Best Wordle Words

Best Wordle Words#

Wordle is extremely popular these days, and for good reason: it is so simple and elegant; a real joy.

Wordle reminds us that the best of games are simple in nature, with few rules, and a great deal of freedom for players to develop their own strategies. But, don’t let Wordle’s apparent simplicity fool you… it is all too easy to guess words ineffectively, and to lose!

In this arrticle I will try to come up with some good words for initial guesses to Wordle.

5-Letter Words#

Let’s assume that the only rule to Wordle’s choice of words is that it is a 5-letter word. So, let’s just get all 5-letter words.

import string

import matplotlib.pyplot as plt
import pandas as pd
import requests


def plot_table(df: pd.DataFrame):
    return df.style.background_gradient("YlGn")


words_request = requests.get(
    "https://gist.githubusercontent.com/nicobako/759adb8f0e7fa514f408afb946e80042/raw/d9783627c731268fb2935a731a618aa8e95cf465/words"
)
all_words = words_request.content.decode("utf-8").lower().split()

five_letter_words = pd.DataFrame(
    (
        word
        for word in all_words
        if len(word) == 5 and all(letter in string.ascii_lowercase for letter in word)
    ),
    columns=["word"],
)

How many 5-letter words do we have?

len(five_letter_words)

Wow! That’s a lot!

Criteria for a Good Guess#

There are many theories on what makes a good guess, so please don’t judge… but here is my naive simple criteria for a good guess:

Contains as many common letters as possible
Contains as many different letters as possible
Contains as many letters in common places as possible

For now, let’s keep it this simple. We can always make things more complicated later…

Collecting the Data#

This is not so tricky… We look at each word, look at which letters we find at each position in each word, and keep track of everything. Later we can use this data to calculate everything we need.

def get_letters_data():
    data = []
    for word in five_letter_words["word"]:
        for position, letter in enumerate(word):
            data.append({"letter": letter, "position": position + 1})
    return pd.DataFrame(data)


letters = get_letters_data()

This data is a table whose rows contain two values: a letter; its position in a word. It might not seem like much, but it’s all we need. Here’s a quick glance at what this table looks like:

letters.head()

	letter	position
0	a	1
1	a	2
2	r	3
3	o	4
4	n	5

Most Common Letters#

Given our data, we can now count the occurences of each letter. This reflects how commonly each letter is found in any word. To calculate this we take a look at our letters data, group them by letter, and tally up the count.

letters_count = letters.groupby("letter").count()
letters_count.columns = ["count"]
letters_count = letters_count.sort_values(by="count", ascending=False)
letters_count.plot.barh(y="count")
plt.show()

../_images/8a70527167a8038ba14ba879d941ce4c85bf40aa5d26a12db6742bbd2b181a04.png

I don’t know about you, but these results surprised me! The top 10 letters are:

plot_table(letters_count.head(10))

	count
letter
e	3187
s	3019
a	2785
r	2006
o	1946
i	1776
l	1746
t	1574
n	1535
d	1232

And the lowest 10 are:

plot_table(letters_count.tail(10))

	count
letter
b	768
g	704
k	620
f	514
w	494
v	343
z	160
j	136
x	117
q	49

Let’s create a new column in our letters-count table with the percent occurence of each letter:

letters_count["percent"] = 100 * letters_count["count"] / letters_count["count"].sum()

As usual, the sum of the percent column should equal 100%.

letters_count["percent"].sum()

100.0

letters_count["percent"].plot.pie(y="letter")
plt.show()

../_images/76d53987e63ddfe1fcbec24bb51cc9014badb87c00762b13d1b374d625014385.png

Words With Most Common Letters#

Using the occurence of each letter, we can look at all of our words, and give them a score based on how common the letters of the word are.

We also don’t want to give words extra points for having the same letter multiple times… Remember, the whole point of this is to come up with a good first guess for Wordle, so it would be more helpful if our first guess contained unique letters.

def score_based_on_letters_count(word: str) -> float:
    score = 0.0
    unique_letters = list(set(word))
    for letter in unique_letters:
        score += letters_count.loc[letter].percent
    return round(score, 2)

In this way, the score of a letter like apple is:

score_based_on_letters_count("apple")

28.34

We can take this metric and calculate a score for all of our 5-letter words.

five_letter_words["score_letters_count"] = [
    score_based_on_letters_count(word) for word in five_letter_words["word"]
]

Let’s take a look at some of the top words:

plot_table(
    five_letter_words.sort_values("score_letters_count", ascending=False)
    .head(10)
    .set_index("word")
)

	score_letters_count
word
arose	42.640000
arise	42.080000
aries	42.080000
raise	42.080000
earls	41.980000
laser	41.980000
reals	41.980000
aloes	41.780000
stare	41.410000
tears	41.410000

Most Common Positions of Letters#

Another important thing to look at is the relative position of the letters in our first guess. We want a word whose letters are not only common, but whose positions of letters are in common places.

This is a little trickier to do, but stil not that tough. We group our data by letter and position, and count how many occurences of each letter-position combination.

letters_position = pd.DataFrame(
    {"count": letters.groupby(["letter", "position"]).size()}
)

Here’s what it looks like for the letter a

letters_position.loc["a"]

	count
position
1	338
2	1076
3	646
4	412
5	313

You can see that a is most commonly in the second position.

Let’s create a percent column for this table as well.

letters_position["percent"] = (
    100 * letters_position["count"] / letters_position["count"].sum()
)
letters_position["percent"] = [round(num, 2) for num in letters_position["percent"]]

Let’s make sure the sum of the percent is 100%.

letters_position["percent"].sum()

99.94999999999999

And here’s a fancy chart displaying how the letters and positions look for each letter:

letters_position_pivoted = letters_position.reset_index().pivot(
    index="letter", columns="position", values="percent"
)
letters_position_pivoted.sort_values("letter")
letters_position_pivoted.style.background_gradient("YlGn")

position	1	2	3	4	5
letter
a	1.110000	3.540000	2.130000	1.360000	1.030000
b	1.480000	0.130000	0.510000	0.340000	0.070000
c	1.520000	0.260000	0.600000	0.850000	0.140000
d	1.070000	0.190000	0.660000	0.780000	1.350000
e	0.580000	2.410000	1.290000	3.980000	2.240000
f	0.990000	0.040000	0.230000	0.290000	0.130000
g	0.930000	0.070000	0.530000	0.580000	0.200000
h	0.830000	0.910000	0.130000	0.250000	0.740000
i	0.290000	2.210000	1.850000	1.210000	0.300000
j	0.380000	0.010000	0.040000	0.020000	nan
k	0.360000	0.100000	0.310000	0.760000	0.510000
l	0.970000	1.290000	1.430000	1.250000	0.810000
m	1.130000	0.250000	0.720000	0.600000	0.240000
n	0.440000	0.510000	1.530000	1.410000	1.160000
o	0.360000	2.980000	1.570000	1.030000	0.480000
p	1.240000	0.350000	0.490000	0.570000	0.270000
q	0.120000	0.030000	0.010000	0.000000	nan
r	0.900000	1.630000	1.810000	1.110000	1.160000
s	2.500000	0.180000	0.840000	0.870000	5.560000
t	1.240000	0.420000	0.900000	1.470000	1.160000
u	0.180000	1.690000	1.020000	0.540000	0.080000
v	0.370000	0.110000	0.420000	0.220000	0.010000
w	0.750000	0.250000	0.310000	0.220000	0.090000
x	0.020000	0.080000	0.190000	0.010000	0.090000
y	0.150000	0.310000	0.290000	0.140000	2.080000
z	0.080000	0.040000	0.180000	0.130000	0.090000

This chart is really quite useful, from a glance you can see that some letters are much more likely to be in certain positions of the word, so when making guesses, it’s important to keep this in mind.

Words with Most Common Letter Positions#

We can use the above data to score each of our 5-letter words based on the positions of the letters. We just look at all of the letters of our word and their corresponding positions, and tally up the percent chance of encountering each letter in its position.

Here will count all letters, even duplicates… not sure why, it just feels right.

def score_based_on_letters_position(word: str) -> float:
    score = 0.0
    for i, letter in enumerate(word):
        position = i + 1
        score += letters_position.loc[letter, position]["percent"]
    return score

So, the score of apple in this case would be:

score_based_on_letters_position("apple")

5.44

You can see right away that the score for the word apple is very different than before.

Let’s calculate the score for each of our 5-letter words based on the positions of its letters.

five_letter_words["score_letters_position"] = [
    score_based_on_letters_position(word) for word in five_letter_words["word"]
]

Let’s take a look at the top 10 in this case.

plot_table(
    five_letter_words[["word", "score_letters_position"]]
    .sort_values("score_letters_position", ascending=False)
    .head(10)
    .set_index("word")
)

	score_letters_position
word
sales	17.010000
sores	16.830000
sates	16.480000
soles	16.450000
cares	16.410000
bares	16.370000
sames	16.300000
sades	16.240000
tares	16.130000
pares	16.130000

You may be surprised, as I was, to find that these two ways of scoring words generated very different lists!

What is the Best Guess?#

So, what is the best first guess. We’ll naively assume that is a combination of these two scoring methods. We’ll just add up the scores for letter-count and letter-position, and look at the top words.

five_letter_words["final_score"] = (
    five_letter_words["score_letters_count"]
    + five_letter_words["score_letters_position"]
)

plot_table(
    five_letter_words.sort_values("final_score", ascending=False)
    .set_index("word")
    .head(20)
)

	score_letters_count	score_letters_position	final_score
word
tares	41.410000	16.130000	57.540000
tales	40.560000	15.750000	56.310000
rates	41.410000	14.880000	56.290000
dares	40.290000	15.960000	56.250000
aries	42.080000	14.130000	56.210000
cares	39.600000	16.410000	56.010000
lanes	40.430000	15.580000	56.010000
oates	41.220000	14.340000	55.560000
aloes	41.780000	13.510000	55.290000
pares	39.140000	16.130000	55.270000
mares	39.170000	16.020000	55.190000
bares	38.760000	16.370000	55.130000
dales	39.430000	15.580000	55.010000
hares	39.100000	15.720000	54.820000
earls	41.980000	12.740000	54.720000
danes	38.730000	15.680000	54.410000
reals	41.980000	12.250000	54.230000
earns	41.280000	12.900000	54.180000
races	39.600000	14.580000	54.180000
canes	38.050000	16.130000	54.180000

Conclusion#

Don’t let this fancy code and math fool you, this is a naive approach. We are simply looking at which letters are most common, and which positions of letters are most common, and picking words that maximize this combination. There are a ton of other details that this code simply isn’t considering, and a lot of ways this code can be improved.

In the end, this article may help you come up with a decent first guess, but the rest is up to you! Anyway, good luck on your next Wordle game, and don’t forget to try out one of the top words!