The game Wordle has won the heart of social media in the past few weeks. Wordle is basically a word game, where the player tries to guess a 5-letter word in 6 guesses (tries), where the player progressively receives more information about the target word. The game is created by Josh Wardle, an artist and engineer. Wordle starts when the player submits their first 5-letter word. Every time a word is submitted, feedback is provided on each letter of the submitted word, indicating if the letter exists in the target word, and if the spot matches that in the target word. Below is a screenshot of the instructions.
Is there a good strategy to play the game? Obviously, prior to entering the first word, the player has no information about the word and it could be one of approximately 15,000 5-letter English words. However, once the first word is submitted, the player will gain more information on letters involved in the target word, depending on the entered word. Is there a good strategy once the player starts receiving feedback? Perhaps there is one. After feedback on the first word is provided, success would depend on many factors including the players vocabulary and how they can narrow down their next guess based on the feedback. However, the choice of the first word is independent of the player’s vocabulary or language skills. That is why, we can perhaps talk about a strategy that would provide the best feedback (one with as much information as possible) after the first word is submitted. Basically, a good strategy for the first entered word would be one that tries to eliminate as many remaining letters as possible. Better yet, a good strategy for the first entered word would be one that can determine as many letters of the target word as possible with as many correct placements of those letters. In this analysis, I am trying to find a strategy, or rather a word, that can serve this purpose.
Based on this article on Wikipedia, the Webster’s Third New International Dictionary of the English Language contains 470,000 entries. However, a portion of these words are obsolete or may not fall into the category of valid single words that contain only letters (no numbers or symbols). I found a dataset of such words at this repository on Github. The file contains 370,103 English words that are single and contain only letters. After extracting only 5-letter words from this list, I was left with a list of 15,918 words. I will explore this list to hopefully gain more insight into a good strategy for the first word entered into Wordle. Perhaps unrelated to this little project, but I was curious to find the distribution of words frequency based on number of letters and the following was the result. Apparently, the frequency is unimodal with a peak at words with 9 letters. The 5-letter words constitute just approximately 4.3% of all words in this list.
Next, I will review two different strategies, the Vowel Strategy and the Frequency Strategy. I will show that the Frequency Strategy is a better strategy and we will pick the best word based on the Frequency Strategy.
Vowels play an import role when trying to come up with a strategy to eliminate large numbers of words each round. This is because at least one vowel exists in each syllable of the word. There are 5 vowels: A, E, I, O and U. Even though the letter Y can act as a vowel in some words, I did not consider it a vowel here. Starting the search with vowels may be a good idea because every single letter in English must have at least one vowel (well this is not 100% true, as we will find a bit later, we would be able to find 8 words without any vowels, although not bringing the merit of this strategy into question).
I started my search through my list of 5-letter words by finding the number of words with one, two, three, four and five unique vowels. For instance, the word asana has only one unique vowel and the word alibi has two. Turns out, there are 6223, 8568, 1055, 18 and 0 words with 1, 2, 3, 4 and 5 unique vowels, respectively. For example, the words adieu and auloi (plural of Aulos, an ancient Greek wind instrument), Aequi (an ancient Italian tribe) and uraei (plural of Uraeus the upright form of an Egyptian cobra) all have 4 unique vowels. Needless to say, there were no 5-letter words that consisted of only vowels.
There were also 46 5-letter words, where the letter Y acted as a vowel, e.g., in words ghyll (a ravine or narrow valley in the North of England) or Scyld (a legendary Danish king). There were also 8 words without any vowels such as crwth, which is a a type of stringed instrument.
Considering how important vowels are in the English language, a strategy based on vowels would be to use first words that contain as many unique vowels as possible. This will help us determine the existence or absence of as many vowels as possible in the target word. As mentioned above, there are no 5-letter words that consist of only vowels. However, there are 18 words that consist of 4 unique vowels. These words include: adieu, aequi, aoife, audio, aueto, auloi, aurei, avoue, heiau, kioea, louie, miaou, ouabe, ouija, oukia, ourie, ousia and uraei.
One may argue that any of these 18 words would make a good first try at Wordle. However, let’s see if any of the 5 vowels are any more/less frequent in 5-letter words. The following shows the frequency of appearance for each of the 5 vowels in 5-letter words (not counting unique appearances, i.e., for letter A, the word asana counts as 1).
The graph above shows that the vowel U is the least frequent of the 5 vowels. Filtering out from the list of 5-letter words with 4 unique vowels, words that contains U as a vowel, we are left with a list of just two words, Aoife (an Irish feminine given name) and Kioea (a Hawaiian bird that became extinct in the 19th century). A quick search through the list shows that the consonant K appeared in 1663 5-letter words, whereas the consonant F appeared in 1115. Therefore, this strategy would suggest the word Kioea. It is important to mention that this strategy completely ignores the placement of vowels in the word and only determines the existence or absence of them in the target word. We will see in the next section, how the Frequency Strategy outperforms the Vowels Strategy.
The previous strategy only focused on the vowels. This strategy, however will focus on all of the letters. We will evaluate the most frequently used letters in the alphabet and will also determine the most frequent placement of top most frequently used letters in 5-letter words. Based on those, we will determine the best words to be entered first into the game.
I found the frequency of occurrence of each letter in the alphabet in the 5-letter words in the dataset and sorted them from largest to smallest. The following graph shows the frequencies.
In the above graph, each occurrence of a letter in a word was counted as 1. So I decided to look at the average frequency of letters per word to see if it was any different from the above. Looking at the average frequency of letters in 5-letter words, I did not see any difference in the order of letters, sorted from most commonly appearing to least commonly appearing (see below).
This means the top most commonly used letters in 5-letter words (in terms of total frequency as well as average frequency) were the letters A, E, S, O, R, I, L, T, etc. I decided to focus on the top six letters since the average frequency dropped significantly after the sixth letter. There are 96 words that are made up of only these letters (repetition allowed). However, if we agree that the purpose of the first letter is to eliminate as many remaining letters (or determine as many letters in the target word) as possible, perhaps we should restrict repetition of letters. If we don’t allow for repetition, the list will reduce to only 12 words. These words are: aesir, aries, arise, arose, ireos, oreas, orias, osier, raise, seora, serai and serio. Which one of these 12 words would be the best first word in Wordle?
To answer this question, I decided to look at the frequency of appearance of each of the top six letters in each spot of the 5-letter words (first letter, second letter, etc.). The result is shown below.
I also calculated the average frequency of the top six letters in 5-letter words to see if it shows any significant difference from the absolute frequencies but it did not turn out to be different. The average frequencies are calculated by dividing the absolute frequencies by the number of 5-letter words, in which that particular letter appears in that particular spot. The average frequency plot is presented below.
This shows for example, that the letter S frequently appears in 5-letter words as the fifth letter, but it is almost never appearing as the third letter. Based on this, I used a simple scoring system to assign a score to each word, which basically consists of the sum of average frequencies for the letters based on above results. This scoring system will assume that the 6 letters are all valued equally and will only focus on frequencies per spot. For example, the score for the letter aesir will be calculated as approximately 0.1619 + 0.2928 + 0.1162 + 0.2771 + 0.1840=1.032, since the average frequency of the letter A in the first spot is 0.1619, average frequency of the letter E in the second spot is 0.2928, and so on. The table and figure below show the calculated score for all 12 words in the list.
Based on this analysis, the word Aries (Latin word for ram) has the highest calculated score. It is shown that if used as the first word entered into Wordle, on average, the word Aries can determine the largest number of letters in the target word.
To test the effectiveness of Aries to identify letters in the target word, I used a random selection of 5000 words from the list of 5-letter words, and calculated how many letters, on average, would be indicated when the word Aries is used as the first word on Wordle. I replicated this process 10 times. The following shows that the average number of letters (per word), whose existence in the target word identified after Aries was used as first word, was between 2.055 and 2.1. Please note, the following result does not separate letters, whose spot was correctly identified and those who weren’t. It simply includes all the letters that were identified in the target word. In other words, all the letters that turn Gold and Green after the word was entered.
I conducted the same analysis for the word Kioea (which was suggested by our Vowels Strategy), and the result was an average of only 1.79 letters identified. This is an indication that the Frequency Strategy was superior in indicating letters in the target word to the Vowel Strategy.
Next, I calculated the average number of letters (per word), whose actual spot in the target word was correctly identified by the word Aries. This means, not only is the letter identified, but its spot in the target word is also correctly identified. In other words, this is the average number of letters that turn Green after the word is entered. For the simulation I again used 10 replications and 5000 randomly selected words in each replication. The following shows the results for Aries.
I ran the same analysis for all the 12 words in the list of top words to see if any of them could beat Aries. As expected, the word Aries demonstrated the highest value for average number of letters (per target word), whose spots were correctly identified. For this analysis also I used 10 replications and 5000 randomly selected words in each replication and reported the average across all 10 replications.
Based on the results of this study, if used as the first word, the word Aries can correctly identify the existence of approximately 2.07 letters on average and the correct spot of approximately 0.6 letters, on average, will be correctly identified.
I realized later that, unfortunately, Aries is not a word on Wordle’s list of accepted words, and neither are the next best words on the list Orias and Serio (based on the word scores identified above). The next best word on the list was serai, which is another word for caravanserai or inn and is indeed on Wordle’s list of accepted words. The origin of the name is Persian and Turkish, with slightly different pronunciations (saray or sarāī, also see caravanserai). In terms of average frequency of letters and letter spots identified in our testing model, both serai and Aries have the same average frequency of letters in target word correctly identified (approximately 2.07 letters on average). However, the word serai has a slightly lower average frequency of letter spots correctly identified (approximately 0.47 compared to 0.58 for Aries). Below, you see serai used as first word on the Wordle of January 16, identifying the existence of 3 letters, with the spot of two of them correctly identified.
In conclusion, I am not sure if the selection of words for Wordle is a completely random process. You may argue that some words may have had some reference to daily global events (see here for a list of past Wordle words in 2022). And after all, it may not be too much fun playing based on an analysis or strategy.
Happy Wordling everyone (although Wordling is probably not on Wordle’s list of accepted words)!