Word lists for writing computer word games
This started out to be a post about writing a computer version of a word game, but ended up focusing mostly on computerized word lists.
Wordle got me thinking about a vaguely related (but not the same) word game called Fives that I learned as a kid. I wrote about Fives in a 1997 Words & Stuff post, in which I also mentioned that I had always wanted to turn it into a computer game.
And now I have! But so far it’s just a text-only game to run on the command line. The core of it turned out to be easy to write; I wrote the whole game in about a hundred lines of Perl code, without trying to be particularly brief/compact. I wrote it in Perl because I figured it would be quick and easy to write that way; but I neglected to take into account the fact that I haven’t written much Perl code in a long time, so I ended up having to look up the Perl syntax for all sorts of basic things.
One issue that I’ll have to deal with before releasing it publicly is finding a freely licensed word list.
It turns out that it’s a good idea for a word game like Wordle to include two lists of five-letter words (or whatever other kind of words match the game’s criteria):
- One list of as many legitimate five-letter words as possible. When a player enters a guess, the program checks their guess against this list to see whether the guess is a valid five-letter word. If they type GGGGG as a guess, then the program can determine that that’s not a valid word, at least not for the purposes of the game.
- Another, much shorter, list of common five-letter words. When the program needs to pick a word to be the answer for the current game, it picks from this list. (Every word on this list, of course, has to be on the other list as well.)
So, for example, if a player wants to enter the word BAVIN as a guess, it’s nice to let them do so; that’s a legitimate word, listed in some dictionaries. But it’s a pretty obscure word, so you probably don’t want the program to pick it as the answer; players who don’t know the word would find it frustratingly hard to figure out.
But in order to include those two lists of words, you need to acquire them. One way to do that would be to license a word list from a company that provides them; some word game apps do this. Another way would be to use an existing freely usable list.
Many UNIX-derived operating systems include a freely usable word list, as a text file. For example, in macOS, the
/usr/share/dict directory includes a couple of word lists. The
web2 file in that directory contains about 235,000 words from Webster’s Second International, which was published in 1934 but the copyright has lapsed. So that list of words is in the public domain. And it’s easy enough to extract all five-letter words from that list.
(Or all five-letter words with no repeating letters, for use in Fives.)
But there are three problems with using that list, for my purposes:
- It doesn’t include inflected forms of words. So, for example, it doesn’t include BAKED or BAKES.
- It doesn’t include words coined since 1934.
- It doesn’t indicate how common each word is. So it’s useful for the full guess-validation list, but I would have to extract common words manually to create the possible-answers list.
For the inflections issue, I came up with a kludgy workaround: I extracted all the four-letter words and added S to the end of each. (And could similarly add D to the ends of four-letter words ending in E.) This kludge works surprisingly well, creating lots of legitimate five-letter plurals; but it means that my full list now also includes lots of strings that aren’t words, such as INKYS.
For the lack-of-recent-words issue, I suspect that there aren’t all that many five-letter words with no repeating letters coined in the past 90 years. But I also suspect that modern players would find it frustrating to guess such a word and be told that it’s not valid.
For the commonness issue, manually pulling out the common words isn’t that hard; I started to do it, and it went pretty quickly. But that approach requires making a lot of decisions about what counts as common.
So I went looking for other options.
So far, the most promising-looking option I’ve found is SCOWL, though I need to look into it a bit more. It also has the advantage of scoring words by how common they are.
Even if I go with SCOWL’s list, or something similar, I’ll still need to manually look through the possible-answers list before I publish the game. For example, I don’t want the game to pick various common insults as answers. But I think that using the SCOWL list as a starting point will make various things easier.