« Temporary mail outage | Main | Best theatre »

Smarter spellcheckers


There are a lot of typos that spellcheckers won't catch.

Some of those, there's probably nothing to be done about until we have much better grammar checkers, which may require strong AI.

But there are a fair number of typos where the intended word is pretty common, and the accidentally typed word is fairly unusual. For example, many people accidentally type "solider" for "soldier"; out of 26 occurrences of "solider" in the stories we received in 2000 through 2006, every single one is a typo for "soldier." We should have probabilistic spellcheckers that would be aware of how common a word is, and would flag uncommon words that are also common misspellings of more common words. Which would require spellcheckers to have more than one kind of flag--in addition to saying "this word is misspelled," they would have to also sometimes say "this word is a valid word, but may not be the word you meant."

The two other most common examples I see are "florescent" for "fluorescent" and "lightening" for "lightning." In both cases, a smart spellchecker would help a lot--"florescent" is almost never intentional (over a hundred instances from 2000-2006, and although I've only spot-checked a few, I would bet that not one of them is intentional), and "lightening" is only intentional about half the time.

When I say "more common," I mean both in the world in general, and in your own writing. Word processors should be tracking the words you type, and the words you later correct, and should pay attention to what words are more common in your vocabulary than in the general vocabulary. It could even let you track different kinds of writing differently--one log for academic writing, one for blog entries, one for personal emails, one for fiction, etc. (Potentially, it could use heuristics to guess which log to use for any given document, but that would probably be hard to get right; I'd be satisfied with letting the user apply user-specified tags to documents to put them in one category or another.)

Spellcheckers should be smarter in other ways, too. For example, they should look at context. If you type "solider" in a paragraph (or even a document) where you've used the word "soldier" three times already, it should weight the probabilities even more toward considering "solider" to be a typo. It won't always be a typo, of course; but if you're writing a story about soldiers, it's even more likely to be a typo in that context than it normally is.

Also, I would like a pony.

(This entry is one of a continuing series that I might call something like a "tech wishlist." I'm writing them partly just to express dissatisfaction with the state of today's technology, but also partly as an attempt to help forestall software patents on some of this stuff. Even though I'm not implementing any of it myself, I would hope that if someone tries to patent a probabilistic spellchecker like this in a couple years, this entry may help reduce the likelihood of that patent being granted. Though of course some of the stuff I'm describing here already exists--some grammar checkers do some similar things, and the iPhone typing-correction feature does some similar things, and the spellchecker built into Google Search does some similar things.)


Since disk space is getting so cheap, it seems like you could dump a huge corpus of quality text through a set of variable length Markov-chains. Then, as you're typing, your spell-checker could weigh the probabilities like you suggest. You could even have multiple probability collections ("Technical Manuals," "18th Century Literature," "All of Project Gutenberg," etc), and the system would use the best match -- determined, of course, by an overall statistical analysis of your text.

The danger, of course, is that you end up with some evil Clippy-like thing:

"You typed 'batrachian' -- 90% of the time people use this word, they really mean 'bactrian,' however your use of the word 'frog' within a ten word range lowers that probability to 40%. Or did you mean 'brachial?' Since it looks like you're writing a speculative fiction (75%) space opera (77%), maybe it's a formal name and merely needs capitalization? "

The OpenOffice auto-complete feature sorta-kinda works that way, although I've found it consistently annoying enough that I usually turn it off!

And, oddly enough, I would kill for this feature in Mobile Word's word-completion-hinting. I'm often baffled by the words that it suggests.

My master's thesis in GIS involved fuzzy logic, a branch of probability based on fuzzy set theory. In classic set theory, given an object and a set, either the object is in the set or not in the set. A binary condition. A cat is either an animal or it is not an animal. In fuzzy set theory, the element is assigned a value between 0 or 1 that is a measure of its membership. Fr example, is jello in the set of solids or not? Or when classifying plots of land, are they forest or not (how many trees per square foot e.g. are required to be a forest). It seems like the perfect mathematics for the kind of spell checker you're looking for.

Post a comment