Sometime around the early 1980s, Logical Business Machines (creators of computers with names like David and Goliath, and a programming language called English) released a computer system called Mike. This system came with speech-recognition software and a microphone. An executive of the company attempted to demonstrate the product on a television show; he stood at the microphone and said, "Hello, Mike." Mike displayed, "Excuse me, what was that?" The executive tried various other phrases, but all Mike would say in response was "Excuse me, what was that?"
Needless to say, the product never did very well.
Some years back, I participated in a speech-recognition project at Apple. After I spent half an hour reading various phrases into a microphone, the people running the project gave me a t-shirt with a picture of a beach strewn with apple cores. The caption read "I helped Apple wreck a nice beach." As an environmentalist, I was somewhat upset by this shirt—until someone told me to read the phrase aloud. Comprehension dawned.
This confusion illustrates one of the basic problems of speech recognition: words and phrases often sound like other words and phrases.
Last week I got the chance to try out the latest in speech-recognition software, Dragon Systems' Naturally Speaking. Things have come a long way since Mike, but without training the software to my voice, it was still wrong more often than not, much like the Newton's handwriting recognition software. I gather, though, that if you take the time to train it, and if you assiduously correct it when it makes mistakes, it can become quite proficient at recognizing your speech—it not only learns how you pronounce words, it adapts its vocabulary and sentence structure to yours.
We spent some time making fun of the software's untrained efforts at recognizing the speech of three rather different-voiced people, and then spent a couple of days mis-hearing and misunderstanding each other, which gave us a somewhat greater appreciation of the difficulties of speech recognition. After all, if not even humans can do it perfectly, it's asking a lot of a computer. For instance, I've occasionally used the phrase "my friend Steph," and gotten confused looks from people who think I said "my friend's deaf."
Earlier speech recognition software relied on discrete speech—pausing after every word, for instance. The latest generation can recognize continuous speech, spoken normally without pauses; this is quite a feat when you consider that there's often no obvious phonological break between words (that is, the sound of one word can merge into the sound of the next word with no silence between them; we're adept at figuring out where word boundaries are from context most of the time, but it's not as simple as just looking for spaces between words).
As I understand it, speech recognition currently works pretty well under either of two circumstances: for a single user who's trained the software, or for a limited knowledge domain (such as weather). The general problem—recognizing a full range of speech from any speaker—is still beyond the abilities of our software, as far as I know.
A related problem, which also has not yet been solved, is natural language parsing, or NLP. If the computer is to try to respond to natural-language input (whether spoken or written), it must first parse the input, figuring out which parts of the sentence refer to what. (I'm guessing that advanced speech recognition does a certain amount of parsing as it goes, to aid in eliminating unlikely possibilities.) Again, parsing can be done fairly successfully in limited domains, but if you open it up to any domain you begin to approach the boundaries of current knowledge in AI and linguistics.
One current state-of-the-art speech-recognition/NLP system is Jupiter, "a conversational interface for on-line weather information over the phone." You can call the system at 1-888-573-8255 (outside the US it's a toll call: 1-617-258-0300) and ask it questions like "What's the forecast for today in Boston?" It's pretty good at figuring out what you're asking and giving appropriate answers, as long as you ask it about one of the cities or countries it knows about. (You can also ask it to tell you what locations it's familiar with—"What cities do you know in China?") As with Naturally Speaking, Jupiter works best when you speak clearly but naturally—don't use exaggerated enunciation or computer-esque wording. Note that your questions are recorded for research purposes.
In a different medium, Philip A. Bralich is working on NLP without (as yet) a connection to speech; his system is intended for parsing textual statements and queries. (Of course, a speech recognition system could pass its textual output along to such an NLP engine.) He's been working on connecting his NLP system to 3D graphics using VRML; for more information, see his Web page (including a free demo of his system that reveals both some of its strengths and some of its weaknesses).
It's an exciting area of research—it would be nice to be able to interact with a computer entirely through voice commands. But we've still got a ways to go before computers can really wreck a nice beach.
For information on Dragon products, see the Dragon Systems Web page, or the 21st Century Eloquence Dragon-related page (which includes some (probably less advanced) Macintosh products). Besides Naturally Speaking, Dragon also produces multilingual and specialized/limited-domain versions of their software... This info is provided as a service for those who want to pursue it; I can't vouch for the quality of these systems.