I’ve written here in the past about speech recognition (column DD, and brief notes on Google Voice), but I haven’t written much about speech synthesis, except for a post about song synthesis and an aside in column iii.
So I’m pleased to note that Google has made some remarkable improvements in text-to-speech lately.
For example, as I posted elsewhere at the time, a couple months ago Google provided audio samples from a paper titled “Natural TTS [Text To Speech] Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions.” I recommend going to that page, and without listening to any of the earlier samples on the page, scroll down to the "Tacotron 2 or Human?" section. Listen to the four pairs of recordings, and see if you can tell which one is machine-generated and which is a human in each pair.
(Google apparently hasn't said what the answers are, but an article at Inc provides a likely-sounding meta-criterion.)
After you've listened to the computer-or-human samples, it's worth also listening to the other samples on the page. Most of those do still sound machine-generated to me, but not nearly as much so as most text-to-speech systems.
And this week, Google announced a new text-to-speech service called Cloud Text-to-Speech that anyone can use. It’s partly powered by machine-learning software called WaveNet, which uses different techniques for putting together speech sounds than most traditional speech synthesis software has used.
The main documentation page lets you enter text and choose one of the many available voices and accents to read the text aloud. Unfortunately, only a few of those voices are currently powered by WaveNet (it may only be available for US English, not sure); but I’m hoping there’ll be a wider range of WaveNet voices soon.