Soon, you might have a hard time telling the difference between human and computer voices.
Apple’s Siri personal assistant is getting a lot smarter in the upcoming iOS 10, but odds are she’ll still sound like a computer. Meanwhile, a subsidiary of Google (her creator’s rival) is working on an entirely new model for teaching computers to convert text to speech.
It’s called WaveNet, and Google says it can mimic any human voice while sounding more natural than text-to-speech algorithms available today.
WaveNet is based on research from DeepMind, which this week offered an in-depth look at its efforts to synthesize audio signals for more natural-sounding artificial voices. It all starts with convolutional neural networks, the same technology that powers everything from self-driving cars to disease detection.
Neural networks also now power some current text-to-speech products, including Siri, which two years ago was rebuilt to take advantage of this form of machine learning. But Siri and her colleagues, like Google Voice Search or Amazon’s Alexa, still use a database of short speech fragments that are strung together to form complete words and sentences. The result is a halting, emotionless voice, even if it is understandable.
What if instead of using speech fragments, there was a way to efficiently compile pure audio waveforms? Not only would that allow for more natural-sounding speech, but it would also let the computer mimic virtually any sound, including the ability to faithfully reproduce music. DeepMind engineers set to work.
At first, they waged an uphill battle thanks to the inherent density of raw audio, which requires more than 16,000 samples a second for a computer to process. But the engineers were at last able to build a neural network that uses real waveforms from human speakers. They sampled each recording to create a probability distribution of utterances—in essence, teaching the computer how to speak like a human.
“Building up samples one step at a time like this is computationally expensive,” according to DeepSense, “but we have found it essential for generating complex, realistic-sounding audio.”
The result is remarkable. DeepSense provided samples of its speech capabilities alongside those typically used today, and the difference in inflection, tone, and emotion is immediately apparent. Have a listen for yourself.
It’s only natural that computers’ speech synthesis will become more, well, natural: Google and its competitors have invested significant resources in developing personal assistants. In order for them to catch on, humans need to think of them less as a gimmick and more as articulate, pleasant robots.