Google Deepmind makes breakthrough in making computers sound like humans

Side by side comparison of text to speech methods as rated by human listeners. Google’s DeepMind is teaching Assistant to sound like humans in a step-by-step process.

Alphabet Inc (NASDAQ:GOOGL) artificial intelligence arm DeepMind says it has achieved a milestone in creating machine-generated speech that sounds more natural.

While it still sounds odd, the new AI speech certainly flows better than the kinds of responses you’ll get from Siri or Cortana, which chop up human speech and paste it back together in a way that makes individual pronunciations correct, but the flow of the speech is completely off. There are multiple unexplored possibilities here. Perhaps most impressively, the system is also able to synthesise speech without input. Called WaveNet, the new AI is said to act as a deep neural network that’s capable of generating speech by sampling real human speech and forming raw audio waveforms.

Training WaveNets without text results in gibberish. This relies on pre-recorded words and phrases and it is called concatenative speech synthesis. This creates a complex set of rules that determine which tones follow other tones in every common context of the speech. All samples are fed into the neural network’s algorithm.

If data from a single speaker is fed into WaveNet, the resulting speech will resemble that speaker. From the same information, WaveNet can generate a variety of different voices. It can reportedly beat existing Text-to-Speech systems.

It’s only natural that computers’ speech synthesis will become more, well, natural: Google and its competitors have invested significant resources in developing personal assistants.

WaveNet can also generate non-speech sounds like breaths and mouth movements.

The system still has certain shortcomings which might be corrected once diverse speakers patterns are introduced into the system. Or like me, you may just rejoice that finally there’s hope for an ebook reader that doesn’t sound like the re-animated corpse of a 1980’s Commodore computer. A different system is used to translate written words into audio precursors, like a computer-readable phonetic spelling.

With WaveNet, the idea is to directly “model the raw waveform of the audio signal, one sample at a time”, in order to make it more natural-sounding speech. You can listen for yourselves to the samples above – the first one is parametric, second is concatenative (think Siri), the last is the new Wavenet.

Advertisement

Advertisement

About the Author

Some Related Posts