Tuesday, January 17, 2017

Synthesizing your voice: WaveNet by Google DeepMind

I remember the day (somewhen in the 90s), when computer generated voices sounded - well synthetic. Today you can still tell the difference between a human and a machine speaking to you. Although they have gotten very good. On the other hand, it's probably good that you can tell a difference. Think about all the implicit expectations, that we would have, if we'd thought a human was speaking to us...

But then on the other hand, think about how much already existing content that exists in text form we can leverage given we have a natural sounding voice reading it to us. Many e-learning platforms are already using it but to be honest most of them do not cut it, when they use TTS. It's diffent to watch a youtube tutorial with an energetic tutor that grabs my attention.

But technology keep catching on: WaveNet by Google DeepMind is promising, generating voices from actual audio samples. Imagine: Hearing your voice reading a book or a tutorial, without reading it (yes I know it's akward to hear you own voice when you are not used to it).

Based in deep learning techniques WaveNet picks up subtle notions such as breathing rhythm and individual intonation. Probably energizing the generated TTS with some markup is not so far away...