Microsoft's New AI has Learned to Imitate the Human Voice

Author at ApiX-Drive

Reading time: ~1 min

An intelligent text-to-speech converter can accurately copy not only the voice of a particular person, but also its emotional coloring, as well as the audio environment surrounding it.

On January 5 of this year, Microsoft developers presented a new AI model that can convert text to speech and exactly mimic the human voice, having received a sound sample only 3 seconds long for training. They call the novelty VALL-E. As soon as the model masters a certain voice, it can become a full-fledged understudy for its owner, while maintaining all the nuances of timbre and emotional color.

According to Microsoft, VALL-E is a neural codec language model. It is based on EnCodec technology, which Meta developers talked about back in October last year. The creators of VALL-E believe that it can be used to create a new generation of text-to-speech applications, as well as speech editing services and high-quality audio content creation. It is assumed that the novelty will strengthen the team of generative type AI models (for example, GPT-3).

As always, the coin has two sides. Developers are aware of the danger that potentially comes from such technology. Since VALL-E is capable of synthesizing speech identical to the real one, it can be used to replace the voice identification of a person or impersonate another person. The creation of another model that will distinguish synthesized speech from real speech will help reduce these risks.