All articles

Speech synthesis. How does TTS work

Speech synthesis. How does TTS work

The task of speech synthesis is solved in several stages. First of all, the special algorithm needs to prepare the text so that it would be comfortable for robot to reading: it writes all the numbers in words and decodes the abbreviations. Then the text is divided into separate phrases that need to be read with continuous intonation – for this, the system focuses on punctuation and stable structures.

Next, phonetic transcription is made for all words. To understand how to read a word and where to put an accent in it, the system turns to built-in dictionaries compiled by people. If the correct word is missing, the computer builds the transcription on its own, based on academic rules. If there are not enough of them, the statistical rules come into play: the system goes through announcers recordings and determines which syllable they accentuated.

When the transcription is composed, the computer calculates how many frames (fragments of 25 milliseconds) there are. Further, each frame is described by a set of parameters: a part of which phoneme is it, what place does it occupy, what syllable does this phoneme come in. It also defines for each vowel if it stressed or unstressed. System also creates the correct intonation, using the data on the phrase and sentence.

The system then uses an acoustic model to read the prepared text. It establishes correspondences between phonemes with certain characteristics and sounds. The acoustic model knows how to pronounce the phoneme correctly and give the correct intonation to the sentence thanks to machine learning. The more data is “fed” to the model studies, the better the result that it will get.

As for the voices, they are recognizable, first of all, by the timbre, which depends on the structural features of the speech apparatus. The timbre of any voice can be modeled, that is, to describe its characteristics – it’s enough to read a small amount of texts in the studio. After that, data on the timbre can be used in speech synthesis in any language. When the system needs to say something, it uses a sound wave generator – a vocoder. It loads information about the frequency characteristics of the phrase, obtained from the acoustic model, as well as data about the timbre, which gives the voice a recognizable coloring.

It is worth noting that modern speech synthesis technologies have some problems. The first of them is artificiality. Any synthesized speech is perceived by a person with difficulty, so there is a need to use additional resources to understand it. Thus, people can normally perceive the synthesized speech for only about 20 minutes at once. Also, synthesized speech, as a rule, lacks emotional coloration, and it has low noise immunity. In other words, even the slightest background noise interferes with the perception of synthesized speech.

In 2019, the question of speech synthesis acquires another serious reason for actualization. The Federal Communications Commission issued an not very pleasant for game developers order. It implies that all games released after January 2019 must provide gamers with a speech synthesis function if game uses text chat. Such adjustments are introduced to ensure the ability for people with disabilities to fully play games.

Microsoft has already added to Xbox One’s development kit the ability to incorporate real-time text transcription of audio chat and have written text read aloud into the audio chat.