|
High-Quality Personalized Text-to-Speech Synthesis for Belgian Standard Dutch Presenter Mr Lukas Latacz [Email] Abstract Human speech is quite diverse: there exist more than 7000 languages in the world, even more regional language variants, and humans apply partly unconsciously different styles of speaking according to the situation in which the speech is being used. In many applications the direct use of human speech is too inconvenient or too costly to be economically feasible and speech synthesis is used instead. Speech synthesis is typically generated from an input text. People with communicative disabilities often depend on synthetic speech, for example in a speaking device for people who cannot speak properly anymore or in a reading device for people with reduced eye-sight or dyslexia. The number of commercially-available synthetic voices is still rather small. This is especially true for medium-sized languages such as Dutch and for specific language-variants such as Belgian standard Dutch (also sometimes referred to as Flemish). Each synthetic voice is typically spoken in a neutral speaking style, which is not always appropriate in all situations. This thesis focuses on how to build more appropriate high-quality synthetic voices without requiring significant effort and expert knowledge from the voice-builder. The lack of high-quality Belgian standard Dutch synthetic voices available for research purposes inspired us to construct a new high-quality speech synthesizer at the Vrije Universiteit Brussel, the DSSP synthesizer, able to synthesize Belgian standard Dutch and English speech. This work is structured into three main parts. The first part describes how the recordings of a speaker are used to create new synthetic voices. Our synthesizer is able to synthesize using the two dominant speech synthesis techniques, unit selection synthesis and statistical parametric synthesis. The latter uses a flexible statistical parametric model of speech, but sounds less natural than unit selection synthesis, which selects small speech units from the recordings and concatenates their waveforms. The second part describes the language-specific aspects of our Belgian standard Dutch synthetic voices and how the speaking style of a speaker can be captured by modeling speaker-specific pronunciations, prosodic phrase breaks, silences, accented words and prominent syllables. Finally, we look at some use-cases of the DSSP synthesizer to create personalized high-quality synthetic speech.
|