Comprehensive many-to-many phoneme-to-viseme mapping and its application for concatenative visual speech synthesis This publication appears in: Speech Communication Authors: W. Mattheyses, L. Latacz and W. Verhelst Volume: 55 Issue: 7-8 Pages: 857-876 Publication Date: Sep. 2013
Abstract: The use of visemes as atomic speech units in visual speech analysis and synthesis systems is well-established. Viseme labels are determined using a many-to-one phoneme-to-viseme mapping. However, due to visual coarticulation effects, an accurate mapping from phonemes to visemes should define a many-to-many mapping scheme instead. In this research it was found that neither the use of standardized nor speaker-dependent many-to-one viseme labels could satisfy the quality requirements of concatenative visual speech synthesis. Therefore, a novel technique to define a many-to-many phoneme-to-viseme mapping scheme is introduced, which makes use of both tree-based and k-means clustering approaches. We show that these many-to-many viseme labels more accurately describe the visual speech information as compared to both phoneme-based and many-to-one viseme-based speech labels. In addition, we found that the use of these many-to-many visemes improves the precision of the segment selection phase in concatenative visual speech synthesis using limited speech databases. Furthermore, the resulting synthetic visual speech was both objectively and subjectively found to be of higher quality when the many-to-many visemes are used to describe the speech database and the synthesis targets.
|