1. Introduction
A system for the automatic time synchronization of two renditions of the same speech utterance modifies the timing structure of the first utterance (replacement or dub) in such a way as to synchronize it with the second utterance (reference or guide), which serves as the timing reference and has been produced by the same or by a different speaker. In general, such a system achieves the time synchronization using a synthesis-after-analysis approach, as illustrated in the conceptual block diagram of Fig. 1:
- Analysis: first, the relative timing differences between the corresponding speech sounds (phonemes) in the two utterances are measured by means of an appropriate timing analysis algorithm. From the resulting timing relationship the local time scale modifications, viz. the varying amounts of time stretching and compression, can be derived that are necessary to bring the time axis of the replacement (Uy) into optimal alignment with that of the reference (Ux).
- Synthesis: in the second step, the relative timing discrepancies between the utterances are cancelled out by time-scaling the replacement speech utterance in accordance with the measured timing relationship, such that the timing of the acoustic-phonetic features of the result (Uz) conforms to those of the timing reference.
Fig. 1: General concept of the automatic time synchronization of two renditions of the same speech utterance.
In the past, a number of systems have been developed that allow to perform the time synchronization task in an automated manner. Although these systems significantly reduce the amount of time required to produce manually edited or naturally revoiced time-aligned results, it was also observed that they often produce results that are of unacceptable quality and/or insufficiently synchronized with the reference utterance.
Fig. 2: Functional block diagram of the proposed system for automatic time synchronization [3,4].
At ETRO, we developed a novel approach for automatic time synchronization. In contrast to earlier approaches, all of which perform the timing analysis in one single step, the proposed approach splits up the analysis in a number of different steps (see Fig. 2). In the first step, which aims to solve the difficult problem of precisely inserting new, and deleting or resizing existing non-speech segments, such as breathing pauses, an explicit distinction is made between the speech and non-speech segments in both reference and replacement speech waveforms. This information is then used to identify the replacement speech segments that match those of the timing reference using a split Dynamic Time WarpiNG (DTW) algorithm. Thereafter, the timing relationship for each pair of matching speech segments is computed and then processed in such a way that the time-scale modification of each replacement speech segment is performed more gradually (smoothing) while at the same time the speech rate of the time-scaled result is systematically controlled in relation to that of the timing reference (post-processing). We refer to [3,4] for details of the proposed procedures. Subjective audio-visual listening tests have demonstrated that this approach leads to time-scaled results that are perceptually more acceptable than those obtained using any of the state-of-the-art solutions. As an illustration, we show in Table 1 how each stage of the proposed approach contributes to the final synchronized results.
Table 1: Illustration of contribution of each different stage of the proposed approach to the final results.
2. Automatic Dialogue Replacement
Among the many possible uses of time synchronization systems is the application of the automatic post-synchronization of revoiced studio recordings with the corresponding recordings made on the film set: during the production of motion picture soundtracks, dialogues are frequently re-recorded in a studio and used to replace the original ones recorded on the set. This is for example necessary if the original recordings have been corrupted by some kind of background noise that is difficult to control, such as that caused by an airplane passing by. As another example it is sometimes argued that an actor can produce a markedly improved spoken performance in a studio in comparison to the one produced on the set, which is usually very chaotic and makes it difficult to capture the true mood of the scene. In either case, it is well-known that a straightforward replacement of the original recordings by the studio dialogues introduces many mismatches between the lip and mouth movements in the picture on the one hand and the actual timing and duration of the individual phonemes in the replacement speech on the other hand. Automatic Dialogue Replacement (ADR) is the most widespread technique used for the indirect compensation of such audio-visual "lip-synch" errors. It operates as follows: for each scene that contains lines that need to be replaced, the original actors are invited into the studio for a special dubbing session, during which the appropriate pictures are projected onto a screen in front of them, while the original audio recordings are replayed over headphones. The actors then revoice the original dialogues, ensuring not only that their replacement speech precisely synchronizes with the on-screen lip movements, but also that the nuances of their performances match the original. Post-synchronizing dialogue is generally considered very difficult because most actors have a lot of difficulties to maintain synchrony while speaking. In addition, its repetitive nature makes it also very dull and time-consuming for the actors as they often need to redeliver their lines over and over again until director and dialogue editor have settled a compromise between the desired level of performance and timing. Therefore, a system that can automatically time-align one single well-performed dialogue redelivery will not only save time and money, it will also release the actors from their technical preoccupation of speaking in synchrony with a picture soundtrack and thus allow them to fully concentrate on their primary task of acting and producing great performances!
3. LipSynch - A tool for automatic lip-synchronization
Fig. 3 shows a screenshot of LipSynch, a software tool developed for the automatic replacement of dialogues in motion picture, video or TV fragments. This tool allows to edit both audio and video, which are then displayed in the corresponding panels ("waveform editor" and "video editor"). Also, the GUI allows to control the settings for the speech/non-speech classification algorithm ("classification"), as well as those for the (optional) noise tracking and noise suppression algorithms ("noise suppression"). In order to replace the audio from an original video fragment, one first needs to partition both the original and replacement audio waveforms into speech and non-speech (e.g. breathing pauses) intervals. Typically, this segmentation is achieved in two steps:
- In the first step, an energy-based segmentation algorithm is used to distinguish between speech and non-speech intervals by comparing local estimates of the signal energy, evaluated in blocks of 20ms and 50% overlap, against a user-adjustable utterance-specific threshold. This is for example illustrated in the lower subpanel of the waveform editor: the blue curve represents the smoothed energy contour of the waveform that is displayed on top of it, and which, by itself, is an enhanced version of the original noisy waveform in the upper subpanel. Application of a threshold (green line) on this function yields a first estimate of where the speech segments are located (red square wave). We remark that several other parameters of the algorithm can be fine-tuned in the GUI in order to optimize the segmentation process. For example, in the classification subpanel, one can fine-tune the values for the energy threshold (Th), the minimum duration of the non-speech (NSS) and speech segments (SS), as well as the relaxation of the speech segment onset (RX on) and offset time markers (RX off). Further, in the absence of noise (typically this concerns the dubbed signal in the context of ADR), the resulting speech/non-speech transition markers are usually accurate enough to allow further processing. However, for noisy speech waveforms (such as guide tracks that were recorded on location), the Signal-to-Noise Ratio (SNR) first needs to be enhanced in order to obtain acceptable estimates. We remark that for this purpose, one implementation of LipSynch is based on the classical spectral subtraction approach, in which the estimate of the noise power spectrum is continuously tracked using time and frequency dependent smoothing parameters, which themselves are adjusted based on speech presence probabilities in sub-bands. The speech presence is determined by computing the ratio of the smoothed input power spectrum to its local minimum, which is updated continuously by averaging past values of the noisy speech power spectra with a look-ahead factor.
- Because of the importance of a precise speech/non-speech segmentation for the successful application of the proposed split DTW algorithm, the obtained speech/non-speech transition markers are typically checked manually against timing errors in a second step. For clean speech waveforms, this manual intervention is often minimal or even unneeded, but for noisy speech waveforms, the post-correction is generally indispensable. LipSynch allows to accomplish such corrections in a very efficient interactive way: at each time, it is possible to zoom into and slide through the waveforms, select and audition specific portions of these waveforms, and correct the transition markers by dragging them to the left or to the right. It is even possible to insert new or delete existing speech and non-speech segments (this is for example useful if someone is talking in the background).
Once the two waveforms have been segmented, the "align" button processes the data according to the details described in [3,4]. Finally, the aligned audio is automatically re-assembled with the original video, such that both the achieved lip-synchronization accuracy and the perceived quality of the time-aligned audio can be evaluated.
Fig. 3: Screenshot of LipSynch. Default values for the segmentation process were chosen as follows: Th = 0.05 = 5% of the maximum absolute RMS signal energy; NSS = 150ms; SS = 250ms; RX on = 0ms; RX off = 25ms.
4. Demos
In this section, we present a number of demos that illustrate the performance of the developed system.
-
Plain ADR: For the first demo, we asked a female speaker to read a passage intended for use in a multimedia platform that tries to enrich the visual experience of visitors of art exhibitions by offering the possibility to retrieve related accompaniment audio, video, text and images on a personal preference via a mobile device (http://www.universumdigitalis.be/). Because the level of microphone noise in the original recording proved unacceptable for use in the application, the passage was re-recorded and then synchronized with the original recording. The final results are summarized in Table 2.
Table 2: ADR demo (download script here).
We remark that the time-aligned audio goes slightly out of synch at around second 23. This can be explained from the fact that the speaker forgot to pronounce the word "Sam" in the dub. In such situations, LipSynch will stretch the phonemes in the dub around the gap such that the time-aligned audio compromises well between achieved lip-synch accuracy and perceived voice quality.
-
ADR in Background Noise: The scenario in the second demo shows a man cleaning his living room while talking. Because both the vacuum cleaner fan as well as the brushing movements on the wooden floor render the speech unintelligible (SNR<0dB), the cleaner was asked to revoice his transcribed speech @ the Nosey Elephant Studios. In addition, another male (♂) and female (♀) speaker were asked to produce a similar dub. Then, using LipSynch, the three replacement speech samples were synchronized with the original audio recording and re-assembled with the video, the results of which can be viewed in Table 3.
Table 3: ADR in background noise & voice transplantation demo (download script here).
-
EUSIPCO 2010: In [3,4], we formally tested our ADR system by conducting a series of subjective audio-visual listening tests with the aim of comparing its performance with that of the industry-standard VocALign PRO (V4.0) (http://www.synchroarts.com/), which is world-wide considered the benchmark system for ADR and automatic time synchronization. For this purpose, an audio-visual corpus was recorded, comprising a total of 80 different samples, produced by two male and six female native Dutch speakers (As an illustration, Table 4 shows a reference, revoiced and time-aligned sample for one of the female speakers). The data from this corpus was extracted from two sets of recording sessions:
- During a first series of recording sessions, we invited each time two speakers for an informal 30 minute table talk, during which they were allowed to chat freely on a subject at choice. To produce as much useful data as possible within the given time limit, the participants were allowed to change the subject of the talk at any time. From each of the recorded conversations, a total of five samples were extracted for each person. In doing so, care was taken the selected samples were sufficiently long such that they would cover a wide range of speaking rate (variations) as well as pauses of different types and durations.
- In a second series, the same speakers were asked to mimic the selected parts of their conversations by revoicing the literally transcribed lines from a large screen at a pace they felt comfortable with. In contrast to the traditional approach in ADR, we did not request the speakers to deliver performances with near-perfect lip-synch accuracy. As a consequence, substantial timing differences can generally be observed between the corresponding sample pairs, which therefore constitute a suitable test database to research the alignment capabilities of a given alignment algorithm, and in particular its robustness against the acoustic-phonetic differences described in [2].
Table 4: Sample from the EUSIPCO-2010 experiment (download script here).
-
EUSIPCO 2012: Current research aims at improving the across-speaker robustness of automatic time synchronization. This is motivated by recent experiments, which have shown that the overall performance of our time synchronization system as described in [3,4] degrades drastically when moving from speaker-dependent to speaker-independent time synchronization, especially when the reference and replacement speakers are of opposite gender. While the source of the across-speaker variability of the acoustic speech signal stems from a complex combination of many factors, such as differences in speaking styles and pronunciation, it is commonly agreed that a major part of the variability is due to physiological differences between speakers, in particular due to differences in their vocal tract length (VTL) and shape. In one of the simplest physiological models, the human vocal tract is treated as a uniform tube resonator. According to this model, the resonant or formant frequencies are inversely proportional to the length of the tube. As a result, Vocal Tract Length Normalization (VTLN) approaches typically neutralize speaker-specific aspects by warping the frequency axis of the acoustic speech signal in accordance with an appropriate frequency mapping function. Currently, we study the use of this speaker normalization technique on the time synchronization paradigm. In [5], we developed a novel method, which attempts to reduce the variability in spectral formant peak positions for corresponding speech sounds produced by different speakers. This is achieved by means of an efficient bilinear frequency warping procedure, in which the amount of warping is iteratively optimized in accordance with a criterion that is directly related to the output of the standard Dynamic Time Warping algorithm. Subjective listening tests performed on mixed-gender time-aligned results obtained with a subset of data from the English EUROM1 Many Talker Set have shown that the proposed procedure significantly improves the overall speech quality and time synchronization accuracy. As an illustration, Table 5 shows some of the obtained results: depending on the timing relationship (time warping path) that is used in the synthesis stage, we distinguish three types of time-aligned results: while the Baseline (B) system uses a smooth version of the time warping path obtained with the standard DTW algorithm, the Intermediate (I) and Proposed (P) systems first apply the proposed VTLN procedure in order to obtain an improved version of the standard DTW path, which is then smoothed, and additionally post-processed by the Proposed system only.
Table 5: Some results from the EUSIPCO-2012 experiment (download script here); SCx and SCy represent EUROM1 speaker codes.
|