Animated talking heads
Introduction
It is a well-known fact that when we speak, the motion of the face is strongly dependent linked to the sounds we make. The face is informative of the speech, which is something that we, as humans, use a lot to help our understanding of speech. We do this especially when the sound is noisy (such as at parties) or when the sound is just plainly lacking (lipreading). Yet even when the sound is perfectly audible, our perception of the face motion affects the sound we perceive --- to the extent that if we see mismatched facial motion, we hear different sounds (this is known as the McGurk effect).
In this research, we attempt to capture the correspondence between the face and the sound as a function of the phonetic transcription of what is being said. This opens the door to exciting applications, such as the generation of animated faces from speech, automatic synchronisation of audio and video of sequences containing speech, improved automatic speech recognition with the help of video and, to an extent, even automatic speech reading.
The model
For an in-depth description of our model, please refer to our NIPS 2007 paper (coming soon). We will be presenting this work at the upcoming NIPS conference in December, where our paper was accepted for a full oral presentation.
Evaluation of the model
In order to evaluate the validity of our approach, we asked volunteers to compare sequences generated by our model with sequences generated with other, existing models and with sequences extracted from real video. This test was done on-line, and is still accessible for reference and validation purposes at this page. At the end of the test, the volunteers were offered some more information. The volunteers were also shown some example sequences, and were asked not to do the test again, in order to avoid biasing the results.
Resources
The data used for this research is available on-line, here. The code used for the modelling will be put online here soon, too.