25
Realistic Face Animation for Speech Gregor A. Kalberer Computer Vision Group ETH Z¨ urich, Switzerland [email protected] Luc Van Gool Computer Vision Group ETH Z¨ urich, Switzerland ESAT / VISICS, Kath. Univ. Leuven, Belgium [email protected] Keywords: face animation, speech, visemes, eigen space, realism Abstract Realistic face animation is especially hard as we are all experts in the perception and interpretation of face dynamics. One approach is to simulate facial anatomy. Alternatively, animation can be based on first observing the visible 3D dynamics, extracting the basic modes, and putting these together according to the required performance. This is the strategy followed by the paper, which focuses on speech. The approach follows a kind of bootstrap procedure. First, 3D shape statistics are learned from a talking face with a relatively small number of markers. A 3D reconstruction is produced at temporal intervals of 1/25 seconds. A topological mask of the lower half of the face is fitted to the motion. Principal component analysis (PCA) of the mask shapes reduces the dimension of the mask shape space. The result is two-fold. On the one hand, the face can be animated, in our case it can be made to speak new sentences. On the other hand, face dynamics can be tracked in 3D without markers for performance capture.

Realistic Face Animation for Speechkonijn/publications/2002/...Luc Van Gool Computer Vision Group ETH Zurich,¨ Switzerland ESAT / VISICS, Kath. Univ. Leuven, Belgium [email protected]

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

  • Realistic Face Animation for Speech

    Gregor A. KalbererComputer Vision Group ETH Zürich, Switzerland

    [email protected]

    Luc Van GoolComputer Vision Group ETH Zürich, Switzerland

    ESAT / VISICS, Kath. Univ. Leuven, [email protected]

    Keywords:face animation, speech, visemes, eigen space, realism

    Abstract

    Realistic face animation is especially hard as we are all experts in the perception and interpretation

    of face dynamics. One approach is to simulate facial anatomy. Alternatively, animation can be based on

    first observing the visible 3D dynamics, extracting the basic modes, and putting these together according

    to the required performance. This is the strategy followed by the paper, which focuses on speech. The

    approach follows a kind of bootstrap procedure. First, 3D shape statistics are learned from a talking

    face with a relatively small number of markers. A 3D reconstruction is produced at temporal intervals

    of 1/25 seconds. A topological mask of the lower half of the face is fitted to the motion. Principal

    component analysis (PCA) of the mask shapes reduces the dimension of the mask shape space. The

    result is two-fold. On the one hand, the face can be animated, in our case it can be made to speak new

    sentences. On the other hand, face dynamics can be tracked in 3D without markers for performance

    capture.

  • Introduction

    Realistic face animation is a hard problem. Humans will typically focus on faces and are incredibly

    good at spotting the slightest glitch in the animation. On the other hand, there is probably no shape more

    important for animation than the human face. Several applications come immediately to mind, such as

    games, special effects for the movies, avatars, virtual assistants for information kiosks, etc. This paper

    focuses on the realistic animation of the mouth area for speech.

    Face animation research dates back to the early 70’s. Since then, the level of sophistication has

    increased dramatically. For example, the human face models used in Pixar’s Toy Story had several

    thousand control points each [1]. Methods can be distinguished by mainly two criteria. On the one hand,

    there are image and 3D model based methods. The method proposed here uses 3D face models. On the

    other hand, the synthesis can be based on facial anatomy, i.e. both interior and exterior structures of a

    face can be brought to bear, or the synthesis can be purely based on the exterior shape. The proposed

    method only uses exterior shape. By now, several papers have appeared for each of these strands. A

    complete discussion is not possible, so the sequel rather focuses on a number of contributions that are

    particularly relevant for the method presented here.

    So far, for reaching photorealism one of the most effective approaches has been the use of 2D mor-

    phing between photographic images [2, 3, 4]. These techniques typically require animators to specify

    carefully chosen feature correspondences between frames. Bregler et al. [5] used morphing of mouth

    regions to lip-synch existing video to a novel sound-track. This Video Rewrite approach works largely

    automatically and directly from speech. The principle is the re-ordering of existing video frames. It is

    of particular interest here as the focus is on detailed lip motions, incl. co-articulation effects between

    phonemes. But still, a problem with such 2D image morphing or re-ordering techniques is that they

    do not allow much freedom in the choice of face orientation or compositing the image with other 3D

    objects, two requirements of many animation applications.

    In order to achieve such freedom, 3D techniques seem the most direct route. Chen et al. [6] applied

    3D morphing between cylindrical laser scans of human heads. The animator must manually indicate a

  • number of correspondences on every scan. Brand [7] generates full facial animations from expressive

    information in an audio track, but the results are not photo-realistic yet. Very realistic expressions have

    been achieved by Pighin et al. [8]. They present face animation for emotional expressions, based on

    linear morphs between 3D models acquired for the different expressions. The 3D models are created

    by matching a generic model to 3D points measured on an individual’s face using photogrammetric

    techniques and interactively indicated correspondences. Though this approach is very convincing for

    expressions, it would be harder to implement for speech, where higher levels of geometric detail are

    required, certainly on the lips. Hai Tao et al. [9] applied a 3D facial motion tracking based on a piece-

    wise beziér volume deformation model and manually defined action units to track and synthesize visual

    speech subsequently. Also this approach is less convincing around the mouth, probably because only a

    few specific feature points are tracked and used for all the deformations. Per contra L. Reveret et. al. [10]

    have applied a sophisticated 3D lip model, which is represented as a parametric surface guided by 30

    control points. Unfortunately the motion around the lips, which is also very important for increased

    realism, was tracked by only 30 markers on one side of the face and finally mirrored. Knowing that most

    of the people talks spacially unsymetric, the chosen approach results in a very symmetric and not very

    detailed animation.

    Here, we present a face animation approach that is based on the detailed analysis of 3D face shapes

    during speech. To that end, 3D reconstructions of faces have been generated at temporal sampling rates

    of 25 reconstructions per second. A PCA analysis on the displacements of a selection of control points

    yiels a compact 3D description of visemes, the visual counterparts of phonemes. With 38 points on the

    lips themselves and a total of 124 on the larger part of the face that is influenced by speech, this analysis

    is quite detailed. By directly learning the facial deformations from real speech, their parameterisation in

    terms of principal components is a natural and perceptually relevant one. This seems less the case for

    anatomically based models [11, 12]. Concatenation of visemes yields realistic animations. In addition,

    the results yield a robust face tracker for performance capture, that works without special markers.

    The structure of the paper is as follows. The first Section describes how the 3D face shapes are ac-

    quired that are observed during speech and how these data are used to analyse the space of corresponding

  • face deformations. Whereas the second Section uses these results in the context of performance capture,

    the third section discusses the use for speech-based animation of a face for which 3D lip dynamics have

    been learned and for those to which the learned dynamics were copied. A last Section concludes the

    paper.

    The Space of Face Shapes

    Our performance capture and speech-based animation modules both make use of a compact parame-

    terisation of real face deformations during speech. This section describes the extraction and analysis of

    the real, 3D input data.

    Face Shape Acquisition

    When acquiring 3D face data for speech, a first issue is the actual part of the face to be measured.

    The results of Munhall and Vatikiotis-Bateson [13] provide evidence that lip and jaw motions affect the

    entire facial structure below the eyes. Therefore, we extract 3D data for the area between the eyes and

    the chin, to which we fit a topological model or ‘mask’, as shown in fig. 1.

    This mask consists of 124 vertices, the 34 standard MPEG-4 vertices and 90 additional vertices for

    increased realism. Of these vertices, 38 are on the lips and 86 are spread over the remaining part of the

    mask. The remainder of this section explores the shapes that this mask takes on if it is fitted to the face

    of a speaking person. The shape of a talking face was extracted at a temporal sampling rate of 25 3D

    snapshots per second (video). We have used Eyetronics’ ShapeSnatcher system for this purpose [14].

    It projects a grid onto the face, and extracts the 3D shape and texture from a single image. By using a

    video camera, a quick succession of 3D snapshots can be gathered. The ShapeSnatcher yields several

    thousand points for every snapshot, as a connected, triangulated and textured surface. The problem is

    that these 3D points correspond to projected grid intersections, not corresponding, physical points of the

    face. We have simplified the problem by putting markers on the face for each of the 124 mask vertices,

    as shown in fig. 2.

  • The 3D coordinates of these 124 markers (actually of the centroids of the marker dots) were measured

    for each 3D snapshot, through linear interpolation of the neighbouring grid intersection coordinates.

    This yielded 25 subsequent mask shapes for every second. One such mask fit is also shown in fig. 2.

    The markers were extracted automatically, except for the first snapshot, where the mask vertices were

    fitted manually to the markers. Thereafter, the fit of the previous frame was used as an initialisation for

    the next, and it was usually sufficient to move the mask vertices to the nearest markers. In cases where

    there were two nearby candidate markers the situation could almost without exception be disambiguated

    by first aligning the vertices with only one such candidate.

    Before the data were extracted, it had to be decided what the test person would say during the acquisi-

    tion. It was important that all relevant visemes would be observed at least once, i.e. all visually distinct

    mouth shape patterns that occur during speech. Moreover, these different shapes should be observed in

    as short a time as possible, in order to keep processing time low. The subject was asked to pronounce

    a series of words, one directly after the other as in fluent speech, where each word was targeting one

    viseme. These words are given in the table of fig 5. This table will be discussed in more detail later.

    Face Shape Analysis

    The 3D measurements yield different shapes of the mask during speech. A Principal Component

    Analysis (PCA) was applied to these shapes in order to extract the natural modes. The recorded data

    points represent 372 degrees of freedom (124 vertices with three displacements each). Because only 145

    3D snapshots were used for training, at most 144 components could be found. This poses no problem as

    98% of the total variance was found to be represented by the first 10 components or ‘eigenmasks’, i.e.

    the eigenvectors with the 10 highest eigenvalues of the covariance matrix for the displacements. This

    leads to a compact, low-dimensional representation in terms of eigenmasks. It has to be added that so

    far we have experimented with the face of a single person. Work on automatically animating faces of

    people for whom no dynamic 3D face data are available is planned for the near future. Next, we describe

    the extraction of the eigenmasks in more detail.

    The extraction of the eigenmaks follows traditional PCA, applied to the displacements of the 124

  • selected points on the face. This analysis cannot be performed on the raw data, however. First, the mask

    position is normalised with respect to the rigid rotation and translation of the head. This normalisation

    is carried out by aligning the points that are not affected by speech, such as the points on the upper side

    of the nose and the corners of the eyes. After this normalisation, the 3D positions of the mask vertices

    are collected into a single vector mk for every frame k = 1...N , with N = 145 in this case

    mk = (xk1, yk1, zk1, ..., xk124, yk124, zk124)T (1)

    where T stands for the transpose. Then, the average mask m̄

    m̄ =1

    N

    N∑

    k=1

    mk ; N = 145 (2)

    is subtracted to obtain displacements with respect to the average, denoted as ∆mk = mk-m̄. The

    covariance matrix Σ for the displacements is obtained as

    Σ =1

    N − 1

    N∑

    k=1

    ∆mk∆mkT ; N = 145 ; (3)

    Upon decomposing this matrix as the product of a rotation, a scaling and the inverse rotation

    Σ = RΛRT (4)

    one obtains the PCA decomposition with Λ the diagonal scaling matrix with the eigenvalues λ sorted

    from the largest to the smallest magnitude, and the columns of the rotation matrix R the corresponding

    eigenvectors. The eigenvectors with the highest eigenvalues characterize the most important modes of

    face deformation. Mask shapes can be approximated as a linear combination of the 144 modes.

    mj = m̄ + Rwj (5)

    The weight vector wj describes the deviation of the mask shape mj from the average mask m̄ in

    terms of the eigenvectors, coined eigenmasks for this application. By varying wj within reasonable

    bounds, realistic mask shapes are generated. As already mentioned at the beginning of this section, it

    was found that most of the variance (98%) is represented by the first 10 modes, hence further use of the

    eigenmasks is limited to linear combinations of the first 10. They are shown in fig 3.

  • Performance Capture

    A face tracker has been developed, that can serve as a performance capture system for speech. It fits

    the face mask to subsequent 3D snapshots, but now without markers. Again, 3D snapshots taken with

    the ShapeSnatcher at 1/25 second intervals are the input. The face tracker decomposes the 3D motions

    into rigid motions and motions due to the visemes.

    The tracker first adjusts the rigid head motion and then adapts the weight vector wj to fit the remaining

    motions, mainly those of the lips. A schematic overview is given in fig. 4(a). Such performance capture

    can e.g. be used to drive a face model at a remote location, by only transmitting a few face animation

    parameters: 6 parameters for rigid motion and 10 components of the weight vectors.

    For the very first frame, the system has no clue where the face is and where to try fitting the mask. In

    this special case, it starts by detecting the nose tip. It is found as a point with particularly high curvature

    in both horizontal and vertical direction:

    n(x, y) = {(x, y)|min(max(0, kx), max(0, ky)) is maximal} (6)

    where kx and ky are the two curvatures, which are in fact averaged over a small region around the

    points in order to reduce the influence of noise. The curvatures are extracted from the 3D face data

    obtained with the ShapeSnatcher. After the nose tip vertex of the mask has been aligned with the nose

    tip detected on the face, and with the mask oriented upright, the rigid transformation can be fixed by

    aligning the upper part of the mask with the corresponding part of the face. After the first frame, the

    previous position of the mask is normally close enough to directly home in on the new position with the

    rigid motion adjustment routine alone.

    The rigid motion adjustment routine focuses on the upper part of the mask as this part hardly deforms

    during speech. The alignment is achieved by minimizing distances

    between the vertices of this part of the mask and the face surface. In order not to spend too much

    time on extracting the true distances, the cost Eo of a match is simplified. Instead, the distances are

    summed between the mask vertices x and the points p where lines through these vertices and parallel to

    the viewing direction of the 3D acquisition system hit the 3D face surface:

  • Eo =∑

    i∈{upper part}

    di ; di = ‖pi − xi(w)‖ ; (7)

    Note that the sum is only over the vertices in the upper part of the mask. The optimization is performed

    with the downhill simplex method [15], with 3 rotation angles and 3 translation components as parame-

    ters. Fig. 4 gives an example where the mask starts from an initial position (b) and is iteratively rotated

    and translated to end up in the rigidly adjusted position (c).

    Once the rigid motion has been canceled out, a fine-registration step deforms the mask in order to

    precisely fit the instantaneous 3D facial data due to speech. To that end the components of the weight

    vector w are optimised. Just as is the case with face spaces [16], PCA also here brings the advantage

    that the dimensionality of the search space is kept low. Again, a downhill simplex procedure is used

    to minimize a cost function for subsequent frames j. This cost function is of the same form as eq. (7),

    with the difference that now the distance for all mask vertices is taken into account (i.e. also for the

    non-rigidly moving parts). Each time starting from the previous weight vector wj−1 (for the first frame

    starting with the average mask shape, i.e. wj−1 = 0 ), an updated vector wj is calculated for the frame at

    hand. These weight vectors have dimension 10, as only the eigenmasks with the 10 largest eigenvalues

    are considered (see section ). Fig. 4(d) shows the fine registration for this example.

    The sequence of weight vectors – i.e. mask shapes – extracted in this way can be used as a performance

    capture result, to animate the face and reproduce the orignal motion. This reproduced motion still

    contains some jitter, due to sudden changes in the values of the weight vector’s components. Therefore,

    these components are smoothed with B-splines (of degree 3). These smoothed mask deformations are

    used to drive a detailed 3D face model, which has many more vertices than the mask. For the animation

    of the face vertices between the mask vertices a lattice deformation was used (Maya, DEFORMER -TYPE

    WRAP).

    Fig. 8 shows some results. The first row (A) shows different frames of the input video sequence. The

    person says “Hello, my name is Jlona”. The second row (B) shows the 3D ShapeSnatcher output, i.e. the

    input for the performance capture. The third row (C) shows the extracted mask shapes for the same time

    instances. The fourth row (D) shows the reproduced expressions of the detailed face model as driven by

  • the tracker.

    Animation

    The use of performance capture is limited, as it only allows a verbatim replay of what has been

    observed. This limitation can be lifted if one can animate faces based on speech input, either as an audio

    track or text. Our system deals with both types of input.

    Animation of speech has much in common with speech synthesis. Rather than composing a sequence

    of phonemes according to the laws of co-articulation to get the transitions between the phonemes right,

    the animation generates sequences of visemes. Visemes correspond to the basic, visual mouth expres-

    sions that are observed in speech. Whereas there is a reasonably strong consensus about the set of

    phonemes, there is less unanimity about the selection of visemes. Approaches aimed at realistic anima-

    tion of speech have used any number from as few as 16 [2] up to about 50 visemes [17]. This number

    is by no means the only parameter in assessing the level of sophistication of different schemes. Much

    also depends on the addition of co-articulation effects. There certainly is no simple one-to-one relation

    between the 52 phonemes and the visemes, as different sounds may look the same and therefore this

    mapping is rather many-to-one. For instance \b\ and \p\ are two bilabial stops which differ only in

    the fact that the former is voiced while the latter is voiceless. Visually, there is hardly any difference in

    fluent speech.

    We based our selection of visemes on the work of Owens [18] for consonants. We use his consonant

    groups, except for two of them, which we combine into a single \k,g,n,l,ng,h,y\ viseme. The

    groups are considered as single visemes because they yield the same visual impression when uttered.

    We do not consider all the possible instances of different, neighboring vocals that Owens distinguishes,

    however. In fact, we only consider two cases for each cluster: rounded and widened, that represent

    the instances farthest from the neutral expression. For instance, the viseme associated with \m\ differs

    depending on whether the speaker is uttering the sequence omo or umu vs. the sequence eme or imi.

    In the former case, the \m\ viseme assumes a rounded shape, while the latter assumes a more widened

  • shape. Therefore, each consonant was assigned to these two types of visemes. For the visemes that

    correspond to vocals, we used those proposed by Montgomery and Jackson [19].

    As shown in fig. 5, the selection contains a total of 20 visemes: 12 representing the consonants (boxes

    with red ‘consonant’ title), 7 representing the monophtongs (boxes with title ‘monophtong’) and one

    representing the neutral pose (box with title ‘silence’), where diphtongs (box with title ‘diphtong’) are

    divided into two seperate monophtongs and their mutual influence is taken care of as a co-articulation

    effect. The boxes with smaller title ‘allophones’ can be discarded by the reader for the moment. The

    table also contains example words producing the visemes when they are pronounced.

    This viseme selection differs from others proposed earlier. It contains more consonant visemes than

    most, mainly because the distinction between the rounded and widened shapes is made systematically.

    For the sake of comparison, Ezzat and Poggio [2] used 6 (only one for each of Owens’ consonant

    groups while also combining two of them), Bregler et al. [5] used 10 (same clusters but they subdivided

    the cluster \t,d,s,z,th,dh\ into \th,dh\ and the rest, and \k,g,n,l,ng,h,y\ into \ng\,

    \h\, \y\, and the rest, making an even more precise subdivision for this cluster), and Massaro [20]

    used 9 (but this animation was restriced to cartoon-like figures, which do not show the same complexity

    as real faces). We feel that our selection is a good compromise between the number of visemes needed

    in the animation and the realism that is obtained.

    Animation can then be considered as navigating through a graph where each node represents one

    of NV visemes, and the interconnections between nodes represent the N 2V viseme transformations (co-

    articulation). From an animator’s perspective, the visemes represent key masks, and the transformations

    represent a method of interpolating between them. As a preparation for the animation, the visemes were

    mapped into the 10-dimensional eigenmask space. This yields one weight vector wvis for every viseme.

    The advantage of performing the animation as transitions between these points in the eigenmask space,

    is that interpolated shapes all look realistic. As was the case for tracking, point to point navigation in the

    eigenmask space as a way of concatenating visemes yields jerky motions. Moreover, when generating

    the temporal samples, these may not precisely coincide with the pace at which visemes change. Both

    problems are solved through B-spline fitting to the different components of the weight vectors wvis(t)

  • with t time, as illustrated in fig. 6, which yields trajectories that are smooth and that can be sampled as

    desired.

    As input for the animation experiments we have used both text and audio. The visemes which have

    to be visited, the order in which this should happen, and the time intervals in between can be calculated

    from a pure audio track containing speech. First a file is generated that contains the ordered list of allo-

    phones and their timing. ‘Allophones’ correspond to a finer subdivision of phonemes. This transcription

    has not been our work and we have used an existing tool, described in [21]. The allophones are then

    translated into visemes. The vocals and ’silence’ are directly mapped to the viseme in the box imme-

    diately to their left in fig. 5. For the consonants the context plays a role. If they immediately follow a

    vocal among \o\, \u\, and \@@\ (this is the vocal as in ‘bird’), then the allophone is mapped onto

    a rounded consonant (the corresponding box in the left column of fig. 5). If the vocal is among \i\,

    \a\, and \e\ then the allophone is mapped onto a widened consonant (the corresponding box in the

    right column of fig. 5). When the consonant is not preceded immediately by a vocal, but the subsequent

    allophone is one, then a similar decision is made. If the consonant is flanked by two other consonants,

    the preceding vocal decides.

    Once the sequence of visemes and their timing are available, the mask deformations are determined.

    The mask then drives the detailed face model. Fig. 8 (E) shows a few snapshots of the animated head

    model, for the same sentence as used for the performance capture example. Row (F) shows a detail of

    the lips for another viewing angle.

    It is of course interesting at this point to test what the result would be of verbatim copying of the

    visemes onto another face. If successful, this would mean that no new lip dynamics have to be captured

    for that face and much time and efford could be saved. Such result are shown in fig. 7. Although these

    static images seem resonable, the corresponding sequences are not really satisfactory.

  • Conclusions

    Realsitic face animation is still a hard nut to crack. We have tried to attack this problem via the

    acquisition and analysis of exterior, 3D face measurements. With 38 points on the lips alone and a

    further 86 around the mouth region to cover all parts of the face that are influenced by speech, it seems

    that this analysis is more detailed than earlier ones. Based on a proposed selection of visemes, speech

    animation is approached as the concatenation of 3D mask deformations, expressed in a compact space

    of ‘eigenmasks’. Such approach was also demonstrated for performance capture.

    This work still has to be extended in a number of ways. First, the current animation suite only supports

    animation of the face of the person for whom the 3D snapshots were acquired. Although we have tried

    to transplant visemes onto other people’s faces, it became clear, that a really realistic animation requires

    visemes that are adapted to the shape or ’physiognomy’ of the face at hand. Hence one cannot simply

    copy the deformations that have been extracted from one face to a novel face. It is not precisely know

    at this point how the visemes deformations depend on the physiognomy, but ongoing experiments have

    already shown that adaptations are possible without a complete relearning of the face dynamics.

    Secondly, there are still unnatural effects with some co-articulations between subsequent consonants.

    Although Massaro [22] has suggested to use a finite inventory of visemes rather than an approach with

    a hugh amount of disemes, these effects have to be removed through the refinement of the spline trajec-

    tories in the eigenmask space and a more sophisticated dominance model in general. Other necessary

    improvements are the rounding of the lips into the mouth cavity which is not yet present because these

    parts of the lips are not observed in the 3D data (the reason why the mouth doesn’t close completely

    yet), and the addition of wrinkles on the lips and elsewhere, which can be solved by also using the

    dynamically observed texture data (e.g. when the lips are rounded).

    Acknowledgments

    This research work has been supported by the ETH Research Council and the by European Commis-

    sion through the IST project MESH (www.meshproject.com 2002) and with the assistance of: Univ.

  • Freiburg, DURAN, EPFL, Eyetronics, Univ. Geneva

    References

    [1] Ostby Eben. Personal communication. Pixar Animation Studios, 1997.

    [2] T. Ezzat and T. Poggio. Visual speech synthesis by morphing visemes. In Kluwer Academic

    Publishers, editor, International Journal of Computer Vision, volume 38, pages 45–57, 2000.

    [3] Th. Beier and S. Neely. Feature-based image metamorphosis. In ACM Press (etc.) ACM SIG-

    GRAPH, editor, SIGGRAPH’99 Conference Proceedings, volume 26, pages 35–42, 1992.

    [4] Ch. Bregler and S. Omohundro. Nonlinear image interpolation using manifold learning. In NIPS,

    volume 7, 1995.

    [5] C. Bregler, M. Covell, and M. Slaney. Video rewrite: driving visual speech with audio. In SIG-

    GRAPH, pages 353–360, 1997.

    [6] D. Chen and A. State. Interactive shape metamorphosis. In Symposium on Interactive 3D Graphics,

    editor, SIGGRAPH’95 Conference Proceedings, pages 43–44, 1995.

    [7] M. Brand. Voice puppetry. In Animation SIGGRAPH, 1999.

    [8] F. Pighin, J. Hecker, D. Lischinsky, R. Szeliski, and D.H. Salesin. Synthesizing realistic facial

    expressions from photographs. In Proc. SIGGRAPH, pages 75–84, 1998.

    [9] Hai Tao and Th. S. Huang. Explanation-based facial motion tracking using a piecewise beziér

    volume deformation model. In Proc. CVPR, 1999.

    [10] L. Reveret, G. Bailly, and P. Badin. Mother, a new generation of talking heads providing a flexible

    articulatory control for videorealistic speech animation. In Proc. ICSL’2000, 2000.

    [11] S. King, R. Parent, and L. Olsafsky. An anatomically-based 3d parameter lip model to support

    facial animation and synchronized speech. In in Proc. Deform Workshop., pages 1–19, 2000.

  • [12] K. Waters and J. Frisbie. A coordinated muscle model for speech animation. In Graphics Interface,

    pages 163–170, 1995.

    [13] K.G. Munhall and E. Vatikiotis-Bateson. The moving face during speech communication. In Ruth

    Campbell, Barbara Dodd, and Denis Burnham, editors, Hearing by Eye, volume 2, chapter 6, pages

    123–39. Psychology Press, East Sussex, UK, 1998.

    [14] Http://www.eyetronics.com.

    [15] J.A. Nelder and R. Mead. Simplex downhill method. In Computer Journal, volume 7, pages

    308–312, 1965.

    [16] V. Blanz and T. Vetter. A morphable model for the synthesis of 3d faces. In Proc. SIGGRAPH,

    pages 187–194, 1999.

    [17] K.C. Scott, D.S. Kagels, S.H. Watson, H. Rom, J.R. Wright, M. Lee, and K.J. Hussey. Synthesis

    of speaker facial movement to match selected speech sequences. In In Proceedings of the Fifth

    Australian Conference on Speech Science and Technology, volume 2, pages 620–625, 1994.

    [18] O. Owens and B. Blazek. Visemes observed by hearing-impaired and normal-hearing adult view-

    ers. In Jour. Speech and Hearing Research, volume 28, pages 381–393, 1985.

    [19] A. Montgomery and P. Jackson. Physical characteristics of the lips underlying vowel lipreading

    performance. In Jour. Acoust. Soc. Am., volume 73, pages 2134–2144, 1983.

    [20] D.W. Massaro. Perceiving Talking Faces. MIT. Press, 1998.

    [21] C. Traber. SVOX: The Implementation of a Text-to-Speech System. PhD thesis. Computer Engi-

    neering and Networks Laboratory, ETH; No. 11064, 1995.

    [22] Massaro. Perceiving Talking Faces. MIT. Press, 1998.

  • Gregor A. Kalberer is a PhD student at the Computer Vision Group BIWI, D-ITET, ETH Zurich,

    Switzerland. He received the Msc degree in electrical engineering from ETH Zurich in 1999. His

    research interests include computer vision, graphics, animation and virtual reality. He is a member of

    the IEEE.

    Luc van Gool is professor for Computer Vision at the University of Leuven in Belgium and at ETH

    Zurich in Switzerland. He is a member of the editorial board of several computer vision journals and of

    the programm committees of international conferences about the same subject. His research includes ob-

    ject recognition, tracking, texture, 3D reconstruction, and the confluence of vision and graphics. Vision

    and graphics for archaeology is among his favour applications.

  • Figure 1. Left: example of 3D input for one snapshot; Right: the mask used for tracking the facial

    motions during speech.

  • Figure 2. Left: markers put on the face, one for each of the 124 mask vertices; Right: 3D mask fitted

    by matching the mask vertices with face markers.

  • Figure 3. Average mask (0) and the 10 dominant ‘eigenmasks’ for visual speech, in order of descend-

    ing importance from (1) to (10).

  • Nose Tip Finding

    Fine RegistrationSimplex Downhill Method

    with Avoidance ofLocal Minima

    Rigid MotionCancelation

    Simplex Downhill Methodwith Avoidance of

    Local Minima

    Start

    End

    Figure 4. (a) Schematic overview of the performance capture tracker; (b) The facial mask in its starting

    position, (c) mask position after rigid motion cancelation, (d) mask after additional deformation to fit

    the lips.

  • two a_I

    a_U

    e_I

    I_@

    O_I

    U_@ cures

    fear

    raise

    rouse

    rise

    noise

    monophtongs

    /w,r/widenedrounded

    /w,r/w, r

    wasp, wrong

    nursery [ n3:rserI ] iritate [ ’IrItteIt ]

    allophones

    /ch,jh,sh,zh/consonant

    widened

    S, t_S, Z, d_Zshin, chin, measure, Gin

    scrooge [ skru:d_Z ] [ glId_ZIN ]glitching

    allophones

    widened

    consonantloch, skat, kin, give, new, long, thing, hit

    x, k, k_h, g, n, l, N, h

    roll−on [ rol on] steely

    roundedmock,bin, spark, pin

    widened

    consonant/p,b,m/

    [ momo ]momo image [ ’ImId_Z ]

    allophones

    allophones

    rounded/f,v/

    widened

    consonantf, v

    fit, heavy

    [ ru:vu:s ]Ruvus giving [ gIvIN ]

    allophonesconsonant consonantt, t_h, d, s, z, T, D,

    staff, tin, din, mouse, fees, thin, this/t,d,s,z,th,dh/rounded

    /t,d,s,z,th,dh/widenedmoose [ mu:s ] [ i:zI ]easy

    allophones consonantconsonant

    consonant

    consonant

    /ch,jh,sh,zh/rounded

    consonant

    rounded

    consonant

    allophonesi:, I, j

    easy, pit, yes

    fitting

    /i,ii/normal

    monophtong

    allophonesA:, A

    stars,cut

    stars

    /aa,o/normal

    monophtong

    allophones3:

    bird

    birdnormal/@@/

    monophtong

    allophonesU, u:

    put, lose

    book

    /u,uu/normal

    monophtong

    allophones

    X

    silence

    closed lips

    normal

    /#/

    silence

    allophones

    nose

    m, b, p, p_h

    [ ’sti:lI ]

    divide into@_U

    /f,v/

    allophonesO, Q

    cause, pot

    causenormal/oo/

    monophtong

    allophones@

    another

    one anothernormal/uh,@/

    monophtong

    allophones{, e, Er

    hat, pet, stairs

    pet, stay

    /e,a/normal

    monophtong

    diphtong

    /p,b,m/

    /k,g,n,l,ng,h,y//k,g,n,l,ng,h,y/

    Figure 5. Viseme table

  • f f f f ff f frames

    Wj1

    t060 1 2 3 4 5

    6

    t2

    t4

    t

    t1

    t3

    t5

    Figure 6. Spline interpolation between visemes (dots in red) and samples taken for animation (quads

    in blue).

  • Figure 7. Transplantation of Speech.

  • H ll , my Na e s Jlo ae o m i n

    Time [f]009 013 030032 042

    A

    B

    C

    D

    E

    F

    Figure 8.

  • Figure 9.

  • Figure 10.