Realistic Face Animation for Speechkonijn/publications/2002/...Luc Van Gool Computer Vision Group ETH Zurich,¨ Switzerland ESAT / VISICS, Kath. Univ. Leuven, Belgium [email protected]

Realistic Face Animation for Speech

Gregor A. KalbererComputer Vision Group ETH Zürich, Switzerland

[email protected]

Luc Van GoolComputer Vision Group ETH Zürich, Switzerland

ESAT / VISICS, Kath. Univ. Leuven, [email protected]

Keywords:face animation, speech, visemes, eigen space, realism

Abstract

Realistic face animation is especially hard as we are all experts in the perception and interpretation

of face dynamics. One approach is to simulate facial anatomy. Alternatively, animation can be based on

first observing the visible 3D dynamics, extracting the basic modes, and putting these together according

to the required performance. This is the strategy followed by the paper, which focuses on speech. The

approach follows a kind of bootstrap procedure. First, 3D shape statistics are learned from a talking

face with a relatively small number of markers. A 3D reconstruction is produced at temporal intervals

of 1/25 seconds. A topological mask of the lower half of the face is fitted to the motion. Principal

component analysis (PCA) of the mask shapes reduces the dimension of the mask shape space. The

result is two-fold. On the one hand, the face can be animated, in our case it can be made to speak new

sentences. On the other hand, face dynamics can be tracked in 3D without markers for performance

capture.

Introduction

Realistic face animation is a hard problem. Humans will typically focus on faces and are incredibly

good at spotting the slightest glitch in the animation. On the other hand, there is probably no shape more

important for animation than the human face. Several applications come immediately to mind, such as

games, special effects for the movies, avatars, virtual assistants for information kiosks, etc. This paper

focuses on the realistic animation of the mouth area for speech.

Face animation research dates back to the early 70’s. Since then, the level of sophistication has

increased dramatically. For example, the human face models used in Pixar’s Toy Story had several

thousand control points each [1]. Methods can be distinguished by mainly two criteria. On the one hand,

there are image and 3D model based methods. The method proposed here uses 3D face models. On the

other hand, the synthesis can be based on facial anatomy, i.e. both interior and exterior structures of a

face can be brought to bear, or the synthesis can be purely based on the exterior shape. The proposed

method only uses exterior shape. By now, several papers have appeared for each of these strands. A

complete discussion is not possible, so the sequel rather focuses on a number of contributions that are

particularly relevant for the method presented here.

So far, for reaching photorealism one of the most effective approaches has been the use of 2D mor-

phing between photographic images [2, 3, 4]. These techniques typically require animators to specify

carefully chosen feature correspondences between frames. Bregler et al. [5] used morphing of mouth

regions to lip-synch existing video to a novel sound-track. This Video Rewrite approach works largely

automatically and directly from speech. The principle is the re-ordering of existing video frames. It is

of particular interest here as the focus is on detailed lip motions, incl. co-articulation effects between

phonemes. But still, a problem with such 2D image morphing or re-ordering techniques is that they

do not allow much freedom in the choice of face orientation or compositing the image with other 3D

objects, two requirements of many animation applications.

In order to achieve such freedom, 3D techniques seem the most direct route. Chen et al. [6] applied

3D morphing between cylindrical laser scans of human heads. The animator must manually indicate a

number of correspondences on every scan. Brand [7] generates full facial animations from expressive

information in an audio track, but the results are not photo-realistic yet. Very realistic expressions have

been achieved by Pighin et al. [8]. They present face animation for emotional expressions, based on

linear morphs between 3D models acquired for the different expressions. The 3D models are created

by matching a generic model to 3D points measured on an individual’s face using photogrammetric

techniques and interactively indicated correspondences. Though this approach is very convincing for

expressions, it would be harder to implement for speech, where higher levels of geometric detail are

required, certainly on the lips. Hai Tao et al. [9] applied a 3D facial motion tracking based on a piece-

wise beziér volume deformation model and manually defined action units to track and synthesize visual

speech subsequently. Also this approach is less convincing around the mouth, probably because only a

few specific feature points are tracked and used for all the deformations. Per contra L. Reveret et. al. [10]

have applied a sophisticated 3D lip model, which is represented as a parametric surface guided by 30

control points. Unfortunately the motion around the lips, which is also very important for increased

realism, was tracked by only 30 markers on one side of the face and finally mirrored. Knowing that most

of the people talks spacially unsymetric, the chosen approach results in a very symmetric and not very

detailed animation.

Here, we present a face animation approach that is based on the detailed analysis of 3D face shapes

during speech. To that end, 3D reconstructions of faces have been generated at temporal sampling rates

of 25 reconstructions per second. A PCA analysis on the displacements of a selection of control points

yiels a compact 3D description of visemes, the visual counterparts of phonemes. With 38 points on the

lips themselves and a total of 124 on the larger part of the face that is influenced by speech, this analysis

is quite detailed. By directly learning the facial deformations from real speech, their parameterisation in

terms of principal components is a natural and perceptually relevant one. This seems less the case for

anatomically based models [11, 12]. Concatenation of visemes yields realistic animations. In addition,

the results yield a robust face tracker for performance capture, that works without special markers.

The structure of the paper is as follows. The first Section describes how the 3D face shapes are ac-

quired that are observed during speech and how these data are used to analyse the space of corresponding

face deformations. Whereas the second Section uses these results in the context of performance capture,

the third section discusses the use for speech-based animation of a face for which 3D lip dynamics have

been learned and for those to which the learned dynamics were copied. A last Section concludes the

paper.

The Space of Face Shapes

Our performance capture and speech-based animation modules both make use of a compact parame-

terisation of real face deformations during speech. This section describes the extraction and analysis of

the real, 3D input data.

Face Shape Acquisition

When acquiring 3D face data for speech, a first issue is the actual part of the face to be measured.

The results of Munhall and Vatikiotis-Bateson [13] provide evidence that lip and jaw motions affect the

entire facial structure below the eyes. Therefore, we extract 3D data for the area between the eyes and

the chin, to which we fit a topological model or ‘mask’, as shown in fig. 1.

This mask consists of 124 vertices, the 34 standard MPEG-4 vertices and 90 additional vertices for

increased realism. Of these vertices, 38 are on the lips and 86 are spread over the remaining part of the

mask. The remainder of this section explores the shapes that this mask takes on if it is fitted to the face

of a speaking person. The shape of a talking face was extracted at a temporal sampling rate of 25 3D

snapshots per second (video). We have used Eyetronics’ ShapeSnatcher system for this purpose [14].

It projects a grid onto the face, and extracts the 3D shape and texture from a single image. By using a

video camera, a quick succession of 3D snapshots can be gathered. The ShapeSnatcher yields several

thousand points for every snapshot, as a connected, triangulated and textured surface. The problem is

that these 3D points correspond to projected grid intersections, not corresponding, physical points of the

face. We have simplified the problem by putting markers on the face for each of the 124 mask vertices,

as shown in fig. 2.

The 3D coordinates of these 124 markers (actually of the centroids of the marker dots) were measured

for each 3D snapshot, through linear interpolation of the neighbouring grid intersection coordinates.

This yielded 25 subsequent mask shapes for every second. One such mask fit is also shown in fig. 2.

The markers were extracted automatically, except for the first snapshot, where the mask vertices were

fitted manually to the markers. Thereafter, the fit of the previous frame was used as an initialisation for

the next, and it was usually sufficient to move the mask vertices to the nearest markers. In cases where

there were two nearby candidate markers the situation could almost without exception be disambiguated

by first aligning the vertices with only one such candidate.

Before the data were extracted, it had to be decided what the test person would say during the acquisi-

tion. It was important that all relevant visemes would be observed at least once, i.e. all visually distinct

mouth shape patterns that occur during speech. Moreover, these different shapes should be observed in

as short a time as possible, in order to keep processing time low. The subject was asked to pronounce

a series of words, one directly after the other as in fluent speech, where each word was targeting one

viseme. These words are given in the table of fig 5. This table will be discussed in more detail later.

Face Shape Analysis

The 3D measurements yield different shapes of the mask during speech. A Principal Component

Analysis (PCA) was applied to these shapes in order to extract the natural modes. The recorded data

points represent 372 degrees of freedom (124 vertices with three displacements each). Because only 145

3D snapshots were used for training, at most 144 components could be found. This poses no problem as

98% of the total variance was found to be represented by the first 10 components or ‘eigenmasks’, i.e.

the eigenvectors with the 10 highest eigenvalues of the covariance matrix for the displacements. This

leads to a compact, low-dimensional representation in terms of eigenmasks. It has to be added that so

far we have experimented with the face of a single person. Work on automatically animating faces of

people for whom no dynamic 3D face data are available is planned for the near future. Next, we describe

the extraction of the eigenmasks in more detail.

The extraction of the eigenmaks follows traditional PCA, applied to the displacements of the 124

selected points on the face. This analysis cannot be performed on the raw data, however. First, the mask

position is normalised with respect to the rigid rotation and translation of the head. This normalisation

is carried out by aligning the points that are not affected by speech, such as the points on the upper side

of the nose and the corners of the eyes. After this normalisation, the 3D positions of the mask vertices

are collected into a single vector mk for every frame k = 1...N , with N = 145 in this case

mk = (xk1, yk1, zk1, ..., xk124, yk124, zk124)T (1)

where T stands for the transpose. Then, the average mask m̄

m̄ =1

N

N∑

k=1

mk ; N = 145 (2)

is subtracted to obtain displacements with respect to the average, denoted as ∆mk = mk-m̄. The

covariance matrix Σ for the displacements is obtained as

Σ =1

N − 1

N∑

k=1

∆mk∆mkT ; N = 145 ; (3)

Upon decomposing this matrix as the product of a rotation, a scaling and the inverse rotation

Σ = RΛRT (4)

one obtains the PCA decomposition with Λ the diagonal scaling matrix with the eigenvalues λ sorted

from the largest to the smallest magnitude, and the columns of the rotation matrix R the corresponding

eigenvectors. The eigenvectors with the highest eigenvalues characterize the most important modes of

face deformation. Mask shapes can be approximated as a linear combination of the 144 modes.

mj = m̄ + Rwj (5)

The weight vector wj describes the deviation of the mask shape mj from the average mask m̄ in

terms of the eigenvectors, coined eigenmasks for this application. By varying wj within reasonable

bounds, realistic mask shapes are generated. As already mentioned at the beginning of this section, it

was found that most of the variance (98%) is represented by the first 10 modes, hence further use of the

eigenmasks is limited to linear combinations of the first 10. They are shown in fig 3.

Performance Capture

A face tracker has been developed, that can serve as a performance capture system for speech. It fits

the face mask to subsequent 3D snapshots, but now without markers. Again, 3D snapshots taken with

the ShapeSnatcher at 1/25 second intervals are the input. The face tracker decomposes the 3D motions

into rigid motions and motions due to the visemes.

The tracker first adjusts the rigid head motion and then adapts the weight vector wj to fit the remaining

motions, mainly those of the lips. A schematic overview is given in fig. 4(a). Such performance capture

can e.g. be used to drive a face model at a remote location, by only transmitting a few face animation

parameters: 6 parameters for rigid motion and 10 components of the weight vectors.

For the very first frame, the system has no clue where the face is and where to try fitting the mask. In

this special case, it starts by detecting the nose tip. It is found as a point with particularly high curvature

in both horizontal and vertical direction:

n(x, y) = {(x, y)|min(max(0, kx), max(0, ky)) is maximal} (6)

where kx and ky are the two curvatures, which are in fact averaged over a small region around the

points in order to reduce the influence of noise. The curvatures are extracted from the 3D face data

obtained with the ShapeSnatcher. After the nose tip vertex of the mask has been aligned with the nose

tip detected on the face, and with the mask oriented upright, the rigid transformation can be fixed by

aligning the upper part of the mask with the corresponding part of the face. After the first frame, the

previous position of the mask is normally close enough to directly home in on the new position with the

rigid motion adjustment routine alone.

The rigid motion adjustment routine focuses on the upper part of the mask as this part hardly deforms

during speech. The alignment is achieved by minimizing distances

between the vertices of this part of the mask and the face surface. In order not to spend too much

time on extracting the true distances, the cost Eo of a match is simplified. Instead, the distances are

summed between the mask vertices x and the points p where lines through these vertices and parallel to

the viewing direction of the 3D acquisition system hit the 3D face surface:

Eo =∑

i∈{upper part}

di ; di = ‖pi − xi(w)‖ ; (7)

Note that the sum is only over the vertices in the upper part of the mask. The optimization is performed

with the downhill simplex method [15], with 3 rotation angles and 3 translation components as parame-

ters. Fig. 4 gives an example where the mask starts from an initial position (b) and is iteratively rotated

and translated to end up in the rigidly adjusted position (c).

Once the rigid motion has been canceled out, a fine-registration step deforms the mask in order to

precisely fit the instantaneous 3D facial data due to speech. To that end the components of the weight

vector w are optimised. Just as is the case with face spaces [16], PCA also here brings the advantage

that the dimensionality of the search space is kept low. Again, a downhill simplex procedure is used

to minimize a cost function for subsequent frames j. This cost function is of the same form as eq. (7),

with the difference that now the distance for all mask vertices is taken into account (i.e. also for the

non-rigidly moving parts). Each time starting from the previous weight vector wj−1 (for the first frame

starting with the average mask shape, i.e. wj−1 = 0 ), an updated vector wj is calculated for the frame at

hand. These weight vectors have dimension 10, as only the eigenmasks with the 10 largest eigenvalues

are considered (see section ). Fig. 4(d) shows the fine registration for this example.

The sequence of weight vectors – i.e. mask shapes – extracted in this way can be used as a performance

capture result, to animate the face and reproduce the orignal motion. This reproduced motion still

contains some jitter, due to sudden changes in the values of the weight vector’s components. Therefore,

these components are smoothed with B-splines (of degree 3). These smoothed mask deformations are

used to drive a detailed 3D face model, which has many more vertices than the mask. For the animation

of the face vertices between the mask vertices a lattice deformation was used (Maya, DEFORMER -TYPE

WRAP).

Fig. 8 shows some results. The first row (A) shows different frames of the input video sequence. The

person says “Hello, my name is Jlona”. The second row (B) shows the 3D ShapeSnatcher output, i.e. the

input for the performance capture. The third row (C) shows the extracted mask shapes for the same time

instances. The fourth row (D) shows the reproduced expressions of the detailed face model as driven by

the tracker.

Animation

The use of performance capture is limited, as it only allows a verbatim replay of what has been

observed. This limitation can be lifted if one can animate faces based on speech input, either as an audio

track or text. Our system deals with both types of input.

Animation of speech has much in common with speech synthesis. Rather than composing a sequence

of phonemes according to the laws of co-articulation to get the transitions between the phonemes right,

the animation generates sequences of visemes. Visemes correspond to the basic, visual mouth expres-

sions that are observed in speech. Whereas there is a reasonably strong consensus about the set of

phonemes, there is less unanimity about the selection of visemes. Approaches aimed at realistic anima-

tion of speech have used any number from as few as 16 [2] up to about 50 visemes [17]. This number

is by no means the only parameter in assessing the level of sophistication of different schemes. Much

also depends on the addition of co-articulation effects. There certainly is no simple one-to-one relation

between the 52 phonemes and the visemes, as different sounds may look the same and therefore this

mapping is rather many-to-one. For instance \b\ and \p\ are two bilabial stops which differ only in

the fact that the former is voiced while the latter is voiceless. Visually, there is hardly any difference in

fluent speech.

We based our selection of visemes on the work of Owens [18] for consonants. We use his consonant

groups, except for two of them, which we combine into a single \k,g,n,l,ng,h,y\ viseme. The

groups are considered as single visemes because they yield the same visual impression when uttered.

We do not consider all the possible instances of different, neighboring vocals that Owens distinguishes,

however. In fact, we only consider two cases for each cluster: rounded and widened, that represent

the instances farthest from the neutral expression. For instance, the viseme associated with \m\ differs

depending on whether the speaker is uttering the sequence omo or umu vs. the sequence eme or imi.

In the former case, the \m\ viseme assumes a rounded shape, while the latter assumes a more widened

shape. Therefore, each consonant was assigned to these two types of visemes. For the visemes that

correspond to vocals, we used those proposed by Montgomery and Jackson [19].

As shown in fig. 5, the selection contains a total of 20 visemes: 12 representing the consonants (boxes

with red ‘consonant’ title), 7 representing the monophtongs (boxes with title ‘monophtong’) and one

representing the neutral pose (box with title ‘silence’), where diphtongs (box with title ‘diphtong’) are

divided into two seperate monophtongs and their mutual influence is taken care of as a co-articulation

effect. The boxes with smaller title ‘allophones’ can be discarded by the reader for the moment. The

table also contains example words producing the visemes when they are pronounced.

This viseme selection differs from others proposed earlier. It contains more consonant visemes than

most, mainly because the distinction between the rounded and widened shapes is made systematically.

For the sake of comparison, Ezzat and Poggio [2] used 6 (only one for each of Owens’ consonant

groups while also combining two of them), Bregler et al. [5] used 10 (same clusters but they subdivided

the cluster \t,d,s,z,th,dh\ into \th,dh\ and the rest, and \k,g,n,l,ng,h,y\ into \ng\,

\h\, \y\, and the rest, making an even more precise subdivision for this cluster), and Massaro [20]

used 9 (but this animation was restriced to cartoon-like figures, which do not show the same complexity

as real faces). We feel that our selection is a good compromise between the number of visemes needed

in the animation and the realism that is obtained.

Animation can then be considered as navigating through a graph where each node represents one

of NV visemes, and the interconnections between nodes represent the N 2V viseme transformations (co-

articulation). From an animator’s perspective, the visemes represent key masks, and the transformations

represent a method of interpolating between them. As a preparation for the animation, the visemes were

mapped into the 10-dimensional eigenmask space. This yields one weight vector wvis for every viseme.

The advantage of performing the animation as transitions between these points in the eigenmask space,

is that interpolated shapes all look realistic. As was the case for tracking, point to point navigation in the

eigenmask space as a way of concatenating visemes yields jerky motions. Moreover, when generating

the temporal samples, these may not precisely coincide with the pace at which visemes change. Both

problems are solved through B-spline fitting to the different components of the weight vectors wvis(t)

with t time, as illustrated in fig. 6, which yields trajectories that are smooth and that can be sampled as

desired.

As input for the animation experiments we have used both text and audio. The visemes which have

to be visited, the order in which this should happen, and the time intervals in between can be calculated

from a pure audio track containing speech. First a file is generated that contains the ordered list of allo-

phones and their timing. ‘Allophones’ correspond to a finer subdivision of phonemes. This transcription

has not been our work and we have used an existing tool, described in [21]. The allophones are then

translated into visemes. The vocals and ’silence’ are directly mapped to the viseme in the box imme-

diately to their left in fig. 5. For the consonants the context plays a role. If they immediately follow a

vocal among \o\, \u\, and \@@\ (this is the vocal as in ‘bird’), then the allophone is mapped onto

a rounded consonant (the corresponding box in the left column of fig. 5). If the vocal is among \i\,

\a\, and \e\ then the allophone is mapped onto a widened consonant (the corresponding box in the

right column of fig. 5). When the consonant is not preceded immediately by a vocal, but the subsequent

allophone is one, then a similar decision is made. If the consonant is flanked by two other consonants,

the preceding vocal decides.

Once the sequence of visemes and their timing are available, the mask deformations are determined.

The mask then drives the detailed face model. Fig. 8 (E) shows a few snapshots of the animated head

model, for the same sentence as used for the performance capture example. Row (F) shows a detail of

the lips for another viewing angle.

It is of course interesting at this point to test what the result would be of verbatim copying of the

visemes onto another face. If successful, this would mean that no new lip dynamics have to be captured

for that face and much time and efford could be saved. Such result are shown in fig. 7. Although these

static images seem resonable, the corresponding sequences are not really satisfactory.

Conclusions

Realsitic face animation is still a hard nut to crack. We have tried to attack this problem via the

acquisition and analysis of exterior, 3D face measurements. With 38 points on the lips alone and a

further 86 around the mouth region to cover all parts of the face that are influenced by speech, it seems

that this analysis is more detailed than earlier ones. Based on a proposed selection of visemes, speech

animation is approached as the concatenation of 3D mask deformations, expressed in a compact space

of ‘eigenmasks’. Such approach was also demonstrated for performance capture.

This work still has to be extended in a number of ways. First, the current animation suite only supports

animation of the face of the person for whom the 3D snapshots were acquired. Although we have tried

to transplant visemes onto other people’s faces, it became clear, that a really realistic animation requires

visemes that are adapted to the shape or ’physiognomy’ of the face at hand. Hence one cannot simply

copy the deformations that have been extracted from one face to a novel face. It is not precisely know

at this point how the visemes deformations depend on the physiognomy, but ongoing experiments have

already shown that adaptations are possible without a complete relearning of the face dynamics.

Secondly, there are still unnatural effects with some co-articulations between subsequent consonants.

Although Massaro [22] has suggested to use a finite inventory of visemes rather than an approach with

a hugh amount of disemes, these effects have to be removed through the refinement of the spline trajec-

tories in the eigenmask space and a more sophisticated dominance model in general. Other necessary

improvements are the rounding of the lips into the mouth cavity which is not yet present because these

parts of the lips are not observed in the 3D data (the reason why the mouth doesn’t close completely

yet), and the addition of wrinkles on the lips and elsewhere, which can be solved by also using the

dynamically observed texture data (e.g. when the lips are rounded).

Acknowledgments

This research work has been supported by the ETH Research Council and the by European Commis-

sion through the IST project MESH (www.meshproject.com 2002) and with the assistance of: Univ.

Freiburg, DURAN, EPFL, Eyetronics, Univ. Geneva

References

[1] Ostby Eben. Personal communication. Pixar Animation Studios, 1997.

[2] T. Ezzat and T. Poggio. Visual speech synthesis by morphing visemes. In Kluwer Academic

Publishers, editor, International Journal of Computer Vision, volume 38, pages 45–57, 2000.

[3] Th. Beier and S. Neely. Feature-based image metamorphosis. In ACM Press (etc.) ACM SIG-

GRAPH, editor, SIGGRAPH’99 Conference Proceedings, volume 26, pages 35–42, 1992.

[4] Ch. Bregler and S. Omohundro. Nonlinear image interpolation using manifold learning. In NIPS,

volume 7, 1995.

[5] C. Bregler, M. Covell, and M. Slaney. Video rewrite: driving visual speech with audio. In SIG-

GRAPH, pages 353–360, 1997.

[6] D. Chen and A. State. Interactive shape metamorphosis. In Symposium on Interactive 3D Graphics,

editor, SIGGRAPH’95 Conference Proceedings, pages 43–44, 1995.

[7] M. Brand. Voice puppetry. In Animation SIGGRAPH, 1999.

[8] F. Pighin, J. Hecker, D. Lischinsky, R. Szeliski, and D.H. Salesin. Synthesizing realistic facial

expressions from photographs. In Proc. SIGGRAPH, pages 75–84, 1998.

[9] Hai Tao and Th. S. Huang. Explanation-based facial motion tracking using a piecewise beziér

volume deformation model. In Proc. CVPR, 1999.

[10] L. Reveret, G. Bailly, and P. Badin. Mother, a new generation of talking heads providing a flexible

articulatory control for videorealistic speech animation. In Proc. ICSL’2000, 2000.

[11] S. King, R. Parent, and L. Olsafsky. An anatomically-based 3d parameter lip model to support

facial animation and synchronized speech. In in Proc. Deform Workshop., pages 1–19, 2000.

[12] K. Waters and J. Frisbie. A coordinated muscle model for speech animation. In Graphics Interface,

pages 163–170, 1995.

[13] K.G. Munhall and E. Vatikiotis-Bateson. The moving face during speech communication. In Ruth

Campbell, Barbara Dodd, and Denis Burnham, editors, Hearing by Eye, volume 2, chapter 6, pages

123–39. Psychology Press, East Sussex, UK, 1998.

[14] Http://www.eyetronics.com.

[15] J.A. Nelder and R. Mead. Simplex downhill method. In Computer Journal, volume 7, pages

308–312, 1965.

[16] V. Blanz and T. Vetter. A morphable model for the synthesis of 3d faces. In Proc. SIGGRAPH,

pages 187–194, 1999.

[17] K.C. Scott, D.S. Kagels, S.H. Watson, H. Rom, J.R. Wright, M. Lee, and K.J. Hussey. Synthesis

of speaker facial movement to match selected speech sequences. In In Proceedings of the Fifth

Australian Conference on Speech Science and Technology, volume 2, pages 620–625, 1994.

[18] O. Owens and B. Blazek. Visemes observed by hearing-impaired and normal-hearing adult view-

ers. In Jour. Speech and Hearing Research, volume 28, pages 381–393, 1985.

[19] A. Montgomery and P. Jackson. Physical characteristics of the lips underlying vowel lipreading

performance. In Jour. Acoust. Soc. Am., volume 73, pages 2134–2144, 1983.

[20] D.W. Massaro. Perceiving Talking Faces. MIT. Press, 1998.

[21] C. Traber. SVOX: The Implementation of a Text-to-Speech System. PhD thesis. Computer Engi-

neering and Networks Laboratory, ETH; No. 11064, 1995.

[22] Massaro. Perceiving Talking Faces. MIT. Press, 1998.

Gregor A. Kalberer is a PhD student at the Computer Vision Group BIWI, D-ITET, ETH Zurich,

Switzerland. He received the Msc degree in electrical engineering from ETH Zurich in 1999. His

research interests include computer vision, graphics, animation and virtual reality. He is a member of

the IEEE.

Luc van Gool is professor for Computer Vision at the University of Leuven in Belgium and at ETH

Zurich in Switzerland. He is a member of the editorial board of several computer vision journals and of

the programm committees of international conferences about the same subject. His research includes ob-

ject recognition, tracking, texture, 3D reconstruction, and the confluence of vision and graphics. Vision

and graphics for archaeology is among his favour applications.

Figure 1. Left: example of 3D input for one snapshot; Right: the mask used for tracking the facial

motions during speech.

Figure 2. Left: markers put on the face, one for each of the 124 mask vertices; Right: 3D mask fitted

by matching the mask vertices with face markers.

Figure 3. Average mask (0) and the 10 dominant ‘eigenmasks’ for visual speech, in order of descend-

ing importance from (1) to (10).

Nose Tip Finding

Fine RegistrationSimplex Downhill Method

with Avoidance ofLocal Minima

Rigid MotionCancelation

Simplex Downhill Methodwith Avoidance of

Local Minima

Start

End

Figure 4. (a) Schematic overview of the performance capture tracker; (b) The facial mask in its starting

position, (c) mask position after rigid motion cancelation, (d) mask after additional deformation to fit

the lips.

two a_I

a_U

e_I

I_@

O_I

U_@ cures

fear

raise

rouse

rise

noise

monophtongs

/w,r/widenedrounded

/w,r/w, r

wasp, wrong

nursery [ n3:rserI ] iritate [ ’IrItteIt ]

allophones

/ch,jh,sh,zh/consonant

widened

S, t_S, Z, d_Zshin, chin, measure, Gin

scrooge [ skru:d_Z ] [ glId_ZIN ]glitching

allophones

widened

consonantloch, skat, kin, give, new, long, thing, hit

x, k, k_h, g, n, l, N, h

roll−on [ rol on] steely

roundedmock,bin, spark, pin

widened

consonant/p,b,m/

[ momo ]momo image [ ’ImId_Z ]

allophones

allophones

rounded/f,v/

widened

consonantf, v

fit, heavy

[ ru:vu:s ]Ruvus giving [ gIvIN ]

allophonesconsonant consonantt, t_h, d, s, z, T, D,

staff, tin, din, mouse, fees, thin, this/t,d,s,z,th,dh/rounded

/t,d,s,z,th,dh/widenedmoose [ mu:s ] [ i:zI ]easy

allophones consonantconsonant

consonant

consonant

/ch,jh,sh,zh/rounded

consonant

rounded

consonant

allophonesi:, I, j

easy, pit, yes

fitting

/i,ii/normal

monophtong

allophonesA:, A

stars,cut

stars

/aa,o/normal

monophtong

allophones3:

bird

birdnormal/@@/

monophtong

allophonesU, u:

put, lose

book

/u,uu/normal

monophtong

allophones

X

silence

closed lips

normal

/#/

silence

allophones

nose

m, b, p, p_h

[ ’sti:lI ]

divide into@_U

/f,v/

allophonesO, Q

cause, pot

causenormal/oo/

monophtong

allophones@

another

one anothernormal/uh,@/

monophtong

allophones{, e, Er

hat, pet, stairs

pet, stay

/e,a/normal

monophtong

diphtong

/p,b,m/

/k,g,n,l,ng,h,y//k,g,n,l,ng,h,y/

Figure 5. Viseme table

f f f f ff f frames

Wj1

t060 1 2 3 4 5

6

t2

t4

t

t1

t3

t5

Figure 6. Spline interpolation between visemes (dots in red) and samples taken for animation (quads

in blue).

Figure 7. Transplantation of Speech.

H ll , my Na e s Jlo ae o m i n

Time [f]009 013 030032 042

A

B

C

D

E

F

Figure 8.

Figure 9.

Figure 10.

Documents

Realistic Face Animation for Speechkonijn/publications/2002/...Luc Van Gool Computer Vision Group ETH Zurich,¨ Switzerland ESAT / VISICS, Kath. Univ. Leuven, Belgium [email protected]