Real Time Dissociation of Facial Appearance and Dynamics ...jeffcohn/biblio/BC.pdf · Real Time Dissociation of Facial Appearance and Dynamics during Natural Conversation Steven M

Real Time Dissociation of Facial Appearance and Dynamics

during Natural Conversation

Steven M. Boker Jeffrey F. Cohn

June 3, 2009

Introduction

As we converse, facial expressions, head movements and vocal prosody are produced that are im-

portant source of information in communication. Semantic content of conversation is accompanied

by vocal prosody, non–word vocalizations, head movement, gestures, postural adjustments, eye

movements, smiles, eyebrow movements, and other facial muscle changes. Coordination between

speakers’ and listeners’ head movements, facial expressions, and vocal prosody has been widely

reported (Bernieri & Rosenthal, 1991; Bernieri, Davis, Rosenthal, & Knee, 1994; Cappella, 1981;

Chartrand, Maddux, & Lakin, 2005; Condon, 1976; Lafrance, 1985). Conversational coordination

can be defined as when an action generated by one individual is predictive of a symmetric action

by another (Rotondo & Boker, 2002; Griffin & Gonzalez, 2003). This coordination is a form of spa-

tiotemporal symmetry between individuals (Boker & Rotondo, 2002) that has behaviorally useful

outcomes (Chartrand et al., 2005).

Movements of the head, facial expressions, and vocal prosody have been reported to be important

in judgments of identity (Fox, Gross, Cohn, & Reilly, 2007; Hill & Johnston, 2001; Munhall &

Buchan, 2004), rapport (Grahe & Bernieri, 2006; Bernieri et al., 1994), attractiveness (Morrison,

Gralewski, Campbell, & Penton-Voak, 2007), gender (Morrison et al., 2007; Hill & Johnston, 2001;

Berry, 1991), personality (Levesque & Kenny, 1993), and affect (Ekman, 1993; Hill, Troje, &

1

Johnston, 2003). In a dyadic conversation, each conversant’s perception of the other person produces

an ever–evolving behavioral context that in turn influences her/his ensuing actions, thus creating a

nonstationary feedback system as conversants form patterns of movements, expressions, and vocal

inflections; sometimes with high symmetry between the conversants and sometimes with little or no

similarity (Ashenfelter, Boker, Waddell, & Vitanov, in press). This skill is so automatic that little

thought is given to it unless it begins to break down.

As symmetry is formed between two conversants, the ability to predict the actions of one based

on the actions of the other increases and the perception of empathy increases (Baaren, Holland,

Steenaert, & Knippenberg, 2003). Symmetry in movements implies redundancy, which can be

defined as negative Shannon information (Redlich, 1993; Shannon & Weaver, 1949). By this logic,

when symmetry between two conversants is high, they are sharing an embodied state and thus

may feel greater empathy towards one another. When interpersonal symmetry is broken, that is

when there is a change from similar expressions and movements to dissimilar ones, individuals

are less likely to be able to predict each other’s movements due to lowered redundancy and thus

increased Shannon information. In this way, changes in symmetry can be interpreted as changes in

information flow between the conversants.

An information flow view of the dynamics of dyadic conversation is consistent with a model in

which contributions from audition, vision, and proprioception are combined in a low–level mirror

system (Rizzolatti & Craighero, 2004) that uses the continuous stream of auditory and visual input

as sources of information available for grammatic, semantic, and affective perception. A conceptual

diagram of this model is shown in Figure 1.

[Figure 1 about here.]

According to this model, motor activity can be generated from the mirror system, but this ac-

tivity is, in general, suppressed — Release of suppression of motor mimicry is used intermittently

to elicit engagement and express empathy with interlocutors. This model suggests that observed

adaptive dynamics of head movements and facial expressions in conversation are composed of both

low–level perception–action contributions exhibited as periods of high symmetry with an inter-

2

locutor as well as top–down cognitive contributions exhibited in the regulation of these periods of

symmetry.

Adaptation to Context: Dynamics and Appearance

The context of a conversation can influence its course: The person you think you are speaking with

can influence what you say and how you say it. The appearance of an interlocutor is composed

of his/her facial and body structure as well as how he/she moves and speaks. Separate pathways

for perception of structural appearance and biological motion have been proposed (Giese & Poggio,

2003). Evidence for this view comes from neurological (Steede, Tree, & Hole, 2007b, 2007a) and

judgment (Knappmeyer, Thornton, & Bulthoff, 2003; Hill & Johnston, 2001) studies. Munhall &

Buchan (2004) postulate that motion contributes to identification of faces by (a) better structure

cues from a moving face and (b) dynamic facial signatures. Berry (1991) reports that children can

recognize gender from point light faces when the recorded motion was from a conversation, but not

when it was from a recitative reading. This suggests that contextual dynamics cues are likely to be

stronger in normal interactions than when generated from a scripted and acted sequence.

If we accept that the dynamics of facial expressions, head movements, and gestures during

natural conversation are generated as part of a system that uses mechanisms of adaptive feedback

and varies the informational content of its output, we conclude that these dynamics are likely to

exhibit highly complex time–dependency. When building statistical models of such a system, we

may expect to encounter data with high numbers of degrees of freedom and complex nonlinear

interactions. The successful study of such systems requires measurement with high precision both

in time and in space. Large numbers of data samples are required in order to have sufficient power

to be able to distinguish models with nonlinear time–dependence from those with nonstationary

linear components. Finally, in order to test models, precise experimental perturbations are needed

in order to be able to distinguish causal from correlational structure.

3

Measuring and Manipulating Dynamics and Appearance

Given these needs for studying coordinative movement in natural conversation, we sought a non–

intrusive method for automatically tracking facial expressions and head movements. In addition,

in order to create perturbations, we sought a method for covertly introducing known perturbations

into natural conversation. We wished to be able to make adjustments to both appearance and

dynamics without a naive conversant knowing that these perturbations were present.

The adaptive regulation of cognition and expressive affect has long been studied using labor–

intensive methods such as hand coding of video tape or film (e.g., Cohn, Ambadar, & Ekman,

2007; Cappella, 1996; Condon & Ogston, 1966). Experiments in the regulation of expressive affect

have primarily used fixed or recorded stimuli because it is otherwise difficult to control the context

to which the participant is adapting. But people react much differently to recorded stimuli than

they do when they are engaged in a conversation. Perceptions and implicit biases triggered by

appearance and dynamics of the interlocutor have the potential to change the way that a participant

self–regulates during an interaction. We sought a way for controlling the context of a natural

conversation by allowing the random assignment of appearance variables such as age, sex, race, and

attractiveness.

Videoconference Paradigm

In order to control context and to acquire high quality full face video and acoustically isolated audio

of conversation participants, we selected a videoconference paradigm. In our current laboratory

setup, each participant sits on a stool in a small video booth as shown in Figure 2 facing a back

projection screen approximately 2m away. A small (2cm× 10cm) “lipstick” video camera is mounted

in front of the back projection screen at a position that corresponds to the forehead of the lifesize

image of their interlocutor’s face projected on the screen. Each participant wears headphones and

a small microphone is mounted overhead, out of the participants’ field of view. The walls of the

booths are moveable “gobos” built of sound diffusing compressed fiberglas panels covered with

white fabric. The participants are lit from the front and the gobos and a white fabric booth ceiling

4

serve as reflective surfaces so that there are few shadows on the face, facilitating automatic video

tracking. The field of view in the booth is basically featureless white fabric other than the image of

the conversant. Acoustically, the sound diffusion panels prevent coherent early reflections and thus

the booth does not sound as small as it is. This effect and the open sides are used to help prevent

feelings of claustrophobia that might otherwise occur for some participants in a small (2.5m × 2m)

enclosed space.


While the participants are in separate rooms, and thus acoustically isolated from one another,

they are actually in close proximity to one another. This allows the use of a single magnetic field

covering the two booths to enable motion tracking (Ascension Technologies MotionStar) to syn-

chronously record the participants’ head movements (6 DOF sampled at 81.6Hz). Each participant

wears a headband or hat to which a tracker is attached. Naive participants are informed that we

are “measuring magnetic fields during conversation” and in our experiments over the past 10 years

have in all (N > 200) but one case accepted this cover story. One participant in one of our recent

experiments immediately indicated that he knew that the headband was a motion capture sensor

and so his data was not used. By withholding the fact that we are motion tracking, we wish to

prevent participants from feeling self conscious about their movements during their conversations.

Video and audio can be transmitted between the booths with a minimal delay — there is a one

video frame delay (33ms) at the projector while it builds a frame buffer to project. Audio between

booths is delayed so as to match the arrival of the video. In some of our experiments, we have

used frame delays between 3 and 5 frames (99ms to 165ms) but as yet have found no effects on

movements within this area of delay. As delays become longer than 200ms, conversational patterns

can change. Delays of over 500ms can cause breakdowns in conversational flow as individuals begin

talking at the same time, having difficulty in negotiating smooth speaker–listener turn taking.

5

Real Time Facial Avatars

The video conference paradigm allows us to track head movements, but it also allows us to track

facial movements using Active Appearance Models (AAMs) (Cootes, Edwards, & Taylor, 2001;

Cootes, Wheeler, Walker, & Taylor, 2002) by digitizing a video stream and applying tracking

software developed by our colleagues at Carnegie Mellon (Matthews, Ishikawa, & Baker, 2004;

Matthews & Baker, 2004). From the tracking data, we can redisplay a computer generated avatar

face for each video frame (Theobald, Matthews, Wilkinson, Cohn, & Boker, 2007). The tracking

and redisplay take less time than a single video frame, so the whole process takes 33ms for the frame

digitizing and 33ms for the tracking and redisplay of the frame buffer. Thus, we can track a face

and from that data synthesize a video avatar within 66ms. We have been using an off–the–shelf

PCIe AJA Video Kona Card for video digitizing and redisplay in a standard 3.0GHz Mac Pro which

performs the tracking and synthesis.

The capability to track facial movements and redisplay them brings with it the possibility of

creating perturbations to conversation in a covert manner and randomly assigning them within a

conversation. In order to do this, we first needed to find out whether the synthesized avatars were

accepted as being video by naive participants. Then we needed to develop a method for applying

perturbations of appearance and dynamics to the resynthesized avatar faces.

Facial Avatars Using Active Appearance Models (AAMs)

AAMs are generative, parametric models and consist of both shape and appearance components.

The shape component of the AAM is a triangulated mesh that moves like a face undergoing both

rigid motion (head pose variation) and non-rigid motion (expression) in response to changes in

the parameters. The shape components are identified by first hand labeling between 30 and 50

frames of video for the 68 vertices composing the triangular mesh, a process that takes a trained

research assistant about 2 to 3 hours. The video frames are chosen so as to cover the range of facial

motion normally exhibited by the target individual’s expressions. Once these video frames are

hand–labeled, the remainder of the process is automatic. Thus, once a model is constructed, we can

6

continue to use the model to track an individual over multiple occasions and multiple experiments.

A principal components analysis is performed on these labeled video frames to extract between 8

and 12 shape components that can be thought of as axes of facial movement. These components

are independent and additive so that the estimated shape of a face mesh in a single frame of video

is expressed as the weighted sum of the retained components

s = s0 +m�

i=1

pisi, (1)

where s is the estimated shape, s0 is the mean shape, si are the component loadings and pi are the

shape parameters. Figure 3–a plots the first three of these components for one target individual.

The arrows in each wireframe face in Figure 3–a demonstrate how one unit of change in each the

first 3 principal components creates simultaneous change across multiple vertices of the triangular

mesh.


The appearance component of the AAM is an image of the face, which itself can vary under the

control of the parameters. As the parameters are varied, the appearance changes to model effects

such as the appearance of furrows and wrinkles. Again, we use principal components analysis on the

same labeled video frames and retain a few (8 to 12) appearance components. Thus, the estimated

appearance A(x) becomes a linear combination of the mean appearance A0(x) plus a weighted sum

of appearance images Ai(x):

A(x) = A0(x) +l�

i=1

λiAi(x) ∀ x ∈ s0, (2)

where the coefficients λi are the appearance parameters. Figure 3–b shows the mean appearance

and the first two appearance components for a target individual’s face. Combining the shape and

appearance models we can create a wide variety of natural–looking head poses and facial expressions.

Fitting an AAM is a difficult non-linear optimization problem. Matthews and Baker (2004)

recently proposed and demonstrated an AAM fitting algorithm that is more robust and faster than

7

previous algorithms, tracking faces at over 200 frames per second. Our project uses this algorithm

to fit a pre–built AAM model to each frame of video as it becomes available at the digitizing frame

buffer.

Figures 4–a and 4–b display video frames of two research assistants as captured by the video

camera during an experiment. Below each individual’s picture, Figures 4–c and 4–d present the

respective computer generated facial avatars for the mean shape and appearance for the two research

assistants. The required number of degrees of freedom is surprisingly low (less than 25 DOF) for each

of the generated avatars, although the avatars’ appearance is very similar to the individuals’ actual

appearances. Note that the mean shape and appearance is somewhat smoother than any particular

video frame and that the mean shape and appearance is not a completely relaxed expression.


Figure 5 displays a still from the video feeds, facial tracking, and avatar from one conversation

during an experiment. Figure 5–a shows the video that was captured from the research assistant.

In Figure 5–b, her face is tracked by the 68 vertex triangular mesh. From these tracking data and

a previously constructed model, a video frame with a matching avatar is displayed in Figure 5–c.

The naive participant, shown in Figure 5–d, sees only the avatar image from Figure 5–c, while the

research assistant sees the full video of the naive participant as seen in Figure 5–d.


Note that there appears to be a high degree of symmetry exhibited by the faces in Figure 5. After

the conversation session is over, models are built and tracked for each naive participant. In this

way, we have been capturing measurements that will be used to construct and test specific models

for the dynamic ebb and flow of symmetry formation and symmetry breaking during conversation.

Swapping Appearance by Using Avatar Models

Once we were able to construct and display an avatar in real time, we began to work on methods for

introducing manipulations of the appearance and dynamics of the avatar. Changing the appearance

8

of the avatar is akin to putting a flexible mask on a person. That is to say, the avatar’s expressions

are driven by the captured motions of the person whose face is being motion tracked, but the avatar

model that is shown making these expressions is one that was generated from a different individual.

To accomplish this, we first constructed a set of short video clips that was representative of each

of our six research assistants. We then videotaped each assistant while she or he mimicked the facial

expressions of each of the other assistants as shown on the video clips. Then we used the captured

video from each individual research assistant to build a model that covered approximately the same

space of expressions. Finally, we simply substituted one person’s mean shape and appearance for

another person during the synthesis portion of process. As an example of how this works, the

research assistant in Figure 6–a is being tracked and his expressions are mapped onto the other five

research assistants in Figures 6–b through –f.


Note that the expressions in Figure 6 are not exactly the same. One might think of this mapping

as how person (a)’s expression differs from his mean being applied to how person (b)’s face would

appear if he had produced an expression with the same difference from his mean. This tends to create

natural appearing expressions, since the shapes themselves are not mapped, but simply an expression

that represents a similar point in expression space. By sampling all individuals mimicking the same

movements, we were able to have the axes of the spaces be relatively similar. This is an important

point, because principal component axes generated from the distribution of naturally occurring

movements from one individual may be substantially different from the axes generated from another

person who may have a very different distribution of characteristic movements. Methods for rotation

and scaling of axes between avatar models (Theobald, Matthews, Cohn, & Boker, 2007) may improve

expression mapping and reduce the need for mimic–based video sequences on which to build models.

We used the mean shape and appearance substitution method that were used to produce displays

that map appearance from one sex to another (Boker et al., in press). For instance, in Figure 7

the research assistant’s face was mapped to a male avatar. Each video frame is mapped, so the

dynamics of the movements produced by the female research assistant were produced by the male

9

avatar’s expressions. In addition, the female research assistant’s voice was processed using a TC–

Helicon VoicePro vocal formant processor to change the fundamental frequency and formants to

approximate a male voice. In this experiment, the naive participant in Figure 7–d was informed

that she would have six different conversations. She actually talked to two research assistants, one

male and one female, but she thought she had spoken with six different individuals, three male and

three female. The research assistants were blind to whether they appeared as a male or a female in

any particular conversation.


In our avatar videoconference experiments, we have run over 100 naive participants and only

two of them have doubted our cover story that the faces they see are live video that has been “cut

out” so that they only see the face of the person they are talking to. One of those was the person

who also guessed that we were putting motion capture sensors on him. Unfortunately, as knowledge

of this technology becomes more widely disseminated, we will not be able to rely on participants

trusting that the face they see in a video conference in fact belongs to the person with whom they

are speaking.

Future Directions

Now that it is practical to precisely and non–invasively measure and control non–rigid facial move-

ments produced in natural conversation, we expect that there will be a surge of experiments that test

hypotheses about the coupled dynamics of interpersonal coordination. We expect that a mapping

will be developed between a semantic space of adjectives describing emotion and a low degree–

of–freedom avatar model of the human face. This mapping will allow the automatic tracking of

affective facial displays in a way that may revolutionize human–computer interactions.

We are also interested in perturbing the dynamics of expressions. In affective disorders such as

depression, individuals display facial behavior during conversation that differs from the coordination

between people’s expressions shown in normal conversation. Depressed individuals also report

10

feelings of being distant from others. By better understanding the way that these patterns of

affective display develop and persist, we may be able to devise better interventions that allow these

individuals to recover from depressive episodes more quickly and effectively.

Another area amenable to study using real–time avatars is stereotyping and bias. Since we can

convincingly change a person’s apparent sex, we expect that further work will allow us to randomly

assign variables such as race and age during natural conversation. Studying stereotyping using

this paradigm is particularly interesting since the research assistant whose characteristics are being

modified can be kept blind to the modification. It is not as if an assistant is asked to act a part.

The only way the assistant can know how he or she appears to the conversational partner is by

how the conversational partner treats the assistant. By counterbalancing so that the conversational

partner has more than one conversation with the same assistant in each appearance condition, we

can isolate effects during conversation to being that of the randomly assigned appearance variable.

Applications for this technology in human–computer interaction are not difficult to envision.

For instance, a NASA–funded pilot project has been proposed to track teachers’ faces and mapping

them onto celestial objects so that, in distance–learning equipped classrooms, children can “talk to

Jupiter”. Transmitting avatar displays requires extremely low bandwidth, so these displays may

find application in cell phones and other videoconferencing applications (Brick, Spies, Theobald,

Matthews, & Boker, 2009). Computer–based tutoring systems may be able to use webcams to

track whether a learner is displaying confusion or frustration. Automatous avatars may be able to

display expressions that are perceived as showing empathy by tracking viewers’ faces and displaying

an appropriate amount of interpersonal symmetry, thereby reducing the feeling that the automatous

faces are cold and mechanical. Appropriate responses to detected affect in human facial expressions

may allow human–robot interactions be less threatening and more fulfilling for humans.

Conclusions

We have described an overview of our team’s work in developing and testing real–time facial avatars

driven by motion capture from video. The avatar technology has enabled videoconference experi-

11

ments that randomly assign appearance variables and examine how people coordinate their motions

and expressions in natural conversation.

After 24 to 30 minutes of the videoconference conversations, 98% of naive participants did not

doubt the cover story that we were “cutting out video to just show the face”. We find this to

be surprising since each video frame was constructed from approximately 25 floating point values

applied to a model. Contrast that with the fact that a real video frame contains over 300,000 pixels

each of which is represented by a 24 bit number.

We can think of three reasons why this illusion is so convincing. The first possible reason is that

when we produce facial expressions, we largely coordinate our muscles in correlated patterns, so

that the total number of degrees of freedom we express is relatively small — on the order of three

degrees of freedom for head pose and 7 to 12 degrees of freedom for facial expression.

The second possible reason is that there may be some limiting of perceived degrees of freedom

as we view a facial expression. Thus, our perceptual system may be mapping the facial expressions

onto a lower dimensional space than actually exists in the world, so when the number of degrees

of freedom in the display is reduced, we do not notice. Such a perceptual effect might also explain

why it is so easy to see a face in an arbitrary pattern with only marginal similarity to a face — the

so–called “face on Mars” or “face on the tortilla” effect.

A third possible reason is that in real–time conversation, a participant is expecting to interact

with a real person and is engaged in that interaction. Thus the dynamics of the symmetry formation

and symmetry breaking are appropriate and convince the participant that since the interaction is

real, the video image must be real. Contrast that situation with a judgement paradigm where the

participant may adopt a more critical attitude and is not dynamically engaged with the person on

the display. Thus the nature of the context and task may lead to greater or lesser credibility of the

avatar display.

We expect that real–time facial avatars will be in common, everyday use within ten years or less.

We expect facial avatar technology to be influential in teaching, in human–computer interaction,

and in the diagnosis and treatment of affective disorders. In the mean time, these constructs provide

powerful tools for examining human interpersonal communication.

12

Author Note

The authors gratefully acknowledge the contributions of the many investigators and research

assistants who worked on this project: Zara Ambadar, Kathy Ashenfelter, Timothy Brick, Tamara

Buretz, Enoch Chow, Eric Covey, Pascal Deboeck, Katie Jackson, Hannah Kim, Jen Koltiska, Nancy

Liu, Michael Mangini, Iain Matthews, Sean McGowan, Ryan Mounaime, Sagar Navare, Andrew

Quilpa, Jeffrey Spies, Barry–John Theobald, Stacey Tiberio, Michael Villano, Chris Wagner, and

Meng Zhao. Funding for this work was provided in part by NSF Grant BCS–0527485. Any opinions,

findings, and conclusions or recommendations expressed in this material are those of the authors

and do not necessarily reflect the views of the National Science Foundation. Correspondence may

be addressed to Steven M. Boker, Department of Psychology, The University of Virginia, PO Box

400400, Charlottesville, VA 22904, USA; email sent to [email protected]; or browsers pointed to

http://people.virginia.edu/˜smb3u.

13

References

Ashenfelter, K. T., Boker, S. M., Waddell, J. R., & Vitanov, N. (in press). Spatiotemporal

symmetry and multifractal structure of head movements during dyadic conversation. Journal

of Experimental Psychology: Human Perception and Performance.

Baaren, R. B. van, Holland, R. W., Steenaert, B., & Knippenberg, A. van. (2003). Mimicry for

money: Behavioral consequences of imitation. Journal of Experimental Social Psychology,

39 (4), 393–398.

Bernieri, F. J., Davis, J. M., Rosenthal, R., & Knee, C. R. (1994). Interactional synchrony and

rapport: Measuring synchrony in displays devoid of sound and facial affect. Personality and

Social Psychology Bulletin, 20 (3), 303–311.

Bernieri, F. J., & Rosenthal, R. (1991). Interpersonal coordination : Behavior matching and

interactional synchrony. In R. S. Feldman & B. Rime (Eds.), Fundamentals of nonverbal

behavior (pp. 401–431). Cambridge, UK: Cambridge University Press.

Berry, D. S. (1991). Child and adult sensitivity to gender information in patterns of facial motion.

Ecological Psychology, 3 (4), 349–366.

Boker, S. M., Cohn, J. F., Theobald, B.-J., Matthews, I., Mangini, M., Spies, J. R., et al. (in

press). Something in the way we move: Motion dynamics, not perceived sex, influence head

movements in conversation. Journal of Experimental Psychology: Human Perception and

Performance, ?? (??), ??

Boker, S. M., & Rotondo, J. L. (2002). Symmetry building and symmetry breaking in synchronized

movement. In M. Stamenov & V. Gallese (Eds.), Mirror neurons and the evolution of brain

and language (pp. 163–171). Amsterdam: John Benjamins.

Brick, T. R., Spies, J. R., Theobald, B., Matthews, I., & Boker, S. M. (2009). High–presence, low–

bandwidth, apparent 3–d video–conferencing with a single camera. In Proceedings of the 2009

International Workshop on Image Analysis for Multimedia Interactive Services (WIAMIS).

IEEE.

Cappella, J. N. (1981). Mutual influence in expressive behavior: Adult–adult and infant–adult

14

dyadic interaction. Psychological Bulletin, 89 (1), 101–132.

Cappella, J. N. (1996). Dynamic coordination of vocal and kenesic behavior in dyadic interaction:

Methods problems and interpersonal outcomes. In J. H. Watt & C. A. VanLear (Eds.),

Methodology in social research (pp. 353–386). Thousand Oaks, CA: Sage.

Chartrand, T. L., Maddux, W. W., & Lakin, J. L. (2005). Beyond the perception–behavior link:

The ubiquitous utility and motivational moderators of nonconscious mimicry. In R. Hassin,

J. Uleman, & J. A. Bargh (Eds.), The new unconscious (pp. 334–361). New York: Oxford

University Press.

Cohn, J. F., Ambadar, Z., & Ekman, P. (2007). Observer–based measurement of facial expression

with the Facial Action Coding System. In J. A. Coan & J. J. B. Allen (Eds.), The handbook

of emotion elicitation and assessment (pp. 203–221). New York: Oxford University Press.

Condon, W. S. (1976). An analysis of behavioral organization. Sign Language Studies, 13, 285–318.

Condon, W. S., & Ogston, W. D. (1966). Sound film analysis of normal and pathological behavior

patterns. Journal of Nervous and Mental Disease, 143 (4), 338–347.

Cootes, T. F., Edwards, G., & Taylor, C. J. (2001). Active appearance models. IEEE Transactions

on Pattern Analysis and Machine Intelligence, 23 (6), 681–685.

Cootes, T. F., Wheeler, G. V., Walker, K. N., & Taylor, C. J. (2002). View-based active appearance

models. Image and Vision Computing, 20 (9–10), 657–664.

Ekman, P. (1993). Facial expression and emotion. American Psychologist, 48, 384–392.

Fox, N. A., Gross, R., Cohn, J. F., & Reilly, R. B. (2007). Robust biometric person identification

using automatic classifier fusion of speech, mount, and face experts. IEEE Transactions on

Multimedia, 9 (4), 701–714.

Giese, M. A., & Poggio, T. (2003). Neural mechanisms for the recognition of biological movements.

Nature Reviews Neuroscience, 4, 179–192.

Grahe, J. E., & Bernieri, F. J. (2006). The importance of nonverbal cues in judging rapport.

Journal of Noverbal Behavior, 23 (4), 253–269.

Griffin, D., & Gonzalez, R. (2003). Models of dyadic social interaction. Philosophical Transactions

of the Royal Society of London, B, 358 (1431), 573–581.

15

Hill, H. C. H., & Johnston, A. (2001). Categorizing sex and identity from the biological motion of

faces. Current Biology, 11 (3), 880–885.

Hill, H. C. H., Troje, N. F., & Johnston, A. (2003). Range– and domain–specific exaggeration of

facial speech. Journal of Vision, 5, 793–807.

Knappmeyer, B., Thornton, I. M., & Bulthoff, H. H. (2003). The use of facial motion and facial

form during the processing of identity. Vision Research, 43 (18), 1921–1936.

Lafrance, M. (1985). Postural mirroring and intergroup relations. Personality and Social Psychology

Bulletin, 11 (2), 207–217.

Levesque, M. J., & Kenny, D. A. (1993). Accuracy of behavioral predictions at zero acquaintance:

A social relations model. Journal of Personality and Social Psychology, 65 (6), 1178–1187.

Matthews, I., & Baker, S. (2004). Active appearance models revisited. International Journal of

Computer Vision, 60 (2), 135–164.

Matthews, I., Ishikawa, T., & Baker, S. (2004). The template update problem. IEEE Transactions

on Pattern Analysis and Machine Intelligence, 26, 810–815.

Morrison, E. R., Gralewski, L., Campbell, N., & Penton-Voak, I. S. (2007). Facial movement varies

by sex and is related to attractiveness. Evolution and Human Behavior, 28, 186–192.

Munhall, K. G., & Buchan, J. N. (2004). Something in the way she moves. Trends in Cognitive

Sciences, 8 (2), 51–53.

Redlich, N. A. (1993). Redundancy reduction as a strategy for unsupervised learning. Neural

Computation, 5, 289–304.

Rizzolatti, G., & Craighero, L. (2004). The mirror–neuron system. Annual Reviews of Neuroscience,

27, 169–192.

Rotondo, J. L., & Boker, S. M. (2002). Behavioral synchronization in human conversational

interaction. In M. Stamenov & V. Gallese (Eds.), Mirror neurons and the evolution of brain

and language (pp. 151–162). Amsterdam: John Benjamins.

Shannon, C. E., & Weaver, W. (1949). The mathematical theory of communication. Urbana: The

University of Illinois Press.

Steede, L. L., Tree, J. J., & Hole, G. J. (2007a). Dissociating mechanisms involved in accessing

16

identity by dynamic and static cues. Object Perception, Attention, and Memory (OPCAM)

2006 Conference Report, Visual Cognition, 15 (1), 116–123.

Steede, L. L., Tree, J. J., & Hole, G. J. (2007b). I can’t recognize your face but I can recognize its

movement. Cognitive Neuropsychology, 24 (4), 451–466.

Theobald, B., Matthews, I., Cohn, J. F., & Boker, S. (2007). Real–time expression cloning using

appearance models. In Proceedings of the 9th international conference on multimodal interfaces

(pp. 134–139). New York: Association for Computing Machinery.

Theobald, B., Matthews, I., Wilkinson, N., Cohn, J. F., & Boker, S. (2007). Animating faces using

appearance models. In Proceedings of the 2007 workshop on vision, video and graphics.

17

List of Figures

1 A conceptual model for adaptive feedback between two individuals engaged in conversa-

tion. A mirror system tracks the movements and vocalizations of the interlocutor, but

the output of the mirror system is frequently suppressed. When symmetric action is

called for, the mirror system is pre–primed to produce symmetry by enabling its otherwise

suppressed output. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192 Layout of the videoconference booth and motion tracking system. The oval magnetic

field penetrates the magnetically transparent sound isolation wall so that participants sit

approximately 3m apart in the same motion tracking field. . . . . . . . . . . . . . . . . 203 Active Appearance Models (AAMs) have both shape and appearance components. (a)

The first 3 shape modes. (b) The mean appearance (left) and first 2 appearance modes.

(c) Three example faces generated with the AAM in (a) and (b). . . . . . . . . . . . . 214 Video frames and mean shape and appearance models for two research assistants. . . . 225 One frame from conversation during a videoconference experiment. . . . . . . . . . . . 236 Six avatars generated from the facial expression captured from person (a). . . . . . . . 247 One frame from a conversation in which the appearance and voice of the research assistant

was changed to appear to be male. (a) The research assistant whose face was tracked sat

in one booth. (b) The tracking mesh is automatically fit to the research assistant’s face.

(c) The synthesized avatar is displayed to the naive participant within 99ms of the light

captured by the camera in the research assistant’s booth. (d) The naive participant’s

image is seen by the research assistant. . . . . . . . . . . . . . . . . . . . . . . . . . 25

18

Cognition

Mirror

System

Vision

Audition

Motor

Output

Conversant A

Cognition

Mirror

System

Vision

Audition

Motor

Output

Conversant B

Figure 1: A conceptual model for adaptive feedback between two individuals engaged in conversation. A

mirror system tracks the movements and vocalizations of the interlocutor, but the output of the mirror

system is frequently suppressed. When symmetric action is called for, the mirror system is pre–primed

to produce symmetry by enabling its otherwise suppressed output.

19

Back projection screen Back projection screen

Stool Stool

Magnet Magnet Field

FieldField

Field

GobosGobos

Projector Projector

CameraCamera

Sound

isolation

wallBooth 1 Booth 2

Stage 2Stage 1

LightsLights

Figure 2: Layout of the videoconference booth and motion tracking system. The oval magnetic field

penetrates the magnetically transparent sound isolation wall so that participants sit approximately 3m

apart in the same motion tracking field.

20

a

b

c

Figure 3: Active Appearance Models (AAMs) have both shape and appearance components. (a) The

first 3 shape modes. (b) The mean appearance (left) and first 2 appearance modes. (c) Three example

faces generated with the AAM in (a) and (b).

21

Figure 4: Video frames and mean shape and appearance models for two research assistants.

22

Figure 5: One frame from conversation during a videoconference experiment. (a) The research assistant

whose face was tracked sat in one booth. (b) The tracking mesh is automatically fit to the research

assistant’s face. (c) The synthesized avatar is displayed to the naive participant within 99ms of the light

captured by the camera in the research assistant’s booth. (d) The naive participant’s image is seen by

the research assistant.

23

a b

d e f

c

Figure 6: Six avatars generated from the facial expression captured from person (a).

24

Figure 7: One frame from a conversation in which the appearance and voice of the research assistant

was changed to appear to be male. (a) The research assistant whose face was tracked sat in one booth.

(b) The tracking mesh is automatically fit to the research assistant’s face. (c) A synthesized avatar with

mean appearance taken from a male research assistant is displayed to the naive participant within 99ms

of the light captured by the camera in the research assistant’s booth. (d) The naive participant’s image

is seen by the research assistant.

25

Documents

Real Time Dissociation of Facial Appearance and Dynamics ...jeffcohn/biblio/BC.pdf · Real Time Dissociation of Facial Appearance and Dynamics during Natural Conversation Steven M