Upload
lamkiet
View
212
Download
0
Embed Size (px)
Citation preview
Real Time Dissociation of Facial Appearance and Dynamics
during Natural Conversation
Steven M. Boker Jeffrey F. Cohn
June 3, 2009
Introduction
As we converse, facial expressions, head movements and vocal prosody are produced that are im-
portant source of information in communication. Semantic content of conversation is accompanied
by vocal prosody, non–word vocalizations, head movement, gestures, postural adjustments, eye
movements, smiles, eyebrow movements, and other facial muscle changes. Coordination between
speakers’ and listeners’ head movements, facial expressions, and vocal prosody has been widely
reported (Bernieri & Rosenthal, 1991; Bernieri, Davis, Rosenthal, & Knee, 1994; Cappella, 1981;
Chartrand, Maddux, & Lakin, 2005; Condon, 1976; Lafrance, 1985). Conversational coordination
can be defined as when an action generated by one individual is predictive of a symmetric action
by another (Rotondo & Boker, 2002; Griffin & Gonzalez, 2003). This coordination is a form of spa-
tiotemporal symmetry between individuals (Boker & Rotondo, 2002) that has behaviorally useful
outcomes (Chartrand et al., 2005).
Movements of the head, facial expressions, and vocal prosody have been reported to be important
in judgments of identity (Fox, Gross, Cohn, & Reilly, 2007; Hill & Johnston, 2001; Munhall &
Buchan, 2004), rapport (Grahe & Bernieri, 2006; Bernieri et al., 1994), attractiveness (Morrison,
Gralewski, Campbell, & Penton-Voak, 2007), gender (Morrison et al., 2007; Hill & Johnston, 2001;
Berry, 1991), personality (Levesque & Kenny, 1993), and affect (Ekman, 1993; Hill, Troje, &
1
Johnston, 2003). In a dyadic conversation, each conversant’s perception of the other person produces
an ever–evolving behavioral context that in turn influences her/his ensuing actions, thus creating a
nonstationary feedback system as conversants form patterns of movements, expressions, and vocal
inflections; sometimes with high symmetry between the conversants and sometimes with little or no
similarity (Ashenfelter, Boker, Waddell, & Vitanov, in press). This skill is so automatic that little
thought is given to it unless it begins to break down.
As symmetry is formed between two conversants, the ability to predict the actions of one based
on the actions of the other increases and the perception of empathy increases (Baaren, Holland,
Steenaert, & Knippenberg, 2003). Symmetry in movements implies redundancy, which can be
defined as negative Shannon information (Redlich, 1993; Shannon & Weaver, 1949). By this logic,
when symmetry between two conversants is high, they are sharing an embodied state and thus
may feel greater empathy towards one another. When interpersonal symmetry is broken, that is
when there is a change from similar expressions and movements to dissimilar ones, individuals
are less likely to be able to predict each other’s movements due to lowered redundancy and thus
increased Shannon information. In this way, changes in symmetry can be interpreted as changes in
information flow between the conversants.
An information flow view of the dynamics of dyadic conversation is consistent with a model in
which contributions from audition, vision, and proprioception are combined in a low–level mirror
system (Rizzolatti & Craighero, 2004) that uses the continuous stream of auditory and visual input
as sources of information available for grammatic, semantic, and affective perception. A conceptual
diagram of this model is shown in Figure 1.
[Figure 1 about here.]
According to this model, motor activity can be generated from the mirror system, but this ac-
tivity is, in general, suppressed — Release of suppression of motor mimicry is used intermittently
to elicit engagement and express empathy with interlocutors. This model suggests that observed
adaptive dynamics of head movements and facial expressions in conversation are composed of both
low–level perception–action contributions exhibited as periods of high symmetry with an inter-
2
locutor as well as top–down cognitive contributions exhibited in the regulation of these periods of
symmetry.
Adaptation to Context: Dynamics and Appearance
The context of a conversation can influence its course: The person you think you are speaking with
can influence what you say and how you say it. The appearance of an interlocutor is composed
of his/her facial and body structure as well as how he/she moves and speaks. Separate pathways
for perception of structural appearance and biological motion have been proposed (Giese & Poggio,
2003). Evidence for this view comes from neurological (Steede, Tree, & Hole, 2007b, 2007a) and
judgment (Knappmeyer, Thornton, & Bulthoff, 2003; Hill & Johnston, 2001) studies. Munhall &
Buchan (2004) postulate that motion contributes to identification of faces by (a) better structure
cues from a moving face and (b) dynamic facial signatures. Berry (1991) reports that children can
recognize gender from point light faces when the recorded motion was from a conversation, but not
when it was from a recitative reading. This suggests that contextual dynamics cues are likely to be
stronger in normal interactions than when generated from a scripted and acted sequence.
If we accept that the dynamics of facial expressions, head movements, and gestures during
natural conversation are generated as part of a system that uses mechanisms of adaptive feedback
and varies the informational content of its output, we conclude that these dynamics are likely to
exhibit highly complex time–dependency. When building statistical models of such a system, we
may expect to encounter data with high numbers of degrees of freedom and complex nonlinear
interactions. The successful study of such systems requires measurement with high precision both
in time and in space. Large numbers of data samples are required in order to have sufficient power
to be able to distinguish models with nonlinear time–dependence from those with nonstationary
linear components. Finally, in order to test models, precise experimental perturbations are needed
in order to be able to distinguish causal from correlational structure.
3
Measuring and Manipulating Dynamics and Appearance
Given these needs for studying coordinative movement in natural conversation, we sought a non–
intrusive method for automatically tracking facial expressions and head movements. In addition,
in order to create perturbations, we sought a method for covertly introducing known perturbations
into natural conversation. We wished to be able to make adjustments to both appearance and
dynamics without a naive conversant knowing that these perturbations were present.
The adaptive regulation of cognition and expressive affect has long been studied using labor–
intensive methods such as hand coding of video tape or film (e.g., Cohn, Ambadar, & Ekman,
2007; Cappella, 1996; Condon & Ogston, 1966). Experiments in the regulation of expressive affect
have primarily used fixed or recorded stimuli because it is otherwise difficult to control the context
to which the participant is adapting. But people react much differently to recorded stimuli than
they do when they are engaged in a conversation. Perceptions and implicit biases triggered by
appearance and dynamics of the interlocutor have the potential to change the way that a participant
self–regulates during an interaction. We sought a way for controlling the context of a natural
conversation by allowing the random assignment of appearance variables such as age, sex, race, and
attractiveness.
Videoconference Paradigm
In order to control context and to acquire high quality full face video and acoustically isolated audio
of conversation participants, we selected a videoconference paradigm. In our current laboratory
setup, each participant sits on a stool in a small video booth as shown in Figure 2 facing a back
projection screen approximately 2m away. A small (2cm× 10cm) “lipstick” video camera is mounted
in front of the back projection screen at a position that corresponds to the forehead of the lifesize
image of their interlocutor’s face projected on the screen. Each participant wears headphones and
a small microphone is mounted overhead, out of the participants’ field of view. The walls of the
booths are moveable “gobos” built of sound diffusing compressed fiberglas panels covered with
white fabric. The participants are lit from the front and the gobos and a white fabric booth ceiling
4
serve as reflective surfaces so that there are few shadows on the face, facilitating automatic video
tracking. The field of view in the booth is basically featureless white fabric other than the image of
the conversant. Acoustically, the sound diffusion panels prevent coherent early reflections and thus
the booth does not sound as small as it is. This effect and the open sides are used to help prevent
feelings of claustrophobia that might otherwise occur for some participants in a small (2.5m × 2m)
enclosed space.
[Figure 2 about here.]
While the participants are in separate rooms, and thus acoustically isolated from one another,
they are actually in close proximity to one another. This allows the use of a single magnetic field
covering the two booths to enable motion tracking (Ascension Technologies MotionStar) to syn-
chronously record the participants’ head movements (6 DOF sampled at 81.6Hz). Each participant
wears a headband or hat to which a tracker is attached. Naive participants are informed that we
are “measuring magnetic fields during conversation” and in our experiments over the past 10 years
have in all (N > 200) but one case accepted this cover story. One participant in one of our recent
experiments immediately indicated that he knew that the headband was a motion capture sensor
and so his data was not used. By withholding the fact that we are motion tracking, we wish to
prevent participants from feeling self conscious about their movements during their conversations.
Video and audio can be transmitted between the booths with a minimal delay — there is a one
video frame delay (33ms) at the projector while it builds a frame buffer to project. Audio between
booths is delayed so as to match the arrival of the video. In some of our experiments, we have
used frame delays between 3 and 5 frames (99ms to 165ms) but as yet have found no effects on
movements within this area of delay. As delays become longer than 200ms, conversational patterns
can change. Delays of over 500ms can cause breakdowns in conversational flow as individuals begin
talking at the same time, having difficulty in negotiating smooth speaker–listener turn taking.
5
Real Time Facial Avatars
The video conference paradigm allows us to track head movements, but it also allows us to track
facial movements using Active Appearance Models (AAMs) (Cootes, Edwards, & Taylor, 2001;
Cootes, Wheeler, Walker, & Taylor, 2002) by digitizing a video stream and applying tracking
software developed by our colleagues at Carnegie Mellon (Matthews, Ishikawa, & Baker, 2004;
Matthews & Baker, 2004). From the tracking data, we can redisplay a computer generated avatar
face for each video frame (Theobald, Matthews, Wilkinson, Cohn, & Boker, 2007). The tracking
and redisplay take less time than a single video frame, so the whole process takes 33ms for the frame
digitizing and 33ms for the tracking and redisplay of the frame buffer. Thus, we can track a face
and from that data synthesize a video avatar within 66ms. We have been using an off–the–shelf
PCIe AJA Video Kona Card for video digitizing and redisplay in a standard 3.0GHz Mac Pro which
performs the tracking and synthesis.
The capability to track facial movements and redisplay them brings with it the possibility of
creating perturbations to conversation in a covert manner and randomly assigning them within a
conversation. In order to do this, we first needed to find out whether the synthesized avatars were
accepted as being video by naive participants. Then we needed to develop a method for applying
perturbations of appearance and dynamics to the resynthesized avatar faces.
Facial Avatars Using Active Appearance Models (AAMs)
AAMs are generative, parametric models and consist of both shape and appearance components.
The shape component of the AAM is a triangulated mesh that moves like a face undergoing both
rigid motion (head pose variation) and non-rigid motion (expression) in response to changes in
the parameters. The shape components are identified by first hand labeling between 30 and 50
frames of video for the 68 vertices composing the triangular mesh, a process that takes a trained
research assistant about 2 to 3 hours. The video frames are chosen so as to cover the range of facial
motion normally exhibited by the target individual’s expressions. Once these video frames are
hand–labeled, the remainder of the process is automatic. Thus, once a model is constructed, we can
6
continue to use the model to track an individual over multiple occasions and multiple experiments.
A principal components analysis is performed on these labeled video frames to extract between 8
and 12 shape components that can be thought of as axes of facial movement. These components
are independent and additive so that the estimated shape of a face mesh in a single frame of video
is expressed as the weighted sum of the retained components
s = s0 +m�
i=1
pisi, (1)
where s is the estimated shape, s0 is the mean shape, si are the component loadings and pi are the
shape parameters. Figure 3–a plots the first three of these components for one target individual.
The arrows in each wireframe face in Figure 3–a demonstrate how one unit of change in each the
first 3 principal components creates simultaneous change across multiple vertices of the triangular
mesh.
[Figure 3 about here.]
The appearance component of the AAM is an image of the face, which itself can vary under the
control of the parameters. As the parameters are varied, the appearance changes to model effects
such as the appearance of furrows and wrinkles. Again, we use principal components analysis on the
same labeled video frames and retain a few (8 to 12) appearance components. Thus, the estimated
appearance A(x) becomes a linear combination of the mean appearance A0(x) plus a weighted sum
of appearance images Ai(x):
A(x) = A0(x) +l�
i=1
λiAi(x) ∀ x ∈ s0, (2)
where the coefficients λi are the appearance parameters. Figure 3–b shows the mean appearance
and the first two appearance components for a target individual’s face. Combining the shape and
appearance models we can create a wide variety of natural–looking head poses and facial expressions.
Fitting an AAM is a difficult non-linear optimization problem. Matthews and Baker (2004)
recently proposed and demonstrated an AAM fitting algorithm that is more robust and faster than
7
previous algorithms, tracking faces at over 200 frames per second. Our project uses this algorithm
to fit a pre–built AAM model to each frame of video as it becomes available at the digitizing frame
buffer.
Figures 4–a and 4–b display video frames of two research assistants as captured by the video
camera during an experiment. Below each individual’s picture, Figures 4–c and 4–d present the
respective computer generated facial avatars for the mean shape and appearance for the two research
assistants. The required number of degrees of freedom is surprisingly low (less than 25 DOF) for each
of the generated avatars, although the avatars’ appearance is very similar to the individuals’ actual
appearances. Note that the mean shape and appearance is somewhat smoother than any particular
video frame and that the mean shape and appearance is not a completely relaxed expression.
[Figure 4 about here.]
Figure 5 displays a still from the video feeds, facial tracking, and avatar from one conversation
during an experiment. Figure 5–a shows the video that was captured from the research assistant.
In Figure 5–b, her face is tracked by the 68 vertex triangular mesh. From these tracking data and
a previously constructed model, a video frame with a matching avatar is displayed in Figure 5–c.
The naive participant, shown in Figure 5–d, sees only the avatar image from Figure 5–c, while the
research assistant sees the full video of the naive participant as seen in Figure 5–d.
[Figure 5 about here.]
Note that there appears to be a high degree of symmetry exhibited by the faces in Figure 5. After
the conversation session is over, models are built and tracked for each naive participant. In this
way, we have been capturing measurements that will be used to construct and test specific models
for the dynamic ebb and flow of symmetry formation and symmetry breaking during conversation.
Swapping Appearance by Using Avatar Models
Once we were able to construct and display an avatar in real time, we began to work on methods for
introducing manipulations of the appearance and dynamics of the avatar. Changing the appearance
8
of the avatar is akin to putting a flexible mask on a person. That is to say, the avatar’s expressions
are driven by the captured motions of the person whose face is being motion tracked, but the avatar
model that is shown making these expressions is one that was generated from a different individual.
To accomplish this, we first constructed a set of short video clips that was representative of each
of our six research assistants. We then videotaped each assistant while she or he mimicked the facial
expressions of each of the other assistants as shown on the video clips. Then we used the captured
video from each individual research assistant to build a model that covered approximately the same
space of expressions. Finally, we simply substituted one person’s mean shape and appearance for
another person during the synthesis portion of process. As an example of how this works, the
research assistant in Figure 6–a is being tracked and his expressions are mapped onto the other five
research assistants in Figures 6–b through –f.
[Figure 6 about here.]
Note that the expressions in Figure 6 are not exactly the same. One might think of this mapping
as how person (a)’s expression differs from his mean being applied to how person (b)’s face would
appear if he had produced an expression with the same difference from his mean. This tends to create
natural appearing expressions, since the shapes themselves are not mapped, but simply an expression
that represents a similar point in expression space. By sampling all individuals mimicking the same
movements, we were able to have the axes of the spaces be relatively similar. This is an important
point, because principal component axes generated from the distribution of naturally occurring
movements from one individual may be substantially different from the axes generated from another
person who may have a very different distribution of characteristic movements. Methods for rotation
and scaling of axes between avatar models (Theobald, Matthews, Cohn, & Boker, 2007) may improve
expression mapping and reduce the need for mimic–based video sequences on which to build models.
We used the mean shape and appearance substitution method that were used to produce displays
that map appearance from one sex to another (Boker et al., in press). For instance, in Figure 7
the research assistant’s face was mapped to a male avatar. Each video frame is mapped, so the
dynamics of the movements produced by the female research assistant were produced by the male
9
avatar’s expressions. In addition, the female research assistant’s voice was processed using a TC–
Helicon VoicePro vocal formant processor to change the fundamental frequency and formants to
approximate a male voice. In this experiment, the naive participant in Figure 7–d was informed
that she would have six different conversations. She actually talked to two research assistants, one
male and one female, but she thought she had spoken with six different individuals, three male and
three female. The research assistants were blind to whether they appeared as a male or a female in
any particular conversation.
[Figure 7 about here.]
In our avatar videoconference experiments, we have run over 100 naive participants and only
two of them have doubted our cover story that the faces they see are live video that has been “cut
out” so that they only see the face of the person they are talking to. One of those was the person
who also guessed that we were putting motion capture sensors on him. Unfortunately, as knowledge
of this technology becomes more widely disseminated, we will not be able to rely on participants
trusting that the face they see in a video conference in fact belongs to the person with whom they
are speaking.
Future Directions
Now that it is practical to precisely and non–invasively measure and control non–rigid facial move-
ments produced in natural conversation, we expect that there will be a surge of experiments that test
hypotheses about the coupled dynamics of interpersonal coordination. We expect that a mapping
will be developed between a semantic space of adjectives describing emotion and a low degree–
of–freedom avatar model of the human face. This mapping will allow the automatic tracking of
affective facial displays in a way that may revolutionize human–computer interactions.
We are also interested in perturbing the dynamics of expressions. In affective disorders such as
depression, individuals display facial behavior during conversation that differs from the coordination
between people’s expressions shown in normal conversation. Depressed individuals also report
10
feelings of being distant from others. By better understanding the way that these patterns of
affective display develop and persist, we may be able to devise better interventions that allow these
individuals to recover from depressive episodes more quickly and effectively.
Another area amenable to study using real–time avatars is stereotyping and bias. Since we can
convincingly change a person’s apparent sex, we expect that further work will allow us to randomly
assign variables such as race and age during natural conversation. Studying stereotyping using
this paradigm is particularly interesting since the research assistant whose characteristics are being
modified can be kept blind to the modification. It is not as if an assistant is asked to act a part.
The only way the assistant can know how he or she appears to the conversational partner is by
how the conversational partner treats the assistant. By counterbalancing so that the conversational
partner has more than one conversation with the same assistant in each appearance condition, we
can isolate effects during conversation to being that of the randomly assigned appearance variable.
Applications for this technology in human–computer interaction are not difficult to envision.
For instance, a NASA–funded pilot project has been proposed to track teachers’ faces and mapping
them onto celestial objects so that, in distance–learning equipped classrooms, children can “talk to
Jupiter”. Transmitting avatar displays requires extremely low bandwidth, so these displays may
find application in cell phones and other videoconferencing applications (Brick, Spies, Theobald,
Matthews, & Boker, 2009). Computer–based tutoring systems may be able to use webcams to
track whether a learner is displaying confusion or frustration. Automatous avatars may be able to
display expressions that are perceived as showing empathy by tracking viewers’ faces and displaying
an appropriate amount of interpersonal symmetry, thereby reducing the feeling that the automatous
faces are cold and mechanical. Appropriate responses to detected affect in human facial expressions
may allow human–robot interactions be less threatening and more fulfilling for humans.
Conclusions
We have described an overview of our team’s work in developing and testing real–time facial avatars
driven by motion capture from video. The avatar technology has enabled videoconference experi-
11
ments that randomly assign appearance variables and examine how people coordinate their motions
and expressions in natural conversation.
After 24 to 30 minutes of the videoconference conversations, 98% of naive participants did not
doubt the cover story that we were “cutting out video to just show the face”. We find this to
be surprising since each video frame was constructed from approximately 25 floating point values
applied to a model. Contrast that with the fact that a real video frame contains over 300,000 pixels
each of which is represented by a 24 bit number.
We can think of three reasons why this illusion is so convincing. The first possible reason is that
when we produce facial expressions, we largely coordinate our muscles in correlated patterns, so
that the total number of degrees of freedom we express is relatively small — on the order of three
degrees of freedom for head pose and 7 to 12 degrees of freedom for facial expression.
The second possible reason is that there may be some limiting of perceived degrees of freedom
as we view a facial expression. Thus, our perceptual system may be mapping the facial expressions
onto a lower dimensional space than actually exists in the world, so when the number of degrees
of freedom in the display is reduced, we do not notice. Such a perceptual effect might also explain
why it is so easy to see a face in an arbitrary pattern with only marginal similarity to a face — the
so–called “face on Mars” or “face on the tortilla” effect.
A third possible reason is that in real–time conversation, a participant is expecting to interact
with a real person and is engaged in that interaction. Thus the dynamics of the symmetry formation
and symmetry breaking are appropriate and convince the participant that since the interaction is
real, the video image must be real. Contrast that situation with a judgement paradigm where the
participant may adopt a more critical attitude and is not dynamically engaged with the person on
the display. Thus the nature of the context and task may lead to greater or lesser credibility of the
avatar display.
We expect that real–time facial avatars will be in common, everyday use within ten years or less.
We expect facial avatar technology to be influential in teaching, in human–computer interaction,
and in the diagnosis and treatment of affective disorders. In the mean time, these constructs provide
powerful tools for examining human interpersonal communication.
12
Author Note
The authors gratefully acknowledge the contributions of the many investigators and research
assistants who worked on this project: Zara Ambadar, Kathy Ashenfelter, Timothy Brick, Tamara
Buretz, Enoch Chow, Eric Covey, Pascal Deboeck, Katie Jackson, Hannah Kim, Jen Koltiska, Nancy
Liu, Michael Mangini, Iain Matthews, Sean McGowan, Ryan Mounaime, Sagar Navare, Andrew
Quilpa, Jeffrey Spies, Barry–John Theobald, Stacey Tiberio, Michael Villano, Chris Wagner, and
Meng Zhao. Funding for this work was provided in part by NSF Grant BCS–0527485. Any opinions,
findings, and conclusions or recommendations expressed in this material are those of the authors
and do not necessarily reflect the views of the National Science Foundation. Correspondence may
be addressed to Steven M. Boker, Department of Psychology, The University of Virginia, PO Box
400400, Charlottesville, VA 22904, USA; email sent to [email protected]; or browsers pointed to
http://people.virginia.edu/˜smb3u.
13
References
Ashenfelter, K. T., Boker, S. M., Waddell, J. R., & Vitanov, N. (in press). Spatiotemporal
symmetry and multifractal structure of head movements during dyadic conversation. Journal
of Experimental Psychology: Human Perception and Performance.
Baaren, R. B. van, Holland, R. W., Steenaert, B., & Knippenberg, A. van. (2003). Mimicry for
money: Behavioral consequences of imitation. Journal of Experimental Social Psychology,
39 (4), 393–398.
Bernieri, F. J., Davis, J. M., Rosenthal, R., & Knee, C. R. (1994). Interactional synchrony and
rapport: Measuring synchrony in displays devoid of sound and facial affect. Personality and
Social Psychology Bulletin, 20 (3), 303–311.
Bernieri, F. J., & Rosenthal, R. (1991). Interpersonal coordination : Behavior matching and
interactional synchrony. In R. S. Feldman & B. Rime (Eds.), Fundamentals of nonverbal
behavior (pp. 401–431). Cambridge, UK: Cambridge University Press.
Berry, D. S. (1991). Child and adult sensitivity to gender information in patterns of facial motion.
Ecological Psychology, 3 (4), 349–366.
Boker, S. M., Cohn, J. F., Theobald, B.-J., Matthews, I., Mangini, M., Spies, J. R., et al. (in
press). Something in the way we move: Motion dynamics, not perceived sex, influence head
movements in conversation. Journal of Experimental Psychology: Human Perception and
Performance, ?? (??), ??
Boker, S. M., & Rotondo, J. L. (2002). Symmetry building and symmetry breaking in synchronized
movement. In M. Stamenov & V. Gallese (Eds.), Mirror neurons and the evolution of brain
and language (pp. 163–171). Amsterdam: John Benjamins.
Brick, T. R., Spies, J. R., Theobald, B., Matthews, I., & Boker, S. M. (2009). High–presence, low–
bandwidth, apparent 3–d video–conferencing with a single camera. In Proceedings of the 2009
International Workshop on Image Analysis for Multimedia Interactive Services (WIAMIS).
IEEE.
Cappella, J. N. (1981). Mutual influence in expressive behavior: Adult–adult and infant–adult
14
dyadic interaction. Psychological Bulletin, 89 (1), 101–132.
Cappella, J. N. (1996). Dynamic coordination of vocal and kenesic behavior in dyadic interaction:
Methods problems and interpersonal outcomes. In J. H. Watt & C. A. VanLear (Eds.),
Methodology in social research (pp. 353–386). Thousand Oaks, CA: Sage.
Chartrand, T. L., Maddux, W. W., & Lakin, J. L. (2005). Beyond the perception–behavior link:
The ubiquitous utility and motivational moderators of nonconscious mimicry. In R. Hassin,
J. Uleman, & J. A. Bargh (Eds.), The new unconscious (pp. 334–361). New York: Oxford
University Press.
Cohn, J. F., Ambadar, Z., & Ekman, P. (2007). Observer–based measurement of facial expression
with the Facial Action Coding System. In J. A. Coan & J. J. B. Allen (Eds.), The handbook
of emotion elicitation and assessment (pp. 203–221). New York: Oxford University Press.
Condon, W. S. (1976). An analysis of behavioral organization. Sign Language Studies, 13, 285–318.
Condon, W. S., & Ogston, W. D. (1966). Sound film analysis of normal and pathological behavior
patterns. Journal of Nervous and Mental Disease, 143 (4), 338–347.
Cootes, T. F., Edwards, G., & Taylor, C. J. (2001). Active appearance models. IEEE Transactions
on Pattern Analysis and Machine Intelligence, 23 (6), 681–685.
Cootes, T. F., Wheeler, G. V., Walker, K. N., & Taylor, C. J. (2002). View-based active appearance
models. Image and Vision Computing, 20 (9–10), 657–664.
Ekman, P. (1993). Facial expression and emotion. American Psychologist, 48, 384–392.
Fox, N. A., Gross, R., Cohn, J. F., & Reilly, R. B. (2007). Robust biometric person identification
using automatic classifier fusion of speech, mount, and face experts. IEEE Transactions on
Multimedia, 9 (4), 701–714.
Giese, M. A., & Poggio, T. (2003). Neural mechanisms for the recognition of biological movements.
Nature Reviews Neuroscience, 4, 179–192.
Grahe, J. E., & Bernieri, F. J. (2006). The importance of nonverbal cues in judging rapport.
Journal of Noverbal Behavior, 23 (4), 253–269.
Griffin, D., & Gonzalez, R. (2003). Models of dyadic social interaction. Philosophical Transactions
of the Royal Society of London, B, 358 (1431), 573–581.
15
Hill, H. C. H., & Johnston, A. (2001). Categorizing sex and identity from the biological motion of
faces. Current Biology, 11 (3), 880–885.
Hill, H. C. H., Troje, N. F., & Johnston, A. (2003). Range– and domain–specific exaggeration of
facial speech. Journal of Vision, 5, 793–807.
Knappmeyer, B., Thornton, I. M., & Bulthoff, H. H. (2003). The use of facial motion and facial
form during the processing of identity. Vision Research, 43 (18), 1921–1936.
Lafrance, M. (1985). Postural mirroring and intergroup relations. Personality and Social Psychology
Bulletin, 11 (2), 207–217.
Levesque, M. J., & Kenny, D. A. (1993). Accuracy of behavioral predictions at zero acquaintance:
A social relations model. Journal of Personality and Social Psychology, 65 (6), 1178–1187.
Matthews, I., & Baker, S. (2004). Active appearance models revisited. International Journal of
Computer Vision, 60 (2), 135–164.
Matthews, I., Ishikawa, T., & Baker, S. (2004). The template update problem. IEEE Transactions
on Pattern Analysis and Machine Intelligence, 26, 810–815.
Morrison, E. R., Gralewski, L., Campbell, N., & Penton-Voak, I. S. (2007). Facial movement varies
by sex and is related to attractiveness. Evolution and Human Behavior, 28, 186–192.
Munhall, K. G., & Buchan, J. N. (2004). Something in the way she moves. Trends in Cognitive
Sciences, 8 (2), 51–53.
Redlich, N. A. (1993). Redundancy reduction as a strategy for unsupervised learning. Neural
Computation, 5, 289–304.
Rizzolatti, G., & Craighero, L. (2004). The mirror–neuron system. Annual Reviews of Neuroscience,
27, 169–192.
Rotondo, J. L., & Boker, S. M. (2002). Behavioral synchronization in human conversational
interaction. In M. Stamenov & V. Gallese (Eds.), Mirror neurons and the evolution of brain
and language (pp. 151–162). Amsterdam: John Benjamins.
Shannon, C. E., & Weaver, W. (1949). The mathematical theory of communication. Urbana: The
University of Illinois Press.
Steede, L. L., Tree, J. J., & Hole, G. J. (2007a). Dissociating mechanisms involved in accessing
16
identity by dynamic and static cues. Object Perception, Attention, and Memory (OPCAM)
2006 Conference Report, Visual Cognition, 15 (1), 116–123.
Steede, L. L., Tree, J. J., & Hole, G. J. (2007b). I can’t recognize your face but I can recognize its
movement. Cognitive Neuropsychology, 24 (4), 451–466.
Theobald, B., Matthews, I., Cohn, J. F., & Boker, S. (2007). Real–time expression cloning using
appearance models. In Proceedings of the 9th international conference on multimodal interfaces
(pp. 134–139). New York: Association for Computing Machinery.
Theobald, B., Matthews, I., Wilkinson, N., Cohn, J. F., & Boker, S. (2007). Animating faces using
appearance models. In Proceedings of the 2007 workshop on vision, video and graphics.
17
List of Figures
1 A conceptual model for adaptive feedback between two individuals engaged in conversa-
tion. A mirror system tracks the movements and vocalizations of the interlocutor, but
the output of the mirror system is frequently suppressed. When symmetric action is
called for, the mirror system is pre–primed to produce symmetry by enabling its otherwise
suppressed output. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192 Layout of the videoconference booth and motion tracking system. The oval magnetic
field penetrates the magnetically transparent sound isolation wall so that participants sit
approximately 3m apart in the same motion tracking field. . . . . . . . . . . . . . . . . 203 Active Appearance Models (AAMs) have both shape and appearance components. (a)
The first 3 shape modes. (b) The mean appearance (left) and first 2 appearance modes.
(c) Three example faces generated with the AAM in (a) and (b). . . . . . . . . . . . . 214 Video frames and mean shape and appearance models for two research assistants. . . . 225 One frame from conversation during a videoconference experiment. . . . . . . . . . . . 236 Six avatars generated from the facial expression captured from person (a). . . . . . . . 247 One frame from a conversation in which the appearance and voice of the research assistant
was changed to appear to be male. (a) The research assistant whose face was tracked sat
in one booth. (b) The tracking mesh is automatically fit to the research assistant’s face.
(c) The synthesized avatar is displayed to the naive participant within 99ms of the light
captured by the camera in the research assistant’s booth. (d) The naive participant’s
image is seen by the research assistant. . . . . . . . . . . . . . . . . . . . . . . . . . 25
18
Cognition
Mirror
System
Vision
Audition
Motor
Output
Conversant A
Cognition
Mirror
System
Vision
Audition
Motor
Output
Conversant B
Figure 1: A conceptual model for adaptive feedback between two individuals engaged in conversation. A
mirror system tracks the movements and vocalizations of the interlocutor, but the output of the mirror
system is frequently suppressed. When symmetric action is called for, the mirror system is pre–primed
to produce symmetry by enabling its otherwise suppressed output.
19
Back projection screen Back projection screen
Stool Stool
Magnet Magnet Field
FieldField
Field
GobosGobos
Projector Projector
CameraCamera
Sound
isolation
wallBooth 1 Booth 2
Stage 2Stage 1
LightsLights
Figure 2: Layout of the videoconference booth and motion tracking system. The oval magnetic field
penetrates the magnetically transparent sound isolation wall so that participants sit approximately 3m
apart in the same motion tracking field.
20
a
b
c
Figure 3: Active Appearance Models (AAMs) have both shape and appearance components. (a) The
first 3 shape modes. (b) The mean appearance (left) and first 2 appearance modes. (c) Three example
faces generated with the AAM in (a) and (b).
21
Figure 4: Video frames and mean shape and appearance models for two research assistants.
22
Figure 5: One frame from conversation during a videoconference experiment. (a) The research assistant
whose face was tracked sat in one booth. (b) The tracking mesh is automatically fit to the research
assistant’s face. (c) The synthesized avatar is displayed to the naive participant within 99ms of the light
captured by the camera in the research assistant’s booth. (d) The naive participant’s image is seen by
the research assistant.
23
a b
d e f
c
Figure 6: Six avatars generated from the facial expression captured from person (a).
24
Figure 7: One frame from a conversation in which the appearance and voice of the research assistant
was changed to appear to be male. (a) The research assistant whose face was tracked sat in one booth.
(b) The tracking mesh is automatically fit to the research assistant’s face. (c) A synthesized avatar with
mean appearance taken from a male research assistant is displayed to the naive participant within 99ms
of the light captured by the camera in the research assistant’s booth. (d) The naive participant’s image
is seen by the research assistant.
25