View
236
Download
1
Category
Tags:
Preview:
Citation preview
Face Animation Overview with Face Animation Overview with Shameless Bias Toward MPEG-4 Shameless Bias Toward MPEG-4
Face Animation Tools Face Animation Tools
Dr. Eric PetajanDr. Eric Petajan
Chief Scientist and FounderChief Scientist and Founder
face2face animation, inc.face2face animation, inc.
eric@f2f-inc.comeric@f2f-inc.com
Computer-generated Face Computer-generated Face Animation MethodsAnimation Methods
Morph targets/key frames (traditional)Morph targets/key frames (traditional) Speech articulation model (TTS)Speech articulation model (TTS) Facial Action Coding System (FACS)Facial Action Coding System (FACS) Physics-based (skin and muscle Physics-based (skin and muscle
models)models) Marker-based (dots glued to face)Marker-based (dots glued to face) Video-based (surface features)Video-based (surface features)
Morph targets/key framesMorph targets/key frames
AdvantagesAdvantages– Complete manual control of each frameComplete manual control of each frame– Good for exaggerated expressionsGood for exaggerated expressions
DisadvantagesDisadvantages– Hard to achieve good lipsync without Hard to achieve good lipsync without
manual tweekingmanual tweeking– Morph targets must be downloaded to Morph targets must be downloaded to
terminal for streaming animation (delay)terminal for streaming animation (delay)
Speech articulation modelSpeech articulation model
AdvantagesAdvantages– High level control of faceHigh level control of face– Enables TTSEnables TTS
DisadvantagesDisadvantages– Robotic characterRobotic character– Hard to sync with real voiceHard to sync with real voice
Facial Action Coding SystemFacial Action Coding System
AdvantagesAdvantages– Very high level control of faceVery high level control of face– Maps to morph targetsMaps to morph targets– Explicit specification of emotional statesExplicit specification of emotional states
DisadvantagesDisadvantages– Not good for speechNot good for speech– Not quantifiedNot quantified
Physics-basedPhysics-based
AdvantagesAdvantages– Good for realistic skin, muscle and fatGood for realistic skin, muscle and fat– Collision detectionCollision detection
DisadvantagesDisadvantages– High complexityHigh complexity– Must be driven by high level articulation Must be driven by high level articulation
parameters (TTS)parameters (TTS)– Hard to drive with motion capture dataHard to drive with motion capture data
Marker-basedMarker-based AdvantagesAdvantages
– Can provide accurate motion data from most of the Can provide accurate motion data from most of the faceface
– Face models can be animated directly from surface Face models can be animated directly from surface feature point motionfeature point motion
DisadvantagesDisadvantages– Dots glued to faceDots glued to face– Dots must be manually registeredDots must be manually registered– Not good for accurate inner lip contour or eyelid Not good for accurate inner lip contour or eyelid
trackingtracking
Video-basedVideo-based
AdvantagesAdvantages– Simple to capture video of faceSimple to capture video of face– Face models can be animated directly from Face models can be animated directly from
surface feature motionsurface feature motion DisadvantagesDisadvantages
– Must have good view of faceMust have good view of face
What is MPEG-4 Multimedia?What is MPEG-4 Multimedia?
Natural audio and video objectsNatural audio and video objects 2D and 3D graphics (based on VRML)2D and 3D graphics (based on VRML) Animation (virtual humans)Animation (virtual humans) Synthetic speech and audioSynthetic speech and audio
Samples versus ObjectsSamples versus Objects
Traditional video coding is sample based Traditional video coding is sample based (blocks of pixels are compressed)(blocks of pixels are compressed)
MPEG-4 provides visual object MPEG-4 provides visual object representation for better compression and representation for better compression and new functionalitiesnew functionalities
Objects are rendered in the terminal after Objects are rendered in the terminal after decoding object descriptorsdecoding object descriptors
Object-based FunctionalitiesObject-based Functionalities
User can choose display of content layersUser can choose display of content layers Individual objects (text, models) can be Individual objects (text, models) can be
searched or stored for later usedsearched or stored for later used Content is independent of display Content is independent of display
resolutionresolution Content can be easily repurposed by Content can be easily repurposed by
provider for different networks and usersprovider for different networks and users
MPEG-4 Object CompositionMPEG-4 Object Composition
Objects are organized in a scene graphObjects are organized in a scene graph Scene graphs are specified using a binary Scene graphs are specified using a binary
format called BIFS (based on VRML)format called BIFS (based on VRML) Both 2D and 3D objects, properties and Both 2D and 3D objects, properties and
transforms are specified in BIFStransforms are specified in BIFS BIFS allows objects to be transmitted once BIFS allows objects to be transmitted once
and instanced repeatedly in the scene after and instanced repeatedly in the scene after transformationstransformations
MPEG-4 Operation SequenceMPEG-4 Operation Sequence
Terminal setup• communications bit-rate• available RAM and disk• available MIPS• accelerators for graphics,etc• display resolution
Incremental or streaming data download • nearly exposed geometry and textures• audio data
Initial download • geometry• textures• articulated face model
Time stamped control parameter stream • view position• object instancing/destruction• face animation parameters
TIME
MPEG4 Decoder Functional Architecture
DisplaySystem
layer(bitstream)
-content addressing
-synchronization
-scalability
-error correction
System layer,
Compositing and
Rendering
-user input
-user content mods
- user POV mods
Video/Image
decoders
-H.263
-MPEG2
-JPEG
Audio decoder
- low bit-rate
speech
- 64kbps - AAC
Audio
synthesizer/processor
- MIDI/Structure audio
- 3D processor.
Cached data
- geometry
- textures
- articulated figures
(faces)
- video clips
- audio clips
- FAP codebooks
2D/3D geometry
decoder
-Polygonal meshes
-Segmentation Masks
MPEG4
bitstream
User
input
Level of detail control
Faces are SpecialFaces are Special
Humans are hard-wired to respond to Humans are hard-wired to respond to facesfaces
The face is the primary communication The face is the primary communication interfaceinterface
Human faces can be automatically Human faces can be automatically analyzed and parameterized for a wide analyzed and parameterized for a wide variety of applicationsvariety of applications
MPEG-4 Face and Body MPEG-4 Face and Body Animation CodingAnimation Coding
Face animation is in MPEG-4 version 1Face animation is in MPEG-4 version 1 Body animation is in MPEG-4 version 2Body animation is in MPEG-4 version 2 Face animation parameters displace feature Face animation parameters displace feature
points from neutral positionpoints from neutral position Body animation parameters are joint anglesBody animation parameters are joint angles Face and body animation parameter Face and body animation parameter
sequences are compressed to low bitratessequences are compressed to low bitrates
Neutral Face DefinitionNeutral Face Definition
Head axes parallel to the world axes Head axes parallel to the world axes Gaze is in direction of Z axisGaze is in direction of Z axis Eyelids tangent to the irisEyelids tangent to the iris Pupil diameter is one third of iris diameterPupil diameter is one third of iris diameter Mouth is closed and the upper and lower teeth Mouth is closed and the upper and lower teeth
are touchingare touching Tongue is flat, horizontal with the tip of tongue Tongue is flat, horizontal with the tip of tongue
touching the boundary between upper and lower touching the boundary between upper and lower teethteeth
Face Feature PointsFace Feature Points
xy
z
11.5
11.4
11.2
10.2
10.4
10.10
10.810.6
2.14
7.1
11.6 4.6
4.4
4.2
5.2
5.4
2.10
2.122.1
11.1
Tongue
6.26.4 6.3
6.1Mouth
8.18.9 8.10 8.5
8.3
8.7
8.2
8.8
8.48.6
2.2
2.3
2.6
2.82.9
2.72.5 2.4
2.12.12 2.11
2.142.10
2.13
10.610.8
10.4
10.2
10.105.4
5.2
5.3
5.1
10.1
10.910.3
10.510.7
4.1 4.34.54.6
4.4 4.2
11.111.2 11.3
11.4
11.5
x
y
z
Nose
9.6 9.7
9.14 9.13
9.12
9.2
9.4 9.15 9.5
9.3
9.1
Teeth
9.109.11
9.8
9.9
Feature points affected by FAPs
Other feature points
Right eye Left eye
3.13
3.7
3.9
3.5
3.1
3.3
3.11
3.14
3.10
3.12 3.6
3.4
3.23.8
Face Animation Parameter Face Animation Parameter NormalizationNormalization
Face Animation Parameters (FAPs) are Face Animation Parameters (FAPs) are normalized to facial dimensionsnormalized to facial dimensions
Each FAP is measured as a fraction of Each FAP is measured as a fraction of neutral face mouth width, mouth-nose neutral face mouth width, mouth-nose distance, eye separation, or iris distance, eye separation, or iris diameter diameter
3 Head and 2 eyeball rotation FAPs are 3 Head and 2 eyeball rotation FAPs are Euler anglesEuler angles
Neutral Face Dimensions for Neutral Face Dimensions for FAP NormalizationFAP Normalization
MW0
MNS0
ENS0
ES0IRISD0
FAP GroupsFAP GroupsGroup Number of FAPs
1: visemes and expressions 2
2: jaw, chin, inner lowerlip, cornerlips, midlip 16
3: eyeballs, pupils, eyelids 12
4: eyebrow 8
5: cheeks 4
6: tongue 5
7: head rotation 3
8: outer lip positions 10
9: nose 4
10: ears 4
Lip FAPsLip FAPsMouth closed if sum of upper and Mouth closed if sum of upper and
lower lip FAPs = 0lower lip FAPs = 0
Face Model IndependenceFace Model Independence FAPs are always normalized for model FAPs are always normalized for model
independenceindependence FAPs (and BAPs) can be used without FAPs (and BAPs) can be used without
MPEG-4 systems/BIFSMPEG-4 systems/BIFS Private face models can be accurately Private face models can be accurately
animated with FAPsanimated with FAPs Face models can be simple or complex Face models can be simple or complex
depending on terminal resourcesdepending on terminal resources
MPEG-4 BIFS Face NodeMPEG-4 BIFS Face Node Face node contains FAP node, Face scene Face node contains FAP node, Face scene
graph, Face Definition Parameters (FDP), graph, Face Definition Parameters (FDP), FIT,and FATFIT,and FAT
FIT (Face Interpolation Table) specifies FIT (Face Interpolation Table) specifies interpolation of FAPs in terminalinterpolation of FAPs in terminal
FAT (Face Animation Table) maps FAPs to FAT (Face Animation Table) maps FAPs to Face model deformationFace model deformation
FDP information included face feature points FDP information included face feature points positions and texture mappositions and texture map
Face Model DownloadFace Model Download
3D graphical models (e.g. faces) can be 3D graphical models (e.g. faces) can be downloaded to the terminal with MPEG-4downloaded to the terminal with MPEG-4
3D model specification is based on VRML3D model specification is based on VRML Face Animation Table( FAT) maps FAPs to Face Animation Table( FAT) maps FAPs to
face model vertex displacementsface model vertex displacements Appearance and animation of downloaded Appearance and animation of downloaded
face models is exactly predictableface models is exactly predictable
FAP CompressionFAP Compression
FAPs are adaptively quantized to FAPs are adaptively quantized to desired quality leveldesired quality level
Quantized FAPs are differentially codedQuantized FAPs are differentially coded Adaptive arithmetic coding further Adaptive arithmetic coding further
reduces bitratereduces bitrate Typical compressed FAP bitrate is less Typical compressed FAP bitrate is less
than 2 kilobits/secondthan 2 kilobits/second
FAP Predictive CodingFAP Predictive Coding
FAP(t) + Q
Q-1FrameDelay
- ArithmeticCoder
Bitstream
Face Analysis SystemFace Analysis System
MPEG-4 does not specify analysis MPEG-4 does not specify analysis systemssystems
face2face face analysis system tracks face2face face analysis system tracks nostrils for robust operationnostrils for robust operation
Inner lip contour estimated using adaptive Inner lip contour estimated using adaptive color thresholding and lip modelingcolor thresholding and lip modeling
Eyelids, eyebrows and gaze directionEyelids, eyebrows and gaze direction
Nostril TrackingNostril Tracking• At least 75% of nostril window area is skin color as indicated
by RGB skincolor table• After RGB thresholding nostril window, at least 15% of area is
subthreshold (nostril)• Min/Max constraints are met for nostril width, height, gap ,
center spacing, orientation in thresholded projection domain
Nostrils are detected only if:
Gap
Width
Height
Spacing
Nostrils
Projection Threshold
Vertical Nostril
Projection
Horizontal Nostril
Projection
Nostril Window
Inner Lip Contour EstimationInner Lip Contour EstimationDetect mouth
closure
Train horizontal mouth threshold array when mouth closed
Apply threshold array to mouth Region
Locate teeth by color and position
Form inner lip contour around inner mouth and teeth pixels
FAP Estimation AlgorithmFAP Estimation Algorithm Head scale is normalized based on neutral mouth Head scale is normalized based on neutral mouth
(closed mouth) width(closed mouth) width Head pitch is approximated based on vertical Head pitch is approximated based on vertical
nostril deviation from neutral head positionnostril deviation from neutral head position Head roll is computed from smoothed eye or nostril Head roll is computed from smoothed eye or nostril
orientation depending on availability orientation depending on availability Inner lip FAPs are measured directly from the inner Inner lip FAPs are measured directly from the inner
lips contour as deviations from the neutral lip lips contour as deviations from the neutral lip position (closed mouth)position (closed mouth)
FAP Sequence SmoothingFAP Sequence Smoothing
Time (1/30 sec)
-500
-400
-300
-200
-100
0
100
200
lower_t_midlip
raise_b_midlip
Time (1/30 sec)
-500
-400
-300
-200
-100
0
100
200
lower_t_midlip
raise_b_midlip
MPEG-4 Visemes and MPEG-4 Visemes and ExpressionsExpressions
A weighted combination of 2 visemes A weighted combination of 2 visemes and 2 facial expressions for each frameand 2 facial expressions for each frame
Decoder is free to interpret effect of Decoder is free to interpret effect of visemes and expressions after FAPs visemes and expressions after FAPs are appliedare applied
Definitions of visemes and expressions Definitions of visemes and expressions using FAPs can also be downloadedusing FAPs can also be downloaded
VisemesVisemesviseme_select phonemes example
0 none na
1 p, b, m put, bed, mill
2 f, v far, voice
3 T,D think, that
4 t, d tip, doll
5 k, g call, gas
6 tS, dZ, S chair, join, she
7 s, z sir, zeal
8 n, l lot, not
9 r red
10 A: car
11 e bed
12 I tip
13 Q top
14 U book
Facial ExpressionsFacial Expressionsexpression_select expression name textual description
0 na na
1 joy The eyebrows are relaxed. The mouth is open and the mouthcorners pulled back toward the ears.
2 sadness The inner eyebrows are bent upward. The eyes are slightlyclosed. The mouth is relaxed.
3 anger The inner eyebrows are pulled downward and together. Theeyes are wide open. The lips are pressed against each other oropened to expose the teeth.
4 fear The eyebrows are raised and pulled together. The innereyebrows are bent upward. The eyes are tense and alert.
5 disgust The eyebrows and eyelids are relaxed. The upper lip is raisedand curled, often asymmetrically.
6 surprise The eyebrows are raised. The upper eyelids are wide open, thelower relaxed. The jaw is opened.
Free Face Model SoftwareFree Face Model Software
Wireface is an openGL-based, MPEG-4 Wireface is an openGL-based, MPEG-4 compliant face modelcompliant face model
Good starting point for building high Good starting point for building high quality face models for web applicationsquality face models for web applications
Reads FAP file and raw audio fileReads FAP file and raw audio file Renders face and audio in real timeRenders face and audio in real time Wireface source is freely availableWireface source is freely available
Body AnimationBody Animation
Harmonized with VRML Hanim specHarmonized with VRML Hanim spec Body Animation Parameters (BAPs) are Body Animation Parameters (BAPs) are
humanoid skeleton joint Euler angleshumanoid skeleton joint Euler angles Body Animation Table (BAT) can be Body Animation Table (BAT) can be
downloaded to map BAPs to skin downloaded to map BAPs to skin deformationdeformation
BAPs can be highly compressed for BAPs can be highly compressed for streamingstreaming
Body Animation Parameters Body Animation Parameters (BAPs)(BAPs)
186 humanoid skeleton euler angles186 humanoid skeleton euler angles 110 free parameters for use with 110 free parameters for use with
downloaded body surface meshdownloaded body surface mesh Coded using same codecs as FAPsCoded using same codecs as FAPs Typical bitrates for coded BAPs is 5-Typical bitrates for coded BAPs is 5-
10kbps10kbps
Body Definition Parameters Body Definition Parameters (BDPs)(BDPs)
Humanoid joint center positionsHumanoid joint center positions Names and hierarchy harmonized with Names and hierarchy harmonized with
VRML/Web3D H-Anim working groupVRML/Web3D H-Anim working group Default positions in standard for Default positions in standard for
broadcast applicationsbroadcast applications Download just BDPs to accurately Download just BDPs to accurately
animate unknown body modelanimate unknown body model
Faces Enhance the User Faces Enhance the User ExperienceExperience
Virtual call center agentsVirtual call center agents News readers (e.g. Ananova)News readers (e.g. Ananova) Story tellers for the child in all of usStory tellers for the child in all of us eLearningeLearning Program guideProgram guide Multilingual (same face different voice)Multilingual (same face different voice) Entertainment animationEntertainment animation Multiplayer gamesMultiplayer games
Visual Content for the Visual Content for the Practical InternetPractical Internet
Broadband deployment is happening slowlyBroadband deployment is happening slowly DSL availability is limited and cable is sharedDSL availability is limited and cable is shared Talking heads need high frame-rateTalking heads need high frame-rate Consumer graphics hardware is cheap and Consumer graphics hardware is cheap and
powerfulpowerful MPEG-4 SNHC/FBA tools are matched to MPEG-4 SNHC/FBA tools are matched to
available bandwidth and terminalsavailable bandwidth and terminals
Visual Speech ProcessingVisual Speech Processing FAPs can be used to improve speech FAPs can be used to improve speech
recognition accuracyrecognition accuracy Text-to-speech systems can use FAPs Text-to-speech systems can use FAPs
to animate face modelsto animate face models FAPs can be used in computer-human FAPs can be used in computer-human
dialogue systems to communicate dialogue systems to communicate emotions, intentions and speech emotions, intentions and speech especially in noisy environmentsespecially in noisy environments
Video-driven Face AnimationVideo-driven Face Animation Facial expressions, lip movements and Facial expressions, lip movements and
head motion transferred to face modelhead motion transferred to face model FAPs extracted from talking head video FAPs extracted from talking head video
with special computer vision systemwith special computer vision system No face markers or lipstick is requiredNo face markers or lipstick is required Normal lighting is usedNormal lighting is used Communicates lip movements and facial Communicates lip movements and facial
expressions with visual anonymityexpressions with visual anonymity
Automatic Face Animation Automatic Face Animation DemonstrationDemonstration
FAPs extracted from camcorder videoFAPs extracted from camcorder video FAPs compressed to less than 2 FAPs compressed to less than 2
kbits/seckbits/sec 30 frames/sec animation generated 30 frames/sec animation generated
automaticallyautomatically Face models animated with bones rig Face models animated with bones rig
or fixed deformable mesh (real-time)or fixed deformable mesh (real-time)
QuickTime™ and a decompressor
are needed to see this picture.
QuickTime™ and a decompressor
are needed to see this picture.
What is easy, solved, or What is easy, solved, or almost solvedalmost solved
Can we do photorealistic non-animated Can we do photorealistic non-animated face models? YESface models? YES
Can we do near-real-time lip sync'ing Can we do near-real-time lip sync'ing that is indistinguishable from a human? that is indistinguishable from a human? NONO
What is really hardWhat is really hard
Synthesizing human speech and facial Synthesizing human speech and facial expressionsexpressions
HairHair
What we have assumed What we have assumed someone else is solvingsomeone else is solving
Graphics accelerationGraphics acceleration Video camera cost and resolutionVideo camera cost and resolution Multimedia communication Multimedia communication
infrastructureinfrastructure
Where we need helpWhere we need help We have a face with 68 parameters but We have a face with 68 parameters but
we need the psychologists to tell us how we need the psychologists to tell us how to drive it autonomouslyto drive it autonomously
We need to embody our agents into We need to embody our agents into graphical models that have a couple of graphical models that have a couple of thousand parameters to control gaze, thousand parameters to control gaze, gesture, body language, and do collision gesture, body language, and do collision detection-> NEED MORE SPEEDdetection-> NEED MORE SPEED
Core functionality of the faceCore functionality of the face SpeechSpeech
– Lips, teeth, tongueLips, teeth, tongue Emotional expressionsEmotional expressions
– Gaze, eyebrow, eyelids, head poseGaze, eyebrow, eyelids, head pose Non-verbal communicationNon-verbal communication Sensory responsivitySensory responsivity Technical requirementsTechnical requirements
– FramerateFramerate– SynchronizationSynchronization– LatencyLatency– BitrateBitrate– Spatial resolutionSpatial resolution– ComplexityComplexity
Common framework withbodyCommon framework withbody InteractionInteraction Different faces should respond similarly to common commandsDifferent faces should respond similarly to common commands Accessible to everyoneAccessible to everyone
Interaction with other Interaction with other componentscomponents
Language and discourseLanguage and discourse– Phoneme to viseme mappingPhoneme to viseme mapping
– Given/newGiven/new Action in the environmentAction in the environment Global informationGlobal information
– Emotional stateEmotional state
– PersonalityPersonality
– CultureCulture
– World knowledgeWorld knowledge
– Central time-base and timestampsCentral time-base and timestamps
Open questionsOpen questions
Central vs peripheral functionalityCentral vs peripheral functionality Degree of interface commonalityDegree of interface commonality Degree of agent autonomyDegree of agent autonomy What should the VH be capable ofWhat should the VH be capable of
Recommended