Face Animation Overview with Shameless Bias Toward MPEG-4 Face Animation Tools Dr. Eric Petajan...

Face Animation Overview with Face Animation Overview with Shameless Bias Toward MPEG-4 Shameless Bias Toward MPEG-4

Face Animation Tools Face Animation Tools

Dr. Eric PetajanDr. Eric Petajan

Chief Scientist and FounderChief Scientist and Founder

face2face animation, inc.face2face animation, inc.

eric@f2f-inc.comeric@f2f-inc.com

Computer-generated Face Computer-generated Face Animation MethodsAnimation Methods

Morph targets/key frames (traditional)Morph targets/key frames (traditional) Speech articulation model (TTS)Speech articulation model (TTS) Facial Action Coding System (FACS)Facial Action Coding System (FACS) Physics-based (skin and muscle Physics-based (skin and muscle

models)models) Marker-based (dots glued to face)Marker-based (dots glued to face) Video-based (surface features)Video-based (surface features)

Morph targets/key framesMorph targets/key frames

AdvantagesAdvantages– Complete manual control of each frameComplete manual control of each frame– Good for exaggerated expressionsGood for exaggerated expressions

DisadvantagesDisadvantages– Hard to achieve good lipsync without Hard to achieve good lipsync without

manual tweekingmanual tweeking– Morph targets must be downloaded to Morph targets must be downloaded to

terminal for streaming animation (delay)terminal for streaming animation (delay)

Speech articulation modelSpeech articulation model

AdvantagesAdvantages– High level control of faceHigh level control of face– Enables TTSEnables TTS

DisadvantagesDisadvantages– Robotic characterRobotic character– Hard to sync with real voiceHard to sync with real voice

Facial Action Coding SystemFacial Action Coding System

AdvantagesAdvantages– Very high level control of faceVery high level control of face– Maps to morph targetsMaps to morph targets– Explicit specification of emotional statesExplicit specification of emotional states

DisadvantagesDisadvantages– Not good for speechNot good for speech– Not quantifiedNot quantified

Physics-basedPhysics-based

AdvantagesAdvantages– Good for realistic skin, muscle and fatGood for realistic skin, muscle and fat– Collision detectionCollision detection

DisadvantagesDisadvantages– High complexityHigh complexity– Must be driven by high level articulation Must be driven by high level articulation

parameters (TTS)parameters (TTS)– Hard to drive with motion capture dataHard to drive with motion capture data

Marker-basedMarker-based AdvantagesAdvantages

– Can provide accurate motion data from most of the Can provide accurate motion data from most of the faceface

– Face models can be animated directly from surface Face models can be animated directly from surface feature point motionfeature point motion

DisadvantagesDisadvantages– Dots glued to faceDots glued to face– Dots must be manually registeredDots must be manually registered– Not good for accurate inner lip contour or eyelid Not good for accurate inner lip contour or eyelid

trackingtracking

Video-basedVideo-based

AdvantagesAdvantages– Simple to capture video of faceSimple to capture video of face– Face models can be animated directly from Face models can be animated directly from

surface feature motionsurface feature motion DisadvantagesDisadvantages

– Must have good view of faceMust have good view of face

What is MPEG-4 Multimedia?What is MPEG-4 Multimedia?

Natural audio and video objectsNatural audio and video objects 2D and 3D graphics (based on VRML)2D and 3D graphics (based on VRML) Animation (virtual humans)Animation (virtual humans) Synthetic speech and audioSynthetic speech and audio

Samples versus ObjectsSamples versus Objects

Traditional video coding is sample based Traditional video coding is sample based (blocks of pixels are compressed)(blocks of pixels are compressed)

MPEG-4 provides visual object MPEG-4 provides visual object representation for better compression and representation for better compression and new functionalitiesnew functionalities

Objects are rendered in the terminal after Objects are rendered in the terminal after decoding object descriptorsdecoding object descriptors

Object-based FunctionalitiesObject-based Functionalities

User can choose display of content layersUser can choose display of content layers Individual objects (text, models) can be Individual objects (text, models) can be

searched or stored for later usedsearched or stored for later used Content is independent of display Content is independent of display

resolutionresolution Content can be easily repurposed by Content can be easily repurposed by

provider for different networks and usersprovider for different networks and users

MPEG-4 Object CompositionMPEG-4 Object Composition

Objects are organized in a scene graphObjects are organized in a scene graph Scene graphs are specified using a binary Scene graphs are specified using a binary

format called BIFS (based on VRML)format called BIFS (based on VRML) Both 2D and 3D objects, properties and Both 2D and 3D objects, properties and

transforms are specified in BIFStransforms are specified in BIFS BIFS allows objects to be transmitted once BIFS allows objects to be transmitted once

and instanced repeatedly in the scene after and instanced repeatedly in the scene after transformationstransformations

MPEG-4 Operation SequenceMPEG-4 Operation Sequence

Terminal setup• communications bit-rate• available RAM and disk• available MIPS• accelerators for graphics,etc• display resolution

Incremental or streaming data download • nearly exposed geometry and textures• audio data

Initial download • geometry• textures• articulated face model

Time stamped control parameter stream • view position• object instancing/destruction• face animation parameters

MPEG4 Decoder Functional Architecture

DisplaySystem

layer(bitstream)

-content addressing

-synchronization

-scalability

-error correction

System layer,

Compositing and

Rendering

-user input

-user content mods

- user POV mods

Video/Image

decoders

-H.263

-MPEG2

Audio decoder

- low bit-rate

speech

- 64kbps - AAC

synthesizer/processor

- MIDI/Structure audio

- 3D processor.

Cached data

- geometry

- textures

- articulated figures

(faces)

- video clips

- audio clips

- FAP codebooks

2D/3D geometry

decoder

-Polygonal meshes

-Segmentation Masks

bitstream

Level of detail control

Faces are SpecialFaces are Special

Humans are hard-wired to respond to Humans are hard-wired to respond to facesfaces

The face is the primary communication The face is the primary communication interfaceinterface

Human faces can be automatically Human faces can be automatically analyzed and parameterized for a wide analyzed and parameterized for a wide variety of applicationsvariety of applications

MPEG-4 Face and Body MPEG-4 Face and Body Animation CodingAnimation Coding

Face animation is in MPEG-4 version 1Face animation is in MPEG-4 version 1 Body animation is in MPEG-4 version 2Body animation is in MPEG-4 version 2 Face animation parameters displace feature Face animation parameters displace feature

points from neutral positionpoints from neutral position Body animation parameters are joint anglesBody animation parameters are joint angles Face and body animation parameter Face and body animation parameter

sequences are compressed to low bitratessequences are compressed to low bitrates

Neutral Face DefinitionNeutral Face Definition

Head axes parallel to the world axes Head axes parallel to the world axes Gaze is in direction of Z axisGaze is in direction of Z axis Eyelids tangent to the irisEyelids tangent to the iris Pupil diameter is one third of iris diameterPupil diameter is one third of iris diameter Mouth is closed and the upper and lower teeth Mouth is closed and the upper and lower teeth

are touchingare touching Tongue is flat, horizontal with the tip of tongue Tongue is flat, horizontal with the tip of tongue

touching the boundary between upper and lower touching the boundary between upper and lower teethteeth

Face Feature PointsFace Feature Points

10.810.6

11.6 4.6

2.122.1

Tongue

6.26.4 6.3

6.1Mouth

8.18.9 8.10 8.5

8.48.6

2.82.9

2.72.5 2.4

2.12.12 2.11

2.142.10

10.610.8

10.105.4

10.910.3

10.510.7

4.1 4.34.54.6

4.4 4.2

11.111.2 11.3

9.6 9.7

9.14 9.13

9.4 9.15 9.5

9.109.11

Feature points affected by FAPs

Other feature points

Right eye Left eye

3.12 3.6

3.23.8

Face Animation Parameter Face Animation Parameter NormalizationNormalization

Face Animation Parameters (FAPs) are Face Animation Parameters (FAPs) are normalized to facial dimensionsnormalized to facial dimensions

Each FAP is measured as a fraction of Each FAP is measured as a fraction of neutral face mouth width, mouth-nose neutral face mouth width, mouth-nose distance, eye separation, or iris distance, eye separation, or iris diameter diameter

3 Head and 2 eyeball rotation FAPs are 3 Head and 2 eyeball rotation FAPs are Euler anglesEuler angles

Neutral Face Dimensions for Neutral Face Dimensions for FAP NormalizationFAP Normalization

ES0IRISD0

FAP GroupsFAP GroupsGroup Number of FAPs

1: visemes and expressions 2

2: jaw, chin, inner lowerlip, cornerlips, midlip 16

3: eyeballs, pupils, eyelids 12

4: eyebrow 8

5: cheeks 4

6: tongue 5

7: head rotation 3

8: outer lip positions 10

9: nose 4

10: ears 4

Lip FAPsLip FAPsMouth closed if sum of upper and Mouth closed if sum of upper and

lower lip FAPs = 0lower lip FAPs = 0

Face Model IndependenceFace Model Independence FAPs are always normalized for model FAPs are always normalized for model

independenceindependence FAPs (and BAPs) can be used without FAPs (and BAPs) can be used without

MPEG-4 systems/BIFSMPEG-4 systems/BIFS Private face models can be accurately Private face models can be accurately

animated with FAPsanimated with FAPs Face models can be simple or complex Face models can be simple or complex

depending on terminal resourcesdepending on terminal resources

MPEG-4 BIFS Face NodeMPEG-4 BIFS Face Node Face node contains FAP node, Face scene Face node contains FAP node, Face scene

graph, Face Definition Parameters (FDP), graph, Face Definition Parameters (FDP), FIT,and FATFIT,and FAT

FIT (Face Interpolation Table) specifies FIT (Face Interpolation Table) specifies interpolation of FAPs in terminalinterpolation of FAPs in terminal

FAT (Face Animation Table) maps FAPs to FAT (Face Animation Table) maps FAPs to Face model deformationFace model deformation

FDP information included face feature points FDP information included face feature points positions and texture mappositions and texture map

Face Model DownloadFace Model Download

3D graphical models (e.g. faces) can be 3D graphical models (e.g. faces) can be downloaded to the terminal with MPEG-4downloaded to the terminal with MPEG-4

3D model specification is based on VRML3D model specification is based on VRML Face Animation Table( FAT) maps FAPs to Face Animation Table( FAT) maps FAPs to

face model vertex displacementsface model vertex displacements Appearance and animation of downloaded Appearance and animation of downloaded

face models is exactly predictableface models is exactly predictable

FAP CompressionFAP Compression

FAPs are adaptively quantized to FAPs are adaptively quantized to desired quality leveldesired quality level

Quantized FAPs are differentially codedQuantized FAPs are differentially coded Adaptive arithmetic coding further Adaptive arithmetic coding further

reduces bitratereduces bitrate Typical compressed FAP bitrate is less Typical compressed FAP bitrate is less

than 2 kilobits/secondthan 2 kilobits/second

FAP Predictive CodingFAP Predictive Coding

FAP(t) + Q

Q-1FrameDelay

- ArithmeticCoder

Bitstream

Face Analysis SystemFace Analysis System

MPEG-4 does not specify analysis MPEG-4 does not specify analysis systemssystems

face2face face analysis system tracks face2face face analysis system tracks nostrils for robust operationnostrils for robust operation

Inner lip contour estimated using adaptive Inner lip contour estimated using adaptive color thresholding and lip modelingcolor thresholding and lip modeling

Eyelids, eyebrows and gaze directionEyelids, eyebrows and gaze direction

Nostril TrackingNostril Tracking• At least 75% of nostril window area is skin color as indicated

by RGB skincolor table• After RGB thresholding nostril window, at least 15% of area is

subthreshold (nostril)• Min/Max constraints are met for nostril width, height, gap ,

center spacing, orientation in thresholded projection domain

Nostrils are detected only if:

Height

Spacing

Nostrils

Projection Threshold

Vertical Nostril

Projection

Horizontal Nostril

Projection

Nostril Window

Inner Lip Contour EstimationInner Lip Contour EstimationDetect mouth

closure

Train horizontal mouth threshold array when mouth closed

Apply threshold array to mouth Region

Locate teeth by color and position

Form inner lip contour around inner mouth and teeth pixels

FAP Estimation AlgorithmFAP Estimation Algorithm Head scale is normalized based on neutral mouth Head scale is normalized based on neutral mouth

(closed mouth) width(closed mouth) width Head pitch is approximated based on vertical Head pitch is approximated based on vertical

nostril deviation from neutral head positionnostril deviation from neutral head position Head roll is computed from smoothed eye or nostril Head roll is computed from smoothed eye or nostril

orientation depending on availability orientation depending on availability Inner lip FAPs are measured directly from the inner Inner lip FAPs are measured directly from the inner

lips contour as deviations from the neutral lip lips contour as deviations from the neutral lip position (closed mouth)position (closed mouth)

FAP Sequence SmoothingFAP Sequence Smoothing

Time (1/30 sec)

lower_t_midlip

raise_b_midlip

Time (1/30 sec)

lower_t_midlip

raise_b_midlip

MPEG-4 Visemes and MPEG-4 Visemes and ExpressionsExpressions

A weighted combination of 2 visemes A weighted combination of 2 visemes and 2 facial expressions for each frameand 2 facial expressions for each frame

Decoder is free to interpret effect of Decoder is free to interpret effect of visemes and expressions after FAPs visemes and expressions after FAPs are appliedare applied

Definitions of visemes and expressions Definitions of visemes and expressions using FAPs can also be downloadedusing FAPs can also be downloaded

VisemesVisemesviseme_select phonemes example

0 none na

1 p, b, m put, bed, mill

2 f, v far, voice

3 T,D think, that

4 t, d tip, doll

5 k, g call, gas

6 tS, dZ, S chair, join, she

7 s, z sir, zeal

8 n, l lot, not

9 r red

10 A: car

11 e bed

12 I tip

13 Q top

14 U book

Facial ExpressionsFacial Expressionsexpression_select expression name textual description

0 na na

1 joy The eyebrows are relaxed. The mouth is open and the mouthcorners pulled back toward the ears.

2 sadness The inner eyebrows are bent upward. The eyes are slightlyclosed. The mouth is relaxed.

3 anger The inner eyebrows are pulled downward and together. Theeyes are wide open. The lips are pressed against each other oropened to expose the teeth.

4 fear The eyebrows are raised and pulled together. The innereyebrows are bent upward. The eyes are tense and alert.

5 disgust The eyebrows and eyelids are relaxed. The upper lip is raisedand curled, often asymmetrically.

6 surprise The eyebrows are raised. The upper eyelids are wide open, thelower relaxed. The jaw is opened.

Free Face Model SoftwareFree Face Model Software

Wireface is an openGL-based, MPEG-4 Wireface is an openGL-based, MPEG-4 compliant face modelcompliant face model

Good starting point for building high Good starting point for building high quality face models for web applicationsquality face models for web applications

Reads FAP file and raw audio fileReads FAP file and raw audio file Renders face and audio in real timeRenders face and audio in real time Wireface source is freely availableWireface source is freely available

Body AnimationBody Animation

Harmonized with VRML Hanim specHarmonized with VRML Hanim spec Body Animation Parameters (BAPs) are Body Animation Parameters (BAPs) are

humanoid skeleton joint Euler angleshumanoid skeleton joint Euler angles Body Animation Table (BAT) can be Body Animation Table (BAT) can be

downloaded to map BAPs to skin downloaded to map BAPs to skin deformationdeformation

BAPs can be highly compressed for BAPs can be highly compressed for streamingstreaming

Body Animation Parameters Body Animation Parameters (BAPs)(BAPs)

186 humanoid skeleton euler angles186 humanoid skeleton euler angles 110 free parameters for use with 110 free parameters for use with

downloaded body surface meshdownloaded body surface mesh Coded using same codecs as FAPsCoded using same codecs as FAPs Typical bitrates for coded BAPs is 5-Typical bitrates for coded BAPs is 5-

10kbps10kbps

Body Definition Parameters Body Definition Parameters (BDPs)(BDPs)

Humanoid joint center positionsHumanoid joint center positions Names and hierarchy harmonized with Names and hierarchy harmonized with

VRML/Web3D H-Anim working groupVRML/Web3D H-Anim working group Default positions in standard for Default positions in standard for

broadcast applicationsbroadcast applications Download just BDPs to accurately Download just BDPs to accurately

animate unknown body modelanimate unknown body model

Faces Enhance the User Faces Enhance the User ExperienceExperience

Virtual call center agentsVirtual call center agents News readers (e.g. Ananova)News readers (e.g. Ananova) Story tellers for the child in all of usStory tellers for the child in all of us eLearningeLearning Program guideProgram guide Multilingual (same face different voice)Multilingual (same face different voice) Entertainment animationEntertainment animation Multiplayer gamesMultiplayer games

Visual Content for the Visual Content for the Practical InternetPractical Internet

Broadband deployment is happening slowlyBroadband deployment is happening slowly DSL availability is limited and cable is sharedDSL availability is limited and cable is shared Talking heads need high frame-rateTalking heads need high frame-rate Consumer graphics hardware is cheap and Consumer graphics hardware is cheap and

powerfulpowerful MPEG-4 SNHC/FBA tools are matched to MPEG-4 SNHC/FBA tools are matched to

available bandwidth and terminalsavailable bandwidth and terminals

Visual Speech ProcessingVisual Speech Processing FAPs can be used to improve speech FAPs can be used to improve speech

recognition accuracyrecognition accuracy Text-to-speech systems can use FAPs Text-to-speech systems can use FAPs

to animate face modelsto animate face models FAPs can be used in computer-human FAPs can be used in computer-human

dialogue systems to communicate dialogue systems to communicate emotions, intentions and speech emotions, intentions and speech especially in noisy environmentsespecially in noisy environments

Video-driven Face AnimationVideo-driven Face Animation Facial expressions, lip movements and Facial expressions, lip movements and

head motion transferred to face modelhead motion transferred to face model FAPs extracted from talking head video FAPs extracted from talking head video

with special computer vision systemwith special computer vision system No face markers or lipstick is requiredNo face markers or lipstick is required Normal lighting is usedNormal lighting is used Communicates lip movements and facial Communicates lip movements and facial

expressions with visual anonymityexpressions with visual anonymity

Automatic Face Animation Automatic Face Animation DemonstrationDemonstration

FAPs extracted from camcorder videoFAPs extracted from camcorder video FAPs compressed to less than 2 FAPs compressed to less than 2

kbits/seckbits/sec 30 frames/sec animation generated 30 frames/sec animation generated

automaticallyautomatically Face models animated with bones rig Face models animated with bones rig

or fixed deformable mesh (real-time)or fixed deformable mesh (real-time)

QuickTime™ and a decompressor

are needed to see this picture.

QuickTime™ and a decompressor

are needed to see this picture.

What is easy, solved, or What is easy, solved, or almost solvedalmost solved

Can we do photorealistic non-animated Can we do photorealistic non-animated face models? YESface models? YES

Can we do near-real-time lip sync'ing Can we do near-real-time lip sync'ing that is indistinguishable from a human? that is indistinguishable from a human? NONO

What is really hardWhat is really hard

Synthesizing human speech and facial Synthesizing human speech and facial expressionsexpressions

HairHair

What we have assumed What we have assumed someone else is solvingsomeone else is solving

Graphics accelerationGraphics acceleration Video camera cost and resolutionVideo camera cost and resolution Multimedia communication Multimedia communication

infrastructureinfrastructure

Where we need helpWhere we need help We have a face with 68 parameters but We have a face with 68 parameters but

we need the psychologists to tell us how we need the psychologists to tell us how to drive it autonomouslyto drive it autonomously

We need to embody our agents into We need to embody our agents into graphical models that have a couple of graphical models that have a couple of thousand parameters to control gaze, thousand parameters to control gaze, gesture, body language, and do collision gesture, body language, and do collision detection-> NEED MORE SPEEDdetection-> NEED MORE SPEED

Core functionality of the faceCore functionality of the face SpeechSpeech

– Lips, teeth, tongueLips, teeth, tongue Emotional expressionsEmotional expressions

– Gaze, eyebrow, eyelids, head poseGaze, eyebrow, eyelids, head pose Non-verbal communicationNon-verbal communication Sensory responsivitySensory responsivity Technical requirementsTechnical requirements

– FramerateFramerate– SynchronizationSynchronization– LatencyLatency– BitrateBitrate– Spatial resolutionSpatial resolution– ComplexityComplexity

Common framework withbodyCommon framework withbody InteractionInteraction Different faces should respond similarly to common commandsDifferent faces should respond similarly to common commands Accessible to everyoneAccessible to everyone

Interaction with other Interaction with other componentscomponents

Language and discourseLanguage and discourse– Phoneme to viseme mappingPhoneme to viseme mapping

– Given/newGiven/new Action in the environmentAction in the environment Global informationGlobal information

– Emotional stateEmotional state

– PersonalityPersonality

– CultureCulture

– World knowledgeWorld knowledge

– Central time-base and timestampsCentral time-base and timestamps

Open questionsOpen questions

Central vs peripheral functionalityCentral vs peripheral functionality Degree of interface commonalityDegree of interface commonality Degree of agent autonomyDegree of agent autonomy What should the VH be capable ofWhat should the VH be capable of

Face Animation Overview with Shameless Bias Toward MPEG-4 Face Animation Tools Dr. Eric Petajan...

Documents

From User-friendly to User’s Friend Dr. Eric Petajan Founder and Chief Scientist face2face animation, inc. @f2f-inc.com Why vision

The History Of Animation By Koolkid10. Animation Cut Out Animation Drawn Animation Computer Animation Stop Motion Animation Animation has changed

3D Reconstruction and Photoreal Digital Face Animation using …kowon.dongseo.ac.kr/~lbg/cagd/PhotorealFaceAnimation... · 2016-05-25 · 3D Reconstruction and Photoreal Digital Face

Parenting to Build Resilience In the Face of Risk Factors (J. Eric Vance M.D.)

Performance-Driven Facial Animation - DiPaolaivizlab.sfu.ca/arya/Papers/ACM/Performance-driven Face Animation.pdf · Performance-Driven Facial Animation Lance Williams Advanced Technology

A system for efficient 3D printed stop-motion face animation · A system for e˙icient 3D printed stop-motion face animation RINAT ABDRASHITOV, ALEC JACOBSON, KARAN SINGH, University

Real-Time Speech-Driven Face Animation

Gender Identification Bias Induced with Texture Images on a ...mediatum.ub.tum.de/doc/1115852/1115852.pdfcalibrated 3D face models for animation: each 3D face model is carefully aligned

Face to Face Marketing - Signage - Design · • Graphic Design & Visual Communication • Digital Media Design: • Interactive Presentations • Web Site Development • Flash Animation

Fluid Animation from Simulation on Tetrahedral Meshes€¦ · 1 Abstract Fluid Animation from Simulation on Tetrahedral Meshes by Bryan Eric Feldman Doctor of Philosophy in Computer

FACE MODELING AND ANIMATION LANGUAGE FOR MPEG-4 XMT …arya/pubs/fml.pdf · of XML-based multimedia languages and MPEG-4 face parameters into a hierarchical structure dedicated to

Face Animation: A Case Study for Multimedia Modeling and …arya/pubs/mscir.pdf · 2013-02-13 · Face Animation: A Case Study for Multimedia Modeling and Specification Languages

Face Animation

Vision-based Control of 3D Facial Animationfaculty.cs.tamu.edu/jchai/projects/face-animation/face...Chai et al. / Vision-based Control of 3D Facial Animation signal due to the direct

Human Face Modeling and Animation Example of problems in application of multimedia signal processing

3D BODY & FACE MODELING & ANIMATION · 3D BODY & FACE MODELING & ANIMATION Dr. Tanju Erdem May 2, 2008. 2 Outline • 3D Human Modeling • Facial Expressions • Lip Synchronization

The Two Towers: Face to Face With Gollum | Animation World ... · The Two Towers: Face to Face With Gollum | Animation World Network 2 of 5 1/27/2014 11:34 AM

Face to Face: Evaluating Visual Comparisonvisualthinking.psych.northwestern.edu/publications/OndovFacetoFace2018.pdfapplication of both animation as well as symmetry via mirroring

3D Animation for Transferring Technology to the Community ... fileAbstract— Face to face meeting is the most common way in . technology transfer, especially being addressed to. the

REAL-TIME AVATAR ANIMATION WITH DYNAMIC FACE …