Facial Expression Recognition and...

Preview:

Citation preview

Facial Expression Recognition and Generation

Deepali Aneja Ph.D. student

Computer Science and Engineering University of Washington

Motivation • Accurate facial expression depiction is critical for storytelling.

• And difficult!

0% 13% 25% 38% 50% 63%

JoySadness

AngerSurprise

FearDisgustNeutral

0% 13% 25% 38% 50%

JoySadness

AngerSurprise

FearDisgustNeutral

0% 13% 25% 38% 50%

Joy

Sadness

Anger

Surprise

Fear

Disgust

Neutral

We asked three professional animators to make the character appear as surprised as possible. None of the expressions achieved above 50% recognition on

Mechanical Turk testing.

Use human anatomy (FACs) to generate expressions

MPEG – 4 HapFACS HapFACS FACSGen (Anger) (Anger) (Fear) (Fear)

Adobe Character Animator (Geometry + Audio input)

Problem Statement

Given that that simple geometric mappings are not sufficient:

• How can we transfer human expressions to stylized characters without losing perceptual information?

• How can we use human expressions to quickly and

automatically create expressions for a wide range of characters?

Generate characters from human expressions

Our Approach

• Use deep learning to learn mappings between • human expressions and human expressions • character expressions and character expressions • human expressions and characters expressions

• Seven classes of expressions : • Joy, Sad, Anger, Disgust, Surprise, Fear and Neutral

• This isn’t just geometry mapping

• It is perceptual modelling of expressions

Step 4

Retrieve characters using a perceptual model and geometry

Step 2

Learn analogous character model

Character feature space

f’( )

Step 1

Use deep learning to create a perceptual model of human expressions

Human feature space

f( )

Step 3

Learn Mapping f’( ) f( )

Part 1: Expression Retrieval

Steps

Data Collection

Data Pre-processing

Network Training

using Deep Learning

Transfer expressions

Data Collection - Human Database

• CK+: The Extended Cohn-Kanade [REF] -309 images • DISFA: Denver Intensity of Spontaneous Facial Actions [REF] 60,000

images • KDEF: The Karolinska Directed Emotional Faces [REF] 4900 images • MMI: 10,000 images • Total of 75K images - We balanced out the final number of the

samples for training our network to avoid any bias towards any particular expression.

Data Collection - Character Database

• Eight stylized characters • The animator creates the

• key poses for each expression • labeled via Mechanical Turk (MT) to populate the database

initially • We only used the expression key poses having 70% MT test

agreement among 50 Turkers for the same pose. Interpolating between the key poses resulted in 60,000 images (around 8,000 images per character).

Data Pre-processing

Extract Face 49 landmarks (Intraface)

Register faces to an average frontal face via an affine transformation

Face bounding box selection

Re-size to 256x256 pixels for analysis

Registered faces

Disgust(CK+) Joy(DISFA) Anger (KDEF) Surprise (MMI)

Training networks

Stylized Character

Neural Network

Human Neural

Network

A

D

F

J

N

Sa

Sa

D

F

J

N

Sa

Sa

A

Find the correlation

between the corresponding expressions

Mapping

Network Training using Deep Learning

Data Augmentation • 5 crops of 227x227

from four corners • center crop • Horizontal flip

Training Human model • 4 CONV layers • 4 POOL layers • 2 Fully Connected

layers

Training character model • 3 CONV layers • 3 POOL layers • 2 Fully Connected

layers

Fine-tuning character model • N-1 layer features

Network Architecture

Human CNN (HCNN) Character CNN (CCNN) Shared CNN (SCNN)

When and How to Fine-tune?

• New dataset is small and similar to original dataset. • Not a good idea to fine-tune the ConvNet (overfitting) • Train a linear classifier on the CNN codes.

• New dataset is medium/large and similar to the original dataset. • Fine-tune through the full network (Our shared CNN)

• New dataset is small but very different from the original dataset. • Train the SVM classifier from activations (somewhere earlier in the network)

• New dataset is large and very different from the original dataset. • Train from scratch • Initialize with weights from a pre-trained model.

Transfer Learning

FC6 features extracted

from HCNN

FC6 features

extracted from SCNN

Shared human-character

feature space

Distance Metrics

• Extracted features from the last fully connected layer of both the models: human expression trained model and fine-tuned character expression model & normalized the feature vectors

• To retrieve the stylized character closest expression match to the human expression:

• Jensen—Shannon divergence Distance for expression clarity • Geometric feature distance for expression refinement

Expression feature vectors (N-1) Layer features

Geometry feature vectors

Jensen—Shannon divergence • JS Divergence is symmetrical and gives a finite value:

where • Kullback—Leibler divergence is given as

where X and M are discrete probability distributions

KL Div. KL Div.

Multiple correct label results

Geometric distance refinement

• Since expressions are mainly controlled by muscles around the mouth, eyes and eyebrows, we focus on features that characterize the shape and location of these parts of the face.

• We use the facial landmarks to extract our geometric features including the following measurements:

• the left/right eyebrow height • left/right eyelid height • nose width • left mouth corner to mouth center distance • mouth corner to mouth center distance.

• We normalize these feature vectors and compute the L2 norm distance between

the human geometry vector and character geometry vectors with the correct expression label. Finally, we re-order the retrieved images within the matched label based on matched geometry.

Layers Visualization

Input

Filter – conv1 Features – conv1 Features – conv2 Features – conv3

Prediction label: Surprise

Top match results (Surprise and Joy) Query Character Retrievals

Expression based Retrieval

Using CCNN

Using HCNN

Evaluation

How close is the retrieved character expression label is to the human query expression label?

Retrieval Score

Spearman rank correlation coefficient

Kendall τ test

Expert Comparison

Retrieval Score

• We measured the retrieval performance of our method by calculating the average normalized rank of relevant results (0 is the best score)

• The evaluation score for a query human expression image was calculated as

follows:

where where N is the number of images in the database Nrel the number of relevant expression label images to q Rk is the rank assigned to the kth relevant image.

Average retrieval score for each expression across all characters

Sample expert comparison

test1 test2 test3 test4 test5

Expression

test1 test2 test3 test4 test5

Expert

test1 test2 test3 test4 test5

Expression + Geometry

Rank 1 Rank 2 Rank 3 Rank4 Rank 5

Query

Rank correlation coefficient • Pearson correlation coefficient

• The closer value is to 1, the better the two ranks are correlated. • The average Spearman correlation coefficient for the 30 validation rank

orderings is 0.773 ± 0.336. • Rank 1 correlation is 0.934. – Most relevant match!

• Kendall test

• Pairwise error that represents how many pairs are ranked discordant. The best matching ranks get a τ value of 1.

• The average Kendall correlation coefficient for 30 validation rank orderings is 0.706 ± 0.355

• Rank 1 correlation is 0.910 - Most relevant match!

Correlation metrics with expert

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

Corr

elat

ion

coef

ficie

nt

Number of validation sets

Spearman

Kendall

Part 2: Generating Character Expressions

Convolutional layer Max pooling layer Fully connected layer

Surprise

Fully Connected Convolutional Neural Network

Generating Character Expressions

Convolutional layer Max pooling layer Fully connected layer

N-1 feature vector

Generating Character Expressions

Convolutional layer Max pooling layer Fully connected layer

N-1 feature vector

Maya parameters

Learn character model parameters

Convolutional Max pooling Fully connected Soft max

N-1 feature vector

Maya parameters

Preliminary Result:

Disgust expression query

Disgust expression Parameter rendering

Applications

• Improve visual storytelling applications: • animated films • Gaming • Online marketing • VR/AR experiences • Robotics

• Medically-motivated application: teaching children with autism

spectrum disorder (ASD) to both recognize and convey expressions using cartoon characters in an interactive environment.

Expression retrieval work to be presented at Asian Conference on Computer Vision (Nov 2016).

Project webpage http://grail.cs.washington.edu/projects/deepexpr/

Questions?

Recommended