Multimodality, universals, natural interaction… and some other stories… Kostas Karpouzis & Stefanos Kollias ICCS/NTUA HUMAINE WP4

multimodality, universals, natural interaction…

and some other stories…

Kostas Karpouzis & Stefanos Kollias

ICCS/NTUA

HUMAINE WP4

going multimodal

• ‘multimodal’ is this decade’s main ‘affective interaction’ aspect.

• plethora of modalities available to capture and process– visual, aural, haptic…– ‘visual’ can be broken down to ‘facial

expressivity’, ‘hand gesturing’, ‘body language’, etc.

– ‘aural’ to ‘prosody’, ‘linguistic content’, etc.

why multimodal?

• Extending unimodality…– recognition from traditional unimodal

inputs had serious limitations– Multimodal corpora become available

• What to gain?– have recognition rates improved?– or just introduced more uncertain

features

essential reading

• Communications of the ACM,Nov. 1999, Vol. 42, No. 11, pp. 74-81

putting it all together

• myth #6: multimodal integration involves redundancy of content between modes

• you have features from a person’s– facial expressions and body language– speech prosody and linguistic content,– even their heartbeat rate

• so, what do you do when their face tells you different than their …heart?

first, look at this video

and now, listen!

but it can be good

• what happens when one of the available modalities is not robust?– better yet, when the ‘weak’ modality

changes over time?

• consider the ‘bartender problem’– very little linguistic content reaches its

target– mouth shape available (viseme)– limited vocabulary

but it can be good

again, why multimodal?

• holy grail: assigning labels to different parts of human-human or human-computer interaction

• yes, labels can be nice!– humans do it all the time– and so do computers (e.g.,

classification)– OK, but what kind of label?

In the beginning …

• Based on the claim that ‘there are six facial expressions recognized universally across cultures’…

• all video databases used to contain images of sad, angry, happy or fearful people…

• thus, more sad, angry, happy or fearful people appear, even when data involve HCI, and subtle emotions/additional labels are out of the picture– can you really be afraid that often when

using your computer?

the Humaine approach

• so where is Humaine in all that?– subtle emotions– natural expressivity– alternative emotion representations– discussing dynamics– classification of emotional episodes

from life-like HCI and reality TV

Humaine WP4 results Dataset (leader)

Frames/users/Length

Modalities present

Features extracted

Until now Plans for 2007 Recognitionrates

ERMIS SAL (QUB-ICCS)

Four subjects~ 2hr of audio/video annotated with Feeltrace

facial expressions,Speech prosody,head pose

FAPs per frameAcoustic features per tunePhonemes/

Visemes

One subject analyzed~34.000 frames~800 tunes

Extract facial and prosody features from three remaining subjectsAnalyze head pose

RecurrentNNs:87% Rule-based : 78,4%Possibilistic:

65,1%

EmoTV(LIMSI)

28 clips~5 minutes total

Subtle facial expressions,Restricted gesturing

Overall activation (FAPs or prosody not possible

All clips Extract Remaining expressivity features (where possible)

Correlation with manual annotator: κ*=0,83

EmoTaboo (LIMSI)

2 clips~5 minutes

Facial expressionsSpeech prosody

FAPs All clips Head poseProsody features

Annotation not yet available

CEICES (FAU)

51 children~ 9 hrs recorded and annotated

Speech prosody

Acoustic features per turn/word

All clips Completed analysis, pending comparison of recognition schemes

Mean recognition rate: 55.8%

Genoa06 corpus (Genoa)

10 subjects~50 gesture repetitions each~1 hour

FAPsgesturing,Pseudolangu

age

FAPsgesturesSpeech

All clips Expressivity features from hand movement

Facial: 59.6%Gestures: 67.1%Speech: 70.8%Multimodal: 78.3%

GEMEP(GERG)

1200 clips total FAPsgesturing,Pseudolangu

a

Expressivity, gesturesFAPsSpeech

8 body clips30 face clips

Analyze remaining 1200 clips

Few clips analyzed

HUMAINE 2010

three years from now in a galaxy (not) far,

far away…

a fundamental question

a fundamental question

• OK, people may be angry or sad, or express positive/active emotions

• face recognition provides response to the ‘who?’ question

• ‘when?’ and ‘where?’ are usually known or irrelevant

• but, does anyone know ‘why?’– context information – semantics

a fundamental question (2)

is it me or?...

is it me or?...

• some modalities may display no clues or, worse, contradicting clues

• the same expression may mean different things coming from different people

• can we ‘bridge’ what we know about someone or about the interaction with what we sense?– and can we adapt what we know based on

that?– or can we align what we sense with other

sources?

another kind of language

another kind of language

• sign language analysis poses a number of interesting problems– image processing and understanding

tasks– syntactic analysis– context (e.g. when referring to a third

person)– natural language processing– vocabulary limitations

want answers?

Let us try to extend some of the issues already

raised!

Semantic Analysis

Semantics – Context (a peak at the future)

Visual- data Segmentation Feature Extraction

C1

C2.

Cn

Context

Fu

sio

n

Adapt-

ation

Label-ling

Ontologyinfrastructure

Context analysis

Visual analysis

Classifiers

Fuzzy Reasoning Engine (FiRE)

Centralised /DecentralisedKnowledge Repository

Standardisation Activities

• W3C Multimedia Semantics Incubator Group

• W3C Emotion Incubator Group

Provide machine understandable representations of available Emotion Modelling, Analysis, Synthesis theory, cues and results to be accessed through the Web and used in all types of affective interaction.

Documents

Multimodality, universals, natural interaction… and some other stories… Kostas Karpouzis & Stefanos Kollias ICCS/NTUA HUMAINE WP4