Upload
kenneth-pearson
View
216
Download
0
Embed Size (px)
Citation preview
multimodality, universals, natural interaction…
and some other stories…
Kostas Karpouzis & Stefanos Kollias
ICCS/NTUA
HUMAINE WP4
going multimodal
• ‘multimodal’ is this decade’s main ‘affective interaction’ aspect.
• plethora of modalities available to capture and process– visual, aural, haptic…– ‘visual’ can be broken down to ‘facial
expressivity’, ‘hand gesturing’, ‘body language’, etc.
– ‘aural’ to ‘prosody’, ‘linguistic content’, etc.
why multimodal?
• Extending unimodality…– recognition from traditional unimodal
inputs had serious limitations– Multimodal corpora become available
• What to gain?– have recognition rates improved?– or just introduced more uncertain
features
essential reading
• Communications of the ACM,Nov. 1999, Vol. 42, No. 11, pp. 74-81
putting it all together
• myth #6: multimodal integration involves redundancy of content between modes
• you have features from a person’s– facial expressions and body language– speech prosody and linguistic content,– even their heartbeat rate
• so, what do you do when their face tells you different than their …heart?
first, look at this video
and now, listen!
but it can be good
• what happens when one of the available modalities is not robust?– better yet, when the ‘weak’ modality
changes over time?
• consider the ‘bartender problem’– very little linguistic content reaches its
target– mouth shape available (viseme)– limited vocabulary
but it can be good
again, why multimodal?
• holy grail: assigning labels to different parts of human-human or human-computer interaction
• yes, labels can be nice!– humans do it all the time– and so do computers (e.g.,
classification)– OK, but what kind of label?
In the beginning …
• Based on the claim that ‘there are six facial expressions recognized universally across cultures’…
• all video databases used to contain images of sad, angry, happy or fearful people…
• thus, more sad, angry, happy or fearful people appear, even when data involve HCI, and subtle emotions/additional labels are out of the picture– can you really be afraid that often when
using your computer?
the Humaine approach
• so where is Humaine in all that?– subtle emotions– natural expressivity– alternative emotion representations– discussing dynamics– classification of emotional episodes
from life-like HCI and reality TV
Humaine WP4 results Dataset (leader)
Frames/users/Length
Modalities present
Features extracted
Until now Plans for 2007 Recognitionrates
ERMIS SAL (QUB-ICCS)
Four subjects~ 2hr of audio/video annotated with Feeltrace
facial expressions,Speech prosody,head pose
FAPs per frameAcoustic features per tunePhonemes/
Visemes
One subject analyzed~34.000 frames~800 tunes
Extract facial and prosody features from three remaining subjectsAnalyze head pose
RecurrentNNs:87% Rule-based : 78,4%Possibilistic:
65,1%
EmoTV(LIMSI)
28 clips~5 minutes total
Subtle facial expressions,Restricted gesturing
Overall activation (FAPs or prosody not possible
All clips Extract Remaining expressivity features (where possible)
Correlation with manual annotator: κ*=0,83
EmoTaboo (LIMSI)
2 clips~5 minutes
Facial expressionsSpeech prosody
FAPs All clips Head poseProsody features
Annotation not yet available
CEICES (FAU)
51 children~ 9 hrs recorded and annotated
Speech prosody
Acoustic features per turn/word
All clips Completed analysis, pending comparison of recognition schemes
Mean recognition rate: 55.8%
Genoa06 corpus (Genoa)
10 subjects~50 gesture repetitions each~1 hour
FAPsgesturing,Pseudolangu
age
FAPsgesturesSpeech
All clips Expressivity features from hand movement
Facial: 59.6%Gestures: 67.1%Speech: 70.8%Multimodal: 78.3%
GEMEP(GERG)
1200 clips total FAPsgesturing,Pseudolangu
a
Expressivity, gesturesFAPsSpeech
8 body clips30 face clips
Analyze remaining 1200 clips
Few clips analyzed
HUMAINE 2010
three years from now in a galaxy (not) far,
far away…
a fundamental question
a fundamental question
• OK, people may be angry or sad, or express positive/active emotions
• face recognition provides response to the ‘who?’ question
• ‘when?’ and ‘where?’ are usually known or irrelevant
• but, does anyone know ‘why?’– context information – semantics
a fundamental question (2)
is it me or?...
is it me or?...
• some modalities may display no clues or, worse, contradicting clues
• the same expression may mean different things coming from different people
• can we ‘bridge’ what we know about someone or about the interaction with what we sense?– and can we adapt what we know based on
that?– or can we align what we sense with other
sources?
another kind of language
another kind of language
• sign language analysis poses a number of interesting problems– image processing and understanding
tasks– syntactic analysis– context (e.g. when referring to a third
person)– natural language processing– vocabulary limitations
want answers?
Let us try to extend some of the issues already
raised!
Semantic Analysis
Semantics – Context (a peak at the future)
Visual- data Segmentation Feature Extraction
C1
C2.
Cn
Context
Fu
sio
n
Adapt-
ation
Label-ling
Ontologyinfrastructure
Context analysis
Visual analysis
Classifiers
Fuzzy Reasoning Engine (FiRE)
Centralised /DecentralisedKnowledge Repository
Standardisation Activities
• W3C Multimedia Semantics Incubator Group
• W3C Emotion Incubator Group
Provide machine understandable representations of available Emotion Modelling, Analysis, Synthesis theory, cues and results to be accessed through the Web and used in all types of affective interaction.