21
Occasion: HUMAINE / WP4 / Workshop "From Signals to Signs of Emotion and Vice Versa" Santorin / Fira, 18th – 22nd September, 2004 Talk: Ronald Müller Speech Emotion Recognition Speech Emotion Recognition Combining Acoustic and Semantic Combining Acoustic and Semantic Analyses Analyses Institute for Human-Machine Communication Technische Universität München

Occasion:HUMAINE / WP4 / Workshop "From Signals to Signs of Emotion and Vice Versa" Santorin / Fira, 18th – 22nd September, 2004 Talk: Ronald Müller Speech

Embed Size (px)

Citation preview

Page 1: Occasion:HUMAINE / WP4 / Workshop "From Signals to Signs of Emotion and Vice Versa" Santorin / Fira, 18th – 22nd September, 2004 Talk: Ronald Müller Speech

Occasion: HUMAINE / WP4 / Workshop "From Signals to Signs of Emotion and Vice Versa" Santorin / Fira, 18th – 22nd September, 2004

Talk: Ronald Müller

Speech Emotion Recognition Speech Emotion Recognition Combining Acoustic and Semantic AnalysesCombining Acoustic and Semantic Analyses

Institute for Human-Machine Communication

Technische Universität München

Page 2: Occasion:HUMAINE / WP4 / Workshop "From Signals to Signs of Emotion and Vice Versa" Santorin / Fira, 18th – 22nd September, 2004 Talk: Ronald Müller Speech

Slide -2-

System Overview

Emotional Speech Corpus

Acoustic Analysis

Semantic Analysis

Stream Fusion

Results

Outline

OutlineOutline

Page 3: Occasion:HUMAINE / WP4 / Workshop "From Signals to Signs of Emotion and Vice Versa" Santorin / Fira, 18th – 22nd September, 2004 Talk: Ronald Müller Speech

Slide -3-

System Overview

System OverviewSystem Overview

Speech signalSpeech signal

Prosodic featuresProsodic features ASR-unitASR-unit

Semantic interpretationSemantic interpretation(Bayesian Networks)(Bayesian Networks)

ClassifierClassifier(SVM)(SVM)

Stream fusionStream fusion(MLP)(MLP)

EmotionEmotion

Page 4: Occasion:HUMAINE / WP4 / Workshop "From Signals to Signs of Emotion and Vice Versa" Santorin / Fira, 18th – 22nd September, 2004 Talk: Ronald Müller Speech

Slide -4-

Emotion set:

Anger, disgust, fear, joy, neutrality, sadness, surprise

Corpus 1: Practical course

404 acted samples per emotion

13 speakers (1 female)

Recorded within one year

Corpus 2: Driving simulator

500 spontaneous emotion samples

200 acted samples (disgust, sadness)

Emotional Speech Corpus

Emotional Speech CorpusEmotional Speech Corpus

2828iE

700iE

Page 5: Occasion:HUMAINE / WP4 / Workshop "From Signals to Signs of Emotion and Vice Versa" Santorin / Fira, 18th – 22nd September, 2004 Talk: Ronald Müller Speech

Slide -5-

System Overview

System OverviewSystem Overview

Speech signalSpeech signal

Prosodic featuresProsodic features ASR-unitASR-unit

Semantic interpretationSemantic interpretation(Bayesian Networks)(Bayesian Networks)

ClassifierClassifier(SVM)(SVM)

Stream fusionStream fusion(MLP)(MLP)

EmotionEmotion

Page 6: Occasion:HUMAINE / WP4 / Workshop "From Signals to Signs of Emotion and Vice Versa" Santorin / Fira, 18th – 22nd September, 2004 Talk: Ronald Müller Speech

Slide -6-

Acoustic Analysis

Acoustic AnalysisAcoustic Analysis

Low-level features

Pitch contour (AMDF, low-pass filtering)

Energy contour

Spectrum

Signal

High-level features

Statistic analysis of contours

Elimination of mean, normalization to standard dev.

Duration of one utterance (1-5 seconds)

Page 7: Occasion:HUMAINE / WP4 / Workshop "From Signals to Signs of Emotion and Vice Versa" Santorin / Fira, 18th – 22nd September, 2004 Talk: Ronald Müller Speech

Slide -7-

Acoustic Analysis

Feature selection (1/2)

Initial set of 200 statistical features

Ranking 1: Single performance of each feature

(nearest-mean classifier)

Ranking 2: Sequential Forward Floating Search

wrapping by nearest-mean classifier

Page 8: Occasion:HUMAINE / WP4 / Workshop "From Signals to Signs of Emotion and Vice Versa" Santorin / Fira, 18th – 22nd September, 2004 Talk: Ronald Müller Speech

Slide -8-

Acoustic Analysis

Feature selection (2/2)

Top 10 features

Acoustic Feature SFFS-Rank Single Perf.

Pitch, maximum gradient 1 31.5

Pitch, standard deviation of distance between reversal points

2 23.0

Pitch, mean value 3 25.6

Signal, number of zero-crossings 4 16.9

Pitch, standard deviation 5 27.6

Duration of silences, mean value 6 17.5

Duration of voiced sounds, mean value 7 18.5

Energy, median of fall-time 8 17.8

Energy, mean distance between reversal points

9 19.0

Energy, mean of rise-time 10 17.6

Page 9: Occasion:HUMAINE / WP4 / Workshop "From Signals to Signs of Emotion and Vice Versa" Santorin / Fira, 18th – 22nd September, 2004 Talk: Ronald Müller Speech

Slide -9-

Acoustic Analysis

Classification

Evaluation of various classification methods

33 features

ClassifierError, %

Speaker indep. Speaker dep.

kMeans 57.05 27.38

kNN 30.41 17.39

GMM 25.17 10.88

MLP 26.86 9.36

SVM 23.88 7.05

ML-SVM 18.71 9.05

Output: Vector of (pseudo-) recognition confidencesOutput: Vector of (pseudo-) recognition confidences

Page 10: Occasion:HUMAINE / WP4 / Workshop "From Signals to Signs of Emotion and Vice Versa" Santorin / Fira, 18th – 22nd September, 2004 Talk: Ronald Müller Speech

Slide -10-

Acoustic Analysis

Classification

Multi-Layer Support Vector Machines

acoustic feature vectoracoustic feature vector

ang, ntl, fea, joy / dis, sur, sadang, ntl, fea, joy / dis, sur, sad

ang, ntl / fea, joyang, ntl / fea, joy dis, sur / saddis, sur / sad

ang / ntlang / ntl fea / joyfea / joy dis / surdis / sur

angang ntlntl feafea joyjoy sadsaddisdis sursur

No confidence vector to forward to fusionNo confidence vector to forward to fusion

Page 11: Occasion:HUMAINE / WP4 / Workshop "From Signals to Signs of Emotion and Vice Versa" Santorin / Fira, 18th – 22nd September, 2004 Talk: Ronald Müller Speech

Slide -11-

System Overview

System OverviewSystem Overview

Speech signalSpeech signal

Prosodic featuresProsodic features ASR-unitASR-unit

Semantic interpretationSemantic interpretation(Bayesian Networks)(Bayesian Networks)

ClassifierClassifier(SVM)(SVM)

Stream fusionStream fusion(MLP)(MLP)

EmotionEmotion

Page 12: Occasion:HUMAINE / WP4 / Workshop "From Signals to Signs of Emotion and Vice Versa" Santorin / Fira, 18th – 22nd September, 2004 Talk: Ronald Müller Speech

Slide -12-

Semantic Analysis

Semantic AnalysisSemantic Analysis

ASR-Unit

HMM-based

1300 words german vocabulary

No language model

5-best phrase hypotheses

Recognition confidences per word

Example output (first hypothesis):

I can‘t stand this every tray traffic-jam

69.3 34.6 72.1 20.0 36.1 15.9 55.8

Page 13: Occasion:HUMAINE / WP4 / Workshop "From Signals to Signs of Emotion and Vice Versa" Santorin / Fira, 18th – 22nd September, 2004 Talk: Ronald Müller Speech

Slide -13-

Semantic Analysis

Semantic AnalysisSemantic Analysis

Conditions

Natural language

Erroneous speech recognition

Uncertain knowledge

Incomplete knowledge

Superfluous knowledge

Probabilistic spotting approachProbabilistic spotting approach

Bayesian Belief NetworksBayesian Belief Networks

Page 14: Occasion:HUMAINE / WP4 / Workshop "From Signals to Signs of Emotion and Vice Versa" Santorin / Fira, 18th – 22nd September, 2004 Talk: Ronald Müller Speech

Slide -14-

Semantic Analysis

Bayesian Belief NetworksBayesian Belief Networks

Acyclic graph of nodes and directed edges One state variable per node (here states , ) Setting node-dependencies via cond. probability matrices

Setting initial probabilities in root nodes

Observation A causes evidence in a child node(i.e. is known)

Inference to direct parent nodes and finally to root nodes

Bayes‘ rule :

iX ix ix

)|()|(

)|()|(|

~)()(

PCPC

PCPCPParentCChild xxPxxP

xxPxxPXXP

CxP

TRRR xPxPXP )()(

)(

)()|(|

C

PPCCP XP

XPXXPXXP

Page 15: Occasion:HUMAINE / WP4 / Workshop "From Signals to Signs of Emotion and Vice Versa" Santorin / Fira, 18th – 22nd September, 2004 Talk: Ronald Müller Speech

Slide -15-

Semantic Analysis

Emotion modelling

...II

...

I_hateI_hate BadBad AdhorrenceAdhorrence

first_personfirst_person

JoyJoy

NegativeNegativePositivePositive DisgustDisgust

InputlevelInputlevel

WordsWords

SuperwordsSuperwords

PhrasesPhrases

Super-Super-phrasesphrases

DisgustDisgust

I can‘t stand this nasty every tray traffic-jam

can‘tcan‘t standstand nastynasty

cannotcannot standstand badbad disgustingdisgusting

InterpretationInterpretation

GoodGood

AngerAnger

ClusteringClustering

SequenceSequenceHandlingHandling

ClusteringClustering

ClusteringClustering

SpottingSpotting

I_likeI_like ... ...

... ...

...

... ...

... ...

... ...

Output: Vector of “real“ recognition confidencesOutput: Vector of “real“ recognition confidences

Page 16: Occasion:HUMAINE / WP4 / Workshop "From Signals to Signs of Emotion and Vice Versa" Santorin / Fira, 18th – 22nd September, 2004 Talk: Ronald Müller Speech

Slide -16-

System Overview

System OverviewSystem OverviewF&F of HMC

Overview Speech signalSpeech signal

Prosodic featuresProsodic features ASR-unitASR-unit

Semantic interpretationSemantic interpretation(Bayesian Networks)(Bayesian Networks)

ClassifierClassifier(SVM)(SVM)

Stream fusionStream fusion(MLP)(MLP)

EmotionEmotion

Page 17: Occasion:HUMAINE / WP4 / Workshop "From Signals to Signs of Emotion and Vice Versa" Santorin / Fira, 18th – 22nd September, 2004 Talk: Ronald Müller Speech

Slide -17-

Stream Fusion

Stream FusionStream Fusion

Pairwise mean

Discriminative fusion applying MLP

Input layer: 2 x 7 confidences

Hidden layer: 100 nodes

Output layer: 7 recognition confidences

nfusionn

EPmaxarg

nsemanticnacousticnfusion EPEPEP

Page 18: Occasion:HUMAINE / WP4 / Workshop "From Signals to Signs of Emotion and Vice Versa" Santorin / Fira, 18th – 22nd September, 2004 Talk: Ronald Müller Speech

Slide -18-

Results

ResultsResults

Emotion ang dis fea joy ntl sad sur Mean

% 95.5 61.3 78.7 75.1 78.5 62.1 68.3 74.2

Acoustic recognition rates (SVM): Acoustic recognition rates (SVM):

Semantic recognition rates: Semantic recognition rates:

Emotion ang dis fea joy ntl sad sur Mean

% 78.4 71.2 53.4 57.7 56.0 35.0 65.5 59.6

Page 19: Occasion:HUMAINE / WP4 / Workshop "From Signals to Signs of Emotion and Vice Versa" Santorin / Fira, 18th – 22nd September, 2004 Talk: Ronald Müller Speech

Slide -19-

Results

ResultsResults

Emotion ang dis fea joy ntl sad sur Mean

% 98.0 78.7 88.3 95.9 98.2 91.7 95.8 92.0

Recognition rates after discriminative fusion: Recognition rates after discriminative fusion:

Acoustic Information

Language Information

Fusionby means

Fusionby MLP

% 74.2 59.6 83.1 92.0

Overview: Overview:

Page 20: Occasion:HUMAINE / WP4 / Workshop "From Signals to Signs of Emotion and Vice Versa" Santorin / Fira, 18th – 22nd September, 2004 Talk: Ronald Müller Speech

Slide -20-

Summary

SummarySummary

Acted Emotions

7 discrete emotion categories

Prosodic feature selection via

Singe feature performance

Sequential forward floating search

Evaluative comparision of different classifiers

Outperforming SVMs

Semantic analysis applying Bayesian Networks

Significant gain by discriminative stream fusion

Page 21: Occasion:HUMAINE / WP4 / Workshop "From Signals to Signs of Emotion and Vice Versa" Santorin / Fira, 18th – 22nd September, 2004 Talk: Ronald Müller Speech

Slide -21-