Occasion:HUMAINE / WP4 / Workshop "From Signals to Signs of Emotion and Vice Versa" Santorin / Fira, 18th – 22nd September, 2004 Talk: Ronald Müller Speech

Occasion: HUMAINE / WP4 / Workshop "From Signals to Signs of Emotion and Vice Versa" Santorin / Fira, 18th – 22nd September, 2004

Talk: Ronald Müller

Speech Emotion Recognition Speech Emotion Recognition Combining Acoustic and Semantic AnalysesCombining Acoustic and Semantic Analyses

Institute for Human-Machine Communication

Technische Universität München

Slide -2-

System Overview

Emotional Speech Corpus

Acoustic Analysis

Semantic Analysis

Stream Fusion

Results

Outline

OutlineOutline

Slide -3-

System Overview

System OverviewSystem Overview

Speech signalSpeech signal

Prosodic featuresProsodic features ASR-unitASR-unit

Semantic interpretationSemantic interpretation(Bayesian Networks)(Bayesian Networks)

ClassifierClassifier(SVM)(SVM)

Stream fusionStream fusion(MLP)(MLP)

EmotionEmotion

Slide -4-

Emotion set:

Anger, disgust, fear, joy, neutrality, sadness, surprise

Corpus 1: Practical course

404 acted samples per emotion

13 speakers (1 female)

Recorded within one year

Corpus 2: Driving simulator

500 spontaneous emotion samples

200 acted samples (disgust, sadness)

Emotional Speech Corpus

Emotional Speech CorpusEmotional Speech Corpus

2828iE

700iE

Slide -5-

System Overview







EmotionEmotion

Slide -6-

Acoustic Analysis

Acoustic AnalysisAcoustic Analysis

Low-level features

Pitch contour (AMDF, low-pass filtering)

Energy contour

Spectrum

Signal

High-level features

Statistic analysis of contours

Elimination of mean, normalization to standard dev.

Duration of one utterance (1-5 seconds)

Slide -7-

Acoustic Analysis

Feature selection (1/2)

Initial set of 200 statistical features

Ranking 1: Single performance of each feature

(nearest-mean classifier)

Ranking 2: Sequential Forward Floating Search

wrapping by nearest-mean classifier

Slide -8-

Acoustic Analysis

Feature selection (2/2)

Top 10 features

Acoustic Feature SFFS-Rank Single Perf.

Pitch, maximum gradient 1 31.5

Pitch, standard deviation of distance between reversal points

2 23.0

Pitch, mean value 3 25.6

Signal, number of zero-crossings 4 16.9

Pitch, standard deviation 5 27.6

Duration of silences, mean value 6 17.5

Duration of voiced sounds, mean value 7 18.5

Energy, median of fall-time 8 17.8

Energy, mean distance between reversal points

9 19.0

Energy, mean of rise-time 10 17.6

Slide -9-

Acoustic Analysis

Classification

Evaluation of various classification methods

33 features

ClassifierError, %

Speaker indep. Speaker dep.

kMeans 57.05 27.38

kNN 30.41 17.39

GMM 25.17 10.88

MLP 26.86 9.36

SVM 23.88 7.05

ML-SVM 18.71 9.05

Output: Vector of (pseudo-) recognition confidencesOutput: Vector of (pseudo-) recognition confidences

Slide -10-

Acoustic Analysis

Classification

Multi-Layer Support Vector Machines

acoustic feature vectoracoustic feature vector

ang, ntl, fea, joy / dis, sur, sadang, ntl, fea, joy / dis, sur, sad

ang, ntl / fea, joyang, ntl / fea, joy dis, sur / saddis, sur / sad

ang / ntlang / ntl fea / joyfea / joy dis / surdis / sur

angang ntlntl feafea joyjoy sadsaddisdis sursur

No confidence vector to forward to fusionNo confidence vector to forward to fusion

Slide -11-

System Overview







EmotionEmotion

Slide -12-

Semantic Analysis

Semantic AnalysisSemantic Analysis

ASR-Unit

HMM-based

1300 words german vocabulary

No language model

5-best phrase hypotheses

Recognition confidences per word

Example output (first hypothesis):

I can‘t stand this every tray traffic-jam

69.3 34.6 72.1 20.0 36.1 15.9 55.8

Slide -13-

Semantic Analysis

Semantic AnalysisSemantic Analysis

Conditions

Natural language

Erroneous speech recognition

Uncertain knowledge

Incomplete knowledge

Superfluous knowledge

Probabilistic spotting approachProbabilistic spotting approach

Bayesian Belief NetworksBayesian Belief Networks

Slide -14-

Semantic Analysis

Bayesian Belief NetworksBayesian Belief Networks

Acyclic graph of nodes and directed edges One state variable per node (here states , ) Setting node-dependencies via cond. probability matrices

Setting initial probabilities in root nodes

Observation A causes evidence in a child node(i.e. is known)

Inference to direct parent nodes and finally to root nodes

Bayes‘ rule :

iX ix ix

)|()|(

)|()|(|

~)()(

PCPC

PCPCPParentCChild xxPxxP

xxPxxPXXP

CxP

TRRR xPxPXP )()(

)(

)()|(|

C

PPCCP XP

XPXXPXXP

Slide -15-

Semantic Analysis

Emotion modelling

...II

...

I_hateI_hate BadBad AdhorrenceAdhorrence

first_personfirst_person

JoyJoy

NegativeNegativePositivePositive DisgustDisgust

InputlevelInputlevel

WordsWords

SuperwordsSuperwords

PhrasesPhrases

Super-Super-phrasesphrases

DisgustDisgust

I can‘t stand this nasty every tray traffic-jam

can‘tcan‘t standstand nastynasty

cannotcannot standstand badbad disgustingdisgusting

InterpretationInterpretation

GoodGood

AngerAnger

ClusteringClustering

SequenceSequenceHandlingHandling



SpottingSpotting

I_likeI_like ... ...

... ...

...

... ...

... ...

... ...

Output: Vector of “real“ recognition confidencesOutput: Vector of “real“ recognition confidences

Slide -16-

System Overview

System OverviewSystem OverviewF&F of HMC

Overview Speech signalSpeech signal





EmotionEmotion

Slide -17-

Stream Fusion

Stream FusionStream Fusion

Pairwise mean

Discriminative fusion applying MLP

Input layer: 2 x 7 confidences

Hidden layer: 100 nodes

Output layer: 7 recognition confidences

nfusionn

EPmaxarg

nsemanticnacousticnfusion EPEPEP

Slide -18-

Results

ResultsResults

Emotion ang dis fea joy ntl sad sur Mean

% 95.5 61.3 78.7 75.1 78.5 62.1 68.3 74.2

Acoustic recognition rates (SVM): Acoustic recognition rates (SVM):

Semantic recognition rates: Semantic recognition rates:


% 78.4 71.2 53.4 57.7 56.0 35.0 65.5 59.6

Slide -19-

Results

ResultsResults


% 98.0 78.7 88.3 95.9 98.2 91.7 95.8 92.0

Recognition rates after discriminative fusion: Recognition rates after discriminative fusion:

Acoustic Information

Language Information

Fusionby means

Fusionby MLP

% 74.2 59.6 83.1 92.0

Overview: Overview:

Slide -20-

Summary

SummarySummary

Acted Emotions

7 discrete emotion categories

Prosodic feature selection via

Singe feature performance

Sequential forward floating search

Evaluative comparision of different classifiers

Outperforming SVMs

Semantic analysis applying Bayesian Networks

Significant gain by discriminative stream fusion

Slide -21-

Documents

Occasion:HUMAINE / WP4 / Workshop "From Signals to Signs of Emotion and Vice Versa" Santorin / Fira, 18th – 22nd September, 2004 Talk: Ronald Müller Speech