0409 HUMAINE WP4 Santorin SpeechEmotionRecognition_v3

7/31/2019 0409 HUMAINE WP4 Santorin SpeechEmotionRecognition_v3

1/21

Occasion: HUMAINE / WP4 / Workshop

"From Signals to Signs of Emotion and Vice Versa"

Santorin / Fira, 18th 22nd September, 2004

Talk: Ronald Mller

Speech Emotion RecognitionCombining Acoustic and Semantic Analyses

Institute for

Human-Machine CommunicationTechnische Universitt Mnchen


2/21

Slide -2-

System Overview

Emotional Speech Corpus

Acoustic Analysis

Semantic Analysis

Stream Fusion

Results

Outline

Outline


3/21

Slide -3-

System Overview

System Overview

Speech signal

Prosodic features ASR-unit

Semantic interpretation

(Bayesian Networks)

Classifier

(SVM)

Stream fusion

(MLP)

Emotion


4/21

Slide -4-

Emotion set:Anger, disgust, fear, joy, neutrality, sadness, surprise

Corpus 1: Practical course

404 acted samples per emotion

13 speakers (1 female)

Recorded within one year

Corpus 2: Driving simulator

500 spontaneous emotion samples

200 acted samples (disgust, sadness)



2828iE

700iE


5/21

Slide -5-

System Overview

System Overview

Speech signal



(Bayesian Networks)

Classifier

(SVM)

Stream fusion

(MLP)

Emotion


6/21

Slide -6-

Acoustic Analysis

Acoustic Analysis

Low-level featuresPitch contour (AMDF, low-pass filtering)

Energy contour

Spectrum

Signal

High-level features

Statistic analysis of contours

Elimination of mean, normalization to standard dev.

Duration of one utterance (1-5 seconds)


7/21Slide -7-

Acoustic Analysis

Feature selection (1/2)

Initial set of 200 statistical features

Ranking 1: Single performance of each feature

(nearest-mean classifier)

Ranking 2: Sequential Forward Floating Search

wrapping by nearest-mean classifier


8/21Slide -8-

Acoustic Analysis

Feature selection (2/2)

Top 10 features

Acoustic Feature SFFS-Rank Single Perf.

Pitch, maximum gradient 1 31.5

Pitch, standard deviation of distance

between reversal points2 23.0

Pitch, mean value 3 25.6Signal, number of zero-crossings 4 16.9

Pitch, standard deviation 5 27.6

Duration of silences, mean value 6 17.5

Duration of voiced sounds, mean value 7 18.5

Energy, median of fall-time 8 17.8

Energy, mean distance between

reversal points9 19.0

Energy, mean of rise-time 10 17.6


9/21Slide -9-

Acoustic Analysis

Classification

Evaluation of various classification methods

33 features

ClassifierError, %

Speaker indep. Speaker dep.

kMeans 57.05 27.38

kNN 30.41 17.39

GMM 25.17 10.88

MLP 26.86 9.36SVM 23.88 7.05

ML-SVM 18.71 9.05

Output: Vector of (pseudo-) recognition confidences


10/21Slide -10-

Acoustic Analysis

Classification

Multi-Layer Support Vector Machines

acoustic feature vector

ang, ntl, fea, joy / dis, sur, sad

ang, ntl / fea, joy dis, sur / sad

ang / ntl fea / joy dis / sur

ang ntl fea joy saddis sur

No confidence vector to forward to fusion


11/21Slide -11-

System Overview

System Overview

Speech signal



(Bayesian Networks)

Classifier

(SVM)

Stream fusion

(MLP)

Emotion


12/21Slide -12-

Semantic Analysis

Semantic Analysis

ASR-Unit HMM-based

1300 words german vocabulary

No language model

5-best phrase hypotheses

Recognition confidences per word

Example output (first hypothesis):

I cant stand this every tray traffic-jam

69.3 34.6 72.1 20.0 36.1 15.9 55.8


13/21Slide -13-

Semantic Analysis

Semantic Analysis

Conditions Natural language

Erroneous speech recognition

Uncertain knowledge

Incomplete knowledge

Superfluous knowledge

Probabilistic spotting approach

Bayesian Belief Networks


14/21Slide -14-

Semantic Analysis

Bayesian Belief Networks

Acyclic graph of nodes and directed edges

One state variable per node (here states , )

Setting node-dependencies via cond. probability matrices

Setting initial probabilities in root nodes

Observation A causes evidence in a child node

(i.e. is known) Inference to direct parent nodes and finally to root nodes

Bayes rule :

iX ix ix

)|()|(

)|()|(|

~)()(

PCPC

PCPC

PParentCChild

xxPxxP

xxPxxPXXP

CxP

TRRR xPxPXP )()(

)(

)()|(|

C

PPCCP

XP

XPXXPXXP


15/21Slide -15-

Semantic Analysis

Emotion modelling

...

I

...

I_hate Bad Adhorrence

first_person

Joy

NegativePositive Disgust

Inputlevel

Words

Superwords

Phrases

Super-

phrases

Disgust

I cant stand this nasty every tray traffic-jam

cant stand nasty

cannot stand bad disgusting

Interpretation

Good

Anger

Clustering

Sequence

Handling

Clustering

Clustering

Spotting

I_like ... ...

... ...

...

... ...

... ...

... ...

Output: Vector of real recognition confidences

S t O i


16/21Slide -16-

System Overview

System OverviewF&F of HMC

Overview

Speech signal



(Bayesian Networks)

Classifier

(SVM)

Stream fusion

(MLP)

Emotion

S F i


17/21Slide -17-

Stream Fusion

Stream Fusion

Pairwise mean

Discriminative fusion applying MLP

Input layer: 2 x 7 confidences

Hidden layer: 100 nodes

Output layer: 7 recognition confidences

nfusionn

EPmaxarg

nsemanticnacousticnfusion EPEPEP

R lt


18/21Slide -18-

Results

Results

Emotion ang dis fea joy ntl sad sur Mean

% 95.5 61.3 78.7 75.1 78.5 62.1 68.3 74.2

Acoustic recognition rates (SVM):

Semantic recognition rates:


% 78.4 71.2 53.4 57.7 56.0 35.0 65.5 59.6

R lt


19/21Slide -19-

Results

Results


% 98.0 78.7 88.3 95.9 98.2 91.7 95.8 92.0

Recognition rates after discriminative fusion:

Acoustic

Information

Language

Information

Fusion

by means

Fusion

by MLP

% 74.2 59.6 83.1 92.0

Overview:

S


20/21Slide -20-

Summary

Summary

Acted Emotions

7 discrete emotion categories

Prosodic feature selection via

Singe feature performance

Sequential forward floating search

Evaluative comparision of different classifiers

Outperforming SVMs

Semantic analysis applying Bayesian Networks

Significant gain by discriminative stream fusion


21/21Slide 21

Documents

0409 HUMAINE WP4 Santorin SpeechEmotionRecognition_v3