0409 HUMAINE WP4 Santorin SpeechEmotionRecognition_v3

Embed Size (px)

Citation preview

  • 7/31/2019 0409 HUMAINE WP4 Santorin SpeechEmotionRecognition_v3

    1/21

    Occasion: HUMAINE / WP4 / Workshop

    "From Signals to Signs of Emotion and Vice Versa"

    Santorin / Fira, 18th 22nd September, 2004

    Talk: Ronald Mller

    Speech Emotion RecognitionCombining Acoustic and Semantic Analyses

    Institute for

    Human-Machine CommunicationTechnische Universitt Mnchen

  • 7/31/2019 0409 HUMAINE WP4 Santorin SpeechEmotionRecognition_v3

    2/21

    Slide -2-

    System Overview

    Emotional Speech Corpus

    Acoustic Analysis

    Semantic Analysis

    Stream Fusion

    Results

    Outline

    Outline

  • 7/31/2019 0409 HUMAINE WP4 Santorin SpeechEmotionRecognition_v3

    3/21

    Slide -3-

    System Overview

    System Overview

    Speech signal

    Prosodic features ASR-unit

    Semantic interpretation

    (Bayesian Networks)

    Classifier

    (SVM)

    Stream fusion

    (MLP)

    Emotion

  • 7/31/2019 0409 HUMAINE WP4 Santorin SpeechEmotionRecognition_v3

    4/21

    Slide -4-

    Emotion set:Anger, disgust, fear, joy, neutrality, sadness, surprise

    Corpus 1: Practical course

    404 acted samples per emotion

    13 speakers (1 female)

    Recorded within one year

    Corpus 2: Driving simulator

    500 spontaneous emotion samples

    200 acted samples (disgust, sadness)

    Emotional Speech Corpus

    Emotional Speech Corpus

    2828iE

    700iE

  • 7/31/2019 0409 HUMAINE WP4 Santorin SpeechEmotionRecognition_v3

    5/21

    Slide -5-

    System Overview

    System Overview

    Speech signal

    Prosodic features ASR-unit

    Semantic interpretation

    (Bayesian Networks)

    Classifier

    (SVM)

    Stream fusion

    (MLP)

    Emotion

  • 7/31/2019 0409 HUMAINE WP4 Santorin SpeechEmotionRecognition_v3

    6/21

    Slide -6-

    Acoustic Analysis

    Acoustic Analysis

    Low-level featuresPitch contour (AMDF, low-pass filtering)

    Energy contour

    Spectrum

    Signal

    High-level features

    Statistic analysis of contours

    Elimination of mean, normalization to standard dev.

    Duration of one utterance (1-5 seconds)

  • 7/31/2019 0409 HUMAINE WP4 Santorin SpeechEmotionRecognition_v3

    7/21Slide -7-

    Acoustic Analysis

    Feature selection (1/2)

    Initial set of 200 statistical features

    Ranking 1: Single performance of each feature

    (nearest-mean classifier)

    Ranking 2: Sequential Forward Floating Search

    wrapping by nearest-mean classifier

  • 7/31/2019 0409 HUMAINE WP4 Santorin SpeechEmotionRecognition_v3

    8/21Slide -8-

    Acoustic Analysis

    Feature selection (2/2)

    Top 10 features

    Acoustic Feature SFFS-Rank Single Perf.

    Pitch, maximum gradient 1 31.5

    Pitch, standard deviation of distance

    between reversal points2 23.0

    Pitch, mean value 3 25.6Signal, number of zero-crossings 4 16.9

    Pitch, standard deviation 5 27.6

    Duration of silences, mean value 6 17.5

    Duration of voiced sounds, mean value 7 18.5

    Energy, median of fall-time 8 17.8

    Energy, mean distance between

    reversal points9 19.0

    Energy, mean of rise-time 10 17.6

  • 7/31/2019 0409 HUMAINE WP4 Santorin SpeechEmotionRecognition_v3

    9/21Slide -9-

    Acoustic Analysis

    Classification

    Evaluation of various classification methods

    33 features

    ClassifierError, %

    Speaker indep. Speaker dep.

    kMeans 57.05 27.38

    kNN 30.41 17.39

    GMM 25.17 10.88

    MLP 26.86 9.36SVM 23.88 7.05

    ML-SVM 18.71 9.05

    Output: Vector of (pseudo-) recognition confidences

  • 7/31/2019 0409 HUMAINE WP4 Santorin SpeechEmotionRecognition_v3

    10/21Slide -10-

    Acoustic Analysis

    Classification

    Multi-Layer Support Vector Machines

    acoustic feature vector

    ang, ntl, fea, joy / dis, sur, sad

    ang, ntl / fea, joy dis, sur / sad

    ang / ntl fea / joy dis / sur

    ang ntl fea joy saddis sur

    No confidence vector to forward to fusion

  • 7/31/2019 0409 HUMAINE WP4 Santorin SpeechEmotionRecognition_v3

    11/21Slide -11-

    System Overview

    System Overview

    Speech signal

    Prosodic features ASR-unit

    Semantic interpretation

    (Bayesian Networks)

    Classifier

    (SVM)

    Stream fusion

    (MLP)

    Emotion

  • 7/31/2019 0409 HUMAINE WP4 Santorin SpeechEmotionRecognition_v3

    12/21Slide -12-

    Semantic Analysis

    Semantic Analysis

    ASR-Unit HMM-based

    1300 words german vocabulary

    No language model

    5-best phrase hypotheses

    Recognition confidences per word

    Example output (first hypothesis):

    I cant stand this every tray traffic-jam

    69.3 34.6 72.1 20.0 36.1 15.9 55.8

  • 7/31/2019 0409 HUMAINE WP4 Santorin SpeechEmotionRecognition_v3

    13/21Slide -13-

    Semantic Analysis

    Semantic Analysis

    Conditions Natural language

    Erroneous speech recognition

    Uncertain knowledge

    Incomplete knowledge

    Superfluous knowledge

    Probabilistic spotting approach

    Bayesian Belief Networks

  • 7/31/2019 0409 HUMAINE WP4 Santorin SpeechEmotionRecognition_v3

    14/21Slide -14-

    Semantic Analysis

    Bayesian Belief Networks

    Acyclic graph of nodes and directed edges

    One state variable per node (here states , )

    Setting node-dependencies via cond. probability matrices

    Setting initial probabilities in root nodes

    Observation A causes evidence in a child node

    (i.e. is known) Inference to direct parent nodes and finally to root nodes

    Bayes rule :

    iX ix ix

    )|()|(

    )|()|(|

    ~)()(

    PCPC

    PCPC

    PParentCChild

    xxPxxP

    xxPxxPXXP

    CxP

    TRRR xPxPXP )()(

    )(

    )()|(|

    C

    PPCCP

    XP

    XPXXPXXP

  • 7/31/2019 0409 HUMAINE WP4 Santorin SpeechEmotionRecognition_v3

    15/21Slide -15-

    Semantic Analysis

    Emotion modelling

    ...

    I

    ...

    I_hate Bad Adhorrence

    first_person

    Joy

    NegativePositive Disgust

    Inputlevel

    Words

    Superwords

    Phrases

    Super-

    phrases

    Disgust

    I cant stand this nasty every tray traffic-jam

    cant stand nasty

    cannot stand bad disgusting

    Interpretation

    Good

    Anger

    Clustering

    Sequence

    Handling

    Clustering

    Clustering

    Spotting

    I_like ... ...

    ... ...

    ...

    ... ...

    ... ...

    ... ...

    Output: Vector of real recognition confidences

    S t O i

  • 7/31/2019 0409 HUMAINE WP4 Santorin SpeechEmotionRecognition_v3

    16/21Slide -16-

    System Overview

    System OverviewF&F of HMC

    Overview

    Speech signal

    Prosodic features ASR-unit

    Semantic interpretation

    (Bayesian Networks)

    Classifier

    (SVM)

    Stream fusion

    (MLP)

    Emotion

    S F i

  • 7/31/2019 0409 HUMAINE WP4 Santorin SpeechEmotionRecognition_v3

    17/21Slide -17-

    Stream Fusion

    Stream Fusion

    Pairwise mean

    Discriminative fusion applying MLP

    Input layer: 2 x 7 confidences

    Hidden layer: 100 nodes

    Output layer: 7 recognition confidences

    nfusionn

    EPmaxarg

    nsemanticnacousticnfusion EPEPEP

    R lt

  • 7/31/2019 0409 HUMAINE WP4 Santorin SpeechEmotionRecognition_v3

    18/21Slide -18-

    Results

    Results

    Emotion ang dis fea joy ntl sad sur Mean

    % 95.5 61.3 78.7 75.1 78.5 62.1 68.3 74.2

    Acoustic recognition rates (SVM):

    Semantic recognition rates:

    Emotion ang dis fea joy ntl sad sur Mean

    % 78.4 71.2 53.4 57.7 56.0 35.0 65.5 59.6

    R lt

  • 7/31/2019 0409 HUMAINE WP4 Santorin SpeechEmotionRecognition_v3

    19/21Slide -19-

    Results

    Results

    Emotion ang dis fea joy ntl sad sur Mean

    % 98.0 78.7 88.3 95.9 98.2 91.7 95.8 92.0

    Recognition rates after discriminative fusion:

    Acoustic

    Information

    Language

    Information

    Fusion

    by means

    Fusion

    by MLP

    % 74.2 59.6 83.1 92.0

    Overview:

    S

  • 7/31/2019 0409 HUMAINE WP4 Santorin SpeechEmotionRecognition_v3

    20/21Slide -20-

    Summary

    Summary

    Acted Emotions

    7 discrete emotion categories

    Prosodic feature selection via

    Singe feature performance

    Sequential forward floating search

    Evaluative comparision of different classifiers

    Outperforming SVMs

    Semantic analysis applying Bayesian Networks

    Significant gain by discriminative stream fusion

  • 7/31/2019 0409 HUMAINE WP4 Santorin SpeechEmotionRecognition_v3

    21/21Slide 21