27
Some Statements on the Automatic Classification of Emotion in Speech or: A Set of Hypotheses and Statements re Automatic Extraction and Evaluation of Acoustic (mainly Prosodic) and other Features for the Automatic Classification of Emotional States in Spontaneous Speech Based on Experience and Educated Guessing Anton Batliner , Christian Hacker, Elmar Nöth, Stefan Steidl FAU (University of Erlangen) HUMAINE WP4, Santorin, September 2004

Anton Batliner , Christian Hacker, Elmar Nöth, Stefan Steidl

Embed Size (px)

DESCRIPTION

- PowerPoint PPT Presentation

Citation preview

Page 1: Anton Batliner ,  Christian Hacker, Elmar Nöth, Stefan Steidl

Some Statements on the Automatic Classification of Emotion in Speech

or:A Set of Hypotheses and Statements re Automatic Extraction and

Evaluation of Acoustic (mainly Prosodic) and other Features for the Automatic Classification of Emotional States in Spontaneous Speech Based

on Experience and Educated Guessing

Anton Batliner,

Christian Hacker, Elmar Nöth, Stefan Steidl

FAU (University of Erlangen)

HUMAINE WP4, Santorin, September 2004

Page 2: Anton Batliner ,  Christian Hacker, Elmar Nöth, Stefan Steidl

2

Overview

• why these hypotheses / statements?

• our basic approach:

– features– classifiers– examples

• the hypotheses / statements

• suggestions and outlook

Page 3: Anton Batliner ,  Christian Hacker, Elmar Nöth, Stefan Steidl

3

Why these Hypotheses

• significance and bench-marking in single studies are no proof – cumulative evidence is better - and that's the way how it works, i.e., via repetition

• there's much expertise in HUMAINE at different sites

• to put forth "one man's opinion" to be compared with other experience

• hopefully, at the end of HUMAINE, corroboration, modification, or disproval

Page 4: Anton Batliner ,  Christian Hacker, Elmar Nöth, Stefan Steidl

4

Overview

• why these hypotheses / statements?

• our basic approach:

– features– classifiers– examples

• the hypotheses / statements

• suggestions and outlook

Page 5: Anton Batliner ,  Christian Hacker, Elmar Nöth, Stefan Steidl

5

Feature vector for a context of 2 words: 95 prosodic,

80 spectral, 30 POS

• mean values of duration, energy, F0  

• duration features: absolute; normalized with mean duration; absolute duration divided by number of syllables

• energy features: regression coefficient with mean square error; mean , maximum with position on the time axis, absolute; with mean energy normalized energy

• F0 features: regression coefficient with mean square error; mean, maximum, minimum, onset, and offset values, and their positions on the time axis; all F0 features logarithmized and normalized as to mean F0 value

• length of pause before and after word

• HNR (Harmonicity to Noise Ratio) - and formant-based features for most frequent vowels (frequency and energy), MFCC

• 6 coarse POS (part-of-speech) features

Page 6: Anton Batliner ,  Christian Hacker, Elmar Nöth, Stefan Steidl

6

Classifiers

• Linear Discriminant Analysis LDA, Decision Trees DT (e.g., Cart and Regression Trees), Neural Networks NN, Support Vector machines SVM, Language Models LM, Gaussian Mixtures GM, ...

• heretical suggestion: classifiers are not that important - at least in the context of HUMAINE

Page 7: Anton Batliner ,  Christian Hacker, Elmar Nöth, Stefan Steidl

7

Problem vs. no problem (=joy/neutral), LDA, Sympafly (aut. call centre), cross classification, different feature classes, turn-based

classification

features RR CL

(word-based) probabilities 77.4 75.6

prosodic 75.8 72.1

Part-of-Speech (pos) 77.1 66.3

spectral 74.4 67.1

linguistic 72.5 68.7

pros+spec 78.0 73.7

pos+ling 74.8 70.5

prob+pros+pos+spec+ling 81.1 77.3

RR: overall recognition rateCL: class-wise averaged recognition rate: mean of probem / no problem

Page 8: Anton Batliner ,  Christian Hacker, Elmar Nöth, Stefan Steidl

8

Overview

• why these hypotheses / statements?

• our basic approach:

– features– classifiers– examples

• the hypotheses / statements

• suggestions and outlook

Page 9: Anton Batliner ,  Christian Hacker, Elmar Nöth, Stefan Steidl

9

The Hypothesis

• and the reason why

e.g.

Black Swans do not exist

• reason: I've never seen any (and I´ve seen plenty of white ones)

caveat: few black swans do not falsify the hypothesis (the exception proves the rule because it is unlikely, i.e. no natural law)

Page 10: Anton Batliner ,  Christian Hacker, Elmar Nöth, Stefan Steidl

10

Best Way to do: Compute many (more or less) relevant features, and let the classifier do the work

• cons: features often (highly) correlated, only wood – no trees, „ brute force, dumb shotgun approach“, i.e., interpretation not possible*

• pros: best classification possible, i.e., let the classifier do the work of finding out / throwing away irrelevant ones

• and if optimization too costly, still good performance

• note: our feature vector has been selected carefully: > 500 276 125 95

* Interpretations is s.th. else, cf. below!

Page 11: Anton Batliner ,  Christian Hacker, Elmar Nöth, Stefan Steidl

11

Omnibus Approach: With many features, you can always use the same feature vector, irrespective of the

task

• e.g., for accents, boundaries, questions, offtalk, repairs, etc. – and emotions

• this makes live easier = less effort

• and – possibly – classifiers more robust while coping with new tasks / databases (cf. questions)

• not much deterioration w.r.t. to optimized feature vector

• and it is not much difference between 79% and 81% - if it is not world record but overall quality of the system that counts: it is always 4/5)

Page 12: Anton Batliner ,  Christian Hacker, Elmar Nöth, Stefan Steidl

12

No dramatic decline in performance if all features are used, still good performance if only a few are used

• but of course, these few have to be the most important ones, and they can only be obtained by automatic assessment

• slight deterioration with (more) / less than some 40 / 20 features – can such results be generalized somehow?

Page 13: Anton Batliner ,  Christian Hacker, Elmar Nöth, Stefan Steidl

13

Raw values always contribute more than complex / combined values (other things being equal)

• because the classifier is better at evaluating the impact of single features than we are, and normally, there is some added (= more) information in raw values (range vs. max + min, integral vs. energy + duration, etc.)

Page 14: Anton Batliner ,  Christian Hacker, Elmar Nöth, Stefan Steidl

14

There is no „most important“ feature(s) in prosody

• all the features are – more or less – correlated with each other

• thus, it is not detrimental if one is or some are missing

• and we are far away from any definite assessment

• still, of course, it might matter if one feature is added or missing - but if the effect is pronounced chances are that you´ve made s.th. wrong before (note: there are exceptions)

• and of course, different phenomena can be characterized by different features

Page 15: Anton Batliner ,  Christian Hacker, Elmar Nöth, Stefan Steidl

15

F0 features are not more important than other prosodic features

• 15 years ago, this would have been a very undecent suggestion in some sub-cultures (not emotion research?)

• two possible – and not competing – reasons: they simply aren't, or they aren't because extraction is error-prone

• „They simply aren´t“ means: high F0 range correlates with longer duration and vice versa, etc. - is it important what´s the hen and what´s the egg?

• assessment only possible with manually corrected feature values

Page 16: Anton Batliner ,  Christian Hacker, Elmar Nöth, Stefan Steidl

16

Intonational Models are sub-optimal for the use in automatic classification *

• They concentrate on pitch

• They are designed for something else

• because of quantisation error

* note: this holds for classification but not necessarily for synthesis / typology / etc. !

Page 17: Anton Batliner ,  Christian Hacker, Elmar Nöth, Stefan Steidl

17

Two new "dimensions"

bad Interpretation goodbad

Cla

ss.

Per

form

ance

go

od

many features

few features

Page 18: Anton Batliner ,  Christian Hacker, Elmar Nöth, Stefan Steidl

18

Classification performance is negatively correlated with interpretability

• so it is best to separate them

• and maybe use different classifiers for the two tasks, e.g., LDA and DT (with PC) for interpretation, and – if you have plenty of time – NN or SVM for classification

• context features are good for performance but often not easy to interpret – maybe because of spurious effects; open question: how to incorporate which context - to represent a „neutral“ baseline? „neutral“ w.r.t. speaker or task? unit of analysis

Page 19: Anton Batliner ,  Christian Hacker, Elmar Nöth, Stefan Steidl

19

Spectral features are not irrelevant but much less important than people would like to believe

• either because of extraction problems, sparse data, noise, or because they simply are not important / indicate different things:

• spectral features good at micro-level = segmental, prosodic features good at macro-level = supra-segmental

• the „sparse data problem“ in spontaneous speech maybe most important, because much more dependent on segmental context than prosody

• definitely no bi-uniqueness of form and function

Page 20: Anton Batliner ,  Christian Hacker, Elmar Nöth, Stefan Steidl

20

An example: What laryngealisation can indicate(phonation type / voice quality, prosodic and spectral features)

• accentuation

• vowels

• word boundaries

• native language

• the end of an utterance, i.e., turn-taking

• speaker idiosyncrasies

• speech pathology

• too many drinks / cigarettes

• competence / power

• social class membership

• and: emotional state (sadness, etc.)

Page 21: Anton Batliner ,  Christian Hacker, Elmar Nöth, Stefan Steidl

21

Multi-Modality does not enhance classification performancebecause:

emotions are no brontosauruses* but sausages**

• because humans are holistic beings, i.e., if the emotion is strong, then all simultaneous modalities are pronounced and vice versa (no hens, no eggs)

• of course, one modality might be „complementary“ in the absence / ambiguity of the other modality (the "open mouth problem") "sequential multi-modality"

• Fusion problem in itself! No added value but added noise?

* All brontosauruses are thin at one end, much MUCH thicker in the middle, and then thin again at the far end. (J. Cleese alias Miss Anne Elk)

** Sausages are either thick or thin.

Page 22: Anton Batliner ,  Christian Hacker, Elmar Nöth, Stefan Steidl

22

These are my theories ...

Page 23: Anton Batliner ,  Christian Hacker, Elmar Nöth, Stefan Steidl

23

In "representative" data, peformance for a two-class problem is below 80%, for a 4-class problem, it is below

60%

• thus, results that are (much!) better are unicorns!

• this will hopefully chance slightly but we will face an upper limit pretty soon – unless new knowledge sources are detected and taken into account

• inconsistency of annotations such „low“ recognition rates maybe the best ones one can get?

• question: are there similar statements for (facial and hand) gestures?

Page 24: Anton Batliner ,  Christian Hacker, Elmar Nöth, Stefan Steidl

24

All this only holds for speaker-independent analyses of spontaneous, „representative“ speech but: do not use

acted and/or speaker-dependent data unless this is your intended application!

• it is like read vs. spontaneous speech (remember the sobering break-down of classification performance: 98% 20%)

• thus, results obtained from acted emotions might be taken as basis where to look for but not more!

• two types of „emotional“ data: rich and poor?

Page 25: Anton Batliner ,  Christian Hacker, Elmar Nöth, Stefan Steidl

25

Overview

• why these hypotheses / statements?

• our basic approach:

– features– classifiers– examples

• the hypotheses / statements

• suggestions and outlook

Page 26: Anton Batliner ,  Christian Hacker, Elmar Nöth, Stefan Steidl

26

open for discussion

• consent / rejections to statements?

• if rejection: same type of data / features / procedures?

catalogue of old and new / alternative statements

Page 27: Anton Batliner ,  Christian Hacker, Elmar Nöth, Stefan Steidl

27

Thank you for your attention