4
Applying articulatory features to speech emotion recognition Yu Zhou, Yanqing Sun, Lin Yang, Yonghong Yan ThinkIT Speech Lab., Institute of Acoustics, Chinese Academy of Sciences, Beijing zhouyu, [email protected] Abstract—In this paper, we present an approach that using articulatory features (AFs) derived from spectral features for speech emotion recognition. Also, we investigated the combina- tion of AFs and spectral features. Systems based on AFs only and combined spectral-articulatory features are tested on the CASIA Mandarin emotional corpus. Experiments results show that AFs alone are not suitable for speech emotion recognition and that the combination of spectral features and AFs don’t improve the performance of the system that using only spectral features. Keywords-articulatory feature; emotion recognition; I. I NTRODUCTION During the last few years, the research on speech emotion recognition has got much attention [1]. There have been plenty of studies on speech emotion recognition [2][3]. Most traditional emotion recognition systems have focused on the modeling of spectral features or prosodic features [4][2]. In recent years, researches have started to investigate the articulatory features of speech in speech recognition and speaker identification [5][6]. AFs are the abstract repre- sentations of some important speech production properties, such as the manner and place of articulation, the vocal cord excitation, and lip motion, etc. Since speech is produced by the continuous movements of articulators in the vocal tract excited by the air stream originated from the lung. These phoneme-characterized or speaker-characterized ar- ticulations and excitations, which imparted to the produced speech, are the origin of unique phoneme or speaker infor- mation [7][5]. AFs have been adopted as an alternative or supplementary features for speech recognition [8], language ID [9] and confidence measure [10]. From many studies that have investigated the relationships between emotions and articulatory properties [11][12], it seems reasonable to think that articulatory features contain useful emotion- specific information for speech emotion recognition. How- ever, articulatory information has not been widely applied to automatic emotion recognition. In this paper, we explore the use of articulatory features (AFs) to capture the movements of articulators in the vocal tract and their excitation during sound production for emotion recognition. It was found that AFs alone are not suitable for speech emotion recognition and that the combination of spectral features and AFs don’t improve the performance of the system using only spectral features. The paper is organized as follows. In section 2 the extraction of articulatory features is introduced. Speech emotion recognition based on AFs are presented in section 3. Experiments and results are shown in section 4. Finally, section 5 gives conclusions. II. ARTICULATORY FEATURE EXTRACTION To extract AFs of emotional speech, a set of articulatory classifiers are trained to learn the mapping between the acoustic signals and the articulatory states. There are several ways to train the classifiers, such as X-rays that can obtain the actual articulatory positions, which is expensive to real- ize, or mappings between phonemes and their corresponding articulatory properties. In this paper, AFs were extracted from acoustic signals similar to [6]. To obtain the AFs, a sequence of acoustic vectors are fed to the classifiers in parallel, where each classifier represents a different articula- tory property. The outputs of these classifiers (the posterior probabilities) form the AF vectors. The extracted AFs are then used for speech emotion recognition. Speech input NUFCC extraction MLP for degree MLP for vowel MLP for flow 9 frames extension Emotion recognition Figure 1. The block diagram of the AF extraction for emotion recognition. In our speech emotion recognition system, different ar- ticulatory properties, as listed in the first column of Table 1, are used. For each property, a Multi-Layer Perceptron (MLP) is used to estimate the probability distributions of its pre-defined output classes (they are listed in the second column of Table 1). The extraction process is illustrated in Fig. 1. The inputs to these AF-MLPs are identical while their numbers of outputs are equal to the numbers of AF classes listed in the last column of Table 1. The AF-MLPs were trained with phonetic labels from speech data. With the phonetic labels, articulatory features can be derived from a mapping between phonemes and their states of articulations [8]. 2009 International Conference on Research Challenges in Computer Science 978-0-7695-3927-0/09 $26.00 © 2009 IEEE DOI 10.1109/ICRCCS.2009.26 73

[IEEE 2009 International Conference on Research Challenges in Computer Science (ICRCCS) - Shanghai, China (2009.12.28-2009.12.29)] 2009 International Conference on Research Challenges

Embed Size (px)

Citation preview

Page 1: [IEEE 2009 International Conference on Research Challenges in Computer Science (ICRCCS) - Shanghai, China (2009.12.28-2009.12.29)] 2009 International Conference on Research Challenges

Applying articulatory features to speech emotion recognition

Yu Zhou, Yanqing Sun, Lin Yang, Yonghong Yan

ThinkIT Speech Lab., Institute of Acoustics, Chinese Academy of Sciences, Beijingzhouyu, [email protected]

Abstract—In this paper, we present an approach that usingarticulatory features (AFs) derived from spectral features forspeech emotion recognition. Also, we investigated the combina-tion of AFs and spectral features. Systems based on AFs onlyand combined spectral-articulatory features are tested on theCASIA Mandarin emotional corpus. Experiments results showthat AFs alone are not suitable for speech emotion recognitionand that the combination of spectral features and AFs don’timprove the performance of the system that using only spectralfeatures.

Keywords-articulatory feature; emotion recognition;

I. INTRODUCTION

During the last few years, the research on speech emotion

recognition has got much attention [1]. There have been

plenty of studies on speech emotion recognition [2][3]. Most

traditional emotion recognition systems have focused on the

modeling of spectral features or prosodic features [4][2].

In recent years, researches have started to investigate the

articulatory features of speech in speech recognition and

speaker identification [5][6]. AFs are the abstract repre-

sentations of some important speech production properties,

such as the manner and place of articulation, the vocal cord

excitation, and lip motion, etc. Since speech is produced

by the continuous movements of articulators in the vocal

tract excited by the air stream originated from the lung.

These phoneme-characterized or speaker-characterized ar-

ticulations and excitations, which imparted to the produced

speech, are the origin of unique phoneme or speaker infor-

mation [7][5]. AFs have been adopted as an alternative or

supplementary features for speech recognition [8], language

ID [9] and confidence measure [10]. From many studies

that have investigated the relationships between emotions

and articulatory properties [11][12], it seems reasonable

to think that articulatory features contain useful emotion-

specific information for speech emotion recognition. How-

ever, articulatory information has not been widely applied to

automatic emotion recognition. In this paper, we explore the

use of articulatory features (AFs) to capture the movements

of articulators in the vocal tract and their excitation during

sound production for emotion recognition. It was found that

AFs alone are not suitable for speech emotion recognition

and that the combination of spectral features and AFs don’t

improve the performance of the system using only spectral

features.

The paper is organized as follows. In section 2 the

extraction of articulatory features is introduced. Speech

emotion recognition based on AFs are presented in section

3. Experiments and results are shown in section 4. Finally,

section 5 gives conclusions.

II. ARTICULATORY FEATURE EXTRACTION

To extract AFs of emotional speech, a set of articulatory

classifiers are trained to learn the mapping between the

acoustic signals and the articulatory states. There are several

ways to train the classifiers, such as X-rays that can obtain

the actual articulatory positions, which is expensive to real-

ize, or mappings between phonemes and their corresponding

articulatory properties. In this paper, AFs were extracted

from acoustic signals similar to [6]. To obtain the AFs, a

sequence of acoustic vectors are fed to the classifiers in

parallel, where each classifier represents a different articula-

tory property. The outputs of these classifiers (the posterior

probabilities) form the AF vectors. The extracted AFs are

then used for speech emotion recognition.

Speechinput

NUFCCextraction

MLP fordegree

MLP forvowel

MLP forflow

9 framesextension Emotion

recognition

Figure 1. The block diagram of the AF extraction for emotion recognition.

In our speech emotion recognition system, different ar-

ticulatory properties, as listed in the first column of Table

1, are used. For each property, a Multi-Layer Perceptron

(MLP) is used to estimate the probability distributions of

its pre-defined output classes (they are listed in the second

column of Table 1). The extraction process is illustrated in

Fig. 1. The inputs to these AF-MLPs are identical while

their numbers of outputs are equal to the numbers of AF

classes listed in the last column of Table 1.

The AF-MLPs were trained with phonetic labels from

speech data. With the phonetic labels, articulatory features

can be derived from a mapping between phonemes and their

states of articulations [8].

2009 International Conference on Research Challenges in Computer Science

978-0-7695-3927-0/09 $26.00 © 2009 IEEE

DOI 10.1109/ICRCCS.2009.26

73

Page 2: [IEEE 2009 International Conference on Research Challenges in Computer Science (ICRCCS) - Shanghai, China (2009.12.28-2009.12.29)] 2009 International Conference on Research Challenges

Name ClassesDegree stop, nasal, fricative,

lateral, affricateAspiration aspirated, unaspirated,

other consonantsPlace bilabial, labiodental,

dental, alveolar, retroflex,alveolo-palatal, velar

alveolar-nasal, velar-nasalFrontness front, mid-front,

mid, mid-back, backHeight high, mid-high,

mid, mid-low, lowVowel a, aa, ak, at, au, e, ea, ee,

er, err, i, ii, ix, iy, o, u, uu,v, iaa, ioo, iee, iii, iuu, ivv

Rounding rounded, unroundedTone high-level, high-rising,

low-dipping, high-fallingGlottal state voiced, unvoiced

Table IDESIGN OF ARTICULATORY FEATURES FOR MANDARIN.

III. SPEECH EMOTION RECOGNITION BASED ON AFS

The block diagram of the AF extraction for emotion

recognition is shown in Fig.1. It consists of two steps:

spectral feature extraction and AF extraction. The spectral

features we used are Non-uniform frequency cepstral co-

efficients (NUFCC) rather than MFCC, since it has been

pointed out that NUFCC performs better than traditional

spectral feature MFCC in speech emotion recognition [13].

To ensure a more accurate estimation of the AF values,

multiple frames of NUFCCs, which are [t-n/2,...,t,...,t+n/2

] of NUFCCs created by a moving window, are served as

the inputs to the AF-MLPs at frame t.

A. NUFCC and AFs extractionThe AF values determined from the AF-MLPs, and

NUFCCs are the source of the AF extraction. After NUFCCs

extracted from speech data, a context window of 9 frames is

used [14], which are then fed to AF-MLPs to determine the

AFs. According to Table 1, there are a total of 76 articulatory

classes, which result in a 76-dimensional AF vector for each

frame if we use all the nine articulatory properties.1) Correlation analysis: To investigate the discriminate

ability of these features, and to find out which one performs

the best, we first analysis the correlation coefficients between

the 9 AFs, and choose a best feature set that performs the

best in the emotion recognition system.Correlation coefficients of separate AF-based recognition

are used to estimate the correlations between AFs, which is

defined as

cc(x, y) =∣∣ cov(x, y)√

cov(x, x) ∗ cov(y, y)

∣∣ (1)

where cov(x, y) is defined as

cov(x, y) = E[(x − E(x))(y − E(y))] (2)

x and y are recognition results of separate AF-based recog-

nition, E(x) is the mathematical expectation of x.

Correlation coefficient reflects dependency between two

variables, whose range is between [0 1]. If the two variables

are quite dependent, the absolute value of correlation coeffi-

cient will be close to 1, which suggests large redundance. If

the two variables are quite independent, the absolute value

will be close to 0, which suggests big difference, this may

be due to poor performance. Both of the cases are not

suitable to combine these two variables together. Only the

values with moderate correlation coefficient will be chosen

for combination.

B. Emotion recognition

A sequence of AF vectors was obtained for each frame.

We applied GMM to model the AF vectors using all

the training data, and for each testing data, the multi-

dimensional AF vectors were fed to five models to obtain a

final result.

IV. EXPERIMENTS AND RESULTS

A. Corpus

In this study, we used the CASIA Mandarin emotional

corpus provided by Chinese-LDC [15]. The corpus is de-

signed and set up for emotion recognition studies. The

database contains short utterances from four persons, cov-

ering five primary emotions, namely anger, happy, surprise,

neutral, and sad. Each utterance corresponds to one emo-

tion. For each person, there are 1500 utterances, i.e., 300

utterances for each emotion. Each utterance was recorded

at a sampling rate of 16 kHz. 200 utterances from each

emotion of each person were used for training, and the other

utterances were used for testing.

B. Performance of AFs-based systems

For spectral features, 39-th order NUFCCs were computed

every 10ms using a Hamming window of 25ms. For the

system that uses AFs as features, 76-dimensional AF vectors

were obtained from the nine AF-MLPs, each with 351 input

nodes (9 frames of 39-dimensional MFCCs) and different

hidden nodes. The MLPs were trained using the Quicknet

[16].

1) Performance analysis of separate AF.: Using the

method described before, we first calculate recognition accu-

racy for each AF property separately. The results are shown

in table 2.

From table 2, several conclusions could be drawn. First,

even for the best single AF property ”vowel”, the perfor-

mance is inferior to the system using traditional (spectral

and prosodic)features. Second, some AF properties with

the lowest recognition accuracy can be excluded. Since the

system’s performance may be reduced because of these AFs.

74

Page 3: [IEEE 2009 International Conference on Research Challenges in Computer Science (ICRCCS) - Shanghai, China (2009.12.28-2009.12.29)] 2009 International Conference on Research Challenges

AFs angry happy neutral sad surprise average

Degree 58.00 28.50 57.25 54.25 31.75 45.95Aspiration 48.25 23.50 61.75 49.25 30.00 42.55Frontness 36.25 30.50 67.50 45.75 37.75 43.55

Glottal 19.00 15.50 73.25 45.25 26.25 35.85Height 49.50 29.75 56.75 50.75 40.00 45.35Place 43.50 32.50 63.00 56.75 37.75 46.70

Rounding 22.25 14.50 76.50 39.50 25.00 35.55Tone 28.25 26.25 66.50 43.00 42.50 41.30

Vowel 54.50 43.75 53.00 58.00 45.50 50.95

Table IIRECOGNITION RATE (%)OF SYSTEM BASED ON SEPARATE AF

2) Correlation coefficients between AFs: Although the

performance of single AF has been investigated, it is still

unknown that which feature set performs the best for the

emotion recognition system. Therefore, we analyze the cor-

relation coefficients between the separate AF. According

to equals (1)(2),the correlation coefficients between the

separate AF are calculated, and the results are shown in

table 3.

AFs Deg Asp Fro Glo Hei Pla Rou Ton Vow

Degree 1 0.18 0.17 0.12 0.19 0.23 0.09 0.11 0.20Aspiration 0.18 1 0.12 0.11 0.16 0.17 0.07 0.13 0.19Frontness 0.17 0.12 1 0.13 0.22 0.17 0.13 0.16 0.24

Glottal 0.12 0.11 0.13 1 0.09 0.15 0.17 0.15 0.12Height 0.19 0.16 0.22 0.09 1 0.21 0.13 0.14 0.27Place 0.23 0.17 0.17 0.15 0.21 1 0.07 0.13 0.16

Rounding 0.09 0.07 0.13 0.17 0.13 0.07 1 0.05 0.12Tone 0.11 0.13 0.16 0.15 0.14 0.13 0.05 1 0.17

Vowel 0.20 0.19 0.24 0.12 0.27 0.16 0.12 0.17 1average 0.25 0.24 0.26 0.23 0.27 0.25 0.20 0.23 0.28

Table IIICORRELATION COEFFICIENTS BETWEEN THE 9 AFS AND GMM-BASED

RECOGNITION.

From table 2 we can see that the AF properties ”Glottal”,

”Rounding”, and ”Tone” have the worst average accuracy

rate. In table 3, these three properties have the lowest

average correlation coefficients. According to the section of

correlation analysis, in order to improve the performance of

the system using AFs, we could drop these three AFs. That

is to say, ”Degree”, ”Aspiration”, ”Frontness”, ”Height”,

”Place” and ”Vowel” are chose to be the optimal feature

set.

C. Performance of selected AFs set.

For comparison, we not only investigated the performance

of the selected AFs set, but also test the performance when

all 9 AFs are used. the experiment results are shown in table

4.

D. System fusion

Although AFs don’t perform well in speech emotion

recognition, it nevertheless capture emotion information, as

the recognition rate are higher than the average level, which

AF sets angry happy neutral sad surprise average9(all) 56.25% 33.50% 64.75% 56.50% 43.25% 50.85%

6(-Glottal-Rounding-Tone) 64.00% 40.50% 63.00% 62.00% 47.50% 55.40%

Table IVCONTRAST SUBSET SELECTION.

is 20%. Also, the system based on AFs and the one based

on NUFCC are correlated and produce different errors. Two

systems might complementary to each other, as the errors

made by one system might compensate by the other system,

and vice the versa. Therefore, the AFs-based system and

spectral feature NUFCC-based system were combined with

the linear fusion method on the score level.

if Saf and Snufcc are the scores, we made a emotion

recognition decision based upon the score Sfuse, which is

given by

Sfuse = α ∗ Saf + (1 − α) ∗ Snufcc, (3)

The weight α was valued within the interval 0 - 1 and an

increment of 0.02. The accuracy rates according to different

α were shown in Figure 2.

0 0.2 0.4 0.6 0.8 150

55

60

65

70

75

80

85

90

95

100

weight α

Acc

urac

y (%

)

Figure 2. The accuracy rates according to different α.

It can be seen from Figure 2 that the combined system got

the best performance when α equals 0, i. e.,the combination

of spectral features and AFs don’t improve the performance

of the system that using only spectral features.

V. CONCLUSION AND DISCUSSION

This paper has presented an approach that using articula-

tory features (AFs) derived from spectral features for speech

emotion recognition. Experiments results based on the CA-

SIA Mandarin emotional corpus show that AFs alone are

not suitable for speech emotion recognition. In addition, we

investigated the combination of AFs and spectral features.

Results shows that the combination of spectral features and

75

Page 4: [IEEE 2009 International Conference on Research Challenges in Computer Science (ICRCCS) - Shanghai, China (2009.12.28-2009.12.29)] 2009 International Conference on Research Challenges

AFs don’t improve the performance of the system that using

only spectral features.

Although AFs have been adopted as an alternative or sup-

plementary features for speech recognition, language ID and

confidence measure, they are not effective or supplementary

features for speech emotion recognition.

ACKNOWLEDGMENT

This work is partially supported by The National High

Technology Research and Development Program of China

(863 program, 2006AA010102), National Science & Tech-

nology Pillar Program (2008BAI50B00), MOST (973 pro-

gram, 2004CB318106), National Natural Science Founda-

tion of China (10874203, 60875014, 60535030).

REFERENCES

[1] R. Cowie, E. Douglas-Cowie, N. Tsapatsoulis, G. Votsis,S. Kollias, W. Fellenz, , and J. Taylor, “Emotion recognitionin human-computer interaction,” Signal Processing Magazine,IEEE, vol. 18, no. 1, Jan 2001.

[2] D. Ververidis and C. Kotropoulos, “Emotional speech recog-nition: Resources, features, and methods,” Speech Communi-cation, vol. 48, no. 9, pp. 1162 – 1181, 2006. [Online]. Avail-able: http://www.sciencedirect.com/science/article/B6V1C-4K1HCKM-1/2/3c1a10a68e9fe662b07918424294495a

[3] T. Nwe, S. Foo, and L. D. Silva, “Speech emotion recogni-tion using hidden markov models,” Speech Communication,vol. 41, no. 4.

[4] N. Sato and Y. Obuchi, “Emotion recognition using mel-frequency cepstral coefficients,” Information and Media Tech-nologies, vol. 2, no. 3, pp. 835–848, 2007.

[5] K. Kirchhoff, “Robust speech recognition using articulatoryinformation,” Tech. Rep., 1998.

[6] K. yee Leung, M.-W. Mak, and S.-Y. Kung, “Applying artic-ulatory features to telephone-based speaker verification,” inin Proc. IEEE International Conference on Acoustic, Speech,and Signal Processing, 2004, pp. 85–88.

[7] G. Doddington, “Speaker recognitionłidentifying people bytheir voices,” Proceedings of the IEEE, vol. 73, no. 11, pp.1651–1664, Nov. 1985.

[8] K. Kirchhoff, G. A. Fink, and G. Sagerer, “Combiningacoustic and articulatory feature information for robustspeech recognition,” Speech Communication, vol. 37,no. 3-4, pp. 303 – 319, 2002. [Online]. Avail-able: http://www.sciencedirect.com/science/article/B6V1C-45628J4-9/2/9b4992df884f9b5fb630fc7509890a51

[9] S. Parandekar and K. Kirchhoff, “Multi-stream languageidentification using data-driven dependency selection,” inAcoustics, Speech, and Signal Processing, 2003. Proceedings.(ICASSP ’03). 2003 IEEE International Conference on, vol. 1,April 2003, pp. I–28–I–31 vol.1.

[10] K.-Y. Leung and M. Siu, “Phone level confidence measureusing articulatory features,” in Acoustics, Speech, and SignalProcessing, 2003. Proceedings. (ICASSP ’03). 2003 IEEEInternational Conference on, vol. 1, April 2003, pp. I–600–I–603 vol.1.

[11] S.Lee, E. Bresch, and S. Narayanan, “An exploratory studyof emotional speech production using functional data analysistechniques,” in In Proc.7th Int. Seminar Speech Production,2006.

[12] M. Nordstrand and et al., “Measurements of articulatoryvariation in expressive speech for a set of swedish vowels,”Speech Communication, vol. 44, no. 1-4, pp. 187 – 196, 2004,special Issue on Audio Visual speech processing.

[13] Y. Zhou, Y. Sun, J. Li, J. Zhang, and Y. Yan, “Physiologically-inspired feature extraction for emotion recognition,” in InProc. Interspeech, September 2009.

[14] J. Frankel and et al, “Articulatory feature classifiers trainedon 2000 hours of telephone speech,” in In Proc. Interspeech,2007.

[15] http://www.chineseldc.org/resourse.asp/.

[16] D. Johnson and et al., www.icsi.berkeley.edu/speech/qn.html/.

76