Upload
yonghong
View
217
Download
1
Embed Size (px)
Citation preview
Applying articulatory features to speech emotion recognition
Yu Zhou, Yanqing Sun, Lin Yang, Yonghong Yan
ThinkIT Speech Lab., Institute of Acoustics, Chinese Academy of Sciences, Beijingzhouyu, [email protected]
Abstract—In this paper, we present an approach that usingarticulatory features (AFs) derived from spectral features forspeech emotion recognition. Also, we investigated the combina-tion of AFs and spectral features. Systems based on AFs onlyand combined spectral-articulatory features are tested on theCASIA Mandarin emotional corpus. Experiments results showthat AFs alone are not suitable for speech emotion recognitionand that the combination of spectral features and AFs don’timprove the performance of the system that using only spectralfeatures.
Keywords-articulatory feature; emotion recognition;
I. INTRODUCTION
During the last few years, the research on speech emotion
recognition has got much attention [1]. There have been
plenty of studies on speech emotion recognition [2][3]. Most
traditional emotion recognition systems have focused on the
modeling of spectral features or prosodic features [4][2].
In recent years, researches have started to investigate the
articulatory features of speech in speech recognition and
speaker identification [5][6]. AFs are the abstract repre-
sentations of some important speech production properties,
such as the manner and place of articulation, the vocal cord
excitation, and lip motion, etc. Since speech is produced
by the continuous movements of articulators in the vocal
tract excited by the air stream originated from the lung.
These phoneme-characterized or speaker-characterized ar-
ticulations and excitations, which imparted to the produced
speech, are the origin of unique phoneme or speaker infor-
mation [7][5]. AFs have been adopted as an alternative or
supplementary features for speech recognition [8], language
ID [9] and confidence measure [10]. From many studies
that have investigated the relationships between emotions
and articulatory properties [11][12], it seems reasonable
to think that articulatory features contain useful emotion-
specific information for speech emotion recognition. How-
ever, articulatory information has not been widely applied to
automatic emotion recognition. In this paper, we explore the
use of articulatory features (AFs) to capture the movements
of articulators in the vocal tract and their excitation during
sound production for emotion recognition. It was found that
AFs alone are not suitable for speech emotion recognition
and that the combination of spectral features and AFs don’t
improve the performance of the system using only spectral
features.
The paper is organized as follows. In section 2 the
extraction of articulatory features is introduced. Speech
emotion recognition based on AFs are presented in section
3. Experiments and results are shown in section 4. Finally,
section 5 gives conclusions.
II. ARTICULATORY FEATURE EXTRACTION
To extract AFs of emotional speech, a set of articulatory
classifiers are trained to learn the mapping between the
acoustic signals and the articulatory states. There are several
ways to train the classifiers, such as X-rays that can obtain
the actual articulatory positions, which is expensive to real-
ize, or mappings between phonemes and their corresponding
articulatory properties. In this paper, AFs were extracted
from acoustic signals similar to [6]. To obtain the AFs, a
sequence of acoustic vectors are fed to the classifiers in
parallel, where each classifier represents a different articula-
tory property. The outputs of these classifiers (the posterior
probabilities) form the AF vectors. The extracted AFs are
then used for speech emotion recognition.
Speechinput
NUFCCextraction
MLP fordegree
MLP forvowel
MLP forflow
9 framesextension Emotion
recognition
Figure 1. The block diagram of the AF extraction for emotion recognition.
In our speech emotion recognition system, different ar-
ticulatory properties, as listed in the first column of Table
1, are used. For each property, a Multi-Layer Perceptron
(MLP) is used to estimate the probability distributions of
its pre-defined output classes (they are listed in the second
column of Table 1). The extraction process is illustrated in
Fig. 1. The inputs to these AF-MLPs are identical while
their numbers of outputs are equal to the numbers of AF
classes listed in the last column of Table 1.
The AF-MLPs were trained with phonetic labels from
speech data. With the phonetic labels, articulatory features
can be derived from a mapping between phonemes and their
states of articulations [8].
2009 International Conference on Research Challenges in Computer Science
978-0-7695-3927-0/09 $26.00 © 2009 IEEE
DOI 10.1109/ICRCCS.2009.26
73
Name ClassesDegree stop, nasal, fricative,
lateral, affricateAspiration aspirated, unaspirated,
other consonantsPlace bilabial, labiodental,
dental, alveolar, retroflex,alveolo-palatal, velar
alveolar-nasal, velar-nasalFrontness front, mid-front,
mid, mid-back, backHeight high, mid-high,
mid, mid-low, lowVowel a, aa, ak, at, au, e, ea, ee,
er, err, i, ii, ix, iy, o, u, uu,v, iaa, ioo, iee, iii, iuu, ivv
Rounding rounded, unroundedTone high-level, high-rising,
low-dipping, high-fallingGlottal state voiced, unvoiced
Table IDESIGN OF ARTICULATORY FEATURES FOR MANDARIN.
III. SPEECH EMOTION RECOGNITION BASED ON AFS
The block diagram of the AF extraction for emotion
recognition is shown in Fig.1. It consists of two steps:
spectral feature extraction and AF extraction. The spectral
features we used are Non-uniform frequency cepstral co-
efficients (NUFCC) rather than MFCC, since it has been
pointed out that NUFCC performs better than traditional
spectral feature MFCC in speech emotion recognition [13].
To ensure a more accurate estimation of the AF values,
multiple frames of NUFCCs, which are [t-n/2,...,t,...,t+n/2
] of NUFCCs created by a moving window, are served as
the inputs to the AF-MLPs at frame t.
A. NUFCC and AFs extractionThe AF values determined from the AF-MLPs, and
NUFCCs are the source of the AF extraction. After NUFCCs
extracted from speech data, a context window of 9 frames is
used [14], which are then fed to AF-MLPs to determine the
AFs. According to Table 1, there are a total of 76 articulatory
classes, which result in a 76-dimensional AF vector for each
frame if we use all the nine articulatory properties.1) Correlation analysis: To investigate the discriminate
ability of these features, and to find out which one performs
the best, we first analysis the correlation coefficients between
the 9 AFs, and choose a best feature set that performs the
best in the emotion recognition system.Correlation coefficients of separate AF-based recognition
are used to estimate the correlations between AFs, which is
defined as
cc(x, y) =∣∣ cov(x, y)√
cov(x, x) ∗ cov(y, y)
∣∣ (1)
where cov(x, y) is defined as
cov(x, y) = E[(x − E(x))(y − E(y))] (2)
x and y are recognition results of separate AF-based recog-
nition, E(x) is the mathematical expectation of x.
Correlation coefficient reflects dependency between two
variables, whose range is between [0 1]. If the two variables
are quite dependent, the absolute value of correlation coeffi-
cient will be close to 1, which suggests large redundance. If
the two variables are quite independent, the absolute value
will be close to 0, which suggests big difference, this may
be due to poor performance. Both of the cases are not
suitable to combine these two variables together. Only the
values with moderate correlation coefficient will be chosen
for combination.
B. Emotion recognition
A sequence of AF vectors was obtained for each frame.
We applied GMM to model the AF vectors using all
the training data, and for each testing data, the multi-
dimensional AF vectors were fed to five models to obtain a
final result.
IV. EXPERIMENTS AND RESULTS
A. Corpus
In this study, we used the CASIA Mandarin emotional
corpus provided by Chinese-LDC [15]. The corpus is de-
signed and set up for emotion recognition studies. The
database contains short utterances from four persons, cov-
ering five primary emotions, namely anger, happy, surprise,
neutral, and sad. Each utterance corresponds to one emo-
tion. For each person, there are 1500 utterances, i.e., 300
utterances for each emotion. Each utterance was recorded
at a sampling rate of 16 kHz. 200 utterances from each
emotion of each person were used for training, and the other
utterances were used for testing.
B. Performance of AFs-based systems
For spectral features, 39-th order NUFCCs were computed
every 10ms using a Hamming window of 25ms. For the
system that uses AFs as features, 76-dimensional AF vectors
were obtained from the nine AF-MLPs, each with 351 input
nodes (9 frames of 39-dimensional MFCCs) and different
hidden nodes. The MLPs were trained using the Quicknet
[16].
1) Performance analysis of separate AF.: Using the
method described before, we first calculate recognition accu-
racy for each AF property separately. The results are shown
in table 2.
From table 2, several conclusions could be drawn. First,
even for the best single AF property ”vowel”, the perfor-
mance is inferior to the system using traditional (spectral
and prosodic)features. Second, some AF properties with
the lowest recognition accuracy can be excluded. Since the
system’s performance may be reduced because of these AFs.
74
AFs angry happy neutral sad surprise average
Degree 58.00 28.50 57.25 54.25 31.75 45.95Aspiration 48.25 23.50 61.75 49.25 30.00 42.55Frontness 36.25 30.50 67.50 45.75 37.75 43.55
Glottal 19.00 15.50 73.25 45.25 26.25 35.85Height 49.50 29.75 56.75 50.75 40.00 45.35Place 43.50 32.50 63.00 56.75 37.75 46.70
Rounding 22.25 14.50 76.50 39.50 25.00 35.55Tone 28.25 26.25 66.50 43.00 42.50 41.30
Vowel 54.50 43.75 53.00 58.00 45.50 50.95
Table IIRECOGNITION RATE (%)OF SYSTEM BASED ON SEPARATE AF
2) Correlation coefficients between AFs: Although the
performance of single AF has been investigated, it is still
unknown that which feature set performs the best for the
emotion recognition system. Therefore, we analyze the cor-
relation coefficients between the separate AF. According
to equals (1)(2),the correlation coefficients between the
separate AF are calculated, and the results are shown in
table 3.
AFs Deg Asp Fro Glo Hei Pla Rou Ton Vow
Degree 1 0.18 0.17 0.12 0.19 0.23 0.09 0.11 0.20Aspiration 0.18 1 0.12 0.11 0.16 0.17 0.07 0.13 0.19Frontness 0.17 0.12 1 0.13 0.22 0.17 0.13 0.16 0.24
Glottal 0.12 0.11 0.13 1 0.09 0.15 0.17 0.15 0.12Height 0.19 0.16 0.22 0.09 1 0.21 0.13 0.14 0.27Place 0.23 0.17 0.17 0.15 0.21 1 0.07 0.13 0.16
Rounding 0.09 0.07 0.13 0.17 0.13 0.07 1 0.05 0.12Tone 0.11 0.13 0.16 0.15 0.14 0.13 0.05 1 0.17
Vowel 0.20 0.19 0.24 0.12 0.27 0.16 0.12 0.17 1average 0.25 0.24 0.26 0.23 0.27 0.25 0.20 0.23 0.28
Table IIICORRELATION COEFFICIENTS BETWEEN THE 9 AFS AND GMM-BASED
RECOGNITION.
From table 2 we can see that the AF properties ”Glottal”,
”Rounding”, and ”Tone” have the worst average accuracy
rate. In table 3, these three properties have the lowest
average correlation coefficients. According to the section of
correlation analysis, in order to improve the performance of
the system using AFs, we could drop these three AFs. That
is to say, ”Degree”, ”Aspiration”, ”Frontness”, ”Height”,
”Place” and ”Vowel” are chose to be the optimal feature
set.
C. Performance of selected AFs set.
For comparison, we not only investigated the performance
of the selected AFs set, but also test the performance when
all 9 AFs are used. the experiment results are shown in table
4.
D. System fusion
Although AFs don’t perform well in speech emotion
recognition, it nevertheless capture emotion information, as
the recognition rate are higher than the average level, which
AF sets angry happy neutral sad surprise average9(all) 56.25% 33.50% 64.75% 56.50% 43.25% 50.85%
6(-Glottal-Rounding-Tone) 64.00% 40.50% 63.00% 62.00% 47.50% 55.40%
Table IVCONTRAST SUBSET SELECTION.
is 20%. Also, the system based on AFs and the one based
on NUFCC are correlated and produce different errors. Two
systems might complementary to each other, as the errors
made by one system might compensate by the other system,
and vice the versa. Therefore, the AFs-based system and
spectral feature NUFCC-based system were combined with
the linear fusion method on the score level.
if Saf and Snufcc are the scores, we made a emotion
recognition decision based upon the score Sfuse, which is
given by
Sfuse = α ∗ Saf + (1 − α) ∗ Snufcc, (3)
The weight α was valued within the interval 0 - 1 and an
increment of 0.02. The accuracy rates according to different
α were shown in Figure 2.
0 0.2 0.4 0.6 0.8 150
55
60
65
70
75
80
85
90
95
100
weight α
Acc
urac
y (%
)
Figure 2. The accuracy rates according to different α.
It can be seen from Figure 2 that the combined system got
the best performance when α equals 0, i. e.,the combination
of spectral features and AFs don’t improve the performance
of the system that using only spectral features.
V. CONCLUSION AND DISCUSSION
This paper has presented an approach that using articula-
tory features (AFs) derived from spectral features for speech
emotion recognition. Experiments results based on the CA-
SIA Mandarin emotional corpus show that AFs alone are
not suitable for speech emotion recognition. In addition, we
investigated the combination of AFs and spectral features.
Results shows that the combination of spectral features and
75
AFs don’t improve the performance of the system that using
only spectral features.
Although AFs have been adopted as an alternative or sup-
plementary features for speech recognition, language ID and
confidence measure, they are not effective or supplementary
features for speech emotion recognition.
ACKNOWLEDGMENT
This work is partially supported by The National High
Technology Research and Development Program of China
(863 program, 2006AA010102), National Science & Tech-
nology Pillar Program (2008BAI50B00), MOST (973 pro-
gram, 2004CB318106), National Natural Science Founda-
tion of China (10874203, 60875014, 60535030).
REFERENCES
[1] R. Cowie, E. Douglas-Cowie, N. Tsapatsoulis, G. Votsis,S. Kollias, W. Fellenz, , and J. Taylor, “Emotion recognitionin human-computer interaction,” Signal Processing Magazine,IEEE, vol. 18, no. 1, Jan 2001.
[2] D. Ververidis and C. Kotropoulos, “Emotional speech recog-nition: Resources, features, and methods,” Speech Communi-cation, vol. 48, no. 9, pp. 1162 – 1181, 2006. [Online]. Avail-able: http://www.sciencedirect.com/science/article/B6V1C-4K1HCKM-1/2/3c1a10a68e9fe662b07918424294495a
[3] T. Nwe, S. Foo, and L. D. Silva, “Speech emotion recogni-tion using hidden markov models,” Speech Communication,vol. 41, no. 4.
[4] N. Sato and Y. Obuchi, “Emotion recognition using mel-frequency cepstral coefficients,” Information and Media Tech-nologies, vol. 2, no. 3, pp. 835–848, 2007.
[5] K. Kirchhoff, “Robust speech recognition using articulatoryinformation,” Tech. Rep., 1998.
[6] K. yee Leung, M.-W. Mak, and S.-Y. Kung, “Applying artic-ulatory features to telephone-based speaker verification,” inin Proc. IEEE International Conference on Acoustic, Speech,and Signal Processing, 2004, pp. 85–88.
[7] G. Doddington, “Speaker recognitionłidentifying people bytheir voices,” Proceedings of the IEEE, vol. 73, no. 11, pp.1651–1664, Nov. 1985.
[8] K. Kirchhoff, G. A. Fink, and G. Sagerer, “Combiningacoustic and articulatory feature information for robustspeech recognition,” Speech Communication, vol. 37,no. 3-4, pp. 303 – 319, 2002. [Online]. Avail-able: http://www.sciencedirect.com/science/article/B6V1C-45628J4-9/2/9b4992df884f9b5fb630fc7509890a51
[9] S. Parandekar and K. Kirchhoff, “Multi-stream languageidentification using data-driven dependency selection,” inAcoustics, Speech, and Signal Processing, 2003. Proceedings.(ICASSP ’03). 2003 IEEE International Conference on, vol. 1,April 2003, pp. I–28–I–31 vol.1.
[10] K.-Y. Leung and M. Siu, “Phone level confidence measureusing articulatory features,” in Acoustics, Speech, and SignalProcessing, 2003. Proceedings. (ICASSP ’03). 2003 IEEEInternational Conference on, vol. 1, April 2003, pp. I–600–I–603 vol.1.
[11] S.Lee, E. Bresch, and S. Narayanan, “An exploratory studyof emotional speech production using functional data analysistechniques,” in In Proc.7th Int. Seminar Speech Production,2006.
[12] M. Nordstrand and et al., “Measurements of articulatoryvariation in expressive speech for a set of swedish vowels,”Speech Communication, vol. 44, no. 1-4, pp. 187 – 196, 2004,special Issue on Audio Visual speech processing.
[13] Y. Zhou, Y. Sun, J. Li, J. Zhang, and Y. Yan, “Physiologically-inspired feature extraction for emotion recognition,” in InProc. Interspeech, September 2009.
[14] J. Frankel and et al, “Articulatory feature classifiers trainedon 2000 hours of telephone speech,” in In Proc. Interspeech,2007.
[15] http://www.chineseldc.org/resourse.asp/.
[16] D. Johnson and et al., www.icsi.berkeley.edu/speech/qn.html/.
76