View
220
Download
4
Category
Preview:
Citation preview
INVESTIGATION OF EMOTION CLASSIFICATION USING SPEECH
RHYTHM METRICS
Yousef A. Alotaibi1, Ali H. Meftah2, Sid-Ahmed Selouani3 1,2College of Computer and Information Sciences, King Saud University, Riyadh, Saudi Arabia
3Université de Moncton, 218 bvd. J.-D.-Gauthier, Shippagan, E8S 1P6, Canada
{yaalotaibi, ameftah}@ksu.edu.sa, selouani@umcs.ca
ABSTRACT
The processing of emotion during speech production has
recently become a subject of increasing interest and attention
for many languages. Such research has a wide range of
applications in many different fields. The Arabic language,
however, has thus far received little attention in this area, with
the exception of a limited number of papers at some
conferences. The current paper investigates the relationship
between rhythm metrics Interval Measures (IM) (i.e., %V,
ΔC, and ΔV) and speech emotions. The four emotions of
‘funny,’ ‘sad,’ ‘surprised,’ and ‘question’ are investigated in
Modern Standard Arabic (MSA). The KACST Text To
Speech Database (KTD) was used to perform the study. We
found variations between the four emotion categories,
especially for sadness and question. A major conclusion is
that ΔC, and ΔV rhythm metrics can be used to classify
emotions such as those investigated in this study.
Index Terms— Rhythm, funny, sad, Surprised, question
1. INTRODUCTION
Human emotions are a combination of psychological and
physiological factors and can be expressed as subjective
experience, physiological changes in the body (tense muscles,
dry mouth, sweating, etc.), and individual behavior (facial
expressions, fleeing, hiding, etc.). Non-linguistic information
plays an important role in human communication, and may
be observed through facial expressions, the expression of
emotions, and even punctuation in cases of video, speech, and
written text, respectively [1], [2].
Recently, increased attention has been directed at
the study of the emotional content of speech signals [3].
Understanding emotions present in speech and synthesizing
desired emotions in speech according to the intended message
are the basic goals of emotional speech processing [1].
Speech emotion recognition has several applications, and it
has been the object of increasing research interest in recent
years. One such application of emotion recognition is in the
field of human computer interaction, whereby interactions
could be more enjoyable and realistic. The development of
robots that can understand and express emotion and
incorporate a more sensitive interface to respond to user
behavior has become an open field for researchers.
Analyzing the behavior of call attendants working
with customers in call center conversations, for example, can
be used to improve the quality of service of call attendants by
analyzing users’ emotional states to select appropriate
conciliation strategies and better decide whether to transfer
the call to a human agent. Moreover, in the field of education,
the detection of bored users participating in distance learning
contexts would allow designers to change the style and level
of the materials provided. Other examples would be crime
investigation departments using the emotion analysis results
of telephone conversations between criminals for their
investigations, or emergency services (e.g., ambulance and
fire brigade) analyzing incoming calls to help evaluate the
genuineness of requests; there has also been recent work on
the relationships between driver emotions and driving
performance to avoid accidents through supporting the
driving experience and encouraging better driving, where
information about the state of drivers can be used to keep
them alert during driving.
Schuller et al., [4] investigated emotion recognition
from speech recorded in the noisy conditions of moving cars.
Their aim was to categorize the current state of the speakers
to improve and monitor in-car safety and the performance of
the drivers. Automatic speech-to-speech translation systems
may be useful in automatic emotion analysis in which
translated speech is expected to represent the emotional states
of speakers. Finally, automatic emotion recognition systems
using speech data can also be used to recognize stressed-
speech in aircraft cockpits for improving performance [1], [3],
[5].
1.1. Rhythm
Rhythm has been defined as an effect involving the
isochronous (the property of speech to organize itself in
204978-1-4799-1616-0/13/$31.00 ©2013 IEEE DSPSPE 2013
pieces equal or equivalent in duration) recurrence of some
type of speech unit [6]. Rhythm metrics consist of [7]:
∆V, the standard deviation of vocalic intervals; ∆C, the
standard deviation of consonantal intervals; %V, the percent
of utterance duration composed of vocalic intervals; VarcoV,
the standard deviation of vocalic intervals divided by the
mean vocalic duration (× 100); VarcoC, the standard
deviation of consonantal intervals divided by the mean
consonantal duration; VarcoVC, the standard deviation of
vocalic plus consonantal intervals divided by the mean
vocalic plus consonantal duration; and finally nPVI-V, the
normalized pairwise variability index for vocalic intervals
(mean of the differences between successive vocalic intervals
are divided by their sum), rPVI-C Pairwise variability index
for consonantal intervals (mean of the differences between
successive consonantal intervals), nPVI-VC Normalized
pairwise variability index for vocalic plus consonantal
intervals (mean of the differences between successive vocalic
plus consonantal intervals divided by their sum), rPVI-VC
pairwise variability index for vocalic and consonantal
intervals (mean of the differences between successive vocalic
and consonantal intervals’ Articulation rate Number of
(orthographic) syllables produced per second, excluding
pauses).
In general, languages can be classified depending on
their rhythmic pattern into three classes: stress-timed
languages (e.g., Arabic, English, and German), syllable-
timed languages (e.g., French and Spanish), and mora-timed
languages (e.g., Japanese). The Arabic language has been
described as stress-timed, as given in [8], [9], and [10].
Recent works, such as [11] and [6] have proposed an
approach for describing the rhythmic structure of spoken
languages by relying on acoustic–phonetic measurements.
Ramus et al. [11] suggested a measure based on the
percentage of vocalic intervals (%V) and the standard
deviation of consonantal intervals (ΔC). According to their
research, stress-timed languages have higher ΔC and
lower %V than syllable-timed languages. In addition,
syllable-timed languages have higher ΔC and lower %V than
mora-timed languages [11]. Having found uncertainty in this
classification system, Grabe et al. [6] proposed another way
of classifying the rhythm of languages using the raw and
normalized pairwise variability index (nPVI, rPVI)
calculated from the differences in vocalic and consonantal
durations between successive syllables.
1.2. Works related to Arabic speech rhythm and emotion
A review of the available literature related to Arabic speech
rhythm and emotion revealed the fact that until now little
attention has been paid to Arabic by researchers when
compared to other languages. To our knowledge there is no
influential research on Arabic in this field with the exception
of a limited number of papers.
Regarding Arabic only, Khan et al. [12] attempted
to classify Arabic sentence into either question or non-
question sentences by segmenting them from continuous
speech using intensity and duration features and by extracting
the prosodic features from each sentence. Their research
results indicated an accuracy of their approach of 75.7%. Al
Dakkak et al. [13] developed an automated tool for emotional
Arabic synthesis based on an automatic prosody rough
generation model and also the number of phonemes in a
sentence. They claimed that their system’s model tested
successfully.
Tajima et al. [9] investigated the rhythmic mode of
three languages, specifically Arabic and English as stress-
timed languages and Japanese as a mora-timed language.
They found that there are phonetic differences and
similarities across languages and that there are systematic
differences even among languages of the same rhythm class.
Ghazali et al. [14] found a gradual decrease in %V from
Eastern Arabic dialects (those of Syria and Jordan) to
Western dialects (those of Morocco and Algeria), with the
dialects of Tunisia and Egypt having intermediate values;
further, they found that ∆C in the Arabic dialects of Tunisia
and Egypt was closer to the dialects of the Middle East than
to those of North Africa. This implies a geographical
relationship between these different Arabic dialects. Despite
their rhythmic differences, all Arabic dialects are clustered
around stress-timed languages in plots of rhythm metrics, but
it has been reported that there are actually clear-cut rhythm
classes; instead, there are overlapping rhythm classes
between different categories [15].
Biadsy et al. [16] showed that the identification of a
speaker’s dialect can be significantly improved using
prosodic features (e.g., intonation and rhythm) over a purely
phonetic-based approach. In their emotion classification of
speech signals, the popular features employed were the
statistics of fundamental frequency, energy contour, duration
of silence, and voice quality [17],
This paper investigates the relationship between
rhythm metrics (specifically, %V, ΔC, and ΔV) and the
emotions (i.e. Funny, Sad, Surprised, and happy) in MSA by
using KTD corpus. Finally Details of KTD corpus are
presented in Section 2. The experimental setup is described
in Section 3. The results and a discussion are provided in
Section 4 followed by our conclusions in Section 5.
2. KTD CORPUS
KTD is a Modern Standard Arabic (MSA) simulated
emotional speech read corpus produced by the King
Abdulaziz City for Science and Technology (KACST)
[KTD]. One actor simulates four emotions: funny, sadness,
surprise, and question. Table 1 shows the 16 selected
emotional spoken sentences that we used in our experiment.
Each sentence was spoken in a way that expressed the four
abovementioned emotions.
205
Table 1. Spoken sentences [18]
S1:
.إصابة جديدة يمنأ جذامأ في الأ سة عشر بالأ بعمئة وخمأ ، وأرأ فالأ طأ بشلل الأ
?is?aabatun Ʒadiidatun biʃalalil ?at?faal wa?arbaʢumi?ata waxamsata ʢaʃara bilƷuðaam fil jaman
S2:
لع ، بلغ مع مطأ ضى الأجذامأ لمرأ كل ي .ألأعدد الأ رين حالةأ عمئة وثمانية وعشأ الأعام الأجاري، سبأعة آلف وتسأ
?alʢadadul kullijji limard?al Ʒuðaam balaʁa maʢa mat?laʢil ʢaamil Ʒaarii sabʢata ?aalaafin watisʢimi?atin
waθamaanijatan waʢiʃriina ħaalah
S3:
.وفاة الشيأخ عينأ عمئة، وستة وتسأ ، عام ألأف وتسأ ر مارسأ رةأ، في شهأ ، في الأمدينة الأمنو الأغزالي
wafaatuʃ ʃajxil ʁazaalijj fil madiinatil munawwarah fii ʃahri maaris ʢaama ?alfin watisʢimi?atin wasittatin
watisʢiin
S4: .وفاة ليةأ خا، بالدقهأ كز طلأ ةأ، بمرأ يته بطر ، ودفأنه في قرأ الشيأخ جاد الأحقأ
wafaatu ʃʃajx Ʒaadi lħaq wadafnuhu fii qarjatihi bit?urrah bimarkazi t?alxaa biddaqahlijjah
S5: ، يوارى الثرى في .ألشيأخ الشعأراوي أ دقادوسأ
?a ʃʃajxuʃ ʃaʢraawijj juwaaraθ θaraa fii daqaaduus
S6: ، يتكلم عنأ زلأزال تأسونامي ارأ لول النج .زغأ
zaʁluulin naƷƷaar jatakallamu ʢan zilzaalit suunaamii
S7: مع الأخالديأن في وا بمجأ مدأ فؤادأ باشا، عضأ .الأقاهرةأحأ
?aħmad fu?aad baaʃaa ʢud?wan bimaƷmaʢi lxaalidajn fi lqaahirah
S8: . يةأ باب صح لأطة لسأ ى عن الس بورسأ يلأسن، يتنح
buuris jilsin jatanaħħaa ʢanis sult?ati li?asbaabin s?iħħijjah
S9: م وساطته ج بوش، يقد جياجورأ يا وجورأ مةأ، بيأن روسأ زأ .لحل الأ
ƷuurƷi buuʃ juqaddimu wasaat?atahu liħallil ?azmah bajna ruusjaa waƷuurƷijaa
S10: هرأ زأ رير مجلة الأ بيومي، رئيس تحأ دأ رجب الأ .محم
muħammad raƷabil bajjuumii ra?iisu taħriiri maƷallatil ?azhar
S11: . ب والسلمأ ، بطل الأحرأ ألساداتأ
?assaadaat bat?alul ħarbi wassalaam
S12: . رائيلأ ب ديفيدأ، إت فاقية ملأزمة بيأن فلسطين وإسأ كامأ
kaambi diifiid ?ittifaaqijjatun mulzimatun bajna filasat?iin wa?israa?iil
S13: . ملك حسيأن بن طللأ ، ألأ دني أ رأ ، في زيارة للأعاهل الأ ملك الأحسنأ ، ألأ ألأعاهل الأمغأربي أ
?alʢaahilul maʁribijj ?almalikul ħasan fii zijaaratin lilʢaahilil ?ardunijj ?almaliki ħusajn bin t?alaal
S14: . دأ، يتوعد تنأظيم الأقاعدة في السعوديةأ ألأملكأ فهأ
?almalik fahd jatawaʢʢadu tanðˤiimal qaaʢidati fis suʢuudijjah
S15: . لة التقأديريةأ صل على جائزة الدوأ ، يحأ بقأ سأ ، وزير الثقافة الأ مدأ هيكلأ أحأ
?aħmad hiikal waziiruθ θaqaafatil ?asbaq jaħs?ulu ʢalaa Ʒaa?izatid dawlatit taqdiirijjah
S16: ة ثلثية. ودأ، ومباركأ في قم سدأ، ولح بشار الأ
baʃʃaaril ?asad walaħħuud wamubaarak fii qimmatin θulaaθijjah
206
3. EXPERIMENT SETUP
To compute rhythm metrics, it is important to segment and
label the speech signals of the corpus being analyzed.
Segmentation must identify and separate consonants, vowels,
and non-speech portions such as silences and short pauses.
This is because rhythm metrics computation depends heavily
on various comparisons between consonantal and vocalic
intervals in the given speech utterance.
In our research, segmentation and alignment were
performed automatically. We have been forced to this
because in the original corpus there were no phoneme
boundaries in the transcriptions. To perform this task in a
time-efficient manner, we used the HTK [19] toolkit parallel
accumulator of the HERest tool for HMMs re-estimation, in
combination with the powerful capabilities of GNU
parallelization [20]. The master label file was divided into N
parts to enable parallel time-alignment with the HVite tool.
Finally, the scheme illustrated in [11] (henceforth:
Ramus scheme) was used to analyze the rhythm metrics
results. Creation of different plots depending on the three
rhythm metrics is the final phase of this experimental setup.
From the plots we can classify the different rhythm emotions.
4. RESULTS AND DISCUSSION
Table 2 shows the computed overall average standard
deviations of vocalic duration (∆V), the average standard
deviation of consonantal intervals (∆C), and the average
proportions of vocalic intervals (%V) for the given speaker in
each of the four emotion types under investigation, where
time units are milliseconds. The results show that the
‘question’ emotion type can be considered to be an
intermediate emotion between ‘sadness’ and the other two
emotions, ‘funny’ and ‘surprise.’ Both ∆C and ∆V almost
gradually increase as they move along the spectrum from
‘surprise’ to ‘funny’ to ‘question’ and finally to ‘sadness.’
Table 2. Summary of rhythm metrics
∆V ∆C %V
Funny 52.153 49.581 35.065
Sad 93.927 84.507 32.801
Surprised 50.535 49.764 35.088
Questions 60.570 54.161 35.487
As shown in Table 2, %V is almost the same for all
four emotions except for ‘sadness.’ ‘Sadness’ has a lower %V
value compared with the other three emotions. This implies
that in the case of sadness the portion of vocalic interval
shrinks compared with the other three emotions under
investigation. Next, there is also a large variance in ∆C
between sadness and the other emotions. Finally, ∆V shows
the most discrepancy, implying that the emotion of sadness
gives the lowest speech rate, while in the cases of surprise
and funny the speech rate is the fastest.
Figure 1 shows the plots of the four investigated
emotions plotted on the three planes of the Ramus schema [11]
for rhythm metrics calculation. As shown in the figure, the
planes (a) %V–∆C, (b) %V–∆V, and (c) ∆V–∆C are shown
graphically which makes their differences easier to visualize. Dellwo et al. [21] showed that the consonantal intervals will
be longer on an average basis in low-rate speech and shorter
in high-rate speech. In other words, the standard deviation of
consonantal intervals (i.e., ∆C) will be inversely proportional
to speech rate. According to the above finding of Dellwo et al.
[21] and by comparing our investigated emotions, the speech
rates for the emotions of ‘funny’ and ‘surprise’ were higher
than that for the emotion of ‘sadness,’ while the emotion of
‘question’ was positioned between the other three emotions.
This large difference in speech rates indicates a large
difference in consonantal intervals, as noted above.
5. CONCLUSION
The Arabic language suffers a lack of research in emotion
processing, despite the fact that emotion processing is a recent
topic of growing research that has many possible applications.
We investigated the four emotions of funny, sadness, surprise,
and question using rhythm metrics. The results showed
important variations in rhythm metrics among these four
emotions in MSA speakers. Differences between the emotions
of sadness and question were easily detected by rhythm
metrics. The consonantal intervals for the emotion of sadness
were larger than those for the emotions of funny, surprise, and
question emotions. Therefore, it can be concluded that the
speech rate for the emotion of sadness is lower than the rates
for the other three investigated emotions. The emotions of
funny and surprise cannot be distinguished using rhythm
metrics (i.e., %V, ΔC, and ΔV) because of their high
similarities among these metrics. Vocalic interval
percentage, %V, is not a sufficient parameter to use for the
distinguishing of emotions. Conversely, rhythm metrics (ΔC,
and ΔV) can be used to distinguish emotions. This finding
will be exploited in the future to help adapting speech
recognizers to specific accents and regional variety of
languages.
6. ACKNOWLEDGMENT
This work was supported by the NPST program under King
Saud University Project Number 10-INF1325-02.
207
a).
b).
c).
Figure 1. Emotions distributed among a) %V–∆C, b) %V– ∆V, and c) ∆V–∆C planes.
60
70
80
90
100
110
120
130
25 27 29 31 33 35 37 39 41
ΔC
%V
Fun
Sad
Sur
Que
30
40
50
60
70
80
90
100
25 27 29 31 33 35 37 39 41
ΔV
%V
Fun
Sad
Sur
Que
60
70
80
90
100
110
120
130
30 40 50 60 70 80 90 100
ΔC
ΔV
Fun
Sad
Sur
Que
208
7. REFERENCES
[1]. G. Shashidhar, K. Koolagudi, R. Sreenivasa, “Emotion
recognition from speech: a review,” International Journal of
Speech Technology, V. 15, Issue 2, pp 99-117, June 2012.
[2]. Hassan A. On automatic emotion classification using acoustic
features. University of Southampton, Faculty of Physical and
Applied Sciences, Doctoral Thesis, 2012: 204pp.
[3]. M. El Ayadi, M. S. K., Fakhri Karray, “Survey on speech
emotion recognition: Features, classification schemes, and
databases,” Pattern Recognition,V. 44, Issue 3, Pages 572–587,
March 2011.
[4]. B. Schuller, G. Rigoll, M. Grimm, K. Kroschel, T. Moosmayr,
and G. Ruske, “Effects of in-car noise-conditions on the
recognition of emotion within speech,” In Proceedings of the
33rd Annual Conference on Acoustics, DAGA’07, 305-306,
Stuttgart, Germany, 2007.
[5]. Yazid A, Pierre D. Automatic Emotion Recognition from
Speech. PhD Research Proposal. Springer-Verlag, Berlin,
Heidelberg, 2011.
[6]. E. Grabe, and E. L. Low, E. L. “Durational variability in speech
and the rhythm class hypothesis”. Papers in Laboratory
Phonology. 7, 515-546, Berlin: Mouton.2002
[7]. J. M. Liss, L. White, S. L. Mattys, K. Lansford, A. J. Lotto, S.
M. Spitzer, & J. N. Caviness, “Quantifying Speech Rhythm
Abnormalities in the Dysarthrias,” JOURNAL OF SPEECH
LANGUAGE AND HEARING RESEARCH 52 (5), pp 1334
– 1352, 2009.
[8]. D. Abercormbie, “Elements of General Phonetics,”, Edinburgh
University Press, 1967.
[9]. K. Tajima, B. Zawaydeh, and M Kitabara, “A comparative
study of speech rhythm in Arabic, English and Japanese,”
Proceedings of the XIV ICPhS, San Francisco, U.S.A. 1999.
[10]. P. Roach, “On the distinction between ’stress-timed’ and
’syllable-timed’ languages,” In Linguistic controversies, D.
Crystal Ed, London: Edward Arnold. 73-79, 1982.
[11]. F. Ramus, M. Nespor, and J. Mehler, “ Correlates of linguistic
rhythm in the speech signal,” Cognition, 73 (3), 265-292, 1999.
[12]. O. Khan, W. Al-Khatib, L. Cheded, “Detection of Questions in
Arabic Audio Monologues using Prosodic Features,”
Proceedings of the IEEE International Symposium on
Multimedia, (ISM2007), pp. 29-36, Taichung, Taiwan,
R.O.C.,December 2007.
[13]. O. Al-Dakkak, N. Ghneim, M. Abou Zliekha, S. Al-Moubayed,
“Prosodic Feature Introduction and Emotion Incorporation in
an Arabic TTS,” Information and Communication
Technologies, ICTTA '06. ,pp: 1317- 1322 2006.
[14]. S. Ghazali, R. Hamdiy, and M. Barkat, "Speech Rhythm
Variation in Arabic Dialects," 1st International Conference on
Speech Prosody '02, Aix-en-Provence, pp. 331-334, 2002.
[15]. M. Barkat-Defradas, R. Hamdi, E. Ferragne, and F. Pellegrino,
"Speech timing and rhythmic structure in arabic dialects: a
comparison of two approaches," ;in Proc. INTERSPEECH, pp.
1613-1616, 2004.
[16]. F. Biadsy, and J. Hirschberg, "Using prosody and phonotactics
in Arabic dialect identification," In INTERSPEECH’09, pp.
208-211, 2009.
[17]. T.L. Nwe, S.W. Foo , L.C De Silva, “Speech emotion
recognition using hidden Markov models,” Speech
Communication, Volume 41, Number 4, pp. 603-623(21),
November 2003 .
[18]. KTD datasheet, A KACST internal report (Not published).
[19]. S. J. Young, “The HTK hidden markov model toolkit: Design
and philosophy,” Entropic Cambridge Research Laboratory,
Ltd,vol. 2, pp. 2–44, 1994.
[20]. O. Tange, “Gnu parallel - the command-line power tool,” The
USENIX Magazine, vol. 36, no. 1, pp. 42–47, Feb 2011.
Available: http://www.gnu.org/s/parallel.
[21]. V. Dellwo, and P. Wagner, “Relations between language
rhythm and speech rate,” In: (Proceedings) Proceedings of the
International Congress of Phonetics Science, pp. 471 – 474,
Barcelona. 2003.
209
Recommended