[IEEE 2013 IEEE Digital Signal Processing and Signal Processing Education Meeting (DSP/SPE) - Napa, CA, USA (2013.08.11-2013.08.14)] 2013 IEEE Digital Signal Processing and Signal

INVESTIGATION OF EMOTION CLASSIFICATION USING SPEECH

RHYTHM METRICS

Yousef A. Alotaibi1, Ali H. Meftah2, Sid-Ahmed Selouani3 1,2College of Computer and Information Sciences, King Saud University, Riyadh, Saudi Arabia

3Université de Moncton, 218 bvd. J.-D.-Gauthier, Shippagan, E8S 1P6, Canada

{yaalotaibi, ameftah}@ksu.edu.sa, [email protected]

ABSTRACT

The processing of emotion during speech production has

recently become a subject of increasing interest and attention

for many languages. Such research has a wide range of

applications in many different fields. The Arabic language,

however, has thus far received little attention in this area, with

the exception of a limited number of papers at some

conferences. The current paper investigates the relationship

between rhythm metrics Interval Measures (IM) (i.e., %V,

ΔC, and ΔV) and speech emotions. The four emotions of

‘funny,’ ‘sad,’ ‘surprised,’ and ‘question’ are investigated in

Modern Standard Arabic (MSA). The KACST Text To

Speech Database (KTD) was used to perform the study. We

found variations between the four emotion categories,

especially for sadness and question. A major conclusion is

that ΔC, and ΔV rhythm metrics can be used to classify

emotions such as those investigated in this study.

Index Terms— Rhythm, funny, sad, Surprised, question

1. INTRODUCTION

Human emotions are a combination of psychological and

physiological factors and can be expressed as subjective

experience, physiological changes in the body (tense muscles,

dry mouth, sweating, etc.), and individual behavior (facial

expressions, fleeing, hiding, etc.). Non-linguistic information

plays an important role in human communication, and may

be observed through facial expressions, the expression of

emotions, and even punctuation in cases of video, speech, and

written text, respectively [1], [2].

Recently, increased attention has been directed at

the study of the emotional content of speech signals [3].

Understanding emotions present in speech and synthesizing

desired emotions in speech according to the intended message

are the basic goals of emotional speech processing [1].

Speech emotion recognition has several applications, and it

has been the object of increasing research interest in recent

years. One such application of emotion recognition is in the

field of human computer interaction, whereby interactions

could be more enjoyable and realistic. The development of

robots that can understand and express emotion and

incorporate a more sensitive interface to respond to user

behavior has become an open field for researchers.

Analyzing the behavior of call attendants working

with customers in call center conversations, for example, can

be used to improve the quality of service of call attendants by

analyzing users’ emotional states to select appropriate

conciliation strategies and better decide whether to transfer

the call to a human agent. Moreover, in the field of education,

the detection of bored users participating in distance learning

contexts would allow designers to change the style and level

of the materials provided. Other examples would be crime

investigation departments using the emotion analysis results

of telephone conversations between criminals for their

investigations, or emergency services (e.g., ambulance and

fire brigade) analyzing incoming calls to help evaluate the

genuineness of requests; there has also been recent work on

the relationships between driver emotions and driving

performance to avoid accidents through supporting the

driving experience and encouraging better driving, where

information about the state of drivers can be used to keep

them alert during driving.

Schuller et al., [4] investigated emotion recognition

from speech recorded in the noisy conditions of moving cars.

Their aim was to categorize the current state of the speakers

to improve and monitor in-car safety and the performance of

the drivers. Automatic speech-to-speech translation systems

may be useful in automatic emotion analysis in which

translated speech is expected to represent the emotional states

of speakers. Finally, automatic emotion recognition systems

using speech data can also be used to recognize stressed-

speech in aircraft cockpits for improving performance [1], [3],

[5].

1.1. Rhythm

Rhythm has been defined as an effect involving the

isochronous (the property of speech to organize itself in

204978-1-4799-1616-0/13/$31.00 ©2013 IEEE DSPSPE 2013

pieces equal or equivalent in duration) recurrence of some

type of speech unit [6]. Rhythm metrics consist of [7]:

∆V, the standard deviation of vocalic intervals; ∆C, the

standard deviation of consonantal intervals; %V, the percent

of utterance duration composed of vocalic intervals; VarcoV,

the standard deviation of vocalic intervals divided by the

mean vocalic duration (× 100); VarcoC, the standard

deviation of consonantal intervals divided by the mean

consonantal duration; VarcoVC, the standard deviation of

vocalic plus consonantal intervals divided by the mean

vocalic plus consonantal duration; and finally nPVI-V, the

normalized pairwise variability index for vocalic intervals

(mean of the differences between successive vocalic intervals

are divided by their sum), rPVI-C Pairwise variability index

for consonantal intervals (mean of the differences between

successive consonantal intervals), nPVI-VC Normalized

pairwise variability index for vocalic plus consonantal

intervals (mean of the differences between successive vocalic

plus consonantal intervals divided by their sum), rPVI-VC

pairwise variability index for vocalic and consonantal

intervals (mean of the differences between successive vocalic

and consonantal intervals’ Articulation rate Number of

(orthographic) syllables produced per second, excluding

pauses).

In general, languages can be classified depending on

their rhythmic pattern into three classes: stress-timed

languages (e.g., Arabic, English, and German), syllable-

timed languages (e.g., French and Spanish), and mora-timed

languages (e.g., Japanese). The Arabic language has been

described as stress-timed, as given in [8], [9], and [10].

Recent works, such as [11] and [6] have proposed an

approach for describing the rhythmic structure of spoken

languages by relying on acoustic–phonetic measurements.

Ramus et al. [11] suggested a measure based on the

percentage of vocalic intervals (%V) and the standard

deviation of consonantal intervals (ΔC). According to their

research, stress-timed languages have higher ΔC and

lower %V than syllable-timed languages. In addition,

syllable-timed languages have higher ΔC and lower %V than

mora-timed languages [11]. Having found uncertainty in this

classification system, Grabe et al. [6] proposed another way

of classifying the rhythm of languages using the raw and

normalized pairwise variability index (nPVI, rPVI)

calculated from the differences in vocalic and consonantal

durations between successive syllables.

1.2. Works related to Arabic speech rhythm and emotion

A review of the available literature related to Arabic speech

rhythm and emotion revealed the fact that until now little

attention has been paid to Arabic by researchers when

compared to other languages. To our knowledge there is no

influential research on Arabic in this field with the exception

of a limited number of papers.

Regarding Arabic only, Khan et al. [12] attempted

to classify Arabic sentence into either question or non-

question sentences by segmenting them from continuous

speech using intensity and duration features and by extracting

the prosodic features from each sentence. Their research

results indicated an accuracy of their approach of 75.7%. Al

Dakkak et al. [13] developed an automated tool for emotional

Arabic synthesis based on an automatic prosody rough

generation model and also the number of phonemes in a

sentence. They claimed that their system’s model tested

successfully.

Tajima et al. [9] investigated the rhythmic mode of

three languages, specifically Arabic and English as stress-

timed languages and Japanese as a mora-timed language.

They found that there are phonetic differences and

similarities across languages and that there are systematic

differences even among languages of the same rhythm class.

Ghazali et al. [14] found a gradual decrease in %V from

Eastern Arabic dialects (those of Syria and Jordan) to

Western dialects (those of Morocco and Algeria), with the

dialects of Tunisia and Egypt having intermediate values;

further, they found that ∆C in the Arabic dialects of Tunisia

and Egypt was closer to the dialects of the Middle East than

to those of North Africa. This implies a geographical

relationship between these different Arabic dialects. Despite

their rhythmic differences, all Arabic dialects are clustered

around stress-timed languages in plots of rhythm metrics, but

it has been reported that there are actually clear-cut rhythm

classes; instead, there are overlapping rhythm classes

between different categories [15].

Biadsy et al. [16] showed that the identification of a

speaker’s dialect can be significantly improved using

prosodic features (e.g., intonation and rhythm) over a purely

phonetic-based approach. In their emotion classification of

speech signals, the popular features employed were the

statistics of fundamental frequency, energy contour, duration

of silence, and voice quality [17],

This paper investigates the relationship between

rhythm metrics (specifically, %V, ΔC, and ΔV) and the

emotions (i.e. Funny, Sad, Surprised, and happy) in MSA by

using KTD corpus. Finally Details of KTD corpus are

presented in Section 2. The experimental setup is described

in Section 3. The results and a discussion are provided in

Section 4 followed by our conclusions in Section 5.

2. KTD CORPUS

KTD is a Modern Standard Arabic (MSA) simulated

emotional speech read corpus produced by the King

Abdulaziz City for Science and Technology (KACST)

[KTD]. One actor simulates four emotions: funny, sadness,

surprise, and question. Table 1 shows the 16 selected

emotional spoken sentences that we used in our experiment.

Each sentence was spoken in a way that expressed the four

abovementioned emotions.

205

Table 1. Spoken sentences [18]

S1:

.إصابة جديدة يمنأ جذامأ في الأ سة عشر بالأ بعمئة وخمأ ، وأرأ فالأ طأ بشلل الأ

?is?aabatun Ʒadiidatun biʃalalil ?at?faal wa?arbaʢumi?ata waxamsata ʢaʃara bilƷuðaam fil jaman

S2:

لع ، بلغ مع مطأ ضى الأجذامأ لمرأ كل ي .ألأعدد الأ رين حالةأ عمئة وثمانية وعشأ الأعام الأجاري، سبأعة آلف وتسأ

?alʢadadul kullijji limard?al Ʒuðaam balaʁa maʢa mat?laʢil ʢaamil Ʒaarii sabʢata ?aalaafin watisʢimi?atin

waθamaanijatan waʢiʃriina ħaalah

S3:

.وفاة الشيأخ عينأ عمئة، وستة وتسأ ، عام ألأف وتسأ ر مارسأ رةأ، في شهأ ، في الأمدينة الأمنو الأغزالي

wafaatuʃ ʃajxil ʁazaalijj fil madiinatil munawwarah fii ʃahri maaris ʢaama ?alfin watisʢimi?atin wasittatin

watisʢiin

S4: .وفاة ليةأ خا، بالدقهأ كز طلأ ةأ، بمرأ يته بطر ، ودفأنه في قرأ الشيأخ جاد الأحقأ

wafaatu ʃʃajx Ʒaadi lħaq wadafnuhu fii qarjatihi bit?urrah bimarkazi t?alxaa biddaqahlijjah

S5: ، يوارى الثرى في .ألشيأخ الشعأراوي أ دقادوسأ

?a ʃʃajxuʃ ʃaʢraawijj juwaaraθ θaraa fii daqaaduus

S6: ، يتكلم عنأ زلأزال تأسونامي ارأ لول النج .زغأ

zaʁluulin naƷƷaar jatakallamu ʢan zilzaalit suunaamii

S7: مع الأخالديأن في وا بمجأ مدأ فؤادأ باشا، عضأ .الأقاهرةأحأ

?aħmad fu?aad baaʃaa ʢud?wan bimaƷmaʢi lxaalidajn fi lqaahirah

S8: . يةأ باب صح لأطة لسأ ى عن الس بورسأ يلأسن، يتنح

buuris jilsin jatanaħħaa ʢanis sult?ati li?asbaabin s?iħħijjah

S9: م وساطته ج بوش، يقد جياجورأ يا وجورأ مةأ، بيأن روسأ زأ .لحل الأ

ƷuurƷi buuʃ juqaddimu wasaat?atahu liħallil ?azmah bajna ruusjaa waƷuurƷijaa

S10: هرأ زأ رير مجلة الأ بيومي، رئيس تحأ دأ رجب الأ .محم

muħammad raƷabil bajjuumii ra?iisu taħriiri maƷallatil ?azhar

S11: . ب والسلمأ ، بطل الأحرأ ألساداتأ

?assaadaat bat?alul ħarbi wassalaam

S12: . رائيلأ ب ديفيدأ، إت فاقية ملأزمة بيأن فلسطين وإسأ كامأ

kaambi diifiid ?ittifaaqijjatun mulzimatun bajna filasat?iin wa?israa?iil

S13: . ملك حسيأن بن طللأ ، ألأ دني أ رأ ، في زيارة للأعاهل الأ ملك الأحسنأ ، ألأ ألأعاهل الأمغأربي أ

?alʢaahilul maʁribijj ?almalikul ħasan fii zijaaratin lilʢaahilil ?ardunijj ?almaliki ħusajn bin t?alaal

S14: . دأ، يتوعد تنأظيم الأقاعدة في السعوديةأ ألأملكأ فهأ

?almalik fahd jatawaʢʢadu tanðˤiimal qaaʢidati fis suʢuudijjah

S15: . لة التقأديريةأ صل على جائزة الدوأ ، يحأ بقأ سأ ، وزير الثقافة الأ مدأ هيكلأ أحأ

?aħmad hiikal waziiruθ θaqaafatil ?asbaq jaħs?ulu ʢalaa Ʒaa?izatid dawlatit taqdiirijjah

S16: ة ثلثية. ودأ، ومباركأ في قم سدأ، ولح بشار الأ

baʃʃaaril ?asad walaħħuud wamubaarak fii qimmatin θulaaθijjah

206

3. EXPERIMENT SETUP

To compute rhythm metrics, it is important to segment and

label the speech signals of the corpus being analyzed.

Segmentation must identify and separate consonants, vowels,

and non-speech portions such as silences and short pauses.

This is because rhythm metrics computation depends heavily

on various comparisons between consonantal and vocalic

intervals in the given speech utterance.

In our research, segmentation and alignment were

performed automatically. We have been forced to this

because in the original corpus there were no phoneme

boundaries in the transcriptions. To perform this task in a

time-efficient manner, we used the HTK [19] toolkit parallel

accumulator of the HERest tool for HMMs re-estimation, in

combination with the powerful capabilities of GNU

parallelization [20]. The master label file was divided into N

parts to enable parallel time-alignment with the HVite tool.

Finally, the scheme illustrated in [11] (henceforth:

Ramus scheme) was used to analyze the rhythm metrics

results. Creation of different plots depending on the three

rhythm metrics is the final phase of this experimental setup.

From the plots we can classify the different rhythm emotions.

4. RESULTS AND DISCUSSION

Table 2 shows the computed overall average standard

deviations of vocalic duration (∆V), the average standard

deviation of consonantal intervals (∆C), and the average

proportions of vocalic intervals (%V) for the given speaker in

each of the four emotion types under investigation, where

time units are milliseconds. The results show that the

‘question’ emotion type can be considered to be an

intermediate emotion between ‘sadness’ and the other two

emotions, ‘funny’ and ‘surprise.’ Both ∆C and ∆V almost

gradually increase as they move along the spectrum from

‘surprise’ to ‘funny’ to ‘question’ and finally to ‘sadness.’

Table 2. Summary of rhythm metrics

∆V ∆C %V

Funny 52.153 49.581 35.065

Sad 93.927 84.507 32.801

Surprised 50.535 49.764 35.088

Questions 60.570 54.161 35.487

As shown in Table 2, %V is almost the same for all

four emotions except for ‘sadness.’ ‘Sadness’ has a lower %V

value compared with the other three emotions. This implies

that in the case of sadness the portion of vocalic interval

shrinks compared with the other three emotions under

investigation. Next, there is also a large variance in ∆C

between sadness and the other emotions. Finally, ∆V shows

the most discrepancy, implying that the emotion of sadness

gives the lowest speech rate, while in the cases of surprise

and funny the speech rate is the fastest.

Figure 1 shows the plots of the four investigated

emotions plotted on the three planes of the Ramus schema [11]

for rhythm metrics calculation. As shown in the figure, the

planes (a) %V–∆C, (b) %V–∆V, and (c) ∆V–∆C are shown

graphically which makes their differences easier to visualize. Dellwo et al. [21] showed that the consonantal intervals will

be longer on an average basis in low-rate speech and shorter

in high-rate speech. In other words, the standard deviation of

consonantal intervals (i.e., ∆C) will be inversely proportional

to speech rate. According to the above finding of Dellwo et al.

[21] and by comparing our investigated emotions, the speech

rates for the emotions of ‘funny’ and ‘surprise’ were higher

than that for the emotion of ‘sadness,’ while the emotion of

‘question’ was positioned between the other three emotions.

This large difference in speech rates indicates a large

difference in consonantal intervals, as noted above.

5. CONCLUSION

The Arabic language suffers a lack of research in emotion

processing, despite the fact that emotion processing is a recent

topic of growing research that has many possible applications.

We investigated the four emotions of funny, sadness, surprise,

and question using rhythm metrics. The results showed

important variations in rhythm metrics among these four

emotions in MSA speakers. Differences between the emotions

of sadness and question were easily detected by rhythm

metrics. The consonantal intervals for the emotion of sadness

were larger than those for the emotions of funny, surprise, and

question emotions. Therefore, it can be concluded that the

speech rate for the emotion of sadness is lower than the rates

for the other three investigated emotions. The emotions of

funny and surprise cannot be distinguished using rhythm

metrics (i.e., %V, ΔC, and ΔV) because of their high

similarities among these metrics. Vocalic interval

percentage, %V, is not a sufficient parameter to use for the

distinguishing of emotions. Conversely, rhythm metrics (ΔC,

and ΔV) can be used to distinguish emotions. This finding

will be exploited in the future to help adapting speech

recognizers to specific accents and regional variety of

languages.

6. ACKNOWLEDGMENT

This work was supported by the NPST program under King

Saud University Project Number 10-INF1325-02.

207

a).

b).

c).

Figure 1. Emotions distributed among a) %V–∆C, b) %V– ∆V, and c) ∆V–∆C planes.

60

70

80

90

100

110

120

130

25 27 29 31 33 35 37 39 41

ΔC

%V

Fun

Sad

Sur

Que

30

40

50

60

70

80

90

100

25 27 29 31 33 35 37 39 41

ΔV

%V

Fun

Sad

Sur

Que

60

70

80

90

100

110

120

130

30 40 50 60 70 80 90 100

ΔC

ΔV

Fun

Sad

Sur

Que

208

7. REFERENCES

[1]. G. Shashidhar, K. Koolagudi, R. Sreenivasa, “Emotion

recognition from speech: a review,” International Journal of

Speech Technology, V. 15, Issue 2, pp 99-117, June 2012.

[2]. Hassan A. On automatic emotion classification using acoustic

features. University of Southampton, Faculty of Physical and

Applied Sciences, Doctoral Thesis, 2012: 204pp.

[3]. M. El Ayadi, M. S. K., Fakhri Karray, “Survey on speech

emotion recognition: Features, classification schemes, and

databases,” Pattern Recognition,V. 44, Issue 3, Pages 572–587,

March 2011.

[4]. B. Schuller, G. Rigoll, M. Grimm, K. Kroschel, T. Moosmayr,

and G. Ruske, “Effects of in-car noise-conditions on the

recognition of emotion within speech,” In Proceedings of the

33rd Annual Conference on Acoustics, DAGA’07, 305-306,

Stuttgart, Germany, 2007.

[5]. Yazid A, Pierre D. Automatic Emotion Recognition from

Speech. PhD Research Proposal. Springer-Verlag, Berlin,

Heidelberg, 2011.

[6]. E. Grabe, and E. L. Low, E. L. “Durational variability in speech

and the rhythm class hypothesis”. Papers in Laboratory

Phonology. 7, 515-546, Berlin: Mouton.2002

[7]. J. M. Liss, L. White, S. L. Mattys, K. Lansford, A. J. Lotto, S.

M. Spitzer, & J. N. Caviness, “Quantifying Speech Rhythm

Abnormalities in the Dysarthrias,” JOURNAL OF SPEECH

LANGUAGE AND HEARING RESEARCH 52 (5), pp 1334

– 1352, 2009.

[8]. D. Abercormbie, “Elements of General Phonetics,”, Edinburgh

University Press, 1967.

[9]. K. Tajima, B. Zawaydeh, and M Kitabara, “A comparative

study of speech rhythm in Arabic, English and Japanese,”

Proceedings of the XIV ICPhS, San Francisco, U.S.A. 1999.

[10]. P. Roach, “On the distinction between ’stress-timed’ and

’syllable-timed’ languages,” In Linguistic controversies, D.

Crystal Ed, London: Edward Arnold. 73-79, 1982.

[11]. F. Ramus, M. Nespor, and J. Mehler, “ Correlates of linguistic

rhythm in the speech signal,” Cognition, 73 (3), 265-292, 1999.

[12]. O. Khan, W. Al-Khatib, L. Cheded, “Detection of Questions in

Arabic Audio Monologues using Prosodic Features,”

Proceedings of the IEEE International Symposium on

Multimedia, (ISM2007), pp. 29-36, Taichung, Taiwan,

R.O.C.,December 2007.

[13]. O. Al-Dakkak, N. Ghneim, M. Abou Zliekha, S. Al-Moubayed,

“Prosodic Feature Introduction and Emotion Incorporation in

an Arabic TTS,” Information and Communication

Technologies, ICTTA '06. ,pp: 1317- 1322 2006.

[14]. S. Ghazali, R. Hamdiy, and M. Barkat, "Speech Rhythm

Variation in Arabic Dialects," 1st International Conference on

Speech Prosody '02, Aix-en-Provence, pp. 331-334, 2002.

[15]. M. Barkat-Defradas, R. Hamdi, E. Ferragne, and F. Pellegrino,

"Speech timing and rhythmic structure in arabic dialects: a

comparison of two approaches," ;in Proc. INTERSPEECH, pp.

1613-1616, 2004.

[16]. F. Biadsy, and J. Hirschberg, "Using prosody and phonotactics

in Arabic dialect identification," In INTERSPEECH’09, pp.

208-211, 2009.

[17]. T.L. Nwe, S.W. Foo , L.C De Silva, “Speech emotion

recognition using hidden Markov models,” Speech

Communication, Volume 41, Number 4, pp. 603-623(21),

November 2003 .

[18]. KTD datasheet, A KACST internal report (Not published).

[19]. S. J. Young, “The HTK hidden markov model toolkit: Design

and philosophy,” Entropic Cambridge Research Laboratory,

Ltd,vol. 2, pp. 2–44, 1994.

[20]. O. Tange, “Gnu parallel - the command-line power tool,” The

USENIX Magazine, vol. 36, no. 1, pp. 42–47, Feb 2011.

Available: http://www.gnu.org/s/parallel.

[21]. V. Dellwo, and P. Wagner, “Relations between language

rhythm and speech rate,” In: (Proceedings) Proceedings of the

International Congress of Phonetics Science, pp. 471 – 474,

Barcelona. 2003.

209

Documents

[IEEE 2013 IEEE Digital Signal Processing and Signal Processing Education Meeting (DSP/SPE) - Napa, CA, USA (2013.08.11-2013.08.14)] 2013 IEEE Digital Signal Processing and Signal