Emotions in Hindi -Recognition and Conversion S.S. Agrawal CDAC, Noida & KIIT, Gurgaon email: [email protected], [email protected]

Emotions in Hindi -Recognition and Conversion

S.S. Agrawal

CDAC, Noida & KIIT, Gurgaon

email: [email protected], [email protected]

Contents

• Intonation patterns with sentence type categories • A relationship between F0 values in Vowels and

Emotions –Analytical study • Recognition and Perception of Emotions based

spectral and prosodic values obtained from vowels.• F0 Pattern Analysis of Emotion sentences in Hindi.• Emotion conversion using the intonation data base

from sentences and words .• Comparison of machine and Perception

Experiments.

• Hindi speech possesses pitch patterns depending on the meaning, structure and type.

• Intonation also decides the meaning of certain words depending on the type of sentence or phrase where these occur.

• In Hindi we observe three levels of intonations and these can be classified as ‘normal’, ‘high’ and ‘low’.

• In exceptional cases presence of VH (very high) and EH(extremely high) is felt, though it rarely occurs.

• For observing intonation patterns due to sentence type, we may classify them into the following seven categories namely - Affirmative, Negative, Interrogative, Imperative, Doubtful, Desiderative, Conditional, and Exclamatory.

Intonation Patterns of Hindi

Intonation patterns of Hindi

Affirmative ( MHL pitch pattern )

Negative (MHL pitch pattern )

Imperative (ML pitch pattern)

Doubtful

Desiderative

Exclamatory (MHM Pitch pattern)

Intonation Patterns of Hindi

???Application on Emotional Behavior

Recognition of Emotion

Conversion of Emotion

Emotion RecognitionEmotion Recognition

For natural human-machine interaction, there is a requirement of machine based emotional intelligence. For satisfactory responses to human emotions; computer systems need accurate emotion recognition. We can, therefore, monitor physiological state of individuals in several demanding work environments which can be used to augment automated medical or forensic data analysis systems.

METHODMETHODMETHODMETHOD

MaterialMaterial

Speakers: Speakers: Six male graduate students (from drama club, Six male graduate students (from drama club, Aligarh Muslim University, Aligarh),Aligarh Muslim University, Aligarh),Native speakers of Hindi,Native speakers of Hindi,Age group 20-23 years, Age group 20-23 years, Sentences: Sentences: Short 5 neutral Hindi sentences, Short 5 neutral Hindi sentences, Emotions: Emotions: Neutral, happiness, anger, sadness, and fear. Neutral, happiness, anger, sadness, and fear. RepetitionsRepetitions: 4: 4In this way there were 600 (6×5×5×4) sentences. In this way there were 600 (6×5×5×4) sentences.

RecordingRecording

Electret Electret microphone microphone Partially sound treated room Partially sound treated room ““PRAAT” software. PRAAT” software. Sampling rate 16 kHz / 16 bit Sampling rate 16 kHz / 16 bit Distance between mouth and microphone was adjusted Distance between mouth and microphone was adjusted nearly 30 cm. nearly 30 cm.

Above 600 sentences were first randomized in Above 600 sentences were first randomized in sentences and speakers, and then were presented to 20 sentences and speakers, and then were presented to 20 naive listeners to evaluate the emotions within five naive listeners to evaluate the emotions within five categories: neutral, happiness, anger, sadness, and fear. categories: neutral, happiness, anger, sadness, and fear. Only those sentences, whose emotions were identified Only those sentences, whose emotions were identified by at least 80% of all the listeners, were selected for by at least 80% of all the listeners, were selected for this study. this study. After selection, we had left with 400 sentences for our After selection, we had left with 400 sentences for our studystudy

Listening testListening test

Acoustic Analysis

Prosody related features (mean value of pitch (F0), duration, rms value of sound pressure, and speech power.

Spectral features (15 mel frequency cepstral coefficients )

Prosody featuresProsody features For present study, central 60 ms For present study, central 60 ms portion of vowels /a/ occurring in all the sentences at portion of vowels /a/ occurring in all the sentences at different positions (underlined in sentences given in different positions (underlined in sentences given in Appendix) from selected words of sentences were used Appendix) from selected words of sentences were used to measure all the features. In this way there were total to measure all the features. In this way there were total 13 /a/ vowels (3 in first sentence, 3 in second, 2 in third, 13 /a/ vowels (3 in first sentence, 3 in second, 2 in third, 4 in four, and 1 in fifth sentence). After taking 60-60ms 4 in four, and 1 in fifth sentence). After taking 60-60ms of all the /a/ vowels, average of all the vowels of each of all the /a/ vowels, average of all the vowels of each sentence was taken. sentence was taken. Besides FBesides F00, speech power, and sound pressure were also , speech power, and sound pressure were also

calculatedcalculated..

Feature extraction method Feature extraction method Praat software was used Praat software was used to measure all the prosody features.to measure all the prosody features.

Figure shows the representation of waveform (upper) Figure shows the representation of waveform (upper) and spectrogram (lower) (pitch in blue line) for word / and spectrogram (lower) (pitch in blue line) for word / sItsItaar / (a) for anger (b) for fear (c) for happiness (d) for r / (a) for anger (b) for fear (c) for happiness (d) for neutral and (e) for sadness as obtained in “PRAAT” neutral and (e) for sadness as obtained in “PRAAT” softwaresoftware. .

(a)(a)AnAn (b)(b)FeFe

(c)(c)HaHa (d)(d)NeNe

(e) Sa(e) Sa

Table 1: F0 value for vowel

A E I i Av

Anger 237 218 222.2 244 234.6

Sadness 107 110.4 106.0 111 110.1

Neutral 134.5 131.0 132.5 146.9 136.3

Happiness 194.5 190.5 189.1 189.0 191.8

Fear 160.9 163.7 162.3 191.2 173.2

Spectral features Spectral features MFCC coefficients were calculated using MFCC coefficients were calculated using MATLAB programming. MATLAB programming. Frame duration was 16ms. Frame duration was 16ms. Overlapping in frames was of 9ms Overlapping in frames was of 9ms From each frame, 3 MFCCs were calculated, From each frame, 3 MFCCs were calculated, and as we had five frames, so we obtained 15 MFCCs for and as we had five frames, so we obtained 15 MFCCs for each sample. each sample. Thus in total, there are 19 parameters. All the measured 19 Thus in total, there are 19 parameters. All the measured 19 parameters of sentences of each emotion were then parameters of sentences of each emotion were then normalized with respect to parameters of neutral sentences normalized with respect to parameters of neutral sentences of the given speaker.of the given speaker.

Recognition of emotionRecognition of emotion

Independent variables : Measured acoustic parameters Independent variables : Measured acoustic parameters Dependent variables : Emotional categories. Recognition Dependent variables : Emotional categories. Recognition had been done by people as well as by neural network had been done by people as well as by neural network classifier. classifier.

By people By people Selected 400 sentences were randomized Selected 400 sentences were randomized sentence-wise and speaker-wise. These randomized sentence-wise and speaker-wise. These randomized sentences were presented to 20 native listeners of Hindi to sentences were presented to 20 native listeners of Hindi to identify the emotions within five categories: neutral, identify the emotions within five categories: neutral, happiness, anger, sadness, and fear. All the listeners were happiness, anger, sadness, and fear. All the listeners were educated from Hindi background and of age group of 18 to educated from Hindi background and of age group of 18 to 28 years. 28 years.

By Neural network classifier (using PRAAT software)By Neural network classifier (using PRAAT software)

For training 70% of data For training 70% of data And for classification test 30% of data. And for classification test 30% of data. As parameters were normalized with respect to neutral As parameters were normalized with respect to neutral category, only four emotions (Anger, fear, happiness, and category, only four emotions (Anger, fear, happiness, and sadness) were recognized by classifier. sadness) were recognized by classifier.

Contd….

•In present study 3-layered (two hidden layers and one output layer) feed forward neural network had been used, in which both hidden layers had 10-10 nodes.

•There were 19 input units which represented used acoustic parameters.

•Output layer had 4 units which represented output categories (4 emotions in present case).

•Results were obtained using neural network classifier on 2000 training epochs and 1 run for each data set.

RESULT AND DISCUSSIONRESULT AND DISCUSSION

Recognition of emotionRecognition of emotion

By peopleBy people

Most recognizable emotion: Anger (82.3%) Most recognizable emotion: Anger (82.3%) Least recognizable emotion: Fear (75.8%). Least recognizable emotion: Fear (75.8%). Average recognition of emotion: 78.3 %. Average recognition of emotion: 78.3 %. Recognition of emotion was in the order: anger > sadness > Recognition of emotion was in the order: anger > sadness > neutral > happy > fearneutral > happy > fear

Category Neutral Happiness Anger Sadness Fear

Neutral 77.0 1.0 3.8 14.2 4.0

Happiness 4.1 76.5 7.8 5.4 6.2

Anger 7.2 5.0 82.3 2.1 3.4

Sadness 7.3 1.7 3.4 80.0 7.6

Fear 5.1 4.5 2.8 11.8 75.8

Table1.Table1. Confusion Matrix of recognition of emotion by Confusion Matrix of recognition of emotion by People PerformancePeople Performance

By neural network classifier (NNCBy neural network classifier (NNC))

Confusion Matrix obtained by NNC is shown in Table2. Confusion Matrix obtained by NNC is shown in Table2. Most recognizable emotion: Anger (90%), sadness (90%) Most recognizable emotion: Anger (90%), sadness (90%) Least recognizable emotion: Fear (60%). Least recognizable emotion: Fear (60%). Average recognition of emotion: 80 %. Average recognition of emotion: 80 %. The recognition of emotion was in the order: anger The recognition of emotion was in the order: anger =sadness > happy > fear. =sadness > happy > fear. In “Figure 2”, histogram of comparison of percentage In “Figure 2”, histogram of comparison of percentage correct recognition of emotion by people and NNC is correct recognition of emotion by people and NNC is shown.shown.

Category Happiness Anger Sadness Fear

Happiness 80.0

3.3

3.3

3.4

Anger

6.7

90.0

0.0

3.3

Sadness

0.0

0.0

90.0

10.0

Fear

10.0

6.7

23.3

60.0

Table 2Table 2 Confusion Matrix of recognition of emotion by Confusion Matrix of recognition of emotion by NNCNNC

Neutral Happiness Anger Sadness Fear0

10

20

30

40

50

60

70

80

90

---P

erc

en

tag

e c

orr

ect

re

cog

niti

on

---

---Emotions---

By peopleBy NNC

Figure2Figure2 Comparison of percentage correct Comparison of percentage correct recognition of emotion by people and NNCrecognition of emotion by people and NNC

Emotion Conversion

Intonation based Emotional Database

six native speakers , 20 sentences of Hindi utterances,five expressive styles Neutral Sadness Anger Surprise Happy.

Happiness

•F0 curve of utterances •rise and fall pattern at the beginning

of the sentences

•hold pattern at the end of the sentences

Time (s)0.5 2.2

Pitch

(Hz)

0

500

aaj office nahin jana hain

Time (s)1.8 3.31

Pitch

(Hz)

0

500

tumara phone baj raha hain



Time (s)2.96 4.3

Pitch

(Hz)

0

500

wah kal subah aayega

wah kal subah aayega wah kal subah aayega


Anger

•F0 curve of utterances •rise & fall in the beginning of the

sentences.

•fall towards the end of the sentences

Time (s)0.49 1.81

Pitch

(H

z)

0

500


Time (s)0.65 1.81

Pitch

(H

z)

0

500

tumne aaj kya kiya

Time (s)0.48 2.13

Pitch

(H

z)

0

500

aakhir aisa kyon hota hain

•F0 contour of utterances of sadness•fall or hold at the end of sentences

•fall & rise in the beginning of the sentences

•fall-fall pattern throughout the contour

Sadness

Time (s)0.4 1.76

Pitch

(H

z)

0

500

mujhe aaj kaam karna hain

aaj barsaat ho rahi hainnaaj barsaat ho rahi hainn

Time (s)0.72 2.81

Pitch

(H

z)

0

500

Time (s)0.46 2.36

Pitc

h (H

z)

0

500

mujhe aaj kaam karna hain (female speaker)

• F0 curve of utterances • falls at the end of the utterances• rise & fall in the beginning of the

sentences. • In most of the case we observed fall

in sentence final position irrespective of the speaker

Normal

Time (s)0.52 1.98

Pit

ch (

Hz)

0

500

aaj barsaat ho rahi hain

Time (s)0.36 2.07

Pit

ch (

Hz)

0

500

mujhe aaj kaam karna hain

Time (s)0.9 2.66

Pit

ch (

Hz)

0

500

aaj office nahin jana hain

•F0 curve of utternaces• rise & fall pattern for sentence initial

position• rise pattern for sentence final position • most of the utterances of surprise

emotion in the form of question based surprise state.

Surprise

Emotion Conversion

• To store all utterances of all the expressive style is really a difficult and time consuming task.

• Also consume huge memory space.

• There should be an approach which minimizes the time and memory space for emotion rich database.

• Taking this fact in consideration authors have

proposed an algorithm for emotion conversion.

Contd…

• This algorithm requires storing neutral utterances in the database .

• Other expressive style utterances will be produced from neutral emotion.

• Proposed algorithm is based on linear

modification model (LMM), where fundamental frequency (F0) is one of the factors to convert emotions

Intonation based Emotional Database

Another database which is directly associated with the main module of emotion conversion.

The database is used to keep the pitch point values (Table 1) for the utterances, already present in the Speech Database.

The numbers of pitch points are based on number of syllables present in the sentence and resolution frequency (fr).

Resolution frequency is the minimum amount by which every remaining pitch point will lie above or below the line that connects two neighbours pitch points

Table: Pitch Point Table (Neutral emotion recorded by one of the speaker).

Sentence

Pt1 (Hz)

Pt2 Pt3 Pt4 Pt5 Pt6 Pt7 Pt8 Pt9 Pt10

Pt11

1 200.4 232.2 391.4

236.5

185.7

208.1

496.8 211.6

179.2

- -

2 244.2 213.8 262.6

- 159.6

210.0

172.6 177.4

- - -

3 200.9 262.3 231 259 219.7

175.8

87.7 88.8 207.6

201.9

-

4 200.1 231.1 183.8

230.1

234.6

188.7

173.2 152.1

246.8

233.7

-

5 227.3 255.3 220.1

249 189.9

231.7

166.7 221.5

187.4

170.5

203.4

6 232.9 252.7 197.7

237.5

205.5

258.3

206.8 246.3

201.9

193.5

-

7 205.7 237.6 203.4

228.2

165.9

202.1

- - - - -

8 260.9 230.1 251.8

211.6

238.3

200.3

98.4 94.2 202.3

182 -

9 258.5 215.5 202.3

233.7

175.8

144.3

83 197.5

181.9

- -

10 229 203.9 316.8

229.4

207.2

79 256.8 192.8

202.4

148.3

193

11 208.5 201.8 235 203.5

216.7

507.9

489 216.2

168.4

85.3

96.7

12 253.1 223.5 251.4

221.6

249.9

189.7

172.7 85.6 89.2 203.3

186.8

13 229.4 204 273.6

198.3

240.7

200.3

234 161 198.3

- -

14 244.6 265.6 224.4

280.5

198.4

265.6

165.7 191.7

- - -

15 259.6 209.7 308.8

235.6

224.7

252.5

205.4 177.4

- - -

16 210.6 223.2 181.3

91.3

93.6 - - - - - -

17 277 225.1 107.3

105.2

229.7

110.9

108.8 211.1

198.4

98.1

93.4

18 273 234.4 262.9

204.4

228 506.0

257.6 180.5

185.7

- -

19 264.8 219.6 254.5

195.9

225.1

184.2

192.7 97.5 189.9

209.8

178

20 242.5 207.9 257.7

179 201.3

162.8

191.2 - - - -

F0 Based Emotion Conversion

Emotion conversion at Sentence level

Emotion conversion at Word level

F0 Based Emotion Conversion

In these methods pitch points (Pi) were studied for the desired source emotion (Neutral) and target emotion (Surprise) and then difference between corresponding pitch points were evaluated after normalization

This serves as an indicator of the values by which pitch points of source speech utterance must be increased or decreased to convert it to target utterance.

For pitch analysis step length is taken as .01 second and minimum and maximum pitch is taken as 75 Hz and 500 Hz. Then stylization process is performed to remove the excess pitch points and then valid numbers of pitch points were noted

F0 Based Emotion Conversion… On comparison between source and target emotion

training set, pitch points are divided in four groups and set the initial frequency as x1, x2, x3, and x4 respectively. On the basis of observation of training set y1, y2, y3 and y4 is added to the subsequent “x” values.

In some cases, Pitch point number also matters and gets focus to decide the transformed F0 value.

xi and yi, values are came out after the rigorous analysis of pitch patterns of neutral and emotional utterances.

Pitch point1 Range Difference(y) (Hz)

+40 +100 +150

UtteranceFrequency

82% 10% 8%


-40 +40 >+100

UtteranceFrequency

25% 70% 5%


-100 +25 >+80

UtteranceFrequency

10% 73% 17%


-10 +40 >+80

UtteranceFrequency

17% 55% 28%

Sentence Based Emotion Conversion-Algorithm

Select desired sound wave form Convert speech waveform in pitch tier // Stylization For all Pis Select Pi, that is more close to straight line and compare with

resolution frequency (fr) if distance between pi and straight line > fr Stop the process else Repeat for other Pis Divide the pitch points in four groups For each group group[1] = x1+y1 || x1-y1 group[2]= x2+y2+2(pitch point number) group[3]= x3+y3 group[4]= x4+y4+3(pitch point number) Remove existing pitch points Add newly calculated pitch points in place of old pitch points.

Figure 1. Pitch points for natural neutral emotion

Figure 2. Pitch points for natural surprise emotion

Experimental Results For this process, “ “ कल तु�म्हें� फाँसी� हें जाएगी�।

was considered and their results are given in figure 5 and table 5.

In figure, upper picture shows the natural surprise utterance and lower part displays the transformed neutral to surprise utterance .

Table 5 gives the idea about the conversion algorithm pitch points wise .

Figure 5 Natural and transformed Surprise emotion utterance.

Table 5. Comparision table for “ “कल तु�म्हें� फाँसी� हें जाएगी�।

Pitch points Natural Surprise Utterance (Hz) Transformed Surprise Utterance (Hz)

1 349.5 326.4

2 389 389.7

3 244.5 291.6

4 217.9 251.3

5 255.6 479.7

6 414.1 316.9

7 261.7 321.8

8 492.1 461.3

9 324.2 355.4

10 375.9 386.7

11 399.2 452

Analysis For Word Boundary Detection

In a sentence where one word ends and new word is started, its intensity value decreases significantly.

There are many points where intensity value is decreased in a recorded speech, but every such point is not the word boundary.

(Refer to the figure) In most cases, at the point, where intensity is decreased and

the pitch value is undefined, there is a word boundary.

We have several regions where the pitch value is undefined, and in each region there can only be one word boundary point out of many points.

Sometimes we have several low intensity points where pitch value is defined and there may also be word boundary.

In the region where pitch values are defined, no two word boundaries exist within the time span of 0.10 seconds.

Word Boundary Detection Algorithm

Rule 1: Intensity valleys above threshold value I0 are not considered as word segment boundary.

Rule 2: Intensity Valleys below I0 are considered as Word Boundary

Rule 3:Valleys on non- pitch range can be considered as word segment boundary

Rule 4:If there are more than one Intensity Valleys during pitch contour pattern, the lowest value valleys will be considered as word segment boundary.

Rule 5:If there is no intensity point in undefined pitch pattern, there will not be any word boundary.

Rule 6: If there is more Intensity valleys on a pitch defined range and duration difference is less than 0.9 sec, only lowest intensity point will be considered as Word boundary.

Emotion Conversion algorithm (Word level)

For all Pis Select Pi, that is more close to straight line and compare with

resolution frequency (fr) if distance between pi and straight line > fr Stop the process else Repeat for other Pis Divide the pitch points into word segments as produced by WBD For each word segment’s F0min, F0max, F0beg and F0end Wordi,f= Cn,i,f* Xn i,f

// Where n is emotional state, i denotes the word segment and f denotes the F0 values for various prosodic points..

Remove existing pitch points Add newly calculated pitch points and duration in place of old

pitch points.

Algorithm ImplementationOriginal neutral AAJ College Jana Hain

Sentence Based Conversion Word Based conversion

Anger

Happy

Sad

Surprise

Anger

Happy

Sad

Surprise

Results (Word Boundary Detection)Speaker ID Recognition Rate (%) False Recognition (%)S1 88 10.2

S2 91 11.6

S3 87.6 9.3

S4 93.4 10.4

S5 91.6 8.5

S6 85.2 11.3

S7 82.5 9.3

S8 89.7 10.6

S9 93.2 6.8

S10 93.1 7.2

S11 87.3 9.1

S12 70.3 15.6

S13 88.8 10.3

S14 93.2 10.0

S15 86.6 9.5

Results (Word Boundary Detection)

Word –

Boundary

Non-Word

Boundary

Word-Boundary 90.8% 9.2%

Non-Word

Boundary

15.6% 80.1%

Perception Test

Comparison Of Transformed Emotions

Transformed Perception matrix

Listeners are divided into 3 groups of 5

candidates each. Emotion Perception

Surprise 91.2 %

Sadness 89.6%

Neutral 10.4%

Anger 8.8%

In this paper we have not considered the alignment of pitch points by linguistic rules, this will be the next objective for emotion conversion.

We have only taken care of F0 and Energy factor for our experiment; we have not considered the other factor like Spectrum, Duration, Syllable information etc, that can be further investigated.

The experiment has been performed on 800 utterances and not rich enough in terms of numbers. The database should be increased to achieve the perfect ness.

Since few distinctions have been made for Hindi from other languages so, it is justified to design Hindi based Intonational model where transformation of emotions can be incorporated.

Conclusion and Future work

Documents

Emotions in Hindi -Recognition and Conversion S.S. Agrawal CDAC, Noida & KIIT, Gurgaon email: [email protected], [email protected]