Upload
emil-hawkins
View
216
Download
1
Tags:
Embed Size (px)
Citation preview
Emotions in Hindi -Recognition and Conversion
S.S. Agrawal
CDAC, Noida & KIIT, Gurgaon
email: [email protected], [email protected]
Contents
• Intonation patterns with sentence type categories • A relationship between F0 values in Vowels and
Emotions –Analytical study • Recognition and Perception of Emotions based
spectral and prosodic values obtained from vowels.• F0 Pattern Analysis of Emotion sentences in Hindi.• Emotion conversion using the intonation data base
from sentences and words .• Comparison of machine and Perception
Experiments.
• Hindi speech possesses pitch patterns depending on the meaning, structure and type.
• Intonation also decides the meaning of certain words depending on the type of sentence or phrase where these occur.
• In Hindi we observe three levels of intonations and these can be classified as ‘normal’, ‘high’ and ‘low’.
• In exceptional cases presence of VH (very high) and EH(extremely high) is felt, though it rarely occurs.
• For observing intonation patterns due to sentence type, we may classify them into the following seven categories namely - Affirmative, Negative, Interrogative, Imperative, Doubtful, Desiderative, Conditional, and Exclamatory.
Intonation Patterns of Hindi
Intonation patterns of Hindi
Affirmative ( MHL pitch pattern )
Negative (MHL pitch pattern )
Imperative (ML pitch pattern)
Doubtful
Desiderative
Exclamatory (MHM Pitch pattern)
Intonation Patterns of Hindi
???Application on Emotional Behavior
Recognition of Emotion
Conversion of Emotion
Emotion RecognitionEmotion Recognition
For natural human-machine interaction, there is a requirement of machine based emotional intelligence. For satisfactory responses to human emotions; computer systems need accurate emotion recognition. We can, therefore, monitor physiological state of individuals in several demanding work environments which can be used to augment automated medical or forensic data analysis systems.
METHODMETHODMETHODMETHOD
MaterialMaterial
Speakers: Speakers: Six male graduate students (from drama club, Six male graduate students (from drama club, Aligarh Muslim University, Aligarh),Aligarh Muslim University, Aligarh),Native speakers of Hindi,Native speakers of Hindi,Age group 20-23 years, Age group 20-23 years, Sentences: Sentences: Short 5 neutral Hindi sentences, Short 5 neutral Hindi sentences, Emotions: Emotions: Neutral, happiness, anger, sadness, and fear. Neutral, happiness, anger, sadness, and fear. RepetitionsRepetitions: 4: 4In this way there were 600 (6×5×5×4) sentences. In this way there were 600 (6×5×5×4) sentences.
RecordingRecording
Electret Electret microphone microphone Partially sound treated room Partially sound treated room ““PRAAT” software. PRAAT” software. Sampling rate 16 kHz / 16 bit Sampling rate 16 kHz / 16 bit Distance between mouth and microphone was adjusted Distance between mouth and microphone was adjusted nearly 30 cm. nearly 30 cm.
Above 600 sentences were first randomized in Above 600 sentences were first randomized in sentences and speakers, and then were presented to 20 sentences and speakers, and then were presented to 20 naive listeners to evaluate the emotions within five naive listeners to evaluate the emotions within five categories: neutral, happiness, anger, sadness, and fear. categories: neutral, happiness, anger, sadness, and fear. Only those sentences, whose emotions were identified Only those sentences, whose emotions were identified by at least 80% of all the listeners, were selected for by at least 80% of all the listeners, were selected for this study. this study. After selection, we had left with 400 sentences for our After selection, we had left with 400 sentences for our studystudy
Listening testListening test
Acoustic Analysis
Prosody related features (mean value of pitch (F0), duration, rms value of sound pressure, and speech power.
Spectral features (15 mel frequency cepstral coefficients )
Prosody featuresProsody features For present study, central 60 ms For present study, central 60 ms portion of vowels /a/ occurring in all the sentences at portion of vowels /a/ occurring in all the sentences at different positions (underlined in sentences given in different positions (underlined in sentences given in Appendix) from selected words of sentences were used Appendix) from selected words of sentences were used to measure all the features. In this way there were total to measure all the features. In this way there were total 13 /a/ vowels (3 in first sentence, 3 in second, 2 in third, 13 /a/ vowels (3 in first sentence, 3 in second, 2 in third, 4 in four, and 1 in fifth sentence). After taking 60-60ms 4 in four, and 1 in fifth sentence). After taking 60-60ms of all the /a/ vowels, average of all the vowels of each of all the /a/ vowels, average of all the vowels of each sentence was taken. sentence was taken. Besides FBesides F00, speech power, and sound pressure were also , speech power, and sound pressure were also
calculatedcalculated..
Feature extraction method Feature extraction method Praat software was used Praat software was used to measure all the prosody features.to measure all the prosody features.
Figure shows the representation of waveform (upper) Figure shows the representation of waveform (upper) and spectrogram (lower) (pitch in blue line) for word / and spectrogram (lower) (pitch in blue line) for word / sItsItaar / (a) for anger (b) for fear (c) for happiness (d) for r / (a) for anger (b) for fear (c) for happiness (d) for neutral and (e) for sadness as obtained in “PRAAT” neutral and (e) for sadness as obtained in “PRAAT” softwaresoftware. .
(a)(a)AnAn (b)(b)FeFe
(c)(c)HaHa (d)(d)NeNe
(e) Sa(e) Sa
Table 1: F0 value for vowel
A E I i Av
Anger 237 218 222.2 244 234.6
Sadness 107 110.4 106.0 111 110.1
Neutral 134.5 131.0 132.5 146.9 136.3
Happiness 194.5 190.5 189.1 189.0 191.8
Fear 160.9 163.7 162.3 191.2 173.2
Spectral features Spectral features MFCC coefficients were calculated using MFCC coefficients were calculated using MATLAB programming. MATLAB programming. Frame duration was 16ms. Frame duration was 16ms. Overlapping in frames was of 9ms Overlapping in frames was of 9ms From each frame, 3 MFCCs were calculated, From each frame, 3 MFCCs were calculated, and as we had five frames, so we obtained 15 MFCCs for and as we had five frames, so we obtained 15 MFCCs for each sample. each sample. Thus in total, there are 19 parameters. All the measured 19 Thus in total, there are 19 parameters. All the measured 19 parameters of sentences of each emotion were then parameters of sentences of each emotion were then normalized with respect to parameters of neutral sentences normalized with respect to parameters of neutral sentences of the given speaker.of the given speaker.
Recognition of emotionRecognition of emotion
Independent variables : Measured acoustic parameters Independent variables : Measured acoustic parameters Dependent variables : Emotional categories. Recognition Dependent variables : Emotional categories. Recognition had been done by people as well as by neural network had been done by people as well as by neural network classifier. classifier.
By people By people Selected 400 sentences were randomized Selected 400 sentences were randomized sentence-wise and speaker-wise. These randomized sentence-wise and speaker-wise. These randomized sentences were presented to 20 native listeners of Hindi to sentences were presented to 20 native listeners of Hindi to identify the emotions within five categories: neutral, identify the emotions within five categories: neutral, happiness, anger, sadness, and fear. All the listeners were happiness, anger, sadness, and fear. All the listeners were educated from Hindi background and of age group of 18 to educated from Hindi background and of age group of 18 to 28 years. 28 years.
By Neural network classifier (using PRAAT software)By Neural network classifier (using PRAAT software)
For training 70% of data For training 70% of data And for classification test 30% of data. And for classification test 30% of data. As parameters were normalized with respect to neutral As parameters were normalized with respect to neutral category, only four emotions (Anger, fear, happiness, and category, only four emotions (Anger, fear, happiness, and sadness) were recognized by classifier. sadness) were recognized by classifier.
Contd….
•In present study 3-layered (two hidden layers and one output layer) feed forward neural network had been used, in which both hidden layers had 10-10 nodes.
•There were 19 input units which represented used acoustic parameters.
•Output layer had 4 units which represented output categories (4 emotions in present case).
•Results were obtained using neural network classifier on 2000 training epochs and 1 run for each data set.
RESULT AND DISCUSSIONRESULT AND DISCUSSION
Recognition of emotionRecognition of emotion
By peopleBy people
Most recognizable emotion: Anger (82.3%) Most recognizable emotion: Anger (82.3%) Least recognizable emotion: Fear (75.8%). Least recognizable emotion: Fear (75.8%). Average recognition of emotion: 78.3 %. Average recognition of emotion: 78.3 %. Recognition of emotion was in the order: anger > sadness > Recognition of emotion was in the order: anger > sadness > neutral > happy > fearneutral > happy > fear
Category Neutral Happiness Anger Sadness Fear
Neutral 77.0 1.0 3.8 14.2 4.0
Happiness 4.1 76.5 7.8 5.4 6.2
Anger 7.2 5.0 82.3 2.1 3.4
Sadness 7.3 1.7 3.4 80.0 7.6
Fear 5.1 4.5 2.8 11.8 75.8
Table1.Table1. Confusion Matrix of recognition of emotion by Confusion Matrix of recognition of emotion by People PerformancePeople Performance
By neural network classifier (NNCBy neural network classifier (NNC))
Confusion Matrix obtained by NNC is shown in Table2. Confusion Matrix obtained by NNC is shown in Table2. Most recognizable emotion: Anger (90%), sadness (90%) Most recognizable emotion: Anger (90%), sadness (90%) Least recognizable emotion: Fear (60%). Least recognizable emotion: Fear (60%). Average recognition of emotion: 80 %. Average recognition of emotion: 80 %. The recognition of emotion was in the order: anger The recognition of emotion was in the order: anger =sadness > happy > fear. =sadness > happy > fear. In “Figure 2”, histogram of comparison of percentage In “Figure 2”, histogram of comparison of percentage correct recognition of emotion by people and NNC is correct recognition of emotion by people and NNC is shown.shown.
Category Happiness Anger Sadness Fear
Happiness 80.0
3.3
3.3
3.4
Anger
6.7
90.0
0.0
3.3
Sadness
0.0
0.0
90.0
10.0
Fear
10.0
6.7
23.3
60.0
Table 2Table 2 Confusion Matrix of recognition of emotion by Confusion Matrix of recognition of emotion by NNCNNC
Neutral Happiness Anger Sadness Fear0
10
20
30
40
50
60
70
80
90
---P
erc
en
tag
e c
orr
ect
re
cog
niti
on
---
---Emotions---
By peopleBy NNC
Figure2Figure2 Comparison of percentage correct Comparison of percentage correct recognition of emotion by people and NNCrecognition of emotion by people and NNC
Emotion Conversion
Intonation based Emotional Database
six native speakers , 20 sentences of Hindi utterances,five expressive styles Neutral Sadness Anger Surprise Happy.
Happiness
•F0 curve of utterances •rise and fall pattern at the beginning
of the sentences
•hold pattern at the end of the sentences
Time (s)0.5 2.2
Pitch
(Hz)
0
500
aaj office nahin jana hain
Time (s)1.8 3.31
Pitch
(Hz)
0
500
tumara phone baj raha hain
tumara phone baj raha hain
tumara phone baj raha hain
Time (s)2.96 4.3
Pitch
(Hz)
0
500
wah kal subah aayega
wah kal subah aayega wah kal subah aayega
wah kal subah aayega
Anger
•F0 curve of utterances •rise & fall in the beginning of the
sentences.
•fall towards the end of the sentences
Time (s)0.49 1.81
Pitch
(H
z)
0
500
wah kal subah aayega
Time (s)0.65 1.81
Pitch
(H
z)
0
500
tumne aaj kya kiya
Time (s)0.48 2.13
Pitch
(H
z)
0
500
aakhir aisa kyon hota hain
•F0 contour of utterances of sadness•fall or hold at the end of sentences
•fall & rise in the beginning of the sentences
•fall-fall pattern throughout the contour
Sadness
Time (s)0.4 1.76
Pitch
(H
z)
0
500
mujhe aaj kaam karna hain
aaj barsaat ho rahi hainnaaj barsaat ho rahi hainn
Time (s)0.72 2.81
Pitch
(H
z)
0
500
Time (s)0.46 2.36
Pitc
h (H
z)
0
500
mujhe aaj kaam karna hain (female speaker)
• F0 curve of utterances • falls at the end of the utterances• rise & fall in the beginning of the
sentences. • In most of the case we observed fall
in sentence final position irrespective of the speaker
Normal
Time (s)0.52 1.98
Pit
ch (
Hz)
0
500
aaj barsaat ho rahi hain
Time (s)0.36 2.07
Pit
ch (
Hz)
0
500
mujhe aaj kaam karna hain
Time (s)0.9 2.66
Pit
ch (
Hz)
0
500
aaj office nahin jana hain
•F0 curve of utternaces• rise & fall pattern for sentence initial
position• rise pattern for sentence final position • most of the utterances of surprise
emotion in the form of question based surprise state.
Surprise
Emotion Conversion
• To store all utterances of all the expressive style is really a difficult and time consuming task.
• Also consume huge memory space.
• There should be an approach which minimizes the time and memory space for emotion rich database.
• Taking this fact in consideration authors have
proposed an algorithm for emotion conversion.
Contd…
• This algorithm requires storing neutral utterances in the database .
• Other expressive style utterances will be produced from neutral emotion.
• Proposed algorithm is based on linear
modification model (LMM), where fundamental frequency (F0) is one of the factors to convert emotions
Intonation based Emotional Database
Another database which is directly associated with the main module of emotion conversion.
The database is used to keep the pitch point values (Table 1) for the utterances, already present in the Speech Database.
The numbers of pitch points are based on number of syllables present in the sentence and resolution frequency (fr).
Resolution frequency is the minimum amount by which every remaining pitch point will lie above or below the line that connects two neighbours pitch points
Table: Pitch Point Table (Neutral emotion recorded by one of the speaker).
Sentence
Pt1 (Hz)
Pt2 Pt3 Pt4 Pt5 Pt6 Pt7 Pt8 Pt9 Pt10
Pt11
1 200.4 232.2 391.4
236.5
185.7
208.1
496.8 211.6
179.2
- -
2 244.2 213.8 262.6
- 159.6
210.0
172.6 177.4
- - -
3 200.9 262.3 231 259 219.7
175.8
87.7 88.8 207.6
201.9
-
4 200.1 231.1 183.8
230.1
234.6
188.7
173.2 152.1
246.8
233.7
-
5 227.3 255.3 220.1
249 189.9
231.7
166.7 221.5
187.4
170.5
203.4
6 232.9 252.7 197.7
237.5
205.5
258.3
206.8 246.3
201.9
193.5
-
7 205.7 237.6 203.4
228.2
165.9
202.1
- - - - -
8 260.9 230.1 251.8
211.6
238.3
200.3
98.4 94.2 202.3
182 -
9 258.5 215.5 202.3
233.7
175.8
144.3
83 197.5
181.9
- -
10 229 203.9 316.8
229.4
207.2
79 256.8 192.8
202.4
148.3
193
11 208.5 201.8 235 203.5
216.7
507.9
489 216.2
168.4
85.3
96.7
12 253.1 223.5 251.4
221.6
249.9
189.7
172.7 85.6 89.2 203.3
186.8
13 229.4 204 273.6
198.3
240.7
200.3
234 161 198.3
- -
14 244.6 265.6 224.4
280.5
198.4
265.6
165.7 191.7
- - -
15 259.6 209.7 308.8
235.6
224.7
252.5
205.4 177.4
- - -
16 210.6 223.2 181.3
91.3
93.6 - - - - - -
17 277 225.1 107.3
105.2
229.7
110.9
108.8 211.1
198.4
98.1
93.4
18 273 234.4 262.9
204.4
228 506.0
257.6 180.5
185.7
- -
19 264.8 219.6 254.5
195.9
225.1
184.2
192.7 97.5 189.9
209.8
178
20 242.5 207.9 257.7
179 201.3
162.8
191.2 - - - -
F0 Based Emotion Conversion
Emotion conversion at Sentence level
Emotion conversion at Word level
F0 Based Emotion Conversion
In these methods pitch points (Pi) were studied for the desired source emotion (Neutral) and target emotion (Surprise) and then difference between corresponding pitch points were evaluated after normalization
This serves as an indicator of the values by which pitch points of source speech utterance must be increased or decreased to convert it to target utterance.
For pitch analysis step length is taken as .01 second and minimum and maximum pitch is taken as 75 Hz and 500 Hz. Then stylization process is performed to remove the excess pitch points and then valid numbers of pitch points were noted
F0 Based Emotion Conversion… On comparison between source and target emotion
training set, pitch points are divided in four groups and set the initial frequency as x1, x2, x3, and x4 respectively. On the basis of observation of training set y1, y2, y3 and y4 is added to the subsequent “x” values.
In some cases, Pitch point number also matters and gets focus to decide the transformed F0 value.
xi and yi, values are came out after the rigorous analysis of pitch patterns of neutral and emotional utterances.
Pitch point1 Range Difference(y) (Hz)
+40 +100 +150
UtteranceFrequency
82% 10% 8%
Pitch point2 Range Difference(y) (Hz)
-40 +40 >+100
UtteranceFrequency
25% 70% 5%
Pitch point3 Range Difference(y) (Hz)
-100 +25 >+80
UtteranceFrequency
10% 73% 17%
Pitch point4 Range Difference(y) (Hz)
-10 +40 >+80
UtteranceFrequency
17% 55% 28%
Sentence Based Emotion Conversion-Algorithm
Select desired sound wave form Convert speech waveform in pitch tier // Stylization For all Pis Select Pi, that is more close to straight line and compare with
resolution frequency (fr) if distance between pi and straight line > fr Stop the process else Repeat for other Pis Divide the pitch points in four groups For each group group[1] = x1+y1 || x1-y1 group[2]= x2+y2+2(pitch point number) group[3]= x3+y3 group[4]= x4+y4+3(pitch point number) Remove existing pitch points Add newly calculated pitch points in place of old pitch points.
Figure 1. Pitch points for natural neutral emotion
Figure 2. Pitch points for natural surprise emotion
Experimental Results For this process, “ “ कल तु�म्हें� फाँसी� हें जाएगी�।
was considered and their results are given in figure 5 and table 5.
In figure, upper picture shows the natural surprise utterance and lower part displays the transformed neutral to surprise utterance .
Table 5 gives the idea about the conversion algorithm pitch points wise .
Figure 5 Natural and transformed Surprise emotion utterance.
Table 5. Comparision table for “ “कल तु�म्हें� फाँसी� हें जाएगी�।
Pitch points Natural Surprise Utterance (Hz) Transformed Surprise Utterance (Hz)
1 349.5 326.4
2 389 389.7
3 244.5 291.6
4 217.9 251.3
5 255.6 479.7
6 414.1 316.9
7 261.7 321.8
8 492.1 461.3
9 324.2 355.4
10 375.9 386.7
11 399.2 452
Analysis For Word Boundary Detection
In a sentence where one word ends and new word is started, its intensity value decreases significantly.
There are many points where intensity value is decreased in a recorded speech, but every such point is not the word boundary.
(Refer to the figure) In most cases, at the point, where intensity is decreased and
the pitch value is undefined, there is a word boundary.
We have several regions where the pitch value is undefined, and in each region there can only be one word boundary point out of many points.
Sometimes we have several low intensity points where pitch value is defined and there may also be word boundary.
In the region where pitch values are defined, no two word boundaries exist within the time span of 0.10 seconds.
Word Boundary Detection Algorithm
Rule 1: Intensity valleys above threshold value I0 are not considered as word segment boundary.
Rule 2: Intensity Valleys below I0 are considered as Word Boundary
Rule 3:Valleys on non- pitch range can be considered as word segment boundary
Rule 4:If there are more than one Intensity Valleys during pitch contour pattern, the lowest value valleys will be considered as word segment boundary.
Rule 5:If there is no intensity point in undefined pitch pattern, there will not be any word boundary.
Rule 6: If there is more Intensity valleys on a pitch defined range and duration difference is less than 0.9 sec, only lowest intensity point will be considered as Word boundary.
Emotion Conversion algorithm (Word level)
For all Pis Select Pi, that is more close to straight line and compare with
resolution frequency (fr) if distance between pi and straight line > fr Stop the process else Repeat for other Pis Divide the pitch points into word segments as produced by WBD For each word segment’s F0min, F0max, F0beg and F0end Wordi,f= Cn,i,f* Xn i,f
// Where n is emotional state, i denotes the word segment and f denotes the F0 values for various prosodic points..
Remove existing pitch points Add newly calculated pitch points and duration in place of old
pitch points.
Algorithm ImplementationOriginal neutral AAJ College Jana Hain
Sentence Based Conversion Word Based conversion
Anger
Happy
Sad
Surprise
Anger
Happy
Sad
Surprise
Results (Word Boundary Detection)Speaker ID Recognition Rate (%) False Recognition (%)S1 88 10.2
S2 91 11.6
S3 87.6 9.3
S4 93.4 10.4
S5 91.6 8.5
S6 85.2 11.3
S7 82.5 9.3
S8 89.7 10.6
S9 93.2 6.8
S10 93.1 7.2
S11 87.3 9.1
S12 70.3 15.6
S13 88.8 10.3
S14 93.2 10.0
S15 86.6 9.5
Results (Word Boundary Detection)
Word –
Boundary
Non-Word
Boundary
Word-Boundary 90.8% 9.2%
Non-Word
Boundary
15.6% 80.1%
Perception Test
Comparison Of Transformed Emotions
Transformed Perception matrix
Listeners are divided into 3 groups of 5
candidates each. Emotion Perception
Surprise 91.2 %
Sadness 89.6%
Neutral 10.4%
Anger 8.8%
In this paper we have not considered the alignment of pitch points by linguistic rules, this will be the next objective for emotion conversion.
We have only taken care of F0 and Energy factor for our experiment; we have not considered the other factor like Spectrum, Duration, Syllable information etc, that can be further investigated.
The experiment has been performed on 800 utterances and not rich enough in terms of numbers. The database should be increased to achieve the perfect ness.
Since few distinctions have been made for Hindi from other languages so, it is justified to design Hindi based Intonational model where transformation of emotions can be incorporated.
Conclusion and Future work