Upload
cora-mcgee
View
214
Download
0
Tags:
Embed Size (px)
Citation preview
Dr. O. Dakkak & Dr. N. Ghneim: HIAST
M. Abu-Zleikha & S. Al-Moubyed: IT fac., Damascus U.
Prosodic Feature Introduction and Emotion Incorporation in an Arabic
TTS
Presented by
Dr. O. Al Dakkak
Outline
• Arabic TTS
• Why Prosody generation?
• Prosody Analysis and Rule Extraction
• Emotion Inclusion
• Results
• Conclusion
Arabic Text-to-Speech System
– Arabic Text-to-Phonemes (ATOPH) Including open /E/, /O/ phonemes and emphatic vowels
– Use of MBROLA Diphone units to synthesize speech Till our semi-syllables are ready (Corpus is currently being recorded)
– Prosody Generation and Emotion Inclusion
Arabic Text-to-Speech System
– MBROLA permits to synthesize phonemes. With control on duration and F0 contour (a set of segments) and we implemented a tool to control the Amplitude.
– Absent phonemes are replaced by the nearest present phonemes
– Possibility to generate and test prosody
Why Prosody Generation?
• Increase intelligibility & expressionality.
• Provides the context in which speech is interpreted
• Signals speaker intentions (special aids)
• Man-machine communication (airports,..)
• Doublage*
Methodology
• Based on the punctuation marks (‘,’, ‘.’, ‘?’ and ‘!’) we classify sentences into: continuous affirmation, long affirmation, interrogative, exclamation; respectively.
• Recording a corpus and Analysis of its sentences to produce F0, and intensity curves
• Statistical study of the curves and Rule extraction to generate them automatically.
The corpus
• Use of a pre-recorded corpus, of 12 short sentences for each type, 5 speakers (4 m. & 1 f.). Each sentence has 14 phonemes at most.
• Recording of other 10 sentences of variable lengths pronounced by 3 speakers.– short : 4-20 phonemes, – medium : 20-40 phonemes – long : more than 40 phonemes.
• The curves of F0, intensity were available for the pre-recorded corpus and were computed for the further set of recording.
Emotion Inclusion
• Recording a corpus of 5 different emotional sentences (joy, anger, sadness, fear & surprise) with their emotionless versions (20 sentences/emotion).
• Measures of prosodic features F0, duration and intensity, with their variations (Praat).
• Extraction of rules to automatically produce emotion on synthetic speech.
• Rules Validation.
181 245 221176200 170195 177 163 200 196176177 173 169 133158 153195213
c a h u w a D a n b j c a n c a t a H a m m a l a c a n aa D a aa l i k
0
500
100
200
300
400
Time (s)0 2.27622
ذلك؟ �ا �َن َأ �َح�َّمَل� َت� َأ �ْن� َأ �ي �ِب ذ�َن ُه�َو�
� َأ
Is it my fault to bear it?Pitch: variation of F0 Range: difference between F0max & F0minF0 Averag: Mean valueContour slope: shape of contour slope (range variation).
Variability: deg. Of it (high, low..) .Jitter: Irregularities between successive glottal pulses
Example: Anger emotion
• F0 mean: + 40%-75%• F0 range: + 50%-100%• F0 at vowels and semi-vowels: + 30%• F0 slope: +• Speech rate: +• Silence rate: -• Duration of vowels and semi-vowels: + • Intensity mean: +• Intensity monotonous with F0• Others: F0 variability: +, F0 jitter: +
Emotion Synthesis: Anger
• F0 mean: + 30%• F0 range: + 30%• F0 at vowels and semi-vowels: +100%• Speech rate: +75%-80%• Duration of vowels and semi-vowels: +30% • Duration of fricatives: +20%
Synthetic examples
emotionless with emotion- Anger:- Joy: - Sadness :- Fear: - Surprise:
“who do you think you are?”
“no more clouds in the sky”
“I’m so sad today”
“What a scary scene!” “What a beautiful scene!”
EmoGen
Normal text to MBROLA text
Converter (NTMTC)
Prosody Generator Emotion Generator
Mbrola Playerinterface
Input Text
VoiceInterface Text Editor
Speech and emotionproperties
Results
• Five sentences for each emotion were synthesized and listened by 10 people.
• Each listener gives the perceived emotion for each sentence (we don’t provide our list of emotions)