19
Dr. O. Dakkak & Dr. N. Ghneim: HIAST M. Abu-Zleikha & S. Al-Moubyed: IT fac., Damascus U . Prosodic Feature Introduction and Emotion Incorporation in an Arabic TTS Presented by Dr. O. Al Dakkak

Dr. O. Dakkak & Dr. N. Ghneim: HIAST M. Abu-Zleikha & S. Al-Moubyed: IT fac., Damascus U. Prosodic Feature Introduction and Emotion Incorporation in an

Embed Size (px)

Citation preview

Dr. O. Dakkak & Dr. N. Ghneim: HIAST

M. Abu-Zleikha & S. Al-Moubyed: IT fac., Damascus U.

Prosodic Feature Introduction and Emotion Incorporation in an Arabic

TTS

Presented by

Dr. O. Al Dakkak

Outline

• Arabic TTS

• Why Prosody generation?

• Prosody Analysis and Rule Extraction

• Emotion Inclusion

• Results

• Conclusion

Arabic Text-to-Speech System

– Arabic Text-to-Phonemes (ATOPH) Including open /E/, /O/ phonemes and emphatic vowels

– Use of MBROLA Diphone units to synthesize speech Till our semi-syllables are ready (Corpus is currently being recorded)

– Prosody Generation and Emotion Inclusion

Arabic Text-to-Speech System

– MBROLA permits to synthesize phonemes. With control on duration and F0 contour (a set of segments) and we implemented a tool to control the Amplitude.

– Absent phonemes are replaced by the nearest present phonemes

– Possibility to generate and test prosody

Why Prosody Generation?

• Increase intelligibility & expressionality.

• Provides the context in which speech is interpreted

• Signals speaker intentions (special aids)

• Man-machine communication (airports,..)

• Doublage*

Methodology

• Based on the punctuation marks (‘,’, ‘.’, ‘?’ and ‘!’) we classify sentences into: continuous affirmation, long affirmation, interrogative, exclamation; respectively.

• Recording a corpus and Analysis of its sentences to produce F0, and intensity curves

• Statistical study of the curves and Rule extraction to generate them automatically.

The corpus

• Use of a pre-recorded corpus, of 12 short sentences for each type, 5 speakers (4 m. & 1 f.). Each sentence has 14 phonemes at most.

• Recording of other 10 sentences of variable lengths pronounced by 3 speakers.– short : 4-20 phonemes, – medium : 20-40 phonemes – long : more than 40 phonemes.

• The curves of F0, intensity were available for the pre-recorded corpus and were computed for the further set of recording.

Rules Extraction

• Re-definition of the length concept, using fuzzy sets:

Rules Extraction

• Curve stylization after stochastic analysis, ex:

Emotion Inclusion

• Recording a corpus of 5 different emotional sentences (joy, anger, sadness, fear & surprise) with their emotionless versions (20 sentences/emotion).

• Measures of prosodic features F0, duration and intensity, with their variations (Praat).

• Extraction of rules to automatically produce emotion on synthetic speech.

• Rules Validation.

181 245 221176200 170195 177 163 200 196176177 173 169 133158 153195213

c a h u w a D a n b j c a n c a t a H a m m a l a c a n aa D a aa l i k

0

500

100

200

300

400

Time (s)0 2.27622

ذلك؟ �ا �َن َأ �َح�َّمَل� َت� َأ �ْن� َأ �ي �ِب ذ�َن ُه�َو�

� َأ

Is it my fault to bear it?Pitch: variation of F0 Range: difference between F0max & F0minF0 Averag: Mean valueContour slope: shape of contour slope (range variation).

Variability: deg. Of it (high, low..) .Jitter: Irregularities between successive glottal pulses

Example: Anger emotion

• F0 mean: + 40%-75%• F0 range: + 50%-100%• F0 at vowels and semi-vowels: + 30%• F0 slope: +• Speech rate: +• Silence rate: -• Duration of vowels and semi-vowels: + • Intensity mean: +• Intensity monotonous with F0• Others: F0 variability: +, F0 jitter: +

Analysis & Rule Extraction: Anger

With emotion

emotionless

Emotion Synthesis: Anger

• F0 mean: + 30%• F0 range: + 30%• F0 at vowels and semi-vowels: +100%• Speech rate: +75%-80%• Duration of vowels and semi-vowels: +30% • Duration of fricatives: +20%

Synthetic examples

emotionless with emotion- Anger:- Joy: - Sadness :- Fear: - Surprise:

“who do you think you are?”

“no more clouds in the sky”

“I’m so sad today”

“What a scary scene!” “What a beautiful scene!”

EmoGen

Normal text to MBROLA text

Converter (NTMTC)

Prosody Generator Emotion Generator

Mbrola Playerinterface

Input Text

VoiceInterface Text Editor

Speech and emotionproperties

Results

• Five sentences for each emotion were synthesized and listened by 10 people.

• Each listener gives the perceived emotion for each sentence (we don’t provide our list of emotions)

Results

Conclusion

• An automated tool for emotional Arabic synthesis has been developed

• The prosodic model proposed and tested in this work proved to be successful. Especially in conversational context:

• Further work will follow to include other emotions: Disgust, Annoyance,…