Towards Conversational Speech Synthesis; “Lessons Learned from the Expressive Speech Processing Project” Nick Campbell NiCT / ATR-SLC National Institute

Towards Conversational Speech Synthesis;

“Lessons Learned from the Expressive Speech Processing Project”

Nick Campbell

NiCT / ATR-SLCNational Institute of Information and

Communications Technology&

ATR Spoken Language Communication Research LabsKeihanna Science City, Kyoto 619-0288, Japan

[email protected], [email protected]

The JST/CREST ‘ESP’ corpus

• The ATR “Expressive Speech Processing” project (JST/CREST) lasted from 4/’00 to 3/’05 and resulted in a corpus of 1,500 hours of natural conversational speech

• All recordings were transcribed, and about 10% are annotated for speaking-style, etc.

• The corpus is divided into 3 sections : i: esp_f, ii: esp_c, and iii: esp_m

Transcription example

匿名One “utterance” per line

Sections of the ESP corpus

• esp_f– one female speaker, head-mounted mic, 600 hours of

daily spoken interactions, emotion/speech-act/etc …

• esp_c– 10 adult speakers, 5m 5f, 2 Chinese, 2 English, – 30-minute telephone conversations x 10 weeks– all conversations in Japanese, free content

• esp_m– multi-speaker, head-mounted microphones, variety of

interaction settings (like esp_f but many more voices)

finding #1: the Function of Conversational Speech

• To establish a rapport with the listener• To show interest and attention• To convey propositional content

• Contrast “broadcast mode” (one-way) with “interactive mode” (two-way) speech

• Speech Synthesis can do broadcast mode but Conversational Speech is two-way!

Backchannels & affect bursts

the hundred most common utterances

Non-Verbal Speech Sounds

• Short, simple, repetitive noises

• How they are spoken is usually more important than what is being said

• Some examples:– The word ほんま– Means “really”– Used a lot in Osaka conversations …

Synthesis of Non-Verbal Speech Sounds

• The challenge now is how to synthesise these non-lexical speech sounds

• the same speaker says the same word in many consistently different ways …

• How should they best be (a) described

(b) realised?

Tap-to-talk

http:feast.atr.jp/imode

Characteristics of Non-Verbal Utterances

• Better described by icons?

• Short, expressive sounds• Phonetically ambiguous• Prosodically marked

• Not well specified by text input!– But frequent and textually ‘transpar

ent’

‘Wrappers’ and ‘Fillings’ - Interaction Devices

• Often used as “edge-markers”– At beginning and end of utterance chunks

• Add expressivity to propositional content– Not just “fillers” –they ‘wrap’ the utterance

e.g., “erm, it’s very simple, you know”

The Acoustic features of Wrappers (and Fillers)

• Prosodically very variable

in more than just pitch & duration …

• pca dimension reduction shows– 3 components account for more than 50% – 7 components account for more than 80% – Voice-quality comes up in the 1st component!

Voice Quality in Synthesis

• Chakai – affect-based unit selection

• using “whole-phrase” units• that vary according to expressivity• selected by their acoustics (princomps)

• They show affective relationships• and serve a pragmatic (phatic) function

Chakai

KeyTalk

Touch-sensitive Selection

• One big advantage of using a midi keyboard is touch-sensitivity – controlled sustain & attack

i.e., (perfect for the natural input of prosody)– with pitch-blend as well …

• Another is that keys can be intuitively grouped into related sets of utterances

Greetings replies opinion calling etc …

Octave or sub-octave clusters …. 5 & 7 black & white keys

Grouping Related Utterances

• It remains as future work to group related utterance types and plan a full keyboard for non-verbal speech sound synthesis

• Demo software is provided on the cd-rom proceedings – please let me know if you have any helpful ideas or suggestions

:-)

Summary

• Non-verbal sounds offer a challenge for the synthesis of interactive speech

• They are frequent and carry important affective and discourse-flow information

• Segments can be selected and reused from a conversational speech corpus

Conclusion

• This paper has presented some examples of non-linguistic uses of speech prosody

• Synthesis of expressive sounds is easy!– ‘Units’ can be whole phrases

• But unit selection is difficult!– They carry subtle differences of meaning– That can be very hard to specify in text

Listen:

• Some examples of conversational speech

• (a) taken from the corpus (natural)

• (b) synthesised using current technology

• (c) concatenated from a very-large corpus

• Listen to the non-linguistic prosody!

NATRnext-generation

advanced text rendering

– The original dialogue – ditto - synthesised– CHATR (& original ）– NATR – large-corpus– NATR – more lively

Morning もしもしMorning もしもしHello こんにちはhi_there_ まいどHaha ハハハbeen_a_long_time 久しぶりですねー came_staight_to_the_eighth_floor もう直接、八階の方に、はいReally あ、そうなんReally あ、ほんまseventh_floor_today 七階すか、今日yeah_yeah うーん、そうそうHahaha ワーハハハー what_time_did_you_come 何時頃来たんすかjust_now さっきabout_now さっきぐらいウアハハハハハハ、まじで bit_late ちょっと遅なっ＜てんやん、アハjust_in_time ぎりぎりー not_really いや、そういうわけじゃないねんけどyeah_yeah_yeah はーいあいはいはい Umm うんYeah そっかそっかSo そう came_by_bike あたし自転車やからfrom_Kyooubashi 京橋のほうじゃなかったですっけReally そうやYeah でしょうUmm うんfrom_Kyoubashi 京橋からby_bike チャリンコですぐですか、あーそんなもんで来れるんやー Yeah そうYeah うん、だいたいReally あそうなーんすかYeah うん

7. Acknowledgements• This work is supported by the National Institute o

f Information and Communications Technology (NiCT), and includes contributions from the Japan Science & Technology Corporation 　 (JST), and the Ministry of Public Management, Home Affairs, 　 Posts and Telecommunications, Japan (SCOPE).

• The author is especially grateful to the management of ATR Spoken Language Communication Research Labs for their continuing encouragement and support.

Thank you

coming next:

Thank you

Documents

Towards Conversational Speech Synthesis; “Lessons Learned from the Expressive Speech Processing Project” Nick Campbell NiCT / ATR-SLC National Institute