Kishore Prahallad ([email protected]), IIIT Hyderabad 1 Building a Limited Domain Voice Using Festvox (Workshop Talk at IIT Kharagpur, Mar 4-5, 2009)

Kishore Prahallad ([email protected]), IIIT Hyderabad1

Building a Limited Domain Voice Using Festvox

(Workshop Talk at IIT Kharagpur, Mar 4-5, 2009)

Kishore PrahalladEmail: [email protected]

International Institute of Information Technology (IIIT) Hyderabad, India&

Language Technologies Institute, Carnegie Mellon University


Objective

• Objective: To provide introduction to the inner details of Festival Synthesis system

• Best Resources: Documentation of Festival, Festvox and Speech Tools and their mailing lists

• Topics: – Festival, Festvox and Speech Tools– Modules and data structures in Festival– Synthesis Flow– Building a limited domain voice


Festival & Speech Tools

• Festival – Full text to speech system– Multi-lingual– A general framework for building new voices in

existing and new languages– APIs: Shell Level, C++ Library, Emacs interface

• Speech Tools– A set of modules for common tasks found in speech

processing• Example: Feature Extraction

– Interface: Stand alone executables and a set of library calls linked into user programs


Festvox

• Voice building tool

• Interface created on top of Festival and Speech Tools to build voices


How Festival, Festvox & Speech Tools are Related

Speech Tools

Festival Multi-lingual Synthesis Engine

FestvoxEnvironment

To build voices


Output of Festvox

Speech Tools


FestvoxEnvironment

To build voices

Voice

• Festvox uses SpeechTools and Festival to create a new voice

• The Voice created is put back into Festival framework to synthesize text


User Interface with Festival

Speech Tools


FestvoxEnvironment

To build voices

Voice

UserWorld


Some Festival-Specific Terminology

• Utterance: *Name* of a data structure used in Festival

• Segment: A phone is referred to as segment


Basic Modules of Festival TTS system

There are many modules in the Festival system - the basic modules used for text-to-speech are:

• Token_POS– basic token identification

• Token– Apply the token to word rules (handle non-standard words)

• POS– A standard part of speech tagger

• Phrasify– A Chunker, detect the phrase boundaries

• Word– Implements letter to sound rules

Tokens: White Space separated

European language: Space, CR, newline, tab, vertical tab etc..

Asian Languages: No white space separators – Use dictionaries

Punctuation: The boy----was usually late-----but arrived on time!! We have orange/apple/banana flavors


Basic Modules of Festival TTS system contd..

• Pauses– Prediction of pauses, inserting silences.

• Intonation– Prediction of accents: Which syllables have accent (stress)

• PostLex– Post lexicon rules that can modify segments based on their context.

This is used for things like vowel reduction, contractions, etc. • Duration

– Prediction of durations of segments. • Int_Targets

– Realization of F0 contour: given the accents/tones generate an F0 contour.

• Wave_Synth– A general function that in turn calls the appropriate method to actually

generate the waveform.


Data Structure in Festival

• Utterance: A dashboard data structure (as all modules read/write on a common memory)

• *Utterance* is the input and the output of every module in the Festival

Module

Utterance Utterance


Utterance consist of ?

• *Items* and *Relations*• Items:

– It is an object to store strings representing word, segment etc.

• Relation: – A graph which links the items – For example: “syllable” is a relation which

links the items storing segment-names together


What Each Module Does to an Utterance

• Each module access *items* and *relations* in an utterance and generate new features, items and relations in the same utterance– For ex: Token_POS

• Input: Utterance with one item - a string representing a sentences

• Output: Utterance with multiple items – each item represents a token

• Synthesis process in Festival is viewed as applying a set of modules to an utterance


Synthesis Flow

ModulesJune 25

Relations

Text


Synthesis Flow

ModulesJune 25

Relations

Text

June 25 Token

Tokenize

Twenty FifthJune Word

Token2Word

POS NumNoun Num


Synthesis FlowTwenty FifthJune Word

POS NumNoun Num

1 1 0 1

jh uu n t w e n t ii f i f th

Syllable

Segment

Word

Wave Synthesize Wave


Installation of Festival & Festvox

• Step 1: Install Speech tools

• Step 2: Install Festival – Synthesize text in English to check the sound

card, rate of speech etc.

• Step 3: Install Festvox

• Detailed Notes available from course web site


Building Limited Domain• Unit selection is applied to a limited with restricted vocabulary

• High quality speech systems

• Units are words – Implementation in Festival:

• The units are still phone, but are restricted to be coming from a specific word – /p/ from “Pennsylvania” is differentiated from /p/ from “Pittsburgh”– To synthesize “Pittsburgh” all the phones should come from the word

“Pittsburgh” (there may be many examples of the same word).

• Talking clock, Weather Prediction, Rail/Air Inquiry Systems• http://www.cs.cmu.edu/~awb/papers/ICSLP2000_ldom/index.html

http://www.cs.cmu.edu/~awb/papers/ICSLP2000_ldom/index.html


Limited Domain Setup (http://festvox.org/bsv/bsv-ldom-ch.html)

• 1. Set the Environment:$FESTVOXDIR/src/ldom/setup_ldom iiit time pra

#This would give a talking clock set up. #To change it to any another domain, all you have to do is to replace "etc/time.data"

#with the domain specific training sentences. #For non-english languages, these sentences are transliterated in English.

• 2. Generate Prompts – Synthesize the sentence which *you* are going to speak – How can you synthesize? – mostly applicable to English languages only– Why Synthesize at all? – To *prompt* you what to speakfestival -b festvox/build_ldom.scm '(build_prompts "etc/txt.done.data")'

• 3. Record prompts– For new languages, switch off the * playing of the prompt* by commenting na_play in bin/prompt_thembin/prompt_them etc/txt.done.data

• 4. Label Automatically– Uses dynamic programming for labeling the speech– Labeling builds the correspondence between the text and the speechbin/make_labs prompt-wav/*.wav

• 4.1 Manually correct the labeling errorsemulabel etc/emu_lab time0001

http://festvox.org/bsv/bsv-ldom-ch.html


Contd…

• 5. Generate Pitch markers bin/make_pm_wave wav/*.wav

• 6. Correct the pitch markersbin/make_pm_fix pm/*.pm

• 7. Generate Mel Cepstral coefficientsbin/make_mcep wav/*.wav

• 8. Generate Utterance Structurefestival -b festvox/build_ldom.scm '(build_utts "etc/txt.done.data")'

• 9. Cluster the units festival -b festvox/build_ldom.scm '(build_clunits "etc/txt.done.data")'

• 10. Test the voice.festival festvox/iiit_time_pra_ldom '(voice_iiit_time_pra_ldom)'

• To see the units selected (set! utt (SayText "abhii samaya hai....")(clunits::units_selected utt "-")


References

• http://festvox.org• 11-752 CMU course slides

– http://festvox.org/festtut/

• 11-752 CMU Course Lecture Notes– http://festvox.org/festtut/notes/festtut_toc.html

• Building Synthetic Voices – http://www.festvox.org/bsv/

• The Festival Speech Synthesis System– http://www.festvox.org/docs/manual-1.4.3/festival_toc.html

• Edinburgh Speech Tools Library– http://www.festvox.org/docs/speech_tools-1.2.0/book1.htm

http://festvox.org/

http://festvox.org/festtut/

http://festvox.org/festtut/notes/festtut_toc.html

http://festvox.org/festtut/notes/festtut_toc.html

http://www.festvox.org/bsv/

http://www.festvox.org/docs/manual-1.4.3/festival_toc.html

http://www.festvox.org/docs/speech_tools-1.2.0/book1.htm

Documents

Kishore Prahallad ([email protected]), IIIT Hyderabad 1 Building a Limited Domain Voice Using Festvox (Workshop Talk at IIT Kharagpur, Mar 4-5, 2009)