Upload
lauren-obrien
View
216
Download
0
Tags:
Embed Size (px)
Citation preview
Kishore Prahallad ([email protected]), IIIT Hyderabad1
Building a Limited Domain Voice Using Festvox
(Workshop Talk at IIT Kharagpur, Mar 4-5, 2009)
Kishore PrahalladEmail: [email protected]
International Institute of Information Technology (IIIT) Hyderabad, India&
Language Technologies Institute, Carnegie Mellon University
Kishore Prahallad ([email protected]), IIIT Hyderabad2
Objective
• Objective: To provide introduction to the inner details of Festival Synthesis system
• Best Resources: Documentation of Festival, Festvox and Speech Tools and their mailing lists
• Topics: – Festival, Festvox and Speech Tools– Modules and data structures in Festival– Synthesis Flow– Building a limited domain voice
Kishore Prahallad ([email protected]), IIIT Hyderabad3
Festival & Speech Tools
• Festival – Full text to speech system– Multi-lingual– A general framework for building new voices in
existing and new languages– APIs: Shell Level, C++ Library, Emacs interface
• Speech Tools– A set of modules for common tasks found in speech
processing• Example: Feature Extraction
– Interface: Stand alone executables and a set of library calls linked into user programs
Kishore Prahallad ([email protected]), IIIT Hyderabad4
Festvox
• Voice building tool
• Interface created on top of Festival and Speech Tools to build voices
Kishore Prahallad ([email protected]), IIIT Hyderabad5
How Festival, Festvox & Speech Tools are Related
Speech Tools
Festival Multi-lingual Synthesis Engine
FestvoxEnvironment
To build voices
Kishore Prahallad ([email protected]), IIIT Hyderabad6
Output of Festvox
Speech Tools
Festival Multi-lingual Synthesis Engine
FestvoxEnvironment
To build voices
Voice
• Festvox uses SpeechTools and Festival to create a new voice
• The Voice created is put back into Festival framework to synthesize text
Kishore Prahallad ([email protected]), IIIT Hyderabad7
User Interface with Festival
Speech Tools
Festival Multi-lingual Synthesis Engine
FestvoxEnvironment
To build voices
Voice
UserWorld
Kishore Prahallad ([email protected]), IIIT Hyderabad8
Some Festival-Specific Terminology
• Utterance: *Name* of a data structure used in Festival
• Segment: A phone is referred to as segment
Kishore Prahallad ([email protected]), IIIT Hyderabad9
Basic Modules of Festival TTS system
There are many modules in the Festival system - the basic modules used for text-to-speech are:
• Token_POS– basic token identification
• Token– Apply the token to word rules (handle non-standard words)
• POS– A standard part of speech tagger
• Phrasify– A Chunker, detect the phrase boundaries
• Word– Implements letter to sound rules
Tokens: White Space separated
European language: Space, CR, newline, tab, vertical tab etc..
Asian Languages: No white space separators – Use dictionaries
Punctuation: The boy----was usually late-----but arrived on time!! We have orange/apple/banana flavors
Kishore Prahallad ([email protected]), IIIT Hyderabad10
Basic Modules of Festival TTS system contd..
• Pauses– Prediction of pauses, inserting silences.
• Intonation– Prediction of accents: Which syllables have accent (stress)
• PostLex– Post lexicon rules that can modify segments based on their context.
This is used for things like vowel reduction, contractions, etc. • Duration
– Prediction of durations of segments. • Int_Targets
– Realization of F0 contour: given the accents/tones generate an F0 contour.
• Wave_Synth– A general function that in turn calls the appropriate method to actually
generate the waveform.
Kishore Prahallad ([email protected]), IIIT Hyderabad11
Data Structure in Festival
• Utterance: A dashboard data structure (as all modules read/write on a common memory)
• *Utterance* is the input and the output of every module in the Festival
Module
Utterance Utterance
Kishore Prahallad ([email protected]), IIIT Hyderabad12
Utterance consist of ?
• *Items* and *Relations*• Items:
– It is an object to store strings representing word, segment etc.
• Relation: – A graph which links the items – For example: “syllable” is a relation which
links the items storing segment-names together
Kishore Prahallad ([email protected]), IIIT Hyderabad13
What Each Module Does to an Utterance
• Each module access *items* and *relations* in an utterance and generate new features, items and relations in the same utterance– For ex: Token_POS
• Input: Utterance with one item - a string representing a sentences
• Output: Utterance with multiple items – each item represents a token
• Synthesis process in Festival is viewed as applying a set of modules to an utterance
Kishore Prahallad ([email protected]), IIIT Hyderabad14
Synthesis Flow
ModulesJune 25
Relations
Text
Kishore Prahallad ([email protected]), IIIT Hyderabad15
Synthesis Flow
ModulesJune 25
Relations
Text
June 25 Token
Tokenize
Twenty FifthJune Word
Token2Word
POS NumNoun Num
Kishore Prahallad ([email protected]), IIIT Hyderabad16
Synthesis FlowTwenty FifthJune Word
POS NumNoun Num
1 1 0 1
jh uu n t w e n t ii f i f th
Syllable
Segment
Word
Wave Synthesize Wave
Kishore Prahallad ([email protected]), IIIT Hyderabad17
Installation of Festival & Festvox
• Step 1: Install Speech tools
• Step 2: Install Festival – Synthesize text in English to check the sound
card, rate of speech etc.
• Step 3: Install Festvox
• Detailed Notes available from course web site
Kishore Prahallad ([email protected]), IIIT Hyderabad18
Building Limited Domain• Unit selection is applied to a limited with restricted vocabulary
• High quality speech systems
• Units are words – Implementation in Festival:
• The units are still phone, but are restricted to be coming from a specific word – /p/ from “Pennsylvania” is differentiated from /p/ from “Pittsburgh”– To synthesize “Pittsburgh” all the phones should come from the word
“Pittsburgh” (there may be many examples of the same word).
• Talking clock, Weather Prediction, Rail/Air Inquiry Systems• http://www.cs.cmu.edu/~awb/papers/ICSLP2000_ldom/index.html
Kishore Prahallad ([email protected]), IIIT Hyderabad19
Limited Domain Setup (http://festvox.org/bsv/bsv-ldom-ch.html)
• 1. Set the Environment:$FESTVOXDIR/src/ldom/setup_ldom iiit time pra
#This would give a talking clock set up. #To change it to any another domain, all you have to do is to replace "etc/time.data"
#with the domain specific training sentences. #For non-english languages, these sentences are transliterated in English.
• 2. Generate Prompts – Synthesize the sentence which *you* are going to speak – How can you synthesize? – mostly applicable to English languages only– Why Synthesize at all? – To *prompt* you what to speakfestival -b festvox/build_ldom.scm '(build_prompts "etc/txt.done.data")'
• 3. Record prompts– For new languages, switch off the * playing of the prompt* by commenting na_play in bin/prompt_thembin/prompt_them etc/txt.done.data
• 4. Label Automatically– Uses dynamic programming for labeling the speech– Labeling builds the correspondence between the text and the speechbin/make_labs prompt-wav/*.wav
• 4.1 Manually correct the labeling errorsemulabel etc/emu_lab time0001
Kishore Prahallad ([email protected]), IIIT Hyderabad20
Contd…
• 5. Generate Pitch markers bin/make_pm_wave wav/*.wav
• 6. Correct the pitch markersbin/make_pm_fix pm/*.pm
• 7. Generate Mel Cepstral coefficientsbin/make_mcep wav/*.wav
• 8. Generate Utterance Structurefestival -b festvox/build_ldom.scm '(build_utts "etc/txt.done.data")'
• 9. Cluster the units festival -b festvox/build_ldom.scm '(build_clunits "etc/txt.done.data")'
• 10. Test the voice.festival festvox/iiit_time_pra_ldom '(voice_iiit_time_pra_ldom)'
• To see the units selected (set! utt (SayText "abhii samaya hai....")(clunits::units_selected utt "-")
Kishore Prahallad ([email protected]), IIIT Hyderabad21
References
• http://festvox.org• 11-752 CMU course slides
– http://festvox.org/festtut/
• 11-752 CMU Course Lecture Notes– http://festvox.org/festtut/notes/festtut_toc.html
• Building Synthetic Voices – http://www.festvox.org/bsv/
• The Festival Speech Synthesis System– http://www.festvox.org/docs/manual-1.4.3/festival_toc.html
• Edinburgh Speech Tools Library– http://www.festvox.org/docs/speech_tools-1.2.0/book1.htm