Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
6. Text-‐to-‐Speech Synthesis
(Most Of these slides come from Dan Jura>y’s course at Stanford)
Tractament Digital de la Parla 2
History of Speech Synthesis • In 1779, the Danish scientist Christian Kratzenstein builds models of
the human vocal tract that can produce the five long vowel sounds. • In 1791 Wolfgang von Kempelen (the creator of the Turk chess playing
game) devises the bellows(fuelle, mancha)-operated “automatic-mechanical speech machine”. It added models of the tongue and lips that allowed it to produce voewls and consonants.
• In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's design.
• In the 1930s, Bell Labs developed the VOCODER, a keyboard-operated electronic speech analyzer and synthesizer that was said to be clearly intelligible. It was later refined into the VODER, which was exhibited at the 1939 New York World's Fair.
Von Kempelen: • Small whistles
controlled consonants • Rubber mouth and
nose; nose had to be covered with two fingers for non-nasals
• Unvoiced sounds: mouth covered, auxiliary bellows driven by string provides puff of air
From Traunmüller’s web site
Von Kempelen’s speaking machine
Bell labs VOCODER machine
Homer Dudley 1939 VODER • Synthesizing speech by electrical means • 1939 World’s Fair
Homer Dudley’s VODER
• Manually controlled through complex keyboard • Operator training was a problem
One of the first “talking” computers
Closer to a natural vocal tract: Riesz 1937
The UK Speaking Clock
• July 24, 1936 • Photographic storage on 4 glass disks • 2 disks for minutes, 1 for hour, one for
seconds. • Other words in sentence distributed across
4 disks, so all 4 used at once. • Voice of “Miss J. Cain”
The 1936 UK Speaking Clock
From hQp://web.ukonline.co.uk/freshwater/clocks/spkgclock.htm
The UK speaking clock
Gunnar Fant’s OVE synthesizer • Of the Royal Institute of
Technology, Stockholm • Formant
Synthesizer for vowels
• F1 and F2 could be controlled
From Traunmüller’s web site
Cooper’s Pattern Playback
• Haskins Labs for investigating speech perception
• Works like an inverse of a spectrograph • Light from a lamp goes through a rotating
disk then through spectrogram into photovoltaic cells
• Thus amount of light that gets transmitted at each frequency band corresponds to amount of acoustic energy at that band
Cooper’s Pattern Playback
Modern TTS systems • 1960’s first full TTS: Umeda et al (1968) • 1970’s
– Joe Olive 1977 concatenation of linear-prediction diphones – Texas Instruments Speak and Spell,
• June 1978 • Paul Breedlove
• 1980’s – 1979 MIT MITalk (Allen, Hunnicut, Klatt)
• 1990’s-present – Diphone synthesis – Unit selection synthesis – HMM synthesis
Speak and spell demo
TTS Demos (Unit-Selection) • Cereproc
– Catalan – Spanish
• Festival (open source) – English
• ATT – English – South American
Types of Waveform Synthesis
• Articulatory Synthesis: – Model movements of articulators and acoustics of vocal
tract • Formant Synthesis:
– Start with acoustics, create rules/filters to create each formant
• Concatenative Synthesis: – Use databases of stored speech to assemble new
utterances. • Diphone • Unit Selection
• Statistical (HMM) Synthesis – Trains parameters on databases of speech
Text modified from Richard Sproat slides
1st gen. synthesis: Formant Synthesis
• Were the most common commercial systems when computers were slow and had little memory.
• 1979 MIT MITalk (Allen, Hunnicut, Klatt) • 1983 DECtalk system
– “Perfect Paul” (The voice of Stephen Hawking)
– “Beautiful Betty”
2nd Generation Synthesis
• Diphone Synthesis – Units are diphones; middle of one phone to
middle of next. – Why? Middle of phone is steady state. – Record 1 speaker saying each diphone – ~1400 recordings – Paste them together and modify prosody.
3rd GenerationSynthesis • All current commercial systems.
• Unit Selection Synthesis – Larger units of variable length – Record one speaker speaking 10 hours or more,
• Have multiple copies of each unit
– Use search to find best sequence of units
• Hidden Markov Model Synthesis – Train a statistical model on large amounts of data.
General TTS architecture
Raw text in
Text analysis: • Text NormalizaYon • Part-‐of-‐speech tagging • Homograph disambiguaYon
PhoneYc analysis • DicYonary lookup • Grapheme-‐to-‐phoneme(LTS)
Prosodic analysis • Boundary placement • Pitch accent assignment • DuraYon computaYon
Waveform synthesis
Speech out
1. Text Normalization • Analysis of raw text into pronounceable words • Sample problems:
– He stole $100 million from the bank – It's 13 St. Andrews St. – The home page is h?p://www.stanford.edu – yes, see you the following tues, that's 11/12/01
• Steps: – Identify tokens in text – Chunk tokens into reasonably sized sections – Map tokens to words – Identify types for words
Identify tokens and chunk them
• Whitespace can be viewed as separators • Punctuation can be separated from the
raw tokens • Festival converts text into
– ordered list of tokens – each with features:
• its own preceding whitespace • its own succeeding punctuation
End-of-utterance detection
• Relatively simple if utterance ends in ?! • But what about ambiguity of “.” • Ambiguous between end-of-utterance and
end-of-abbreviation – My place on Winfield St. is around the corner. – I live at 151 Winfield St. – (Not “I live at 151 Winfield St..”)
Identify Types of Tokens, and Convert Tokens to Words
• Pronunciation of numbers often depends on type. 3 ways to pronounce 1776: – 1776 date: seventeen seventy six. – 1776 phone number: one seven seven six – 1776 quantifier: one thousand seven hundred
(and) seventy six – Also:
• 25 day: twenty-fifth
Step 2: Classify token into 1 of 20 types
• EXPN: abbrev, contractions (adv, N.Y., mph, gov’t) • LSEQ: letter sequence (CIA, D.C., CDs) • ASWD: read as word, e.g. CAT, proper names • MSPL: misspelling • NUM: number (cardinal) (12,45,1/2, 0.6) • NORD: number (ordinal) e.g. May 7, 3rd, Bill Gates II • NTEL: telephone (or part) e.g. 212-555-4523 • NDIG: number as digits e.g. Room 101 • NIDE: identifier, e.g. 747, 386, I5, PC110 • NADDR: number as stresst address, e.g. 5000 Pennsylvania • NZIP, NTIME, NDATE, NYER, MONEY, BMONY, PRCT,URL,etc • SLNT: not spoken (KENT*REALTY)
POS Tagging: Definition
• The process of assigning a part-of-speech or lexical class marker to each word in a corpus:
the koala put the keys on the table
WORDS TAGS
N V P DET
Part of speech tagging
• 8 (ish) traditional parts of speech – Noun, verb, adjective, preposition, adverb,
article, interjection, pronoun, conjunction, etc – This idea has been around for over 2000
years (Dionysius Thrax of Alexandria, c. 100 B.C.)
– Called: parts-of-speech, lexical category, word classes, morphological classes, lexical tags, POS
– We’ll use POS most frequently
POS examples
• N noun chair, bandwidth, pacing • V verb study, debate, munch • ADJ adj purple, tall, ridiculous • ADV adverb unfortunately, slowly, • P preposition of, by, to • PRO pronoun I, me, mine • DET determiner the, a, that, those
POS Tagging example WORD tag
the DET koala N put V the DET keys N on P the DET table N
POS tagging: Choosing a tagset
• There are so many parts of speech, potential distinctions we can draw
• To do POS tagging, need to choose a standard set of tags to work with
• Could pick very coarse tagets – N, V, Adj, Adv.
• More commonly used set is finer grained, the “UPenn TreeBank tagset”, 45 tags – PRP$, WRB, WP$, VBG
• Even more fine-grained tagsets exist
Penn TreeBank POS Tag set
POS importance
• Part of speech tagging plays important role in TTS
• Most algorithms get 96-97% tag accuracy • Not a lot of studies on whether remaining
error tends to cause problems in TTS
1/5/07
PhoneYc analysis
• Deals with the pronunciaYon of each word • Finds, for each word, the way to pronounce it • Straigh^orward method would be a dicYonary lookup, but: – Unknown words and proper names will be missing – Words in other languages
Phoneme sets
ConverYng from words to phones • Two methods:
– DicYonary-‐based – Rule-‐based (through leQer-‐to-‐sound -‐LTS-‐ rules)
• All early systems used LTS. The first to use a dicYonary was MITTalk (10K words)
• A good dicYonary example is CMU dicYonary with 127K words
When dictionaries aren’t sufficient
• Unknown words – Seem to be linear with number of words in unseen text – Mostly person, company, product names – But also foreign words, etc. – From a Black et al analysis
• Of 39K tokens in part of the Wall Street Journal • 1775 (4.6%) were not in the OALD dictionary:
• So commercial systems have 3-part system: – Big dictionary – Special code for handling names – Machine learned LTS system for other unknown words
1/5/07
Learning LTS rules • Rules for Spanish are straigh^orward
– Grapheme(wriQen form) is very similar to the phoneme sequence
• In the past most big systems employed many linguists (of PhD students) to manually write LTS rules
• Now they are mostly induced automaYcally from a dicYonary of the language – An important step is the phoneme-‐grapheme alignment
Homograph disambiguaYon • An homograph is a word (or group of words) that share the same wriQen form but have different meanings.
• In pronunciaYon, they might have the same pronunciaYon (homonyms) or different (heteronyms).
• It is very important to detect and disambiguate them for correct pronunciaYon
• Examples: – English: wood(madera)/wood(bosque), bear(aguantar)/bear(barba), read(presente leer)/read(pasado leer)
– Spanish: all words are homonyms
Next steps afer text analysis
• Afer Text Analysis we now have a string of Phones. We also have some semanYc informaYon to do Prosodic Analysis, so next steps are:
• Prosody – Desired F0 for enYre uQerance – DuraYon for each phone – Stress value for each phone, possibly accent value
• Generate Waveforms
Prosody Prosody is in charge of converYng words+phones into boundaries, accent, F0 and duraYon informaYon • Prosodic phrasing
– Need to break uQerances into phrases – PunctuaYon is useful, not sufficient
• Accents: – PredicYons of accents: which syllables should be accented – RealizaYon of F0 contour: given accents/tones, – generate F0 contour
• DuraYon: – PredicYng duraYon of each phone
Graphic representation of F0
legumes are a good source of VITAMINS 50
100
150
200
250
300
350
400
time
F0 (i
n H
ertz
)
Slide from Jennifer Venditti
AlternaYve accents
• We might want to give prominence to different parts of a sentence depending on the context.
• For example, imagine answering a quesYon on the previous example
Q1: What types of foods are a good source of vitamins?
50
100
150
200
250
300
350
400
LEGUMES are a good source of vitamins
Slide from Jennifer Venditti
Q2: Are legumes a source of vitamins?
50
100
150
200
250
300
350
400
Legumes are a GOOD source of vitamins
Slide from Jennifer Venditti
Q3: I’ve heard that legumes are healthy, but what are they a good source of ?
legumes are a good source of VITAMINS 50
100
150
200
250
300
350
400
Slide from Jennifer Venditti
Duration • Simplest: fixed size for all phones (100 ms) • Next simplest: average duration for that phone (from training data).
Samples from SWBD in ms: – aa 118 b 68 – ax 59 d 68 – ay 138 dh 44 – eh 87 f 90 – ih 77 g 66
• Next Next Simplest: add in phrase-final and initial lengthening plus stress
• Lots of fancy models of duration prediction: – Using Z-scores and other clever normalizations – Sum-of-products model – New features like word predictability
• Words with higher bigram probability are shorter
Waveform synthesis generation
• Given: – String of phones – Prosody
• Desired F0 for entire utterance • Duration for each phone • Stress value for each phone, possibly accent value
• Generate: – Waveforms
The hourglass architecture
Internal Representation: Input to Waveform Synthesis
Waveform synthesis
• concatenaYve synthesis systems – Diphone synthesis: join diphones to form the waveform.
– Unit selecYon synthesis: join the longest unit possible.
• HMM-‐based sytnthesis
Diphones • mid-‐phone is more stable than edge • Need O(phone2) number of units
– Some combinaYons don’t exist (hopefully) – ATT (Olive et al. 1998) system had 43 phones
• 1849 possible diphones • PhonotacYcs ([h] only occurs before vowels), don’t need to keep diphones across silence
• Only 1172 actual diphones – May include stress, consonant clusters
• So could have more – Lots of phoneYc knowledge in design
• Database relaYvely small (by today’s standards) – Around 8 megabytes for English (16 KHz 16 bit)
Slide from Richard Sproat
Diphones • Mid-phone is more stable than edge:
Diphone synthesis • Training:
– Choose units (kinds of diphones) – Record 1 speaker saying 1 example of each diphone – Mark the boundaries of each diphones,
• cut each diphone out and create a diphone database
• Synthesizing an utterance, – grab relevant sequence of diphones from database – Concatenate the diphones, doing slight signal
processing at boundaries – use signal processing to change the prosody (F0,
energy, duration) of selected sequence of diphones
Diphone labelling • Recorded sentences need to be labelled with phone start-‐end and middle(stable parts). This can be done automaYcally with: – ASR forced alignment – Audio-‐to-‐syntheYc audio auto-‐alignment
Diphone auto-alignment
• Given – synthesized prompts – Human speech of same prompts
• Do a dynamic time warping alignment of the two – Using Euclidean distance
• Works very well 95%+ – Errors are typically large (easy to fix) – Maybe even automatically detected
Slide from Richard Sproat
Dynamic Time Warping
Slide from Richard Sproat
Pupng it all together
Once we have the diphones we want to join, we need to: • Join them together to form the words and sentences
• Modify the speaking speed to reflect the target duraYon
• Modify the pitch to reflect the target intonaYon
Joining diphones Given any group of diphones to join together, we can follow a set of techniques: • Dumb:
– just join -‐> we will get many arYfacts – at zero crossings
• Algorithms that allow joining and modifying the duraYon at the same Yme – SOLA – TD-‐PSOLA(Time-‐domain pitch-‐synchronous overlap-‐and-‐add)
– Necessary prerequisite in voiced signals: epoch labelling
Epoch-labeling
• An example of epoch-labeling useing “SHOW PULSES” in Praat:
It can be easily done using an EGG, or (less accurately) via signal processing)
Epoch-labeling: Electroglottograph (EGG)
• Also called laryngograph or Lx – Device that straps on
speaker’s neck near the larynx
– Sends small high frequency current through adam’s apple
– Human tissue conducts well; air not as well
– Transducer detects how open the glottis is (I.e. amount of air between folds) by measuring impedence.
Picture from UCLA PhoneYcs Lab
DuraYon/pitch modificaYon
If we playback a sound faster than usual (for example by resampling the signal) both pitch and length gets affected. To only alter one of them: • Time stretching: algorithms used to change the length/speed of a signal without altering its pitch
• Pitch scaling/shifing: algorithms used to change the pitch without affecYng the length/speed.
Speech as Short Term signals
Alan Black
Duration modification • Duplicate/remove short term signals
Pitch Modification • Move short-term signals closer together/further apart
Slide from Richard Sproat
Windowing • To avoid artifacts at the edges we first apply a
windowing to short-term signals • y[n] = w[n]s[n]
Windowing
• Multiply value of signal at sample number n by the value of a windowing function
• y[n] = w[n]s[n]
Pitch/duraYon modificaYon algorithms • OLA (Overlap-‐add synthesis): Applies a certain overlap to each short
term signal (if we desire a pitch modificaYon) and add them. If we want to lengthen the signal we just repeat some of the periods. It can create glitches when the short term boundaries are not well defined.
! !!!!!!!!!!!!! ! "!#!$!$ $$""!!%"&$ '!!"#$"
"("$("$" '!"#%#&#$"
)*+!%,-./+%!%0"&$!1&/2!3044+5! 451-!6+51!1&!,&! 0&7+58,/!3+.+&3+&7!1&!5+918+50&:!4,9715!;<=!3+40&+3!,%!,!5,701!14!!%06+!>!14!7*+!,&,/2%0%!?0&31?!?"&$!@2!7*+!.079*!.+5013(!!
! 'A!'() ! "!B!$!!
C&! .5,9709+=! ?+! 9*11%+! ;<! ! B=! ?*+&! 7*+! %.+975D-! 14! ! %0"&$! %0:&,/! ,..51E0-,7+%! 7*+!%.+975D-!14!%"&$!%0:&,/(!)*+&! 7*+!91&9,7+&,701&!.519+%%!9*,&:+%! 7*+!.079*!?07*1D7!,44+970&:!7*+!415-,&7%F!45+GD+&90+%(!)*+!D%+!14!3044+5+&7!5+918+50&:!4,9715!9,D%+%!%751&:!3+:5,3,701&!14!%2&7*+%06+3!%.++9*=!+(:(!@D660&:!15!+44+97!14!-+7,/!8109+(!!
!! "#$%&'(&$)"%'*%+,("#$+-.$/%01+$2%'(%"#$%$34%21"101+$%
)*+!91&9+.7D,/!%9*+-+!14!7*+!30.*1&+!%2&7*+%06+5!0%!%*1?&!1&!7*+!40:D5+!#(!)*+5+!,5+!7*5++!@,%09!.*,%+%!14!%2&7*+%0%(!C&!7*+!405%7!.*,%+=!07!0%!&+9+%%,52!71!95+,7+!,!3,7,@,%+!71!%715+!%,-./+%! 14! *D-,&! /,&:D,:+(! )*+&=! 7*+! .,5709D/,5! %+:-+&7%! 14! 915.D%=! %D9*! ,%! .*1&+-+%=!30.*1&+%=! .+5013%=! +79(! ,5+! ,D71-,709,//2! 5+91:&06+3(! )*0%! %7+.! 0%! 31&+! @2! 7*+! %.+90,/!5+91:&060&:! %147?,5+=! 7*+! .,57! 14! 1D5! %2%7+-(! )0-+! -,5H%! 415! 7*+! .,5709D/,5! %+:-+&7%! 14!915.D%=!7*+!1D7.D7!14!7*+!%147?,5+=!,5+!%715+3!0&!7*+!3,7,@,%+(!!
)*+!%+91&3!.*,%+!0%!7*+!%.++9*!%2&7*+%0%(!I%+5!?507+%!7+E7=!*+!?,&7%!71!%2&7*+%06+!?07*!3+%05+3!45+GD+&92(!;05%7/2=!7*+!7+E7!0%!,&,/2%+3(!)*+&!7*+!3,7,@,%+!0%!%+,59*+3!,&3!,991530&:!71!7*+! 5D/+%! 14! 7*+! /,&:D,:+=! ,..51.50,7+! 30.*1&+%! ,5+! -,5H+3(! ;0&,//2=! 7*+2! ,5+! %2&7*+%06+3!,991530&:! 71! 7*+! )J! KLM>N! ,/:1507*-(! L2&7*+%06+3! %.++9*! 0%! %715+3! 0&! 7*+! 3,7,@,%+! 415!4D57*+5!.519+%%0&:!"7*+!7*053!.*,%+$=!15!07!9,&!@+!./,2+3!18+5!,&!1D7.D7!3+809+(!!
!
L,-./+%!14!%.++9*!
OPI!3,7,@,%+!%2%7+-!
!
ND71-,709!5+91:&0701&!14!.*1&+-+%=!30.*1&+%!,&3!.+5013%
Q1--,&3!,&,/2%0%=!
L+,59*0&:!415!%+:-+&7%!
L.++9*!%2&7*+%0%!@,%+3!1&!!)J!KLM>N!
N91D%709,/!1D7.D7!
I%+5!91--,&3!
-%
-
--
--
--
--
---%
-----%
!!
*567%89! *+#,-./0123$,4-5-3+637".4+#-3$8#/4-$"9-:3
! !
! !!!!!!!!!!!!! ! "!#!$!$ $$""!!%"&$ '!!"#$"
"("$("$" '!"#%#&#$"
)*+!%,-./+%!%0"&$!1&/2!3044+5! 451-!6+51!1&!,&! 0&7+58,/!3+.+&3+&7!1&!5+918+50&:!4,9715!;<=!3+40&+3!,%!,!5,701!14!!%06+!>!14!7*+!,&,/2%0%!?0&31?!?"&$!@2!7*+!.079*!.+5013(!!
! 'A!'() ! "!B!$!!
C&! .5,9709+=! ?+! 9*11%+! ;<! ! B=! ?*+&! 7*+! %.+975D-! 14! ! %0"&$! %0:&,/! ,..51E0-,7+%! 7*+!%.+975D-!14!%"&$!%0:&,/(!)*+&! 7*+!91&9,7+&,701&!.519+%%!9*,&:+%! 7*+!.079*!?07*1D7!,44+970&:!7*+!415-,&7%F!45+GD+&90+%(!)*+!D%+!14!3044+5+&7!5+918+50&:!4,9715!9,D%+%!%751&:!3+:5,3,701&!14!%2&7*+%06+3!%.++9*=!+(:(!@D660&:!15!+44+97!14!-+7,/!8109+(!!
!! "#$%&'(&$)"%'*%+,("#$+-.$/%01+$2%'(%"#$%$34%21"101+$%
)*+!91&9+.7D,/!%9*+-+!14!7*+!30.*1&+!%2&7*+%06+5!0%!%*1?&!1&!7*+!40:D5+!#(!)*+5+!,5+!7*5++!@,%09!.*,%+%!14!%2&7*+%0%(!C&!7*+!405%7!.*,%+=!07!0%!&+9+%%,52!71!95+,7+!,!3,7,@,%+!71!%715+!%,-./+%! 14! *D-,&! /,&:D,:+(! )*+&=! 7*+! .,5709D/,5! %+:-+&7%! 14! 915.D%=! %D9*! ,%! .*1&+-+%=!30.*1&+%=! .+5013%=! +79(! ,5+! ,D71-,709,//2! 5+91:&06+3(! )*0%! %7+.! 0%! 31&+! @2! 7*+! %.+90,/!5+91:&060&:! %147?,5+=! 7*+! .,57! 14! 1D5! %2%7+-(! )0-+! -,5H%! 415! 7*+! .,5709D/,5! %+:-+&7%! 14!915.D%=!7*+!1D7.D7!14!7*+!%147?,5+=!,5+!%715+3!0&!7*+!3,7,@,%+(!!
)*+!%+91&3!.*,%+!0%!7*+!%.++9*!%2&7*+%0%(!I%+5!?507+%!7+E7=!*+!?,&7%!71!%2&7*+%06+!?07*!3+%05+3!45+GD+&92(!;05%7/2=!7*+!7+E7!0%!,&,/2%+3(!)*+&!7*+!3,7,@,%+!0%!%+,59*+3!,&3!,991530&:!71!7*+! 5D/+%! 14! 7*+! /,&:D,:+=! ,..51.50,7+! 30.*1&+%! ,5+! -,5H+3(! ;0&,//2=! 7*+2! ,5+! %2&7*+%06+3!,991530&:! 71! 7*+! )J! KLM>N! ,/:1507*-(! L2&7*+%06+3! %.++9*! 0%! %715+3! 0&! 7*+! 3,7,@,%+! 415!4D57*+5!.519+%%0&:!"7*+!7*053!.*,%+$=!15!07!9,&!@+!./,2+3!18+5!,&!1D7.D7!3+809+(!!
!
L,-./+%!14!%.++9*!
OPI!3,7,@,%+!%2%7+-!
!
ND71-,709!5+91:&0701&!14!.*1&+-+%=!30.*1&+%!,&3!.+5013%
Q1--,&3!,&,/2%0%=!
L+,59*0&:!415!%+:-+&7%!
L.++9*!%2&7*+%0%!@,%+3!1&!!)J!KLM>N!
N91D%709,/!1D7.D7!
I%+5!91--,&3!
-%
-
--
--
--
--
---%
-----%
!!
*567%89! *+#,-./0123$,4-5-3+637".4+#-3$8#/4-$"9-:3
! !
Original period Desired period
Pitch/duraYon modificaYon algorithms • PSOLA (Pitch synchronous overlap-‐add synthesis): Through a good
pitch/F0 detecYon it defines the boundaries of the short term signals to be amplitude peaks containing 2-‐3 pitch periods. Then it uses OLA to join them.
• TD-‐PSOLA (Time-‐domain PSOLA): Patented by France Telecom, is an efficient Yme-‐domain version of PSOLA (no FFT required, suitable for real-‐Yme TTS). Can modify pitch up to 2X or 0.5X
• MBR-‐PSOLA (MulY-‐band re-‐synthesis psola): tries to solve problems of TD-‐PSOLA in boundaries between diphones due to phase, pitch or spectral envelope mismatches y normalizing all the database a priory. It later turned into the open project called MBROLA, which focuses on minimizing the database storage.
• LP-‐PSOLA (Linear predictors PSOLA): modificaYon of PSOLA where the LP coefficients are stored instead of the signal itself.
! ! ! !!
!"#$%&'! !"#$%&&'(#)*+,!-)#.*/01%/#12'3-!45#6787("#$*+,!-)#.*/0#12'3-!4#931&#:7;<=7#:7><=7#:7?<#=#
(! )*+),-./*+.%0+1%23423)5/63.%
"#$%&%'(! )*%! +,%%-*(! +./)*%+01%2! 3.! 45! 6789:! ;<=#'0)*>! 0+! ?@0)%! @/2%'+);/2;3<%(!*@>;/! %;'! ,%'-%0&%+! 0)! ;+! +./)*%)0-A! B@')*%'>#'%(! ;!='%;)! %<#/=;)0#/! #C! +./)*%+01%2! +,%%-*!+0=/;<! 3.! ;220/=! ,%'0#2+(! -;@+%+! )*%! %-*#! %CC%-)A!D#/)';-)0#/! #C! ! )*%! +0=/;<! 2#%+! /#)! -;@+%!@/C;&#@';3<%! ;@20)0&%! ,%'-%,)0#/+A! "#$%&%'(! #>0))0/=!>#'%! ,%'0#2+! 0/! 20,*#/%(! -;@+%+! )*%!<#++!#C!+,%%-*!0/C#'>;)0#/!-#/)%/)A!!
4*%!45!6789:!;<=#'0)*>!$;+!0>,<%>%/)%2!0/!4D9!+-'0,)0/=!<;/=@;=%A!:+!;!'%+@<)!#C!3%<#/=0/=! )#! )*%! ='#@,! #C! 0/)%','%)%2! <;/=@;=%+(! 4D9! 0+! '%<;)0&%<.! +<#$A! E&%/! )*#@=*! )*%!,'#=';>>%!$;+!#,)0>01%2(! )*%! +./)*%+0+!#C!;!+%/)%/-%! <;+)+! +%&%';<! +%-#/2+A! F!+@,,#+%(! )*;)!)*%!@+%!#C!-#>,0<%2!<;/=@;=%!%A=A!DGG(!-#>30/%2!$0)*!*;'2$;'%!,%'C#'>;/-%!='#$)*!0/!)*%!C@)@'%(!-#@<2!+*#')%/!)*0+!)0>%!)#!*@/2'%2+!#C!!>0<0+%-#/2+A!4*%!@/2%/0;3<%!;2&;/);=%!#C!4D9!<;/=@;=%!'%>;0/+!)*%!,#++030<0).!#C!,#')0/=!4D9!+-'0,)+!;>#/=!#,%';)0/=!+.+)%>+A!!
4*%!C@/2;>%/);<!,'#3<%>!#C!)*%!EHI!4D9!-#/+#<%!0+!;!,<%/).!#C!%''#'+!0/!+#@'-%!-#2%A!H;/.!C@/-)0#/+!2#%+!/#)!$#'J!,'#,%'<.(!;/2!+#>%!#C!)*%>(!0/!+,%-0;<!-;+%+(!2#%+!/#)!$#'J!;)!;<<A!4*%'%C#'%!+#>%!#C!)*%>!*;2!)#!3%!/%$<.!0>,<%>%/)%2A!!
5%+,0)%! )*%! C;-)! )*;)! )*%! +,%%-*! -'%;)%2! 3.! -#/-;)%/;)0#/! #C! ! 20,*#/%+! 0+! +@,%'0#'! )#!-#/-;)%/;)0#/!#C! ,*#/%>%+(! )*0+! ;,,'#;-*!2#%+! /#)! %/;3<%! +./)*%+0+! #C! *0=*!?@;<0).!*@>;/!+,%%-*A!F)!0+!-;@+%2!3.!)*%!C;-)(!)*;)!/%0)*%'!;!*@=%!20,*#/%!2;);3;+%!0+!;3<%!)#!-#&%'!;!='%;)!&;'0%).!#C! !*@>;/!+,%%-*A!4*%!@+%!#C!H@<)0K;/2!EL-0);)0#/!M%+./)*%+0+!#/!2;);3;+%(!>0=*)!+<0=*)<.! 0>,'#&%! )*%!?@;<0).!#C! )*%!+,%%-*(!3@)! !;')0-@<;)%!+./)*%+01%'+! '%>;0/! )*%!&0+0#/!C#'!)*%!C@)@'%A!
43!343+)3.%NOP!5@)#0)(!4AQ!:/!F/)'#2@-)0#/!)#!4%L)R)#R+,%%-*!7./)*%+0+(!*)),QSS)-)+AC,>+A;-A3%S+./)*%+0+!NTP! D;++02.(! 7A(! ";''0/=)#/! UAQ! H@<)0R<%&%<! ://#);)0#/! 0/! )*%! EHI! 7,%%-*! 5;);3;+%!
H;/;=%>%/)!7.+)%>(!7,%%-*!D#>>@/0-;)0#/!VV(!TWWT!
NVP! 5@)#0)(!4A(!9%0-*(!"AQ!HKM!6789:!447!7./)*%+0+!K;+%2!#/!;/!HKE!M%R7./)*%+0+!#C!)*%!7%=>%/)+!5;);3;+%(!*)),QSS)-)+AC,>+A;-A3%!
NXP!7.'2;<(!:A!;/2!D#<AQ!45!6789:!Y%'+@+!";'>#/0-!Z#0+%!H#2%<!F/!50,*#/%!K;+%2!7,%%-*!7./)*%+0+(!*)),QSS$$$A10,,.A*#A;))A-#>!
! !
Problems with diphone synthesis
• Signal processing leave artifacts, making the speech sound unnatural
• Diphone synthesis only captures local effects – But there are many more global effects
(syllable structure, stress pattern, word-level effects)
Unit Selection Synthesis
• Generalization of the diphone intuition – Larger units
• From diphones to sentences
– Many many copies of each unit • 10 hours of speech instead of 1500 diphones (a
few minutes of speech)
– Little or no signal processing applied to each unit
• Unlike diphones
Why Unit Selection Synthesis
• Natural data solves problems with diphones – Diphone databases are carefully designed but:
• Speaker makes errors • Speaker doesn’t speak intended dialect • Require database design to be right
– If it’s automatic • Labeled with what the speaker actually said • Coarticulation, schwas, flaps are natural
• “There’s no data like more data” – Lots of copies of each unit mean you can choose just
the right one for the context – Larger units mean you can capture wider effects
Unit Selection Intuition • Given a big database • For each segment (diphone) that we want to synthesize
– Find the unit in the database that is the best to synthesize this target segment
• What does “best” mean? – “Target cost”: Closest match to the target description, in terms of
• Phonetic context • F0, stress, phrase position
– “Join cost”: Best join with neighboring units • Matching formants + other spectral characteristics • Matching energy • Matching F0
!
C(t1n,u1
n ) = Ctarget (i=1
n
" ti,ui) + C join (i= 2
n
" ui#1,ui)
Search of best units using Viterbi algorithm
!
ˆ u 1n = argmin
u1 ,...,un
C(t1n,u1
n )
HMM synthesis • It is a totally different approach to concatenaYve synthesis – It is much closer to ASR
• Hidden Markov Models (HMM) are trained from labeled data to learn how each phone is pronounced in each condiYon – It also learns its prosody
• Then, given a desired phoneme sequence and prosody paQern, it outputs the most probable audio sequence.
Samples ~2007 Sample 2010 NITech