192
Text to Speech Systems (TTS) EE 516 Spring 2009 Alex Acero

Text to Speech Systems (TTS) EE 516 Spring 2009

  • Upload
    nicole

  • View
    42

  • Download
    0

Embed Size (px)

DESCRIPTION

Text to Speech Systems (TTS) EE 516 Spring 2009. Alex Acero. Acknowledgments. Thanks to Dan Jurafsky for a lot of slides Also thanks to Alan Black, Jennifer Venditti, Richard Sproat. Outline. History of TTS Architecture Text Processing Letter-to-Sound Rules Prosody Waveform Generation - PowerPoint PPT Presentation

Citation preview

Page 1: Text to Speech Systems (TTS) EE 516 Spring 2009

Text to Speech Systems (TTS)EE 516 Spring 2009

Alex Acero

Page 2: Text to Speech Systems (TTS) EE 516 Spring 2009

Acknowledgments

• Thanks to Dan Jurafsky for a lot of slides• Also thanks to Alan Black, Jennifer Venditti, Richard Sproat

Page 3: Text to Speech Systems (TTS) EE 516 Spring 2009

Outline

• History of TTS• Architecture• Text Processing• Letter-to-Sound Rules• Prosody• Waveform Generation• Evaluation

Page 4: Text to Speech Systems (TTS) EE 516 Spring 2009

Dave Barry on TTS

“And computers are getting smarter all the time; scientists tell us that soon they will be able to talk with us.

(By "they", I mean computers; I doubt scientists will ever be able to talk to us.)

Page 5: Text to Speech Systems (TTS) EE 516 Spring 2009

5

Von Kempelen 1780

• Small whistles controlled consonants

• Rubber mouth and nose; nose had to be covered with two fingers for non-nasals

• Unvoiced sounds: mouth covered, auxiliary bellows driven by string provides puff of air

From Traunmüller’s web site

Page 6: Text to Speech Systems (TTS) EE 516 Spring 2009

Closer to a natural vocal tract:Riesz 1937

Page 7: Text to Speech Systems (TTS) EE 516 Spring 2009

The 1936 UK Speaking Clock

From http://web.ukonline.co.uk/freshwater/clocks/spkgclock.htm

Page 8: Text to Speech Systems (TTS) EE 516 Spring 2009

The UK Speaking Clock

• July 24, 1936• Photographic storage on 4 glass disks• 2 disks for minutes, 1 for hour, one for seconds.• Other words in sentence distributed across 4 disks, so all 4

used at once.• Voice of “Miss J. Cain”

Page 9: Text to Speech Systems (TTS) EE 516 Spring 2009

A technician adjusts the amplifiers of the first speaking clock

From http://web.ukonline.co.uk/freshwater/clocks/spkgclock.htm

Page 10: Text to Speech Systems (TTS) EE 516 Spring 2009

Homer Dudley’s VODER 1939

•Synthesizing speech by electrical means•1939 World’s Fair •Manually controlled through complex keyboard•Operator training was a problem•1939 vocoder

Page 11: Text to Speech Systems (TTS) EE 516 Spring 2009

Cooper’s Pattern Playback

Page 12: Text to Speech Systems (TTS) EE 516 Spring 2009

Dennis Klatt’s history of TTS (1987)

• More history at http://www.festvox.org/history/klatt.html (Dennis Klatt)

Page 13: Text to Speech Systems (TTS) EE 516 Spring 2009

Outline

• History of TTS• Architecture• Text Processing• Letter-to-Sound Rules• Prosody• Waveform Generation• Evaluation

Page 14: Text to Speech Systems (TTS) EE 516 Spring 2009

Types of Modern Synthesis

• Articulatory Synthesis:– Model movements of articulators and acoustics of vocal tract

• Formant Synthesis:– Start with acoustics, create rules/filters to create each formant

• Concatenative Synthesis:– Use databases of stored speech to assemble new utterances

• HMM-Based Synthesis– Run an HMM in generation mode

Page 15: Text to Speech Systems (TTS) EE 516 Spring 2009

Formant Synthesis

• Were the most common commercial systems while computers were relatively underpowered.

• 1979 MIT MITalk (Allen, Hunnicut, Klatt)• 1983 DECtalk system• The voice of Stephen Hawking

Page 16: Text to Speech Systems (TTS) EE 516 Spring 2009

Concatenative Synthesis

• All current commercial systems.• Diphone Synthesis

– Units are diphones; middle of one phone to middle of next.– Why? Middle of phone is steady state.– Record 1 speaker saying each diphone

• Unit Selection Synthesis– Larger units– Record 10 hours or more, so have multiple copies of each unit– Use search to find best sequence of units

Page 17: Text to Speech Systems (TTS) EE 516 Spring 2009

TTS Demos (all are Unit-Selection)

• ATT:– http://www.research.att.com/~ttsweb/tts/demo.php

• Microsoft– http://research.microsoft.com/en-us/groups/speech/tts.aspx

• Festival– http://www-2.cs.cmu.edu/~awb/festival_demos/index.html

• Cepstral– http://www.cepstral.com/cgi-bin/demos/general

• IBM– http://www-306.ibm.com/software/pervasive/tech/demos/tts.shtml

Page 18: Text to Speech Systems (TTS) EE 516 Spring 2009

Text Normalization

• Analysis of raw text into pronounceable words• Sample problems:

– He stole $100 million from the bank– It's 13 St. Andrews St.– The home page is http://ee.washington.edu– yes, see you the following tues, that's 11/12/01

• Steps– Identify tokens in text– Chunk tokens into reasonably sized sections– Map tokens to words– Identify types for words

Page 19: Text to Speech Systems (TTS) EE 516 Spring 2009

Grapheme to Phoneme

• How to pronounce a word? Look in dictionary! But:– Unknown words and names will be missing– Turkish, German, and other hard languages

• uygarlaStIramadIklarImIzdanmISsInIzcasIna• ``(behaving) as if you are among those whom we could not civilize’

• uygar +laS +tIr +ama +dIk +lar +ImIz +dan +mIS +sInIz +casIna civilized +bec +caus +NegAble +ppart +pl +p1pl +abl +past +2pl +AsIf

• So need Letter to Sound Rules • Also homograph disambiguation (wind, live, read)

Page 20: Text to Speech Systems (TTS) EE 516 Spring 2009

Prosody:from words+phones to boundaries, accent, F0, duration

• Prosodic phrasing – Need to break utterances into phrases– Punctuation is useful, not sufficient

• Accents:– Predictions of accents: which syllables should be accented– Realization of F0 contour: given accents/tones, generate F0

contour

• Duration:– Predicting duration of each phone

Page 21: Text to Speech Systems (TTS) EE 516 Spring 2009

Waveform synthesis:from segments, f0, duration to waveform

• Collecting diphones: – need to record diphones in correct contexts

• l sounds different in onset than coda, t is flapped sometimes, etc.

– need quiet recording room, maybe EEG, etc.– then need to label them very very exactly

• Unit selection: how to pick the right unit? Search• Joining the units

• dumb (just stick'em together)• PSOLA (Pitch-Synchronous Overlap and Add)• MBROLA (Multi-band overlap and add)

Page 22: Text to Speech Systems (TTS) EE 516 Spring 2009

Festival

• http://festvox.org/festival/• Open source speech synthesis system• Multiplatform (Windows/Unix)• Designed for development and runtime use

– Use in many academic systems (and some commercial)– Hundreds of thousands of users– Multilingual

• No built-in language• Designed to allow addition of new languages

– Additional tools for rapid voice development• Statistical learning tools• Scripts for building models

Page 23: Text to Speech Systems (TTS) EE 516 Spring 2009

Festival as software

• C/C++ code with Scheme scripting language• General replaceable modules:

– Lexicons, LTS, duration, intonation, phrasing, POS tagging, tokenizing, diphone/unit selection, signal processing

• General tools– Intonation analysis (f0, Tilt), signal processing, CART building, N-

gram, SCFG, WFST

Page 24: Text to Speech Systems (TTS) EE 516 Spring 2009

CMU FestVox project

• Festival is an engine, how do you make voices?• Festvox: building synthetic voices:

– Tools, scripts, documentation– Discussion and examples for building voices– Example voice databases– Step by step walkthroughs of processes

• Support for English and other languages• Support for different waveform synthesis methods

– Diphone– Unit selection– Limited domain

Page 25: Text to Speech Systems (TTS) EE 516 Spring 2009

Outline

• History of TTS• Architecture• Text Processing• Letter-to-Sound Rules• Prosody• Waveform Generation• Evaluation

Page 26: Text to Speech Systems (TTS) EE 516 Spring 2009

Text Processing

• He stole $100 million from the bank• It’s 13 St. Andrews St.• The home page is http://ee.washington.edu• Yes, see you the following tues, that’s 11/12/01• IV: four, fourth, I.V.• IRA: I.R.A. or Ira• 1750: seventeen fifty (date, address) or one thousand

seven… (dollars)

Page 27: Text to Speech Systems (TTS) EE 516 Spring 2009

Steps

• Identify tokens in text• Chunk tokens • Identify types of tokens• Convert tokens to words

Page 28: Text to Speech Systems (TTS) EE 516 Spring 2009

Step 1: identify tokens and chunk

• Whitespace can be viewed as separators• Punctuation can be separated from the raw tokens• Festival converts text into

– ordered list of tokens – each with features:

• its own preceding whitespace• its own succeeding punctuation

Page 29: Text to Speech Systems (TTS) EE 516 Spring 2009

End-of-utterance detection

• Relatively simple if utterance ends in ?!• But what about ambiguity of “.”• Ambiguous between end-of-utterance and end-of-

abbreviation– My place on Forest Ave. is around the corner.– I live at 360 Forest Ave.– (Not “I live at 360 Forest Ave..”)

• How to solve this period-disambiguation task?

Page 30: Text to Speech Systems (TTS) EE 516 Spring 2009

Some rules used

• A dot with one or two letters is an abbrev• A dot with 3 cap letters is an abbrev.• An abbrev followed by 2 spaces and a capital letter is an

end-of-utterance• Non-abbrevs followed by capitalized word are breaks• This fails for

– Cog. Sci. Newsletter– Lots of cases at end of line.– Badly spaced/capitalized sentences

Page 31: Text to Speech Systems (TTS) EE 516 Spring 2009

More sophisticated decision tree features

• Prob(word with “.” occurs at end-of-s)• Prob(word after “.” occurs at begin-of-s)• Length of word with “.”• Length of word after “.”• Case of word with “.”: Upper, Lower, Cap, Number• Case of word after “.”: Upper, Lower, Cap, Number• Punctuation after “.” (if any)• Abbreviation class of word with “.” (month name, unit-of-

measure, title, address name, etc)

Page 32: Text to Speech Systems (TTS) EE 516 Spring 2009

CART

• Breiman, Friedman, Olshen, Stone. 1984. Classification and Regression Trees. Chapman & Hall, New York.

• Description/Use:– Binary tree of decisions, terminal nodes determine prediction

(“20 questions”)– If dependent variable is categorial, “classification tree”, – If continuous, “regression tree”

Page 33: Text to Speech Systems (TTS) EE 516 Spring 2009

Learning DTs

• DTs are rarely built by hand• Hand-building only possible for very simple features, domains• Lots of algorithms for DT induction• I’ll give quick intuition here

Page 34: Text to Speech Systems (TTS) EE 516 Spring 2009

CART Estimation

• Creating a binary decision tree for classification or regression involves 3 steps:

– Splitting Rules: Which split to take at a node?– Stopping Rules: When to declare a node terminal?– Node Assignment: Which class/value to assign to a terminal

node?

Page 35: Text to Speech Systems (TTS) EE 516 Spring 2009

Splitting Rules

• Which split to take a node?• Candidate splits considered:

– Binary cuts: for continuous (-inf < x < inf) consider splits of form:

• X <= k vs. x > k K– Binary partitions: For categorical x {1,2,…} = X consider splits

of form:– x A vs. x X-A, A X

Page 36: Text to Speech Systems (TTS) EE 516 Spring 2009

Splitting Rules

• Choosing best candidate split.– Method 1: Choose k (continuous) or A (categorical) that

minimizes estimated classification (regression) error after split– Method 2 (for classification): Choose k or A that minimizes

estimated entropy after that split.

Page 37: Text to Speech Systems (TTS) EE 516 Spring 2009

Decision Tree Stopping

• When to declare a node terminal?• Strategy (Cost-Complexity pruning):

1. Grow over-large tree2. Form sequence of subtrees, T0…Tn ranging from full tree to just

the root node.3. Estimate “honest” error rate for each subtree.4. Choose tree size with minimum “honest” error rate.

• To estimate “honest” error rate, test on data different from training data (I.e. grow tree on 9/10 of data, test on 1/10, repeating 10 times and averaging (cross-validation).

Page 38: Text to Speech Systems (TTS) EE 516 Spring 2009

Sproat’s EOS tree

Page 39: Text to Speech Systems (TTS) EE 516 Spring 2009

Steps 3+4: Identify Types of Tokens, and Convert Tokens to Words

• Pronunciation of numbers often depends on type:– 1776 date: seventeen seventy six.– 1776 phone number: one seven seven six– 1776 quantifier: one thousand seven hundred (and) seventy six– 25 day: twenty-fifth

Page 40: Text to Speech Systems (TTS) EE 516 Spring 2009

Festival rule for dealing with “$1.2 million”

(define (token_to_words utt token name) (cond ((and (string-matches name "\\$[0-9,]+\\(\\.[0-9]+\\)?") (string-matches (utt.streamitem.feat utt token "n.name") ".*illion.?")) (append (builtin_english_token_to_words utt token (string-after name "$"))

(list (utt.streamitem.feat utt token "n.name")))) ((and (string-matches (utt.streamitem.feat utt token "p.name") "\\$[0-9,]+\\(\\.[0-9]+\\)?") (string-matches name ".*illion.?")) (list "dollars")) (t (builtin_english_token_to_words utt token name))))

Page 41: Text to Speech Systems (TTS) EE 516 Spring 2009

Rule-based versus machine learning

• As always, we can do things either way, or more often by a combination

• Rule-based:– Simple– Quick– Can be more robust

• Machine Learning– Works for complex problems where rules hard to write– Higher accuracy in general– But worse generalization to very different test sets

• Real TTS and NLP systems– Often use aspects of both.

Page 42: Text to Speech Systems (TTS) EE 516 Spring 2009

Machine learning method for Text Normalization

• From 1999 Hopkins summer workshop “Normalization of Non-Standard Words”

• Sproat, R., Black, A., Chen, S., Kumar, S., Ostendorf, M., and Richards, C. 2001. Normalization of Non-standard Words, Computer Speech and Language, 15(3):287-333

• NSW examples:– Numbers:

• 123, 12 March 1994

– Abrreviations, contractions, acronyms:• approx., mph. ctrl-C, US, pp, lb

– Punctuation conventions:• 3-4, +/-, and/or

– Dates, times, urls, etc

Page 43: Text to Speech Systems (TTS) EE 516 Spring 2009

How common are NSWs?

• Varies over text type• Word not in lexicon, or with non-alphabetic characters:

Text Type % NSW

novels 1.5%

press wire 4.9%

e-mail 10.7%

recipes 13.7%

classified 17.9%

Page 44: Text to Speech Systems (TTS) EE 516 Spring 2009

How hard are NSWs?

• Identification:– Some homographs “Wed”, “PA”– False positives: OOV

• Realization:– Simple rule: money, $2.34– Type identification+rules: numbers– Text type specific knowledge (in classified ads, BR for bedroom)

• Ambiguity (acceptable multiple answers)– “D.C.” as letters or full words– “MB” as “meg” or “megabyte”– 250

Page 45: Text to Speech Systems (TTS) EE 516 Spring 2009

Step 1: Splitter

• Letter/number confjunctions (WinNT, SunOS, PC110)• Hand-written rules in two parts:

– Part I: group things not to be split (numbers, etc; including commas in numbers, slashes in dates)

– Part II: apply rules:• At transitions from lower to upper case• After penultimate upper-case char in transitions from upper to lower• At transitions from digits to alpha• At punctuation

Page 46: Text to Speech Systems (TTS) EE 516 Spring 2009

Step 2: Classify token into 1 of 20 types

• EXPN: abbrev, contractions (adv, N.Y., mph, gov’t)• LSEQ: letter sequence (CIA, D.C., CDs)• ASWD: read as word, e.g. CAT, proper names• MSPL: misspelling• NUM: number (cardinal) (12,45,1/2, 0.6)• NORD: number (ordinal) e.g. May 7, 3rd, Bill Gates II• NTEL: telephone (or part) e.g. 212-555-4523• NDIG: number as digits e.g. Room 101• NIDE: identifier, e.g. 747, 386, I5, PC110• NADDR: number as stresst address, e.g. 5000 Pennsylvania• NZIP, NTIME, NDATE, NYER, MONEY, BMONY, PRCT,URL,etc

• SLNT: not spoken (KENT*REALTY)

Page 47: Text to Speech Systems (TTS) EE 516 Spring 2009

More about the types• 4 categories for alphabetic sequences:

– EXPN: expand to full word or word seq (fplc for fireplace, NY for New York)

– LSEQ: say as letter sequence (IBM)– ASWD: say as standard word (either OOV or acronyms)

• 5 main ways to read numbers:– Cardinal (quantities)– Ordinal (dates)– String of digits (phone numbers)– Pair of digits (years)– Trailing unit: serial until last non-zero digit: 8765000 is “eight seven six five

thousand” (some phone numbers, long addresses)– But still exceptions: (947-3030, 830-7056)

Page 48: Text to Speech Systems (TTS) EE 516 Spring 2009

Type identification algorithm

• Create large hand-labeled training set and build a DT to predict type

• Example of features in tree for subclassifier for alphabetic tokens:

– P(t|o) = p(o|t)p(t)/p(o)– P(o|t), for t in ASWD, LSWQ, EXPN (from trigram letter model)

– P(t) from counts of each tag in text– P(o) normalization factor

p(o | t) p(li1 | li 1, li 2)i1

N

Page 49: Text to Speech Systems (TTS) EE 516 Spring 2009

Type identification algorithm

• Hand-written context-dependent rules:– List of lexical items (Act, Advantage, amendment) after which

Roman numbers read as cardinals not ordinals

• Classifier accuracy: – 98.1% in news data, – 91.8% in email

Page 50: Text to Speech Systems (TTS) EE 516 Spring 2009

Step 3: expanding NSW Tokens

• Type-specific heuristics– ASWD expands to itself– LSEQ expands to list of words, one for each letter– NUM expands to string of words representing cardinal– NYER expand to 2 pairs of NUM digits…– NTEL: string of digits with silence for puncutation– Abbreviation:

• use abbrev lexicon if it’s one we’ve seen• Else use training set to know how to expand• Cute idea: if “eat in kit” occurs in text, “eat-in kitchen” will also occur

somewhere.

Page 51: Text to Speech Systems (TTS) EE 516 Spring 2009

4 steps to Sproat et al. algorithm

1) Splitter (on whitespace or also within word (“AltaVista”)

2) Type identifier: for each split token identify type

3) Token expander: for each typed token, expand to words• Deterministic for number, date, money, letter sequence• Only hard (nondeterministic) for abbreviations

4) Language Model: to select between alternative pronunciations

Page 52: Text to Speech Systems (TTS) EE 516 Spring 2009

Homograph disambiguation

record 195house 150contract 143lead 131live 130lives 105protest 94survey 91project 90separate 87present 80read 72subject 68rebel 48finance 46estimate 46

• Most frequent homographs, from Liberman and Church

• Not a huge problem, but still important

Page 53: Text to Speech Systems (TTS) EE 516 Spring 2009

POS Tagging for homograph disambiguation

• Many homographs can be distinguished by POS

live l ay v l ih vREcord reCORD INsult inSULT OBject obJECT OVERflow overFLOW DIScount disCOUNT CONtent conTENT

Page 54: Text to Speech Systems (TTS) EE 516 Spring 2009

Part of speech tagging

• 8 (ish) traditional parts of speech– This idea has been around for over 2000 years (Dionysius

Thrax of Alexandria, c. 100 B.C.)– Called: parts-of-speech, lexical category, word classes,

morphological classes, lexical tags, POS– We’ll use POS most frequently

Page 55: Text to Speech Systems (TTS) EE 516 Spring 2009

POS examples

N noun chair, bandwidth, pacingV verb study, debate, munchADJ adj purple, tall, ridiculousADV adverb unfortunately, slowly,P preposition of, by, toPRO pronoun I, me, mineDET determiner the, a, that, those

Page 56: Text to Speech Systems (TTS) EE 516 Spring 2009

POS Tagging: Definition

• The process of assigning a part-of-speech or lexical class marker to each word in a corpus:

thekoalaputthekeysonthetable

WORDSTAGS

NVPDET

Page 57: Text to Speech Systems (TTS) EE 516 Spring 2009

POS Tagging example

WORD TAGthe DETchild Nput Vthe DETkeys Non Pthe DETtable N

Page 58: Text to Speech Systems (TTS) EE 516 Spring 2009

Open and closed class words

• Closed class: a relatively fixed membership – Prepositions: of, in, by, …– Auxiliaries: may, can, will had, been, …– Pronouns: I, you, she, mine, his, them, …– Usually function words (short common words which play a role in

grammar)

• Open class: new ones can be created all the time– English has 4: Nouns, Verbs, Adjectives, Adverbs– Many languages have all 4, but not all!– In Lakhota and possibly Chinese, what English treats as adjectives

act more like verbs.

Page 59: Text to Speech Systems (TTS) EE 516 Spring 2009

Open class words

• Nouns– Proper nouns (Seattle University, Boulder, Neal Snider, William Gates

Hall). English capitalizes these.– Common nouns (the rest). German capitalizes these.– Count nouns and mass nouns

• Count: have plurals, get counted: goat/goats, one goat, two goats• Mass: don’t get counted (snow, salt, communism) (*two snows)

• Adverbs: tend to modify things– Unfortunately, John walked home extremely slowly yesterday– Directional/locative adverbs (here, home, downhill)– Degree adverbs (extremely, very, somewhat)– Manner adverbs (slowly, delicately)

• Verbs:– In English, have morphological affixes (eat/eats/eaten)

Page 60: Text to Speech Systems (TTS) EE 516 Spring 2009

Closed Class Words

• Idiosyncratic• Examples:

– prepositions: on, under, over, …– particles: up, down, on, off, …– determiners: a, an, the, …– pronouns: she, who, I, ..– conjunctions: and, but, or, …– auxiliary verbs: can, may should, …– numerals: one, two, three, third, …

Page 61: Text to Speech Systems (TTS) EE 516 Spring 2009

POS tagging: Choosing a tagset

• There are so many parts of speech, potential distinctions we can draw• To do POS tagging, need to choose a standard set of tags to work with• Could pick very coarse tagets

– N, V, Adj, Adv.

• More commonly used set is finer grained, the “UPenn TreeBank tagset”, 45 tags

– PRP$, WRB, WP$, VBG

• Even more fine-grained tagsets exist

Page 62: Text to Speech Systems (TTS) EE 516 Spring 2009

PRP$PRP

Page 63: Text to Speech Systems (TTS) EE 516 Spring 2009

Using the UPenn tagset

• The/DT grand/JJ jury/NN commmented/VBD on/IN a/DT number/NN of/IN other/JJ topics/NNS ./.

• Prepositions and subordinating conjunctions marked IN (“although/IN I/PRP..”)

• Except the preposition/complementizer “to” is just marked “to”.

Page 64: Text to Speech Systems (TTS) EE 516 Spring 2009

POS Tagging

• Words often have more than one POS: back– The back door = JJ– On my back = NN– Win the voters back = RB– Promised to back the bill = VB

• The POS tagging problem is to determine the POS tag for a particular instance of a word.

Page 65: Text to Speech Systems (TTS) EE 516 Spring 2009

How hard is POS tagging? Measuring ambiguity

Unambiguous (1 tag): 38,857Ambiguous (2-7 tags): 8,844

2 tags 6,731

3 tags 1621

4 tags 357

5 tags 90

6 tags 32

7 tags 6 well, set, round, open, fit, down

8 tags 4 ‘s, half, back, a

9 tags 3 that, more, in

Page 66: Text to Speech Systems (TTS) EE 516 Spring 2009

3 methods for POS tagging

1. Rule-based tagging– (ENGTWOL)

2. Stochastic (=Probabilistic) tagging– HMM (Hidden Markov Model) tagging

3. Transformation-based tagging– Brill tagger

Page 67: Text to Speech Systems (TTS) EE 516 Spring 2009

Rule-based tagging

• Start with a dictionary• Assign all possible tags to words from the dictionary• Write rules by hand to selectively remove tags• Leaving the correct tag for each word

Page 68: Text to Speech Systems (TTS) EE 516 Spring 2009

Start with a dictionary

• she: PRP• promised: VBN,VBD• to TO• back: VB, JJ, RB, NN• the: DT• bill: NN, VB

• Etc… for the ~100,000 words of English

Page 69: Text to Speech Systems (TTS) EE 516 Spring 2009

Use the dictionary to assign every possible tag

NN

RB

VBN JJ VB

PRP VBD TO VB DT NN

She promised to back the bill

Page 70: Text to Speech Systems (TTS) EE 516 Spring 2009

Write rules to eliminate tags

Eliminate VBN if VBD is an option when VBN|VBD follows “<start> PRP”

NN

RB

JJ VB

PRP VBD TO VB DT NN

She promised to back the bill

VBN

Page 71: Text to Speech Systems (TTS) EE 516 Spring 2009

Stochastic Tagging

• Intuition: assign each word “most probable” tag• Simplest way to define “most probable”

– Collect a training corpus– Choose tag which is most frequent for that word in training corpus– I.e., chose tag such that p(tag|word) is high– Of all the times that “use” occurred in a training corpus, what

percentage was it V, what percentage N? Choose higher probability tag.

• Does it work?– Achieves: 90%! But we can do better:– How? Context: “to use”: use is V; “the use of”: use is N

Page 72: Text to Speech Systems (TTS) EE 516 Spring 2009

HMM Tagger

• Intuition: Pick the most probable tag sequence for a series of words

• But how to make the right-hand side operational?• Use Bayes’ rule:• Substituting:

ˆ t 1n argmax

t1n

P(t1n | w1

n )

ˆ t 1n argmax

t1n

P(w1n | t1

n )P(t1n )

P(w1n )

P(x | y) P(y | x)P(x)

P(y)

Page 73: Text to Speech Systems (TTS) EE 516 Spring 2009

HMM Tagger: fundamental equations

• Since the word sequence is constant:

• Still too hard to compute directly

ˆ t 1n argmax

t1n

P(w1n | t1

n )P(t1n )

P(w1n )

ˆ t 1n argmax

t1n

P(w1n | t1

n )P(t1n )

likelihood prior

Page 74: Text to Speech Systems (TTS) EE 516 Spring 2009

HMM Tagger: Two simplifying assumptions

• Prob of word independent of other words and their tags:

• Prob of tag is only dependent on previous tag

• Combining:

ˆ t 1n argmax

t1n

P(w1n | t1

n )P(t1n )

P(w1n | t1

n ) P(wi | ti)i1

n

P(t1n ) P(ti | ti 1)

i1

n

ˆ t 1n argmax

t1n

P(t1n | w1

n ) argmaxt1

n

P(wi | ti)P(ti | ti 1)i1

n

Page 75: Text to Speech Systems (TTS) EE 516 Spring 2009

Estimating these probabilities

• Determiners precede nouns in English, so expect P(NN|DT) to be high

• In tagged 1-million word Brown corpus:• P(NN|DT) = C(DT,NN)/C(DT) = 56509/116454=.49

• P(is|VBZ)=C(VBZ,is)/C(VBZ)=10073/21627=.47

P(ti | ti 1) C(ti 1, ti)

C(ti 1)

P(wi | ti) C(ti,wi)

C(ti)

Page 76: Text to Speech Systems (TTS) EE 516 Spring 2009

An example

• Secretariat/NNP is/BEZ expected/VBN to/TO race/VB tomorrow/NR

• People/NNS continue/VB to/TO inquire/VB the/AT reason/NN for/IN the/AT race/NN for/IN outer/JJ space/NN

Page 77: Text to Speech Systems (TTS) EE 516 Spring 2009

An example of two tag sequences

Page 78: Text to Speech Systems (TTS) EE 516 Spring 2009

Picture of HMM

Page 79: Text to Speech Systems (TTS) EE 516 Spring 2009

Viterbi Algorithm

promised to back the bill

VBD

VBN

TO

VB

JJ

NN

RB

DT

NNP

VB

NN

promised to back the bill

VBD

VBN

TO

VB

JJ

NN

RB

DT

NNP

VB

NN

S1 S2 S4S3 S5

promised to back the bill

VBD

VBN

TO

VB

JJ

NN

RB

DT

NNP

VB

NN

Page 80: Text to Speech Systems (TTS) EE 516 Spring 2009

Evaluation

• The result is compared with a manually coded “Gold Standard”

– Typically accuracy reaches 96-97%– This may be compared with result for a baseline tagger (one

that uses no context).

• Important: 100% is impossible even for human annotators

Page 81: Text to Speech Systems (TTS) EE 516 Spring 2009

Outline

• History of TTS• Architecture• Text Processing• Letter-to-Sound Rules• Prosody• Waveform Generation• Evaluation

Page 82: Text to Speech Systems (TTS) EE 516 Spring 2009

Lexicons and Lexical Entries

• You can explicitly give pronunciations for words– Each language/dialect has its own lexicon– You can lookup words with

• (lex.lookup WORD)

– You can add entries to the current lexicon• (lex.add.entry NEWENTRY)

– Entry: (WORD POS (SYL0 SYL1…))– Syllable: ((PHONE0 PHONE1 …) STRESS )– Example:

(“cepstra” n ((k eh p) 1) ((s t r aa) 0))))

Page 83: Text to Speech Systems (TTS) EE 516 Spring 2009

Converting from words to phones

• Two methods:– Dictionary-based– Rule-based (Letter-to-sound=LTS)

• Early systems, all LTS• MITalk was radical in having huge 10K word dictionary• Now systems use a combination• CMU dictionary: 127K words

– http://www.speech.cs.cmu.edu/cgi-bin/cmudict

Page 84: Text to Speech Systems (TTS) EE 516 Spring 2009

Dictionaries aren’t always sufficient

• Unknown words– Seem to be linear with number of words in unseen text– Mostly person, company, product names– But also foreign words, etc.

• So commercial systems have 3-part system:– Big dictionary– Special code for handling names– Machine learned LTS system for other unknown words

Page 85: Text to Speech Systems (TTS) EE 516 Spring 2009

Letter-to-Sound Rules

• Festival LTS rules:• (LEFTCONTEXT [ ITEMS] RIGHTCONTEXT = NEWITEMS )

• Example:– ( # [ c h ] C = k )– ( # [ c h ] = ch )

• # denotes beginning of word• C means all consonants• Rules apply in order

– “christmas” pronounced with [k]– But word with ch followed by non-consonant pronounced [ch]

• E.g., “choice”

Page 86: Text to Speech Systems (TTS) EE 516 Spring 2009

Stress rules in LTS

• English famously evil: one from Allen et al 1987• V -> [1-stress] / X_C* {Vshort C C?|V} {[Vshort C*|V}• Where X must contain all prefixes:• Assign 1-stress to the vowel in a syllable preceding a weak

syllable followed by a morpheme-final syllable containing a short vowel and 0 or more consonants (e.g. difficult)

• Assign 1-stress to the vowel in a syllable preceding a weak syllable followed by a morpheme-final vowel (e.g. oregano)

• etc

Page 87: Text to Speech Systems (TTS) EE 516 Spring 2009

Modern method: Learning LTS rules automatically

• Induce LTS from a dictionary of the language • Black et al. 1998• Applied to English, German, French• Two steps: alignment and (CART-based) rule-induction

Page 88: Text to Speech Systems (TTS) EE 516 Spring 2009

Alignment

• Letters: c h e c k e d• Phones: ch _ eh _ k _ t• Black et al Method 1:

– First scatter epsilons in all possible ways to cause letters and phones to align

– Then collect stats for P(letter|phone) and select best to generate new stats

– This iterated a number of times until settles (5-6)– This is EM (expectation maximization) alg

Page 89: Text to Speech Systems (TTS) EE 516 Spring 2009

Alignment

• Black et al method 2• Hand specify which letters can be rendered as which phones

– C goes to k/ch/s/sh– W goes to w/v/f, etc

• Once mapping table is created, find all valid alignments, find p(letter|phone), score all alignments, take best

Page 90: Text to Speech Systems (TTS) EE 516 Spring 2009

Alignment

• Some alignments will turn out to be really bad.• These are just the cases where pronunciation doesn’t

match letters:– Dept d ih p aa r t m ah n t– CMU s iy eh m y uw– Lieutenant l eh f t eh n ax n t (British)

• Also foreign words• These can just be removed from alignment training

Page 91: Text to Speech Systems (TTS) EE 516 Spring 2009

Building CART trees

• Build a CART tree for each letter in alphabet (26 plus accented) using context of +-3 letters

• # # # c h e c -> ch• c h e c k e d -> _• This produces 92-96% correct LETTER accuracy (58-75

word acc) for English

Page 92: Text to Speech Systems (TTS) EE 516 Spring 2009

Improvements

• Take names out of the training data• And acronyms• Detect both of these separately• And build special-purpose tools to do LTS for names and

acronyms

Page 93: Text to Speech Systems (TTS) EE 516 Spring 2009

Names

• Big problem area is names• Names are common

– 20% of tokens in typical newswire text will be names– 1987 Donnelly list (72 million households) contains about 1.5

million names– Personal names: McArthur, D’Angelo, Jimenez, Rajan, Raghavan,

Sondhi, Xu, Hsu, Zhang, Chang, Nguyen– Company/Brand names: Infinit, Kmart, Cytyc, Medamicus, Inforte,

Aaon, Idexx Labs, Bebe

Page 94: Text to Speech Systems (TTS) EE 516 Spring 2009

Names

• Methods: – Can do morphology (Walters -> Walter, Lucasville)– Can write stress-shifting rules (Jordan -> Jordanian)– Rhyme analogy: Plotsky by analogy with Trostsky (replace tr with

pl)– Liberman and Church: for 250K most common names, got 212K

(85%) from these modified-dictionary methods, used LTS for rest.– Can do automatic country detection (from letter trigrams) and then

do country-specific rules

Page 95: Text to Speech Systems (TTS) EE 516 Spring 2009

Outline

• History of TTS• Architecture• Text Processing• Letter-to-Sound Rules• Prosody• Waveform Generation• Evaluation

Page 96: Text to Speech Systems (TTS) EE 516 Spring 2009

Defining Intonation

• Ladd (1996) “Intonational phonology”• “The use of suprasegmental phonetic features

Suprasegmental = above and beyond the segment/phone– F0– Intensity (energy)– Duration

• to convey sentence-level pragmatic meanings”– I.e. meanings that apply to phrases or utterances as a whole, not

lexical stress, not lexical tone.

Page 97: Text to Speech Systems (TTS) EE 516 Spring 2009

Three aspects of prosody

• Prominence: some syllables/words are more prominent than others

• Structure/boundaries: sentences have prosodic structure– Some words group naturally together– Others have a noticeable break or disjuncture between them

• Tune: the intonational melody of an utterance.

Page 98: Text to Speech Systems (TTS) EE 516 Spring 2009

Prosodic Prominence: Pitch Accents

A: What types of foods are a good source of vitamins?

B1: Legumes are a good source of VITAMINS.

B2: LEGUMES are a good source of vitamins.

• Prominent syllables are:• Louder• Longer• Have higher F0 and/or sharper changes in F0 (higher F0 velocity)

Slides from Jennifer Venditti

Page 99: Text to Speech Systems (TTS) EE 516 Spring 2009

Prosodic Boundaries

.

French [bread and cheese]

[French bread] and [cheese]

Page 100: Text to Speech Systems (TTS) EE 516 Spring 2009

Prosodic Tunes

• Legumes are a good source of vitamins.

• Are legumes a good source of vitamins?

Page 101: Text to Speech Systems (TTS) EE 516 Spring 2009

TOPIC #1TOPIC #1

Thinking about F0

Page 102: Text to Speech Systems (TTS) EE 516 Spring 2009

legumes are a good source of VITAMINS50

100

150

200

250

300

350

400

Graphic representation of F0

time

F0

(in H

ertz

)

Page 103: Text to Speech Systems (TTS) EE 516 Spring 2009

legumes are a good source of VITAMINS[ t ][ s ] [ s ]

50

100

150

200

250

300

350

400

The ‘ripples’

F0 is not defined for consonants without vocalfold vibration.

Page 104: Text to Speech Systems (TTS) EE 516 Spring 2009

legumes are a good source of VITAMINS[ v ][ g ] [ g ][ z ]

50

100

150

200

250

300

350

400

The ‘ripples’

... and F0 can be perturbed by consonants withan extreme constriction in the vocal tract.

Page 105: Text to Speech Systems (TTS) EE 516 Spring 2009

legumes are a good source of VITAMINS50

100

150

200

250

300

350

400

Abstraction of the F0 contour

Our perception of the intonation contour abstracts away from these perturbations.

Page 106: Text to Speech Systems (TTS) EE 516 Spring 2009

legumes are a good source of VITAMINS50

100

150

200

250

300

350

400

The ‘waves’ and the ‘swells’

‘wave’ = accent

‘swell’ = phrase

Page 107: Text to Speech Systems (TTS) EE 516 Spring 2009

TOPIC #2TOPIC #2

Accent Placement and Intonational Tunes

Page 108: Text to Speech Systems (TTS) EE 516 Spring 2009

Stress vs. accent

• Stress is a structural property of a word — it marks a potential (arbitrary) location for an accent to occur, if there is one.

• Accent is a property of a word in context — it is a way to mark intonational prominence in order to ‘highlight’ important words in the

discourse.

(x)

(x) (accented syll)

x x stressed syll

x x x full vowels

x x x x x x x syllables

vi ta mins

Ca

li for nia

Page 109: Text to Speech Systems (TTS) EE 516 Spring 2009

Stress vs. accent (2)

• The speaker decides to make the word vitamin more prominent by accenting it.

• Lexical stress tell us that this prominence will appear on the first syllable, hence VItamin.

Page 110: Text to Speech Systems (TTS) EE 516 Spring 2009

Which word receives an accent?

• It depends on the context. For example, the ‘new’ information in the answer to a question is often accented, while the ‘old’ information usually is not.

– Q1: What types of foods are a good source of vitamins?– A1: LEGUMES are a good source of vitamins.

– Q2: Are legumes a source of vitamins?– A2: Legumes are a GOOD source of vitamins.

– Q3: I’ve heard that legumes are healthy, but what are they a good source of ?

– A3: Legumes are a good source of VITAMINS.

Page 111: Text to Speech Systems (TTS) EE 516 Spring 2009

50

100

150

200

250

300

350

400

Same ‘tune’, different alignment

LEGUMES are a good source of vitamins

The main rise-fall accent (= “I assert this”) shifts locations.

Page 112: Text to Speech Systems (TTS) EE 516 Spring 2009

50

100

150

200

250

300

350

400

Same ‘tune’, different alignment

Legumes are a GOOD source of vitamins

The main rise-fall accent (= “I assert this”) shifts locations.

Page 113: Text to Speech Systems (TTS) EE 516 Spring 2009

Same ‘tune’, different alignment

legumes are a good source of VITAMINS50

100

150

200

250

300

350

400

The main rise-fall accent (= “I assert this”) shifts locations.

Page 114: Text to Speech Systems (TTS) EE 516 Spring 2009

Broad focus

legumes are a good source of vitamins

In the absence of narrow focus, English tends to mark the firstand last ‘content’ words with perceptually prominent accents.

50

100

150

200

250

300

350

400

Page 115: Text to Speech Systems (TTS) EE 516 Spring 2009

Yes-No question tune

are LEGUMES a good source of vitamins

Rise from the main accent to the end of the sentence.

50

100

150

200

250

300

350

400

450

500

550

Page 116: Text to Speech Systems (TTS) EE 516 Spring 2009

Yes-No question tune

are legumes a GOOD source of vitamins

Rise from the main accent to the end of the sentence.

50

100

150

200

250

300

350

400

450

500

550

Page 117: Text to Speech Systems (TTS) EE 516 Spring 2009

Yes-No question tune

are legumes a good source of VITAMINS

Rise from the main accent to the end of the sentence.

50

100

150

200

250

300

350

400

450

500

550

Page 118: Text to Speech Systems (TTS) EE 516 Spring 2009

50

100

150

200

250

300

350

400

WH-questions

WHAT are a good source of vitamins

WH-questions typically have falling contours, like statements.

[I know that many natural foods are healthy, but ...]

Page 119: Text to Speech Systems (TTS) EE 516 Spring 2009

50

100

150

200

250

300

350

400

450

500

550

legumes are a good source of vitamins

Rising statements

High-rising statements can signal that the speaker is seeking approval.

[... does this statement qualify?]

Page 120: Text to Speech Systems (TTS) EE 516 Spring 2009

‘Surprise-redundancy’ tune

legumes are a good source of vitamins

Low beginning followed by a gradual rise to a high at the end.

[How many times do I have to tell you ...]

50

100

150

200

250

300

350

400

Page 121: Text to Speech Systems (TTS) EE 516 Spring 2009

50

100

150

200

250

300

350

400

‘Contradiction’ tune

linguini isn’t a good source of vitamins

Sharp fall at the beginning, flat and low, then rising at the end.

“I’ve heard that linguini is a good source of vitamins.”

[... how could you think that?]

Page 122: Text to Speech Systems (TTS) EE 516 Spring 2009

TOPIC #3TOPIC #3

Intonational phrasingand disambiguation

Page 123: Text to Speech Systems (TTS) EE 516 Spring 2009

50

100

150

200

250

300

350

400

A single intonation phrase

legumes are a good source of vitamins

Broad focus statement consisting of one intonation phrase(that is, one intonation tune spans the whole unit).

Page 124: Text to Speech Systems (TTS) EE 516 Spring 2009

50

100

150

200

250

300

350

400

Multiple phrases

legumes are a good source of vitamins

Utterances can be ‘chunked’ up into smaller phrases in order to signal the importance of information in each unit.

Page 125: Text to Speech Systems (TTS) EE 516 Spring 2009

Phrasing can disambiguate

• Global ambiguity:

Sally saw % the man with the binoculars.

Sally saw the man % with the binoculars.

Page 126: Text to Speech Systems (TTS) EE 516 Spring 2009

Phrasing can disambiguate

• Temporary ambiguity:

When Madonna sings the song ...

Page 127: Text to Speech Systems (TTS) EE 516 Spring 2009

Phrasing can disambiguate

• Temporary ambiguity:

When Madonna sings the song is a hit.

Page 128: Text to Speech Systems (TTS) EE 516 Spring 2009

Phrasing can disambiguate

• Temporary ambiguity:

When Madonna sings % the song is a hit.

When Madonna sings the song % it’s a hit.

[from Speer & Kjelgaard (1992)]

Page 129: Text to Speech Systems (TTS) EE 516 Spring 2009

50

100

150

200

250

300

350

400

Phrasing can disambiguate

I met Mary and Elena’s mother at the mall yesterday

Mary & Elena’s mothermall

One intonation phrase with relatively flat overall pitch range.

Page 130: Text to Speech Systems (TTS) EE 516 Spring 2009

50

100

150

200

250

300

350

400

Phrasing can disambiguate

I met Mary and Elena’s mother at the mall yesterday

Marymall

Elena’s mother

Separate phrases, with expanded pitch movements.

Page 131: Text to Speech Systems (TTS) EE 516 Spring 2009

TOPIC #4TOPIC #4

The TOBI Intonational Transcription Theory

Page 132: Text to Speech Systems (TTS) EE 516 Spring 2009

ToBI: Tones and Break Indices

• Pitch accent tones– H* “peak accent”– L* “low accent”– L+H* “rising peak accent” (contrastive)– L*+H ‘scooped accent’– H+!H* downstepped high

• Boundary tones– L-L% (final low; Am Eng. Declarative contour)– L-H% (continuation rise)– H-H% (yes-no queston)

• Break indices– 0: clitics, 1, word boundaries, 2 short pause– 3 intermediate intonation phrase– 4 full intonation phrase/final boundary.

Page 133: Text to Speech Systems (TTS) EE 516 Spring 2009

Examples of the TOBI system

• I don’t eat beef. L* L* L*L-L%• Marianna made the marmalade. H* L-L% L* H-H%• “I” means insert. H* H* H*L-L%

1 H*L- H*L-L% 3

Page 134: Text to Speech Systems (TTS) EE 516 Spring 2009

ToBI

• http://www.ling.ohio-state.edu/~tobi/• TOBI for American English

– http://www.ling.ohio-state.edu/~tobi/ame_tobi/• Silverman, K., Beckman, M., Pitrelli, J., Ostendorf, M., Wightman, C., Price, P.,

Pierrehumbert, J., and Hirschberg, J. (1992). ToBI: a standard for labelling English prosody. In Proceedings of ICSLP92, volume 2, pages 867-870

• Pitrelli, J. F., Beckman, M. E., and Hirschberg, J. (1994). Evaluation of prosodic transcription labeling reliability in the ToBI framework. In ICSLP94, volume 1, pages 123-126

• Pierrehumbert, J., and J. Hirschberg (1990) The meaning of intonation contours in the interpretation of discourse. In P. R. Cohen, J.Morgan, and M. E. Pollack, eds., Plans and Intentions inCommunication and Discourse, 271-311. MIT Press.

• Beckman and Elam. Guidelines for ToBI Labelling. Web.

Page 135: Text to Speech Systems (TTS) EE 516 Spring 2009

TOPIC #5TOPIC #5

PRODUCINGINTONATION IN TTS

Page 136: Text to Speech Systems (TTS) EE 516 Spring 2009

Intonation in TTS

1) Accent: Decide which words are accented, which syllable has accent, what sort of accent

2) Boundaries: Decide where intonational boundaries are

3) Duration: Specify length of each segment

4) F0: Generate F0 contour from these

Page 137: Text to Speech Systems (TTS) EE 516 Spring 2009

TOPIC #5aTOPIC #5a

Predicting pitch accent

Page 138: Text to Speech Systems (TTS) EE 516 Spring 2009

Factors in accent prediction

• Contrast– Legumes are poor source of VITAMINS– No, legumes are a GOOD source of vitamins

– I think JOHN and MARY should go– No, I think JOHN AND MARY should go

Page 139: Text to Speech Systems (TTS) EE 516 Spring 2009

But it’s more than just contrast

• List intonation:

• I went and saw ANNA, LENNY, MARY, and NORA.

Page 140: Text to Speech Systems (TTS) EE 516 Spring 2009

In fact, accents are common!

• A Broadcast News example from Hirschberg (1993)• SUN MICROSYSTEMS INC, the UPSTART COMPANY that HELPED

LAUNCH the DESKTOP COMPUTER industry TREND TOWARD HIGH powered WORKSTATIONS, was UNVEILING an ENTIRE OVERHAUL of its PRODUCT LINE TODAY. SOME of the new MACHINES, PRICED from FIVE THOUSAND NINE hundred NINETY five DOLLARS to seventy THREE thousand nine HUNDRED dollars, BOAST SOPHISTICATED new graphics and DIGITAL SOUND TECHNOLOGIES, HIGHER SPEEDS AND a CIRCUIT board that allows FULL motion VIDEO on a COMPUTER SCREEN.

Page 141: Text to Speech Systems (TTS) EE 516 Spring 2009

Factors in accent prediction

• Part of speech:– Content words are usually accented– Function words are rarely accented

• Of, for, in on, that, the, a, an, no, to, and but or will may would can her is their its our there is am are was were, etc

Page 142: Text to Speech Systems (TTS) EE 516 Spring 2009

Factors in accent prediction

• Word Order• Preposed items are accented more frequently• TODAY we will BEGIN to LOOK at FROG anatomy.• We will BEGIN to LOOK at FROG anatomy today.

Page 143: Text to Speech Systems (TTS) EE 516 Spring 2009

Factors in Accent Prediction

• Information Status:• New versus old information.• Old information is not deaccented• There are LAWYERS, and there are GOOD lawyers

Page 144: Text to Speech Systems (TTS) EE 516 Spring 2009

Complex Noun Phrase Structure

• Sproat, R. 1994. English noun-phrase accent prediction for text-to-speech. Computer Speech and Language 8:79-94.

• Proper Names, stress on right-most word– New York CITY; Paris, FRANCE

• Adjective-Noun combinations, stress on noun– Large HOUSE, red PEN, new NOTEBOOK

• Noun-Noun compounds: stress left noun– HOTdog (food) versus HOT DOG (overheated animal)– WHITE house (place) versus WHITE HOUSE (made of stucco)

• examples:– Madison AVENUE, park STREET, MEDICAL building– APPLE cake, cherry PIE

• Some Rules:– Furniture+Room -> RIGHT (e.g., kitchen TABLE)– Proper-name + Street -> LEFT (e.g. PARK street)

Page 145: Text to Speech Systems (TTS) EE 516 Spring 2009

Other features

• POS• POS of previous word• POS of next word• Stress of current, previous, next syllable• Unigram probability of word• Bigram probability of word• Position of word in sentence

Page 146: Text to Speech Systems (TTS) EE 516 Spring 2009

State of the art

• Hand-label large training sets• Use CART, SVM, CRF, etc to predict accent• Lots of rich features from context• Classic lit:

– Hirschberg, Julia. 1993. Pitch Accent in context: predicting intonational prominence from text. Artificial Intelligence 63, 305-340

Page 147: Text to Speech Systems (TTS) EE 516 Spring 2009

TOPIC #5bTOPIC #5b

Predicting boundaries

Page 148: Text to Speech Systems (TTS) EE 516 Spring 2009

Predicting Boundaries

• Intonation phrase boundaries– Intermediate phrase boundaries– Full intonation phrase boundaries

Page 149: Text to Speech Systems (TTS) EE 516 Spring 2009

More examples

• From Ostendorf and Veilleux. 1994 “Hierarchical Stochastic model for Automatic Prediction of Prosodic Boundary Location”, Computational Linguistics 20:1

• Computer phone calls, || which do everything | from selling magazine subscriptions || to reminding people about meetings || have become the telephone equivalent | of junk mail. ||

• Doctor Norman Rosenblatt, || dean of the college | of criminal justice at Northeastern University, || agrees.||

• For WBUR, || I’m Margo Melnicove.

Page 150: Text to Speech Systems (TTS) EE 516 Spring 2009

Ostendorf and Veilleux CART

Page 151: Text to Speech Systems (TTS) EE 516 Spring 2009

TOPIC #5cTOPIC #5c

Predicting duration

Page 152: Text to Speech Systems (TTS) EE 516 Spring 2009

Duration

• Simplest: fixed size for all phones (100 ms)• Next simplest: average duration for that phone (from

training data). Samples from SWBD in ms:– aa 118 b 68– ax 59 d 68– ay 138 dh 44– eh 87 f 90– Ih 77 g 66

• Next Next Simplest: add in phrase-final and initial lengthening plus stress:

• Better: average duration for each triphone

Page 153: Text to Speech Systems (TTS) EE 516 Spring 2009

Duration in Festival (2)

• Klatt duration rules. Modify duration based on:– Position in clause– Syllable position in word– Syllable type– Lexical stress– Left+right context phone– Prepausal lengthening

• Festival: 2 options– Klatt rules– Use labeled training set with Klatt features to train CART

Page 154: Text to Speech Systems (TTS) EE 516 Spring 2009

Duration: state of the art

• Lots of fancy models of duration prediction:– Using Z-scores and other clever normalizations– Sum-of-products model– New features like word predictability

• Words with higher bigram probability are shorter

Page 155: Text to Speech Systems (TTS) EE 516 Spring 2009

TOPIC #5dTOPIC #5d

F0 Generation

Page 156: Text to Speech Systems (TTS) EE 516 Spring 2009

F0 Generation

• Generation in Festival– F0 Generation by rule– F0 Generation by linear regression

• Some constraints– F0 is constrained by accents and boundaries– F0 declines gradually over an utterance (“declination”)

Page 157: Text to Speech Systems (TTS) EE 516 Spring 2009

F0 Generation by rule• Generate a list of target F0 points for each syllable• Here’s a rule to generate a simple H* “hat” accent (with fixed =

speaker-specific F0 values):(define (targ_func1 utt syl) "(targ_func1 UTT STREAMITEM)Returns a list of targets for the given syllable." (let ((start (item.feat syl 'syllable_start)) (end (item.feat syl 'syllable_end))) (if (equal? (item.feat syl "R:Intonation.daughter1.name") "Accented")

(list (list start 110) (list (/ (+ start end) 2.0) 140) (list end 100)))))

Page 158: Text to Speech Systems (TTS) EE 516 Spring 2009

F0 generation by regression

• Supervised machine learning again• We predict: value of F0 at 3 places in each syllable• Predictor features:

– Accent of current word, next word, previous– Boundaries– Syllable type, phonetic information– Stress information

• Need training sets with pitch accents labeled

Page 159: Text to Speech Systems (TTS) EE 516 Spring 2009

Outline

• History of TTS• Architecture• Text Processing• Letter-to-Sound Rules• Prosody• Waveform Generation• Evaluation

Page 160: Text to Speech Systems (TTS) EE 516 Spring 2009

Articulatory Synthesis

The vocal tract is divided into a large number of short tubes, as in the electrical transmission line analog, which are then combined and resonant frequencies calculated.

from Sinder, 1999 (thesis work with Flanagan, Rutgers)

(.wav)

Page 161: Text to Speech Systems (TTS) EE 516 Spring 2009

Formant Synthesis

• Instead of specifying mouth shapes, formant synthesis specifies frequencies and bandwidths of resonators, which are used to filter a source waveform.

• Formant frequency analysis is difficult; bandwidth estimation is even more difficult. But the biggest perceptual problem in formant synthesis is not in the resonances, but in a “buzzy” quality most likely due to the glottal source model.

• Formant synthesis can sound identical to natural utterance if details of the glottal source and formants are well modeled.

NATURAL SPEECH SYNTHETIC SPEECH

(John Holmes, 1973)

(.wav) (.wav)

21 2

1( )

1 2 cos(2 )i ii b b

i

H ze f z e z

Page 162: Text to Speech Systems (TTS) EE 516 Spring 2009

Klatt’s formant synthesizer

Impulse

Gen.RGP

AVS

RGZ

RGS+ RNP

RNZ

R1

R2

R3

R4

R5

R1A1

RNPAN

R2A2

R3A3

R4A4

R5A5

R6A6

AB

+

+

First Diff.

First Diff.

AF

AH

F0

Random Number

Gen.

XMOD.

LPF

SW

AV

Parallel Transfer Function

Cascad

e Transfe

r Functio

n

+

Page 163: Text to Speech Systems (TTS) EE 516 Spring 2009

Klatt’s parameter valuesN

Symbol Name Min Max Typ

1 AV Amplitude of voicing (dB) 0 80 02 AF Amplitude of frication (dB) 0 80 03 AH Amplitude of aspiration (dB) 0 80 04 AVS Amplitude of sinusoidal voicing (dB) 0 80 05 F0 Fundamental frequency (Hz) 0 500 06 F1 First formant frequency (Hz) 150 900 5007 F2 Second formant frequency (Hz) 500 2500 15008 F3 Third formant frequency (Hz) 1300 3500 25009 F4 Fourth formant frequency (Hz) 2500 4500 350010 FNZ Nasal zero frequency (Hz) 200 700 25011 AN Nasal formant amplitude (Hz) 0 80 012 A1 First formant amplitude (Hz) 0 80 013 A2 Second formant amplitude (Hz) 0 0 014 A3 Third formant amplitude (Hz) 0 80 015 A4 Fourth formant amplitude (Hz) 0 80 016 A5 Fifth formant amplitude (Hz) 0 80 017 A6 Sixth formant amplitude (Hz) 0 80 018 AB Bypass path amplitude (Hz) 0 80 019 B1 First formant bandwidth (Hz) 40 500 5020 B2 Second formant bandwidth (Hz) 40 500 7021 B3 Third formant bandwidth (Hz) 40 500 11022 SW Cascade/parallel switch 0 1 0

…32 FNP Nasal pole frequency (Hz) 200 500 25033 BNP Nasal pole bandwidth (Hz) 50 500 10034 BNZ Nasal zero bandwidth (Hz) 50 500 10035 BGS Glottal resonator 2 bandwidth (Hz) 100 1000 20036 SR Sampling rate (Hz) 500 20000 1000037 NWS Number of waveform samples per chunk 1 200 5038 G0 Overall gain control (dB) 0 80 4839 NFC Number of cascaded formants 4 6 5

Page 164: Text to Speech Systems (TTS) EE 516 Spring 2009

Formant systems: Rule-Based Synthesis

• For synthesis of arbitrary text, formants and bandwidths for each phoneme are determined by analyzing speech of a single person.

• The models of each phoneme may be a single set of formant frequencies and bandwidths for a canonical phoneme at a single point in time, or a trajectory of frequencies, bandwidths, and source models over time.

• The formant frequencies for each phoneme are combined over time using a model of coarticulation, such as Klatt’s modified locus theory.

• Duration, pitch, and energy rules are applied

• Result: something like this: (.wav)

Page 165: Text to Speech Systems (TTS) EE 516 Spring 2009

Concatenative Synthesis

• Copy synthesis sounds great but synthesis by rule using formants does not. Why?… Problem with glottal source? Problem with coarticulation and formant transitions? Problem with prosody?

• Formant synthesis was main TTS technique until the early or mid 1990’s, when increasing memory size and CPU speed allowed concatenative synthesis to be viable approach.

• Concatenative synthesis uses recordings of small units of speech (typically the region from the middle of one phoneme to the middle of another phoneme, or a diphone unit), and glues these units together to forms words and sentences.

• Don’t have to worry about glottal source models or coarticulation, since the synthesis is just a concatenation of different waveforms containing “natural” glottal source and coarticulation.

Page 166: Text to Speech Systems (TTS) EE 516 Spring 2009

Concatenative Synthesis: Units

• The basic unit for concatenative synthesis is the diphone:

• More recent TTS research is on using larger units. Issues include: • how to decide what units will be used?• how to select best unit from very large database?

• With increasing size and variety of units, there is an exponential growth in the database size. Yet, despite massive databases that may take months to record, coverage is nowhere near complete. There is a very large number of infrequent events in speech.

sil-jh jh-aa aa-n n-sil

Page 167: Text to Speech Systems (TTS) EE 516 Spring 2009

Joining Units

• Dumb: – just join – Better: at low amplitude regions

• TD-PSOLA– Time-domain pitch-synchronous overlap-and-add– Join at pitch periods (with windowing)

Page 168: Text to Speech Systems (TTS) EE 516 Spring 2009

Diphone boundaries in stops

Page 169: Text to Speech Systems (TTS) EE 516 Spring 2009

Prosodic Modification

• Modifying pitch and duration independently• Changing sample rate modifies both:

– “Alvin and the Chipmunks” speech

• Duration: duplicate/remove parts of the signal• Pitch: resample to change pitch

Page 170: Text to Speech Systems (TTS) EE 516 Spring 2009

Speech as Short Term signals

Page 171: Text to Speech Systems (TTS) EE 516 Spring 2009

Duration modification

• Duplicate/remove short term signals

Page 172: Text to Speech Systems (TTS) EE 516 Spring 2009

Pitch Modification

• Move short-term signals closer together/further apart

Page 173: Text to Speech Systems (TTS) EE 516 Spring 2009

Overlap-and-add (OLA)

Page 174: Text to Speech Systems (TTS) EE 516 Spring 2009

Overlap and Add (OLA)

• Hanning windows of length 2N used to multiply the analysis signal

• Resulting windowed signals are added• Analysis windows, spaced 2N• Synthesis windows, spaced N• Time compression is uniform with factor of 2• Pitch periodicity somewhat lost around 4th window

Page 175: Text to Speech Systems (TTS) EE 516 Spring 2009

TD-PSOLA ™

• Time-Domain Pitch Synchronous Overlap and Add• Patented by France Telecom (CNET)• Very efficient

– No FFT (or inverse FFT) required

• Can modify Hz up to two times or by half

Page 176: Text to Speech Systems (TTS) EE 516 Spring 2009

TD-PSOLA ™

Page 177: Text to Speech Systems (TTS) EE 516 Spring 2009

HMM-Based synthesis

• Generate the most likely sequence of spectral (e.g. MFCCs) and excitation (F0) parameters for the given phone sequence using HMM

• Create a filter using the spectral parameters• Pass the excitation parameters (F0, noise) through the filter

to generate the waveform

Page 178: Text to Speech Systems (TTS) EE 516 Spring 2009

Block Diagram

• Zen & Toda (2005)

Page 179: Text to Speech Systems (TTS) EE 516 Spring 2009

HMM parameter generation

• Each model represents a phone or a subphone (diphone, triphone, etc.)

• Each model consists of multiple states– Tri-state model– Each Gaussian mixture represented by a different state with the

transitional probability as the mixture weight

• Each state emits spectral/F0 feature vector– 12~13 MFCCs, deltas, (delta-deltas)– F0, delta, (delta-delta)

Page 180: Text to Speech Systems (TTS) EE 516 Spring 2009

Problem for HMM parameter generation

• We know which models to concatenate in what order• We do NOT know

– which state in the model to use to generate each frame– which value to choose from a set of values observable within

each state

Page 181: Text to Speech Systems (TTS) EE 516 Spring 2009

Tokuda et al. (1995)

• We need to solve

– O : observation sequence, where each observation is a feature vector consisting of MFCCs (c) and their deltas (Δc)

– q : state sequence– λ : HMM

• The problem is that we don’t exactly know what q is

)|,(maxarg qOPc

Page 182: Text to Speech Systems (TTS) EE 516 Spring 2009

Solution – (1)

),|(maxarg)|(),|(maxarg)|,(maxarg qOPqPqOPqOPccc

)()()()(),|( 321 321 Tqqqq obobobobqOPT

),;()(ttt qqttq cNob

)()'(

2

1exp

||)2(

1),;( 1

oooN

N

Let’s assume that we know the state sequence “q”

Page 183: Text to Speech Systems (TTS) EE 516 Spring 2009

Solution – (1)

• Non smooth spectrum

time

MFCC1

Page 184: Text to Speech Systems (TTS) EE 516 Spring 2009

Solution – (2) Add deltas)()()()(),|( 321 321 Tqqqq obobobobqOP

T

),;(),;()(ttttt qqtqqttq cNcNob

1. Differentiate the log-probability with respect to ct2. Solve set of linear equations for ct

time

MFCC1

Page 185: Text to Speech Systems (TTS) EE 516 Spring 2009

Digression: Delta

• Simple calculation of delta

• More robust calculation of delta

• Typically rewritten as below, where wl is derived from above

211

ttt

ccd

L

l

L

lltlt

t

l

ccld

1

2

1

2

)(

L

Llltlt cwd

Page 186: Text to Speech Systems (TTS) EE 516 Spring 2009

Finding the state sequence

• Recall that our problem was that – We do NOT know

1) which state in the model to use to generate each frame2) which value to choose from a set of values observable

within each state

• The solution discussed thus far solved (2) assuming that we know the answer to (1)

• To really solve the problem, we should consider all possible state sequences and choose “c” that gives us the highest observation probability

• Directly solving the equation for all possible state sequences takes too much time

Page 187: Text to Speech Systems (TTS) EE 516 Spring 2009

How about excitation?

• Unvoiced speech: white noise. This is fine!• Voiced speech: Impulse train

• Problems:– Voiced speech has frication– Hard decisions are hard

h[n]

Page 188: Text to Speech Systems (TTS) EE 516 Spring 2009

How about excitation?

• Use a mixed excitation model

• Learn model parameters from data with HMM• Multi-band noise is better

h[n] +

g[n]

Page 189: Text to Speech Systems (TTS) EE 516 Spring 2009

HMM-Based concatenative Synthesis

• Given a big database• Find string of units that maximizes probability of HMM

• Intrasegment scores can be precomputed• Concatenation scores could also be precomputed

– All possible joints (could be large!)– Delta means and variances at boundaries are the key!

• Good job at concatenation matching!• How about prosody?• Use HMM too!

T

tt

UUupUpU

1

)|(maxarg)|(maxarg

Page 190: Text to Speech Systems (TTS) EE 516 Spring 2009

Text AnalysisText Analysis

Waveform concatenation

Waveform concatenation

RulesRules

Database of Recorded Speech

Database of Recorded Speech

Text

Speech

Stylistic TTS

Letter-to-SoundLetter-to-Sound

ProsodyProsody

Dictionary and Rules

Dictionary and Rules

Read speech voice

Stylistic TTS

TTS in Windows since Windows 2000

Thanks to Min Chu, MSR Asia

Page 191: Text to Speech Systems (TTS) EE 516 Spring 2009

Outline

• History of TTS• Architecture• Text Processing• Letter-to-Sound Rules• Prosody• Waveform Generation• Evaluation

Page 192: Text to Speech Systems (TTS) EE 516 Spring 2009

Evaluation of TTS

• Intelligibility Tests– Diagnostic Rhyme Test (DRT)

• Humans do listening identification choice between two words differing by a single phonetic feature

– Voicing, nasality, sustenation, sibilation

• 96 rhyming pairs• Veal/feel, meat/beat, vee/bee, zee/thee, etc

– Subject hears “veal”, chooses either “veal or “feel”– Subject also hears “feel”, chooses either “veal” or “feel”

• % of right answers is intelligibility score.

• Overall Quality Tests– Have listeners rate space on a scale from 1 (bad) to 5 (excellent)

• Preference Tests (prefer A, prefer B)