CS 4705 Lecture 4 CS4705 Sound Systems and Text-to- Speech

CS 4705

Lecture 4

CS4705

Sound Systems and Text-to-Speech

Sound Systems of Language

• Phonetics – The sounds (phones) of the world’s languages, the

phonemes they map to, and how they are produced

• Phonology– Rules that govern how phones are realized differently

in different contexts

• Technologies:– Automatic Speech Recognition (ASR) systems take

sounds as input and output word hypotheses

– Text-to-Speech (TTS) systems take text as input and produce speech

Letters and Sounds• same spelling = different sounds

o comb, tomb, bomb oo blood, food, good

c court, center, cheese s reason, surreal, shy

• same sound = different spellings[i] sea, see, scene, receive, thief [s] cereal, same, miss

[u] true, few, choose, lieu, do [ay] prime, buy, rhyme, lie

• combination of letters = single soundch child, beach th that, bathe

oo good, foot gh laugh

• single letter = combination of soundsx exit, Texas u use, music

• ‘silent’ lettersk knife, know p psycho, pterodactyl

e moose, bone gh through

Articulators

lips

teethAlveolar ridge

velum

uvula

pharyngeal

vocal folds:glottis

larynx

trachea

palate

Articulators in action

“Why did Ken set the soggy net on top of his deck?”

(Sample from the Queen’s University / ATR Labs X-ray Film Database)

http://psyc.queensu.ca/~munhallk/05_database.htm

Vocal fold vibration

[UCLA Phonetics Lab demo]

http://www.humnet.ucla.edu/humnet/linguistics/faciliti/demos/vocalfolds/vocalfolds.htm

Places of articulation

http://www.chass.utoronto.ca/~danhall/phonetics/sammy.html

labial

dentalalveolar post-alveolar/palatal

velar

uvular

pharyngeal

laryngeal/glottal

http://www.chass.utoronto.ca/~danhall/phonetics/sammy.html

Articulatory parameters for English consonants (in ARPAbet)

h

q

glottal

dxflap

yl/r wapprox

ng n mnasal

jhchaffric.

zhsh z sdhth v ffric.

g k d t b pstop

velarpalatalalveolarinter-dental

labio-dental

bilabial

PLACE OF ARTICULATION

MA

NN

ER

OF

AR

TIC

ULA

TIO

N

VOICING: voicedvoiceless

American English vowel space

FRONT BACK

HIGH

LOW

eyow

aw

oy

ay

iy

ih

eh

ae aa

ao

uw

uh

ah

ax

ix ux

Acoustic landmarks

“Patricia and Patsy and Sally”

[p] [t] [p] [t]

[p] [t]

[l][sh] [s] [s][n] [n][ix]

[ix] [ih]

[ih] [ax] [ae] [iy] [iy][ae]

Syllables

• Syllabification important for– pronunciation: deny/denim

– speaking rate calculation: syllables per second

– word recognition in ASR

• (onset) + nucleus + (coda): – c a t

– a

– a t

– t o

• Lexical stress: primary, secondary, terciary– telephone

Phonological Rules

• Not all instances of a given phone [x] sound/look alike

• Phoneme /x/ may have many allophones• Phonological rules map phonemes in context to

allophones, e.g.– simple rules: /{t,d}/ --> [V’ _ V

– FSA’s, FST’s

– declarative constraints: t: V’ _ V

Allophones of /t/

• What we would consider a single ‘sound’ can be pronounced differently depending on the phonetic context. For example, the phoneme /t/:

Figure 4.8: Jurafsky & Martin (2000), page 104.

Application: Word Pronunciation for TTS

• Pronouncing dictionaries (the: [‘dhax],[‘dhiy])• Problems:

– Homographs (bass/bass, wind/wind, desert/desert)

– Abbreviation (dr., st.)

– Numbers (2125551212)

– Acronyms (NAACL, IDIAP)

– Morphological variation (unrelentingly)

– Proper names and unknown words

• rules + dictionaries/dictionaries + rules

• Hybrid model:– FSTs model individual word pronunciation in lexicon

(e.g. reg-noun-stem entry c:k a:ae t:t)

– FSAs model morphology (e.g. reg-noun-stem + s)

– FSTs for pronunciation rules (e.g. s--> z)

– special rules to model name and acronym pronunciation

– default letter2sound rules for other words

Inventive (and sometimes useful) Approaches for Pronouncing Unknown Words

• Rhyming analogy: varoom/room, todo/dodo• Linguistic origin: Infiniti, vingt, Perez• Abbreviation expansion:

– spacious living/dining rm w/frplc/dining room with fireplace

– pls?

Summary

• Phones realize phonemes in different contexts– Different places and manners of articulation result in

acoustic differences that can be detected by ASR systems as well as people

• Versatile FSTs can model phonological as well as morphological and spelling systems

• Many creative approaches toward pronunciation modeling for TTS

• Next time: Read Ch 5

Documents

CS 4705 Lecture 4 CS4705 Sound Systems and Text-to- Speech