12
Talsyntes: Joakim Gustafson Tal, musik och hörsel 1 J. Gustafson, CTT, KTH Speech synthesis Speech synthesis Speech synthesis Speech synthesis DT2112 DT2112 DT2112 DT2112 Joakim Joakim Joakim Joakim Gustafson, CTT, KTH Gustafson, CTT, KTH Gustafson, CTT, KTH Gustafson, CTT, KTH School for Computer Science and Communication School for Computer Science and Communication School for Computer Science and Communication School for Computer Science and Communication Many slides prepared by Many slides prepared by Many slides prepared by Many slides prepared by Olov Olov Olov Olov Engwall Engwall Engwall Engwall (and others) (and others) (and others) (and others) J. Gustafson, CTT, KTH 2 Text Text Text Text-To To To To-Speech Speech Speech Speech synthesis synthesis synthesis synthesis (TTS) (TTS) (TTS) (TTS) The automatic generation of synthesized sound or visual output from any phonetic string. Our focus in this course! J. Gustafson, CTT, KTH 3 Different kinds of speech synthesis Different kinds of speech synthesis Different kinds of speech synthesis Different kinds of speech synthesis Recorded speech Words or phrases (telephone banking) Fixed vocabulary – maintenance problems… Concatenative speech synthesis Parametric synthesis Multimodal synthesis J. Gustafson, CTT, KTH 4 What What What What a a a a synthesiser synthesiser synthesiser synthesiser is to is to is to is to convey convey convey convey The linguistic component: semantic information that is part of the speaker’s language (e.g. question intonation) The paralinguistic component: the speaker’s attitudinal or emotional states, sociolect and regional dialect. The extralinguistic component: the individuality, gender and age of a certain speaker. It can be judged independently of the language. To adapt a speech synthesizer to a certain speaker, we need both the para- and extralinguisitic components. J. Gustafson, CTT, KTH Desireable synthesis features Desireable synthesis features Desireable synthesis features Desireable synthesis features from a dialogue perspective from a dialogue perspective from a dialogue perspective from a dialogue perspective Real-time, Incremental, Interruptable Explicit control of prosodic parameters Fundamental frequency – Intensity Natural sounding lengthening, hesitation, interruptions Generation of extra-linguistic sounds Filled pauses – Creeks/Gargles Smacks/Inhalations/exhalations to give turn J. Gustafson, CTT, KTH 6 The synthesis space The synthesis space The synthesis space The synthesis space Intelligibility Naturalness Bit rate Vocabulary Units Complexity Processing needs Flexibility Speech Knowledge Cost

Talsyntes: Joakim Gustafson Tal, musik och hörsel

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Talsyntes: Joakim Gustafson

Tal, musik och hörsel

1

J. Gustafson, CTT, KTH

Speech synthesisSpeech synthesisSpeech synthesisSpeech synthesis

DT2112DT2112DT2112DT2112

JoakimJoakimJoakimJoakim Gustafson, CTT, KTHGustafson, CTT, KTHGustafson, CTT, KTHGustafson, CTT, KTH

School for Computer Science and CommunicationSchool for Computer Science and CommunicationSchool for Computer Science and CommunicationSchool for Computer Science and CommunicationMany slides prepared by Many slides prepared by Many slides prepared by Many slides prepared by OlovOlovOlovOlov EngwallEngwallEngwallEngwall (and others)(and others)(and others)(and others)

J. Gustafson, CTT, KTH 2

TextTextTextText----ToToToTo----SpeechSpeechSpeechSpeech synthesissynthesissynthesissynthesis (TTS)(TTS)(TTS)(TTS)

The automatic generation of synthesized sound or

visual output from any phonetic string.

Our focus in this course!

J. Gustafson, CTT, KTH 3

Different kinds of speech synthesisDifferent kinds of speech synthesisDifferent kinds of speech synthesisDifferent kinds of speech synthesis

• Recorded speech

– Words or phrases (telephone banking)

– Fixed vocabulary – maintenance problems…

• Concatenative speech synthesis

• Parametric synthesis

• Multimodal synthesis

J. Gustafson, CTT, KTH 4

WhatWhatWhatWhat a a a a synthesisersynthesisersynthesisersynthesiser is to is to is to is to conveyconveyconveyconvey

• The linguistic component: semantic information that is part of the speaker’s language (e.g. question intonation)

• The paralinguistic component: the speaker’s attitudinal or emotional states, sociolect and regional dialect.

• The extralinguistic component: the individuality, gender and age of a certain speaker. It can be judged independently of the language.

To adapt a speech synthesizer to a certain speaker, we need both the para- and extralinguisitic components.

J. Gustafson, CTT, KTH

Desireable synthesis features Desireable synthesis features Desireable synthesis features Desireable synthesis features from a dialogue perspectivefrom a dialogue perspectivefrom a dialogue perspectivefrom a dialogue perspective

• Real-time, Incremental, Interruptable

• Explicit control of prosodic parameters– Fundamental frequency

– Intensity

– Natural sounding lengthening, hesitation, interruptions

• Generation of extra-linguistic sounds– Filled pauses

– Creeks/Gargles

– Smacks/Inhalations/exhalations to give turn

J. Gustafson, CTT, KTH 6

The synthesis spaceThe synthesis spaceThe synthesis spaceThe synthesis space

Intelligibility

Naturalness

Bit rate

Vocabulary

Units

Complexity

Processingneeds

Flexibility

SpeechKnowledge

Cost

Talsyntes: Joakim Gustafson

Tal, musik och hörsel

2

J. Gustafson, CTT, KTH 7

The steps in TTSThe steps in TTSThe steps in TTSThe steps in TTS

text

Linguistic analysis

Prosodic analysis

Phonetic description

Sound generation

Morphological analysisLexicon and rulesSyntax analysis

Rules and lexicon

Rules and unit selection

Concatenation Rules

Language ident.

“abcd”

J. Gustafson, CTT, KTH

The automatic generation of synthesized sound from any text string.

From textFrom textFrom textFrom text

J. Gustafson, CTT, KTH 9

PPPPreprocessorreprocessorreprocessorreprocessor

• Sentence end detection Sentence end detection Sentence end detection Sentence end detection (semicolon, period – ratio, time and

decimal point, sentence ending respectively)

• AbbreviationsAbbreviationsAbbreviationsAbbreviations (e.g. – for instance)

Changed to their full form with the help of lexicons

• AcronymsAcronymsAcronymsAcronyms (I.B.M – these can be read as a sequence of characters, or

NASA which can be read following the default way)

• NumbersNumbersNumbersNumbers (Once detected, first interpreted as rational, time of the

day, dates and ordinal depending on their context)

• IdiomsIdiomsIdiomsIdioms (e.g. “In spite of”, “as a matter of fact”– these are combined

into single FSU using a special lexicon)

J. Gustafson, CTT, KTH 10

GraphemeGraphemeGraphemeGrapheme----totototo----phonemephonemephonemephoneme conversionconversionconversionconversion

• Dictionary:– Store a maximum of phonological knowledge into a lexicon.

– Compounding rules describe how the morphemes of dictionary items are modified.

– Hand-corrected, expensive– The lexicon is never complete: needs out of vocabulary pronouncer,

transcribed by rule.

• Rules:– A set of letter to sound (grapheme to phoneme) rules.

– Words pronounced in a such a particular way that they have their own rule are stored in exceptions directory.

– Fast & easy, but lower accuracy

• Machine learning:– Cart tree– Analogy pronunciation

J. Gustafson, CTT, KTH

11111111

ProsodyProsodyProsodyProsody

• Prosody = melody, rhythm, “tone” of speech

• Not what words are said, but how they are said

• Prosody is conveyed using:– Pitch– Phone durations– Energy

• Human languages use prosody to convey:

– phrasing and structure (e.g. sentence boundaries)

– disfluencies (e.g. false starts, repairs, fillers)

– sentence mode (statement vs question)

– emotional attitudes (urgency, surprise, anger)

J. Gustafson, CTT, KTH 12

Intonation: F0 contourIntonation: F0 contourIntonation: F0 contourIntonation: F0 contourLarge pitch range (female)Authoritive (final fall)Emphasis for Finance (H*)Final has a raise – more information to come

• Word stress and sentence intonation

– each word has at least one syllable which is spoken with higher prominence

– in each phrase the stressed syllable can be accented depending on the semantics and syntax of the phrase

• Prosody relies on syntax, semantics, pragmatics: personal reflection of the reader.

Talsyntes: Joakim Gustafson

Tal, musik och hörsel

3

J. Gustafson, CTT, KTH 13

Pitch contour modelingPitch contour modelingPitch contour modelingPitch contour modeling

• Tonetics (the British school)– tone groups composed of syllables {unstressed, stressed, accented

or nuclear}.

– nuclear syllables have nuclear tones {fall, rise, fall-rise, rise-fall}

• ToBI (Tones and Break Indices)– Phrases split into intermediate phrases composed of syllables.

– Relative tone levels: high (H) or low (L) (plus diacritics) at every intonational or intermediate phrase boundary (%) and on every accented syllable

• Stylization method (prosodic pattern measured from natural speech)– Demo

J. Gustafson, CTT, KTH

The automatic generation of synthesized sound from any text string.

To SpeechTo SpeechTo SpeechTo Speech

J. Gustafson, CTT, KTH 15

SynthesisSynthesisSynthesisSynthesis approachesapproachesapproachesapproaches

By ConcatenationBy ConcatenationBy ConcatenationBy ConcatenationElementary speech units are stored in a database and then concatenated and processed to produce the speech signal

By RuleBy RuleBy RuleBy RuleSpeech is produced by mathematical rules that describe the influence of phonemes on one another

J. Gustafson, CTT, KTH

Research trends in Research trends in Research trends in Research trends in speech synthesisspeech synthesisspeech synthesisspeech synthesis

1950 Synthesis by analysis

1960 Phonetic rules

1970 Linguistic processing

1980 Concatenation

1990 Automatic procedures

2000

J. Gustafson, CTT, KTH

TextTextTextText----totototo----Speech Speech Speech Speech Synthesis EvolutionSynthesis EvolutionSynthesis EvolutionSynthesis Evolution

1962 1967 1972 1982 1987 1992 1997 2002

Year

Formant

Synthesis

Bell Labs; Joint

Speech Research

Unit; MIT (DEC-

Talk); Haskins

Lab; KTH

LPC-Based

Diphone/Dyad

Synthesis

Bell Labs; CNET;

Bellcore; Berkeley

Speech

Technology

Unit Selection

Synthesis

ATR in Japan; CSTR

in Scotland; BT in

England; AT&T Labs

(1998); L&H in

Belgium

Poor

Intelligibility;

Poor Naturalness,

Small footprint

Good

Intelligibility;

Poor Naturalness

Good

Intelligibility;

Customer Quality

Naturalness

(Limited Context)

HMM

Synthesis

HTS in Japan; CSTR

in Scotland;

Multi-speaker

training, speaker

adaptation;

Naturalness,

generative, Small

footprint

J. Gustafson, CTT, KTH 18

Synthesis by ruleSynthesis by ruleSynthesis by ruleSynthesis by rule

Talsyntes: Joakim Gustafson

Tal, musik och hörsel

4

J. Gustafson, CTT, KTH

ArticulatoryArticulatoryArticulatoryArticulatory SynthesisSynthesisSynthesisSynthesis

J. Gustafson, CTT, KTH 20

ArticulatoryArticulatoryArticulatoryArticulatory synthesissynthesissynthesissynthesispotential usepotential usepotential usepotential use

• Articulatory synthesis– Calculations directly from cross

sectional areas

– Fluid dynamics calculations

• Visual synthesis– Articulation training

• Demonstrations and research

J. Gustafson, CTT, KTH 21

ArticulatoryArticulatoryArticulatoryArticulatory parametersparametersparametersparameters

• Jaw opening

• Lip rounding

• Lip Protrusion

• Tongue position

• Tongue height

• Tongue tip

• Velum

• Hyoid

J. Gustafson, CTT, KTH

From articulation to acousticsFrom articulation to acousticsFrom articulation to acousticsFrom articulation to acoustics

Transfer function

Vocal tract model

Tubes

Waveform

Cross-sections

3D air flow calculations

Area function

J. Gustafson, CTT, KTH

Benefits:Benefits:Benefits:Benefits:• Speech production in the same way as humans• Can be made with few parameters• The changes are intuitive

(raise the tongue tip, round the lips)

Disadvantages:Disadvantages:Disadvantages:Disadvantages:• Computationally demanding• Problems with consonants• Articulatory measurements required• State-of-the-art articulatory synthesis still sounds bad

Summary: Summary: Summary: Summary: ArticulatoryArticulatoryArticulatoryArticulatory SynthesisSynthesisSynthesisSynthesis

J. Gustafson, CTT, KTH

Formant SynthesisFormant SynthesisFormant SynthesisFormant Synthesis

Talsyntes: Joakim Gustafson

Tal, musik och hörsel

5

J. Gustafson, CTT, KTH

Formant synthesis (1959Formant synthesis (1959Formant synthesis (1959Formant synthesis (1959----1987)1987)1987)1987)

• Haskins, 1959

• KTH – Stockholm, 1962

• Bell Labs, 1973

• MIT, 1976

• MIT-talk, 1979

• Speak ‘N spell, 1980

• BELL Labs, 1985

• Dec talk, 1987

J. Gustafson, CTT, KTH 26

• OVE I (1953)

• The original & a new version on the computer + OVE II (1962)

LetLetLetLet usususus taketaketaketake a look at OVEa look at OVEa look at OVEa look at OVE

formant.exe (Command Line)

J. Gustafson, CTT, KTH 27

OVE IIOVE IIOVE IIOVE II

J. Gustafson, CTT, KTH

RuleRuleRuleRule----driven formant synthesisdriven formant synthesisdriven formant synthesisdriven formant synthesis

• Parameters are generated by rule(RULSYS , Carlson et al.)

• Formant values are generated by interpolating between target frequencies

• Parameters are fed to a synthesizer (GLOVE, Carlson et al.)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90

500

1000

1500

2000

2500

3000

3500

4000 M O B I: L sil

J. Gustafson, CTT, KTH

Summary: formant synthesisSummary: formant synthesisSummary: formant synthesisSummary: formant synthesis

Benefits:Benefits:Benefits:Benefits:• Possible to change the voice to get different:

• speakers•emotions•voice qualities

• Small footprint

Disadvantages:Disadvantages:Disadvantages:Disadvantages:• Hard to achieve naturalness in voice source• Some consonant sounds are hard to model with formants (bursts)

J. Gustafson, CTT, KTH 30

From rule based From rule based From rule based From rule based concatenativeconcatenativeconcatenativeconcatenative synthesissynthesissynthesissynthesis

• Rule based sounds unnatural, while concatenative synthesis provides

(piece-wise) high quality speech.

• Certain sounds are hard to be produced by rule but easy to concatenate:

– Bursts, voiceless stops are too difficult

• Rule based had an advantage of small footprint, however storing the

segment database is no longer an issue

• Change of applications:

– From reading machines for the blind to spoken dialogue systems

Talsyntes: Joakim Gustafson

Tal, musik och hörsel

6

J. Gustafson, CTT, KTH 31

SynthesisSynthesisSynthesisSynthesis by by by by concatenationconcatenationconcatenationconcatenation

J. Gustafson, CTT, KTH 32

Let’sLet’sLet’sLet’s get the terms straightget the terms straightget the terms straightget the terms straight

ConcatenativeConcatenativeConcatenativeConcatenative synthesissynthesissynthesissynthesisDefinition: Definition: Definition: Definition: All kinds of synthesis based on the concatenation of units, regardless of

type (sound, formant trajectories, articulatory parameters) and size (diphones, triphones, syllables, longer units). There is only one candidate only one candidate only one candidate only one candidate per setting.

Everyday use: Everyday use: Everyday use: Everyday use: Concatenation of same-size sound units.

Unit selection synthesisUnit selection synthesisUnit selection synthesisUnit selection synthesisDefinition: Definition: Definition: Definition: All kinds of synthesis based on the concatenation of units where there are several candidates several candidates several candidates several candidates to choose from, regardless of if the candidates have the same, fixed size or if the size is variable.

Everyday use: Everyday use: Everyday use: Everyday use: Concatenation of variable sized sound units.

J. Gustafson, CTT, KTH 33

Database preparation when Database preparation when Database preparation when Database preparation when building a building a building a building a concatenativeconcatenativeconcatenativeconcatenative synthesissynthesissynthesissynthesis

• Choose the speech units (Phone, Diphone, Sub-word unit, Cluster based unit selection)

• Compile and record utterances

• Segment signal and extract speech units

• Store segment waveforms (along with context) and information in a database:

Dictionary, waveform, pitch mark

e.g. “ch-l r021 412.035 463.009 518.23”

diphone file Start time Middle time End

• Pitch mark file: a list of each pitch mark position in the file

• Extract parameters; create parametric segment

database (for data compaction and prosody matching)

• Perform amplitude equalization (prevents mismatches)

J. Gustafson, CTT, KTH 34

Signal manipulations in Signal manipulations in Signal manipulations in Signal manipulations in concatenativeconcatenativeconcatenativeconcatenative synthesissynthesissynthesissynthesis

• Prosodic modifications– Possibility to modify F0

– Possibility to lengthen or shorten segments

• Spectral modifications– Interpolation of spectrum at joints

J. Gustafson, CTT, KTH 35

Sequences of a particular sound/phone in all its environmentsof occurrence or all/most two-phone sequences occurring in alanguage: _auto_ -> _a, au, ut, to, o_

• Rationale: the center’ofcenter’ofcenter’ofcenter’of a phonetic realization is the moststablestablestablestable region, whereas the transition from one segment toanother contains the most interesting phenomena, and is thusthe hardest to model.

Assignment: Diphone ”synthesis”; cut and paste

DiphoneDiphoneDiphoneDiphone SynthesisSynthesisSynthesisSynthesis

J. Gustafson, CTT, KTH

DiphonesDiphonesDiphonesDiphones

• Need O(phone2) number of units– Some combinations don’t exist (hopefully)

– ATT (Olive et al. 1998) system had 43 phones

• 1849 possible diphones

• Phonotactics ([h] only occurs before vowels), don’t need to keep diphones across silence

• Only 1172 actual diphones

– May include stress, consonant clusters

• So could have more

– Lots of phonetic knowledge in design

• Database relatively small– Around 8 megabytes for English (16 KHz 16 bit)

Slide from Richard Sproat

Talsyntes: Joakim Gustafson

Tal, musik och hörsel

7

J. Gustafson, CTT, KTH

Building diphone schemataBuilding diphone schemataBuilding diphone schemataBuilding diphone schemata

• Find list of phones in language:– Plus interesting allophones

– Stress, tons, clusters, onset/coda, etc

– Foreign (rare) phones.

• Build carriers for:– Consonant-vowel, vowel-consonant

– Vowel-vowel, consonant-consonant

– Silence-phone, phone-silence

– Other special cases

• Check the output:– List all diphones and justify missing ones

– Every diphone list has mistakes

Slide from Richard Sproat J. Gustafson, CTT, KTH

Designing a diphone inventory:Designing a diphone inventory:Designing a diphone inventory:Designing a diphone inventory:

Nonsense wordsNonsense wordsNonsense wordsNonsense words

• Build set of carrier words:– pau t aa b aa b aa pau– pau t aa m aa m aa pau– pau t aa m iy m aa pau– pau t aa m iy m aa pau– pau t aa m ih m aa pau

• Advantages:– Easy to get all diphones

– Likely to be pronounced consistently• No lexical interference

• Disadvantages:– (possibly) bigger database

– Speaker becomes bored

Slide from Richard Sproat

J. Gustafson, CTT, KTH

Designing a diphone inventory:Designing a diphone inventory:Designing a diphone inventory:Designing a diphone inventory:

Natural wordsNatural wordsNatural wordsNatural words

• Greedily select sentences/words:– Quebecois arguments

– Brouhaha abstractions

– Arkansas arranging

• Advantages:– Will be pronounced naturally

– Easier for speaker to pronounce

– Smaller database? (505 pairs vs. 1345 words)

• Disadvantages:– May not be pronounced correctly

Slide from Richard Sproat J. Gustafson, CTT, KTH

Making recordings consistent:Making recordings consistent:Making recordings consistent:Making recordings consistent:

• Diphone should come from mid-word– Help ensure full articulation

• Performed consistently– Constant pitch (monotone), power, duration

• Use (synthesized) prompts:– Helps avoid pronunciation problems

– Keeps speaker consistent

– Used for alignment in labeling

Slide from Richard Sproat

J. Gustafson, CTT, KTH

Recording conditionsRecording conditionsRecording conditionsRecording conditions

• Ideal:– Anechoic chamber

– Studio quality recording

– EGG signal

• More likely:– Quiet room

– Cheap microphone/sound blaster

– No EGG

– Headmounted microphone

• What we can do:– Repeatable conditions

– Careful setting on audio levels

Slide from Richard Sproat J. Gustafson, CTT, KTH

Labeling DiphonesLabeling DiphonesLabeling DiphonesLabeling Diphones

• Run an ASR in forced alignment mode– Forced alignment:

• In: A trained ASR system, a wavefile, a word transcription of the wavefile

• Returns: an alignment of the phones in the words to the wavefile.

• Much easier than phonetic labeling:– The words are defined

– The phone sequence is generally defined

– They are clearly articulated

– But sometimes speaker still pronounces wrong, so need to check.

• Phone boundaries less important– +- 10 ms is okay

• Midphone boundaries important– Where is the stable part

– Can it be automatically found?

Slide from Richard Sproat

Talsyntes: Joakim Gustafson

Tal, musik och hörsel

8

J. Gustafson, CTT, KTH

Finding diphone boundariesFinding diphone boundariesFinding diphone boundariesFinding diphone boundaries

• Stable part in phonesFor stops: one third in

For phone-silence: one quarter in

For other diphones: 50% in

• In time alignment case:Given explicit known diphone boundaries in prompt in the label file

Use dynamic time warping to find same stable point in new speech

• Optimal couplingTaylor and Isard 1991, Conkie and Isard 1996

Instead of precutting the diphones� Wait until we are about to concatenate the diphones together

� Then take the 2 complete (uncut diphones)

� Find optimal join points by measuring cepstral distance at potential join points, pick best

Slide from Richard Sproat J. Gustafson, CTT, KTH

Summary: Diphone SynthesisSummary: Diphone SynthesisSummary: Diphone SynthesisSummary: Diphone Synthesis

• Well-understood, mature technology

• Augmentations– Stress

– Onset/coda

– Demi-syllables

• Problems:– Signal processing still necessary for modifying durations

– Source data is still not natural

– Units are just not large enough; can’t handle word-specific effects, etc

Slide from Dan Jurafsky

J. Gustafson, CTT, KTH

From diphone synthesis to From diphone synthesis to From diphone synthesis to From diphone synthesis to

Unit Selection SynthesisUnit Selection SynthesisUnit Selection SynthesisUnit Selection Synthesis

• Natural data solves problems with diphones– Diphone databases are carefully designed but:

• Speaker makes errors• Speaker doesn’t speak intended dialect• Require database design to be right

– If it’s automatic• Labeled with what the speaker actually said• Coarticulation, schwas, flaps are natural

• “There’s no data like more data”– Lots of copies of each unit copies of each unit copies of each unit copies of each unit mean you can choose just the

right one for the context– Larger units Larger units Larger units Larger units mean you can capture wider effects

Slide from Dan Jurafsky J. Gustafson, CTT, KTH

UnitUnitUnitUnit selectionselectionselectionselection SynthesisSynthesisSynthesisSynthesis

Slide from Tokuda

J. Gustafson, CTT, KTH

Unit Selection IntuitionUnit Selection IntuitionUnit Selection IntuitionUnit Selection Intuition

• Given a big database

• For each segment that we want to synthesize– Find the unit in the database that is the best to synthesize this target

segment

• What does “best” mean?– Target cost: Target cost: Target cost: Target cost: Closest match to the target description, in terms of

• Phonetic context

• Pitch, power, duration, phrase position

– Concatenation cost: Concatenation cost: Concatenation cost: Concatenation cost: The difference between the end of diphone 1 and the start of diphone 2:

• Matching formants + other spectral characteristics

• Matching energy

• Matching F0

Slide from Dan Jurafsky

Assignment 1: Practical exercises with the calculation of target and concatenation cost.

J. Gustafson, CTT, KTH 48

Target Target Target Target costcostcostcost measuresmeasuresmeasuresmeasures

∑ −=

i

iiyxD ||

Euclidean distanceManhattan (City block) distance

∑ −=

i

ii yxD2)(

Talsyntes: Joakim Gustafson

Tal, musik och hörsel

9

J. Gustafson, CTT, KTH 49

ConcatenationConcatenationConcatenationConcatenation costcostcostcost measuresmeasuresmeasuresmeasures

• Kullback-Leibler distance

• Mahalanobis distance ∑−

=2

2)(

i

iiyx

i

iN

i iiy

xyxD log)(

1∑=

−=

Mahalanobis distance Mahalanobis distance Mahalanobis distance Mahalanobis distance is useful in when multi-normal distributions lead to non spherically symmetric distributions

J. Gustafson, CTT, KTH 50

The units in Unit SelectionThe units in Unit SelectionThe units in Unit SelectionThe units in Unit Selection

• Different types Different types Different types Different types of units: e.g. diphones, phones, syllables, words, etc.

• Multiple occurrences Multiple occurrences Multiple occurrences Multiple occurrences of the units cover a wide space of the spectral and prosodic parameters

• Units nearest in this space to the targets will be chosen and will require only minor modificationminor modificationminor modificationminor modification

• The corpus is segmented into phonetic units, indexed, and used asindexed, and used asindexed, and used asindexed, and used as----isisisis

• Selection is made onononon----linelinelineline

• The trend is towards longer and longer unitslonger unitslonger unitslonger units

1999 2000 2001 2002 2003 2004 2005

Sound (OLE2) Sound (OLE2)Sound (OLE2) (.wav)

(.wav) (.wav)

J. Gustafson, CTT, KTH 51

• Large databases of recorded natural speech

• Minimal processing

• Annotation of database – what information is needed?

• Few cuts > maximally long units selected(but context and prosody must fit well)

• Target and concatenation costs

Slide from Dan Jurafsky

Features of Unit Selection SynthesisFeatures of Unit Selection SynthesisFeatures of Unit Selection SynthesisFeatures of Unit Selection Synthesis

J. Gustafson, CTT, KTH

Database creation: Database creation: Database creation: Database creation:

a good speakera good speakera good speakera good speaker• Professional speakers are always better:

– Consistent style and articulation

– Although these databases are carefully labeled

• Ideally (according to AT&T experiments):– Record 20 professional speakers (small amounts of data)

– Build simple synthesis examples

– Get many (200?) people to listen and score them

– Take best voices

• Correlates for human preferences:– High power in unvoiced speech

– High power in higher frequencies

– Larger pitch range

Text from Paul Taylor and Richard Sproat

J. Gustafson, CTT, KTH

Database creation: Database creation: Database creation: Database creation:

good recording conditionsgood recording conditionsgood recording conditionsgood recording conditions

• Good script– Application dependent helps

• Good word coverage• News data synthesizes as news data• News data is bad for dialog.

– Good phonetic coverage, especially wrt context– Low ambiguity– Easy to read

• Annotate at phone level, with stress, word information, phrase breaks

Text from Paul Taylor and Richard Sproat J. Gustafson, CTT, KTH

Creating databaseCreating databaseCreating databaseCreating database

• Unlike diphone synthesis, prosodic variation is a good thing

• Accurate annotation is crucial

• Pitch annotation needs to be very accurate

• Phone alignments can be done automatically, as described for

diphones

Slide from Dan Jurafsky

Talsyntes: Joakim Gustafson

Tal, musik och hörsel

10

J. Gustafson, CTT, KTH

Practical System IssuesPractical System IssuesPractical System IssuesPractical System Issues

• Size of typical system (Rhetorical rVoice):– ~300M

• Speed:– For each diphone, average of 1000 units to choose from, so:

– 1000 target costs

– 1000x1000 join costs

– Each join cost, say 30x30 float point calculations

– 10-15 diphones per second

– 10 billion floating point calculations per second

• But commercial systems must run ~50x faster than real time

• Heavy pruning essential: – 1000 units -> 25 units

Slide from Paul Taylor J. Gustafson, CTT, KTH

Summary: Unit SelectionSummary: Unit SelectionSummary: Unit SelectionSummary: Unit Selection

• Advantages– Quality is far superior to diphones

– Natural prosody selection sounds better

– Non-linguistic features of the speakers voice built in

• Disadvantages:– Fixed voice

– Quality can be very bad in places

• HCI problem: mix of very good and very bad is quite annoying

– Large footprint, itis computationally expensive

– Can’t synthesize everything you want:• Diphone technique can move emphasis

• Unit selection gives good (but possibly incorrect) result

Slide from Richard Sproat

J. Gustafson, CTT, KTH

From Unit selection From Unit selection From Unit selection From Unit selection

to HMM synthesisto HMM synthesisto HMM synthesisto HMM synthesis• Problems with Unit Selection Synthesis

– Discontinuities:Discontinuities:Discontinuities:Discontinuities: Can’t modify signal

– Hit or miss: Hit or miss: Hit or miss: Hit or miss: database often doesn’t have exactly what you want

– Fixed voice Fixed voice Fixed voice Fixed voice

• Solution: HMM (Hidden Markov Model) Synthesis– Stable, Smooth and easy to create multiple voicesStable, Smooth and easy to create multiple voicesStable, Smooth and easy to create multiple voicesStable, Smooth and easy to create multiple voices

– Sounds unnatural to researchers, but naïve subjects prefer itnaïve subjects prefer itnaïve subjects prefer itnaïve subjects prefer it

Example: Nina as unit selection and HMM synthesis voice

Slide from Dan Jurafsky J. Gustafson, CTT, KTH

HMM HMM HMM HMM SynthesisSynthesisSynthesisSynthesis

Slide from Tokuda

J. Gustafson, CTT, KTH 59

HiddenHiddenHiddenHidden MarkovMarkovMarkovMarkov ModelsModelsModelsModels

• A HMM is a machine, with a limited number of possible states.

• The transition between two states is regulated by probabilities.

• Every transition results in an observation with a certain probability.

• The states are hidden, only the observations are visible.

PiiPij

Pjj

Pjk

Pjk

Pkl

Pll

Oi OjOk Ol

J. Gustafson, CTT, KTH 60

HMMsHMMsHMMsHMMs in in in in synthesissynthesissynthesissynthesis

Talsyntes: Joakim Gustafson

Tal, musik och hörsel

11

J. Gustafson, CTT, KTH 61

RelationshipRelationshipRelationshipRelationship betweenbetweenbetweenbetween unitunitunitunitselectionselectionselectionselection and HMM and HMM and HMM and HMM synthesissynthesissynthesissynthesis

Slide from Tokuda J. Gustafson, CTT, KTH 62

RelationshipRelationshipRelationshipRelationship betweenbetweenbetweenbetween unitunitunitunitselectionselectionselectionselection and HMM and HMM and HMM and HMM synthesissynthesissynthesissynthesis 2222

Slide from Tokuda

J. Gustafson, CTT, KTH 63

TheTheTheThe trainingtrainingtrainingtraining partpartpartpart• The training is automatic. You need:

– The text + recordings of about 1000 sentences

• The training of 1000 sentences– takes 24 hours and generates a voice of less than 1 MB

• Separate HMMs for: Spectrum, F0, Duration

• Training in two steps:1. Context independent models2. Use these models to create context dependent models.

• Clustering:– Storing all contexts requires much space– It may be difficult to find alternatives for missing models– Many models are very similar = redundancy

J. Gustafson, CTT, KTH 64

ExamplesExamplesExamplesExamples of features in HMM of features in HMM of features in HMM of features in HMM synthesissynthesissynthesissynthesis trainingtrainingtrainingtraining

• Segment features: – immediate context– position in syllable

• Syllable features – Stress and lexical accent type– position in word and phrase

• Word features – number of syllables– position in phrase– morphological feature (compound or not)– part-of-speech tag (content or function word)

• Phrase features– phrase length in terms of syllables and words

• Utterance features:– length in syllables, words and phrases

• Speaker– Dialect, speaking style, emotional state

J. Gustafson, CTT, KTH 65

ClusteringClusteringClusteringClustering• Groups a large database into clusters

• Three trees: Duration, F0 and Spectrum

• Division based on yes/no questions– Grouping acoustic similar phonemes

– Features.

– Context.

J. Gustafson, CTT, KTH 66

Speaker adaptationSpeaker adaptationSpeaker adaptationSpeaker adaptation

http://homepages.inf.ed.ac.uk/jyamagis/Demo-

html/map-new.html

Talsyntes: Joakim Gustafson

Tal, musik och hörsel

12

J. Gustafson, CTT, KTH

Dalarna Norrland Skåne Gotland SvealandGötaland

vart tar universum slut?

centerpartister och kristdemokrater menar dock att brudparet

kan slippa betala det kan bådas föräldrar göra

gula sidorna finns från och med i fredags sökbart via

mobiltelefonen

om man till exempel tar en telefon och frågar hur den fungerar så

svarar vetenskapsmannen med att lyfta på luren och slå numret

Exempel från Exempel från Exempel från Exempel från SimulektSimulektSimulektSimulekt----projektetprojektetprojektetprojektet

J. Gustafson, CTT, KTH 68

UseUseUseUse of HMM of HMM of HMM of HMM synthesissynthesissynthesissynthesis

• Various voices:– Speaker adaptation

– Speaker interpolation

• Security of speaker identification systems

• Very low bit rate speech coder

• Small footprint, for use in mobile phones and web browsers

– E.g. in flash: http://www.furui.cs.titech.ac.jp/~dixonp/hts/index.html