View
3
Download
0
Category
Preview:
Citation preview
Machine Learning in Speech Synthesis
Alan W Black
Language Technologies Institute
Carnegie Mellon University
Sept 2009
Overview
uu Speech Synthesis History and OverviewSpeech Synthesis History and Overview
ll From handFrom hand--crafted to datacrafted to data--driven techniquesdriven techniques
uu Text to Speech ProcessesText to Speech Processes
uu Waveform synthesisWaveform synthesis
ll Unit selection and Statistical Parametric SynthesisUnit selection and Statistical Parametric Synthesis
uu EvaluationEvaluation
uu Voice conversionVoice conversion
uu Project ideasProject ideas
Physical Models
• Blowing air through tubes…
– von Kemplen’s
synthesizer 1791
• Synthesis by physical models
– Homer Dudley’s Voder. 1939
More Computation – More Data
uu Formant synthesis (60sFormant synthesis (60s--80s)80s)
ll Waveform construction from componentsWaveform construction from components
uu DiphoneDiphone synthesis (80ssynthesis (80s--90s)90s)
ll Waveform by concatenation of small number of Waveform by concatenation of small number of instances of speechinstances of speech
uu Unit selection (90sUnit selection (90s--00s)00s)
ll Waveform by concatenation of very large number of Waveform by concatenation of very large number of instances of speechinstances of speech
uu Statistical Parametric Synthesis (00sStatistical Parametric Synthesis (00s--..)..)
ll Waveform construction from parametric modelsWaveform construction from parametric models
Waveform Generation
-- Formant synthesisFormant synthesis
-- Random word/phrase concatenationRandom word/phrase concatenation
-- Phone concatenationPhone concatenation
-- DiphoneDiphone concatenationconcatenation
-- SubSub--word unit selectionword unit selection
-- Cluster based unit selectionCluster based unit selection
-- Statistical Parametric SynthesisStatistical Parametric Synthesis
Speech Synthesis
uu Text AnalysisText Analysisll Chunking, tokenization, token expansionChunking, tokenization, token expansion
uu Linguistic AnalysisLinguistic Analysis
ll PronunciationsPronunciations
ll ProsodyProsody
uu Waveform generationWaveform generation
ll From phones and prosody to waveformsFrom phones and prosody to waveforms
Text processing
uu Find the wordsFind the words
ll Splitting tokens too e.g. Splitting tokens too e.g. ““04/11/200904/11/2009””
ll Removing punctuation Removing punctuation
uu Identifying word typesIdentifying word types
ll Numbers: years, quantities, ordinalsNumbers: years, quantities, ordinals
ll 1996 sheep were stolen on 25 Nov 19961996 sheep were stolen on 25 Nov 1996
uu Identifying words/abbreviationsIdentifying words/abbreviations
ll CIA, 10m, 12sf, WeH7200CIA, 10m, 12sf, WeH7200
Pronunciations
uu Giving pronunciation for each wordGiving pronunciation for each word
ll A phoneme string (plus tone, stress A phoneme string (plus tone, stress ……))
uu A constructed lexiconA constructed lexicon
ll ((““pencilpencil”” n (p eh1 n s n (p eh1 n s ihih l))l))
ll ((““twotwo”” n (t uw1))n (t uw1))
uu Letter to sound rulesLetter to sound rules
ll Pronunciation of out of vocabulary wordsPronunciation of out of vocabulary words
ll Machine learning prediction from lettersMachine learning prediction from letters
Pronunciation of Unknown Words
uu How do you pronounce new wordsHow do you pronounce new words
uu 4% of tokens (in news) are new4% of tokens (in news) are new
uu You canYou can’’t synthesis then without t synthesis then without pronunciationspronunciations
uu You canYou can’’t recognize them without t recognize them without pronunciationspronunciations
uu LetterLetter--toto--Sounds rulesSounds rules
uu GraphemeGrapheme--toto--Phoneme rulesPhoneme rules
LTS: Hand written
uu Hand written rulesHand written rules
ll [[LeftContextLeftContext] X [] X [RightContextRightContext] ] --> Y> Y
ll e.g.e.g.
ll c [h r] c [h r] --> k> k
ll c [h] c [h] --> > chch
ll c [i] c [i] --> s> s
ll c c --> k> k
LTS: Machine Learning Techniques
uu Need an existing lexiconNeed an existing lexicon
ll Pronunciations: words and phonesPronunciations: words and phones
ll But different number of letters and phonesBut different number of letters and phones
uu Need an alignment Need an alignment
ll Between letters and phonesBetween letters and phones
ll checked checked --> > chch eh k t eh k t
LTS: alignment
• checked -> ch eh k t
• Some letters go to nothing
• Some letters go to two phones
– box -> b aa k-s
– table -> t ey b ax-l -
tt____kkeheh__chch
ddeekkcceehhcc
Find alignment automatically
uu Epsilon scatteringEpsilon scattering
ll Find all possible alignmentsFind all possible alignments
ll Estimate Estimate p(L,Pp(L,P) on each alignment) on each alignment
ll Find most probable alignmentFind most probable alignment
uu Hand seedHand seed
ll Hand specify allowable pairsHand specify allowable pairs
ll Estimate Estimate p(L,Pp(L,P) on each possible alignment) on each possible alignment
ll Find most probable alignmentFind most probable alignment
uu Statistical Machine Translation (IBM model 1)Statistical Machine Translation (IBM model 1)
ll Estimate Estimate p(L,Pp(L,P) on each possible alignment) on each possible alignment
ll Find most probably alignment Find most probably alignment
Not everything aligns
uu 0, 1, and 2 letter cases0, 1, and 2 letter cases
ll e e --> epsilon > epsilon ““movedmoved””
ll x x --> > kk--ss, , gg--zz ““boxbox”” ““exampleexample””
ll e e --> > yy--uwuw ““askewaskew””
uu Some alignment arenSome alignment aren’’t sensiblet sensible
ll dept dept --> d > d ihih p p aaaa r t m ax n tr t m ax n t
ll cmucmu --> s > s iyiy eh m y eh m y uwuw
Training LTS models
uu Use CART treesUse CART treesll One model for each letterOne model for each letter
uu Predict phone (epsilon, phone, dual phone)Predict phone (epsilon, phone, dual phone)
ll From letter 3From letter 3--context (and POS)context (and POS)
uu # # # c h e c # # # c h e c --> > chch
uu # # c h e c k # # c h e c k --> _> _
uu # c h e c k e # c h e c k e --> eh> eh
uu c h e c k e d c h e c k e d --> k> k
LTS results
• Split lexicon into train/test 90%/10%
– i.e. every tenth entry is extracted for testing
68.76%68.76%95.60%95.60%ThaiThai
89.38%89.38%98.79%98.79%DEDE--CELEXCELEX
93.03%93.03%99.00%99.00%BRULEXBRULEX
57.80%57.80%91.99%91.99%CMUDICTCMUDICT
75.56%75.56%95.80%95.80%OALDOALD
Word AccWord AccLetter AccLetter AccLexiconLexicon
Example Tree
But we need more than phones
• What about lexical stress
– p r aa1 j eh k t -> p r aa j eh1 k t
• Two possibilities
– A separate prediction model
– Join model – introduce eh/eh1 (BETTER)
74.56%74.56%63.68%63.68%WordWord
74.69%74.69%76.92%76.92%W no SW no S
95.80%95.80%------LetterLetter
96.27%96.27%96.36%96.36%L no SL no S
LTPSLTPSLTP+SLTP+S
Does it really work
• 40K words from Time Magazine
– 1775 (4.6%) not in OALD
– LTS gets 70% correct (test set was 74%)
0.40.477TyposTypos
3.23.25757US SpellingUS Spelling
19.819.8351351UnknownUnknown
76.676.613601360NamesNames
%%OccursOccurs
Prosody Modeling
uu PhrasingPhrasing
ll Where to take breathsWhere to take breaths
uu IntonationIntonation
ll Where (and what size) are accentsWhere (and what size) are accents
ll F0 realizationF0 realization
uu DurationDuration
ll What is the length of each phonemeWhat is the length of each phoneme
Intonation Contour
Unit Selection vs Parametric
Unit SelectionThe “standard” method“Select appropriate sub-word units from
large databases of natural speech”Parametric Synthesis: [NITECH: Tokuda et al]
HMM-generation based synthesisCluster units to form modelsGenerate from the models“Take ‘average’ of units”
Unit Selection
• Target cost and Join cost [Hunt and Black 96]
– Target cost is distance from desired unit to actual
unit in the databases
• Based on phonetic, prosodic metrical context
– Join cost is how well the selected units join
“Hunt and Black” Costs
HB Unit Selection
• Use Viterbi to find best set of units
• Note
– Finding “longest” is typically not optimal
Clustering Units
• Cluster units [Donovan et al 96, Black et al 97]
• Moves calculation to compile time
Unit Selection Issues
• Cost metrics
– Finding best weights, best techniques etc
• Database design
– Best database coverage
• Automatic labeling accuracy
– Finding errors/confidence
• Limited domain:
– Target the databases to a particular application
– Talking clocks
– Targeted domain synthesis
Old vs New
Unit Selection:
large carefully labelled database
quality good when good examples available
quality will sometimes be bad
no control of prosody
Parametric Synthesis:
smaller less carefully labelled database
quality consistent
resynthesis requires vocoder, (buzzy)
can (must) control prosody
model size much smaller than Unit DB
Parametric Synthesis
• Probabilistic Models
• Simplification
• Generative model
– Predict acoustic frames from text
Trajectories
uu Frame (State) based predictionFrame (State) based prediction
ll Ignores dynamicsIgnores dynamics
uu Various solutionsVarious solutions
ll MLPG (maximum likelihood parameter MLPG (maximum likelihood parameter
generation)generation)
ll Trajectory Trajectory HMMsHMMs
ll Global VarianceGlobal Variance
ll MGE, minimal generation errorMGE, minimal generation error
SPSS Systems
uu HTS (NITECH)HTS (NITECH)
ll Based on HTKBased on HTK
ll Predicts HMMPredicts HMM--statesstates
ll (Default) uses MCEP and MLSA filter(Default) uses MCEP and MLSA filter
ll Supported in FestivalSupported in Festival
uu ClustergenClustergen (CMU)(CMU)
ll No use of HTKNo use of HTK
ll Predicts FramesPredicts Frames
ll (Default) uses MCEP and MLSA filter(Default) uses MCEP and MLSA filter
ll More tightly coupled with FestivalMore tightly coupled with Festival
FestVox CLUSTERGEN Synthesizer
Requires:Prompt transcriptions (txt.done.data)Waveform files (well recorded)
FestVox LabellingEHMM (Kishore)Context Independent models and forced alignment(Have used Janus labels too).
Parameter extraction:(HTS’s) melcep/mlsa filter for resynthesisF0 extraction
ClusteringWagon vector clustering for each HMM-state name
Clustering by CART
Update to Wagon (Edinburgh Speech Tools).
Tight coupling of features with FestVox utts
Support for arbitrary vectors
Define impurity on clusters of N vectors
Clustering
F0 and MCEP
Tested jointly and separately
Features for clustering (51):
phonetic, syllable, phrasal context
Training Output
Three models:
Spectral (MCEP) CART tree
F0 CART tree
Duration CART tree
F0 model:
Smoothed extracted F0 through all speech
(i.e. unvoiced regions get F0 values)
Chose voicing at runtime phonetically
CLUSTERGEN Synthesis
Generate phoneme strings (as before)Generate phoneme strings (as before)
For each phone:For each phone:
Find HMMFind HMM--state names: ah_1, ah_2, ah_3state names: ah_1, ah_2, ah_3
Predict duration of eachPredict duration of each
Create empty Create empty mcepmcep vector to fill durationvector to fill duration
Predict Predict mcepmcep values from cluster treevalues from cluster tree
Predict F0 value from cluster treePredict F0 value from cluster tree
Use MLSA filter to regenerate speechUse MLSA filter to regenerate speech
Objective Score
CLUSTERGENCLUSTERGEN
Mean Mel Mean Mel CepstralCepstral Distortion over test setDistortion over test set
MCD: Voice Conversion ranges 4.5MCD: Voice Conversion ranges 4.5--6.06.0
MCD: CG scores 4.0MCD: CG scores 4.0--8.08.0
smaller is bettersmaller is better
Example CG Voices
7 Arctic databases:7 Arctic databases:
1200 utterances, 43K 1200 utterances, 43K segssegs, 1hr speech, 1hr speech
awb awb bdlbdl
clbclb jmkjmk
kspksp rmsrms
sltslt
slt_arcticslt_arctic data sizedata size
UttsUtts ClustersClusters RMS F0RMS F0 MCDMCD
5050 230230 24.2924.29 6.7616.761
100100 435435 19.4719.47 6.2786.278
200200 824824 17.4117.41 6.0476.047
500500 22272227 15.0215.02 5.7555.755
11001100 45974597 14.5514.55 5.6855.685
Database size vs Quality
Making it Better
uu Label data, build modelLabel data, build model
uu But maybe there are better labelsBut maybe there are better labels
uu So find labels that maximize model So find labels that maximize model
accuracyaccuracy
Move Labels
a_1 a_2
Predicted gaussians
Cluster trees
Actual values
Move Labels
uu Use EHMM to label segments/HMM statesUse EHMM to label segments/HMM states
uu Build Build ClustergenClustergen ModelModel
uu IterateIteratell Predict Cluster (mean/std) for each framePredict Cluster (mean/std) for each frame
ll For each label boundaryFor each label boundaryÕÕ If If dist(actual_after,pred_beforedist(actual_after,pred_before) < ) <
dist(actual_after,pred_afterdist(actual_after,pred_after))ÕÕ Move label forwardMove label forward
ÕÕ If If dist(actual_before,pred_afterdist(actual_before,pred_after) < ) < dist(actual_before,pred_beforedist(actual_before,pred_before))ÕÕ Move label backwardMove label backward
Distance Metric
uu Distance from predicted to actualDistance from predicted to actual
ll EuclideanEuclidean
ll F0, static, deltas, voicingF0, static, deltas, voicing
ll With/without standard deviation normalizationWith/without standard deviation normalization
ll WeightingWeighting
uu Best choiceBest choice
ll Static without Static without stddevstddev normalizationnormalization
ll (This is closest to MCD)(This is closest to MCD)
ML with 10 iterations
• rms voice (66 minutes of speech)
– train 1019 utts, test 113 utts (every tenth)
14.18714.1871.7501.7505.0405.04013382133821345713457268392683999
14.23914.2391.7401.7405.0355.03513841138411373013730275712757188
14.24014.2401.7571.7575.0425.04214218142181414314143283612836177
14.28714.2871.7541.7545.0425.04214880148801481314813296932969366
14.30614.3061.7531.7535.0465.04615613156131551815518311313113155
14.26014.2601.7651.7655.0615.06116580165801650316503330833308344
14.26714.2671.7791.7795.0735.07317224172241783517835350593505933
14.22014.2201.7941.7945.0905.09020508205082022320223407314073122
14.25114.2511.8461.8465.1215.12125949259492316223162482114821111
13.99013.9901.9651.9655.2475.24700000000
F0F0stddevstddevMCDMCD--veve++veveMoveMovePassPass
Move Labels
4.9834.9835.1705.1705.7135.713sltslt
5.1605.1605.2985.298--rxrrxr
5.0355.0355.2475.2475.7315.731rmsrms
5.1405.1405.2895.2895.9805.980kspksp
5.2395.2395.3985.3986.1656.165jmkjmk
4.6984.6984.8384.8385.4175.417clbclb
5.4675.4675.6855.6856.1296.129bdlbdl
4.4834.4834.4454.4456.5576.557awbawb
5.0575.0575.2345.234--ahwahw
2008 ml2008 ml2008 base2008 base20062006VoiceVoice
Average improvement 0.172 (excluding awb)
Does it sound better
uu rmsrms
ll abtestabtest (10 utterances)(10 utterances)
ÕÕ mlml 7 7
ÕÕ basebase 1 1
ÕÕ == 22
uu SltSlt base mlbase ml
ll abtestabtest
ÕÕ ml ml 77
ÕÕ basebase 22
ÕÕ == 11
Arctic MLSB improvements
Power (MCD cep0)
1.75 2.25 2.75 3.25 3.75
Sp
ectr
um
(M
CD
1-2
4)
4.0
4.2
4.4
4.6
4.8
5.0
5.2Move Label Segment Boundaries DMCD
awb
jdb
ahw
slt
ksptta
jmk
sbbrxr
clb
rms
Grapheme Based Synthesis
uu Synthesis without a phoneme setSynthesis without a phoneme set
uu Use the letters as phonemesUse the letters as phonemesll ((““alanalan”” nil (a l a n))nil (a l a n))
ll ((““blackblack”” nil ( b l a c k ))nil ( b l a c k ))
uu Spanish (easier ?)Spanish (easier ?)ll 419 utterances419 utterances
ll HMM training to label databasesHMM training to label databases
ll Simple pronunciation rulesSimple pronunciation rules
ll PoliciPolici’’aa --> p o l i c i> p o l i c i’’ aa
ll CuatroCuatro --> c u a t r o> c u a t r o
Spanish Grapheme Synthesis
English Grapheme Synthesis
-- Use Letters are phonesUse Letters are phones
-- 26 26 ““phonemesphonemes””
-- ( ( ““alanalan”” n (a l a n))n (a l a n))
-- ( ( ““blackblack”” n (b l a c k))n (b l a c k))
-- Build HMM acoustic models for labelingBuild HMM acoustic models for labeling
-- For EnglishFor English
-- ““This is a penThis is a pen””
-- ““We went to the church at ChristmasWe went to the church at Christmas””
-- Festival introFestival intro
-- ““do eight meatdo eight meat””
-- Requires method to fix errorsRequires method to fix errors
-- Letter to letter mappingLetter to letter mapping
Common Data Sets
uu Data drive techniques need dataData drive techniques need data
uu DiphoneDiphone DatabasesDatabasesll CSTR and CMU US English CSTR and CMU US English DiphoneDiphone sets (sets (kalkal and ked)and ked)
uu CMU ARCTIC DatabasesCMU ARCTIC Databasesll 1200 phonetically balanced utterances (about 1 hour)1200 phonetically balanced utterances (about 1 hour)
ll 7 different speakers (2 male 2 female 3 accented)7 different speakers (2 male 2 female 3 accented)
ll EGG, phonetically labeledEGG, phonetically labeled
ll Utterances chosen from outUtterances chosen from out--ofof--copyright textcopyright text
ll Easy to sayEasy to say
ll Freely distributableFreely distributable
ll Tools to build your own in your own languageTools to build your own in your own language
Blizzard Challenge
uu Realistic evaluationRealistic evaluation
ll Under the same conditionsUnder the same conditions
uu Blizzard Challenge [Black and Blizzard Challenge [Black and TokudaTokuda]]
ll Participants build voice from common datasetParticipants build voice from common dataset
ll Synthesis test sentencesSynthesis test sentences
ll Large set of listening experimentsLarge set of listening experiments
ll Since 2005, now in 4Since 2005, now in 4thth yearyear
ll 18 groups in 200818 groups in 2008
How to test synthesis
uu Blizzard tests:Blizzard tests:
ll Do you like it? (MOS scores)Do you like it? (MOS scores)
ll Can you understand it?Can you understand it?
ÕÕ SUS sentenceSUS sentence
ÕÕ The unsure steaks overcame the zippy rudderThe unsure steaks overcame the zippy rudder
uu CanCan’’t this be done automatically?t this be done automatically?
ll Not yet (at least not reliably enough)Not yet (at least not reliably enough)
ll But we now have lots of data for training techniquesBut we now have lots of data for training techniques
uu Why does it still sound like robot?Why does it still sound like robot?
ll Need better (appropriate testing)Need better (appropriate testing)
SUS Sentences
uu sus_00022sus_00022
uu sus_00012sus_00012
uu sus_00005sus_00005
uu sus_00017sus_00017
SUS Sentences
uu The serene adjustments foresaw the The serene adjustments foresaw the
acceptable acquisitionacceptable acquisition
uu The temperamental gateways forgave the The temperamental gateways forgave the
weatherbeatenweatherbeaten finalistfinalist
uu The sorrowful premieres sang the The sorrowful premieres sang the
ostentatious gymnastostentatious gymnast
uu The disruptive billboards blew the sugary The disruptive billboards blew the sugary
endorsementendorsement
Voice Identity
uu What makes a voice identityWhat makes a voice identity
ll Lexical Choice: Lexical Choice:
ÕÕ WooWoo--hoohoo, ,
ÕÕ I pity the fool I pity the fool ……
ll Phonetic choicePhonetic choice
ll Intonation and durationIntonation and duration
ll Spectral qualities (vocal tract shape)Spectral qualities (vocal tract shape)
ll ExcitationExcitation
Voice Conversion techniques
uu Full ASR and TTSFull ASR and TTS
ll Much too hard to do reliablyMuch too hard to do reliably
uu Codebook transformationCodebook transformation
ll ASR HMM state to HMM state transformationASR HMM state to HMM state transformation
uu GMM based transformationGMM based transformation
ll Build a mapping function between framesBuild a mapping function between frames
Learning VC models
uu First need to get parallel speechFirst need to get parallel speech
ll Source and Target say same thingSource and Target say same thing
ll Use DTW to align (in the spectral domain)Use DTW to align (in the spectral domain)
ll Trying to learn a functional mappingTrying to learn a functional mapping
ll 2020--50 utterances50 utterances
uu ““TextText--independentindependent”” VCVC
ll Means no parallel speech availableMeans no parallel speech available
ll Use some form of synthesis to generate itUse some form of synthesis to generate it
VC Training process
uu Extract F0, power and MFCC from source Extract F0, power and MFCC from source
and target utterancesand target utterances
uu DTW align source and targetDTW align source and target
uu Loop until convergenceLoop until convergence
ll Build GMM to map between source/targetBuild GMM to map between source/target
ll DTW source/target using GMM mappingDTW source/target using GMM mapping
VC Training process
VC Run-time
Voice Transformation
-- FestvoxFestvox GMM transformation suite (Toda) GMM transformation suite (Toda)
awb awb bdlbdl jmkjmk sltslt
awbawb
bdlbdl
jmkjmk
sltslt
VC in Synthesis
uu Can be used as a post filter in synthesisCan be used as a post filter in synthesis
ll Build Build kal_diphonekal_diphone to target VCto target VC
ll Use on all output of Use on all output of kal_diphonekal_diphone
uu Can be used to convert a full DBCan be used to convert a full DB
ll Convert a full db and rebuild a voiceConvert a full db and rebuild a voice
Style/Emotion Conversion
uu Unit Selection (or SPS)Unit Selection (or SPS)
ll Require lots of data in desired style/emotionRequire lots of data in desired style/emotion
uu VC techniqueVC technique
ll Use as filter to main voice (same speaker)Use as filter to main voice (same speaker)
ll Convert neutral to angry, sad, happy Convert neutral to angry, sad, happy ……
Can you say that again?
uu Voice conversion for speaking in noiseVoice conversion for speaking in noise
uu Different quality when you repeat thingsDifferent quality when you repeat things
uu Different quality when you speak in noiseDifferent quality when you speak in noise
ll Lombard effect (when very loud)Lombard effect (when very loud)
ll ““SpeechSpeech--inin--noisenoise”” in regular noisein regular noise
Speaking in Noise (Langner)
uu Collect dataCollect data
ll Randomly play noise in personRandomly play noise in person’’s earss ears
ll NormalNormal
ll In NoiseIn Noise
uu Collect 500 of each typeCollect 500 of each type
uu Build VC modelBuild VC model
ll Normal Normal --> in> in--NoiseNoise
uu ActuallyActually
ll Spectral, duration, f0 and power differencesSpectral, duration, f0 and power differences
Synthesis in Noise
uu For bus information taskFor bus information task
uu Play different synthesis information Play different synthesis information uttsutts
ll With SIN synthesizerWith SIN synthesizer
ll With SWN synthesizerWith SWN synthesizer
ll With VC (SWNWith VC (SWN-->SIN) synthesizer>SIN) synthesizer
uu Measure their understandingMeasure their understanding
ll SIN synthesizer better (in Noise)SIN synthesizer better (in Noise)
ll SIN synthesizer better (without Noise for elderly)SIN synthesizer better (without Noise for elderly)
Transterpolation
uu Incrementally transform a voice X%Incrementally transform a voice X%
ll BDLBDL--SLT by 10% SLT by 10%
ll SLTSLT--BDL by 10%BDL by 10%
uu Count when you think it changes from MCount when you think it changes from M--FF
uu Fun but what are the uses Fun but what are the uses ……
De-identification
uu Remove speaker identityRemove speaker identity
ll But keep it still human likeBut keep it still human like
uu Health RecordsHealth Records
ll HIPAA laws require thisHIPAA laws require this
ll Not just removing names and Not just removing names and SSNsSSNs
uu Remove identifiable propertiesRemove identifiable properties
ll Use Voice conversion to remove spectralUse Voice conversion to remove spectral
ll Use F0/duration mapping to remove prosodicUse F0/duration mapping to remove prosodic
ll Use ASR/MT techniques to remove lexicalUse ASR/MT techniques to remove lexical
Summary
uu DataData--driven speech synthesisdriven speech synthesis
ll Text processingText processing
ll Prosody and pronunciationProsody and pronunciation
ll Waveform synthesisWaveform synthesis
uu Finding the right optimizationFinding the right optimization
ll Find an objective metric that correlates with Find an objective metric that correlates with
human perceptionhuman perception
Potential Projects
uu Find F0 in story tellingFind F0 in story telling
ll F0 is easy to find in isolated sentencesF0 is easy to find in isolated sentences
ll What about full paragraphsWhat about full paragraphs
ll Storytellers use much wider rangeStorytellers use much wider range
uu Find F0 shapes/accent types Find F0 shapes/accent types
ll Use HMM to recognize Use HMM to recognize ““typestypes”” of accentsof accents
ll (trajectory modeling)(trajectory modeling)
ll Following Following ““tilttilt”” and Moeller modeland Moeller model
Parametric Synthesis
uu Better parametric representation of speechBetter parametric representation of speech
ll Particularly excitation parameterizationParticularly excitation parameterization
uu Better Acoustic measures of qualityBetter Acoustic measures of quality
ll Use Blizzard answers to build/check objective measureUse Blizzard answers to build/check objective measure
uu Statistical Statistical KlattKlatt Parametric synthesisParametric synthesis
ll Using Using ““knowledgeknowledge--basebase”” parametersparameters
ll F0, aspiration, nasality, formantsF0, aspiration, nasality, formants
ll Automatically derive Automatically derive KlattKlatt parameters for dbparameters for db
ll Use them for statistical parametric synthesisUse them for statistical parametric synthesis
Recommended