Upload
hilda-burns
View
231
Download
0
Tags:
Embed Size (px)
Citation preview
11
5-Text To Speech (TTS) 5-Text To Speech (TTS)
Speech SynthesisSpeech SynthesisSpeech Synthesis Concept Speech Synthesis Concept
Phone UnitsPhone Units
Phone Sequence To SpeechPhone Sequence To Speech
Speech NaturalnessSpeech Naturalness– Concatenative ApproachesConcatenative Approaches– Rule-Based ApproachesRule-Based Approaches
22
Speech Synthesis ConceptSpeech Synthesis Concept
Text toPhone Sequence
Phone Sequenceto Speech
Text Speech
Natural Language Processing (NLP)
Speech Processing
Text Speech
33
Phone UnitsPhone Units
Paragraph ( )
Sentence ( )
Word (Depends on the language. Usually more than 100,000)
Syllable
Diphone & Triphone
Phoneme (Between 10 , 100)
44
Phone Units (Cont’d)Phone Units (Cont’d)
Diphone : We model Transitions between Diphone : We model Transitions between two phonemestwo phonemes
p1 p2 p3 p4 p5 . . . . .
Diphone
Phoneme
55
Phone Units (Cont’d)Phone Units (Cont’d)
In farsi we have 30 Phoneme. so we have In farsi we have 30 Phoneme. so we have 30*30 Diphone Theoretically.30*30 Diphone Theoretically.
Practically the only Diphone that we don’t Practically the only Diphone that we don’t have in farsi is have in farsi is /zho/ /zho/
we have 27000 Triphone Theoretically. we have 27000 Triphone Theoretically. But practically we have about 15000 But practically we have about 15000 Triphone in farsi.Triphone in farsi.
66
Phone Units (Cont’d)Phone Units (Cont’d)
Syllable = Onset (Consonant) + RhymeSyllable = Onset (Consonant) + Rhyme
Syllable is a set of phonemes that exactly Syllable is a set of phonemes that exactly contains one vowelcontains one vowel
Syllables in Farsi : CV , CVC , CVCC Syllables in Farsi : CV , CVC , CVCC
We have about 4000 Syllables in farsiWe have about 4000 Syllables in farsi
Syllables in English :V, CV , CVC ,CCVC, Syllables in English :V, CV , CVC ,CCVC, CCVCC, CCCVC, CCCVCC, . . .CCVCC, CCCVC, CCCVCC, . . .
Number of Syllables in English is very muchNumber of Syllables in English is very much
77
Phone Sequence To SpeechPhone Sequence To Speech
Concatenative Approaches : Trade-Off Concatenative Approaches : Trade-Off between Naturality And Memory usage between Naturality And Memory usage and variety of desired functionsand variety of desired functions
Rule-Based Approaches : The most Rule-Based Approaches : The most important Rule-Based approach is Klatt important Rule-Based approach is Klatt methodmethod
88
Phone Sequence To Speech Phone Sequence To Speech (Cont’d)(Cont’d)
Text to Phone
Sequence
Phone Sequence
to primitive utterance
Text Speechprimitive utteranceto Natural
Speech
NLP Speech Processing
99
Speech NaturalnessSpeech Naturalness
Obviation of undesirable noise and Obviation of undesirable noise and distortion and dissociation from speechdistortion and dissociation from speech
Prosody generationProsody generation– Speech energySpeech energy– DurationDuration– pitchpitch– IntonationIntonation– StressStress
1010
Speech Naturalness (Cont’d)Speech Naturalness (Cont’d)
Intonation and Stress are very effective in Intonation and Stress are very effective in speech naturalnessspeech naturalness
Intonation : Variation of Pitch frequency Intonation : Variation of Pitch frequency along speakingalong speaking
Stress : Increasing the pitch frequency in a Stress : Increasing the pitch frequency in a specific timespecific time
1111
Concatenative ApproachesConcatenative Approaches
In this approaches we store units of In this approaches we store units of natural speech for reconstruction of natural speech for reconstruction of desired speechdesired speech
We could select the appropriate phone We could select the appropriate phone unit for speech synthesisunit for speech synthesis
we can store compressed parameters we can store compressed parameters instead of main waveforminstead of main waveform
1212
Concatenative Approaches Concatenative Approaches (Cont’d)(Cont’d)
Benefits of storing compressed Benefits of storing compressed parameters instead of main waveformparameters instead of main waveform– Less memory useLess memory use– General state instead of a specific storedGeneral state instead of a specific stored
utteranceutterance– Generating prosody easilyGenerating prosody easily
1313
Concatenative Approaches Concatenative Approaches (Cont’d)(Cont’d)
Phone Unit Type of StoringParagraph
Sentence
Word
Syllable
Diphone
Phoneme
Main Waveform
Main Waveform
Main Waveform
Coded/Main Waveform
Coded Waveform
Coded Waveform
1414
Concatenative Approaches Concatenative Approaches (Cont’d)(Cont’d)
Pitch Synchronous Overlap-Add-Method (PSOLA) is a famous method in phoneme transmit smoothing
Overlap-Add-Method is a standard DSP method
PSOLA is a base action for Voice Conversion.
In this method in analysis stage we select frames that are synchronous by pitch markers.
1515
Rule-Based Approach StagesRule-Based Approach Stages
Determine the speech model and model Determine the speech model and model parametersparameters
Determine type of phone unitsDetermine type of phone units
Determine some parameter amount for Determine some parameter amount for each phone uniteach phone unit
Substitute sequence of phone units by its Substitute sequence of phone units by its equivalent parameter sequenceequivalent parameter sequence
Put parameter sequence in speech modelPut parameter sequence in speech model
1616
KLATT 80 ModelKLATT 80 Model
1717
KLATT 88 ModelKLATT 88 Model
1818
KL GLOTT 88 KL GLOTT 88 model model
(default)(default)
SPECTRAL SPECTRAL TILT LOW-PAS TILT LOW-PAS RESONANTORRESONANTOR
MODIFIED LF
MODEL
ASPIRATION NOISE
GENERATOR
FIRST DIFFERENCE
PREEMPHASIS
NASAL NASAL FORMANT FORMANT
RESONATORRESONATOR
TRACHEAL FORMANT
RESONATOR
FOURTH FORMANT
RESONATOR
THIRTH FORMANT
RESONATOR
SECOND SECOND FORMANT FORMANT
RESONATORRESONATOR
FIRST FIRST FORMANT FORMANT
RESONATORRESONATOR
FRICATION FRICATION NOISE NOISE
GENERATORGENERATOR
SECOND FORMANT
RESONATOR
THIRD FORMANT
RESONATOR
FOURTH FORMANT
RESONATOR
FIFTH FIFTH FORMANT FORMANT
RESONATORRESONATOR
SIXTH FORMANT
RESONATOR
A2F
A3F
A4F
A5F
A6F
AB
ANV
A1V
A2V
A3V
A4V
ATV
+
-
+
-
+
-
+
+
-
+
-
-
+
+
FILTERED FILTERED IMPULSE IMPULSE
TRAINTRAIN
F0 AV OO FL DI
SO
SS
TL
AH
AF
GLOTTAL SOUND SOURCES
CP
BYPASS PATH
B2F
B3F
B4F
B5F
B6F F6
PARALLEL VOCAL TRACT MODEL LYRYNGEAL
SOUND SOURCES (NORMALLY NOT USED)
PARALLEL VOCAL TRACT MODEL FRICATION SOUND SOURCES
BNP BNZ BTP BTZ DF1 DB1 F2 B2 F3 B3 F4 B4 F5 B5
CASCADE VOCAL TRACT MODEL LARYNGEAL SOUND SOURCES
NASAL NASAL
POLE ZERO POLE ZERO PAIRPAIR
TRACHEAL TRACHEAL POLE ZERO POLE ZERO
PAIRPAIR
FIRST FIRST FORMANT FORMANT
RESONATORRESONATOR
SECOND SECOND FORMANT FORMANT
RESONATORRESONATOR
THIRTH THIRTH FORMANT FORMANT
RESONATORRESONATOR
FOURTH FOURTH FORMANT FORMANT
RESONATORRESONATOR
FIFTH FIFTH FORMANT FORMANT
RESONATORRESONATOR
THE KLSYN88 CASCADE PARALLEL FORMANT SYNTHESIZERTHE KLSYN88 CASCADE PARALLEL FORMANT SYNTHESIZERFNP FNZ FTP FTZ F1 B1
1919
Three Voicing Source Model In Three Voicing Source Model In KLATT 88KLATT 88
The old KLSYN impulsive sourceThe old KLSYN impulsive source
The KLGLOTT88 model The KLGLOTT88 model
The modified LF modelThe modified LF model