1 5-Text To Speech (TTS) Speech Synthesis Speech Synthesis Concept Phone Units Phone Sequence To Speech Speech Naturalness –Concatenative Approaches –Rule-Based

11

5-Text To Speech (TTS) 5-Text To Speech (TTS)

Speech SynthesisSpeech SynthesisSpeech Synthesis Concept Speech Synthesis Concept

Phone UnitsPhone Units

Phone Sequence To SpeechPhone Sequence To Speech

Speech NaturalnessSpeech Naturalness– Concatenative ApproachesConcatenative Approaches– Rule-Based ApproachesRule-Based Approaches

22

Speech Synthesis ConceptSpeech Synthesis Concept

Text toPhone Sequence

Phone Sequenceto Speech

Text Speech

Natural Language Processing (NLP)

Speech Processing

Text Speech

33

Phone UnitsPhone Units

Paragraph ( )

Sentence ( )

Word (Depends on the language. Usually more than 100,000)

Syllable

Diphone & Triphone

Phoneme (Between 10 , 100)

44

Phone Units (Cont’d)Phone Units (Cont’d)

Diphone : We model Transitions between Diphone : We model Transitions between two phonemestwo phonemes

p1 p2 p3 p4 p5 . . . . .

Diphone

Phoneme

55


In farsi we have 30 Phoneme. so we have In farsi we have 30 Phoneme. so we have 30*30 Diphone Theoretically.30*30 Diphone Theoretically.

Practically the only Diphone that we don’t Practically the only Diphone that we don’t have in farsi is have in farsi is /zho/ /zho/

we have 27000 Triphone Theoretically. we have 27000 Triphone Theoretically. But practically we have about 15000 But practically we have about 15000 Triphone in farsi.Triphone in farsi.

66


Syllable = Onset (Consonant) + RhymeSyllable = Onset (Consonant) + Rhyme

Syllable is a set of phonemes that exactly Syllable is a set of phonemes that exactly contains one vowelcontains one vowel

Syllables in Farsi : CV , CVC , CVCC Syllables in Farsi : CV , CVC , CVCC

We have about 4000 Syllables in farsiWe have about 4000 Syllables in farsi

Syllables in English :V, CV , CVC ,CCVC, Syllables in English :V, CV , CVC ,CCVC, CCVCC, CCCVC, CCCVCC, . . .CCVCC, CCCVC, CCCVCC, . . .

Number of Syllables in English is very muchNumber of Syllables in English is very much

77

Phone Sequence To SpeechPhone Sequence To Speech

Concatenative Approaches : Trade-Off Concatenative Approaches : Trade-Off between Naturality And Memory usage between Naturality And Memory usage and variety of desired functionsand variety of desired functions

Rule-Based Approaches : The most Rule-Based Approaches : The most important Rule-Based approach is Klatt important Rule-Based approach is Klatt methodmethod

88

Phone Sequence To Speech Phone Sequence To Speech (Cont’d)(Cont’d)

Text to Phone

Sequence

Phone Sequence

to primitive utterance

Text Speechprimitive utteranceto Natural

Speech

NLP Speech Processing

99

Speech NaturalnessSpeech Naturalness

Obviation of undesirable noise and Obviation of undesirable noise and distortion and dissociation from speechdistortion and dissociation from speech

Prosody generationProsody generation– Speech energySpeech energy– DurationDuration– pitchpitch– IntonationIntonation– StressStress

1010

Speech Naturalness (Cont’d)Speech Naturalness (Cont’d)

Intonation and Stress are very effective in Intonation and Stress are very effective in speech naturalnessspeech naturalness

Intonation : Variation of Pitch frequency Intonation : Variation of Pitch frequency along speakingalong speaking

Stress : Increasing the pitch frequency in a Stress : Increasing the pitch frequency in a specific timespecific time

1111

Concatenative ApproachesConcatenative Approaches

In this approaches we store units of In this approaches we store units of natural speech for reconstruction of natural speech for reconstruction of desired speechdesired speech

We could select the appropriate phone We could select the appropriate phone unit for speech synthesisunit for speech synthesis

we can store compressed parameters we can store compressed parameters instead of main waveforminstead of main waveform

1212

Concatenative Approaches Concatenative Approaches (Cont’d)(Cont’d)

Benefits of storing compressed Benefits of storing compressed parameters instead of main waveformparameters instead of main waveform– Less memory useLess memory use– General state instead of a specific storedGeneral state instead of a specific stored

utteranceutterance– Generating prosody easilyGenerating prosody easily

1313


Phone Unit Type of StoringParagraph

Sentence

Word

Syllable

Diphone

Phoneme

Main Waveform

Main Waveform

Main Waveform

Coded/Main Waveform

Coded Waveform

Coded Waveform

1414


Pitch Synchronous Overlap-Add-Method (PSOLA) is a famous method in phoneme transmit smoothing

Overlap-Add-Method is a standard DSP method

PSOLA is a base action for Voice Conversion.

In this method in analysis stage we select frames that are synchronous by pitch markers.

1515

Rule-Based Approach StagesRule-Based Approach Stages

Determine the speech model and model Determine the speech model and model parametersparameters

Determine type of phone unitsDetermine type of phone units

Determine some parameter amount for Determine some parameter amount for each phone uniteach phone unit

Substitute sequence of phone units by its Substitute sequence of phone units by its equivalent parameter sequenceequivalent parameter sequence

Put parameter sequence in speech modelPut parameter sequence in speech model

1616

KLATT 80 ModelKLATT 80 Model

1717

KLATT 88 ModelKLATT 88 Model

1818

KL GLOTT 88 KL GLOTT 88 model model

(default)(default)

SPECTRAL SPECTRAL TILT LOW-PAS TILT LOW-PAS RESONANTORRESONANTOR

MODIFIED LF

MODEL

ASPIRATION NOISE

GENERATOR

FIRST DIFFERENCE

PREEMPHASIS

NASAL NASAL FORMANT FORMANT

RESONATORRESONATOR

TRACHEAL FORMANT

RESONATOR

FOURTH FORMANT

RESONATOR

THIRTH FORMANT

RESONATOR

SECOND SECOND FORMANT FORMANT

RESONATORRESONATOR

FIRST FIRST FORMANT FORMANT

RESONATORRESONATOR

FRICATION FRICATION NOISE NOISE

GENERATORGENERATOR

SECOND FORMANT

RESONATOR

THIRD FORMANT

RESONATOR

FOURTH FORMANT

RESONATOR

FIFTH FIFTH FORMANT FORMANT

RESONATORRESONATOR

SIXTH FORMANT

RESONATOR

A2F

A3F

A4F

A5F

A6F

AB

ANV

A1V

A2V

A3V

A4V

ATV

+

-

+

-

+

-

+

+

-

+

-

-

+

+

FILTERED FILTERED IMPULSE IMPULSE

TRAINTRAIN

F0 AV OO FL DI

SO

SS

TL

AH

AF

GLOTTAL SOUND SOURCES

CP

BYPASS PATH

B2F

B3F

B4F

B5F

B6F F6

PARALLEL VOCAL TRACT MODEL LYRYNGEAL

SOUND SOURCES (NORMALLY NOT USED)

PARALLEL VOCAL TRACT MODEL FRICATION SOUND SOURCES

BNP BNZ BTP BTZ DF1 DB1 F2 B2 F3 B3 F4 B4 F5 B5

CASCADE VOCAL TRACT MODEL LARYNGEAL SOUND SOURCES

NASAL NASAL

POLE ZERO POLE ZERO PAIRPAIR

TRACHEAL TRACHEAL POLE ZERO POLE ZERO

PAIRPAIR

FIRST FIRST FORMANT FORMANT

RESONATORRESONATOR

SECOND SECOND FORMANT FORMANT

RESONATORRESONATOR

THIRTH THIRTH FORMANT FORMANT

RESONATORRESONATOR

FOURTH FOURTH FORMANT FORMANT

RESONATORRESONATOR

FIFTH FIFTH FORMANT FORMANT

RESONATORRESONATOR

THE KLSYN88 CASCADE PARALLEL FORMANT SYNTHESIZERTHE KLSYN88 CASCADE PARALLEL FORMANT SYNTHESIZERFNP FNZ FTP FTZ F1 B1

1919

Three Voicing Source Model In Three Voicing Source Model In KLATT 88KLATT 88

The old KLSYN impulsive sourceThe old KLSYN impulsive source

The KLGLOTT88 model The KLGLOTT88 model

The modified LF modelThe modified LF model

Documents

1 5-Text To Speech (TTS) Speech Synthesis Speech Synthesis Concept Phone Units Phone Sequence To Speech Speech Naturalness –Concatenative Approaches –Rule-Based