Fundamentals & recent advances in HMM-based speech synthesis

Fundamentals & recent advancesin HMM-based speech synthesis

Heiga ZEN

Toshiba Research Europe Ltd.

Cambridge Research Laboratory

FALA2010, Vigo, Spain

November 10th, 2010

Text-to-speech synthesis (TTS)Text (seq of discrete symbols) Speech (continuous time series)

Automatic speech recognition (ASR)Speech (continuous time series) Text (seq of discrete symbols)

Machine Translation (MT)Text (seq of discrete symbols) Text (seq of discrete symbols)

Text-to-speech as a mapping problem

Dobré ráno Good morning

Good morning

text(concept)

speech

air flow

Sound sourcevoiced: pulseunvoiced: noise

frequencytransfercharacteristics

magnitudestart--end

fundamentalfrequency

Speech production process

Grapheme-to-Phoneme

Part-of-Speech tagging

Text normalization

Pause predictionProsody prediction

Waveform generation

Flow of text-to-speech synthesis

Text analysis

SYNTHESIZEDSPEECH

Speech synthesis

This talk focuses on speech synthesis part of TTS

Rule-based, formant synthesis (~’90s)

– Based on parametric representation of speech

– Hand-crafted rules to control phonetic unit

DECtalk (or KlattTalk / MITTalk) [Klatt;‘82]

Speech synthesis methods (1)

Block diagram of KlattTalk

Corpus-based, concatenative synthesis (’90s~)

– Concatenate small speech units (e.g., phone) from a database

– Large data + automatic learning High-quality synthetic voices

Single inventory; diphone synthesis [Moullnes;‘90]

Multiple inventory; unit selection synthesis [Sagisaka;‘92, Black;‘96]

Corpus-based, statistical parametric synthesis (mid ’90s~)

– Large data + automatic training

Automatic voice building

– Source-filter model + statistical modeling

Flexible to change its voice characteristics

Hidden Markov models (HMMs) as its statistical acoustic model

HMM-based speech synthesis (HTS) [Yoshimura;‘02]

Parametergeneration

Modeltraining

Featureextraction

Waveformgeneration

This talk gives the basic implementations & some recent work in HMM-based speech synthesis research

Popularity of HMM-based speech synthesis

Outline

Fundamentals of HMM-based speech synthesis– Mathematical formulation

– Implementation of individual components

– Examples demonstrating its flexibility

Recent work– Quality improvements

– Vocoding

– Acoustic modeling

– Oversmoothing

– Other challenging research topics

Conclusions

Mathematical formulation

Statistical parametric speech synthesisTraining

Synthesis

Using HMM as generative model

HMM-based speech synthesis (HTS) [Yoshimura;’02]

Excitationgeneration

Synthesis Filter

Text analysis

SYNTHESIZEDSPEECH

Training HMMs

Parameter generationfrom HMMs

Context-dependent HMMs& state duration models

Labels Excitationparameters

Excitation

Spectralparameters

Labels

Training part

Synthesis part

ExcitationParameterextraction

SPEECHDATABASE

SpectralParameterExtraction

Spectralparameters

Excitationparameters

Speech signal

HMM-based speech synthesis system (HTS)

SPEECHDATABASE

Synthesis Filter

Text analysis

SYNTHESIZEDSPEECH

Training HMMs

Excitation

Spectralparameters

Labels

Speech signal Training part

Synthesis part

Synthesis Filter

Text analysis

SYNTHESIZEDSPEECH

Training HMMs

Excitation

Spectralparameters

Labels

Training part

Synthesis part

SPEECHDATABASE

Spectralparameters

Speech signal

air flow

frequencytransfercharacteristics

magnitudestart--end

fundamentalfrequency

n speech

Sound sourcevoiced: pulseunvoiced: noise

Speech production process

Divide speech into frames

Speech is a non-stationary signal

… but can be assumed to be quasi-stationary

Divide speech into short-time frames (e.g., 5ms shift, 25ms length)

lineartime-invariant

system

excitationpulse train(voiced)

white noise(unvoiced)

speech

Excitation (source) part Spectral (filter) part

Fourier transform

Source-filter model

Parametric models speech spectrum

Autoregressive (AR) model Exponential (EX) model

ML estimation of spectral model parameters

: AR model Linear prediction (LP) [Itakura;’70]

: EX model ML-based cepstral analysis

Spectral (filter) model

Excitation (source) model

excitationpulse train

white noise

Excitation model: pulse/noise excitation– Voiced (periodic) pulse trains

– Unvoiced (aperiodic) white noise

Excitation model parameters– V/UV decision

– V fundamental frequency (F0):

Speech samples

Natural speech

Reconstructed speech from extracted parameters (cepstral coefficients & F0 with V/UV decisions)

Quality degrades, but main characteristics are preserved

SPEECHDATABASE

Synthesis Filter

Text analysis

SYNTHESIZEDSPEECH

Excitation

Spectralparameters

Synthesis part

Training HMMs

Spectralparameters

Labels

Structure of state-output (observation) vector

Spectrum part

Excitation part

Spectral parameters(e.g., cepstrum, LSPs)

log F0 with V/UV

Dynamic features

HMM-based modeling

・・

1 1 2 3 3 N

Observation sequence

State sequence

SentenceHMM

Transcription sil a i sil

1 2 31 2 3

Spectrum(cepstrum or LSP, & dynamic features)

Excitation(log F0

& dynamic features)

Stream

Multi-stream HMM structure

Unable to model by continuous or discrete distribution

Observation of F0

unvoiced

voiced

unvoiced

voiced

unvoiced

voiced

・・・

voiced/unvoiced weights

Multi-space probability distribution (MSD)

Voiced

Unvoiced

Voiced

Unvoiced

Voiced

Unvoiced

Log F0

Spectral params Single Gaussian

MSD(Gaussian & discrete)

Structure of state-output distributions

data & labels

Compute variancefloor

Initialize CI-HMMs bysegmental k-means

Reestimate CI-HMMs byEM algorithm

Copy CI-HMMs toCD-HMMs

monophone(context-independent, CI)

Reestimate CD-HMMs byEM algorithm

Decision tree-basedclustering

Untie parameter tyingstructure

Estimate CD-dur Modelsfrom FB stats

Estimated dur models

EstimatedHMMs

fullcontext(context-dependent, CD)

Training process

HMM-based modeling

・・

1 1 2 3 3 N

State sequence

SentenceHMM

Transcription

sil sil-a+i a-i+sil sil

1 2 31 2 3

Phoneme current phoneme {preceding, succeeding} two phonemes

Syllable # of phonemes at {preceding, current, succeeding} syllable {accent, stress} of {preceding, current, succeeding} syllable Position of current syllable in current word # of {preceding, succeeding} {accented, stressed} syllable in current phrase # of syllables {from previous, to next} {accented, stressed} syllable Vowel within current syllable

Word Part of speech of {preceding, current, succeeding} word # of syllables in {preceding, current, succeeding} word Position of current word in current phrase # of {preceding, succeeding} content words in current phrase # of words {from previous, to next} content word

Phrase # of syllables in {preceding, current, succeeding} phrase

Huge # of combinations Difficult to have all possible models⇒

Context-dependent modeling

data & labels

EstimatedHMMs

Training process

k-a+b/A:…

t-e+h/A:……

…w-a+t/A:…

w-i+sil/A:… w-o+sh/A:…

gy-e+sil/A:…

gy-a+pau/A:…

g-u+pau/A:…

leaf nodes

synthesized

states

yesyes

yes no

R=silence?

L=“gy”?

C=voiced?

L=“w”?

R=silence?

Decision tree-based context clustering [Odell;’95]

Spectrum & excitation have different context dependency

Build decision trees separately

Decision treesfor

mel-cepstrum

Decision treesfor F0

Stream-dependent clustering

data & labels

EstimatedHMMs

Training process

1 2 3 4 5 6 7 8

Estimation of state duration models [Yoshimura;’98]

Decision treesfor

mel-cepstrum

Decision treesfor F0

State durationmodel

Decision treefor state dur.

models

Stream-dependent clustering

data & labels

EstimatedHMMs

Training process

SPEECHDATABASE

Synthesis Filter

SYNTHESIZEDSPEECH

Excitation

Synthesis part

Training HMMs

Spectralparameters

Labels

Text analysisParameter generation

from HMMs

Spectralparameters

Composition of sentence HMM for given text

Text analysis

context-dependent label sequence

POS tagging

Text normalization

Pause prediction

sentence HMM given labels

This sentence HMM gives

Speech parameter generation algorithm

State sequence

・・

1 1 1 1 2 2 3 3

Determine state sequence via determining state durations

Determination of state sequence (1)

4 10 5State duration

Geometric

Gaussian

0.01 2 3 4 5 6 7 8 State duration

Mean Variance

step-wise, mean values

Without dynamic features

Speech param. vectors includes both static & dyn. feats.

The relationship between & can be arranged as

Integration of dynamic features

Solution

Mean Variance

ticGenerated speech parameter trajectory

a i silsil

Generated spectra

w/o dynamic features

w/ dynamic features

SPEECHDATABASE

Text analysis

Training HMMs

Labels

Spectralparameters

Labels

Synthesis partExcitationgeneration

Synthesis Filter

SYNTHESIZEDSPEECH

Excitation

Spectralparameters

lineartime-invariant

systemexcitation

pulse train

white noise

synthesizedspeech

Generatedexcitation parameter(log F0 with V/UV)

Generatedspectral parameter

(cepstrum, LSP)

Source-filter model

filtering

Speech samples

w/o dynamic features

w/ dynamic features

Use of dynamic features can reduce discontinuity

Outline

Recent works– Quality improvements

– Vocoding

– Oversmoothing

Conclusions

Adaptation/adaptive training of HMMs– Originally developed in ASR, but works very well in TTS

– Average voice model (AVM)-based speech synthesis [Yamagishi;’06]

– Require small amount of target speaker/speaking style data

More efficient creation of new voices

AdaptiveTraining

Average-voice model

Training speakers

Adaptation

Target speakers

Adaptation (mimic voice)

Speaker adaptation– VIP voice

– Child voice

Style (expressiveness) adaptation (in Japanese)– Joyful

– Sad

– Rough

Difficult to collect large amount of data Adaptation framework gives a way to build TTS with small database

Adaptation demo

Some samples are from http://homepages.inf.ed.ac.uk/jyamagis/Demo-html/demo.html

Interpolate parameters among representive HMM sets– Can obtain new voices even when no adaptation data is available

– Gradually change speakers & speaking styles [Yoshimura;’97, Tachibana;’05]

Interpolation (mixing voice)

Speaker interpolation (in Japanese)– Male & Female

Style interpolation– Neutral Angry

– Neutral Happy

Interpolation demo

Samples are from http://www.sp.nitech.ac.jp/& http://homepages.inf.ed.ac.uk/jyamagis/Demo-html/demo.html

Outline

Recent works– Quality improvements

– Vocoding

– Oversmoothing

Conclusions

Drawbacks of HTS

Quality of synthesized speech– Buzzy

– Flat

– Muffled

Three major factors degrade the quality– Poor vocoding

how to parameterize speech?

– Inaccurate acoustic modeling

how to model extracted speech parameter trajectories?

– Over-smoothing

how to recover generated speech parameter trajectories?

Recent work has drastically improved the quality of HTS

pulse train

white noseexcitation

Unvoiced Voiced

Problem in vocoding

Binary pulse or noise excitation– Voiced (periodic) pulse trains

– Unvoiced (aperiodic) white noiseDifficult to model mixture of voiced & unvoiced sounds

(e.g., voiced fricatives)

Results in buzziness in synthesized speech

Limitations of HMMs for modeling speech parameters– Piece-wise constant statistics

Statistics do not vary within an HMM state

/a/ /i/ /sil//sil/

-1.20.4

-0.5mean

variance

Problem in acoustic modeling

1q 2q 3q

1o 2o 3o

– Frame-wise conditional independence assumption

State output probability depends only on the current state

1 2 3 4 5 6 7 8 State duration

– Weak duration modeling

State duration probability decreases exponentially with time

– Weak duration modeling

State duration probability decreases exponentially with time

None of them hold for real speech data

Speech parameters are generated from acoustic models

Better acoustic model may produce better speech

Over-smoothing problem

Speech parameter generation algorithm– Dynamic features make generated parameters smooth

– Some times too smooth Muffled speech

Why?– Fine structure of spectra are removed by averaging effect

– Model underfitting

Nitech-HTS 2005 for Blizzard Challenge

Blizzard Challenge– Open evaluation of TTS

– Common database

– From 2005

Nitech-HTS 2005– Achieved best naturalness & intelligibility

– Still used as calibration system in Blizzard Challenge

– Includes various enhancements

Advanced vocoding

High-quality vocoder STRAIGHT

Better acoustic modeling

Hidden semi-Markov model (HSMM)

Oversmoothing compensation

Speech parameter generation algorithm considering GV

Waveform Synthetic waveform

Fixed-point analysis

F0 extraction

F0 adaptive spectralsmoothing in the

time-frequency region

Analysis

Smoothedspectrum

Aperiodicfactors

Mixed excitation withphase manipulation

Synthesis

STRAIGHT analysis

Frequency [kHz]

-200 8

FFT power spectrumFFT + mel-cepstral analysisSTRAIGHT + mel-cepstral analysis

Mel-cepstrum from STRAIGHT spectrum

Examples of generated excitation signals

pulse/noise

STRAIGHT

HSMM-based modeling

Hidden semi-Markov model (HSMM)

– Can be viewed as HMM with state-duration probability distributions

– State-output, state-duration, & state-transition probability distributions can be estimated simultaneously

Generated trajectory has smaller GV

0 1 2 3 time [sec]

Natural Generated

Parameter generation considering GV

Global variance intra-utterance var of speech params

Objective function of standard parameter generation

Objective function of parameter generation with GV

– 1st term is same as standard one

– 2nd term works as a penalty term for oversmoothing

Parameter generation considering GV

24 mcep. HMM

non-GV

24 ST. HMM

non-GV

24 mcep. HSMM non-GV

24 ST. HSMM non-GV

39 ST. HSMM non-GV

24 mcep. HSMM

24 ST. HSMM

39 ST. HSMM

3.2 3.2 3.1

4.7 4.7

Evaluations

Recent works to improve quality

Vocoding– MELP-style / CELP-style excitation

– LF model

– Sinusoidal models

Acoustic model– Segment models, trajectory models

– Model combination (product of experts)

– Minimum generation error training

– Bayesian modeling

Oversmoothing– Pre & postfiltering

– Improvements of GV

– Hybrid approaches

& more…

Fundamentals & recent advances in HMM-based speech synthesis

Technology

The NITECH HMM-based text-to-speech system for the ...swdkei/paper/Blizzard_slide_2015_09.pdf · The NITECH HMM-based text-to-speech system for the Blizzard Challenge 2015 Kei Sawada,

LNAI 4274 - Advances in Mandarin Broadcast Speech ......Advances in Mandarin Broadcast Speech Transcription at IBM 413 After segmentation, the frames classified as non-speech are discarded,

INTONATION ISSUES IN HMM-BASED SPEECH SYNTHESIS FOR …

Speech Recognition and HMM Learning1 Overview of speech recognition approaches – Standard Bayesian Model – Features – Acoustic Model Approaches – Language

Recent Advances in Speech Dereverberation

Performance analysis of bangla speech recognizer model using hmm

A Novel Visual Speech Representation and HMM

Enhancement of spectral envelope modeling in HMM-based speech … · 2019-05-07 · Diploma Thesis Enhancement of spectral envelope modeling in HMM-based speech synthesis Florian

CONNECTIONIST PROPABILITY ESTIMATORS IN HMM USING GENETIC CLUSTERING APPLICATION FOR SPEECH RECOGNITION AND MEDICAL DIAGNOSIS

Overview of NIT HMM-based speech synthesis system for Blizzard Challenge 2011

HMM Based Speech Background

HMM-based speech synthesis: the new generation of artificial voices

A novel irregular voice model for HMM-based speech synthesis

Speech Recognition Using HMM with MFCC-An Analysis Using Frequency Specral Decomposion Technique

Hidden Markov Models for Speech Recognitionberlin.csie.ntnu.edu.tw/Courses/Speech Processing... · Hidden Markov Model (cont.) • HMM, an extended version of Observable Markov Model

Design of Tree-based Context Clustering for an HMM-based Thai Speech Synthesis System

A Bayesian Approach to HMM-Based Speech Synthesis

Hidden Markov Model (HMM) based Speech Synthesis for Urdu ...cle.org.pk/research/presentations/Hidden Markov... · Hidden Markov Model (HMM) based Speech Synthesis for Urdu Language

Information Bottleneck and HMM for Speech Recognition · 2008-01-05 · Information Bottleneck and HMM for Speech Recognition Thesis submitted for the degree of "Master of Science"

Usage of HMM-Based Speech Recognition Methods for …ansis.lv/raksti/ainl2019.pdf · 2020-02-22 · Usage of HMM-Based Speech Recognition Methods for Automated Determination of a