CLOSED-LOOP TRAINING FOR DIPHONE-BASED TTSspcc.csd.uoc.gr/_docs/Lectures2016/SPCC16_Akamine.pdf ·...

CLOSED-LOOP TRAINING

FOR DIPHONE-BASED TTS

Masami Akamine, Senior Fellow Toshiba Research and Development Center

1985: finished Ph.D.

State-Space Approach to Synthesis of Minimum Quantization Error

Digital Filter without Limit-cycles

Professional experiences

Toshiba R&D Center for 30 years

Speech coding

Video coding

Spoken dialogue

Personal Background

Research topics on TTS at Toshiba labs

Unit concatenation-based speech synthesis

Closed-loop training for diphone-based TTS

Multiple unit selection and fusion

Statistical speech synthesis

Speaker and language adaptation

Cluster Adaptive Training (CAT)

Speech synthesis using sub-band basis models (SBM)

Band-width expansion based on SBM

Dynamic sinusoidal models

Complex Cepstrum analysis

Today’s lecture

Outline

1. Diphone-based speech synthesis with TD-PSOLA

2. Closed-loop training of diphone units

3. Closed-loop training of prosody generation model

4. Multiple unit selection and fusion

5. Products and services

6. Conclusions

Structure of TTS

3 types of speech synthesis

What is the diphone-based speech synthesis?

Prosodic modification using TD-PSOLA

Speech quality problems caused by prosodic

modification

1. Diphone-based speech synthesis with TD-PSOLA

Structure of TTS

Speech

synthesis

Synthetic

speech Prosody

generation

analysis Text

Pitch Duration

Pronunciation Accent

Linguistic inf.

…… ……

Methods of speech synthesis

Formant synthesis

Unit concatenation-based speech synthesis

Single inventory: diphone-based speech synthesis

Multiple inventory: unit selection-based speech synthesis

Statistical parametrical speech synthesis

HMM-based speech synthesis

Deep learning, e.g. DNN, RNN, LSTM

Formant synthesis

http://www.cs.indiana.edu/rhythmsp/ASA/Contents.html

Klattalk (Klatt, 1983)

Based on source–filter model (vocoder)

Handcrafted rules to map phoneme to model parameters

Pulse train (voiced)

White noise (unvoiced)

Synthesis

filter Speech

u/v f0 gain F1,2,3 Q1,2,3 gain

音声素片・音声ＤＢ・ボイスフォント

ho he hu hi ha no ne nu ni na

tadaima

Phoneme sequence

ta ad da ai im ma

Prosodic

modification

concatenation

selection

Diphone-based speech synthesis

t a d a i m a

Unit selection-based speech synthesis

Multiple inventory with different spectrum, f0, duration

Unit selection based on target cost and concatenation

High quality but large data

Unit inventory

selection

concatenation Linguistic features

Target cost

Concatenation cost

Differences in features, e.g.

phoneme, position, stress, etc.

A. Hunt and A. Black, ICASSP96

Statistical parametric speech synthesis

Vocoder

Speech

Feature extraction

Parameter

generation

Training based on ML

Synthesis

Statistical modeling of vocoder parameters

Parameter generation and speech synthesis

Model training

Replace Klatt’s rule

K. Tokuda et al., ICASSP2000

Diphone-based TTS

A research project of TTS started in 1994.

Business target

Focused on applications to embedded systems

The market of GPS navigation was growing

VUI was essential

Good synergy with semiconductor business in Toshiba

Research objectives

A method to synthesize high quality speech

No muffled voice, clear voice

Natural prosody, lively intonation

Small footprint (memory size): less than 1 MB

Low computational complexity: less than 15 MIPS

Why diphone-based speech synthesis?

Memory size and speech quality

Quality

Memory size [MB]

1 10 100 1000

Diphone-based synthesis

Unit selection-based synthesis

E-dictionary

Our target

Muffled voice and flat f0

Prosodic modification

Diphone-based speech synthesis needs prosodic

modification of the diphone unit

Requirements

Prosodic modification of speech with minimum distortions

Time-scale modification modifies the duration without changing

Pitch-scale modification modifies the pitch without changing its

duration

C.f., speeding up recordings to shorten the duration changes the

TD-PSOLA

Time domain - pitch synchronous overlap and add

Signal processing to perform time-scale and pitch-scale modification

of speech

Isolate pitch waveforms in the original signal by windowing the signal

Perform the modification by allocating isolated waveforms at new

pitch marks

Resynthesize the final waveform through an overlap-add operation

Pitch modification

[Hamon et.al, ICASSP89]

Windowing to

decompose into

pitch waveforms

Overlap and add at

new pitch marks

Pitch and duration modification

Decompose into

pitch waveforms

Duplicate or delete

pitch waveforms

Resynthesize

speech waveforms

at new pitch marks

Distortion caused by prosodic modification

Prosodic modification: resample the spectrum

envelope at different harmonic frequencies from

original

Spectrum envelope F0 2F0 3F0 F0’ 2F0’ 3F0’

“a” with a low F0 Modified

Spectrum envelope F0

F0’ 2F0’ 3F0’

Modified “a” with a high F0

Large distortion for high pitch voice and

large modification of F0

Handcrafting of diphone units

Trial and error process

Clipping

Clip small segments from recorded speech and select proper

units by hand

Trial and error process

Labor and time consuming

Recorded

speech

Evaluation

listening Diphone

Problems at the time

Poor speech quality and unnatural prosody

Muffled voice

Flat intonation

Robotic voice

Time consuming to build up synthesis units

2 or 3 years to develop a prototype of a voice

Trial and error

Hand tuning

Formulate the distortion in speech synthesis as error

between original and synthesized speech

Generate diphone units that minimize the distortion

Two methods of closed-loop training

Method 1: Unit selection by closed-loop training

Method 2: Analytic generation by closed-loop

training

2. Closed-loop training of diphone units

Formulation of distortion in speech synthesis

Modification &

Concatenation

Synthetic

Speech

Prosody

Generation

Analysis Text

Modification &

Concatenation

Pitch Duration

Recorded

Distortion

Introduce recorded voice as reference

Text input and waveform output, no way to calculate the distortion

Closed-loop training of diphone units

Set candidates

Modify prosody of candidates and resynthesize waveforms

Calculate distortion between synthesized and original waveforms

in training data

Select diphone units that minimize the total sum of distortion

Method 2: Analytic generation by closed-loop

training

Formulate synthesized waveforms as a function of diphone unit

Formulate distortion between synthesized and original waveforms

Analytically generate the optimum unit that minimizes the total

sum of distortion

[Kagoshima et al., ICASSP97], [Akamine et al., ICSLP98]

Closed-loop training

Synthesis

Distortion

Evaluation

Selection

Prosodic

Modification

Prosody

Analysis

Select diphone units that minimize the total sum of distortion

Training Data

Candidates

Method 1

Synthetic speech

Candidate for

synthesis unit

Training data

1 decompose

2 align

3 overlap-add

r n( )

u n( )

( )u n

Distortion evaluation

Training data

Synthetic speech

r n( )

Distortion

Squared error with normalized power

Synthesis units selection

Synthesis units minimizing C（） are selected.

r r r r r

1 2 3 4 5

1 11 12 13 14 15

2 21 22 23 24 25

3 31 32 33 34 35

4 41 42 43 44 45

e e e e e

Training data

Candidate for

synthesis units

Cost function

C i iM

e en i j

i jn( , , ) ( , )1

Error matrix

( )n 1

4.25/)34212()4,3( C

8.15/)12231()2,1( C

2.25/)32231()3,1( C

Example (n=2)

and are selected as synthesis units.

Cost function

Error matrix

r r r r r

1 2 3 4 5

1 6 2 2 4

2 3 3 8 1

4 3 2 4 3

2 1 5 5 6

Training data

Candidate for

synthesis units

Experiment

Synthesizer

Japanese

waveform concatenation based

CV/VC units (diphone)

one synthesis unit per diphone unit (262 units)

11.025 kHz sampling

Training data

50 min/speaker

one male and one female speaker

Comparison between CLT and OLT

step1. Calculate LSP parameters of all

the candidates.

step2. Take an average of all the LSP

parameters.

step3. Select the synthesis unit

whose LSP parameters are the

closest to the average.

LSP parameters

average

synthesis unit

Open Loop Training

Experimental Result

0 20 40 60 80

Preference Score (%)

Test sentences 10

Subjects 7males

Prosody Natural speech

Female Speech

Male Speech

Method 2: Analytic Generation

Closed-loop training to minimize the distortion

Synthesis units

Prosodic

modification

Synthesized

speech vector

Unknown unit

ho he hu hi ha no

ne nu ni na Optimum

Minimization

Speech DB

Training data

Prosody

analysis

Automatic generation of

optimum units

Analytic method

1. Prepare speech segments as training data

2. Set initial synthesis unit vectors

3. Partition training vectors into cluster sets based on

the nearest neighbor condition

4. Generate the optimal unit vector that minimizes

distortion in each cluster

5. Update the synthesis unit vectors

6. Repeat Step 3 to 5

A generalized method to create a multiple entry for a diphone unit

Partition of training vectors

},);,(),(:{ ,, ikallyrdyrdrGkjjijjji

rj : training vector

yj,i : synthesized vector from unit vector ui

d(*,*) : distance measure

Given an initial vector ui, partition rj based on the distance

ui(yj,i)

Cluster Gi

Generate the optimal unit vector that

minimizes the distortion in each cluster

38 TOSHIBA

yj,i is produced by modifying its pitch period and

duration so that they are identical with those of the

training vector.

(a) Unit vector, (b) Pitch waveform vector, (c)

Training vector, (d) Modified vector

Formulation of prosodic modification

Resynthesized

waveform

uOriginal unit

A: overlap and add operation

Distortion

誤差ベクトル（音色の差） e = r - g s g : 最適ゲイン

Synthesis

e = r - g y

Synthesized

vector

signal

Training

vector

Distortion

g = 𝑦𝑇𝑟/𝑦𝑇𝑦

Calculate the distortion over all the training vectors

Automatic generation of the optimum unit

ijij Gr

ijij rAguAAg

E g g j j j j

T j j j

( ) ( ) r A u r A u 評価関数

Solve linear equations

E g g j j j j

T j j j

( ) ( ) r A u r A u

Obtain the optimum unit u

Experiments

Training corpus: 40 minute long, Japanese male and

female voices

Synthesis units: 302 CV-VC units

Prosody was produced by a method based on closed-

loop training

Objective evaluation

1 2 3 4

Number of synthesis units

Distortion and the number of synthesis units for Method 1 and Method 2

Method 1

Method 2

Female

Selective method Analytic method (80 %)

Selective method Analytic method (76 %)

Preference scores for analytic method

and selective method

Method 2 Method 1

Subjective evaluation

Minimize distortion in synthetic speech

Extract the most likely spectrum over different f0

Optimum unit

Same envelope,

different f0

Exploit increase of sampling points of spectrum

Generation model

Closed-loop training of F0 contour codebook

Representative vector selection

3. Closed-loop training of prosody generation model

Prosody generation model

Grammatical attributes Duration

Expansion or

Contraction

Vector

Selection Offset

adjustment

F0 Contour

Codebook

Offset Level

Prediction

Generate F0 contour codebook using closed-loop

training method

Select contour vector based on grammatical attributes

Predict offset based on grammatical attributes

Quantification Theory Type I is used for contour vector

selection model and offset prediction model

Prosody generation algorithm

Training of an F0 contour codebook

Duration corpus F0 Contour corpus

Error calculation and clustering

Vector modification

Codebook generation F0 Contour

codebook

Optimal offset level Approximation error for Ci

Training data for estimating approx. error and offset

Cluster Gi

Generate the optimal vector that

minimizes the distortion in each cluster

Codebook generation

duration matrix

representative vector

offset level

segmental contour

Total approximation error

F0 contour control model

unit vector

Representative vector selection

◆Approximation error prediction model

Quantification Theory Type I

◆Training data

Approximation error corpus obtained through codebook training

Vector

selection Approximation

error prediction

grammatical

attributes

Error prediction model

for each vector

Quantification theory Type I

attribute category

middle

next but one

next but over one

next but one

next but over one

coefficient

x b 2 c 3 d 1 + + + = a 2

predicted value

(a) position of current phrase

in sentence

(b) current phrase length

Offset level prediction

◆Offset level prediction model

Quantification Theory Type I

Training data

Optimal offset level corpus obtained through codebook training

grammatical

attributes

Offset level

prediction offset level

Experiment

Data sets

Japanese

Male voice

866 sentences (6398 phrases)

Automatic phoneme labeling

Automatic F0 extraction + manual correction

F0 contour codebook

48 vectors (6 accent types × 8 vectors)

Grammatical attributes

Accent type, phrase length, stress, part of speech, etc.

Listening tests

Speech samples

(a)synthetic speech by proposed method

(b)synthetic speech by conventional method (Shiga et al., ICSLP94)

(c)synthetic speech with original prosody

Naturalness test: “Which is more natural, (a) or (b) ?”

Individuality test: “Which resembles (c), (a) or (b) ?”

naturalness test

preference score

individuality test

Examples of F0 contour

original

proposed method

Memory size and speech quality

Quality

Memory size [MB]

1 10 100 1000

Diphone-based synthesis

Unit selection-based synthesis

E-dictionary

Demonstration

Japanese for car navigation systems

UK English for car navigation systems

Footprint: 500 kB (Japanese) / 800 kB (English)

Motivation

Unit selection

Units fusion

4. Multiple unit selection and fusion

Speech

corpus

Multi-unit

selection Unit

concatenation

Linguistic features

and prosody

fusion

Scalable systems with different corpus sizes and stable voice quality

Change the number of units to be fused depending on the corpus size

Multiple unit selection and fusion, when using a small corpus stable sound

Motivation

Prosodic

modification

[Tamura et al., ICASSP2005]

u1, u2, u3, u4, u5

p1, p2, p3, p4

unit pre-selection by target cost

search for the best path based on

target cost and concatenation

select N units based on the best

path, target cost and

concatenation cost

F0, duration, phoneme

selected units

Unit selection

Waveform generation

TD-PSOLA

fused unit

average

selected units

Adjust

duration

Adjust

unit fusion

Unit fusion

Demonstration: a boy and a girl

Quarrel about cakes

あ～、お姉ちゃんが僕の分食べたー

私の方が１つ上なんだから、いいでしょ

お母さんに言いつけるぞー

すぐ言いつける、この言いつけ虫

お姉ちゃんだって、お菓子虫じゃないか！

Products and services

1. Human machine interface

• In-car navigation systems

• Home appliances, TVs, air

conditioners, fridges

• E-book reader

2. Contents creation service

• E-learning

• Multi-media texts for education

3. Smart-phone applications

5. Products and services

Businesses in Toshiba

Middleware licensing Value addition to Toshiba products

processor

Semiconductor devices

video game

software FAX

Smart phone Tablet PC

service for

mobile phone

software

GPS car

navigation

telephone

e-dictionary

Internet

service

elevator

chip software contents

Products and services using TTS

ASR & TTS on Bluetooth™ chipset for in-car use

You can make hands-

free call while driving

ASR TTS

Grapheme-to-phoneme

translator

Mobile-phone Micro processor

phoneme sequence

Call Bob.

Bluetooth™

Small vocabulary, e.g. Commands, digit

American/European languages, 9 in total

Focus on noise robustness especially for car

User’s TTS voice creation

Open TTS voice creation to public （http://tospeak.ivc.toshiba.co.jp/grcd）

• Provide a pipeline system to create new voices

• A user uploads recorded voice of 100 sentences

• The process completes in 1 hour

• Created more than 800 voices

Variation

reading reminder alert

Speaker variation

Speaking style variation

formal

conversation

reading

Scalability M

Footprint

1 10 100 [MB]

Contents creation using TTS

Transcription service using ASR

Speech to speech translation applications

Spoken dialogue

RECAIUS™: Cloud service of media processing

https://www.toshiba.co.jp/cl/pro/recaius/index_j.htm

https://developer.recaius.io/jp/top.html

CAT-based statistical models

Cluster adaptive Training

The means of the distributions are defined as a linear

combination of P clusters (factors):

CAT weight vector 𝜆𝑖 represents continuous voice

)()(μμ

• µc(m,i): the mean vector of the

i cluster

• λi : the weight for Cluster i

• The variance is shared across

all clusters

CAT: Model structure

Unlike eigenvoice models, CAT doesn’t need to force

the clusters to be orthogonal.

Decision trees can be independent, i.e. tree

intersection

=> more context can be represented efficiently

Single tree

8 leaves => 8 context

CAT Tree intersection

7 leaves => 12 context

Control of voice character using CAT weights

Create a wide variety of TTS voices by controlling

elements of voice characteristics such as the age,

gender, and brightness of the voice

𝝀 = 𝝀 0 + 𝑤 𝑒 ⋅ 𝝀 𝑒 − 𝝀 0𝐸−1

𝑒=1

𝑨𝒗𝒆𝒓𝒂𝒈𝒆 𝒗𝒐𝒊𝒄𝒆 𝒎𝒐𝒅𝒆𝒍 𝝀 0

6. Conclusions

Diphone-based TTS

Prosodic modification using TD-PSOLA

Distortion in speech quality when large modification, high pitch voice

Closed-loop training of diphone units

Method 2: Analytic generation by closed-loop training

Minimize distortion between synthesized and original waveforms

Closed-loop training of pitch contour codebook

Applications to embedded systems

cf. MGE training for HMM-TTS [Wu et al., ICASSP2006]

CLOSED-LOOP TRAINING FOR DIPHONE-BASED TTSspcc.csd.uoc.gr/_docs/Lectures2016/SPCC16_Akamine.pdf ·...

Documents

Developing Syllable and Diphone Speech Databases For ... · Title: Developing Syllable and Diphone Speech Databases For Persian Text-To-Speech Synthesis System Author: Eslami Moharram,

Vocaine - spcc.csd.uoc.grspcc.csd.uoc.gr/_docs/Lectures2016/SPCC16_Agiomyrgiannakis.pdf · Vocaine - Coherent noise modulation model 1 / 3 Vocaine has an explicit frication / aspiration

ANTIQUES & COLLECTIBLESmartinauctioncompany.com/_docs/3_3_18CookSalebill.pdfRecords – Sheet Music – Dehumidifier – Snack Sets – Roaster FURNITURE & APPLIANCES Maytag Washer

Cryptanalysis - Indian Statistical Institutercbose/internship/lectures2016/rt11slides.pdfThe remaining keys bits can be guessed by exhaustive search with one cipher text. Thus an overall

Sustainable Cities - Siemens Singaporesg.siemens.com/city_of_the_future/_docs/sustainable_cities_PDF_e.pdf · Sustainable Cities Ecological harmony and quality of life – the key

FILTER PRESS TECHNOLOGY - integralpx.comintegralpx.com/_docs/filtration/Evoqua_JWI_MC_Model_Filter-Press.pdf · filter press dewatering solutions to the mining industry. ... long

Battery Powered Scooters Power Washers - Ranch …martinauctioncompany.com/_docs/Public7_21salebill.pdf · cessories – Electric Motor Kit w/ Motor Boat – Transformers – Looney

Social Transitions - cyssprogram.comcyssprogram.com/_docs/online-training/SocialTransitions/Social... · • Mitigate the potential for succumbing to peer pressure ... especially

Sample Size/Power Calculationcancer.ucsf.edu/_docs/BiostatWorkshop_SampleSizePower_Slides.pdf · Sample Size/Power Calculation: Principles and Applications Introduction to Statistical

UCSF HELEN DILLER FAMILY COMPREHENSIVE CANCER CENTERcancer.ucsf.edu/_docs/UCSFCancerCenter_ASH2015_AbstractBrochure… · As president of the UCSF Helen Diller Family Comprehensive

L’opérationnalisation du Programme APE pour le ...jaga.afrique-gouvernance.net/_docs/dp121_fr_final_final.pdf · L’opérationnalisation du Programme APE pour le développement

GLASS ARTWORK ANTIQUES & COLLECTIBLES SILVER & …martinauctioncompany.com/_docs/3_25_17BennettSalebill.pdf · Cabinet - Bissell Power Force Vac – Precious Moments – Flatware

chhs.niu.educhhs.niu.edu/nursing/_docs/Nursing_Ed_Planning_Packet… · Web viewCurrent Clinical Requirements on File. Required Preliminary Meetings. Sign the Professional Contract

STAINED GLASS WINDOWS (measurements are …martinauctioncompany.com/_docs/6_10_17ChurchSalebill.pdf · Christmas Star – Var. Christmas Décor ... Piano Bench – Cabinet w/ Glass

Fleming Product Catalogue - Steel Dor of Tucsonsteeldor.com/_docs/architectural-tools/FlemingProductCatalogue.pdf · temperature differences occur. ... paper honeycomb, polystyrene

Adaptive Sinusoidal Modelingspcc.csd.uoc.gr/_docs/Lectures2016/SPCC16_StylianouI.pdf · Adaptive Sinusoidal Modeling Yannis Stylianou University of Crete, Computer Science Dept.,

MALTESE DIPHONE STATISTICAL ANALYSIS (2/June/2010) The ... · 3.3. Analysing structure of words to convert surface forms to lexical forms, i.e., to normalise words consisting of combinations

Vocoder-side Voice Morphing for TTSspcc.csd.uoc.gr/_docs/Lectures2016/SPCC16_AgiomyrgiannakisII.pdfmathematical framework for Voice Conversion with unaligned corpora”, ICASSP 2016

1.3 Adiabatic approximation - University of Southern …iopenshell.usc.edu/chem545/lectures2016/Lecture2_Born...result in error cancellation. 1.3 Adiabatic approximation 1.3.1 Potential

Sequenom MassARRAY DNA Methylation Analysiscancer.ucsf.edu/_docs/cores/genome/presentations/methylation_intro... · DNA Methylation Analysis ... Overview High-throughput methylation