High-Quality Statistical Parametric Speech Synthesis Using Generative Adversarial Networkssython.org...

©Yuki Saito, 2018/01/25

High-Quality Statistical Parametric Speech Synthesis Using

Generative Adversarial Networks

48-166605 Yuki Saito

Supervisor: Prof. Hiroshi Saruwatari

Speech Synthesis

– Technique for synthesizing speech using computer

Applications

– Speech communication assistance (e.g., speech translation)

– Entertainments (e.g., singing voice conversion)

DNN-based speech synthesis* [Zen et al. 2013]

– High flexibility but low speech quality

Research Field: Speech Synthesis

Text-To-Speech (TTS)

Text Speech

Voice Conversion (VC)

Outputspeech

Inputspeech

Hello Hello[Sagisaka et al., 1988]

[Stylianou et al., 1988]

DNN: Deep Neural Network

Thesis Overview

Smoothness of speech parametersOverly smoothed Naturally smoothed

nit Goal

(natural speech)

Chapter 2Wa

Global Variance (GV) [Toda et al., 2007]

Chapter 4(proposed)

Generative Adversarial Nets

(GANs)Chapter 3

(proposed)DNNs w/ vocoders

[Zen et al., 2013]

DNNs w/o vocoders

[Takaki et al., 2017]

STFT: Short-Term Fourier Transform

Table of Contents

Chapter 1. Introduction

Chapter 2. Speech Synthesis Using DNNs

Chapter 3. Speech Synthesis Using GANs w/ Vocoders

Chapter 4. Speech Synthesis Using GANs w/o Vocoders

Chapter 5. Conclusion

Speech Analysis and Parameter Extraction

Speech signal

STFT amplitude spectra

(e.g., 513 dim.)

Vocoder-derived speech params.

(e.g., 32 dim.)

Speech synthesis w/ vocoders

(Sec. 2.2)

STFT analysis

Mel-cepstral coefficients(timbre)

Prosodic feats.(pitch, hoarseness)

Speech synthesis w/o vocoders

(Sec. 2.3)

Vocoder-based parameterization

[Kawahara et al., 1999]

DNN-based Speech Synthesis w/ Vocoders

Acoustic models(DNN)

⋯⋯

𝒙𝑇

Mel-cepstral coefficients

Prosodicfeats.

Linguisticfeats.

Speechparams.

⋯⋯ 0

Phoneme

Accent

Frameposition

[Zen et al., 2013]

𝒚𝑇

DNN-based Speech Synthesis w/o Vocoders

Acoustic models(DNN)

⋯⋯

𝒙𝑇

𝒚𝑇

⋯⋯ 0

Phoneme

Accent

Frameposition

[Takaki et al., 2017]

Prosodic feats.

Amplitudespectra

Linguisticfeats. + F0

Minimum Generation Error (MGE) Training Algorithm

𝐿MGE 𝒚, 𝒚

[Wu et al., 2016]

Naturalspeech

params.

𝐿MGE 𝒚, 𝒚 =1

𝑇 𝒚 − 𝒚 ⊤ 𝒚 − 𝒚 → Minimize

Generatedspeech params. 𝒚

Acoustic models

𝒙 𝒀

Linguisticfeats.

MLPG: Maximum Likelihood Parameter Generation [Tokuda et al., 2000]

Issue of DNN-based Speech Synthesis:Over-smoothing of Generated Speech Parameters

Natural MGE

21st mel-cepstral coefficient

These distributions are significantly different...(GV [Toda et al., 2007] explicitly compensates the 2nd moment.)

Narrow

Table of Contents

Generative Adversarial Nets (GANs) [Goodfellow et al., 2014]

1: true0: fake

True sample

→ Minimize𝐿DGAN

𝒚, 𝒚 = 𝐿D,1GAN

𝒚 + 𝐿D,0GAN 𝒚

Loss to recognizetrue sample as true

Loss to recognizefake sample as fake

Discriminator

𝐷 ⋅

Prior(e.g., 𝑁 𝟎, 𝑰 )

𝐿DGAN

𝒚, 𝒚

1: true

Fake sample

Generator

𝒚𝒛Latent

variable 𝒚

𝐿DGAN

𝒚, 𝒚 is equivalent to the cross-entropy function.

Discriminator

𝐷 ⋅

Minimize approx. JS* divergence betw. dists. of 𝒚 and 𝒚.

𝐿ADVGAN 𝒚

1: true

→ Minimize𝐿ADVGAN 𝒚 = 𝐿D,1

GAN 𝒚

Loss to recognizefake sample as true

Generative Adversarial Nets (GANs) [Goodfellow et al., 2014]

Generator

𝒚𝒛Latent

variable Fake sample

Prior(e.g., 𝑁 𝟎, 𝑰 )

JS: Jensen-Shannon

Proposed Method:Acoustic Model Training Using GANs

𝜔D: weight, 𝐸𝐿∗ : expectation values of 𝐿∗

𝐿G 𝒚, 𝒚 = 𝐿MGE 𝒚, 𝒚 + 𝜔D

𝐸𝐿MGE

𝐸𝐿ADV𝐿ADVGAN 𝒚 → Minimize

Discriminator

𝐷 ⋅

1: natural

𝐿ADVGAN 𝒚

Loss to recognizegenerated params. as natural

𝐿MGE 𝒚, 𝒚

Natural𝒚

Generated 𝒚𝒙 𝒀

Linguisticfeats.

Acoustic models(generator)

Distributions of Speech Parameters

The proposed algorithm alleviates the over-smoothing effect!

21st mel-cepstral coefficient

Natural MGE Proposed

Narrow

Wide asnatural speech

GANs = minimizing divergence betw. two distributions

Discussions

Compensating for distribution differences

– The proposed method generalizes the conventional methods such as the GV.

Integrating voice anti-spoofing techniques

– Features that are effective for detecting synthetic speech can be used (Sec. 3.4.8).

Changing a divergence to be minimized

– Earth mover’s distance (Wasserstein GAN [Arjovsky et al., 2017]) was the best for improving synthetic speech quality (Sec. 3.4.10).

Applying various speech synthesis

– Not only TTS (next slides), but also VC (Sec. 3.5).

Experimental Conditions

Train / evaluate data 450 sentences / 53 sentences (16 kHz sampling)

Linguistic feats. 442-dimensional vector

Speech params. Mel-cepstral coefficients and prosodic features

Optimizer AdaGrad [Duchi et al., 2011]

Acoustic models Feed-Forward 442 – 3x512 (ReLU) – 94 (linear)

Discriminator Feed-Forward 26 – 3x256 (ReLU) – 1 (sigmoid)

Weight 𝜔D 1.0 (Secs. 3.4.2 and 3.4.4)

Methods MGE [Wu et al., 2016], GV [Toda et al., 2007], Proposed

Subjective Evaluationsin Terms of Speech Quality

Preference AB test (select better sounded speech)

Proposed

Error bars denote 95% confidence intervals.

0.0 0.2 0.4

Preference score

0.6 0.8 1.0

Proposed

(a) MGE vs. Proposed

(b) GV vs. Proposed

Improved

Proposed method improves synthetic speech quality!

Proposed

Table of Contents

Issue in Speech Synthesis w/o Vocoders

Over-smoothing of generated STFT amplitude spectra

– Formants (spectral peaks) tend to be weakened.

– The method proposed in Chap. 3 cannot be applied directly.

• Difficulties in modeling highly complex distribution

To deal with the issue...

– Dimensionality reduction retaining spectral structures

Natural amplitudes

Acoustic Model Training UsingLow-resolution GANs

𝐿GL

𝒚, 𝒚 = 𝐿MGE 𝒚, 𝒚 + 𝜔DL 𝐸𝐿MGE

𝐸𝐿ADV𝐿ADVGAN 𝒚 L → Minimize

Low-resolutiondiscriminator

𝐷 L ⋅

1: natural

𝐿ADVGAN 𝒚 L

𝐿MGE 𝒚, 𝒚Natural 𝒚Generated 𝒚

Acoustic models

𝒙Linguisticfeats. + F0

Average pooling𝝓 ⋅

𝝓 ⋅ 𝒚 𝒚 L

𝒚 L

Examples ofNatural and Generated Amplitude Spectra

Natural

Proposed (Chap. 3)

Proposed (Chap. 4)

Low-resolution GANs capture differences in formants!

Subjective Evaluationsin Terms of Speech Quality

Low-resolution GAN improves synthetic speech quality!

Preference AB test (select better sounded speech)

Proposed(Chap. 3)

0.0 0.2 0.4

Preference score

0.6 0.8 1.0

Proposed(Chap. 4)

(a) MGE vs. Proposed (Chap. 3)

(b) MGE vs. Proposed (Chap. 4)

Improved

Proposed(Chap. 3)

Proposed(Chap. 4)

Degraded

Error bars denote 95% confidence intervals.

Table of Contents

Conclusion

Purpose: improving synthetic speech quality of SPSS

Proposed: acoustic model training algorithms using GANs

• They compensate for the distribution differences betw. natural / generated speech parameters.

Results

– The proposed algorithms improved synthetic speech quality compared to conventional methods.

Future works

– Investigating anti-spoofing techniques

– Further improving speech quality using STFT spectra

High-Quality Statistical Parametric Speech Synthesis Using Generative Adversarial Networkssython.org...

Documents

Speech Enhancement Using Synthesis and Adaptive Techniqueszduan/teaching/ece477/...B. Speech Synthesis and Recognition Techniques Many approaches exist to solving speech synthesis

L17: Speech synthesis (front-end)

Speech Signal Processing - Phil Garner · Speech Signal Processing Milos Cernak Introduction Speech synthesis signal processing Analysis Speech parameter generation Re-synthesis Synthesis

Statistical Speech Synthesis

Vergina: A Modern Greek Speech Database for Speech Synthesis

SSML for Urdu Speech Synthesis

8 Speech Synthesis

The Use of Speech Synthesis in Exploring DifferentThe use of speech synthesis in exploring different

Speech Synthesis April 12, 2013 Speech Synthesis: A Basic Overview Speech synthesis is the generation of speech by machine. The reasons for studying

Low-Cost Portable Text Recognition and Speech Synthesis ... · Low-cost Portable Text Recognition and Speech Synthesis with ... portable text recognition and speech synthesis

Text Processing for Speech Synthesis

1 5-Text To Speech (TTS) Speech Synthesis Speech Synthesis Concept Phone Units Phone Sequence To Speech Speech Naturalness –Concatenative Approaches –Rule-Based

6-Text To Speech (TTS) Speech Synthesis

Speech Synthesis Waveform generation 2 - Speech at · PDF file · 2008-10-01Speech Synthesis Waveform generation 2. Speech Synthesis Text Analysis ... HMM-generation based synthesis

Multimodal speech synthesis

4. Speech Synthesis

Improvements in Speech Synthesis

Speech Analysis Synthesis and Perception

Speech Synthesis: A Review - ijert.org · Speech synthesis is a process of automatic generation of speech by machines/computers. The goal of speech synthesis is to develop a machine

Deep Learning in Speech Synthesis · Deep Learning in Speech Synthesis Heiga Zen Google August 31st, 2013 ... statistical parametric speech synthesis Experiments Conclusion. Text-to-speech