Deep neural network for speech synthesis - SSPNET · Deep neural network for speech synthesis Heng Lu University of Edinburgh 23 May 2013

Deep neural network for speech synthesis

Heng Lu

University ofEdinburgh

23 May 2013

Motivation

Speech synthesisers automatically learn from data Decision tree clustering needs expert linguistic knowledge for

question set design, while VSM generates continuous labels usingInformation Retrieval method from text

Decision tree clustering uses hard division for each training samplewhile Deep Neural Networks train DNN using back-propagationand one training sample is used to train all the parameters in DNN

From text to speech is a Complex Mapping

2 of 16

Conventional HMM-based parametric TTS

3 of 16

Framework for DNN-TTS

21y

11y

1c Mc

11x

21x

31x

1Mx 3Mx

2Mx

11h

3Nh23h13h

2Nh22h

12h

1Nh21h

1My 2My

4 of 16

DNN-TTS systems

6 systems are built for comparison.

(a) Letter based binary DNN-TTS. (b) Letter based VSM DNN-TTS. (c) Letter based HTS baseline. (d) Phone based binary for frame mapping DNN-TTS. (e) Phone based HTS baseline. (f) Phone based full context dependent HMM state mapping

DNN-TTS.

For the letter based systems, 3 states HMM model is used. And forphone based state mapping, 5 states HMM model is employed.

5 of 16

Binary DNN input

For the phone based binary context information, phone based contextinformation are transformed into a binary vector with dimension 1256according to answer to the context information dependent questionset. Questions set include (previous, current, following) phonecategory, position, word and phrase level questions. Utterance levelquestions are excluded to reduce the input vector dimensionality.

6 of 16

Vector space model (VSM)

The vector space model (VSM) is well established in InformationRetrieval (IR) and Natural Language Processing (NLP) as a way ofrepresenting objects such as documents and words as vectors ofdescriptors. To build vector space models,

co-occurrence statistics are gathered in matrix form to producehigh-dimensional representations of the distributional behavior ofword and letter types in the corpus.

Lower dimensional representations are obtained by approximatelyfactorizing the matrix of raw co-occurrence counts by theapplication of slim singular value decomposition (SVD).

The objective distance between the VSM output values directlyrepresents the similarity of two units. The VSMs are learned in anunsupervised fashion from text: no labeled speech is required.

7 of 16

Database

A database of a British English male speaker sampled at 16kHz isused for the experiments. There are 1000 utterances in the database,from which 860 are used for training, and 140 are held out forsubjective and objective evaluation. 21-order Line Spectral Pair (LSP)plus delta and delta-delta are extracted to represent the spectrum.Log F0 plus delta and delta-delta are used to model pitch. 10-orderare extracted for Aperiodicity (AP) feature. And 3 dimension ofbinary number is used to indicate the voice/unvoiced for F0 and itsdelta and delta delta for current sample.

8 of 16

DNN-training

Input vector dimension

phone based binary context information dimension 1256 letter based binary context information dimension 1134 VSM based continuous vector dimension 27

No pre-training used but use layer by layer training. Mean SquareError is used as training criterion.

9 of 16

DNN error along with epochs for Letter based binaryDNN-TTS

0 10 20 30 40 50 60 70 80 90 100

0.7

0.8

0.9

1

1.1

1.2

1.3

1.4

epochs

Mea

n S

quar

e E

rror

3 hidden Layer train error3 hidden Layer test error2 hidden Layer train error2 hidden Layer test error1 hidden Layer train error1 hidden Layer test error

10 of 16

DNN error along with epochs for Letter based VSMDNN-TTS

0 10 20 30 40 50 60 70 80 90 1000.7

0.8

0.9

1

1.1

1.2

1.3

1.4

epochs

Mea

n S

quar

e E

rror

2 hidden Layer train error2 hidden Layer test error1 hidden Layer train error1 hidden Layer test error

11 of 16

MLPG generated LSP compared with LSP fromnatural speech for Phone based full context dependentHMM state mapping DNN-TTS

0 100 200 300 400 500 600 700 800 900 10000

1000

2000

3000

4000

5000

6000

7000

8000

Frame

Hz

12 of 16

Objective Test

The RMSE for LSP are calculated between DNN-TTS generated onesand natural ones. A total of 140 test utterances are used.

System LSP Error

Letter based binary DNN-TTS 0.222398

Letter based VSM DNN-TTS 0.228053

Letter based HTS baseline 0.157672

phone based binary for frame mapping DNN-TTS 0.244260

Phone based HTS baseline 0.135460

Phone based HMM state mapping DNN-TTS 0.176447

Table: RMSE for LSP

13 of 16

Subjective Listening Test

As to compare the systems carefully, pairwise AB test is conducted.19 native English speaker take part in the test. And they listen to 20pairs of synthesis utterances for each sub-test. Results are shownbelow.

System a b c d e f

Letter based binary DNN-TTS = > > < =

Table: Subjective Listening Test

14 of 16

DNN-TTS Synthesis Speech Demo

Letter based VSM DNN-TTS

Phone based HTS baseline

Phone based HMM state mapping DNN-TTS

15 of 16

rjs_02_0439.wavMedia File (audio/wav)















Conclusion

Phone label based DNN-TTS sounds fair, VSM-based or letterbased one still not quite intelligible

More training data needs for DNN training. Will try different DNN training criterion and include pre-training

16 of 16

Documents

Deep neural network for speech synthesis - SSPNET · Deep neural network for speech synthesis Heng Lu University of Edinburgh 23 May 2013