Upload
duongkiet
View
230
Download
6
Embed Size (px)
Citation preview
Deep neural network for speech synthesis
Heng Lu
University ofEdinburgh
23 May 2013
Motivation
Speech synthesisers automatically learn from data Decision tree clustering needs expert linguistic knowledge for
question set design, while VSM generates continuous labels usingInformation Retrieval method from text
Decision tree clustering uses hard division for each training samplewhile Deep Neural Networks train DNN using back-propagationand one training sample is used to train all the parameters in DNN
From text to speech is a Complex Mapping
2 of 16
Conventional HMM-based parametric TTS
3 of 16
Framework for DNN-TTS
21y
11y
1c Mc
11x
21x
31x
1Mx 3Mx
2Mx
11h
3Nh23h13h
2Nh22h
12h
1Nh21h
1My 2My
4 of 16
DNN-TTS systems
6 systems are built for comparison.
(a) Letter based binary DNN-TTS. (b) Letter based VSM DNN-TTS. (c) Letter based HTS baseline. (d) Phone based binary for frame mapping DNN-TTS. (e) Phone based HTS baseline. (f) Phone based full context dependent HMM state mapping
DNN-TTS.
For the letter based systems, 3 states HMM model is used. And forphone based state mapping, 5 states HMM model is employed.
5 of 16
Binary DNN input
For the phone based binary context information, phone based contextinformation are transformed into a binary vector with dimension 1256according to answer to the context information dependent questionset. Questions set include (previous, current, following) phonecategory, position, word and phrase level questions. Utterance levelquestions are excluded to reduce the input vector dimensionality.
6 of 16
Vector space model (VSM)
The vector space model (VSM) is well established in InformationRetrieval (IR) and Natural Language Processing (NLP) as a way ofrepresenting objects such as documents and words as vectors ofdescriptors. To build vector space models,
co-occurrence statistics are gathered in matrix form to producehigh-dimensional representations of the distributional behavior ofword and letter types in the corpus.
Lower dimensional representations are obtained by approximatelyfactorizing the matrix of raw co-occurrence counts by theapplication of slim singular value decomposition (SVD).
The objective distance between the VSM output values directlyrepresents the similarity of two units. The VSMs are learned in anunsupervised fashion from text: no labeled speech is required.
7 of 16
Database
A database of a British English male speaker sampled at 16kHz isused for the experiments. There are 1000 utterances in the database,from which 860 are used for training, and 140 are held out forsubjective and objective evaluation. 21-order Line Spectral Pair (LSP)plus delta and delta-delta are extracted to represent the spectrum.Log F0 plus delta and delta-delta are used to model pitch. 10-orderare extracted for Aperiodicity (AP) feature. And 3 dimension ofbinary number is used to indicate the voice/unvoiced for F0 and itsdelta and delta delta for current sample.
8 of 16
DNN-training
Input vector dimension
phone based binary context information dimension 1256 letter based binary context information dimension 1134 VSM based continuous vector dimension 27
No pre-training used but use layer by layer training. Mean SquareError is used as training criterion.
9 of 16
DNN error along with epochs for Letter based binaryDNN-TTS
0 10 20 30 40 50 60 70 80 90 100
0.7
0.8
0.9
1
1.1
1.2
1.3
1.4
epochs
Mea
n S
quar
e E
rror
3 hidden Layer train error3 hidden Layer test error2 hidden Layer train error2 hidden Layer test error1 hidden Layer train error1 hidden Layer test error
10 of 16
DNN error along with epochs for Letter based VSMDNN-TTS
0 10 20 30 40 50 60 70 80 90 1000.7
0.8
0.9
1
1.1
1.2
1.3
1.4
epochs
Mea
n S
quar
e E
rror
2 hidden Layer train error2 hidden Layer test error1 hidden Layer train error1 hidden Layer test error
11 of 16
MLPG generated LSP compared with LSP fromnatural speech for Phone based full context dependentHMM state mapping DNN-TTS
0 100 200 300 400 500 600 700 800 900 10000
1000
2000
3000
4000
5000
6000
7000
8000
Frame
Hz
12 of 16
Objective Test
The RMSE for LSP are calculated between DNN-TTS generated onesand natural ones. A total of 140 test utterances are used.
System LSP Error
Letter based binary DNN-TTS 0.222398
Letter based VSM DNN-TTS 0.228053
Letter based HTS baseline 0.157672
phone based binary for frame mapping DNN-TTS 0.244260
Phone based HTS baseline 0.135460
Phone based HMM state mapping DNN-TTS 0.176447
Table: RMSE for LSP
13 of 16
Subjective Listening Test
As to compare the systems carefully, pairwise AB test is conducted.19 native English speaker take part in the test. And they listen to 20pairs of synthesis utterances for each sub-test. Results are shownbelow.
System a b c d e f
Letter based binary DNN-TTS = > > < =
Table: Subjective Listening Test
14 of 16
DNN-TTS Synthesis Speech Demo
Letter based VSM DNN-TTS
Phone based HTS baseline
Phone based HMM state mapping DNN-TTS
15 of 16
rjs_02_0439.wavMedia File (audio/wav)
rjs_02_0440.wavMedia File (audio/wav)
rjs_02_0445.wavMedia File (audio/wav)
rjs_02_0443.wavMedia File (audio/wav)
rjs_02_0444.wavMedia File (audio/wav)
rjs_02_0439.wavMedia File (audio/wav)
rjs_02_0440.wavMedia File (audio/wav)
rjs_02_0445.wavMedia File (audio/wav)
rjs_02_0443.wavMedia File (audio/wav)
rjs_02_0444.wavMedia File (audio/wav)
rjs_02_0439.wavMedia File (audio/wav)
rjs_02_0440.wavMedia File (audio/wav)
rjs_02_0445.wavMedia File (audio/wav)
rjs_02_0443.wavMedia File (audio/wav)
rjs_02_0444.wavMedia File (audio/wav)
Conclusion
Phone label based DNN-TTS sounds fair, VSM-based or letterbased one still not quite intelligible
More training data needs for DNN training. Will try different DNN training criterion and include pre-training
16 of 16