1. Deep neural network for speech synthesis Heng Lu University
of Edinburgh 23 May 2013
2. Motivation Speech synthesisers automatically learn from data
Decision tree clustering needs expert linguistic knowledge for
question set design, while VSM generates continuous labels using
Information Retrieval method from text Decision tree clustering
uses hard division for each training sample while Deep Neural
Networks train DNN using back-propagation and one training sample
is used to train all the parameters in DNN From text to speech is a
Complex Mapping 2 of 16
3. Conventional HMM-based parametric TTS 3 of 16
4. Framework for DNN-TTS 2 1y1 1y 1c Mc 1 1x 2 1x 3 1x 1 Mx 3
Mx2 Mx 1 1h 3 N h 2 3h1 3h 2 N h2 2h1 2h 1 N h 2 1h 1 My 2 My 4 of
16
5. DNN-TTS systems 6 systems are built for comparison. (a)
Letter based binary DNN-TTS. (b) Letter based VSM DNN-TTS. (c)
Letter based HTS baseline. (d) Phone based binary for frame mapping
DNN-TTS. (e) Phone based HTS baseline. (f) Phone based full context
dependent HMM state mapping DNN-TTS. For the letter based systems,
3 states HMM model is used. And for phone based state mapping, 5
states HMM model is employed. 5 of 16
6. Binary DNN input For the phone based binary context
information, phone based context information are transformed into a
binary vector with dimension 1256 according to answer to the
context information dependent question set. Questions set include
(previous, current, following) phone category, position, word and
phrase level questions. Utterance level questions are excluded to
reduce the input vector dimensionality. 6 of 16
7. Vector space model (VSM) The vector space model (VSM) is
well established in Information Retrieval (IR) and Natural Language
Processing (NLP) as a way of representing objects such as documents
and words as vectors of descriptors. To build vector space models,
co-occurrence statistics are gathered in matrix form to produce
high-dimensional representations of the distributional behavior of
word and letter types in the corpus. Lower dimensional
representations are obtained by approximately factorizing the
matrix of raw co-occurrence counts by the application of slim
singular value decomposition (SVD). The objective distance between
the VSM output values directly represents the similarity of two
units. The VSMs are learned in an unsupervised fashion from text:
no labeled speech is required. 7 of 16
8. Database A database of a British English male speaker
sampled at 16kHz is used for the experiments. There are 1000
utterances in the database, from which 860 are used for training,
and 140 are held out for subjective and objective evaluation.
21-order Line Spectral Pair (LSP) plus delta and delta-delta are
extracted to represent the spectrum. Log F0 plus delta and
delta-delta are used to model pitch. 10-order are extracted for
Aperiodicity (AP) feature. And 3 dimension of binary number is used
to indicate the voice/unvoiced for F0 and its delta and delta delta
for current sample. 8 of 16
9. DNN-training Input vector dimension phone based binary
context information dimension 1256 letter based binary context
information dimension 1134 VSM based continuous vector dimension 27
No pre-training used but use layer by layer training. Mean Square
Error is used as training criterion. 9 of 16
10. DNN error along with epochs for Letter based binary DNN-TTS
0 10 20 30 40 50 60 70 80 90 100 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4
epochs MeanSquareError 3 hidden Layer train error 3 hidden Layer
test error 2 hidden Layer train error 2 hidden Layer test error 1
hidden Layer train error 1 hidden Layer test error 10 of 16
11. DNN error along with epochs for Letter based VSM DNN-TTS 0
10 20 30 40 50 60 70 80 90 100 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 epochs
MeanSquareError 2 hidden Layer train error 2 hidden Layer test
error 1 hidden Layer train error 1 hidden Layer test error 11 of
16
12. MLPG generated LSP compared with LSP from natural speech
for Phone based full context dependent HMM state mapping DNN-TTS 0
100 200 300 400 500 600 700 800 900 1000 0 1000 2000 3000 4000 5000
6000 7000 8000 Frame Hz 12 of 16
13. Objective Test The RMSE for LSP are calculated between
DNN-TTS generated ones and natural ones. A total of 140 test
utterances are used. System LSP Error Letter based binary DNN-TTS
0.222398 Letter based VSM DNN-TTS 0.228053 Letter based HTS
baseline 0.157672 phone based binary for frame mapping DNN-TTS
0.244260 Phone based HTS baseline 0.135460 Phone based HMM state
mapping DNN-TTS 0.176447 Table: RMSE for LSP 13 of 16
14. Subjective Listening Test As to compare the systems
carefully, pairwise AB test is conducted. 19 native English speaker
take part in the test. And they listen to 20 pairs of synthesis
utterances for each sub-test. Results are shown below. System a b c
d e f Letter based binary DNN-TTS = > < Letter based VSM
DNN-TTS < = < Letter based HTS baseline > = < phone
based binary for frame mapping DNN-TTS = < Phone based HTS
baseline > = > Phone based HMM state mapping DNN-TTS >
> < = Table: Subjective Listening Test 14 of 16
15. DNN-TTS Synthesis Speech Demo Letter based VSM DNN-TTS
Phone based HTS baseline Phone based HMM state mapping DNN-TTS 15
of 16
16. Conclusion Phone label based DNN-TTS sounds fair, VSM-based
or letter based one still not quite intelligible More training data
needs for DNN training. Will try dierent DNN training criterion and
include pre-training 16 of 16