Deep neural network for speech synthesis

1. Deep neural network for speech synthesis Heng Lu University of Edinburgh 23 May 2013

2. Motivation Speech synthesisers automatically learn from data Decision tree clustering needs expert linguistic knowledge for question set design, while VSM generates continuous labels using Information Retrieval method from text Decision tree clustering uses hard division for each training sample while Deep Neural Networks train DNN using back-propagation and one training sample is used to train all the parameters in DNN From text to speech is a Complex Mapping 2 of 16

3. Conventional HMM-based parametric TTS 3 of 16

4. Framework for DNN-TTS 2 1y1 1y 1c Mc 1 1x 2 1x 3 1x 1 Mx 3 Mx2 Mx 1 1h 3 N h 2 3h1 3h 2 N h2 2h1 2h 1 N h 2 1h 1 My 2 My 4 of 16

5. DNN-TTS systems 6 systems are built for comparison. (a) Letter based binary DNN-TTS. (b) Letter based VSM DNN-TTS. (c) Letter based HTS baseline. (d) Phone based binary for frame mapping DNN-TTS. (e) Phone based HTS baseline. (f) Phone based full context dependent HMM state mapping DNN-TTS. For the letter based systems, 3 states HMM model is used. And for phone based state mapping, 5 states HMM model is employed. 5 of 16

6. Binary DNN input For the phone based binary context information, phone based context information are transformed into a binary vector with dimension 1256 according to answer to the context information dependent question set. Questions set include (previous, current, following) phone category, position, word and phrase level questions. Utterance level questions are excluded to reduce the input vector dimensionality. 6 of 16

7. Vector space model (VSM) The vector space model (VSM) is well established in Information Retrieval (IR) and Natural Language Processing (NLP) as a way of representing objects such as documents and words as vectors of descriptors. To build vector space models, co-occurrence statistics are gathered in matrix form to produce high-dimensional representations of the distributional behavior of word and letter types in the corpus. Lower dimensional representations are obtained by approximately factorizing the matrix of raw co-occurrence counts by the application of slim singular value decomposition (SVD). The objective distance between the VSM output values directly represents the similarity of two units. The VSMs are learned in an unsupervised fashion from text: no labeled speech is required. 7 of 16

8. Database A database of a British English male speaker sampled at 16kHz is used for the experiments. There are 1000 utterances in the database, from which 860 are used for training, and 140 are held out for subjective and objective evaluation. 21-order Line Spectral Pair (LSP) plus delta and delta-delta are extracted to represent the spectrum. Log F0 plus delta and delta-delta are used to model pitch. 10-order are extracted for Aperiodicity (AP) feature. And 3 dimension of binary number is used to indicate the voice/unvoiced for F0 and its delta and delta delta for current sample. 8 of 16

9. DNN-training Input vector dimension phone based binary context information dimension 1256 letter based binary context information dimension 1134 VSM based continuous vector dimension 27 No pre-training used but use layer by layer training. Mean Square Error is used as training criterion. 9 of 16

10. DNN error along with epochs for Letter based binary DNN-TTS 0 10 20 30 40 50 60 70 80 90 100 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 epochs MeanSquareError 3 hidden Layer train error 3 hidden Layer test error 2 hidden Layer train error 2 hidden Layer test error 1 hidden Layer train error 1 hidden Layer test error 10 of 16

11. DNN error along with epochs for Letter based VSM DNN-TTS 0 10 20 30 40 50 60 70 80 90 100 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 epochs MeanSquareError 2 hidden Layer train error 2 hidden Layer test error 1 hidden Layer train error 1 hidden Layer test error 11 of 16

12. MLPG generated LSP compared with LSP from natural speech for Phone based full context dependent HMM state mapping DNN-TTS 0 100 200 300 400 500 600 700 800 900 1000 0 1000 2000 3000 4000 5000 6000 7000 8000 Frame Hz 12 of 16

13. Objective Test The RMSE for LSP are calculated between DNN-TTS generated ones and natural ones. A total of 140 test utterances are used. System LSP Error Letter based binary DNN-TTS 0.222398 Letter based VSM DNN-TTS 0.228053 Letter based HTS baseline 0.157672 phone based binary for frame mapping DNN-TTS 0.244260 Phone based HTS baseline 0.135460 Phone based HMM state mapping DNN-TTS 0.176447 Table: RMSE for LSP 13 of 16

14. Subjective Listening Test As to compare the systems carefully, pairwise AB test is conducted. 19 native English speaker take part in the test. And they listen to 20 pairs of synthesis utterances for each sub-test. Results are shown below. System a b c d e f Letter based binary DNN-TTS = > < Letter based VSM DNN-TTS < = < Letter based HTS baseline > = < phone based binary for frame mapping DNN-TTS = < Phone based HTS baseline > = > Phone based HMM state mapping DNN-TTS > > < = Table: Subjective Listening Test 14 of 16

15. DNN-TTS Synthesis Speech Demo Letter based VSM DNN-TTS Phone based HTS baseline Phone based HMM state mapping DNN-TTS 15 of 16

16. Conclusion Phone label based DNN-TTS sounds fair, VSM-based or letter based one still not quite intelligible More training data needs for DNN training. Will try dierent DNN training criterion and include pre-training 16 of 16

Education

Deep neural network for speech synthesis