16
Deep neural network for speech synthesis Heng Lu University of Edinburgh 23 May 2013

Deep neural network for speech synthesis - SSPNET · Deep neural network for speech synthesis Heng Lu University of Edinburgh 23 May 2013

Embed Size (px)

Citation preview

  • Deep neural network for speech synthesis

    Heng Lu

    University ofEdinburgh

    23 May 2013

  • Motivation

    Speech synthesisers automatically learn from data Decision tree clustering needs expert linguistic knowledge for

    question set design, while VSM generates continuous labels usingInformation Retrieval method from text

    Decision tree clustering uses hard division for each training samplewhile Deep Neural Networks train DNN using back-propagationand one training sample is used to train all the parameters in DNN

    From text to speech is a Complex Mapping

    2 of 16

  • Conventional HMM-based parametric TTS

    3 of 16

  • Framework for DNN-TTS

    21y

    11y

    1c Mc

    11x

    21x

    31x

    1Mx 3Mx

    2Mx

    11h

    3Nh23h13h

    2Nh22h

    12h

    1Nh21h

    1My 2My

    4 of 16

  • DNN-TTS systems

    6 systems are built for comparison.

    (a) Letter based binary DNN-TTS. (b) Letter based VSM DNN-TTS. (c) Letter based HTS baseline. (d) Phone based binary for frame mapping DNN-TTS. (e) Phone based HTS baseline. (f) Phone based full context dependent HMM state mapping

    DNN-TTS.

    For the letter based systems, 3 states HMM model is used. And forphone based state mapping, 5 states HMM model is employed.

    5 of 16

  • Binary DNN input

    For the phone based binary context information, phone based contextinformation are transformed into a binary vector with dimension 1256according to answer to the context information dependent questionset. Questions set include (previous, current, following) phonecategory, position, word and phrase level questions. Utterance levelquestions are excluded to reduce the input vector dimensionality.

    6 of 16

  • Vector space model (VSM)

    The vector space model (VSM) is well established in InformationRetrieval (IR) and Natural Language Processing (NLP) as a way ofrepresenting objects such as documents and words as vectors ofdescriptors. To build vector space models,

    co-occurrence statistics are gathered in matrix form to producehigh-dimensional representations of the distributional behavior ofword and letter types in the corpus.

    Lower dimensional representations are obtained by approximatelyfactorizing the matrix of raw co-occurrence counts by theapplication of slim singular value decomposition (SVD).

    The objective distance between the VSM output values directlyrepresents the similarity of two units. The VSMs are learned in anunsupervised fashion from text: no labeled speech is required.

    7 of 16

  • Database

    A database of a British English male speaker sampled at 16kHz isused for the experiments. There are 1000 utterances in the database,from which 860 are used for training, and 140 are held out forsubjective and objective evaluation. 21-order Line Spectral Pair (LSP)plus delta and delta-delta are extracted to represent the spectrum.Log F0 plus delta and delta-delta are used to model pitch. 10-orderare extracted for Aperiodicity (AP) feature. And 3 dimension ofbinary number is used to indicate the voice/unvoiced for F0 and itsdelta and delta delta for current sample.

    8 of 16

  • DNN-training

    Input vector dimension

    phone based binary context information dimension 1256 letter based binary context information dimension 1134 VSM based continuous vector dimension 27

    No pre-training used but use layer by layer training. Mean SquareError is used as training criterion.

    9 of 16

  • DNN error along with epochs for Letter based binaryDNN-TTS

    0 10 20 30 40 50 60 70 80 90 100

    0.7

    0.8

    0.9

    1

    1.1

    1.2

    1.3

    1.4

    epochs

    Mea

    n S

    quar

    e E

    rror

    3 hidden Layer train error3 hidden Layer test error2 hidden Layer train error2 hidden Layer test error1 hidden Layer train error1 hidden Layer test error

    10 of 16

  • DNN error along with epochs for Letter based VSMDNN-TTS

    0 10 20 30 40 50 60 70 80 90 1000.7

    0.8

    0.9

    1

    1.1

    1.2

    1.3

    1.4

    epochs

    Mea

    n S

    quar

    e E

    rror

    2 hidden Layer train error2 hidden Layer test error1 hidden Layer train error1 hidden Layer test error

    11 of 16

  • MLPG generated LSP compared with LSP fromnatural speech for Phone based full context dependentHMM state mapping DNN-TTS

    0 100 200 300 400 500 600 700 800 900 10000

    1000

    2000

    3000

    4000

    5000

    6000

    7000

    8000

    Frame

    Hz

    12 of 16

  • Objective Test

    The RMSE for LSP are calculated between DNN-TTS generated onesand natural ones. A total of 140 test utterances are used.

    System LSP Error

    Letter based binary DNN-TTS 0.222398

    Letter based VSM DNN-TTS 0.228053

    Letter based HTS baseline 0.157672

    phone based binary for frame mapping DNN-TTS 0.244260

    Phone based HTS baseline 0.135460

    Phone based HMM state mapping DNN-TTS 0.176447

    Table: RMSE for LSP

    13 of 16

  • Subjective Listening Test

    As to compare the systems carefully, pairwise AB test is conducted.19 native English speaker take part in the test. And they listen to 20pairs of synthesis utterances for each sub-test. Results are shownbelow.

    System a b c d e f

    Letter based binary DNN-TTS = > > < =

    Table: Subjective Listening Test

    14 of 16

  • DNN-TTS Synthesis Speech Demo

    Letter based VSM DNN-TTS

    Phone based HTS baseline

    Phone based HMM state mapping DNN-TTS

    15 of 16

    rjs_02_0439.wavMedia File (audio/wav)

    rjs_02_0440.wavMedia File (audio/wav)

    rjs_02_0445.wavMedia File (audio/wav)

    rjs_02_0443.wavMedia File (audio/wav)

    rjs_02_0444.wavMedia File (audio/wav)

    rjs_02_0439.wavMedia File (audio/wav)

    rjs_02_0440.wavMedia File (audio/wav)

    rjs_02_0445.wavMedia File (audio/wav)

    rjs_02_0443.wavMedia File (audio/wav)

    rjs_02_0444.wavMedia File (audio/wav)

    rjs_02_0439.wavMedia File (audio/wav)

    rjs_02_0440.wavMedia File (audio/wav)

    rjs_02_0445.wavMedia File (audio/wav)

    rjs_02_0443.wavMedia File (audio/wav)

    rjs_02_0444.wavMedia File (audio/wav)

  • Conclusion

    Phone label based DNN-TTS sounds fair, VSM-based or letterbased one still not quite intelligible

    More training data needs for DNN training. Will try different DNN training criterion and include pre-training

    16 of 16