24
http://www.natural-speech-technology.org Natural Speech Technology Programme Overview Steve Renals Centre for Speech Technology Research University of Edinburgh 28 May 2015

Natural Speech Technology Programme Overviewn-s-t.org/sites/default/files/Overview-May2015.pdf · Programme Overview Steve Renals Centre for Speech Technology Research University

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Natural Speech Technology Programme Overviewn-s-t.org/sites/default/files/Overview-May2015.pdf · Programme Overview Steve Renals Centre for Speech Technology Research University

http://www.natural-speech-technology.org

Natural Speech Technology Programme Overview

Steve RenalsCentre for Speech Technology Research

University of Edinburgh

28 May 2015

Page 2: Natural Speech Technology Programme Overviewn-s-t.org/sites/default/files/Overview-May2015.pdf · Programme Overview Steve Renals Centre for Speech Technology Research University

NST Programme Grant

• At the outset of NST we identified several weaknesses with speech technology systems

• Fragile operation across domains

• Synthesis and recognition developed independently

• Reliance on supervised approaches, manually transcribed training data

• Models for synthesis and recognition include relatively little speech knowledge

• Models only weakly factor the underlying sources of variability

• Systems react crudely (if at all) to the context / environment

• These weaknesses still drive our objectives

Page 3: Natural Speech Technology Programme Overviewn-s-t.org/sites/default/files/Overview-May2015.pdf · Programme Overview Steve Renals Centre for Speech Technology Research University

NST Technical Objectives

• Learning and Adaptation • learning to compactly represent speech and to adapt to new

scenarios and speaking styles

• Natural Speech Transcription • Speech recognition systems that operate seamlessly across

domain and acoustic environment

• Natural Speech Synthesis • Controllable synthesisers that learn from data, and can generate

expressive conversational speech

• Exemplar Applications • prototype deployment in applications, focusing on health/social

domain, media, and the needs of User Group stakeholders

Page 4: Natural Speech Technology Programme Overviewn-s-t.org/sites/default/files/Overview-May2015.pdf · Programme Overview Steve Renals Centre for Speech Technology Research University

NSTHighlights 2014-15

• Best paper awards at IEEE SLT-2015, IEEE ICASSP-2014

• Open source software – HTK, Kaldi, HTS

• Speech Recognition applications – BBC (NewsHack and MGB Challenge), Ericsson (Just-in-time ASR), MediaEval, Browsing oral history (English Heritage)

• Voice banking and reconstruction

• homeService

• Challenges and Evaluations – Spoofing challenge at

Page 5: Natural Speech Technology Programme Overviewn-s-t.org/sites/default/files/Overview-May2015.pdf · Programme Overview Steve Renals Centre for Speech Technology Research University

Theory

Technology

Applications

Spee

chSy

nthe

sis Speech

Recognition

Learning

Adaptation

Englis

h Heri

tagehomeService

Media Archives

Voice

rec

onstr

uctio

n,

dona

tion,

& bank

ing

Technologyshowcase

Learning and adaptation

Structuring diverse dataCanonical acoustic models& adaptation

Canonical language models

Learning speech representations

Adaptation of DNN acoustic models

RNN encoder-decoder for large vocabulary ASR

Training, adaptation, decoding using RNNLMs

Corpora: new collections & structuring diverse data

Bottleneck features for DNN synthesis

Page 6: Natural Speech Technology Programme Overviewn-s-t.org/sites/default/files/Overview-May2015.pdf · Programme Overview Steve Renals Centre for Speech Technology Research University

Theory

Technology

Applications

Spee

chSy

nthe

sis Speech

Recognition

Learning

Adaptation

Englis

h Heri

tagehomeService

Media Archives

Voice

rec

onstr

uctio

n,

dona

tion,

& bank

ing

Technologyshowcase

Natural transcription

Software

Environment models & multiple sound sources

Use of rich contexts

Wide domain coverageThe MGB Challenge

at ASRU-2015

Discriminative LIMABEAM

Speaker-informed DNN acoustic models

Kaldi extensions

HTK extensions

Page 7: Natural Speech Technology Programme Overviewn-s-t.org/sites/default/files/Overview-May2015.pdf · Programme Overview Steve Renals Centre for Speech Technology Research University

Theory

Technology

Applications

Spee

chSy

nthe

sis Speech

Recognition

Learning

Adaptation

Englis

h Heri

tagehomeService

Media Archives

Voice

rec

onstr

uctio

n,

dona

tion,

& bank

ing

Technologyshowcase

Natural Synthesis

Fluent and disfluent speech synthesis

Canonical acoustic models for TTS

DNN-based speech synthesis

Disfluent speech synthesis

Reconstructing voices in the multiple-average-voice

framework

Model selection

Multitask learning

Sentence-level control

Assessment of TTSAre we using enough listeners to evaluate

synthetic speech?

Page 8: Natural Speech Technology Programme Overviewn-s-t.org/sites/default/files/Overview-May2015.pdf · Programme Overview Steve Renals Centre for Speech Technology Research University

Applications

Theory

Technology

Applications

Spee

chSy

nthe

sis Speech

Recognition

Learning

Adaptation

Englis

h Heri

tagehomeService

Media Archives

Voice

rec

onstr

uctio

n,

dona

tion,

& bank

ing

Technologyshowcase

Voice banking and voice reconstruction

BBC transcription

ASR for people with disordered speech

English Heritage - Oral History demo

Page 9: Natural Speech Technology Programme Overviewn-s-t.org/sites/default/files/Overview-May2015.pdf · Programme Overview Steve Renals Centre for Speech Technology Research University

HTK ANN Extensions• HTK 3.5 will support ANNs, maintaining compatibility with most

existing functions.

• Minimises the effort to reuse previous source code and tool

• Allows transfer of e.g. SI/SD input transforms, MPE/MMI sequence training

• 64-bit compatible

• Generic extensions

• Flexible input feature configurations

• ANN structures can be any directed acyclic graph

• Stochastic gradient descent supporting frame/sequence training

• CPU/GPU math kernels for ANNs

• Decoders extended to support tandem/hybrid systems, system combination

Page 10: Natural Speech Technology Programme Overviewn-s-t.org/sites/default/files/Overview-May2015.pdf · Programme Overview Steve Renals Centre for Speech Technology Research University

HTK Language Model extensions

• HTK v3.5 support for decoding RNN language models

• Lattice rescoring using RNNLMs

• Class / Full word outputs, interpolation with n-grams

• Similar functionality for feed-forward NN LMs

• RNNLM estimation enhancements

• bunch mode GPU training

• full/class output RNN LMs

• NCE training

• variance regularised training

Page 11: Natural Speech Technology Programme Overviewn-s-t.org/sites/default/files/Overview-May2015.pdf · Programme Overview Steve Renals Centre for Speech Technology Research University

MGB Challenge

Page 12: Natural Speech Technology Programme Overviewn-s-t.org/sites/default/files/Overview-May2015.pdf · Programme Overview Steve Renals Centre for Speech Technology Research University

Spoofing Challenge

Page 13: Natural Speech Technology Programme Overviewn-s-t.org/sites/default/files/Overview-May2015.pdf · Programme Overview Steve Renals Centre for Speech Technology Research University

The Rise of Neural Nets

Page 14: Natural Speech Technology Programme Overviewn-s-t.org/sites/default/files/Overview-May2015.pdf · Programme Overview Steve Renals Centre for Speech Technology Research University

The Rise and Fall and Riseof Neural Nets

Page 15: Natural Speech Technology Programme Overviewn-s-t.org/sites/default/files/Overview-May2015.pdf · Programme Overview Steve Renals Centre for Speech Technology Research University

Neural network acoustic models (1990s)

1 hidden layer

~2000 hidden units

~40 CI phone outputs

9x39 MFCC inputs

Bourlard & Morgan, 1994

Million Parameters

Error (%)

0 1 2 3 4 5 60.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

9.0

10.0

11.0 CI-HMM

CI-MLP

CD-HMM

MIX

DARPA RM 1992

Renals, Morgan, Cohen & Franco, ICASSP 1992

Page 16: Natural Speech Technology Programme Overviewn-s-t.org/sites/default/files/Overview-May2015.pdf · Programme Overview Steve Renals Centre for Speech Technology Research University

Broadcast news 199820.8% WER

(best GMM-based system, 13.5%)Cook, Christie, Ellis, Fosler-Lussier, Gotoh,

Kingsbury, Morgan, Renals, Robinson, & Williams, DARPA, 1999

Neural network acoustic models (1990s)

1 hidden layer

~2000 hidden units

~40 CI phone outputs

9x39 MFCC inputs

Bourlard & Morgan, 1994

UtteranceHypothesis

Speech

CI RNN

CI MLP

CD RNN

DecoderChronos

ChronosDecoder

ChronosDecoder ROVER

PredictionLinear

Perceptual

PredictionLinear

Perceptual

SpectrogramModulation

Page 17: Natural Speech Technology Programme Overviewn-s-t.org/sites/default/files/Overview-May2015.pdf · Programme Overview Steve Renals Centre for Speech Technology Research University

NN acoustic models Limitations vs GMMs

• Computationally restricted to monophone outputs

• CD-RNN factored over multiple networks – limited within-word context

• Training not easily parallelisable

• experimental turnaround slower

• systems less complex (fewer parameters)

• RNN – <100k parameters

• MLP – ~1M parameters

• Rapid adaptation hard (cf MLLR)

Page 18: Natural Speech Technology Programme Overviewn-s-t.org/sites/default/files/Overview-May2015.pdf · Programme Overview Steve Renals Centre for Speech Technology Research University

NN acoustic models Benefits

• Fewer limitations on inputs

• Correlated features

• Multi-frame windows

• Discriminative training criteria (frame level and sequence level)

• Can be used to generate ‘higher-level’ features

• tandem, posteriorgrams

• bottleneck features

Page 19: Natural Speech Technology Programme Overviewn-s-t.org/sites/default/files/Overview-May2015.pdf · Programme Overview Steve Renals Centre for Speech Technology Research University

(Deep) neural network acoustic models (2010s)

3-8 hidden layers

~2000 hidden units

~6000 CD phone outputs

9x39 MFCC inputsHinton, Deng, Yu, Dahl, Mohamed, Jaitly, Senior, Vanhoucke,

Nguyen, Sainath & Kingsbury, IEEE SP Mag 2012

Dahl, Yu, Deng & Acero, IEEE TASLP 2012

Page 20: Natural Speech Technology Programme Overviewn-s-t.org/sites/default/files/Overview-May2015.pdf · Programme Overview Steve Renals Centre for Speech Technology Research University

(Deep) neural network acoustic models (2010s)

3-8 hidden layers

~2000 hidden units

~6000 CD phone outputs

9x39 MFCC inputs

DEEP Automatically learned

feature extraction

WIDE Softmax output layer

ACOUSTIC INPUT Spectral? Cepstral?Derived features?

Hinton, Deng, Yu, Dahl, Mohamed, Jaitly, Senior, Vanhoucke, Nguyen, Sainath & Kingsbury, IEEE SP Mag 2012

Dahl, Yu, Deng & Acero, IEEE TASLP 2012

Page 21: Natural Speech Technology Programme Overviewn-s-t.org/sites/default/files/Overview-May2015.pdf · Programme Overview Steve Renals Centre for Speech Technology Research University

(Deep) neural network acoustic models (2010s)

3-8 hidden layers

~2000 hidden units

~6000 CD phone outputs

9x39 MFCC inputs

DEEP Automatically learned

feature extraction

WIDE Softmax output layer

ACOUSTIC INPUT Spectral? Cepstral?Derived features?

ACTIVATION FUNCTIONS – pooling, RELU, gated units

WEIGHT SHARING – adaptation, CNNs

TRAINING – optimisation,

objective fn

ADAPTATIONARCHITECTURES

– recurrent, convolutional, …

Page 22: Natural Speech Technology Programme Overviewn-s-t.org/sites/default/files/Overview-May2015.pdf · Programme Overview Steve Renals Centre for Speech Technology Research University

(Deep) neural network acoustic models (2010s)

3-8 hidden layers

~2000 hidden units

~6000 CD phone outputs

9x39 MFCC inputs

DEEP Automatically learned

feature extraction

WIDE Softmax output layer

ACOUSTIC INPUT Spectral? Cepstral?Derived features?

ACTIVATION FUNCTIONS – pooling, RELU, gated units

WEIGHT SHARING – adaptation, CNNs

TRAINING – optimisation,

objective fn

ADAPTATIONARCHITECTURES

– recurrent, convolutional, …

Page 23: Natural Speech Technology Programme Overviewn-s-t.org/sites/default/files/Overview-May2015.pdf · Programme Overview Steve Renals Centre for Speech Technology Research University

Neural networks & NST

0

5

10

15

20

Num

ber

of

Paper

s

20152012 2013 2014

NN

Non-NNNN

NN

NNNon-NN

Non-NN

Non-NN

ASR

TTS

Page 24: Natural Speech Technology Programme Overviewn-s-t.org/sites/default/files/Overview-May2015.pdf · Programme Overview Steve Renals Centre for Speech Technology Research University

Today’s agenda• 9:30 – 11:20 Intro, 4 talks, poster spotlights

• 11:20 – 13:00 Coffee + demos/posters [LR4, ground floor]

• 13:00 – 14:15 Lunch

• 14:15 – 15:15 3 talks

• 15:15 – 15:45 Coffee

• 15:45 – 16:45 Discussion: Clinical, Media, Future Challenges

• 16:45 - 17:00 Wrap-up

• 17:00 – 18:30 Advisory board meeting

• 19:00 Dinner at Emmanuel College