52
Achieving Human Parity in Conversational Speech Recognition Speech & Dialog Research Group AI & Research 1

Achieving Human Parity in Conversational Speech Recognition · PDF fileHuman Parity in Conversational Speech Recognition •What is Human Parity? •Humans make mistakes, too. Can

  • Upload
    buikhue

  • View
    224

  • Download
    1

Embed Size (px)

Citation preview

Page 1: Achieving Human Parity in Conversational Speech Recognition · PDF fileHuman Parity in Conversational Speech Recognition •What is Human Parity? •Humans make mistakes, too. Can

Achieving Human Parity in Conversational Speech

Recognition

Speech & Dialog Research Group

AI & Research

1

Page 2: Achieving Human Parity in Conversational Speech Recognition · PDF fileHuman Parity in Conversational Speech Recognition •What is Human Parity? •Humans make mistakes, too. Can

Acknowledgements

2

Joint work withgreat collaborators!

Page 3: Achieving Human Parity in Conversational Speech Recognition · PDF fileHuman Parity in Conversational Speech Recognition •What is Human Parity? •Humans make mistakes, too. Can

Human Parity in Conversational Speech Recognition• What is Human Parity?

• Humans make mistakes, too. Can ASR make fewer?

• Conversational Speech Recognition• Humans talking in unplanned way

• Focus on each other, not on a computer

• The result of thirty years of progress• DARPA / US Government programs

• Conversational Speech Recognition is the latest in a series of increasingly difficult tasks.

3

Page 4: Achieving Human Parity in Conversational Speech Recognition · PDF fileHuman Parity in Conversational Speech Recognition •What is Human Parity? •Humans make mistakes, too. Can

Significance: History

RM

ATIS

WSJ

SWB

CH

For many years, DARPA drove the field 4

Page 5: Achieving Human Parity in Conversational Speech Recognition · PDF fileHuman Parity in Conversational Speech Recognition •What is Human Parity? •Humans make mistakes, too. Can

Significance: Community

Building on accumulated knowledge of many institutions! 5

Page 6: Achieving Human Parity in Conversational Speech Recognition · PDF fileHuman Parity in Conversational Speech Recognition •What is Human Parity? •Humans make mistakes, too. Can

Significance: Technical

• The right tool for the right job

• CNNs, LSTMs!

• Building on lots of past innovations:• HMM modeling

• Distributed Representations [Hinton ‘84]

• Early CNNs, RNNs, TDNNs [Lang & Hinton ‘88, Waibel et al. ‘89, Robinson ’91, Pineda ‘87]

• Hybrid training [Renals et al. ‘91, Bourlard & Morgan ‘94]

• Discriminative modeling

• Speaker adaptation

• System combination

RNNs / CNNs

GMMs

6

Page 7: Achieving Human Parity in Conversational Speech Recognition · PDF fileHuman Parity in Conversational Speech Recognition •What is Human Parity? •Humans make mistakes, too. Can

Outline

• Acoustic Modeling

• Language Modeling

• Decoding, Rescoring & System Combination

• Measuring Human Performance

• Results

• Counterpoint – Letter based CTC

• Conclusions

7

Page 8: Achieving Human Parity in Conversational Speech Recognition · PDF fileHuman Parity in Conversational Speech Recognition •What is Human Parity? •Humans make mistakes, too. Can

Outline

• Acoustic Modeling• Model Structures

• Language Modeling

• Decoding, Rescoring & System Combination

• Measuring Human Performance

• Results

• Counterpoint – Letter based CTC

• Conclusions

8

Page 9: Achieving Human Parity in Conversational Speech Recognition · PDF fileHuman Parity in Conversational Speech Recognition •What is Human Parity? •Humans make mistakes, too. Can

Acoustic Modeling: Hybrid HMM/DNN

[Yu et al., 2010; Dahl et al., 2011]

Record performance in 2011 [Seide et al.]

Hybrid HMM/NN approach standardBut DNN model now obsolete (!)• Poor spatial/temporal invariance

9

1st pass decoding

Page 10: Achieving Human Parity in Conversational Speech Recognition · PDF fileHuman Parity in Conversational Speech Recognition •What is Human Parity? •Humans make mistakes, too. Can

Acoustic Modeling: VGG CNN

[Simonyan & Zisserman, 2014; Frossard 2016, Saon et al., 2016, Krizhevsky et al., 2012]

Adapted from image processingRobust to temporal and frequency shifts

10

Page 11: Achieving Human Parity in Conversational Speech Recognition · PDF fileHuman Parity in Conversational Speech Recognition •What is Human Parity? •Humans make mistakes, too. Can

Acoustic Modeling: ResNet

[He et al., 2015]

Add a non-linear offset to linear transformation of featuresSimilar to fMPE in Povey et al., 2005See also Ghahremani & Droppo, 2016Our best single model after rescoring

11

1st pass decoding

Page 12: Achieving Human Parity in Conversational Speech Recognition · PDF fileHuman Parity in Conversational Speech Recognition •What is Human Parity? •Humans make mistakes, too. Can

Acoustic Modeling: LACE CNN

[Yu et al., 2016]

Combines batch normalization, Resnetjumps, and attention masks in CNNTied for 2nd best single model after rescoring

12

1st pass decoding

Page 13: Achieving Human Parity in Conversational Speech Recognition · PDF fileHuman Parity in Conversational Speech Recognition •What is Human Parity? •Humans make mistakes, too. Can

CNN Comparison

13

Very deepMany parametersSmall convolution patternsProcessing ~ ½ second per window

Page 14: Achieving Human Parity in Conversational Speech Recognition · PDF fileHuman Parity in Conversational Speech Recognition •What is Human Parity? •Humans make mistakes, too. Can

Acoustic Modeling: Bidirectional LSTMs

Stable form of recurrent neural netRobust to temporal shiftsTied for 2nd best single model

[Hochreiter & Schmidhuber, 1997, Graves & Schmidhuber, 2005; Sak et al., 2014]

[Graves & Jaitly ‘14]14

Page 15: Achieving Human Parity in Conversational Speech Recognition · PDF fileHuman Parity in Conversational Speech Recognition •What is Human Parity? •Humans make mistakes, too. Can

Outline

• Acoustic Modeling• Training Techniques

• Language Modeling

• Decoding, Rescoring & System Combination

• Measuring Human Performance

• Results

• Counterpoint – Letter based CTC

• Conclusions

15

Page 16: Achieving Human Parity in Conversational Speech Recognition · PDF fileHuman Parity in Conversational Speech Recognition •What is Human Parity? •Humans make mistakes, too. Can

I-vector Adaptation

𝑌(𝑡)

H1

H2

H3

H4

LABELS

Ƹ𝑠(𝑡)

[Dehak et al. 2011; Saon et al., 2013]

5-10% relative improvement for Switchboard

16

I-vectors provide a fixed-length representation of a speaker’s voice characteristics.

Page 17: Achieving Human Parity in Conversational Speech Recognition · PDF fileHuman Parity in Conversational Speech Recognition •What is Human Parity? •Humans make mistakes, too. Can

Spatial Regularization

- =

2-D Unrolling Smoothed 2D Hi-Freq

Regularize with L2 norm of Hi-frequency residual

5-10% relative improvement for BLSTM

[Droppo et al. in progress]

17

Page 18: Achieving Human Parity in Conversational Speech Recognition · PDF fileHuman Parity in Conversational Speech Recognition •What is Human Parity? •Humans make mistakes, too. Can

Lattice Free MMI

• Simple brute force MMI

• Avoids need to generate lattices

• Alignments always current

Data,

'

Data,

Data,

);'|()'(

);|(logmaxarg

);(

);|(logmaxarg

);()(

);,(logmaxarg

aw

w

aw

aw

waPwP

waP

aP

waP

aPwP

awP

Traditionally approximated by word sequences in lattice (DAG)

Instead LFMMI uses all possible word sequences in cyclic FSA

[Chen et al., 2006, McDermott et al., 2914, Povey et al., 2016]

18

Page 19: Achieving Human Parity in Conversational Speech Recognition · PDF fileHuman Parity in Conversational Speech Recognition •What is Human Parity? •Humans make mistakes, too. Can

Denominator GPU computation

• Represent FSA of all possible state sequences as a sparse transition matrix A

• Implement exact alpha beta computations

• Execute in straight “for” loops on GPU with cusparseDcsrmv and cublasDdgmm

• Beautifully simple

11

1

tt

T

t

ttt

o

o

A

A

19

Page 20: Achieving Human Parity in Conversational Speech Recognition · PDF fileHuman Parity in Conversational Speech Recognition •What is Human Parity? •Humans make mistakes, too. Can

LFMMI Improvements

• Denominator LM graph has 52k states and 215k transitions• GPU-side alpha-beta computation is 0.18xRT exclusive of NN evaluation

20

8-14% relative improvement on SWB

Page 21: Achieving Human Parity in Conversational Speech Recognition · PDF fileHuman Parity in Conversational Speech Recognition •What is Human Parity? •Humans make mistakes, too. Can

Cognitive Toolkit (CNTK) Training

• Flexible

• Multi-GPU

• Multi-Server

• 1-bit SGD

• All AM training

• Best LM training

21

Page 22: Achieving Human Parity in Conversational Speech Recognition · PDF fileHuman Parity in Conversational Speech Recognition •What is Human Parity? •Humans make mistakes, too. Can

Outline

• Acoustic Modeling

• Language Modeling

• Decoding, Rescoring & System Combination

• Measuring Human Performance

• Results

• Counterpoint – Letter based CTC

• Conclusions

22

Page 23: Achieving Human Parity in Conversational Speech Recognition · PDF fileHuman Parity in Conversational Speech Recognition •What is Human Parity? •Humans make mistakes, too. Can

Language Models

• 1st Pass n-gram:• SRI-LM, 30k vocab, 16M n-grams

• Rescoring n-gram:• SRI-LM, 145M n-grams

• RNN LM• CUED Toolkit, two 1000 unit layers• Relu activations, NCE training

• LSTM LM• Cognitive Toolkit (CNTK), three 1000 unit layers• Letter trigram input, no NCE

23

Page 24: Achieving Human Parity in Conversational Speech Recognition · PDF fileHuman Parity in Conversational Speech Recognition •What is Human Parity? •Humans make mistakes, too. Can

LM Training Trick: Self-stabilization

• Learn an overall scaling function for each layer

xWy

Wxy

)(

:becomes

[Ghahremani & Droppo, 2016]

Applied to the LSTM networks, between layers.

24

Page 25: Achieving Human Parity in Conversational Speech Recognition · PDF fileHuman Parity in Conversational Speech Recognition •What is Human Parity? •Humans make mistakes, too. Can

Language Model Perplexities

Perplexities on the 1997 eval set

LSTM beats RNN

Letter trigram input slightlybetter than word input

25

Note both forward and backward running models

Page 26: Achieving Human Parity in Conversational Speech Recognition · PDF fileHuman Parity in Conversational Speech Recognition •What is Human Parity? •Humans make mistakes, too. Can

Outline

• Acoustic Modeling

• Language Modeling

• Decoding, Rescoring & System Combination

• Measuring Human Performance

• Results

• Counterpoint – Letter based CTC

• Conclusions

26

Page 27: Achieving Human Parity in Conversational Speech Recognition · PDF fileHuman Parity in Conversational Speech Recognition •What is Human Parity? •Humans make mistakes, too. Can

Overall Process

Lattice Generation:ResNet

Rescoring:RNNs, LSTMs,

Pron Probs

Lattice Generation:LACE

Rescoring:RNNs, LSTMs,

Pron Probs

Lattice Generation:BLSTM

Rescoring:RNNs, LSTMs,

Pron Probs

Confusion Network Combination &

Select Best

27

N-gram Rescoring500-best Generation

N-gram Rescoring500-best Generation

N-gram Rescoring500-best Generation

Page 28: Achieving Human Parity in Conversational Speech Recognition · PDF fileHuman Parity in Conversational Speech Recognition •What is Human Parity? •Humans make mistakes, too. Can

Greedy System Combination

• Make confusion network from best single system

• Repeat:• Compute error rate on development data for each possible system addition

• Add the system

28

Page 29: Achieving Human Parity in Conversational Speech Recognition · PDF fileHuman Parity in Conversational Speech Recognition •What is Human Parity? •Humans make mistakes, too. Can

Rescoring Performance

ResNet CNN Acoustic Model (no combination)

One LSTM ~ 0.5% better than one RNN.

Multiple LSTMs provide further 0.3%

29

Page 30: Achieving Human Parity in Conversational Speech Recognition · PDF fileHuman Parity in Conversational Speech Recognition •What is Human Parity? •Humans make mistakes, too. Can

Outline

• Acoustic Modeling

• Language Modeling

• Decoding, Rescoring & System Combination

• Measuring Human Performance

• Results

• Counterpoint – Letter based CTC

• Conclusions

30

Page 31: Achieving Human Parity in Conversational Speech Recognition · PDF fileHuman Parity in Conversational Speech Recognition •What is Human Parity? •Humans make mistakes, too. Can

A First Try

• The 4% rumor

[Lippman, 1997]

31

Page 32: Achieving Human Parity in Conversational Speech Recognition · PDF fileHuman Parity in Conversational Speech Recognition •What is Human Parity? •Humans make mistakes, too. Can

Another Attempt

[Glenn et al., 2010]

Significant variability.

Note the bulk of the training data was “quick transcribed.”

32

Page 33: Achieving Human Parity in Conversational Speech Recognition · PDF fileHuman Parity in Conversational Speech Recognition •What is Human Parity? •Humans make mistakes, too. Can

Getting a Positive ID on Actual Test Data

• Skype Translator has a weekly transcription contract• Quality control, training, etc.

• Transcription followed by a second checking pass

• One week, we added eval 2000 to the pile…

33

Page 34: Achieving Human Parity in Conversational Speech Recognition · PDF fileHuman Parity in Conversational Speech Recognition •What is Human Parity? •Humans make mistakes, too. Can

The Results

• Switchoard: 5.9% error rate

• CallHome: 11.3% error rate

• SWB in the 4.1% - 9.6% range expected

• CH is difficult for both people and machines• High ASR error not just because

of mismatched conditions

34

Page 35: Achieving Human Parity in Conversational Speech Recognition · PDF fileHuman Parity in Conversational Speech Recognition •What is Human Parity? •Humans make mistakes, too. Can

Outline

• Acoustic Modeling

• Language Modeling

• Decoding, Rescoring & System Combination

• Measuring Human Performance

• Results

• Counterpoint – Letter based CTC

• Conclusions

35

Page 36: Achieving Human Parity in Conversational Speech Recognition · PDF fileHuman Parity in Conversational Speech Recognition •What is Human Parity? •Humans make mistakes, too. Can

The Bottom Line & Comparisons

Single CNN doesremarkably well

Parity with professional transcribers

Best previous number

36

Page 37: Achieving Human Parity in Conversational Speech Recognition · PDF fileHuman Parity in Conversational Speech Recognition •What is Human Parity? •Humans make mistakes, too. Can

Runtimes

AM Training: Forward, Backward + Update computationsAM eval: Forward probability computation onlyDecoding: Mixed GPU/CPU, complete decoding time with open beams

Titan X GPU & Intel Xeon E5-2620 v3 @2.4GhZ, 12 coresAll times are xRT (fraction of real-time required) on Titan X GPU

GPU 10 to 100x faster than CPU

(Multiples of real-time, smaller is better)

37

Page 38: Achieving Human Parity in Conversational Speech Recognition · PDF fileHuman Parity in Conversational Speech Recognition •What is Human Parity? •Humans make mistakes, too. Can

Error AnalysisSubstitutions (~21k words in each test set)

“ums” and “uh-hums” most frequent mistakes – but most errors are in the long tail

38

Page 39: Achieving Human Parity in Conversational Speech Recognition · PDF fileHuman Parity in Conversational Speech Recognition •What is Human Parity? •Humans make mistakes, too. Can

Error Analysis

Deletions Insertions

Both people and machines insert “I” and “and” a lot.39

Page 40: Achieving Human Parity in Conversational Speech Recognition · PDF fileHuman Parity in Conversational Speech Recognition •What is Human Parity? •Humans make mistakes, too. Can

Outline

• Acoustic Modeling

• Language Modeling

• Decoding, Rescoring & System Combination

• Measuring Human Performance

• Results

• Counterpoint – Letter based CTC

• Conclusions

40

Page 41: Achieving Human Parity in Conversational Speech Recognition · PDF fileHuman Parity in Conversational Speech Recognition •What is Human Parity? •Humans make mistakes, too. Can

Two Ways to move up field

The Dive The Pass

41

Page 42: Achieving Human Parity in Conversational Speech Recognition · PDF fileHuman Parity in Conversational Speech Recognition •What is Human Parity? •Humans make mistakes, too. Can

The CTC Alternative

• Wouldn’t it be nice if we could just look at the frame-level labels,

de-dup,

and read-off the transcription?

• For example, with a character model,

S – U U – P P P E E - - R G G - - O O – D D

Super Good

• CTC [Graves et al. 2006] can train a model so you can do this!42

Page 43: Achieving Human Parity in Conversational Speech Recognition · PDF fileHuman Parity in Conversational Speech Recognition •What is Human Parity? •Humans make mistakes, too. Can

Objective Function and Gradient:

t

q

t

qt

q

t

t

tq

pa

tqt

pqP

:softmax beforeGradient

))(( symbolfor at timeoutput net neural :P

)|( :function Obj.

t

(t))q(

alignments

))((

alignments

Make the neural-net outputs look like the transcript-constrained posteriors

43

Page 44: Achieving Human Parity in Conversational Speech Recognition · PDF fileHuman Parity in Conversational Speech Recognition •What is Human Parity? •Humans make mistakes, too. Can

Defining the Symbols

• Characters:• Generalize to new words

• No problem with infrequent words

• Couple of issues:• Double-letters (e.g. “hello”) don’t work with de-duping

• Where to insert spaces to form words (e.g. “darkroom” vs “dark room”)

• Solution: • introduce double-letter units (ll, oo, etc.)

• Introduce word-initial letters (capital letters)Explicit space characteraligns acoustics to nothing.

44

Page 45: Achieving Human Parity in Conversational Speech Recognition · PDF fileHuman Parity in Conversational Speech Recognition •What is Human Parity? •Humans make mistakes, too. Can

CUDNN RNNImplementation

• Process full minibatch per CUDNN call

• 32 utterances per minibatch

• computation on CPU• 8-way parallelization / OMP

• Best Configuration:• 9 Relu-RNN layers• Bidirectional• 1024 wide• 0.0058 xRT (!)

45

Page 46: Achieving Human Parity in Conversational Speech Recognition · PDF fileHuman Parity in Conversational Speech Recognition •What is Human Parity? •Humans make mistakes, too. Can

Results: 2000 Hour Training

New record for CTC on Switchboard

46

Best previous result – ensemble from Hannun et al. 2014

Conclusion: Much simpler systems can produce good performance[Zweig et al., 2016]

Page 47: Achieving Human Parity in Conversational Speech Recognition · PDF fileHuman Parity in Conversational Speech Recognition •What is Human Parity? •Humans make mistakes, too. Can

Outline

• Acoustic Modeling

• Language Modeling

• Decoding, Rescoring & System Combination

• Measuring Human Performance

• Results

• Counterpoint – Letter based CTC

• Conclusions

47

Page 48: Achieving Human Parity in Conversational Speech Recognition · PDF fileHuman Parity in Conversational Speech Recognition •What is Human Parity? •Humans make mistakes, too. Can

Summary: Human Parity after Twenty Years

5.9% Human performance:Microsoft,October, 2016

48

Page 49: Achieving Human Parity in Conversational Speech Recognition · PDF fileHuman Parity in Conversational Speech Recognition •What is Human Parity? •Humans make mistakes, too. Can

Concluding Remarks

• Parity – Are we done?• Cocktail party problem

• Farfield

• Robustness

Cocktail Party Problem

49

Page 50: Achieving Human Parity in Conversational Speech Recognition · PDF fileHuman Parity in Conversational Speech Recognition •What is Human Parity? •Humans make mistakes, too. Can

Concluding Remarks

• Parity – Are we done?• Cocktail party problem

• Farfield

• Robustness

• What is interesting?• New network structures

• Process Simplification e.g. CTC

50

Page 51: Achieving Human Parity in Conversational Speech Recognition · PDF fileHuman Parity in Conversational Speech Recognition •What is Human Parity? •Humans make mistakes, too. Can

Concluding Remarks

• Parity – Are we done?• Cocktail party problem

• Farfield

• Robustness

• What is interesting?• New network structures

• Process Simplification e.g. CTC

• Are we stuck?• CNNs! LSTMs! Attention & More.

• The future is bright

51

Page 52: Achieving Human Parity in Conversational Speech Recognition · PDF fileHuman Parity in Conversational Speech Recognition •What is Human Parity? •Humans make mistakes, too. Can

Thank You!

52