35
ASR for TICSP - Dan Ellis 1999sep02 - 1 An overview of Speech Recognition research at ICSI Dan Ellis International Computer Science Institute, Berkeley CA <[email protected]> Outline About ICSI Hybrid connectionist-HMM speech recognition Overview of current projects Some details: Combinations, News retrieval, SpeechCorder Conclusions 1 2 3 4 5

Dan Ellis An overview of - University of California, Berkeley

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Dan Ellis An overview of - University of California, Berkeley

An overview of

on

r

ASR for TICSP - Dan Ellis 1999sep02 - 1

Speech Recognition research at ICSI

Dan EllisInternational Computer Science Institute, Berkeley CA

<[email protected]>

Outline

About ICSI

Hybrid connectionist-HMM speech recogniti

Overview of current projects

Some details:Combinations, News retrieval, SpeechCorde

Conclusions

1

2

3

4

5

Page 2: Dan Ellis An overview of - University of California, Berkeley

t)

rlands

d

ASR for TICSP - Dan Ellis 1999sep02 - 2

About ICSIhttp://www.icsi.berkeley.edu/

• Founded 1988 as ‘portal’ between U.S. and European academic systems- attached to UC Berkeley (but independen- about 100 people at any time

• Infrastructure funding from European government/industry- .. in return for hosting visitors- Germany, Italy, Spain, Switzerland, Nethe

• Research groups consist of staff, visitors anUC Berkeley graduate students

• Original vision: ‘massively parallel systems- has diversified since then

1

Page 3: Dan Ellis An overview of - University of California, Berkeley

Groups at ICSI

llis

ASR for TICSP - Dan Ellis 1999sep02 - 3

• “Realization” (Real Speech Collective?)- Nelson Morgan, Steve Greenberg, Dan E- Speech recognition for realistic conditions- Systems support for ASR applications

• ACIRI (AT&T Center for Internet Research at ICSI)- Network routing, traffic, security

• Theory- mathematical complexity, coding theory

• Applications- natural langauge understanding- connectionist models of cognition

• Networks- multimedia communications applications

Page 4: Dan Ellis An overview of - University of California, Berkeley

ICSI Realization highlights

proc

ASR for TICSP - Dan Ellis 1999sep02 - 4

• Connectionist framework for ASR(Morgan & Bourlard)

• RASTA front-end processing(Hermansky & Morgan)

• The Ring Array Processor (RAP)

• SPERT/T0- SBUS boards with custom 8-way vector µ

Page 5: Dan Ellis An overview of - University of California, Berkeley

Outline

on

ASR for TICSP - Dan Ellis 1999sep02 - 5

About ICSI

Hybrid connectionist-HMM speech recogniti- visualizing speech recognition- building a recognizer- some issues in ASR

Overview of current projects

Some details

Conclusions

1

2

3

4

5

Page 6: Dan Ellis An overview of - University of California, Berkeley

ns

X

MM

S

i

|

X

i

):

mar

Wordsequence

hypotheses

ASR for TICSP - Dan Ellis 1999sep02 - 6

The Hybrid Connectionist-HMM system

• Conventional ASR: Symbols S, observatio

• P(Xi|Si) is acoustic likelihood model e.g. G

• Connectionist replaces with posterior , P(

2

S∗ P S X( )S

argmax =

P Xi Si( ) P Si Si 1–( )⋅i

∏Sargmax =

P S X,( )P X( )

-------------------S

argmax =

C0

C1

C2

Cktn

tn+w

h#pclbcltcldcl

Feature calculation

Inputsound

Neural network acoustic model

Posterior probabilitiesfor context-independent

phone classes

Wordpronunciation

models Gram

HMMdecoder

Page 7: Dan Ellis An overview of - University of California, Berkeley

Visualizing speech recognition

q

i

ASR for TICSP - Dan Ellis 1999sep02 - 7

• Speech as a sequence of discrete symbols

Front end

Sound

Phoneclassifier

Feature vectors

Acousticmodels

HMMdecoder

Label probabilities

Phone & word labelling

Word modelsGrammar

Page 8: Dan Ellis An overview of - University of California, Berkeley

Building a recognizer

ASR for TICSP - Dan Ellis 1999sep02 - 8

• Define pronunciation models- application vocabulary- standard dictionaries + phonetic rules?

• Build language model- P(Wi|Wi-1,...)

- count n-grams in example texts?

• Train acoustic model- choose features to suit conditions- train neural net on large labeled corpus- relabel & retrain?

Page 9: Dan Ellis An overview of - University of California, Berkeley

How much training data?

m

rns

00

its

ASR for TICSP - Dan Ellis 1999sep02 - 9

• The bulk of recent improvements derives frolarger training sets

• Largest model: 2.5M parameters, 32M patte

= 24 days to train on TetraSPERT, 1015 ops

5

1000

2000

4000

9.25

18.5

37

74

32

34

36

38

40

42

44

Hidden layer / unTraining set / hours

WE

R%

WER for PLP12N nets vs. net size & training data

Page 10: Dan Ellis An overview of - University of California, Berkeley

Some issues in ASR

es

ulary

sk

vocabNewsmbo...

e,

ASR for TICSP - Dan Ellis 1999sep02 - 10

• ‘Spectrogram reading’ paradigm- short-time features, concatenative models

• Goal: classifier accuracy / Word Error Rate (WER)- objective measures, but quite opaque- normalization vs. generalization

• HMM requires search over all word sequenc- dominates processing time in large-vocab

• Best solutions (e.g. features) depends on ta- RASTA plus delta-features good for small- plain normalized PLP best for Broadcast - modulation spectrum features best for co

• Key challenge = Robustness- to: task, acoustic conditions, speaking rat

style, accent ...

Page 11: Dan Ellis An overview of - University of California, Berkeley

Outline

on

ASR for TICSP - Dan Ellis 1999sep02 - 11

About ICSI

Hybrid connectionist-HMM speech recogniti

Overview of current projects- Front-end features- Pronunciation and language modeling- Modeling alternatives- Other topics

Some details

Conclusions

1

2

3

4

5

Page 12: Dan Ellis An overview of - University of California, Berkeley

The modulation-filtered spectrogram

k

160 ms

640 ms

ndpass16 Hz

ASR for TICSP - Dan Ellis 1999sep02 - 12

(Brian Kingsbury)

• Goal: invariance to variable acoustics

- filter out irrelevant modulations

- channel adaptation (on-line auto. gain control)

- multiple representations

• Comparison:

x x

Bark-scale power-spectral filterban

speech

lowpassfeatures

bandpassfeatures

AGC

AGC

AGC

AGC

= 160 ms

= 320 ms

τ

τ

lowpass envelopefiltering

AGC AGC

AGC AGC

τ

τ

=

=

automaticgain

control

0-16 Hzba2-

plp log-spect features

5

10

15

time / s

freq

/ B

ark

freq

/ B

ark

msg1a features

0.5 1 1.5 2 2.5 3 3.5

5

10

Page 13: Dan Ellis An overview of - University of California, Berkeley

Data-driven feature design

r coeffs

ASR for TICSP - Dan Ellis 1999sep02 - 13

(Mike Shire)

• ‘Optimal’ features for different conditions- subband envelope domain- linear-discriminant analysis (LDA) for filte- separation of labeled classes is optimized

• Modulation-frequency domain responsesfor clean, reverb, mixture:

-30

-25

-20

-15

-10

-5

0

dB

1st Discriminant Filter

CleanLight ReverbSevere ReverbClean+Severe Reverb

100

101

-30

-25

-20

-15

-10

-5

0

dB

2nd Discriminant Filter

Hz

Page 14: Dan Ellis An overview of - University of California, Berkeley

Automatic pronunciation extraction

s)

g

ASR for TICSP - Dan Ellis 1999sep02 - 14

(Eric Fosler)

• Canonical pronunciations are too limited

• Phonetic rules overproduce ( → homonym

• Filter candidates against acoustics:

• Decision trees for canonical →real mappin

FSG building

ax d aw

Phone trees Letter-to-

d o w

phone treesDictionary

jones=jh...dow=d awthe=dh ax

Acoustics

...

FSG

d

aw

aa

dhddclax

iy

Alignmentdh dh dh dh ax ax dcl d...

Acoustic confidencesdh=0.24 ax=0.4 dcl=0.81...

FSG decoder

Word stringsororThe Dow Jones

Page 15: Dan Ellis An overview of - University of California, Berkeley

Topic modeling (Latent Semantic Analysis)

doc)

doc)

)

ASR for TICSP - Dan Ellis 1999sep02 - 15

(Dan Gildea & Thomas Hofmann)

• Bayesian model:- p(word | doc) = ∑t p(word | topic) p(topic |

- EM modeling of p(word | topic) & p(topic |over training set

- p(topic | doc) estimated from context in recognition

• Use to modify language model weights- p(word) ∝ ptri(word) ptop(word) / puni(word

- Trigram language model perplexity of 109 reduced 17%

• Use for topic segmentation?

Page 16: Dan Ellis An overview of - University of California, Berkeley

Multiband systems

scog.

ionsy)

ed

ASR for TICSP - Dan Ellis 1999sep02 - 16

(Adam Janin / Nikki Mirghafori)

• Separate recognizers look at different band- Fletcher/Allen model of human speech re- noise/corruption in one channel is limited- how to combine results?

• Weighted average of all possible combos- p(S | a,b,c,d) = ∑B p(S | B,a,b,c,d) . p(B)

B ranges over 16 possible band combinat- p(B) from? constant, local feature (entrop

.. ....

.. ....

.. ....

.. ....

.. ....

.. ....

.. ....

.. ....

.. ....

Pow

erPo

wer

Pow

erPo

wer

Front end

SpeechSignal

Prob. estimator MLP Merger

ViterbiDecoder

RecognizWords

Page 17: Dan Ellis An overview of - University of California, Berkeley

Buried Markov Models

ASR for TICSP - Dan Ellis 1999sep02 - 17

(Jeff Bilmes)

• Increasing the scope of the classifier input

• Add state-dependent sparse links:

• How to add links?- maximum conditional mutual information- greedy algorithm

• How to model?- linear dependence as first attempt

Qt qt= Qt 1+ qt 1+=Qt 1— qt 1—=

X

Qt 2+ qt 2+=

Y

Page 18: Dan Ellis An overview of - University of California, Berkeley

Connectionist speaker recognition

er

y

ASR for TICSP - Dan Ellis 1999sep02 - 18

(Dominique Genoud)

• Use neural networks to model speakers raththan phones?

• Specialize a phone classifier for a particularspeaker?

• Do both at once for “Twin-output MLP”:

Feature vectorsCurrent vector

t0-4 t0+4

t

12 fe

atur

es

9x12

feat

ures

= 1

08 in

puts

MLP

108:2000:107

Output layer

Non-speech

53 'speaker'phones

53 'world'phones

Posteriorprobabilitvectors

Page 19: Dan Ellis An overview of - University of California, Berkeley

Acoustic Segment Classification

e:

erence

h

tset

0.25 0.3

speecn/music

Speech Music Speech+Music

ASR for TICSP - Dan Ellis 1999sep02 - 19

(Gethin Williams)

• Features from posteriors show utterance typ- average per-frame entropy- ‘dynamism’ - mean squared 1st-order diff- average energy of ‘silence’ label- covariance matrix distance to clean speec

• 1.3% error on 2.5 second speech-music tes

• Use for finding segment boundaries?

0 0.05 0.1 0.15 0.2Dynamism

Dynamism vs. Entropy for 2.5 second segments of

0

0.5

1

1.5

2

2.5

3

3.5

Ent

ropy

0

2000

4000

frq/Hz

0 2 4 6 8 10 12 time/s0

20

40

Spectrogram

Posteriors

speech music speech+music

Page 20: Dan Ellis An overview of - University of California, Berkeley

Perceptual experiments

:

0.4

5.0

8.9

ASR for TICSP - Dan Ellis 1999sep02 - 20

(Steven Greenberg et al.)

• Use spectrally-sparse speech:

• Effect of temporal alignment between bands

WaveformSpectrogram

Original

Slit4

3

2

1

8

4

0Fre

qu

ency

(kH

z)

6000

4762

298375

Time (s)

750945

18902381

3.5

51.747.110.3 58.6 88.3%

Synchronized Slits

(Intelligibility WER%)

79.5

63.4

54.8

8

6

5

53.6

70.7

41.9

54.6

40.7

79.7

25 ms delay

50 ms delay

75 ms delay

De-synchronized Slits

Page 21: Dan Ellis An overview of - University of California, Berkeley

Outline

on

al

ASR for TICSP - Dan Ellis 1999sep02 - 21

About ICSI

Hybrid connectionist-HMM speech recogniti

Overview of current projects

Some details:- Information stream combination- Broadcast News spoken-document retriev- SpeechCorder PDA

Conclusions

1

2

3

4

5

Page 22: Dan Ellis An overview of - University of California, Berkeley

onesstem

ASR for TICSP - Dan Ellis 1999sep02 - 22

.1 Information stream combinations• Task: AURORA noisy digits

- continuous digits- test: 4 noise types x 7 SNR levels- train: mixed clean/noisy data

• Feature design evaluation- intermediate representation for mobile ph- evaluation specifies GMM-HMM (HTK) sy

• Baseline results:

• Can we combine features advantageously?

Feature System WER ratio

mfcc HTK 100.0%

plp Hybrid 89.6%

msg Hybrid 87.1%

msg HTK 205.0%

msg KG HTK 184.5%

4

Page 23: Dan Ellis An overview of - University of California, Berkeley

Combination schemes

1⊥ X2|q

es

deses

deses

ASR for TICSP - Dan Ellis 1999sep02 - 23

• Simple probability combination works best:P(qi|X1,X2) = P(qi|X1)·P(qi|X2) / P(qi) ... if X

Feature 1calculation

Acousticclassifier

HMMdecoder

Feature 2calculation

Inputsound

Acousticclassifier

Speechfeatures

HMMdecoder

Phoneprobabilities

Wordhypotheses

Hypothesiscombination

Finalhypothes

Feature 1calculation

Acousticclassifier

HMMdecoder

Feature 2calculation

Inputsound

Acousticclassifier

Speechfeatures

Phoneprobabilities

Worhypoth

Posteriorcombination

Feature 1calculation

Acousticclassifier

HMMdecoder

Feature 2calculation

Inputsound

Speechfeatures

Phoneprobabilities

Worhypoth

Featurecombination

Page 24: Dan Ellis An overview of - University of California, Berkeley

Posterior multiplication

ASR for TICSP - Dan Ellis 1999sep02 - 24

Features System WER ratio

plp + msg Feature combo 74.1%

plp + msg Prob. combo 63.0%

plp + msg HTK on probs. 51.6%

Page 25: Dan Ellis An overview of - University of California, Berkeley

chive)

aries

6 eval)

male)

ASR for TICSP - Dan Ellis 1999sep02 - 25

.2 Spoken document retrieval

• Based on DARPA/NIST Broadcast News

• Training material recorded off-air- ABC, CNN, CSPAN, NPR- 200 hour training set (TREC: 550 hour ar- training:

word transcriptions + speaker time bound

• Best WER results:- 1996: HTK: 27%- 1997: HTK: 16% (but: easier; 22% on 199- 1998: LIMSI: 14% (SPRACH: 21%)

• Some clear conclusions- one classifier for all conditions (or male/fe- feature adaptation (VTLN, MLLR, SAT)- importance of segmentation- more data is useful

4

Page 26: Dan Ellis An overview of - University of California, Berkeley

Applications for BN systems

CTION

DER

ODE OF VE PARTY OD BUYS IS

ASR for TICSP - Dan Ellis 1999sep02 - 26

• Live transcription- subtitles- transcripts- but: more than words?

• Video editing- precision word-time alignments- commercial systems by IBM, Virage, etc.

• Information Retrieval (IR)- TREC/MUC ‘spoken documents’- tolerant of word error rate, e.g.:

F0: THE VERY EARLY RETURNS OF THE NICARAGUAN PRESIDENTIAL ELESEEMED TO FADE BEFORE THE LOCAL MAYOR ON A LOT OF LAW

F4: AT THIS STAGE OF THE ACCOUNTING FOR SEVENTY SCOTCH ONE LEADANIEL ORTEGA IS IN SECOND PLACE THERE WERE TWENTY THREE PRESIDENTIAL CANDIDATES OF THE ELECTION

F5: THE LABOR MIGHT DO WELL TO REMEMBER THE LOST A MAJOR EPISTRANSATLANTIC CONNECT TO A CORPORATION IN BOTH CONSERVATIOFFICIALS FROM BRITAIN GOING TO WASHINGTON THEY WENT TO WOGEORGE BUSH ON HOW TO WIN A SECOND TO NONE IN LONDON THISSTEPHEN BEARD FOR MARKETPLACE

Page 27: Dan Ellis An overview of - University of California, Berkeley

Thematic Indexing of Spoken Language

Query

ASR for TICSP - Dan Ellis 1999sep02 - 27

(Thisl)

• EC collaboration, BBC providing data

• 1000+ hr archive data

• IR is key factor- stop lists- weighting schemes- query expansion

Control

Text

Audio

Video

Receiver ASR

NLP

SegmentationASR

IR

DatabaseArchive

http

Page 28: Dan Ellis An overview of - University of California, Berkeley

ThislGui

tDemo

ASR for TICSP - Dan Ellis 1999sep02 - 28

• Tcl/Tk front-end to Thisl IR engine

• Spoken query input: SPRACHdemo/Abbo

• NLP integration: prolog lattice parser

Page 29: Dan Ellis An overview of - University of California, Berkeley

rtium)

ASR for TICSP - Dan Ellis 1999sep02 - 29

.3 SpeechCorder

• Convergence of interesting problems:- ubiquitous PDAs- multimedia processing- very fast, low-power CPU design- resource-bound speech recognition

• ICSI / UC Berkeley / MIT collaboration- ICSI: speech & audio processing- UCB: user interface design- MIT: new low-power CPUs

• Current proposal- PDA- meeting / memo recorder- .. also for sociological study?

(new Human Centered Computing conso

4

Page 30: Dan Ellis An overview of - University of California, Berkeley

The SpeechCorder GUI

ASR for TICSP - Dan Ellis 1999sep02 - 30

• Live annotation of recognized speech

• Application issues:- correcting errorful transcriptions- finding places in the recording- annotations (speaker, notes, emphasis)

Page 31: Dan Ellis An overview of - University of California, Berkeley

SpeechCorder: Audio

ASR for TICSP - Dan Ellis 1999sep02 - 31

• Speech recognition- not close-mic’d- speaker ID- low-power / low-memory / vectorized- non-local processing for ‘best’ transcript?

• Other audio issues- identifying speech versus nonspeech- finding / indexing nonspeech events...

Page 32: Dan Ellis An overview of - University of California, Berkeley

Outline

on

ASR for TICSP - Dan Ellis 1999sep02 - 32

About ICSI

Hybrid connectionist-HMM speech recogniti

Overview of current projects

Some details

Conclusions- more criticisms of ASR- future work

1

2

3

4

5

Page 33: Dan Ellis An overview of - University of California, Berkeley

l:

ints word

sible &

nted

sifiertcher)

ASR for TICSP - Dan Ellis 1999sep02 - 33

Conclusions

• The downside of objective evaluation- research priority has been pragmatic goa

reduce WER- human speech recog. uses many constra- grammatic/semantic constraints implicit in

sequence statistics (grammar)- automatic analysis of large corpora is pos

helpful

• The problems with a grammar- unexpected (unseen) phrases are discou- highly brittle alternatives- masks underlying performance

• A more scientific approach- first work on the underlying phoneme clas- follow nonsense syllable performance (Fle

5

Page 34: Dan Ellis An overview of - University of California, Berkeley

The signal model in speech recognition

or

use of

ASR for TICSP - Dan Ellis 1999sep02 - 34

• Systems & approach have been optimized fspeech-alone situation- minimize classifier parameters, maximize

‘feature space’- e.g. cepstra [example]

• Possibly non-lexical data thrown away- pitch- timing/rhythm- speaker identification

• Dire consequences- .. dealing with nonspeech sounds- .. distinguishing success & failure

• Popular focus of research- e.g. segmental models, pitch features- fail to obtain robust improvements

Page 35: Dan Ellis An overview of - University of California, Berkeley

Future work

ASR for TICSP - Dan Ellis 1999sep02 - 35

• Continue improving robustness- better features- better pronunciations- better modeling

• Still looking for a good architecture- multiband- multistream- more adaptation- more contextual dependence

• Speech recognition: useful for applications- archive indexing, summarizing- personal devices, new interfaces- tie-in to general audio analysis...