Dan Ellis An overview of - University of California, Berkeley

An overview of

on

r

ASR for TICSP - Dan Ellis 1999sep02 - 1

Speech Recognition research at ICSI

Dan EllisInternational Computer Science Institute, Berkeley CA

<[email protected]>

Outline

About ICSI

Hybrid connectionist-HMM speech recogniti

Overview of current projects

Some details:Combinations, News retrieval, SpeechCorde

Conclusions

1

2

3

4

5

t)

rlands

d

’


About ICSIhttp://www.icsi.berkeley.edu/

• Founded 1988 as ‘portal’ between U.S. and European academic systems- attached to UC Berkeley (but independen- about 100 people at any time

• Infrastructure funding from European government/industry- .. in return for hosting visitors- Germany, Italy, Spain, Switzerland, Nethe

• Research groups consist of staff, visitors anUC Berkeley graduate students

• Original vision: ‘massively parallel systems- has diversified since then

1

Groups at ICSI

llis


• “Realization” (Real Speech Collective?)- Nelson Morgan, Steve Greenberg, Dan E- Speech recognition for realistic conditions- Systems support for ASR applications

• ACIRI (AT&T Center for Internet Research at ICSI)- Network routing, traffic, security

• Theory- mathematical complexity, coding theory

• Applications- natural langauge understanding- connectionist models of cognition

• Networks- multimedia communications applications

ICSI Realization highlights

−

proc


• Connectionist framework for ASR(Morgan & Bourlard)

• RASTA front-end processing(Hermansky & Morgan)

• The Ring Array Processor (RAP)

• SPERT/T0- SBUS boards with custom 8-way vector µ

Outline

on


About ICSI

Hybrid connectionist-HMM speech recogniti- visualizing speech recognition- building a recognizer- some issues in ASR


Some details

Conclusions

1

2

3

4

5

ns

X

MM

S

i

|

X

i

):

mar

Wordsequence

hypotheses


The Hybrid Connectionist-HMM system

• Conventional ASR: Symbols S, observatio

• P(Xi|Si) is acoustic likelihood model e.g. G

• Connectionist replaces with posterior , P(

2

S∗ P S X( )S

argmax =

P Xi Si( ) P Si Si 1–( )⋅i

∏Sargmax =

P S X,( )P X( )

-------------------S

argmax =

C0

C1

C2

Cktn

tn+w

h#pclbcltcldcl

Feature calculation

Inputsound

Neural network acoustic model

Posterior probabilitiesfor context-independent

phone classes

Wordpronunciation

models Gram

HMMdecoder

Visualizing speech recognition

q

i


• Speech as a sequence of discrete symbols

Front end

Sound

Phoneclassifier

Feature vectors

Acousticmodels

HMMdecoder

Label probabilities

Phone & word labelling

Word modelsGrammar

Building a recognizer


• Define pronunciation models- application vocabulary- standard dictionaries + phonetic rules?

• Build language model- P(Wi|Wi-1,...)

- count n-grams in example texts?

• Train acoustic model- choose features to suit conditions- train neural net on large labeled corpus- relabel & retrain?

How much training data?

m

rns

00

its


• The bulk of recent improvements derives frolarger training sets

• Largest model: 2.5M parameters, 32M patte

= 24 days to train on TetraSPERT, 1015 ops

5

1000

2000

4000

9.25

18.5

37

74

32

34

36

38

40

42

44

Hidden layer / unTraining set / hours

WE

R%

WER for PLP12N nets vs. net size & training data

Some issues in ASR

es

ulary

sk

vocabNewsmbo...

e,


• ‘Spectrogram reading’ paradigm- short-time features, concatenative models

• Goal: classifier accuracy / Word Error Rate (WER)- objective measures, but quite opaque- normalization vs. generalization

• HMM requires search over all word sequenc- dominates processing time in large-vocab

• Best solutions (e.g. features) depends on ta- RASTA plus delta-features good for small- plain normalized PLP best for Broadcast - modulation spectrum features best for co

• Key challenge = Robustness- to: task, acoustic conditions, speaking rat

style, accent ...

Outline

on


About ICSI


Overview of current projects- Front-end features- Pronunciation and language modeling- Modeling alternatives- Other topics

Some details

Conclusions

1

2

3

4

5

The modulation-filtered spectrogram

k

160 ms

640 ms

ndpass16 Hz


(Brian Kingsbury)

• Goal: invariance to variable acoustics

- filter out irrelevant modulations

- channel adaptation (on-line auto. gain control)

- multiple representations

• Comparison:

x x

Bark-scale power-spectral filterban

speech

lowpassfeatures

bandpassfeatures

AGC

AGC

AGC

AGC

= 160 ms

= 320 ms

τ

τ

lowpass envelopefiltering

AGC AGC

AGC AGC

τ

τ

=

=

automaticgain

control

0-16 Hzba2-

plp log-spect features

5

10

15

time / s

freq

/ B

ark

freq

/ B

ark

msg1a features

0.5 1 1.5 2 2.5 3 3.5

5

10

Data-driven feature design

r coeffs


(Mike Shire)

• ‘Optimal’ features for different conditions- subband envelope domain- linear-discriminant analysis (LDA) for filte- separation of labeled classes is optimized

• Modulation-frequency domain responsesfor clean, reverb, mixture:

-30

-25

-20

-15

-10

-5

0

dB

1st Discriminant Filter

CleanLight ReverbSevere ReverbClean+Severe Reverb

100

101

-30

-25

-20

-15

-10

-5

0

dB

2nd Discriminant Filter

Hz

Automatic pronunciation extraction

s)

g


(Eric Fosler)

• Canonical pronunciations are too limited

• Phonetic rules overproduce ( → homonym

• Filter candidates against acoustics:

• Decision trees for canonical →real mappin

FSG building

ax d aw

Phone trees Letter-to-

d o w

phone treesDictionary

jones=jh...dow=d awthe=dh ax

Acoustics

...

FSG

d

aw

aa

dhddclax

iy

Alignmentdh dh dh dh ax ax dcl d...

Acoustic confidencesdh=0.24 ax=0.4 dcl=0.81...

FSG decoder

Word stringsororThe Dow Jones

Topic modeling (Latent Semantic Analysis)

doc)

doc)

)


(Dan Gildea & Thomas Hofmann)

• Bayesian model:- p(word | doc) = ∑t p(word | topic) p(topic |

- EM modeling of p(word | topic) & p(topic |over training set

- p(topic | doc) estimated from context in recognition

• Use to modify language model weights- p(word) ∝ ptri(word) ptop(word) / puni(word

- Trigram language model perplexity of 109 reduced 17%

• Use for topic segmentation?

Multiband systems

scog.

ionsy)

ed


(Adam Janin / Nikki Mirghafori)

• Separate recognizers look at different band- Fletcher/Allen model of human speech re- noise/corruption in one channel is limited- how to combine results?

• Weighted average of all possible combos- p(S | a,b,c,d) = ∑B p(S | B,a,b,c,d) . p(B)

B ranges over 16 possible band combinat- p(B) from? constant, local feature (entrop

.. ....

.. ....

.. ....

.. ....

.. ....

.. ....

.. ....

.. ....

.. ....

Pow

erPo

wer

Pow

erPo

wer

Front end

SpeechSignal

Prob. estimator MLP Merger

ViterbiDecoder

RecognizWords

Buried Markov Models


(Jeff Bilmes)

• Increasing the scope of the classifier input

• Add state-dependent sparse links:

• How to add links?- maximum conditional mutual information- greedy algorithm

• How to model?- linear dependence as first attempt

Qt qt= Qt 1+ qt 1+=Qt 1— qt 1—=

X

Qt 2+ qt 2+=

Y

Connectionist speaker recognition

er

y


(Dominique Genoud)

• Use neural networks to model speakers raththan phones?

• Specialize a phone classifier for a particularspeaker?

• Do both at once for “Twin-output MLP”:

Feature vectorsCurrent vector

t0-4 t0+4

t

12 fe

atur

es

9x12

feat

ures

= 1

08 in

puts

MLP

108:2000:107

Output layer

Non-speech

53 'speaker'phones

53 'world'phones

Posteriorprobabilitvectors

Acoustic Segment Classification

e:

erence

h

tset

0.25 0.3

speecn/music

Speech Music Speech+Music


(Gethin Williams)

• Features from posteriors show utterance typ- average per-frame entropy- ‘dynamism’ - mean squared 1st-order diff- average energy of ‘silence’ label- covariance matrix distance to clean speec

• 1.3% error on 2.5 second speech-music tes

• Use for finding segment boundaries?

0 0.05 0.1 0.15 0.2Dynamism

Dynamism vs. Entropy for 2.5 second segments of

0

0.5

1

1.5

2

2.5

3

3.5

Ent

ropy

0

2000

4000

frq/Hz

0 2 4 6 8 10 12 time/s0

20

40

Spectrogram

Posteriors

speech music speech+music

Perceptual experiments

:

0.4

5.0

8.9


(Steven Greenberg et al.)

• Use spectrally-sparse speech:

• Effect of temporal alignment between bands

WaveformSpectrogram

Original

Slit4

3

2

1

8

4

0Fre

qu

ency

(kH

z)

6000

4762

298375

Time (s)

750945

18902381

3.5

51.747.110.3 58.6 88.3%

Synchronized Slits

(Intelligibility WER%)

79.5

63.4

54.8

8

6

5

53.6

70.7

41.9

54.6

40.7

79.7

25 ms delay

50 ms delay

75 ms delay

De-synchronized Slits

Outline

on

al


About ICSI



Some details:- Information stream combination- Broadcast News spoken-document retriev- SpeechCorder PDA

Conclusions

1

2

3

4

5

onesstem


.1 Information stream combinations• Task: AURORA noisy digits

- continuous digits- test: 4 noise types x 7 SNR levels- train: mixed clean/noisy data

• Feature design evaluation- intermediate representation for mobile ph- evaluation specifies GMM-HMM (HTK) sy

• Baseline results:

• Can we combine features advantageously?

Feature System WER ratio

mfcc HTK 100.0%

plp Hybrid 89.6%

msg Hybrid 87.1%

msg HTK 205.0%

msg KG HTK 184.5%

4

Combination schemes

1⊥ X2|q

es

deses

deses


• Simple probability combination works best:P(qi|X1,X2) = P(qi|X1)·P(qi|X2) / P(qi) ... if X

Feature 1calculation

Acousticclassifier

HMMdecoder


Inputsound

Acousticclassifier

Speechfeatures

HMMdecoder

Phoneprobabilities

Wordhypotheses

Hypothesiscombination

Finalhypothes


Acousticclassifier

HMMdecoder


Inputsound

Acousticclassifier

Speechfeatures

Phoneprobabilities

Worhypoth

Posteriorcombination


Acousticclassifier

HMMdecoder


Inputsound

Speechfeatures

Phoneprobabilities

Worhypoth

Featurecombination

Posterior multiplication


Features System WER ratio

plp + msg Feature combo 74.1%

plp + msg Prob. combo 63.0%

plp + msg HTK on probs. 51.6%

chive)

aries

6 eval)

male)


.2 Spoken document retrieval

• Based on DARPA/NIST Broadcast News

• Training material recorded off-air- ABC, CNN, CSPAN, NPR- 200 hour training set (TREC: 550 hour ar- training:

word transcriptions + speaker time bound

• Best WER results:- 1996: HTK: 27%- 1997: HTK: 16% (but: easier; 22% on 199- 1998: LIMSI: 14% (SPRACH: 21%)

• Some clear conclusions- one classifier for all conditions (or male/fe- feature adaptation (VTLN, MLLR, SAT)- importance of segmentation- more data is useful

4

Applications for BN systems

CTION

DER

ODE OF VE PARTY OD BUYS IS


• Live transcription- subtitles- transcripts- but: more than words?

• Video editing- precision word-time alignments- commercial systems by IBM, Virage, etc.

• Information Retrieval (IR)- TREC/MUC ‘spoken documents’- tolerant of word error rate, e.g.:

F0: THE VERY EARLY RETURNS OF THE NICARAGUAN PRESIDENTIAL ELESEEMED TO FADE BEFORE THE LOCAL MAYOR ON A LOT OF LAW

F4: AT THIS STAGE OF THE ACCOUNTING FOR SEVENTY SCOTCH ONE LEADANIEL ORTEGA IS IN SECOND PLACE THERE WERE TWENTY THREE PRESIDENTIAL CANDIDATES OF THE ELECTION

F5: THE LABOR MIGHT DO WELL TO REMEMBER THE LOST A MAJOR EPISTRANSATLANTIC CONNECT TO A CORPORATION IN BOTH CONSERVATIOFFICIALS FROM BRITAIN GOING TO WASHINGTON THEY WENT TO WOGEORGE BUSH ON HOW TO WIN A SECOND TO NONE IN LONDON THISSTEPHEN BEARD FOR MARKETPLACE

Thematic Indexing of Spoken Language

Query


(Thisl)

• EC collaboration, BBC providing data

• 1000+ hr archive data

• IR is key factor- stop lists- weighting schemes- query expansion

Control

Text

Audio

Video

Receiver ASR

NLP

SegmentationASR

IR

DatabaseArchive

http

ThislGui

tDemo


• Tcl/Tk front-end to Thisl IR engine

• Spoken query input: SPRACHdemo/Abbo

• NLP integration: prolog lattice parser

rtium)


.3 SpeechCorder

• Convergence of interesting problems:- ubiquitous PDAs- multimedia processing- very fast, low-power CPU design- resource-bound speech recognition

• ICSI / UC Berkeley / MIT collaboration- ICSI: speech & audio processing- UCB: user interface design- MIT: new low-power CPUs

• Current proposal- PDA- meeting / memo recorder- .. also for sociological study?

(new Human Centered Computing conso

4

The SpeechCorder GUI


• Live annotation of recognized speech

• Application issues:- correcting errorful transcriptions- finding places in the recording- annotations (speaker, notes, emphasis)

SpeechCorder: Audio


• Speech recognition- not close-mic’d- speaker ID- low-power / low-memory / vectorized- non-local processing for ‘best’ transcript?

• Other audio issues- identifying speech versus nonspeech- finding / indexing nonspeech events...

Outline

on


About ICSI



Some details

Conclusions- more criticisms of ASR- future work

1

2

3

4

5

l:

ints word

sible &

nted

sifiertcher)


Conclusions

• The downside of objective evaluation- research priority has been pragmatic goa

reduce WER- human speech recog. uses many constra- grammatic/semantic constraints implicit in

sequence statistics (grammar)- automatic analysis of large corpora is pos

helpful

• The problems with a grammar- unexpected (unseen) phrases are discou- highly brittle alternatives- masks underlying performance

• A more scientific approach- first work on the underlying phoneme clas- follow nonsense syllable performance (Fle

5

The signal model in speech recognition

or

use of


• Systems & approach have been optimized fspeech-alone situation- minimize classifier parameters, maximize

‘feature space’- e.g. cepstra [example]

• Possibly non-lexical data thrown away- pitch- timing/rhythm- speaker identification

• Dire consequences- .. dealing with nonspeech sounds- .. distinguishing success & failure

• Popular focus of research- e.g. segmental models, pitch features- fail to obtain robust improvements

Future work


• Continue improving robustness- better features- better pronunciations- better modeling

• Still looking for a good architecture- multiband- multistream- more adaptation- more contextual dependence

• Speech recognition: useful for applications- archive indexing, summarizing- personal devices, new interfaces- tie-in to general audio analysis...

Documents

Dan Ellis An overview of - University of California, Berkeley