Audio-Visual Speech Recognition

Audio-Visual Audio-Visual Speech Recognition: Speech Recognition:

Audio Noise,Audio Noise,Video Noise, Video Noise,

and Pronunciation Variabilityand Pronunciation Variability

Mark Hasegawa-JohnsonMark Hasegawa-JohnsonElectrical and Computer EngineeringElectrical and Computer Engineering

Audio-Visual Speech RecognitionAudio-Visual Speech Recognition

1)1) Video NoiseVideo Noise1)1) Graphical Methods: Manifold EstimationGraphical Methods: Manifold Estimation2)2) Local Graph Discriminant FeaturesLocal Graph Discriminant Features

2)2) Audio NoiseAudio Noise1)1) Beam-Form, Post-Filter, and Low-SNR VADBeam-Form, Post-Filter, and Low-SNR VAD

3)3) Pronunciation VariabilityPronunciation Variability1)1) Graphical Methods: Dynamic Bayesian NetworkGraphical Methods: Dynamic Bayesian Network2)2) An Articulatory-Feature Model for Audio-An Articulatory-Feature Model for Audio-

Visual Speech RecognitionVisual Speech Recognition

I. Video NoiseI. Video Noise





AVICAR DatabaseAVICAR Database● AVICAR = Audio-Visual In a CARAVICAR = Audio-Visual In a CAR● 100 Talkers100 Talkers● 4 Cameras, 7 Microphones4 Cameras, 7 Microphones● 5 noise conditions: Engine idling, 35mph, 35mph with 5 noise conditions: Engine idling, 35mph, 35mph with

windows open, 55mph, 55mph with windows openwindows open, 55mph, 55mph with windows open● Three types of utterances: Three types of utterances:

– DigitsDigits & & Phone numbersPhone numbers, for training and testing phone-, for training and testing phone-number recognizersnumber recognizers

– TIMIT sentencesTIMIT sentences, for training and testing large , for training and testing large vocabulary speech recognitionvocabulary speech recognition

– Isolated LettersIsolated Letters, to test the use of video for an acoustically , to test the use of video for an acoustically hard recognition problemhard recognition problem

AVICAR Recording HardwareAVICAR Recording Hardware(Lee, Hasegawa-Johnson et al., ICSLP 2004)(Lee, Hasegawa-Johnson et al., ICSLP 2004)

4 Cameras,

Glare Shields,

Adjustable

Mounting

Best Place=

Dashboard

8 Mics,

Pre-amps,

Wooden

Baffle.

Best Place=

Sunvisor.

System is not permanently installed; mounting requires 10 minutes.

AVICAR Video NoiseAVICAR Video Noise

Lighting: Many different angles, many types of Lighting: Many different angles, many types of weatherweather

Interlace: 30fps NTSC encoding used to transmit Interlace: 30fps NTSC encoding used to transmit data from camera to digital video tapedata from camera to digital video tape

Facial Features: Facial Features: – HairHair– SkinSkin– ClothingClothing– ObstructionsObstructions

AVICAR Noisy Image ExamplesAVICAR Noisy Image Examples

Related Problem: DimensionalityRelated Problem: Dimensionality

Dimension of the raw grayscale lip rectangle:Dimension of the raw grayscale lip rectangle:30x200=6000 pixels30x200=6000 pixels

Dimension of the DCT of the lip rectangle:Dimension of the DCT of the lip rectangle:30x200=6000 dimensions30x200=6000 dimensions

Smallest truncated DCT that allows a human viewer to recognize Smallest truncated DCT that allows a human viewer to recognize lip shapes (Hasegawa-Johnson, informal experiments):lip shapes (Hasegawa-Johnson, informal experiments):

25x25=625 dimensions25x25=625 dimensions

Truncated DCT typically used in AVSR:Truncated DCT typically used in AVSR:4x4=16 dimensions4x4=16 dimensions

Dimension of “geometric lip features” that allow high-accuracy Dimension of “geometric lip features” that allow high-accuracy AVSR (e.g., Chu and Huang, 2000):AVSR (e.g., Chu and Huang, 2000):

3 dimensions (lip height, lip width, vertical assymmetry)3 dimensions (lip height, lip width, vertical assymmetry)

Dimensionality Reduction: The ClassicsDimensionality Reduction: The Classics

0

0.5

1

1.5

2

2.5

3

3.5

0 0.5 1 1.5 2 2.5 3

0

0.5

1

1.5

2

2.5

3

3.5

0 0.5 1 1.5 2 2.5 3

Principal Components AnalysisPrincipal Components Analysis (PCA): (PCA): Project onto eigenvectors of the total Project onto eigenvectors of the total

covariance matrixcovariance matrix Projection includes noiseProjection includes noise

Linear Discriminant AnalysisLinear Discriminant Analysis (LDA): (LDA): Project onto Project onto vv==WW-1-1((dd), ),

WW=within-class covariance=within-class covariance Projection reduces noiseProjection reduces noise

Manifold EstimationManifold Estimation(e.g., Roweis and Saul, Science 2000)(e.g., Roweis and Saul, Science 2000)

0

0.5

1

1.5

2

2.5

3

3.5

0 0.5 1 1.5 2 2.5 3

Neighborhood GraphNeighborhood Graph Node = data pointNode = data point Edge = connect each data Edge = connect each data

point to its K nearest point to its K nearest neighborsneighbors

Manifold EstimationManifold Estimation The K nearest neighbors The K nearest neighbors

of each data point define of each data point define the local (K-1)-the local (K-1)-dimensional tangent space dimensional tangent space of a manifoldof a manifold

0

0.5

1

1.5

2

2.5

3

3.5

0 0.5 1 1.5 2 2.5 3

Local Discriminant GraphLocal Discriminant Graph(Fu, Zhou, Liu, Hasegawa-Johnson and Huang, ICIP 2007)(Fu, Zhou, Liu, Hasegawa-Johnson and Huang, ICIP 2007)

Maximize Local Inter-Maximize Local Inter-Manifold Interpolation Manifold Interpolation Errors, Errors,

subject to a constant Same-subject to a constant Same-Class Interpolation Error:Class Interpolation Error:

Find Find PP to maximize to maximize

DDii||||PPTT(x(xii--kkcckkyykk)||)||2,2,

yykk ЄЄ KNN(x KNN(xii), other classes), other classes

Subject to Subject to SS = constant, = constant,

SS = = ii||||PPTT(x(xii--jjccjjxxjj)||)||22,,

xxjj ЄЄ KNN(x KNN(xii), same class), same class

PCA, LDA, LDG: Experimental TestPCA, LDA, LDG: Experimental Test (Fu, Zhou, Liu, Hasegawa-Johnson and Huang, ICIP 2007) (Fu, Zhou, Liu, Hasegawa-Johnson and Huang, ICIP 2007)

Lip Feature Extraction:Lip Feature Extraction:

DCT=discrete cosine transform; PCA=principal components analysis;DCT=discrete cosine transform; PCA=principal components analysis;

LDA=linear discriminant analysis; LEA=local eigenvector analysis;LDA=linear discriminant analysis; LEA=local eigenvector analysis;

LDG=local discriminant graphLDG=local discriminant graph

Lip Reading Results (Digits) Lip Reading Results (Digits) (Fu, Zhou, Liu, Hasegawa-Johnson and Huang, ICIP 2007)(Fu, Zhou, Liu, Hasegawa-Johnson and Huang, ICIP 2007)

61

61.5

62

62.5

63

63.5

64

64.5

2 4 8 32

Gaussians per State

Word Error Rate (%)

DCT

PCA

LDA

LEA

LDG

DCT=discrete cosine transform; PCA=principal components analysis;DCT=discrete cosine transform; PCA=principal components analysis;

LDA=linear discriminant analysis; LEA=local eigenvector analysis;LDA=linear discriminant analysis; LEA=local eigenvector analysis;

LDG=local discriminant graphLDG=local discriminant graph

II. Audio NoiseII. Audio Noise





Audio NoiseAudio Noise

BeamformingBeamforming– Filter-and-sum (MVDR) vs. Delay-and-sumFilter-and-sum (MVDR) vs. Delay-and-sum

Post-FilterPost-Filter– MMSE log spectral amplitude estimator (Ephraim and MMSE log spectral amplitude estimator (Ephraim and

Malah, 1984) vs. Spectral SubtractionMalah, 1984) vs. Spectral Subtraction

Voice Activity DetectionVoice Activity Detection– Likelihood ratio method (Sohn and Sung, ICASSP 1998)Likelihood ratio method (Sohn and Sung, ICASSP 1998)

– Noise estimates:Noise estimates: Fixed noiseFixed noise

Time-varying noise (autoregressive estimator)Time-varying noise (autoregressive estimator)

High-variance noise (backoff estimator)High-variance noise (backoff estimator)

Audio Noise CompensationAudio Noise Compensation

MVDR Beamformer + MMSElogSA PostfilterMVDR Beamformer + MMSElogSA Postfilter(MVDR = Minimum variance distortionless response)(MVDR = Minimum variance distortionless response)

(MMSElogSA = MMSE log spectral amplitude estimator)(MMSElogSA = MMSE log spectral amplitude estimator)(Proof of optimality: Balan and Rosca, ICASSP 2002)(Proof of optimality: Balan and Rosca, ICASSP 2002)

Word Error Rate: BeamformersWord Error Rate: Beamformers Ten-digit phone numbers; trained and tested with 50/50 mix of quiet (idle) and noisy (55mph open)Ten-digit phone numbers; trained and tested with 50/50 mix of quiet (idle) and noisy (55mph open) DS=Delay-and-sum; MVDR=Minimum variance distortionless responseDS=Delay-and-sum; MVDR=Minimum variance distortionless response

Word Error Rate: PostfiltersWord Error Rate: Postfilters

Most errors at low SNR are because noise Most errors at low SNR are because noise gets misrecognized as speechgets misrecognized as speech

Effective solution: voice activity detection Effective solution: voice activity detection (VAD)(VAD)

Likelihood ratio VAD (Sohn and Sung, Likelihood ratio VAD (Sohn and Sung, ICASSP 1998):ICASSP 1998):

tt = log { p(X = log { p(Xtt=S=Stt+N+Ntt) / p(X) / p(Xtt=N=Ntt) }) }

XXtt = Measured Power Spectrum = Measured Power Spectrum

SStt, N, Ntt = Exponentially Distributed Speech, Noise = Exponentially Distributed Speech, Noise

tt > threshold → Speech Present > threshold → Speech Present

tt < threshold → Speech Absent < threshold → Speech Absent

Voice Activity DetectionVoice Activity Detection

Fixed estimate: NFixed estimate: N00=average of first 10 frames=average of first 10 frames

Autoregressive estimator (Sohn and Sung):Autoregressive estimator (Sohn and Sung):

NNt t = = t t XXtt + (1- + (1-tt) N) Nt-1t-1

t t = function of X= function of Xtt, N, N00

Backoff estimator (Lee and Hasegawa-Backoff estimator (Lee and Hasegawa-Johnson, DSP for In-Vehicle and Mobile Johnson, DSP for In-Vehicle and Mobile Systems, 2007):Systems, 2007):

NNt t = = t t XXtt + (1- + (1-tt) N) N00

VAD: Noise EstimatorsVAD: Noise Estimators

Word Error Rate: DigitsWord Error Rate: Digits

0123456789

10

Word Error Rate

(%)

Idle35U35D55U55D

Noise Condition

BackoffEstimation

AutoregressiveEstimation

Fixed Noise

III. Pronunciation VariabilityIII. Pronunciation Variability





Graphical Methods: Dynamic Bayesian Graphical Methods: Dynamic Bayesian NetworkNetwork

Bayesian Network = A Graph in whichBayesian Network = A Graph in which Nodes are Random Variables (RVs)Nodes are Random Variables (RVs) Edges Represent DependenceEdges Represent Dependence

Dynamic Bayesian Network = A BN in which Dynamic Bayesian Network = A BN in which RVs are repeated once per time stepRVs are repeated once per time step

Example: an HMM is a DBNExample: an HMM is a DBN Most important RV: the “phonestate” variable qMost important RV: the “phonestate” variable q tt

Typically qTypically qtt ЄЄ {Phones} x {1,2,3} {Phones} x {1,2,3} Acoustic features xAcoustic features xtt and video features y and video features ytt depend on q depend on qtt

Example: HMM is a DBNExample: HMM is a DBN

qqt-1t-1

t-1t-1

xxt-1t-1 yyt-1t-1

wwt-1t-1

wincwinct-1t-1

qincqinct-1t-1

Frame t-1Frame t-1

qqtt

tt

xxtt yytt

wwtt

wincwinctt

qincqinctt

Frame tFrame t

qqtt is the phonestate, e.g., q is the phonestate, e.g., qtt ЄЄ { /w/1, /w/2, /w/3, /n/1, /n/2, … } { /w/1, /w/2, /w/3, /n/1, /n/2, … } wwtt is the word label at time t, for example, wt is the word label at time t, for example, wt ЄЄ {“one”, “two”, …} {“one”, “two”, …} tt is the position of phone q is the position of phone qtt within word w within word wtt: : tt ЄЄ {1 {1stst, 2, 2ndnd, 3, 3rdrd, …}, …} qincqinct t ЄЄ {0,1} specifies whether {0,1} specifies whether t+1t+1==tt or or t+1t+1==tt+1+1

Pronunciation VariabilityPronunciation Variability

Even when reading phone numbers, talkers Even when reading phone numbers, talkers “blend” articulations.“blend” articulations.

For example: “seven eight:” /sFor example: “seven eight:” /svvәәnet/→ /snet/→ /svne?/ vne?/ As speech gets less formal, pronunciation As speech gets less formal, pronunciation

variability gets worse, e.g., worse in a car than in variability gets worse, e.g., worse in a car than in the lab; worse in conversation than in read speechthe lab; worse in conversation than in read speech

A Related Problem: AsynchronyA Related Problem: Asynchrony

Audio and Video Audio and Video information are not information are not synchronoussynchronous

For example: “th” (/For example: “th” (//) /) in “three” is visible, but in “three” is visible, but not yet audible, because not yet audible, because the audio is still silentthe audio is still silent

Should HMM be in Should HMM be in qqtt=“silence,” or q=“silence,” or qtt=/=//?/?

qqtt

tt

wwtt

wincwinctt

qincqinctt

Frame tFrame t

xxtt

vvtt

tt

vincvinctt

yytt

tt

qqt-1t-1

t-1t-1

wwt-1t-1

wincwinct-1t-1

qincqinct-1t-1

Frame t-1Frame t-1

xxt-1t-1

vvt-1t-1

t-1t-1

vincvinct-1t-1

yyt-1t-1

t-1t-1

A Solution: Two State VariablesA Solution: Two State Variables(Chu and Huang, ICASSP 2000)(Chu and Huang, ICASSP 2000)

Coupled HMM Coupled HMM (CHMM): Two (CHMM): Two parallel HMMsparallel HMMs

qqtt: Audio state (x: Audio state (xtt: :

audio observation)audio observation) vvtt: Video state (y: Video state (ytt: :

video observation) video observation) tt==tt--tt: :

Asynchrony, Asynchrony, capped at |capped at |tt|<3|<3

Asynchrony in Articulatory PhonologyAsynchrony in Articulatory Phonology(Livescu and Glass, 2004)(Livescu and Glass, 2004)

It’s not really the AUDIO and VIDEO that are It’s not really the AUDIO and VIDEO that are ssynchronous…ssynchronous…

It is the LIPS, TONGUE, and GLOTTIS that are It is the LIPS, TONGUE, and GLOTTIS that are asynchronousasynchronous

S1 S1

word word

ind1 ind1

ind2 ind2

ind3 ind3

U1 U1

S2 S2

U2U2

U3

S3 S3

U3

sync1,2

sync2,3

sync1,2

sync2,3

Asynchrony in Articulatory PhonologyAsynchrony in Articulatory Phonology

Dental /Dental ///TongueTongue

GlottisGlottis UnvoicedUnvoiced

Retroflex /r/Retroflex /r/

VoicedVoiced

Palatal /i/Palatal /i/

““three,” dictionary formthree,” dictionary form

timetime

It’s not really the AUDIO and VIDEO that are It’s not really the AUDIO and VIDEO that are ssynchronous…ssynchronous…

It is the LIPS, TONGUE, and GLOTTIS that are It is the LIPS, TONGUE, and GLOTTIS that are asynchronousasynchronous

Dental /Dental ///TongueTongue

GlottisGlottis UnvoicedUnvoiced

Retroflex /r/Retroflex /r/

VoicedVoiced

Palatal /i/Palatal /i/

““three,” casual speechthree,” casual speech

SilentSilent

SilentSilent

Asynchrony in Articulatory PhonologyAsynchrony in Articulatory Phonology

Fricative /v/Fricative /v/LipsLips

TongueTongue Wide /Wide /// Closed /n/Closed /n/

““seven,” dictionary form: seven,” dictionary form: /s/svvәәn/n/

timetime

Fricative /s/Fricative /s/

Same mechanism represents pronunciation variability:Same mechanism represents pronunciation variability:– ““Seven:” /vSeven:” /vәәn/→ /vn/ if tongue closes before lips openn/→ /vn/ if tongue closes before lips open

– ““Eight:” /et/ → /e?/ if glottis closes before tongue tip closesEight:” /et/ → /e?/ if glottis closes before tongue tip closes

Fricative /v/Fricative /v/LipsLips

TongueTongue Wide /Wide /// Closed /n/Closed /n/

““seven,” casual speech: seven,” casual speech: /s/svn/vn/

timetime

Fricative /s/Fricative /s/

Neutral /Neutral /әә//

Neutral /Neutral /әә//

lltt

tt

wwtt

wincwinctt

linclinctt

tttt

tt

tinctinctt

tt

llt-1t-1

t-1t-1

wwt-1t-1

wincwinct-1t-1

linclinct-1t-1

ttt-1t-1

t-1t-1

tinctinct-1t-1

t-1t-1

An Articulatory Feature ModelAn Articulatory Feature Model(Hasegawa-Johnson, Livescu, Lal and Saenko, ICPhS 2007)(Hasegawa-Johnson, Livescu, Lal and Saenko, ICPhS 2007)

There is no There is no “phonestate” “phonestate” variable. Instead, variable. Instead, we use a vector we use a vector qqtt→[→[lltt,,tttt,,ggtt]]– Lipstate variable Lipstate variable

lltt

– Tonguestate Tonguestate variable tvariable ttt

– Glotstate variable Glotstate variable ggtt

ggtt

tt

gincginctt

tt

ggt-1t-1

t-1t-1

gincginct-1t-1

t-1t-1

Experimental Test Experimental Test (Hasegawa-Johnson, Livescu, Lal and Saenko, ICPhS 2007)(Hasegawa-Johnson, Livescu, Lal and Saenko, ICPhS 2007)

Training and test data: CUAVE corpus Training and test data: CUAVE corpus – Patterson, Gurbuz, Turfecki and Gowdy, ICASSP 2002Patterson, Gurbuz, Turfecki and Gowdy, ICASSP 2002

– 169 utterances used, 10 digits each, silence between words169 utterances used, 10 digits each, silence between words

– Recorded without Audio or Video noise (studio lighting; silent bkgd)Recorded without Audio or Video noise (studio lighting; silent bkgd)

Audio prepared by Kate Saenko at MITAudio prepared by Kate Saenko at MIT– NOISEX speech babble added at various SNRsNOISEX speech babble added at various SNRs

– MFCC+d+dd feature vectors, 10ms framesMFCC+d+dd feature vectors, 10ms frames

Video prepared by Amar Subramanya at UWVideo prepared by Amar Subramanya at UW– Feature vector = DCT of lip rectangleFeature vector = DCT of lip rectangle

– Upsampled from 33ms frames to 10ms framesUpsampled from 33ms frames to 10ms frames

Experimental Condition: Train-Test MismatchExperimental Condition: Train-Test Mismatch– Training on clean dataTraining on clean data

– Audio/video weights tuned on noise-specific dev setsAudio/video weights tuned on noise-specific dev sets

– Language model: uniform (all words equal probability), constrained to have the Language model: uniform (all words equal probability), constrained to have the right number of words per utteranceright number of words per utterance

Experimental QuestionsExperimental Questions(Hasegawa-Johnson, Livescu, Lal and Saenko, ICPhS 2007)(Hasegawa-Johnson, Livescu, Lal and Saenko, ICPhS 2007)

1)1) Does Video reduce word error rate?Does Video reduce word error rate?

2)2) Does Audio-Video Asynchrony reduce word Does Audio-Video Asynchrony reduce word error rate?error rate?

3)3) Should asynchrony be represented as Should asynchrony be represented as 1)1) Audio-Video Asynchrony (CHMM), orAudio-Video Asynchrony (CHMM), or

2)2) Lips-Tongue-Glottis Asynchrony (AFM)Lips-Tongue-Glottis Asynchrony (AFM)

4)4) Is it better to use only CHMM, only AFM, or a Is it better to use only CHMM, only AFM, or a combination of both methods?combination of both methods?

Results, part 1:Results, part 1: Should we use video? Should we use video?Answer: YES. Audio-Visual WER < Single-stream WERAnswer: YES. Audio-Visual WER < Single-stream WER

0

10

20

30

40

50

60

70

80

90

CLEAN SNR 12dB SNR 10dB SNR 6dB SNR 4dB SNR -4dB

Audio

Video

Audiovisual

Results, part 2:Results, part 2: Are Audio and Video be asynchronous? Are Audio and Video be asynchronous?Answer: YES. Async WER < Sync WER.Answer: YES. Async WER < Sync WER.

0

10

20

30

40

50

60

70

CLEAN SNR 12dB SNR 10dB SNR 6dB SNR 4dB SNR -4

No Asynchrony

1 State Async

2 States Async

Unlimited Asyn

Results, part 3: Results, part 3: Should we use CHMM or AFM?Should we use CHMM or AFM?Answer: DOESN’T MATTER! WERs are equal.Answer: DOESN’T MATTER! WERs are equal.

0

10

20

30

40

50

60

70

80

Clean SNR12dB

SNR10dB

SNR 6dB SNR 4dB SNR -4dB

Phone-viseme

Articulatory features

Results, part 4: Results, part 4: Should we combine systems?Should we combine systems?Answer: YES. Best is AFM+CH1+CH2 ROVERAnswer: YES. Best is AFM+CH1+CH2 ROVER

17

18

19

20

21

22

23

A+C1+C2 ROVER CU+C1+C2ROVER

C2: CHMM A: AFM C1: CHMM

Video Feature Extraction:Video Feature Extraction:– Manifold discriminant is better than a global discriminantManifold discriminant is better than a global discriminant

Audio Feature Extraction:Audio Feature Extraction:– Beamformer: Delay-and-sum beats Filter-and-sumBeamformer: Delay-and-sum beats Filter-and-sum– Postfilter: Spectral subtraction gives best WER (though Postfilter: Spectral subtraction gives best WER (though

MMSE-logSA sounds best)MMSE-logSA sounds best)– VAD: Backoff noise estimation works best in this corpusVAD: Backoff noise estimation works best in this corpus

Audio-Video Fusion:Audio-Video Fusion:– Video reduces WER in train-test mismatch conditionsVideo reduces WER in train-test mismatch conditions– Audio and video are asynchronous (CHMM)Audio and video are asynchronous (CHMM)– Lips, tongue and glottis are asynchronous (AFM)Lips, tongue and glottis are asynchronous (AFM)– It doesn’t matter whether you use CHMM or AFM, but...It doesn’t matter whether you use CHMM or AFM, but...– Best result: combine both representationsBest result: combine both representations

ConclusionsConclusions

Documents

Audio-Visual Speech Recognition