Upload
livana
View
79
Download
4
Tags:
Embed Size (px)
DESCRIPTION
Audio-Visual Speech Recognition: Audio Noise, Video Noise, and Pronunciation Variability Mark Hasegawa-Johnson Electrical and Computer Engineering. Audio-Visual Speech Recognition. Video Noise Graphical Methods: Manifold Estimation Local Graph Discriminant Features Audio Noise - PowerPoint PPT Presentation
Citation preview
Audio-Visual Audio-Visual Speech Recognition: Speech Recognition:
Audio Noise,Audio Noise,Video Noise, Video Noise,
and Pronunciation Variabilityand Pronunciation Variability
Mark Hasegawa-JohnsonMark Hasegawa-JohnsonElectrical and Computer EngineeringElectrical and Computer Engineering
Audio-Visual Speech RecognitionAudio-Visual Speech Recognition
1)1) Video NoiseVideo Noise1)1) Graphical Methods: Manifold EstimationGraphical Methods: Manifold Estimation2)2) Local Graph Discriminant FeaturesLocal Graph Discriminant Features
2)2) Audio NoiseAudio Noise1)1) Beam-Form, Post-Filter, and Low-SNR VADBeam-Form, Post-Filter, and Low-SNR VAD
3)3) Pronunciation VariabilityPronunciation Variability1)1) Graphical Methods: Dynamic Bayesian NetworkGraphical Methods: Dynamic Bayesian Network2)2) An Articulatory-Feature Model for Audio-An Articulatory-Feature Model for Audio-
Visual Speech RecognitionVisual Speech Recognition
I. Video NoiseI. Video Noise
1)1) Video NoiseVideo Noise1)1) Graphical Methods: Manifold EstimationGraphical Methods: Manifold Estimation2)2) Local Graph Discriminant FeaturesLocal Graph Discriminant Features
2)2) Audio NoiseAudio Noise1)1) Beam-Form, Post-Filter, and Low-SNR VADBeam-Form, Post-Filter, and Low-SNR VAD
3)3) Pronunciation VariabilityPronunciation Variability1)1) Graphical Methods: Dynamic Bayesian NetworkGraphical Methods: Dynamic Bayesian Network2)2) An Articulatory-Feature Model for Audio-An Articulatory-Feature Model for Audio-
Visual Speech RecognitionVisual Speech Recognition
AVICAR DatabaseAVICAR Database● AVICAR = Audio-Visual In a CARAVICAR = Audio-Visual In a CAR● 100 Talkers100 Talkers● 4 Cameras, 7 Microphones4 Cameras, 7 Microphones● 5 noise conditions: Engine idling, 35mph, 35mph with 5 noise conditions: Engine idling, 35mph, 35mph with
windows open, 55mph, 55mph with windows openwindows open, 55mph, 55mph with windows open● Three types of utterances: Three types of utterances:
– DigitsDigits & & Phone numbersPhone numbers, for training and testing phone-, for training and testing phone-number recognizersnumber recognizers
– TIMIT sentencesTIMIT sentences, for training and testing large , for training and testing large vocabulary speech recognitionvocabulary speech recognition
– Isolated LettersIsolated Letters, to test the use of video for an acoustically , to test the use of video for an acoustically hard recognition problemhard recognition problem
AVICAR Recording HardwareAVICAR Recording Hardware(Lee, Hasegawa-Johnson et al., ICSLP 2004)(Lee, Hasegawa-Johnson et al., ICSLP 2004)
4 Cameras,
Glare Shields,
Adjustable
Mounting
Best Place=
Dashboard
8 Mics,
Pre-amps,
Wooden
Baffle.
Best Place=
Sunvisor.
System is not permanently installed; mounting requires 10 minutes.
AVICAR Video NoiseAVICAR Video Noise
Lighting: Many different angles, many types of Lighting: Many different angles, many types of weatherweather
Interlace: 30fps NTSC encoding used to transmit Interlace: 30fps NTSC encoding used to transmit data from camera to digital video tapedata from camera to digital video tape
Facial Features: Facial Features: – HairHair– SkinSkin– ClothingClothing– ObstructionsObstructions
AVICAR Noisy Image ExamplesAVICAR Noisy Image Examples
Related Problem: DimensionalityRelated Problem: Dimensionality
Dimension of the raw grayscale lip rectangle:Dimension of the raw grayscale lip rectangle:30x200=6000 pixels30x200=6000 pixels
Dimension of the DCT of the lip rectangle:Dimension of the DCT of the lip rectangle:30x200=6000 dimensions30x200=6000 dimensions
Smallest truncated DCT that allows a human viewer to recognize Smallest truncated DCT that allows a human viewer to recognize lip shapes (Hasegawa-Johnson, informal experiments):lip shapes (Hasegawa-Johnson, informal experiments):
25x25=625 dimensions25x25=625 dimensions
Truncated DCT typically used in AVSR:Truncated DCT typically used in AVSR:4x4=16 dimensions4x4=16 dimensions
Dimension of “geometric lip features” that allow high-accuracy Dimension of “geometric lip features” that allow high-accuracy AVSR (e.g., Chu and Huang, 2000):AVSR (e.g., Chu and Huang, 2000):
3 dimensions (lip height, lip width, vertical assymmetry)3 dimensions (lip height, lip width, vertical assymmetry)
Dimensionality Reduction: The ClassicsDimensionality Reduction: The Classics
0
0.5
1
1.5
2
2.5
3
3.5
0 0.5 1 1.5 2 2.5 3
0
0.5
1
1.5
2
2.5
3
3.5
0 0.5 1 1.5 2 2.5 3
Principal Components AnalysisPrincipal Components Analysis (PCA): (PCA): Project onto eigenvectors of the total Project onto eigenvectors of the total
covariance matrixcovariance matrix Projection includes noiseProjection includes noise
Linear Discriminant AnalysisLinear Discriminant Analysis (LDA): (LDA): Project onto Project onto vv==WW-1-1((dd), ),
WW=within-class covariance=within-class covariance Projection reduces noiseProjection reduces noise
Manifold EstimationManifold Estimation(e.g., Roweis and Saul, Science 2000)(e.g., Roweis and Saul, Science 2000)
0
0.5
1
1.5
2
2.5
3
3.5
0 0.5 1 1.5 2 2.5 3
Neighborhood GraphNeighborhood Graph Node = data pointNode = data point Edge = connect each data Edge = connect each data
point to its K nearest point to its K nearest neighborsneighbors
Manifold EstimationManifold Estimation The K nearest neighbors The K nearest neighbors
of each data point define of each data point define the local (K-1)-the local (K-1)-dimensional tangent space dimensional tangent space of a manifoldof a manifold
0
0.5
1
1.5
2
2.5
3
3.5
0 0.5 1 1.5 2 2.5 3
Local Discriminant GraphLocal Discriminant Graph(Fu, Zhou, Liu, Hasegawa-Johnson and Huang, ICIP 2007)(Fu, Zhou, Liu, Hasegawa-Johnson and Huang, ICIP 2007)
Maximize Local Inter-Maximize Local Inter-Manifold Interpolation Manifold Interpolation Errors, Errors,
subject to a constant Same-subject to a constant Same-Class Interpolation Error:Class Interpolation Error:
Find Find PP to maximize to maximize
DDii||||PPTT(x(xii--kkcckkyykk)||)||2,2,
yykk ЄЄ KNN(x KNN(xii), other classes), other classes
Subject to Subject to SS = constant, = constant,
SS = = ii||||PPTT(x(xii--jjccjjxxjj)||)||22,,
xxjj ЄЄ KNN(x KNN(xii), same class), same class
PCA, LDA, LDG: Experimental TestPCA, LDA, LDG: Experimental Test (Fu, Zhou, Liu, Hasegawa-Johnson and Huang, ICIP 2007) (Fu, Zhou, Liu, Hasegawa-Johnson and Huang, ICIP 2007)
Lip Feature Extraction:Lip Feature Extraction:
DCT=discrete cosine transform; PCA=principal components analysis;DCT=discrete cosine transform; PCA=principal components analysis;
LDA=linear discriminant analysis; LEA=local eigenvector analysis;LDA=linear discriminant analysis; LEA=local eigenvector analysis;
LDG=local discriminant graphLDG=local discriminant graph
Lip Reading Results (Digits) Lip Reading Results (Digits) (Fu, Zhou, Liu, Hasegawa-Johnson and Huang, ICIP 2007)(Fu, Zhou, Liu, Hasegawa-Johnson and Huang, ICIP 2007)
61
61.5
62
62.5
63
63.5
64
64.5
2 4 8 32
Gaussians per State
Word Error Rate (%)
DCT
PCA
LDA
LEA
LDG
DCT=discrete cosine transform; PCA=principal components analysis;DCT=discrete cosine transform; PCA=principal components analysis;
LDA=linear discriminant analysis; LEA=local eigenvector analysis;LDA=linear discriminant analysis; LEA=local eigenvector analysis;
LDG=local discriminant graphLDG=local discriminant graph
II. Audio NoiseII. Audio Noise
1)1) Video NoiseVideo Noise1)1) Graphical Methods: Manifold EstimationGraphical Methods: Manifold Estimation2)2) Local Graph Discriminant FeaturesLocal Graph Discriminant Features
2)2) Audio NoiseAudio Noise1)1) Beam-Form, Post-Filter, and Low-SNR VADBeam-Form, Post-Filter, and Low-SNR VAD
3)3) Pronunciation VariabilityPronunciation Variability1)1) Graphical Methods: Dynamic Bayesian NetworkGraphical Methods: Dynamic Bayesian Network2)2) An Articulatory-Feature Model for Audio-An Articulatory-Feature Model for Audio-
Visual Speech RecognitionVisual Speech Recognition
Audio NoiseAudio Noise
BeamformingBeamforming– Filter-and-sum (MVDR) vs. Delay-and-sumFilter-and-sum (MVDR) vs. Delay-and-sum
Post-FilterPost-Filter– MMSE log spectral amplitude estimator (Ephraim and MMSE log spectral amplitude estimator (Ephraim and
Malah, 1984) vs. Spectral SubtractionMalah, 1984) vs. Spectral Subtraction
Voice Activity DetectionVoice Activity Detection– Likelihood ratio method (Sohn and Sung, ICASSP 1998)Likelihood ratio method (Sohn and Sung, ICASSP 1998)
– Noise estimates:Noise estimates: Fixed noiseFixed noise
Time-varying noise (autoregressive estimator)Time-varying noise (autoregressive estimator)
High-variance noise (backoff estimator)High-variance noise (backoff estimator)
Audio Noise CompensationAudio Noise Compensation
MVDR Beamformer + MMSElogSA PostfilterMVDR Beamformer + MMSElogSA Postfilter(MVDR = Minimum variance distortionless response)(MVDR = Minimum variance distortionless response)
(MMSElogSA = MMSE log spectral amplitude estimator)(MMSElogSA = MMSE log spectral amplitude estimator)(Proof of optimality: Balan and Rosca, ICASSP 2002)(Proof of optimality: Balan and Rosca, ICASSP 2002)
Word Error Rate: BeamformersWord Error Rate: Beamformers Ten-digit phone numbers; trained and tested with 50/50 mix of quiet (idle) and noisy (55mph open)Ten-digit phone numbers; trained and tested with 50/50 mix of quiet (idle) and noisy (55mph open) DS=Delay-and-sum; MVDR=Minimum variance distortionless responseDS=Delay-and-sum; MVDR=Minimum variance distortionless response
Word Error Rate: PostfiltersWord Error Rate: Postfilters
Most errors at low SNR are because noise Most errors at low SNR are because noise gets misrecognized as speechgets misrecognized as speech
Effective solution: voice activity detection Effective solution: voice activity detection (VAD)(VAD)
Likelihood ratio VAD (Sohn and Sung, Likelihood ratio VAD (Sohn and Sung, ICASSP 1998):ICASSP 1998):
tt = log { p(X = log { p(Xtt=S=Stt+N+Ntt) / p(X) / p(Xtt=N=Ntt) }) }
XXtt = Measured Power Spectrum = Measured Power Spectrum
SStt, N, Ntt = Exponentially Distributed Speech, Noise = Exponentially Distributed Speech, Noise
tt > threshold → Speech Present > threshold → Speech Present
tt < threshold → Speech Absent < threshold → Speech Absent
Voice Activity DetectionVoice Activity Detection
Fixed estimate: NFixed estimate: N00=average of first 10 frames=average of first 10 frames
Autoregressive estimator (Sohn and Sung):Autoregressive estimator (Sohn and Sung):
NNt t = = t t XXtt + (1- + (1-tt) N) Nt-1t-1
t t = function of X= function of Xtt, N, N00
Backoff estimator (Lee and Hasegawa-Backoff estimator (Lee and Hasegawa-Johnson, DSP for In-Vehicle and Mobile Johnson, DSP for In-Vehicle and Mobile Systems, 2007):Systems, 2007):
NNt t = = t t XXtt + (1- + (1-tt) N) N00
VAD: Noise EstimatorsVAD: Noise Estimators
Word Error Rate: DigitsWord Error Rate: Digits
0123456789
10
Word Error Rate
(%)
Idle35U35D55U55D
Noise Condition
BackoffEstimation
AutoregressiveEstimation
Fixed Noise
III. Pronunciation VariabilityIII. Pronunciation Variability
1)1) Video NoiseVideo Noise1)1) Graphical Methods: Manifold EstimationGraphical Methods: Manifold Estimation2)2) Local Graph Discriminant FeaturesLocal Graph Discriminant Features
2)2) Audio NoiseAudio Noise1)1) Beam-Form, Post-Filter, and Low-SNR VADBeam-Form, Post-Filter, and Low-SNR VAD
3)3) Pronunciation VariabilityPronunciation Variability1)1) Graphical Methods: Dynamic Bayesian NetworkGraphical Methods: Dynamic Bayesian Network2)2) An Articulatory-Feature Model for Audio-An Articulatory-Feature Model for Audio-
Visual Speech RecognitionVisual Speech Recognition
Graphical Methods: Dynamic Bayesian Graphical Methods: Dynamic Bayesian NetworkNetwork
Bayesian Network = A Graph in whichBayesian Network = A Graph in which Nodes are Random Variables (RVs)Nodes are Random Variables (RVs) Edges Represent DependenceEdges Represent Dependence
Dynamic Bayesian Network = A BN in which Dynamic Bayesian Network = A BN in which RVs are repeated once per time stepRVs are repeated once per time step
Example: an HMM is a DBNExample: an HMM is a DBN Most important RV: the “phonestate” variable qMost important RV: the “phonestate” variable q tt
Typically qTypically qtt ЄЄ {Phones} x {1,2,3} {Phones} x {1,2,3} Acoustic features xAcoustic features xtt and video features y and video features ytt depend on q depend on qtt
Example: HMM is a DBNExample: HMM is a DBN
qqt-1t-1
t-1t-1
xxt-1t-1 yyt-1t-1
wwt-1t-1
wincwinct-1t-1
qincqinct-1t-1
Frame t-1Frame t-1
qqtt
tt
xxtt yytt
wwtt
wincwinctt
qincqinctt
Frame tFrame t
qqtt is the phonestate, e.g., q is the phonestate, e.g., qtt ЄЄ { /w/1, /w/2, /w/3, /n/1, /n/2, … } { /w/1, /w/2, /w/3, /n/1, /n/2, … } wwtt is the word label at time t, for example, wt is the word label at time t, for example, wt ЄЄ {“one”, “two”, …} {“one”, “two”, …} tt is the position of phone q is the position of phone qtt within word w within word wtt: : tt ЄЄ {1 {1stst, 2, 2ndnd, 3, 3rdrd, …}, …} qincqinct t ЄЄ {0,1} specifies whether {0,1} specifies whether t+1t+1==tt or or t+1t+1==tt+1+1
Pronunciation VariabilityPronunciation Variability
Even when reading phone numbers, talkers Even when reading phone numbers, talkers “blend” articulations.“blend” articulations.
For example: “seven eight:” /sFor example: “seven eight:” /svvәәnet/→ /snet/→ /svne?/ vne?/ As speech gets less formal, pronunciation As speech gets less formal, pronunciation
variability gets worse, e.g., worse in a car than in variability gets worse, e.g., worse in a car than in the lab; worse in conversation than in read speechthe lab; worse in conversation than in read speech
A Related Problem: AsynchronyA Related Problem: Asynchrony
Audio and Video Audio and Video information are not information are not synchronoussynchronous
For example: “th” (/For example: “th” (//) /) in “three” is visible, but in “three” is visible, but not yet audible, because not yet audible, because the audio is still silentthe audio is still silent
Should HMM be in Should HMM be in qqtt=“silence,” or q=“silence,” or qtt=/=//?/?
qqtt
tt
wwtt
wincwinctt
qincqinctt
Frame tFrame t
xxtt
vvtt
tt
vincvinctt
yytt
tt
qqt-1t-1
t-1t-1
wwt-1t-1
wincwinct-1t-1
qincqinct-1t-1
Frame t-1Frame t-1
xxt-1t-1
vvt-1t-1
t-1t-1
vincvinct-1t-1
yyt-1t-1
t-1t-1
A Solution: Two State VariablesA Solution: Two State Variables(Chu and Huang, ICASSP 2000)(Chu and Huang, ICASSP 2000)
Coupled HMM Coupled HMM (CHMM): Two (CHMM): Two parallel HMMsparallel HMMs
qqtt: Audio state (x: Audio state (xtt: :
audio observation)audio observation) vvtt: Video state (y: Video state (ytt: :
video observation) video observation) tt==tt--tt: :
Asynchrony, Asynchrony, capped at |capped at |tt|<3|<3
Asynchrony in Articulatory PhonologyAsynchrony in Articulatory Phonology(Livescu and Glass, 2004)(Livescu and Glass, 2004)
It’s not really the AUDIO and VIDEO that are It’s not really the AUDIO and VIDEO that are ssynchronous…ssynchronous…
It is the LIPS, TONGUE, and GLOTTIS that are It is the LIPS, TONGUE, and GLOTTIS that are asynchronousasynchronous
S1 S1
word word
ind1 ind1
ind2 ind2
ind3 ind3
U1 U1
S2 S2
U2U2
U3
S3 S3
U3
sync1,2
sync2,3
sync1,2
sync2,3
Asynchrony in Articulatory PhonologyAsynchrony in Articulatory Phonology
Dental /Dental ///TongueTongue
GlottisGlottis UnvoicedUnvoiced
Retroflex /r/Retroflex /r/
VoicedVoiced
Palatal /i/Palatal /i/
““three,” dictionary formthree,” dictionary form
timetime
It’s not really the AUDIO and VIDEO that are It’s not really the AUDIO and VIDEO that are ssynchronous…ssynchronous…
It is the LIPS, TONGUE, and GLOTTIS that are It is the LIPS, TONGUE, and GLOTTIS that are asynchronousasynchronous
Dental /Dental ///TongueTongue
GlottisGlottis UnvoicedUnvoiced
Retroflex /r/Retroflex /r/
VoicedVoiced
Palatal /i/Palatal /i/
““three,” casual speechthree,” casual speech
SilentSilent
SilentSilent
Asynchrony in Articulatory PhonologyAsynchrony in Articulatory Phonology
Fricative /v/Fricative /v/LipsLips
TongueTongue Wide /Wide /// Closed /n/Closed /n/
““seven,” dictionary form: seven,” dictionary form: /s/svvәәn/n/
timetime
Fricative /s/Fricative /s/
Same mechanism represents pronunciation variability:Same mechanism represents pronunciation variability:– ““Seven:” /vSeven:” /vәәn/→ /vn/ if tongue closes before lips openn/→ /vn/ if tongue closes before lips open
– ““Eight:” /et/ → /e?/ if glottis closes before tongue tip closesEight:” /et/ → /e?/ if glottis closes before tongue tip closes
Fricative /v/Fricative /v/LipsLips
TongueTongue Wide /Wide /// Closed /n/Closed /n/
““seven,” casual speech: seven,” casual speech: /s/svn/vn/
timetime
Fricative /s/Fricative /s/
Neutral /Neutral /әә//
Neutral /Neutral /әә//
lltt
tt
wwtt
wincwinctt
linclinctt
tttt
tt
tinctinctt
tt
llt-1t-1
t-1t-1
wwt-1t-1
wincwinct-1t-1
linclinct-1t-1
ttt-1t-1
t-1t-1
tinctinct-1t-1
t-1t-1
An Articulatory Feature ModelAn Articulatory Feature Model(Hasegawa-Johnson, Livescu, Lal and Saenko, ICPhS 2007)(Hasegawa-Johnson, Livescu, Lal and Saenko, ICPhS 2007)
There is no There is no “phonestate” “phonestate” variable. Instead, variable. Instead, we use a vector we use a vector qqtt→[→[lltt,,tttt,,ggtt]]– Lipstate variable Lipstate variable
lltt
– Tonguestate Tonguestate variable tvariable ttt
– Glotstate variable Glotstate variable ggtt
ggtt
tt
gincginctt
tt
ggt-1t-1
t-1t-1
gincginct-1t-1
t-1t-1
Experimental Test Experimental Test (Hasegawa-Johnson, Livescu, Lal and Saenko, ICPhS 2007)(Hasegawa-Johnson, Livescu, Lal and Saenko, ICPhS 2007)
Training and test data: CUAVE corpus Training and test data: CUAVE corpus – Patterson, Gurbuz, Turfecki and Gowdy, ICASSP 2002Patterson, Gurbuz, Turfecki and Gowdy, ICASSP 2002
– 169 utterances used, 10 digits each, silence between words169 utterances used, 10 digits each, silence between words
– Recorded without Audio or Video noise (studio lighting; silent bkgd)Recorded without Audio or Video noise (studio lighting; silent bkgd)
Audio prepared by Kate Saenko at MITAudio prepared by Kate Saenko at MIT– NOISEX speech babble added at various SNRsNOISEX speech babble added at various SNRs
– MFCC+d+dd feature vectors, 10ms framesMFCC+d+dd feature vectors, 10ms frames
Video prepared by Amar Subramanya at UWVideo prepared by Amar Subramanya at UW– Feature vector = DCT of lip rectangleFeature vector = DCT of lip rectangle
– Upsampled from 33ms frames to 10ms framesUpsampled from 33ms frames to 10ms frames
Experimental Condition: Train-Test MismatchExperimental Condition: Train-Test Mismatch– Training on clean dataTraining on clean data
– Audio/video weights tuned on noise-specific dev setsAudio/video weights tuned on noise-specific dev sets
– Language model: uniform (all words equal probability), constrained to have the Language model: uniform (all words equal probability), constrained to have the right number of words per utteranceright number of words per utterance
Experimental QuestionsExperimental Questions(Hasegawa-Johnson, Livescu, Lal and Saenko, ICPhS 2007)(Hasegawa-Johnson, Livescu, Lal and Saenko, ICPhS 2007)
1)1) Does Video reduce word error rate?Does Video reduce word error rate?
2)2) Does Audio-Video Asynchrony reduce word Does Audio-Video Asynchrony reduce word error rate?error rate?
3)3) Should asynchrony be represented as Should asynchrony be represented as 1)1) Audio-Video Asynchrony (CHMM), orAudio-Video Asynchrony (CHMM), or
2)2) Lips-Tongue-Glottis Asynchrony (AFM)Lips-Tongue-Glottis Asynchrony (AFM)
4)4) Is it better to use only CHMM, only AFM, or a Is it better to use only CHMM, only AFM, or a combination of both methods?combination of both methods?
Results, part 1:Results, part 1: Should we use video? Should we use video?Answer: YES. Audio-Visual WER < Single-stream WERAnswer: YES. Audio-Visual WER < Single-stream WER
0
10
20
30
40
50
60
70
80
90
CLEAN SNR 12dB SNR 10dB SNR 6dB SNR 4dB SNR -4dB
Audio
Video
Audiovisual
Results, part 2:Results, part 2: Are Audio and Video be asynchronous? Are Audio and Video be asynchronous?Answer: YES. Async WER < Sync WER.Answer: YES. Async WER < Sync WER.
0
10
20
30
40
50
60
70
CLEAN SNR 12dB SNR 10dB SNR 6dB SNR 4dB SNR -4
No Asynchrony
1 State Async
2 States Async
Unlimited Asyn
Results, part 3: Results, part 3: Should we use CHMM or AFM?Should we use CHMM or AFM?Answer: DOESN’T MATTER! WERs are equal.Answer: DOESN’T MATTER! WERs are equal.
0
10
20
30
40
50
60
70
80
Clean SNR12dB
SNR10dB
SNR 6dB SNR 4dB SNR -4dB
Phone-viseme
Articulatory features
Results, part 4: Results, part 4: Should we combine systems?Should we combine systems?Answer: YES. Best is AFM+CH1+CH2 ROVERAnswer: YES. Best is AFM+CH1+CH2 ROVER
17
18
19
20
21
22
23
A+C1+C2 ROVER CU+C1+C2ROVER
C2: CHMM A: AFM C1: CHMM
Video Feature Extraction:Video Feature Extraction:– Manifold discriminant is better than a global discriminantManifold discriminant is better than a global discriminant
Audio Feature Extraction:Audio Feature Extraction:– Beamformer: Delay-and-sum beats Filter-and-sumBeamformer: Delay-and-sum beats Filter-and-sum– Postfilter: Spectral subtraction gives best WER (though Postfilter: Spectral subtraction gives best WER (though
MMSE-logSA sounds best)MMSE-logSA sounds best)– VAD: Backoff noise estimation works best in this corpusVAD: Backoff noise estimation works best in this corpus
Audio-Video Fusion:Audio-Video Fusion:– Video reduces WER in train-test mismatch conditionsVideo reduces WER in train-test mismatch conditions– Audio and video are asynchronous (CHMM)Audio and video are asynchronous (CHMM)– Lips, tongue and glottis are asynchronous (AFM)Lips, tongue and glottis are asynchronous (AFM)– It doesn’t matter whether you use CHMM or AFM, but...It doesn’t matter whether you use CHMM or AFM, but...– Best result: combine both representationsBest result: combine both representations
ConclusionsConclusions