Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Multi-Channel Modeling and Integration
with Applications in
Speech and Multimodal Processing
Prof. Herve [email protected]
IDIAP Research Institute, Martigny, Switzerlandhttp://www.idiap.ch/
Swiss Federal Institute of Technology, Lausanne (EPFL)
**********
Outline
1. Multi-channel modeling and integration
• Problems
• State-of-the art solutions
2. Multi-stream and Asynchronous HMM, with application in:
• Audio-visual speech recognition
3. Multi-layered HMMs, with applications in:
• Speech recognition
• Multi-modal meeting (group action) modeling
**********
Notation
• HMMs:
◦ Powerful models used to handle sequences
◦ State-of-the-art models for speech recognition
◦ Efficient training and decoding algorithm
• Notation:
◦ Ok = (ok1 , . . . , ok
Nk) is the k-th observation sequence
◦ O = [O1, . . . , Ok, . . . , OK ]
◦ K= number of channels, and Nk= number of featurevectors (of dimension dk) in k-th observation stream
◦ Qk = {qk1 , . . . , qk
Nk} is the set of (piecewise stationary) states
modeling the k-th observation stream.
**********
1) Multi-Channel Modeling: Problems (1)
Joint Multi-Channel Modeling
• Training: parameters that maximise the likelihood of L
multi-stream sequences:
θ∗j = arg maxθj
L∏
l=1
p(Ol|θj). (1)
• Decoding: given a multi-stream observation sequence O, findthe sequence W of “events” (classes) maximizing the jointlikelihood p(O|W, θ∗j ).
• The most well-known solution to efficiently model such adistribution is to use Hidden Markov Models (HMMs).
**********
Multi-Channel Modeling: Problems (2)
• Examples of multiple feature sequences:
◦ Extracted from the same signal, e.g., multi-rate processing,different frequency bands
◦ Extracted from different modes, but associated with thesame “‘event” (class), typically in multimodal processing(audio-visual speech recognition, multimodal meetingactions, etc).
• Asynchrony, with different frame rates (Nk) and/or variablenumber of streams (K) over time.
◦ Piecewise stationarity assumption ⇒ exponentially complexmodels
• Complexity: More parameters, difficult to model/extracthigher-level information
**********
Multi-Channel Modeling: Problems (3)
Piecewise Stationarity Assumption
Stream 1
Stream 2
Asynchronous
A B C
A B C
A
C
B C
B
A
A B
B
B
CB
C C
A
A
3 d1−dimensional states
3 d2−dimensional states
Joint /
NaiveIntegration
5 (d1+d2)−dimensionalstates
Multi−Stream: 3 d1− + 3 d2−dimensional statesAsynchronous: 3 (d1+d2)−dimensional states
**********
Multi-Channel modeling: State-of-the-art
• HMM/DBN variants based on different assumptions of streaminter-dependencies, such as:
◦ feature-level correlation,
◦ state-level correlation, and
◦ whether streams evolve synchronously or asynchronously.
• HMM/DBN variants include:
◦ Early integration HMM
◦ Late integration HMM
◦ Coupled HMMs [Brand, 1997]
◦ Multi-Rate HMMs [Cetin, 2004]
◦ Multi-Stream HMMs [Bourlard et al, 1996]
◦ Asynchronous HMMs [Bengio, 2002]
**********
2) Multi-Stream and Asynchronous HMMs
Multi-stream HMM (Bourlard et al, 1996)
1 2 3
4 5 6
14 15 16
24 25 26
34 35 36
(1) MSHMM (2) Product MSHMM
(1) p(okt | qk
t ) =∑G
g=1 wkg N (ok
t − µkg , σk
g ), ∀k
(2) p(ot | qt) =∏K
k=1 p(okt | qk
t ) ⇒ Viterbi (1) ≡ Viterbi (2)
more complex recombinations ⇒ Viterbi (1) 6≡ Viterbi (2)
**********
Multi-Stream and Asynchronous HMM
Multi-stream HMM:
• Assumes partial independence between feature streams
• While anchor (synchronization) points may occur anywhere,complexity becomes exponential in the number of streams.
• Thus, in practice, recombination is often done at the HMMstate level.
◦ Assuming that streams evolve frame synchronously.
◦ Doesn’t really model the joint likelihood of feature streams.
◦ Significant success in several applications such asnoise robust speech recognition, multi-stream speechrecognition, audio-visual speech recognition, etc.
**********
Asynchronous HMMs [Bengio, 2002]
O1
O2
• Maximise the joint likelihood P (O1, . . . , OK |Θ) of theobservation streams given the model Θ
◦ Stream Ok can emit an empty symbol (ε) with probabilityαk
◦ p(okt | qt, g) =
(1− αk) N (okt − µk
g , σkg ) if ok
t 6= ε
αk otherwise
◦ p(ot | qt) =G∑
g=1
wg
K∏
k=1
p(okt | qt, g)
• Can deal with observation streams of different lengths.
**********
Asynchronous HMMs (2)
• Enables re-synchronization of observation streams.
• One HMM: maximizes the joint likelihood of all streams,modeling correlation in the feature space.
O1
O2
qt+1qtqt−1
p(o1t |qt)p(o1
t , o2s|qt)
P (qt|qt−1)
p(emit o2s|qt)
• Forward-backward and EM training possible, involving twohidden variables (instead of one in standard HMMs).
**********
Likelihood and Forward Recurrence
• Assuming two sequences xT1 = {x1, . . . , xT } and
yS1 = {y1, . . . , yS}, with S < T
• Forward Loop for HMMs:
α(i, t) = p(qt=i, xt1) = p(xt|qt=i)
N∑
j=1
P (qt=i|qt−1=j)α(j, t− 1)
• Forward Loop for AHMMs:
α(i, s, t) = ε(i, t)p(xt, ys|xt=i)N∑
j=1
P (qt=i|qt−1=j)α(j, s−1, t− 1)
+ (1− ε(i, t))p(xt|qt=i)N∑
j=1
P (qt=i|qt−1=j)α(j, s, t− 1)
• ε(i, t) is the probability that the system emits on sequence y attime t while in state i.
**********
Multi-Stream vs Asynchronous HMM
• Multi-stream HMM:
◦ Allows for different number of HMM states for differentobservation streams (e.g. allowing different models for slowand fast streams).
◦ Doesn’t explicitly model correlation between streams, butdoes it implicitly through anchor points.
◦ Full asynchrony between anchor points.
• Asynchronous HMM:
◦ Assumes a single HMM, thus the same number of HMMstates for each observation stream.
◦ Models correlation between streams directly at theobservation level, if p(x, y|q) doesn’t make independenceassumptions.
◦ If so, actually estimates P (X, Y |Θ).
**********
Audio-Visual Speech Recognition
Database
• Audio-Visual M2VTS database: 37 subjects, 185 recordings,(French) digit sequences; 5-fold cross-validation
• Audio:
◦ 16 MFCC coefficients and their first derivatives, plus thederivative of the log energy: 33 features, 100Hz frame rate
• Video:
◦ 12 shape features and 12 intensity features, plus firstderivatives: 48 features, 25Hz frame rate
**********
Audio-Visual Speech Recognition
Results
0
10
20
30
40
50
60
70
80
0db 5db 10db 15db
Wor
d E
rror
Rat
e
noise level
audio HMMaudio+video HMM
audio+video AHMMvideo HMM
**********
3) Multi-layered HMMs
• Posterior probabilities: used as local measure/classifier oras transformed features.
• How to improve those posterior estimates:
1. Using contextual information, AHMM, etc
2. Using any possible a priori knowledge (e.g., specific HMMtopological constraints)
3. Replacing p(qit|xt) by γ(i, t) = p(qi
t|Xt,M), where:◦ Xt represents xt in context (e.g., the whole utterance X
in the case of ASR)◦ M is the (HMM?) model representing the prior
information
• Yielding “optimal” multi-layered (hierarchical) HMMstructures.
**********
Posteriors as ASR features
• Use MLP-generated posteriors as acoustic features
• MPL usually accommodates some acoustic context Xt+ct−c ,
typically of 9 frames
• MLP can be used to merge features and generate a reduced setof “optimal” features (resulting of NLDA)
• Those features can be transformed (usually through KLT) toreduce the space dimension and/or to make them better suitedto HMM-GMM
• Successful example: “Tandem”, integrated in full SRI CTSsystem
**********
Posteriors as features: Tandem
o
o
o
X MLP
LogKL
Transform
prior tofinal nonlinearity
MLP outputs
GMMW
HMMTandemFeatures
p(qit|xt)
**********
HMM State Gammas as Features (1)
Likelihood-based systems
• Alpha recursion:
α(i, t) = p(xt1, q
it) = p(xt|qi
t)∑
j
p(qit|qj
t−1)α(j, t− 1)
• Beta recursion:
β(i, t) = p(xTt+1|qi
t) =∑
j
p(xt+1|qjt+1)p(qj
t+1|qit)β(j, t + 1)
• State Gamma:
γ(i, t) = p(qit|xT
1 ) =α(i, t)β(i, t)∑
j α(j, T )
**********
HMM State Gammas as Features (2)
Posterior-based systems
• Scaled-Alpha recursion:
αs(i, t) =p(xt
1, qit)∏t
τ=1 p(xτ )=
p(qit|xt)
p(qit)
∑
j
p(qit|qj
t−1)αs(j, t− 1)
• Scaled-Beta recursion:
βs(i, t) =p(xT
t+1|qit)∏T
τ=t+1 p(xτ )=
∑
j
p(qjt+1|xt+1)
p(qjt+1)
p(qjt+1|qi
t)βs(j, t + 1)
• State Gamma:
γs(i, t) = p(qit|xT
1 ) =αs(i, t)βs(i, t)∑
j αs(j, T )
**********
“Gamma” posteriors as features
MLPX
StateGammas
KL
TransformHMMGMM
W
Log
Gamma
Features
p(qti |xt)
p(qti)
γs(i, t)
**********
Multi-layered HMMs for ASR: Numbers’95
Task: Numbers’95
Features WER
PLP 6.9%
Tandem phone posteriors (alone) 4.9%
Gamma phone posteriors (alone) 4.6%
Table 1: Word error rate (WER) on the Numbers’95 task: 31 lexiconwords, 9x39(PLP)-1200-24 MLP (resulting in 25 features after KLT),80 CD HMM/GMM phone models , 24 dimensional Tandem features,1206 test utterances.
**********
DARPA CTS sub-task
Task: Conversational Telephone Speech
Features WER
PLP 44.3%
PLP+Tandem phone posteriors 42.5%
PLP+Gamma phone posteriors 41.7%
Table 2: Word error rate (WER) on the male part of a reducedvocabulary version of the DARPA Conversational Telephone Speech-to-text (CTS) task: 1,000 lexicon words, with multi-words and multi-pronunciations, 9x39(PLP)-1300-46 MLP (resulting in 25 featuresafter KLT).
**********
Multi-modal meeting (group action) modeling
IDIAP Instrumented Meeting Room
MicrophoneArray
RackEquipment
LapelMicrophone
WhiteboardProjector Screen
Meeting Table
Camera
Participant
**********
Multi-layered HMMs for meeting actions
• Event lexicon: V = (monologue1, monologue2, monologue3,monologue4, discussion, presentation, whiteboard-presentation,note-taking)
• A-V Features: 39-dimension observation vectors extracted from12 audio and 3 visual channels, at 5 Hz.
Microphones
Cameras
Person 1 AV Features I-HMM 1
I-HMM 2
I-HMM N
G-HMM
Person 2 AV Features
Person N AV Features
Group AV Features
**********
Meeting Actions: A-V features
Modality Participants
Feature Audio Visual Individual Other
Seat speech activity X X
White-board speech activity X X
Presentation speech activity X X
Speech pitch X X
Speech energy X X
Speaking rate X X
Head blob vertical centroid X X
Hand blob horizontal centroid X X
Hand blob eccentricity X X
Hand blob angle X X
Combined motion X X
White-board/presentation blob X X
**********
Recognizing Sequences of Meeting Actions (1)
Method Features AER
Visual Only 48.2
One-Layer Audio Only 36.7
Audio Visual 23.7
Visual Only 42.5
Two-Layer Audio Only 32.4
Audio Visual 16.6
Async HMM 15.1
AER=Action Error Rate
• A-V multi-stream better than early integration.
• Important to model correlation (interaction) betweenparticipants.
• Best results with multi-layered HMMs, using AHMM at theparticipant level (1st HMM layer).
**********
Recognizing Sequences of Meeting Actions (2)
**********
Conclusion
• Multi-channel processing is an interesting challenge, withimportant potential application in speech recognition,multimodal processing, etc.
• AHMM is an interesting solution towards principled modelingof multi-stream joint likelihood.
• MSHMM and AHMM gives good results in noisy speechrecognition tasks, as well as multi-model group actionmodeling.
• Still not very efficient in space and time (but can be controlled)
• Hierarchical HMMs, and associated posterior combination,provide additional advantages and a new approach towardssolving complex problem in a hierarchical approach.
**********