Multi-Channel Modeling and Integration with Applications in

Multi-Channel Modeling and Integration

with Applications in

Speech and Multimodal Processing

Prof. Herve [email protected]

IDIAP Research Institute, Martigny, Switzerlandhttp://www.idiap.ch/

Swiss Federal Institute of Technology, Lausanne (EPFL)

**********

Outline

1. Multi-channel modeling and integration

• Problems

• State-of-the art solutions

2. Multi-stream and Asynchronous HMM, with application in:

• Audio-visual speech recognition

3. Multi-layered HMMs, with applications in:

• Speech recognition

• Multi-modal meeting (group action) modeling

**********

Notation

• HMMs:

◦ Powerful models used to handle sequences

◦ State-of-the-art models for speech recognition

◦ Efficient training and decoding algorithm

• Notation:

◦ Ok = (ok1 , . . . , ok

Nk) is the k-th observation sequence

◦ O = [O1, . . . , Ok, . . . , OK ]

◦ K= number of channels, and Nk= number of featurevectors (of dimension dk) in k-th observation stream

◦ Qk = {qk1 , . . . , qk

Nk} is the set of (piecewise stationary) states

modeling the k-th observation stream.

**********

1) Multi-Channel Modeling: Problems (1)

Joint Multi-Channel Modeling

• Training: parameters that maximise the likelihood of L

multi-stream sequences:

θ∗j = arg maxθj

L∏

l=1

p(Ol|θj). (1)

• Decoding: given a multi-stream observation sequence O, findthe sequence W of “events” (classes) maximizing the jointlikelihood p(O|W, θ∗j ).

• The most well-known solution to efficiently model such adistribution is to use Hidden Markov Models (HMMs).

**********

Multi-Channel Modeling: Problems (2)

• Examples of multiple feature sequences:

◦ Extracted from the same signal, e.g., multi-rate processing,different frequency bands

◦ Extracted from different modes, but associated with thesame “‘event” (class), typically in multimodal processing(audio-visual speech recognition, multimodal meetingactions, etc).

• Asynchrony, with different frame rates (Nk) and/or variablenumber of streams (K) over time.

◦ Piecewise stationarity assumption ⇒ exponentially complexmodels

• Complexity: More parameters, difficult to model/extracthigher-level information

**********

Multi-Channel Modeling: Problems (3)

Piecewise Stationarity Assumption

Stream 1

Stream 2

Asynchronous

A B C

A B C

A

C

B C

B

A

A B

B

B

CB

C C

A

A

3 d1−dimensional states

3 d2−dimensional states

Joint /

NaiveIntegration

5 (d1+d2)−dimensionalstates

Multi−Stream: 3 d1− + 3 d2−dimensional statesAsynchronous: 3 (d1+d2)−dimensional states

**********

Multi-Channel modeling: State-of-the-art

• HMM/DBN variants based on different assumptions of streaminter-dependencies, such as:

◦ feature-level correlation,

◦ state-level correlation, and

◦ whether streams evolve synchronously or asynchronously.

• HMM/DBN variants include:

◦ Early integration HMM

◦ Late integration HMM

◦ Coupled HMMs [Brand, 1997]

◦ Multi-Rate HMMs [Cetin, 2004]

◦ Multi-Stream HMMs [Bourlard et al, 1996]

◦ Asynchronous HMMs [Bengio, 2002]

**********

2) Multi-Stream and Asynchronous HMMs

Multi-stream HMM (Bourlard et al, 1996)

1 2 3

4 5 6

14 15 16

24 25 26

34 35 36

(1) MSHMM (2) Product MSHMM

(1) p(okt | qk

t ) =∑G

g=1 wkg N (ok

t − µkg , σk

g ), ∀k

(2) p(ot | qt) =∏K

k=1 p(okt | qk

t ) ⇒ Viterbi (1) ≡ Viterbi (2)

more complex recombinations ⇒ Viterbi (1) 6≡ Viterbi (2)

**********

Multi-Stream and Asynchronous HMM

Multi-stream HMM:

• Assumes partial independence between feature streams

• While anchor (synchronization) points may occur anywhere,complexity becomes exponential in the number of streams.

• Thus, in practice, recombination is often done at the HMMstate level.

◦ Assuming that streams evolve frame synchronously.

◦ Doesn’t really model the joint likelihood of feature streams.

◦ Significant success in several applications such asnoise robust speech recognition, multi-stream speechrecognition, audio-visual speech recognition, etc.

**********

Asynchronous HMMs [Bengio, 2002]

O1

O2

• Maximise the joint likelihood P (O1, . . . , OK |Θ) of theobservation streams given the model Θ

◦ Stream Ok can emit an empty symbol (ε) with probabilityαk

◦ p(okt | qt, g) =

(1− αk) N (okt − µk

g , σkg ) if ok

t 6= ε

αk otherwise

◦ p(ot | qt) =G∑

g=1

wg

K∏

k=1

p(okt | qt, g)

• Can deal with observation streams of different lengths.

**********

Asynchronous HMMs (2)

• Enables re-synchronization of observation streams.

• One HMM: maximizes the joint likelihood of all streams,modeling correlation in the feature space.

O1

O2

qt+1qtqt−1

p(o1t |qt)p(o1

t , o2s|qt)

P (qt|qt−1)

p(emit o2s|qt)

• Forward-backward and EM training possible, involving twohidden variables (instead of one in standard HMMs).

**********

Likelihood and Forward Recurrence

• Assuming two sequences xT1 = {x1, . . . , xT } and

yS1 = {y1, . . . , yS}, with S < T

• Forward Loop for HMMs:

α(i, t) = p(qt=i, xt1) = p(xt|qt=i)

N∑

j=1

P (qt=i|qt−1=j)α(j, t− 1)

• Forward Loop for AHMMs:

α(i, s, t) = ε(i, t)p(xt, ys|xt=i)N∑

j=1

P (qt=i|qt−1=j)α(j, s−1, t− 1)

+ (1− ε(i, t))p(xt|qt=i)N∑

j=1

P (qt=i|qt−1=j)α(j, s, t− 1)

• ε(i, t) is the probability that the system emits on sequence y attime t while in state i.

**********

Multi-Stream vs Asynchronous HMM

• Multi-stream HMM:

◦ Allows for different number of HMM states for differentobservation streams (e.g. allowing different models for slowand fast streams).

◦ Doesn’t explicitly model correlation between streams, butdoes it implicitly through anchor points.

◦ Full asynchrony between anchor points.

• Asynchronous HMM:

◦ Assumes a single HMM, thus the same number of HMMstates for each observation stream.

◦ Models correlation between streams directly at theobservation level, if p(x, y|q) doesn’t make independenceassumptions.

◦ If so, actually estimates P (X, Y |Θ).

**********

Audio-Visual Speech Recognition

Database

• Audio-Visual M2VTS database: 37 subjects, 185 recordings,(French) digit sequences; 5-fold cross-validation

• Audio:

◦ 16 MFCC coefficients and their first derivatives, plus thederivative of the log energy: 33 features, 100Hz frame rate

• Video:

◦ 12 shape features and 12 intensity features, plus firstderivatives: 48 features, 25Hz frame rate

**********

Audio-Visual Speech Recognition

Results

0

10

20

30

40

50

60

70

80

0db 5db 10db 15db

Wor

d E

rror

Rat

e

noise level

audio HMMaudio+video HMM

audio+video AHMMvideo HMM

**********

3) Multi-layered HMMs

• Posterior probabilities: used as local measure/classifier oras transformed features.

• How to improve those posterior estimates:

1. Using contextual information, AHMM, etc

2. Using any possible a priori knowledge (e.g., specific HMMtopological constraints)

3. Replacing p(qit|xt) by γ(i, t) = p(qi

t|Xt,M), where:◦ Xt represents xt in context (e.g., the whole utterance X

in the case of ASR)◦ M is the (HMM?) model representing the prior

information

• Yielding “optimal” multi-layered (hierarchical) HMMstructures.

**********

Posteriors as ASR features

• Use MLP-generated posteriors as acoustic features

• MPL usually accommodates some acoustic context Xt+ct−c ,

typically of 9 frames

• MLP can be used to merge features and generate a reduced setof “optimal” features (resulting of NLDA)

• Those features can be transformed (usually through KLT) toreduce the space dimension and/or to make them better suitedto HMM-GMM

• Successful example: “Tandem”, integrated in full SRI CTSsystem

**********

Posteriors as features: Tandem

o

o

o

X MLP

LogKL

Transform

prior tofinal nonlinearity

MLP outputs

GMMW

HMMTandemFeatures

p(qit|xt)

**********

HMM State Gammas as Features (1)

Likelihood-based systems

• Alpha recursion:

α(i, t) = p(xt1, q

it) = p(xt|qi

t)∑

j

p(qit|qj

t−1)α(j, t− 1)

• Beta recursion:

β(i, t) = p(xTt+1|qi

t) =∑

j

p(xt+1|qjt+1)p(qj

t+1|qit)β(j, t + 1)

• State Gamma:

γ(i, t) = p(qit|xT

1 ) =α(i, t)β(i, t)∑

j α(j, T )

**********

HMM State Gammas as Features (2)

Posterior-based systems

• Scaled-Alpha recursion:

αs(i, t) =p(xt

1, qit)∏t

τ=1 p(xτ )=

p(qit|xt)

p(qit)

∑

j

p(qit|qj

t−1)αs(j, t− 1)

• Scaled-Beta recursion:

βs(i, t) =p(xT

t+1|qit)∏T

τ=t+1 p(xτ )=

∑

j

p(qjt+1|xt+1)

p(qjt+1)

p(qjt+1|qi

t)βs(j, t + 1)

• State Gamma:

γs(i, t) = p(qit|xT

1 ) =αs(i, t)βs(i, t)∑

j αs(j, T )

**********

“Gamma” posteriors as features

MLPX

StateGammas

KL

TransformHMMGMM

W

Log

Gamma

Features

p(qti |xt)

p(qti)

γs(i, t)

**********

Multi-layered HMMs for ASR: Numbers’95

Task: Numbers’95

Features WER

PLP 6.9%

Tandem phone posteriors (alone) 4.9%

Gamma phone posteriors (alone) 4.6%

Table 1: Word error rate (WER) on the Numbers’95 task: 31 lexiconwords, 9x39(PLP)-1200-24 MLP (resulting in 25 features after KLT),80 CD HMM/GMM phone models , 24 dimensional Tandem features,1206 test utterances.

**********

DARPA CTS sub-task

Task: Conversational Telephone Speech

Features WER

PLP 44.3%

PLP+Tandem phone posteriors 42.5%

PLP+Gamma phone posteriors 41.7%

Table 2: Word error rate (WER) on the male part of a reducedvocabulary version of the DARPA Conversational Telephone Speech-to-text (CTS) task: 1,000 lexicon words, with multi-words and multi-pronunciations, 9x39(PLP)-1300-46 MLP (resulting in 25 featuresafter KLT).

**********

Multi-modal meeting (group action) modeling

IDIAP Instrumented Meeting Room

MicrophoneArray

RackEquipment

LapelMicrophone

WhiteboardProjector Screen

Meeting Table

Camera

Participant

**********

Multi-layered HMMs for meeting actions

• Event lexicon: V = (monologue1, monologue2, monologue3,monologue4, discussion, presentation, whiteboard-presentation,note-taking)

• A-V Features: 39-dimension observation vectors extracted from12 audio and 3 visual channels, at 5 Hz.

Microphones

Cameras

Person 1 AV Features I-HMM 1

I-HMM 2

I-HMM N

G-HMM

Person 2 AV Features

Person N AV Features

Group AV Features

**********

Meeting Actions: A-V features

Modality Participants

Feature Audio Visual Individual Other

Seat speech activity X X

White-board speech activity X X

Presentation speech activity X X

Speech pitch X X

Speech energy X X

Speaking rate X X

Head blob vertical centroid X X

Hand blob horizontal centroid X X

Hand blob eccentricity X X

Hand blob angle X X

Combined motion X X

White-board/presentation blob X X

**********

Recognizing Sequences of Meeting Actions (1)

Method Features AER

Visual Only 48.2

One-Layer Audio Only 36.7

Audio Visual 23.7

Visual Only 42.5

Two-Layer Audio Only 32.4

Audio Visual 16.6

Async HMM 15.1

AER=Action Error Rate

• A-V multi-stream better than early integration.

• Important to model correlation (interaction) betweenparticipants.

• Best results with multi-layered HMMs, using AHMM at theparticipant level (1st HMM layer).

**********

Recognizing Sequences of Meeting Actions (2)

**********

Conclusion

• Multi-channel processing is an interesting challenge, withimportant potential application in speech recognition,multimodal processing, etc.

• AHMM is an interesting solution towards principled modelingof multi-stream joint likelihood.

• MSHMM and AHMM gives good results in noisy speechrecognition tasks, as well as multi-model group actionmodeling.

• Still not very efficient in space and time (but can be controlled)

• Hierarchical HMMs, and associated posterior combination,provide additional advantages and a new approach towardssolving complex problem in a hierarchical approach.

**********

Documents

Multi-Channel Modeling and Integration with Applications in