Download ppt - Siddiqi and Moore, Fast Inference and Learning in Large-State-Space HMMs Sajid M. Siddiqi Andrew W. Moore The Auton Lab Carnegie Mellon

Siddiqi and Moore, www.autonlab.org

Fast Inference and Learning in Large-State-Space HMMs

Sajid M. SiddiqiAndrew W. Moore

The Auton LabCarnegie Mellon

University


HMM Overview Reducing quadratic complexity in the number

of states• The model• Algorithms for fast evaluation and inference• Algorithms for fast learning

Results• Speed• Accuracy

Conclusion





Conclusion


Hidden Markov Models

1/3

q0

q1

q2

q3

q4

O0

O1

O2

O3

O4


i P(qt+1=s1|qt=si) P(qt+1=s2|qt=si) … P(qt+1=sj|qt=si) …P(qt+1=sN|qt=si)

1 a11 a12…a1j

…a1N

2 a21 a22…a2j

…a2N

3 a31 a32…a3j

…a3N

: : : : : : :

i ai1 ai2…aij

…aiN

N aN1 aN2…aNj

…aNN

Transition Model

1/3

q0

q1

q2

q3

q4


Each of these probability tables is identical

i P(qt+1=s1|qt=si) P(qt+1=s2|qt=si) … P(qt+1=sj|qt=si) …P(qt+1=sN|qt=si)

1 a11 a12…a1j

…a1N

2 a21 a22…a2j

…a2N

3 a31 a32…a3j

…a3N

: : : : : : :

i ai1 ai2…aij

…aiN

N aN1 aN2…aNj

…aNN

Transition Model

1/3

q0

q1

q2

q3

q4

Notation:

)|( 1 itjtij sqsqPa


Observation Modelq0

q1

q2

q3

q4

O0

O1

O2

O3

O4

i P(Ot=1|qt=si) P(Ot=2|qt=si) … P(Ot=k|qt=si) … P(Ot=M|qt=si)

1 b1(1) b1 (2) … b1 (k) … b1(M)

2 b2 (1) b2 (2) … b2(k) … b2 (M)

3 b3 (1) b3 (2) … b3(k) … b3 (M)

: : : : : : :

i bi(1) bi (2) … bi(k) … bi (M)

: : : : : : :

N bN (1) bN (2) … bN(k) … bN (M)


Observation Modelq0

q1

q2

q3

q4

O0

O1

O2

O3

O4

i P(Ot=1|qt=si) P(Ot=2|qt=si) … P(Ot=k|qt=si) … P(Ot=M|qt=si)

1 b1(1) b1 (2) … b1 (k) … b1(M)

2 b2 (1) b2 (2) … b2(k) … b2 (M)

3 b3 (1) b3 (2) … b3(k) … b3 (M)

: : : : : : :

i bi(1) bi (2) … bi(k) … bi (M)

: : : : : : :

N bN (1) bN (2) … bN(k) … bN (M)

Notation:

)|()( itti sqkOPkb


Some Famous HMM TasksQuestion 1: State Estimation

What is P(qT=Si | O1O2…OT)


Question 1: State Estimation


Some Famous HMM Tasks








Question 2: Most Probable Path

Given O1O2…OT , what is the most probable path that I took?














Woke up at 8.35, Got on Bus at 9.46, Sat in lecture 10.05-11.22…






Question 3: Learning HMMs:

Given O1O2…OT , what is the maximum likelihood HMM that could have produced this string of observations?















Eat

Bus

walk

aAB

aBB

aAA

aCB

aBA aBC

aCC

Ot-1 Ot+1

Ot

bA(Ot-1)

bB(Ot)

bC(Ot+1)


Basic Operations in HMMsFor an observation sequence O = O1…OT, the three basic HMM

operations are:

Problem Algorithm Complexity

Evaluation:

Calculating P(O|)

Forward-Backward O(TN2)

Inference:

Computing Q* = argmaxQ P(O,Q|)

Viterbi Decoding O(TN2)

Learning:

Computing * = argmax P(O|Baum-Welch (EM) O(TN2)

T = # timesteps, i.e. datapoints N = # states


Basic Operations in HMMsFor an observation sequence O = O1…OT, the three basic HMM

operations are:

Problem Algorithm Complexity

Evaluation:

Calculating P(O|)

Forward-Backward O(TN2)

Inference:

Computing Q* = argmaxQ P(O,Q|)

Viterbi Decoding O(TN2)

Learning:

Computing * = argmax P(O|Baum-Welch (EM) O(TN2)

This talk:

A simple approach to

reducing the complexity in N

T = # timesteps, i.e. datapoints N = # states


HMM Overview Reducing quadratic complexity

• The model• Algorithms for fast evaluation and inference• Algorithms for fast learning


Conclusion


Reducing Quadratic Complexity in NWhy does it matter?

• Quadratic HMM algorithms hinder HMM computations when N is large

• Several promising applications for efficient large-state-space HMM algorithms in • topic modeling• speech recognition• real-time HMM systems such as for

activity monitoring• … and more


Idea One: Sparse Transition Matrix

• Only K << N non-zero next-state probabilities




7.003.000

05.0005.0

75.00025.00

03.007.00

004.006.0




7.003.000

05.0005.0

75.00025.00

03.007.00

004.006.0

Only O(TNK)!




7.003.000

05.0005.0

75.00025.00

03.007.00

004.006.0

• But can get very badly

confused by

“impossible transitions”

• Cannot learn the

sparse structure (once

chosen cannot

change)

Only O(TNK)!


Dense-Mostly-Constant (DMC) Transitions

K non-constant probabilities per row DMC HMMs comprise a richer and more

expressive class of models than sparse HMMs

a DMC transition matrix with K=2

25.015.030.015.015.0

01.051.001.001.046.0

6.005.005.025.005.0

04.018.004.07.004.0

1.01.03.01.04.0


Dense-Mostly-Constant (DMC) Transitions• The transition model for state i now consists of:

• K = the number of non-constant values per row

• NCi = { j : sisj is a non-constant transition probability }

• ci = the transition probability for si to all states not in NCi

• aij = the non-constant transition probability for si sj,

iNCj

25.015.030.015.015.0

01.051.001.001.046.0

6.005.005.025.005.0

04.018.004.07.004.0

1.01.03.01.04.0 K = 2

NC3 = {2,5}

c3 = 0.05

a32 = 0.25

a35 = 0.6





Conclusion


Evaluation in Regular HMMsP(qt = si | O1, O2 … Ot)


Evaluation in Regular HMMsP(qt = si | O1, O2 … Ot) =

Where

N

jt

t

j

i

1

)(

)(

ittt SqOOOi ..P 21



Where

Then,

N

jt

t

j

i

1

)(

)(

ittt SqOOOi ..P 21

j

T iOP )|(



Where

Then,

N

jt

t

j

i

1

)(

)(

ittt SqOOOi ..P 21

j

T iOP )|(

Called the “forward variables”


iObaj ti

tjijt 11


iObaj ti

tjijt 11

t t(1) t(2) t(3) … t(N)

1

2 …

3

4

5

6

7

8

9


t t(1) t(2) t(3) … t(N)

1

2 …

3 …

4

5

6

7

8

9

iObaj ti

tjijt 11


t t(1) t(2) t(3) … t(N)

1

2 …

3 …

4

5

6

7

8

9

iObaj ti

tjijt 11

•Cost O(TN2)


Similarly,

and

Also costs O(TN2)

itTttt SqOOOi |..P 21

iObaj ti

tjijt 11


Similarly,

and

Also costs O(TN2)

itTttt SqOOOi |..P 21

iObaj ti

tjijt 11

Called the “backward variables”


Fast Evaluation in DMC HMMs


Fast Evaluation in DMC HMMs

O(N), but only computed

once per row of the table!O(K) for each t(j) entry

This yields O(TNK) complexity for the evaluation problem


Fast Inference in DMC HMMs



O(N2) recursion in regular model:



O(N2) recursion in regular model:

O(NK) recursion in DMC model:

O(N), but only computed

once per row of the tableO(K) for each t(j) entry





Conclusion


Learning a DMC HMM


Learning a DMC HMM

• Idea One:• Ask user to tell us the DMC

structure• Learn the parameters using EM


Learning a DMC HMM

• Idea One:• Ask user to tell us the DMC

structure• Learn the parameters using EM

• Simple!

• But in general, don’t know the DMC structure


Learning a DMC HMM

• Idea Two:Use EM to learn the DMC structure also

1. Guess DMC structure2. Find expected transition

counts and observation parameters, given current model and observations

3. Find maximum likelihood DMC model given counts

4. Goto 2


Learning a DMC HMM





4. Goto 2

DMC structure can (and does) change!


Learning a DMC HMM





4. Goto 2

DMC structure can (and does) change!

In fact, just start with an all-constant transition model


Learning a DMC HMM2. Find expected transition



newija )|( 1 itjt sqsqP We want new estimate of



N

kT

old

Told

OOOki

OOOji

121

21

,,,| ns transitio# Expected




N

kT

old

Told

OOOki

OOOji

121

21



N

k

T

tTitkt

T

tTitjt

OOOsqsqP

OOOsqsqP

1 121

old1

121

old1

),,,|,(

),,,|,(



N

kT

old

Told

OOOki

OOOji

121

21



N

k

T

tTitkt

T

tTitjt

OOOsqsqP

OOOsqsqP

1 121

old1

121

old1

),,,|,(

),,,|,(

N

kik

ij

S

S

1

where

T

tTitjtij OOsqsqPS

1

old11 )|,,,(

T

ttjttij Objia

111 )()()(

Applying Bayes rule to both terms gives us…


We want

N

kikijij SSa

1

new

T

ttjttijij ObjiaS

111 )()()( where


T

N

T

N

We want

N

kikijij SSa

1

new

T

ttjttijij ObjiaS

111 )()()( where


T

N

T

N

Can get this in O(TN) time


We want

N

kikijij SSa

1

new

T

ttjttijij ObjiaS

111 )()()( where


We want

N

kikijij SSa

1

new

T

tttij jiS

1

)()( where

T

N

T

N



)()()( 11 tjtijt Objaj


We want

N

kikijij SSa

1

new

T

tttij jiS

1

)()( where

T

N

T

N


We want

N

kikijij SSa

1

new

T

tttij jiS

1

)()( where

T

N

T

N

SN

N

S24

*2 *4

Dot Product of Columns


We want

N

kikijij SSa

1

new

T

tttij jiS

1

)()( where

T

N

T

N

SN

N

S24

*2 *4


TS O(TN2)


We want

N

kikijij SSa

1

new

T

tttij jiS

1

)()( where

T

N

T

N

SN

N

S24

*2 *4


TS O(TN2)

Speedups:

• Strassen?


We want

N

kikijij SSa

1

new

T

tttij jiS

1

)()( where

T

N

T

N

SN

N

S24

*2 *4


TS O(TN2)

Speedups:

• Strassen

• Approximate by DMC


We want

N

kikijij SSa

1

new

T

tttij jiS

1

)()( where

T

N

T

N

SN

N

S24

*2 *4


TS O(TN2)

Speedups:

• Strassen


• Approximate randomized ATB


We want

N

kikijij SSa

1

new

T

tttij jiS

1

)()( where

T

N

T

N

SN

N

S24

*2 *4


TS O(TN2)

Speedups:

• Strassen



• Sparse structure fine?


We want

N

kikijij SSa

1

new

T

tttij jiS

1

)()( where

T

N

T

N

SN

N

S24

*2 *4


TS O(TN2)

Speedups:

• Strassen



• Sparse structure fine

• Fixed DMC is fine?


We want

N

kikijij SSa

1

new

T

tttij jiS

1

)()( where

T

N

T

N

SN

N

S24

*2 *4


TS O(TN2)

Speedups:

• Strassen



• Sparse structure fine

• Fixed DMC is fine

• Speedup without approximation


We want

N

kikijij SSa

1

new

T

tttij jiS

1

)()( where

T

N

T

N

SN

N

S24• Insight One: only need the top K entries

in each row of S

• Insight Two: Values in columns of and are often very skewed


T

N N

-biggies(i) -biggies(j)

For i = 1..N, store indexes of R largest values in i’th column of

For j = 1..N, store indexes of R largest values in j’th column of

There’s an important detail I’m omitting here to do with prescaling the rows of and .


T

N N




R << T

Takes O(TN) time to do all indexes


T

N N




R << T


T

tttij jiS

1

)()(


T

N N




R << T


T

tttij jiS

1

)()(

biggies(j)-biggies(i)-

)()(

t

tt ji


)()(

t

tt ji


T

N N




R << T


T

tttij jiS

1

)()(


)()(

t

tt ji


)()(

t

tt ji


)()(

t

tt ji


)( )()(

t

tR ji


T

N N




R << T


T

tttij jiS

1

)()(


)()(

t

tt ji


)()(

t

tt ji


)()(

t

tt ji


)( )()(

t

tR ji

R’th largest value in i’th column of

O(1) time to obtain

O(1) time to obtain (precached for all j in time O(TN) )

O(R) computation


S

N

j1 2 3 N…

Sij

Computing the i’th row of S…

In O(NR) time, we can put upper and lower bounds on Sij for j = 1,2 .. N


S

N

j1 2 3 N…

Sij



Only need exact values of Sij for the k largest values within the row


S

N

j1 2 3 N…

Sij




Ignore j’s that can’t be the best


S

N

j1 2 3 N…

Sij





Be exact for the rest: O(N) time each.


S

N

j1 2 3 N…

Sij





Be exact for the rest: O(N) time each.

If there’s enough pruning,

total time is O(TN+RN2)


In Short …• Sub-quadratic evaluation• Sub-quadratic inference• ‘Nearly’ sub-quadratic learning• Fully connected transition models allowed


In Short …• Sub-quadratic evaluation• Sub-quadratic inference• ‘Nearly’ sub-quadratic learning• Fully connected transition models allowed

Some extra work to extract ‘important’

transitions from data





Conclusion


Evaluation and Inference Speedup

Dat

ase

t: s

ynth

etic

dat

a w

ith T

=20

00 t

ime

ste

ps


Parameter Learning Speedup

Dat

ase

t: s

ynth

etic

dat

a w

ith T

=20

00 t

ime

ste

ps





Conclusion


Datasets• DMC-friendly dataset:

• From 2-D gaussian 20-state DMC HMM with K=5 (20,000 train, 5,000 test)

• Anti-DMC dataset: • From 2-D gaussian 20-state regular HMM with steadily

varying, well-distributed transition probabilities (20,000 train, 5,000 test)

• Motionlogger dataset: • Accelerometer data from two sensors worn over several

days (10,000 train, 4,720 test)


HMMs Used• Regular and DMC HMMs:

• 20 states

• Baseline 1: • 5-state regular HMM

• Baseline 2: • 20-state HMM with uniform transition probabilities


HMMs Used• Regular and DMC HMMs:

• 20 states

• Baseline 1: • 5-state regular HMM

• Baseline 2: • 20-state HMM with uniform transition probabilities

Do we really need a large HMM?

Does the transition model matter?


Learning Curves for DMC-friendly data












Learning Curves for DMC-friendly dataDMC model achieves full model score!


Learning Curves for DMC-friendly dataDMC model achieves full model score!


Learning Curves for Anti-DMC data












Learning Curves for Anti-DMC dataDMC model worse than full model


Learning Curves for Anti-DMC dataDMC model worse than full model


Learning Curves for Motionlogger data












Learning Curves for Motionlogger dataDMC model achieves full model score!


Learning Curves for Motionlogger dataDMC model achieves full model score!

Baselines do much worse


Regularization with DMC HMMs• # of transition parameters in regular 100-state

HMM: 10,000• # of transition parameters in DMC 100-state

HMM with K= 5 : 500


Tradeoffs between N and K• We vary N and K while keeping the number of

transition parameters (N×K) constant• Increasing N and decreasing K allows more states

for modeling data features but fewer parameters per state for temporal structure


Tradeoffs between N and K

• Average test-set log-likelihoods at convergence• Datasets:

• A: DMC-friendly• B: Anti-DMC• C: Motionlogger


Tradeoffs between N and K

• Average test-set log-likelihoods at convergence• Datasets:

• A: DMC-friendly• B: Anti-DMC• C: Motionlogger

Each dataset has a different optimal N-vs-K

tradeoff





Conclusion


Conclusions• DMC HMMs are an important class of models that allow

parameterized complexity-vs-efficiency tradeoffs in large state spaces




• The speedup can be several orders of magnitude





• Even for non-DMC domains, DMC HMMs yield higher scores than baseline models





• Even for non-DMC domains, DMC HMMs yield higher scores than baseline models

• The DMC HMM model can be applied to arbitrary state spaces and observation densities


Related Work• Felzenszwalb et al. (2003) – fast HMM algorithms when

transition probabilities can be expressed as distances in an underlying parameter space

• Murphy and Paskin (2002) – fast inference in hierarchical HMMs cast as DBNs

• Salakhutdinov et al. (2003) – combines EM and conjugate gradient for faster HMM learning when missing information amount is high

• Ghahramani and Jordan (1996) – Factorial HMMs for distributed representation of large state spaces

• Beam Search – widely used heuristic in viterbi inference for speech systems


Future Work• Eliminate R parameter using an automatic backoff

evaluation approach• Investigate DMC HMMs as regularization

mechanism• Compare robustness against overfitting with factorial

HMMs for large-state-space problems


Thank You!