A Sober Look at Spectral Learninghzhao1/papers/ICML2014/spectral-slide.pdf · A Sober Look at Spectral Learning Han Zhao and Pascal Poupart June 17, 2014 1/69. Spectral Learning What

A Sober Look at Spectral Learning

Han Zhao and Pascal Poupart

June 17, 2014

1 / 69

Spectral Learning

What is spectral learning?

I New methods in machine learning to tackle mixture modelsand graphical models with latent variables.

I Dates back to Karl Pearson’s method of moments approach tosolve mixture of Gaussians.

I An alternative to the principle of maximum likelihoodestimation and Bayesian inference.

I Been widely applied to various models, including HiddenMarkov Models [1, 2], mixture of Gaussians [3], TopicModels [4, 5, 6] and latent junction trees [7, 8], etc.

Today I will focus on spectral algorithm for Hidden Markov Models.

2 / 69

Spectral Learning







3 / 69

Spectral Learning







4 / 69

Spectral Learning







5 / 69

Spectral Learning







6 / 69

HMM

Hidden Markov Model

I A discrete time stochastic process.

I Satisfies Markovian property.

I The state of the system at each time step is hidden, only theobservation of the system is visible.

7 / 69

HMM

HMM can be defined as a triple 〈T ,O, π〉:I Transition matrix T ∈ Rm×m, Tij = Pr(st+1 = i | st = j).

I Observation matrix O ∈ Rn×m, Oij = Pr(ot = i | st = j).

I Initial distribution π ∈ Rm, πi = Pr(s1 = i).

8 / 69

HMM

Given an HMM H = 〈T ,O, π〉, we are interested in two inferenceproblems:

1. Marginal Inference (Estimation problem). Computing themarginal probability

Pr(o1:t) =∑s1:t

Pr(o1:t , s1:t) =∑s1:t

Pr(s1:t) Pr(o1:t |s1:t)

Dynamic Programming !

2. MAP Inference (Decoding problem). Computing the sequences∗1:t maximizing the posterior probability

s∗1:t = arg maxs1:t

Pr(s1:t |o1:t)

Viterbi Algorithm !

What about the learning problem?

9 / 69

HMM



Pr(o1:t) =∑s1:t

Pr(o1:t , s1:t) =∑s1:t





Pr(s1:t |o1:t)

Viterbi Algorithm !


10 / 69

HMM



Pr(o1:t) =∑s1:t

Pr(o1:t , s1:t) =∑s1:t





Pr(s1:t |o1:t)

Viterbi Algorithm !


11 / 69

HMM



Pr(o1:t) =∑s1:t

Pr(o1:t , s1:t) =∑s1:t





Pr(s1:t |o1:t)

Viterbi Algorithm !


12 / 69

HMM



Pr(o1:t) =∑s1:t

Pr(o1:t , s1:t) =∑s1:t





Pr(s1:t |o1:t)

Viterbi Algorithm !


13 / 69

HMM



Pr(o1:t) =∑s1:t

Pr(o1:t , s1:t) =∑s1:t





Pr(s1:t |o1:t)

Viterbi Algorithm !


14 / 69

HMM Reparametrization

Let H = 〈T ,O, π〉 be an HMM, define the following observableoperators:

Ax , T diag(Ox ,1, . . . ,Ox ,m), ∀x ∈ [n]

H = 〈π,Ax〉,∀x ∈ [n] is an equivalent parameterization of HMM.

Ax [i , j ] = Pr(st+1 = i |st = j)× Pr(ot = x |st = j) = Pr(st+1 =i , ot = x |st = j).

15 / 69



Ax , T diag(Ox ,1, . . . ,Ox ,m), ∀x ∈ [n]



16 / 69



Ax , T diag(Ox ,1, . . . ,Ox ,m), ∀x ∈ [n]



17 / 69


We can express the marginal probability in terms of observableoperators:

Pr(o1:t) =∑s1:t+1

Pr(o1:t , s1:t+1)

=∑s1:t+1

[Pr(st+1|st) Pr(ot |st)] · · · [Pr(s2|s1) Pr(o1|s1)] Pr(s1)

=∑s1:t+1

Aot [st+1, st ] · · ·Ao1 [s2, s1]πs1

= 1TAot · · ·Ao1π

Goal of Learning: Estimate the observable operators from sequenceof observations.

18 / 69



Pr(o1:t) =∑s1:t+1

Pr(o1:t , s1:t+1)

=∑s1:t+1


=∑s1:t+1

Aot [st+1, st ] · · ·Ao1 [s2, s1]πs1



19 / 69



Pr(o1:t) =∑s1:t+1

Pr(o1:t , s1:t+1)

=∑s1:t+1


=∑s1:t+1

Aot [st+1, st ] · · ·Ao1 [s2, s1]πs1



20 / 69



Pr(o1:t) =∑s1:t+1

Pr(o1:t , s1:t+1)

=∑s1:t+1


=∑s1:t+1

Aot [st+1, st ] · · ·Ao1 [s2, s1]πs1



21 / 69



Pr(o1:t) =∑s1:t+1

Pr(o1:t , s1:t+1)

=∑s1:t+1


=∑s1:t+1

Aot [st+1, st ] · · ·Ao1 [s2, s1]πs1



22 / 69

Spectral Learning for HMM [1]

Assumption 1: π > 0 element-wise, and T and O are full rank(rank(T ) = rank(O) = m). Define the first three order momentsof the observations:

P1[i ] = Pr(x1) = i

P2,1[i , j ] = Pr(x2 = i , x1 = j)

P3,x ,1[i , j ] = Pr(x3 = i , x2 = x , x1 = j), ∀x ∈ [n]

Let U ∈ Rn×m be the left singular matrix of P2,1, define thefollowing observable operators:

b1 = UTP1

b∞ = (PT2,1U)+P1

Bx = (UTP3,x ,1)(UTP2,1)+, ∀x ∈ [n]

where M+ denotes the Moore-Penrose pseudoinverse of matrix M.

23 / 69


Assumption 1: π > 0 element-wise, and T and O are full rank(rank(T ) = rank(O) = m). Define the first three order momentsof the observations:

P1[i ] = Pr(x1) = i

P2,1[i , j ] = Pr(x2 = i , x1 = j)

P3,x ,1[i , j ] = Pr(x3 = i , x2 = x , x1 = j), ∀x ∈ [n]

Let U ∈ Rn×m be the left singular matrix of P2,1, define thefollowing observable operators:

b1 = UTP1

b∞ = (PT2,1U)+P1

Bx = (UTP3,x ,1)(UTP2,1)+, ∀x ∈ [n]

where M+ denotes the Moore-Penrose pseudoinverse of matrix M.

24 / 69


Theorem (Observable HMM Representation [1])

Assume the HMM obeys assumption 1, then

1. b1 = (UTO)π

2. bT∞ = 1T (UTO)−1

3. Bx = (UTO)Ax(UTO)−1 ∀x ∈ [n]

4. Pr(o1:t) = bT∞Bxt · · ·Bx1b1

b1, b∞ and Bx only depend on first three order moments ofobservations, free of hidden states !

25 / 69


Theorem (Observable HMM Representation [1])

Assume the HMM obeys assumption 1, then

1. b1 = (UTO)π

2. bT∞ = 1T (UTO)−1

3. Bx = (UTO)Ax(UTO)−1 ∀x ∈ [n]

4. Pr(o1:t) = bT∞Bxt · · ·Bx1b1

b1, b∞ and Bx only depend on first three order moments ofobservations, free of hidden states !

26 / 69


Main result of Spectral Learning algorithm for HMM:

Theorem (Sample Complexity)

There exists a constant C > 0 such that the following holds. Pickany 0 < ε, η < 1 and t ≥ 1. Assume the HMM obeys assumption1, and

N ≥ C · t2

ε2·(

m · log(1/ε)

σm(O)2σm(P2,1)4+

m · n0(ε) · log(1/ε)

σm(O)2σm(P2,1)2

)With probability at least 1− η, the model returned by the spectrallearning algorithm for HMM satisfies∑

x1,...,xt

|Pr(x1:t)− P̂r(x1:t)| ≤ ε

where n0(ε) = O(ε1/(1−s)), s > 1 a constant.

27 / 69

Compared with EM

Expectation-Maximization [9]:

I Local search heuristic algorithm based on the principle ofMaximum Likelihood Estimation

I Local optima problem.

I No consistency guarantees.

For a given t ≥ 1, and 0 < ε, η < 1, spectral learning algorithm:

I A finite sample complexity to be consistent in terms of L1

error on marginal probability.

I No local optima since it only solves an SVD without any localsearch.

28 / 69

Compared with EM









29 / 69

Compared with EM









30 / 69

Compared with EM









31 / 69

Compared with EM









32 / 69

EM v.s. Spectral algorithm

Two synthetic experiments:

SmallSyn LargeSyn

# states 4 50# observations 8 100

test set size 4096 10,000length of test sequence 4 50

Measure: normalized L1 prediction error on test data set

L1 =∑

x1:t∈T|Pr(x1:t)− P̂r(x1:t)|

1t

where T is the test set.

33 / 69

EM v.s. Spectral algorithm

1 1.5 2 2.5 3 3.5 4

0.35

0.4

0.45

0.5

0.55

0.6

0.65

0.7

Rank Hyperparameter

No

rma

lize

d L

1 e

rro

rSmallSyn

Training Size = 10000





0 10 20 30 40 500.9

0.92

0.94

0.96

0.98

1

1.02

1.04

1.06

1.08

1.1

Rank Hyperparameter

No

rma

lize

d L

1 e

rro

r

LargeSyn






0 1 2 3 4 5

x 104

0.45

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

Training Size

No

rma

lize

d L

1 e

rro

r

SmallSyn, m = 4

LearnHMM

EM

0 1 2 3 4 5

x 106

0.01

0.02

0.03

0.04

0.05

0.06

0.07

Training Size

No

rma

lize

d L

1 e

rro

r

LargeSyn, m = 50

LearnHMM

EM

34 / 69

EM v.s. Spectral algorithmNegative probability problem with spectral learning algorithm:

I Size of training data.

I Estimation of rank hyperparameter.I Length of test sequence.

Proportion of negative probabilities:

NEG PROP =|{P̂r(x1:t) < 0 | x1:t ∈ T }|

|T |

1 1.5 2 2.5 3 3.5 40

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

Rank Hyperparameter

Pro

po

rtio

n o

f N

eg

ative

Pro

ba

bili

tie

s

SmallSyn






0 10 20 30 40 500

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

Rank Hyperparameter

Pro

po

rtio

n o

f N

eg

ative

Pro

ba

bili

tie

s

LargeSyn






35 / 69


I Size of training data.I Estimation of rank hyperparameter.

I Length of test sequence.


NEG PROP =|{P̂r(x1:t) < 0 | x1:t ∈ T }|

|T |

1 1.5 2 2.5 3 3.5 40

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

Rank Hyperparameter

Pro

po

rtio

n o

f N

eg

ative

Pro

ba

bili

tie

s

SmallSyn






0 10 20 30 40 500

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

Rank Hyperparameter

Pro

po

rtio

n o

f N

eg

ative

Pro

ba

bili

tie

s

LargeSyn






36 / 69


I Size of training data.I Estimation of rank hyperparameter.I Length of test sequence.


NEG PROP =|{P̂r(x1:t) < 0 | x1:t ∈ T }|

|T |

1 1.5 2 2.5 3 3.5 40

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

Rank Hyperparameter

Pro

po

rtio

n o

f N

eg

ative

Pro

ba

bili

tie

s

SmallSyn






0 10 20 30 40 500

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

Rank Hyperparameter

Pro

po

rtio

n o

f N

eg

ative

Pro

ba

bili

tie

s

LargeSyn






37 / 69


I Size of training data.I Estimation of rank hyperparameter.I Length of test sequence.


NEG PROP =|{P̂r(x1:t) < 0 | x1:t ∈ T }|

|T |

1 1.5 2 2.5 3 3.5 40

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

Rank Hyperparameter

Pro

po

rtio

n o

f N

eg

ative

Pro

ba

bili

tie

s

SmallSyn






0 10 20 30 40 500

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

Rank Hyperparameter

Pro

po

rtio

n o

f N

eg

ative

Pro

ba

bili

tie

s

LargeSyn






38 / 69

Compared with EM

Why EM succeeds in practice?If the log-likelihood function of model parameter tends toconcave/quasi-concave when the sample size goes to infinity ?

1. Local search algorithms, for example, EM algorithm in ourcase, will converge to global optima, hence obtain themaximum likelihood estimator [10].

2. Consistency. Sequence of MLE converges in probability to thetrue model parameter (suppose the model is identifiable byparameter) [11].

3. Asymptotic normality. The distribution of MLE tends to be aGaussian distribution with mean the true parameter andcovariance matrix equal to the inverse the Fisher informationmatrix, i.e., more and more concentrated [11].

4. Most statistical efficient consistent estimator of modelparameter [11].

39 / 69

Compared with EM






40 / 69

Compared with EM






41 / 69

Compared with EM






42 / 69

Compared with EM






43 / 69

Synthetic Experiment

Is our conjecture true in HMM? An HMM with one singleparameter for visualization:

H = 〈T =

(θ 1− θ

1− θ θ

),O =

(0.7 0.30.3 0.7

), π = (0.5, 0.5)〉

Beta distribution with uniform distribution as prior.Exact Bayesian updating with more and more observations.

44 / 69


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.4

0.5

0.6

0.7

0.8

0.9

1

1.1

1.2

1.3

θ

(no

rma

lize

d)

like

liho

od

10 observations

45 / 69


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

θ

(no

rma

lize

d)

like

liho

od

10 observations

20 observations

46 / 69


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

θ

(no

rma

lize

d)

like

liho

od

10 observations

20 observations

30 observations

47 / 69


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

θ

(no

rma

lize

d)

like

liho

od

10 observations

20 observations

30 observations

40 observations

48 / 69


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.5

1

1.5

2

2.5

3

θ

(no

rma

lize

d)

like

liho

od

10 observations

20 observations

30 observations

40 observations

50 observations

49 / 69


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.5

1

1.5

2

2.5

3

θ

(no

rma

lize

d)

like

liho

od

10 observations

20 observations

30 observations

40 observations

50 observations

60 observations

50 / 69


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.5

1

1.5

2

2.5

3

θ

(no

rma

lize

d)

like

liho

od

10 observations

20 observations

30 observations

40 observations

50 observations

60 observations

70 observations

51 / 69


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.5

1

1.5

2

2.5

3

3.5

4

4.5

θ

(no

rma

lize

d)

like

liho

od

10 observations

20 observations

30 observations

40 observations

50 observations

60 observations

70 observations

80 observations

52 / 69


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

θ

(no

rma

lize

d)

like

liho

od

10 observations

20 observations

30 observations

40 observations

50 observations

60 observations

70 observations

80 observations

90 observations

53 / 69


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

θ

(no

rma

lize

d)

like

liho

od

10 observations

20 observations

30 observations

40 observations

50 observations

60 observations

70 observations

80 observations

90 observations

100 observations

54 / 69

Synthetic ExperimentAnother small synthetic experiment: HMM with 2 states, 2observations and 4 free parameters.

0 50 100 150 200−140

−120

−100

−80

−60

−40

−20

0

Training Size

Log−

likelih

oo

d

Log−likelihood Comparison

EM log−likelihood

True log−likelihood

55 / 69

Synthetic ExperimentAnother small synthetic experiment: HMM with 2 states, 2observations and 4 free parameters.

0 50 100 150 200−140

−120

−100

−80

−60

−40

−20

0

Training Size

Log−

likelih

ood

Log−likelihood Comparison

EM log−likelihood

True log−likelihood

56 / 69

Conclusions

Spectral learning for HMMPros:

1. Additive L1 error bound with finite sample complexity.

2. No local optima.

Cons:

1. Negative probability.

2. Not most statistically efficient.

3. Slow to converge.

57 / 69

Conclusions



2. No local optima.

Cons:




58 / 69

Conclusions



2. No local optima.

Cons:




59 / 69

Conclusions



2. No local optima.

Cons:




60 / 69

Conclusions



2. No local optima.

Cons:




61 / 69

Conclusions

EM for HMMPros:

1. Fast to converge.

2. Statistically efficient.

3. Optimization based approach.

Cons:

1. Local search heuristics, no provable guarantee for globaloptima.

2. Stuck in local optima for non-convex optimization.

62 / 69

Conclusions

EM for HMMPros:




Cons:



63 / 69

Conclusions

EM for HMMPros:




Cons:



64 / 69

Conclusions

EM for HMMPros:




Cons:



65 / 69

Conclusions

EM for HMMPros:




Cons:



66 / 69

Reference I

D. Hsu, S. M. Kakade, and T. Zhang, “A spectral algorithmfor learning Hidden Markov Models,” Journal of Computer andSystem Sciences, vol. 78, pp. 1460–1480, Sept. 2012.

A. Anandkumar, D. Hsu, and S. M. Kakade, “A Method ofMoments for Mixture Models and Hidden Markov Models,”arXiv preprint arXiv:1203.0683, 2012.

D. Hsu and S. M. Kakade, “Learning mixtures of sphericalgaussians: moment methods and spectral decompositions,” inProceedings of the 4th conference on Innovations inTheoretical Computer Science, pp. 11–20, ACM, 2013.

A. Anandkumar, D. P. Foster, and D. Hsu, “A SpectralAlgorithm for Latent Dirichlet Allocation,” in NIPS,pp. 926—-934, 2012.

67 / 69

Reference II

A. Anandkumar, R. Ge, D. Hsu, S. M. Kakade, andM. Telgarsky, “Tensor Decompositions for Learning LatentVariable Models,” in arXiv preprint arXiv:1210.7559, pp. 1–55,2012.

S. Arora, R. Ge, Y. Halpern, D. Mimno, A. Moitra, D. Sontag,Y. Wu, and M. Zhu, “A practical algorithm for topic modelingwith provable guarantees,” arXiv preprint arXiv:1212.4777,2012.

A. Parikh, G. Teodoru, G. Tech, M. Ishteva, and E. P. Xing,“A Spectral Algorithm for Latent Junction Trees,” arXivpreprint arXiv:1210.4884, 2012.

A. Parikh and E. P. Xing, “A Spectral Algorithm for LatentTree Graphical Models,” in Proceedings of the 28thInternational Conference on Machine Learning, pp. 1065–1072,2011.

68 / 69

Reference III

L. Rabiner, “A tutorial on hidden markov models and selectedapplications in speech recognition,” Proceedings of the IEEE,vol. 77, no. 2, pp. 257–286, 1989.

S. P. Boyd and L. Vandenberghe, Convex optimization.Cambridge university press, 2004.

G. Casella and R. L. Berger, Statistical inference, vol. 70.Duxbury Press Belmont, CA, 1990.

69 / 69

Documents

A Sober Look at Spectral Learninghzhao1/papers/ICML2014/spectral-slide.pdf · A Sober Look at Spectral Learning Han Zhao and Pascal Poupart June 17, 2014 1/69. Spectral Learning What