69
A Sober Look at Spectral Learning Han Zhao and Pascal Poupart June 17, 2014 1 / 69

A Sober Look at Spectral Learninghzhao1/papers/ICML2014/spectral-slide.pdf · A Sober Look at Spectral Learning Han Zhao and Pascal Poupart June 17, 2014 1/69. Spectral Learning What

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: A Sober Look at Spectral Learninghzhao1/papers/ICML2014/spectral-slide.pdf · A Sober Look at Spectral Learning Han Zhao and Pascal Poupart June 17, 2014 1/69. Spectral Learning What

A Sober Look at Spectral Learning

Han Zhao and Pascal Poupart

June 17, 2014

1 / 69

Page 2: A Sober Look at Spectral Learninghzhao1/papers/ICML2014/spectral-slide.pdf · A Sober Look at Spectral Learning Han Zhao and Pascal Poupart June 17, 2014 1/69. Spectral Learning What

Spectral Learning

What is spectral learning?

I New methods in machine learning to tackle mixture modelsand graphical models with latent variables.

I Dates back to Karl Pearson’s method of moments approach tosolve mixture of Gaussians.

I An alternative to the principle of maximum likelihoodestimation and Bayesian inference.

I Been widely applied to various models, including HiddenMarkov Models [1, 2], mixture of Gaussians [3], TopicModels [4, 5, 6] and latent junction trees [7, 8], etc.

Today I will focus on spectral algorithm for Hidden Markov Models.

2 / 69

Page 3: A Sober Look at Spectral Learninghzhao1/papers/ICML2014/spectral-slide.pdf · A Sober Look at Spectral Learning Han Zhao and Pascal Poupart June 17, 2014 1/69. Spectral Learning What

Spectral Learning

What is spectral learning?

I New methods in machine learning to tackle mixture modelsand graphical models with latent variables.

I Dates back to Karl Pearson’s method of moments approach tosolve mixture of Gaussians.

I An alternative to the principle of maximum likelihoodestimation and Bayesian inference.

I Been widely applied to various models, including HiddenMarkov Models [1, 2], mixture of Gaussians [3], TopicModels [4, 5, 6] and latent junction trees [7, 8], etc.

Today I will focus on spectral algorithm for Hidden Markov Models.

3 / 69

Page 4: A Sober Look at Spectral Learninghzhao1/papers/ICML2014/spectral-slide.pdf · A Sober Look at Spectral Learning Han Zhao and Pascal Poupart June 17, 2014 1/69. Spectral Learning What

Spectral Learning

What is spectral learning?

I New methods in machine learning to tackle mixture modelsand graphical models with latent variables.

I Dates back to Karl Pearson’s method of moments approach tosolve mixture of Gaussians.

I An alternative to the principle of maximum likelihoodestimation and Bayesian inference.

I Been widely applied to various models, including HiddenMarkov Models [1, 2], mixture of Gaussians [3], TopicModels [4, 5, 6] and latent junction trees [7, 8], etc.

Today I will focus on spectral algorithm for Hidden Markov Models.

4 / 69

Page 5: A Sober Look at Spectral Learninghzhao1/papers/ICML2014/spectral-slide.pdf · A Sober Look at Spectral Learning Han Zhao and Pascal Poupart June 17, 2014 1/69. Spectral Learning What

Spectral Learning

What is spectral learning?

I New methods in machine learning to tackle mixture modelsand graphical models with latent variables.

I Dates back to Karl Pearson’s method of moments approach tosolve mixture of Gaussians.

I An alternative to the principle of maximum likelihoodestimation and Bayesian inference.

I Been widely applied to various models, including HiddenMarkov Models [1, 2], mixture of Gaussians [3], TopicModels [4, 5, 6] and latent junction trees [7, 8], etc.

Today I will focus on spectral algorithm for Hidden Markov Models.

5 / 69

Page 6: A Sober Look at Spectral Learninghzhao1/papers/ICML2014/spectral-slide.pdf · A Sober Look at Spectral Learning Han Zhao and Pascal Poupart June 17, 2014 1/69. Spectral Learning What

Spectral Learning

What is spectral learning?

I New methods in machine learning to tackle mixture modelsand graphical models with latent variables.

I Dates back to Karl Pearson’s method of moments approach tosolve mixture of Gaussians.

I An alternative to the principle of maximum likelihoodestimation and Bayesian inference.

I Been widely applied to various models, including HiddenMarkov Models [1, 2], mixture of Gaussians [3], TopicModels [4, 5, 6] and latent junction trees [7, 8], etc.

Today I will focus on spectral algorithm for Hidden Markov Models.

6 / 69

Page 7: A Sober Look at Spectral Learninghzhao1/papers/ICML2014/spectral-slide.pdf · A Sober Look at Spectral Learning Han Zhao and Pascal Poupart June 17, 2014 1/69. Spectral Learning What

HMM

Hidden Markov Model

I A discrete time stochastic process.

I Satisfies Markovian property.

I The state of the system at each time step is hidden, only theobservation of the system is visible.

7 / 69

Page 8: A Sober Look at Spectral Learninghzhao1/papers/ICML2014/spectral-slide.pdf · A Sober Look at Spectral Learning Han Zhao and Pascal Poupart June 17, 2014 1/69. Spectral Learning What

HMM

HMM can be defined as a triple 〈T ,O, π〉:I Transition matrix T ∈ Rm×m, Tij = Pr(st+1 = i | st = j).

I Observation matrix O ∈ Rn×m, Oij = Pr(ot = i | st = j).

I Initial distribution π ∈ Rm, πi = Pr(s1 = i).

8 / 69

Page 9: A Sober Look at Spectral Learninghzhao1/papers/ICML2014/spectral-slide.pdf · A Sober Look at Spectral Learning Han Zhao and Pascal Poupart June 17, 2014 1/69. Spectral Learning What

HMM

Given an HMM H = 〈T ,O, π〉, we are interested in two inferenceproblems:

1. Marginal Inference (Estimation problem). Computing themarginal probability

Pr(o1:t) =∑s1:t

Pr(o1:t , s1:t) =∑s1:t

Pr(s1:t) Pr(o1:t |s1:t)

Dynamic Programming !

2. MAP Inference (Decoding problem). Computing the sequences∗1:t maximizing the posterior probability

s∗1:t = arg maxs1:t

Pr(s1:t |o1:t)

Viterbi Algorithm !

What about the learning problem?

9 / 69

Page 10: A Sober Look at Spectral Learninghzhao1/papers/ICML2014/spectral-slide.pdf · A Sober Look at Spectral Learning Han Zhao and Pascal Poupart June 17, 2014 1/69. Spectral Learning What

HMM

Given an HMM H = 〈T ,O, π〉, we are interested in two inferenceproblems:

1. Marginal Inference (Estimation problem). Computing themarginal probability

Pr(o1:t) =∑s1:t

Pr(o1:t , s1:t) =∑s1:t

Pr(s1:t) Pr(o1:t |s1:t)

Dynamic Programming !

2. MAP Inference (Decoding problem). Computing the sequences∗1:t maximizing the posterior probability

s∗1:t = arg maxs1:t

Pr(s1:t |o1:t)

Viterbi Algorithm !

What about the learning problem?

10 / 69

Page 11: A Sober Look at Spectral Learninghzhao1/papers/ICML2014/spectral-slide.pdf · A Sober Look at Spectral Learning Han Zhao and Pascal Poupart June 17, 2014 1/69. Spectral Learning What

HMM

Given an HMM H = 〈T ,O, π〉, we are interested in two inferenceproblems:

1. Marginal Inference (Estimation problem). Computing themarginal probability

Pr(o1:t) =∑s1:t

Pr(o1:t , s1:t) =∑s1:t

Pr(s1:t) Pr(o1:t |s1:t)

Dynamic Programming !

2. MAP Inference (Decoding problem). Computing the sequences∗1:t maximizing the posterior probability

s∗1:t = arg maxs1:t

Pr(s1:t |o1:t)

Viterbi Algorithm !

What about the learning problem?

11 / 69

Page 12: A Sober Look at Spectral Learninghzhao1/papers/ICML2014/spectral-slide.pdf · A Sober Look at Spectral Learning Han Zhao and Pascal Poupart June 17, 2014 1/69. Spectral Learning What

HMM

Given an HMM H = 〈T ,O, π〉, we are interested in two inferenceproblems:

1. Marginal Inference (Estimation problem). Computing themarginal probability

Pr(o1:t) =∑s1:t

Pr(o1:t , s1:t) =∑s1:t

Pr(s1:t) Pr(o1:t |s1:t)

Dynamic Programming !

2. MAP Inference (Decoding problem). Computing the sequences∗1:t maximizing the posterior probability

s∗1:t = arg maxs1:t

Pr(s1:t |o1:t)

Viterbi Algorithm !

What about the learning problem?

12 / 69

Page 13: A Sober Look at Spectral Learninghzhao1/papers/ICML2014/spectral-slide.pdf · A Sober Look at Spectral Learning Han Zhao and Pascal Poupart June 17, 2014 1/69. Spectral Learning What

HMM

Given an HMM H = 〈T ,O, π〉, we are interested in two inferenceproblems:

1. Marginal Inference (Estimation problem). Computing themarginal probability

Pr(o1:t) =∑s1:t

Pr(o1:t , s1:t) =∑s1:t

Pr(s1:t) Pr(o1:t |s1:t)

Dynamic Programming !

2. MAP Inference (Decoding problem). Computing the sequences∗1:t maximizing the posterior probability

s∗1:t = arg maxs1:t

Pr(s1:t |o1:t)

Viterbi Algorithm !

What about the learning problem?

13 / 69

Page 14: A Sober Look at Spectral Learninghzhao1/papers/ICML2014/spectral-slide.pdf · A Sober Look at Spectral Learning Han Zhao and Pascal Poupart June 17, 2014 1/69. Spectral Learning What

HMM

Given an HMM H = 〈T ,O, π〉, we are interested in two inferenceproblems:

1. Marginal Inference (Estimation problem). Computing themarginal probability

Pr(o1:t) =∑s1:t

Pr(o1:t , s1:t) =∑s1:t

Pr(s1:t) Pr(o1:t |s1:t)

Dynamic Programming !

2. MAP Inference (Decoding problem). Computing the sequences∗1:t maximizing the posterior probability

s∗1:t = arg maxs1:t

Pr(s1:t |o1:t)

Viterbi Algorithm !

What about the learning problem?

14 / 69

Page 15: A Sober Look at Spectral Learninghzhao1/papers/ICML2014/spectral-slide.pdf · A Sober Look at Spectral Learning Han Zhao and Pascal Poupart June 17, 2014 1/69. Spectral Learning What

HMM Reparametrization

Let H = 〈T ,O, π〉 be an HMM, define the following observableoperators:

Ax , T diag(Ox ,1, . . . ,Ox ,m), ∀x ∈ [n]

H = 〈π,Ax〉,∀x ∈ [n] is an equivalent parameterization of HMM.

Ax [i , j ] = Pr(st+1 = i |st = j)× Pr(ot = x |st = j) = Pr(st+1 =i , ot = x |st = j).

15 / 69

Page 16: A Sober Look at Spectral Learninghzhao1/papers/ICML2014/spectral-slide.pdf · A Sober Look at Spectral Learning Han Zhao and Pascal Poupart June 17, 2014 1/69. Spectral Learning What

HMM Reparametrization

Let H = 〈T ,O, π〉 be an HMM, define the following observableoperators:

Ax , T diag(Ox ,1, . . . ,Ox ,m), ∀x ∈ [n]

H = 〈π,Ax〉,∀x ∈ [n] is an equivalent parameterization of HMM.

Ax [i , j ] = Pr(st+1 = i |st = j)× Pr(ot = x |st = j) = Pr(st+1 =i , ot = x |st = j).

16 / 69

Page 17: A Sober Look at Spectral Learninghzhao1/papers/ICML2014/spectral-slide.pdf · A Sober Look at Spectral Learning Han Zhao and Pascal Poupart June 17, 2014 1/69. Spectral Learning What

HMM Reparametrization

Let H = 〈T ,O, π〉 be an HMM, define the following observableoperators:

Ax , T diag(Ox ,1, . . . ,Ox ,m), ∀x ∈ [n]

H = 〈π,Ax〉,∀x ∈ [n] is an equivalent parameterization of HMM.

Ax [i , j ] = Pr(st+1 = i |st = j)× Pr(ot = x |st = j) = Pr(st+1 =i , ot = x |st = j).

17 / 69

Page 18: A Sober Look at Spectral Learninghzhao1/papers/ICML2014/spectral-slide.pdf · A Sober Look at Spectral Learning Han Zhao and Pascal Poupart June 17, 2014 1/69. Spectral Learning What

HMM Reparametrization

We can express the marginal probability in terms of observableoperators:

Pr(o1:t) =∑s1:t+1

Pr(o1:t , s1:t+1)

=∑s1:t+1

[Pr(st+1|st) Pr(ot |st)] · · · [Pr(s2|s1) Pr(o1|s1)] Pr(s1)

=∑s1:t+1

Aot [st+1, st ] · · ·Ao1 [s2, s1]πs1

= 1TAot · · ·Ao1π

Goal of Learning: Estimate the observable operators from sequenceof observations.

18 / 69

Page 19: A Sober Look at Spectral Learninghzhao1/papers/ICML2014/spectral-slide.pdf · A Sober Look at Spectral Learning Han Zhao and Pascal Poupart June 17, 2014 1/69. Spectral Learning What

HMM Reparametrization

We can express the marginal probability in terms of observableoperators:

Pr(o1:t) =∑s1:t+1

Pr(o1:t , s1:t+1)

=∑s1:t+1

[Pr(st+1|st) Pr(ot |st)] · · · [Pr(s2|s1) Pr(o1|s1)] Pr(s1)

=∑s1:t+1

Aot [st+1, st ] · · ·Ao1 [s2, s1]πs1

= 1TAot · · ·Ao1π

Goal of Learning: Estimate the observable operators from sequenceof observations.

19 / 69

Page 20: A Sober Look at Spectral Learninghzhao1/papers/ICML2014/spectral-slide.pdf · A Sober Look at Spectral Learning Han Zhao and Pascal Poupart June 17, 2014 1/69. Spectral Learning What

HMM Reparametrization

We can express the marginal probability in terms of observableoperators:

Pr(o1:t) =∑s1:t+1

Pr(o1:t , s1:t+1)

=∑s1:t+1

[Pr(st+1|st) Pr(ot |st)] · · · [Pr(s2|s1) Pr(o1|s1)] Pr(s1)

=∑s1:t+1

Aot [st+1, st ] · · ·Ao1 [s2, s1]πs1

= 1TAot · · ·Ao1π

Goal of Learning: Estimate the observable operators from sequenceof observations.

20 / 69

Page 21: A Sober Look at Spectral Learninghzhao1/papers/ICML2014/spectral-slide.pdf · A Sober Look at Spectral Learning Han Zhao and Pascal Poupart June 17, 2014 1/69. Spectral Learning What

HMM Reparametrization

We can express the marginal probability in terms of observableoperators:

Pr(o1:t) =∑s1:t+1

Pr(o1:t , s1:t+1)

=∑s1:t+1

[Pr(st+1|st) Pr(ot |st)] · · · [Pr(s2|s1) Pr(o1|s1)] Pr(s1)

=∑s1:t+1

Aot [st+1, st ] · · ·Ao1 [s2, s1]πs1

= 1TAot · · ·Ao1π

Goal of Learning: Estimate the observable operators from sequenceof observations.

21 / 69

Page 22: A Sober Look at Spectral Learninghzhao1/papers/ICML2014/spectral-slide.pdf · A Sober Look at Spectral Learning Han Zhao and Pascal Poupart June 17, 2014 1/69. Spectral Learning What

HMM Reparametrization

We can express the marginal probability in terms of observableoperators:

Pr(o1:t) =∑s1:t+1

Pr(o1:t , s1:t+1)

=∑s1:t+1

[Pr(st+1|st) Pr(ot |st)] · · · [Pr(s2|s1) Pr(o1|s1)] Pr(s1)

=∑s1:t+1

Aot [st+1, st ] · · ·Ao1 [s2, s1]πs1

= 1TAot · · ·Ao1π

Goal of Learning: Estimate the observable operators from sequenceof observations.

22 / 69

Page 23: A Sober Look at Spectral Learninghzhao1/papers/ICML2014/spectral-slide.pdf · A Sober Look at Spectral Learning Han Zhao and Pascal Poupart June 17, 2014 1/69. Spectral Learning What

Spectral Learning for HMM [1]

Assumption 1: π > 0 element-wise, and T and O are full rank(rank(T ) = rank(O) = m). Define the first three order momentsof the observations:

P1[i ] = Pr(x1) = i

P2,1[i , j ] = Pr(x2 = i , x1 = j)

P3,x ,1[i , j ] = Pr(x3 = i , x2 = x , x1 = j), ∀x ∈ [n]

Let U ∈ Rn×m be the left singular matrix of P2,1, define thefollowing observable operators:

b1 = UTP1

b∞ = (PT2,1U)+P1

Bx = (UTP3,x ,1)(UTP2,1)+, ∀x ∈ [n]

where M+ denotes the Moore-Penrose pseudoinverse of matrix M.

23 / 69

Page 24: A Sober Look at Spectral Learninghzhao1/papers/ICML2014/spectral-slide.pdf · A Sober Look at Spectral Learning Han Zhao and Pascal Poupart June 17, 2014 1/69. Spectral Learning What

Spectral Learning for HMM [1]

Assumption 1: π > 0 element-wise, and T and O are full rank(rank(T ) = rank(O) = m). Define the first three order momentsof the observations:

P1[i ] = Pr(x1) = i

P2,1[i , j ] = Pr(x2 = i , x1 = j)

P3,x ,1[i , j ] = Pr(x3 = i , x2 = x , x1 = j), ∀x ∈ [n]

Let U ∈ Rn×m be the left singular matrix of P2,1, define thefollowing observable operators:

b1 = UTP1

b∞ = (PT2,1U)+P1

Bx = (UTP3,x ,1)(UTP2,1)+, ∀x ∈ [n]

where M+ denotes the Moore-Penrose pseudoinverse of matrix M.

24 / 69

Page 25: A Sober Look at Spectral Learninghzhao1/papers/ICML2014/spectral-slide.pdf · A Sober Look at Spectral Learning Han Zhao and Pascal Poupart June 17, 2014 1/69. Spectral Learning What

Spectral Learning for HMM [1]

Theorem (Observable HMM Representation [1])

Assume the HMM obeys assumption 1, then

1. b1 = (UTO)π

2. bT∞ = 1T (UTO)−1

3. Bx = (UTO)Ax(UTO)−1 ∀x ∈ [n]

4. Pr(o1:t) = bT∞Bxt · · ·Bx1b1

b1, b∞ and Bx only depend on first three order moments ofobservations, free of hidden states !

25 / 69

Page 26: A Sober Look at Spectral Learninghzhao1/papers/ICML2014/spectral-slide.pdf · A Sober Look at Spectral Learning Han Zhao and Pascal Poupart June 17, 2014 1/69. Spectral Learning What

Spectral Learning for HMM [1]

Theorem (Observable HMM Representation [1])

Assume the HMM obeys assumption 1, then

1. b1 = (UTO)π

2. bT∞ = 1T (UTO)−1

3. Bx = (UTO)Ax(UTO)−1 ∀x ∈ [n]

4. Pr(o1:t) = bT∞Bxt · · ·Bx1b1

b1, b∞ and Bx only depend on first three order moments ofobservations, free of hidden states !

26 / 69

Page 27: A Sober Look at Spectral Learninghzhao1/papers/ICML2014/spectral-slide.pdf · A Sober Look at Spectral Learning Han Zhao and Pascal Poupart June 17, 2014 1/69. Spectral Learning What

Spectral Learning for HMM [1]

Main result of Spectral Learning algorithm for HMM:

Theorem (Sample Complexity)

There exists a constant C > 0 such that the following holds. Pickany 0 < ε, η < 1 and t ≥ 1. Assume the HMM obeys assumption1, and

N ≥ C · t2

ε2·(

m · log(1/ε)

σm(O)2σm(P2,1)4+

m · n0(ε) · log(1/ε)

σm(O)2σm(P2,1)2

)With probability at least 1− η, the model returned by the spectrallearning algorithm for HMM satisfies∑

x1,...,xt

|Pr(x1:t)− P̂r(x1:t)| ≤ ε

where n0(ε) = O(ε1/(1−s)), s > 1 a constant.

27 / 69

Page 28: A Sober Look at Spectral Learninghzhao1/papers/ICML2014/spectral-slide.pdf · A Sober Look at Spectral Learning Han Zhao and Pascal Poupart June 17, 2014 1/69. Spectral Learning What

Compared with EM

Expectation-Maximization [9]:

I Local search heuristic algorithm based on the principle ofMaximum Likelihood Estimation

I Local optima problem.

I No consistency guarantees.

For a given t ≥ 1, and 0 < ε, η < 1, spectral learning algorithm:

I A finite sample complexity to be consistent in terms of L1

error on marginal probability.

I No local optima since it only solves an SVD without any localsearch.

28 / 69

Page 29: A Sober Look at Spectral Learninghzhao1/papers/ICML2014/spectral-slide.pdf · A Sober Look at Spectral Learning Han Zhao and Pascal Poupart June 17, 2014 1/69. Spectral Learning What

Compared with EM

Expectation-Maximization [9]:

I Local search heuristic algorithm based on the principle ofMaximum Likelihood Estimation

I Local optima problem.

I No consistency guarantees.

For a given t ≥ 1, and 0 < ε, η < 1, spectral learning algorithm:

I A finite sample complexity to be consistent in terms of L1

error on marginal probability.

I No local optima since it only solves an SVD without any localsearch.

29 / 69

Page 30: A Sober Look at Spectral Learninghzhao1/papers/ICML2014/spectral-slide.pdf · A Sober Look at Spectral Learning Han Zhao and Pascal Poupart June 17, 2014 1/69. Spectral Learning What

Compared with EM

Expectation-Maximization [9]:

I Local search heuristic algorithm based on the principle ofMaximum Likelihood Estimation

I Local optima problem.

I No consistency guarantees.

For a given t ≥ 1, and 0 < ε, η < 1, spectral learning algorithm:

I A finite sample complexity to be consistent in terms of L1

error on marginal probability.

I No local optima since it only solves an SVD without any localsearch.

30 / 69

Page 31: A Sober Look at Spectral Learninghzhao1/papers/ICML2014/spectral-slide.pdf · A Sober Look at Spectral Learning Han Zhao and Pascal Poupart June 17, 2014 1/69. Spectral Learning What

Compared with EM

Expectation-Maximization [9]:

I Local search heuristic algorithm based on the principle ofMaximum Likelihood Estimation

I Local optima problem.

I No consistency guarantees.

For a given t ≥ 1, and 0 < ε, η < 1, spectral learning algorithm:

I A finite sample complexity to be consistent in terms of L1

error on marginal probability.

I No local optima since it only solves an SVD without any localsearch.

31 / 69

Page 32: A Sober Look at Spectral Learninghzhao1/papers/ICML2014/spectral-slide.pdf · A Sober Look at Spectral Learning Han Zhao and Pascal Poupart June 17, 2014 1/69. Spectral Learning What

Compared with EM

Expectation-Maximization [9]:

I Local search heuristic algorithm based on the principle ofMaximum Likelihood Estimation

I Local optima problem.

I No consistency guarantees.

For a given t ≥ 1, and 0 < ε, η < 1, spectral learning algorithm:

I A finite sample complexity to be consistent in terms of L1

error on marginal probability.

I No local optima since it only solves an SVD without any localsearch.

32 / 69

Page 33: A Sober Look at Spectral Learninghzhao1/papers/ICML2014/spectral-slide.pdf · A Sober Look at Spectral Learning Han Zhao and Pascal Poupart June 17, 2014 1/69. Spectral Learning What

EM v.s. Spectral algorithm

Two synthetic experiments:

SmallSyn LargeSyn

# states 4 50# observations 8 100

test set size 4096 10,000length of test sequence 4 50

Measure: normalized L1 prediction error on test data set

L1 =∑

x1:t∈T|Pr(x1:t)− P̂r(x1:t)|

1t

where T is the test set.

33 / 69

Page 34: A Sober Look at Spectral Learninghzhao1/papers/ICML2014/spectral-slide.pdf · A Sober Look at Spectral Learning Han Zhao and Pascal Poupart June 17, 2014 1/69. Spectral Learning What

EM v.s. Spectral algorithm

1 1.5 2 2.5 3 3.5 4

0.35

0.4

0.45

0.5

0.55

0.6

0.65

0.7

Rank Hyperparameter

No

rma

lize

d L

1 e

rro

rSmallSyn

Training Size = 10000

Training Size = 50000

Training Size = 100000

Training Size = 500000

Training Size = 1000000

0 10 20 30 40 500.9

0.92

0.94

0.96

0.98

1

1.02

1.04

1.06

1.08

1.1

Rank Hyperparameter

No

rma

lize

d L

1 e

rro

r

LargeSyn

Training Size = 10000

Training Size = 50000

Training Size = 100000

Training Size = 500000

Training Size = 1000000

0 1 2 3 4 5

x 104

0.45

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

Training Size

No

rma

lize

d L

1 e

rro

r

SmallSyn, m = 4

LearnHMM

EM

0 1 2 3 4 5

x 106

0.01

0.02

0.03

0.04

0.05

0.06

0.07

Training Size

No

rma

lize

d L

1 e

rro

r

LargeSyn, m = 50

LearnHMM

EM

34 / 69

Page 35: A Sober Look at Spectral Learninghzhao1/papers/ICML2014/spectral-slide.pdf · A Sober Look at Spectral Learning Han Zhao and Pascal Poupart June 17, 2014 1/69. Spectral Learning What

EM v.s. Spectral algorithmNegative probability problem with spectral learning algorithm:

I Size of training data.

I Estimation of rank hyperparameter.I Length of test sequence.

Proportion of negative probabilities:

NEG PROP =|{P̂r(x1:t) < 0 | x1:t ∈ T }|

|T |

1 1.5 2 2.5 3 3.5 40

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

Rank Hyperparameter

Pro

po

rtio

n o

f N

eg

ative

Pro

ba

bili

tie

s

SmallSyn

Training Size = 10000

Training Size = 50000

Training Size = 100000

Training Size = 500000

Training Size = 1000000

0 10 20 30 40 500

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

Rank Hyperparameter

Pro

po

rtio

n o

f N

eg

ative

Pro

ba

bili

tie

s

LargeSyn

Training Size = 10000

Training Size = 50000

Training Size = 100000

Training Size = 500000

Training Size = 1000000

35 / 69

Page 36: A Sober Look at Spectral Learninghzhao1/papers/ICML2014/spectral-slide.pdf · A Sober Look at Spectral Learning Han Zhao and Pascal Poupart June 17, 2014 1/69. Spectral Learning What

EM v.s. Spectral algorithmNegative probability problem with spectral learning algorithm:

I Size of training data.I Estimation of rank hyperparameter.

I Length of test sequence.

Proportion of negative probabilities:

NEG PROP =|{P̂r(x1:t) < 0 | x1:t ∈ T }|

|T |

1 1.5 2 2.5 3 3.5 40

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

Rank Hyperparameter

Pro

po

rtio

n o

f N

eg

ative

Pro

ba

bili

tie

s

SmallSyn

Training Size = 10000

Training Size = 50000

Training Size = 100000

Training Size = 500000

Training Size = 1000000

0 10 20 30 40 500

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

Rank Hyperparameter

Pro

po

rtio

n o

f N

eg

ative

Pro

ba

bili

tie

s

LargeSyn

Training Size = 10000

Training Size = 50000

Training Size = 100000

Training Size = 500000

Training Size = 1000000

36 / 69

Page 37: A Sober Look at Spectral Learninghzhao1/papers/ICML2014/spectral-slide.pdf · A Sober Look at Spectral Learning Han Zhao and Pascal Poupart June 17, 2014 1/69. Spectral Learning What

EM v.s. Spectral algorithmNegative probability problem with spectral learning algorithm:

I Size of training data.I Estimation of rank hyperparameter.I Length of test sequence.

Proportion of negative probabilities:

NEG PROP =|{P̂r(x1:t) < 0 | x1:t ∈ T }|

|T |

1 1.5 2 2.5 3 3.5 40

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

Rank Hyperparameter

Pro

po

rtio

n o

f N

eg

ative

Pro

ba

bili

tie

s

SmallSyn

Training Size = 10000

Training Size = 50000

Training Size = 100000

Training Size = 500000

Training Size = 1000000

0 10 20 30 40 500

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

Rank Hyperparameter

Pro

po

rtio

n o

f N

eg

ative

Pro

ba

bili

tie

s

LargeSyn

Training Size = 10000

Training Size = 50000

Training Size = 100000

Training Size = 500000

Training Size = 1000000

37 / 69

Page 38: A Sober Look at Spectral Learninghzhao1/papers/ICML2014/spectral-slide.pdf · A Sober Look at Spectral Learning Han Zhao and Pascal Poupart June 17, 2014 1/69. Spectral Learning What

EM v.s. Spectral algorithmNegative probability problem with spectral learning algorithm:

I Size of training data.I Estimation of rank hyperparameter.I Length of test sequence.

Proportion of negative probabilities:

NEG PROP =|{P̂r(x1:t) < 0 | x1:t ∈ T }|

|T |

1 1.5 2 2.5 3 3.5 40

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

Rank Hyperparameter

Pro

po

rtio

n o

f N

eg

ative

Pro

ba

bili

tie

s

SmallSyn

Training Size = 10000

Training Size = 50000

Training Size = 100000

Training Size = 500000

Training Size = 1000000

0 10 20 30 40 500

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

Rank Hyperparameter

Pro

po

rtio

n o

f N

eg

ative

Pro

ba

bili

tie

s

LargeSyn

Training Size = 10000

Training Size = 50000

Training Size = 100000

Training Size = 500000

Training Size = 1000000

38 / 69

Page 39: A Sober Look at Spectral Learninghzhao1/papers/ICML2014/spectral-slide.pdf · A Sober Look at Spectral Learning Han Zhao and Pascal Poupart June 17, 2014 1/69. Spectral Learning What

Compared with EM

Why EM succeeds in practice?If the log-likelihood function of model parameter tends toconcave/quasi-concave when the sample size goes to infinity ?

1. Local search algorithms, for example, EM algorithm in ourcase, will converge to global optima, hence obtain themaximum likelihood estimator [10].

2. Consistency. Sequence of MLE converges in probability to thetrue model parameter (suppose the model is identifiable byparameter) [11].

3. Asymptotic normality. The distribution of MLE tends to be aGaussian distribution with mean the true parameter andcovariance matrix equal to the inverse the Fisher informationmatrix, i.e., more and more concentrated [11].

4. Most statistical efficient consistent estimator of modelparameter [11].

39 / 69

Page 40: A Sober Look at Spectral Learninghzhao1/papers/ICML2014/spectral-slide.pdf · A Sober Look at Spectral Learning Han Zhao and Pascal Poupart June 17, 2014 1/69. Spectral Learning What

Compared with EM

Why EM succeeds in practice?If the log-likelihood function of model parameter tends toconcave/quasi-concave when the sample size goes to infinity ?

1. Local search algorithms, for example, EM algorithm in ourcase, will converge to global optima, hence obtain themaximum likelihood estimator [10].

2. Consistency. Sequence of MLE converges in probability to thetrue model parameter (suppose the model is identifiable byparameter) [11].

3. Asymptotic normality. The distribution of MLE tends to be aGaussian distribution with mean the true parameter andcovariance matrix equal to the inverse the Fisher informationmatrix, i.e., more and more concentrated [11].

4. Most statistical efficient consistent estimator of modelparameter [11].

40 / 69

Page 41: A Sober Look at Spectral Learninghzhao1/papers/ICML2014/spectral-slide.pdf · A Sober Look at Spectral Learning Han Zhao and Pascal Poupart June 17, 2014 1/69. Spectral Learning What

Compared with EM

Why EM succeeds in practice?If the log-likelihood function of model parameter tends toconcave/quasi-concave when the sample size goes to infinity ?

1. Local search algorithms, for example, EM algorithm in ourcase, will converge to global optima, hence obtain themaximum likelihood estimator [10].

2. Consistency. Sequence of MLE converges in probability to thetrue model parameter (suppose the model is identifiable byparameter) [11].

3. Asymptotic normality. The distribution of MLE tends to be aGaussian distribution with mean the true parameter andcovariance matrix equal to the inverse the Fisher informationmatrix, i.e., more and more concentrated [11].

4. Most statistical efficient consistent estimator of modelparameter [11].

41 / 69

Page 42: A Sober Look at Spectral Learninghzhao1/papers/ICML2014/spectral-slide.pdf · A Sober Look at Spectral Learning Han Zhao and Pascal Poupart June 17, 2014 1/69. Spectral Learning What

Compared with EM

Why EM succeeds in practice?If the log-likelihood function of model parameter tends toconcave/quasi-concave when the sample size goes to infinity ?

1. Local search algorithms, for example, EM algorithm in ourcase, will converge to global optima, hence obtain themaximum likelihood estimator [10].

2. Consistency. Sequence of MLE converges in probability to thetrue model parameter (suppose the model is identifiable byparameter) [11].

3. Asymptotic normality. The distribution of MLE tends to be aGaussian distribution with mean the true parameter andcovariance matrix equal to the inverse the Fisher informationmatrix, i.e., more and more concentrated [11].

4. Most statistical efficient consistent estimator of modelparameter [11].

42 / 69

Page 43: A Sober Look at Spectral Learninghzhao1/papers/ICML2014/spectral-slide.pdf · A Sober Look at Spectral Learning Han Zhao and Pascal Poupart June 17, 2014 1/69. Spectral Learning What

Compared with EM

Why EM succeeds in practice?If the log-likelihood function of model parameter tends toconcave/quasi-concave when the sample size goes to infinity ?

1. Local search algorithms, for example, EM algorithm in ourcase, will converge to global optima, hence obtain themaximum likelihood estimator [10].

2. Consistency. Sequence of MLE converges in probability to thetrue model parameter (suppose the model is identifiable byparameter) [11].

3. Asymptotic normality. The distribution of MLE tends to be aGaussian distribution with mean the true parameter andcovariance matrix equal to the inverse the Fisher informationmatrix, i.e., more and more concentrated [11].

4. Most statistical efficient consistent estimator of modelparameter [11].

43 / 69

Page 44: A Sober Look at Spectral Learninghzhao1/papers/ICML2014/spectral-slide.pdf · A Sober Look at Spectral Learning Han Zhao and Pascal Poupart June 17, 2014 1/69. Spectral Learning What

Synthetic Experiment

Is our conjecture true in HMM? An HMM with one singleparameter for visualization:

H = 〈T =

(θ 1− θ

1− θ θ

),O =

(0.7 0.30.3 0.7

), π = (0.5, 0.5)〉

Beta distribution with uniform distribution as prior.Exact Bayesian updating with more and more observations.

44 / 69

Page 45: A Sober Look at Spectral Learninghzhao1/papers/ICML2014/spectral-slide.pdf · A Sober Look at Spectral Learning Han Zhao and Pascal Poupart June 17, 2014 1/69. Spectral Learning What

Synthetic Experiment

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.4

0.5

0.6

0.7

0.8

0.9

1

1.1

1.2

1.3

θ

(no

rma

lize

d)

like

liho

od

10 observations

45 / 69

Page 46: A Sober Look at Spectral Learninghzhao1/papers/ICML2014/spectral-slide.pdf · A Sober Look at Spectral Learning Han Zhao and Pascal Poupart June 17, 2014 1/69. Spectral Learning What

Synthetic Experiment

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

θ

(no

rma

lize

d)

like

liho

od

10 observations

20 observations

46 / 69

Page 47: A Sober Look at Spectral Learninghzhao1/papers/ICML2014/spectral-slide.pdf · A Sober Look at Spectral Learning Han Zhao and Pascal Poupart June 17, 2014 1/69. Spectral Learning What

Synthetic Experiment

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

θ

(no

rma

lize

d)

like

liho

od

10 observations

20 observations

30 observations

47 / 69

Page 48: A Sober Look at Spectral Learninghzhao1/papers/ICML2014/spectral-slide.pdf · A Sober Look at Spectral Learning Han Zhao and Pascal Poupart June 17, 2014 1/69. Spectral Learning What

Synthetic Experiment

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

θ

(no

rma

lize

d)

like

liho

od

10 observations

20 observations

30 observations

40 observations

48 / 69

Page 49: A Sober Look at Spectral Learninghzhao1/papers/ICML2014/spectral-slide.pdf · A Sober Look at Spectral Learning Han Zhao and Pascal Poupart June 17, 2014 1/69. Spectral Learning What

Synthetic Experiment

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.5

1

1.5

2

2.5

3

θ

(no

rma

lize

d)

like

liho

od

10 observations

20 observations

30 observations

40 observations

50 observations

49 / 69

Page 50: A Sober Look at Spectral Learninghzhao1/papers/ICML2014/spectral-slide.pdf · A Sober Look at Spectral Learning Han Zhao and Pascal Poupart June 17, 2014 1/69. Spectral Learning What

Synthetic Experiment

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.5

1

1.5

2

2.5

3

θ

(no

rma

lize

d)

like

liho

od

10 observations

20 observations

30 observations

40 observations

50 observations

60 observations

50 / 69

Page 51: A Sober Look at Spectral Learninghzhao1/papers/ICML2014/spectral-slide.pdf · A Sober Look at Spectral Learning Han Zhao and Pascal Poupart June 17, 2014 1/69. Spectral Learning What

Synthetic Experiment

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.5

1

1.5

2

2.5

3

θ

(no

rma

lize

d)

like

liho

od

10 observations

20 observations

30 observations

40 observations

50 observations

60 observations

70 observations

51 / 69

Page 52: A Sober Look at Spectral Learninghzhao1/papers/ICML2014/spectral-slide.pdf · A Sober Look at Spectral Learning Han Zhao and Pascal Poupart June 17, 2014 1/69. Spectral Learning What

Synthetic Experiment

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.5

1

1.5

2

2.5

3

3.5

4

4.5

θ

(no

rma

lize

d)

like

liho

od

10 observations

20 observations

30 observations

40 observations

50 observations

60 observations

70 observations

80 observations

52 / 69

Page 53: A Sober Look at Spectral Learninghzhao1/papers/ICML2014/spectral-slide.pdf · A Sober Look at Spectral Learning Han Zhao and Pascal Poupart June 17, 2014 1/69. Spectral Learning What

Synthetic Experiment

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

θ

(no

rma

lize

d)

like

liho

od

10 observations

20 observations

30 observations

40 observations

50 observations

60 observations

70 observations

80 observations

90 observations

53 / 69

Page 54: A Sober Look at Spectral Learninghzhao1/papers/ICML2014/spectral-slide.pdf · A Sober Look at Spectral Learning Han Zhao and Pascal Poupart June 17, 2014 1/69. Spectral Learning What

Synthetic Experiment

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

θ

(no

rma

lize

d)

like

liho

od

10 observations

20 observations

30 observations

40 observations

50 observations

60 observations

70 observations

80 observations

90 observations

100 observations

54 / 69

Page 55: A Sober Look at Spectral Learninghzhao1/papers/ICML2014/spectral-slide.pdf · A Sober Look at Spectral Learning Han Zhao and Pascal Poupart June 17, 2014 1/69. Spectral Learning What

Synthetic ExperimentAnother small synthetic experiment: HMM with 2 states, 2observations and 4 free parameters.

0 50 100 150 200−140

−120

−100

−80

−60

−40

−20

0

Training Size

Log−

likelih

oo

d

Log−likelihood Comparison

EM log−likelihood

True log−likelihood

55 / 69

Page 56: A Sober Look at Spectral Learninghzhao1/papers/ICML2014/spectral-slide.pdf · A Sober Look at Spectral Learning Han Zhao and Pascal Poupart June 17, 2014 1/69. Spectral Learning What

Synthetic ExperimentAnother small synthetic experiment: HMM with 2 states, 2observations and 4 free parameters.

0 50 100 150 200−140

−120

−100

−80

−60

−40

−20

0

Training Size

Log−

likelih

ood

Log−likelihood Comparison

EM log−likelihood

True log−likelihood

56 / 69

Page 57: A Sober Look at Spectral Learninghzhao1/papers/ICML2014/spectral-slide.pdf · A Sober Look at Spectral Learning Han Zhao and Pascal Poupart June 17, 2014 1/69. Spectral Learning What

Conclusions

Spectral learning for HMMPros:

1. Additive L1 error bound with finite sample complexity.

2. No local optima.

Cons:

1. Negative probability.

2. Not most statistically efficient.

3. Slow to converge.

57 / 69

Page 58: A Sober Look at Spectral Learninghzhao1/papers/ICML2014/spectral-slide.pdf · A Sober Look at Spectral Learning Han Zhao and Pascal Poupart June 17, 2014 1/69. Spectral Learning What

Conclusions

Spectral learning for HMMPros:

1. Additive L1 error bound with finite sample complexity.

2. No local optima.

Cons:

1. Negative probability.

2. Not most statistically efficient.

3. Slow to converge.

58 / 69

Page 59: A Sober Look at Spectral Learninghzhao1/papers/ICML2014/spectral-slide.pdf · A Sober Look at Spectral Learning Han Zhao and Pascal Poupart June 17, 2014 1/69. Spectral Learning What

Conclusions

Spectral learning for HMMPros:

1. Additive L1 error bound with finite sample complexity.

2. No local optima.

Cons:

1. Negative probability.

2. Not most statistically efficient.

3. Slow to converge.

59 / 69

Page 60: A Sober Look at Spectral Learninghzhao1/papers/ICML2014/spectral-slide.pdf · A Sober Look at Spectral Learning Han Zhao and Pascal Poupart June 17, 2014 1/69. Spectral Learning What

Conclusions

Spectral learning for HMMPros:

1. Additive L1 error bound with finite sample complexity.

2. No local optima.

Cons:

1. Negative probability.

2. Not most statistically efficient.

3. Slow to converge.

60 / 69

Page 61: A Sober Look at Spectral Learninghzhao1/papers/ICML2014/spectral-slide.pdf · A Sober Look at Spectral Learning Han Zhao and Pascal Poupart June 17, 2014 1/69. Spectral Learning What

Conclusions

Spectral learning for HMMPros:

1. Additive L1 error bound with finite sample complexity.

2. No local optima.

Cons:

1. Negative probability.

2. Not most statistically efficient.

3. Slow to converge.

61 / 69

Page 62: A Sober Look at Spectral Learninghzhao1/papers/ICML2014/spectral-slide.pdf · A Sober Look at Spectral Learning Han Zhao and Pascal Poupart June 17, 2014 1/69. Spectral Learning What

Conclusions

EM for HMMPros:

1. Fast to converge.

2. Statistically efficient.

3. Optimization based approach.

Cons:

1. Local search heuristics, no provable guarantee for globaloptima.

2. Stuck in local optima for non-convex optimization.

62 / 69

Page 63: A Sober Look at Spectral Learninghzhao1/papers/ICML2014/spectral-slide.pdf · A Sober Look at Spectral Learning Han Zhao and Pascal Poupart June 17, 2014 1/69. Spectral Learning What

Conclusions

EM for HMMPros:

1. Fast to converge.

2. Statistically efficient.

3. Optimization based approach.

Cons:

1. Local search heuristics, no provable guarantee for globaloptima.

2. Stuck in local optima for non-convex optimization.

63 / 69

Page 64: A Sober Look at Spectral Learninghzhao1/papers/ICML2014/spectral-slide.pdf · A Sober Look at Spectral Learning Han Zhao and Pascal Poupart June 17, 2014 1/69. Spectral Learning What

Conclusions

EM for HMMPros:

1. Fast to converge.

2. Statistically efficient.

3. Optimization based approach.

Cons:

1. Local search heuristics, no provable guarantee for globaloptima.

2. Stuck in local optima for non-convex optimization.

64 / 69

Page 65: A Sober Look at Spectral Learninghzhao1/papers/ICML2014/spectral-slide.pdf · A Sober Look at Spectral Learning Han Zhao and Pascal Poupart June 17, 2014 1/69. Spectral Learning What

Conclusions

EM for HMMPros:

1. Fast to converge.

2. Statistically efficient.

3. Optimization based approach.

Cons:

1. Local search heuristics, no provable guarantee for globaloptima.

2. Stuck in local optima for non-convex optimization.

65 / 69

Page 66: A Sober Look at Spectral Learninghzhao1/papers/ICML2014/spectral-slide.pdf · A Sober Look at Spectral Learning Han Zhao and Pascal Poupart June 17, 2014 1/69. Spectral Learning What

Conclusions

EM for HMMPros:

1. Fast to converge.

2. Statistically efficient.

3. Optimization based approach.

Cons:

1. Local search heuristics, no provable guarantee for globaloptima.

2. Stuck in local optima for non-convex optimization.

66 / 69

Page 67: A Sober Look at Spectral Learninghzhao1/papers/ICML2014/spectral-slide.pdf · A Sober Look at Spectral Learning Han Zhao and Pascal Poupart June 17, 2014 1/69. Spectral Learning What

Reference I

D. Hsu, S. M. Kakade, and T. Zhang, “A spectral algorithmfor learning Hidden Markov Models,” Journal of Computer andSystem Sciences, vol. 78, pp. 1460–1480, Sept. 2012.

A. Anandkumar, D. Hsu, and S. M. Kakade, “A Method ofMoments for Mixture Models and Hidden Markov Models,”arXiv preprint arXiv:1203.0683, 2012.

D. Hsu and S. M. Kakade, “Learning mixtures of sphericalgaussians: moment methods and spectral decompositions,” inProceedings of the 4th conference on Innovations inTheoretical Computer Science, pp. 11–20, ACM, 2013.

A. Anandkumar, D. P. Foster, and D. Hsu, “A SpectralAlgorithm for Latent Dirichlet Allocation,” in NIPS,pp. 926—-934, 2012.

67 / 69

Page 68: A Sober Look at Spectral Learninghzhao1/papers/ICML2014/spectral-slide.pdf · A Sober Look at Spectral Learning Han Zhao and Pascal Poupart June 17, 2014 1/69. Spectral Learning What

Reference II

A. Anandkumar, R. Ge, D. Hsu, S. M. Kakade, andM. Telgarsky, “Tensor Decompositions for Learning LatentVariable Models,” in arXiv preprint arXiv:1210.7559, pp. 1–55,2012.

S. Arora, R. Ge, Y. Halpern, D. Mimno, A. Moitra, D. Sontag,Y. Wu, and M. Zhu, “A practical algorithm for topic modelingwith provable guarantees,” arXiv preprint arXiv:1212.4777,2012.

A. Parikh, G. Teodoru, G. Tech, M. Ishteva, and E. P. Xing,“A Spectral Algorithm for Latent Junction Trees,” arXivpreprint arXiv:1210.4884, 2012.

A. Parikh and E. P. Xing, “A Spectral Algorithm for LatentTree Graphical Models,” in Proceedings of the 28thInternational Conference on Machine Learning, pp. 1065–1072,2011.

68 / 69

Page 69: A Sober Look at Spectral Learninghzhao1/papers/ICML2014/spectral-slide.pdf · A Sober Look at Spectral Learning Han Zhao and Pascal Poupart June 17, 2014 1/69. Spectral Learning What

Reference III

L. Rabiner, “A tutorial on hidden markov models and selectedapplications in speech recognition,” Proceedings of the IEEE,vol. 77, no. 2, pp. 257–286, 1989.

S. P. Boyd and L. Vandenberghe, Convex optimization.Cambridge university press, 2004.

G. Casella and R. L. Berger, Statistical inference, vol. 70.Duxbury Press Belmont, CA, 1990.

69 / 69