68
Identifiability and Learning of Topic Models: Tensor Decompositions under Structural Constraints Anima Anandkumar U.C. Irvine Joint work with Daniel Hsu, Majid Janzamin Adel Javanmard and Sham Kakade.

Identifiability and Learning of Topic Models: Tensor ... › ~djhsu › tutorials › icml2013 › part3.pdfIdentifiability and Learning of Topic Models: Tensor Decompositions under

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Identifiability and Learning of Topic Models: Tensor ... › ~djhsu › tutorials › icml2013 › part3.pdfIdentifiability and Learning of Topic Models: Tensor Decompositions under

Identifiability and Learning of Topic Models:

Tensor Decompositions under

Structural Constraints

Anima Anandkumar

U.C. Irvine

Joint work with Daniel Hsu, Majid JanzaminAdel Javanmard and Sham Kakade.

Page 2: Identifiability and Learning of Topic Models: Tensor ... › ~djhsu › tutorials › icml2013 › part3.pdfIdentifiability and Learning of Topic Models: Tensor Decompositions under

Latent Variable Modeling

Goal: Discover hidden effects from observed measurements

Topic Models

Observations: words. Hidden: topics.

Nursing Home Is Faulted Over Care After

Storm

By MICHAEL POWELL and SHERI FINK

Amid the worst hurricane to hit New York City

in nearly 80 years, officials have claimed that

the Promenade Rehabilitation and Health

Care Center failed to provide the most basic

care to its patients.

In One Day, 11,000 Flee Syria as War and

Hardship Worsen

By RICK GLADSTONE and NEIL

MacFARQUHAR

The United Nations reported that 11,000

Syrians fled on Friday, the vast majority of

them clambering for safety over the Turkish

border.

Obama to Insist on Tax Increase for the

Wealthy

By HELENE COOPER and JONATHAN

WEISMAN

Amid talk of compromise, President Obama

and Speaker John A. Boehner both indicated

unchanged stances on this issue, long a point

of contention.

Hurricane Exposed Flaws in Protection of

Tunnels

By ELISABETH ROSENTHAL

Nearly two weeks after Hurricane Sandy

struck, the vital arteries that bring cars, trucks

and subways into New York City’s

transportation network have recovered, with

one major exception: the Brooklyn-Battery

Tunnel remains closed.

Behind New York Gas Lines, Warnings and

Crossed Fingers

By DAVID W. CHEN, WINNIE HU and

CLIFFORD KRAUSS

The return of 1970s-era gas lines to the five

boroughs of New York City was not the result

of a single miscalculation, but a combination

of ignored warnings and indecisiveness.

Modeling communities in social networks, modeling gene regulation . . .

Page 3: Identifiability and Learning of Topic Models: Tensor ... › ~djhsu › tutorials › icml2013 › part3.pdfIdentifiability and Learning of Topic Models: Tensor Decompositions under

Challenges in Learning Topic Models

Learning Topic Models Using Word Observations

Challenges in Identifiability

When can topics be identified?

Conditions on the model parameter, e.g. on topic-word matrix Φ andon topic proportions distributions (h)?

Does identifiability also lead to tractable algorithms?

Page 4: Identifiability and Learning of Topic Models: Tensor ... › ~djhsu › tutorials › icml2013 › part3.pdfIdentifiability and Learning of Topic Models: Tensor Decompositions under

Challenges in Learning Topic Models

Learning Topic Models Using Word Observations

Challenges in Identifiability

When can topics be identified?

Conditions on the model parameter, e.g. on topic-word matrix Φ andon topic proportions distributions (h)?

Does identifiability also lead to tractable algorithms?

Challenges in Design of Learning Algorithms

Maximum likelihood learning of topic models NP-hard (Arora et. al.)

In practice, heuristics such as Gibbs sampling, variation Bayes etc.

Guaranteed learning with minimal assumptions? Efficient methods?Low sample and computational complexities?

Page 5: Identifiability and Learning of Topic Models: Tensor ... › ~djhsu › tutorials › icml2013 › part3.pdfIdentifiability and Learning of Topic Models: Tensor Decompositions under

Challenges in Learning Topic Models

Learning Topic Models Using Word Observations

Challenges in Identifiability

When can topics be identified?

Conditions on the model parameter, e.g. on topic-word matrix Φ andon topic proportions distributions (h)?

Does identifiability also lead to tractable algorithms?

Challenges in Design of Learning Algorithms

Maximum likelihood learning of topic models NP-hard (Arora et. al.)

In practice, heuristics such as Gibbs sampling, variation Bayes etc.

Guaranteed learning with minimal assumptions? Efficient methods?Low sample and computational complexities?

Moment-based approach: learning using low order observed moments

Page 6: Identifiability and Learning of Topic Models: Tensor ... › ~djhsu › tutorials › icml2013 › part3.pdfIdentifiability and Learning of Topic Models: Tensor Decompositions under

Probabilistic Topic Models

Useful abstraction for automatic categorization of documents

Observed: words. Hidden: topics.

Bag of words: order of words does not matter

Graphical model representation

l words in a document x1, . . . , xl.

h: proportions of topics in a document.

Word xi generated from topic yi.

Exchangeability: x1 ⊥⊥ x2 ⊥⊥ . . . |h

Φ(i, j) := P[xm = i|ym = j] :

topic-word matrix.Words

Topics

Topic

Mixture

x1 x2 x3 x4 x5

y1 y2 y3 y4 y5

ΦΦΦΦΦ

h

Page 7: Identifiability and Learning of Topic Models: Tensor ... › ~djhsu › tutorials › icml2013 › part3.pdfIdentifiability and Learning of Topic Models: Tensor Decompositions under

Formulation as Linear Models

Distribution of the topic proportions vector h

If there are k topics, distribution over the simplex ∆k−1

∆k−1 := {h ∈ Rk, hi ∈ [0, 1],

i

hi = 1}.

Distribution of the words x1, x2, . . .

Order n words in vocabulary. If x1 is jth word, assign ej ∈ Rn

Distribution of each xi: supported on vertices of ∆n−1.

Properties

Linear Model: E[xi|h] = Φh .

Multiview model: h is fixed and multiple words (xi) are generated.

Page 8: Identifiability and Learning of Topic Models: Tensor ... › ~djhsu › tutorials › icml2013 › part3.pdfIdentifiability and Learning of Topic Models: Tensor Decompositions under

Geometric Picture for Topic Models

Topic proportions vector (h)

Document

Page 9: Identifiability and Learning of Topic Models: Tensor ... › ~djhsu › tutorials › icml2013 › part3.pdfIdentifiability and Learning of Topic Models: Tensor Decompositions under

Geometric Picture for Topic Models

Single topic (h)

Page 10: Identifiability and Learning of Topic Models: Tensor ... › ~djhsu › tutorials › icml2013 › part3.pdfIdentifiability and Learning of Topic Models: Tensor Decompositions under

Geometric Picture for Topic Models

Topic proportions vector (h)

Page 11: Identifiability and Learning of Topic Models: Tensor ... › ~djhsu › tutorials › icml2013 › part3.pdfIdentifiability and Learning of Topic Models: Tensor Decompositions under

Geometric Picture for Topic Models

Topic proportions vector (h)

ΦΦΦ

x1

x2

x3Word generation (x1, x2, . . .)

Page 12: Identifiability and Learning of Topic Models: Tensor ... › ~djhsu › tutorials › icml2013 › part3.pdfIdentifiability and Learning of Topic Models: Tensor Decompositions under

Geometric Picture for Topic Models

Topic proportions vector (h)

ΦΦΦ

x1

x2

x3Word generation (x1, x2, . . .)

Moment-based estimation: co-occurrences of words in documents

Page 13: Identifiability and Learning of Topic Models: Tensor ... › ~djhsu › tutorials › icml2013 › part3.pdfIdentifiability and Learning of Topic Models: Tensor Decompositions under

Outline

1 Introduction

2 Form of Moments

3 Matrix Case: Learning using Pairwise MomentsIdentifiability and Learning of Topic-Word MatrixLearning Latent Space Parameters of the Topic Model

4 Tensor Case: Learning From Higher Order MomentsOvercomplete Representations

5 Conclusion

Page 14: Identifiability and Learning of Topic Models: Tensor ... › ~djhsu › tutorials › icml2013 › part3.pdfIdentifiability and Learning of Topic Models: Tensor Decompositions under

Recap of CP DecompositionRecall form of moments for single topic/Dirichlet model.

E[xi|h] = Φh. ~λ := [E[h]]i.

Learn topic-word matrix Φ, vector ~λx1 x2 x3 x4 x5

ΦΦΦΦΦ

h

Page 15: Identifiability and Learning of Topic Models: Tensor ... › ~djhsu › tutorials › icml2013 › part3.pdfIdentifiability and Learning of Topic Models: Tensor Decompositions under

Recap of CP DecompositionRecall form of moments for single topic/Dirichlet model.

E[xi|h] = Φh. ~λ := [E[h]]i.

Learn topic-word matrix Φ, vector ~λx1 x2 x3 x4 x5

ΦΦΦΦΦ

h

Pairs Matrix M2

M2 := E[x1x>2 ] = E[E[x1x

>2 |h]] = ΦE[hh>]Φ> =

k∑

r=1

λrφrφ>r

Page 16: Identifiability and Learning of Topic Models: Tensor ... › ~djhsu › tutorials › icml2013 › part3.pdfIdentifiability and Learning of Topic Models: Tensor Decompositions under

Recap of CP DecompositionRecall form of moments for single topic/Dirichlet model.

E[xi|h] = Φh. ~λ := [E[h]]i.

Learn topic-word matrix Φ, vector ~λx1 x2 x3 x4 x5

ΦΦΦΦΦ

h

Pairs Matrix M2

M2 := E[x1x>2 ] = E[E[x1x

>2 |h]] = ΦE[hh>]Φ> =

k∑

r=1

λrφrφ>r

Similarly Triples Tensor M3

M3 := E(x1 ⊗ x2 ⊗ x3) =∑

r

λrφr ⊗ φr ⊗ φr

Page 17: Identifiability and Learning of Topic Models: Tensor ... › ~djhsu › tutorials › icml2013 › part3.pdfIdentifiability and Learning of Topic Models: Tensor Decompositions under

Recap of CP DecompositionRecall form of moments for single topic/Dirichlet model.

E[xi|h] = Φh. ~λ := [E[h]]i.

Learn topic-word matrix Φ, vector ~λx1 x2 x3 x4 x5

ΦΦΦΦΦ

h

Pairs Matrix M2

M2 := E[x1x>2 ] = E[E[x1x

>2 |h]] = ΦE[hh>]Φ> =

k∑

r=1

λrφrφ>r

Similarly Triples Tensor M3

M3 := E(x1 ⊗ x2 ⊗ x3) =∑

r

λrφr ⊗ φr ⊗ φr

Matrix and Tensor Forms: φr := rth column of Φ.

M2 =

k∑

r=1

λrφr ⊗ φr. M3 =

k∑

r=1

λrφr ⊗ φr ⊗ φr

Page 18: Identifiability and Learning of Topic Models: Tensor ... › ~djhsu › tutorials › icml2013 › part3.pdfIdentifiability and Learning of Topic Models: Tensor Decompositions under

Multi-linear Transformation

For a tensor T , define (for matrices Vi of appropriate dimensions)

[T (V1, V2, V3)]i1,i2,i3 :=∑

j1,j2,j3

(T )j1,j2,j3∏

m∈[3]

V1(jm, im)

For a matrix M2, M(V1, V2) := V >1 M2V2 .

T =

k∑

r=1

λrφr ⊗ φr ⊗ φr

T (W,W,W ) =∑

r∈[k]

λr(W>φr)

⊗3

T (I, v, v) =∑

r∈[k]

λr〈v, φr〉2φr.

T (I, I, v) =∑

r∈[k]

λr〈v, φr〉φrφ>r .

Page 19: Identifiability and Learning of Topic Models: Tensor ... › ~djhsu › tutorials › icml2013 › part3.pdfIdentifiability and Learning of Topic Models: Tensor Decompositions under

Form of Moments for a general Topic Model

E[xi|h] = Φh.

Learn Φ, distribution of h

Form of moments?x1 x2 x3 x4 x5

y1 y2 y3 y4 y5

ΦΦΦΦΦ

h

Pairs Matrix M2

M2 := E[x1x>2 ] = E[E[x1x

>2 |h]] = ΦE[hh>]Φ>

Page 20: Identifiability and Learning of Topic Models: Tensor ... › ~djhsu › tutorials › icml2013 › part3.pdfIdentifiability and Learning of Topic Models: Tensor Decompositions under

Form of Moments for a general Topic Model

E[xi|h] = Φh.

Learn Φ, distribution of h

Form of moments?x1 x2 x3 x4 x5

y1 y2 y3 y4 y5

ΦΦΦΦΦ

h

Pairs Matrix M2

M2 := E[x1x>2 ] = E[E[x1x

>2 |h]] = ΦE[hh>]Φ>

Similarly Triples Tensor M3

M3 := E(x1 ⊗ x2 ⊗ x3) =∑

i,j,k

E[h⊗3]i,j,kφi ⊗ φj ⊗ φk = E[h⊗3](Φ,Φ,Φ)

Page 21: Identifiability and Learning of Topic Models: Tensor ... › ~djhsu › tutorials › icml2013 › part3.pdfIdentifiability and Learning of Topic Models: Tensor Decompositions under

Form of Moments for a general Topic Model

E[xi|h] = Φh.

Learn Φ, distribution of h

Form of moments?x1 x2 x3 x4 x5

y1 y2 y3 y4 y5

ΦΦΦΦΦ

h

Pairs Matrix M2

M2 := E[x1x>2 ] = E[E[x1x

>2 |h]] = ΦE[hh>]Φ>

Similarly Triples Tensor M3

M3 := E(x1 ⊗ x2 ⊗ x3) =∑

i,j,k

E[h⊗3]i,j,kφi ⊗ φj ⊗ φk = E[h⊗3](Φ,Φ,Φ)

Tucker Tensor Decomposition

Find decomposition M3 = E[h⊗3](Φ,Φ,Φ)

Key difference from CP: E[h⊗3] NOT a diagonal tensor

Lot more parameters to estimate.

Page 22: Identifiability and Learning of Topic Models: Tensor ... › ~djhsu › tutorials › icml2013 › part3.pdfIdentifiability and Learning of Topic Models: Tensor Decompositions under

Guaranteed Learning of Topic Models

Two Learning approaches

CP Tensor decomposition: Parametric topic distributions(constraints on h) but general topic-word matrix Φ

Tucker Tensor decomposition: Constrain topic-word matrix Φ butgeneral (non-degenerate) distributions on h

Words

Topics

Topic

Mixture

x1 x2 x3 x4 x5

y1 y2 y3 y4 y5

ΦΦΦΦΦ

h

Page 23: Identifiability and Learning of Topic Models: Tensor ... › ~djhsu › tutorials › icml2013 › part3.pdfIdentifiability and Learning of Topic Models: Tensor Decompositions under

Outline

1 Introduction

2 Form of Moments

3 Matrix Case: Learning using Pairwise MomentsIdentifiability and Learning of Topic-Word MatrixLearning Latent Space Parameters of the Topic Model

4 Tensor Case: Learning From Higher Order MomentsOvercomplete Representations

5 Conclusion

Page 24: Identifiability and Learning of Topic Models: Tensor ... › ~djhsu › tutorials › icml2013 › part3.pdfIdentifiability and Learning of Topic Models: Tensor Decompositions under

Outline

1 Introduction

2 Form of Moments

3 Matrix Case: Learning using Pairwise MomentsIdentifiability and Learning of Topic-Word MatrixLearning Latent Space Parameters of the Topic Model

4 Tensor Case: Learning From Higher Order MomentsOvercomplete Representations

5 Conclusion

Page 25: Identifiability and Learning of Topic Models: Tensor ... › ~djhsu › tutorials › icml2013 › part3.pdfIdentifiability and Learning of Topic Models: Tensor Decompositions under

Pairwise moments for learningLearning using pairwise moments: minimal information.Exchangeable Topic Model

Words

Topics

Topic

Mixture

x1 x2 x3 x4 x5

y1 y2 y3 y4 y5

ΦΦΦΦΦ

h

Page 26: Identifiability and Learning of Topic Models: Tensor ... › ~djhsu › tutorials › icml2013 › part3.pdfIdentifiability and Learning of Topic Models: Tensor Decompositions under

Pairwise moments for learningLearning using pairwise moments: minimal information.Exchangeable Topic Model

Words

Topics

Topic

Mixture

x1 x2 x3 x4 x5

y1 y2 y3 y4 y5

ΦΦΦΦΦ

h

So far..

Parametric h: Dirichlet, single topic, independent components, . . .

No restrictions on Φ (other than non-degeneracy).

Learning using third order moment through tensor decompositions

Page 27: Identifiability and Learning of Topic Models: Tensor ... › ~djhsu › tutorials › icml2013 › part3.pdfIdentifiability and Learning of Topic Models: Tensor Decompositions under

Pairwise moments for learningLearning using pairwise moments: minimal information.Exchangeable Topic Model

Words

Topics

Topic

Mixture

x1 x2 x3 x4 x5

y1 y2 y3 y4 y5

ΦΦΦΦΦ

h

So far..

Parametric h: Dirichlet, single topic, independent components, . . .

No restrictions on Φ (other than non-degeneracy).

Learning using third order moment through tensor decompositions

What if..

Allow for general h: model arbitrary topic correlations

Constrain topic-word matrix Φ:

Page 28: Identifiability and Learning of Topic Models: Tensor ... › ~djhsu › tutorials › icml2013 › part3.pdfIdentifiability and Learning of Topic Models: Tensor Decompositions under

Pairwise moments for learningLearning using pairwise moments: minimal information.Exchangeable Topic Model

Words

Topics

Topic

Mixture

x1 x2 x3 x4 x5

y1 y2 y3 y4 y5

ΦΦΦΦΦ

h

So far..

Parametric h: Dirichlet, single topic, independent components, . . .

No restrictions on Φ (other than non-degeneracy).

Learning using third order moment through tensor decompositions

What if..

Allow for general h: model arbitrary topic correlations

Constrain topic-word matrix Φ: Sparsity constraints

Page 29: Identifiability and Learning of Topic Models: Tensor ... › ~djhsu › tutorials › icml2013 › part3.pdfIdentifiability and Learning of Topic Models: Tensor Decompositions under

Pairwise moments for learningLearning using pairwise moments: minimal information.Exchangeable Topic Model

Words

Topics

Topic

Mixture

x1 x2 x3 x4 x5

y1 y2 y3 y4 y5

ΦΦΦΦΦ

h

Topic-word matrix

h1 h2 hk

x(1) x(2) x(n)

Φ

So far..

Parametric h: Dirichlet, single topic, independent components, . . .

No restrictions on Φ (other than non-degeneracy).

Learning using third order moment through tensor decompositions

What if..

Allow for general h: model arbitrary topic correlations

Constrain topic-word matrix Φ: Sparsity constraints

Page 30: Identifiability and Learning of Topic Models: Tensor ... › ~djhsu › tutorials › icml2013 › part3.pdfIdentifiability and Learning of Topic Models: Tensor Decompositions under

Some Intuitions..h1 h2 hk

x(1) x(2) x(n)

Φ

Learning using second-order moments

Linear model: E[xi|h] = Φh. and E[x1x>2 ] = ΦE[hh>]Φ>

Learning: recover Φ from ΦE[hh>]Φ>.

Ill-posed without further restrictions

Page 31: Identifiability and Learning of Topic Models: Tensor ... › ~djhsu › tutorials › icml2013 › part3.pdfIdentifiability and Learning of Topic Models: Tensor Decompositions under

Some Intuitions..h1 h2 hk

x(1) x(2) x(n)

Φ

Learning using second-order moments

Linear model: E[xi|h] = Φh. and E[x1x>2 ] = ΦE[hh>]Φ>

Learning: recover Φ from ΦE[hh>]Φ>.

Ill-posed without further restrictions

When h is not degenerate: recover Φ from Col(Φ)

No other restrictions on h: arbitrary dependencies

Page 32: Identifiability and Learning of Topic Models: Tensor ... › ~djhsu › tutorials › icml2013 › part3.pdfIdentifiability and Learning of Topic Models: Tensor Decompositions under

Some Intuitions..h1 h2 hk

x(1) x(2) x(n)

Φ

Learning using second-order moments

Linear model: E[xi|h] = Φh. and E[x1x>2 ] = ΦE[hh>]Φ>

Learning: recover Φ from ΦE[hh>]Φ>.

Ill-posed without further restrictions

When h is not degenerate: recover Φ from Col(Φ)

No other restrictions on h: arbitrary dependencies

Sparsity constraints on topic-word matrix Φ

Main constraint: columns of Φ are sparsest vectors in Col(Φ)

Page 33: Identifiability and Learning of Topic Models: Tensor ... › ~djhsu › tutorials › icml2013 › part3.pdfIdentifiability and Learning of Topic Models: Tensor Decompositions under

Sufficient Conditions for Identifiability

columns of Φ are sparsest vectors in Col(Φ)

Sufficient conditions?

h1 h2 hk

x(1) x(2) x(n)

Φ

S

N (S)

Page 34: Identifiability and Learning of Topic Models: Tensor ... › ~djhsu › tutorials › icml2013 › part3.pdfIdentifiability and Learning of Topic Models: Tensor Decompositions under

Sufficient Conditions for Identifiability

columns of Φ are sparsest vectors in Col(Φ)

Sufficient conditions?

h1 h2 hk

x(1) x(2) x(n)

Φ

S

N (S)

Structural Condition: (Additive) Graph Expansion

|N (S)| > |S|+ dmax, for all S ⊂ [k]

Parametric Conditions: Generic Parameters

‖Φv‖0 > |NΦ(supp(v))| − | supp(v)|

A. Anandkumar, D. Hsu, A. Javanmard, and, S. M. Kakade. Learning Bayesian Networks with

Latent Variables. In Proc. of Intl. Conf. on Machine Learning, June 2013.

Page 35: Identifiability and Learning of Topic Models: Tensor ... › ~djhsu › tutorials › icml2013 › part3.pdfIdentifiability and Learning of Topic Models: Tensor Decompositions under

Brief Proof Sketch

Structural Condition: (Additive) Graph Expansion

|N (S)| > |S|+ dmax, for all S ⊂ [k]

Parametric Conditions: Generic Parameters

‖Φv‖0 > |NΦ(supp(v))| − | supp(v)|

Structural and Parametric Conditions Imply:

When | supp(v)| > 1, ‖Φv‖0 > |NΦ(supp(v))| − | supp(v)| > dmax

Thus, | supp(v)| = 1, for Φv to be one of k sparsest vectors in Col(Φ)

Page 36: Identifiability and Learning of Topic Models: Tensor ... › ~djhsu › tutorials › icml2013 › part3.pdfIdentifiability and Learning of Topic Models: Tensor Decompositions under

Brief Proof Sketch

Structural Condition: (Additive) Graph Expansion

|N (S)| > |S|+ dmax, for all S ⊂ [k]

Parametric Conditions: Generic Parameters

‖Φv‖0 > |NΦ(supp(v))| − | supp(v)|

Structural and Parametric Conditions Imply:

When | supp(v)| > 1, ‖Φv‖0 > |NΦ(supp(v))| − | supp(v)| > dmax

Thus, | supp(v)| = 1, for Φv to be one of k sparsest vectors in Col(Φ)

Claim: Parametric conditions are satisfied for generic parameters

Page 37: Identifiability and Learning of Topic Models: Tensor ... › ~djhsu › tutorials › icml2013 › part3.pdfIdentifiability and Learning of Topic Models: Tensor Decompositions under

Tractable Learning Algorithm

Learning Task

Recover topic-word matrix Φ from M2 = ΦE[hh>]Φ> .

Exhaustive search

minz 6=0

‖Φz‖0

Convex relaxation

minz ‖Φz‖1, b>z = 1,

where b is a row in Φ.

Change of Variables

minw ‖M1/22 w‖1, e>i M

1/22 w = 1.

Under “reasonable” conditions, the above program exactly recovers Φ

Page 38: Identifiability and Learning of Topic Models: Tensor ... › ~djhsu › tutorials › icml2013 › part3.pdfIdentifiability and Learning of Topic Models: Tensor Decompositions under

Outline

1 Introduction

2 Form of Moments

3 Matrix Case: Learning using Pairwise MomentsIdentifiability and Learning of Topic-Word MatrixLearning Latent Space Parameters of the Topic Model

4 Tensor Case: Learning From Higher Order MomentsOvercomplete Representations

5 Conclusion

Page 39: Identifiability and Learning of Topic Models: Tensor ... › ~djhsu › tutorials › icml2013 › part3.pdfIdentifiability and Learning of Topic Models: Tensor Decompositions under

Latent General Topic Models

h1 hk

x(1) x(2) x(n)

Φ

So far: recover topic-word matrix Φ from ΦE[hh>]Φ> .

Learning topic proportion distribution

E[hh>] not enough to recover general distributions

Need higher order moments to learn distribution of h

Any models where low order moments suffice? e.g. Dirichlet/singletopic require only third order moments. What about any otherdistributions?

Are there other topic distributions which can be learned efficiently?

Page 40: Identifiability and Learning of Topic Models: Tensor ... › ~djhsu › tutorials › icml2013 › part3.pdfIdentifiability and Learning of Topic Models: Tensor Decompositions under

Learning Latent Bayesian Networks

BN: Markov relationships on DAG

Pai: parents of node i. P(h) =n∏

i=1P(hi|hPai)

Linear Bayesian Network: hj =∑

i∈Paj

λjihi + ηj

h = Λh+ η E[xi|η] = Φ(I − Λ)−1η = Φ′η and ηi uncorrelated

h1 hk

x(1) x(2) x(n)

Φ

Page 41: Identifiability and Learning of Topic Models: Tensor ... › ~djhsu › tutorials › icml2013 › part3.pdfIdentifiability and Learning of Topic Models: Tensor Decompositions under

Learning Latent Bayesian Networks

BN: Markov relationships on DAG

Pai: parents of node i. P(h) =n∏

i=1P(hi|hPai)

Linear Bayesian Network: hj =∑

i∈Paj

λjihi + ηj

h = Λh+ η E[xi|η] = Φ(I − Λ)−1η = Φ′η and ηi uncorrelated

h1 hk

x(1) x(2) x(n)

Φ

Page 42: Identifiability and Learning of Topic Models: Tensor ... › ~djhsu › tutorials › icml2013 › part3.pdfIdentifiability and Learning of Topic Models: Tensor Decompositions under

Learning Latent Bayesian Networks

BN: Markov relationships on DAG

Pai: parents of node i. P(h) =n∏

i=1P(hi|hPai)

Linear Bayesian Network: hj =∑

i∈Paj

λjihi + ηj

h = Λh+ η E[xi|η] = Φ(I − Λ)−1η = Φ′η and ηi uncorrelated

h1 hk

x(1) x(2) x(n)

Φ

(I−Λ)−1

η1 ηk

Page 43: Identifiability and Learning of Topic Models: Tensor ... › ~djhsu › tutorials › icml2013 › part3.pdfIdentifiability and Learning of Topic Models: Tensor Decompositions under

Learning Latent Bayesian Networks

BN: Markov relationships on DAG

Pai: parents of node i. P(h) =n∏

i=1P(hi|hPai)

Linear Bayesian Network: hj =∑

i∈Paj

λjihi + ηj

h = Λh+ η E[xi|η] = Φ(I − Λ)−1η = Φ′η and ηi uncorrelated

h1 hk

x(1) x(2) x(n)

Φ

(I−Λ)−1

η1 ηk≡

η1 η2 ηk

x(1) x(2) x(n)

Φ′

Page 44: Identifiability and Learning of Topic Models: Tensor ... › ~djhsu › tutorials › icml2013 › part3.pdfIdentifiability and Learning of Topic Models: Tensor Decompositions under

Learning Latent Bayesian Networks

BN: Markov relationships on DAG

Pai: parents of node i. P(h) =n∏

i=1P(hi|hPai)

Linear Bayesian Network: hj =∑

i∈Paj

λjihi + ηj

h = Λh+ η E[xi|η] = Φ(I − Λ)−1η = Φ′η and ηi uncorrelated

h1 hk

x(1) x(2) x(n)

Φ

≡η1 η2 ηk

x(1) x(2) x(n)

Φ′

Page 45: Identifiability and Learning of Topic Models: Tensor ... › ~djhsu › tutorials › icml2013 › part3.pdfIdentifiability and Learning of Topic Models: Tensor Decompositions under

Learning Latent Bayesian Networks

BN: Markov relationships on DAG

Pai: parents of node i. P(h) =n∏

i=1P(hi|hPai)

Linear Bayesian Network: hj =∑

i∈Paj

λjihi + ηj

h = Λh+ η E[xi|η] = Φ(I − Λ)−1η = Φ′η and ηi uncorrelated

h1 hk

x(1) x(2) x(n)

Φ

≡η1 η2 ηk

x(1) x(2) x(n)

Φ′

Φ: structured and sparse while Φ′ is dense

h: correlated topics while η are uncorrelated

Page 46: Identifiability and Learning of Topic Models: Tensor ... › ~djhsu › tutorials › icml2013 › part3.pdfIdentifiability and Learning of Topic Models: Tensor Decompositions under

Learning Latent Bayesian Networks

E[xi|η] = Φ(I − Λ)−1η = Φ′η E[η] = λ

E[x1 ⊗ x2 ⊗ x3] = E[η⊗3](Φ′,Φ′,Φ′) =∑

i λi(φ′i)⊗3

Page 47: Identifiability and Learning of Topic Models: Tensor ... › ~djhsu › tutorials › icml2013 › part3.pdfIdentifiability and Learning of Topic Models: Tensor Decompositions under

Learning Latent Bayesian Networks

E[xi|η] = Φ(I − Λ)−1η = Φ′η E[η] = λ

E[x1 ⊗ x2 ⊗ x3] = E[η⊗3](Φ′,Φ′,Φ′) =∑

i λi(φ′i)⊗3

Solving CP decomposition through Tensor Power Method

Recall ηi are uncorrelated: E[η⊗] is diagonal.

Reduction to CP decomposition: can be efficient solved via tensorpower method

Page 48: Identifiability and Learning of Topic Models: Tensor ... › ~djhsu › tutorials › icml2013 › part3.pdfIdentifiability and Learning of Topic Models: Tensor Decompositions under

Learning Latent Bayesian Networks

E[xi|η] = Φ(I − Λ)−1η = Φ′η E[η] = λ

E[x1 ⊗ x2 ⊗ x3] = E[η⊗3](Φ′,Φ′,Φ′) =∑

i λi(φ′i)⊗3

Solving CP decomposition through Tensor Power Method

Recall ηi are uncorrelated: E[η⊗] is diagonal.

Reduction to CP decomposition: can be efficient solved via tensorpower method

Sparse Tucker Decomposition: Unmixing via Convex Optimization

Un-mix Φ from Φ′ = Φ(I − Λ)−1 through `1 optimization.

Page 49: Identifiability and Learning of Topic Models: Tensor ... › ~djhsu › tutorials › icml2013 › part3.pdfIdentifiability and Learning of Topic Models: Tensor Decompositions under

Learning Latent Bayesian Networks

E[xi|η] = Φ(I − Λ)−1η = Φ′η E[η] = λ

E[x1 ⊗ x2 ⊗ x3] = E[η⊗3](Φ′,Φ′,Φ′) =∑

i λi(φ′i)⊗3

Solving CP decomposition through Tensor Power Method

Recall ηi are uncorrelated: E[η⊗] is diagonal.

Reduction to CP decomposition: can be efficient solved via tensorpower method

Sparse Tucker Decomposition: Unmixing via Convex Optimization

Un-mix Φ from Φ′ = Φ(I − Λ)−1 through `1 optimization.

Learning both structure and parameters of Φ and distribution of hCombine non-convex and convex methods for learning!

Page 50: Identifiability and Learning of Topic Models: Tensor ... › ~djhsu › tutorials › icml2013 › part3.pdfIdentifiability and Learning of Topic Models: Tensor Decompositions under

Outline

1 Introduction

2 Form of Moments

3 Matrix Case: Learning using Pairwise MomentsIdentifiability and Learning of Topic-Word MatrixLearning Latent Space Parameters of the Topic Model

4 Tensor Case: Learning From Higher Order MomentsOvercomplete Representations

5 Conclusion

Page 51: Identifiability and Learning of Topic Models: Tensor ... › ~djhsu › tutorials › icml2013 › part3.pdfIdentifiability and Learning of Topic Models: Tensor Decompositions under

Outline

1 Introduction

2 Form of Moments

3 Matrix Case: Learning using Pairwise MomentsIdentifiability and Learning of Topic-Word MatrixLearning Latent Space Parameters of the Topic Model

4 Tensor Case: Learning From Higher Order MomentsOvercomplete Representations

5 Conclusion

Page 52: Identifiability and Learning of Topic Models: Tensor ... › ~djhsu › tutorials › icml2013 › part3.pdfIdentifiability and Learning of Topic Models: Tensor Decompositions under

Extension to learning overcomplete representations

So far..

Pairwise moments for learning structured topic-word matrices

Third order moments for learning latent Bayesian network models

Number of topics k, n is vocabulary size and k < n.

Page 53: Identifiability and Learning of Topic Models: Tensor ... › ~djhsu › tutorials › icml2013 › part3.pdfIdentifiability and Learning of Topic Models: Tensor Decompositions under

Extension to learning overcomplete representations

So far..

Pairwise moments for learning structured topic-word matrices

Third order moments for learning latent Bayesian network models

Number of topics k, n is vocabulary size and k < n.

Undercomplete Representation

k

n

h1 h2 hk

x(1)x(2) x(n)

Overcomplete Representation

k

n

h1 h2 hk

x(1)x(2) x(n)

What about overcomplete models: k > n? Do higher-order moments help?

Page 54: Identifiability and Learning of Topic Models: Tensor ... › ~djhsu › tutorials › icml2013 › part3.pdfIdentifiability and Learning of Topic Models: Tensor Decompositions under

Learning Overcomplete Representations

Why Overcomplete Representations?

Flexible modeling, robust to noise

Huge gains in many applications, e.g. speech and computer vision.

Page 55: Identifiability and Learning of Topic Models: Tensor ... › ~djhsu › tutorials › icml2013 › part3.pdfIdentifiability and Learning of Topic Models: Tensor Decompositions under

Learning Overcomplete Representations

Why Overcomplete Representations?

Flexible modeling, robust to noise

Huge gains in many applications, e.g. speech and computer vision.

Recall Tucker Form of Moments for Topic Models

M2 := E(x1 ⊗ x2) = E[h⊗2](Φ,Φ) ≡ ΦE[hh>]Φ>

M3 := E(x1 ⊗ x2 ⊗ x3) = E[h⊗3](Φ,Φ,Φ)

k > n: Tucker decomposition not unique: model non-identifiable.

Page 56: Identifiability and Learning of Topic Models: Tensor ... › ~djhsu › tutorials › icml2013 › part3.pdfIdentifiability and Learning of Topic Models: Tensor Decompositions under

Learning Overcomplete Representations

Why Overcomplete Representations?

Flexible modeling, robust to noise

Huge gains in many applications, e.g. speech and computer vision.

Recall Tucker Form of Moments for Topic Models

M2 := E(x1 ⊗ x2) = E[h⊗2](Φ,Φ) ≡ ΦE[hh>]Φ>

M3 := E(x1 ⊗ x2 ⊗ x3) = E[h⊗3](Φ,Φ,Φ)

k > n: Tucker decomposition not unique: model non-identifiable.

Identifiability of Overcomplete Models

Possible under the notion of topic persistence

Includes single topic model as a special case.

Page 57: Identifiability and Learning of Topic Models: Tensor ... › ~djhsu › tutorials › icml2013 › part3.pdfIdentifiability and Learning of Topic Models: Tensor Decompositions under

Persistent Topic Models

Bag of Words Model

Words

Topics

TopicMixture

x1 x2 x3 x4 x5

y1 y2 y3 y4 y5

ΦΦΦΦΦ

h

A. Anandkumar, D. Hsu, M. Janzamin, and S. M. Kakade. When are Overcomplete

Representations Identifiable? Uniqueness of Tensor Decompositions Under Expansion

Constraints, Preprint, June 2013.

Page 58: Identifiability and Learning of Topic Models: Tensor ... › ~djhsu › tutorials › icml2013 › part3.pdfIdentifiability and Learning of Topic Models: Tensor Decompositions under

Persistent Topic Models

Bag of Words Model

Words

Topics

TopicMixture

x1 x2 x3 x4 x5

y1 y2 y3 y4 y5

ΦΦΦΦΦ

h

Persistent Topic Model

Words

Topics

TopicMixture

x1 x2 x3 x4 x5

y1 y2 y3

ΦΦΦΦΦ

h

A. Anandkumar, D. Hsu, M. Janzamin, and S. M. Kakade. When are Overcomplete

Representations Identifiable? Uniqueness of Tensor Decompositions Under Expansion

Constraints, Preprint, June 2013.

Page 59: Identifiability and Learning of Topic Models: Tensor ... › ~djhsu › tutorials › icml2013 › part3.pdfIdentifiability and Learning of Topic Models: Tensor Decompositions under

Persistent Topic Models

Bag of Words Model

Words

Topics

TopicMixture

x1 x2 x3 x4 x5

y1 y2 y3 y4 y5

ΦΦΦΦΦ

h

Persistent Topic Model

Words

Topics

TopicMixture

x1 x2 x3 x4 x5

y1 y2 y3

ΦΦΦΦΦ

h

Single-topic model is a special case.

Persistence: incorporates locality or order of words.

A. Anandkumar, D. Hsu, M. Janzamin, and S. M. Kakade. When are Overcomplete

Representations Identifiable? Uniqueness of Tensor Decompositions Under Expansion

Constraints, Preprint, June 2013.

Page 60: Identifiability and Learning of Topic Models: Tensor ... › ~djhsu › tutorials › icml2013 › part3.pdfIdentifiability and Learning of Topic Models: Tensor Decompositions under

Persistent Topic Models

Bag of Words Model

Words

Topics

TopicMixture

x1 x2 x3 x4 x5

y1 y2 y3 y4 y5

ΦΦΦΦΦ

h

Persistent Topic Model

Words

Topics

TopicMixture

x1 x2 x3 x4 x5

y1 y2 y3

ΦΦΦΦΦ

h

Single-topic model is a special case.

Persistence: incorporates locality or order of words.

Identifiability conditions for overcomplete models?

A. Anandkumar, D. Hsu, M. Janzamin, and S. M. Kakade. When are Overcomplete

Representations Identifiable? Uniqueness of Tensor Decompositions Under Expansion

Constraints, Preprint, June 2013.

Page 61: Identifiability and Learning of Topic Models: Tensor ... › ~djhsu › tutorials › icml2013 › part3.pdfIdentifiability and Learning of Topic Models: Tensor Decompositions under

Identifiability of Overcomplete Models

Recall Tucker Form of Moments for Bag-of-Words Model

Tensor form: E(x1 ⊗ x2 ⊗ x3 ⊗ x4) = E[h⊗4](Φ,Φ,Φ,Φ)

Matricized form:E((x1 ⊗ x2)(x3 ⊗ x4)

>) = (Φ ⊗ Φ)E[(h⊗ h)(h ⊗ h)>](Φ⊗ Φ)>

Page 62: Identifiability and Learning of Topic Models: Tensor ... › ~djhsu › tutorials › icml2013 › part3.pdfIdentifiability and Learning of Topic Models: Tensor Decompositions under

Identifiability of Overcomplete Models

Recall Tucker Form of Moments for Bag-of-Words Model

Tensor form: E(x1 ⊗ x2 ⊗ x3 ⊗ x4) = E[h⊗4](Φ,Φ,Φ,Φ)

Matricized form:E((x1 ⊗ x2)(x3 ⊗ x4)

>) = (Φ ⊗ Φ)E[(h⊗ h)(h ⊗ h)>](Φ⊗ Φ)>

For Persistent Topic Model

Tensor form: E(x1 ⊗ x2 ⊗ x3 ⊗ x4) = E[hh>](Φ � Φ,Φ�Φ)

Matricized form:E((x1 ⊗ x2)(x3 ⊗ x4)

>) = (Φ � Φ)E[hh>](Φ � Φ)>

Page 63: Identifiability and Learning of Topic Models: Tensor ... › ~djhsu › tutorials › icml2013 › part3.pdfIdentifiability and Learning of Topic Models: Tensor Decompositions under

Identifiability of Overcomplete Models

Recall Tucker Form of Moments for Bag-of-Words Model

Tensor form: E(x1 ⊗ x2 ⊗ x3 ⊗ x4) = E[h⊗4](Φ,Φ,Φ,Φ)

Matricized form:E((x1 ⊗ x2)(x3 ⊗ x4)

>) = (Φ ⊗ Φ)E[(h⊗ h)(h ⊗ h)>](Φ⊗ Φ)>

For Persistent Topic Model

Tensor form: E(x1 ⊗ x2 ⊗ x3 ⊗ x4) = E[hh>](Φ � Φ,Φ�Φ)

Matricized form:E((x1 ⊗ x2)(x3 ⊗ x4)

>) = (Φ � Φ)E[hh>](Φ � Φ)>

Kronecker vs. Khatri-Rao Products

Φ: Topic-word matrix, is n× k.

(Φ⊗ Φ): Kronecker product, is n2 × k2 matrix.

(Φ� Φ): Khatri-Rao product, is n2 × k matrix.

Page 64: Identifiability and Learning of Topic Models: Tensor ... › ~djhsu › tutorials › icml2013 › part3.pdfIdentifiability and Learning of Topic Models: Tensor Decompositions under

Some Intuitions

Bag-of-words Model:

(Φ⊗ Φ)E[(h⊗ h)(h⊗ h)>](Φ ⊗Φ)>

Persistent Model:(Φ� Φ)E[hh>](Φ� Φ)>

Topic-Word Matrix Φk

n

Effective Topic-Word Matrix Given Fourth-Order Moments:

Bag of Words Model:Kronecker Product Φ⊗Φ

k2

n2

Not Identifiable

Persistent Model:Khatri-Rao Product Φ� Φ

k

n2

Identifiable

Page 65: Identifiability and Learning of Topic Models: Tensor ... › ~djhsu › tutorials › icml2013 › part3.pdfIdentifiability and Learning of Topic Models: Tensor Decompositions under

Outline

1 Introduction

2 Form of Moments

3 Matrix Case: Learning using Pairwise MomentsIdentifiability and Learning of Topic-Word MatrixLearning Latent Space Parameters of the Topic Model

4 Tensor Case: Learning From Higher Order MomentsOvercomplete Representations

5 Conclusion

Page 66: Identifiability and Learning of Topic Models: Tensor ... › ~djhsu › tutorials › icml2013 › part3.pdfIdentifiability and Learning of Topic Models: Tensor Decompositions under

ConclusionMoment-based Estimation of Latent Variable Models

Moments are easy to estimate.

Low-order moments have good concentration properties

Page 67: Identifiability and Learning of Topic Models: Tensor ... › ~djhsu › tutorials › icml2013 › part3.pdfIdentifiability and Learning of Topic Models: Tensor Decompositions under

ConclusionMoment-based Estimation of Latent Variable Models

Moments are easy to estimate.

Low-order moments have good concentration properties

Tensor Decomposition Methods

Moment tensors have tractable forms for many models, e.g. Topicmodels, HMMs, Gaussian mixtures, ICA.

Efficient CP tensor decomposition through power iterations.

Efficient structured Tucker decomposition through `1.

Structured topic-word matrices: Expansion conditions

Can be extended to overcomplete representations

Page 68: Identifiability and Learning of Topic Models: Tensor ... › ~djhsu › tutorials › icml2013 › part3.pdfIdentifiability and Learning of Topic Models: Tensor Decompositions under

ConclusionMoment-based Estimation of Latent Variable Models

Moments are easy to estimate.

Low-order moments have good concentration properties

Tensor Decomposition Methods

Moment tensors have tractable forms for many models, e.g. Topicmodels, HMMs, Gaussian mixtures, ICA.

Efficient CP tensor decomposition through power iterations.

Efficient structured Tucker decomposition through `1.

Structured topic-word matrices: Expansion conditions

Can be extended to overcomplete representations

Practical Considerations for Tensor Methods

Not covered in detail in this tutorial.

Matrix algebra and iterative methods.

Scalable: Parallel implementation on GPUs