Scalable Deep Poisson Factor Analysis for Topic …changyou/PDF/dpfa_slides_icml15.pdfScalable Deep Poisson Factor Analysis for Topic Modeling Zhe Gan, Changyou Chen, Ricardo Henao,

Scalable Deep Poisson Factor Analysis for TopicModeling

Zhe Gan, Changyou Chen,Ricardo Henao, David Carlson, Lawrence Carin

Duke University

July 9th, 2015

Presented by David Carlson (Duke) Scalable DPFA July 9th, 2015 1 / 23

Outline

1 Introduction

2 Model Formulation

3 Scalable Posterior Inference

4 Experiments

5 Summary


Introduction

Problem of interest: How to develop deep generative models fordocuments that are represented in bag-of-words form?

Directed Graphical Models:Latent Dirichlet Allocation (LDA) (Blei et al., 2003)Focused Topic Model (FTM) (Williamson et al., 2010)Poisson Factor Analysis (PFA) (Zhou et al., 2012)

Going “Deep”?Hierarchical tree-structured topic modelsnested Chinese Restaurant Process (nCRP) (Blei et al., 2004)Hierarchical Dirichlet Process (HDP) (Teh et al., 2006)nested Hierarchical Dirichlet Process (nHDP) (Paisley et al., 2015)

How about we want to model general topic correlations?


Introduction

Undirected Graphical Models:Replicated Softmax Model (RSM) (Salakhutdinov and Hinton,2009b)One generalization of the Restricted Boltzmann Machine (RBM)(Hinton, 2002)

Going Deep?Deep Belief Networks (DBN) (Hinton et al., 2006; Hinton andSalakhutdinov, 2011)Deep Boltzmann Machines (DBM) (Salakhutdinov and Hinton,2009a; Srivastava et al., 2013)

Topics are not defined “properly”.


Introduction

Main idea:Poisson Factor Analysis (PFA) + Deep Sigmoid Belief Network(SBN) or Restricted Boltzmann Machine (RBM).PFA is employed to interact with data at the bottom layer.Deep SBN or RBM serve as a flexible prior for revealing topicstructure.

220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274

275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329

Scalable Deep Poisson Factor Analysis for Topic Modeling

In the experiments we consider both the deep SBN anddeep RBM for representation of the latent binary units,which are connected to topic usage in a given document.

Discussion An important benefit of SBNs over RBMsis that in the former sparsity or shrinkage priors can bereadily imposed on the global parameters W(1), and fullyBayesian inference can be implemented as shown in Ganet al. (2015). The RBM relies on an approximation tech-nique known as contrastive divergence (Hinton, 2002), forwhich prior specification for the model parameters is lim-ited.

2.3. Deep Architecture for Topic Modeling

Specifying a prior distribution onh(2)n as in (3) might be too

restrictive in some cases. Alternatively, we can use anotherSBN prior for h(2)

n , in fact, we can add multiple layers asin Gan et al. (2015) to obtain a deep architecture,

p(h(1)n , . . . ,h(L)

n ) = p(h(L)n )

∏L`=2 p(h

(`−1)n |h(`)

n ), (6)

where L is the number of layers, p(h(L)n ) is the prior for the

top layer defined as in (3), p(h(`−1)n |h(`)

n ) is defined in (4),and the weights W(`) ∈ RK`×K`+1 and biases c(`) ∈ RK`

are omitted from the conditional distributions to keep no-tation uncluttered. A similar deep architecture may be de-signed for the RBM (Salakhutdinov & Hinton, 2009b).

Instead of employing the beta-Bernoulli specification forh(1)n as in the NB-FTM, which assumes independent topic

usage probabilities, we propose using (6) instead as theprior for h(1)

n , thus

p(xn,hn) = p(xn|h(1)n )p(h(1)

n , . . . ,h(L)n ) , (7)

where hn , {h(1)n , . . . ,h(L)

n }, and p(xn|h(1)n ) as in (2).

The prior p(h(1)n |h(2)

n . . . ,h(L)n ) can be seen as a flexible

prior distribution over binary vectors that encodes high-order interactions across elements of h(1)

n . The graphi-cal model for our model, Deep Poisson Factor Analysis(DPFA) is shown in Figure 1.

3. Scalable Posterior InferenceWe focus on learning our model with fully Bayesian al-gorithms, however, emerging large-scale corpora prohibitstandard MCMC inference algorithms to be applied di-rectly. For example, in the experiments, we consider theRCV1-v2 and the Wikipedia corpora, which contain about800K and 10M documents, respectively. Therefore, fastalgorithms for big Bayesian learning are essential. Whileparallel algorithms based on distributed architectures suchas the parameter server (Ho et al., 2013; Li et al., 2014)are popular choices, in the work presented here, we focus

h(1)n

h(2)n

h(3)n

xn

θn

W(1)

W(2)

rk

φk

γ0

aφ

n=1,...,Nk=1,...,K

Figure 1. Graphical model for the Deep Poisson Factor Analysiswith three layers of hidden binary hierarchies. The directed binaryhierarchy may be replaced by a deep Boltzmann machine.

on another direction for scaling up inference by stochas-tic algorithms, where mini-batches instead of the wholedataset are utilized in each iteration of the algorithms.Specifically, we develop two stochastic Bayesian inferencealgorithms based on Bayesian conditional density filter-ing (Guhaniyogi et al., 2014) and stochastic gradient ther-mostats (Ding et al., 2014), both of which have theoreticalguarantees in the sense of asymptotical convergence to thetrue posterior distribution.

3.1. Bayesian conditional density filtering

Bayesian conditional density filtering (BCDF) is a re-cently proposed stochastic algorithm for Bayesian onlinelearning (Guhaniyogi et al., 2014), that extends Markovchain Monte Carlo (MCMC) sampling to streaming data.Sampling in BCDF proceeds by drawing from the condi-tional posterior distributions of model parameters, obtainedby propagating surrogate conditional sufficient statistics(SCSS). In practice, we repeatedly update the SCSS usingthe current mini-batch and draw S samples from the condi-tional densities using, for example, a Gibbs sampler. Thiseliminates the need to load the entire dataset into mem-ory, and provides computationally cheaper Gibbs updates.More importantly, it can be proved that BCDF leads to anapproximation of the conditional distributions that producesamples from the correct target posterior asymptotically,once the entire dataset is seen (Guhaniyogi et al., 2014).

In the learning phase, we are interested in learning theglobal parameters Ψg = ({φk}, {rk}, γ0, {W(`), c(`)}).Denote local variables as Ψl = (Θ,H(`)), and let Sg rep-resent the SCSS for Ψg , the BCDF algorithm can be sum-marized in Algorithm 1. Specifically, we need to obtain theconditional densities, which can be readily derived grantedthe full local conjugacy of the proposed model. Using dotnotation to represent marginal sums, e.g., x·nk ,

∑p xpnk,

we can write the key conditional densities for (2) as (Zhou& Carin, 2015)

Figure: Graphical model for the Deep Poisson Factor Analysis with three layers of hidden binary hierarchies. The directedbinary hierarchy may be replaced by a deep Boltzmann machine.


Model Formulation

Poisson Factor Analysis: (Zhou et al., 2012)We represent a discrete matrix X ∈ ZP×N

+ containing counts fromN documents and P words as

X = Pois(Φ(Θ ◦ H(1))) . (1)

Each column of Φ, φk , encodes the relative importance of eachword in topic k .Each column of Θ, θn, contains relative topic intensities specific todocument n.Each column of H(1), h(1)

n , defines a sparse set of topicsassociated with each document.


Model Formulation

Poisson Factor Analysis: (Zhou et al., 2012)We construct PFAs by placing Dirichlet priors on φk and gammapriors on θn.

xpn =∑K

k=1 xpnk , xpnk ∼ Pois(φpkθknh(1)kn ) , (2)

with priors specified as φk ∼ Dir(aφ, . . . ,aφ),θkn ∼ Gamma(rk ,pn/(1− pn)), rk ∼ Gamma(γ0,1/c0), andγ0 ∼ Gamma(e0,1/f0).

Previously, a beta-Bernoulli process prior is defined on h(1)n ,

assuming topic independence (Zhou and Carin, 2015).

The novelty in our models comes from the prior for h(1)n .


Model Formulation

Structured Priors on the Latent Binary matrix:

Assume h(1)n ∈ {0,1}K1 , we define another hidden set of units

h(2)n ∈ {0,1}K2 placed at a layer “above” h(1)

n .Modeling with the RBM: (Undirected)

− E(h(1)n ,h(2)

n ) = (h(1)n )>c(1) + (h(1)

n )>W(1)h(2)n + (h(2)

n )>c(2) . (3)

Modeling with the SBN (Neal, 1992): (Directed)

p(h(2)k2n = 1) = σ(c(2)

k2) , (4)

p(h(1)k1n = 1|h(2)

n ) = σ((w (1)

k1)>h(1)

n + c(1)k1

). (5)


Model Formulation

Going Deep?Add multiple layers of SBNs or RBMs.

220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274

275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329

Scalable Deep Poisson Factor Analysis for Topic Modeling

In the experiments we consider both the deep SBN anddeep RBM for representation of the latent binary units,which are connected to topic usage in a given document.

Discussion An important benefit of SBNs over RBMsis that in the former sparsity or shrinkage priors can bereadily imposed on the global parameters W(1), and fullyBayesian inference can be implemented as shown in Ganet al. (2015). The RBM relies on an approximation tech-nique known as contrastive divergence (Hinton, 2002), forwhich prior specification for the model parameters is lim-ited.

2.3. Deep Architecture for Topic Modeling

Specifying a prior distribution onh(2)n as in (3) might be too

restrictive in some cases. Alternatively, we can use anotherSBN prior for h(2)

n , in fact, we can add multiple layers asin Gan et al. (2015) to obtain a deep architecture,

p(h(1)n , . . . ,h(L)

n ) = p(h(L)n )

∏L`=2 p(h

(`−1)n |h(`)

n ), (6)

where L is the number of layers, p(h(L)n ) is the prior for the

top layer defined as in (3), p(h(`−1)n |h(`)

n ) is defined in (4),and the weights W(`) ∈ RK`×K`+1 and biases c(`) ∈ RK`

are omitted from the conditional distributions to keep no-tation uncluttered. A similar deep architecture may be de-signed for the RBM (Salakhutdinov & Hinton, 2009b).

Instead of employing the beta-Bernoulli specification forh(1)n as in the NB-FTM, which assumes independent topic

usage probabilities, we propose using (6) instead as theprior for h(1)

n , thus

p(xn,hn) = p(xn|h(1)n )p(h(1)

n , . . . ,h(L)n ) , (7)

where hn , {h(1)n , . . . ,h(L)

n }, and p(xn|h(1)n ) as in (2).

The prior p(h(1)n |h(2)

n . . . ,h(L)n ) can be seen as a flexible

prior distribution over binary vectors that encodes high-order interactions across elements of h(1)

n . The graphi-cal model for our model, Deep Poisson Factor Analysis(DPFA) is shown in Figure 1.

3. Scalable Posterior InferenceWe focus on learning our model with fully Bayesian al-gorithms, however, emerging large-scale corpora prohibitstandard MCMC inference algorithms to be applied di-rectly. For example, in the experiments, we consider theRCV1-v2 and the Wikipedia corpora, which contain about800K and 10M documents, respectively. Therefore, fastalgorithms for big Bayesian learning are essential. Whileparallel algorithms based on distributed architectures suchas the parameter server (Ho et al., 2013; Li et al., 2014)are popular choices, in the work presented here, we focus

h(1)n

h(2)n

h(3)n

xn

θn

W(1)

W(2)

rk

φk

γ0

aφ

n=1,...,Nk=1,...,K

Figure 1. Graphical model for the Deep Poisson Factor Analysiswith three layers of hidden binary hierarchies. The directed binaryhierarchy may be replaced by a deep Boltzmann machine.

on another direction for scaling up inference by stochas-tic algorithms, where mini-batches instead of the wholedataset are utilized in each iteration of the algorithms.Specifically, we develop two stochastic Bayesian inferencealgorithms based on Bayesian conditional density filter-ing (Guhaniyogi et al., 2014) and stochastic gradient ther-mostats (Ding et al., 2014), both of which have theoreticalguarantees in the sense of asymptotical convergence to thetrue posterior distribution.

3.1. Bayesian conditional density filtering

Bayesian conditional density filtering (BCDF) is a re-cently proposed stochastic algorithm for Bayesian onlinelearning (Guhaniyogi et al., 2014), that extends Markovchain Monte Carlo (MCMC) sampling to streaming data.Sampling in BCDF proceeds by drawing from the condi-tional posterior distributions of model parameters, obtainedby propagating surrogate conditional sufficient statistics(SCSS). In practice, we repeatedly update the SCSS usingthe current mini-batch and draw S samples from the condi-tional densities using, for example, a Gibbs sampler. Thiseliminates the need to load the entire dataset into mem-ory, and provides computationally cheaper Gibbs updates.More importantly, it can be proved that BCDF leads to anapproximation of the conditional distributions that producesamples from the correct target posterior asymptotically,once the entire dataset is seen (Guhaniyogi et al., 2014).

In the learning phase, we are interested in learning theglobal parameters Ψg = ({φk}, {rk}, γ0, {W(`), c(`)}).Denote local variables as Ψl = (Θ,H(`)), and let Sg rep-resent the SCSS for Ψg , the BCDF algorithm can be sum-marized in Algorithm 1. Specifically, we need to obtain theconditional densities, which can be readily derived grantedthe full local conjugacy of the proposed model. Using dotnotation to represent marginal sums, e.g., x·nk ,

∑p xpnk,

we can write the key conditional densities for (2) as (Zhou& Carin, 2015)

Figure: Graphical model for the Deep Poisson Factor Analysis with three layers of hidden binary hierarchies. The directedbinary hierarchy may be replaced by a deep Boltzmann machine.


Scalable Posterior Inference

Challenge: Designing scalable Bayesian inference algorithms.Solutions: Scaling up inference by stochastic algorithms.

Applying Bayesian conditional density filtering algorithm(Guhaniyogi et al., 2014).Extending recently proposed work on stochastic gradientthermostats (Ding et al., 2014).



Bayesian conditional density filtering (BCDF):Repeatedly updating the surrogate conditional sufficient statistics(SCSS) using the current mini-batch.Drawing samples from the conditional posterior distributions ofmodel parameters, based on SCSS.“stochastic Gibbs-style” updates.

Input: text documents, i.e., a count matrix X.Initialize Ψ

(0)g randomly and set S(0)

g all to zero.for t = 1 to∞ do

Get one mini-batch X(t).Initialize Ψ

(t)g = Ψ

(t−1)g , and S(t)

g = S(t−1)g .

Initialize Ψ(t)l randomly.

for s = 1 to S doGibbs sampling for DPFA on X(t).Collect samples Ψ1:S

g ,Ψ1:Sl and S1:S

g .end forSet Ψ(t)

g = mean(Ψ1:Sg ), and S(t)

g = mean(S1:Sg ).

end for

Ψg : global parametersΨl : local hidden variablesSg : SCSS for Ψg



Stochastic Gradient Nose-Hoover Thermostats (SGNHT):Extending Hamiltonian Monte Carlo using stochastic gradient.Introducing thermostat to maintain system temperature.Adaptively absorbing stochastic gradient noise.The motion of the particles in the system are defined by thestochastic differential equations (SDE)

dΨg = vdt , dv = f (Ψg)dt − ξvdt +√

DdW ,

dξ =(

1M vT v − 1

)dt , (6)

where Ψg ∈ RM are model parameters, v ∈ RM are themomentum variables, f (Ψg) , −∇Ψg U(Ψg), and U(Ψg) is thenegative log-posterior.



Extension:Extending the SGNHT by introducing multiple thermostatvariables (ξ1, · · · , ξM) into the system such that each ξi controlsone degree of the particle momentum.The proposed SGNHT is defined by the following SDEs

dΨg = vdt , dv = f (Ψg)dt − Ξvdt +√

DdW ,

dΞ = (q− I) dt , (7)

where Ξ = diag(ξ1, ξ2, · · · , ξM), q = diag(v21 , · · · , v2

M)

Theorem

The equilibrium distribution of the SDE system in (7) is

p(Ψg ,v ,Ξ) ∝ exp(−1

2v>v − U(Ψg)−

12

tr{(Ξ− D)> (Ξ− D)

}).



Stochastic Gradient Nose-Hoover Thermostats (SGNHT):

Input: text documents, i.e., a count matrix X.Random Initialization.for t = 1 to∞ do

Ψ(t+1)g = Ψ

(t)g + v (t)h.

v (t+1) = f (Ψ(t+1)g )h − Ξ(t)v (t)h +

√2DhN (0, I).

Ξ(t+1) = Ξ(t) + (q(t+1) − I)h, where q =diag(v2

1 , . . . , v2M).

end for

Discussion:BCDF: ease of implementation, but prefers the conditionaldensities for all the parameters.SGNHT: more general and robust, fast convergence.



Stochastic Gradient Nose-Hoover Thermostats (SGNHT):

Input: text documents, i.e., a count matrix X.Random Initialization.for t = 1 to∞ do

Ψ(t+1)g = Ψ

(t)g + v (t)h.

v (t+1) = f (Ψ(t+1)g )h − Ξ(t)v (t)h +

√2DhN (0, I).

Ξ(t+1) = Ξ(t) + (q(t+1) − I)h, where q =diag(v2

1 , . . . , v2M).

end for

Discussion:BCDF: ease of implementation, but prefers the conditionaldensities for all the parameters.SGNHT: more general and robust, fast convergence.


Experiments

Datasets:20 Newsgroups: 20K documents with a vocabulary size of 2K.RCV1-v2: 800K documents with a vocabulary size of 10K.Wikipedia: 10M documents with a vocabulary size of 8K.


Experiments

Quantitative Evaluation:

Table: 20 Newsgroups.

MODEL METHOD DIM PERP.DPFA-SBN-t GIBBS 128-64-32 827DPFA-SBN GIBBS 128-64-32 846DPFA-SBN SGNHT 128-64-32 846DPFA-RBM SGNHT 128-64-32 896DPFA-SBN BCDF 128-64-32 905DPFA-SBN GIBBS 128-64 851DPFA-SBN SGNHT 128-64 850DPFA-RBM SGNHT 128-64 893DPFA-SBN BCDF 128-64 896LDA GIBBS 128 893NB-FTM GIBBS 128 887RSM CD5 128 877NHDP SVB (10,10,5)� 889

Table: RCV1-v2 & Wikipedia.

MODEL METHOD DIM RCV WIKIDPFA-SBN SGNHT 1024-512-256 964 770DPFA-SBN SGNHT 512-256-128 1073 799DPFA-SBN SGNHT 128-64-32 1143 876DPFA-RBM SGNHT 128-64-32 920 942DPFA-SBN BCDF 128-64-32 1149 986LDA BCDF 128 1179 1059NB-FTM BCDF 128 1155 991RSM CD5 128 1171 1001NHDP SVB (10,5,5)� 1041 932


Experiments

Quantitative Evaluation:

0 300 600 900 1200 1500840

860

880

900

920

940

960

980

1000

Iteration Number

Per

plex

ity

LDANB−FTMDPFA−SBN (Gibbs)DPFA−SBN (BCDF)DPFA−SBN (SGNHT)DPFA−RBM (SGNHT)

200K 230K 260K 290K 320K 350K800

900

1000

1100

1200

1300

1400

1500

1600

#Documents Seen

Perp

lexity

350K 380K 410K 440K 470K 500K850

900

950

1000

1050

1100

1150

1200

#Documents Seen

Per

plex

ity

Sensitivity Analysis:

1.8 2 2.2 2.4 2.6

x 105

800

900

1000

1100

1200

1300

1400

#Docs Seen

Pe

rple

xity

Batch_1Batch_10Batch_30Batch_50Batch_80Batch_100

2 2.05 2.1 2.15

x 105

1500

2000

2500

3000

#Docs Seen

Pe

rple

xity

Batch_10Batch_50Batch_80Batch_100

3.5 3.55 3.6 3.65 3.7

x 105

1000

1200

1400

1600

#Docs Seen

Pe

rple

xity

Batch_10Batch_30Batch_50Batch_80Batch_100

Figure: Perplexities. (Left) 20 News. (Middle) RCV1-v2. (Right) Wikipedia.


Experiments

Topics we learned on 20 Newsgroups:

T1 T3 T8 T9 T10 T14 T15 T19 T21 T24year people group world evidence game israel software files teamhit real groups country claim games israeli modem file playersruns simply reading countries people win jews port ftp playergood world newsgroup germany argument cup arab mac program playseason things pro nazi agree hockey jewish serial format teamsT25 T26 T29 T40 T41 T43 T50 T54 T55 T64god fire people wrong image boston problem card windows turkishexistence fbi life doesn program toronto work video dos armenianexist koresh death jim application montreal problems memory file armenianshuman children kill agree widget chicago system mhz win turksatheism batf killing quote color pittsburgh fine bit ms armeniaT65 T69 T78 T81 T91 T94 T112 T118 T120 T126truth window drive makes question code children people men sextrue server disk power answer mit father make women sexualpoint display scsi make means comp child person man cramerfact manager hard doesn true unix mother things hand gaybody client drives part people source son feel world homosexual


Experiments

Visualization:Sports, Computers, and Poltics/Law.

1

3

8

91014

15

19

21

2425

26

29 40

41

43

50

54

55

64

65

69

78

81

91

94

112118120

126

1

3

8

91014

15

19

21

2425

26

29 40

41

43

50

54

55

64

65

69

78

81

91

94

112118120

126

1

3

8

91014

15

19

21

2425

26

29 40

41

43

50

54

55

64

65

69

78

81

91

94

112118120

126

Figure: Graphs induced by the correlation structure learned by DPFA-SBN forthe 20 Newsgroups.


Summary

Model: Deep Poisson Factor AnalysisPFA is employed to interact with data at the bottom layer.Deep SBN or RBM serve as a flexible prior for revealing topicstructure.

Scalable Inference:Bayesian conditional density filtering.Stochastic gradient thermostats.

https://github.com/zhegan27/dpfa_icml2015


https://github.com/zhegan27/dpfa_icml2015

Questions?


References I

Blei, D. M., Griffiths, T., Jordan, M. I., and Tenenbaum, J. B. (2004). Hierarchical topicmodels and the nested Chinese restaurant process. NIPS.

Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent Dirichlet allocation. JMLR.

Ding, N., Fang, Y., Babbush, R., Chen, C., Skeel, R. D., and Neven, H. (2014).Bayesian sampling using stochastic gradient thermostats. NIPS.

Guhaniyogi, R., Qamar, S., and Dunson, D. B. (2014). Bayesian conditional densityfiltering. arXiv:1401.3632.

Hinton, G. E. (2002). Training products of experts by minimizing contrastivedivergence. Neural computation.

Hinton, G. E., Osindero, S., and Teh, Y. W. (2006). A fast learning algorithm for deepbelief nets. Neural computation.

Hinton, G. E. and Salakhutdinov, R. (2011). Discovering binary codes for documentsby learning deep generative models. Topics in Cognitive Science.

Neal, R. M. (1992). Connectionist learning of belief networks. Artificial Intelligence.

Paisley, J., Wang, C., Blei, D. M., and Jordan, M. I. (2015). Nested hierarchicalDirichlet processes. PAMI.


References II

Salakhutdinov, R. and Hinton, G. E. (2009a). Deep Boltzmann machines. AISTATS.

Salakhutdinov, R. and Hinton, G. E. (2009b). Replicated softmax: an undirected topicmodel. NIPS.

Srivastava, N., Salakhutdinov, R., and Hinton, G. E. (2013). Modeling documents withdeep Boltzmann machines. UAI.

Teh, Y. W., Jordan, M. I., Beal, M. J., and Blei, D. M. (2006). Hierarchical Dirichletprocesses. JASA.

Williamson, S., Wang, C., Heller, K., and Blei, D. M. (2010). The IBP compoundDirichlet process and its application to focused topic modeling. ICML.

Zhou, M. and Carin, L. (2015). Negative binomial process count and mixturemodeling. PAMI.

Zhou, M., Hannah, L., Dunson, D., and Carin, L. (2012). Beta-negative binomialprocess and Poisson factor analysis. AISTATS.