Download pdf - Nonparametric Topic Modeling with Neural Inferencenicsefc.ee.tsinghua.edu.cn/media/publications/2020/Neuro... · 2020. 8. 13. · Neurocomputing 399 (2020) 296–306

Neurocomputing 399 (2020) 296–306

Contents lists available at ScienceDirect

Neurocomputing

journal homepage: www.elsevier.com/locate/neucom

Nonparametric Topic Modeling with Neural Inference

Xuefei Ning

a , Yin Zheng

b , Zhuxi Jiang

c , Yu Wang

a , Huazhong Yang

a , Junzhou Huang

d , Peilin Zhao

d , ∗

a Tsinghua University, China b Weixin Group, Tencent, China c Momenta, China d Tencent AI Lab, China

a r t i c l e i n f o

Article history:

Received 10 April 2019

Revised 26 November 2019

Accepted 27 December 2019

Available online 28 February 2020

Communicated by Dr. Lixin Duan

a b s t r a c t

This work focuses on combining nonparametric topic models with Auto-Encoding Variational Bayes

(AEVB). Specifically, we first propose iTM-VAE, where the topics are treated as trainable parameters and

the document-specific topic proportions are obtained by a stick-breaking construction. The inference of

iTM-VAE is modeled by neural networks such that it can be computed in a simple feed-forward man-

ner. We also describe how to introduce a hyper-prior into iTM-VAE so as to model the uncertainty of

the prior parameter. Actually, the hyper-prior technique is quite general and we show that it can be ap-

plied to other AEVB based models to alleviate the collapse-to-prior problem elegantly. Moreover, we also

propose HiTM-VAE, where the document-specific topic distributions are generated in a hierarchical man-

ner. HiTM-VAE is even more flexible and can generate topic representations with better variability and

sparsity. Experimental results on 20News and Reuters RCV1-V2 datasets show that the proposed models

outperform the state-of-the-art baselines significantly. The advantages of the hyper-prior technique and

the hierarchical model construction are also confirmed by experiments.

© 2020 Elsevier B.V. All rights reserved.

c

V

f

m

r

p

p

m

t

Y

(

a

t

n

m

m

b

1. Introduction

Probabilistic topic models focus on discovering the abstract

“topics” that occur in a collection of documents, and represent a

document as a weighted mixture of the discovered topics. Classi-

cal topic models [5] have achieved success in a range of applica-

tions [5,38,40,46] . A major challenge of topic models is that the

inference of the distribution over topics does not have a closed-

form solution and must be approximated, using either MCMC sam-

pling or variational inference. When some small changes are made

on the model, we need to re-derive the inference algorithm. In

contrast, blacks-box inference methods [21,32,37,39] require only

limited model-specific analysis and can be flexibly applied to new

models.

Among all the black-box inference methods, Auto-Encoding

Variational Bayes (AEVB) [21,39] is a promising one for topic mod-

els. AEVB contains an inference network that can map a document

directly to a variational posterior without the need for further lo-

∗ Corresponding author.

E-mail addresses: [email protected] (X. Ning), [email protected]

(Y. Zheng), [email protected] (Z. Jiang), [email protected] (Y.

Wang), [email protected] (H. Yang), [email protected] (J. Huang),

[email protected] (P. Zhao).

o

a

m

t

https://doi.org/10.1016/j.neucom.2019.12.128

0925-2312/© 2020 Elsevier B.V. All rights reserved.

al variational updates on test data, and the Stochastic Gradient

ariational Bayes (SGVB) estimator allows efficient approximate in-

erence for a broad class of posteriors, which makes topic models

ore flexible. Hence, an increasing number of models are proposed

ecently to combine topic models with AEVB, such as [8,29,30,43] .

Although these AEVB based topic models achieve promising

erformance, the number of topics, which is important to the

erformance of these models, has to be specified manually with

odel selection methods. Nonparametric models, however, have

he ability of adapting the topic number to data. For example,

ee Whye Teh and Blei [47] proposed Hierarchical Dirichlet Process

HDP), which models each document with a Dirichlet Process (DP)

nd all DPs for the documents in a corpus share a base distribu-

ion that is itself sampled from a DP. HDP has potentially an infinite

umber of topics and allows the number to grow as more docu-

ents are observed. It is appealing that the nonparametric topic

odels can also be equipped with AEVB techniques to enjoy the

enefit brought by neural black-box inference. We make progress

n this problem by proposing an infinite Topic Model with Vari-

tional Auto-Encoders (iTM-VAE), which is a nonparametric topic

odel with AEVB.

For nonparametric topic models with stick breaking prior [41] ,

he concentration parameter α plays an important role in deciding


http://www.ScienceDirect.com

http://www.elsevier.com/locate/neucom

http://crossmark.crossref.org/dialog/?doi=10.1016/j.neucom.2019.12.128&domain=pdf

mailto:[email protected]








X. Ning, Y. Zheng and Z. Jiang et al. / Neurocomputing 399 (2020) 296–306 297

t

t

p

M

m

t

p

n

t

s

d

w

u

t

a

H

i

g

w

2

p

p

t

m

m

m

m

c

t

t

a

t

A

t

c

t

m

t

f

t

t

v

[

s

Table 1

The notations used in this paper.

Section Notation Description

Section 3.1 α the concentration parameter in Beta/GEM

π k the k th atom weights

νk the k th sample from the Beta prior

θ k the k th topic

φk the unconstrained parameter for θ k

x ( j ) the j th document

G ( j ) the document-specific topic distribution for x ( j )

w

( j) n the n th word in x ( j )

Section 3.2 a k , b k the Kumaraswamy variational parameters for νk

ψ the parameters of the inference neural network

Section 3.3 s 1 , s 2 the prior parameters of the Gamma hyper-prior

on α

γ 1 , γ 2 the Gamma variational parameters for α

Section 4 γ the corpus-level concentration parameter

β ′ i

the i th sample from the corpus-level Beta prior

β i the i th corpus-level atom weights

c ( j) k

the indicator variables for the j th document x ( j )

ϕk the Multinomial variational parameters for c ( j) k

u i , v i the corpus-level Beta variational parameters for

β ′ i

m

r

s

c

fi

N

w

a

p

2

t

p

p

a

c

3

d

T

H

a

A

3

d

d

ν

L

t

i

t

he growth of topic numbers 1 . The larger the α is, the more topics

he model tends to discover. Hence, people can place a hyper-

rior [2] over α such that the model can adapt it to data [4,10,47] .

oreover, the hyper-prior technique plays a two-fold role in topic

odels powered by AEVB, as the AEVB framework suffers from

he problem that the latent representation tends to collapse to the

rior [6,9,42] , which means, the prior parameter α will control the

umber of discovered topics tightly in our case, especially when

he decoder is strong. Common heuristic tricks to alleviate this is-

ue are 1) KL-annealing [42] and 2) decoder regularizing [6] . Intro-

ucing a hyper-prior into the AEVB framework is nontrivial and not

ell-done in the community. In this paper, we show that as a nat-

ral relaxation of the prior, the hyper-prior technique can alleviate

he collapse-to-prior issue in the training process, and increase the

daptive capability of the model. 2

To further increase the flexibility of iTM-VAE, we propose

iTM-VAE, which model the document-specific topic distribution

n a hierarchical manner. This hierarchical construction can help to

enerate topic representation with better variability and sparsity,

hich is more suitable in handling heterogeneous documents.

The main contributions of the paper are:

• We propose iTM-VAE and iTM-VAE-Prod, which are two

novel nonparametric topic models equipped with AEVB, and

outperform the state-of-the-art models on the benchmarks.

• We propose iTM-VAE-HP, in which a hyper-prior helps the

model to adapt the prior parameter to data. We also show

that this technique can help other AEVB-based models to al-

leviate the collapse-to-prior problem elegantly.

• We propose HiTM-VAE, which is a hierarchical extension

of iTM-VAE. This construction and its corresponding AEVB-

based inference method can help the model to learn more

topics and produce topic proportions with higher variability

and sparsity.

. Related work

Topic models have been studied extensively in a variety of ap-

lications such as document modeling, information retrieval, com-

uter vision and bioinformatics [3,5,36,38,40,46] . Recently, with

he impressive success of deep learning, the proposed neural topic

odels [12,23,32] achieve encouraging performance in document

odeling tasks. Although these models achieve competitive perfor-

ance, they do not explicitly model the generative story of docu-

ents, hence are less explainable.

Several recent work proposed to model the generative pro-

edure explicitly, and the inference of the topic distributions in

hese models is computed by deep neural networks, which makes

hese models explainable, powerful and easily extendable. For ex-

mple, Srivastava and Sutton [43] proposed AVITM, which embeds

he original Latent Dirichlet Allocation (LDA) [5] formulation with

EVB. By utilizing Laplacian approximation for the Dirichlet dis-

ribution, AVITM can be optimized by the SGVB estimator effi-

iently. AVITM achieves the state-of-the-art performance on the

opic coherence metric [25] , which indicates the topics learned

atch closely to human judgment. Since the Laplace approxima-

ion could not model the posterior sparsity effectively, other in-

erence techniques for the Dirichlet distribution are proposed. Af-

er decomposing the Dirichlet distribution into Gamma distribu-

ions, Joo et al. [16] proposed to use the approximation of the in-

erse Gamma CDF to do the reparametrization trick. Zhang et al.

48] proposed to use the Weibull distribution in the inference of a

1 Please refer to Section 3.1 for more details about the concentration parameter. 2 The hyper-prior technique can also alleviate the collapse-to-prior issue in other

cenarios, an example is demonstrated in Appendix B .

a

i

m

s

ulti-layer generative model, for learning hierarchical latent rep-

esentations. Burkhardt and Kramer [7] proposed to decouple the

parsity and smoothness variational parameters and use an effi-

ient rejection sampler for the Gamma random variables.

Nonparametric topic models [1,17,27,44,47] , potentially have in-

nite topic capacity and can adapt the topic number to data.

alisnick and Smyth [34] proposed Stick-Breaking VAE (SB-VAE),

hich is a Bayesian nonparametric version of traditional VAE with

stochastic dimensionality. iTM-VAE differs with SB-VAE in 3 as-

ects: 1) iTM-VAE is a kind of topic model for discrete text data.

) A hyper-prior is introduced into the AEVB framwork to increase

he adaptive capability. 3) A hierarchical extension of iTM-VAE is

roposed to further increase the flexibility. Miao et al. [29] pro-

osed GSM, GSB, RSB and RSB-TF to model documents. RSB-TF uses

heuristic indicator to guide the growth of the topic numbers, and

an adapt the topic number to data.

. The iTM-VAE model

In this section, we describe the generative and inference proce-

ure of iTM-VAE and iTM-VAE-Prod in Section 3.1 and Section 3.2 .

hen, Section 3.3 describes the hyper-prior extension iTM-VAE-

P. The notations used in this paper are summarized in Table 1 ,

nd the abbreviations of various methods are summarized in

ppendix F .

.1. The Generative Procedure of iTM-VAE

Suppose the atom weights π = { πk } ∞

k =1 are drawn from a GEM

istribution [33] , i.e. π ~ GEM( α), where the GEM distribution is

efined as:

k ∼ Beta (1 , α) πk = νk

k −1 ∏

l=1

(1 − νl ) = νk

(

1 −k −1 ∑

l=1

πl

)

. (1)

et θk = σ ( φk ) denotes the k th topic, which is a multinomial dis-

ribution over vocabulary, φk ∈ R

V is the parameter of θk , σ ( · )

s the softmax function and V is the vocabulary size. In iTM-VAE,

here are unlimited number of topics and we denote � = { θk } ∞

k =1 nd = { φk } ∞

k =1 as the collections of these countably infinite top-

cs and the corresponding parameters. The generation of a docu-

ent x ( j) = w

( j)

1: N ( j) by iTM-VAE can then be mathematically de-

cribed as:

298 X. Ning, Y. Zheng and Z. Jiang et al. / Neurocomputing 399 (2020) 296–306

V

u

p

o

o

L

w

s

A

i

A

3

G

t

c

s

c

G

m

m

r

w

f

t

α

V

L

w

γ

f

fi

f

p

• Get the document-specific G

( j) ( θ;π( j) , �) =

∑ ∞

k =1 π( j) k

δθk ( θ) ,

where π( j ) ~ GEM( α)

• For each word w n in x ( j ) : 1) draw a topic ˆ θn ∼ G

( j) ( θ;π( j) , �) ;

2) w n ∼ Cat ( ̂ θn )

where α is the concentration parameter, Cat ( ̂ θi ) is a categorical

distribution parameterized by ˆ θi , and δθk ( θ) is a discrete dirac

function, which equals to 1 when θ = θk and 0 otherwise. In the

following, we remove the superscript of j for simplicity.

Thus, the joint probability of w 1: N = { w n } N n =1 , ˆ θ1: N = { ̂ θn } N n =1

and π can be written as:

p(w 1: N , π, ̂ θ1: N | α, �) = p( π| α) N ∏

n =1

p(w n | ̂ θn ) p( ̂ θn | π, �) (2)

where p( π| α) = GEM (α) , p( θ| π, �) = G ( θ;π, �) and p(w | θ) =Cat ( θ) .

Similar to [43] , we collapse the variable ˆ θ1: N and rewrite

Equation 2 as:

p(w 1: N , π| α, �) = p( π| α) N ∏

n =1

p(w n | π, �) (3)

where p(w n | π, �) = Cat ( ̄θ) and θ̄ =

∑ ∞

k =1 πk θk .

In Eq. 3 , θ̄ is a mixture of multinomials. This formulation cannot

make any predictions that are sharper than the distributions being

mixed [12] , which may result in some topics that are of poor qual-

ity. Replacing the mixture of multinomials with a weighted prod-

uct of experts is one method to make sharper predictions [11,43] .

Hence, a products-of-experts version of iTM-VAE (i.e. iTM-VAE-

Prod) can be obtained by simply computing ˆ θ for each document

as ˆ θ = σ ( ∑ ∞

k =1 πk φk ) .

3.2. The inference procedure of iTM-VAE

In this section, we describe the inference procedure of iTM-VAE,

i.e. how to draw π given a document w 1: N . To elaborate, suppose

ν = [ ν1 , ν2 , . . . , νK−1 ] is a K − 1 dimensional vector, where νk is a

random variable sampled from a Kumaraswamy distribution κ( ν;

a k , b k ) parameterized by a k and b k [22,34] , iTM-VAE models the

joint distribution q ψ

( ν| w 1: N ) as: 3

[ a 1 , . . . , a K−1 ; b 1 , . . . , b K−1 ] = g(w 1: N ;ψ) (4)

q ψ

( ν| w 1: N ) =

K−1 ∏

k =1

κ(νk ; a k , b k ) (5)

where g ( w 1: N ; ψ) is a neural network with parameters ψ . Then,

π = { πk } K k =1 can be drawn by:

ν ∼ q ψ

( ν| w 1: N ) (6)

π = (π1 , π2 , . . . , πK−1 , πK )

=

(

ν1 , ν2 (1 − ν1 ) , . . . , νk −1

K−2 ∏

n =1

(1 − νn ) , K−1 ∏

n =1

(1 − νn )

)

(7)

In the above procedure, we truncate the infinite sequence of

mixture weights π = { πk } ∞

k =1 by K elements, and νK is always set

to 1 to ensure ∑ K

k =1 πk = 1 . Notably, as is discussed in [4] , the

truncation of variational posterior does not indicate that we are

3 Ideally, Beta distribution is the most suitable probability candidate, since iTM-

AE assumes π is drawn from a GEM distribution in the generative procedure.

However, as Beta does not satisfy the differentiable, non-centered parameterization

(DNCP) [19] requirement of SGVB [21] , we use the Kumaraswamy distribution.

i

A

A

m

p

sing a finite dimensional prior, since we never truncate the GEM

rior. Hence, iTM-VAE still has the ability to model the uncertainty

f the number of topics and adapt it to data [34] . iTM-VAE can be

ptimized by maximizing the Evidence Lower Bound (ELBO):

(w 1: N | , ψ) = E q ψ ( ν| w 1: N ) [ log p(w 1: N | π, ) ]

−KL (q ψ

( ν| w 1: N ) || p( ν| α) )

(8)

here p ( ν| α) is the product of K − 1 Beta(1, α) probabilistic den-

ity functions. The details of the optimization can be found in

ppendix C . The optimizing procedure of iTM-VAE is summarized

n Algorithm 1 .

lgorithm 1 The optimizing procedure of iTM-VAE.

1: EPOCH: the total epochs

2: epoch = 0

3: while epoch < EPOCH do

4: for all w 1: N in dataset do

5: [ a 1 , . . . , a K−1 ; b 1 , . . . , b K−1 ] = g(w 1: N ;ψ)

6: q ψ

( ν| w 1: N ) =

∏ K−1 k =1

κ(νk ; a k , b k )

7: ν ∼ q ψ

( ν| w 1: N )

8: π = (ν1 , ν2 (1 − ν1 ) , . . . , νk −1

∏ K−2 n =1 (1 − νn ) ,

∏ K−1 n =1 (1 − νn ))

9: ψ = ψ + ηψ

∇ ψ

( log p(w 1: N | π, ) −KL

(q ψ

( ν| w 1: N ) || p( ν| α) ))

10: φ = φ + ηφ∇ φ log p(w 1: N | π, )

11: end for

12: epoch = epoch + 1

13: end while

.3. Modeling the uncertainty of prior parameter

In the generative procedure, the concentration parameter α of

EM( α) can have significant impact on the growth of number of

opics. The larger the α is, the more “breaks” it will create, and

onsequently, more topics will be used. Hence, it is generally rea-

onable to consider placing a hyper-prior on α to model its un-

ertainty. [4,10,47] . For example, Escobar and West [10] placed a

amma hyper-prior on α for the urn-based samplers and imple-

ented the corresponding Gibbs updates with auxiliary variable

ethods. Blei et al. [4] also placed a Gamma prior on α and de-

ived a closed-form update for the variational parameters. Different

ith previous work, we introduce the hyper-prior into the AEVB

ramework and propose to optimize the model jointly by stochas-

ic gradient decent (SGD) methods.

Concretely, since the Gamma distribution is conjugate to Beta(1,

), we place a Gamma( s 1 , s 2 ) prior on α. Then the ELBO of iTM-

AE-HP can be written as:


+ E q ψ ( ν| w 1: N ) q (α| γ1 ,γ2 ) [ log p( ν| α)]

− E q ψ ( ν| w 1: N ) [ log q ψ

( ν| w 1: N )]

− KL (q (α| γ1 , γ2 ) || p(α| s 1 , s 2 )) (9)

here p(α| s 1 , s 2 ) = Gamma (s 1 , s 2 ) , p(v k | α) = Beta (1 , α) , q ( α| γ 1 ,

2 ) is the corpus-level variational posterior for α. The derivation

or Eq. 9 can be found in Appendix D . In our experiments, we

nd iTM-VAE-Prod always performs better than iTM-VAE, there-

ore we only report the performance of iTM-VAE-Prod with hyper-

rior, and refer this variant as iTM-VAE-HP. Actually, as discussed

n Section 1 , the hyper-prior technique can also be applied to other

EVB based models to alleviate the collapse-to-prior problem. In

ppendix B , we show that by introducing a hyper-prior to SB-VAE,

ore latent units can be activated and the model achieves better

erformance.


4

d

b

4

l

o

T

b

t

p

{

c

w

C ∏4

l

r

[

q

q

w

ϕ

d

c

a

β

l

c

L

w

{

f

t

i

t

u

4

b

f

t

b

c

w

c

s

5

v

V

i

t

d

1

d

t

a

a

d

f

s

t

s

o

b

m

a

4 http://qwone.com/ ∼jason/20Newsgroups/ 5 http://trec.nist.gov/data/reuters/reuters.html 6 In these baselines, at most 200 topics are used. Please refer to Table 2 for de-

tails.

. Hierarchical iTM-VAE

In this section, we describe the generative and inference proce-

ures of HiTM-VAE in Section 4.1 and Section 4.2 . The relationship

etween iTM-VAE and HiTM-VAE is discussed in Section 4.3

.1. The generative procedure of HiTM-VAE

The generation of a document by HiTM-VAE is described as fol-

ows:

• Get the corpus-level base distribution G

(0) : β ∼GEM (γ ) ; G

(0) ( θ;β, �) =

∑ ∞

i =1 βi δθi ( θ)

• For each document x ( j) = w

( j)

1: N ( j) in the corpus:

• Draw the document-level stick breaking weights

π( j ) ~ GEM( α)

• Draw document-level atoms ζ( j) k

∼ G 0 , k = 1 , · · · , ∞ ;

Then we get a document-specific distribution

G

( j) ( θ;π( j) , { ζ( j) k

} ∞

k =1 , �) =

∑ ∞

k =1 π( j) k

δζ( j)

k

( θ)

• For each word w n in the document: 1) draw a topic ˆ θn ∼G

( j) ; 2) w n ∼ Cat ( ̂ θn )

To sample the document-level atoms ζ( j) = { ζ( j) k

} ∞

k =1 , a series

f indicator variables c ( j) = { c ( j) k

} ∞

k =1 are drawn i.i.d: c

( j) k

∼ Cat ( β) .

hen, the document-level atoms are ζ( j) k

= θc ( j) k

.

Let D and N

( j ) denote the size of the dataset and the num-

er of word in each document x ( j ) , respectively. After collapse

he per-word assignment random variables {{ ̂ θ( j)

n } N ( j)

n =1 } D j=1 , the joint

robability of the corpus-level atom weights β, documents X = x ( j) } D

j=1 , the stick breaking weights � = { π( j) } D

j=1 and the indi-

ator variables C = { c ( j) } D j=1

can be written as:

p( β, X , �, C| γ , α, �) = p( β| γ ) D ∏

j=1

p( π( j) | α) p( c ( j) | β)

× p( x ( j) | π( j) , c ( j) , �) (10)

here p( β| γ ) = GEM (γ ) , p( π( j) | α) = GEM (α) , p( c ( j) | β) =at ( β) , and p(x ( j) | π( j) , c ( j) , �) =

∏ N ( j)

n =1 p(w

( j) n | π( j) , c ( j) , �) =

N ( j)

l=1 Cat (w

( j) n | ∑ ∞

k =1 π( j) k

θc ( j) k

) .

.2. The inference procedure of HiTM-VAE

Setting the truncation level of the corpus-level and document-

evel GEM to T and K , HiTM-VAE models the per-document poste-

ior q ( ν, c | w 1: N ) for every document w 1: N as:

a 1 , . . . , a K−1 ; b 1 , . . . , b K−1 ;ϕ 1 , . . . , ϕ K ] = g(w 1: N ;ψ) (11)

( ν, c | w 1: N ) = q ψ

( ν| w 1: N ) q ψ

( c | w 1: N ) (12)

ψ

( ν| w 1: N ) =

K−1 ∏

k =1

κ(νk ; a k , b k ) ; q ψ

( c | w 1: N ) =

K ∏

k =1

Cat (c k ;ϕ k )

(13)

here g ( w 1: N ; ψ) is a neural network with parameters ψ , and

k = { ϕ ki } T i =1 are the multinomial variational parameters for each

ocument-level indicator variable c k . Then, π = { πk } K k =1 can be

onstructed by the stick breaking process using ν.

As we shown in Section 4.1 , the generation of the corpus-level

tom weights β is as follows:

′ i ∼ Beta (1 , γ ) ; βi = β ′

i

i −1 ∏

l=1

(1 − β ′ l ) (14)

The corpus-level variational posterior for β′ with truncation

evel T is q ( β′ ) =

∏ T −1 i =1 Beta (β ′

i | u i , v i ) , where { u i , v i } T −1

i =1 are the

orpus-level variational parameters.

The ELBO of the training dataset can be written as:

(D| , ψ) = E q ( β′ )

[log

P ( β′ | γ )

q ( β′ | u , v )

]+

D ∑

j=1

{E q ( ν( j) )

[log

P ( ν( j) | α)

q ( ν( j) )

]

+

K ∑

k =1

E q ( β′ ) q (c ( j)

k | ϕ ( j)

k )

[

log P (c ( j)

k | β)

q (c ( j) k

| ϕ

( j) k

)

]

+ E q ( ν( j) ) q ( c ( j) )

[P (x ( j) | ν( j) , c ( j) , )

]}(15)

here β = { βi } T i =1 , ν( j) = { ν( j)

k } K−1

k =1 , c ( j) = { c ( j)

k } K

k =1 , ϕ

( j) k

= ϕ

( j) ki

} T i =1

. The details of the derivation of the ELBO can be

ound in Appendix E .

Gumbel-Softmax estimator [15] is used for backpropagating

hrough the categorical random variables c . Instead of joint train-

ng with the NN parameters, mean-field updates are used to learn

he corpus-level variational parameters { u i , v i } T −1 i =1

:

i = 1 +

D ∑

j=1

K ∑

k =1

ϕ

( j) ki

; v i = γ +

D ∑

j=1

K ∑

k =1

T ∑

l= i +1

ϕ

( j) kl

(16)

.3. Discussion

In iTM-VAE, we get the document-specific topic distribution G

( j )

y sampling the atom weights from a GEM. Instead of being drawn

rom a continuous base distribution, the atoms are modeled as

rainable parameters as in [5,29,43] . Thus, the atoms are shared

y all documents naturally without the need to use a hierarchical

onstruction like HDP [47] . The hierarchical extension, HiTM-VAE,

hich models G

( j ) in a hierarchical manner, is more flexible and

an generate topic representations with better variability and spar-

ity. A detailed comparison is illustrated in Section 5.3 .

. Experiments

In this section, we evaluate the performance of iTM-VAE and its

ariants on two public benchmarks: 20News 4 and Reuters RCV1-

2 5 , and demonstrate the advantage brought by the variants of

TM-VAE. 20News dataset contains about 18,0 0 0 documents parti-

ioned into 20 different classes, and RCV1-V2 [26] is a much bigger

ataset composed of 804,414 documents manually categorized into

03 classes. To make a fair comparison, we use exactly the same

ata and vocabulary as [43] .

The configuration of the experiments is as follows. We use a

wo-layer fully-connected neural network for g ( w 1: N ; ψ) of Eq. 4 ,

nd the number of hidden units is set to 256 and 512 for 20News

nd RCV1-V2, respectively. The concentration parameter α for GEM

istribution is cross-validated on validation set from [10,20,30,50]

or iTM-VAE and iTM-VAE-Prod. The truncation level K in Eq. 7 is

et to 200 so that the maximum topic numbers will never exceed

he ones used by baselines, 6 and we empirically find K = 200 is

ufficiently large for the actually learned posterior with these αn the two benchmarks. Batch-Renormalization [14] is used to sta-

ilize the training procedure. Adam [18] is used to optimize the

odel and the learning rate is set to 0.01. The code of iTM-VAE

nd its variants is available at http://anonymous .

http://anonymous

http://qwone.com/~jason/20Newsgroups/

http://trec.nist.gov/data/reuters/reuters.html


Table 2

Comparison of perplexity (lower is better) and topic coherence (higher is better) between different topic models on 20News and

RCV1-V2 datasets.

Methods Perplexity Coherence

20News RCV1-V2 20News RCV1-V2

#Topics 50 200 50 200 50 200 50 200

LDA [13] a 893 1015 1062 1058 0.131 0.112 –

DocNADE [23] 797 804 856 670 0.086 0.082 0.079 0.065

HDP [45] a 937 918 – –

NVDM [30] a 837 873 717 588 0.186 0.157 –

NVLDA [43] 1078 993 791 797 0.162 0.133 0.153 0.172

ProdLDA [43] 1009 989 780 788 0.236 0.217 0.252 0.179

GSM [29] a 787 829 653 521 0.223 0.186 –

GSB [29] a 816 815 712 544 0.217 0.171 –

RSB [29] a 785 792 662 534 0.224 0.177 –

RSB-TF [29] a 788 532 – –

iTM-VAE 877 1124 0.205 0.218

iTM-VAE-Prod 775 508 0.278 0.3

iTM-VAE-HP 876 692 0.285 0 . 311

HiTM-VAE 912 747 0 . 29 0.27

a We take these results from [29] directly, since we use exactly the same datasets. The symbol “ - ” indicates that [29] does not

provide the corresponding values. As this paper is based on variational inference, we do not compare with LDA and HDP using Gibbs

sampling, which are usually time consuming.

Table 3

Top 10 words of topics learned by iTM-VAE-Prod without cherry picking.

Geography turkish armenians turks armenia armenian turkey azerbaijan

greek greece village

Sports season team player nhl score playoff hockey game coach

hitter

Religion jesus bible god faith scripture christ doctrine belief eternal

church

Space orbit shuttle launch lunar spacecraft nasa satellite probe

rocket moon

Hardware scsi ide scsus motherboard ram controller upgrade meg

cache floppy

Encryption ripem escrow rsa des encrypt cipher privacy crypto chip nsa

Trade shipping sale annual manual tor vs det cd price excellent

X system window xterm font colormap server xlib widget xt windows

toolkit

Hockey a det tor buf cal pit que mon pt vs calgary

Health msg b patient disease symptom doctor food pain mouse

cancer hospital

Circuit voltage puck connector signal amp input circuit pin wire

connect

Lawsuit gun homicide militia weapon amendment handgun criminal

firearm crime knife

Traffic bike brake car tire ride engine honda bmw rear motorcycle

a All these words are about hockey teams of different cities, e.g. “que” means

Quebec. b “msg” means monosodium glutamate.

A

i

l

i

s

l

5

f

i

i

s

5.1. Perplexity and topic coherence

Perplexity is widely used by topic models to mea-

sure the goodness-to-fit capability, which is defined as:

exp (− 1 D

∑ D j=1

1 | x ( j) | log p(x ( j) )) , where D is the number of docu-

ments, and | x ( j ) | is the number of words in the j -th document x ( j ) .

Following previous work, the variational lower bound is used to

estimate the perplexity.

As the quality of the learned topics is not directly reflected by

perplexity [35] , topic coherence is designed to match the human

judgment. We adopt NPMI [25] as the measurement of topic coher-

ence, as is adopted by [29,43] . 7 We define a topic to be an Effective

Topic if it becomes the top-1 significant topic of a sample among

the training set more than τ × D times, where D is the training

set size and τ is a ratio. We set τ to 0.5% in our experiments. Fol-

lowing [29] , we use an average over topic coherence computed by

top-5 and top-10 words across five random runs, which is more

robust [24] .

Table 2 shows the perplexity and topic coherence of different

topic models on 20News and RCV1-V2 datasets. We can clearly

see that our models outperform the baselines, which indicates that

our models have better goodness-to-fit capability and can discover

topics that match more closely to human judgment. We can also

see that HiTM-VAE achieves better perplexity than [45] , in which a

similar hierarchical construction is used. Note that comparing the

ELBO-estimated perplexity of HiTM-VAE with other models directly

is not suitable, as it has a lot more random variables, which usu-

ally leads to a higher ELBO. The possible reasons for the good co-

herence achieved by our models are 1) The “Product-of-Experts”

enables the model to model sharper distributions. 2) The nonpara-

metric characteristic means the models can adapt the number of

topics to data, thus topics can be sufficiently trained and of high

diversity.

The diversity of the learned topics is another desired prop-

erty in many applications of topic modeling, as diverse top-

ics often come with better latent topic representations of docu-

ments that are suitable for discriminative tasks (e.g. document re-

trieval). Table 3 illustrates the highly-coherent and diverse topics

learned by iTM-VAE-Prod. In contrast, as we listed in Table A.1 in

7 We use the code provided by [25] at https://github.com/jhlau/

topic _ interpretability/ to calculate the NPMI score.

α

A

γ

f

ppendix A , there are a lot of redundant topics among the top-

cs learned by ProdLDA [43] . As a result, the latent representation

earned by ProdLDA is of poor discriminative power. For an intu-

tive comparison of the quality of the latent topic representations,

ee Fig. 1 for the TSNE-visualization of the topic representations

earned by iTM-VAE-Prod and ProdLDA.

.2. The effect of hyper-prior on iTM-VAE

In this section, we provide quantitative evaluations on the ef-

ect of the hyper-prior for iTM-VAE. Specifically, a relatively non-

nformative hyper-prior Gamma(1, 0.05) is imposed on α. And we

nitialize the global variational parameters γ 1 and γ 2 of Eq. 9 the

ame as the non-informative Gamma prior. Thus the expectation of

given the variational posterior q ( α| γ 1 , γ 2 ) is 20 before training.

SGD optimizer with a learning rate of 0.01 is used to optimize

1 and γ 2 . No KL annealing and decoder regularization are used

or iTM-VAE-HP.

https://github.com/jhlau/topic_interpretability/


Fig. 1. (a) The TSNE-visualization of the representation learned by by iTM-VAE-Prod. (b) The TSNE-visualization of the representation learned by ProdLDA [43] with the best

topic coherence on 20News ( K = 50 ).

Fig. 2. Topic coverage w.r.t number of used topics learned by iTM-VAE-HP.

Fig. 3. Comparison of the topic coverage (a) and sparsity (b) between iTM-VAE-Prod ( α = 5 ), iTM-VAE-Prod ( α = 20 ) and HiTM-VAE ( γ = 20 , α = 5 ). We can see that

HiTM-VAE can simultaneously discover more topics and produce sparser posterior topic proportions.

Table 4

The posterior distribution of α learned by iTM-VAE-HP on

subsets of 20News dataset.

#classes γ 1 γ 2 E q (α) [ α]

1 16.88 4.58 3.68

2 23.03 3.68 6.25

5 31.43 2.88 10.93

10 39.64 2.69 14.71

20 48.91 2.98 16.39

a

o

a

i

a

t

s

i

o1 2

8 Since there are no labels for the 20News dataset provided by [43] , we prepro-

cess the dataset ourselves in this illustrative experiment.

Table 4 reports the learned global variational parameter γ 1 , γ 2

nd the expectation of α given the variational poster q ( α| γ , γ )
1 2
n several subsets of 20News dataset, which contain 1, 2, 5, 10

nd 20 classes, respectively. 8 We can see that, once the training

s done, the variational posterior q ( α| γ 1 , γ 2 ) is very confident,

nd E q (α| γ1 ,γ2 ) [ α] , the expectation of α given the variational pos-

erior, is adjusted to the training set. For example, if the training

et contains only 1 class of documents, E q (α| γ1 ,γ2 ) [ α] after train-

ng is 3.68, Whereas, when the training set consists of 10 classes

f documents, E q (α| γ ,γ ) [ α] after training is 14.71. This indicates


V

t

p

C

t

n

W

W

a

m

D

c

i

A

d

T

n

A

T

Top 10 words of some redundant topics learned by iTM-VAE-Prod.

Topics about

Religion

1 jesus christian scripture faith god christ heaven

christianity verse resurrection

2 jesus christ doctrine revelation verse

scripture satan christian interpretation god

3 belief god passage scripture moral atheist christian

truth principle jesus

4 god belief existence faith jesus atheist bible

christian religion sin

5 jesus son holy christ father god doctrine heaven

spirit prophet

6 homosexual marriage belief islam moral

christianity truth islamic religion god

Topics about

Hardware

7 floppy controller scsus ide scsi ram hd mb cache isa

8 printer meg adapter scsi motherboard windows

modem mhz vga hd

9 ide mb connector controller isa scsi scsus floppy

jumper disk

10 mb controller bio rom interface mhz scsus scsi

floppy ide

11 ide meg motherboard shipping adapter simm hd

mhz monitor scsi

12 ram controller dos bio windows disk scsi rom scsus

meg

13 honda motherboard bike amp quadra hd brake

apple upgrade meg

Topics about

Lawsuit

20 morality truth moral objective absolute

belief murder existence principle human

21 homicide vancouver seattle handgun firearm child

states percent study file

22 murder moral constitution morality criminal

objective rights gun law weapon

Topics about

Politics

14 decision stephanopoulos president armenian gay

package congress myers february armenians

15 armenian turkish genocide armenians turks

jesus massacre muslim armenia muslims

16 armenians father gang soldier neighbor apartment

girl armenian troops rape

17 muslim greek turks turkish armenian muslims

village genocide armenia jews

18 armenian turks armenians armenia turkish

muslim massacre village turkey greek

19 armenians armenian neighbor apartment woman

soviet kill bus azerbaijan hide

. . . . . . . . .

that iTM-VAE-HP can learn to adjust α to data, thus the number

of discovered topics will adapt to data better. In contrast, for iTM-

AE-Prod (without the hyper-prior), when the decoder is strong,

no matter how many classes the dataset contains, the number of

topics will be constrained tightly by the prior due to the collapse-

to-prior problem of AEVB, and KL-annealing and decoder regulariz-

ing tricks do not help much.

Fig. 2 illustrates the training set coverage w.r.t the number of

used topics when the training set contains 1, 2, 5, 10 and 20

classes, respectively. Specifically, we compute the average weight

of every topic on the training dataset, and sort the topics accord-

ing to their average weights. The topic coverage is then defined

as the cumulative sum of these weights. Fig. 2 shows that, with

the increasing of the number of classes, more topics are utilized

by iTM-VAE-HP to reach the same level of topic coverage, which

indicates that the model has the ability to adapt to data.

5.3. The evaluation of HiTM-VAE

In this section, by comparing the topic coverage and sparsity of

iTM-VAE-Prod and HiTM-VAE, we show that the hierarchical con-

struction can help the model to learn more topics, and produce

posterior topic representations with higher sparsity, which is de-

sirable in many applications [28] .

The model configurations are the same for iTM-VAE-Prod and

HiTM-VAE, except that α is set to 5 and 20 for iTM-VAE-Prod,

and γ = 20 , α = 5 for HiTM-VAE. For HiTM-VAE, the corpus-level

updates are done every 200 epochs on 20News, and 20 epochs on

RCV1-V2.

To compare the overall sparsity of the posterior topic represen-

tations of each model, we sort the topic weights of every training

document and average across the dataset. Then, the logarithm of

the average weights are plotted w.r.t the topic index. As shown

in Fig. 3 , HiTM-VAE can learn more topics than iTM-VAE-Prod

( α = 20 ), and the sparsity of its posterior topic proportions is sig-

nificantly higher. iTM-VAE-Prod ( α = 5 ) has higher sparsity than

iTM-VAE-Prod ( α = 20 ). However, its sparsity is still lower than

HiTM-VAE with the same document-level concentration parameter

α, and it can only learn a small number of topics, which means

that there might exist rare topics that are not learned by the

model. The comparison of HiTM-VAE and iTM-VAE-Prod ( α = 5 )

shows that the superior sparsity not only comes from a smaller

per-document concentration hyper-parameter α, but also from the

flexibility brought by the hierarchical construction itself.

6. Conclusion

In this paper, we propose iTM-VAE and iTM-VAE-Prod, which

are nonparametric topic models that are modeled by Variational

Auto-Encoders. Specifically, a stick-breaking prior is used to gen-

erate the atom weights of countably infinite shared topics, and

the Kumaraswamy distribution is exploited such that the model

can be optimized by AEVB algorithm. We also propose iTM-VAE-

HP which introduces a hyper-prior into the VAE framework such

that the model can adapt better to data. This technique is general

and can be incorporated into other VAE-based models to alleviate

the collapse-to-prior problem. To further diversify the document-

specific topic distributions, we use a hierarchical construction in

the generative procedure. And we show that the proposed model

HiTM-VAE can learn more topics and produce more sparse poste-

rior topic proportions. The advantage of iTM-VAE and its variants

over traditional nonparametric topic models is that the inference is

performed by feed-forward neural networks, which is of rich rep-

resentation capacity and requires only limited knowledge of the

data. Hence, it is flexible to develop extensions and incorporate

more information sources to the model. Experimental results on

wo public benchmarks show that iTM-VAE and its variants out-

erform the state-of-the-art baselines.

RediT authorship contribution statement

Xuefei Ning: Software, Formal analysis, Investigation, Visualiza-

ion. Yin Zheng: Conceptualization, Methodology, Writing - origi-

al draft, Resources. Zhuxi Jiang: Formal analysis, Investigation. Yu

ang: Writing - review & editing, Supervision. Huazhong Yang:

riting - review & editing, Supervision. Junzhou Huang: Project

dministration. Peilin Zhao: Writing - review & editing, Project ad-

inistration.

eclaration of Competing Interest

The authors declare that they have no known competing finan-

ial interests or personal relationships that could have appeared to

nfluence the work reported in this paper.

cknowledgment

This work was supported by the National Natural Science Foun-

ation of China (No. U19B2019 , 61832007 , 61621091 ), the project of

singhua University and Toyota Joint Research Center for AI Tech-

ology of Automated Vehicle (TT2019-01).

ppendix A. Learned Topics of ProdLDA

able A.1

https://doi.org/10.13039/501100001809


A

t

a

I

t

V

n

n

g

i

f

t

m

t

i

e

A

B

L

V

w

A

B

L

S

ppendix B. The Effect of Hyper-Prior on SB-VAE

Introducing a hyper-prior into AEVB and doing global varia-

ional inference to learn the hyper-posterior is a general technique

nd can be applied to other VAE-based models, in other scenarios.

n this section, we show that the hyper-prior can also help SB-VAE

o gain more adaptive power on CIFAR-10 dataset 9 .

Specifically, for both SB-VAE and SB-VAE with hyper-prior (SB-

AE-HP), the encoder network is modeled as 5-layer convolutional

etwork, and the decoder is modeled as 5-layer deconvolutional

etwork. The truncation level is set to 10 0 0 and the discretized lo-

istic observation model [20] is used. We use the last 50 0 0 train-

ng samples as the validation set, and 50-epoch look-ahead is used

or early stopping. Adamax [18] with 0.001 learning rate is used as

he optimizer. Table B.1 shows that with the hyper-prior technique,

ore latent units can be activated and the model can adapt better

o dataset of different sizes. Actually, the hyper-prior can also be

ncorporated to other variants of VAE, and we leave the detailed

valuation for future work.

Table B.1

The comparison between vanilla SB-VAE and SB-VAE-HP on different subsets of

CIFAR-10. The α of SB-VAE is set to 5 and the hyper-prior of SB-VAE-HP is set to

Gamma(1.0, 0.2). 99% coverage is used to define the number of active units (AU)

following [34] . bits/dim (lower is better) is calculated as − log 2 p(x ) D

, where D is the

input dimensionality.

#classes SB-VAE SB-VAE-HP

bits/dim #AU bits/dim #AU E q [ α]

1 6.73 40.7 6.62 56.9 10.4

5 6.66 58.6 6.45 85.2 17.7

10 6.59 63.0 6.34 100.1 21.7

ppendix C. The Evidence Lower Bound of iTM-VAE

In this section, we show how to compute the Evidence Lower

ound (ELBO) of iTM-VAE which can be written as:


− KL (q ψ

( ν| w 1: N ) || p( ν| α) )

(C.1)

• E q ψ ( ν| w 1: N ) [ log p(w 1: N | π, ) ] :

Similar to other VAE-based models, the SGVB estimator and

reparameterization trick can be used to approximate this in-

tractable expectation and propagate the gradient flow into the

inference network g . Specifically, we have:

E q ψ ( ν| w 1: N ) [ log p(w 1: N | π, ) ] =

1

L

L ∑

l=1

N ∑

i =1

log p(w i | π(l) , ) (C.2)

where L is the number of Monte Carlo samples in the SGVB

estimator and can be set to 1. N is the number of words in the

document.

According to Section 3.2 , π( l ) can be obtained by

[ a 1 , . . . , a K−1 ; b 1 , . . . , b K−1 ] = g(w 1: N ;ψ) (C.3)

νk ∼ κ(ν; a k , b k ) (C.4)

π = (π1 , π2 , . . . , πK−1 , πK )

=

(

ν1 , ν2 (1 − ν1 ) , . . . , νk −1

K−2 ∏

l=1

(1 − νl ) , K−1 ∏

l=1

(1 − νl )

)

(C.5)

9 As the motivation of this experiment is to show the effect of hyper-prior on SB-

AE, we use a much smaller and naive network architecture and do not compare

ith the state-of-the-art models.

where g ( w 1: N ; ψ) is an inference network with parameters

ψ , κ denotes the Kumaraswamy distribution and K ∈ R + is the

truncation level. Here we omit the superscript ∗( l ) for simplicity.

In our experiments, following [29] , we factorize the parameter

φk of topic θk as φk = t k W where t k ∈ R

H is the k -th topic fac-

tor vector, W ∈ R

H×V is the word factor matrix and H ∈ R + is

the factor dimension. According to the generative procedure in

Section 3.1 , p ( w i | π( l ) , ) can be computed by

p(w i | π(l) , ) =

⎧ ⎪ ⎪ ⎪ ⎨

⎪ ⎪ ⎪ ⎩

∞ ∑

k =1

π(l) k

σ ( t k W ) ≈K ∑

k =1

π(l) k

σ ( t k W ) iTM-VAE

σ

(

∞ ∑

k =1

π(l) k

t k W

)

≈ σ

(

K ∑

k =1

π(l) k

t k W

)

iTM-VAE-Prod

(C.6)

where σ ( · ) is the softmax function.

• KL( q ψ

( ν| w 1: N )|| p ( ν| α)): By applying the KL divergence of a Ku-

maraswamy distribution κ( ν; a k , b k ) from a beta distribution

p ( ν; 1, α), we have:

KL (q ψ

( ν| w 1: N ) || p( ν| α) )

=

K−1 ∑

k =1

KL (q ψ

(νk | w 1: N ) || p( νk | α) )

=

K−1 ∑

k =1

a k − 1

a k

(−γ − �(b k ) − 1

b k

)+ log a k b k + log B ( 1 , α)

+ ( α − 1 )

∞ ∑

m =1

b k m + a k b k

B

(m

a k , b k

)− b k − 1

b k (C.7)

where B ( · ) is the Beta function and γ is the Euler’s constant.

ppendix D. The Evidence Lower Bound of iTM-VAE-HP


ound (ELBO) of iTM-VAE-HP which can be written as:

(w 1: N | , ψ) = E q ψ ( ν| w 1: N ) [ log p(w 1: N | π, )]

+ E q ψ ( ν| w 1: N ) q (α| γ1 ,γ2 ) [ log p( ν| α)]

− E q ψ ( ν| w 1: N ) [ log q ψ

( ν| w 1: N )]

− KL (q (α| γ1 , γ2 ) || p(α| s 1 , s 2 )) (D.1)

pecifically, each item in Eq. D.1 can be obtained as follows:

• E q ψ ( ν| w 1: N ) [ log p(w 1: N | π, ) ] :

The derivation is exactly the same as Appendix C .

• E q ψ ( ν| w 1: N ) q (α| γ1 ,γ2 ) [ log p( ν| α)] :

Recall that the prior of the stick length variable νk is

Beta(1, α): p(v k | α) = α(1 − νk ) α−1 and the variational posterior

of the concentration parameter α is a Gamma distribution q ( α;

γ 1 , γ 2 ), we have

E q ψ ( ν| w 1: N ) q (α| γ1 ,γ2 ) [ log p( ν| α)]

= E q ψ ( ν| w 1: N )

[

K−1 ∑

k =1

E q (α| γ1 ,γ2 ) [ log α + (α − 1) log (1 − νk )]

]

= (K − 1) E q (α| γ1 ,γ2 ) [ log α] +

K−1 ∑

k =1

γ1 − γ2

γ2

E q ψ (νk | w 1: N ) [ log (1 − νk )]

(D.2)

Now, we provide more details about the calculation of these

two expectations in Eq. D.2 as follows:

◦ E q (α| γ ,γ ) [ log α] :
1 2


A

B

a

L

S

Fisrt, we can write the Gamma distribution q ( α; γ 1 , γ 2 ) in

its exponential family form:

q (α;γ1 , γ2 ) =

1

αexp

(− γ2 α + γ1 log α

− ( log �(γ1 ) − γ1 log γ2 ) )

(D.3)

Considering the general fact that the derivative of the log

normalizor log �(γ1 ) − γ1 log γ2 of a exponential family dis-

tribution with respect to its natural parameter γ 1 is equal

to the expectation of the sufficient statistic log α, we can

compute E q (α| γ1 ,γ2 ) [ log α] in the first term of Eq. D.2 as

follows:

E q (α| γ1 ,γ2 ) [ log α] = �(γ1 ) − log γ2 (D.4)

where � is the digamma function, the first derivative of the

log Gamma function.

◦ E q ψ (νk | w 1: N ) [ log (1 − v k )] :

By applying the Taylor expansion, E q (νk | w 1: N ) [ log (1 − v k )] can

be written as the infinite sum of the Kumaraswamy’s m th

moment:

E q ψ (νk | w 1: N ) [ log (1 − v k )] = −∞ ∑

m =1

1

m

E q ψ (νk | w 1: N ) [ v m

k ]

= −∞ ∑

m =1

b k m + a k b k

B

(m

a k , b k

)(D.5)

where B ( · ) is the Beta function.

By substituting Eq. D.4 and Eq. D.5 into Eq. D.2 , we can ob-

tain:

E q ψ ( ν| w 1: N ) q (α| γ1 ,γ2 ) [ log p( ν| α)]

= (K − 1)(�(γ1 ) − log γ2 ) − γ1 − γ2

γ2

×K−1 ∑

k =1

∞ ∑

m =1

b k m + a k b k

B

(m

a k , b k

)(D.6)

• −E q ψ ( ν| w 1: N ) [ log q ψ

( ν| w 1: N )] :

According to Section 4.11 of [31] , the Kumaraswamy’s entropy

is given as

−E q ψ ( ν| w 1: N ) [ log q ψ

( ν| w 1: N )]

= −K−1 ∑

k =1

E q ψ (νk | w 1: N ) [ log q ψ

(νk | w 1: N )]

=

K−1 ∑

k =1

− log (a k b k ) +

a k − 1

a k

(γ + �(b k ) +

1

b k

)+

b k − 1

b k (D.7)

where γ is the Euler’s constant.

• KL( q ( α| γ 1 , γ 2 )|| p ( α| s 1 , s 2 )):

The KL divergence of one Gamma distribution q ( α; γ 1 ,

γ 2 ) from another Gamma distribution p ( α; s 1 , s 2 ) evaluates

to

KL (q || p) = s 1 log γ2

s 2 − log

�(γ1 )

�(s 1 ) + (γ1 − s 1 )�(γ1 )

− (γ2 − s 2 ) γ1

γ2

(D.8)

ppendix E. The evidence lower bound of HiTM-VAE


ound of HiTM-VAE on the whole dataset which can be written

s:

(D| , ψ) = E q ( β′ )

[log

P ( β′ | γ )

q ( β′ | u , v )

]

+

D ∑

j=1

{E q ( ν( j) )

[log

P ( ν( j) | α)

q ( ν( j) )

]

+

K ∑

k =1

E q ( β′ ) q (c ( j)

k | ϕ ( j)

k )

[

log P (c ( j)

k | β)

q (c ( j) k

| ϕ

( j) k

)

]

+ E q ( ν( j) ) q ( c ( j) )

[P ( x ( j) | ν( j) , c ( j) , )

]}(E.1)

pecifically, each item in Eq. E.1 can be obtained as follows:

• E q ( β′ ) [ log P( β′ | γ )

q ( β′ | u , v ) ] : The KL divergence of two series of independent Beta distribu-

tion, beta( u i , v i ) and Beta(1, γ ) is:

T −1 ∑

i =1

KL ( Beta (u i , v i ) || Beta (1 , γ )) = −(T − 1) log γ

−T −1 ∑

i =1

{log

�(u i )�(v i ) �(u i + v i )

+ (�(u i + v i ) − �(u i )) × (u i − 1)

+ (�(u i + v i ) − �(v i )) × (v i − γ )

}(E.2)

• The first term in the summation, E q ( ν( j) | a ( j) , b ( j) )

[ log P( ν( j) | α)

q ( ν( j) ) ] :

K−1 ∑

k =1

KL (q (ν( j)

k ) || p(ν( j)

k | α))

)

=

K−1 ∑

k =1

a k − 1

a k

(−γ − �(b k ) − 1

b k

)+ log a k b k + log B ( 1 , α)

+ ( α − 1 )

∞ ∑

m =1

b k m + a k b k

B

(m

a k , b k

)− b k − 1

b k (E.3)

• The second term in the summation,∑ K k =1 E q ( β′ ) q (c

( j) k

| ϕ ( j) k

) [ log

P(c ( j) k

| β)

q (c ( j) k

| ϕ ( j) k

) ] :

Let A

( j) i

=

∑ K k =1 ϕ

( j) ki

K ∑

k =1

H( ϕ

( j) k

) +

T ∑

i =1

{ E q [ log βi ] A

( j) i

}

=

K ∑

k =1

H( ϕ

( j) k

) +

T −1 ∑

i =1

A

( j) i

E q [ log β ′ i ] +

T ∑

i =1

A

( j) i

i −1 ∑

l=1

E q [ log (1 − β ′ l )]

=

K ∑

k =1

H( ϕ

( j) k

) +

T −1 ∑

i =1

A

( j) i

{ �(u i ) − �(u i + v i ) }

+

T ∑

i =1

A

( j) i

i −1 ∑

l=1

{ �(v l ) − �(u l + v l ) }


A

R

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

=

K ∑

k =1

H( ϕ

( j) k

) +

T −1 ∑

i =1

{A

( j) i

�(u i ) −(

T ∑

l= i A

( j) l

)

�(u i + v i )

+

(

T ∑

l= i +1

A

( j) l

)

�(v i ) }

(E.4)

• The third term in the summation,

E q ( ν( j) ) q ( c ( j) ) [ P ( x ( j) | ν( j) , c ( j) , )] } , is estimated by MC sam-

pling. For backpropagating through the stochastic units,

the reparametrization tricks of the Kumaraswamy posterior

q ({ ν( j) k

} K k =1

) and the Gumbel–Softmax approximation [15] for

the multinomial posterior q ({ c ( j) k

} K k =1

) are used.

ppendix F. Abbreviations of Methods

Table F.1

The abbreviations used in this paper.

Abbreviation Description

LDA Latent Dirichlet Allocation [5]

HDP Hierarchical Dirichlet Process [44]

AEVB Auto-Encoding Variational Bayes [21,39]

AVITM Autoencoded Variational Inference For

Topic Model [43]

SB-VAE Stick-Breaking Variational

Auto-Encoder [34]

NVDM Neural Variational Document Model [30]

GSM, GSB, RSB, RSB-TF Methods proposed in [29]

iTM-VAE ( Sec. 3.1 ) infinite Topic Model with Variational

Auto-Encoders

iTM-VAE-Prod ( Sec. 3.1 ) iTM-VAE using Product-of-experts

iTM-VAE-HP ( Sec. 3.3 ) iTM-VAE with the Hyper Prior extension

HiTM-VAE ( Sec. 4 ) Hierarchical iTM-VAE

eferences

[1] C. Archambeau , B. Lakshminarayanan , G. Bouchard , Latent ibp compound

dirichlet allocation, IEEE Trans. Pattern Anal. Mach. Intell. 37 (2) (2015)321–333 .

[2] J. M. Bernardo, A. F. Smith, Bayesian theory, 2001. [3] D.M. Blei , Probabilistic topic models, Commun. ACM 55 (4) (2012) 77–84 .

[4] D.M. Blei , M.I. Jordan , et al. , Variational inference for dirichlet process mix-

tures, Bayesian Anal. 1 (1) (2006) 121–143 . [5] D.M. Blei , A.Y. Ng , M.I. Jordan , Latent dirichlet allocation, J. Mach. Learn. Res. 3

(Jan) (2003) 993–1022 . [6] S.R. Bowman, L. Vilnis, O. Vinyals, A.M. Dai, R. Józefowicz, S. Bengio, Generating

sentences from a continuous space, arXiv preprint arXiv:1511.06349(2015) [7] S. Burkhardt , S. Kramer , Decoupling sparsity and smoothness in the dirichlet

variational autoencoder topic model, J. Mach. Learn. Res. 20 (131) (2019) 1–27 .

[8] D. Card, C. Tan, N.A. Smith, A neural framework for generalized topic models,arXiv preprint arXiv:1705.09296(2017)

[9] X. Chen , D.P. Kingma , T. Salimans , Y. Duan , P. Dhariwal , J. Schulman ,I. Sutskever , P. Abbeel , Variational lossy autoencoder, in: International Confer-

ence on Learning Representations, 2017 . [10] M.D. Escobar , M. West , Bayesian density estimation and inference using mix-

tures, J. Am. Stat. Assoc. 90 (430) (1995) 577–588 . [11] G.E. Hinton , Training products of experts by minimizing contrastive divergence,

Neural Comput. 14 (8) (2006) .

[12] G.E. Hinton , R.R. Salakhutdinov , Replicated softmax: an undirected topicmodel, in: Advances in Neural Information Processing Systems, 2009,

pp. 1607–1614 . [13] M. Hoffman , F.R. Bach , D.M. Blei , Online learning for latent dirichlet allocation,

Advances in Neural Information Processing Systems, 2010 . [14] S. Ioffe , Batch renormalization: towards reducing minibatch dependence in

batch-normalized models, in: Advances in Neural Information Processing Sys-

tems, 2017, pp. 1945–1953 . [15] E. Jang , S. Gu , B. Poole , Categorical reparameterization with gumbel-softmax,

in: International Conference on Learning Representations, 2017 . [16] W. Joo, W. Lee, S. Park, I.-C. Moon, Dirichlet variational autoencoder, arXiv

preprint arXiv:1901.02739(2019) [17] D.I. Kim , E.B. Sudderth , The doubly correlated nonparametric topic model, in:

Advances in Neural Information Processing Systems, 2011, pp. 1980–1988 .

[18] D. Kingma , J. Ba , Adam: A method for stochastic optimization, in: InternationalConference on Learning Representations, 2015 .

[19] D. Kingma , M. Welling , Efficient gradient-based inference through transforma-tions between bayes nets and neural nets, in: International Conference on Ma-

chine Learning, 2014, pp. 1782–1790 . 20] D.P. Kingma , T. Salimans , R. Jozefowicz , X. Chen , I. Sutskever , M. Welling , Im-

proving variational inference with inverse autoregressive flow, in: InternationalConference on Learning Representations, 2016 .

[21] D.P. Kingma , M. Welling , Auto-encoding variational bayes, in: International

Conference on Learning Representations, 2014 . 22] P. Kumaraswamy , A generalized probability density function for dou-

ble-bounded random processes, J. Hydrol. 46 (1-2) (1980) 79–88 . 23] H. Larochelle , S. Lauly , A neural autoregressive topic model, in: Advances in

Neural Information Processing Systems, 2012, pp. 2708–2716 . [24] J.H. Lau , T. Baldwin , The sensitivity of topic coherence evaluation to topic car-

dinality, in: Proceedings of the 2016 Conference of the North American Chapter

of the Association for Computational Linguistics: Human Language Technolo-gies, 2016, pp. 4 83–4 87 .

25] J.H. Lau , D. Newman , T. Baldwin , Machine reading tea leaves: automati-cally evaluating topic coherence and topic model quality, in: Conference of

the European Chapter of the Association for Computational Linguistics, 2014,pp. 530–539 .

26] D.D. Lewis , Y. Yang , T.G. Rose , F. Li , Rcv1: a new benchmark collection for text

categorization research, J. Mach. Learn. Res. 5 (Apr) (2004) 361–397 . [27] K.W. Lim , W. Buntine , C. Chen , L. Du , Nonparametric bayesian topic modelling

with the hierarchical pitman–yor processes, Int. J. Approx. Reason. 78 (2016)172–191 .

28] T. Lin , W. Tian , Q. Mei , H. Cheng , The dual-sparse topic model: mining focusedtopics and focused terms in short text, in: International World Wide Web Con-

ference, ACM, 2014, pp. 539–550 .

29] Y. Miao , E. Grefenstette , P. Blunsom , Discovering discrete latent topics withneural variational inference, in: International Conference on Machine Learning,

JMLR. org, 2017, pp. 2410–2419 . 30] Y. Miao , L. Yu , P. Blunsom , Neural variational inference for text processing, in:

International conference on machine learning, 2016, pp. 1727–1736 . [31] J.V. Michalowicz , J.M. Nichols , F. Bucholtz , Handbook of Differential Entropy,

CRC Press (2013) .

32] A. Mnih , K. Gregor , Neural variational inference and learning in belief net-works, in: International Conference on Machine Learning, 2014 .

[33] K.P. Murphy , Machine Learning: A Probabilistic Perspective, MIT press, 2012 . 34] E. Nalisnick , P. Smyth , Stick-breaking variational autoencoders, in: International

Conference on Learning Representations, 2017 . [35] D. Newman , J.H. Lau , K. Grieser , T. Baldwin , Automatic evaluation of topic co-

herence, in: Human Language Technologies: The 2010 Annual Conference of

the North American Chapter of the Association for Computational Linguistics,ACL, 2010, pp. 100–108 .

36] D. Putthividhy , H.T. Attias , S.S. Nagarajan , Topic regression multi-modal latentdirichlet allocation for image annotation, in: Conference on Computer Vision

and Pattern Recognition, IEEE, 2010, pp. 3408–3415 . [37] R. Ranganath , S. Gerrish , D. Blei , Black box variational inference, in: Interna-

tional Conference on Artificial Intelligence and Statistics, 2014 . 38] N. Rasiwasia , N. Vasconcelos , Latent dirichlet allocation models for image

classification, IEEE Trans. Pattern Anal. Mach. Intell. 35 (11) (2013) 2665–

2679 . 39] D.J. Rezende , S. Mohamed , D. Wierstra , Stochastic backpropagation and varia-

tional inference in deep latent gaussian models, in: International Conferenceon Machine Learning, 2014 .

40] S. Rogers , M. Girolami , C. Campbell , R. Breitling , The latent process decompo-sition of cdna microarray data sets, IEEE/ACM Trans. Comput. Biol. Bioinform.

2 (2) (2005) 143–156 .

[41] J. Sethuraman , A constructive definition of dirichlet priors, Stat. Sin. 4 (1994)639–650 .

42] C.K. Sønderby , T. Raiko , L. Maaløe , S.K. Sønderby , O. Winther , Ladder varia-tional autoencoders, in: Advances in Neural Information Processing Systems,

2016, pp. 3738–3746 . 43] A. Srivastava , C. Sutton , Autoencoding variational inference for topic models,

in: International Conference on Learning Representations, 2017 .

44] Y.W. Teh , A hierarchical bayesian language model based on pitman-yor pro-cesses, in: International Conference on Computational Linguistics, ACL, 2006,

pp. 985–992 . 45] C. Wang , J. Paisley , D. Blei , Online variational inference for the hierarchical

dirichlet process, in: International Conference on Artificial Intelligence andStatistics, 2011, pp. 752–760 .

46] X. Wei , W.B. Croft , Lda-based document models for ad-hoc retrieval, in: Inter-

national ACM SIGIR Conference on Research and Development in InformationRetrieval, ACM, 2006, pp. 178–185 .

[47] M.J.B. Yee Whye Teh Michael I Jordan , D.M. Blei , Hierarchical dirichlet pro-cesses, J. Am. Stat. Assoc. 101 (476) (2006) 1566–1581 .

48] H. Zhang , B. Chen , D. Guo , M. Zhou , Whai: weibull hybrid autoencoding infer-ence for deep topic modeling, in: International Conference on Learning Repre-

sentations, 2018 .

http://refhub.elsevier.com/S0925-2312(20)30301-5/sbref0001




































































































































































o

r

h

Xuefei Ning is currently a Ph.D. student in the Depart-

ment of Electronic Engineering, Tsinghua University. Shereceived the B.S. degree in electronic engineering from

Tsinghua University, Beijing, China, in 2016. Her research

interests include the reliability and robustness of neuralnetwork.

Yin Zheng is currently a senior researcher with Weixin

Group, Tencent. He received his Ph.D degree from Ts-inghua University in 2015. He serves as a PC member or

reviewer for many journals and conferences in his area.

Zhuxi Jiang received the Bachelor’s degree and the Mas-ter’s degree from Beijing Institute of Technology, Beijing,

China, in 2015 and 2018 respectively. His research inter-ests include machine learning, computer vision and intel-

ligent transportation systems.

Yu Wang (S05-M07-SM14) received the BS and PhD (withhonor) degrees from Tsinghua University, Beijing, in 2002

and 2007. He is currently a tenured professor with theDepartment of Electronic Engineering, Tsinghua Univer-

sity. His research interests include brain inspired comput-

ing, application specific hardware computing, parallel cir-cuit analysis, and power/reliability aware system design

methodology. He has authored and coauthored more than200 papers in refereed journals and conferences. He is a

recipient of DAC under 40 innovator award (2018), IBMX10 Faculty Award (2010). He served as TPC/track chair

and program committee member for leading conferences

in the EDA and FPGA fields.

Huazhong Yang (M97-SM00) received B.S. degree in mi-

croelectronics in 1989, M.S. and Ph.D. degree in electronicengineering in 1993 and 1998, respectively, all from Ts-

inghua University, Beijing. In 1993, he joined the Depart-

ment of Electronic Engineering, Tsinghua University, Bei-jing, where he has been a Full Professor since 1998. Dr.

Yang was awarded the Distinguished Young Researcher byNSFC in 20 0 0 and Cheung Kong Scholar by Ministry of

Education of China in 2012. He has been in charge of sev-eral projects, including projects sponsored by the national

science and technology major project, 863 program, NSFC,

9th five-year national program and several internationalresearch projects. Dr. Yang has authored and co-authored

ver 400 technical papers, 7 books, and over 100 granted Chinese patents. His cur-ent research interests include wireless sensor networks, data converters, energy-

arvesting circuits, nonvolatile processors, and brain inspired computing.

Junzhou Huang received the B.E. degree from Huazhong

University of Science and Technology, Wuhan, China, theM.S. degree from the Institute of Automation, Chinese

Academy of Sciences, Beijing, China, and the Ph.D. de-

gree in Computer Science at Rutgers, The State Universityof New Jersey. His major research interests include ma-

chine learning, computer vision and biomedical informat-ics, with focus on the development of sparse modeling,

imaging, and learning for big data analytics.

Peilin Zhao is currently a Principal Researcher at TencentAI Lab, China. Previously, he has worked at Rutgers Uni-

versity, Institute for Infocomm Research (I2R), Ant Finan-cial Services Group. His research interests include: Online

Learning, Deep Learning, Recommendation System, Auto-matic Machine Learning, etc. He has published over 100

papers in top venues, including JMLR, ICML, KDD, etc.

He has been invited as a PC member, reviewer or edi-tor for many international conferences and journals, such

as ICML, JMLR, etc. He received his bachelor degree fromZhejiang University, and his PHD degree from Nanyang

Technological University.