Neurocomputing 399 (2020) 296–306
Contents lists available at ScienceDirect
Neurocomputing
journal homepage: www.elsevier.com/locate/neucom
Nonparametric Topic Modeling with Neural Inference
Xuefei Ning
a , Yin Zheng
b , Zhuxi Jiang
c , Yu Wang
a , Huazhong Yang
a , Junzhou Huang
d , Peilin Zhao
d , ∗
a Tsinghua University, China b Weixin Group, Tencent, China c Momenta, China d Tencent AI Lab, China
a r t i c l e i n f o
Article history:
Received 10 April 2019
Revised 26 November 2019
Accepted 27 December 2019
Available online 28 February 2020
Communicated by Dr. Lixin Duan
a b s t r a c t
This work focuses on combining nonparametric topic models with Auto-Encoding Variational Bayes
(AEVB). Specifically, we first propose iTM-VAE, where the topics are treated as trainable parameters and
the document-specific topic proportions are obtained by a stick-breaking construction. The inference of
iTM-VAE is modeled by neural networks such that it can be computed in a simple feed-forward man-
ner. We also describe how to introduce a hyper-prior into iTM-VAE so as to model the uncertainty of
the prior parameter. Actually, the hyper-prior technique is quite general and we show that it can be ap-
plied to other AEVB based models to alleviate the collapse-to-prior problem elegantly. Moreover, we also
propose HiTM-VAE, where the document-specific topic distributions are generated in a hierarchical man-
ner. HiTM-VAE is even more flexible and can generate topic representations with better variability and
sparsity. Experimental results on 20News and Reuters RCV1-V2 datasets show that the proposed models
outperform the state-of-the-art baselines significantly. The advantages of the hyper-prior technique and
the hierarchical model construction are also confirmed by experiments.
© 2020 Elsevier B.V. All rights reserved.
c
V
f
m
r
p
p
m
t
Y
(
a
t
n
m
m
b
1. Introduction
Probabilistic topic models focus on discovering the abstract
“topics” that occur in a collection of documents, and represent a
document as a weighted mixture of the discovered topics. Classi-
cal topic models [5] have achieved success in a range of applica-
tions [5,38,40,46] . A major challenge of topic models is that the
inference of the distribution over topics does not have a closed-
form solution and must be approximated, using either MCMC sam-
pling or variational inference. When some small changes are made
on the model, we need to re-derive the inference algorithm. In
contrast, blacks-box inference methods [21,32,37,39] require only
limited model-specific analysis and can be flexibly applied to new
models.
Among all the black-box inference methods, Auto-Encoding
Variational Bayes (AEVB) [21,39] is a promising one for topic mod-
els. AEVB contains an inference network that can map a document
directly to a variational posterior without the need for further lo-
∗ Corresponding author.
E-mail addresses: [email protected] (X. Ning), [email protected]
(Y. Zheng), [email protected] (Z. Jiang), [email protected] (Y.
Wang), [email protected] (H. Yang), [email protected] (J. Huang),
[email protected] (P. Zhao).
o
a
m
t
https://doi.org/10.1016/j.neucom.2019.12.128
0925-2312/© 2020 Elsevier B.V. All rights reserved.
al variational updates on test data, and the Stochastic Gradient
ariational Bayes (SGVB) estimator allows efficient approximate in-
erence for a broad class of posteriors, which makes topic models
ore flexible. Hence, an increasing number of models are proposed
ecently to combine topic models with AEVB, such as [8,29,30,43] .
Although these AEVB based topic models achieve promising
erformance, the number of topics, which is important to the
erformance of these models, has to be specified manually with
odel selection methods. Nonparametric models, however, have
he ability of adapting the topic number to data. For example,
ee Whye Teh and Blei [47] proposed Hierarchical Dirichlet Process
HDP), which models each document with a Dirichlet Process (DP)
nd all DPs for the documents in a corpus share a base distribu-
ion that is itself sampled from a DP. HDP has potentially an infinite
umber of topics and allows the number to grow as more docu-
ents are observed. It is appealing that the nonparametric topic
odels can also be equipped with AEVB techniques to enjoy the
enefit brought by neural black-box inference. We make progress
n this problem by proposing an infinite Topic Model with Vari-
tional Auto-Encoders (iTM-VAE), which is a nonparametric topic
odel with AEVB.
For nonparametric topic models with stick breaking prior [41] ,
he concentration parameter α plays an important role in deciding
X. Ning, Y. Zheng and Z. Jiang et al. / Neurocomputing 399 (2020) 296–306 297
t
t
p
M
m
t
p
n
t
s
d
w
u
t
a
H
i
g
w
2
p
p
t
m
m
m
m
c
t
t
a
t
A
t
c
t
m
t
f
t
t
v
[
s
Table 1
The notations used in this paper.
Section Notation Description
Section 3.1 α the concentration parameter in Beta/GEM
π k the k th atom weights
νk the k th sample from the Beta prior
θ k the k th topic
φk the unconstrained parameter for θ k
x ( j ) the j th document
G ( j ) the document-specific topic distribution for x ( j )
w
( j) n the n th word in x ( j )
Section 3.2 a k , b k the Kumaraswamy variational parameters for νk
ψ the parameters of the inference neural network
Section 3.3 s 1 , s 2 the prior parameters of the Gamma hyper-prior
on α
γ 1 , γ 2 the Gamma variational parameters for α
Section 4 γ the corpus-level concentration parameter
β ′ i
the i th sample from the corpus-level Beta prior
β i the i th corpus-level atom weights
c ( j) k
the indicator variables for the j th document x ( j )
ϕk the Multinomial variational parameters for c ( j) k
u i , v i the corpus-level Beta variational parameters for
β ′ i
m
r
s
c
fi
N
w
a
p
2
t
p
p
a
c
3
d
T
H
a
A
3
d
d
ν
L
t
i
t
he growth of topic numbers 1 . The larger the α is, the more topics
he model tends to discover. Hence, people can place a hyper-
rior [2] over α such that the model can adapt it to data [4,10,47] .
oreover, the hyper-prior technique plays a two-fold role in topic
odels powered by AEVB, as the AEVB framework suffers from
he problem that the latent representation tends to collapse to the
rior [6,9,42] , which means, the prior parameter α will control the
umber of discovered topics tightly in our case, especially when
he decoder is strong. Common heuristic tricks to alleviate this is-
ue are 1) KL-annealing [42] and 2) decoder regularizing [6] . Intro-
ucing a hyper-prior into the AEVB framework is nontrivial and not
ell-done in the community. In this paper, we show that as a nat-
ral relaxation of the prior, the hyper-prior technique can alleviate
he collapse-to-prior issue in the training process, and increase the
daptive capability of the model. 2
To further increase the flexibility of iTM-VAE, we propose
iTM-VAE, which model the document-specific topic distribution
n a hierarchical manner. This hierarchical construction can help to
enerate topic representation with better variability and sparsity,
hich is more suitable in handling heterogeneous documents.
The main contributions of the paper are:
• We propose iTM-VAE and iTM-VAE-Prod, which are two
novel nonparametric topic models equipped with AEVB, and
outperform the state-of-the-art models on the benchmarks.
• We propose iTM-VAE-HP, in which a hyper-prior helps the
model to adapt the prior parameter to data. We also show
that this technique can help other AEVB-based models to al-
leviate the collapse-to-prior problem elegantly.
• We propose HiTM-VAE, which is a hierarchical extension
of iTM-VAE. This construction and its corresponding AEVB-
based inference method can help the model to learn more
topics and produce topic proportions with higher variability
and sparsity.
. Related work
Topic models have been studied extensively in a variety of ap-
lications such as document modeling, information retrieval, com-
uter vision and bioinformatics [3,5,36,38,40,46] . Recently, with
he impressive success of deep learning, the proposed neural topic
odels [12,23,32] achieve encouraging performance in document
odeling tasks. Although these models achieve competitive perfor-
ance, they do not explicitly model the generative story of docu-
ents, hence are less explainable.
Several recent work proposed to model the generative pro-
edure explicitly, and the inference of the topic distributions in
hese models is computed by deep neural networks, which makes
hese models explainable, powerful and easily extendable. For ex-
mple, Srivastava and Sutton [43] proposed AVITM, which embeds
he original Latent Dirichlet Allocation (LDA) [5] formulation with
EVB. By utilizing Laplacian approximation for the Dirichlet dis-
ribution, AVITM can be optimized by the SGVB estimator effi-
iently. AVITM achieves the state-of-the-art performance on the
opic coherence metric [25] , which indicates the topics learned
atch closely to human judgment. Since the Laplace approxima-
ion could not model the posterior sparsity effectively, other in-
erence techniques for the Dirichlet distribution are proposed. Af-
er decomposing the Dirichlet distribution into Gamma distribu-
ions, Joo et al. [16] proposed to use the approximation of the in-
erse Gamma CDF to do the reparametrization trick. Zhang et al.
48] proposed to use the Weibull distribution in the inference of a
1 Please refer to Section 3.1 for more details about the concentration parameter. 2 The hyper-prior technique can also alleviate the collapse-to-prior issue in other
cenarios, an example is demonstrated in Appendix B .
a
i
m
s
ulti-layer generative model, for learning hierarchical latent rep-
esentations. Burkhardt and Kramer [7] proposed to decouple the
parsity and smoothness variational parameters and use an effi-
ient rejection sampler for the Gamma random variables.
Nonparametric topic models [1,17,27,44,47] , potentially have in-
nite topic capacity and can adapt the topic number to data.
alisnick and Smyth [34] proposed Stick-Breaking VAE (SB-VAE),
hich is a Bayesian nonparametric version of traditional VAE with
stochastic dimensionality. iTM-VAE differs with SB-VAE in 3 as-
ects: 1) iTM-VAE is a kind of topic model for discrete text data.
) A hyper-prior is introduced into the AEVB framwork to increase
he adaptive capability. 3) A hierarchical extension of iTM-VAE is
roposed to further increase the flexibility. Miao et al. [29] pro-
osed GSM, GSB, RSB and RSB-TF to model documents. RSB-TF uses
heuristic indicator to guide the growth of the topic numbers, and
an adapt the topic number to data.
. The iTM-VAE model
In this section, we describe the generative and inference proce-
ure of iTM-VAE and iTM-VAE-Prod in Section 3.1 and Section 3.2 .
hen, Section 3.3 describes the hyper-prior extension iTM-VAE-
P. The notations used in this paper are summarized in Table 1 ,
nd the abbreviations of various methods are summarized in
ppendix F .
.1. The Generative Procedure of iTM-VAE
Suppose the atom weights π = { πk } ∞
k =1 are drawn from a GEM
istribution [33] , i.e. π ~ GEM( α), where the GEM distribution is
efined as:
k ∼ Beta (1 , α) πk = νk
k −1 ∏
l=1
(1 − νl ) = νk
(
1 −k −1 ∑
l=1
πl
)
. (1)
et θk = σ ( φk ) denotes the k th topic, which is a multinomial dis-
ribution over vocabulary, φk ∈ R
V is the parameter of θk , σ ( · )
s the softmax function and V is the vocabulary size. In iTM-VAE,
here are unlimited number of topics and we denote � = { θk } ∞
k =1 nd = { φk } ∞
k =1 as the collections of these countably infinite top-
cs and the corresponding parameters. The generation of a docu-
ent x ( j) = w
( j)
1: N ( j) by iTM-VAE can then be mathematically de-
cribed as:
298 X. Ning, Y. Zheng and Z. Jiang et al. / Neurocomputing 399 (2020) 296–306
V
u
p
o
o
L
w
s
A
i
A
3
G
t
c
s
c
G
m
m
r
w
f
t
α
V
L
w
γ
f
fi
f
p
• Get the document-specific G
( j) ( θ;π( j) , �) =
∑ ∞
k =1 π( j) k
δθk ( θ) ,
where π( j ) ~ GEM( α)
• For each word w n in x ( j ) : 1) draw a topic ˆ θn ∼ G
( j) ( θ;π( j) , �) ;
2) w n ∼ Cat ( ̂ θn )
where α is the concentration parameter, Cat ( ̂ θi ) is a categorical
distribution parameterized by ˆ θi , and δθk ( θ) is a discrete dirac
function, which equals to 1 when θ = θk and 0 otherwise. In the
following, we remove the superscript of j for simplicity.
Thus, the joint probability of w 1: N = { w n } N n =1 , ˆ θ1: N = { ̂ θn } N n =1
and π can be written as:
p(w 1: N , π, ̂ θ1: N | α, �) = p( π| α) N ∏
n =1
p(w n | ̂ θn ) p( ̂ θn | π, �) (2)
where p( π| α) = GEM (α) , p( θ| π, �) = G ( θ;π, �) and p(w | θ) =Cat ( θ) .
Similar to [43] , we collapse the variable ˆ θ1: N and rewrite
Equation 2 as:
p(w 1: N , π| α, �) = p( π| α) N ∏
n =1
p(w n | π, �) (3)
where p(w n | π, �) = Cat ( ̄θ) and θ̄ =
∑ ∞
k =1 πk θk .
In Eq. 3 , θ̄ is a mixture of multinomials. This formulation cannot
make any predictions that are sharper than the distributions being
mixed [12] , which may result in some topics that are of poor qual-
ity. Replacing the mixture of multinomials with a weighted prod-
uct of experts is one method to make sharper predictions [11,43] .
Hence, a products-of-experts version of iTM-VAE (i.e. iTM-VAE-
Prod) can be obtained by simply computing ˆ θ for each document
as ˆ θ = σ ( ∑ ∞
k =1 πk φk ) .
3.2. The inference procedure of iTM-VAE
In this section, we describe the inference procedure of iTM-VAE,
i.e. how to draw π given a document w 1: N . To elaborate, suppose
ν = [ ν1 , ν2 , . . . , νK−1 ] is a K − 1 dimensional vector, where νk is a
random variable sampled from a Kumaraswamy distribution κ( ν;
a k , b k ) parameterized by a k and b k [22,34] , iTM-VAE models the
joint distribution q ψ
( ν| w 1: N ) as: 3
[ a 1 , . . . , a K−1 ; b 1 , . . . , b K−1 ] = g(w 1: N ;ψ) (4)
q ψ
( ν| w 1: N ) =
K−1 ∏
k =1
κ(νk ; a k , b k ) (5)
where g ( w 1: N ; ψ) is a neural network with parameters ψ . Then,
π = { πk } K k =1 can be drawn by:
ν ∼ q ψ
( ν| w 1: N ) (6)
π = (π1 , π2 , . . . , πK−1 , πK )
=
(
ν1 , ν2 (1 − ν1 ) , . . . , νk −1
K−2 ∏
n =1
(1 − νn ) , K−1 ∏
n =1
(1 − νn )
)
(7)
In the above procedure, we truncate the infinite sequence of
mixture weights π = { πk } ∞
k =1 by K elements, and νK is always set
to 1 to ensure ∑ K
k =1 πk = 1 . Notably, as is discussed in [4] , the
truncation of variational posterior does not indicate that we are
3 Ideally, Beta distribution is the most suitable probability candidate, since iTM-
AE assumes π is drawn from a GEM distribution in the generative procedure.
However, as Beta does not satisfy the differentiable, non-centered parameterization
(DNCP) [19] requirement of SGVB [21] , we use the Kumaraswamy distribution.
i
A
A
m
p
sing a finite dimensional prior, since we never truncate the GEM
rior. Hence, iTM-VAE still has the ability to model the uncertainty
f the number of topics and adapt it to data [34] . iTM-VAE can be
ptimized by maximizing the Evidence Lower Bound (ELBO):
(w 1: N | , ψ) = E q ψ ( ν| w 1: N ) [ log p(w 1: N | π, ) ]
−KL (q ψ
( ν| w 1: N ) || p( ν| α) )
(8)
here p ( ν| α) is the product of K − 1 Beta(1, α) probabilistic den-
ity functions. The details of the optimization can be found in
ppendix C . The optimizing procedure of iTM-VAE is summarized
n Algorithm 1 .
lgorithm 1 The optimizing procedure of iTM-VAE.
1: EPOCH: the total epochs
2: epoch = 0
3: while epoch < EPOCH do
4: for all w 1: N in dataset do
5: [ a 1 , . . . , a K−1 ; b 1 , . . . , b K−1 ] = g(w 1: N ;ψ)
6: q ψ
( ν| w 1: N ) =
∏ K−1 k =1
κ(νk ; a k , b k )
7: ν ∼ q ψ
( ν| w 1: N )
8: π = (ν1 , ν2 (1 − ν1 ) , . . . , νk −1
∏ K−2 n =1 (1 − νn ) ,
∏ K−1 n =1 (1 − νn ))
9: ψ = ψ + ηψ
∇ ψ
( log p(w 1: N | π, ) −KL
(q ψ
( ν| w 1: N ) || p( ν| α) ))
10: φ = φ + ηφ∇ φ log p(w 1: N | π, )
11: end for
12: epoch = epoch + 1
13: end while
.3. Modeling the uncertainty of prior parameter
In the generative procedure, the concentration parameter α of
EM( α) can have significant impact on the growth of number of
opics. The larger the α is, the more “breaks” it will create, and
onsequently, more topics will be used. Hence, it is generally rea-
onable to consider placing a hyper-prior on α to model its un-
ertainty. [4,10,47] . For example, Escobar and West [10] placed a
amma hyper-prior on α for the urn-based samplers and imple-
ented the corresponding Gibbs updates with auxiliary variable
ethods. Blei et al. [4] also placed a Gamma prior on α and de-
ived a closed-form update for the variational parameters. Different
ith previous work, we introduce the hyper-prior into the AEVB
ramework and propose to optimize the model jointly by stochas-
ic gradient decent (SGD) methods.
Concretely, since the Gamma distribution is conjugate to Beta(1,
), we place a Gamma( s 1 , s 2 ) prior on α. Then the ELBO of iTM-
AE-HP can be written as:
(w 1: N | , ψ) = E q ψ ( ν| w 1: N ) [ log p(w 1: N | π, ) ]
+ E q ψ ( ν| w 1: N ) q (α| γ1 ,γ2 ) [ log p( ν| α)]
− E q ψ ( ν| w 1: N ) [ log q ψ
( ν| w 1: N )]
− KL (q (α| γ1 , γ2 ) || p(α| s 1 , s 2 )) (9)
here p(α| s 1 , s 2 ) = Gamma (s 1 , s 2 ) , p(v k | α) = Beta (1 , α) , q ( α| γ 1 ,
2 ) is the corpus-level variational posterior for α. The derivation
or Eq. 9 can be found in Appendix D . In our experiments, we
nd iTM-VAE-Prod always performs better than iTM-VAE, there-
ore we only report the performance of iTM-VAE-Prod with hyper-
rior, and refer this variant as iTM-VAE-HP. Actually, as discussed
n Section 1 , the hyper-prior technique can also be applied to other
EVB based models to alleviate the collapse-to-prior problem. In
ppendix B , we show that by introducing a hyper-prior to SB-VAE,
ore latent units can be activated and the model achieves better
erformance.
X. Ning, Y. Zheng and Z. Jiang et al. / Neurocomputing 399 (2020) 296–306 299
4
d
b
4
l
o
T
b
t
p
{
c
w
C ∏4
l
r
[
q
q
w
ϕ
d
c
a
β
l
c
L
w
{
f
t
i
t
u
4
b
f
t
b
c
w
c
s
5
v
V
i
t
d
1
d
t
a
a
d
f
s
t
s
o
b
m
a
4 http://qwone.com/ ∼jason/20Newsgroups/ 5 http://trec.nist.gov/data/reuters/reuters.html 6 In these baselines, at most 200 topics are used. Please refer to Table 2 for de-
tails.
. Hierarchical iTM-VAE
In this section, we describe the generative and inference proce-
ures of HiTM-VAE in Section 4.1 and Section 4.2 . The relationship
etween iTM-VAE and HiTM-VAE is discussed in Section 4.3
.1. The generative procedure of HiTM-VAE
The generation of a document by HiTM-VAE is described as fol-
ows:
• Get the corpus-level base distribution G
(0) : β ∼GEM (γ ) ; G
(0) ( θ;β, �) =
∑ ∞
i =1 βi δθi ( θ)
• For each document x ( j) = w
( j)
1: N ( j) in the corpus:
• Draw the document-level stick breaking weights
π( j ) ~ GEM( α)
• Draw document-level atoms ζ( j) k
∼ G 0 , k = 1 , · · · , ∞ ;
Then we get a document-specific distribution
G
( j) ( θ;π( j) , { ζ( j) k
} ∞
k =1 , �) =
∑ ∞
k =1 π( j) k
δζ( j)
k
( θ)
• For each word w n in the document: 1) draw a topic ˆ θn ∼G
( j) ; 2) w n ∼ Cat ( ̂ θn )
To sample the document-level atoms ζ( j) = { ζ( j) k
} ∞
k =1 , a series
f indicator variables c ( j) = { c ( j) k
} ∞
k =1 are drawn i.i.d: c
( j) k
∼ Cat ( β) .
hen, the document-level atoms are ζ( j) k
= θc ( j) k
.
Let D and N
( j ) denote the size of the dataset and the num-
er of word in each document x ( j ) , respectively. After collapse
he per-word assignment random variables {{ ̂ θ( j)
n } N ( j)
n =1 } D j=1 , the joint
robability of the corpus-level atom weights β, documents X = x ( j) } D
j=1 , the stick breaking weights � = { π( j) } D
j=1 and the indi-
ator variables C = { c ( j) } D j=1
can be written as:
p( β, X , �, C| γ , α, �) = p( β| γ ) D ∏
j=1
p( π( j) | α) p( c ( j) | β)
× p( x ( j) | π( j) , c ( j) , �) (10)
here p( β| γ ) = GEM (γ ) , p( π( j) | α) = GEM (α) , p( c ( j) | β) =at ( β) , and p(x ( j) | π( j) , c ( j) , �) =
∏ N ( j)
n =1 p(w
( j) n | π( j) , c ( j) , �) =
N ( j)
l=1 Cat (w
( j) n | ∑ ∞
k =1 π( j) k
θc ( j) k
) .
.2. The inference procedure of HiTM-VAE
Setting the truncation level of the corpus-level and document-
evel GEM to T and K , HiTM-VAE models the per-document poste-
ior q ( ν, c | w 1: N ) for every document w 1: N as:
a 1 , . . . , a K−1 ; b 1 , . . . , b K−1 ;ϕ 1 , . . . , ϕ K ] = g(w 1: N ;ψ) (11)
( ν, c | w 1: N ) = q ψ
( ν| w 1: N ) q ψ
( c | w 1: N ) (12)
ψ
( ν| w 1: N ) =
K−1 ∏
k =1
κ(νk ; a k , b k ) ; q ψ
( c | w 1: N ) =
K ∏
k =1
Cat (c k ;ϕ k )
(13)
here g ( w 1: N ; ψ) is a neural network with parameters ψ , and
k = { ϕ ki } T i =1 are the multinomial variational parameters for each
ocument-level indicator variable c k . Then, π = { πk } K k =1 can be
onstructed by the stick breaking process using ν.
As we shown in Section 4.1 , the generation of the corpus-level
tom weights β is as follows:
′ i ∼ Beta (1 , γ ) ; βi = β ′
i
i −1 ∏
l=1
(1 − β ′ l ) (14)
The corpus-level variational posterior for β′ with truncation
evel T is q ( β′ ) =
∏ T −1 i =1 Beta (β ′
i | u i , v i ) , where { u i , v i } T −1
i =1 are the
orpus-level variational parameters.
The ELBO of the training dataset can be written as:
(D| , ψ) = E q ( β′ )
[log
P ( β′ | γ )
q ( β′ | u , v )
]+
D ∑
j=1
{E q ( ν( j) )
[log
P ( ν( j) | α)
q ( ν( j) )
]
+
K ∑
k =1
E q ( β′ ) q (c ( j)
k | ϕ ( j)
k )
[
log P (c ( j)
k | β)
q (c ( j) k
| ϕ
( j) k
)
]
+ E q ( ν( j) ) q ( c ( j) )
[P (x ( j) | ν( j) , c ( j) , )
]}(15)
here β = { βi } T i =1 , ν( j) = { ν( j)
k } K−1
k =1 , c ( j) = { c ( j)
k } K
k =1 , ϕ
( j) k
= ϕ
( j) ki
} T i =1
. The details of the derivation of the ELBO can be
ound in Appendix E .
Gumbel-Softmax estimator [15] is used for backpropagating
hrough the categorical random variables c . Instead of joint train-
ng with the NN parameters, mean-field updates are used to learn
he corpus-level variational parameters { u i , v i } T −1 i =1
:
i = 1 +
D ∑
j=1
K ∑
k =1
ϕ
( j) ki
; v i = γ +
D ∑
j=1
K ∑
k =1
T ∑
l= i +1
ϕ
( j) kl
(16)
.3. Discussion
In iTM-VAE, we get the document-specific topic distribution G
( j )
y sampling the atom weights from a GEM. Instead of being drawn
rom a continuous base distribution, the atoms are modeled as
rainable parameters as in [5,29,43] . Thus, the atoms are shared
y all documents naturally without the need to use a hierarchical
onstruction like HDP [47] . The hierarchical extension, HiTM-VAE,
hich models G
( j ) in a hierarchical manner, is more flexible and
an generate topic representations with better variability and spar-
ity. A detailed comparison is illustrated in Section 5.3 .
. Experiments
In this section, we evaluate the performance of iTM-VAE and its
ariants on two public benchmarks: 20News 4 and Reuters RCV1-
2 5 , and demonstrate the advantage brought by the variants of
TM-VAE. 20News dataset contains about 18,0 0 0 documents parti-
ioned into 20 different classes, and RCV1-V2 [26] is a much bigger
ataset composed of 804,414 documents manually categorized into
03 classes. To make a fair comparison, we use exactly the same
ata and vocabulary as [43] .
The configuration of the experiments is as follows. We use a
wo-layer fully-connected neural network for g ( w 1: N ; ψ) of Eq. 4 ,
nd the number of hidden units is set to 256 and 512 for 20News
nd RCV1-V2, respectively. The concentration parameter α for GEM
istribution is cross-validated on validation set from [10,20,30,50]
or iTM-VAE and iTM-VAE-Prod. The truncation level K in Eq. 7 is
et to 200 so that the maximum topic numbers will never exceed
he ones used by baselines, 6 and we empirically find K = 200 is
ufficiently large for the actually learned posterior with these αn the two benchmarks. Batch-Renormalization [14] is used to sta-
ilize the training procedure. Adam [18] is used to optimize the
odel and the learning rate is set to 0.01. The code of iTM-VAE
nd its variants is available at http://anonymous .
300 X. Ning, Y. Zheng and Z. Jiang et al. / Neurocomputing 399 (2020) 296–306
Table 2
Comparison of perplexity (lower is better) and topic coherence (higher is better) between different topic models on 20News and
RCV1-V2 datasets.
Methods Perplexity Coherence
20News RCV1-V2 20News RCV1-V2
#Topics 50 200 50 200 50 200 50 200
LDA [13] a 893 1015 1062 1058 0.131 0.112 –
DocNADE [23] 797 804 856 670 0.086 0.082 0.079 0.065
HDP [45] a 937 918 – –
NVDM [30] a 837 873 717 588 0.186 0.157 –
NVLDA [43] 1078 993 791 797 0.162 0.133 0.153 0.172
ProdLDA [43] 1009 989 780 788 0.236 0.217 0.252 0.179
GSM [29] a 787 829 653 521 0.223 0.186 –
GSB [29] a 816 815 712 544 0.217 0.171 –
RSB [29] a 785 792 662 534 0.224 0.177 –
RSB-TF [29] a 788 532 – –
iTM-VAE 877 1124 0.205 0.218
iTM-VAE-Prod 775 508 0.278 0.3
iTM-VAE-HP 876 692 0.285 0 . 311
HiTM-VAE 912 747 0 . 29 0.27
a We take these results from [29] directly, since we use exactly the same datasets. The symbol “ - ” indicates that [29] does not
provide the corresponding values. As this paper is based on variational inference, we do not compare with LDA and HDP using Gibbs
sampling, which are usually time consuming.
Table 3
Top 10 words of topics learned by iTM-VAE-Prod without cherry picking.
Geography turkish armenians turks armenia armenian turkey azerbaijan
greek greece village
Sports season team player nhl score playoff hockey game coach
hitter
Religion jesus bible god faith scripture christ doctrine belief eternal
church
Space orbit shuttle launch lunar spacecraft nasa satellite probe
rocket moon
Hardware scsi ide scsus motherboard ram controller upgrade meg
cache floppy
Encryption ripem escrow rsa des encrypt cipher privacy crypto chip nsa
Trade shipping sale annual manual tor vs det cd price excellent
X system window xterm font colormap server xlib widget xt windows
toolkit
Hockey a det tor buf cal pit que mon pt vs calgary
Health msg b patient disease symptom doctor food pain mouse
cancer hospital
Circuit voltage puck connector signal amp input circuit pin wire
connect
Lawsuit gun homicide militia weapon amendment handgun criminal
firearm crime knife
Traffic bike brake car tire ride engine honda bmw rear motorcycle
a All these words are about hockey teams of different cities, e.g. “que” means
Quebec. b “msg” means monosodium glutamate.
A
i
l
i
s
l
5
f
i
i
s
5.1. Perplexity and topic coherence
Perplexity is widely used by topic models to mea-
sure the goodness-to-fit capability, which is defined as:
exp (− 1 D
∑ D j=1
1 | x ( j) | log p(x ( j) )) , where D is the number of docu-
ments, and | x ( j ) | is the number of words in the j -th document x ( j ) .
Following previous work, the variational lower bound is used to
estimate the perplexity.
As the quality of the learned topics is not directly reflected by
perplexity [35] , topic coherence is designed to match the human
judgment. We adopt NPMI [25] as the measurement of topic coher-
ence, as is adopted by [29,43] . 7 We define a topic to be an Effective
Topic if it becomes the top-1 significant topic of a sample among
the training set more than τ × D times, where D is the training
set size and τ is a ratio. We set τ to 0.5% in our experiments. Fol-
lowing [29] , we use an average over topic coherence computed by
top-5 and top-10 words across five random runs, which is more
robust [24] .
Table 2 shows the perplexity and topic coherence of different
topic models on 20News and RCV1-V2 datasets. We can clearly
see that our models outperform the baselines, which indicates that
our models have better goodness-to-fit capability and can discover
topics that match more closely to human judgment. We can also
see that HiTM-VAE achieves better perplexity than [45] , in which a
similar hierarchical construction is used. Note that comparing the
ELBO-estimated perplexity of HiTM-VAE with other models directly
is not suitable, as it has a lot more random variables, which usu-
ally leads to a higher ELBO. The possible reasons for the good co-
herence achieved by our models are 1) The “Product-of-Experts”
enables the model to model sharper distributions. 2) The nonpara-
metric characteristic means the models can adapt the number of
topics to data, thus topics can be sufficiently trained and of high
diversity.
The diversity of the learned topics is another desired prop-
erty in many applications of topic modeling, as diverse top-
ics often come with better latent topic representations of docu-
ments that are suitable for discriminative tasks (e.g. document re-
trieval). Table 3 illustrates the highly-coherent and diverse topics
learned by iTM-VAE-Prod. In contrast, as we listed in Table A.1 in
7 We use the code provided by [25] at https://github.com/jhlau/
topic _ interpretability/ to calculate the NPMI score.
α
A
γ
f
ppendix A , there are a lot of redundant topics among the top-
cs learned by ProdLDA [43] . As a result, the latent representation
earned by ProdLDA is of poor discriminative power. For an intu-
tive comparison of the quality of the latent topic representations,
ee Fig. 1 for the TSNE-visualization of the topic representations
earned by iTM-VAE-Prod and ProdLDA.
.2. The effect of hyper-prior on iTM-VAE
In this section, we provide quantitative evaluations on the ef-
ect of the hyper-prior for iTM-VAE. Specifically, a relatively non-
nformative hyper-prior Gamma(1, 0.05) is imposed on α. And we
nitialize the global variational parameters γ 1 and γ 2 of Eq. 9 the
ame as the non-informative Gamma prior. Thus the expectation of
given the variational posterior q ( α| γ 1 , γ 2 ) is 20 before training.
SGD optimizer with a learning rate of 0.01 is used to optimize
1 and γ 2 . No KL annealing and decoder regularization are used
or iTM-VAE-HP.
X. Ning, Y. Zheng and Z. Jiang et al. / Neurocomputing 399 (2020) 296–306 301
Fig. 1. (a) The TSNE-visualization of the representation learned by by iTM-VAE-Prod. (b) The TSNE-visualization of the representation learned by ProdLDA [43] with the best
topic coherence on 20News ( K = 50 ).
Fig. 2. Topic coverage w.r.t number of used topics learned by iTM-VAE-HP.
Fig. 3. Comparison of the topic coverage (a) and sparsity (b) between iTM-VAE-Prod ( α = 5 ), iTM-VAE-Prod ( α = 20 ) and HiTM-VAE ( γ = 20 , α = 5 ). We can see that
HiTM-VAE can simultaneously discover more topics and produce sparser posterior topic proportions.
Table 4
The posterior distribution of α learned by iTM-VAE-HP on
subsets of 20News dataset.
#classes γ 1 γ 2 E q (α) [ α]
1 16.88 4.58 3.68
2 23.03 3.68 6.25
5 31.43 2.88 10.93
10 39.64 2.69 14.71
20 48.91 2.98 16.39
a
o
a
i
a
t
s
i
o1 2
8 Since there are no labels for the 20News dataset provided by [43] , we prepro-
cess the dataset ourselves in this illustrative experiment.
Table 4 reports the learned global variational parameter γ 1 , γ 2
nd the expectation of α given the variational poster q ( α| γ , γ )
1 2n several subsets of 20News dataset, which contain 1, 2, 5, 10
nd 20 classes, respectively. 8 We can see that, once the training
s done, the variational posterior q ( α| γ 1 , γ 2 ) is very confident,
nd E q (α| γ1 ,γ2 ) [ α] , the expectation of α given the variational pos-
erior, is adjusted to the training set. For example, if the training
et contains only 1 class of documents, E q (α| γ1 ,γ2 ) [ α] after train-
ng is 3.68, Whereas, when the training set consists of 10 classes
f documents, E q (α| γ ,γ ) [ α] after training is 14.71. This indicates
302 X. Ning, Y. Zheng and Z. Jiang et al. / Neurocomputing 399 (2020) 296–306
V
t
p
C
t
n
W
W
a
m
D
c
i
A
d
T
n
A
T
Top 10 words of some redundant topics learned by iTM-VAE-Prod.
Topics about
Religion
1 jesus christian scripture faith god christ heaven
christianity verse resurrection
2 jesus christ doctrine revelation verse
scripture satan christian interpretation god
3 belief god passage scripture moral atheist christian
truth principle jesus
4 god belief existence faith jesus atheist bible
christian religion sin
5 jesus son holy christ father god doctrine heaven
spirit prophet
6 homosexual marriage belief islam moral
christianity truth islamic religion god
Topics about
Hardware
7 floppy controller scsus ide scsi ram hd mb cache isa
8 printer meg adapter scsi motherboard windows
modem mhz vga hd
9 ide mb connector controller isa scsi scsus floppy
jumper disk
10 mb controller bio rom interface mhz scsus scsi
floppy ide
11 ide meg motherboard shipping adapter simm hd
mhz monitor scsi
12 ram controller dos bio windows disk scsi rom scsus
meg
13 honda motherboard bike amp quadra hd brake
apple upgrade meg
Topics about
Lawsuit
20 morality truth moral objective absolute
belief murder existence principle human
21 homicide vancouver seattle handgun firearm child
states percent study file
22 murder moral constitution morality criminal
objective rights gun law weapon
Topics about
Politics
14 decision stephanopoulos president armenian gay
package congress myers february armenians
15 armenian turkish genocide armenians turks
jesus massacre muslim armenia muslims
16 armenians father gang soldier neighbor apartment
girl armenian troops rape
17 muslim greek turks turkish armenian muslims
village genocide armenia jews
18 armenian turks armenians armenia turkish
muslim massacre village turkey greek
19 armenians armenian neighbor apartment woman
soviet kill bus azerbaijan hide
. . . . . . . . .
that iTM-VAE-HP can learn to adjust α to data, thus the number
of discovered topics will adapt to data better. In contrast, for iTM-
AE-Prod (without the hyper-prior), when the decoder is strong,
no matter how many classes the dataset contains, the number of
topics will be constrained tightly by the prior due to the collapse-
to-prior problem of AEVB, and KL-annealing and decoder regulariz-
ing tricks do not help much.
Fig. 2 illustrates the training set coverage w.r.t the number of
used topics when the training set contains 1, 2, 5, 10 and 20
classes, respectively. Specifically, we compute the average weight
of every topic on the training dataset, and sort the topics accord-
ing to their average weights. The topic coverage is then defined
as the cumulative sum of these weights. Fig. 2 shows that, with
the increasing of the number of classes, more topics are utilized
by iTM-VAE-HP to reach the same level of topic coverage, which
indicates that the model has the ability to adapt to data.
5.3. The evaluation of HiTM-VAE
In this section, by comparing the topic coverage and sparsity of
iTM-VAE-Prod and HiTM-VAE, we show that the hierarchical con-
struction can help the model to learn more topics, and produce
posterior topic representations with higher sparsity, which is de-
sirable in many applications [28] .
The model configurations are the same for iTM-VAE-Prod and
HiTM-VAE, except that α is set to 5 and 20 for iTM-VAE-Prod,
and γ = 20 , α = 5 for HiTM-VAE. For HiTM-VAE, the corpus-level
updates are done every 200 epochs on 20News, and 20 epochs on
RCV1-V2.
To compare the overall sparsity of the posterior topic represen-
tations of each model, we sort the topic weights of every training
document and average across the dataset. Then, the logarithm of
the average weights are plotted w.r.t the topic index. As shown
in Fig. 3 , HiTM-VAE can learn more topics than iTM-VAE-Prod
( α = 20 ), and the sparsity of its posterior topic proportions is sig-
nificantly higher. iTM-VAE-Prod ( α = 5 ) has higher sparsity than
iTM-VAE-Prod ( α = 20 ). However, its sparsity is still lower than
HiTM-VAE with the same document-level concentration parameter
α, and it can only learn a small number of topics, which means
that there might exist rare topics that are not learned by the
model. The comparison of HiTM-VAE and iTM-VAE-Prod ( α = 5 )
shows that the superior sparsity not only comes from a smaller
per-document concentration hyper-parameter α, but also from the
flexibility brought by the hierarchical construction itself.
6. Conclusion
In this paper, we propose iTM-VAE and iTM-VAE-Prod, which
are nonparametric topic models that are modeled by Variational
Auto-Encoders. Specifically, a stick-breaking prior is used to gen-
erate the atom weights of countably infinite shared topics, and
the Kumaraswamy distribution is exploited such that the model
can be optimized by AEVB algorithm. We also propose iTM-VAE-
HP which introduces a hyper-prior into the VAE framework such
that the model can adapt better to data. This technique is general
and can be incorporated into other VAE-based models to alleviate
the collapse-to-prior problem. To further diversify the document-
specific topic distributions, we use a hierarchical construction in
the generative procedure. And we show that the proposed model
HiTM-VAE can learn more topics and produce more sparse poste-
rior topic proportions. The advantage of iTM-VAE and its variants
over traditional nonparametric topic models is that the inference is
performed by feed-forward neural networks, which is of rich rep-
resentation capacity and requires only limited knowledge of the
data. Hence, it is flexible to develop extensions and incorporate
more information sources to the model. Experimental results on
wo public benchmarks show that iTM-VAE and its variants out-
erform the state-of-the-art baselines.
RediT authorship contribution statement
Xuefei Ning: Software, Formal analysis, Investigation, Visualiza-
ion. Yin Zheng: Conceptualization, Methodology, Writing - origi-
al draft, Resources. Zhuxi Jiang: Formal analysis, Investigation. Yu
ang: Writing - review & editing, Supervision. Huazhong Yang:
riting - review & editing, Supervision. Junzhou Huang: Project
dministration. Peilin Zhao: Writing - review & editing, Project ad-
inistration.
eclaration of Competing Interest
The authors declare that they have no known competing finan-
ial interests or personal relationships that could have appeared to
nfluence the work reported in this paper.
cknowledgment
This work was supported by the National Natural Science Foun-
ation of China (No. U19B2019 , 61832007 , 61621091 ), the project of
singhua University and Toyota Joint Research Center for AI Tech-
ology of Automated Vehicle (TT2019-01).
ppendix A. Learned Topics of ProdLDA
able A.1
X. Ning, Y. Zheng and Z. Jiang et al. / Neurocomputing 399 (2020) 296–306 303
A
t
a
I
t
V
n
n
g
i
f
t
m
t
i
e
A
B
L
V
w
A
B
L
S
ppendix B. The Effect of Hyper-Prior on SB-VAE
Introducing a hyper-prior into AEVB and doing global varia-
ional inference to learn the hyper-posterior is a general technique
nd can be applied to other VAE-based models, in other scenarios.
n this section, we show that the hyper-prior can also help SB-VAE
o gain more adaptive power on CIFAR-10 dataset 9 .
Specifically, for both SB-VAE and SB-VAE with hyper-prior (SB-
AE-HP), the encoder network is modeled as 5-layer convolutional
etwork, and the decoder is modeled as 5-layer deconvolutional
etwork. The truncation level is set to 10 0 0 and the discretized lo-
istic observation model [20] is used. We use the last 50 0 0 train-
ng samples as the validation set, and 50-epoch look-ahead is used
or early stopping. Adamax [18] with 0.001 learning rate is used as
he optimizer. Table B.1 shows that with the hyper-prior technique,
ore latent units can be activated and the model can adapt better
o dataset of different sizes. Actually, the hyper-prior can also be
ncorporated to other variants of VAE, and we leave the detailed
valuation for future work.
Table B.1
The comparison between vanilla SB-VAE and SB-VAE-HP on different subsets of
CIFAR-10. The α of SB-VAE is set to 5 and the hyper-prior of SB-VAE-HP is set to
Gamma(1.0, 0.2). 99% coverage is used to define the number of active units (AU)
following [34] . bits/dim (lower is better) is calculated as − log 2 p(x ) D
, where D is the
input dimensionality.
#classes SB-VAE SB-VAE-HP
bits/dim #AU bits/dim #AU E q [ α]
1 6.73 40.7 6.62 56.9 10.4
5 6.66 58.6 6.45 85.2 17.7
10 6.59 63.0 6.34 100.1 21.7
ppendix C. The Evidence Lower Bound of iTM-VAE
In this section, we show how to compute the Evidence Lower
ound (ELBO) of iTM-VAE which can be written as:
(w 1: N | , ψ) = E q ψ ( ν| w 1: N ) [ log p(w 1: N | π, ) ]
− KL (q ψ
( ν| w 1: N ) || p( ν| α) )
(C.1)
• E q ψ ( ν| w 1: N ) [ log p(w 1: N | π, ) ] :
Similar to other VAE-based models, the SGVB estimator and
reparameterization trick can be used to approximate this in-
tractable expectation and propagate the gradient flow into the
inference network g . Specifically, we have:
E q ψ ( ν| w 1: N ) [ log p(w 1: N | π, ) ] =
1
L
L ∑
l=1
N ∑
i =1
log p(w i | π(l) , ) (C.2)
where L is the number of Monte Carlo samples in the SGVB
estimator and can be set to 1. N is the number of words in the
document.
According to Section 3.2 , π( l ) can be obtained by
[ a 1 , . . . , a K−1 ; b 1 , . . . , b K−1 ] = g(w 1: N ;ψ) (C.3)
νk ∼ κ(ν; a k , b k ) (C.4)
π = (π1 , π2 , . . . , πK−1 , πK )
=
(
ν1 , ν2 (1 − ν1 ) , . . . , νk −1
K−2 ∏
l=1
(1 − νl ) , K−1 ∏
l=1
(1 − νl )
)
(C.5)
9 As the motivation of this experiment is to show the effect of hyper-prior on SB-
AE, we use a much smaller and naive network architecture and do not compare
ith the state-of-the-art models.
where g ( w 1: N ; ψ) is an inference network with parameters
ψ , κ denotes the Kumaraswamy distribution and K ∈ R + is the
truncation level. Here we omit the superscript ∗( l ) for simplicity.
In our experiments, following [29] , we factorize the parameter
φk of topic θk as φk = t k W where t k ∈ R
H is the k -th topic fac-
tor vector, W ∈ R
H×V is the word factor matrix and H ∈ R + is
the factor dimension. According to the generative procedure in
Section 3.1 , p ( w i | π( l ) , ) can be computed by
p(w i | π(l) , ) =
⎧ ⎪ ⎪ ⎪ ⎨
⎪ ⎪ ⎪ ⎩
∞ ∑
k =1
π(l) k
σ ( t k W ) ≈K ∑
k =1
π(l) k
σ ( t k W ) iTM-VAE
σ
(
∞ ∑
k =1
π(l) k
t k W
)
≈ σ
(
K ∑
k =1
π(l) k
t k W
)
iTM-VAE-Prod
(C.6)
where σ ( · ) is the softmax function.
• KL( q ψ
( ν| w 1: N )|| p ( ν| α)): By applying the KL divergence of a Ku-
maraswamy distribution κ( ν; a k , b k ) from a beta distribution
p ( ν; 1, α), we have:
KL (q ψ
( ν| w 1: N ) || p( ν| α) )
=
K−1 ∑
k =1
KL (q ψ
(νk | w 1: N ) || p( νk | α) )
=
K−1 ∑
k =1
a k − 1
a k
(−γ − �(b k ) − 1
b k
)+ log a k b k + log B ( 1 , α)
+ ( α − 1 )
∞ ∑
m =1
b k m + a k b k
B
(m
a k , b k
)− b k − 1
b k (C.7)
where B ( · ) is the Beta function and γ is the Euler’s constant.
ppendix D. The Evidence Lower Bound of iTM-VAE-HP
In this section, we show how to compute the Evidence Lower
ound (ELBO) of iTM-VAE-HP which can be written as:
(w 1: N | , ψ) = E q ψ ( ν| w 1: N ) [ log p(w 1: N | π, )]
+ E q ψ ( ν| w 1: N ) q (α| γ1 ,γ2 ) [ log p( ν| α)]
− E q ψ ( ν| w 1: N ) [ log q ψ
( ν| w 1: N )]
− KL (q (α| γ1 , γ2 ) || p(α| s 1 , s 2 )) (D.1)
pecifically, each item in Eq. D.1 can be obtained as follows:
• E q ψ ( ν| w 1: N ) [ log p(w 1: N | π, ) ] :
The derivation is exactly the same as Appendix C .
• E q ψ ( ν| w 1: N ) q (α| γ1 ,γ2 ) [ log p( ν| α)] :
Recall that the prior of the stick length variable νk is
Beta(1, α): p(v k | α) = α(1 − νk ) α−1 and the variational posterior
of the concentration parameter α is a Gamma distribution q ( α;
γ 1 , γ 2 ), we have
E q ψ ( ν| w 1: N ) q (α| γ1 ,γ2 ) [ log p( ν| α)]
= E q ψ ( ν| w 1: N )
[
K−1 ∑
k =1
E q (α| γ1 ,γ2 ) [ log α + (α − 1) log (1 − νk )]
]
= (K − 1) E q (α| γ1 ,γ2 ) [ log α] +
K−1 ∑
k =1
γ1 − γ2
γ2
E q ψ (νk | w 1: N ) [ log (1 − νk )]
(D.2)
Now, we provide more details about the calculation of these
two expectations in Eq. D.2 as follows:
◦ E q (α| γ ,γ ) [ log α] :
1 2304 X. Ning, Y. Zheng and Z. Jiang et al. / Neurocomputing 399 (2020) 296–306
A
B
a
L
S
Fisrt, we can write the Gamma distribution q ( α; γ 1 , γ 2 ) in
its exponential family form:
q (α;γ1 , γ2 ) =
1
αexp
(− γ2 α + γ1 log α
− ( log �(γ1 ) − γ1 log γ2 ) )
(D.3)
Considering the general fact that the derivative of the log
normalizor log �(γ1 ) − γ1 log γ2 of a exponential family dis-
tribution with respect to its natural parameter γ 1 is equal
to the expectation of the sufficient statistic log α, we can
compute E q (α| γ1 ,γ2 ) [ log α] in the first term of Eq. D.2 as
follows:
E q (α| γ1 ,γ2 ) [ log α] = �(γ1 ) − log γ2 (D.4)
where � is the digamma function, the first derivative of the
log Gamma function.
◦ E q ψ (νk | w 1: N ) [ log (1 − v k )] :
By applying the Taylor expansion, E q (νk | w 1: N ) [ log (1 − v k )] can
be written as the infinite sum of the Kumaraswamy’s m th
moment:
E q ψ (νk | w 1: N ) [ log (1 − v k )] = −∞ ∑
m =1
1
m
E q ψ (νk | w 1: N ) [ v m
k ]
= −∞ ∑
m =1
b k m + a k b k
B
(m
a k , b k
)(D.5)
where B ( · ) is the Beta function.
By substituting Eq. D.4 and Eq. D.5 into Eq. D.2 , we can ob-
tain:
E q ψ ( ν| w 1: N ) q (α| γ1 ,γ2 ) [ log p( ν| α)]
= (K − 1)(�(γ1 ) − log γ2 ) − γ1 − γ2
γ2
×K−1 ∑
k =1
∞ ∑
m =1
b k m + a k b k
B
(m
a k , b k
)(D.6)
• −E q ψ ( ν| w 1: N ) [ log q ψ
( ν| w 1: N )] :
According to Section 4.11 of [31] , the Kumaraswamy’s entropy
is given as
−E q ψ ( ν| w 1: N ) [ log q ψ
( ν| w 1: N )]
= −K−1 ∑
k =1
E q ψ (νk | w 1: N ) [ log q ψ
(νk | w 1: N )]
=
K−1 ∑
k =1
− log (a k b k ) +
a k − 1
a k
(γ + �(b k ) +
1
b k
)+
b k − 1
b k (D.7)
where γ is the Euler’s constant.
• KL( q ( α| γ 1 , γ 2 )|| p ( α| s 1 , s 2 )):
The KL divergence of one Gamma distribution q ( α; γ 1 ,
γ 2 ) from another Gamma distribution p ( α; s 1 , s 2 ) evaluates
to
KL (q || p) = s 1 log γ2
s 2 − log
�(γ1 )
�(s 1 ) + (γ1 − s 1 )�(γ1 )
− (γ2 − s 2 ) γ1
γ2
(D.8)
ppendix E. The evidence lower bound of HiTM-VAE
In this section, we show how to compute the Evidence Lower
ound of HiTM-VAE on the whole dataset which can be written
s:
(D| , ψ) = E q ( β′ )
[log
P ( β′ | γ )
q ( β′ | u , v )
]
+
D ∑
j=1
{E q ( ν( j) )
[log
P ( ν( j) | α)
q ( ν( j) )
]
+
K ∑
k =1
E q ( β′ ) q (c ( j)
k | ϕ ( j)
k )
[
log P (c ( j)
k | β)
q (c ( j) k
| ϕ
( j) k
)
]
+ E q ( ν( j) ) q ( c ( j) )
[P ( x ( j) | ν( j) , c ( j) , )
]}(E.1)
pecifically, each item in Eq. E.1 can be obtained as follows:
• E q ( β′ ) [ log P( β′ | γ )
q ( β′ | u , v ) ] : The KL divergence of two series of independent Beta distribu-
tion, beta( u i , v i ) and Beta(1, γ ) is:
T −1 ∑
i =1
KL ( Beta (u i , v i ) || Beta (1 , γ )) = −(T − 1) log γ
−T −1 ∑
i =1
{log
�(u i )�(v i ) �(u i + v i )
+ (�(u i + v i ) − �(u i )) × (u i − 1)
+ (�(u i + v i ) − �(v i )) × (v i − γ )
}(E.2)
• The first term in the summation, E q ( ν( j) | a ( j) , b ( j) )
[ log P( ν( j) | α)
q ( ν( j) ) ] :
K−1 ∑
k =1
KL (q (ν( j)
k ) || p(ν( j)
k | α))
)
=
K−1 ∑
k =1
a k − 1
a k
(−γ − �(b k ) − 1
b k
)+ log a k b k + log B ( 1 , α)
+ ( α − 1 )
∞ ∑
m =1
b k m + a k b k
B
(m
a k , b k
)− b k − 1
b k (E.3)
• The second term in the summation,∑ K k =1 E q ( β′ ) q (c
( j) k
| ϕ ( j) k
) [ log
P(c ( j) k
| β)
q (c ( j) k
| ϕ ( j) k
) ] :
Let A
( j) i
=
∑ K k =1 ϕ
( j) ki
K ∑
k =1
H( ϕ
( j) k
) +
T ∑
i =1
{ E q [ log βi ] A
( j) i
}
=
K ∑
k =1
H( ϕ
( j) k
) +
T −1 ∑
i =1
A
( j) i
E q [ log β ′ i ] +
T ∑
i =1
A
( j) i
i −1 ∑
l=1
E q [ log (1 − β ′ l )]
=
K ∑
k =1
H( ϕ
( j) k
) +
T −1 ∑
i =1
A
( j) i
{ �(u i ) − �(u i + v i ) }
+
T ∑
i =1
A
( j) i
i −1 ∑
l=1
{ �(v l ) − �(u l + v l ) }
X. Ning, Y. Zheng and Z. Jiang et al. / Neurocomputing 399 (2020) 296–306 305
A
R
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
=
K ∑
k =1
H( ϕ
( j) k
) +
T −1 ∑
i =1
{A
( j) i
�(u i ) −(
T ∑
l= i A
( j) l
)
�(u i + v i )
+
(
T ∑
l= i +1
A
( j) l
)
�(v i ) }
(E.4)
• The third term in the summation,
E q ( ν( j) ) q ( c ( j) ) [ P ( x ( j) | ν( j) , c ( j) , )] } , is estimated by MC sam-
pling. For backpropagating through the stochastic units,
the reparametrization tricks of the Kumaraswamy posterior
q ({ ν( j) k
} K k =1
) and the Gumbel–Softmax approximation [15] for
the multinomial posterior q ({ c ( j) k
} K k =1
) are used.
ppendix F. Abbreviations of Methods
Table F.1
The abbreviations used in this paper.
Abbreviation Description
LDA Latent Dirichlet Allocation [5]
HDP Hierarchical Dirichlet Process [44]
AEVB Auto-Encoding Variational Bayes [21,39]
AVITM Autoencoded Variational Inference For
Topic Model [43]
SB-VAE Stick-Breaking Variational
Auto-Encoder [34]
NVDM Neural Variational Document Model [30]
GSM, GSB, RSB, RSB-TF Methods proposed in [29]
iTM-VAE ( Sec. 3.1 ) infinite Topic Model with Variational
Auto-Encoders
iTM-VAE-Prod ( Sec. 3.1 ) iTM-VAE using Product-of-experts
iTM-VAE-HP ( Sec. 3.3 ) iTM-VAE with the Hyper Prior extension
HiTM-VAE ( Sec. 4 ) Hierarchical iTM-VAE
eferences
[1] C. Archambeau , B. Lakshminarayanan , G. Bouchard , Latent ibp compound
dirichlet allocation, IEEE Trans. Pattern Anal. Mach. Intell. 37 (2) (2015)321–333 .
[2] J. M. Bernardo, A. F. Smith, Bayesian theory, 2001. [3] D.M. Blei , Probabilistic topic models, Commun. ACM 55 (4) (2012) 77–84 .
[4] D.M. Blei , M.I. Jordan , et al. , Variational inference for dirichlet process mix-
tures, Bayesian Anal. 1 (1) (2006) 121–143 . [5] D.M. Blei , A.Y. Ng , M.I. Jordan , Latent dirichlet allocation, J. Mach. Learn. Res. 3
(Jan) (2003) 993–1022 . [6] S.R. Bowman, L. Vilnis, O. Vinyals, A.M. Dai, R. Józefowicz, S. Bengio, Generating
sentences from a continuous space, arXiv preprint arXiv:1511.06349(2015) [7] S. Burkhardt , S. Kramer , Decoupling sparsity and smoothness in the dirichlet
variational autoencoder topic model, J. Mach. Learn. Res. 20 (131) (2019) 1–27 .
[8] D. Card, C. Tan, N.A. Smith, A neural framework for generalized topic models,arXiv preprint arXiv:1705.09296(2017)
[9] X. Chen , D.P. Kingma , T. Salimans , Y. Duan , P. Dhariwal , J. Schulman ,I. Sutskever , P. Abbeel , Variational lossy autoencoder, in: International Confer-
ence on Learning Representations, 2017 . [10] M.D. Escobar , M. West , Bayesian density estimation and inference using mix-
tures, J. Am. Stat. Assoc. 90 (430) (1995) 577–588 . [11] G.E. Hinton , Training products of experts by minimizing contrastive divergence,
Neural Comput. 14 (8) (2006) .
[12] G.E. Hinton , R.R. Salakhutdinov , Replicated softmax: an undirected topicmodel, in: Advances in Neural Information Processing Systems, 2009,
pp. 1607–1614 . [13] M. Hoffman , F.R. Bach , D.M. Blei , Online learning for latent dirichlet allocation,
Advances in Neural Information Processing Systems, 2010 . [14] S. Ioffe , Batch renormalization: towards reducing minibatch dependence in
batch-normalized models, in: Advances in Neural Information Processing Sys-
tems, 2017, pp. 1945–1953 . [15] E. Jang , S. Gu , B. Poole , Categorical reparameterization with gumbel-softmax,
in: International Conference on Learning Representations, 2017 . [16] W. Joo, W. Lee, S. Park, I.-C. Moon, Dirichlet variational autoencoder, arXiv
preprint arXiv:1901.02739(2019) [17] D.I. Kim , E.B. Sudderth , The doubly correlated nonparametric topic model, in:
Advances in Neural Information Processing Systems, 2011, pp. 1980–1988 .
[18] D. Kingma , J. Ba , Adam: A method for stochastic optimization, in: InternationalConference on Learning Representations, 2015 .
[19] D. Kingma , M. Welling , Efficient gradient-based inference through transforma-tions between bayes nets and neural nets, in: International Conference on Ma-
chine Learning, 2014, pp. 1782–1790 . 20] D.P. Kingma , T. Salimans , R. Jozefowicz , X. Chen , I. Sutskever , M. Welling , Im-
proving variational inference with inverse autoregressive flow, in: InternationalConference on Learning Representations, 2016 .
[21] D.P. Kingma , M. Welling , Auto-encoding variational bayes, in: International
Conference on Learning Representations, 2014 . 22] P. Kumaraswamy , A generalized probability density function for dou-
ble-bounded random processes, J. Hydrol. 46 (1-2) (1980) 79–88 . 23] H. Larochelle , S. Lauly , A neural autoregressive topic model, in: Advances in
Neural Information Processing Systems, 2012, pp. 2708–2716 . [24] J.H. Lau , T. Baldwin , The sensitivity of topic coherence evaluation to topic car-
dinality, in: Proceedings of the 2016 Conference of the North American Chapter
of the Association for Computational Linguistics: Human Language Technolo-gies, 2016, pp. 4 83–4 87 .
25] J.H. Lau , D. Newman , T. Baldwin , Machine reading tea leaves: automati-cally evaluating topic coherence and topic model quality, in: Conference of
the European Chapter of the Association for Computational Linguistics, 2014,pp. 530–539 .
26] D.D. Lewis , Y. Yang , T.G. Rose , F. Li , Rcv1: a new benchmark collection for text
categorization research, J. Mach. Learn. Res. 5 (Apr) (2004) 361–397 . [27] K.W. Lim , W. Buntine , C. Chen , L. Du , Nonparametric bayesian topic modelling
with the hierarchical pitman–yor processes, Int. J. Approx. Reason. 78 (2016)172–191 .
28] T. Lin , W. Tian , Q. Mei , H. Cheng , The dual-sparse topic model: mining focusedtopics and focused terms in short text, in: International World Wide Web Con-
ference, ACM, 2014, pp. 539–550 .
29] Y. Miao , E. Grefenstette , P. Blunsom , Discovering discrete latent topics withneural variational inference, in: International Conference on Machine Learning,
JMLR. org, 2017, pp. 2410–2419 . 30] Y. Miao , L. Yu , P. Blunsom , Neural variational inference for text processing, in:
International conference on machine learning, 2016, pp. 1727–1736 . [31] J.V. Michalowicz , J.M. Nichols , F. Bucholtz , Handbook of Differential Entropy,
CRC Press (2013) .
32] A. Mnih , K. Gregor , Neural variational inference and learning in belief net-works, in: International Conference on Machine Learning, 2014 .
[33] K.P. Murphy , Machine Learning: A Probabilistic Perspective, MIT press, 2012 . 34] E. Nalisnick , P. Smyth , Stick-breaking variational autoencoders, in: International
Conference on Learning Representations, 2017 . [35] D. Newman , J.H. Lau , K. Grieser , T. Baldwin , Automatic evaluation of topic co-
herence, in: Human Language Technologies: The 2010 Annual Conference of
the North American Chapter of the Association for Computational Linguistics,ACL, 2010, pp. 100–108 .
36] D. Putthividhy , H.T. Attias , S.S. Nagarajan , Topic regression multi-modal latentdirichlet allocation for image annotation, in: Conference on Computer Vision
and Pattern Recognition, IEEE, 2010, pp. 3408–3415 . [37] R. Ranganath , S. Gerrish , D. Blei , Black box variational inference, in: Interna-
tional Conference on Artificial Intelligence and Statistics, 2014 . 38] N. Rasiwasia , N. Vasconcelos , Latent dirichlet allocation models for image
classification, IEEE Trans. Pattern Anal. Mach. Intell. 35 (11) (2013) 2665–
2679 . 39] D.J. Rezende , S. Mohamed , D. Wierstra , Stochastic backpropagation and varia-
tional inference in deep latent gaussian models, in: International Conferenceon Machine Learning, 2014 .
40] S. Rogers , M. Girolami , C. Campbell , R. Breitling , The latent process decompo-sition of cdna microarray data sets, IEEE/ACM Trans. Comput. Biol. Bioinform.
2 (2) (2005) 143–156 .
[41] J. Sethuraman , A constructive definition of dirichlet priors, Stat. Sin. 4 (1994)639–650 .
42] C.K. Sønderby , T. Raiko , L. Maaløe , S.K. Sønderby , O. Winther , Ladder varia-tional autoencoders, in: Advances in Neural Information Processing Systems,
2016, pp. 3738–3746 . 43] A. Srivastava , C. Sutton , Autoencoding variational inference for topic models,
in: International Conference on Learning Representations, 2017 .
44] Y.W. Teh , A hierarchical bayesian language model based on pitman-yor pro-cesses, in: International Conference on Computational Linguistics, ACL, 2006,
pp. 985–992 . 45] C. Wang , J. Paisley , D. Blei , Online variational inference for the hierarchical
dirichlet process, in: International Conference on Artificial Intelligence andStatistics, 2011, pp. 752–760 .
46] X. Wei , W.B. Croft , Lda-based document models for ad-hoc retrieval, in: Inter-
national ACM SIGIR Conference on Research and Development in InformationRetrieval, ACM, 2006, pp. 178–185 .
[47] M.J.B. Yee Whye Teh Michael I Jordan , D.M. Blei , Hierarchical dirichlet pro-cesses, J. Am. Stat. Assoc. 101 (476) (2006) 1566–1581 .
48] H. Zhang , B. Chen , D. Guo , M. Zhou , Whai: weibull hybrid autoencoding infer-ence for deep topic modeling, in: International Conference on Learning Repre-
sentations, 2018 .
306 X. Ning, Y. Zheng and Z. Jiang et al. / Neurocomputing 399 (2020) 296–306
o
r
h
Xuefei Ning is currently a Ph.D. student in the Depart-
ment of Electronic Engineering, Tsinghua University. Shereceived the B.S. degree in electronic engineering from
Tsinghua University, Beijing, China, in 2016. Her research
interests include the reliability and robustness of neuralnetwork.
Yin Zheng is currently a senior researcher with Weixin
Group, Tencent. He received his Ph.D degree from Ts-inghua University in 2015. He serves as a PC member or
reviewer for many journals and conferences in his area.
Zhuxi Jiang received the Bachelor’s degree and the Mas-ter’s degree from Beijing Institute of Technology, Beijing,
China, in 2015 and 2018 respectively. His research inter-ests include machine learning, computer vision and intel-
ligent transportation systems.
Yu Wang (S05-M07-SM14) received the BS and PhD (withhonor) degrees from Tsinghua University, Beijing, in 2002
and 2007. He is currently a tenured professor with theDepartment of Electronic Engineering, Tsinghua Univer-
sity. His research interests include brain inspired comput-
ing, application specific hardware computing, parallel cir-cuit analysis, and power/reliability aware system design
methodology. He has authored and coauthored more than200 papers in refereed journals and conferences. He is a
recipient of DAC under 40 innovator award (2018), IBMX10 Faculty Award (2010). He served as TPC/track chair
and program committee member for leading conferences
in the EDA and FPGA fields.
Huazhong Yang (M97-SM00) received B.S. degree in mi-
croelectronics in 1989, M.S. and Ph.D. degree in electronicengineering in 1993 and 1998, respectively, all from Ts-
inghua University, Beijing. In 1993, he joined the Depart-
ment of Electronic Engineering, Tsinghua University, Bei-jing, where he has been a Full Professor since 1998. Dr.
Yang was awarded the Distinguished Young Researcher byNSFC in 20 0 0 and Cheung Kong Scholar by Ministry of
Education of China in 2012. He has been in charge of sev-eral projects, including projects sponsored by the national
science and technology major project, 863 program, NSFC,
9th five-year national program and several internationalresearch projects. Dr. Yang has authored and co-authored
ver 400 technical papers, 7 books, and over 100 granted Chinese patents. His cur-ent research interests include wireless sensor networks, data converters, energy-
arvesting circuits, nonvolatile processors, and brain inspired computing.
Junzhou Huang received the B.E. degree from Huazhong
University of Science and Technology, Wuhan, China, theM.S. degree from the Institute of Automation, Chinese
Academy of Sciences, Beijing, China, and the Ph.D. de-
gree in Computer Science at Rutgers, The State Universityof New Jersey. His major research interests include ma-
chine learning, computer vision and biomedical informat-ics, with focus on the development of sparse modeling,
imaging, and learning for big data analytics.
Peilin Zhao is currently a Principal Researcher at TencentAI Lab, China. Previously, he has worked at Rutgers Uni-
versity, Institute for Infocomm Research (I2R), Ant Finan-cial Services Group. His research interests include: Online
Learning, Deep Learning, Recommendation System, Auto-matic Machine Learning, etc. He has published over 100
papers in top venues, including JMLR, ICML, KDD, etc.
He has been invited as a PC member, reviewer or edi-tor for many international conferences and journals, such
as ICML, JMLR, etc. He received his bachelor degree fromZhejiang University, and his PHD degree from Nanyang
Technological University.