Upload
others
View
10
Download
0
Embed Size (px)
Citation preview
A fast algorithm with minimax optimal guarantees for topic models
with an unknown number of topics.
Xin Bing∗ Florentina Bunea† Marten Wegkamp‡
September 6, 2019
Abstract
Topic models have become popular for the analysis of data that consists in a collection of n
independent multinomial observations, with parameters Ni ∈ N and Πi ∈ [0, 1]p for i = 1, . . . , n.
The model links all cell probabilities, collected in a p×n matrix Π, via the assumption that Π can
be factorized as the product of two nonnegative matrices A ∈ [0, 1]p×K and W ∈ [0, 1]K×n. Topic
models have been originally developed in text mining, when one browses through n documents,
based on a dictionary of p words, and covering K topics. In this terminology, the matrix A
is called the word-topic matrix, and is the main target of estimation. It can be viewed as a
matrix of conditional probabilities, and it is uniquely defined, under appropriate separability
assumptions, discussed in detail in this work. Notably, the unique A is required to satisfy what
is commonly known as the anchor word assumption, under which A has an unknown number of
rows respectively proportional to the canonical basis vectors in RK . The indices of such rows
are referred to as anchor words. Recent computationally feasible algorithms, with theoretical
guarantees, utilize constructively this assumption by linking the estimation of the set of anchor
words with that of estimating the K vertices of a simplex. This crucial step in the estimation
of A requires K to be known, and cannot be easily extended to the more realistic set-up when
K is unknown.
This work takes a different view on anchor word estimation, and on the estimation of A.
We propose a new method of estimation in topic models, that is not a variation on the existing
simplex finding algorithms, and that estimates K from the observed data. We derive new finite
sample minimax lower bounds for the estimation of A, as well as new upper bounds for our
proposed estimator. We describe the scenarios where our estimator is minimax adaptive. Our
finite sample analysis is valid for any n,Ni, p and K, and both p and K are allowed to increase
with n, a situation not handled well by previous analyses.
∗Department of Statistical Science, Cornell University, Ithaca, NY. E-mail: [email protected].†Department of Statistical Science, Cornell University, Ithaca, NY. E-mail: [email protected].‡Department of Mathematics and Department of Statistical Science, Cornell University, Ithaca, NY. E-mail:
1
arX
iv:1
805.
0683
7v3
[st
at.M
L]
4 S
ep 2
019
We complement our theoretical results with a detailed simulation study. We illustrate that
the new algorithm is faster and more accurate than the current ones, although we start out with
a computational and theoretical disadvantage of not knowing the correct number of topics K,
while we provide the competing methods with the correct value in our simulations.
Keywords: Topic model, latent model, overlapping clustering, identification, high dimensional
estimation, minimax estimation, anchor words, separability, nonnegative matrix factorization,
adaptive estimation
1 Introduction
1.1 Background
Topic models have been developed during the last two decades in natural language processing and
machine learning for discovering the themes, or “topics”, that occur in a collection of documents.
They have also been successfully used to explore structures in data from genetics, neuroscience and
computational social science, to name just a few areas of application. Earlier works on versions
of these models, called latent semantic indexing models, appeared mostly in the computer sci-
ence and information science literature, for instance Deerwester et al. (1990); Papadimitriou et al.
(1998); Hofmann (1999); Papadimitriou et al. (2000). Bayesian solutions, involving latent Dirichlet
allocation models, have been introduced in Blei et al. (2003) and MCMC-type solvers have been
considered by Griffiths and Steyvers (2004), to give a very limited number of earlier references. We
refer to Blei (2012) for a in-depth overview of this field. One weakness of the earlier work on topic
models was of computational nature, which motivated further, more recent, research on the devel-
opment of algorithms with polynomial running time, see, for instance, Anandkumar et al. (2012);
Arora et al. (2013); Bansal et al. (2014); Ke and Wang (2017). Despite these recent advances, fast
algorithms leading to estimators with sharp statistical properties are still lacking, and motivates
this work.
We begin by describing the topic model, using the terminology employed for its original usage,
that of text mining. It is assumed that we observe a collection of n independent documents,
and that each document is written using the same dictionary of p words. For each document
i ∈ [n] := 1, . . . , n, we sample Ni words and record their frequencies in the vector Xi ∈ Rp. It
is further assumed that the probability Πji with which a word j appears in a document i depends
on the topics covered in the document, justifying the following informal application of the Bayes’
theorem,
Πji := Pi(Word j) =K∑k=1
Pi(Word j|Topic k)Pi(Topic k).
The topic model assumption is that the conditional probability of the occurrence of a word, given
2
the topic, is the same for all documents. This leads to the topic model specification:
Πji =
K∑k=1
P(Word j|Topic k)Pi(Topic k), for each j ∈ [p], i ∈ [n]. (1)
We collect the above conditional probabilities in the p ×K word-topic matrix A and we let Wi ∈RK denote the vector containing the probabilities of each of the K topics occurring in document
i ∈ [n]. With this notation, data generated from topic models are observed count frequencies Xi
corresponding to independent
Yi := NiXi ∼ Multinomialp (Ni, AWi) , for each i ∈ [n]. (2)
Let X be the p × n observed data matrix, W be the K × n matrix with columns Wi, and Π be
the p × n matrix with entries Πji satisfying (1). The topic model therefore postulates that the
expectation of the word-document frequency matrix X has the non-negative factorization
Π := E[X] = AW, (3)
and the goal is to borrow strength across the n samples to estimate the common matrix of condi-
tional probabilities, A. Since the columns in Π, A and W are probabilities specified by (1), they
have non-negative entries and satisfy
p∑j=1
Πji = 1,
p∑j=1
Ajk = 1,K∑k=1
Wki = 1, for any k ∈ [K] and i ∈ [n]. (4)
In Section 2 we discuss in detail separability conditions on A and W that ensure the uniqueness of
the factorization in (3).
In this context, the main goal of this work is to estimate A optimally, both computationally
and from a minimax-rate perspective, in identifiable topic models, with an unknown number K of
topics, that is allowed to depend on n,Ni, p.
1.2 Outline and contributions
In this section we describe the outline of this paper and give a precise summary of our results
which are developed via the following overall strategy: (i) We first show that A can be derived,
uniquely, at the population level, from quantities that can be estimated independently of A. (ii)
We use the constructive procedure in (i) for estimation, and replace population level quantities by
appropriate estimates, tailored to our final goal of minimax optimal estimation of A in (3), via fast
computation.
3
Recovery of A at the population level. We prove in Propositions 2 and 3 of Section 3 that
the target word-topic matrix A can be uniquely derived from Π, and give the resulting procedure
in Algorithm 1. The proofs require the separability Assumptions 1 and 2, common in the topic
model literature, when K is known. All model assumptions are stated and discussed in Section 2,
and informally described here. Assumption 1 is placed on the word-topic matrix A, and is known
as the anchor-word assumption as it requires the existence of words that are solely associated with
one topic. In Assumption 2 we require that W have full rank.
To the best of our knowledge, parameter identifiability in topic models received a limited amount
of attention. If model (3) and Assumptions 1 and 2 hold, and provided that the index set I
corresponding to anchor words, as well as the number of topics K, are known, Lemma 3.1 of Arora
et al. (2012) shows that A can be constructed uniquely via Π. If I is unknown, but K is known,
Theorem 3.1 of Bittorf et al. (2012) further shows that the matrices A and W can be constructed
uniquely via Π, by connecting the problem of finding I with that of finding the K vertices of an
appropriately defined simplex. Methods that utilize simplex structures are common in the topic
models literature, such as the simplex structure in the word-word co-occurrence matrix (Arora
et al., 2012, 2013), in the original matrix Π (Ding et al., 2013), and in the singular vectors of Π
(Ke and Wang, 2017).
In this work we provide a solution to the open problem of constructing I, and then A, in topic
models, in the more realistic situation when K is unknown. For this, we develop a method that
is not a variation of the existing simplex-based constructions. Under the additional Assumption 3
of Section 2, but without a priori knowledge of K, we recover the index set I of all anchor words,
as well as its partition I. This constitutes Proposition 2. Our proof only requires the existence
of one anchor word for each topic, but we allow for the existence of more, as this is typically
the case in practice, see for instance, Blei (2012). Our method is optimization-free. It involves
comparisons between row and column maxima of a scaled version of the matrix ΠΠT , specifically
of the matrix R given by (11). Example 1 of Section 3 illustrates our procedure, whereas a contrast
with simplex-based approaches is given in Remark 1 of Section 3.
Estimation of A. In Section 5.2, we follow the steps of Algorithm 1 of Section 3, to develop
Algorithms 2 and 3 for estimating A from the data.
We show first how to construct estimators of I, I and K, and summarize this construction
in Algorithm 2 of Section 4, with theoretical guarantees provided in Theorem 4. Since we follow
Algorithm 1, this step of our estimation procedure does not involve any of the previously used
simplex recovery algorithms, such as those mentioned above.
The estimators of I, I and K are employed in the second step of our procedure, summarized
in Algorithm 3 of Section 5.2. This step yields the estimator A of A, and only requires solving a
system of equations under linear restrictions, which, in turn, requires the estimation of the inverse
of a matrix. For the latter, we develop a fast and stable algorithm, tailored to this model, which
4
reduces to solving K linear programs, each optimizing over a K-dimensional space. This is less
involved, computationally, than the next best competing estimator of A, albeit developed for K
known, in Arora et al. (2013). After estimating I, their estimate of A requires solving p restricted
convex optimization problems, each optimizing over a K-dimensional parameter space.
We assess the theoretical performance of our estimator A with respect to the L1,∞ and L1 losses
defined below, by providing finite sample lower and upper bounds on these quantities, that hold
for all p, K, Ni and n. In particular, we allow K and p to grow with n, as we expect that when the
number of available documents n increases, so will the number K of topics that they cover, and
possibly the number p of words used in these documents. Specifically, we let HK denote the set of
all K ×K permutation matrices and define:
‖A−A‖1,∞ := max1≤k≤K
p∑j=1
|Ajk −Ajk|, ‖A−A‖1 :=
p∑j=1
K∑k=1
|Ajk −Ajk|,
L1,∞(A, A) := minP∈HK
‖A−AP‖1,∞, L1(A, A) := minP∈HK
‖A−AP‖1.
We provide upper bounds for L1(A, A) and L1,∞(A, A) in Theorem 7 of Section 5.3. To benchmark
these upper bounds, Theorem 6 in Section 5.1 shows that the corresponding lower bounds are:
infA
supA
PA
L1,∞(A, A) ≥ c0
√K(|Imax|+ |J |)
nN
≥ c1, (5)
infA
supA
PA
L1(A, A) ≥ c0K
√|I|+K|J |
nN
≥ c1,
for absolute constants c0 > 0 and c1 ∈ (0, 1] and assumingN := N1 = · · ·Nn for ease of presentation.
The infimum is taken over all estimators A, while the supremum is taken over all matrices A in
a prescribed class A, defined in (34). The lower bounds depend on the largest number of anchor
words within each topic (|Imax|), the total number of anchor words (|I|), and the number of non-
anchor words (|J |) with J := [p]\I. In Section 5.3 we discuss conditions under which our estimator
A is minimax optimal, up to a logarithmic factor, under both losses. To the best of our knowledge,
these lower and upper bounds on the L1,∞ loss of our estimators are new, and valid for growing K
and p. They imply the more commonly studied bounds on the L1 loss.
Our estimation procedure and the analysis of the resulting estimator A are tailored to count
data, and utilize the restrictions (4) on the parameters of model (3). Consequently, both the
estimation method and the properties of the estimator differ from those developed for general
identifiable latent variable models, for instance those in Bing et al. (2017), and we refer to the
latter for further references and a recent overview of estimation in such models.
To the best of our knowledge, computationally efficient estimators of the word-topic matrix A in
(3), that are also accompanied by a theoretical analysis, have only been developed for the situation
in which K is known in advance. Even in that case, the existing results are limited.
5
Arora et al. (2012, 2013) are the first to analyze theoretically, from a rate perspective, estimators
of A in the topic model. They establish upper bounds on the global L1 loss of their estimators,
and their analysis allows K and p to grow with n. Unfortunately, these bounds differ by at least a
factor of order p3/2 from the minimax optimal rate given by our Theorem 7, even when K is fixed
and does not grow with n.
The recent work of Ke and Wang (2017) is tailored to topic models with a small, known, number
of topics K, which is independent of the number of documents n. Their procedure makes clever
use of the geometric simplex structure in the singular vectors of Π. To the best of our knowledge,
Ke and Wang (2017) is the first work that proves a minimax lower bound for the estimation of
A in topic models, with respect to the L1 loss, over a different parameter space than the one we
consider. We discuss in detail the corresponding rate over this space, and compare it with ours, in
Remark 5 in Section 5.1. The procedure developed by Ke and Wang (2017) is rate optimal for fixed
K, under suitable conditions tailored to their set-up (see pages 13 – 14 in Ke and Wang (2017)).
We defer a detailed rate comparison with existing results to Remark 5 of Section 5.1 and to
Section 5.3.1.
In Section 6 we present a simulation study, in which we compare numerically the quality of
our estimator with that of the best performing estimator to date, developed in Arora et al. (2013),
which also comes with theoretical guarantees, albeit not minimax optimal. We found that the
competing estimator is generally fast and accurate when K is known, but it is very sensitive to
the misspecification of K, as we illustrate in Appendix G of the supplement. Further, extensive
comparisons are presented in Section 6, in terms of the estimation of I, A and the computational
running time of the algorithms. We found that our procedure dominates on all these counts.
Finally, the proofs of Propositions 1 and 2 of Section 3 and the results of Sections 4 and 5 are
deferred to the appendices.
Summary of new contributions. We propose a new method that estimates
(a) the number of topics K;
(b) the anchor words and their partition;
(c) the word-topic matrix A;
and provide an analysis under a finite sample setting, that allows K, in addition to Ni and p to
grow with the sample size (number of documents) n. In this regime,
(d) we establish a minimax lower bound for estimating the word-topic matrix A;
(e) we show that the number of topics can be estimated correctly, with high probability;
(f) we show that A can be estimated at the minimax-optimal rate.
Furthermore,
(g) the estimation of K is optimization free;
6
(h) the estimation of the anchor words and that of A is scalable in n,Ni, p and K.
To the best of our knowledge, estimators of A that are scalable not only with p, but also with K,
and for which (a), (b) and (d) - (f) hold are new in the literature.
1.3 Notation
The following notation will be used throughout the entire paper.
The integer set 1, . . . , n is denoted by [n]. For a generic set S, we denote |S| as its cardinality.
For a generic vector v ∈ Rd, we let ‖v‖q denote the vector `q norm, for q = 0, 1, 2, . . . ,∞ and
supp(v) denote its support. We denote by diag(v) a d× d diagonal matrix with diagonal elements
equal to v. For a generic matrix Q ∈ Rd×m, we write ‖Q‖∞ = max1≤i≤d,1≤j≤m |Qij |, ‖Q‖1 =∑1≤i≤d,1≤j≤m |Qij | and ‖Q‖∞,1 = max1≤i≤d
∑1≤j≤m |Qij |. For the submatrix of A, we let Qi· and
Q·j be the ith row and jth column of Q. For a set S, we let QS denote its |S| ×m submatrix. We
write the d× d diagonal matrix
DQ = diag(‖Q1·‖1, . . . , ‖Qd·‖1)
and let (DQ)ii denote the ith diagonal element.
We use an . bn to denote there exists an absolute constant c > 0 such that an ≤ cbn, and write
an bn if there exists two absolute constants c, c′ > 0 such that cbn ≤ an ≤ c′bn.
We let n stand for the number of documents and Ni for the number of randomly drawn words
at document i ∈ [n]. Furthermore, p is the total number of words (dictionary size) and K is the
number of topics. We define M := maxiNi ∨ n ∨ p. Finally, I is the (index) set of anchor words,
and its complement J := [p] \ I forms the (index) set of non-anchor words.
2 Preliminaries
In this section we introduce and discuss the assumptions under which A in model (3) can be
uniquely determined via Π, although W is not observed.
2.1 An information bound perspective on model assumptions
If we had access to W in model (3), then the problem of estimating A would become the more
standard problem of estimation in multivariate response regression under the constraints (4), and
dependent errors. In that case, A is uniquely defined if W has full rank, which is our Assumption 2
below. Since W is not observable, we mentioned earlier that the identifiability of A requires extra
assumptions. We provide insight into their nature, via a classical information bound calculation.
We view W as a nuisance parameter and ask when the estimation of A can be done with the same
7
precision whether W is known or not. In classical information bound jargon (Cox and Reid, 1987),
we study when the parameters A and W are orthogonal. The latter is equivalent with verifying
E[−∂
2`(X1, . . . , Xn)
∂Ajk∂Wk′i
]= 0 for all j ∈ [p], i ∈ [n] and k, k′ ∈ [K], (6)
where `(X1, . . . , Xn) is the log-likelihood of n independent multinomial vectors. Proposition 1
below gives necessary and sufficient conditions for parameter orthogonality.
Proposition 1. If X1, . . . Xn are an independent sample from (2), and (3) holds, then A and W
are orthogonal parameters, in the sense (6) above, if and only if the following holds:∣∣∣supp(Aj·) ∩ supp(W·i)∣∣∣ ≤ 1, for all j ∈ [p], i ∈ [n]. (7)
We observe that condition (7) is implied by either of the two following extreme conditions:
(1) All rows in A are proportional to canonical vectors in RK , which is equivalent to assuming
that all words are anchor words.
(2) C := n−1WW T is diagonal.
In the first scenario, each topic is described via words exclusively used for that topic, which is
unrealistic. In the second case, the topics are totally unrelated to one another, an assumption that
is not generally met, but is perhaps more plausible than (1). Proposition 1 above shows that one
cannot expect the estimation of A in (3), when W is not observed, to be as easy as that when W is
observed, unless the very stringent conditions of this proposition hold. However, it points towards
quantities that play a crucial role in the estimation of A: the anchor words and the rank of W .
This motivates the study of this model, with both A and W unknown, under the more realistic
assumptions introduced in the next section and used throughout this paper.
2.2 Main assumptions
We make the following three main assumptions:
Assumption 1. For each topic k = 1, . . . ,K, there exists at least one word j such that Ajk > 0
and Aj` = 0 for any ` 6= k.
Assumption 2. The matrix W has rank K ≤ n.
Assumption 3. The inequality
cos (∠(Wi·,Wj·)) <ζiζj∧ ζjζi
for all 1 ≤ i 6= j ≤ K,
holds, with ζi := ‖Wi·‖2/‖Wi·‖1.
8
Conditions on A and W under which A can be uniquely determined from Π are generically
known as separability conditions, and were first introduced by Donoho and Stodden (2004), for the
identifiability of the factors in general nonnegative matrix factorization (NMF) problems. Versions
of such conditions have been subsequently adopted in most of the literature on topic models, which
are particular instances of NMF, see, for instance, Bittorf et al. (2012); Arora et al. (2012, 2013).
In the context and interpretation of the topic model, the commonly accepted Assumption 1
postulates that for each topic k there exists at least one word solely associated with that topic.
Such words are called anchor words, as the appearance of an anchor word is a clear indicator of
the occurrence of its corresponding topic, and typically more than one anchor word is present. For
future reference, for a given word-topic matrix A, we let I := I(A) be the set of anchor words, and
I be its partition relative to topics:
Ik := j ∈ [p] : Ajk > 0, Aj` = 0 for all ` 6= k, I :=
K⋃k=1
Ik, I := I1, . . . , IK . (8)
Earlier work Anandkumar et al. (2012) proposes a tensor-based approach that does not require
the anchor word assumption, but assumes that the topics are uncorrelated. Li and McCallum
(2006); Blei and Lafferty (2007) showed that, in practice, there is strong evidence against the lack
of correlation between topics. We therefore relax the orthogonality conditions on the matrix W in
our Assumption 2, similar to Arora et al. (2012, 2013). We note that in Assumption 2 we have
K ≤ n, which translates as: the total number K of topics covered by n documents is smaller than
the number of documents.
Assumption 2 guarantees that the rows of W , viewed as vectors in Rn, are not parallel, and
Assumption 3 strengthens this, by placing a mild condition on the angle between any two rows of
W . If, for instance, WW T is a diagonal matrix, or if ζi is the same for all i ∈ [K], then Assumption
2 implies Assumption 3. However, the two assumptions are not equivalent, and neither implies
the other, in general. We illustrate this in the examples of Section E.1 in the supplement. It is
worth mentioning that, when the columns of W are i.i.d. samples from the Dirichlet distribution
as commonly assumed in the topic model literature (Blei et al., 2003; Blei and Lafferty, 2007; Blei,
2012), Assumption 3 holds with high probability under mild conditions on the hyper-parameter of
the Dirichlet distribution. We defer their precise expressions to Lemma 25 in Appendix E.3 of the
supplement.
We discuss these assumptions further in Remark 1 of Section 3 and Remark 3 of Section 4
below.
3 Exact recovery of I, I and A at the population level
In this section we construct A via Π. Under Assumptions 1 and 3, we show first that the set of
anchor words I and its partition I can be determined, from the matrix R given in (11) below. We
9
begin by re-normalizing the three matrices involved in model (3) such that their rows sum up to 1:
W := D−1W W, Π := D−1
Π Π, A := D−1Π ADW . (9)
Then
Π = AW , (10)
and
R := nΠΠT = ACAT , (11)
with C = nWW T . This normalization is standard in the topic model literature (Arora et al., 2012,
2013), and it preserves the anchor word structure: matrices A and A have the same support, and
Assumption 1 is equivalent with the existence, for each k ∈ [K], of at least one word j such that
Ajk = 1 and Aj` = 0 for any ` 6= k. Therefore A and A have the same I and I. We differ from
the existing literature in the way we make use of this normalization and explain this in Remark 1
below. Let
Ti := max1≤j≤p
Rij , Si := j ∈ [p] : Rij = Ti , for any i ∈ [p]. (12)
In words, Ti is the maximum entry of row i, and Si is the set of column indices of those entries in
row i that equal to the row maximum value. The following proposition shows the exact recovery
of I and I from R.
Proposition 2. Assume that model (3) and Assumptions 1 and 3 hold. Then:
(a) i ∈ I ⇐⇒ Ti = Tj , for all j ∈ Si.
(b) The anchor word set I can be determined uniquely from R. Moreover, its partition I is
unique and can be determined from R up to label permutations.
The proof of this proposition is given in Appendix 2, and its success relies on the equivalent
formulation of Assumption 3,
min1≤i<j≤K
(Cii ∧ Cjj − Cij
)> 0.
The short proof of Proposition 3 below gives an explicit construction of A from
Θ :=1
nΠΠT , (13)
using the unique partition I of I given by Proposition 2 above.
Proposition 3. Under model (3) and Assumptions 1, 2 and 3, A can be uniquely recovered from
Θ with given I, up to column permutations.
10
Proof. Given the partition of anchor words I = I1, . . . , IK, we construct a set L = i1, . . . , iKby selecting one anchor word ik ∈ Ik for each topic k ∈ [K]. We let AL be the diagonal matrix
AL = diag(Ai11, . . . , AiKK). (14)
We show first that B := AA−1L can be constructed from Θ. Assuming, for now, that B has been
constructed, then A = BAL. The diagonal elements of AL can be readily determined from this
relationship, since, via model (3) satisfying (4), the columns of A sum up to 1:
1 = ‖A·k‖1 = Aikk‖B·k‖1, (15)
for each k. Therefore, although B is only unique up to the choice of L and of the scaling matrix
AL, the matrix A with unit column sums thus constructed is unique.
It remains to construct B from Θ. Let J = 1, . . . , p \ I. We let BJ denote the |J | × K
sub-matrix of B with row indices in J and BI denote the |I| ×K sub-matrix of B with row indices
in I. Recall that C := n−1WW T . Model (3) implies the following decomposition of the submatrix
of Θ with row and column indices in L ∪ J :[ΘLL ΘLJ
ΘJL ΘJJ
]=
[ALCAL ALCA
TJ
AJCAL AJCATJ
].
In particular, we have
ΘLJ = ALCATJ = ALC(ALA
−1L )ATJ = ΘLL(A−1
L ATJ ) = ΘLLBTJ . (16)
Note that Aikk > 0, for each k ∈ [K], from Assumption 1 which, together with Assumption 2,
implies that ΘLL is invertible. We then have
BJ = ΘJLΘ−1LL. (17)
On the other hand, for any i ∈ Ik, for each k ∈ [K], we have Bik = Aik/Aikk, by the definition
of B. Also, model (3) and Assumption 1 imply that for any i ∈ Ik,
1
n
n∑t=1
Πit = Aik
(1
n
n∑t=1
Wkt
). (18)
Therefore, the matrix BI has entries
Bik =‖Πi·‖1‖Πik·‖1
, for any i ∈ Ik and k ∈ [K]. (19)
This, together with BJ given above completes the construction of B, and uniquely determines
A.
11
Algorithm 1 Recover the word-topic matrix A from Π
Require: true word-document frequency matrix Π ∈ Rp×n
1: procedure Top(Π)
2: compute Θ = n−1ΠΠT and R from (11)
3: recover I via FindAnchorWords(R)
4: construct L = i1, . . . , iK by choosing any ik ∈ Ik, for k ∈ [K]
5: compute BI from (17) and BJ from (19)
6: recover A by normalizing B to unit column sums
7: return I and A
8: procedure FindAnchorWords(R)
9: initialize I = ∅ and P = [p]
10: while P 6= ∅ do
11: take any i ∈ P, compute Si and Ti from (12)
12: if ∃j ∈ Si s.t. Ti 6= Tj then
13: P = P \ i14: else
15: P = P \ Si and Si ∈ I
16: return I
Our approach for recovering both I and A is constructive and can be easily adapted to estima-
tion. For this reason, we summarize our approach in Algorithm 1 and illustrate the algorithm with
a simple example.
Example 1. Let K = 3, p = 6, n = 3 and consider the following A and W :
A =
0.3 0 0
0.2 0 0
0 0.5 0
0 0 0.4
0.2 0.5 0.3
0.3 0 0.3
, W =
0.6 0.2 0.2
0.3 0.7 0.0
0.1 0.1 0.8
, Π = AW =
0.18 0.06 0.06
0.12 0.04 0.04
0.15 0.35 0.00
0.04 0.04 0.32
0.30 0.42 0.28
0.21 0.09 0.30
Applying FindAnchorWords in Algorithm 1 to R gives I =
1, 2, 3, 4
from
R =
1.32 1.32 0.96 0.72 0.96 1.02
1.32 1.32 0.96 0.72 0.96 1.02
0.96 0.96 1.74 0.30 1.15 0.63
0.72 0.72 0.30 1.98 0.89 1.35
0.96 0.96 1.15 0.89 1.03 0.92
1.02 1.02 0.63 1.35 0.92 1.19
=⇒
T1 = 1.32, S1 = 1, 2, 1−X
T2 = 1.32, S2 = 1, 2, 2−X
T3 = 1.74, S3 = 3, 3−X
T4 = 1.98, S4 = 4, 4−X
T5 = 1.15, S5 = 3, 5− x
T6 = 1.35, S6 = 4, 6− x
12
Based on the recovered I, the matrix A can be recovered from Proposition 3, which is executed via
steps 4 - 6 in Algorithm 1 . Specifically, by taking L = 1, 3, 4 as the representative set of anchor
words, it follows from (17) and (19) that
BI =
1 0 0
2/3 0 0
0 1 0
0 0 1
, BJ =
[0.03 0.06 0.04
0.02 0.02 0.04
]0.01 0.02 0.01
0.02 0.05 0.01
0.01 0.01 0.04
−1
=
[2/3 1 3/4
1 0 3/4
].
Finally, A is recovered by normalizing B = [BTI , B
TJ ]T to have unit column sums,
A =
1 0 0
2/3 0 0
0 1 0
0 0 1
2/3 1 3/4
1 0 3/4
0.3 0 0
0 0.5 0
0 0 0.4
=
0.3 0 0
0.2 0 0
0 0.5 0
0 0 0.4
0.2 0.5 0.3
0.3 0 0.3
.
Remark 1. Contrast with existing results. It is easy to see that the rows in R (or, alternatively,
Π) corresponding to non-anchor words j ∈ J are convex combinations of the rows in R (or Π)
corresponding to anchor words i ∈ I. Therefore, finding K representative anchor words, amounts
to finding the K vertices of a simplex. The latter can be accomplished by finding the unique
solution of an appropriate linear program, that uses K as input, as shown by Bittorf et al. (2012).
This result only utilizes Assumption 1 and a relaxation of Assumption 2, in which it is assumed
that no rows of W are convex combinations of the rest. To the best of our knowledge, Theorem
3.1 in Bittorf et al. (2012) is the only result to guarantee that, after representative anchor words
are found, a partition of I in K groups can also be found, for the specified K.
When K is not known, this strategy can no longer be employed, since finding the representative
anchor words requires knowledge of K. However, we showed that this problem can still be solved
under our mild additional Assumption 3. This assumption allows us to provide the if and only if
characterization of I proved in part (i) of Proposition 2. Moreover, part (ii) of this proposition
shows that K is in one-to-one correspondence with the number of groups in I, and we exploit this
observation for the estimation of K.
4 Estimation of the anchor word set and of the number of topics
Algorithm 1 above recovers the index set I, its partition I and the number of topics K from the
matrix
R = nΠΠT =(nD−1
Π
)Θ(nD−1
Π
)
13
Algorithm 2 Estimate the partition of the anchor words I by I
Require: matrix R ∈ Rp×p, C1 and Q ∈ Rp×p such that Q[j, `] := C1δj`
1: procedure FindAnchorWords(R, Q)
2: initialize I = ∅3: for i ∈ [p] do
4: ai = argmax1≤j≤p Rij
5: set I(i) = ` ∈ [p] : Riai − Ril ≤ Q[i, ai] +Q[i, `] and Anchor(i) = True
6: for j ∈ I(i) do
7: aj = argmax1≤k≤p Rjk
8: if∣∣∣Rij − Rjaj ∣∣∣ > Q[i, j] +Q[j, aj ] then
9: Anchor(i) = False
10: break
11: if Anchor(i) then
12: I = Merge(I(i), I)
13: return I = I1, I2, . . . , IK
14: procedure Merge(I(i), I)
15: for G ∈ I do
16: if G ∩ I(i) 6= ∅ then
17: replace G in I by G ∩ I(i)
18: return I19: I(i) ∈ I20: return I
with Θ = n−1ΠΠT . Algorithm 2 below is a sample version of Algorithm 1. It has O(p2) computa-
tional complexity and is optimization free.
The matrix Π is replaced by the observed frequency data matrix X ∈ Rp×n with independent
columns X1, . . . Xn. Since they that are assumed to follow the multinomial model (2), an unbiased
estimator of Θ is given by
Θ :=1
n
n∑i=1
[Ni
Ni − 1XiX
Ti −
1
Ni − 1diag(Xi)
], (20)
with Ni representing the total number of words in document i. We then estimate R by
R :=(nD−1
X
)Θ(nD−1
X
). (21)
The quality of our estimator depends on how well we can control the noise level R − R. In the
computer science related literature, albeit for different algorithms, (Arora et al., 2012; Bittorf et al.,
14
2012), only global ‖R −R‖∞,1 control is considered, which ultimately impacts negatively the rate
of convergence of A. In general latent models with pure variables, the latter being the analogues
of anchor words, Bing et al. (2017) developed a similar algorithm to ours, under a less stringent
‖R − R‖∞ control, which is still not precise enough for sharp estimation in topic models. To see
why, we note that Algorithm 2 involves comparisons between two different entries in a row of R.
In these comparisons, we must allow for small entry-wise error margins. These margin levels are
precise bounds C1δj` such that |Rj` − Rj`| ≤ C1δj` for all j, ` ∈ [p], with high probability, for
some universal constant C1 > 1. The explicit deterministic bounds are stated in Proposition 16 of
Appendix C.2, while practical data-driven choices are based on Corollary 17 of Appendix C.2 and
given in Section 6.
Since the estimation of I is based on R which is a perturbation of R, one cannot distinguish
an anchor word from a non-anchor word that is very close to it, without further signal strength
conditions on A. Nevertheless, Theorem 4 shows that even without such conditions we can still
estimate K consistently. Moreover, we guarantee the recovery of I and I with minimal mistakes.
Specifically, we denote the set of quasi-anchor words by
J1 :=j ∈ J : there exists k ∈ [K] such that Ajk ≥ 1− 4δ/ν
(22)
where
ν := min1≤i<j≤K
(Cii ∧ Cjj − Cij
)(23)
and
δ := max1≤j,`≤p
δj`. (24)
In the proof of Proposition 2, we argued that the set of anchor words, defined in Assumption 1,
coincide with those of the scaled matrix A given in (9). The words corresponding to indices in J1
are almost anchor words, since in a row of A corresponding to such index the largest entry is close
to 1, while the other entries are close to 0, if δ/ν is small.
For the remaining of the paper we make the blanket assumption that all documents have equal
length, that is, N1 = · · · = Nn = N . We make this assumption for ease of presentation only, as all
our results continue to hold when the documents have unequal lengths.
Theorem 4. Under model (3) and Assumption 1, assume
ν > 2 max
2δ,
√2‖C‖∞δ
(25)
with ν defined in (23), and
min1≤j≤p
1
n
n∑i=1
Πji ≥2 logM
3N, min
1≤j≤pmax
1≤i≤nΠji ≥
(3 logM)2
N. (26)
15
Then, with probability greater than 1− 8M−1, we have
K = K, I ⊆ I ⊆ I ∪ J1, Iπ(k) ⊆ Ik ⊆ Iπ(k) ∪ Jπ(k)1 , for all 1 ≤ k ≤ K,
where Jk1 := j ∈ J1 : Ajk ≥ 1− 4δ/ν and π : [K]→ [K] is some label permutation.
If we further impose the signal strength assumption J1 = ∅, the following corollary guarantees
exact recovery of all anchor words.
Corollary 5. Under model (3) and Assumption 1, assume ν > 4δ, (26) and J1 = ∅. With
probability greater than 1 − 8M−1, we have K = K, I = I and Ik = Iπ(k), for all 1 ≤ k ≤ K and
some permutation π.
Remark 2.
(1) Condition (26) is assumed for the analysis only and the implementation of our procedure only
requires N ≥ 2. Furthermore, we emphasize that (26) is assumed to simplify our presentation.
In particular, we used it to obtain the precise expressions of δj` and ηj` given in (50) – (51)
of Section 6. In fact, (26) can be relaxed to
min1≤j≤p
1
n
n∑i=1
Πji ≥c logM
nN(27)
for some sufficiently large constant c > 0, under which more complicated expressions of δ′j`and η′j` can be derived, see Corollary 18 of Appendix C.2. Theorem 4 continues to hold,
provided that (25) holds for δ′ = maxj,` δ′j` in lieu of δ, that is,
ν > 2 max
2δ′,
√2‖C‖∞δ′
. (28)
Note that condition (27) implies the restriction
nN ≥ c · p logM, (29)
by using
min1≤j≤p
1
n
n∑i=1
Πji ≤1
p
p∑j=1
1
n
n∑i=1
Πji =1
p. (30)
Intuitively, both (26) and (27) preclude the average frequency of each word, over all doc-
uments, from being very small. Otherwise, if a word rarely occurs, one cannot reasonably
expect to detect/sample it: ‖Xj·‖1 will be close to 0, and the estimation of R in (21) becomes
problematic. For this reason, removing rare words or grouping several rare words together to
form a new word are commonly used strategies in data pre-processing (Arora et al., 2013, 2012;
Bansal et al., 2014; Blei et al., 2003), which we also employ in the data analyses presented in
Section 6.
16
(2) To interpret the requirement J1 = ∅, by recalling that A = D−1Π ADW ,
Ajk =n−1‖Wk·‖1Ajkn−1‖Πj·‖1
can be viewed as
P(Topic k | Word j) =P(Topic k)× P(Word j | Topic k)
P(Word j).
If J1 6= ∅, then P(Topic k |Word j) ≈ 1, for a quasi-anchor word j. Then, quasi-anchor words
also determine a topic, and it is hopeless to try to distinguish them exactly from the anchor
words of the same topic. However, Theorem 4 shows that in this case our algorithm places
quasi-anchor words and anchor words for the same topic in the same estimated group, as soon
as (25) of Theorem 4 holds. When we have only anchor words, and no quasi-anchor words,
J1 = ∅, there is no possibility for confusion. Then, we can have less separation between the
rows of W , ν > 4δ, and exact anchor word recovery, as shown in Corollary 5.
Remark 3 (Assumption 3 and condition ν > 4δ).
(1) The exact recovery of anchor words in the noiseless case (Proposition 2) relies on Assumption
3, which requires the angle between two different rows of W not be too small in the following
sense:
cos (∠(Wi·,Wj·)) <ζiζj∧ ζjζi, for all 1 ≤ i 6= j ≤ K, (31)
with ζi := ‖Wi·‖2/‖Wi·‖1. Therefore, the more balanced the rows of W are, the less restrictive
this assumption becomes. The most ideal case is mini ζi/maxi ζi → 1 under which (31) holds
whenever two rows of W are not parallel, whereas the least favorable case is mini ζi/maxi ζi →0, for which we need the rows of W close to orthogonal (the topics are uncorrelated).
Although in this work the matrix W has non-random entries, it is interesting to study when
(31) holds, with high probability, under appropriate distributional assumptions on W . A
popular and widely used distribution of the columns of W is the Dirichlet distribution (Blei
et al., 2003). Lemma 25 in the supplement shows that, when the columns of W are i.i.d. sam-
ples from the Dirichlet distribution, (31) holds with high probability, under mild conditions
on the hyper-parameter of the Dirichlet distribution.
(2) We prove in Lemma 23 that Assumption 3 is equivalent with ν > 0, where we recall that ν
has been defined in (23). For finding the anchor words from the noisy data, we need that
ν > 4δ, a strengthened version of Assumption 3. Furthermore, Lemmas 23 and 24 in the
supplement guarantee that there exists a sequence εn such that ν > 4δ is implied by
cos (∠(Wi·,Wj·)) <
(ζiζj∧ ζjζi
)(1− εn) , for all 1 ≤ i 6= j ≤ K. (32)
17
Thus, we need εn more separation between any two different rows of W than what we require
in (31). Under the following balance condition of words across documents
max1≤i≤n
Πji
/(1
n
n∑i=1
Πji
)= o(√n), for 1 ≤ j ≤ p, (33)
Lemma 24 guarantees that εn → 0 as n → ∞. The same interplay between the angle of
rows of W and their balance condition as described in part (1) above holds. We view (33) as
a reasonable, mild, balance condition, as it effectively asks the maximum frequency of each
particular word, across documents not be larger than the average frequency of that word over
the n documents, multiplied by√n.
If the columns of W follow the Dirichlet distribution, under mild conditions on the hyper-
parameter, we directly prove, in Lemma 25 in the supplement, that ν > 4δ holds with
probability greater than 1−O(M−1) with M := n ∨ p ∨N .
5 Estimation of the word-topic membership matrix.
We derive minimax lower bounds for the estimation of A in topic models, with respect to the L1
and L1,∞ losses in Section 5.1. We follow with a description of our estimator A of A, in Section
5.2. In Section 5.3, we establish upper bounds on L1(A, A) and L1,∞(A, A), for the estimator A
constructed in Section 5.2, and provide conditions under which the bounds are minimax optimal.
5.1 Minimax lower bounds
In this section, we establish the lower bound of model (3) based on L1(A, A) and L1,∞(A, A) for
any estimator A of A over the parameter space
A(K, |I|, |J |) :=A ∈ Rp×K+ : AT1p = 1K , A has |I| anchor words
(34)
where 1d denotes the d-dimensional vector with all entries equal to 1. Let
W = W 0 +1
nN1K1TK −
K
nNIK , W 0 = e1, . . . , e1︸ ︷︷ ︸
n1
, e2, . . . , e2︸ ︷︷ ︸n2
, . . . , eK , . . . , eK︸ ︷︷ ︸nK
(35)
with∑K
k=1 nk = n and |ni−nj | ≤ 1 for 1 ≤ i, j ≤ K. We use ek and Id to denote, respectively, the
canonical basis vectors in RK and the identity matrix in Rd×d. It is easy to verify that W defined
above satisfies Assumptions 2 and 3. Denote by PA the joint distribution of (X1, . . . , Xn), under
model (3) for this choice of W . Let |Imax| = maxk |Ik|.
Theorem 6. Under model (3), assume (2) and let |I|+K|J | ≤ c(nN), for some universal constant
c > 1. Then, there exist c0 > 0 and c1 ∈ (0, 1] such that
infA
supA
PA
L1(A, A) ≥ c0K
√|I|+K|J |
nN
≥ c1. (36)
18
Moreover, if K(|Imax|+ |J |) ≤ c(nN) holds, we further have
infA
supA
PA
L1,∞(A, A) ≥ c0
√K(|Imax|+ |J |)
nN
≥ c1.
The infA
is taken over all estimators A of A; the supremum is taken over all A ∈ A(K, |I|, |J |).
Remark 4. The product nN is the total number of sampled words, while |I|+K|J | is the number of
unknown parameters in A ∈ A(K, |I|, |J |). Since we do not make any further structural assumptions
on the parameter space, we studied minimax-optimality of estimation in topic models with anchor
words in the regime
nN > c(|I|+K|J |),
in which one can expect to be able to develop procedures for the consistent estimation of the matrix
A.
In order to facilitate the interpretation of the lower bound of the L1-loss, we can rewrite the
second statement in (36) as
infA
supA∈A(K,|I|,|J |)
PA
L1(A, A)
‖A‖1≥ c0
√|I|+K|J |
nN
≥ c1,
using the fact ‖A‖1 = K. Thus, the right-hand-side becomes the square root of the ratio between
number of parameters to estimate and overall sample size.
Remark 5. When K is known and independent of n or p, Ke and Wang (2017) derived the minimax
rate (37) of L1(A, A) in their Theorem 2.2:
infA
supA∈A(p,K)
PL1(A, A) ≥ c1
√p
nN
≥ c2 (37)
for some constants c1, c2 > 0. The parameter space considered in (Ke and Wang (2017)) for the
derivation of the lower bound in (37) is
A(p,K) = A ∈ Rp×K+ : AT1p = 1K , ‖Aj·‖1 ≥ c3/p, ∀j ∈ [p]
for some constant c3 > 0, and the lower bound is independent of K. In contrast, the lower bound
in Theorem 6 holds over A(K, |I|, |J |) ⊆ A(p,K), and the dependency on K in (36) is explicit.
The upper bounds derived for L1(A, A) in both this work and Ke and Wang (2017) correspond
to A ∈ A(K, |I|, |J |), making the latter the appropriate space for discussing attainability of lower
bounds.
Nevertheless, we notice that, when K is treated as a fixed constant, and recalling that |I|+|J | =p, the lower bounds over both spaces have the same order of magnitude,
√p/nN . From this
perspective, when K is fixed, the result in Ke and Wang (2017) can be viewed as a minimax result
over the smaller parameter space.
19
A non-trivial modification of the proof in Ke and Wang (2017) allowed us to recover the depen-
dency on K that was absent in their original lower bound (37): the corresponding rate is√pK/nN ,
and it is relative to estimation over the larger parameter space A(p,K). For comparison purposes,
we note that this space corresponds to A(K, |I|, |J |), with I = ∅ and |J | = p. In this case, our
lower bound (36) becomes K√pK/nN , larger by a factor of K than the bound that can be derived
by modifying arguments in Ke and Wang (2017). Therefore, Theorem 6 improves upon existing
lower bounds for estimation in topic models without anchor words and with a growing number of
topics, and offers the first minimax lower bound for estimation in topic models with anchor words
and a growing K.
5.2 An estimation procedure for A
Our estimation procedure follows the constructive proof of Proposition 3. Given the set of estimated
anchor words I = I1, . . . , IK, we begin by selecting a set of representative indices of words per
topic, by choosing ik ∈ Ik at random, to form L := i1, . . . , iK. As we explained in the proof of
Proposition 3, we first estimate a normalized version of A, the matrix B = AA−1L . We estimate
separately BI and BJ . In light of (19), we estimate the |I| ×K matrix BI by
Bik =
‖Xi·‖1
/‖Xik·‖1, if i ∈ Ik and 1 ≤ k ≤ K
0, otherwise .(38)
Recall from (17) that BJ = ΘJLΘ−1LL and that Assumption 2 ensures that ΘLL := ALCAL is
invertible, with Θ defined in (13). Since we have already obtained I, we can estimate J by J =
1, . . . , p \ I. We then use the estimator Θ given in (20), to estimate ΘJL by ΘJL
. It remains to
estimate the K ×K matrix Ω := Θ−1LL. For this, we solve the linear program
(t, Ω) = arg mint∈R+, Ω∈RK×K
t (39)
subject to ∥∥ΩΘLL− I∥∥∞,1 ≤ λt, ‖Ω‖∞,1 ≤ t, (40)
with λ = C0 maxi∈L∑
j∈L ηij , where ηij is defined such that |Θij − Θij | ≤ C0ηij for all i, j ∈ [p],
with high probability, and C0 is a universal constant. The precise expression of ηij is given in
Proposition 16 of Appendix C.2, see also Remark 8 below. To accelerate the computation, we can
decouple the above optimization problem, and solve instead K linear programs separately. We
estimate Ω by Ω = (ω1, . . . , ωK) where, for any k = 1, . . . , K,
ωk := arg minω∈RK
‖ω‖1 (41)
subject to ∥∥ΘLLω − ek
∥∥1≤ λ‖ω‖1 (42)
20
with e1, . . . , eK denoting the canonical basis in RK . After constructing Ω as above, we estimate
BJ by
BJ
=(
ΘJL
Ω)
+, (43)
where the operation (·)+ = max(0, ·) is applied entry-wise. Recalling that AL can be determined
from B via (15), the combination of (43) with (38) yields B and hence the desired estimator of A:
A = B · diag(‖B·1‖−1
1 , . . . , ‖B·K‖−11
). (44)
Remark 6. The decoupled linear programs given by (41) and (42) are computationally attractive
and can be done in parallel. This improvement over (39) becomes significant when K is large.
Remark 7. Since we can select all anchor words with high probability, as shown in Theorem 4, in
practice we can repeat randomly selecting different sets of representatives L from I several times,
and we can estimate A via (38) – (44) for each L. The entry-wise average of these estimates inherits,
via Jensen’s inequality, the same theoretical guarantees shown in Section 5.3, while benefiting from
an improved numerical performance.
Remark 8. To preserve the flow of the presentation we refer to Proposition 16 of Appendix C.2
for the precise expressions of ηij used in constructing the tuning parameter λ. The estimates
of ηij , recommended for practical implementation, are shown in (51) based on Corollary 17 in
Appendix C.2. We also note that in precision matrix estimation, λ is proportional, in our notation,
to the norm ‖ΘLL − ΘLL‖∞, see, for instance, Bing et al. (2017) and the references therein for
a similar construction, but devoted to general sub-Gaussian distributions. In this work, the data
is multinomial, and we exploited this fact to propose a more refined tuning parameter, based on
entry-wise control.
We summarize our procedure, called Top, in the following algorithm.
Algorithm 3 Estimate the word-topic matrix A
Require: frequency data matrix X ∈ Rp×n with document lengths N1, . . . , Nn; two positive con-
stants C0, C1 and positive integer T
1: procedure Top(X,N1, . . . , Nn;C0, C1)
2: compute Θ from (20) and R from (21)
3: compute ηij and Q[i, j] := C1δij from (50) and (51), for i, j ∈ [p]
4: estimate I via FindAnchorWords(R,Q)
5: for i = 1, . . . , T do
6: randomly select L and solve Ω from (41) by using λ = C0 maxi∈L∑
j∈L ηij in (42)
7: estimate B from (38) and (43)
8: compute Ai from (44)
9: return I = I1, I2, . . . , IK and A = T−1∑T
i=1 Ai
21
5.3 Upper bounds of the estimation rate of A
In this section we derive upper bounds for estimators A constructed in Section 5.2, under the matrix
‖ · ‖1 and ‖ · ‖1,∞ norms. A is obtained by choosing the tuning parameter λ = C0 maxi∈L∑
j∈L ηij
in the optimization (41). To simplify notation and properly adjust the scales, we define
αj := p max1≤k≤K
Ajk, γk :=K
n
n∑i=1
Wki, for each j ∈ [p], k ∈ [K], (45)
such that∑K
k=1 γk = K and p ≤∑p
j=1 αj ≤ pK from (4). We further set
αI = maxi∈I
αi, αI = mini∈I
αi, ρj = αj/αI , γ = max1≤k≤K
γk, γ = min1≤k≤K
γk. (46)
For future reference, we note that
γ ≥ 1 ≥ γ.
Theorem 7. Under model (3), Assumptions 1 and 2, assume ν > 4δ, J1 = ∅ and (26). Then,
with probability 1− 8M−1, we have
minP∈HK
∥∥∥A·k − (AP )·k
∥∥∥1≤ Rem(I, k) + Rem(J, k), for all 1 ≤ k ≤ K,
where HK is the space of K ×K permutation matrices,
Rem(I, k) .
√K logM
npN·∑i∈Ik
αi√αIγ
,
Rem(J, k) .
√K logM
nN· γ
1/2‖C−1‖∞,1K
· αIαI
√|J |+∑j∈J
ρj +αIαI
√K∑j∈J
ρj
.
Moreover, summing over 1 ≤ k ≤ K, yields
L1(A, A) .K∑k=1
Rem(I, k) +
K∑k=1
Rem(J, k).
In Theorem 7 we explicitly state bounds on Rem(I, k) and Rem(J, k), respectively, which allows
us to separate out the error made in the estimation of the rows of A that correspond to anchor
words from that corresponding to non-anchor words. This facilitates the statement and explanation
of the quantities that play a key role in this rate, and of the conditions under which our estimator
achieves near minimax optimal rate, up to a logarithmic factor of M . We summarize it in the
following corollary and the remarks following it. Recall that C = n−1WW T .
Corollary 8 (Attaining the optimal rate). In addition to the conditions in Theorem 7, suppose
(i) αI αI ,∑
j∈J ρj . |J |
22
(ii) γ γ ,∑
k′ 6=k√Ckk′ = o(
√Ckk) for any 1 ≤ k ≤ K
hold. Then with probability 1− 8M−1, we have
L1,∞(A, A) .
√K(|Imax|+ |J |) logM
nN, L1(A, A) . K
√|I|+K|J | logM
nN. (47)
Remark 9. The optimal estimation rate depends on the bounds for Θj`−Θj` and Rj`−Rj` derived
via a careful analysis in Proposition 16 in Appendix C.2. We rule out quasi-anchor words (J1 = ∅,
see Remark 2 as well) since otherwise, the presentation, analysis and proofs will become much more
cumbersome.
Remark 10 (Relation between document length N and dictionary size p). Our procedure can be im-
plemented for any N ≥ 2. However, Theorem 7 and Corollary 8 indirectly impose some restrictions
on N and p. Indeed, the restriction
N ≥ (2p logM/3) ∨ (9p log2M/K) (48)
is subsumed by (26), via (30) and
min1≤j≤p
max1≤i≤n
Πji = min1≤j≤p
K∑k=1
Ajk max1≤i≤n
Wki ≤1
p
p∑j=1
K∑k=1
Ajk =K
p.
Inequality (48) describes the regime for which we establish the upper bound results in this section,
and are able to show minimax optimality, as the lower bound restriction cnN ≥ |I|+ |J |K for some
c > 1 in Theorem 6 implies N ≥ p/(cn).
We can extend the range (48) of N at the cost of a stronger condition than (25) on ν. Assume
(28) holds with δ′ = maxj,` δ′j` and with δ′j,` defined in Corollary 18 of Appendix C.2. In that case,
as in Remark 2, condition (26) can be relaxed to (27). Provided
minj∈I
1
n
n∑i=1
Πji ≥c logM
N, min
j∈Imax
1≤i≤nΠji ≥
c′(logM)2
N(49)
for some constant c, c′ > 0, we prove in Appendix D.2.3 of the supplement that Theorem 7 and
Corollary 8 still hold. As discussed in Remark 2, condition (27) implies (29),
N ≥ c · (p logM)/n,
which is a much weaker restriction on N and p than (48). Condition (49) in turn is weaker than (26)
as it only restricts the smallest (averaged over documents) frequency of anchor words. As a result,
(49) does not necessarily imply the constraint (48). For instance, if minj∈I n−1∑n
i=1 Πji & 1/|I|,then (49) is implied by N & |I|(logM)2. The problem of developing a procedure that can be shown
to be minimax-rate optimal in the absence of condition (49) is left open for future research.
23
Remark 11 (Interpretation of the conditions of Corollary 8).
(1) Conditions regarding anchor words. Condition αI αI implies that all anchor words, across
topics, have the same order of frequency. The second condition∑
j∈J ρj . |J | is equivalent with
|J |−1∑
j∈J ‖Aj·‖∞ . maxi∈I ‖Ai·‖∞. Thus it holds when the averaged frequency of non-anchor
words is no greater, in order, than the largest frequency among all anchor words.
(2) Conditions regarding the topic matrix W . Condition (ii) implies that the topics are bal-
anced, through γ γ, and prevents too strong a linear dependency between the rows in W ,
via∑
k′ 6=k√Ckk′ = o(
√Ckk). As a result, we can show ‖C−1‖∞,1 = O(K) (see Lemma 22 in the
supplement) and the rate of Rem(J, k) in Theorem 7 can be improved by a factor of 1/√K. The
most favorable situation under which condition (ii) holds corresponds to the extreme case when
each document contains a prevalent topic k, in that the corresponding Wki ≈ 1, and the topics are
approximately balanced across documents, so that approximately n/K documents cover the same
prevalent topic. The minimax lower bound is also derived based on this ideal structure of C. At
the other extreme, all topics are equally likely to be covered in each document, so that Wki ≈ 1/K,
for all i and k. In the latter case, γ γ ≈ 1, but ‖C−1‖∞,1 may be larger, in order, than K and
the rates in Theorem 7 are slower than the optimal rates by at least a factor of√K. When K is
fixed or comparatively small, this loss is ignorable. Nevertheless, our condition (ii) rules out this
extreme case, as in general we do not expect any of the given documents to cover, in the same
proportion, all of the K topics we consider.
Remark 12 (Extensions). Both our procedure and the outline of our analysis can be naturally
extended to the more general Nonnegative Matrix Factorization (NMF) setting, and to different
data generating distributions, as long as Assumptions 1, 2 and 3 hold, by adapting the control of
the stochastic error terms ε.
5.3.1 Comparison with the rates of other existing estimators
As mentioned in the Introduction, the rate analysis of estimators in topic models received very little
attention, with the two exceptions discussed below, both assuming that K is known in advance.
An upper bound on L1(A, A) has been established in Arora et al. (2012, 2013), for the estimators
A considered in these works, and different than ours. Since the estimator of Arora et al. (2013)
inherits the rate of Arora et al. (2012), we only discuss the latter rate, given below:
L1(A, A) .a2K3
Γδ3p
·√
log p
nN.
Here a can be viewed as γ/γ, Γ can be treated as the `1-condition number of C = n−1WW T and
δp is the smallest non-zero entry among all the anchor words, and corresponds to αI/p, in our
notation. To understand the order of magnitude of this bound, we evaluate it in the most favorable
scenario, that of W = W 0 in (35). Then Γ ≤√Kσmin(C) . 1/
√K, where σmin(C) is the smallest
24
eigenvalue of C, and γ γ. Since∑
j∈J ρj . |J | implies αI & p−1∑p
j=1 αj and p ≤∑p
j=1 αj ≤ pK,
suppose also αI ≥ K. Then, the above rate becomes
L1(A, A) . p3 ·√K log p
nN,
which is slower than what we obtained in (47) by at least a factor of (p5 log p)1/2/K.
The upper bound on L1(A, A) in Ke and Wang (2017) is derived for K fixed, under a number
of non-trivial assumptions on Π, A and W given in their work. Their rate analysis does not assume
all anchor words have the same order of frequency but requires that the number of anchor words
in each topic grows as p2 log2(n)/(nN) at the estimation level. With an abundance of anchor
words, the estimation problem becomes easier, as there will be fewer parameters to estimate. If
this assumption does not hold, the error upper bound established in Theorem 2.1 of Ke and Wang
(2017), for fixed K, may become sub-optimal by factors in p. In contrast, although in our work we
allow for the existence of more anchor words per topic, we only require a minimum of one anchor
word per topic.
To further understand how the number of anchor words per topic affects the estimation rate,
we consider the extreme example, used for illustration purposes only, of I = 1, . . . , p := [p], when
all words are anchor words. Our Theorem 6 immediately shows that in this case the minimax lower
bound for L1(A, A) becomes
infA
supA∈A(K,p,0)
PAL1(A, A) ≥ c0K
√p
nN
≥ c1
for two universal constant c0, c1 > 0, where the infimum is taken over all estimators A. Theorem 7
shows that our estimator does indeed attain this rate when when γ 1 and mini∈I ‖Ai·‖1 & K/p.
This rate becomes faster (by a factor√K), as expected, since there is only one non-zero entry of
each row of A to estimate. These considerations show that when we return to the realistic case in
which an unknown subset of the words are anchor words, the bounds L1(A, A), for our estimator
A, only increase at most by an optimal factor of√K, and not by factors depending on p.
6 Experimental results
Notation: Recall that n denotes the number of documents, N denotes the number of words
drawn from each document, p denotes the dictionary size, K denotes the number of topics, and |Ik|denotes the cardinality of anchor words for topic k. We write ξ := mini∈I K
−1∑K
k=1Aik for the
minimal average frequencies of anchor words i. The quantity ξ plays the same role in our work as
δp defined in the separability assumption of Arora et al. (2013). Larger values are more favorable
for estimation.
25
Data generating mechanism: For each document i ∈ [n], we randomly generate the topic
vector Wi ∈ RK according to the following principle. We first randomly choose the cardinality si of
Wi from the integer set 1, . . . bK/3c. Then we randomly choose its support of cardinality si from
[K]. Each entry of the chosen support is then generated from Uniform(0, 1). Finally, we normalize
Wi such that it sums to 1. In this way, each document contains a (small) subset of topics instead
of all possible topics.
Regarding the word-topic matrix A, we first generate its anchor words by putting Aik := Kξ
for any i ∈ Ik and k ∈ [K]. Then, each entry of non-anchor words is sampled from a Uniform(0, 1)
distribution. Finally, we normalize each sub-column AJk ⊂ A·k to have sum 1−∑
i∈I Aik.
Given the matrix A and Wi, we generate the p-dimensional column NXi by independently
drawing N samples from a Multinomialp(N,AWi) distribution.
We consider the setting N = 1500, n = 1500, p = 1000, K = 30, |Ik| = p/100 and ξ = 1/p as
our benchmark setting.
Specification of the tuning parameters in our algorithm. In practice, based on Corollary
17 in Appendix C.2, we recommend the choices
δj` =n2
‖Xj·‖1‖X`·‖1
ηj` + 2Θj`
√logM
n
n
‖Xj·‖1
(1
n
n∑i=1
Xji
Ni
)12
+n
‖X`·‖1
(1
n
n∑i=1
X`i
Ni
)12
(50)
and
ηj` = 3√
6
(‖Xj·‖
12∞ + ‖X`·‖
12∞
)√logM
n
(1
n
n∑i=1
XjiX`i
Ni
) 12
+ (51)
+2 logM
n(‖Xj·‖∞ + ‖X`·‖∞)
1
n
n∑i=1
1
Ni+ 31
√(logM)4
n
(1
n
n∑i=1
Xji +X`i
N3i
)12
and set C0 = 0.01 and C1 = 1.1 in Algorithm 3. We found that these choices for C0 and C1 not
only give good overall performance, but are robust as well. To verify this claim, we generated 50
datasets under a benchmark setting of N = 1500, n = 1500, p = 1000, K = 30, |Ik| = p/100 and
ξ = 1/p. We first applied our Algorithm 3 with T = 1 to each dataset by setting C1 = 1.1 and
varying C0 within the grid 0.001, 0.003, 0.005, . . . , 0.097, 0.099. The estimation error L1(A, A)/K,
averaged over the 50 datasets, is shown in Figure 1 and clearly demonstrates that our algorithm is
robust to the choice of C0 in terms of overall estimation error. In addition, we applied Algorithm
3 by keeping C0 = 0.01 and varying C1 from 0.1, 0.2, . . . , 11.9, 12. Since C1 mainly controls the
selection of anchor words in Algorithm 2, we averaged the estimated topics number K, sensitivity
|I ∩ I|/ |I| and specificity |Ic ∩ Ic|/ |Ic| of the selected anchor words over the 50 datasets. Figure 2
shows that Algorithm 2 recovers all anchor words by choosing any C1 from the whole range of [1, 10]
and consistently estimates the number of topics for all 0.2 ≤ C1 ≤ 10, which strongly supports the
26
robustness of Algorithm 2 relative to the choice of the tuning parameter C1.
0.00 0.02 0.04 0.06 0.08 0.10C0
0.00
0.02
0.04
0.06
0.08
0.10
0.12
0.14
0.16
1 K||A
A|| 1
0.00 0.02 0.04 0.06 0.08 0.10C0
0.1100
0.1125
0.1150
0.1175
0.1200
0.1225
0.1250
0.1275
0.1300
1 K||A
A|| 1
Figure 1: Plots of overall estimation error vs C0. The right plot is zoomed in.
0 2 4 6 8 10 12C1
0
5
10
15
20
25
30
35
K
0 2 4 6 8 10 12C1
0.0
0.2
0.4
0.6
0.8
1.0
sensitivityspecificity
Figure 2: Plots of K, sensitivity and specificity vs C1 when the true K0 = 30.
Throughout, we consider two versions of our algorithm: Top1 and Top10 described in Algorithm
3 with T = 1 and T = 10, respectively. We compare Top with best performing algorithm available,
that of Arora et al. (2013). We denote this algorithm by Recover-L2 and Recover-KL depending
on which loss function is used for estimating non-anchor rows in their Algorithm 3. In Appendix
G we conducted a small simulation study to compare these two methods, and ours, with the recent
procedure of Ke and Wang (2017), using the implementation the authors kindly made available to
us. Their method is tailored to topic models with a known, small, number of topics. Our study
revealed that, in the “small K” regime, their procedure is comparable or outperformed by existing
methods. Latent Dirichlet Allocation (LDA) (Blei et al., 2003) is a popular Bayesian approach to
topic models, but is computationally demanding.
The procedures from Arora et al. (2013) have better performance than LDA in terms of overall
loss and computational cost, as evidenced by their simulations. For this reason, we only focus on
the comparison of our method with Recover-L2 and Recover-KL for the synthetic data. The
comparison with LDA is considered in the semi-synthetic data.
We report the findings of our simulation studies in this section by showing that our algorithms
estimate both the number of topics and anchor words consistently, and have superior performance
27
in terms of estimation error as well as computational time in various settings over the existing
algorithms.
We re-emphasize that in all the comparisons presented below, the existing methods have as input
the true K used to simulate the data, while we also estimate K. In Appendix G, we show that these
algorithms are very sensitive to the choice of K. This demonstrates that correct estimation of K is
indeed highly critical for the estimation of the entire matrix A.
Topics and anchor words recovery
Top10 and Top1 use the same procedure (Algorithm 2) to select the anchor words, likewise for
Recover-L2 and Recover-KL. We present in Table 1 the observed sensitivity |I ∩ I|/|I| and
specificity |Ic ∩ Ic|/|Ic| of selected anchor words in the benchmark setting with |Ik| varying. It
is clear that Top recovers all anchor words and estimates the topics number K consistently. All
algorithms are performing perfectly for not selecting non-anchor words. We emphasize that the
correct K is given for procedure Recover.
Table 1: Table of anchor recovery and topic recovery for varying |Ik|.
Measures Top Recover
|Ik| 2 4 6 8 10 2 4 6 8 10
sensitivity 100% 100% 100% 100% 100% 50% 25% 16.7% 12.5% 10%
specificity 100% 100% 100% 100% 100% 100% 100% 100% 100% 100%
Number of topics 100% N/A
Estimation error
In the benchmark setting, we variedN and n over 500, 1000, 1500, 2000, 2500, p over 500, 800, 1000, 1200, 1500,K over 20, 25, 30, 35, 40 and |Ik| over 2, 4, 6, 8, 10, one at a time. For each case, the averaged
overall estimation error ‖A−AP‖1/K and topic-wise estimation error ‖A−AP‖1,∞ over 50 gener-
ated datasets for each dimensional setting were recorded. We used a simple linear program to find
the best permutation matrix P which aligns A with A. Since the two measures had similar patterns
for all settings, we only present overall estimation error in Figure 3, which can be summarized as
follows:
- The estimation error of all four algorithms decreases as n or N increases, while it increases
as p or K increases. This confirms our theoretical findings and indicates that A is harder to
estimate when not only p, but K as well, is allowed to grow.
- In all settings, Top10 has the smallest estimation error. Meanwhile, Top1 has better perfor-
mance than Recover-L2 and Recover-KL except for N = 500 and |Ik| = 2. The difference
between Top10 and Top1 decreases as the length N of each sampled document increases.
28
This is to be expected since the larger the N , the better each column of X approximates the
corresponding column of Π, which lessens the benefit of selecting different representative sets
L of anchor words.
- Recover-KL is more sensitive to the specification of K and |Ik| than the other approaches.
Its performance increasingly worsens compared to the other procedures for increasing val-
ues of K. On the other hand, when the sizes |Ik| are small, it performs almost as well as
Top10. However, its performance does not improve as much as the performances of the other
algorithms in the presence of more anchor words.
500 1000 1500 2000 2500n
0.10
0.12
0.14
0.16
0.18
0.20
0.22
0.24
1 K||A
A|| 1
Top10Top1Recover_L2Recover_KL
500 800 1000 1200 1500p
0.06
0.08
0.10
0.12
0.14
0.16
1 K||A
A|| 1
Top10Top1Recover_L2Recover_KL
500 1000 1500 2000 2500N
0.10
0.12
0.14
0.16
0.18
0.20
0.22
0.24
1 K||A
A|| 1
Top10Top1Recover_L2Recover_KL
20 25 30 35 40K
0.10
0.11
0.12
0.13
0.14
0.15
1 K||A
A|| 1
Top10Top1Recover_L2Recover_KL
2 4 6 8 10|I_k|
0.12
0.13
0.14
0.15
0.16
1 K||A
A|| 1
Top10Top1Recover_L2Recover_KL
Figure 3: Plots of averaged overall estimation error for varying parameter one at a time.
Running time
The running time of all four algorithms is shown in Figure 4. As expected, Top1 dominates in terms
of computational efficiency. Its computational cost only slightly increases in p or K. Meanwhile,
the running times of Top10 is better than Recover-L2 in most of the settings and becomes
comparable to it when K is large or p is small. Recover-KL is overall much more computationally
demanding than the others. We see that Top1 and Top10 are nearly independent of n, the number
of documents, and N , the document length, as these parameters only appear in the computations
of the matrix R and the tuning parameters δij and ηij . More importantly, as the dictionary size p
increases, the two Recover algorithms become much more computationally expensive than Top.
This difference stems from the fact that our procedure of estimating A is almost independent of
29
p computationally. Top solves K linear programs in K dimensional space, while Recover must
solve p convex optimization problems over in K dimensional spaces.
We emphasize again that our Top procedure accurately estimates K in the reported times,
whereas we provide the two Recover versions with the true values of K. In practice, one needs to
resort to various cross-validation schemes to select a value of K for the Recover algorithms, see
Arora et al. (2013). This would dramatically increase the actual running time for these procedures.
500 1000 1500 2000 2500n
0
2
4
6
8
Seco
nds
Top10Top1
Recover_L2Recover_KL
500 800 1000 1200 1500p
0
2
4
6
8
10
12
14
16
Seco
nds
Top10Top1
Recover_L2Recover_KL
500 1000 1500 2000 2500N
0
1
2
3
4
5
6
7
8
Seco
nds
Top10Top1
Recover_L2Recover_KL
20 25 30 35 40K
0
2
4
6
8
Seco
nds
Top10Top1
Recover_L2Recover_KL
2 4 6 8 10|I_k|
0
2
4
6
8
Seco
nds
Top10Top1
Recover_L2Recover_KL
Figure 4: Plots of running time for varying parameter one at a time.
Semi-synthetic data from NIPs corpus
In this section, we compare our algorithm with existing competitors on semi-synthetic data, gen-
erated as follows.
We begin with one real-world dataset1, a corpus of NIPs articles (Dheeru and Karra Taniskidou,
2017) to benchmark our algorithm and compare Top1 with LDA (Blei et al., 2003), Recover-
L2 and Recover-KL. We use the code of LDA from Riddell et al. (2016) implemented via the
fast collapsed Gibbs sampling with the default 1000 iterations. To preprocess the data, following
Arora et al. (2013), we removed common stopping words and rare words occurring in less than
1More comparison based on the New York Times dataset is relegated to the supplement.
30
150 documents, and cut off the documents with less than 150 words. The resultant dataset has
n = 1480 documents with dictionary size p = 1253 and mean document length 858.
To generate semi-synthetic data, we first apply Top to this real data set, in order to obtain
the estimated word-topic matrix A, which we then use as the ground truth in our simulation
experiments, performed as follows.2 For each document i ∈ [n], we sample Wi from a specific
distribution (see below) and we sample Xi from Multinomialp(Ni, AWi). The estimated A from
Top (with C1 = 4.5 chosen via cross-validation and C0 = 0.01) contains 178 anchor words and 120
topics. We consider three distributions of W , chosen as in (Arora et al., 2013):
(a) symmetric Dirichlet distribution with parameter 0.03;
(b) logistic-normal distribution with block diagonal covariance matrix and ρ = 0.02;
(c) logistic-normal distribution with block diagonal covariance matrix and ρ = 0.2.
Cases (b) and (c) are designed to investigate how the correlation among topics affects the estimation
error. To construct the block diagonal covariance structure, we divide the 120 topics into 10 groups.
For each group, the off-diagonal elements of the covariance matrix of topics is set to ρ while the
diagonal entries are set to 1. The parameter ρ = 0.02, 0.2 reflects the magnitude of correlation
among topics.
The number of documents n is varied as 2000, 3000, 4000, 5000, 6000 and the document length
is set to Ni = 850 for 1 ≤ i ≤ n. In each setting, we repeat generating 20 datasets and report the
averaged overall estimation error ‖A− AP‖1/K and topic-wise estimation error ‖A− AP‖1,∞ of
different algorithms in Figure 5. The running time of each algorithm is reported in Table 2.
Overall, LDA is outperformed by the other three methods, though its performance might
be improved by increasing the number of iterations. Top, Recover-KL and Recover-L2 are
comparable when columns of W are sampled from a symmetric Dirichlet with parameter 0.03,
whereas Top has better performance when the correlation among topics increases. Moreover, Top
has the best control of topic-wise estimation error as expected, while the comparison between
Recover-KL and Recover-L2 depends on the error metric. From the running-time perspective,
Top runs significantly faster than the other three methods.
Finally, we emphasize that we provide LDA and the two Recover algorithms with the true
K, whereas Top estimates it.
2 Arora et al. (2013) uses the posterior estimate of A from LDA with K = 100. Since we do not have prior
information of K, we instead use our Top to estimate it. Moreover, the posterior from LDA does not satisfy the
anchor word assumptions and to evaluate the effect of anchor words, one has to manually add additional anchor
words (Arora et al., 2013). In contrast, the estimated A from Top automatically gives anchor words.
31
Dirichlet 0.03 Logistic Normal 0.02 Logistic Normal 0.2
2000 3000 4000 5000 6000 2000 3000 4000 5000 6000 2000 3000 4000 5000 6000
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
n
Ove
ral e
stim
atio
n er
ror
Methods
KL
L2
LDA
Top
Dirichlet 0.03 Logistic Normal 0.02 Logistic Normal 0.2
2000 3000 4000 5000 6000 2000 3000 4000 5000 6000 2000 3000 4000 5000 6000
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
1.8
n
Topi
c−w
ise
estim
atio
n er
ror
Methods
KL
L2
LDA
Top
Figure 5: Plots of averaged overall estimation error and topic-wise estimation error of Top,
Recover-L2 (L2), Recover-KL (KL) and LDA. TOP estimates K, the other methods
use the true K as input. The bars denote one standard deviation.
Acknowledgements
Bunea and Wegkamp are supported in part by NSF grant DMS-1712709. We thank the Editor,
Associate Editor and three referees for their constructive remarks.
References
Anandkumar, A., Foster, D. P., Hsu, D. J., Kakade, S. M. and Liu, Y.-k. (2012). A spectral
algorithm for latent dirichlet allocation. In Advances in Neural Information Processing Systems
25 (F. Pereira, C. J. C. Burges, L. Bottou and K. Q. Weinberger, eds.). Curran Associates, Inc.,
917–925.
Arora, S., Ge, R., Halpern, Y., Mimno, D. M., Moitra, A., Sontag, D., Wu, Y. and Zhu,
M. (2013). A practical algorithm for topic modeling with provable guarantees. In ICML (2).
Arora, S., Ge, R. and Moitra, A. (2012). Learning topic models–going beyond svd. In Foun-
dations of Computer Science (FOCS), 2012, IEEE 53rd Annual Symposium. IEEE.
Bansal, T., Bhattacharyya, C. and Kannan, R. (2014). A provable svd-based algorithm for
learning topics in dominant admixture corpus. In Proceedings of the 27th International Confer-
ence on Neural Information Processing Systems - Volume 2. NIPS’14, MIT Press, Cambridge,
MA, USA.
32
Table 2: Running time (seconds) of different algorithms
Top Recover-L2 Recover-KL LDA
n = 2000 21.4 428.2 2404.5 3052.3
n = 3000 22.3 348.2 1561.8 4649.5
n = 4000 25.3 353.5 1764.8 6051.1
n = 5000 28.5 349.0 1800.4 7113.0
n = 6000 29.5 346.6 1848.1 7318.4
Bing, X., Bunea, F., Yang, N. and Wegkamp, M. (2017). Sparse latent factor models with
pure variables for overlapping clustering. arXiv: 1704.06977 .
Bittorf, V., Recht, B., Re, C. and Tropp, J. A. (2012). Factoring nonnegative matrices with
linear programs. arXiv:1206.1270 .
Blei, D. M. (2012). Introduction to probabilistic topic models. Communications of the ACM 55
77–84.
Blei, D. M. and Lafferty, J. D. (2007). A correlated topic model of science. Ann. Appl. Stat.
1 17–35.
Blei, D. M., Ng, A. Y. and Jordan, M. I. (2003). Latent dirichlet allocation. Journal of
Machine Learning Research 993–1022.
Cox, D. R. and Reid, N. (1987). Parameter orthogonality and approximate conditional inference.
Journal of the Royal Statistical Society. Series B (Methodological) 49 1–39.
Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K. and Harshman, R.
(1990). Indexing by latent semantic analysis. Journal of the American society for information
science 41 391–407.
Dheeru, D. and Karra Taniskidou, E. (2017). UCI machine learning repository.
Ding, W., Rohban, M. H., Ishwar, P. and Saligrama, V. (2013). Topic discovery through
data dependent and random projections. In Proceedings of the 30th International Conference
on Machine Learning (S. Dasgupta and D. McAllester, eds.), vol. 28 of Proceedings of Machine
Learning Research. PMLR, Atlanta, Georgia, USA.
URL http://proceedings.mlr.press/v28/ding13.html
Donoho, D. and Stodden, V. (2004). When does non-negative matrix factorization give a correct
decomposition into parts? In Advances in Neural Information Processing Systems 16 (S. Thrun,
L. K. Saul and P. B. Scholkopf, eds.). MIT Press, 1141–1148.
Griffiths, T. L. and Steyvers, M. (2004). Finding scientific topics. Proceedings of the National
Academy of Sciences 101 5228–5235.
Hofmann, T. (1999). Probabilistic latent semantic indexing. Proceedings of the Twenty-Second
Annual International SIGIR Conference .
Ke, T. Z. and Wang, M. (2017). A new svd approach to optimal topic estimation.
33
arXiv:1704.07016 .
Li, W. and McCallum, A. (2006). Pachinko allocation: Dag-structured mixture models of topic
correlations. In Proceedings of the 23rd International Conference on Machine Learning. ICML
2006, ACM, New York, NY, USA.
Mimno, D., Wallach, H. M., Talley, E., Leenders, M. and McCallum, A. (2011). Op-
timizing semantic coherence in topic models. In Proceedings of the Conference on Empirical
Methods in Natural Language Processing. EMNLP ’11, Association for Computational Linguis-
tics, Stroudsburg, PA, USA.
Papadimitriou, C. H., Raghavan, P., Tamaki, H. and Vempala, S. (2000). Latent semantic
indexing: A probabilistic analysis. Journal of Computer and System Sciences 61 217–235.
Papadimitriou, C. H., Tamaki, H., Raghavan, P. and Vempala, S. (1998). Latent semantic
indexing: A probabilistic analysis. In Proceedings of the Seventeenth ACM SIGACT-SIGMOD-
SIGART Symposium on Principles of Database Systems. PODS ’98, ACM, New York, NY, USA.
Riddell, A., Hopper, T. and Grivas, A. (2016). lda: 1.0.4.
URL https://doi.org/10.5281/zenodo.57927
Stevens, K., Kegelmeyer, P., Andrzejewski, D. and Buttler, D. (2012). Exploring topic
coherence over many models and many topics. In Proceedings of the 2012 Joint Conference
on Empirical Methods in Natural Language Processing and Computational Natural Language
Learning. Association for Computational Linguistics, Jeju Island, Korea.
Tsybakov, A. B. (2009). Introduction to Nonparametric Estimation. Springer, New York.
34
Supplement
From the topic model specifications, the matrices Π, A and W are all scaled as
p∑j=1
Πji = 1,
p∑j=1
Ajk = 1,K∑k=1
Wki = 1 (52)
for any 1 ≤ j ≤ p, 1 ≤ i ≤ n and 1 ≤ k ≤ K. In order to adjust their scales properly, we denote
mj = p max1≤i≤n
Πji, µj =p
n
n∑i=1
Πji, αj = p max1≤k≤K
Ajk, γk =K
n
n∑i=1
Wki, (53)
so thatK∑k=1
γk = K,
p∑j=1
µj = p. (54)
We further denote mmin = min1≤j≤pmj and µmin = min1≤j≤p µj .
A Proofs of Section 2
Proof of Proposition 1. Recall that Yi := NiXi ∼ Multinomialp(Ni; Πi) for any i ∈ [n]. The joint
log-likelihood of (Y1, . . . , Yn) is
`(Y1, . . . , Yn) =n∑i=1
log(Ni!)−n∑i=1
p∑j=1
log(Yji) +n∑i=1
p∑j=1
Yji log Πji
=
n∑i=1
log(Ni!)−n∑i=1
p∑j=1
log(Yji) +
n∑i=1
p∑j=1
Yji log
(K∑k=1
AjkWki
).
Fix any j ∈ [p], k ∈ [K] and i ∈ [n]. It follows that
∂`(Y1, . . . , Yn)
∂Ajk=
∑ni=1 YjiWki
/(∑Kt=1AjtWti
), if Ajk 6= 0,Wki 6= 0
0, otherwise
from which we further deduce
∂2`(Y1, . . . , Yn)
∂Ajk∂Wki
=
∑n
i=1 Yji
(∑Kt=1AjtWti −AjkWki
)/(∑Kt=1AjtWti
)2, if Ajk 6= 0,Wki 6= 0
0, otherwise.
Since E[Yji] = NiΠji, taking expectation yields
E[−∂
2`(Y1, . . . , Yn)
∂Ajk∂Wki
]=
∑ni=1Ni
(∑t6=k AjtWti
)/(∑Kt=1AjtWti
), if Ajk 6= 0,Wki 6= 0
0, otherwise. (55)
35
Similarly, for this j, k and i but with any k′ 6= k, we have
∂2`(Y1, . . . , Yn)
∂Ajk∂Wk′i
=
−∑n
i=1 YjiAjk′Wki
/(∑Kt=1AjtWti
)2, if Ajk 6= 0, Ajk′ 6= 0,Wk′i 6= 0,Wki 6= 0
0, otherwise
and
E[−∂
2`(Y1, . . . , Yn)
∂Ajk∂Wk′i
]=
∑ni=1NiAjk′Wki
/∑Kt=1AjtWti, if Ajk 6= 0, Ajk′ 6= 0,Wk′i 6= 0,Wki 6= 0
0, otherwise. (56)
From (55) and (56), it is easy to see that condition (7) implies
E[−∂
2`(Y1, . . . , Yn)
∂Ajk∂Wk′i
]= 0 (57)
for any j ∈ [p], k, k′ ∈ [K] and i ∈ [n]. This proves the sufficiency. To show the necessity, we use
contradiction. If (57) holds for any j ∈ [p], k, k′ ∈ [K] and i ∈ [n], suppose there exist at least one
j ∈ [p] and i ∈ [n] such that supp(Aj·) ∩ supp(W·i) = k1, k2 and k1 6= k2. Then, (56) implies
E[−∂
2`(Y1, . . . , Yn)
∂Ajk1∂Wk2i
]≥
NiAjk1Wk2i
Ajk1Wk1i +Ajk2Wk2i6= 0.
This contradicts (7) and concludes the proof.
Proof of Proposition 2. Since the columns of A sum up to 1, and Assumption 1 holds, then the
matrix A satisfies:
Ajk ≥ 0,∥∥Aj·∥∥1
= 1, for each j = 1, . . . , p, and K = 1, . . . ,K. (58)
Additionally, A has the same sparsity pattern as A, and thus A satisfies Assumption 1, with the
same I and I. We further notice that Assumption 3 is equivalent to
|〈Wi·, Wj·〉| < ‖Wi·‖2 ∧ ‖Wj·‖2, for all 1 ≤ i < j ≤ K,
which is further equivalent with ν > 0, defined in (23).
To finish the proof we invoke Theorem 1 in Bing et al. (2017), slightly adapted to our situation.
Specifically, we consider any matrix R defined in (11) that factorizes as R = ACAT where A satisfies
Assumption 1 and C satisfies (23). Note that the quantities Mi and Si defined in page 9 of Bing
et al. (2017) are replaced by, respectively, Ti and Si in (12). We proceed to prove (a) and (b) of
Proposition 2.
36
Proof of (a). We first show the sufficiency part. Consider any i ∈ [p] with Ti = Tj for all
j ∈ Si. Part (a) of Lemma 10, stated after the proof, states that there exists a j ∈ Ia ∩Si for some
a ∈ [K]. For this j ∈ Ia, we have Tj = Caa from part (b) of Lemma 10. Invoking our premise
Tj = Ti as j ∈ Si, we conclude that Ti = Caa, that is, maxk∈pRik = Caa. By Lemma 9, stated after
the proof, the maximum is achieved for any i ∈ Ia. However, if i 6∈ Ia, we have that Rik < Caa for
all k ∈ [p]. Hence i ∈ Ia and this concludes the proof of the sufficiency part.
It remains to prove the necessity part. Let i ∈ Ia for some a ∈ [K] and j ∈ Si. Lemma 10 implies
that j ∈ Ia and Ti = Caa. Since j ∈ Si, we have Rij = Ti = Caa, while j ∈ Ia yields Rjk ≤ Caa for
all k ∈ [p], and Rjk = Caa for k ∈ Ia, as a result of Lemma 9. Hence, Tj = maxk∈[p]Rjk = Caa = Ti
for any j ∈ Si, which proves our claim.
Proof of (b). The proof of (b) follows by the same arguments for proving part (b) of Theorem
1 in Bing et al. (2017). The proof is then complete.
The following two lemmas are used in the above proof. They are adapted from Lemmas 1 and
2 of the supplement of Bing et al. (2017).
Lemma 9. For any a ∈ [K] and i ∈ Ia, we have
(a) Rij = Caa for all j ∈ Ia,
(b) Rij < Caa for all j 6∈ Ia.
Proof. The proof follows the same argument for proving Lemma 1 in Bing et al. (2017).
Lemma 10. Let Ti and Si be defined in (12). We have
(a) Si ∩ I 6= ∅, for any i ∈ [p],
(b) Si ∪ i = Ia and Ti = Caa, for any i ∈ Ia and a ∈ [K].
Proof. Lemma 9 implies that, for any i ∈ Ia, Ti = Caa and Si = Ia, which proves part (b). Part
(a) follows from the result of part (b) and the same arguments of the proof of Lemma 2 in Bing
et al. (2017) by replacing Mi by Ti, |Σij | by Rij , A by A and C by C.
B Error bounds of stochastic errors
We use this section to present tight bounds on the error terms which are critical to our later
estimation rate. We recall that εji := Xji − Πji, for 1 ≤ i ≤ n and 1 ≤ j ≤ p and assume
N1 = . . . = Nn = N for ease of presentation since similar results for different N can be derived by
using the same arguments. The following results, Lemmata 13 - 15 control several terms related
with εji under the multinomial assumption (2). We start by stating the well-known Bernstein
inequality and Hoeffding inequality for bounded random variables which are used in the sequel.
37
Lemma 11 (Bernstein’s inequality for bounded random variable). For independent random vari-
ables Y1, . . . , Yn with bounded ranges [−B,B] and zero means,
P
1
n
∣∣∣∣∣n∑i=1
Yi
∣∣∣∣∣ > x
≤ 2 exp
(− n2x2/2
v + nBx/3
), for any x ≥ 0,
where v ≥ var(Y1 + . . .+ Yn).
Lemma 12 (Hoeffding’s inequality). Let Y1, . . . , Yn be independent random variables with E[Yi] =
0 and bounded by [ai, bi]: For any t ≥ 0, we have
P
∣∣∣∣∣n∑i=1
Yi
∣∣∣∣∣ > t
≤ 2 exp
(− 2t2∑n
i=1(bi − ai)2
).
Lemma 13. Assume µmin/p ≥ 2 logM/(3N). With probability 1− 2M−1,
1
n
∣∣∣∣∣n∑i=1
εji
∣∣∣∣∣ ≤ 2
õj logM
npN
(1 +√
6n−12
), uniformly in 1 ≤ j ≤ p.
Proof. Fix 1 ≤ j ≤ p. From model (3), we know NXji ∼ Binomial(N ; Πji). We express the
binomial random variable as a sum of i.i.d. Bernoulli random variables:
εji = Xji −Πji =1
N
N∑k=1
(Bjik −Πji
):=
1
N
N∑`=1
Zjik
with Bji` ∼ Bernoulli(Πji), such that N
∑ni=1 εji =
∑ni=1
∑N`=1 Z
ji`. Note that |Zjik| ≤ 1, E[Zjik] = 0
and E[(Zjik)2] = Πji(1 − Πji) ≤ Πji, for all i ∈ [n] and k ∈ [N ]. An application of Bernstein’s
inequality, see Lemma 11, to Zji` with v = N∑n
i=1 Πji = Nnµj/p and B = 1 gives
P
1
n
∣∣∣∣∣n∑i=1
εji
∣∣∣∣∣ > t
≤ 2 exp
(− n2N2t2/2
N∑n
i=1 Πji + nNt/3
), for any t > 0.
This implies, for all t > 0
P
1
n
∣∣∣∣∣n∑i=1
εji
∣∣∣∣∣ >√
µjt
npN+
t
nN
≤ 2e−t/2.
Choosing t = 4 logM and recalling that M = n ∨N ∨ p ≥ p, we find by the union bound
p∑j=1
P
1
n
∣∣∣∣∣n∑i=1
εji
∣∣∣∣∣ > 2
õj logM
npN+
4 logM
nN
≤ 2pM−2 ≤ 2M−1. (59)
Using µmin/p ≥ 2 logM/(3N) concludes the proof.
38
Remark 13. By inspection of the proof of Lemma 13, if instead of the bound E[(Zjik)2] ≤ Πji, we
had used the overall bound E[(Zjik)2] ≤ 1, and application of Bernstein’s inequality would yield,
1
n
∣∣∣∣∣n∑i=1
εji
∣∣∣∣∣ ≤ c√
logM
nN, uniformly in 1 ≤ j ≤ p.
Summing over 1 ≤ j ≤ p in the above display would further give
1
n
p∑j=1
∣∣∣∣∣n∑i=1
εji
∣∣∣∣∣ ≤ c · p√
logM
nN,
and the right hand side would be slower by an important√p factor that what we would obtain by
summing the bound in Lemma 13 over 1 ≤ j ≤ p, since
2
p∑j=1
õj logM
npN≤ 2
√logM
nN
√√√√ p∑j=1
µj = 2
√p logM
nN.
In this last display, we used the Cauchy-Schwarz inequality in the first inequality, and the constraint
(54) in the last equality. The bound of Lemma 13 is an important intermediate step for deriving
the final bounds of Theorem 7, and the simple calculations presented above that the constraints of
unit column sums induced by model (3) permit a more refined control of the stochastic errors than
those previously considered. A larger impact of this refinement on the final rate of convergence is
illustrated in Remark 14, following the proof of Lemma 15, in which we control sums of quadratic,
dependent terms εjiε`i.
Lemma 14. With probability 1− 2M−1,
1
n
∣∣∣∣∣n∑i=1
Π`iεji
∣∣∣∣∣ ≤√
6m`Θj` logM
npN+
2m` logM
npN, uniformly in 1 ≤ j, ` ≤ p.
Proof. Similar as in the proof of Lemma 13, we write
Π`iεji =1
N
N∑k=1
Π`iZjik
such that |Π`iZjik| ≤ Π`i ≤ m`/p by (53), E[Π`iZ
jik] = 0 and E[Π2
`iZ2ik] = Π2
`iΠji(1 − Πji) ≤m`Π`iΠji/p, for all i ∈ [n] and k ∈ [N ]. Fix 1 ≤ j, ` ≤ p and recall that Θj` = n−1
∑ni=1 ΠjiΠ`i.
Applying Bernstein’s inequality to Π`iZjik with v = nNm`Θj`/p and B = m`/p gives
P
1
n
∣∣∣∣∣n∑i=1
Π`iεji
∣∣∣∣∣ ≥ t≤ 2 exp
(− n2N2t2/2
nNm`Θj`/p+ nNm`t/(3p)
)which further implies
P
1
n
∣∣∣∣∣n∑i=1
Π`iεji
∣∣∣∣∣ >√m`Θj`t
npN+
m`t
3npN
≤ 2e−t/2.
Taking the union bound over 1 ≤ j, ` ≤ p, and choosing t = 6 logM , concludes the proof.
39
Lemma 15. If µmin/p ≥ 2 logM/(3N), then with probability 1− 4M−1,
1
n
∣∣∣∣∣n∑i=1
(εjiε`i − E[εjiε`i])
∣∣∣∣∣ ≤ 12√
6
√Θj` +
(µj + µ`) logM
pN
√(logM)3
nN2+ 4M−3,
holds, uniformly in 1 ≤ j, ` ≤ p.
Proof. For any 1 ≤ j ≤ p, recall that εji = Xji −Πji and
εji =1
N
N∑k=1
Zjik
where Zjik := Bjik−Πji and Bj
ik ∼ Bernoulli(Πji). By using the arguments of Lemma 13, application
of Bernstein’s inequality to Zjik with v = NΠji and B = 1 gives
P |εji| > t ≤ 2 exp
(− Nt2/2
Πji + t/3
), for any t > 0,
which yields
|εji| ≤√
6Πji logM
N+
2 logM
N:= Tji
with probability greater than 1− 2M−3. We define Yji = εji1Sji with Sji := |εji| ≤ Tji for each
1 ≤ j ≤ p and 1 ≤ i ≤ n, and S := ∩pj=1∩ni=1Sji. It follows that P(Sji) ≥ 1−2M−3 for all 1 ≤ i ≤ nand 1 ≤ j ≤ p, so that P(S) ≥ 1− 2M−1 as M := n ∨ p ∨N . On the event S, we have
1
n
∣∣∣∣∣n∑i=1
(εjiε`i − E[εjiε`i])
∣∣∣∣∣ ≤ 1
n
∣∣∣∣∣n∑i=1
(YjiY`i − E[YjiY`i])
∣∣∣∣∣︸ ︷︷ ︸T1
+1
n
∣∣∣∣∣n∑i=1
(E[εjiε`i]− E[YjiY`i])
∣∣∣∣∣︸ ︷︷ ︸T2
We first study T2. By writing
E[εjiε`i] = E[YjiY`i] + E[Yjiε`i1Sc`i
]+ E
[εji1Scjiε`i
],
we have
T2 =1
n
∣∣∣∣∣n∑i=1
(E[εjiε`i]− E[YjiY`i])
∣∣∣∣∣ ≤ 1
n
∣∣∣∣∣n∑i=1
(E[Yjiε`i1Sc`i
]+ E
[εji1Scjiε`i
])∣∣∣∣∣≤ 1
n
∣∣∣∣∣n∑i=1
(P(Scji) + P(Sc`i)
)∣∣∣∣∣ (60)
≤ 4M−3
by using |Yji| ≤ |εji| ≤ 1 in the second inequality.
40
Next we bound T1. Since |Yji| ≤ Tji, we know −2TjiT`i ≤ YjiY`i − E[YjiY`i] ≤ 2TjiT`i for all
1 ≤ i ≤ n. Applying the Hoeffding inequality Lemma 12 with ai = −2TjiT`i and bi = 2TjiT`i gives
P
∣∣∣∣∣n∑i=1
(YjiY`i − E[YjiY`i])
∣∣∣∣∣ ≥ t≤ 2 exp
(− t2
8∑n
i=1 T2jiT
2`i
).
Taking t =√
24∑n
i=1 T2jiT
2`i logM yields
T1 =1
n
∣∣∣∣∣n∑i=1
(YjiY`i − E[YjiY`i])
∣∣∣∣∣ ≤ 2√
6
(1
n
n∑i=1
T 2jiT
2`i ·
logM
n
)1/2
(61)
with probability greater than 1− 2M−3. Finally, note that
1
n
n∑i=1
T 2jiT
2`i =
1
n
n∑i=1
(ΠjiΠ`i
36(logM)2
N2+
(2 logM
N
)4
+ (Πji + Π`i)24(logM)3
N3
)
= 36Θj`
(logM
N
)2
+ 16
(logM
N
)4
+ 24(µj + µ`)(logM)3
pN3(62)
by using (53) and Θj` = n−1ΠΠT in the second equality. Finally, combining (60) - (62) and using
µmin/p ≥ 2 logM/(3N) conclude the proof.
Remark 14. We illustrate the improvement of our result over a simple application of Hanson-Wright
inequality. Write
4εjiε`i = (εji + ε`i)2 − (εji − ε`i)2
for each i ∈ [n]. Since εji = N−1∑N
k=1 Zjik and ‖εji ± ε`i‖φ2 ≤ 2/
√N , a direct application of the
Hanson-Wright inequality to the two terms in the right hand side will give
1
n
∣∣∣∣∣n∑i=1
(εjiε`i − E[εjiε`i])
∣∣∣∣∣ ≤ c√
logM
nN
with high probability. Summing over 1 ≤ j ≤ p and 1 ≤ ` ≤ p further yields
p∑j,`=1
1
n
∣∣∣∣∣n∑i=1
(εjiε`i − E[εjiε`i])
∣∣∣∣∣ ≤ c · p2
√logM
nN. (63)
By contrast, summing the first term in Lemma 15 yields
p∑j,`=1
1
n
∣∣∣∣∣n∑i=1
(εjiε`i − E[εjiε`i])
∣∣∣∣∣ ≤ c ·√p2(logM)3
nN2
√ ∑1≤j,`≤p
Θj` + c ·√p3(logM)4
nN3
= c ·√p2(logM)3
nN2+ c ·
√p3(logM)4
nN3
by using Cauchy-Schwarz in the first inequality and (52) in the last equality which is (p√N)∧(N
√p)
faster than the result in (63) after ignoring the logarithmic term.
41
C Proofs of Section 4
Throughout this section, we define the event E := E1 ∩ E2 ∩ E3 by
E1 =
p⋂j=1
1
n
∣∣∣∣∣n∑i=1
εji
∣∣∣∣∣ ≤ 2(1 +√
6/n)
õj logM
npN
,
E2 =
p⋂j,`=1
1
n
∣∣∣∣∣n∑i=1
Π`iεji
∣∣∣∣∣ ≤√
6m`Θj` logM
npN+
2m` logM
npN
,
E3 =
p⋂j,`=1
1
n
∣∣∣∣∣n∑i=1
(εjiε`i − E[εjiε`i])
∣∣∣∣∣ ≤ 12√
6
√Θj` +
(µj + µ`) logM
pN
√(logM)3
nN2+ 4M−3
Recall (53) and (54), if
min1≤j≤p
1
n
n∑i=1
Πji ≥2 logM
3N
holds, we have
1
p≥ µmin
p= min
1≤j≤p
1
n
n∑i=1
Πji ≥2 logM
3N, (64)
Therefore, invoking Lemmas 13 – 15 yields P(E) ≥ 1− 8M−1.
C.1 Preliminaries
From model specifications (52) and (53), we first give some useful expressions which are repeatedly
invoked later.
(a) For any j ∈ [p], by using (53),
µj =p
n
n∑i=1
Πji =p
n
n∑i=1
K∑k=1
AjkWki =p
K
K∑k=1
Ajkγk ⇒ p
K
K∑k=1
Ajk · γ ≤ µj ≤ αj . (65)
In particular, for any j ∈ Ik with any k ∈ [K],
µj =p
n
n∑i=1
K∑k=1
AjkWki =p
KAjkγk
(53)=
αjγkK
. (66)
(b) For any j ∈ [p],
mj(53)= p max
1≤i≤nΠji = p max
1≤i≤n
K∑k=1
AjkWki ≤ p max1≤k≤K
Ajk(53)= αj ⇒ µj ≤ mj ≤ αj , (67)
by using 0 ≤Wki ≤ 1 and∑
kWki = 1 for any k ∈ [K] and i ∈ [n].
42
C.2 Control of Θ−Θ and R−R
Proposition 16. Under model (3), assume (26). Let Θ and R be defined in (20) and (21),
respectively. Then Θ is an unbiased estimator of Θ. Moreover, with probability greater than
1− 8M−1,
|Θj` −Θj`| ≤ ηj`, |Rj` −Rj`| ≤ δj`
for all 1 ≤ j, ` ≤ p, where
ηj` := 3√
6
(√mj
p+
√m`
p
)√Θj` logM
nN+
2(mj +m`)
p
logM
nN
+ 31(1 + κ1)
√µj + µ`
p
(logM)4
nN3+ κ2 (68)
and
δj` := (1 + κ1κ3)p2
µjµ`ηj` + κ3
p2Θj`
µjµ`
(√p
µj+
√p
µ`
)√logM
nN, (69)
with κ1 =√
6/n, κ2 = 4/M3 and
κ3 =2(1 + κ1)
(1− κ1 − κ21)2
.
Remark 15. For ease of presentation, we assumed that the document lengths are equal, that is,
Ni = N for all i ∈ [n]. Inspection of the proofs of Lemmas 13 - 15 and Proposition 16, we may
allow for unequal document lengths Ni by adjusting the quantities ηj` and δj` with
ηj` := 3√
6(√mj +
√m`
)√ logM
np
(1
n
n∑i=1
ΠjiΠ`i
Ni
)1/2
+2(mj +m`) logM
np
(1
n
n∑i=1
1
Ni
)
+ 31(1 + κ1)
√(logM)4
n
(1
n
n∑i=1
Πji + Π`i
N3i
)1/2
+ κ2 (70)
δj` := (1 + κ1κ3)p2ηj`µjµ`
+ κ3p3Θj`
µjµ`
√logM
n
( 1
n
n∑i=1
Πji
µ2jNi
)1/2
+
(1
n
n∑i=1
Π`i
µ2`Ni
)1/2 . (71)
The quantities mj and µj appearing in the above rates are related with Π and can be directly
estimated from X. Letmj
p= max
1≤i≤nXji,
µjp
=1
n
n∑i=1
Xji. (72)
The following corollary gives the data dependent bounds of Θ−Θ and R−R.
Corollary 17. Under the same conditions as Proposition 16, with probability greater than 1 −8M−1, we have
|Θj` −Θj`| ≤ ηj`, |Rj` −Rj`| ≤ δj`, for all 1 ≤ j, ` ≤ p.
43
The quantities ηj` have the same form as (70) and (71) except for replacing Θj`, mj/p and
µj/p by Θj` + κ5, mj/p + κ4 and µj/p + κ5, respectively, with κ4 = O(√
logM/N) and κ5 =
O(√
logM/(nN)). Similarly, δj` can be estimated in the same way by replacing ηj`, (µj/p)−1 and
Θj` by ηj`, (µj/p− κ5)−1 and Θj` + κ5, respectively.
Proof of Proposition 16. Throughout the proof, we work on the event E . Write X = (X1, . . . , Xn) ∈Rp×n and similarly, for ε = (ε1, . . . , εn) and W = (W1, . . . ,Wn). We first show that E[Θ] = Θ.
Recall that Xi = AWi + εi satisfying
EXi = AWi, Cov(Xi) =1
Nidiag(AWi)−
1
NiAWiW
Ti A
T .
This gives
E[Θ] =1
n
n∑i=1
[Ni
Ni − 1E[XiX
Ti ]− 1
Ni − 1diag(EXi)
]=
1
n
n∑i=1
AWiWTi A
T = Θ.
Next we bound the entry-wise error rate of Θ−Θ. Observe that
Θ =1
n
n∑i=1
[Ni
Ni − 1(AWi + εi)(AWi + εi)
T − 1
Ni − 1diag(AWi + εi)
]
=1
n
n∑i=1
[Ni
Ni − 1(AWiW
Ti A
T +AWiεTi + εi(AWi)
T + εiεTi )− 1
Ni − 1diag(AWi + εi)
]
=1
n
n∑i=1
[AWiW
Ti A
T +Ni
Ni − 1(AWiε
Ti + εi(AWi)
T )− diag(εi)
Ni − 1
+Ni
Ni − 1
(εiε
Ti − E[εiε
Ti ]) ].
The third equality comes from the fact that
E[εiεTi ] =
1
Nidiag(AWi)−
1
NiAWiW
Ti A
T .
Recall that Θ = n−1∑n
i=1AWiWTi A
T . We have
∣∣∣Θj` −Θj`
∣∣∣ ≤ ∣∣∣∣∣ 1nn∑i=1
(AWiεTi + εi(AWi)
T )j`
∣∣∣∣∣︸ ︷︷ ︸T1
+
∣∣∣∣∣ 1nn∑i=1
1
Ni(diag(εi))j`
∣∣∣∣∣︸ ︷︷ ︸T2
+
∣∣∣∣∣ 1nn∑i=1
(εjiε`i − E[εjiε`i])
∣∣∣∣∣︸ ︷︷ ︸T3
. (73)
44
It remains to bound T1, T2 and T3. Fix 1 ≤ j, ` ≤ p. To bound T1, we have
T1 ≤1
n
∣∣∣∣∣n∑i=1
Πjiε`i
∣∣∣∣∣+1
n
∣∣∣∣∣n∑i=1
Π`iεji
∣∣∣∣∣ E≤ (√mj +
√m`)
√6Θj` logM
npN+
2(mj +m`) logM
npN. (74)
For T2, we have
T2 =1
nN
∣∣∣∣∣n∑i=1
εji
∣∣∣∣∣ E≤ 2(1 + κ1)
õj logM
npN3, (75)
if j = `. Note that (T2)j` = 0 if j 6= `. Finally, to bound T3, we obtain
T3 ≤1
n
∣∣∣∣∣n∑i=1
(εjiε`i − E[εjiε`i])
∣∣∣∣∣E≤ 12
√6
√Θj` +
(µj + µ`) logM
pN
√(logM)3
nN2+ κ2
≤ 12
√6Θj`(logM)3
nN2+ 12
√6(µj + µ`)(logM)4
npN3+ κ2. (76)
Since (26) impliesmmin
p= min
1≤j≤pmax
1≤i≤nΠji ≥
(3 logM)2
N(77)
by recalling (53), we have
12
√6Θj`(logM)3
nN2+ (√mj +
√m`)
√6Θj` logM
npN≤ 3√
6(√mj +
√m`)
√Θj` logM
npN.
In addition,
2(1 + κ1)
õj logM
npN31j=` + 12
√6(µj + µ`)(logM)4
npN3≤ 31(1 + κ1)
√(µj + µ`)(logM)4
npN3.
Combining (74) - (76) concludes the desired rate of Θ−Θ.
To prove the rate of R − R, recall that R = (nD−1Π )Θ(nD−1
Π ). Fix 1 ≤ j, ` ≤ p. By using the
diagonal structure of DX and DΠ, it follows[(nD−1
X
)Θ(nD−1
X
)−R
]j`
= n2(D−1X −D
−1Π
)jj
Θj`
(D−1X
)``︸ ︷︷ ︸
T4
+n2(D−1
Π
)jj
Θj`
(D−1X −D
−1Π
)``︸ ︷︷ ︸
T5
+ n2(D−1
Π
)jj
(Θ−Θ
)j`
(D−1
Π
)``︸ ︷︷ ︸
T6
.
45
We first quantify the term n(D−1X −D
−1Π ). From their definitions,
n∣∣∣(D−1
X −D−1Π
)jj
∣∣∣ = n
∣∣∣∣ 1∑ni=1 Πji +
∑ni=1 εji
− 1∑ni=1 Πji
∣∣∣∣(53)
≤ 1
n
∣∣∣∣∣n∑i=1
εji
∣∣∣∣∣/(
µjp
∣∣∣∣∣µjp +1
n
n∑i=1
εji
∣∣∣∣∣)
E≤ 2(1 + κ1)
1− κ1(1 + κ1)· pµj
√p logM
µjnN, (78)
where the last inequality uses∣∣∣∣∣µjp +1
n
n∑i=1
εji
∣∣∣∣∣ E≥ µjp− 2(1 + κ1)
õj logM
npN
=µjp
(1− 2(1 + κ1)
√p logM
µjnN
)(64)
≥ µjp
(1− 2(1 + κ1)
√3
2n
)=µjp
(1− κ1(1 + κ1))
by recalling that κ1 =√
6/n. Since(D−1
Π
)jj
=1∑n
i=1(AW )ji=
1∑ni=1 Πji
=p
nµj, (79)
combined with (78), we find
n∣∣∣(D−1
X
)jj
∣∣∣ ≤ p
µj
(1 +
2(1 + κ1)
1− κ1(1 + κ1)
√p logM
µjnN
)(64)
≤(
1 +κ1(1 + κ1)
1− κ1(1 + κ1)
)p
µj
=1
1− κ1(1 + κ1)· pµj
(80)
Finally, since |Θj`| ≤ Θj` +∣∣∣Θj` −Θj`
∣∣∣, combining (79) and (80) gives
|T4|+ |T5| ≤2(1 + κ1)
(1− κ1(1 + κ1))2· p2
µjµ`
(√p
µj+
√p
µ`
)√logM
nN
(Θj` + |Θj` −Θj`|
),
|T6| =p2
µjµ`|Θj` −Θj`|.
Collecting these bounds for T4, T5 and T6 and using (64) again yield
|(R−R)j`| ≤ (1 + κ1κ3)p2
µjµ`|Θj` −Θj`|+ κ3
p2Θj`
µjµ`
(√p
µj+
√p
µ`
)√logM
nN.
46
with
κ3 =2(1 + κ1)
(1− κ1 − κ21)2
.
This completes the proof of Proposition 16.
Proof of Corollary 17. It suffices to show the following on the event E ,
|Θj` −Θj`| = O
(√logM
nN
),|mj −mj |
p= O
(√logM
N
),|µj − µj |
p= O
(√logM
nN
),
for all j, ` ∈ [p]. Recall that the definitions (53). Since
µjp
=1
n
n∑i=1
Πji ≤ max1≤i≤n
Πji =mj
p≤ 1, Θj` =
1
n
n∑i=1
ΠjiΠ`i ≤ 1, (81)
for any j, ` ∈ [p], display (68) implies
ηj` ≤ 3
√6 logM
nN+
2 logM
nN+ 31(1 + κ1)
√2(logM)4
nN3+ κ2 = O(
√logM/(nN)).
In addition,
|µj − µj |p
=1
n
∣∣∣∣∣n∑i=1
(Xji −Πji)
∣∣∣∣∣ =1
n
∣∣∣∣∣n∑i=1
εji
∣∣∣∣∣E≤ 2(1 + κ1)
õj logM
npN
(81)
≤ 2(1 + κ1)
√logM
nN= O
(√logM
nN
).
Finally, we show |mj −mj |/p = O(√
logM/N). Fix any j ∈ [p] and define
i∗ := argmax1≤i≤n
Πji, i′ := argmax1≤i≤n
Xji.
Thus, from the definitions (53) and (72), we have
|mj −mj |p
= |Xji′ −Πji∗ |
≤ |Xji′ −Πji′ |+ Πji∗ −Πji′
≤ 2|Xji′ −Πji′ |+ |Xji∗ −Mji∗ |+Xji∗ −Xji′
≤ 2|εji′ |+ |εji∗ |,
from the definition of i′.
47
From the proof of Lemma 15, we conclude, on the event Sji, that holds with probability at least
1− 2M−3,
|mj −mj |p
≤ 2|εji′ |+ |εji∗ |Sji≤ 3
√6Πji∗ logM
N+
6 logM
N
(81)= O
(√logM
N
).
This completes the proof.
The following corollary provides the expressions of δj` and ηj` under condition (27).
Corollary 18. Under model (3) and (27), with probability greater than 1−O(M−1),
|Θj` −Θj`| ≤ c0ηj`, |Rj` −Rj`| ≤ c1δj`, for all 1 ≤ j, ` ≤ p
for some constant c0, c1 > 0, where
ηj` =
√Θj` logM
nN
√mj +m`
p∨ log2M
N+
2(mj +m`)
p
logM
nN
+
√log4M
nN3
√µj + µ`
p∨ logM
N(82)
and
δj` :=p2ηj`µjµ`
+p2Θj`
µjµ`
(√p
µj+
√p
µ`
)√logM
nN. (83)
Proof. We define the event E ′ := E ′1 ∩ E2 ∩ E ′3 with
E ′1 =
p⋂j=1
1
n
∣∣∣∣∣n∑i=1
εji
∣∣∣∣∣ .√µj logM
npN
,
E2 =
p⋂j,`=1
1
n
∣∣∣∣∣n∑i=1
Π`iεji
∣∣∣∣∣ ≤√
6m`Θj` logM
npN+
2m` logM
npN
,
E ′3 =
p⋂j,`=1
1
n
∣∣∣∣∣n∑i=1
(εjiε`i − E[εjiε`i])
∣∣∣∣∣ .√
Θj` log3M
nN2+
√log5M
nN4
+
√(µj + µ`) log4M
npN3
.
From display (59) in the proof of Lemma 13 and condition (27), one has P(E ′1) ≥ 1 − O(M−1).
Further invoking Lemma 14 and (60) – (62) in Lemma 15 yields P(E ′) ≥ 1−O(M−1).
48
We proceed to work on E ′. The bound of |Θj` − Θj`| requires upper bounds of T1, T2 and T3
defined in (73). From (74) – (76) and by invoking E ′, after a bit algebra, we have
|Θj` −Θj`| .√
Θj` logM
nN
√log2M
N∨ mj +m`
p+
(mj +m`) logM
npN
+
√log4M
nN3
√logM
N∨ µj + µ`
p.
Finally, the bound of |Rj` − Rj`| follows from the same arguments in the proof of Proposition 16
by invoking condition (27) instead of (64).
C.3 Proofs of Theorem 4
We start by stating and proving the following lemma which is crucial for the proof of Theorem 4.
Recall
Ja1 := j ∈ [p] : Aja ≥ 1− 4δ/ν, for all a ∈ [K].
Let
ai = argmax1≤j≤p
Rij , for any i ∈ [p].
Lemma 19. Under the conditions in Theorem 4, for any i ∈ Ia with some a ∈ [K], the following
inequalities hold on the event E :∣∣∣Rij − Rik∣∣∣ ≤ δij + δik, for all j, k ∈ Ia; (84)
Rij − Rik > δij + δik, for all j ∈ Ia, k /∈ Ia ∪ Ja1 ; (85)
Rij − Rik < δij + δik, for all j ∈ Ja1 and k ∈ Ia. (86)
For any i ∈ Ja1 , we have
Riai − Rij ≤ δiai + δij , for any j ∈ Ia. (87)
Proof. We work on the event E so that, in particular, |Rj` −Rj`| ≤ δj` for all 1 ≤ j, ` ≤ p.To prove (84), fix i ∈ Ia and j, k ∈ Ia with some a ∈ [K]. Since R = ACAT , we have Rij = Rik =
Caa and
|Rij − Rik| ≤ |Rij −Rij |+ |Rik −Rik|E≤ δij + δik.
To prove (85), fix i, j ∈ Ia and k ∈ [p] \ Ia. On the one hand,
RikE≤
K∑b=1
AkbCab + δik(23)
≤ AkaCaa + (1− Aka)(Caa − ν) + δik = Caa − (1− Aka)ν + δik. (88)
49
On the other hand, i, j ∈ Ia implies Rij = Caa. Thus, on the event E , (88) gives
Rij − RikE≥ (1− Aka)ν − δij − δik.
If Aka = 0, using (24) and ν > 4δ gives the desired result. If Aka > 0, from the definition of Ja1 , we
have Aka < 1− 4δ/ν which finishes the proof by using (24) again.
To prove (86), observe that, for any j ∈ Ja1 and k ∈ Ia,
Rij(88)
≤ Caa − (1− Aja)ν + δij < Caa + δij = Rik + δijE≤ Rik + δij + δik.
It remains to show (87). For any i ∈ Ja1 and j ∈ Ia, we have, for some c ∈ [K],
RiaiE≤ max
k∈[p]Rik + δiai
(∗)≤
K∑b=1
AibCbc + δiai
(∗∗)≤
K∑b=1
AibCba + δiai
= Rij + δiaiE≤ Rij + δij + δiai .
Inequality (∗) holds since
maxk∈[p]
Rik = maxk∈[p]
K∑b=1
Akb
(K∑a=1
AiaCab
)≤ max
k∈[p]maxb∈[K]
K∑a=1
AiaCab =K∑a=1
AiaCac.
Inequality (∗∗) holds, since, for any c 6= a, we have
K∑b=1
AibCbc ≤ AiaCac + (1− Aia)Ccc(23)
≤ Aia(Caa − ν) + (1− Aia)Ccc,
andK∑b=1
AibCab(23)
≥ AiaCaa.
K∑b=1
AibCab −K∑b=1
AibCbc ≥ Aiaν − (1− Aia)Ccc > ν − 2(1− Aia)Ccc.
The term on the right is positive, since condition (25) guarantees that
ν ≥ 8δ
νCcc ≥ 2(1− Aia)Ccc,
where the last inequality is due to the definition of J1. This concludes the proof.
Lemma 19 remains valid under the conditions of Corollary 5 in which case J1 = ∅ and we only
need ν > 4δ to prove (85).
50
Proof of Theorem 4. We work on the event E throughout the proof. Without loss of generality, we
assume that the label permutation π is the identity. We start by presenting three claims which are
sufficient to prove the result. Let I(i) be defined in step 5 of Algorithm 2 for any i ∈ [p].
(1) For any i ∈ J \ J1, we have I(i) /∈ I.
(2) For any i ∈ Ia and a ∈ [K], we have i ∈ I(i), Ia ⊆ I(i) and I(i)\Ia ⊆ Ja1 .
(3) For any i ∈ Ja1 and a ∈ [K], we have Ia ⊆ I(i).
If we can prove these claims, then (1) and the Merge step in Algorithm 2 guarantees that I ∩ (J \J1) = ∅ and thus enable us to focus on i ∈ I ∩ J1. For any a ∈ [K], (2) implies that there exists
i ∈ Ia such that Ia ⊆ Ia and Ia\Ia ⊆ Ja1 with I(i) := Ia. Finally, (3) guarantees that none of anchor
words will be excluded by any i ∈ J1 in the Merge step. Thus, K = K and I = I1, . . . , IKis the desired partition. Therefore, we proceed to prove (1) - (3) in the following. Recall that
ai := argmax1≤j≤p Rij for all 1 ≤ i ≤ p.To prove (1), let i ∈ J \ J1 be fixed. We first prove that I(i) /∈ I when I(i) ∩ I 6= ∅. From steps
8 - 10 of Algorithm 2, it suffices to show that, there exists j ∈ I(i) such that the following does not
hold for any k ∈ aj : ∣∣∣Rij − Rjk∣∣∣ ≤ δij + δjk. (89)
Let I(i) ∩ I 6= ∅ such that there exists j ∈ Ib ∩ I(i) for some b ∈ [K]. For this j, we have
Rij =∑
a AiaCab and
Rij(88)
≤ Cbb − (1− Aib)ν + δij . (90)
On the other hand, for any k ∈ aj and k′ ∈ Ib, using the definition of aj gives
Rjk ≥ Rjk′E≥ Rjk′ + δjk′ = Cbb − δjk′ . (91)
Combining (90) with (91) gives
Rjk − Rij ≥ (1− Aib)ν − δij − δjk′ .
The definition of J1 and (24) with ν > 4δ give
Rjk − Rij > δjk + δij .
This shows that for any i ∈ J \ J1, if I(i) ∩ I 6= ∅, I(i) /∈ I. Therefore, to complete the proof of (1),
we show that I(i) ∩ I = ∅ is impossible if i ∈ J \ J1. For fixed i ∈ J \ J1 and j ∈ ai, we have
Rij =∑b
∑a
AiaAjbCab ≤ max1≤b≤K
∑a
AiaCab =∑a
AiaCab∗ = Rik
for some b∗ and any k ∈ Ib∗ . Therefore,
Rij − RikE≤ Rij −Rik + δij + δik ≤ δij + δik
51
On the other hand, assume I(i) ∩ I = ∅. Since k ∈ Ib∗ , we know k /∈ I(i), which implies
Rij − Rik > δij + δik,
from step 5 of Algorithm 2. The last two displays contradict each other, and we conclude that, for
any i ∈ J \ J1, I(i) ∩ I 6= ∅. This completes the proof of (1).
From (85) in Lemma 19, given step 5 of Algorithm 2, we know that, for any j ∈ I(i), j ∈ Ia∪Ja1 .
Thus, we write I(i) = (I(i) ∩ Ia) ∪ (I(i) ∩ Ja1 ). For any j ∈ I(i) ∩ Ia, by the same reasoning, aj is
either Ia or Ja1 . For both cases, since i, j ∈ Ia and i 6= j, (84) and (87) in Lemma 19 guarantee
that (89) holds. On the other hand, for any j ∈ I(i) ∩ Ja1 , (86) in Lemma 19 implies that (89) still
holds. Thus, we have shown that, for any i ∈ Ia, i ∈ I(i). To show Ia ⊆ I(i), let any j ∈ Ia and
observe that ai can only be in Ia ∪ Ja1 . In both cases, (84) and (87) imply j ∈ I(i). Thus, Ia ⊆ I(i).
Finally, I(i)\Ia ⊆ Ja1 follows immediately from (85).
We conclude the proof by noting that (3) directly follows from (86).
D Proofs of Section 5
D.1 Proofs of Lower bounds in Section 5.1
Proof of Theorem 6. We first show the result of the matrix `1 norm. Let
A(0) =
[e1 ⊗ 1g1 e2 ⊗ 1g2 · · · eK ⊗ 1gK
1|J | 1|J | · · · 1|J |
]×
1
g1+|J |. . .
1gK+|J |
(92)
with gk := |Ik| for any 1 ≤ k ≤ K and |gk − gk+1| ≤ 1. We use ek to denote the canonical basis
vectors in RK and use 1d and ⊗ to denote, respectively, the d-dimensional vector with entries equal
to 1 and the kronecker product. We start by constructing a set of “hypotheses” of A. Assume
gk + |J | is even for 1 ≤ k ≤ K. Let
M := 0, 1(|I|+|J |K)/2.
Following the Varshamov-Gilbert bound in Lemma 2.9 in Tsybakov (2009), there exists w(j) ∈ Mfor j = 0, 1, . . . , T , such that∥∥∥w(i) − w(j)
∥∥∥1≥ |I|+K|J |
16, for any 0 ≤ i 6= j ≤ T, (93)
with w(0) = 0 and
log(T ) ≥ log(2)
16(|I|+K|J |). (94)
For each w(j) ∈ R(|I|+K|J |)/2, we divide it into K chunks as w(j) =(w
(j)1 , w
(j)2 , . . . , w
(j)K
)with
w(j)k ∈ R(gk+|J |)/2. For each w
(j)k , we write w
(j)k ∈ Rp as its augumented counterpart such that
52
w(j)k (Sk) = [w
(j)k ,−w(j)
k ] and w(j)k (`) = 0 for any ` /∈ Sk, where Sk := supp(A
(0)k ). For 1 ≤ j ≤ T ,
we choose A(j) as
A(j) = A(0) + γ[w
(j)1 · · · w
(j)K
](95)
with
γ =
√log(2)
162c2(1 + c0)
√K2
nN(|I|+K|J |)(96)
for some constant c0, c > 0. Under |I|+K|J | ≤ c(nN), it is easy to verify that A(j) ∈ A(K, |I|, |J |)for all 0 ≤ j ≤ T .
In order to apply Theorem 2.5 in Tsybakov (2009), we need to check the following conditions:
(a) KL(PA(j) ,PA(0)) ≤ log(T )/16, for each i = 1, . . . , T .
(b) L1
(A(i), A(j)
)≥ c1K
√(|I|+K|J |)/(nN), for 0 ≤ i < j ≤ T and some positive constant c1.
(c) L1( · ) satisfies the triangle inequality.
The expression of Kullback-Leibler divergence between two multinomial distributions is shown in
(Ke and Wang, 2017, Lemma 6.7). For completeness, we include it here.
Lemma 20 (Lemma 6.7 Ke and Wang (2017)). Let D and D′ be two p × n matrices such that
each column of them is a weight vector. Under model (3), let P and P′ be the probability measures
associated with D and D′, respectively. Suppose D is a positive matrix. Let
η = max1≤j≤p,1≤i≤n
|D′ji −Dji|Dji
and assume η < 1. There exists a universal constant c0 > 0 such that
KL(P′,P) ≤ (1 + c0η)N
n∑i=1
p∑j=1
|D′ji −Dji|2
Dji.
Fix 1 ≤ j ≤ T and choose D(j) = A(j)W with
W ∈ RK×n+ = W 0 +1
nN1K1TK −
K
nNIK
where W 0 is defined in (35). The above choice of W perturbs W 0 to avoid Dji 6= 0 for any j ∈ Iand i ∈ [n], in the presence of anchor words. Since there exists some large enough constant c > 0
such that gk ≤ c|I|/K, it then follows that
D(0)`i =
K∑k=1
A(0)`k Wki =
Wki/(gk + |J |) ≥ cWki/(|I|/K + |J |) if ` ∈ Ik, k ∈ [K]∑
kWki/(gk + |J |) ≥ c/(|I|/K + |J |) if ` ∈ J(97)
53
for any 1 ≤ i ≤ n, where we also use∑
kWki = 1. Similarly, we have∣∣∣D(j)`i −D
(0)`i
∣∣∣ = γ
∣∣∣∣∣K∑k=1
w(j)k (`)Wki
∣∣∣∣∣ ≤Wkiγ if ` ∈ Ik, k ∈ [K]
γ if ` ∈ J(98)
for all 1 ≤ i ≤ n, where w(j)k (`) denotes the `th element of w
(j)k . Thus, by choosing proper c0 in
(96) and using |I|+K|J | ≤ c(nN), we have
max1≤`≤p,1≤i≤n
|D(j)`i −D`i|D`i
≤ cγ(|I|/K + |J |) < 1,
for any 1 ≤ j ≤ T . Invoking Lemma 20 gives
KL(PA(j) ,PA(0)) ≤ (1 + c0)Nn∑i=1
p∑`=1
|D(j)`i −D
(0)`i |
2
D(0)`i
≤ c (1 + c0)N
(|I|K
+ |J |) n∑i=1
γ2|J |+
K∑k=1
γ2gkWki
.
≤ c (1 + c0)nN (|I|+K|J |)2 γ2/K2
(94)
≤ 1
16log T,
where the second inequality uses (97) and (98) and the third inequality uses gk ≤ c|I|/K and∑kWki = 1. This verifies (a).
To show (b), from (95), it gives
L1(A(j), A(`)) =K∑k=1
∥∥∥A(j)·k −A
(`)·k
∥∥∥1
= 2γ
K∑k=1
∥∥∥w(j)k − w
(`)k
∥∥∥1
= 2γ∥∥∥w(j) − w(`)
∥∥∥1
(93)
≥ γ
8(|I|+K|J |).
Plugging into the expression of γ verifies (b). Since (c) is already verified in (Ke and Wang, 2017,
page 31), invoking (Tsybakov, 2009, Theorem 2.5) concludes the proof for even gk + |J |. The
complementary case is easy to derive with slight modifications. Specifically, denote by Sodd :=
1 ≤ k ≤ K : gk + |J | is odd. Then we change M := 0, 1Card with
Card =∑
k∈Sodd
gk + |J | − 1
2+∑
k∈Scodd
gk + |J |2
.
For each w(j), we write it as w(j) = (w(j)1 , . . . , w
(j)K ) and each w
(j)k has length (gk + |J | − 1)/2 if
k ∈ Sodd and (gk + |J |)/2 otherwise. We then construct A(j)k = A
(0)k + γw
(j)k where w
(j)k ∈ Rp is the
54
same augumented ccounterpart of w(j)k . The result follows from the same arguments.
To conclude the proof, we show the lower bound of ‖A − A‖1,∞. Let k∗ := arg maxk gk. Then
we can repeat the above arguments by only changing the k∗th column of A(0). Specifically, assume
|J |+ gk∗ is even and we draw w(j) from M = 0, 1(|J |+gk∗ )/2 for 1 ≤ j ≤ T with
log(T ) ≥ log(2)(|J |+ gk∗)/16.
Let w(j) be its augumented counterpart and choose A(j)k = A
(0)k + γw(j) with
γ =
√log(2)
162c2(1 + c0)
√K
nN(gk∗ + |J |)
and ‖w(j) − w(`)‖1 ≥ (gk∗ + |J |)/16 for any 0 ≤ j 6= ` ≤ T . Then we need to check
(a’) KL(PA(j) ,PA(0)) ≤ log(T )/16, for each i = 1, . . . , T .
(b’) L1,∞(A(i), A(j)
)≥ c1
√K(gk∗ + |J |)/(nN), for 0 ≤ i < j ≤ T and some positive constant c1.
(c’) L1,∞( · ) satisfies the triangle inequality.
Different from (97) and (98), we have, for all 1 ≤ i ≤ n,
D(0)`i =
K∑k=1
A(0)`k Wki =
Wki/(gk + |J |) ≥Wki/(gk∗ + |J |) if ` ∈ Ik, k ∈ [K]∑
kWki/(gk + |J |) ≥ 1/(gk∗ + |J |) if ` ∈ J(99)
and ∣∣∣D(j)`i −D
(0)`i
∣∣∣ =
γWk∗iw
(j)` ≤ γWk∗i, if ` ∈ Ik∗ ∪ J
0, otherwise.. (100)
Using the same arguments, one can derive
KL(PA(j) ,PA(0)) ≤ (1 + c0)Nγ2(gk∗ + |J |)n∑i=1
(W 2k∗i|J |+Wk∗igk∗
).
Since there exists some constant c > 0 such that
n∑i=1
(W 2k∗i|J |+Wk∗igk∗
)=
n∑a=1
∑i∈na
(W 2k∗i|J |+Wk∗igk∗
)≤ c(|J |+ gk∗)n/K,
we conclude that
KL(PA(j) ,PA(0)) ≤ c(1 + c0)γ2Nn(gk∗ + |J |)2/K ≤ log(T )/16.
To show (b’), observe that
L1,∞(A(j), A(`)) = 2γ max1≤k≤K
∥∥∥w(j)k − w
(`)k
∥∥∥1≥ γ(gk∗ + |J |)
8.
55
Finally, we verify (c’) by showing that L1,∞(·) satisfies the triangle inequality. Consider (A, A, A)
and observe that
L1,∞(A, A) = minP∈HK
‖AP − A‖1,∞ = minP,Q∈HK
‖AP − AQ‖1,∞
≤ minP,Q∈HK
(‖AP − A‖1,∞ + ‖A− AQ‖1,∞
)= min
P∈HK
‖AP − A‖1,∞ + minQ∈HK
‖A− AQ‖1,∞
= L1,∞(A, A) + L1,∞(A, A).
The proof is complete.
D.2 Proofs of upper bounds in Section 5.3
We work on the event E . Recall that under (26), we have P (E) ≥ 1−M−3. From Theorem 4, we
have K = K and I = I. Without loss of generality, we assume that the label permutation matrix
Π is the identity matrix that aligns the topic words with the estimates (I = I). In particular, this
implies that any chosen L ⊂ I has the correct partition, and we have L = L. We first give a crucial
lemma to the proof of Theorem 7. From (68), recall that
ηj` (√mj +
√m`
)√Θj` logM
npN+
(mj +m`) logM
npN+
√(µj + µ`)(logM)4
npN3, (101)
for all j, ` ∈ [p].
Lemma 21. Under the conditions of Theorem 7, we have
maxi∈L
∑j∈L
ηij .√α3Iγ
√logM
np3N, (102)
maxi∈L
∑j∈J
ηij .√αIγSJ
√logM
Knp2N, (103)
with SJ = |J |αI +∑
j∈J αj , for any i ∈ Ik and k ∈ [K]. In addition, if∑
k′ 6=k√Ckk′ = o(
√Ckk)
for any 1 ≤ k ≤ K, then
maxi∈L
∑j∈L
ηij .√α3Iγ
√logM
Knp3N. (104)
Proof. We first simplify the expression of η defined in (68). To show (102), observe that, for any
i ∈ Ia, j ∈ Ib and a, b ∈ [K], we have
Θij =1
n
n∑`=1
Πi`Πj` = AiaAjb1
n
n∑`=1
Wa`Wb`(53)=
αiαjp2· 1
n〈Wa·,Wb·〉.
56
As a result, plugging (65) - (67) and the above display into (68) yields
ηij √
1
n〈Wa·,Wb·〉
√α3I logM
np3N+αI logM
npN+
√αIγ(logM)4
KnpN3. (105)
SinceK∑b=1
1
n〈Wa·,Wb·〉 =
1
n
n∑`=1
Wa`
K∑b=1
Wb`(52)=
1
n
n∑`=1
Wa`(53)=
γaK,
summing (105) over j ∈ L, and using the Cauchy-Schwarz inequality to obtain the first term, gives
maxi∈L
∑j∈L
ηij .
√αI logM
npN
√α2Iγ
p2+
√αIK2 logM
npN+
√Kγ(logM)3
N2
. (106)
Note that Assumption 2 implies n ≥ K. Since (26) and (66) together with the fact γ ≥ 1 ≥ γ give
αIγ
K≥ αIK≥αIγ
K≥ min
i∈Iµi ≥
2p logM
3N, (107)
these two bounds imply the first term on the right in (106) is greater than the second term on the
right of (106). Moreover, (67), (77) and (107) imply
α2I
Kp2≥ 2 logM
3N
αIp≥ 2 logM
3N
mmin
p≥ 6(logM)3
N2.
Thus, the first term on the right of (106) is also greater than the third term on the right of (106).
This finishes the proof of (102).
We proceed to show (103). For any i ∈ Ia and j ∈ [p], observe that
Θij =1
n
n∑`=1
Πi`Πj` = Aia1
n
n∑`=1
Wa`Πj`(53)=
αip· 1
n〈Wa·,Πj·〉.
Plugging (65) – (67) and the above display into (68) yields
ηij .
√1
n〈Wa·,Πj·〉
√αi(αi + αj) logM
np2N+
(αi + αj) logM
npN+
√(αi + αj)(logM)4
npN3. (108)
Since∑
kWk` = 1 and (53) yield
∑j∈J
1
n〈Wa·,Πj·〉 =
K∑k=1
1
n
n∑`=1
Wa`Wk`
∑j∈J
Ajk ≤γaK‖AJ‖1,∞ ≤
γaK,
summing (108) over j ∈ J and using Cauchy-Schwarz yield
maxi∈L
∑j∈J
ηij .
√SJ logM
npN
(√αIγ
Kp+
√SJ logM
npN+
√|J |(logM)3
N2
).
57
To conclude the proof of (103), it suffices to show the first term on the right in the display above
dominates the other two. This is true by noting that
αIγN2
K|J |p(logM)3≥ µminN
2
|J |p(logM)3& 1
andγnN
K|J | logM& 1,
αIγnN
K∑
j∈J αj logM& 1
from the same arguments to prove (102) together with |J | ≤ p and∑
j∈J αj ≤ Kp.Finally, to show (104), invoking the additional condition and using (105) yield
maxi∈L
∑j∈L
ηij . maxa∈[K]
√Caa
√α3I logM
np3N+KαI logM
npN+
√KαIγ(logM)4
npN3
≤
√α3Iγ logM
Knp3N+KαI logM
npN+
√KαIγ(logM)4
npN3
where we use Caa ≤ n−1‖Wa·‖1 . γ/K to derive the second inequality. Repeating the same
arguments, one can show that on the right-hand-side the first term dominates the second and third
terms. This finishes the proof.
D.2.1 Proof of Theorem 7
On the event E , Theorem 4 guarantees that I = I, J = J and L = L. We work on this event for
the remainder of the proof. Our proof consists of three parts for any k ∈ [K]: To obtain the error
rate of (1) ‖BIk −BIk‖1, (2) ‖BJk −BJk‖1 and (3) ‖A·k −A·k‖1.
For step (1), recall (19) and (38). Then, for any 1 ≤ k ≤ K, it follows
‖BIk −BIk‖1 =∑i∈Ik
∣∣∣∣∣∥∥Xi·
∥∥1∥∥Xik·∥∥
1
−∥∥Πi·
∥∥1∥∥Πik·∥∥
1
∣∣∣∣∣and we obtain
‖BIk −BIk‖1
=1
‖Xik·‖1‖Πik·‖1
∑i∈Ik
∣∣∣‖Πik·‖1‖Xi·‖1 − ‖Xik·‖1‖Πi·‖1∣∣∣
≤ 1
‖Xik·‖1‖Πik·‖1
‖Πik·‖1∑i∈Ik
∣∣∣‖Xi·‖1 − ‖Πi·‖1∣∣∣+∣∣∣‖Xik·‖1 − ‖Πik·‖1
∣∣∣∑i∈Ik
‖Πi·‖1
=
1
‖Xik·‖1
∑i∈Ik
1
n
∣∣∣∣∣∣n∑j=1
εij
∣∣∣∣∣∣+
∑i∈Ik ‖Πi·‖1
‖Xik·‖1‖Πik·‖11
n
∣∣∣∣∣∣n∑j=1
εikj
∣∣∣∣∣∣ .58
In the last step we used that Xi and Πi have nonnegative entries only so that indeed
∣∣∣‖Xi·‖1 − ‖Πi·‖1∣∣∣ =
1
n
∣∣∣∣∣∣n∑j=1
|Xij | −n∑j=1
|Πij |
∣∣∣∣∣∣=
1
n
∣∣∣∣∣∣n∑j=1
Xij −n∑j=1
Πij
∣∣∣∣∣∣=
1
n
∣∣∣∣∣∣n∑j=1
εij
∣∣∣∣∣∣ .Invoking Lemma 13, n−1
∣∣∣∑nj=1 εij
∣∣∣ . √µip logM/(nN) on the event E1 ⊃ E . Note further that,
for any i ∈ [p],
1
‖Xi·‖1= (DX)−1
ii
(80)
.p
nµi
and ‖Πi·‖1 = nµi/p by the definition in (53). Now deduce that
‖BIk −BIk‖1 .1
µik
∑i∈Ik
õip logM
nN+
∑i∈Ik µi
µ2ik
õikp logM
nN.
Recall that µi = αiγk/K for any i ∈ Ik in (66). We further have
1
µik
∑i∈Ik
õip logM
nN+
∑i∈Ik µi
µ2ik
õikp logM
nN=
1
αik
√pK logM
γknN
∑i∈Ik
√αi +
∑i∈Ik αi√αik
and we conclude that, on the event E ,
‖BIk −BIk‖1 .
∑i∈Ik αi
αik√αIγk
√pK logM
nN. (109)
We proceed to show step (2). Recall that Ω = (ω1, . . . , ωK) is the optimal solution from (41),
BJ =(ΘJLΩ
)+
and BJ = ΘJLΩ. Fix any 1 ≤ k ≤ K and denote the canonical basis vectors in
RK by e1, . . . , eK . Since B has only non-negative entries, we have
‖BJk −BJk‖1= ‖(ΘJLΩ·k
)+−ΘJLΩ·k‖1
≤ ‖ΘJLΩ·k −ΘJLΩ·k‖1≤ ‖(ΘJL −ΘJL)Ω·k‖1 + ‖ΘJLΩ·k −BJk‖1≤ ‖Ω·k‖1‖ΘJL −ΘJL‖1,∞ + ‖BJΘLLΩ·k −BJek‖1
≤ ‖Ω·k‖1‖ΘJL −ΘJL‖1,∞ +[‖ΘLLΩ·k − ek‖1 + ‖(ΘLL −ΘLL)Ω·k‖1
]‖BJ‖1,∞
≤ ‖ωk‖1‖ΘJL −ΘJL‖1,∞ +[‖ΘLLΩ·k − ek‖1 + ‖ωk‖1‖ΘLL −ΘLL‖∞,1
]‖BJ‖1,∞.
59
We first study the property of Ω. Notice that ωk := Ω·k is feasible of (41) since, on the event E ,
‖ΘLLωk − ek‖1 ≤ ‖ωk‖1‖ΘLL −ΘLL‖∞,1E≤ C0‖ωk‖1 max
i∈L
∑j∈L
ηij = ‖ωk‖1λ,
by Proposition 16. The optimality and feasibility of ωk imply
‖ωk‖1 ≤ ‖ωk‖1, ‖ΘLLωk − ek‖1 ≤ ‖ωk‖1λ ≤ ‖ωk‖1λ.
Hence, on the event E , we have
‖BJk −BJk‖1 ≤ C0‖ωk‖1 maxi∈L
∑j∈J
ηij + 2‖ωk‖1λ‖BJ‖1,∞.
Recall that BJ = AJA−1L and AL = diag(αi1/p, . . . , αiK/p), and deduce
‖BJ‖1,∞ = max1≤k≤K
p
αik
∑j∈J
Ajk ≤p
αI‖AJ‖1,∞ (110)
Recall that λ = maxi∈L∑
j∈L ηij and notice that
‖ωk‖1 = ‖Ω·k‖1 = ‖Θ−1LLek‖1 = ‖A−1
L C−1A−1L ek‖1 ≤
p2
αikαI‖C−1‖∞,1. (111)
Collecting (110) – (111) gives
‖BJk −BJk‖1 ≤ C0p2
αikαI‖C−1‖∞,1
maxi∈L
∑j∈J
ηij + 2p‖AJ‖1,∞
αImaxi∈L
∑j∈L
ηij
.Invoking (102) – (103) in Lemma 21 yields, on the event E ,
‖BJk −BJk‖1 .p2‖C−1‖∞,1
αikαI
[√αIγSJ
√logM
Knp2N+‖AJ‖1,∞
αI
√α3Iγ
√logM
npN
]
≤ p
αik‖C−1‖∞,1
αIαI
√γ logM
KnN
√|J |+∑j∈J
αjα
+αIαI
√K∑j∈J
αjαI
(112)
where we recall that SJ = |J |αI +∑
j∈J αj and use ‖AJ‖1,∞ ≤ 1, ‖AJ‖1,∞ ≤∑
j∈J αj/p in the
second inequality. Combining (109) and (112) yields, on the event E ,
‖B·k −B·k‖1 .p
αik
∑i∈Ik αi√αIγk
√K logM
npN.
+p
αik‖C−1‖∞,1
αIαI
√γ logM
KnN
√|J |+∑j∈J
αjα
+αIαI
√K∑j∈J
αjαI
. (113)
60
It remains to prove step (3). For any k ∈ [K], from (44), we have
|Ajk −Ajk| =
∣∣∣∣∣ Bjk
‖B·k‖1−
Bjk‖B·k‖1
∣∣∣∣∣≤
Bjk
‖B·k‖1‖B·k‖1
∣∣∣‖B·k‖1 − ‖B·k‖1∣∣∣+|Bjk −Bjk|‖B·k‖1
=
∣∣∣‖B·k‖1 − ‖B·k‖1∣∣∣‖B·k‖1
Ajk +|Bjk −Bjk|‖B·k‖1
by using Ajk = Bjk/‖B·k‖1 in the last equality. Since ‖A·k‖1 = 1, summing over j ∈ [p] yields
‖A·k −A·k‖1 ≤2‖B·k −B·k‖1‖B·k‖1
=2αikp‖B·k −B·k‖1.
The equality uses ‖B·k‖1 = p/αik from B = AA−1L . Invoking (113) concludes the proof.
D.2.2 Proof of Corollary 8
To prove Corollary 8, we need the following lemma which controls ‖C−1‖∞,1.
Lemma 22. Let γ and γ be defined in (45). Assume γ γ and∑
k′ 6=k√Ckk′ = o(
√Ckk) for any
1 ≤ k ≤ K. Then ‖C−1‖∞,1 = O(K).
Proof. From the definition of (∞, 1)-norm and the symmetry of C, one has
‖C−1‖∞,1 = supv 6=0
‖C−1v‖1‖v‖1
= supv 6=0
‖v‖1‖Cv‖1
=
(inf‖v‖1=1
‖Cv‖1)−1
.
It suffices to lower bound inf‖v‖1=1 ‖Cv‖1. We have
inf‖v‖1=1
‖Cv‖1 = inf‖v‖1=1
K∑a=1
∣∣∣∣∣∣Caava +∑b 6=a
Cabvb
∣∣∣∣∣∣≥ inf‖v‖1=1
K∑a=1
Caa|va| −∑b 6=a
Cab|vb|
= inf‖v‖1=1
K∑a=1
Caa|va| −K∑b=1
|vb|∑a6=b
Cab.
Note that∑
k′ 6=k√Ckk′ = o(
√Ckk) implies
∑k′ 6=k
Ckk′ ≤
∑k′ 6=k
√Ckk′
2
= o(Ckk), for any k ∈ [K] (114)
61
and Ckk ≤ n−1‖Wk·‖1 = γk/K = O(1/K) by using γ γ 1. We further obtain
inf‖v‖1=1
‖Cv‖1 & inf‖v‖1=1
K∑a=1
(Caa|va| − |vb|o(1/K)) & minaCaa.
Observe that
1
K− Ckk
γkK− Ckk =
K∑k′=1
Ckk′ − Ckk =∑k′ 6=k
Ckk′ ≤∑k′ 6=k
√Ckk′ = o(Ckk).
This implies Ckk 1/K which concludes inf‖v‖1=1 ‖Cv‖1 & 1/K and completes the proof.
Proof of Corollary 8. From (104) in Lemma 21, under∑
k′ 6=k√Ckk′ = o(
√Ckk) for any 1 ≤ k ≤ K,
one has
maxi∈L
∑j∈L
ηij .
√α3Iγ logM
Knp3N
which improves the rate in (102) by a factor of√
1/K. Repeating the previous arguments for
proving Theorem 7, one can show that
Rem(J, k) .
√K logM
nN· γ
1/2‖C−1‖∞,1K
· αIαI
√|J |+∑j∈J
ρj +αIαI
√∑j∈J
ρj
.
Invoking conditions (i) – (ii) of Corollary 8 together with ‖C−1‖∞,1 = O(K) from Lemma 22, one
can obtainK∑k=1
Rem(I, k) .∑i∈I
√αiK logM
npN≤ K
√|I| logM
nN
by using Cauchy-Schwarz inequality with∑
i∈I αi ≤∑p
i=1 αi ≤ pK, and
Rem(J, k) .
√|J |K logM
nN.
This yields the optimal rate of L1(A, A). For L1,∞(A, A), observe that∑
i∈Ik αi/p =∑
i∈Ik Aik ≤ 1.
This yields
Rem(I, k) .∑i∈Ik
√αiK logM
npN≤√|Ik|K logM
nN,
which, in conjunction with the rate of Rem(J, k), concludes the result of L1,∞(A, A).
62
D.2.3 Proof of the statment in Remark 9
We prove the result of Theorem 7 when condition (26) is replaced by (27) and (49). First note
that, similar as (64) and (77), condition (49) implies
minj∈I
µjp
= minj∈I
1
n
n∑i=1
Πji ≥c logM
N, min
j∈I
mj
p= min
j∈Imax
1≤i≤nΠji ≥
c′(logM)2
N. (115)
We will work on the event E ′ defined in the proof of Corollary 18 which holds with probability
greater than 1 − O(M−1) under condition (27). To prove Theorem 7, observe that its proof in
Section D.2.1 only uses the result of Lemma 13, Lemma 21 and (80). Since Lemma 13 and (80) is
guaranteed by E ′, it suffices to prove that Lemma 21 holds for ηj` defined in (82). This is indeed
true since the only different terms in the expression of ηj` in (82) than (101) are
mj +m`
p∨ log2M
N,
µj + µ`p
∨ logM
N
for any j ∈ Ik and ` ∈ [p], and hence invoking (115) yields
mj +m`
p∨ log2M
N.mj +m`
p,
µj + µ`p
∨ logM
N.µj + µ`
p.
Therefore, Lemma 21 follows and so does Theorem 7.
E Discussions of Assumptions 2 and 3
E.1 Examples
The following example shows that Assumption 2 does not imply Assumption 3.
Example 2. In the first two instances (116) and (117) below, Assumptions 2 and 3 both hold. These
instances give insight into why Assumption 3 may fail, while Assumption 2 holds, which is shown
in the third instance (118).
We consider the following matrix
W =
0.5 0.4 0 0 0.4
0.2 0.6 0.5 0.5 0
0.3 0 0.5 0.5 0.1
0 0 0 0 0.5
=⇒ C =
0.34 0.15 0.1 0.31
0.15 0.28 0.22 0
0.1 0.22 0.31 0.07
0.31 0 0.07 1
. (116)
Clearly, each diagonal entry of C dominates the non-diagonal ones, row-wise. From (23), Assump-
tion 3 holds. Assumption 2 can be verified as well. We see that topic 4 is rare, in that it only
occurs in document 5. However, the probability that the rare topic occurs is not small (W45 = 0.5).
Since each column sums to 1, the closer W45 is to 1, the smaller the other entries in the 5th column
63
must be, and the more uncorrelated topic 4 will be with other topics. Hence, the larger W45, the
more likely it is that Assumption 3 holds. Suppose the rare topic 4 also has a small probability, for
instance
W =
0.5 0.4 0 0 0.3
0.2 0.6 0.5 0.5 0.2
0.3 0 0.5 0.5 0.4
0 0 0 0 0.1
=⇒ C =
0.35 0.17 0.13 0.25
0.17 0.24 0.19 0.1
0.13 0.19 0.26 0.24
0.25 0.1 0.24 1
. (117)
While the rare topic 4 has small non-zero entry W45 = 0.1, none of Wi5 for i = 1, 2, 3, is dominating
and Assumption 3 is still met. However, it fails in the following scenario:
W =
0.5 0.4 0 0 0.9
0.2 0.6 0.5 0.5 0
0.3 0 0.5 0.5 0
0 0 0 0 0.1
=⇒ C =
0.38 0.1 0.06 0.5
0.1 0.28 0.24 0
0.06 0.24 0.35 0
0.5 0 0 1
. (118)
Here, a frequent topic (topic 1) has probability W15 = 0.9, that dominates all other entries in the
fifth column. Topic 4 is hard to detect since it occurs with probability 0.1, while the frequent topic
1 occurs with probability 0.9, in the single document that may reveal topic 4. We emphasize that
W in all three cases above satisfy Assumption 2.
We give an example below to show that Assumption 3 does not imply Assumption 2, that is,
W satisfies Assumption 3 but does not have rank K.
Example 3 Let K = n = 5 and consider
W =
0.3 0.05 0.3 0 0
0.25 0.04 0.25 0.32 0.1
0 0.5 0.15 0.1 0.3
0.055 0.257 0.13 0.466 0.28
0.395 0.153 0.17 0.114 0.32
.
Note that W4· = −0.9W1· + 1.3W2· + 0.5W3· which implies rank(W ) < 5. However, it follows that
C = nWW T =
2.16 1.22 0.51 0.44 1.18
1.22 1.3 0.59 1.02 0.98
0.51 0.59 1.69 1.12 0.87
0.44 1.02 1.12 1.35 0.83
1.18 0.98 0.87 0.83 1.22
.
Therefore, Assumption 3 is satisfied as it is equivalent with Cii ∧ Cjj > Cij for any i, j ∈ [K],
whereas W does not have full rank.
64
E.2 Statements and proofs of results invoked in Remark 3
We first show the equivalence between ν > 4δ and
cos (∠(Wi·,Wj·)) <
(ζiζj∧ ζjζi
)(1− 4δ
nζiζj
), for all 1 ≤ i 6= j ≤ K. (119)
Lemma 23. Let ν be defined in (23) with C = nWW T and W = D−1W W . Then, the inequality
ν > 4δ is equivalent with (119). In particular, ν > 0 is equivalent with Assumption 3.
Proof. Starting with the definition of ν, we have
ν > 4δ
⇐⇒ n‖Wi·‖22‖Wi·‖21
∧ n‖Wj·‖22‖Wj·‖21
− n〈Wi·,Wj·〉‖Wi·‖1‖Wj·‖1
> 4δ, for all i 6= j
⇐⇒ nζ2i ∧ nζ2
j − nζiζj cos(∠(Wi·,Wj·)) > 4δ, for all i 6= j
by using ζi = ‖Wi·‖2/‖Wi·‖1. The result follows after rearranging terms, and taking δ = 0 we
obtain the second assertion as well.
The following lemma provides an upper bound of δ defined in (24).
Lemma 24. Assume condition (26) holds. Then, there exists some constant c > 0 such that
δ ≤ c
max`∈[p]
m`
µ`
√1
n+
√logM
n
:= εn.
If, in addition, logM = o(n) and condition (33) hold, we have
δ
nmini ζ2i
= o(1)
with ζi = ‖Wi·‖2/‖Wi·‖1.
Proof. First, we notice that δ = max1≤j≤p δjj . For any 1 ≤ j ≤ p, the expressions in (68) – (69)
yield
δjj p2
µ2j
√mjΘjj logM
npN+mj logM
npN+
õj(logM)4
npN3+ Θjj
√p logM
µjnN
.
Using
Θjj =1
n
n∑i=1
Π2ji ≤
1
n
n∑i=1
Πji max1≤i≤n
Πji(53)=
µjmj
p2, (120)
together with (64), we obtain
δjj .mj
µj
√p logM
µjnN+mjp logM
µ2jnN
+
√p3(logM)4
µ3jnN
3
.mj
µj
√1
n+
√logM
n(121)
65
Taking the maximum over j ∈ [p] concludes the proof of the upper bound of δ. The second part
follows by observing that
min1≤i≤K
nζ2i = min
1≤i≤K
n‖Wi·‖22‖Wi·‖21
≥ 1,
by the Cauchy-Schwarz inequality.
E.3 The Dirichlet distribution of W
Suppose the columns of W = (W1, . . . ,Wn) are i.i.d. samples from the Dirichlet distribution
parametrized by the vector d = (d1, . . . , dK) ∈ RK . Let d0 :=∑
k dk. It is well known that, for any
1 ≤ k ≤ K,
ak := E[Wki] =dkd0, σ2
kk := V ar(Wki) =dk(d0 − dk)d2
0(d0 + 1)(122)
and, for any 1 ≤ k 6= ` ≤ K,
skk := E[W 2ki] =
dk(1 + dk)
d0(1 + d0), sk` := E[WkiW`i] =
dkd`d0(1 + d0)
. (123)
Further denote
ak :=1
n
n∑i=1
Wki, sk` =1
n
n∑i=1
WkiW`i, for any 1 ≤ k, ` ≤ K
and the event
Edir =|ak − ak| ≤ εk1, |sk` − sk`| ≤ εk`2 , for all 1 ≤ k, ` ≤ K
. (124)
From Lemma 23, condition ν > 4δ discussed in Remark 3 is equivalent with
skka2k
∧ s``a2`
− sk`aka`
> 4δ, for any 1 ≤ k < ` ≤ K. (125)
In particular, Assumption 3 corresponds to δ = 0 in (125). The following lemma provides sufficient
conditions that guarantee Assumption 3 and ν > 4δ. Recall that M := n ∨ p ∨maxi∈[n]Ni.
Lemma 25. Assume the columns of W are i.i.d. from a Dirichlet distribution with parameter
d = (d1, . . . , dK). Set d0 := d1 + · · · + dK , d := maxk dk and d := mink dk. Assumption 3 holds
with probability greater than 1−O(M−1), provided
1 ∨ dd
d0(1 + d0) ≤ c n
logM(126)
holds, for some constant c > 0 sufficiently small. Additionally, if
d(1 + d0)
d≤ c√
n
logM, (127)
holds for some constant c > 0 small enough, then Pν > 4δ ≥ 1−O(M−1).
66
Proof. Display (121) together with
mj
p= max
1≤t≤nΠjt ≤ ‖Aj·‖1,
µjp
=1
n
n∑t=1
Πjt ≥ ‖Aj·‖1 mink
1
n‖Wk·‖1 = ‖Aj·‖1 min
kak
from (65) and (67) yield
δ .1
mink ak
√1
n+
√logM
n. (128)
We work on the event Edir with εk1 and εk`2 chosen as
εk1 = c1
√dk
d0(1 + d0)
√logM
n, εk`2 = c2
√dkd`
d0(1 + d0)+
√logM
n
√logM
n, (129)
for any 1 ≤ k ≤ ` ≤ K where c1, c2 are some positive constants such that, from Lemma 26,
P(Edir) ≥ 1−O(M−1). Note that condition (126) and Edir imply
ak ≤ ak + εk1 ≤dkd0
+ c1
√dk
d0(1 + d0)
√logM
n≤ c′1ak; (130)
ak ≥ ak − εk1 ≤dkd0− c1
√dk
d0(1 + d0)
√logM
n≥ c′′1ak (131)
for all k ∈ [K]. In the following, we prove (125) by only showing
skka2k
− sk`aka`
=skka` − sk`ak
a2ka`
> 4δ (132)
since the other part can be deduced by using similar arguments. We first show that, on the event
Edir, condition (126) guarantees
skka` − sk`aka2ka`
≥ ρ · d0
dk(1 + d0)
for some constant ρ ∈ [0, 1). We bound the numerator of the term on the left from below by
skka` − sk`ak ≥(skk − εkk2
)(a` − ε`1
)−(sk` + εk`2
)(ak + εk1
)= skka` − sk`ak − εkk2 c′1a` − ε`1skk − εk`2 c
′1ak −−εk1sk`
=dkd`
d20(1 + d0)
− c′1εkk2
d`d0− ε`1
dk(1 + dk)
d0(1 + d0)− c′1εk`2
dkd0− εk1
dkd`d0(1 + d0)
.
We used (130) in the second line and used (122) – (123) in the third line. We then show that
max
c′1ε
kk2
d`d0, ε`1
dk(1 + dk)
d0(1 + d0), c′1ε
k`2
dkd0, εk1
dkd`d0(1 + d0)
≤ dkd`
5d20(1 + d0)
.
From (122) – (123), it suffices to show the individual inequalities
5c′1εkk2 <
dkd0(1 + d0)
, 5ε`1 <dl
d0(1 + dk), 5c′1ε
k`2 <
d`d0(1 + d0)
, 5εk1 <1
d0.
67
These inequalities follow by invoking condition (126) and plugging in the expression of εk1 and εk`2
in (129). We thus have
skka` − sk`ak ≥1
5
dkd`d2
0(1 + d0).
In conjunction with (130), we conclude that
skka` − sk`aka2ka`
≥ 1
5(c′1)3
d0
dk(1 + d0)
which further yieldsskka2k
∧ s``a2`
− sk`aka`
≥ 1
5(c′1)3
d0
d(1 + d0).
When δ = 0, the first statement follows. To prove the statement of ν > 4δ, it suffices to show
1
5(c′1)3
d0
d(1 + d0)> 4δ.
This follows by invoking (127), (128) and (131). The proof is complete.
Remark 16. For a symmetric Dirichlet distribution, that is, d1 = · · · = dK = d, condition (127)
and (126) becomes, respectively,
1 +Kd ≤ c√
n
logM, (1 ∨ d)K(1 +Kd) ≤ c n
logM.
The larger K is, the smaller d/n needs to be. Note that small d encourages the sparsity of W .
Lemma 26. Under condition (126), assume columns of W are i.i.d. sample from the Dirichlet
distribution. Then P(Edir) ≥ 1−M−1 with εk1 and εk`2 chosen as (129).
Sketch of the proof. Since Wki and WkiW`i are bounded by 1 and displays (122) – (123) imply
V ar(WkiW`i) ≤ dkd`/(d0(1+d0)), using Bernstein’s inequality in Lemma 11 with similar arguments
in Lemmas 13 and 15 together with condition (126) yield the desired result.
F Cross-validation procedure for selecting C1 in Algorithm 3
We give a practical cross-validation procedure for selecting the proportionally constant C1 used in
Algorithm 3. Starting with a chosen fine grid C, we randomly split the data into a training set D1
and a validation set D2. For each c ∈ C, we apply Algorithm 3 with T = 1, C0 = 0.01 and C1 = c
on D1 to obtain Ac and Cc, and then calculate the out-of-sample error
L(c) :=∥∥∥Θ(2) − AcCcATc
∥∥∥1.
Here Θ(2) is obtained as in (20) by using D2 and ‖ · ‖1 denotes the matrix `1 norm. The selected
C1 is defined as
C1 = arg minc∈C
L(c).
68
Using estimates I and A and motivated by the structure ΘII = AICATI , we can estimate C by
C = (ATIAI)−1AT
IΘIIAI(A
TIAI)−1.
G Additional simulation results
Sensitivity to topics number K
We study the performance of Recover-L2 and Recover-KL for different K. We show that even
in a favorable low dimensional setting (p = 400 and true K0 = 15) with |Ik| = 1, ξ = 1/p and large
sample sizes (N = 800, n = 1000), the estimation error is seriously affected by a wrong choice of
thenumber of topics K.
We generated 50 datasets according to our data generating mechanism and applied Recover-
L2, Recover-KL by using different K to each dataset to obtain AK . To quantify the estimation
error, we use the criterion ∥∥AKATK −AAT∥∥1
to evaluate the overall fit of the word by word membership matrix AAT ∈ Rp×p. We averaged this
loss over 50 datasets. To further benchmark the result, we use a random guessing method Uniform
which randomly draws p × K entries from the Uniform(0, 1) distribution and normalizes each
column to obtain an “estimate” that is independent of the data. The performance of Recover-
L2, Recover-KL and Uniform is shown in Figure 6. It clearly shows that both Recover-L2 and
Recover-KL are very sensitive indeed to correctly specified K. When K differs from the true value
K0 by more than 2 units, the performance is close to random guessing! This phenomenon continues
to hold for various settings and we conclude that specifying K is critical for both Recover-L2
and Recover-KL.
10 11 12 13 14 15 16 17 18 19K
0
1
2
3
4
5
6
||AAT
AAT ||
1
Recover_L2Recover_KLUniform
Figure 6: Plot of ‖AAT −AAT ‖1 for using different specified K. The black vertical bars are
standard errors over 50 datasets.
69
Comparison of different algorithms for small K
We compare Top10, Recover-L2, Recover-KL and T-score in the benchmark setting N =
n = 1500, p = 1000 and ξ = 1/p, but for small values of K. T-score is the procedure of Ke and
Wang (2017), who kindly made the code available to us. Since K is small, the cardinality of each
column Wi of W is randomly sampled from 1, 2, 3.In the first experiment, we considered K = 5 and varied |Ik| within 2, 3, 4, 5, 6. The estimation
error L1(A, A)/K, averaged over 50 generated datasets, is shown in Figure 7. We see that the
performance of Recover-KL is as good as Top10 and even better for |Ik| ≤ 3. However, as already
verified in Section 6, Recover-KL has the worst performance for large K and is computationally
demanding. The plot also demonstrates that T-score needs at least 4 anchor words per group in
order to have comparable performance with Top10 and Recover-KL. This is as expected since
|Ik| needs to grow as p2 log2(n)/(nN) in Ke and Wang (2017).
In the second experiment, we set |Ik| = p/100 and varied K from 5, 6, . . . , 10. We considered
this range of values as, unfortunately, in the implementation of T-score the authors made available
to us, it often crashes for K = 11. The average overall L1 estimation error in Figure 7 shows that
Top10 has the smallest error over all K. T-score has similar performance when K ≤ 8 but
becomes worse than Top10 when K becomes larger.
2 3 4 5 6|I_k|
0.045
0.050
0.055
0.060
0.065
0.070
1 K||A
A|| 1
Top10Recover_L2Recover_KLT-score
5 6 7 8 9 10K
0.045
0.050
0.055
0.060
0.065
0.070
0.075
0.080
1 K||A
A|| 1
Top10Recover_L2Recover_KLT-score
Figure 7: The left plot is the averaged overall estimation error by varying |Ik| and setting
K = 5. The right one is for varying K and setting |Ik| = p/100.
The New York Times (NYT) dataset
Similar to the procedure for preprocessing the NIPs dataset, we removed common stopping words
and rare words occurring in less than 150 documents. The dataset after preprocessing has n =
299419, p = 3079 and N = 210. Since Arora et al. (2013) mainly considers K = 100, we tune our
method to obtain a similar number of topics for our comparative study, for both the semi-synthetic
data and the real data.
70
Semi-synthetic data comparison. To generate the semi-synthetic data, we first apply Top1
to the preprocessed dataset with C0 = 0.01 and C1 = 15.5 3 and obtain the estimated word-to-topic
matrix A with K = 101 and 206 anchor words. Using this A, we generate W from the previous
three distributions (a) – (c), separately, with n = 40000, 50000, 60000, 70000, and then generate
X based on AW with N = 250. For each setting, we generate 15 datasets and the averaged overall
estimation error and topic-wise estimation error of Top, Recover-L2, Recover-KL and LDA
are reported in Figure 8. The running time of different algorithm is shown in Table 3. We specify
the true number of topics for Recover-L2, Recover-KL and LDA while Top estimates it.
The results have essentially the same pattern as those from the NIPs dataset. For the Dirichlet
distribution with parameter 0.03, Recover-KL, Recover-L2 and Top are comparable, whereas
Top outperforms the other two when the topics become more correlated. LDA has the worst
performance, especially when n is relatively small. Top and Recover-L2 run much faster than
Recover-KL and LDA.
Dirichlet 0.03 Logistic Normal 0.02 Logistic Normal 0.2
40000 50000 60000 70000 40000 50000 60000 70000 40000 50000 60000 70000
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
n
Ove
ral e
stim
atio
n er
ror
Methods
KL
L2
LDA
Top
Dirichlet 0.03 Logistic Normal 0.02 Logistic Normal 0.2
40000 50000 60000 70000 40000 50000 60000 70000 40000 50000 60000 70000
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
1.8
n
Topi
c−w
ise
estim
atio
n er
ror
Methods
KL
L2
LDA
Top
Figure 8: Plots of averaged overall estimation error and topic-wise estimation error of Top,
Recover-L2 (L2), Recover-KL (KL) and LDA from the NYT dataset. Top estimates K
while the other algorithms use the true K as input. The bars denote one standard deviation.
Real data comparison from the NYT dataset. We applied Top, Recover-L2, Recover-
KL and LDA to the preprocessed NYT dataset. In order to ensure that all algorithms make use of
a similar number of topics, Top uses C0 = 0.01 and C1 = 15.5, which gives an estimated K = 101,
3The value of C1 is chosen such that the estimated number of topics is around 100
71
Table 3: Running time (seconds) of different algorithms on the NYT dataset
Top Recover-L2 Recover-KL LDA
n = 40000 471.5 795.6 4681.9 35913.7
n = 50000 517.2 841.5 4824.0 44754.1
n = 60000 555.3 894.1 4945.3 53833.8
n = 70000 579.2 957.2 5060.8 62548.8
and the other three algorithms use K = 101 as input. Evaluation of algorithms on the real data
is difficult since we do not know the ground truth. However, there exists some standard ways of
evaluating the fitted model. Since our main target is the word-topic matrix A, we consider two
widely used metrics of evaluating the semantic quality of estimated topics 4 (Arora et al., 2013;
Mimno et al., 2011; Stevens et al., 2012):
(1) Topic coherence. Coherence is a measure of the co-occurrence of the high-frequency words for
each individual estimated topics. “This metric has been shown to correlate well with human
judgments of topic quality. When the topics are perfectly recovered, all the high-probability
words in a topic should co-occur frequently, otherwise, the model may be mixing unrelated
concepts.” (Arora et al., 2013). Given a set of words W, coherence is calculated as
Coherence(W) :=∑
w1,w2∈Wlog
D(w1, w2) + ε
D(w2)(133)
where D(w1) denotes the number of documents that the word w1 occurs, D(w1, w2) denotes
the number of documents that both word w1 and word w2 occur (Mimno et al., 2011) and
the parameter ε = 0.01 is used to avoid taking the log of zero (Stevens et al., 2012).
In the NYT dataset, for each algorithm, we use its estimated word-topic matrix A and
calculate Coherence(Wk) for 1 ≤ k ≤ K, where Wk is the set of 20 most frequent words in
topic k. The averaged coherence metrics and its standard deviation across topics of Top,
Recover and LDA are reported in Table 4.5 Top has the largest coherence suggesting a
better topic recovery while Recover has the smallest coherence.
(2) Unique words: Since coherence only measures the quality of each individual topic, we consider
the metric, unique words (Arora et al., 2013), which reflects the redundancy between topics.
For each topic and its T most probable words, we count how many of those T words do
not appear in any T most probable words of the other topics. Some overlap across topics is
expected due to semantic ambiguity, but lower numbers of unique words indicate less useful
models. We consider T = 100 and the averaged unique words and the standard deviation
4We do not consider the metric of held-out-probability since it requires the estimation of both A and W while
neither Top nor Recover estimates W .5We only report one of Recover-L2 and Recover-KL since they have the same results.
72
across topics are reported in Table 4. Top and LDA have more unique words than Recover.
In fact, the unique words from Recover are only the anchor words and the other most
probable words of different topics are all overlapped, indicating the redundancy between the
estimated topics.
Table 4: Coherence and unique words of Top, Recover and LDA on the NYT dataset.
The numbers in the parentheses are the standard deviations across topics.
Metric Top Recover LDA
Coherence -328.8(51.3) -464.1(3.8) -340.3(44.8)
Unique words 6.7(4.4) 1.0(0.0) 6.3(3.1)
73