A fast algorithm with minimax optimal guarantees for topic ... · that of text mining. It is assumed that we observe a collection of nindependent documents, and that each document

A fast algorithm with minimax optimal guarantees for topic models

with an unknown number of topics.

Xin Bing∗ Florentina Bunea† Marten Wegkamp‡

September 6, 2019

Abstract

Topic models have become popular for the analysis of data that consists in a collection of n

independent multinomial observations, with parameters Ni ∈ N and Πi ∈ [0, 1]p for i = 1, . . . , n.

The model links all cell probabilities, collected in a p×n matrix Π, via the assumption that Π can

be factorized as the product of two nonnegative matrices A ∈ [0, 1]p×K and W ∈ [0, 1]K×n. Topic

models have been originally developed in text mining, when one browses through n documents,

based on a dictionary of p words, and covering K topics. In this terminology, the matrix A

is called the word-topic matrix, and is the main target of estimation. It can be viewed as a

matrix of conditional probabilities, and it is uniquely defined, under appropriate separability

assumptions, discussed in detail in this work. Notably, the unique A is required to satisfy what

is commonly known as the anchor word assumption, under which A has an unknown number of

rows respectively proportional to the canonical basis vectors in RK . The indices of such rows

are referred to as anchor words. Recent computationally feasible algorithms, with theoretical

guarantees, utilize constructively this assumption by linking the estimation of the set of anchor

words with that of estimating the K vertices of a simplex. This crucial step in the estimation

of A requires K to be known, and cannot be easily extended to the more realistic set-up when

K is unknown.

This work takes a different view on anchor word estimation, and on the estimation of A.

We propose a new method of estimation in topic models, that is not a variation on the existing

simplex finding algorithms, and that estimates K from the observed data. We derive new finite

sample minimax lower bounds for the estimation of A, as well as new upper bounds for our

proposed estimator. We describe the scenarios where our estimator is minimax adaptive. Our

finite sample analysis is valid for any n,Ni, p and K, and both p and K are allowed to increase

with n, a situation not handled well by previous analyses.

∗Department of Statistical Science, Cornell University, Ithaca, NY. E-mail: [email protected].†Department of Statistical Science, Cornell University, Ithaca, NY. E-mail: [email protected].‡Department of Mathematics and Department of Statistical Science, Cornell University, Ithaca, NY. E-mail:

[email protected].

1

arX

iv:1

805.

0683

7v3

[st

at.M

L]

4 S

ep 2

019

We complement our theoretical results with a detailed simulation study. We illustrate that

the new algorithm is faster and more accurate than the current ones, although we start out with

a computational and theoretical disadvantage of not knowing the correct number of topics K,

while we provide the competing methods with the correct value in our simulations.

Keywords: Topic model, latent model, overlapping clustering, identification, high dimensional

estimation, minimax estimation, anchor words, separability, nonnegative matrix factorization,

adaptive estimation

1 Introduction

1.1 Background

Topic models have been developed during the last two decades in natural language processing and

machine learning for discovering the themes, or “topics”, that occur in a collection of documents.

They have also been successfully used to explore structures in data from genetics, neuroscience and

computational social science, to name just a few areas of application. Earlier works on versions

of these models, called latent semantic indexing models, appeared mostly in the computer sci-

ence and information science literature, for instance Deerwester et al. (1990); Papadimitriou et al.

(1998); Hofmann (1999); Papadimitriou et al. (2000). Bayesian solutions, involving latent Dirichlet

allocation models, have been introduced in Blei et al. (2003) and MCMC-type solvers have been

considered by Griffiths and Steyvers (2004), to give a very limited number of earlier references. We

refer to Blei (2012) for a in-depth overview of this field. One weakness of the earlier work on topic

models was of computational nature, which motivated further, more recent, research on the devel-

opment of algorithms with polynomial running time, see, for instance, Anandkumar et al. (2012);

Arora et al. (2013); Bansal et al. (2014); Ke and Wang (2017). Despite these recent advances, fast

algorithms leading to estimators with sharp statistical properties are still lacking, and motivates

this work.

We begin by describing the topic model, using the terminology employed for its original usage,

that of text mining. It is assumed that we observe a collection of n independent documents,

and that each document is written using the same dictionary of p words. For each document

i ∈ [n] := 1, . . . , n, we sample Ni words and record their frequencies in the vector Xi ∈ Rp. It

is further assumed that the probability Πji with which a word j appears in a document i depends

on the topics covered in the document, justifying the following informal application of the Bayes’

theorem,

Πji := Pi(Word j) =K∑k=1

Pi(Word j|Topic k)Pi(Topic k).

The topic model assumption is that the conditional probability of the occurrence of a word, given

2

the topic, is the same for all documents. This leads to the topic model specification:

Πji =

K∑k=1

P(Word j|Topic k)Pi(Topic k), for each j ∈ [p], i ∈ [n]. (1)

We collect the above conditional probabilities in the p ×K word-topic matrix A and we let Wi ∈RK denote the vector containing the probabilities of each of the K topics occurring in document

i ∈ [n]. With this notation, data generated from topic models are observed count frequencies Xi

corresponding to independent

Yi := NiXi ∼ Multinomialp (Ni, AWi) , for each i ∈ [n]. (2)

Let X be the p × n observed data matrix, W be the K × n matrix with columns Wi, and Π be

the p × n matrix with entries Πji satisfying (1). The topic model therefore postulates that the

expectation of the word-document frequency matrix X has the non-negative factorization

Π := E[X] = AW, (3)

and the goal is to borrow strength across the n samples to estimate the common matrix of condi-

tional probabilities, A. Since the columns in Π, A and W are probabilities specified by (1), they

have non-negative entries and satisfy

p∑j=1

Πji = 1,

p∑j=1

Ajk = 1,K∑k=1

Wki = 1, for any k ∈ [K] and i ∈ [n]. (4)

In Section 2 we discuss in detail separability conditions on A and W that ensure the uniqueness of

the factorization in (3).

In this context, the main goal of this work is to estimate A optimally, both computationally

and from a minimax-rate perspective, in identifiable topic models, with an unknown number K of

topics, that is allowed to depend on n,Ni, p.

1.2 Outline and contributions

In this section we describe the outline of this paper and give a precise summary of our results

which are developed via the following overall strategy: (i) We first show that A can be derived,

uniquely, at the population level, from quantities that can be estimated independently of A. (ii)

We use the constructive procedure in (i) for estimation, and replace population level quantities by

appropriate estimates, tailored to our final goal of minimax optimal estimation of A in (3), via fast

computation.

3

Recovery of A at the population level. We prove in Propositions 2 and 3 of Section 3 that

the target word-topic matrix A can be uniquely derived from Π, and give the resulting procedure

in Algorithm 1. The proofs require the separability Assumptions 1 and 2, common in the topic

model literature, when K is known. All model assumptions are stated and discussed in Section 2,

and informally described here. Assumption 1 is placed on the word-topic matrix A, and is known

as the anchor-word assumption as it requires the existence of words that are solely associated with

one topic. In Assumption 2 we require that W have full rank.

To the best of our knowledge, parameter identifiability in topic models received a limited amount

of attention. If model (3) and Assumptions 1 and 2 hold, and provided that the index set I

corresponding to anchor words, as well as the number of topics K, are known, Lemma 3.1 of Arora

et al. (2012) shows that A can be constructed uniquely via Π. If I is unknown, but K is known,

Theorem 3.1 of Bittorf et al. (2012) further shows that the matrices A and W can be constructed

uniquely via Π, by connecting the problem of finding I with that of finding the K vertices of an

appropriately defined simplex. Methods that utilize simplex structures are common in the topic

models literature, such as the simplex structure in the word-word co-occurrence matrix (Arora

et al., 2012, 2013), in the original matrix Π (Ding et al., 2013), and in the singular vectors of Π

(Ke and Wang, 2017).

In this work we provide a solution to the open problem of constructing I, and then A, in topic

models, in the more realistic situation when K is unknown. For this, we develop a method that

is not a variation of the existing simplex-based constructions. Under the additional Assumption 3

of Section 2, but without a priori knowledge of K, we recover the index set I of all anchor words,

as well as its partition I. This constitutes Proposition 2. Our proof only requires the existence

of one anchor word for each topic, but we allow for the existence of more, as this is typically

the case in practice, see for instance, Blei (2012). Our method is optimization-free. It involves

comparisons between row and column maxima of a scaled version of the matrix ΠΠT , specifically

of the matrix R given by (11). Example 1 of Section 3 illustrates our procedure, whereas a contrast

with simplex-based approaches is given in Remark 1 of Section 3.

Estimation of A. In Section 5.2, we follow the steps of Algorithm 1 of Section 3, to develop

Algorithms 2 and 3 for estimating A from the data.

We show first how to construct estimators of I, I and K, and summarize this construction

in Algorithm 2 of Section 4, with theoretical guarantees provided in Theorem 4. Since we follow

Algorithm 1, this step of our estimation procedure does not involve any of the previously used

simplex recovery algorithms, such as those mentioned above.

The estimators of I, I and K are employed in the second step of our procedure, summarized

in Algorithm 3 of Section 5.2. This step yields the estimator A of A, and only requires solving a

system of equations under linear restrictions, which, in turn, requires the estimation of the inverse

of a matrix. For the latter, we develop a fast and stable algorithm, tailored to this model, which

4

reduces to solving K linear programs, each optimizing over a K-dimensional space. This is less

involved, computationally, than the next best competing estimator of A, albeit developed for K

known, in Arora et al. (2013). After estimating I, their estimate of A requires solving p restricted

convex optimization problems, each optimizing over a K-dimensional parameter space.

We assess the theoretical performance of our estimator A with respect to the L1,∞ and L1 losses

defined below, by providing finite sample lower and upper bounds on these quantities, that hold

for all p, K, Ni and n. In particular, we allow K and p to grow with n, as we expect that when the

number of available documents n increases, so will the number K of topics that they cover, and

possibly the number p of words used in these documents. Specifically, we let HK denote the set of

all K ×K permutation matrices and define:

‖A−A‖1,∞ := max1≤k≤K

p∑j=1

|Ajk −Ajk|, ‖A−A‖1 :=

p∑j=1

K∑k=1

|Ajk −Ajk|,

L1,∞(A, A) := minP∈HK

‖A−AP‖1,∞, L1(A, A) := minP∈HK

‖A−AP‖1.

We provide upper bounds for L1(A, A) and L1,∞(A, A) in Theorem 7 of Section 5.3. To benchmark

these upper bounds, Theorem 6 in Section 5.1 shows that the corresponding lower bounds are:

infA

supA

PA

L1,∞(A, A) ≥ c0

√K(|Imax|+ |J |)

nN

≥ c1, (5)

infA

supA

PA

L1(A, A) ≥ c0K

√|I|+K|J |

nN

≥ c1,

for absolute constants c0 > 0 and c1 ∈ (0, 1] and assumingN := N1 = · · ·Nn for ease of presentation.

The infimum is taken over all estimators A, while the supremum is taken over all matrices A in

a prescribed class A, defined in (34). The lower bounds depend on the largest number of anchor

words within each topic (|Imax|), the total number of anchor words (|I|), and the number of non-

anchor words (|J |) with J := [p]\I. In Section 5.3 we discuss conditions under which our estimator

A is minimax optimal, up to a logarithmic factor, under both losses. To the best of our knowledge,

these lower and upper bounds on the L1,∞ loss of our estimators are new, and valid for growing K

and p. They imply the more commonly studied bounds on the L1 loss.

Our estimation procedure and the analysis of the resulting estimator A are tailored to count

data, and utilize the restrictions (4) on the parameters of model (3). Consequently, both the

estimation method and the properties of the estimator differ from those developed for general

identifiable latent variable models, for instance those in Bing et al. (2017), and we refer to the

latter for further references and a recent overview of estimation in such models.

To the best of our knowledge, computationally efficient estimators of the word-topic matrix A in

(3), that are also accompanied by a theoretical analysis, have only been developed for the situation

in which K is known in advance. Even in that case, the existing results are limited.

5

Arora et al. (2012, 2013) are the first to analyze theoretically, from a rate perspective, estimators

of A in the topic model. They establish upper bounds on the global L1 loss of their estimators,

and their analysis allows K and p to grow with n. Unfortunately, these bounds differ by at least a

factor of order p3/2 from the minimax optimal rate given by our Theorem 7, even when K is fixed

and does not grow with n.

The recent work of Ke and Wang (2017) is tailored to topic models with a small, known, number

of topics K, which is independent of the number of documents n. Their procedure makes clever

use of the geometric simplex structure in the singular vectors of Π. To the best of our knowledge,

Ke and Wang (2017) is the first work that proves a minimax lower bound for the estimation of

A in topic models, with respect to the L1 loss, over a different parameter space than the one we

consider. We discuss in detail the corresponding rate over this space, and compare it with ours, in

Remark 5 in Section 5.1. The procedure developed by Ke and Wang (2017) is rate optimal for fixed

K, under suitable conditions tailored to their set-up (see pages 13 – 14 in Ke and Wang (2017)).

We defer a detailed rate comparison with existing results to Remark 5 of Section 5.1 and to

Section 5.3.1.

In Section 6 we present a simulation study, in which we compare numerically the quality of

our estimator with that of the best performing estimator to date, developed in Arora et al. (2013),

which also comes with theoretical guarantees, albeit not minimax optimal. We found that the

competing estimator is generally fast and accurate when K is known, but it is very sensitive to

the misspecification of K, as we illustrate in Appendix G of the supplement. Further, extensive

comparisons are presented in Section 6, in terms of the estimation of I, A and the computational

running time of the algorithms. We found that our procedure dominates on all these counts.

Finally, the proofs of Propositions 1 and 2 of Section 3 and the results of Sections 4 and 5 are

deferred to the appendices.

Summary of new contributions. We propose a new method that estimates

(a) the number of topics K;

(b) the anchor words and their partition;

(c) the word-topic matrix A;

and provide an analysis under a finite sample setting, that allows K, in addition to Ni and p to

grow with the sample size (number of documents) n. In this regime,

(d) we establish a minimax lower bound for estimating the word-topic matrix A;

(e) we show that the number of topics can be estimated correctly, with high probability;

(f) we show that A can be estimated at the minimax-optimal rate.

Furthermore,

(g) the estimation of K is optimization free;

6

(h) the estimation of the anchor words and that of A is scalable in n,Ni, p and K.

To the best of our knowledge, estimators of A that are scalable not only with p, but also with K,

and for which (a), (b) and (d) - (f) hold are new in the literature.

1.3 Notation

The following notation will be used throughout the entire paper.

The integer set 1, . . . , n is denoted by [n]. For a generic set S, we denote |S| as its cardinality.

For a generic vector v ∈ Rd, we let ‖v‖q denote the vector `q norm, for q = 0, 1, 2, . . . ,∞ and

supp(v) denote its support. We denote by diag(v) a d× d diagonal matrix with diagonal elements

equal to v. For a generic matrix Q ∈ Rd×m, we write ‖Q‖∞ = max1≤i≤d,1≤j≤m |Qij |, ‖Q‖1 =∑1≤i≤d,1≤j≤m |Qij | and ‖Q‖∞,1 = max1≤i≤d

∑1≤j≤m |Qij |. For the submatrix of A, we let Qi· and

Q·j be the ith row and jth column of Q. For a set S, we let QS denote its |S| ×m submatrix. We

write the d× d diagonal matrix

DQ = diag(‖Q1·‖1, . . . , ‖Qd·‖1)

and let (DQ)ii denote the ith diagonal element.

We use an . bn to denote there exists an absolute constant c > 0 such that an ≤ cbn, and write

an bn if there exists two absolute constants c, c′ > 0 such that cbn ≤ an ≤ c′bn.

We let n stand for the number of documents and Ni for the number of randomly drawn words

at document i ∈ [n]. Furthermore, p is the total number of words (dictionary size) and K is the

number of topics. We define M := maxiNi ∨ n ∨ p. Finally, I is the (index) set of anchor words,

and its complement J := [p] \ I forms the (index) set of non-anchor words.

2 Preliminaries

In this section we introduce and discuss the assumptions under which A in model (3) can be

uniquely determined via Π, although W is not observed.

2.1 An information bound perspective on model assumptions

If we had access to W in model (3), then the problem of estimating A would become the more

standard problem of estimation in multivariate response regression under the constraints (4), and

dependent errors. In that case, A is uniquely defined if W has full rank, which is our Assumption 2

below. Since W is not observable, we mentioned earlier that the identifiability of A requires extra

assumptions. We provide insight into their nature, via a classical information bound calculation.

We view W as a nuisance parameter and ask when the estimation of A can be done with the same

7

precision whether W is known or not. In classical information bound jargon (Cox and Reid, 1987),

we study when the parameters A and W are orthogonal. The latter is equivalent with verifying

E[−∂

2`(X1, . . . , Xn)

∂Ajk∂Wk′i

]= 0 for all j ∈ [p], i ∈ [n] and k, k′ ∈ [K], (6)

where `(X1, . . . , Xn) is the log-likelihood of n independent multinomial vectors. Proposition 1

below gives necessary and sufficient conditions for parameter orthogonality.

Proposition 1. If X1, . . . Xn are an independent sample from (2), and (3) holds, then A and W

are orthogonal parameters, in the sense (6) above, if and only if the following holds:∣∣∣supp(Aj·) ∩ supp(W·i)∣∣∣ ≤ 1, for all j ∈ [p], i ∈ [n]. (7)

We observe that condition (7) is implied by either of the two following extreme conditions:

(1) All rows in A are proportional to canonical vectors in RK , which is equivalent to assuming

that all words are anchor words.

(2) C := n−1WW T is diagonal.

In the first scenario, each topic is described via words exclusively used for that topic, which is

unrealistic. In the second case, the topics are totally unrelated to one another, an assumption that

is not generally met, but is perhaps more plausible than (1). Proposition 1 above shows that one

cannot expect the estimation of A in (3), when W is not observed, to be as easy as that when W is

observed, unless the very stringent conditions of this proposition hold. However, it points towards

quantities that play a crucial role in the estimation of A: the anchor words and the rank of W .

This motivates the study of this model, with both A and W unknown, under the more realistic

assumptions introduced in the next section and used throughout this paper.

2.2 Main assumptions

We make the following three main assumptions:

Assumption 1. For each topic k = 1, . . . ,K, there exists at least one word j such that Ajk > 0

and Aj` = 0 for any ` 6= k.

Assumption 2. The matrix W has rank K ≤ n.

Assumption 3. The inequality

cos (∠(Wi·,Wj·)) <ζiζj∧ ζjζi

for all 1 ≤ i 6= j ≤ K,

holds, with ζi := ‖Wi·‖2/‖Wi·‖1.

8

Conditions on A and W under which A can be uniquely determined from Π are generically

known as separability conditions, and were first introduced by Donoho and Stodden (2004), for the

identifiability of the factors in general nonnegative matrix factorization (NMF) problems. Versions

of such conditions have been subsequently adopted in most of the literature on topic models, which

are particular instances of NMF, see, for instance, Bittorf et al. (2012); Arora et al. (2012, 2013).

In the context and interpretation of the topic model, the commonly accepted Assumption 1

postulates that for each topic k there exists at least one word solely associated with that topic.

Such words are called anchor words, as the appearance of an anchor word is a clear indicator of

the occurrence of its corresponding topic, and typically more than one anchor word is present. For

future reference, for a given word-topic matrix A, we let I := I(A) be the set of anchor words, and

I be its partition relative to topics:

Ik := j ∈ [p] : Ajk > 0, Aj` = 0 for all ` 6= k, I :=

K⋃k=1

Ik, I := I1, . . . , IK . (8)

Earlier work Anandkumar et al. (2012) proposes a tensor-based approach that does not require

the anchor word assumption, but assumes that the topics are uncorrelated. Li and McCallum

(2006); Blei and Lafferty (2007) showed that, in practice, there is strong evidence against the lack

of correlation between topics. We therefore relax the orthogonality conditions on the matrix W in

our Assumption 2, similar to Arora et al. (2012, 2013). We note that in Assumption 2 we have

K ≤ n, which translates as: the total number K of topics covered by n documents is smaller than

the number of documents.

Assumption 2 guarantees that the rows of W , viewed as vectors in Rn, are not parallel, and

Assumption 3 strengthens this, by placing a mild condition on the angle between any two rows of

W . If, for instance, WW T is a diagonal matrix, or if ζi is the same for all i ∈ [K], then Assumption

2 implies Assumption 3. However, the two assumptions are not equivalent, and neither implies

the other, in general. We illustrate this in the examples of Section E.1 in the supplement. It is

worth mentioning that, when the columns of W are i.i.d. samples from the Dirichlet distribution

as commonly assumed in the topic model literature (Blei et al., 2003; Blei and Lafferty, 2007; Blei,

2012), Assumption 3 holds with high probability under mild conditions on the hyper-parameter of

the Dirichlet distribution. We defer their precise expressions to Lemma 25 in Appendix E.3 of the

supplement.

We discuss these assumptions further in Remark 1 of Section 3 and Remark 3 of Section 4

below.

3 Exact recovery of I, I and A at the population level

In this section we construct A via Π. Under Assumptions 1 and 3, we show first that the set of

anchor words I and its partition I can be determined, from the matrix R given in (11) below. We

9

begin by re-normalizing the three matrices involved in model (3) such that their rows sum up to 1:

W := D−1W W, Π := D−1

Π Π, A := D−1Π ADW . (9)

Then

Π = AW , (10)

and

R := nΠΠT = ACAT , (11)

with C = nWW T . This normalization is standard in the topic model literature (Arora et al., 2012,

2013), and it preserves the anchor word structure: matrices A and A have the same support, and

Assumption 1 is equivalent with the existence, for each k ∈ [K], of at least one word j such that

Ajk = 1 and Aj` = 0 for any ` 6= k. Therefore A and A have the same I and I. We differ from

the existing literature in the way we make use of this normalization and explain this in Remark 1

below. Let

Ti := max1≤j≤p

Rij , Si := j ∈ [p] : Rij = Ti , for any i ∈ [p]. (12)

In words, Ti is the maximum entry of row i, and Si is the set of column indices of those entries in

row i that equal to the row maximum value. The following proposition shows the exact recovery

of I and I from R.

Proposition 2. Assume that model (3) and Assumptions 1 and 3 hold. Then:

(a) i ∈ I ⇐⇒ Ti = Tj , for all j ∈ Si.

(b) The anchor word set I can be determined uniquely from R. Moreover, its partition I is

unique and can be determined from R up to label permutations.

The proof of this proposition is given in Appendix 2, and its success relies on the equivalent

formulation of Assumption 3,

min1≤i<j≤K

(Cii ∧ Cjj − Cij

)> 0.

The short proof of Proposition 3 below gives an explicit construction of A from

Θ :=1

nΠΠT , (13)

using the unique partition I of I given by Proposition 2 above.

Proposition 3. Under model (3) and Assumptions 1, 2 and 3, A can be uniquely recovered from

Θ with given I, up to column permutations.

10

Proof. Given the partition of anchor words I = I1, . . . , IK, we construct a set L = i1, . . . , iKby selecting one anchor word ik ∈ Ik for each topic k ∈ [K]. We let AL be the diagonal matrix

AL = diag(Ai11, . . . , AiKK). (14)

We show first that B := AA−1L can be constructed from Θ. Assuming, for now, that B has been

constructed, then A = BAL. The diagonal elements of AL can be readily determined from this

relationship, since, via model (3) satisfying (4), the columns of A sum up to 1:

1 = ‖A·k‖1 = Aikk‖B·k‖1, (15)

for each k. Therefore, although B is only unique up to the choice of L and of the scaling matrix

AL, the matrix A with unit column sums thus constructed is unique.

It remains to construct B from Θ. Let J = 1, . . . , p \ I. We let BJ denote the |J | × K

sub-matrix of B with row indices in J and BI denote the |I| ×K sub-matrix of B with row indices

in I. Recall that C := n−1WW T . Model (3) implies the following decomposition of the submatrix

of Θ with row and column indices in L ∪ J :[ΘLL ΘLJ

ΘJL ΘJJ

]=

[ALCAL ALCA

TJ

AJCAL AJCATJ

].

In particular, we have

ΘLJ = ALCATJ = ALC(ALA

−1L )ATJ = ΘLL(A−1

L ATJ ) = ΘLLBTJ . (16)

Note that Aikk > 0, for each k ∈ [K], from Assumption 1 which, together with Assumption 2,

implies that ΘLL is invertible. We then have

BJ = ΘJLΘ−1LL. (17)

On the other hand, for any i ∈ Ik, for each k ∈ [K], we have Bik = Aik/Aikk, by the definition

of B. Also, model (3) and Assumption 1 imply that for any i ∈ Ik,

1

n

n∑t=1

Πit = Aik

(1

n

n∑t=1

Wkt

). (18)

Therefore, the matrix BI has entries

Bik =‖Πi·‖1‖Πik·‖1

, for any i ∈ Ik and k ∈ [K]. (19)

This, together with BJ given above completes the construction of B, and uniquely determines

A.

11

Algorithm 1 Recover the word-topic matrix A from Π

Require: true word-document frequency matrix Π ∈ Rp×n

1: procedure Top(Π)

2: compute Θ = n−1ΠΠT and R from (11)

3: recover I via FindAnchorWords(R)

4: construct L = i1, . . . , iK by choosing any ik ∈ Ik, for k ∈ [K]

5: compute BI from (17) and BJ from (19)

6: recover A by normalizing B to unit column sums

7: return I and A

8: procedure FindAnchorWords(R)

9: initialize I = ∅ and P = [p]

10: while P 6= ∅ do

11: take any i ∈ P, compute Si and Ti from (12)

12: if ∃j ∈ Si s.t. Ti 6= Tj then

13: P = P \ i14: else

15: P = P \ Si and Si ∈ I

16: return I

Our approach for recovering both I and A is constructive and can be easily adapted to estima-

tion. For this reason, we summarize our approach in Algorithm 1 and illustrate the algorithm with

a simple example.

Example 1. Let K = 3, p = 6, n = 3 and consider the following A and W :

A =

0.3 0 0

0.2 0 0

0 0.5 0

0 0 0.4

0.2 0.5 0.3

0.3 0 0.3

, W =

0.6 0.2 0.2

0.3 0.7 0.0

0.1 0.1 0.8

, Π = AW =

0.18 0.06 0.06

0.12 0.04 0.04

0.15 0.35 0.00

0.04 0.04 0.32

0.30 0.42 0.28

0.21 0.09 0.30

Applying FindAnchorWords in Algorithm 1 to R gives I =

1, 2, 3, 4

from

R =

1.32 1.32 0.96 0.72 0.96 1.02

1.32 1.32 0.96 0.72 0.96 1.02

0.96 0.96 1.74 0.30 1.15 0.63

0.72 0.72 0.30 1.98 0.89 1.35

0.96 0.96 1.15 0.89 1.03 0.92

1.02 1.02 0.63 1.35 0.92 1.19

=⇒

T1 = 1.32, S1 = 1, 2, 1−X

T2 = 1.32, S2 = 1, 2, 2−X

T3 = 1.74, S3 = 3, 3−X

T4 = 1.98, S4 = 4, 4−X

T5 = 1.15, S5 = 3, 5− x

T6 = 1.35, S6 = 4, 6− x

12

Based on the recovered I, the matrix A can be recovered from Proposition 3, which is executed via

steps 4 - 6 in Algorithm 1 . Specifically, by taking L = 1, 3, 4 as the representative set of anchor

words, it follows from (17) and (19) that

BI =

1 0 0

2/3 0 0

0 1 0

0 0 1

, BJ =

[0.03 0.06 0.04

0.02 0.02 0.04

]0.01 0.02 0.01

0.02 0.05 0.01

0.01 0.01 0.04

−1

=

[2/3 1 3/4

1 0 3/4

].

Finally, A is recovered by normalizing B = [BTI , B

TJ ]T to have unit column sums,

A =

1 0 0

2/3 0 0

0 1 0

0 0 1

2/3 1 3/4

1 0 3/4

0.3 0 0

0 0.5 0

0 0 0.4

=

0.3 0 0

0.2 0 0

0 0.5 0

0 0 0.4

0.2 0.5 0.3

0.3 0 0.3

.

Remark 1. Contrast with existing results. It is easy to see that the rows in R (or, alternatively,

Π) corresponding to non-anchor words j ∈ J are convex combinations of the rows in R (or Π)

corresponding to anchor words i ∈ I. Therefore, finding K representative anchor words, amounts

to finding the K vertices of a simplex. The latter can be accomplished by finding the unique

solution of an appropriate linear program, that uses K as input, as shown by Bittorf et al. (2012).

This result only utilizes Assumption 1 and a relaxation of Assumption 2, in which it is assumed

that no rows of W are convex combinations of the rest. To the best of our knowledge, Theorem

3.1 in Bittorf et al. (2012) is the only result to guarantee that, after representative anchor words

are found, a partition of I in K groups can also be found, for the specified K.

When K is not known, this strategy can no longer be employed, since finding the representative

anchor words requires knowledge of K. However, we showed that this problem can still be solved

under our mild additional Assumption 3. This assumption allows us to provide the if and only if

characterization of I proved in part (i) of Proposition 2. Moreover, part (ii) of this proposition

shows that K is in one-to-one correspondence with the number of groups in I, and we exploit this

observation for the estimation of K.

4 Estimation of the anchor word set and of the number of topics

Algorithm 1 above recovers the index set I, its partition I and the number of topics K from the

matrix

R = nΠΠT =(nD−1

Π

)Θ(nD−1

Π

)

13

Algorithm 2 Estimate the partition of the anchor words I by I

Require: matrix R ∈ Rp×p, C1 and Q ∈ Rp×p such that Q[j, `] := C1δj`

1: procedure FindAnchorWords(R, Q)

2: initialize I = ∅3: for i ∈ [p] do

4: ai = argmax1≤j≤p Rij

5: set I(i) = ` ∈ [p] : Riai − Ril ≤ Q[i, ai] +Q[i, `] and Anchor(i) = True

6: for j ∈ I(i) do

7: aj = argmax1≤k≤p Rjk

8: if∣∣∣Rij − Rjaj ∣∣∣ > Q[i, j] +Q[j, aj ] then

9: Anchor(i) = False

10: break

11: if Anchor(i) then

12: I = Merge(I(i), I)

13: return I = I1, I2, . . . , IK

14: procedure Merge(I(i), I)

15: for G ∈ I do

16: if G ∩ I(i) 6= ∅ then

17: replace G in I by G ∩ I(i)

18: return I19: I(i) ∈ I20: return I

with Θ = n−1ΠΠT . Algorithm 2 below is a sample version of Algorithm 1. It has O(p2) computa-

tional complexity and is optimization free.

The matrix Π is replaced by the observed frequency data matrix X ∈ Rp×n with independent

columns X1, . . . Xn. Since they that are assumed to follow the multinomial model (2), an unbiased

estimator of Θ is given by

Θ :=1

n

n∑i=1

[Ni

Ni − 1XiX

Ti −

1

Ni − 1diag(Xi)

], (20)

with Ni representing the total number of words in document i. We then estimate R by

R :=(nD−1

X

)Θ(nD−1

X

). (21)

The quality of our estimator depends on how well we can control the noise level R − R. In the

computer science related literature, albeit for different algorithms, (Arora et al., 2012; Bittorf et al.,

14

2012), only global ‖R −R‖∞,1 control is considered, which ultimately impacts negatively the rate

of convergence of A. In general latent models with pure variables, the latter being the analogues

of anchor words, Bing et al. (2017) developed a similar algorithm to ours, under a less stringent

‖R − R‖∞ control, which is still not precise enough for sharp estimation in topic models. To see

why, we note that Algorithm 2 involves comparisons between two different entries in a row of R.

In these comparisons, we must allow for small entry-wise error margins. These margin levels are

precise bounds C1δj` such that |Rj` − Rj`| ≤ C1δj` for all j, ` ∈ [p], with high probability, for

some universal constant C1 > 1. The explicit deterministic bounds are stated in Proposition 16 of

Appendix C.2, while practical data-driven choices are based on Corollary 17 of Appendix C.2 and

given in Section 6.

Since the estimation of I is based on R which is a perturbation of R, one cannot distinguish

an anchor word from a non-anchor word that is very close to it, without further signal strength

conditions on A. Nevertheless, Theorem 4 shows that even without such conditions we can still

estimate K consistently. Moreover, we guarantee the recovery of I and I with minimal mistakes.

Specifically, we denote the set of quasi-anchor words by

J1 :=j ∈ J : there exists k ∈ [K] such that Ajk ≥ 1− 4δ/ν

(22)

where

ν := min1≤i<j≤K

(Cii ∧ Cjj − Cij

)(23)

and

δ := max1≤j,`≤p

δj`. (24)

In the proof of Proposition 2, we argued that the set of anchor words, defined in Assumption 1,

coincide with those of the scaled matrix A given in (9). The words corresponding to indices in J1

are almost anchor words, since in a row of A corresponding to such index the largest entry is close

to 1, while the other entries are close to 0, if δ/ν is small.

For the remaining of the paper we make the blanket assumption that all documents have equal

length, that is, N1 = · · · = Nn = N . We make this assumption for ease of presentation only, as all

our results continue to hold when the documents have unequal lengths.

Theorem 4. Under model (3) and Assumption 1, assume

ν > 2 max

2δ,

√2‖C‖∞δ

(25)

with ν defined in (23), and

min1≤j≤p

1

n

n∑i=1

Πji ≥2 logM

3N, min

1≤j≤pmax

1≤i≤nΠji ≥

(3 logM)2

N. (26)

15

Then, with probability greater than 1− 8M−1, we have

K = K, I ⊆ I ⊆ I ∪ J1, Iπ(k) ⊆ Ik ⊆ Iπ(k) ∪ Jπ(k)1 , for all 1 ≤ k ≤ K,

where Jk1 := j ∈ J1 : Ajk ≥ 1− 4δ/ν and π : [K]→ [K] is some label permutation.

If we further impose the signal strength assumption J1 = ∅, the following corollary guarantees

exact recovery of all anchor words.

Corollary 5. Under model (3) and Assumption 1, assume ν > 4δ, (26) and J1 = ∅. With

probability greater than 1 − 8M−1, we have K = K, I = I and Ik = Iπ(k), for all 1 ≤ k ≤ K and

some permutation π.

Remark 2.

(1) Condition (26) is assumed for the analysis only and the implementation of our procedure only

requires N ≥ 2. Furthermore, we emphasize that (26) is assumed to simplify our presentation.

In particular, we used it to obtain the precise expressions of δj` and ηj` given in (50) – (51)

of Section 6. In fact, (26) can be relaxed to

min1≤j≤p

1

n

n∑i=1

Πji ≥c logM

nN(27)

for some sufficiently large constant c > 0, under which more complicated expressions of δ′jànd η′j` can be derived, see Corollary 18 of Appendix C.2. Theorem 4 continues to hold,

provided that (25) holds for δ′ = maxj,` δ′j` in lieu of δ, that is,

ν > 2 max

2δ′,

√2‖C‖∞δ′

. (28)

Note that condition (27) implies the restriction

nN ≥ c · p logM, (29)

by using

min1≤j≤p

1

n

n∑i=1

Πji ≤1

p

p∑j=1

1

n

n∑i=1

Πji =1

p. (30)

Intuitively, both (26) and (27) preclude the average frequency of each word, over all doc-

uments, from being very small. Otherwise, if a word rarely occurs, one cannot reasonably

expect to detect/sample it: ‖Xj·‖1 will be close to 0, and the estimation of R in (21) becomes

problematic. For this reason, removing rare words or grouping several rare words together to

form a new word are commonly used strategies in data pre-processing (Arora et al., 2013, 2012;

Bansal et al., 2014; Blei et al., 2003), which we also employ in the data analyses presented in

Section 6.

16

(2) To interpret the requirement J1 = ∅, by recalling that A = D−1Π ADW ,

Ajk =n−1‖Wk·‖1Ajkn−1‖Πj·‖1

can be viewed as

P(Topic k | Word j) =P(Topic k)× P(Word j | Topic k)

P(Word j).

If J1 6= ∅, then P(Topic k |Word j) ≈ 1, for a quasi-anchor word j. Then, quasi-anchor words

also determine a topic, and it is hopeless to try to distinguish them exactly from the anchor

words of the same topic. However, Theorem 4 shows that in this case our algorithm places

quasi-anchor words and anchor words for the same topic in the same estimated group, as soon

as (25) of Theorem 4 holds. When we have only anchor words, and no quasi-anchor words,

J1 = ∅, there is no possibility for confusion. Then, we can have less separation between the

rows of W , ν > 4δ, and exact anchor word recovery, as shown in Corollary 5.

Remark 3 (Assumption 3 and condition ν > 4δ).

(1) The exact recovery of anchor words in the noiseless case (Proposition 2) relies on Assumption

3, which requires the angle between two different rows of W not be too small in the following

sense:

cos (∠(Wi·,Wj·)) <ζiζj∧ ζjζi, for all 1 ≤ i 6= j ≤ K, (31)

with ζi := ‖Wi·‖2/‖Wi·‖1. Therefore, the more balanced the rows of W are, the less restrictive

this assumption becomes. The most ideal case is mini ζi/maxi ζi → 1 under which (31) holds

whenever two rows of W are not parallel, whereas the least favorable case is mini ζi/maxi ζi →0, for which we need the rows of W close to orthogonal (the topics are uncorrelated).

Although in this work the matrix W has non-random entries, it is interesting to study when

(31) holds, with high probability, under appropriate distributional assumptions on W . A

popular and widely used distribution of the columns of W is the Dirichlet distribution (Blei

et al., 2003). Lemma 25 in the supplement shows that, when the columns of W are i.i.d. sam-

ples from the Dirichlet distribution, (31) holds with high probability, under mild conditions

on the hyper-parameter of the Dirichlet distribution.

(2) We prove in Lemma 23 that Assumption 3 is equivalent with ν > 0, where we recall that ν

has been defined in (23). For finding the anchor words from the noisy data, we need that

ν > 4δ, a strengthened version of Assumption 3. Furthermore, Lemmas 23 and 24 in the

supplement guarantee that there exists a sequence εn such that ν > 4δ is implied by

cos (∠(Wi·,Wj·)) <

(ζiζj∧ ζjζi

)(1− εn) , for all 1 ≤ i 6= j ≤ K. (32)

17

Thus, we need εn more separation between any two different rows of W than what we require

in (31). Under the following balance condition of words across documents

max1≤i≤n

Πji

/(1

n

n∑i=1

Πji

)= o(√n), for 1 ≤ j ≤ p, (33)

Lemma 24 guarantees that εn → 0 as n → ∞. The same interplay between the angle of

rows of W and their balance condition as described in part (1) above holds. We view (33) as

a reasonable, mild, balance condition, as it effectively asks the maximum frequency of each

particular word, across documents not be larger than the average frequency of that word over

the n documents, multiplied by√n.

If the columns of W follow the Dirichlet distribution, under mild conditions on the hyper-

parameter, we directly prove, in Lemma 25 in the supplement, that ν > 4δ holds with

probability greater than 1−O(M−1) with M := n ∨ p ∨N .

5 Estimation of the word-topic membership matrix.

We derive minimax lower bounds for the estimation of A in topic models, with respect to the L1

and L1,∞ losses in Section 5.1. We follow with a description of our estimator A of A, in Section

5.2. In Section 5.3, we establish upper bounds on L1(A, A) and L1,∞(A, A), for the estimator A

constructed in Section 5.2, and provide conditions under which the bounds are minimax optimal.

5.1 Minimax lower bounds

In this section, we establish the lower bound of model (3) based on L1(A, A) and L1,∞(A, A) for

any estimator A of A over the parameter space

A(K, |I|, |J |) :=A ∈ Rp×K+ : AT1p = 1K , A has |I| anchor words

(34)

where 1d denotes the d-dimensional vector with all entries equal to 1. Let

W = W 0 +1

nN1K1TK −

K

nNIK , W 0 = e1, . . . , e1︸︷︷︸

n1

, e2, . . . , e2︸︷︷︸n2

, . . . , eK , . . . , eK︸︷︷︸nK

(35)

with∑K

k=1 nk = n and |ni−nj | ≤ 1 for 1 ≤ i, j ≤ K. We use ek and Id to denote, respectively, the

canonical basis vectors in RK and the identity matrix in Rd×d. It is easy to verify that W defined

above satisfies Assumptions 2 and 3. Denote by PA the joint distribution of (X1, . . . , Xn), under

model (3) for this choice of W . Let |Imax| = maxk |Ik|.

Theorem 6. Under model (3), assume (2) and let |I|+K|J | ≤ c(nN), for some universal constant

c > 1. Then, there exist c0 > 0 and c1 ∈ (0, 1] such that

infA

supA

PA

L1(A, A) ≥ c0K

√|I|+K|J |

nN

≥ c1. (36)

18

Moreover, if K(|Imax|+ |J |) ≤ c(nN) holds, we further have

infA

supA

PA

L1,∞(A, A) ≥ c0

√K(|Imax|+ |J |)

nN

≥ c1.

The infA

is taken over all estimators A of A; the supremum is taken over all A ∈ A(K, |I|, |J |).

Remark 4. The product nN is the total number of sampled words, while |I|+K|J | is the number of

unknown parameters in A ∈ A(K, |I|, |J |). Since we do not make any further structural assumptions

on the parameter space, we studied minimax-optimality of estimation in topic models with anchor

words in the regime

nN > c(|I|+K|J |),

in which one can expect to be able to develop procedures for the consistent estimation of the matrix

A.

In order to facilitate the interpretation of the lower bound of the L1-loss, we can rewrite the

second statement in (36) as

infA

supA∈A(K,|I|,|J |)

PA

L1(A, A)

‖A‖1≥ c0

√|I|+K|J |

nN

≥ c1,

using the fact ‖A‖1 = K. Thus, the right-hand-side becomes the square root of the ratio between

number of parameters to estimate and overall sample size.

Remark 5. When K is known and independent of n or p, Ke and Wang (2017) derived the minimax

rate (37) of L1(A, A) in their Theorem 2.2:

infA

supA∈A(p,K)

PL1(A, A) ≥ c1

√p

nN

≥ c2 (37)

for some constants c1, c2 > 0. The parameter space considered in (Ke and Wang (2017)) for the

derivation of the lower bound in (37) is

A(p,K) = A ∈ Rp×K+ : AT1p = 1K , ‖Aj·‖1 ≥ c3/p, ∀j ∈ [p]

for some constant c3 > 0, and the lower bound is independent of K. In contrast, the lower bound

in Theorem 6 holds over A(K, |I|, |J |) ⊆ A(p,K), and the dependency on K in (36) is explicit.

The upper bounds derived for L1(A, A) in both this work and Ke and Wang (2017) correspond

to A ∈ A(K, |I|, |J |), making the latter the appropriate space for discussing attainability of lower

bounds.

Nevertheless, we notice that, when K is treated as a fixed constant, and recalling that |I|+|J | =p, the lower bounds over both spaces have the same order of magnitude,

√p/nN . From this

perspective, when K is fixed, the result in Ke and Wang (2017) can be viewed as a minimax result

over the smaller parameter space.

19

A non-trivial modification of the proof in Ke and Wang (2017) allowed us to recover the depen-

dency on K that was absent in their original lower bound (37): the corresponding rate is√pK/nN ,

and it is relative to estimation over the larger parameter space A(p,K). For comparison purposes,

we note that this space corresponds to A(K, |I|, |J |), with I = ∅ and |J | = p. In this case, our

lower bound (36) becomes K√pK/nN , larger by a factor of K than the bound that can be derived

by modifying arguments in Ke and Wang (2017). Therefore, Theorem 6 improves upon existing

lower bounds for estimation in topic models without anchor words and with a growing number of

topics, and offers the first minimax lower bound for estimation in topic models with anchor words

and a growing K.

5.2 An estimation procedure for A

Our estimation procedure follows the constructive proof of Proposition 3. Given the set of estimated

anchor words I = I1, . . . , IK, we begin by selecting a set of representative indices of words per

topic, by choosing ik ∈ Ik at random, to form L := i1, . . . , iK. As we explained in the proof of

Proposition 3, we first estimate a normalized version of A, the matrix B = AA−1L . We estimate

separately BI and BJ . In light of (19), we estimate the |I| ×K matrix BI by

Bik =

‖Xi·‖1

/‖Xik·‖1, if i ∈ Ik and 1 ≤ k ≤ K

0, otherwise .(38)

Recall from (17) that BJ = ΘJLΘ−1LL and that Assumption 2 ensures that ΘLL := ALCAL is

invertible, with Θ defined in (13). Since we have already obtained I, we can estimate J by J =

1, . . . , p \ I. We then use the estimator Θ given in (20), to estimate ΘJL by ΘJL

. It remains to

estimate the K ×K matrix Ω := Θ−1LL. For this, we solve the linear program

(t, Ω) = arg mint∈R+, Ω∈RK×K

t (39)

subject to ∥∥ΩΘLL− I∥∥∞,1 ≤ λt, ‖Ω‖∞,1 ≤ t, (40)

with λ = C0 maxi∈L∑

j∈L ηij , where ηij is defined such that |Θij − Θij | ≤ C0ηij for all i, j ∈ [p],

with high probability, and C0 is a universal constant. The precise expression of ηij is given in

Proposition 16 of Appendix C.2, see also Remark 8 below. To accelerate the computation, we can

decouple the above optimization problem, and solve instead K linear programs separately. We

estimate Ω by Ω = (ω1, . . . , ωK) where, for any k = 1, . . . , K,

ωk := arg minω∈RK

‖ω‖1 (41)

subject to ∥∥ΘLLω − ek

∥∥1≤ λ‖ω‖1 (42)

20

with e1, . . . , eK denoting the canonical basis in RK . After constructing Ω as above, we estimate

BJ by

BJ

=(

ΘJL

Ω)

+, (43)

where the operation (·)+ = max(0, ·) is applied entry-wise. Recalling that AL can be determined

from B via (15), the combination of (43) with (38) yields B and hence the desired estimator of A:

A = B · diag(‖B·1‖−1

1 , . . . , ‖B·K‖−11

). (44)

Remark 6. The decoupled linear programs given by (41) and (42) are computationally attractive

and can be done in parallel. This improvement over (39) becomes significant when K is large.

Remark 7. Since we can select all anchor words with high probability, as shown in Theorem 4, in

practice we can repeat randomly selecting different sets of representatives L from I several times,

and we can estimate A via (38) – (44) for each L. The entry-wise average of these estimates inherits,

via Jensen’s inequality, the same theoretical guarantees shown in Section 5.3, while benefiting from

an improved numerical performance.

Remark 8. To preserve the flow of the presentation we refer to Proposition 16 of Appendix C.2

for the precise expressions of ηij used in constructing the tuning parameter λ. The estimates

of ηij , recommended for practical implementation, are shown in (51) based on Corollary 17 in

Appendix C.2. We also note that in precision matrix estimation, λ is proportional, in our notation,

to the norm ‖ΘLL − ΘLL‖∞, see, for instance, Bing et al. (2017) and the references therein for

a similar construction, but devoted to general sub-Gaussian distributions. In this work, the data

is multinomial, and we exploited this fact to propose a more refined tuning parameter, based on

entry-wise control.

We summarize our procedure, called Top, in the following algorithm.

Algorithm 3 Estimate the word-topic matrix A

Require: frequency data matrix X ∈ Rp×n with document lengths N1, . . . , Nn; two positive con-

stants C0, C1 and positive integer T

1: procedure Top(X,N1, . . . , Nn;C0, C1)

2: compute Θ from (20) and R from (21)

3: compute ηij and Q[i, j] := C1δij from (50) and (51), for i, j ∈ [p]

4: estimate I via FindAnchorWords(R,Q)

5: for i = 1, . . . , T do

6: randomly select L and solve Ω from (41) by using λ = C0 maxi∈L∑

j∈L ηij in (42)

7: estimate B from (38) and (43)

8: compute Ai from (44)

9: return I = I1, I2, . . . , IK and A = T−1∑T

i=1 Ai

21

5.3 Upper bounds of the estimation rate of A

In this section we derive upper bounds for estimators A constructed in Section 5.2, under the matrix

‖ · ‖1 and ‖ · ‖1,∞ norms. A is obtained by choosing the tuning parameter λ = C0 maxi∈L∑

j∈L ηij

in the optimization (41). To simplify notation and properly adjust the scales, we define

αj := p max1≤k≤K

Ajk, γk :=K

n

n∑i=1

Wki, for each j ∈ [p], k ∈ [K], (45)

such that∑K

k=1 γk = K and p ≤∑p

j=1 αj ≤ pK from (4). We further set

αI = maxi∈I

αi, αI = mini∈I

αi, ρj = αj/αI , γ = max1≤k≤K

γk, γ = min1≤k≤K

γk. (46)

For future reference, we note that

γ ≥ 1 ≥ γ.

Theorem 7. Under model (3), Assumptions 1 and 2, assume ν > 4δ, J1 = ∅ and (26). Then,

with probability 1− 8M−1, we have

minP∈HK

∥∥∥A·k − (AP )·k

∥∥∥1≤ Rem(I, k) + Rem(J, k), for all 1 ≤ k ≤ K,

where HK is the space of K ×K permutation matrices,

Rem(I, k) .

√K logM

npN·∑i∈Ik

αi√αIγ

,

Rem(J, k) .

√K logM

nN· γ

1/2‖C−1‖∞,1K

· αIαI

√|J |+∑j∈J

ρj +αIαI

√K∑j∈J

ρj

.

Moreover, summing over 1 ≤ k ≤ K, yields

L1(A, A) .K∑k=1

Rem(I, k) +

K∑k=1

Rem(J, k).

In Theorem 7 we explicitly state bounds on Rem(I, k) and Rem(J, k), respectively, which allows

us to separate out the error made in the estimation of the rows of A that correspond to anchor

words from that corresponding to non-anchor words. This facilitates the statement and explanation

of the quantities that play a key role in this rate, and of the conditions under which our estimator

achieves near minimax optimal rate, up to a logarithmic factor of M . We summarize it in the

following corollary and the remarks following it. Recall that C = n−1WW T .

Corollary 8 (Attaining the optimal rate). In addition to the conditions in Theorem 7, suppose

(i) αI αI ,∑

j∈J ρj . |J |

22

(ii) γ γ ,∑

k′ 6=k√Ckk′ = o(

√Ckk) for any 1 ≤ k ≤ K

hold. Then with probability 1− 8M−1, we have

L1,∞(A, A) .

√K(|Imax|+ |J |) logM

nN, L1(A, A) . K

√|I|+K|J | logM

nN. (47)

Remark 9. The optimal estimation rate depends on the bounds for Θj`−Θj` and Rj`−Rj` derived

via a careful analysis in Proposition 16 in Appendix C.2. We rule out quasi-anchor words (J1 = ∅,

see Remark 2 as well) since otherwise, the presentation, analysis and proofs will become much more

cumbersome.

Remark 10 (Relation between document length N and dictionary size p). Our procedure can be im-

plemented for any N ≥ 2. However, Theorem 7 and Corollary 8 indirectly impose some restrictions

on N and p. Indeed, the restriction

N ≥ (2p logM/3) ∨ (9p log2M/K) (48)

is subsumed by (26), via (30) and

min1≤j≤p

max1≤i≤n

Πji = min1≤j≤p

K∑k=1

Ajk max1≤i≤n

Wki ≤1

p

p∑j=1

K∑k=1

Ajk =K

p.

Inequality (48) describes the regime for which we establish the upper bound results in this section,

and are able to show minimax optimality, as the lower bound restriction cnN ≥ |I|+ |J |K for some

c > 1 in Theorem 6 implies N ≥ p/(cn).

We can extend the range (48) of N at the cost of a stronger condition than (25) on ν. Assume

(28) holds with δ′ = maxj,` δ′j` and with δ′j,` defined in Corollary 18 of Appendix C.2. In that case,

as in Remark 2, condition (26) can be relaxed to (27). Provided

minj∈I

1

n

n∑i=1

Πji ≥c logM

N, min

j∈Imax

1≤i≤nΠji ≥

c′(logM)2

N(49)

for some constant c, c′ > 0, we prove in Appendix D.2.3 of the supplement that Theorem 7 and

Corollary 8 still hold. As discussed in Remark 2, condition (27) implies (29),

N ≥ c · (p logM)/n,

which is a much weaker restriction on N and p than (48). Condition (49) in turn is weaker than (26)

as it only restricts the smallest (averaged over documents) frequency of anchor words. As a result,

(49) does not necessarily imply the constraint (48). For instance, if minj∈I n−1∑n

i=1 Πji & 1/|I|,then (49) is implied by N & |I|(logM)2. The problem of developing a procedure that can be shown

to be minimax-rate optimal in the absence of condition (49) is left open for future research.

23

Remark 11 (Interpretation of the conditions of Corollary 8).

(1) Conditions regarding anchor words. Condition αI αI implies that all anchor words, across

topics, have the same order of frequency. The second condition∑

j∈J ρj . |J | is equivalent with

|J |−1∑

j∈J ‖Aj·‖∞ . maxi∈I ‖Ai·‖∞. Thus it holds when the averaged frequency of non-anchor

words is no greater, in order, than the largest frequency among all anchor words.

(2) Conditions regarding the topic matrix W . Condition (ii) implies that the topics are bal-

anced, through γ γ, and prevents too strong a linear dependency between the rows in W ,

via∑

k′ 6=k√Ckk′ = o(

√Ckk). As a result, we can show ‖C−1‖∞,1 = O(K) (see Lemma 22 in the

supplement) and the rate of Rem(J, k) in Theorem 7 can be improved by a factor of 1/√K. The

most favorable situation under which condition (ii) holds corresponds to the extreme case when

each document contains a prevalent topic k, in that the corresponding Wki ≈ 1, and the topics are

approximately balanced across documents, so that approximately n/K documents cover the same

prevalent topic. The minimax lower bound is also derived based on this ideal structure of C. At

the other extreme, all topics are equally likely to be covered in each document, so that Wki ≈ 1/K,

for all i and k. In the latter case, γ γ ≈ 1, but ‖C−1‖∞,1 may be larger, in order, than K and

the rates in Theorem 7 are slower than the optimal rates by at least a factor of√K. When K is

fixed or comparatively small, this loss is ignorable. Nevertheless, our condition (ii) rules out this

extreme case, as in general we do not expect any of the given documents to cover, in the same

proportion, all of the K topics we consider.

Remark 12 (Extensions). Both our procedure and the outline of our analysis can be naturally

extended to the more general Nonnegative Matrix Factorization (NMF) setting, and to different

data generating distributions, as long as Assumptions 1, 2 and 3 hold, by adapting the control of

the stochastic error terms ε.

5.3.1 Comparison with the rates of other existing estimators

As mentioned in the Introduction, the rate analysis of estimators in topic models received very little

attention, with the two exceptions discussed below, both assuming that K is known in advance.

An upper bound on L1(A, A) has been established in Arora et al. (2012, 2013), for the estimators

A considered in these works, and different than ours. Since the estimator of Arora et al. (2013)

inherits the rate of Arora et al. (2012), we only discuss the latter rate, given below:

L1(A, A) .a2K3

Γδ3p

·√

log p

nN.

Here a can be viewed as γ/γ, Γ can be treated as the `1-condition number of C = n−1WW T and

δp is the smallest non-zero entry among all the anchor words, and corresponds to αI/p, in our

notation. To understand the order of magnitude of this bound, we evaluate it in the most favorable

scenario, that of W = W 0 in (35). Then Γ ≤√Kσmin(C) . 1/

√K, where σmin(C) is the smallest

24

eigenvalue of C, and γ γ. Since∑

j∈J ρj . |J | implies αI & p−1∑p

j=1 αj and p ≤∑p

j=1 αj ≤ pK,

suppose also αI ≥ K. Then, the above rate becomes

L1(A, A) . p3 ·√K log p

nN,

which is slower than what we obtained in (47) by at least a factor of (p5 log p)1/2/K.

The upper bound on L1(A, A) in Ke and Wang (2017) is derived for K fixed, under a number

of non-trivial assumptions on Π, A and W given in their work. Their rate analysis does not assume

all anchor words have the same order of frequency but requires that the number of anchor words

in each topic grows as p2 log2(n)/(nN) at the estimation level. With an abundance of anchor

words, the estimation problem becomes easier, as there will be fewer parameters to estimate. If

this assumption does not hold, the error upper bound established in Theorem 2.1 of Ke and Wang

(2017), for fixed K, may become sub-optimal by factors in p. In contrast, although in our work we

allow for the existence of more anchor words per topic, we only require a minimum of one anchor

word per topic.

To further understand how the number of anchor words per topic affects the estimation rate,

we consider the extreme example, used for illustration purposes only, of I = 1, . . . , p := [p], when

all words are anchor words. Our Theorem 6 immediately shows that in this case the minimax lower

bound for L1(A, A) becomes

infA

supA∈A(K,p,0)

PAL1(A, A) ≥ c0K

√p

nN

≥ c1

for two universal constant c0, c1 > 0, where the infimum is taken over all estimators A. Theorem 7

shows that our estimator does indeed attain this rate when when γ 1 and mini∈I ‖Ai·‖1 & K/p.

This rate becomes faster (by a factor√K), as expected, since there is only one non-zero entry of

each row of A to estimate. These considerations show that when we return to the realistic case in

which an unknown subset of the words are anchor words, the bounds L1(A, A), for our estimator

A, only increase at most by an optimal factor of√K, and not by factors depending on p.

6 Experimental results

Notation: Recall that n denotes the number of documents, N denotes the number of words

drawn from each document, p denotes the dictionary size, K denotes the number of topics, and |Ik|denotes the cardinality of anchor words for topic k. We write ξ := mini∈I K

−1∑K

k=1Aik for the

minimal average frequencies of anchor words i. The quantity ξ plays the same role in our work as

δp defined in the separability assumption of Arora et al. (2013). Larger values are more favorable

for estimation.

25

Data generating mechanism: For each document i ∈ [n], we randomly generate the topic

vector Wi ∈ RK according to the following principle. We first randomly choose the cardinality si of

Wi from the integer set 1, . . . bK/3c. Then we randomly choose its support of cardinality si from

[K]. Each entry of the chosen support is then generated from Uniform(0, 1). Finally, we normalize

Wi such that it sums to 1. In this way, each document contains a (small) subset of topics instead

of all possible topics.

Regarding the word-topic matrix A, we first generate its anchor words by putting Aik := Kξ

for any i ∈ Ik and k ∈ [K]. Then, each entry of non-anchor words is sampled from a Uniform(0, 1)

distribution. Finally, we normalize each sub-column AJk ⊂ A·k to have sum 1−∑

i∈I Aik.

Given the matrix A and Wi, we generate the p-dimensional column NXi by independently

drawing N samples from a Multinomialp(N,AWi) distribution.

We consider the setting N = 1500, n = 1500, p = 1000, K = 30, |Ik| = p/100 and ξ = 1/p as

our benchmark setting.

Specification of the tuning parameters in our algorithm. In practice, based on Corollary

17 in Appendix C.2, we recommend the choices

δj` =n2

‖Xj·‖1‖X`·‖1

ηj` + 2Θj`

√logM

n

n

‖Xj·‖1

(1

n

n∑i=1

Xji

Ni

)12

+n

‖X`·‖1

(1

n

n∑i=1

Xì

Ni

)12

(50)

and

ηj` = 3√

6

(‖Xj·‖

12∞ + ‖X`·‖

12∞

)√logM

n

(1

n

n∑i=1

XjiXì

Ni

) 12

+ (51)

+2 logM

n(‖Xj·‖∞ + ‖X`·‖∞)

1

n

n∑i=1

1

Ni+ 31

√(logM)4

n

(1

n

n∑i=1

Xji +Xì

N3i

)12

and set C0 = 0.01 and C1 = 1.1 in Algorithm 3. We found that these choices for C0 and C1 not

only give good overall performance, but are robust as well. To verify this claim, we generated 50

datasets under a benchmark setting of N = 1500, n = 1500, p = 1000, K = 30, |Ik| = p/100 and

ξ = 1/p. We first applied our Algorithm 3 with T = 1 to each dataset by setting C1 = 1.1 and

varying C0 within the grid 0.001, 0.003, 0.005, . . . , 0.097, 0.099. The estimation error L1(A, A)/K,

averaged over the 50 datasets, is shown in Figure 1 and clearly demonstrates that our algorithm is

robust to the choice of C0 in terms of overall estimation error. In addition, we applied Algorithm

3 by keeping C0 = 0.01 and varying C1 from 0.1, 0.2, . . . , 11.9, 12. Since C1 mainly controls the

selection of anchor words in Algorithm 2, we averaged the estimated topics number K, sensitivity

|I ∩ I|/ |I| and specificity |Ic ∩ Ic|/ |Ic| of the selected anchor words over the 50 datasets. Figure 2

shows that Algorithm 2 recovers all anchor words by choosing any C1 from the whole range of [1, 10]

and consistently estimates the number of topics for all 0.2 ≤ C1 ≤ 10, which strongly supports the

26

robustness of Algorithm 2 relative to the choice of the tuning parameter C1.

0.00 0.02 0.04 0.06 0.08 0.10C0

0.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14

0.16

1 K||A

A|| 1

0.00 0.02 0.04 0.06 0.08 0.10C0

0.1100

0.1125

0.1150

0.1175

0.1200

0.1225

0.1250

0.1275

0.1300

1 K||A

A|| 1

Figure 1: Plots of overall estimation error vs C0. The right plot is zoomed in.

0 2 4 6 8 10 12C1

0

5

10

15

20

25

30

35

K

0 2 4 6 8 10 12C1

0.0

0.2

0.4

0.6

0.8

1.0

sensitivityspecificity

Figure 2: Plots of K, sensitivity and specificity vs C1 when the true K0 = 30.

Throughout, we consider two versions of our algorithm: Top1 and Top10 described in Algorithm

3 with T = 1 and T = 10, respectively. We compare Top with best performing algorithm available,

that of Arora et al. (2013). We denote this algorithm by Recover-L2 and Recover-KL depending

on which loss function is used for estimating non-anchor rows in their Algorithm 3. In Appendix

G we conducted a small simulation study to compare these two methods, and ours, with the recent

procedure of Ke and Wang (2017), using the implementation the authors kindly made available to

us. Their method is tailored to topic models with a known, small, number of topics. Our study

revealed that, in the “small K” regime, their procedure is comparable or outperformed by existing

methods. Latent Dirichlet Allocation (LDA) (Blei et al., 2003) is a popular Bayesian approach to

topic models, but is computationally demanding.

The procedures from Arora et al. (2013) have better performance than LDA in terms of overall

loss and computational cost, as evidenced by their simulations. For this reason, we only focus on

the comparison of our method with Recover-L2 and Recover-KL for the synthetic data. The

comparison with LDA is considered in the semi-synthetic data.

We report the findings of our simulation studies in this section by showing that our algorithms

estimate both the number of topics and anchor words consistently, and have superior performance

27

in terms of estimation error as well as computational time in various settings over the existing

algorithms.

We re-emphasize that in all the comparisons presented below, the existing methods have as input

the true K used to simulate the data, while we also estimate K. In Appendix G, we show that these

algorithms are very sensitive to the choice of K. This demonstrates that correct estimation of K is

indeed highly critical for the estimation of the entire matrix A.

Topics and anchor words recovery

Top10 and Top1 use the same procedure (Algorithm 2) to select the anchor words, likewise for

Recover-L2 and Recover-KL. We present in Table 1 the observed sensitivity |I ∩ I|/|I| and

specificity |Ic ∩ Ic|/|Ic| of selected anchor words in the benchmark setting with |Ik| varying. It

is clear that Top recovers all anchor words and estimates the topics number K consistently. All

algorithms are performing perfectly for not selecting non-anchor words. We emphasize that the

correct K is given for procedure Recover.

Table 1: Table of anchor recovery and topic recovery for varying |Ik|.

Measures Top Recover

|Ik| 2 4 6 8 10 2 4 6 8 10

sensitivity 100% 100% 100% 100% 100% 50% 25% 16.7% 12.5% 10%

specificity 100% 100% 100% 100% 100% 100% 100% 100% 100% 100%

Number of topics 100% N/A

Estimation error

In the benchmark setting, we variedN and n over 500, 1000, 1500, 2000, 2500, p over 500, 800, 1000, 1200, 1500,K over 20, 25, 30, 35, 40 and |Ik| over 2, 4, 6, 8, 10, one at a time. For each case, the averaged

overall estimation error ‖A−AP‖1/K and topic-wise estimation error ‖A−AP‖1,∞ over 50 gener-

ated datasets for each dimensional setting were recorded. We used a simple linear program to find

the best permutation matrix P which aligns A with A. Since the two measures had similar patterns

for all settings, we only present overall estimation error in Figure 3, which can be summarized as

follows:

- The estimation error of all four algorithms decreases as n or N increases, while it increases

as p or K increases. This confirms our theoretical findings and indicates that A is harder to

estimate when not only p, but K as well, is allowed to grow.

- In all settings, Top10 has the smallest estimation error. Meanwhile, Top1 has better perfor-

mance than Recover-L2 and Recover-KL except for N = 500 and |Ik| = 2. The difference

between Top10 and Top1 decreases as the length N of each sampled document increases.

28

This is to be expected since the larger the N , the better each column of X approximates the

corresponding column of Π, which lessens the benefit of selecting different representative sets

L of anchor words.

- Recover-KL is more sensitive to the specification of K and |Ik| than the other approaches.

Its performance increasingly worsens compared to the other procedures for increasing val-

ues of K. On the other hand, when the sizes |Ik| are small, it performs almost as well as

Top10. However, its performance does not improve as much as the performances of the other

algorithms in the presence of more anchor words.

500 1000 1500 2000 2500n

0.10

0.12

0.14

0.16

0.18

0.20

0.22

0.24

1 K||A

A|| 1

Top10Top1Recover_L2Recover_KL

500 800 1000 1200 1500p

0.06

0.08

0.10

0.12

0.14

0.16

1 K||A

A|| 1


500 1000 1500 2000 2500N

0.10

0.12

0.14

0.16

0.18

0.20

0.22

0.24

1 K||A

A|| 1


20 25 30 35 40K

0.10

0.11

0.12

0.13

0.14

0.15

1 K||A

A|| 1


2 4 6 8 10|I_k|

0.12

0.13

0.14

0.15

0.16

1 K||A

A|| 1


Figure 3: Plots of averaged overall estimation error for varying parameter one at a time.

Running time

The running time of all four algorithms is shown in Figure 4. As expected, Top1 dominates in terms

of computational efficiency. Its computational cost only slightly increases in p or K. Meanwhile,

the running times of Top10 is better than Recover-L2 in most of the settings and becomes

comparable to it when K is large or p is small. Recover-KL is overall much more computationally

demanding than the others. We see that Top1 and Top10 are nearly independent of n, the number

of documents, and N , the document length, as these parameters only appear in the computations

of the matrix R and the tuning parameters δij and ηij . More importantly, as the dictionary size p

increases, the two Recover algorithms become much more computationally expensive than Top.

This difference stems from the fact that our procedure of estimating A is almost independent of

29

p computationally. Top solves K linear programs in K dimensional space, while Recover must

solve p convex optimization problems over in K dimensional spaces.

We emphasize again that our Top procedure accurately estimates K in the reported times,

whereas we provide the two Recover versions with the true values of K. In practice, one needs to

resort to various cross-validation schemes to select a value of K for the Recover algorithms, see

Arora et al. (2013). This would dramatically increase the actual running time for these procedures.

500 1000 1500 2000 2500n

0

2

4

6

8

Seco

nds

Top10Top1

Recover_L2Recover_KL

500 800 1000 1200 1500p

0

2

4

6

8

10

12

14

16

Seco

nds

Top10Top1


500 1000 1500 2000 2500N

0

1

2

3

4

5

6

7

8

Seco

nds

Top10Top1


20 25 30 35 40K

0

2

4

6

8

Seco

nds

Top10Top1


2 4 6 8 10|I_k|

0

2

4

6

8

Seco

nds

Top10Top1


Figure 4: Plots of running time for varying parameter one at a time.

Semi-synthetic data from NIPs corpus

In this section, we compare our algorithm with existing competitors on semi-synthetic data, gen-

erated as follows.

We begin with one real-world dataset1, a corpus of NIPs articles (Dheeru and Karra Taniskidou,

2017) to benchmark our algorithm and compare Top1 with LDA (Blei et al., 2003), Recover-

L2 and Recover-KL. We use the code of LDA from Riddell et al. (2016) implemented via the

fast collapsed Gibbs sampling with the default 1000 iterations. To preprocess the data, following

Arora et al. (2013), we removed common stopping words and rare words occurring in less than

1More comparison based on the New York Times dataset is relegated to the supplement.

30

150 documents, and cut off the documents with less than 150 words. The resultant dataset has

n = 1480 documents with dictionary size p = 1253 and mean document length 858.

To generate semi-synthetic data, we first apply Top to this real data set, in order to obtain

the estimated word-topic matrix A, which we then use as the ground truth in our simulation

experiments, performed as follows.2 For each document i ∈ [n], we sample Wi from a specific

distribution (see below) and we sample Xi from Multinomialp(Ni, AWi). The estimated A from

Top (with C1 = 4.5 chosen via cross-validation and C0 = 0.01) contains 178 anchor words and 120

topics. We consider three distributions of W , chosen as in (Arora et al., 2013):

(a) symmetric Dirichlet distribution with parameter 0.03;

(b) logistic-normal distribution with block diagonal covariance matrix and ρ = 0.02;

(c) logistic-normal distribution with block diagonal covariance matrix and ρ = 0.2.

Cases (b) and (c) are designed to investigate how the correlation among topics affects the estimation

error. To construct the block diagonal covariance structure, we divide the 120 topics into 10 groups.

For each group, the off-diagonal elements of the covariance matrix of topics is set to ρ while the

diagonal entries are set to 1. The parameter ρ = 0.02, 0.2 reflects the magnitude of correlation

among topics.

The number of documents n is varied as 2000, 3000, 4000, 5000, 6000 and the document length

is set to Ni = 850 for 1 ≤ i ≤ n. In each setting, we repeat generating 20 datasets and report the

averaged overall estimation error ‖A− AP‖1/K and topic-wise estimation error ‖A− AP‖1,∞ of

different algorithms in Figure 5. The running time of each algorithm is reported in Table 2.

Overall, LDA is outperformed by the other three methods, though its performance might

be improved by increasing the number of iterations. Top, Recover-KL and Recover-L2 are

comparable when columns of W are sampled from a symmetric Dirichlet with parameter 0.03,

whereas Top has better performance when the correlation among topics increases. Moreover, Top

has the best control of topic-wise estimation error as expected, while the comparison between

Recover-KL and Recover-L2 depends on the error metric. From the running-time perspective,

Top runs significantly faster than the other three methods.

Finally, we emphasize that we provide LDA and the two Recover algorithms with the true

K, whereas Top estimates it.

2 Arora et al. (2013) uses the posterior estimate of A from LDA with K = 100. Since we do not have prior

information of K, we instead use our Top to estimate it. Moreover, the posterior from LDA does not satisfy the

anchor word assumptions and to evaluate the effect of anchor words, one has to manually add additional anchor

words (Arora et al., 2013). In contrast, the estimated A from Top automatically gives anchor words.

31

Dirichlet 0.03 Logistic Normal 0.02 Logistic Normal 0.2

2000 3000 4000 5000 6000 2000 3000 4000 5000 6000 2000 3000 4000 5000 6000

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

n

Ove

ral e

stim

atio

n er

ror

Methods

KL

L2

LDA

Top


2000 3000 4000 5000 6000 2000 3000 4000 5000 6000 2000 3000 4000 5000 6000

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

n

Topi

c−w

ise

estim

atio

n er

ror

Methods

KL

L2

LDA

Top

Figure 5: Plots of averaged overall estimation error and topic-wise estimation error of Top,

Recover-L2 (L2), Recover-KL (KL) and LDA. TOP estimates K, the other methods

use the true K as input. The bars denote one standard deviation.

Acknowledgements

Bunea and Wegkamp are supported in part by NSF grant DMS-1712709. We thank the Editor,

Associate Editor and three referees for their constructive remarks.

References

Anandkumar, A., Foster, D. P., Hsu, D. J., Kakade, S. M. and Liu, Y.-k. (2012). A spectral

algorithm for latent dirichlet allocation. In Advances in Neural Information Processing Systems

25 (F. Pereira, C. J. C. Burges, L. Bottou and K. Q. Weinberger, eds.). Curran Associates, Inc.,

917–925.

Arora, S., Ge, R., Halpern, Y., Mimno, D. M., Moitra, A., Sontag, D., Wu, Y. and Zhu,

M. (2013). A practical algorithm for topic modeling with provable guarantees. In ICML (2).

Arora, S., Ge, R. and Moitra, A. (2012). Learning topic models–going beyond svd. In Foun-

dations of Computer Science (FOCS), 2012, IEEE 53rd Annual Symposium. IEEE.

Bansal, T., Bhattacharyya, C. and Kannan, R. (2014). A provable svd-based algorithm for

learning topics in dominant admixture corpus. In Proceedings of the 27th International Confer-

ence on Neural Information Processing Systems - Volume 2. NIPS’14, MIT Press, Cambridge,

MA, USA.

32

Table 2: Running time (seconds) of different algorithms

Top Recover-L2 Recover-KL LDA

n = 2000 21.4 428.2 2404.5 3052.3

n = 3000 22.3 348.2 1561.8 4649.5

n = 4000 25.3 353.5 1764.8 6051.1

n = 5000 28.5 349.0 1800.4 7113.0

n = 6000 29.5 346.6 1848.1 7318.4

Bing, X., Bunea, F., Yang, N. and Wegkamp, M. (2017). Sparse latent factor models with

pure variables for overlapping clustering. arXiv: 1704.06977 .

Bittorf, V., Recht, B., Re, C. and Tropp, J. A. (2012). Factoring nonnegative matrices with

linear programs. arXiv:1206.1270 .

Blei, D. M. (2012). Introduction to probabilistic topic models. Communications of the ACM 55

77–84.

Blei, D. M. and Lafferty, J. D. (2007). A correlated topic model of science. Ann. Appl. Stat.

1 17–35.

Blei, D. M., Ng, A. Y. and Jordan, M. I. (2003). Latent dirichlet allocation. Journal of

Machine Learning Research 993–1022.

Cox, D. R. and Reid, N. (1987). Parameter orthogonality and approximate conditional inference.

Journal of the Royal Statistical Society. Series B (Methodological) 49 1–39.

Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K. and Harshman, R.

(1990). Indexing by latent semantic analysis. Journal of the American society for information

science 41 391–407.

Dheeru, D. and Karra Taniskidou, E. (2017). UCI machine learning repository.

Ding, W., Rohban, M. H., Ishwar, P. and Saligrama, V. (2013). Topic discovery through

data dependent and random projections. In Proceedings of the 30th International Conference

on Machine Learning (S. Dasgupta and D. McAllester, eds.), vol. 28 of Proceedings of Machine

Learning Research. PMLR, Atlanta, Georgia, USA.

URL http://proceedings.mlr.press/v28/ding13.html

Donoho, D. and Stodden, V. (2004). When does non-negative matrix factorization give a correct

decomposition into parts? In Advances in Neural Information Processing Systems 16 (S. Thrun,

L. K. Saul and P. B. Scholkopf, eds.). MIT Press, 1141–1148.

Griffiths, T. L. and Steyvers, M. (2004). Finding scientific topics. Proceedings of the National

Academy of Sciences 101 5228–5235.

Hofmann, T. (1999). Probabilistic latent semantic indexing. Proceedings of the Twenty-Second

Annual International SIGIR Conference .

Ke, T. Z. and Wang, M. (2017). A new svd approach to optimal topic estimation.

33

http://proceedings.mlr.press/v28/ding13.html

arXiv:1704.07016 .

Li, W. and McCallum, A. (2006). Pachinko allocation: Dag-structured mixture models of topic

correlations. In Proceedings of the 23rd International Conference on Machine Learning. ICML

2006, ACM, New York, NY, USA.

Mimno, D., Wallach, H. M., Talley, E., Leenders, M. and McCallum, A. (2011). Op-

timizing semantic coherence in topic models. In Proceedings of the Conference on Empirical

Methods in Natural Language Processing. EMNLP ’11, Association for Computational Linguis-

tics, Stroudsburg, PA, USA.

Papadimitriou, C. H., Raghavan, P., Tamaki, H. and Vempala, S. (2000). Latent semantic

indexing: A probabilistic analysis. Journal of Computer and System Sciences 61 217–235.

Papadimitriou, C. H., Tamaki, H., Raghavan, P. and Vempala, S. (1998). Latent semantic

indexing: A probabilistic analysis. In Proceedings of the Seventeenth ACM SIGACT-SIGMOD-

SIGART Symposium on Principles of Database Systems. PODS ’98, ACM, New York, NY, USA.

Riddell, A., Hopper, T. and Grivas, A. (2016). lda: 1.0.4.

URL https://doi.org/10.5281/zenodo.57927

Stevens, K., Kegelmeyer, P., Andrzejewski, D. and Buttler, D. (2012). Exploring topic

coherence over many models and many topics. In Proceedings of the 2012 Joint Conference

on Empirical Methods in Natural Language Processing and Computational Natural Language

Learning. Association for Computational Linguistics, Jeju Island, Korea.

Tsybakov, A. B. (2009). Introduction to Nonparametric Estimation. Springer, New York.

34

https://doi.org/10.5281/zenodo.57927

Supplement

From the topic model specifications, the matrices Π, A and W are all scaled as

p∑j=1

Πji = 1,

p∑j=1

Ajk = 1,K∑k=1

Wki = 1 (52)

for any 1 ≤ j ≤ p, 1 ≤ i ≤ n and 1 ≤ k ≤ K. In order to adjust their scales properly, we denote

mj = p max1≤i≤n

Πji, µj =p

n

n∑i=1

Πji, αj = p max1≤k≤K

Ajk, γk =K

n

n∑i=1

Wki, (53)

so thatK∑k=1

γk = K,

p∑j=1

µj = p. (54)

We further denote mmin = min1≤j≤pmj and µmin = min1≤j≤p µj .

A Proofs of Section 2

Proof of Proposition 1. Recall that Yi := NiXi ∼ Multinomialp(Ni; Πi) for any i ∈ [n]. The joint

log-likelihood of (Y1, . . . , Yn) is

`(Y1, . . . , Yn) =n∑i=1

log(Ni!)−n∑i=1

p∑j=1

log(Yji) +n∑i=1

p∑j=1

Yji log Πji

=

n∑i=1

log(Ni!)−n∑i=1

p∑j=1

log(Yji) +

n∑i=1

p∑j=1

Yji log

(K∑k=1

AjkWki

).

Fix any j ∈ [p], k ∈ [K] and i ∈ [n]. It follows that

∂`(Y1, . . . , Yn)

∂Ajk=

∑ni=1 YjiWki

/(∑Kt=1AjtWti

), if Ajk 6= 0,Wki 6= 0

0, otherwise

from which we further deduce

∂2`(Y1, . . . , Yn)

∂Ajk∂Wki

=

∑n

i=1 Yji

(∑Kt=1AjtWti −AjkWki

)/(∑Kt=1AjtWti

)2, if Ajk 6= 0,Wki 6= 0

0, otherwise.

Since E[Yji] = NiΠji, taking expectation yields

E[−∂

2`(Y1, . . . , Yn)

∂Ajk∂Wki

]=

∑ni=1Ni

(∑t6=k AjtWti

)/(∑Kt=1AjtWti

), if Ajk 6= 0,Wki 6= 0

0, otherwise. (55)

35

Similarly, for this j, k and i but with any k′ 6= k, we have

∂2`(Y1, . . . , Yn)

∂Ajk∂Wk′i

=

−∑n

i=1 YjiAjk′Wki

/(∑Kt=1AjtWti

)2, if Ajk 6= 0, Ajk′ 6= 0,Wk′i 6= 0,Wki 6= 0

0, otherwise

and

E[−∂

2`(Y1, . . . , Yn)

∂Ajk∂Wk′i

]=

∑ni=1NiAjk′Wki

/∑Kt=1AjtWti, if Ajk 6= 0, Ajk′ 6= 0,Wk′i 6= 0,Wki 6= 0

0, otherwise. (56)

From (55) and (56), it is easy to see that condition (7) implies

E[−∂

2`(Y1, . . . , Yn)

∂Ajk∂Wk′i

]= 0 (57)

for any j ∈ [p], k, k′ ∈ [K] and i ∈ [n]. This proves the sufficiency. To show the necessity, we use

contradiction. If (57) holds for any j ∈ [p], k, k′ ∈ [K] and i ∈ [n], suppose there exist at least one

j ∈ [p] and i ∈ [n] such that supp(Aj·) ∩ supp(W·i) = k1, k2 and k1 6= k2. Then, (56) implies

E[−∂

2`(Y1, . . . , Yn)

∂Ajk1∂Wk2i

]≥

NiAjk1Wk2i

Ajk1Wk1i +Ajk2Wk2i6= 0.

This contradicts (7) and concludes the proof.

Proof of Proposition 2. Since the columns of A sum up to 1, and Assumption 1 holds, then the

matrix A satisfies:

Ajk ≥ 0,∥∥Aj·∥∥1

= 1, for each j = 1, . . . , p, and K = 1, . . . ,K. (58)

Additionally, A has the same sparsity pattern as A, and thus A satisfies Assumption 1, with the

same I and I. We further notice that Assumption 3 is equivalent to

|〈Wi·, Wj·〉| < ‖Wi·‖2 ∧ ‖Wj·‖2, for all 1 ≤ i < j ≤ K,

which is further equivalent with ν > 0, defined in (23).

To finish the proof we invoke Theorem 1 in Bing et al. (2017), slightly adapted to our situation.

Specifically, we consider any matrix R defined in (11) that factorizes as R = ACAT where A satisfies

Assumption 1 and C satisfies (23). Note that the quantities Mi and Si defined in page 9 of Bing

et al. (2017) are replaced by, respectively, Ti and Si in (12). We proceed to prove (a) and (b) of

Proposition 2.

36

Proof of (a). We first show the sufficiency part. Consider any i ∈ [p] with Ti = Tj for all

j ∈ Si. Part (a) of Lemma 10, stated after the proof, states that there exists a j ∈ Ia ∩Si for some

a ∈ [K]. For this j ∈ Ia, we have Tj = Caa from part (b) of Lemma 10. Invoking our premise

Tj = Ti as j ∈ Si, we conclude that Ti = Caa, that is, maxk∈pRik = Caa. By Lemma 9, stated after

the proof, the maximum is achieved for any i ∈ Ia. However, if i 6∈ Ia, we have that Rik < Caa for

all k ∈ [p]. Hence i ∈ Ia and this concludes the proof of the sufficiency part.

It remains to prove the necessity part. Let i ∈ Ia for some a ∈ [K] and j ∈ Si. Lemma 10 implies

that j ∈ Ia and Ti = Caa. Since j ∈ Si, we have Rij = Ti = Caa, while j ∈ Ia yields Rjk ≤ Caa for

all k ∈ [p], and Rjk = Caa for k ∈ Ia, as a result of Lemma 9. Hence, Tj = maxk∈[p]Rjk = Caa = Ti

for any j ∈ Si, which proves our claim.

Proof of (b). The proof of (b) follows by the same arguments for proving part (b) of Theorem

1 in Bing et al. (2017). The proof is then complete.

The following two lemmas are used in the above proof. They are adapted from Lemmas 1 and

2 of the supplement of Bing et al. (2017).

Lemma 9. For any a ∈ [K] and i ∈ Ia, we have

(a) Rij = Caa for all j ∈ Ia,

(b) Rij < Caa for all j 6∈ Ia.

Proof. The proof follows the same argument for proving Lemma 1 in Bing et al. (2017).

Lemma 10. Let Ti and Si be defined in (12). We have

(a) Si ∩ I 6= ∅, for any i ∈ [p],

(b) Si ∪ i = Ia and Ti = Caa, for any i ∈ Ia and a ∈ [K].

Proof. Lemma 9 implies that, for any i ∈ Ia, Ti = Caa and Si = Ia, which proves part (b). Part

(a) follows from the result of part (b) and the same arguments of the proof of Lemma 2 in Bing

et al. (2017) by replacing Mi by Ti, |Σij | by Rij , A by A and C by C.

B Error bounds of stochastic errors

We use this section to present tight bounds on the error terms which are critical to our later

estimation rate. We recall that εji := Xji − Πji, for 1 ≤ i ≤ n and 1 ≤ j ≤ p and assume

N1 = . . . = Nn = N for ease of presentation since similar results for different N can be derived by

using the same arguments. The following results, Lemmata 13 - 15 control several terms related

with εji under the multinomial assumption (2). We start by stating the well-known Bernstein

inequality and Hoeffding inequality for bounded random variables which are used in the sequel.

37

Lemma 11 (Bernstein’s inequality for bounded random variable). For independent random vari-

ables Y1, . . . , Yn with bounded ranges [−B,B] and zero means,

P

1

n

∣∣∣∣∣n∑i=1

Yi

∣∣∣∣∣ > x

≤ 2 exp

(− n2x2/2

v + nBx/3

), for any x ≥ 0,

where v ≥ var(Y1 + . . .+ Yn).

Lemma 12 (Hoeffding’s inequality). Let Y1, . . . , Yn be independent random variables with E[Yi] =

0 and bounded by [ai, bi]: For any t ≥ 0, we have

P

∣∣∣∣∣n∑i=1

Yi

∣∣∣∣∣ > t

≤ 2 exp

(− 2t2∑n

i=1(bi − ai)2

).

Lemma 13. Assume µmin/p ≥ 2 logM/(3N). With probability 1− 2M−1,

1

n

∣∣∣∣∣n∑i=1

εji

∣∣∣∣∣ ≤ 2

√µj logM

npN

(1 +√

6n−12

), uniformly in 1 ≤ j ≤ p.

Proof. Fix 1 ≤ j ≤ p. From model (3), we know NXji ∼ Binomial(N ; Πji). We express the

binomial random variable as a sum of i.i.d. Bernoulli random variables:

εji = Xji −Πji =1

N

N∑k=1

(Bjik −Πji

):=

1

N

N∑`=1

Zjik

with Bji` ∼ Bernoulli(Πji), such that N

∑ni=1 εji =

∑ni=1

∑N`=1 Z

ji`. Note that |Zjik| ≤ 1, E[Zjik] = 0

and E[(Zjik)2] = Πji(1 − Πji) ≤ Πji, for all i ∈ [n] and k ∈ [N ]. An application of Bernstein’s

inequality, see Lemma 11, to Zji` with v = N∑n

i=1 Πji = Nnµj/p and B = 1 gives

P

1

n

∣∣∣∣∣n∑i=1

εji

∣∣∣∣∣ > t

≤ 2 exp

(− n2N2t2/2

N∑n

i=1 Πji + nNt/3

), for any t > 0.

This implies, for all t > 0

P

1

n

∣∣∣∣∣n∑i=1

εji

∣∣∣∣∣ >√

µjt

npN+

t

nN

≤ 2e−t/2.

Choosing t = 4 logM and recalling that M = n ∨N ∨ p ≥ p, we find by the union bound

p∑j=1

P

1

n

∣∣∣∣∣n∑i=1

εji

∣∣∣∣∣ > 2

√µj logM

npN+

4 logM

nN

≤ 2pM−2 ≤ 2M−1. (59)

Using µmin/p ≥ 2 logM/(3N) concludes the proof.

38

Remark 13. By inspection of the proof of Lemma 13, if instead of the bound E[(Zjik)2] ≤ Πji, we

had used the overall bound E[(Zjik)2] ≤ 1, and application of Bernstein’s inequality would yield,

1

n

∣∣∣∣∣n∑i=1

εji

∣∣∣∣∣ ≤ c√

logM

nN, uniformly in 1 ≤ j ≤ p.

Summing over 1 ≤ j ≤ p in the above display would further give

1

n

p∑j=1

∣∣∣∣∣n∑i=1

εji

∣∣∣∣∣ ≤ c · p√

logM

nN,

and the right hand side would be slower by an important√p factor that what we would obtain by

summing the bound in Lemma 13 over 1 ≤ j ≤ p, since

2

p∑j=1

√µj logM

npN≤ 2

√logM

nN

√√√√ p∑j=1

µj = 2

√p logM

nN.

In this last display, we used the Cauchy-Schwarz inequality in the first inequality, and the constraint

(54) in the last equality. The bound of Lemma 13 is an important intermediate step for deriving

the final bounds of Theorem 7, and the simple calculations presented above that the constraints of

unit column sums induced by model (3) permit a more refined control of the stochastic errors than

those previously considered. A larger impact of this refinement on the final rate of convergence is

illustrated in Remark 14, following the proof of Lemma 15, in which we control sums of quadratic,

dependent terms εjiεì.

Lemma 14. With probability 1− 2M−1,

1

n

∣∣∣∣∣n∑i=1

Πìεji

∣∣∣∣∣ ≤√

6m`Θj` logM

npN+

2m` logM

npN, uniformly in 1 ≤ j, ` ≤ p.

Proof. Similar as in the proof of Lemma 13, we write

Πìεji =1

N

N∑k=1

ΠìZjik

such that |ΠìZjik| ≤ Πì ≤ m`/p by (53), E[ΠìZ

jik] = 0 and E[Π2

ìZ2ik] = Π2

ìΠji(1 − Πji) ≤m`ΠìΠji/p, for all i ∈ [n] and k ∈ [N ]. Fix 1 ≤ j, ` ≤ p and recall that Θj` = n−1

∑ni=1 ΠjiΠì.

Applying Bernstein’s inequality to ΠìZjik with v = nNm`Θj`/p and B = m`/p gives

P

1

n

∣∣∣∣∣n∑i=1

Πìεji

∣∣∣∣∣ ≥ t≤ 2 exp

(− n2N2t2/2

nNm`Θj`/p+ nNm`t/(3p)

)which further implies

P

1

n

∣∣∣∣∣n∑i=1

Πìεji

∣∣∣∣∣ >√m`Θj`t

npN+

m`t

3npN

≤ 2e−t/2.

Taking the union bound over 1 ≤ j, ` ≤ p, and choosing t = 6 logM , concludes the proof.

39

Lemma 15. If µmin/p ≥ 2 logM/(3N), then with probability 1− 4M−1,

1

n

∣∣∣∣∣n∑i=1

(εjiεì − E[εjiεì])

∣∣∣∣∣ ≤ 12√

6

√Θj` +

(µj + µ`) logM

pN

√(logM)3

nN2+ 4M−3,

holds, uniformly in 1 ≤ j, ` ≤ p.

Proof. For any 1 ≤ j ≤ p, recall that εji = Xji −Πji and

εji =1

N

N∑k=1

Zjik

where Zjik := Bjik−Πji and Bj

ik ∼ Bernoulli(Πji). By using the arguments of Lemma 13, application

of Bernstein’s inequality to Zjik with v = NΠji and B = 1 gives

P |εji| > t ≤ 2 exp

(− Nt2/2

Πji + t/3

), for any t > 0,

which yields

|εji| ≤√

6Πji logM

N+

2 logM

N:= Tji

with probability greater than 1− 2M−3. We define Yji = εji1Sji with Sji := |εji| ≤ Tji for each

1 ≤ j ≤ p and 1 ≤ i ≤ n, and S := ∩pj=1∩ni=1Sji. It follows that P(Sji) ≥ 1−2M−3 for all 1 ≤ i ≤ nand 1 ≤ j ≤ p, so that P(S) ≥ 1− 2M−1 as M := n ∨ p ∨N . On the event S, we have

1

n

∣∣∣∣∣n∑i=1


∣∣∣∣∣ ≤ 1

n

∣∣∣∣∣n∑i=1

(YjiYì − E[YjiYì])

∣∣∣∣∣︸︷︷︸T1

+1

n

∣∣∣∣∣n∑i=1

(E[εjiεì]− E[YjiYì])

∣∣∣∣∣︸︷︷︸T2

We first study T2. By writing

E[εjiεì] = E[YjiYì] + E[Yjiεì1Scì

]+ E

[εji1Scjiεì

],

we have

T2 =1

n

∣∣∣∣∣n∑i=1

(E[εjiεì]− E[YjiYì])

∣∣∣∣∣ ≤ 1

n

∣∣∣∣∣n∑i=1

(E[Yjiεì1Scì

]+ E

[εji1Scjiεì

])∣∣∣∣∣≤ 1

n

∣∣∣∣∣n∑i=1

(P(Scji) + P(Scì)

)∣∣∣∣∣ (60)

≤ 4M−3

by using |Yji| ≤ |εji| ≤ 1 in the second inequality.

40

Next we bound T1. Since |Yji| ≤ Tji, we know −2TjiTì ≤ YjiYì − E[YjiYì] ≤ 2TjiTì for all

1 ≤ i ≤ n. Applying the Hoeffding inequality Lemma 12 with ai = −2TjiTì and bi = 2TjiTì gives

P

∣∣∣∣∣n∑i=1


∣∣∣∣∣ ≥ t≤ 2 exp

(− t2

8∑n

i=1 T2jiT

2ì

).

Taking t =√

24∑n

i=1 T2jiT

2ì logM yields

T1 =1

n

∣∣∣∣∣n∑i=1


∣∣∣∣∣ ≤ 2√

6

(1

n

n∑i=1

T 2jiT

2ì ·

logM

n

)1/2

(61)

with probability greater than 1− 2M−3. Finally, note that

1

n

n∑i=1

T 2jiT

2ì =

1

n

n∑i=1

(ΠjiΠì

36(logM)2

N2+

(2 logM

N

)4

+ (Πji + Πì)24(logM)3

N3

)

= 36Θj`

(logM

N

)2

+ 16

(logM

N

)4

+ 24(µj + µ`)(logM)3

pN3(62)

by using (53) and Θj` = n−1ΠΠT in the second equality. Finally, combining (60) - (62) and using

µmin/p ≥ 2 logM/(3N) conclude the proof.

Remark 14. We illustrate the improvement of our result over a simple application of Hanson-Wright

inequality. Write

4εjiεì = (εji + εì)2 − (εji − εì)2

for each i ∈ [n]. Since εji = N−1∑N

k=1 Zjik and ‖εji ± εì‖φ2 ≤ 2/

√N , a direct application of the

Hanson-Wright inequality to the two terms in the right hand side will give

1

n

∣∣∣∣∣n∑i=1


∣∣∣∣∣ ≤ c√

logM

nN

with high probability. Summing over 1 ≤ j ≤ p and 1 ≤ ` ≤ p further yields

p∑j,`=1

1

n

∣∣∣∣∣n∑i=1


∣∣∣∣∣ ≤ c · p2

√logM

nN. (63)

By contrast, summing the first term in Lemma 15 yields

p∑j,`=1

1

n

∣∣∣∣∣n∑i=1


∣∣∣∣∣ ≤ c ·√p2(logM)3

nN2

√ ∑1≤j,`≤p

Θj` + c ·√p3(logM)4

nN3

= c ·√p2(logM)3

nN2+ c ·

√p3(logM)4

nN3

by using Cauchy-Schwarz in the first inequality and (52) in the last equality which is (p√N)∧(N

√p)

faster than the result in (63) after ignoring the logarithmic term.

41

C Proofs of Section 4

Throughout this section, we define the event E := E1 ∩ E2 ∩ E3 by

E1 =

p⋂j=1

1

n

∣∣∣∣∣n∑i=1

εji

∣∣∣∣∣ ≤ 2(1 +√

6/n)

√µj logM

npN

,

E2 =

p⋂j,`=1

1

n

∣∣∣∣∣n∑i=1

Πìεji

∣∣∣∣∣ ≤√

6m`Θj` logM

npN+

2m` logM

npN

,

E3 =

p⋂j,`=1

1

n

∣∣∣∣∣n∑i=1


∣∣∣∣∣ ≤ 12√

6

√Θj` +

(µj + µ`) logM

pN

√(logM)3

nN2+ 4M−3

Recall (53) and (54), if

min1≤j≤p

1

n

n∑i=1

Πji ≥2 logM

3N

holds, we have

1

p≥ µmin

p= min

1≤j≤p

1

n

n∑i=1

Πji ≥2 logM

3N, (64)

Therefore, invoking Lemmas 13 – 15 yields P(E) ≥ 1− 8M−1.

C.1 Preliminaries

From model specifications (52) and (53), we first give some useful expressions which are repeatedly

invoked later.

(a) For any j ∈ [p], by using (53),

µj =p

n

n∑i=1

Πji =p

n

n∑i=1

K∑k=1

AjkWki =p

K

K∑k=1

Ajkγk ⇒ p

K

K∑k=1

Ajk · γ ≤ µj ≤ αj . (65)

In particular, for any j ∈ Ik with any k ∈ [K],

µj =p

n

n∑i=1

K∑k=1

AjkWki =p

KAjkγk

(53)=

αjγkK

. (66)

(b) For any j ∈ [p],

mj(53)= p max

1≤i≤nΠji = p max

1≤i≤n

K∑k=1

AjkWki ≤ p max1≤k≤K

Ajk(53)= αj ⇒ µj ≤ mj ≤ αj , (67)

by using 0 ≤Wki ≤ 1 and∑

kWki = 1 for any k ∈ [K] and i ∈ [n].

42

C.2 Control of Θ−Θ and R−R

Proposition 16. Under model (3), assume (26). Let Θ and R be defined in (20) and (21),

respectively. Then Θ is an unbiased estimator of Θ. Moreover, with probability greater than

1− 8M−1,

|Θj` −Θj`| ≤ ηj`, |Rj` −Rj`| ≤ δj`

for all 1 ≤ j, ` ≤ p, where

ηj` := 3√

6

(√mj

p+

√m`

p

)√Θj` logM

nN+

2(mj +m`)

p

logM

nN

+ 31(1 + κ1)

√µj + µ`

p

(logM)4

nN3+ κ2 (68)

and

δj` := (1 + κ1κ3)p2

µjµ`ηj` + κ3

p2Θj`

µjµ`

(√p

µj+

√p

µ`

)√logM

nN, (69)

with κ1 =√

6/n, κ2 = 4/M3 and

κ3 =2(1 + κ1)

(1− κ1 − κ21)2

.

Remark 15. For ease of presentation, we assumed that the document lengths are equal, that is,

Ni = N for all i ∈ [n]. Inspection of the proofs of Lemmas 13 - 15 and Proposition 16, we may

allow for unequal document lengths Ni by adjusting the quantities ηj` and δj` with

ηj` := 3√

6(√mj +

√m`

)√ logM

np

(1

n

n∑i=1

ΠjiΠì

Ni

)1/2

+2(mj +m`) logM

np

(1

n

n∑i=1

1

Ni

)

+ 31(1 + κ1)

√(logM)4

n

(1

n

n∑i=1

Πji + Πì

N3i

)1/2

+ κ2 (70)

δj` := (1 + κ1κ3)p2ηj`µjµ`

+ κ3p3Θj`

µjµ`

√logM

n

( 1

n

n∑i=1

Πji

µ2jNi

)1/2

+

(1

n

n∑i=1

Πì

µ2`Ni

)1/2 . (71)

The quantities mj and µj appearing in the above rates are related with Π and can be directly

estimated from X. Letmj

p= max

1≤i≤nXji,

µjp

=1

n

n∑i=1

Xji. (72)

The following corollary gives the data dependent bounds of Θ−Θ and R−R.

Corollary 17. Under the same conditions as Proposition 16, with probability greater than 1 −8M−1, we have

|Θj` −Θj`| ≤ ηj`, |Rj` −Rj`| ≤ δj`, for all 1 ≤ j, ` ≤ p.

43

The quantities ηj` have the same form as (70) and (71) except for replacing Θj`, mj/p and

µj/p by Θj` + κ5, mj/p + κ4 and µj/p + κ5, respectively, with κ4 = O(√

logM/N) and κ5 =

O(√

logM/(nN)). Similarly, δj` can be estimated in the same way by replacing ηj`, (µj/p)−1 and

Θj` by ηj`, (µj/p− κ5)−1 and Θj` + κ5, respectively.

Proof of Proposition 16. Throughout the proof, we work on the event E . Write X = (X1, . . . , Xn) ∈Rp×n and similarly, for ε = (ε1, . . . , εn) and W = (W1, . . . ,Wn). We first show that E[Θ] = Θ.

Recall that Xi = AWi + εi satisfying

EXi = AWi, Cov(Xi) =1

Nidiag(AWi)−

1

NiAWiW

Ti A

T .

This gives

E[Θ] =1

n

n∑i=1

[Ni

Ni − 1E[XiX

Ti ]− 1

Ni − 1diag(EXi)

]=

1

n

n∑i=1

AWiWTi A

T = Θ.

Next we bound the entry-wise error rate of Θ−Θ. Observe that

Θ =1

n

n∑i=1

[Ni

Ni − 1(AWi + εi)(AWi + εi)

T − 1

Ni − 1diag(AWi + εi)

]

=1

n

n∑i=1

[Ni

Ni − 1(AWiW

Ti A

T +AWiεTi + εi(AWi)

T + εiεTi )− 1

Ni − 1diag(AWi + εi)

]

=1

n

n∑i=1

[AWiW

Ti A

T +Ni

Ni − 1(AWiε

Ti + εi(AWi)

T )− diag(εi)

Ni − 1

+Ni

Ni − 1

(εiε

Ti − E[εiε

Ti ]) ].

The third equality comes from the fact that

E[εiεTi ] =

1

Nidiag(AWi)−

1

NiAWiW

Ti A

T .

Recall that Θ = n−1∑n

i=1AWiWTi A

T . We have

∣∣∣Θj` −Θj`

∣∣∣ ≤ ∣∣∣∣∣ 1nn∑i=1

(AWiεTi + εi(AWi)

T )j`

∣∣∣∣∣︸︷︷︸T1

+

∣∣∣∣∣ 1nn∑i=1

1

Ni(diag(εi))j`

∣∣∣∣∣︸︷︷︸T2

+

∣∣∣∣∣ 1nn∑i=1


∣∣∣∣∣︸︷︷︸T3

. (73)

44

It remains to bound T1, T2 and T3. Fix 1 ≤ j, ` ≤ p. To bound T1, we have

T1 ≤1

n

∣∣∣∣∣n∑i=1

Πjiεì

∣∣∣∣∣+1

n

∣∣∣∣∣n∑i=1

Πìεji

∣∣∣∣∣ E≤ (√mj +

√m`)

√6Θj` logM

npN+

2(mj +m`) logM

npN. (74)

For T2, we have

T2 =1

nN

∣∣∣∣∣n∑i=1

εji

∣∣∣∣∣ E≤ 2(1 + κ1)

√µj logM

npN3, (75)

if j = `. Note that (T2)j` = 0 if j 6= `. Finally, to bound T3, we obtain

T3 ≤1

n

∣∣∣∣∣n∑i=1


∣∣∣∣∣E≤ 12

√6

√Θj` +

(µj + µ`) logM

pN

√(logM)3

nN2+ κ2

≤ 12

√6Θj`(logM)3

nN2+ 12

√6(µj + µ`)(logM)4

npN3+ κ2. (76)

Since (26) impliesmmin

p= min

1≤j≤pmax

1≤i≤nΠji ≥

(3 logM)2

N(77)

by recalling (53), we have

12

√6Θj`(logM)3

nN2+ (√mj +

√m`)

√6Θj` logM

npN≤ 3√

6(√mj +

√m`)

√Θj` logM

npN.

In addition,

2(1 + κ1)

√µj logM

npN31j=` + 12

√6(µj + µ`)(logM)4

npN3≤ 31(1 + κ1)

√(µj + µ`)(logM)4

npN3.

Combining (74) - (76) concludes the desired rate of Θ−Θ.

To prove the rate of R − R, recall that R = (nD−1Π )Θ(nD−1

Π ). Fix 1 ≤ j, ` ≤ p. By using the

diagonal structure of DX and DΠ, it follows[(nD−1

X

)Θ(nD−1

X

)−R

]j`

= n2(D−1X −D

−1Π

)jj

Θj`

(D−1X

)``︸︷︷︸

T4

+n2(D−1

Π

)jj

Θj`

(D−1X −D

−1Π

)``︸︷︷︸

T5

+ n2(D−1

Π

)jj

(Θ−Θ

)j`

(D−1

Π

)``︸︷︷︸

T6

.

45

We first quantify the term n(D−1X −D

−1Π ). From their definitions,

n∣∣∣(D−1

X −D−1Π

)jj

∣∣∣ = n

∣∣∣∣ 1∑ni=1 Πji +

∑ni=1 εji

− 1∑ni=1 Πji

∣∣∣∣(53)

≤ 1

n

∣∣∣∣∣n∑i=1

εji

∣∣∣∣∣/(

µjp

∣∣∣∣∣µjp +1

n

n∑i=1

εji

∣∣∣∣∣)

E≤ 2(1 + κ1)

1− κ1(1 + κ1)· pµj

√p logM

µjnN, (78)

where the last inequality uses∣∣∣∣∣µjp +1

n

n∑i=1

εji

∣∣∣∣∣ E≥ µjp− 2(1 + κ1)

√µj logM

npN

=µjp

(1− 2(1 + κ1)

√p logM

µjnN

)(64)

≥ µjp

(1− 2(1 + κ1)

√3

2n

)=µjp

(1− κ1(1 + κ1))

by recalling that κ1 =√

6/n. Since(D−1

Π

)jj

=1∑n

i=1(AW )ji=

1∑ni=1 Πji

=p

nµj, (79)

combined with (78), we find

n∣∣∣(D−1

X

)jj

∣∣∣ ≤ p

µj

(1 +

2(1 + κ1)

1− κ1(1 + κ1)

√p logM

µjnN

)(64)

≤(

1 +κ1(1 + κ1)

1− κ1(1 + κ1)

)p

µj

=1

1− κ1(1 + κ1)· pµj

(80)

Finally, since |Θj`| ≤ Θj` +∣∣∣Θj` −Θj`

∣∣∣, combining (79) and (80) gives

|T4|+ |T5| ≤2(1 + κ1)

(1− κ1(1 + κ1))2· p2

µjµ`

(√p

µj+

√p

µ`

)√logM

nN

(Θj` + |Θj` −Θj`|

),

|T6| =p2

µjµ`|Θj` −Θj`|.

Collecting these bounds for T4, T5 and T6 and using (64) again yield

|(R−R)j`| ≤ (1 + κ1κ3)p2

µjµ`|Θj` −Θj`|+ κ3

p2Θj`

µjµ`

(√p

µj+

√p

µ`

)√logM

nN.

46

with

κ3 =2(1 + κ1)

(1− κ1 − κ21)2

.

This completes the proof of Proposition 16.

Proof of Corollary 17. It suffices to show the following on the event E ,

|Θj` −Θj`| = O

(√logM

nN

),|mj −mj |

p= O

(√logM

N

),|µj − µj |

p= O

(√logM

nN

),

for all j, ` ∈ [p]. Recall that the definitions (53). Since

µjp

=1

n

n∑i=1

Πji ≤ max1≤i≤n

Πji =mj

p≤ 1, Θj` =

1

n

n∑i=1

ΠjiΠì ≤ 1, (81)

for any j, ` ∈ [p], display (68) implies

ηj` ≤ 3

√6 logM

nN+

2 logM

nN+ 31(1 + κ1)

√2(logM)4

nN3+ κ2 = O(

√logM/(nN)).

In addition,

|µj − µj |p

=1

n

∣∣∣∣∣n∑i=1

(Xji −Πji)

∣∣∣∣∣ =1

n

∣∣∣∣∣n∑i=1

εji

∣∣∣∣∣E≤ 2(1 + κ1)

√µj logM

npN

(81)

≤ 2(1 + κ1)

√logM

nN= O

(√logM

nN

).

Finally, we show |mj −mj |/p = O(√

logM/N). Fix any j ∈ [p] and define

i∗ := argmax1≤i≤n

Πji, i′ := argmax1≤i≤n

Xji.

Thus, from the definitions (53) and (72), we have

|mj −mj |p

= |Xji′ −Πji∗ |

≤ |Xji′ −Πji′ |+ Πji∗ −Πji′

≤ 2|Xji′ −Πji′ |+ |Xji∗ −Mji∗ |+Xji∗ −Xji′

≤ 2|εji′ |+ |εji∗ |,

from the definition of i′.

47

From the proof of Lemma 15, we conclude, on the event Sji, that holds with probability at least

1− 2M−3,

|mj −mj |p

≤ 2|εji′ |+ |εji∗ |Sji≤ 3

√6Πji∗ logM

N+

6 logM

N

(81)= O

(√logM

N

).

This completes the proof.

The following corollary provides the expressions of δj` and ηj` under condition (27).

Corollary 18. Under model (3) and (27), with probability greater than 1−O(M−1),

|Θj` −Θj`| ≤ c0ηj`, |Rj` −Rj`| ≤ c1δj`, for all 1 ≤ j, ` ≤ p

for some constant c0, c1 > 0, where

ηj` =

√Θj` logM

nN

√mj +m`

p∨ log2M

N+

2(mj +m`)

p

logM

nN

+

√log4M

nN3

√µj + µ`

p∨ logM

N(82)

and

δj` :=p2ηj`µjµ`

+p2Θj`

µjµ`

(√p

µj+

√p

µ`

)√logM

nN. (83)

Proof. We define the event E ′ := E ′1 ∩ E2 ∩ E ′3 with

E ′1 =

p⋂j=1

1

n

∣∣∣∣∣n∑i=1

εji

∣∣∣∣∣ .√µj logM

npN

,

E2 =

p⋂j,`=1

1

n

∣∣∣∣∣n∑i=1

Πìεji

∣∣∣∣∣ ≤√

6m`Θj` logM

npN+

2m` logM

npN

,

E ′3 =

p⋂j,`=1

1

n

∣∣∣∣∣n∑i=1


∣∣∣∣∣ .√

Θj` log3M

nN2+

√log5M

nN4

+

√(µj + µ`) log4M

npN3

.

From display (59) in the proof of Lemma 13 and condition (27), one has P(E ′1) ≥ 1 − O(M−1).

Further invoking Lemma 14 and (60) – (62) in Lemma 15 yields P(E ′) ≥ 1−O(M−1).

48

We proceed to work on E ′. The bound of |Θj` − Θj`| requires upper bounds of T1, T2 and T3

defined in (73). From (74) – (76) and by invoking E ′, after a bit algebra, we have

|Θj` −Θj`| .√

Θj` logM

nN

√log2M

N∨ mj +m`

p+

(mj +m`) logM

npN

+

√log4M

nN3

√logM

N∨ µj + µ`

p.

Finally, the bound of |Rj` − Rj`| follows from the same arguments in the proof of Proposition 16

by invoking condition (27) instead of (64).

C.3 Proofs of Theorem 4

We start by stating and proving the following lemma which is crucial for the proof of Theorem 4.

Recall

Ja1 := j ∈ [p] : Aja ≥ 1− 4δ/ν, for all a ∈ [K].

Let

ai = argmax1≤j≤p

Rij , for any i ∈ [p].

Lemma 19. Under the conditions in Theorem 4, for any i ∈ Ia with some a ∈ [K], the following

inequalities hold on the event E :∣∣∣Rij − Rik∣∣∣ ≤ δij + δik, for all j, k ∈ Ia; (84)

Rij − Rik > δij + δik, for all j ∈ Ia, k /∈ Ia ∪ Ja1 ; (85)

Rij − Rik < δij + δik, for all j ∈ Ja1 and k ∈ Ia. (86)

For any i ∈ Ja1 , we have

Riai − Rij ≤ δiai + δij , for any j ∈ Ia. (87)

Proof. We work on the event E so that, in particular, |Rj` −Rj`| ≤ δj` for all 1 ≤ j, ` ≤ p.To prove (84), fix i ∈ Ia and j, k ∈ Ia with some a ∈ [K]. Since R = ACAT , we have Rij = Rik =

Caa and

|Rij − Rik| ≤ |Rij −Rij |+ |Rik −Rik|E≤ δij + δik.

To prove (85), fix i, j ∈ Ia and k ∈ [p] \ Ia. On the one hand,

RikE≤

K∑b=1

AkbCab + δik(23)

≤ AkaCaa + (1− Aka)(Caa − ν) + δik = Caa − (1− Aka)ν + δik. (88)

49

On the other hand, i, j ∈ Ia implies Rij = Caa. Thus, on the event E , (88) gives

Rij − RikE≥ (1− Aka)ν − δij − δik.

If Aka = 0, using (24) and ν > 4δ gives the desired result. If Aka > 0, from the definition of Ja1 , we

have Aka < 1− 4δ/ν which finishes the proof by using (24) again.

To prove (86), observe that, for any j ∈ Ja1 and k ∈ Ia,

Rij(88)

≤ Caa − (1− Aja)ν + δij < Caa + δij = Rik + δijE≤ Rik + δij + δik.

It remains to show (87). For any i ∈ Ja1 and j ∈ Ia, we have, for some c ∈ [K],

RiaiE≤ max

k∈[p]Rik + δiai

(∗)≤

K∑b=1

AibCbc + δiai

(∗∗)≤

K∑b=1

AibCba + δiai

= Rij + δiaiE≤ Rij + δij + δiai .

Inequality (∗) holds since

maxk∈[p]

Rik = maxk∈[p]

K∑b=1

Akb

(K∑a=1

AiaCab

)≤ max

k∈[p]maxb∈[K]

K∑a=1

AiaCab =K∑a=1

AiaCac.

Inequality (∗∗) holds, since, for any c 6= a, we have

K∑b=1

AibCbc ≤ AiaCac + (1− Aia)Ccc(23)

≤ Aia(Caa − ν) + (1− Aia)Ccc,

andK∑b=1

AibCab(23)

≥ AiaCaa.

K∑b=1

AibCab −K∑b=1

AibCbc ≥ Aiaν − (1− Aia)Ccc > ν − 2(1− Aia)Ccc.

The term on the right is positive, since condition (25) guarantees that

ν ≥ 8δ

νCcc ≥ 2(1− Aia)Ccc,

where the last inequality is due to the definition of J1. This concludes the proof.

Lemma 19 remains valid under the conditions of Corollary 5 in which case J1 = ∅ and we only

need ν > 4δ to prove (85).

50

Proof of Theorem 4. We work on the event E throughout the proof. Without loss of generality, we

assume that the label permutation π is the identity. We start by presenting three claims which are

sufficient to prove the result. Let I(i) be defined in step 5 of Algorithm 2 for any i ∈ [p].

(1) For any i ∈ J \ J1, we have I(i) /∈ I.

(2) For any i ∈ Ia and a ∈ [K], we have i ∈ I(i), Ia ⊆ I(i) and I(i)\Ia ⊆ Ja1 .

(3) For any i ∈ Ja1 and a ∈ [K], we have Ia ⊆ I(i).

If we can prove these claims, then (1) and the Merge step in Algorithm 2 guarantees that I ∩ (J \J1) = ∅ and thus enable us to focus on i ∈ I ∩ J1. For any a ∈ [K], (2) implies that there exists

i ∈ Ia such that Ia ⊆ Ia and Ia\Ia ⊆ Ja1 with I(i) := Ia. Finally, (3) guarantees that none of anchor

words will be excluded by any i ∈ J1 in the Merge step. Thus, K = K and I = I1, . . . , IKis the desired partition. Therefore, we proceed to prove (1) - (3) in the following. Recall that

ai := argmax1≤j≤p Rij for all 1 ≤ i ≤ p.To prove (1), let i ∈ J \ J1 be fixed. We first prove that I(i) /∈ I when I(i) ∩ I 6= ∅. From steps

8 - 10 of Algorithm 2, it suffices to show that, there exists j ∈ I(i) such that the following does not

hold for any k ∈ aj : ∣∣∣Rij − Rjk∣∣∣ ≤ δij + δjk. (89)

Let I(i) ∩ I 6= ∅ such that there exists j ∈ Ib ∩ I(i) for some b ∈ [K]. For this j, we have

Rij =∑

a AiaCab and

Rij(88)

≤ Cbb − (1− Aib)ν + δij . (90)

On the other hand, for any k ∈ aj and k′ ∈ Ib, using the definition of aj gives

Rjk ≥ Rjk′E≥ Rjk′ + δjk′ = Cbb − δjk′ . (91)

Combining (90) with (91) gives

Rjk − Rij ≥ (1− Aib)ν − δij − δjk′ .

The definition of J1 and (24) with ν > 4δ give

Rjk − Rij > δjk + δij .

This shows that for any i ∈ J \ J1, if I(i) ∩ I 6= ∅, I(i) /∈ I. Therefore, to complete the proof of (1),

we show that I(i) ∩ I = ∅ is impossible if i ∈ J \ J1. For fixed i ∈ J \ J1 and j ∈ ai, we have

Rij =∑b

∑a

AiaAjbCab ≤ max1≤b≤K

∑a

AiaCab =∑a

AiaCab∗ = Rik

for some b∗ and any k ∈ Ib∗ . Therefore,

Rij − RikE≤ Rij −Rik + δij + δik ≤ δij + δik

51

On the other hand, assume I(i) ∩ I = ∅. Since k ∈ Ib∗ , we know k /∈ I(i), which implies

Rij − Rik > δij + δik,

from step 5 of Algorithm 2. The last two displays contradict each other, and we conclude that, for

any i ∈ J \ J1, I(i) ∩ I 6= ∅. This completes the proof of (1).

From (85) in Lemma 19, given step 5 of Algorithm 2, we know that, for any j ∈ I(i), j ∈ Ia∪Ja1 .

Thus, we write I(i) = (I(i) ∩ Ia) ∪ (I(i) ∩ Ja1 ). For any j ∈ I(i) ∩ Ia, by the same reasoning, aj is

either Ia or Ja1 . For both cases, since i, j ∈ Ia and i 6= j, (84) and (87) in Lemma 19 guarantee

that (89) holds. On the other hand, for any j ∈ I(i) ∩ Ja1 , (86) in Lemma 19 implies that (89) still

holds. Thus, we have shown that, for any i ∈ Ia, i ∈ I(i). To show Ia ⊆ I(i), let any j ∈ Ia and

observe that ai can only be in Ia ∪ Ja1 . In both cases, (84) and (87) imply j ∈ I(i). Thus, Ia ⊆ I(i).

Finally, I(i)\Ia ⊆ Ja1 follows immediately from (85).

We conclude the proof by noting that (3) directly follows from (86).

D Proofs of Section 5

D.1 Proofs of Lower bounds in Section 5.1

Proof of Theorem 6. We first show the result of the matrix `1 norm. Let

A(0) =

[e1 ⊗ 1g1 e2 ⊗ 1g2 · · · eK ⊗ 1gK

1|J | 1|J | · · · 1|J |

]×

1

g1+|J |. . .

1gK+|J |

(92)

with gk := |Ik| for any 1 ≤ k ≤ K and |gk − gk+1| ≤ 1. We use ek to denote the canonical basis

vectors in RK and use 1d and ⊗ to denote, respectively, the d-dimensional vector with entries equal

to 1 and the kronecker product. We start by constructing a set of “hypotheses” of A. Assume

gk + |J | is even for 1 ≤ k ≤ K. Let

M := 0, 1(|I|+|J |K)/2.

Following the Varshamov-Gilbert bound in Lemma 2.9 in Tsybakov (2009), there exists w(j) ∈ Mfor j = 0, 1, . . . , T , such that∥∥∥w(i) − w(j)

∥∥∥1≥ |I|+K|J |

16, for any 0 ≤ i 6= j ≤ T, (93)

with w(0) = 0 and

log(T ) ≥ log(2)

16(|I|+K|J |). (94)

For each w(j) ∈ R(|I|+K|J |)/2, we divide it into K chunks as w(j) =(w

(j)1 , w

(j)2 , . . . , w

(j)K

)with

w(j)k ∈ R(gk+|J |)/2. For each w

(j)k , we write w

(j)k ∈ Rp as its augumented counterpart such that

52

w(j)k (Sk) = [w

(j)k ,−w(j)

k ] and w(j)k (`) = 0 for any ` /∈ Sk, where Sk := supp(A

(0)k ). For 1 ≤ j ≤ T ,

we choose A(j) as

A(j) = A(0) + γ[w

(j)1 · · · w

(j)K

](95)

with

γ =

√log(2)

162c2(1 + c0)

√K2

nN(|I|+K|J |)(96)

for some constant c0, c > 0. Under |I|+K|J | ≤ c(nN), it is easy to verify that A(j) ∈ A(K, |I|, |J |)for all 0 ≤ j ≤ T .

In order to apply Theorem 2.5 in Tsybakov (2009), we need to check the following conditions:

(a) KL(PA(j) ,PA(0)) ≤ log(T )/16, for each i = 1, . . . , T .

(b) L1

(A(i), A(j)

)≥ c1K

√(|I|+K|J |)/(nN), for 0 ≤ i < j ≤ T and some positive constant c1.

(c) L1( · ) satisfies the triangle inequality.

The expression of Kullback-Leibler divergence between two multinomial distributions is shown in

(Ke and Wang, 2017, Lemma 6.7). For completeness, we include it here.

Lemma 20 (Lemma 6.7 Ke and Wang (2017)). Let D and D′ be two p × n matrices such that

each column of them is a weight vector. Under model (3), let P and P′ be the probability measures

associated with D and D′, respectively. Suppose D is a positive matrix. Let

η = max1≤j≤p,1≤i≤n

|D′ji −Dji|Dji

and assume η < 1. There exists a universal constant c0 > 0 such that

KL(P′,P) ≤ (1 + c0η)N

n∑i=1

p∑j=1

|D′ji −Dji|2

Dji.

Fix 1 ≤ j ≤ T and choose D(j) = A(j)W with

W ∈ RK×n+ = W 0 +1

nN1K1TK −

K

nNIK

where W 0 is defined in (35). The above choice of W perturbs W 0 to avoid Dji 6= 0 for any j ∈ Iand i ∈ [n], in the presence of anchor words. Since there exists some large enough constant c > 0

such that gk ≤ c|I|/K, it then follows that

D(0)ì =

K∑k=1

A(0)`k Wki =

Wki/(gk + |J |) ≥ cWki/(|I|/K + |J |) if ` ∈ Ik, k ∈ [K]∑

kWki/(gk + |J |) ≥ c/(|I|/K + |J |) if ` ∈ J(97)

53

for any 1 ≤ i ≤ n, where we also use∑

kWki = 1. Similarly, we have∣∣∣D(j)ì −D

(0)ì

∣∣∣ = γ

∣∣∣∣∣K∑k=1

w(j)k (`)Wki

∣∣∣∣∣ ≤Wkiγ if ` ∈ Ik, k ∈ [K]

γ if ` ∈ J(98)

for all 1 ≤ i ≤ n, where w(j)k (`) denotes the `th element of w

(j)k . Thus, by choosing proper c0 in

(96) and using |I|+K|J | ≤ c(nN), we have

max1≤`≤p,1≤i≤n

|D(j)ì −Dì|Dì

≤ cγ(|I|/K + |J |) < 1,

for any 1 ≤ j ≤ T . Invoking Lemma 20 gives

KL(PA(j) ,PA(0)) ≤ (1 + c0)Nn∑i=1

p∑`=1

|D(j)ì −D

(0)ì |

2

D(0)ì

≤ c (1 + c0)N

(|I|K

+ |J |) n∑i=1

γ2|J |+

K∑k=1

γ2gkWki

.

≤ c (1 + c0)nN (|I|+K|J |)2 γ2/K2

(94)

≤ 1

16log T,

where the second inequality uses (97) and (98) and the third inequality uses gk ≤ c|I|/K and∑kWki = 1. This verifies (a).

To show (b), from (95), it gives

L1(A(j), A(`)) =K∑k=1

∥∥∥A(j)·k −A

(`)·k

∥∥∥1

= 2γ

K∑k=1

∥∥∥w(j)k − w

(`)k

∥∥∥1

= 2γ∥∥∥w(j) − w(`)

∥∥∥1

(93)

≥ γ

8(|I|+K|J |).

Plugging into the expression of γ verifies (b). Since (c) is already verified in (Ke and Wang, 2017,

page 31), invoking (Tsybakov, 2009, Theorem 2.5) concludes the proof for even gk + |J |. The

complementary case is easy to derive with slight modifications. Specifically, denote by Sodd :=

1 ≤ k ≤ K : gk + |J | is odd. Then we change M := 0, 1Card with

Card =∑

k∈Sodd

gk + |J | − 1

2+∑

k∈Scodd

gk + |J |2

.

For each w(j), we write it as w(j) = (w(j)1 , . . . , w

(j)K ) and each w

(j)k has length (gk + |J | − 1)/2 if

k ∈ Sodd and (gk + |J |)/2 otherwise. We then construct A(j)k = A

(0)k + γw

(j)k where w

(j)k ∈ Rp is the

54

same augumented ccounterpart of w(j)k . The result follows from the same arguments.

To conclude the proof, we show the lower bound of ‖A − A‖1,∞. Let k∗ := arg maxk gk. Then

we can repeat the above arguments by only changing the k∗th column of A(0). Specifically, assume

|J |+ gk∗ is even and we draw w(j) from M = 0, 1(|J |+gk∗ )/2 for 1 ≤ j ≤ T with

log(T ) ≥ log(2)(|J |+ gk∗)/16.

Let w(j) be its augumented counterpart and choose A(j)k = A

(0)k + γw(j) with

γ =

√log(2)

162c2(1 + c0)

√K

nN(gk∗ + |J |)

and ‖w(j) − w(`)‖1 ≥ (gk∗ + |J |)/16 for any 0 ≤ j 6= ` ≤ T . Then we need to check

(a’) KL(PA(j) ,PA(0)) ≤ log(T )/16, for each i = 1, . . . , T .

(b’) L1,∞(A(i), A(j)

)≥ c1

√K(gk∗ + |J |)/(nN), for 0 ≤ i < j ≤ T and some positive constant c1.

(c’) L1,∞( · ) satisfies the triangle inequality.

Different from (97) and (98), we have, for all 1 ≤ i ≤ n,

D(0)ì =

K∑k=1

A(0)`k Wki =

Wki/(gk + |J |) ≥Wki/(gk∗ + |J |) if ` ∈ Ik, k ∈ [K]∑

kWki/(gk + |J |) ≥ 1/(gk∗ + |J |) if ` ∈ J(99)

and ∣∣∣D(j)ì −D

(0)ì

∣∣∣ =

γWk∗iw

(j)` ≤ γWk∗i, if ` ∈ Ik∗ ∪ J

0, otherwise.. (100)

Using the same arguments, one can derive

KL(PA(j) ,PA(0)) ≤ (1 + c0)Nγ2(gk∗ + |J |)n∑i=1

(W 2k∗i|J |+Wk∗igk∗

).

Since there exists some constant c > 0 such that

n∑i=1


)=

n∑a=1

∑i∈na


)≤ c(|J |+ gk∗)n/K,

we conclude that

KL(PA(j) ,PA(0)) ≤ c(1 + c0)γ2Nn(gk∗ + |J |)2/K ≤ log(T )/16.

To show (b’), observe that

L1,∞(A(j), A(`)) = 2γ max1≤k≤K

∥∥∥w(j)k − w

(`)k

∥∥∥1≥ γ(gk∗ + |J |)

8.

55

Finally, we verify (c’) by showing that L1,∞(·) satisfies the triangle inequality. Consider (A, A, A)

and observe that

L1,∞(A, A) = minP∈HK

‖AP − A‖1,∞ = minP,Q∈HK

‖AP − AQ‖1,∞

≤ minP,Q∈HK

(‖AP − A‖1,∞ + ‖A− AQ‖1,∞

)= min

P∈HK

‖AP − A‖1,∞ + minQ∈HK

‖A− AQ‖1,∞

= L1,∞(A, A) + L1,∞(A, A).

The proof is complete.

D.2 Proofs of upper bounds in Section 5.3

We work on the event E . Recall that under (26), we have P (E) ≥ 1−M−3. From Theorem 4, we

have K = K and I = I. Without loss of generality, we assume that the label permutation matrix

Π is the identity matrix that aligns the topic words with the estimates (I = I). In particular, this

implies that any chosen L ⊂ I has the correct partition, and we have L = L. We first give a crucial

lemma to the proof of Theorem 7. From (68), recall that

ηj` (√mj +

√m`

)√Θj` logM

npN+

(mj +m`) logM

npN+

√(µj + µ`)(logM)4

npN3, (101)

for all j, ` ∈ [p].

Lemma 21. Under the conditions of Theorem 7, we have

maxi∈L

∑j∈L

ηij .√α3Iγ

√logM

np3N, (102)

maxi∈L

∑j∈J

ηij .√αIγSJ

√logM

Knp2N, (103)

with SJ = |J |αI +∑

j∈J αj , for any i ∈ Ik and k ∈ [K]. In addition, if∑

k′ 6=k√Ckk′ = o(

√Ckk)

for any 1 ≤ k ≤ K, then

maxi∈L

∑j∈L

ηij .√α3Iγ

√logM

Knp3N. (104)

Proof. We first simplify the expression of η defined in (68). To show (102), observe that, for any

i ∈ Ia, j ∈ Ib and a, b ∈ [K], we have

Θij =1

n

n∑`=1

Πi`Πj` = AiaAjb1

n

n∑`=1

Wa`Wb`(53)=

αiαjp2· 1

n〈Wa·,Wb·〉.

56

As a result, plugging (65) - (67) and the above display into (68) yields

ηij √

1

n〈Wa·,Wb·〉

√α3I logM

np3N+αI logM

npN+

√αIγ(logM)4

KnpN3. (105)

SinceK∑b=1

1

n〈Wa·,Wb·〉 =

1

n

n∑`=1

Wa`

K∑b=1

Wb`(52)=

1

n

n∑`=1

Wa`(53)=

γaK,

summing (105) over j ∈ L, and using the Cauchy-Schwarz inequality to obtain the first term, gives

maxi∈L

∑j∈L

ηij .

√αI logM

npN

√α2Iγ

p2+

√αIK2 logM

npN+

√Kγ(logM)3

N2

. (106)

Note that Assumption 2 implies n ≥ K. Since (26) and (66) together with the fact γ ≥ 1 ≥ γ give

αIγ

K≥ αIK≥αIγ

K≥ min

i∈Iµi ≥

2p logM

3N, (107)

these two bounds imply the first term on the right in (106) is greater than the second term on the

right of (106). Moreover, (67), (77) and (107) imply

α2I

Kp2≥ 2 logM

3N

αIp≥ 2 logM

3N

mmin

p≥ 6(logM)3

N2.

Thus, the first term on the right of (106) is also greater than the third term on the right of (106).

This finishes the proof of (102).

We proceed to show (103). For any i ∈ Ia and j ∈ [p], observe that

Θij =1

n

n∑`=1

Πi`Πj` = Aia1

n

n∑`=1

Wa`Πj`(53)=

αip· 1

n〈Wa·,Πj·〉.

Plugging (65) – (67) and the above display into (68) yields

ηij .

√1

n〈Wa·,Πj·〉

√αi(αi + αj) logM

np2N+

(αi + αj) logM

npN+

√(αi + αj)(logM)4

npN3. (108)

Since∑

kWk` = 1 and (53) yield

∑j∈J

1

n〈Wa·,Πj·〉 =

K∑k=1

1

n

n∑`=1

Wa`Wk`

∑j∈J

Ajk ≤γaK‖AJ‖1,∞ ≤

γaK,

summing (108) over j ∈ J and using Cauchy-Schwarz yield

maxi∈L

∑j∈J

ηij .

√SJ logM

npN

(√αIγ

Kp+

√SJ logM

npN+

√|J |(logM)3

N2

).

57

To conclude the proof of (103), it suffices to show the first term on the right in the display above

dominates the other two. This is true by noting that

αIγN2

K|J |p(logM)3≥ µminN

2

|J |p(logM)3& 1

andγnN

K|J | logM& 1,

αIγnN

K∑

j∈J αj logM& 1

from the same arguments to prove (102) together with |J | ≤ p and∑

j∈J αj ≤ Kp.Finally, to show (104), invoking the additional condition and using (105) yield

maxi∈L

∑j∈L

ηij . maxa∈[K]

√Caa

√α3I logM

np3N+KαI logM

npN+

√KαIγ(logM)4

npN3

≤

√α3Iγ logM

Knp3N+KαI logM

npN+

√KαIγ(logM)4

npN3

where we use Caa ≤ n−1‖Wa·‖1 . γ/K to derive the second inequality. Repeating the same

arguments, one can show that on the right-hand-side the first term dominates the second and third

terms. This finishes the proof.

D.2.1 Proof of Theorem 7

On the event E , Theorem 4 guarantees that I = I, J = J and L = L. We work on this event for

the remainder of the proof. Our proof consists of three parts for any k ∈ [K]: To obtain the error

rate of (1) ‖BIk −BIk‖1, (2) ‖BJk −BJk‖1 and (3) ‖A·k −A·k‖1.

For step (1), recall (19) and (38). Then, for any 1 ≤ k ≤ K, it follows

‖BIk −BIk‖1 =∑i∈Ik

∣∣∣∣∣∥∥Xi·

∥∥1∥∥Xik·∥∥

1

−∥∥Πi·

∥∥1∥∥Πik·∥∥

1

∣∣∣∣∣and we obtain

‖BIk −BIk‖1

=1

‖Xik·‖1‖Πik·‖1

∑i∈Ik

∣∣∣‖Πik·‖1‖Xi·‖1 − ‖Xik·‖1‖Πi·‖1∣∣∣

≤ 1


‖Πik·‖1∑i∈Ik

∣∣∣‖Xi·‖1 − ‖Πi·‖1∣∣∣+∣∣∣‖Xik·‖1 − ‖Πik·‖1

∣∣∣∑i∈Ik

‖Πi·‖1

=

1

‖Xik·‖1

∑i∈Ik

1

n

∣∣∣∣∣∣n∑j=1

εij

∣∣∣∣∣∣+

∑i∈Ik ‖Πi·‖1


n

∣∣∣∣∣∣n∑j=1

εikj

∣∣∣∣∣∣ .58

In the last step we used that Xi and Πi have nonnegative entries only so that indeed

∣∣∣‖Xi·‖1 − ‖Πi·‖1∣∣∣ =

1

n

∣∣∣∣∣∣n∑j=1

|Xij | −n∑j=1

|Πij |

∣∣∣∣∣∣=

1

n

∣∣∣∣∣∣n∑j=1

Xij −n∑j=1

Πij

∣∣∣∣∣∣=

1

n

∣∣∣∣∣∣n∑j=1

εij

∣∣∣∣∣∣ .Invoking Lemma 13, n−1

∣∣∣∑nj=1 εij

∣∣∣ . √µip logM/(nN) on the event E1 ⊃ E . Note further that,

for any i ∈ [p],

1

‖Xi·‖1= (DX)−1

ii

(80)

.p

nµi

and ‖Πi·‖1 = nµi/p by the definition in (53). Now deduce that

‖BIk −BIk‖1 .1

µik

∑i∈Ik

√µip logM

nN+

∑i∈Ik µi

µ2ik

√µikp logM

nN.

Recall that µi = αiγk/K for any i ∈ Ik in (66). We further have

1

µik

∑i∈Ik

√µip logM

nN+

∑i∈Ik µi

µ2ik

√µikp logM

nN=

1

αik

√pK logM

γknN

∑i∈Ik

√αi +

∑i∈Ik αi√αik

and we conclude that, on the event E ,

‖BIk −BIk‖1 .

∑i∈Ik αi

αik√αIγk

√pK logM

nN. (109)

We proceed to show step (2). Recall that Ω = (ω1, . . . , ωK) is the optimal solution from (41),

BJ =(ΘJLΩ

)+

and BJ = ΘJLΩ. Fix any 1 ≤ k ≤ K and denote the canonical basis vectors in

RK by e1, . . . , eK . Since B has only non-negative entries, we have

‖BJk −BJk‖1= ‖(ΘJLΩ·k

)+−ΘJLΩ·k‖1

≤ ‖ΘJLΩ·k −ΘJLΩ·k‖1≤ ‖(ΘJL −ΘJL)Ω·k‖1 + ‖ΘJLΩ·k −BJk‖1≤ ‖Ω·k‖1‖ΘJL −ΘJL‖1,∞ + ‖BJΘLLΩ·k −BJek‖1

≤ ‖Ω·k‖1‖ΘJL −ΘJL‖1,∞ +[‖ΘLLΩ·k − ek‖1 + ‖(ΘLL −ΘLL)Ω·k‖1

]‖BJ‖1,∞

≤ ‖ωk‖1‖ΘJL −ΘJL‖1,∞ +[‖ΘLLΩ·k − ek‖1 + ‖ωk‖1‖ΘLL −ΘLL‖∞,1

]‖BJ‖1,∞.

59

We first study the property of Ω. Notice that ωk := Ω·k is feasible of (41) since, on the event E ,

‖ΘLLωk − ek‖1 ≤ ‖ωk‖1‖ΘLL −ΘLL‖∞,1E≤ C0‖ωk‖1 max

i∈L

∑j∈L

ηij = ‖ωk‖1λ,

by Proposition 16. The optimality and feasibility of ωk imply

‖ωk‖1 ≤ ‖ωk‖1, ‖ΘLLωk − ek‖1 ≤ ‖ωk‖1λ ≤ ‖ωk‖1λ.

Hence, on the event E , we have

‖BJk −BJk‖1 ≤ C0‖ωk‖1 maxi∈L

∑j∈J

ηij + 2‖ωk‖1λ‖BJ‖1,∞.

Recall that BJ = AJA−1L and AL = diag(αi1/p, . . . , αiK/p), and deduce

‖BJ‖1,∞ = max1≤k≤K

p

αik

∑j∈J

Ajk ≤p

αI‖AJ‖1,∞ (110)

Recall that λ = maxi∈L∑

j∈L ηij and notice that

‖ωk‖1 = ‖Ω·k‖1 = ‖Θ−1LLek‖1 = ‖A−1

L C−1A−1L ek‖1 ≤

p2

αikαI‖C−1‖∞,1. (111)

Collecting (110) – (111) gives

‖BJk −BJk‖1 ≤ C0p2

αikαI‖C−1‖∞,1

maxi∈L

∑j∈J

ηij + 2p‖AJ‖1,∞

αImaxi∈L

∑j∈L

ηij

.Invoking (102) – (103) in Lemma 21 yields, on the event E ,

‖BJk −BJk‖1 .p2‖C−1‖∞,1

αikαI

[√αIγSJ

√logM

Knp2N+‖AJ‖1,∞

αI

√α3Iγ

√logM

npN

]

≤ p

αik‖C−1‖∞,1

αIαI

√γ logM

KnN

√|J |+∑j∈J

αjα

+αIαI

√K∑j∈J

αjαI

(112)

where we recall that SJ = |J |αI +∑

j∈J αj and use ‖AJ‖1,∞ ≤ 1, ‖AJ‖1,∞ ≤∑

j∈J αj/p in the

second inequality. Combining (109) and (112) yields, on the event E ,

‖B·k −B·k‖1 .p

αik

∑i∈Ik αi√αIγk

√K logM

npN.

+p

αik‖C−1‖∞,1

αIαI

√γ logM

KnN

√|J |+∑j∈J

αjα

+αIαI

√K∑j∈J

αjαI

. (113)

60

It remains to prove step (3). For any k ∈ [K], from (44), we have

|Ajk −Ajk| =

∣∣∣∣∣ Bjk

‖B·k‖1−

Bjk‖B·k‖1

∣∣∣∣∣≤

Bjk

‖B·k‖1‖B·k‖1

∣∣∣‖B·k‖1 − ‖B·k‖1∣∣∣+|Bjk −Bjk|‖B·k‖1

=

∣∣∣‖B·k‖1 − ‖B·k‖1∣∣∣‖B·k‖1

Ajk +|Bjk −Bjk|‖B·k‖1

by using Ajk = Bjk/‖B·k‖1 in the last equality. Since ‖A·k‖1 = 1, summing over j ∈ [p] yields

‖A·k −A·k‖1 ≤2‖B·k −B·k‖1‖B·k‖1

=2αikp‖B·k −B·k‖1.

The equality uses ‖B·k‖1 = p/αik from B = AA−1L . Invoking (113) concludes the proof.

D.2.2 Proof of Corollary 8

To prove Corollary 8, we need the following lemma which controls ‖C−1‖∞,1.

Lemma 22. Let γ and γ be defined in (45). Assume γ γ and∑

k′ 6=k√Ckk′ = o(

√Ckk) for any

1 ≤ k ≤ K. Then ‖C−1‖∞,1 = O(K).

Proof. From the definition of (∞, 1)-norm and the symmetry of C, one has

‖C−1‖∞,1 = supv 6=0

‖C−1v‖1‖v‖1

= supv 6=0

‖v‖1‖Cv‖1

=

(inf‖v‖1=1

‖Cv‖1)−1

.

It suffices to lower bound inf‖v‖1=1 ‖Cv‖1. We have

inf‖v‖1=1

‖Cv‖1 = inf‖v‖1=1

K∑a=1

∣∣∣∣∣∣Caava +∑b 6=a

Cabvb

∣∣∣∣∣∣≥ inf‖v‖1=1

K∑a=1

Caa|va| −∑b 6=a

Cab|vb|

= inf‖v‖1=1

K∑a=1

Caa|va| −K∑b=1

|vb|∑a6=b

Cab.

Note that∑

k′ 6=k√Ckk′ = o(

√Ckk) implies

∑k′ 6=k

Ckk′ ≤

∑k′ 6=k

√Ckk′

2

= o(Ckk), for any k ∈ [K] (114)

61

and Ckk ≤ n−1‖Wk·‖1 = γk/K = O(1/K) by using γ γ 1. We further obtain

inf‖v‖1=1

‖Cv‖1 & inf‖v‖1=1

K∑a=1

(Caa|va| − |vb|o(1/K)) & minaCaa.

Observe that

1

K− Ckk

γkK− Ckk =

K∑k′=1

Ckk′ − Ckk =∑k′ 6=k

Ckk′ ≤∑k′ 6=k

√Ckk′ = o(Ckk).

This implies Ckk 1/K which concludes inf‖v‖1=1 ‖Cv‖1 & 1/K and completes the proof.

Proof of Corollary 8. From (104) in Lemma 21, under∑

k′ 6=k√Ckk′ = o(

√Ckk) for any 1 ≤ k ≤ K,

one has

maxi∈L

∑j∈L

ηij .

√α3Iγ logM

Knp3N

which improves the rate in (102) by a factor of√

1/K. Repeating the previous arguments for

proving Theorem 7, one can show that

Rem(J, k) .

√K logM

nN· γ

1/2‖C−1‖∞,1K

· αIαI

√|J |+∑j∈J

ρj +αIαI

√∑j∈J

ρj

.

Invoking conditions (i) – (ii) of Corollary 8 together with ‖C−1‖∞,1 = O(K) from Lemma 22, one

can obtainK∑k=1

Rem(I, k) .∑i∈I

√αiK logM

npN≤ K

√|I| logM

nN

by using Cauchy-Schwarz inequality with∑

i∈I αi ≤∑p

i=1 αi ≤ pK, and

Rem(J, k) .

√|J |K logM

nN.

This yields the optimal rate of L1(A, A). For L1,∞(A, A), observe that∑

i∈Ik αi/p =∑

i∈Ik Aik ≤ 1.

This yields

Rem(I, k) .∑i∈Ik

√αiK logM

npN≤√|Ik|K logM

nN,

which, in conjunction with the rate of Rem(J, k), concludes the result of L1,∞(A, A).

62

D.2.3 Proof of the statment in Remark 9

We prove the result of Theorem 7 when condition (26) is replaced by (27) and (49). First note

that, similar as (64) and (77), condition (49) implies

minj∈I

µjp

= minj∈I

1

n

n∑i=1

Πji ≥c logM

N, min

j∈I

mj

p= min

j∈Imax

1≤i≤nΠji ≥

c′(logM)2

N. (115)

We will work on the event E ′ defined in the proof of Corollary 18 which holds with probability

greater than 1 − O(M−1) under condition (27). To prove Theorem 7, observe that its proof in

Section D.2.1 only uses the result of Lemma 13, Lemma 21 and (80). Since Lemma 13 and (80) is

guaranteed by E ′, it suffices to prove that Lemma 21 holds for ηj` defined in (82). This is indeed

true since the only different terms in the expression of ηj` in (82) than (101) are

mj +m`

p∨ log2M

N,

µj + µ`p

∨ logM

N

for any j ∈ Ik and ` ∈ [p], and hence invoking (115) yields

mj +m`

p∨ log2M

N.mj +m`

p,

µj + µ`p

∨ logM

N.µj + µ`

p.

Therefore, Lemma 21 follows and so does Theorem 7.

E Discussions of Assumptions 2 and 3

E.1 Examples

The following example shows that Assumption 2 does not imply Assumption 3.

Example 2. In the first two instances (116) and (117) below, Assumptions 2 and 3 both hold. These

instances give insight into why Assumption 3 may fail, while Assumption 2 holds, which is shown

in the third instance (118).

We consider the following matrix

W =

0.5 0.4 0 0 0.4

0.2 0.6 0.5 0.5 0

0.3 0 0.5 0.5 0.1

0 0 0 0 0.5

=⇒ C =

0.34 0.15 0.1 0.31

0.15 0.28 0.22 0

0.1 0.22 0.31 0.07

0.31 0 0.07 1

. (116)

Clearly, each diagonal entry of C dominates the non-diagonal ones, row-wise. From (23), Assump-

tion 3 holds. Assumption 2 can be verified as well. We see that topic 4 is rare, in that it only

occurs in document 5. However, the probability that the rare topic occurs is not small (W45 = 0.5).

Since each column sums to 1, the closer W45 is to 1, the smaller the other entries in the 5th column

63

must be, and the more uncorrelated topic 4 will be with other topics. Hence, the larger W45, the

more likely it is that Assumption 3 holds. Suppose the rare topic 4 also has a small probability, for

instance

W =

0.5 0.4 0 0 0.3

0.2 0.6 0.5 0.5 0.2

0.3 0 0.5 0.5 0.4

0 0 0 0 0.1

=⇒ C =

0.35 0.17 0.13 0.25

0.17 0.24 0.19 0.1

0.13 0.19 0.26 0.24

0.25 0.1 0.24 1

. (117)

While the rare topic 4 has small non-zero entry W45 = 0.1, none of Wi5 for i = 1, 2, 3, is dominating

and Assumption 3 is still met. However, it fails in the following scenario:

W =

0.5 0.4 0 0 0.9

0.2 0.6 0.5 0.5 0

0.3 0 0.5 0.5 0

0 0 0 0 0.1

=⇒ C =

0.38 0.1 0.06 0.5

0.1 0.28 0.24 0

0.06 0.24 0.35 0

0.5 0 0 1

. (118)

Here, a frequent topic (topic 1) has probability W15 = 0.9, that dominates all other entries in the

fifth column. Topic 4 is hard to detect since it occurs with probability 0.1, while the frequent topic

1 occurs with probability 0.9, in the single document that may reveal topic 4. We emphasize that

W in all three cases above satisfy Assumption 2.

We give an example below to show that Assumption 3 does not imply Assumption 2, that is,

W satisfies Assumption 3 but does not have rank K.

Example 3 Let K = n = 5 and consider

W =

0.3 0.05 0.3 0 0

0.25 0.04 0.25 0.32 0.1

0 0.5 0.15 0.1 0.3

0.055 0.257 0.13 0.466 0.28

0.395 0.153 0.17 0.114 0.32

.

Note that W4· = −0.9W1· + 1.3W2· + 0.5W3· which implies rank(W ) < 5. However, it follows that

C = nWW T =

2.16 1.22 0.51 0.44 1.18

1.22 1.3 0.59 1.02 0.98

0.51 0.59 1.69 1.12 0.87

0.44 1.02 1.12 1.35 0.83

1.18 0.98 0.87 0.83 1.22

.

Therefore, Assumption 3 is satisfied as it is equivalent with Cii ∧ Cjj > Cij for any i, j ∈ [K],

whereas W does not have full rank.

64

E.2 Statements and proofs of results invoked in Remark 3

We first show the equivalence between ν > 4δ and

cos (∠(Wi·,Wj·)) <

(ζiζj∧ ζjζi

)(1− 4δ

nζiζj

), for all 1 ≤ i 6= j ≤ K. (119)

Lemma 23. Let ν be defined in (23) with C = nWW T and W = D−1W W . Then, the inequality

ν > 4δ is equivalent with (119). In particular, ν > 0 is equivalent with Assumption 3.

Proof. Starting with the definition of ν, we have

ν > 4δ

⇐⇒ n‖Wi·‖22‖Wi·‖21

∧ n‖Wj·‖22‖Wj·‖21

− n〈Wi·,Wj·〉‖Wi·‖1‖Wj·‖1

> 4δ, for all i 6= j

⇐⇒ nζ2i ∧ nζ2

j − nζiζj cos(∠(Wi·,Wj·)) > 4δ, for all i 6= j

by using ζi = ‖Wi·‖2/‖Wi·‖1. The result follows after rearranging terms, and taking δ = 0 we

obtain the second assertion as well.

The following lemma provides an upper bound of δ defined in (24).

Lemma 24. Assume condition (26) holds. Then, there exists some constant c > 0 such that

δ ≤ c

max`∈[p]

m`

µ`

√1

n+

√logM

n

:= εn.

If, in addition, logM = o(n) and condition (33) hold, we have

δ

nmini ζ2i

= o(1)

with ζi = ‖Wi·‖2/‖Wi·‖1.

Proof. First, we notice that δ = max1≤j≤p δjj . For any 1 ≤ j ≤ p, the expressions in (68) – (69)

yield

δjj p2

µ2j

√mjΘjj logM

npN+mj logM

npN+

√µj(logM)4

npN3+ Θjj

√p logM

µjnN

.

Using

Θjj =1

n

n∑i=1

Π2ji ≤

1

n

n∑i=1

Πji max1≤i≤n

Πji(53)=

µjmj

p2, (120)

together with (64), we obtain

δjj .mj

µj

√p logM

µjnN+mjp logM

µ2jnN

+

√p3(logM)4

µ3jnN

3

.mj

µj

√1

n+

√logM

n(121)

65

Taking the maximum over j ∈ [p] concludes the proof of the upper bound of δ. The second part

follows by observing that

min1≤i≤K

nζ2i = min

1≤i≤K

n‖Wi·‖22‖Wi·‖21

≥ 1,

by the Cauchy-Schwarz inequality.

E.3 The Dirichlet distribution of W

Suppose the columns of W = (W1, . . . ,Wn) are i.i.d. samples from the Dirichlet distribution

parametrized by the vector d = (d1, . . . , dK) ∈ RK . Let d0 :=∑

k dk. It is well known that, for any

1 ≤ k ≤ K,

ak := E[Wki] =dkd0, σ2

kk := V ar(Wki) =dk(d0 − dk)d2

0(d0 + 1)(122)

and, for any 1 ≤ k 6= ` ≤ K,

skk := E[W 2ki] =

dk(1 + dk)

d0(1 + d0), sk` := E[WkiWì] =

dkd`d0(1 + d0)

. (123)

Further denote

ak :=1

n

n∑i=1

Wki, sk` =1

n

n∑i=1

WkiWì, for any 1 ≤ k, ` ≤ K

and the event

Edir =|ak − ak| ≤ εk1, |sk` − sk`| ≤ εk`2 , for all 1 ≤ k, ` ≤ K

. (124)

From Lemma 23, condition ν > 4δ discussed in Remark 3 is equivalent with

skka2k

∧ s`à2`

− skàka`

> 4δ, for any 1 ≤ k < ` ≤ K. (125)

In particular, Assumption 3 corresponds to δ = 0 in (125). The following lemma provides sufficient

conditions that guarantee Assumption 3 and ν > 4δ. Recall that M := n ∨ p ∨maxi∈[n]Ni.

Lemma 25. Assume the columns of W are i.i.d. from a Dirichlet distribution with parameter

d = (d1, . . . , dK). Set d0 := d1 + · · · + dK , d := maxk dk and d := mink dk. Assumption 3 holds

with probability greater than 1−O(M−1), provided

1 ∨ dd

d0(1 + d0) ≤ c n

logM(126)

holds, for some constant c > 0 sufficiently small. Additionally, if

d(1 + d0)

d≤ c√

n

logM, (127)

holds for some constant c > 0 small enough, then Pν > 4δ ≥ 1−O(M−1).

66

Proof. Display (121) together with

mj

p= max

1≤t≤nΠjt ≤ ‖Aj·‖1,

µjp

=1

n

n∑t=1

Πjt ≥ ‖Aj·‖1 mink

1

n‖Wk·‖1 = ‖Aj·‖1 min

kak

from (65) and (67) yield

δ .1

mink ak

√1

n+

√logM

n. (128)

We work on the event Edir with εk1 and εk`2 chosen as

εk1 = c1

√dk

d0(1 + d0)

√logM

n, εk`2 = c2

√dkd`

d0(1 + d0)+

√logM

n

√logM

n, (129)

for any 1 ≤ k ≤ ` ≤ K where c1, c2 are some positive constants such that, from Lemma 26,

P(Edir) ≥ 1−O(M−1). Note that condition (126) and Edir imply

ak ≤ ak + εk1 ≤dkd0

+ c1

√dk

d0(1 + d0)

√logM

n≤ c′1ak; (130)

ak ≥ ak − εk1 ≤dkd0− c1

√dk

d0(1 + d0)

√logM

n≥ c′′1ak (131)

for all k ∈ [K]. In the following, we prove (125) by only showing

skka2k

− skàka`

=skka` − skàk

a2ka`

> 4δ (132)

since the other part can be deduced by using similar arguments. We first show that, on the event

Edir, condition (126) guarantees

skka` − skàka2ka`

≥ ρ · d0

dk(1 + d0)

for some constant ρ ∈ [0, 1). We bound the numerator of the term on the left from below by

skka` − skàk ≥(skk − εkk2

)(a` − ε`1

)−(sk` + εk`2

)(ak + εk1

)= skka` − skàk − εkk2 c′1a` − ε`1skk − εk`2 c

′1ak −−εk1sk`

=dkd`

d20(1 + d0)

− c′1εkk2

d`d0− ε`1

dk(1 + dk)

d0(1 + d0)− c′1εk`2

dkd0− εk1

dkd`d0(1 + d0)

.

We used (130) in the second line and used (122) – (123) in the third line. We then show that

max

c′1ε

kk2

d`d0, ε`1

dk(1 + dk)

d0(1 + d0), c′1ε

k`2

dkd0, εk1

dkd`d0(1 + d0)

≤ dkd`

5d20(1 + d0)

.

From (122) – (123), it suffices to show the individual inequalities

5c′1εkk2 <

dkd0(1 + d0)

, 5ε`1 <dl

d0(1 + dk), 5c′1ε

k`2 <

d`d0(1 + d0)

, 5εk1 <1

d0.

67

These inequalities follow by invoking condition (126) and plugging in the expression of εk1 and εk`2

in (129). We thus have

skka` − skàk ≥1

5

dkd`d2

0(1 + d0).

In conjunction with (130), we conclude that

skka` − skàka2ka`

≥ 1

5(c′1)3

d0

dk(1 + d0)

which further yieldsskka2k

∧ s`à2`

− skàka`

≥ 1

5(c′1)3

d0

d(1 + d0).

When δ = 0, the first statement follows. To prove the statement of ν > 4δ, it suffices to show

1

5(c′1)3

d0

d(1 + d0)> 4δ.

This follows by invoking (127), (128) and (131). The proof is complete.

Remark 16. For a symmetric Dirichlet distribution, that is, d1 = · · · = dK = d, condition (127)

and (126) becomes, respectively,

1 +Kd ≤ c√

n

logM, (1 ∨ d)K(1 +Kd) ≤ c n

logM.

The larger K is, the smaller d/n needs to be. Note that small d encourages the sparsity of W .

Lemma 26. Under condition (126), assume columns of W are i.i.d. sample from the Dirichlet

distribution. Then P(Edir) ≥ 1−M−1 with εk1 and εk`2 chosen as (129).

Sketch of the proof. Since Wki and WkiWì are bounded by 1 and displays (122) – (123) imply

V ar(WkiWì) ≤ dkd`/(d0(1+d0)), using Bernstein’s inequality in Lemma 11 with similar arguments

in Lemmas 13 and 15 together with condition (126) yield the desired result.

F Cross-validation procedure for selecting C1 in Algorithm 3

We give a practical cross-validation procedure for selecting the proportionally constant C1 used in

Algorithm 3. Starting with a chosen fine grid C, we randomly split the data into a training set D1

and a validation set D2. For each c ∈ C, we apply Algorithm 3 with T = 1, C0 = 0.01 and C1 = c

on D1 to obtain Ac and Cc, and then calculate the out-of-sample error

L(c) :=∥∥∥Θ(2) − AcCcATc

∥∥∥1.

Here Θ(2) is obtained as in (20) by using D2 and ‖ · ‖1 denotes the matrix `1 norm. The selected

C1 is defined as

C1 = arg minc∈C

L(c).

68

Using estimates I and A and motivated by the structure ΘII = AICATI , we can estimate C by

C = (ATIAI)−1AT

IΘIIAI(A

TIAI)−1.

G Additional simulation results

Sensitivity to topics number K

We study the performance of Recover-L2 and Recover-KL for different K. We show that even

in a favorable low dimensional setting (p = 400 and true K0 = 15) with |Ik| = 1, ξ = 1/p and large

sample sizes (N = 800, n = 1000), the estimation error is seriously affected by a wrong choice of

thenumber of topics K.

We generated 50 datasets according to our data generating mechanism and applied Recover-

L2, Recover-KL by using different K to each dataset to obtain AK . To quantify the estimation

error, we use the criterion ∥∥AKATK −AAT∥∥1

to evaluate the overall fit of the word by word membership matrix AAT ∈ Rp×p. We averaged this

loss over 50 datasets. To further benchmark the result, we use a random guessing method Uniform

which randomly draws p × K entries from the Uniform(0, 1) distribution and normalizes each

column to obtain an “estimate” that is independent of the data. The performance of Recover-

L2, Recover-KL and Uniform is shown in Figure 6. It clearly shows that both Recover-L2 and

Recover-KL are very sensitive indeed to correctly specified K. When K differs from the true value

K0 by more than 2 units, the performance is close to random guessing! This phenomenon continues

to hold for various settings and we conclude that specifying K is critical for both Recover-L2

and Recover-KL.

10 11 12 13 14 15 16 17 18 19K

0

1

2

3

4

5

6

||AAT

AAT ||

1

Recover_L2Recover_KLUniform

Figure 6: Plot of ‖AAT −AAT ‖1 for using different specified K. The black vertical bars are

standard errors over 50 datasets.

69

Comparison of different algorithms for small K

We compare Top10, Recover-L2, Recover-KL and T-score in the benchmark setting N =

n = 1500, p = 1000 and ξ = 1/p, but for small values of K. T-score is the procedure of Ke and

Wang (2017), who kindly made the code available to us. Since K is small, the cardinality of each

column Wi of W is randomly sampled from 1, 2, 3.In the first experiment, we considered K = 5 and varied |Ik| within 2, 3, 4, 5, 6. The estimation

error L1(A, A)/K, averaged over 50 generated datasets, is shown in Figure 7. We see that the

performance of Recover-KL is as good as Top10 and even better for |Ik| ≤ 3. However, as already

verified in Section 6, Recover-KL has the worst performance for large K and is computationally

demanding. The plot also demonstrates that T-score needs at least 4 anchor words per group in

order to have comparable performance with Top10 and Recover-KL. This is as expected since

|Ik| needs to grow as p2 log2(n)/(nN) in Ke and Wang (2017).

In the second experiment, we set |Ik| = p/100 and varied K from 5, 6, . . . , 10. We considered

this range of values as, unfortunately, in the implementation of T-score the authors made available

to us, it often crashes for K = 11. The average overall L1 estimation error in Figure 7 shows that

Top10 has the smallest error over all K. T-score has similar performance when K ≤ 8 but

becomes worse than Top10 when K becomes larger.

2 3 4 5 6|I_k|

0.045

0.050

0.055

0.060

0.065

0.070

1 K||A

A|| 1

Top10Recover_L2Recover_KLT-score

5 6 7 8 9 10K

0.045

0.050

0.055

0.060

0.065

0.070

0.075

0.080

1 K||A

A|| 1

Top10Recover_L2Recover_KLT-score

Figure 7: The left plot is the averaged overall estimation error by varying |Ik| and setting

K = 5. The right one is for varying K and setting |Ik| = p/100.

The New York Times (NYT) dataset

Similar to the procedure for preprocessing the NIPs dataset, we removed common stopping words

and rare words occurring in less than 150 documents. The dataset after preprocessing has n =

299419, p = 3079 and N = 210. Since Arora et al. (2013) mainly considers K = 100, we tune our

method to obtain a similar number of topics for our comparative study, for both the semi-synthetic

data and the real data.

70

Semi-synthetic data comparison. To generate the semi-synthetic data, we first apply Top1

to the preprocessed dataset with C0 = 0.01 and C1 = 15.5 3 and obtain the estimated word-to-topic

matrix A with K = 101 and 206 anchor words. Using this A, we generate W from the previous

three distributions (a) – (c), separately, with n = 40000, 50000, 60000, 70000, and then generate

X based on AW with N = 250. For each setting, we generate 15 datasets and the averaged overall

estimation error and topic-wise estimation error of Top, Recover-L2, Recover-KL and LDA

are reported in Figure 8. The running time of different algorithm is shown in Table 3. We specify

the true number of topics for Recover-L2, Recover-KL and LDA while Top estimates it.

The results have essentially the same pattern as those from the NIPs dataset. For the Dirichlet

distribution with parameter 0.03, Recover-KL, Recover-L2 and Top are comparable, whereas

Top outperforms the other two when the topics become more correlated. LDA has the worst

performance, especially when n is relatively small. Top and Recover-L2 run much faster than

Recover-KL and LDA.


40000 50000 60000 70000 40000 50000 60000 70000 40000 50000 60000 70000

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

n

Ove

ral e

stim

atio

n er

ror

Methods

KL

L2

LDA

Top


40000 50000 60000 70000 40000 50000 60000 70000 40000 50000 60000 70000

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

n

Topi

c−w

ise

estim

atio

n er

ror

Methods

KL

L2

LDA

Top

Figure 8: Plots of averaged overall estimation error and topic-wise estimation error of Top,

Recover-L2 (L2), Recover-KL (KL) and LDA from the NYT dataset. Top estimates K

while the other algorithms use the true K as input. The bars denote one standard deviation.

Real data comparison from the NYT dataset. We applied Top, Recover-L2, Recover-

KL and LDA to the preprocessed NYT dataset. In order to ensure that all algorithms make use of

a similar number of topics, Top uses C0 = 0.01 and C1 = 15.5, which gives an estimated K = 101,

3The value of C1 is chosen such that the estimated number of topics is around 100

71

Table 3: Running time (seconds) of different algorithms on the NYT dataset

Top Recover-L2 Recover-KL LDA

n = 40000 471.5 795.6 4681.9 35913.7

n = 50000 517.2 841.5 4824.0 44754.1

n = 60000 555.3 894.1 4945.3 53833.8

n = 70000 579.2 957.2 5060.8 62548.8

and the other three algorithms use K = 101 as input. Evaluation of algorithms on the real data

is difficult since we do not know the ground truth. However, there exists some standard ways of

evaluating the fitted model. Since our main target is the word-topic matrix A, we consider two

widely used metrics of evaluating the semantic quality of estimated topics 4 (Arora et al., 2013;

Mimno et al., 2011; Stevens et al., 2012):

(1) Topic coherence. Coherence is a measure of the co-occurrence of the high-frequency words for

each individual estimated topics. “This metric has been shown to correlate well with human

judgments of topic quality. When the topics are perfectly recovered, all the high-probability

words in a topic should co-occur frequently, otherwise, the model may be mixing unrelated

concepts.” (Arora et al., 2013). Given a set of words W, coherence is calculated as

Coherence(W) :=∑

w1,w2∈Wlog

D(w1, w2) + ε

D(w2)(133)

where D(w1) denotes the number of documents that the word w1 occurs, D(w1, w2) denotes

the number of documents that both word w1 and word w2 occur (Mimno et al., 2011) and

the parameter ε = 0.01 is used to avoid taking the log of zero (Stevens et al., 2012).

In the NYT dataset, for each algorithm, we use its estimated word-topic matrix A and

calculate Coherence(Wk) for 1 ≤ k ≤ K, where Wk is the set of 20 most frequent words in

topic k. The averaged coherence metrics and its standard deviation across topics of Top,

Recover and LDA are reported in Table 4.5 Top has the largest coherence suggesting a

better topic recovery while Recover has the smallest coherence.

(2) Unique words: Since coherence only measures the quality of each individual topic, we consider

the metric, unique words (Arora et al., 2013), which reflects the redundancy between topics.

For each topic and its T most probable words, we count how many of those T words do

not appear in any T most probable words of the other topics. Some overlap across topics is

expected due to semantic ambiguity, but lower numbers of unique words indicate less useful

models. We consider T = 100 and the averaged unique words and the standard deviation

4We do not consider the metric of held-out-probability since it requires the estimation of both A and W while

neither Top nor Recover estimates W .5We only report one of Recover-L2 and Recover-KL since they have the same results.

72

across topics are reported in Table 4. Top and LDA have more unique words than Recover.

In fact, the unique words from Recover are only the anchor words and the other most

probable words of different topics are all overlapped, indicating the redundancy between the

estimated topics.

Table 4: Coherence and unique words of Top, Recover and LDA on the NYT dataset.

The numbers in the parentheses are the standard deviations across topics.

Metric Top Recover LDA

Coherence -328.8(51.3) -464.1(3.8) -340.3(44.8)

Unique words 6.7(4.4) 1.0(0.0) 6.3(3.1)

73

Documents

A fast algorithm with minimax optimal guarantees for topic ... · that of text mining. It is assumed that we observe a collection of nindependent documents, and that each document