19
Tighter Low-rank Approximation via Sampling the Leveraged Element Srinadh Bhojanapalli The University of Texas at Austin [email protected] Prateek Jain Microsoft Research, India [email protected] Sujay Sanghavi The University of Texas at Austin [email protected] Abstract In this work, we propose a new randomized algorithm for computing a low-rank approximation to a given matrix. Taking an approach dierent from existing literature, our method first involves a specific biased sampling, with an ele- ment being chosen based on the leverage scores of its row and column, and then involves weighted alternating minimiza- tion over the factored form of the intended low-rank matrix, to minimize error only on these samples. Our method can leverage input sparsity, yet produce approximations in spec- tral (as opposed to the weaker Frobenius) norm; this com- bines the best aspects of otherwise disparate current results, but with a dependence on the condition number = σ1/σr . In particular we require O(nnz(M)+ n 2 r 5 2 ) computations to generate a rank-r approximation to M in spectral norm. In contrast, the best existing method requires O(nnz(M)+ nr 2 4 ) time to compute an approximation in Frobenius norm. Be- sides the tightness in spectral norm, we have a better depen- dence on the error . Our method is naturally and highly parallelizable. Our new approach enables two extensions that are interesting on their own. The first is a new method to directly compute a low-rank approximation (in ecient factored form) to the product of two given matrices; it computes a small random set of entries of the product, and then executes weighted alternating minimization (as before) on these. The sampling strategy is dierent because now we cannot access leverage scores of the product matrix (but instead have to work with input matrices). The second extension is an improved algorithm with smaller communication complexity for the distributed PCA setting (where each server has small set of rows of the matrix, and want to compute low rank approximation with small amount of communication with other servers). 1 Introduction Finding a low-rank approximation to a matrix is funda- mental to a wide array of machine learning techniques. The large sizes of modern data matrices has driven much recent work into ecient (typically randomized) methods to find low-rank approximations that do not exactly minimize the residual, but run much faster / parallel, with fewer passes over the data. Existing ap- proaches involve either intelligent sampling of a few rows / columns of the matrix, projections onto lower- dimensional spaces, or sampling of entries followed by a top-r SVD of the resulting matrix (with unsampled entries set to 0). We pursue a dierent approach: we first sample entries in a specific biased random way, and then minimize the error on these samples over a search space that is the factored form of the low-rank matrix we are trying to find. We note that this is dierent from approximating a 0-filled matrix; it is instead reminiscent of matrix completion in the sense that it only looks at errors on the sampled entries. Another crucial ingredient is how the sampling is done: we use a combination of ` 1 sampling, and of a distribution where the probability of an element is proportional to the sum of the leverage scores of its row and its column. Both the sampling and the subsequent alternating minimization are naturally fast, parallelizable, and able to utilize sparsity in the input matrix. Existing litera- ture has either focused on running in input sparsity time but approximation in (the weaker) Frobenius norm, or running in O(n 2 ) time with approximation in spectral norm. Our method provides the best of both worlds: it runs in input sparsity time, with just two passes over the data matrix, and yields an approximation in spec- tral norm. It does however have a dependence on the ratio of the first to the r th singular value of the matrix. Our alternative approach also yields new methods for two related problems: directly finding the low-rank approximation of the product of two given matrices, and distributed PCA. Our contributions are thus three new methods in this space: Low-rank approximation of a general ma- trix: Our first (and main) contribution is a new method (LELA, Algorithm 1) for low-rank approx- imation of any given matrix: first draw a random subset of entries in a specific biased way, and then execute a weighted alternating minimization algo- rithm that minimizes the error on these samples

Tighter Low-rank Approximation via Sampling the Leveraged ... · providing tighter spectral norm bounds. Notation: Capital letter M typically denotes a matrix. Mi denotes the i-th

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Tighter Low-rank Approximation via Sampling the Leveraged ... · providing tighter spectral norm bounds. Notation: Capital letter M typically denotes a matrix. Mi denotes the i-th

Tighter Low-rank Approximation via Sampling the Leveraged Element

Srinadh Bhojanapalli

The University of Texas at Austin

[email protected]

Prateek Jain

Microsoft Research, India

[email protected]

Sujay Sanghavi

The University of Texas at Austin

[email protected]

AbstractIn this work, we propose a new randomized algorithm forcomputing a low-rank approximation to a given matrix.Taking an approach di↵erent from existing literature, ourmethod first involves a specific biased sampling, with an ele-ment being chosen based on the leverage scores of its row andcolumn, and then involves weighted alternating minimiza-tion over the factored form of the intended low-rank matrix,to minimize error only on these samples. Our method canleverage input sparsity, yet produce approximations in spec-tral (as opposed to the weaker Frobenius) norm; this com-bines the best aspects of otherwise disparate current results,but with a dependence on the condition number = �1/�r.

In particular we require O(nnz(M)+ n2r5

✏2) computations to

generate a rank-r approximation to M in spectral norm. In

contrast, the best existing method requires O(nnz(M)+ nr2

✏4)

time to compute an approximation in Frobenius norm. Be-sides the tightness in spectral norm, we have a better depen-dence on the error ✏. Our method is naturally and highlyparallelizable.

Our new approach enables two extensions that areinteresting on their own. The first is a new method todirectly compute a low-rank approximation (in e�cientfactored form) to the product of two given matrices; itcomputes a small random set of entries of the product,and then executes weighted alternating minimization (asbefore) on these. The sampling strategy is di↵erent becausenow we cannot access leverage scores of the product matrix(but instead have to work with input matrices). Thesecond extension is an improved algorithm with smallercommunication complexity for the distributed PCA setting(where each server has small set of rows of the matrix, andwant to compute low rank approximation with small amountof communication with other servers).

1 Introduction

Finding a low-rank approximation to a matrix is funda-mental to a wide array of machine learning techniques.The large sizes of modern data matrices has drivenmuch recent work into e�cient (typically randomized)methods to find low-rank approximations that do notexactly minimize the residual, but run much faster /parallel, with fewer passes over the data. Existing ap-proaches involve either intelligent sampling of a fewrows / columns of the matrix, projections onto lower-dimensional spaces, or sampling of entries followed by

a top-r SVD of the resulting matrix (with unsampledentries set to 0).

We pursue a di↵erent approach: we first sampleentries in a specific biased random way, and thenminimize the error on these samples over a search spacethat is the factored form of the low-rank matrix weare trying to find. We note that this is di↵erent fromapproximating a 0-filled matrix; it is instead reminiscentof matrix completion in the sense that it only looksat errors on the sampled entries. Another crucialingredient is how the sampling is done: we use acombination of `

1

sampling, and of a distribution wherethe probability of an element is proportional to the sumof the leverage scores of its row and its column.

Both the sampling and the subsequent alternatingminimization are naturally fast, parallelizable, and ableto utilize sparsity in the input matrix. Existing litera-ture has either focused on running in input sparsity timebut approximation in (the weaker) Frobenius norm, orrunning in O(n2) time with approximation in spectralnorm. Our method provides the best of both worlds: itruns in input sparsity time, with just two passes overthe data matrix, and yields an approximation in spec-tral norm. It does however have a dependence on theratio of the first to the rth singular value of the matrix.

Our alternative approach also yields new methodsfor two related problems: directly finding the low-rankapproximation of the product of two given matrices, anddistributed PCA.

Our contributions are thus three new methods inthis space:

• Low-rank approximation of a general ma-trix: Our first (and main) contribution is a newmethod (LELA, Algorithm 1) for low-rank approx-imation of any given matrix: first draw a randomsubset of entries in a specific biased way, and thenexecute a weighted alternating minimization algo-rithm that minimizes the error on these samples

Page 2: Tighter Low-rank Approximation via Sampling the Leveraged ... · providing tighter spectral norm bounds. Notation: Capital letter M typically denotes a matrix. Mi denotes the i-th

over a factored form of the intended low-rank ma-trix. The sampling is done with only two passesover the matrix (each in input sparsity time), andboth the sampling and the alternating minimiza-tion steps are highly parallelizable and compactlystored/manipulated.

For a matrix M , let Mr

be the best rank-r ap-proximation (i.e. the matrix corresponding totop r components of SVD). Our algorithm finds

a rank-r matrix cMr

in time O(nnz(M) + n

2r

5

2 ),while providing approximation in spectral norm:kM � cM

r

k kM � Mr

k + ✏kM � Mr

kF

, where = �

1

(M)/�r

(M) is the condition number of Mr

.Existing methods either can run in input sparsitytime, but provide approximations in (the weaker)Frobenius norm (i.e. with || · || replaced by || · ||

F

in the above expression), or run in O(n2) time toapproximate in spectral norm, but even then withleading constants larger than 1. Our method how-ever does have a dependence on , which these donot. See Table 1 for a detailed comparison to ex-isting results for low-rank approximation.

• Direct approximation of a matrix product:We provide a new method to directly find a low-rank approximation to the product of two matrices,without having to first compute the product itself.To do so, we first choose a small set of entries (ina biased random way) of the product that we willcompute, and then again run weighted alternatingminimization on these samples. The choice of thebiased random distribution is now di↵erent fromabove, as the sampling step does not have accessto the product matrix. However, again both thesampling and alternating minimization are highlyparallelizable.

For A 2 Rn1⇥d, B 2 Rd⇥n2 , and n = max(n1

, n2

),our algorithm first chooses O(nr3 log n/✏2) entriesof the product A · B that it needs to sample; eachsample takes O(d) time individually, since it isa product of two length-d vectors (though thesecan be parallelized). The weighted alternating

minimization then runs in O(nr5

2

2 ) time (where = �

1

(A · B)/�r

(A · B)). This results in a rank-r

approximation dABr

of A · B in spectral norm, asgiven above.

• Distributed PCA: Motivated by applicationswith really large matrices, recent work has lookedat low-rank approximation in a distributed settingwhere there are s servers – each have small set ofrows of the matrix – each of which can communicatewith a central processor charged with coordinating

the algorithm. In this model, one is interested infind good approximations while minimizing bothcomputations and the communication burden onthe center.

We show that our LELA algorithm can be extendedto the distributed setting while guaranteeing smallcommunication complexity. In particular, our al-gorithm guarantees the same error bounds as thatof our non-distributed version but guarantees com-munication complexity of O(ds + nr

5

2

2 log n) realnumbers for computing rank-r approximation toM 2 Rn⇥d. For n ⇡ d and large s, our analysisguarantees significantly lesser communication com-plexity than the state-of-the-art method [20], whileproviding tighter spectral norm bounds.

Notation: Capital letter M typically denotes amatrix. M i denotes the i-th row of M , M

j

denotesthe j-th column of M , and M

ij

denotes the (i, j)-thelement of M . Unless specified otherwise, M 2 Rn⇥d

and Mr

is the best rank-r approximation of M . Also,M

r

= U⇤⌃⇤(V ⇤)T denotes the SVD of Mr

. = �⇤1

/�⇤r

denotes the condition number of Mr

, where �⇤i

is thei-th singular value of M . kuk denotes the L

2

normof vector u. kMk = maxkxk=1

kMxk denotes the

spectral or operator norm of M . kMkF

=qP

ij

M2

ij

denotes the Frobenius norm of M . Also, kMk1,1

=Pij

|Mij

|. dist(X,Y ) = kXT

?Y k denotes the principalangle based distance between subspaces spanned by Xand Y orthonormal matrices. Typically, C denotes aglobal constant independent of problem parameters andcan change from step to step.

Given a set ⌦ ✓ [n] ⇥ [d], P⌦

(M) is given by:P⌦

(M)(i, j) = Mij

if (i, j) 2 ⌦ and 0 otherwise.R

(M) = w. ⇤ P⌦

(M) denotes the Hadamard productof w and P

(M). That is, R⌦

(M)(i, j) = wij

Mij

if

(i, j) 2 ⌦ and 0 otherwise. Similarly let R1/2

(M)(i, j) =pw

ij

Mij

if (i, j) 2 ⌦ and 0 otherwise.

2 Related results

Low rank approximation: Now we will briefly reviewsome of the existing work on algorithms for low rankapproximation. [12] introduced the problem of comput-ing low rank approximation of a matrix M with fewpasses over M . They presented an algorithm that sam-ples few rows and columns and does SVD to computelow rank approximation, and gave additive error guar-antees. [8, 9] have extended these results. [1] considereda di↵erent approach based on entrywise sampling andquantization for low rank approximation and has givenadditive error bounds.

[16, 28, 10, 7] have given low rank approximation

Page 3: Tighter Low-rank Approximation via Sampling the Leveraged ... · providing tighter spectral norm bounds. Notation: Capital letter M typically denotes a matrix. Mi denotes the i-th

Reference Frobenius norm Spectral norm Computation time

BJS14 (Our Algorithm) (1 + ✏)k�kF

k�k+ ✏k�kF

O(nnz(M) + nr

5

2log(n)

2 )

CW13[6] (1 + ✏)k�kF

(1 + ✏)k�kF

O(nnz(M) + nr

2

4 + r

3

5 )

BG13 [2] (1 + ✏)k�kF

ck�k+ ✏k�kF

O(n2( r+log(n)

2 ) + n r

2log(n)

2

4 )

NDT09[26] (1 + ✏)k�kF

ck�k+ ✏pnk�k O(n2 log( r log(n)

) + nr

2log(n)

2

4 )

WLRT08[30] (1 + ✏)k�kF

k�k+ ✏pnk�k O(n2 log( r

) + nr

4

4 )

Sar06[28] (1 + ✏)k�kF

(1 + ✏)k�kF

O(nnz(M) r✏

+ n r

2

2 )

Table 1: Comparison of error rates and computation time of some low rank approximation algorithms. � =M �M

r

.

algorithms with relative error guarantees in Frobeniusnorm. [30, 26] have provided guarantees on error inspectral norm which are later improved in [15, 2].The main techniques of these algorithms is to use arandom Gaussian or Hadamard transform matrix forprojecting the matrix onto a low dimensional subspaceand compute the rank-r subspace. [2] have given analgorithm based on random Hadamard transform thatcomputes rank-r approximation in time O(n

2r

2 ) and

gives spectral norm bound of kM�cMr

k ckM�Mr

k+✏kM �M

r

kF

.One drawback of Hadamard transform is that it

cannot take advantage of sparsity of the input matrix.Recently [6] gave an algorithm using sparse subspaceembedding that runs in input sparsity time with relativeFrobenius norm error guarantees.

We presented some results in this area as acomparison with our results in table 1. This is aheavily subsampled set of existing results on low rankapproximations. There is a lot of interesting workon very related problems of computing column/rowbased(CUR) decompositions, matrix sketching, lowrank approximation with streaming data. Look at[25, 15] for more detailed discussion and comparison.

Matrix completion: Matrix completion problem isto recover a n ⇥ n rank-r matrix from observing smallnumber of (O(nr log(n))) random entries. Nuclearnorm minimization is shown to recover the matrixfrom uniform random samples if the matrix is inco-herent [3, 4, 27, 14] . Similar results are shown forother algorithms like OptSpace [21] and alternatingminimization [19, 17, 18]. Recently [5] has givenguarantees for recovery of any matrix under leveragescore sampling from O(nr log2(n)) entries.

Distributed PCA: In distributed PCA, one wants tocompute rank-r approximation of a n ⇥ d matrix thatis stored across s servers with small communicationbetween servers. One popular model is row partition

model where subset of rows are stored at each server.Algorithms in [11, 23, 13, 22] achieve O(dsr

) commu-nication complexity with relative error guarantees inFrobenius norm, under this model.

Recently [20] have considered the scenario of arbi-trary splitting of a n⇥d matrix and given an algorithmthat has O(dsr

) communication complexity with rela-tive error guarantees in Frobenius norm.

3 Low-rank Approximation of Matrices

In this section we will present our main contribution:a new randomized algorithm for computing low-rankapproximation of any given matrix. Our algorithmfirst samples a few elements from the given matrixM 2 Rn⇥d, and then rank-r approximation is computedusing only those samples. Algorithm 1 provides a de-tailed pseudo-code of our algorithm; we now commenton each of the two stages:

Sampling: A crucial ingredient of our approach isusing the correct sampling distribution. Recent resultsin matrix completion [5] indicate that a small number(O(nr log2(n))) of samples drawn in a way biased byleverage scores1 can capture all the information in anyexactly low-rank matrix. While this is indicative, herewe have neither access to the leverage scores, nor is ourmatrix exactly low-rank. We approximate the leveragescores with the row and column norms (||M i||2 and||M

j

||2), and account for the arbitrary high-rank natureof input by including an L

1

term in the sampling; thedistribution is given in eq. (3.2). Computationally,our sampling procedure can be done in two passes andO(m log n+ nnz(M)) time.

Weighted alternating minimization: In our secondstep, we directly optimize over the factored form ofthe intended low-rank matrix, by minimizing a weighted

1If SVD of Mr = U⇤

⇤(V ⇤

)

Tthen leverage scores of Mr are

||(U⇤)

i||2 and ||(V ⇤)

j ||2 for all i, j.

Page 4: Tighter Low-rank Approximation via Sampling the Leveraged ... · providing tighter spectral norm bounds. Notation: Capital letter M typically denotes a matrix. Mi denotes the i-th

squared error over the sampled elements from stage 1.That is, we first express the low-rank approximation cM

r

as UV T and then iterate over U and V alternatinglyto minimize the weighted L

2

error over the sampled en-tries (see Sub-routine 2). Note that this is di↵erent fromtaking principal components of a 0-filled version of thesampled matrix. The weights give higher emphasis toelements with smaller sampling probabilities. In par-ticular, the goal is to minimize the following objectivefunction:

(3.1) Err(cMr

) =X

(i,j)2⌦

wij

⇣M

ij

� (cMr

)ij

⌘2

,

where wij

= 1/qij

when qij

> 0, 0 else. For initializationof the WAltMin procedure, we compute SVD of R

(M)(reweighed sampled matrix) followed by a trimming step(see Step 4, 5 of Sub-routine 2). Trimming step sets(bU0)i = 0 if k(bU0)ik � 4kM ik/kMk

F

and (bU0)i =(U0)i otherwise; and prevents heavy rows/columns fromhaving undue influence.

We now provide our main result for low-rank ap-proximation and show that Algorithm 1 can provide atight approximation to M

r

while using a small numberof samples m = E[|⌦|].

Theorem 3.1. Let M 2 Rn⇥d be any given ma-trix (n � d) and let M

r

be the best rank-r approx-imation to M . Set the number of samples m =C

nr

3

2 2 log(n) log2(kMk⇣

), where C > 0 is any global

constant, = �1

/�r

where �i

is the i-th singular valueof M . Also, set the number of iterations of WAltMin

procedure to be T = log(kMk⇣

). Then, with probabilitygreater than 1 � � for any constant � > 0, the outputcM

r

of Algorithm 1 with the above specified parametersm,T , satisfies:

kM � cMr

k kM �Mr

k+ ✏ kM �Mr

kF

+ ⇣.

That is, if T = log( kMk✏kM�MrkF

), we have:

kM � cMr

k kM �Mr

k+ 2✏ kM �Mr

kF

.

Note that our time and sample complexity dependsquadratically on . Recent results in the matrix com-pletion literature shows that such a dependence can beimproved to log() by using a slightly more involvedanalysis [18]. We believe a similar analysis can be com-bined with our techniques to obtain tighter bounds; weleave a similar tighter analysis for future research assuch a proof would be significantly more tedious andwould take away from the key message of this paper.

3.1 Computation complexity: In the first stepwe take 2 passes over the matrix to compute the

Algorithm 1 LELA: Leveraged Element Low-rankApproximation

input matrix: M 2 Rn⇥d, rank: r, number of samples:m, number of iterations: T

1: Sample ⌦ ✓ [n]⇥ [d] where each element is sampledindependently with probability: q

ij

= min{qij

, 1}

(3.2) qij

= m ·✓kM ik2 + kM

j

k2

2(n+ d)kMk2F

+|M

ij

|2kMk

1,1

◆.

/*See Section 3.1 for details about e�cient imple-mentation of this step*/

2: Obtain P⌦

(M) using one pass over M

3:

cMr

= WAltMin(P⌦

(M),⌦, r, q, T )

output cMr

Sub-routine 2 WAltMin: Weighted Alternating Mini-mizationinput P

(M), ⌦, r, q, T1: w

ij

= 1/qij

when qij

> 0, 0 else, 8i, j2: Divide ⌦ in 2T +1 equal uniformly random subsets,

i.e., ⌦ = {⌦0

, . . . ,⌦2T

}3: R

⌦0(M) w. ⇤ P⌦0(M)

4: U (0)⌃(0)(V (0))T = SV D(R⌦0(M), r) //Best rank-r

approximation of R⌦0(M)

5: Trim U (0) and let bU (0) be the output (see Section 3)6: for t = 0 to T � 1 do7:

bV (t+1) = argminV

kR1/2

⌦2t+1(M � bU (t)V T )k2

F

, for

V 2 Rd⇥r.8:

bU (t+1) = argminU

kR1/2

⌦2t+2(M � U(bV (t+1))T )k2

F

,

for U 2 Rn⇥r.9: end for

output Completed matrix cMr

= bU (T )(bV (T ))T .

sampling distribution (3.2) and sampling the entriesbased on this distribution. It is easy to show thatthis step would require O(nnz(M) + m log(n)) time.Next, the initialization step of WAltMin procedurerequires computing rank-r SVD of R

⌦0(M) which hasat most m non-zero entries. Hence, the procedure canbe completed in O(mr) time using standard techniqueslike power method. Note that by Lemma 3.1 weneed top-r singular vectors of R

⌦0(M) only uptoconstant approximation. Further t-th iteration ofalternating minimization takes O(2|⌦

2t+1

|r2) time.So, the total time complexity of our method isO(nnz(M) + mr2). As shown in Theorem 3.1, our

method requires m = O(nr3

2 2 log(n) log2( kMk✏kM�MrkF

))samples. Hence, the total run-time of our algorithm is:O(nnz(M) + nr

5

2 2 log(n) log2( kMk✏kM�MrkF

)).

Page 5: Tighter Low-rank Approximation via Sampling the Leveraged ... · providing tighter spectral norm bounds. Notation: Capital letter M typically denotes a matrix. Mi denotes the i-th

Remarks: Now we will discuss how to sample entriesof M using sampling method (3.2) in O(nnz(M) +m log(n)) time. Consider the following multinomialbased sampling model: sample the number of elementsper row (say m

i

) by doing m draws using a multinomial

distribution over the rows, given by {0.5( dkMik2

(n+d)kMk2F+

1

n+d

) + 0.5 kMik1

kMk1,1}. Then, sample m

i

elements of the

row-i, using {0.5kMjk2

kMk2F

+ 0.5 |Mij |kMk1,1

} over j 2 [d], with

replacement.The failure probability in this model is bounded by

2 times the failure probability if the elements are sam-pled according to (3.2) [3]. Hence, we can instead usethe above mentioned multinomial model for sampling.Moreover, kM ik, kM ik

1

and kMj

k can be computed intime O(nnz(M)+n), so m

i

’s can be sampled e�ciently.Moreover, the multinomial distribution for all the rowscan be computed in time O(d + nnz(M)), O(d) workfor setting up the first kM

j

k term and nnz(M) term forchanging the base distribution whereverM

ij

is non-zero.Hence, the total time complexity is O(m+n+nnz(M)).

3.2 Proof Overview: We now present the key stepsin our proof of Theorem 3.1. As mentioned in theprevious section, our algorithm proceeds in two steps:entry-wise sampling of the given matrix M and thenweighted alternating minimization (WAltMin) to obtaina low-rank approximation of M .

Hence, the goal is to analyze the WAltMin proce-dure, with input samples obtained using (3.2), to obtainthe bounds in Theorem 3.1. Now, WAltMin is an iter-ative procedure solving an inherently non-convex prob-lem, min

U,V

P(i,j)2⌦

wij

(eTi

UV T ej

�Mij

)2. Hence, itis prone to local minimas or worse, might not even con-verge. However, recent results for low-rank matrix com-pletion have shown that alternating minimization (withappropriate initialization) can indeed be analyzed to ob-tain exact matrix completion guarantees.

Our proof also follows along similar lines, wherewe show that the initialization procedure (step 4 ofSub-routine 2) provides an accurate enough estimateof M

r

and then at each step, we show a geometricdecrease in distance to M

r

. However, our proof di↵ersfrom the previous works in two key aspects: a) existingproof techniques of alternating minimization assumethat each element is sampled uniformly at random,while we can allow biased and approximate sampling,b) existing techniques crucially use the assumption thatM

r

is incoherent, while our proof avoids this assumptionusing the weighted version of AltMin.

We now present our bounds for initialization as wellas for each step of the WAltMin procedure. Theorem 3.1follows easily from the two bounds.

Lemma 3.1. (Initialization) Let the set of entries⌦ be generated according to (3.2). Also, let m �C n

2 log(n). Then, the following holds (w.p. � 1� 2

n

10 ):

(3.3) kR⌦

(M)�Mk � kMkF

.

Also, if kM�Mr

kF

1

576r

1.5 kMr

kF

, then the followingholds (w.p. � 1� 2

n

10 ):

k(bU (0))ik 8prqkM ik2/kMk2

F

and

dist(bU (0), U⇤) 1

2,

where bU (0) is the initial iterate obtained using Steps 4,5 of Sub-Procedure 2. = �⇤

1

/�⇤r

, �⇤i

is the i-th singularvalue of M , M

r

= U⇤⌃⇤(V ⇤)T .

Let Pr

(A) be the best rank-r approximation of A.Then, Lemma 3.1 and Weyl’s inequality implies that:

(3.4) kM � Pr

(R⌦

(M))k kM �R⌦

(M)k+ kR

(M)� Pr

(R⌦

(M))k kM �Mr

k+ 2�kMkF

.

Now, we can have two cases: 1) kM � Mr

kF

�1

576r

1.5 kMr

kF

: In this case, setting � = ✏/(r1.5)in (3.4) already implies the required error bounds ofTheorem 3.12. 2) kM�M

r

kF

1

576r

1.5 kMr

kF

. In thisregime, we will show now that alternating minimizationreduces the error from initial �kMk

F

to �kM �Mr

kF

.

Lemma 3.2. (WAltMin Descent) Let hypotheses ofTheorem 3.1 hold. Also, let kM � M

r

kF

1

576r

pr

kMr

kF

. Let bU (t) be the t-th step iterate of Sub-

Procedure 2 (called from Algorithm 1), and let bV (t+1)

be the (t + 1)-th iterate (for V ). Also, let k(U (t))ik 8prq

kMjk2

kMk2F+ |Mij |

kMkFand dist(bU (t), U⇤) 1

2

, where

U (t) is a set of orthonormal vectors spanning bU (t).Then, the following holds (w.p. � 1� �/T ):

dist(bV (t+1), V ⇤) 1

2dist(bU (t), U⇤) + ✏kM �M

r

kF

/�⇤r

,

and k(V (t+1))jk 8prq

kMjk2

kMk2F+ |Mij |

kMkF, where V (t+1)

is a set of orthonormal vectors spanning bV (t+1).

The above lemma shows that distance between bV (t+1)

and V ⇤ (and similarly, bU (t+1) and U⇤) decreases ge-ometrically up to ✏kM � M

r

kF

/�⇤r

. Hence, after

2There is a small technicality here: alternating minimization

can potentially worsen this bound. But the error after each step of

alternating minimization can be e↵ectively checked using a small

cross-validation set and we can stop if the error increases.

Page 6: Tighter Low-rank Approximation via Sampling the Leveraged ... · providing tighter spectral norm bounds. Notation: Capital letter M typically denotes a matrix. Mi denotes the i-th

log(kMkF

k/⇣) steps, the first error term in the boundsabove vanishes and the error bound given in Theo-rem 3.1 is obtained.

Note that, the sampling distribution used for ourresult is a “hybrid” distribution combining leveragescores and the L

1

-style sampling. However, if Mis indeed a rank-r matrix, then our analysis can beextended to handle the leverage score based sampling

itself (qij

= m · kMik2+kMjk2

2nkMk2F

). Hence our results also

show that weighted alternating minimization can beused to solve the coherent-matrix completion problemintroduced in [5].

3.3 Direct Low-rank Approximation of MatrixProduct In this section we present a new algorithm forthe following problem: suppose we are given two matri-ces, and desire a low-rank approximation of their prod-uct AB; in particular, we are not interested in the actualfull matrix product itself (as this may be unwieldy tostore and use, and thus wasteful to produce in its en-tirety). One example setting where this arises is whenone wants to calculate the joint counts between two verylarge sets of entities; for example, web companies rou-tinely come across settings where they need to under-stand (for example) how many users both searched fora particular query and clicked on a particular advertise-ment. The number of possible queries and ads is huge,and finding this co-occurrence matrix from user logsinvolves multiplying two matrices – query-by-user anduser-by-ad respectively – each of which is itself large.

We give a method that directly produces a low-rankapproximation of the final product, and involves storageand manipulation of only the e�cient factored form (i.e.one tall and one fat matrix) of the final intended low-rank matrix. Note that as opposed to the previoussection, the matrix does not already exist and hencewe do not have access to its row and column norms; sowe need a new sampling scheme (and a di↵erent proofof correctness).

Algorithm: Suppose we are given an n1

⇥d matrixA and another d⇥n

2

matrix B, and we wish to calculatea rank-r approximation of the product A · B. Ouralgorithm proceeds in two stages:

1. Choose a biased random set ⌦ ⇢ [n1

] ⇥ [n2

] ofelements as follows: choose an intended numberm (according to Theorem 3.2 below) of sampledelements, and then independently include each(i, j) 2 [n

1

] ⇥ [n2

] in ⌦ with probability given byqij

= min{1, qij

} where

(3.5) qij

:= m ·✓kAik2

n2

kAk2F

+kB

j

k2

n1

kBk2F

◆,

Then, find P⌦

(A ·B), i.e. only the elements of theproduct AB that are in this set ⌦.

2. Run the alternating minimization procedureWAltMin(P

(A·B),⌦, r, q, T ), where T is the num-ber of iterations (again chosen according to Theo-rem 3.2 below). This produces the low-rank ap-proximation in factored form.

Remarks: Note that the sampling distributionnow depends only on the row norms kAik2 of A andthe column norms kB

j

k2 of B; each of these can befound completely in parallel, with one pass over eachrow/column of the matrices A / B. A second pass,again parallelizable, calculates the element (A · B)

ij

of the product, for (i, j) 2 ⌦. Once this is done, weare again in the setting of doing weighted alternatingminimization over a small set of samples – the setting wehad before, and as already mentioned this too is highlyparallelizable and very fast overall. In particular, thecomputation complexity of the algorithm is O(|⌦| · (d+r2)) = O(m(d+ r2)) = O(nr

3

2

2 · (d+ r2)) (suppressingterms dependent on norms of A and B ), where n =max{n

1

, n2

}.We now present our theorem on the number of

samples and iterations needed to make this procedurework with at least a constant probability.

Theorem 3.2. Consider matrices A 2 Rn1⇥d andB 2 Rd⇥n2 and let m = C

· (kAk2F+kBk2

F )

2

kABk2F

·nr

3

(✏)

22 log(n) log2(kAkF+kBkF

), where = �⇤1

/�⇤r

, �⇤i

is

the i-th singular value of A·B and T = log(kAkF+kBkF

).

Let ⌦ be sampled using probability distribution (3.5).

Then, the outputdABr

= WAltMin(P⌦

(A·B),⌦, r, q, T )of Sub-routine 2 satisfies (w.p. � 1� �): kA ·B �dAB

r

k kA ·B � (A ·B)r

k+ ✏kA ·B � (A ·B)r

kF

+ ⇣.

Next, we show an application of our matrix-multiplication approach to approximating covariancematrices M = Y Y T , where Y is a n ⇥ d sample ma-trix. Note, that a rank-r approximation to M can becomputed by computing low rank approximation of Y ,bYr

, i.e., Mr

= bYr

bY T

r

. However, as we show below, suchan approach leads to weaker bounds as compared tocomputing low rank approximation of Y Y T :

Corollary 3.1. Let M = Y Y T 2 Rn⇥n and let⌦ be sampled using probability distribution (3.5) with

m � C

nr

3

2 2 log(n) log2(kY k⇣

), the output cMr

of

WAltMin(P⌦

(M),⌦, r, q, log(kY k⇣

)) satisfy (w.p. � 1��):

kcMr

� (Y Y T )r

k ✏ kY � Yr

k2F

+ ⇣.

Page 7: Tighter Low-rank Approximation via Sampling the Leveraged ... · providing tighter spectral norm bounds. Notation: Capital letter M typically denotes a matrix. Mi denotes the i-th

Further when kY � Yr

kF

kYr

kF

we get:

kcMr

� (Y Y T )r

k ✏��Y Y T � (Y Y T )

r

��F

+ ⇣.

Now, one can compute Mr

= bYr

bY T

r

in time O(n2r +nr

5

2 ) with error k(Y Y

T)r� ˜

MrkkY k2 ✏kY�YrkF

kY k . Where as

computing low rank approximation of Y Y T gives(from

Corollary 3.1) k(Y Y

T)r�c

MrkkY k2 ✏

kY Y

T�YrYTr kF

kY k2 , which

can be much smaller than ✏kY�YrkF

kY k . The di↵erence inerror is a consequence of larger gap between singularvalues of Y Y T compared to Y . For more relatedapplications see section 4.5 of [15].

4 Distributed Principal Component Analysis

Modern large-scale systems have to routinely com-pute PCA of data matrices with millions of data pointsembedded in similarly large number of dimensions.Now, even storing such matrices on a single machineis not possible and hence most industrial scale systemsuse distributed computing environment to handle suchproblems. However, performance of such systems de-pend not only on computation and storage complexity,but also on the required amount of communication be-tween di↵erent servers.

In particular, we consider the following distributedPCA setting: Let M 2 Rn⇥d be a given matrix(assume n � d but n ⇡ d). Also, let M be rowpartitioned among s servers and let Mrk 2 Rn⇥d bethe matrix with rows {r

k

} ✓ [n] of M , stored onk-th server. Moreover, we assume that one of theservers act as Central Processor(CP) and in each roundall servers communicate with the CP and the CPcommunicates back with all the servers. Now, the goalis to compute cM

r

, an estimate of Mr

, such that thetotal communication (i.e. number of bits transferred)between CP and other servers is minimized. Note that,such a model is now standard for this problem and wasmost recently studied by [20].

Recently several interesting results [11, 13, 22, 20]have given algorithms to compute rank-r approximationof M , M

r

in the above mentioned distributed setting.In particular, [20] proposed a method that for row-

partitioned model requires O(dsr✏

+ sr

2

4 ) communicationto obtain a relative Frobenius norm guarantee,

||M � Mr

||F

(1 + ✏)||M �Mr

||F

.

In contrast, a distributed setting extension of ourLELA algorithm 1 has linear communication complex-ity O(ds + nr

5

2 ) and computes rank-r approximationcM

r

, with ||M � cMr

|| ||M � Mr

|| + ✏||M � Mr

||F

.Now note that if n ⇡ d and if s scales with n (which

is a typical requirement), then our communicationcomplexity can be significantly better than that of [20].Moreover, our method provides spectral norm boundsas compared to relatively weak Frobenius boundsmentioned above.

Algorithm: The distributed version of our LELA al-gorithm depends crucially on the following observation:given V , each row of U can be updated independently.Hence, servers need to communicate rows of V only.There also, we can use the fact that each server re-quires only O(nr/s log n) rows of V to update their cor-responding Urk . Urk denote restriction of U to rows

in set {rk

} and 0 outside and similarly bVck ,bYck denote

restriction of V, bY to rows {ck

}.We now describe the distributed version of each of

the critical step of LELA algorithm. See Algorithm 3for a detailed pseudo-code. For simplicity, we droppedthe use of di↵erent set of samples in each iteration. Cor-respondingly the algorithm will modify to distributingsamples ⌦k into 2T + 1 buckets and using one in eachiteration. This simplification doesn’t change the com-munication complexity.

Sampling: For sampling, we first compute columnnorms kM

j

k, 81 j d and communicate to eachserver. This operation would require O(ds) communica-tion. Next, each server (server k) samples elements fromits rows {rk} and stores R

k(Mrk) locally. Note thatbecause of independence over rows, the servers don’tneed to transmit their samples to other servers.

Initialization: In the initialization step, our al-gorithm computes top r right singular vector ofR

(M) by iterations bY (t+1) = R⌦

(M)TR⌦

(M)Y t =Pk

R⌦

k(M)TR⌦

k(M)Y t. Now, note that computingR

k(M)Y t requires server k to access atmost |⌦k|columns of Y t. Hence, the total communication fromthe CP to all the servers in this round is O(|⌦|r). Sim-ilarly, each column of R

k(M)TR⌦

k(M)Y t is only |⌦k|sparse. Hence, total communication from all the serversto CP in this round is O(|⌦|r). Now, we need constantmany rounds to get a constant factor approximation toSVD of R

(M), which is enough for good initializationin WAltMin procedure. Hence, total communicationcomplexity of the initialization step would be O(|⌦|r).

Alternating Minimization Step: For alternatingminimization, update to rows of U is computed at thecorresponding servers and the update to V is computed

at the CP. For updating bU (t+1)

rk at server k, we use the

following observation: updating bU (t+1)

rk requires atmost|⌦k| rows of bV (t). Hence, the total communication fromCP to all the servers in the t-th iteration is O(|⌦|r).Next, we make a critical observation that update bV (t+1)

Page 8: Tighter Low-rank Approximation via Sampling the Leveraged ... · providing tighter spectral norm bounds. Notation: Capital letter M typically denotes a matrix. Mi denotes the i-th

can be computed by adding certain messages from eachserver (see Algorithm 3 for more details). Message fromserver k to CP is of size O(|⌦k|r2). Hence, total com-munication complexity in each round is O(|⌦|r2) andtotal number of rounds is O(log(kMk

F

/⇣)).We now combine the above given observations to

provide error bounds and communication complexity ofour distributed PCA algorithm:

Theorem 4.1. Let the n ⇥ d matrix M be distributedover s servers according to the row-partition model. Let

m � C

nr

3

2 2 log(n) log2( ||Mr||⇣

). Then, the algorithm 3

on completion will leave matrices bU (t+1)

rk at server k

and bV (t+1) at CP such that the following holds (w.p.

� 1��): ||M� bU (t+1)(bV (t+1))T || ||M�Mr

||+✏||M�M

r

||F

+ ⇣, where bU (t+1) =P

k

bU (t+1)

rk . This algorithmhas a communication complexity of O(ds + |⌦|r2) =

O(ds+ nr

5

2

2 log2( ||Mr||⇣

)) real numbers.

As discussed above, each update to bV (t) and bU (t) arecomputed exactly as given in the WAltMin procedure(Sub-routine 2). Hence, error bounds for the algorithmfollows directly from Theorem 3.1. Communicationcomplexity bounds follows by observing that |⌦| 2mw.h.p.

Remark: The sampling step given above suggestsanother simple algorithm where we can compute P

(M)in a distributed fashion and communicate the samples toCP. All the computation is performed at CP afterwards.Hence, the total communication complexity would beO(ds + |⌦|) = O(ds + nr

3

2

2 log( ||Mr||⇣

), which is lesserthan the communication complexity of Algorithm 3.However, such an algorithm is not desirable in practice,because it is completely reliant on one single serverto perform all the computation. Hence it is slowerand is fault-prone. In contrast, our algorithm can beimplemented in a peer-peer scenario as well and is morefault-tolerant.

Also, the communication complexity bound ofTheorem 4.1 only bounds the total number of realnumbers transferred. However, if each of the numberrequires several bits to communicate then the realcommunication can still be very large. Below, webound each of the real number that we transfer, henceproviding a bound on the number of bits transferred.

Bit complexity: First we will bound wij

Mij

. Notethat we need to bound this only for (i, j) 2 ⌦.Now, |w

ij

Mij

| ||R⌦

(M)||1 ||R⌦

(M)|| ||M || +✏||M ||

F

2 ⇤ ndMmax

, where the third inequality fol-lows from Lemma 3.1. Hence if the entries of the matrixM are being represented using b bits initially then the

algorithm needs to use O(b+log(nd)) bits. By the sameargument we get a bound of O(b+log(nd)) bits for com-puting ||M i||2, 8i; ||M

j

||2, 8j; ||M ||2F

and ||M ||1,1

.Further at any stage of the WAltMin iterations

||bU (t)(bV (t+1))T ||1 ||bU (t)(bV (t+1))T || 2||M ||F

. Sothis stage also needs O(b+log(n)) bits for computation.Hence overall the bit complexity of each of the realnumbers of the algorithm 3 is O(b+ log(nd)) , if b bitsare needed for representing the matrix entries. That is,overall communication complexity of the algorithm isO((b+ log(nd)) · (ds+ nr

3

2

2 log( ||Mr||⇣

)).

References

[1] D. Achlioptas and F. McSherry. Fast computation oflow rank matrix approximations. In Proceedings ofthe thirty-third annual ACM symposium on Theory ofcomputing, pages 611–618. ACM, 2001.

[2] C. Boutsidis and A. Gittens. Improved matrix al-gorithms via the subsampled randomized hadamardtransform. SIAM Journal on Matrix Analysis and Ap-plications, 34(3):1301–1340, 2013.

[3] E. J. Candes and B. Recht. Exact matrix comple-tion via convex optimization. Foundations of Compu-tational mathematics, 9(6):717–772, 2009.

[4] E. J. Candes and T. Tao. The power of convex relax-ation: Near-optimal matrix completion. InformationTheory, IEEE Transactions on, 56(5):2053–2080, 2010.

[5] Y. Chen, S. Bhojanapalli, S. Sanghavi, and R. Ward.Coherent matrix completion. In Proceedings of The31st International Conference on Machine Learning,pages 674–682, 2014.

[6] K. L. Clarkson and D. P. Woodru↵. Low rank approx-imation and regression in input sparsity time. In Pro-ceedings of the 45th annual ACM symposium on Sym-posium on theory of computing, pages 81–90. ACM,2013.

[7] A. Deshpande and S. Vempala. Adaptive sampling andfast low-rank matrix approximation. In Approxima-tion, Randomization, and Combinatorial Optimization.Algorithms and Techniques, pages 292–303. Springer,2006.

[8] P. Drineas and R. Kannan. Pass e�cient algorithmsfor approximating large matrices. In SODA, volume 3,pages 223–232, 2003.

[9] P. Drineas, R. Kannan, and M. W. Mahoney. Fastmonte carlo algorithms for matrices ii: Computing alow-rank approximation to a matrix. SIAM Journalon Computing, 36(1):158–183, 2006.

[10] P. Drineas, M. W. Mahoney, and S. Muthukrishnan.Subspace sampling and relative-error matrix approx-imation: Column-based methods. In Approximation,Randomization, and Combinatorial Optimization. Al-gorithms and Techniques, pages 316–326. Springer,2006.

Page 9: Tighter Low-rank Approximation via Sampling the Leveraged ... · providing tighter spectral norm bounds. Notation: Capital letter M typically denotes a matrix. Mi denotes the i-th

[11] D. Feldman, M. Schmidt, and C. Sohler. Turningbig data into tiny data: Constant-size coresets for k-means, pca and projective clustering. In Proceedingsof the Twenty-Fourth Annual ACM-SIAM Symposiumon Discrete Algorithms, pages 1434–1453. SIAM, 2013.

[12] A. Frieze, R. Kannan, and S. Vempala. Fast monte-carlo algorithms for finding low-rank approximations.In Foundations of Computer Science, 1998. Proceed-ings. 39th Annual Symposium on, pages 370–378, Nov1998.

[13] M. Ghashami and J. M. Phillips. Relative errorsfor deterministic low-rank matrix approximations. InSODA, pages 707–717. SIAM, 2014.

[14] D. Gross. Recovering low-rank matrices from fewcoe�cients in any basis. Information Theory, IEEETransactions on, 57(3):1548–1566, 2011.

[15] N. Halko, P.-G. Martinsson, and J. A. Tropp. Findingstructure with randomness: Probabilistic algorithmsfor constructing approximate matrix decompositions.SIAM review, 53(2):217–288, 2011.

[16] S. Har-Peled. Low rank matrix approximationin linear time. Manuscript. http://valis. cs. uiuc.edu/sariel/papers/05/lrank/lrank. pdf, 2006.

[17] M. Hardt. Understanding alternating minimization formatrix completion. arXiv preprint arXiv:1312.0925,2013.

[18] M. Hardt and M. Wootters. Fast matrix completionwithout the condition number. In Proceedings of The27th Conference on Learning Theory, pages 638–678,2014.

[19] P. Jain, P. Netrapalli, and S. Sanghavi. Low-rankmatrix completion using alternating minimization. InProceedings of the 45th annual ACM symposium onSymposium on theory of computing, pages 665–674.ACM, 2013.

[20] R. Kannan, S. S. Vempala, and D. P. Woodru↵. Prin-cipal component analysis and higher correlations fordistributed data. In Proceedings of The 27th Confer-ence on Learning Theory, pages 1040–1057, 2014.

[21] R. H. Keshavan, A. Montanari, and S. Oh. Matrixcompletion from a few entries. Information Theory,IEEE Transactions on, 56(6):2980–2998, 2010.

[22] Y. Liang, M.-F. Balcan, and V. Kanchanapally. Dis-tributed pca and k-means clustering. In The BigLearning Workshop at NIPS, 2013.

[23] E. Liberty. Simple and deterministic matrix sketching.In Proceedings of the 19th ACM SIGKDD internationalconference on Knowledge discovery and data mining,pages 581–588. ACM, 2013.

[24] L. Mackey, M. I. Jordan, R. Y. Chen, B. Farrell,and J. A. Tropp. Matrix concentration inequalitiesvia the method of exchangeable pairs. arXiv preprintarXiv:1201.6002, 2012.

[25] M. W. Mahoney. Randomized algorithms for matricesand data. Foundations and Trends R� in MachineLearning, 3(2):123–224, 2011.

[26] N. H. Nguyen, T. T. Do, and T. D. Tran. A fastand e�cient algorithm for low-rank approximation of

a matrix. In Proceedings of the 41st annual ACMsymposium on Theory of computing, pages 215–224.ACM, 2009.

[27] B. Recht. A simpler approach to matrix completion.arXiv preprint arXiv:0910.0651, 2009.

[28] T. Sarlos. Improved approximation algorithms forlarge matrices via random projections. In Foundationsof Computer Science, 2006. FOCS’06. 47th AnnualIEEE Symposium on, pages 143–152. IEEE, 2006.

[29] J. A. Tropp. User-friendly tail bounds for sumsof random matrices. Foundations of ComputationalMathematics, 12(4):389–434, 2012.

[30] F. Woolfe, E. Liberty, V. Rokhlin, and M. Tygert.A fast randomized algorithm for the approximationof matrices. Applied and Computational HarmonicAnalysis, 25(3):335–366, 2008.

A Concentration Inequalities

In this section we will review couple of concentrationinequalities we use in the proofs.

Lemma A.1. (Bernstein’s Inequality) LetX

1

, ...Xn

be independent scalar random variables.Let |X

i

| L, 8i w.p. 1. Then,

(A.1) P"�����

nX

i=1

Xi

�nX

i=1

E [Xi

]

����� � t

#

2 exp

✓�t2/2P

n

i=1

Var(Xi

) + Lt/3

◆.

Lemma A.2. (Matrix Bernstein’s Ineq. [29]) LetX

1

, ...Xp

be independent random matrices in Rn⇥n.Assume each matrix has bounded deviation from itsmean:

kXi

� E [Xi

] k L, 8i w.p. 1.

Also let the variance be

�2 = max

(�����E"

pX

i=1

(Xi

� E [Xi

])(Xi

� E [Xi

])T

�����

#,

�����E"

pX

i=1

(Xi

� E [Xi

])T (Xi

� E [Xi

])

#�����

).

Then,(A.2)

P"�����

nX

i=1

(Xi

� E [Xi

])

����� � t

# 2n exp

✓�t2/2

�2 + Lt/3

◆.

Recall the Shatten-p norm of a matrix X is

kXkp

=

nX

i=1

�i

(X)p!

1/p

.

Page 10: Tighter Low-rank Approximation via Sampling the Leveraged ... · providing tighter spectral norm bounds. Notation: Capital letter M typically denotes a matrix. Mi denotes the i-th

�i

(X) is the ith singular value of X. In particular forp = 2, Shatten-2 norm is the Frobenius norm of thematrix.

Lemma A.3. [Matrix Chebyshev Inequality [24]] Let Xbe a random matrix. For all t > 0,

(A.3) P [kXk � t] infp�1

t�pEhkXkp

p

i.

B Proofs of section 3

In this section we will present proof for Theorem 3.1.For simplicity we will only discuss proofs for the casewhen matrix is square. Rectangular case is a simpleextension. We will provide proofs of the supportinglemmas first.

We will recall some of the notation now.

qij

= m ·✓0.5kM ik2 + kM

j

k2

2nkMk2F

+ 0.5|M

ij

|kMk

1,1

◆.

Let qij

= min{qij

, 1}. This is to make sure the prob-abilities are all less than 1. Recall the definition ofweights w

ij

= 1/qij

when qij

> 0 and 0 else. Notethat

Pij

qij

m. Also let m � �nr log(n).Let {�

ij

} be the indicator random variables and�ij

= 1 with probability qij

. Define ⌦ to be the samplingoperator with ⌦

ij

= �ij

. Define the weighted samplingoperator R

such that, R⌦

(M)ij

= �ij

wij

Mij

.Throughout the proofs we will drop the subscript of

⌦ that denotes di↵erent sampling sets at each iterationof WAltMin.

First we will abstract out the properties of thesampling distribution (3.2) that we use in the rest ofthe proof.

Lemma B.1. For ⌦ generated according to (3.2) andunder the assumptions of Lemma 3.1 the followingholds, for all (i, j) such that q

ij

1.

(B.4)M

ij

qij

2n

mkMk

F

,

(B.5)X

{j:qij=qij}

M2

ij

qij

4n

mkMk2

F

,

(B.6)k(U⇤)ik2

qij

8nr2

m,

and

(B.7)k(U⇤)ikk(V ⇤)jk

qij

8nr2

m.

The proof of the lemma B.1 is straightforward fromthe definition of q

ij

.

B.1 Initialization: Now we will provide proof of theinitialization lemma 3.1.Proof of lemma 3.1:

Proof. The proof of this lemma has two parts.1) We show that

kR⌦

(M)�Mk � kMkF

2) We show that the trimming step of algorithm 2 givesthe required row norm bounds on bU (0).

k(bU (0))ik 8prqkM ik2/kMk2

F

and

dist(bU (0), U⇤) 1

2,

Proof of the first step: We prove the proof of the firstpart using the matrix Bernstein inequality. Note thatthe L1 term in the sampling distribution will help ingetting good bounds on absolute magnitude of randomvariables X

ij

in this proof.Let X

ij

= (�ij

� qij

)wij

Mij

ei

eTj

. Note that{X

ij

}ni,j=1

are independent zero mean random matri-ces. Also R

(M)� E [R⌦

(M)] =P

ij

Xij

.First we will bound kX

ij

k. When qij

� 1, qij

= 1and �

ij

= 1, and Xij

= 0 with probability 1. Hencewe only need to consider cases when q

ij

= qij

1.We will assume this in all the proofs without explicitlymentioning it any more.

kXij

k = max{|(1� qij

)wij

Mij

| , |qij

wij

Mij

|}.(B.8)

Recall wij

= 1/qij

. Hence

|(1� qij

)wij

Mij

| ����M

ij

qij

����⇣1

2n

mkMk

F

.

⇣1

follows from (B.4).

|qij

wij

Mij

| = |Mij

|⇣1

����M

ij

qij

���� 2n

mkMk

F

.

⇣1

follows from qij

1.Hence, kX

ij

k is bounded by L = 2n

m

kMkF

. Now wewill bound the variance.

������E

2

4X

ij

Xij

XT

ij

3

5

������=

������E

2

4X

ij

(�ij

� qij

)2w2

ij

M2

ij

ei

eTi

3

5

������

=

������

X

ij

qij

(1� qij

)w2

ij

M2

ij

ei

eTi

������

= maxi

������

X

j

qij

(1� qij

)w2

ij

M2

ij

������.

Page 11: Tighter Low-rank Approximation via Sampling the Leveraged ... · providing tighter spectral norm bounds. Notation: Capital letter M typically denotes a matrix. Mi denotes the i-th

Now,

X

j

qij

(1� qij

)w2

ij

M2

ij

X

j

M2

ij

(qij

)

⇣1

4n

mkMk2

F

.

⇣1

follows from (B.5). Hence������E

2

4X

ij

Xij

XT

ij

3

5

������= max

i

������

X

j

qij

(1� qij

)w2

ij

M2

ij

������

maxi

4n

mkMk2

F

=4n

mkMk2

F

.

We can prove the same bound for the���EhP

ij

XT

ij

Xij

i���.Hence �2 = 4n

m

kMk2F

. Now using matrix Bernsteininequality with t = �kMk

F

gives, with probability� 1� 2

n

2 ,

kR⌦

(M)� E [R⌦

(M)]k = kR⌦

(M)�Mk � kMkF

.

Hence we get kM�Pr

(R⌦

(M))k kM�R⌦

(M)k+kR

(M)�Pr

(R⌦

(M))k kM�Mr

k+2�kMkF

, whichimplies kM

r

� Pr

(R⌦

(M))k 2kM �Mr

k+ 2�kMkF

.Let SVD of P

r

(R⌦

(M)) be U (0)⌃(0)(V (0))T . Hence,

kPr

(R⌦

(M))�Mr

k2 = kU (0)⌃(0)(V (0))T � U⇤⌃⇤(V ⇤)T k2

� k(I � U (0)(U (0))T )U⇤⌃⇤(V ⇤)T k2 � (�⇤r

)2k(U (0)

? )TU⇤k2.

This implies dist(U (0), U⇤) 2kM�Mrk+2�kMkF

⇤r

1

144r

. This follows from the assumption kM �Mr

kF

1

576r

1.5 kMr

kF

and � 1

576r

1.5 . = �

⇤1

⇤r

is the the

condition number of Mr

.Proof of the trimming step: From previous step we

know that kR⌦

(M) �Mk � kMkF

and consequentlydist(U (0), U⇤) �

2

. Let,

li

=q4kM ik2/kMk2

F

,

be the estimates for the left leverages scores of thematrix M . Since kM � M

r

kF

kMr

kF

, l2i

�Prk=1(�

⇤k)

2(U

⇤ik)

2Pr

k=1(�⇤k)

2 .

Set the elements bigger than 2li

in the ith row of U0

to 0 and let U be the new initialization matrix obtained.Also since dist(U (0), U⇤) �

2

, for every j = 1, .., r there

exists a vector uj

in Rn, such that hU (0)

j

, uj

i �p1� �2

2

,

kuj

k = 1 and |(uj

)i

|2 Pr

k=1(�⇤k)

2(U

⇤ik)

2Pr

k=1(�⇤k)

2 for all i. Now

Uj

is the jth column of U obtained by setting the entriesof the jth column of U (0) to 0 whenever the ith entry

of U (0)

j

is bigger than 2li

. For such i,

(B.9)���(U

j

)i

� (uj

)i

���

sPr

k=1

(�⇤k

)2(U⇤ik

)2Pr

k=1

(�⇤k

)2

��(U0

j

)i

� (uj

)i

�� ,

since���(U (0)

j

)i

� (uj

)i

��� � 2li

� li

=qPr

k=1(�⇤k)

2(U

⇤ik)

2Pr

k=1(�⇤k)

2 .

For the rest of the coordinates, (Uj

)i

= (U (0)

j

)i

.Hence,

���Uj

� uj

��� ���U (0)

j

� uj

��� =q2� 2hU (0)

j

, uj

i p2�

2

.

Hence���U

j

��� � 1 �p2�

2

, and���U (0)

j

� Uj

��� r1�

���Uj

���2

2p�2

, for �2

1p2

. Also���U (0) � U

���F

2pr�

2

. This gives abound on the smallest singular valueof U .

�min

(U) � �min

(U (0))� �max

(U � U (0)) � 1� 2p

r�2

.

Now let the reduced QR decomposition of U be U =bU (0)⇤�1, where bU (0) is the matrix with orthonormalcolumns. From the bounds above we get that

k⇤k2 =1

�min

(bU (0)⇤�1)2=

1

�min

(U)2 4,

when �2

1

16r

.First we will show that this trimming step will not

increase the distance to U⇤ by much. To bound thedist(bU (0), U⇤) consider, k(u⇤

?)TUk, where u⇤

? is somevector perpendicular to U⇤.

k(u⇤?)

T bU (0)k = k(u⇤?)

T U⇤k(k(u⇤

?)TU (0)k+ k(u⇤

?)T (U � U (0))k)k⇤k

(�2

+ 2p

r�2

)2 6pr�

2

1

2,

for �2

1

144r

. Second we will bound k(bU (0))ik.

k(bU (0))ik = keTi

bU (0)k = keTi

U⇤k

keTi

Ukk⇤k 2li

pr2 8

prqkM ik2/kMk2

F

.

Hence we finish the proof of the second part of thelemma.

B.2 Weighted AltMin analysis: We first provideproof of Lemma 3.2 for rank-1 case to explain the mainideas and in next section we will discuss rank-r case.Hence M

1

= �⇤u⇤(v⇤)T . Before the proof of the lemmawe will prove couple of supporting lemmas.

Let ut and vt+1 be the normalized vectors of theiterates but and bvt+1 of WAltMin. We assume thatsamples for each iteration are generated independently.For simplicity we will drop the subscripts on ⌦ thatdenote di↵erent set of samples in each iteration in therest of the proof.

Page 12: Tighter Low-rank Approximation via Sampling the Leveraged ... · providing tighter spectral norm bounds. Notation: Capital letter M typically denotes a matrix. Mi denotes the i-th

The weighted alternating minimization updates atthe t+ 1 iteration are,(B.10)

kbutkbvt+1

j

= �⇤v⇤j

Pi

�ij

wij

ut

i

u⇤iP

i

�ij

wij

(ut

i

)2+

Pi

�ij

wij

ut

i

(M �M1

)ijP

i

�ij

wij

(ut

i

)2.

Writing in terms of power method updates we get,(B.11)kbutkbvt+1 = �⇤hu⇤, utiv⇤��⇤B�1(hut, u⇤iB�C)v⇤+B�1y,

where B and C are diagonal matrices with Bjj

=Pi

�ij

wij

(ut

i

)2 and Cjj

=P

i

�ij

wij

ut

i

u⇤i

and y isthe vector R

(M � M1

)Tut with entries yj

=Pi

�ij

wij

ut

i

(M �M1

)ij

.Now we will bound the error caused by the M�M

1

component in each iteration.

Lemma B.2. For ⌦ generated according to (3.2) andunder the assumptions of Lemma 3.2 the followingholds:(B.12)��(ut)TR

(M �M1

)� (ut)T (M �M1

)�� �kM�M

1

kF

,

with probability greater that 1 � �

T log(n)

, for m �

�n log(n), � � 4c

21T

��

2 . Hence,��(ut)TR

(M �M1

)��

dist(ut, u⇤)kM �M1

k+ � kM �M1

kF

, for constant �.

Proof. Let the random matrices Xij

= (�ij

�qij

)wij

(M � M1

)ij

(ut)i

eTj

. ThenP

ij

Xij

=

(ut)TR⌦

(M �M1

)� (ut)T (M �M1

). Also E [Xij

] = 0.We will use the matrix Chebyshev inequality for p = 2.

Now we will bound E���P

ij

Xij

���2

2

�.

E

2

64

������

X

ij

Xij

������

2

2

3

75

= E

2

4X

j

X

i

(�ij

� qij

)wij

(M �M1

)ij

(ut)i

!2

3

5

⇣1=X

j

X

i

qij

(1� qij

)(wij

)2(M �M1

)2ij

(ut

i

)2

X

ij

wij

(ut

i

)2(M �M1

)2ij

⇣2

4nc21

mkM �M

1

k2F

.

⇣1

follows from the fact that Xij

are zero mean indepen-dent random variables. ⇣

2

follows from (B.15). Henceapplying the matrix Chebyshev inequality for p = 2 andt = �kM �M

1

kF

gives the result.

Lemma B.3. For ⌦ sampled according to (3.2) andunder the assumptions of Lemma 3.2 the following

holds:

(B.13)

������

X

j

�ij

wij

(u⇤j

)2 �X

j

(u⇤j

)2

������ �

1

,

with probability greater that 1� 2

n

2 , for m � �n log(n),� � 16

21and �

1

3.

Proof. Note that���P

j

�ij

wij

(u⇤j

)2 �P

j

(u⇤j

)2��� =

���P

j

(�ij

� qij

)wij

(u⇤j

)2���. Let the random vari-

able Xj

= (�ij

� qij

)wij

(u⇤j

)2. E [Xj

] = 0 andVar(X

j

) = qij

(1� qij

)(wij

(u⇤j

)2)2. Hence,X

j

Var(Xj

) =X

j

qij

(1� qij

)(wij

(u⇤j

)2)2

=X

j

(1

qij

� 1)(u⇤j

)4 X

j

(u⇤j

)2

qij

(u⇤j

)2

⇣1

16n

m

X

j

(u⇤j

)2 =16n

m.

⇣1

follows from (B.6). Also it is easy to check that|X

j

| 16n

m

. Now applying Bernstein inequality givesthe result.

Lemma B.4. For ⌦ sampled according to (3.2) andunder the assumptions of Lemma 3.2 the followingholds:

(B.14) k(hut, u⇤iB � C)v⇤k �1

p1� hu⇤, uti2,

with probability greater than 1 � 2

n

2 , for m ��n log(n),� � 48c

21

21

and �1

3.

Proof. Let↵i

= ut

i

(hu⇤, utiut

i

� u⇤i

).

Hence the jth coordinate of the error term in equa-tion (B.11) is P

i

�ij

wij

↵i

v⇤jP

i

�ij

wij

(ut

i

)2.

Recall that ↵i

= ut

i

(hu⇤, utiut

i

� u⇤i

). Let Xij

=�ij

wij

↵i

v⇤j

ej

eT1

, for i, j in [1, ..n]. Note that Xij

are in-dependent random matrices. Then (hut, u⇤iB�C)v⇤ isthe first and the only column of the matrix

Pn

i,j=1

Xij

.

We will bound kP

n

ij=1

Xij

k using matrix Bernstein in-equality.

X

ij

E [Xij

] =X

j

X

i

qij

wij

↵i

v⇤j

ej

eT1

=X

j

X

i

↵i

v⇤j

ej

eT1

= 0,

Page 13: Tighter Low-rank Approximation via Sampling the Leveraged ... · providing tighter spectral norm bounds. Notation: Capital letter M typically denotes a matrix. Mi denotes the i-th

becauseP

i

↵i

= 0. Now we will give a bound on kXij

k.

kXij

k = |�ij

wij

↵i

v⇤j

| v⇤j

ut

i

qij

(hu⇤, utiut

i

� u⇤i

)

⇣1

16nc1

m

sX

i

(hu⇤, utiut

i

� u⇤i

)2 =16nc

1

m

p1� hu⇤, uti2

⇣1

follows from (B.7) and (B.15).Now we will bound the variance.������E

2

4X

ij

(Xij

� E [Xij

])T (Xij

� E [Xij

])

3

5

������

=

������E

2

4X

ij

(Xij

TXij

� E [Xij

]T E [Xij

])

3

5

������

=

������

X

j

X

i

qij

(1� qij

)(wij

↵i

v⇤j

)2e1

eT1

������.

Then,

X

i

qij

(1� qij

)(wij

↵i

v⇤j

)2

X

i

wij

(ut

i

)2(hu⇤, utiut

i

� u⇤i

)2(v⇤j

)2

c21

4n

m(v⇤

j

)2X

i

(hu⇤, utiut

i

� u⇤i

)2

c21

4n

m(v⇤

j

)2(1� hu⇤, uti2).

Hence,������E

2

4X

ij

(Xij

� E [Xij

])T (Xij

� E [Xij

])

3

5

������ 4nc2

1

m(1� hu⇤, uti2).

The lemma follows from applying matrix Bernsteininequality.

Now we will provide proof of lemma 3.2.Proof of lemma 3.2:[Rank-1 case]

Proof. Now we will prove that the distance between ut,vt+1 and u⇤, v⇤ decreases with each iteration. Recallfrom the assumptions of the Lemma we have the follow-ing row norm bounds for ut;

(B.15) |ut

i

| c1

qkM ik2/kMk2

F

+ |Mij

|/kMkF

.

First we will prove that dist(ut, u⇤) decreases in eachiteration and second we will prove that vt+1 satisfiesthis condition.

Bounding hvt+1, v⇤i:Using Lemma B.3, Lemma B.4 and equation (B.11)

we get,

(B.16) kbutkhbvt+1, v⇤i

� �⇤hut, u⇤i��⇤ �1

1� �1

p1� hu⇤, uti2� 1

1� �1

kyT v⇤k

and

kbutkhbvt+1, v⇤?i �⇤ �1

1� �1

p1� hu⇤, uti2 + 1

1� �1

kyk.

(B.17)

Hence by applying the noise bounds Lemma B.2 we get,

dist(vt+1, v⇤)2 = 1� hvt+1, v⇤i2

=hbvt+1, v⇤?i2

hbvt+1, v⇤?i2 + hbvt+1, v⇤i2 hbvt+1, v⇤?i2

hbvt+1, v⇤i2⇣1

25(�1

dist(ut, u⇤) + dist(ut, u⇤)kM �M1

k/�⇤

+ �kM �M1

kF

/�⇤)2.

⇣1

follows from �1

1

2

, hut, u⇤i � hu0, u⇤i and

(hu⇤, u0i � 2�1

p1� hu⇤, u0i2 � 1

2

, � 1

20

and �1

1

20

.Hence,

dist(vt+1, v⇤) 1

2dist(ut, u⇤) + 5�kM �M

1

kF

/�⇤.

Bounding |vt+1

j

|:From Lemma B.3 and (B.15) we

get that��P

i

�ij

wij

(ut

i

)2 � 1�� �

1

and

|P

i

�ij

wij

u⇤i

ut

i

� hu⇤, uti| �1

, when � � 16c

21

21.

Hence,

(B.18) 1� �1

Bjj

=X

i

�ij

wij

(ut

i

)2 1 + �1

,

and

(B.19) Cjj

=X

i

�ij

wij

ut

i

u⇤i

hut, u⇤i+ �1

.

Recall that

kbutk��bvt+1

j

�� =����

Pi

�ij

wij

ut

i

MijP

i

�ij

wij

(ut

i

)2

���� 1

1� �1

X

i

�ij

wij

ut

i

Mij

.

We will bound usingP

i

�ij

wij

ut

i

Mij

using Bern-stein inequality.

Let Xi

= (�ij

� qij

)wij

ut

i

Mij

. ThenP

i

E [Xi

] = 0and

Pi

ut

i

Mij

kMj

k by Cauchy-Schwartz inequality.

Page 14: Tighter Low-rank Approximation via Sampling the Leveraged ... · providing tighter spectral norm bounds. Notation: Capital letter M typically denotes a matrix. Mi denotes the i-th

Pi

Var(Xi

) =P

i

qij

(1 � qij

)(wij

)2(ut

i

)2M2

ij

P

i

wij

(ut

i

)2M2

ij

4nc

21

m

kMj

k2. Finally |Xij

|

|wij

ut

i

Mij

| 4nc1m

|Mij

|/q

|Mij |kMkF

4nc1m

p|M

ij

|kMkF

.

Hence applying Bernstein inequality with t =�pkM

j

k2 + |Mij

|kMkF

gives,P

i

�ij

wij

ut

i

Mij

(1 +

�1

)pkM

j

k2 + |Mij

|kMkF

with probability greater than

1� 2

n

3 when m � 24c

21

21n log(n).

Since �1

1

20

, we get, kbutk��bvt+1

j

�� 21

19

pkM

j

k2 + |Mij

|kMkF

. Now we will bound kbvt+1k.

kbutkkbvt+1k � kbutkhbvt+1, v⇤i⇣1

� �⇤hut, u⇤i � �⇤ �1

1� �1

p1� hu⇤, uti2 � 1

1� �1

kyT v⇤k

⇣2

� �⇤hbu0, u⇤i � 2�⇤�1

p1� hu⇤, bu0i2 � 2�kM �M

1

k⇣3

� 2

5�⇤.

⇣1

follows from Lemma B.4 and equations (B.11)and (B.18). ⇣

2

follows from using hu⇤, bu0i hu⇤, utiand �

1

1

20

. ⇣3

follows from the argument: for

�1

1

16

, hbu0, u⇤i � 2�1

p1� hu⇤, bu0i2 is greater than

1

2

, if hbu0, u⇤i � 3

5

. This holds because dist(u⇤, bu0) 4

5

from Lemma 3.1.Hence we get

vt+1

j

=bvt+1

j

kbvt+1k 3

pkM

j

k2 + |Mij

|kMkF

�⇤

c1

qkM

j

k2/kMk2F

+ |Mij

|/kMkF

,

for c1

= 6.Hence we have shown that vt+1 satisfies the

row norm bounds. From Lemma B.2 we have,in each iteration with probability greater than 1 �

T log(n)

,��(ut)TR

(M �M1

)�� dist(ut, u⇤)kM �

M1

k + � kM �M1

kF

. Hence the probability of failurein T iterations is less than �. Lemma now follows fromassumption on m.

Now we have all the elements needed for proof ofthe Theorem 3.1.Proof of Theorem 3.1:[Rank-1 case]

Proof. Lemma 3.1 has shown bu0 satisfies the rownorm bounds condition. From Lemma 3.2 we getdist(vt+1, v⇤) 1

2

dist(ut, u⇤) + 5�kM � M1

kF

/�⇤.Hence dist(vt+1, v⇤) 1

4

t dist(bu0, u⇤) + 10�kM �M

1

kF

/�⇤. After t = O(log( 1⇣

)) iterations we

get dist(vt+1, v⇤) ⇣ + 10�kM � M1

kF

/�⇤ anddist(ut, u⇤) ⇣ + 10�kM �M

1

kF

/�⇤.

Hence,

kM1

� but(bvt+1)T kk(I � ut(ut)T )M

1

k+ kut

�(ut)TM

1

� (vt+1)T�k

⇣1

�⇤1

dist(ut, u⇤) + k�⇤B�1(hut, u⇤iB � C)v⇤k+ kB�1yk

⇣2

�⇤1

dist(ut, u⇤) + 2�3

kMkdist(ut, u⇤)

+ 2dist(ut, u⇤)kM �M1

k+ 2� kM �M1

kF

c�⇤1

⇣ + ✏ kM �M1

kF

.

⇣1

follows from equation (B.11) and ⇣2

from kB�1k 1

1��3 2 from Lemma B.3.From Lemma B.2 we have, in each iteration

with probability greater than 1 � �

T log(n)

we have��(ut)TR⌦

(M �M1

)�� dist(ut, u⇤)kM � M

1

k +� kM �M

1

kF

. Hence the probability of failure in Titerations is less than �.

B.3 Rank-r proofs: Let SVD of Mr

be U⇤⌃⇤(V ⇤)T ,U⇤, V ⇤ are n⇥ r orthonormal matrices and ⌃⇤ is a r⇥ rdiagonal matrix with ⌃⇤

ii

= �⇤i

.In this section we will present rank-r proof of

Lemma 3.2. Before that we will present rank-r versionof the supporting lemmas.

Now like shown in [19], we will analyze a equiv-alent algorithm to algorithm 2 where the iterates areorthogonalized at each step. This makes analysis sig-nificantly simpler to present. Let, bU (t) = U (t)R(t) andbV (t+1) = V (t+1)R(t+1) be the respective QR factoriza-tions. Then we replace step 7 of the algorithm 2 with

bV (t+1) = arg minV 2Rn⇥r

kR⌦2t+1(M � U (t)V T )k2

F

.

We similarly change step 8 too.We also assume that samples for each iteration are

generated independently. For simplicity we will dropthe subscripts on ⌦ that denote di↵erent set of samplesin each iteration in the rest of the proof. The weightedalternating minimization updates at the t+ 1 iterationare,(B.20)

(bV (t+1))j = (Bj)�1

⇣Cj⌃⇤(V ⇤)j + (U (t))TR

(M �Mr

)j

⌘,

where Bj and Cj are r ⇥ r matrices.

Bj =X

i

�ij

wij

(U (t))i(U (t))iT

,

Cj =X

i

�ij

wij

(U (t))i(U⇤)iT

.

Page 15: Tighter Low-rank Approximation via Sampling the Leveraged ... · providing tighter spectral norm bounds. Notation: Capital letter M typically denotes a matrix. Mi denotes the i-th

Writing in terms of power method updates we get,

(B.21)

(bV (t+1))j =⇣(U (t))TU⇤ � (Bj)�1(Bj(U (t))TU⇤

�Cj)�⌃⇤(V ⇤)j + (Bj)�1(U (t))TR

(M �Mr

)j

.

Hence,

(bV (t+1))T = (U (t))TU⇤⌃⇤(V ⇤)T � F

+nX

j=1

(Bj)�1(U (t))TR⌦

(M �Mr

)j

eTj

,

where the jth column of F , Fj

=�(Bj)�1(Bj(U (t))TU⇤ � Cj)

�⌃⇤(V ⇤)j .

First we will bound kBjk using matrix Bernsteininequality.

Lemma B.5. For ⌦ generated according to (3.2) andunder the assumptions of Lemma 3.2 the followingholds:

(B.22) kBj � Ik �2

, and kCj � (U (t))TU⇤k �2

,

with probability greater that 1� 2

n

2 , for m � �nr log(n),

� � 4⇤48c21�

22

and �2

3r.

Now we will bound the error caused by the M�Mr

component in each iteration.

Lemma B.6. For ⌦ generated according to (3.2) andunder the assumptions of Lemma 3.2 the followingholds:

(B.23)���(U (t))TR

(M �Mr

)� (U (t))T (M �Mr

)���

�kM �Mr

kF

,

with probability greater that 1 � 1

c2 log(n)

, for m �

�nr log(n), � � 4c

21c2

2 . Hence,��(U (t))TR

(M �Mr

)��

dist(U (t), U⇤)kM �Mr

k + � kM �Mr

kF

, for constant�.

Since kBjk is bounded by the previous lemma, toget a bound on the norm of the error term in equa-tion (B.21), kFk, we need to bound kFk, where F

j

=(Bj(U (t))TU⇤ � Cj)⌃⇤(V ⇤)j .

Lemma B.7. For ⌦ generated according to (3.2) andunder the assumptions of Lemma 3.2 the followingholds:

(B.24) kFk �2

�⇤1

dist(U (t), U⇤)

with probability greater that 1� 2

n

2 , for m � �n log(n),

� � 32r

3

2

22

, c1

8pr and �

2

1

2

.

The proofs of the above three lemmas are similar tothe rank-1 proofs presented in the previous section andwe skip them in this version.

Now since bV (t+1) = V (t+1)R(t+1),

�min

(R(t+1)) = �min

(bV (t+1))(B.25)

⇣1

� �min

((U (t))TU⇤⌃⇤(V ⇤)T )� kFk

� knX

j=1

(Bj)�1(U (t))TR⌦

(M �Mr

)j

eTj

k.

⇣1

follows from (B.21).Now �

min

((U (t))TU⇤⌃⇤(V ⇤)T ) ��⇤r

p1� dist(U (t), U⇤)2. Hence, kFk 1

1��2kFk

�21��2

�⇤1

dist(U (t), U⇤).

knX

j=1

(Bj)�1(U (t))TR⌦

(M �Mr

)j

eTj

k

1

1� �2

k(U (t))TR⌦

(M �Mr

)kF

1

1� �2

dist(U (t), U⇤)kM �Mr

k

+�

1� �2

kM �Mr

kF

,

from Lemma B.6.Hence,

�min

(R(t+1)) � �⇤r

✓q1� dist(U (t), U⇤)2

� �2

1� �2

dist(U (t), U⇤)� 2

1� �2

kM �Mr

kF

/�⇤r

� �⇤r

2,

for enough number of samples m.Now we are ready to present proof of Lemma 3.2

for rank-r case.Proof of Lemma 3.2:

Proof. The proof like in rank-1 case has two steps. Inthe first step we show that dist(V (t+1), V ⇤) decreases ineach iteration. In the second step we show row normbounds. Recall from the assumptions of the Lemma wehave the following row norm bounds for U (t);(B.26)

k(U (t))ik c1

qkM ik2/kMk2

F

+ |Mij

|/kMkF

, 8i.

We will prove that V (t+1) satisfies this condition.

Page 16: Tighter Low-rank Approximation via Sampling the Leveraged ... · providing tighter spectral norm bounds. Notation: Capital letter M typically denotes a matrix. Mi denotes the i-th

Bounding dist(V (t+1), V ⇤):

dist(V (t+1), V ⇤) = k(V (t+1))TV ⇤?k

⇣1

k(R(t+1))�1

T

FV ⇤?k

+1

1� �2

k(R(t+1))�1

T

(U (t))TR⌦

(M �Mr

)V ⇤?kF

⇣2

1

�min

(R(t+1))

✓kFk+ 1

1� �2

dist(U (t), U⇤)kM �Mr

k

+�

1� �2

kM �Mr

kF

2

�⇤r

✓�2

1� �2

kMkdist(U (t), U⇤)

+1

1� �2

dist(U (t), U⇤)kM �Mr

k+ �

1� �2

kM �Mr

kF

1

2dist(U (t), U⇤) + 5� kM �M

r

kF

/�⇤r

,

for �2

1

16

. ⇣1

follows from (B.21). ⇣2

follows fromLemma B.6.

Bounding k(V (t+1))jk:From Lemma B.5 and (B.26) we get that

�min

(Bj) � 1� �2

and �max

(Cj) 1 + �2

. Recall that(B.27)

(V (t+1))j = (R(t+1))�1

T

⇣(Bj)�1(U (t))TR

(M)j

⌘.

Hence,

k(V (t+1))jk 1

�min

(R(t+1))

✓1

1� �2

k(U (t))TR⌦

(M)j

k◆.

We will bound k(U (t))TR⌦

(M)j

k using matrix Bern-stein inequality.

Let Xi

= (�ij

� qij

)wij

Mij

(U (t))ieTj

. Then E [Xi

] =

0 andP

i

Xi

= (U (t))TR⌦

(M)j

� (U (t))TMj

. NowkX

i

k c12n

m

p|M

ij

|kMkF

and��E⇥P

i

Xi

XT

i

⇤�� 8c

21n

m

kMj

k2. Hence applying matrix Bernstein inequal-

ity with t = �2

pkM

j

k2 + |Mij

|kMkF

, implies

k(U (t))TR⌦

(M)j

k kMj

k+ �2

qkM

j

k2 + |Mij

|kMkF

(1 + �2

)qkM

j

k2 + |Mij

|kMkF

,

with probability greater than 1 � 2

n

2 for

m � 24c

21

22n log(n). Hence k(V (t+1))jk

8prq

kMjk2

kMk2F+ |Mij |

kMkF.

Hence we have shown that (V (t+1))j satisfies corre-sponding row norm bounds. This completes the proof.

Now we have all the elements needed for proof ofthe Theorem 3.1.The proof is similar to the proof in therank-1 case and follows from the Lemmas 3.1 and 3.2

C Proofs of section 3.3

We will now discuss proof of Theorem 3.2. The prooffollows same structure as proof of Theorem 3.1 with fewkey changes because of the absence of L1 term in thesampling and the special structure of M = AB. Againfor simplicity we will present proofs only for the case ofn1

= n2

= n.Recall that q

ij

= min(1, qij

) where qij

= m ·⇣kAik2

nkAk2F+ kBjk2

nkBk2F

⌘. Also, let w

ij

= 1/qij

.

First we will abstract out the properties of thesampling distribution (3.5) that we use in the rest of

the proof. Also let CAB

= (kAk2F+kBk2

F )

2

kABk2F

Lemma C.1. For ⌦ generated according to (3.5) andunder the assumptions of Lemma C.2 the followingholds, for all (i, j) such that q

ij

1.

(C.28)M

ij

qij

n

2m(kAk2

F

+ kBk2F

),

(C.29)X

{j:qij=qij}

M2

ij

qij

n

m(kAk2

F

+ kBk2F

)2,

(C.30)k(U⇤)ik2

qij

n

m

(kAk2F

+ kBk2F

)2

kA ·Bk2F

,

and

(C.31)k(U⇤)ikk(V ⇤)jk

qij

n

m

(kAk2F

+ kBk2F

)2

kA ·Bk2F

.

The proof of the lemma C.1 is straightforward fromthe definition of q

ij

.Now, similar to proof of Theorem 3.1, we divide

our analysis in two parts: initialization analysis andweighted alternating minimization analysis.

C.1 Initialization

Lemma C.2. (Initialization) Let the set of entries⌦ be generated according to q

ij

(3.5). Also, let m �CC

AB

n

2 log(n). Then, the following holds (w.p. �1� 2

n

10 ):

(C.32) kR⌦

(AB)�ABk � kABkF

.

Also, if kAB � (AB)r

kF

1

576r

1.5 k(AB)r

kF

, then thefollowing holds (w.p. � 1� 2

n

10 ):

k(bU (0))ik 8prqkAik2/kAk2

F

and dist(bU (0), U⇤) 1

2,

where bU (0) is the initial iterate obtained using Steps 4,5 of Sub-Procedure 2. = �⇤

1

/�⇤r

, �⇤i

is the i-th singularvalue of AB, (AB)

r

= U⇤⌃⇤(V ⇤)T .

Page 17: Tighter Low-rank Approximation via Sampling the Leveraged ... · providing tighter spectral norm bounds. Notation: Capital letter M typically denotes a matrix. Mi denotes the i-th

Proof. First we show that R⌦

(AB) is a good approxi-mation of AB.

Let M = AB. We prove this part of the lemmausing the matrix Bernstein inequality. Let X

ij

=(�

ij

� qij

)wij

Mij

ei

eTj

. Note that {Xij

}ni,j=1

are inde-pendent zero mean random matrices. Also R

(AB) �E [R

(AB)] =P

ij

Xij

.First we will bound kX

ij

k. When mqij

� 1, qij

= 1and �

ij

= 1, and Xij

= 0 with probability 1. Hencewe only need to consider cases when q

ij

= mqij

1.We will assume this in all the proofs without explicitlymentioning it any more.

kXij

k = max{|(1� qij

)wij

Mij

| , |qij

wij

Mij

|}.

Recall wij

= 1/qij

. Hence

|(1� qij

)wij

Mij

| ����M

ij

qij

����⇣1

n

2m(kAk2

F

+ kBk2F

).

⇣1

follows from (C.28).

|qij

wij

Mij

| = |Mij

|⇣1

����M

ij

qij

���� n

2m(kAk2

F

+ kBk2F

).

⇣1

follows from qij

1.Hence, kX

ij

k is bounded by L = n

2m

(kAk2F

+kBk2F

).Recall that this is the step in the proof of Lemma 3.1that required the L1 term in sampling, which we didn’tneed now because of the structure AB of the matrix.Now we will bound the variance. Recall,

������E

2

4X

ij

Xij

XT

ij

3

5

������= max

i

������

X

j

qij

(1� qij

)w2

ij

M2

ij

������.

Now,

X

j

qij

(1� qij

)w2

ij

M2

ij

X

j

M2

ij

(qij

)

⇣1

n

m(kAk2

F

+ kBk2F

)2.

⇣1

follows from (C.29). Hence������E

2

4X

ij

Xij

XT

ij

3

5

������ n

m(kAk2

F

+ kBk2F

)2.

We can prove the same bound for the���EhP

ij

XT

ij

Xij

i���.Hence �2 = n

m

(kAk2F

+ kBk2F

)2. Now using matrixBernstein inequality with t = �kABk

F

gives, withprobability � 1� 2

n

2 ,

kR⌦

(AB)� E [R⌦

(AB)]k = kR⌦

(AB)�ABk � kABk

F

.

Once we have kR⌦

(M)�Mk �kMkF

proof thattrimming step guarantees

k(bU (0))ik 8pr

skAik2kAk2

F

and dist(bU (0), U⇤) 1

2,

follows from the same argument as in Lemma 3.1.

C.2 Weighted AltMin Analysis

Lemma C.3. (WAltMin Descent) Let hypotheses ofTheorem 3.2 hold. Also, let kAB � (AB)

r

kF

1

576r

pr

k(AB)r

kF

. Let bU (t) be the t-th step iter-

ate of Sub-Procedure 2 (called from WAltMin(P⌦

(A ·B),⌦, q, T )), and let bV (t+1) be the (t + 1)-th iterate(for V ). Also, let k(U (t))ik 8

prpkAik2/kAk2

F

and

dist(bU (t), U⇤) 1

2

, where U (t) is a set of orthonormal

vectors spanning bU (t). Then, the following holds (w.p.� 1� �/T ):

dist(bV (t+1), V ⇤) 1

2dist(bU (t), U⇤)+✏

kAB � (AB)r

kF

�⇤r

,

and k(V (t+1))jk 8prpkB

j

k2/kBk2F

, where V (t+1)

is a set of orthonormal vectors spanning bV (t+1).

For the sake of simplicity we will discuss the prooffor rank-1(r = 1) case for this part of the algorithm.Rank-r proof follows by combining the below analysiswith rank-r analysis of Lemma 3.2 (see Section B.3).Before presenting the proof of this Lemma, we willstate couple of supporting lemmas. The proofs of thesesupporting lemmas follows very close to the ones insection B.2.

Lemma C.4. For ⌦ sampled according to (3.5) thefollowing holds:

(C.33)

������

X

j

�ij

wij

(u⇤j

)2 �X

j

(u⇤j

)2

������ �

1

,

with probability greater that 1 � 2

n

2 , for m ��C

AB

n log(n), � � 16

21and �

1

3.

We assume that samples for each iteration aregenerated independently. For simplicity we will dropthe subscripts on ⌦ that denote di↵erent set of samplesin each iteration in the rest of the proof. The weightedalternating minimization updates at the t+ 1 iterationare,(C.34)

kbutkbvt+1

j

= �⇤v⇤j

Pi

�ij

wij

ut

i

u⇤iP

i

�ij

wij

(ut

i

)2+

Pi

�ij

wij

ut

i

(M �M1

)ijP

i

�ij

wij

(ut

i

)2.

Page 18: Tighter Low-rank Approximation via Sampling the Leveraged ... · providing tighter spectral norm bounds. Notation: Capital letter M typically denotes a matrix. Mi denotes the i-th

Writing in terms of power method updates we get,(C.35)kbutkbvt+1 = �⇤hu⇤, utiv⇤��⇤P�1(hut, u⇤iP�Q)v⇤+P�1y,

where P and Q are diagonal matrices with Pjj

=Pi

�ij

wij

(ut

i

)2 and Qjj

=P

i

�ij

wij

ut

i

u⇤i

and yis the vector R

(M � M1

)Tut with entries yj

=Pi

�ij

wij

ut

i

(M �M1

)ij

.Now we will bound the error caused by the M�M

r

component in each iteration.

Lemma C.5. For ⌦ generated according to (3.5) thefollowing holds:(C.36)���(U (t))TR

(M �Mr

)� (U (t))T (M �Mr

)��� �kM�M

r

kF

,

with probability greater that 1 � 1

c2 log(n)

,

for m � �nr log(n), � � 4c

21c2

2 . Hence,��(U (t))TR⌦

(M �Mr

)�� dist(U (t), U⇤)kM � M

r

k +� kM �M

r

kF

, for constant �.

Lemma C.6. For ⌦ sampled according to (3.5) and forut satisfying equation (C.38) the following holds:

(C.37) k(hut, u⇤iP �Q)v⇤k �1

p1� hu⇤, uti2,

with probability greater than 1 � 2

n

2 , for m ��C

AB

n log(n),� � 48c

21

21

and �1

3.

Now we will provide proof of lemma C.3.Proof of lemma C.3:[Rank-1 case]

Proof. Let ut and vt+1 be the normalized vectors of theiterates but and bvt+1.In the first step we will prove thatthe distance between ut, vt+1 and u⇤, v⇤ decreases witheach iteration. From the assumptions of the lemma wehave;

(C.38) |ut

i

| c1

qkAik2/kAk2

F

.

In the second step we will prove that vt+1 satisfies|vt+1

j

| c1

pkB

j

k2/kBk2F

.

Bounding hvt+1, v⇤i: The proof of this part is sameas in the proofs in previous section and hence we skip it.

Bounding vt+1

j

:From Lemma C.4 and (C.38) we

get that��P

i

�ij

wij

(ut

i

)2 � 1�� �

1

and

|P

i

�ij

wij

u⇤i

ut

i

� hu⇤, uti| �1

, when � � 16c

21

21.

Hence,

(C.39) 1� �1

Pjj

=X

i

�ij

wij

(ut

i

)2 1 + �1

,

and

(C.40) Qjj

=X

i

�ij

wij

ut

i

u⇤i

hut, u⇤i+ �1

.

Recall that

kbutk��bvt+1

j

�� =����

Pi

�ij

wij

ut

i

MijP

i

�ij

wij

(ut

i

)2

���� 1

1� �1

X

i

�ij

wij

ut

i

Mij

.

We will bound usingP

i

�ij

wij

ut

i

Mij

using bern-stein inequality. Let X

i

= (�ij

� qij

)wij

ut

i

Mij

.Then

Pi

E [Xi

] = 0 andP

i

ut

i

Mij

kMj

k byCauchy-Schwartz inequality.

Pi

Var(Xi

) =P

i

qij

(1 �qij

)(wij

)2(ut

i

)2M2

ij

P

i

wij

(ut

i

)2M2

ij

nc

21

m

kMj

k2 nc

21

m

kBj

k2kAk2F

. Finally |Xij

| |wij

ut

i

Mij

| nc1m

kAkF

kBj

k. Hence applying bernstein inequality

with t = �kBjkkBkF

kABkF

gives,P

i

�ij

wij

ut

i

Mij

(1 +

�1

) kBjkkBkF

kABkF

with probability greater than 1 � 2

n

3

when m � 24c

21

21C

AB

n log(n). For �1

1

20

, we get,

kbutk��bvt+1

j

�� 21

19

kBjkkBkF

kABkF

.

Now we will bound kbvt+1k.

kbutkkbvt+1k � kbutkhbvt+1, v⇤i⇣1

� �⇤hut, u⇤i � �⇤ �1

1� �1

p1� hu⇤, uti2 � 1

1� �1

kyT v⇤k

⇣2

� �⇤hu0, u⇤i � 2�⇤�1

p1� hu⇤, u0i2 � 2�kM �M

1

k⇣3

� 2

5�⇤.

⇣1

follows from Lemma C.6 and equations (C.35)and (C.39). ⇣

2

follows from using hu⇤, u0i hu⇤, utiand �

1

1

20

. ⇣3

follows by initialization and using theassumption on m with large enough C > 0. Hence weget

vt+1

j

=bvt+1

j

kbvt+1k 321

19

kBj

kkBk

F

kABkF

�⇤ c1

kBj

kkBk

F

,

for c1

= 6.Hence we have shown that vt+1 satisfies the row

norm bounds. This completes the proof.

The proof of the Theorem 3.2 now follows fromthe Lemma C.2 and Lemma C.3 similar to the proofof Theorem 3.1.

Page 19: Tighter Low-rank Approximation via Sampling the Leveraged ... · providing tighter spectral norm bounds. Notation: Capital letter M typically denotes a matrix. Mi denotes the i-th

Algorithm 3 Distributed low rank approximation al-gorithm

input Matrix Mrk at server k, rank-r, number ofsamples m and number of iterations T .

1: Sampling: Each server k computes column normsof Mrk , kMrkk1,1 and communicates to CP . CPcomputes column norms of M , kMk

1,1

, kMkF

andcommunicates to all servers.

2: Each server k samples (i, j)th entry with probability

min{m(kMi

rkk2

+kMjk2

2nkMk2F

+(Mrk

)ij

kMk1,1), 1} for rows {r

k

}and generates ⌦k.

3: Each server k sends lists of columns ({ck

} ⇢ [d])where R

k(Mrk) has sampled entries, to CP.4: Initialization: CP generates random n ⇥ r matrix

Y (0) and communicates Y (0)

ck to server k.5: for t = 0 to log(1/c) do

6: Each server k computes bY (t+1)

ck =

R⌦

k(Mrk)TR

k(Mrk) ⇤ Y(t)

ck and communi-cates to CP.

7: CP computes Y (t+1) =P

k

bY (t+1)

ck , normalize and

communicates Y (t+1)

ck to server k.8: end for9: WAltMin: Each server k set bV (0)

ck = Y(t+1)

ck .10: for t = 0 to T � 1 do11: Each server k computes (bU (t+1))i =

argminx2Rr

Pj:(i,j)2⌦

k wij

⇣M

ij

� xT (bV t)j⌘2

,

for all i 2 {rk

}.12: Each server k sends to CP; z

kj

=⇣bU (t+1)

rk

⌘T

R⌦

k(Mrk)ej and Bkj

=P

i:(i,j)2⌦

k wij

uuT , u = (bU (t+1))i for all j 2 {ck

}.13: CP computes B

j

=P

k

Bkj

and (bV (t+1))j =B�1

j

Pk

zkj

for j = 1, · · · , d and communicates

(bV (t+1))ck to server k.14: end foroutput Server k has bU (t+1)

rk and CP has bV (t+1).