32
Faster least squares approximation Tamas Sarlos Tamas Sarlos Yahoo! Research IPAM 23 Oct 2007 http://www.ilab.sztaki.hu/~stamas Joint work with P. Drineas, M. Mahoney, and S. Muthukrishnan.

Faster least squares approximation Tamas Sarlos Yahoo! Research IPAM 23 Oct 2007 stamas Joint work with P. Drineas, M. Mahoney,

Embed Size (px)

Citation preview

Page 1: Faster least squares approximation Tamas Sarlos Yahoo! Research IPAM 23 Oct 2007 stamas Joint work with P. Drineas, M. Mahoney,

Faster least squares approximation

Tamas SarlosTamas Sarlos

Yahoo! Research

IPAM 23 Oct 2007

http://www.ilab.sztaki.hu/~stamas

Joint work with P. Drineas, M. Mahoney, and S. Muthukrishnan.

Page 2: Faster least squares approximation Tamas Sarlos Yahoo! Research IPAM 23 Oct 2007 stamas Joint work with P. Drineas, M. Mahoney,

Outline

Randomized methods for

Least squares approximation Lp regression Regularized regression Low rank matrix approximation

Page 3: Faster least squares approximation Tamas Sarlos Yahoo! Research IPAM 23 Oct 2007 stamas Joint work with P. Drineas, M. Mahoney,

Models, curve fitting, and data analysis

In MANY applications (in statistical data analysis and scientific computation), one has n observations (values of a dependent variable y measured at values of an independent variable t):

Model y(t) by a linear combination of d basis functions:

A is an n x d “design matrix” with elements:

In matrix-vector notation:

Page 4: Faster least squares approximation Tamas Sarlos Yahoo! Research IPAM 23 Oct 2007 stamas Joint work with P. Drineas, M. Mahoney,

Least Squares (LS) Approximation

We are interested in over-constrained L2 regression problems, n >> d.

Typically, there is no x such that Ax = b.

Want to find the “best” x such that Ax ≈ b.

Ubiquitous in applications & central to theory:

Statistical interpretation: best linear unbiased estimator.

Geometric interpretation: orthogonally project b onto span(A).

Page 5: Faster least squares approximation Tamas Sarlos Yahoo! Research IPAM 23 Oct 2007 stamas Joint work with P. Drineas, M. Mahoney,

Many applications of this!

• Astronomy: Predicting the orbit of the asteroid Ceres (in 1801!).

Gauss (1809) -- see also Legendre (1805) and Adrain (1808).

First application of “least squares optimization” and runs in O(nd2) time!

• Bioinformatics: Dimension reduction for classification of gene expression microarray data.

• Medicine: Inverse treatment planning and fast intensity-modulated radiation therapy.

• Engineering: Finite elements methods for solving Poisson, etc. equation.

• Control theory: Optimal design and control theory problems.

• Economics: Restricted maximum-likelihood estimation in econometrics.

• Image Analysis: Array signal and image processing.

• Computer Science: Computer vision, document and information retrieval.

• Internet Analysis: Filtering and de-noising of noisy internet data.

• Data analysis: Fit parameters of a biological, chemical, economic, social, internet, etc. model to experimental data.

Page 6: Faster least squares approximation Tamas Sarlos Yahoo! Research IPAM 23 Oct 2007 stamas Joint work with P. Drineas, M. Mahoney,

Exact solution to LS Approximation

Cholesky Decomposition: If A is full rank and well-conditioned,

decompose ATA = RTR, where R is upper triangular, and

solve the normal equations: RTRx=ATb.

QR Decomposition: Slower but numerically stable, esp. if A is rank-deficient.

Write A=QR, and solve Rx = QTb.

Singular Value Decomposition:Most expensive, but best if A is very ill-conditioned.

Write A=UVT, in which case: xOPT = A+b = V-1kUTb.

Complexity is O(nd2) for all of these, but constant factors differ.

Projection of b on the subspace

spanned by the columns of A

Pseudoinverse of A

Page 7: Faster least squares approximation Tamas Sarlos Yahoo! Research IPAM 23 Oct 2007 stamas Joint work with P. Drineas, M. Mahoney,

Reduction of LS Approximation to fast matrix multiplication

Theorem: For any n-by-d matrix A, and d dimensional vector

b, where n>d, we can compute x*=A+b in O(n/dM(d)) time. Here M(d) denotes the time needed to multiply two d-by-d matrices.

• Current best M(d)=d, =2.376… Coppersmith–Winograd (1990)

Proof: Ibarra et al. (1982) generalize the LUP decomposition for rectangular matrixes

Direct proof:

• Recall the normal equations ATAx = ATb

• Group the rows of A into blocks of d and compute ATA as the sum of n/d square products, each d-by-d

• Solve the normal equations

• LS in O(nd1.376) time, but impractical. We assume O(nd2).

Page 8: Faster least squares approximation Tamas Sarlos Yahoo! Research IPAM 23 Oct 2007 stamas Joint work with P. Drineas, M. Mahoney,

Original (expensive) sampling algorithm

Algorithm

1. Fix a set of probabilities pi, i=1,…,n.

2. Pick the i-th row of A and the i-th element of b with probability

min {1, rpi},

and rescale by (1/min{1,rpi})1/2.

1. Solve the induced problem.

Drineas, Mahoney, and Muthukrishnan (SODA, 2006)

• These probabilities pi’s are statistical leverage scores!

• “Expensive” to compute, O(nd2) time, these pi’s!

Theorem:

Let:

If the pi satisfy:

for some (0,1], then w.p. ≥

1-,

Page 9: Faster least squares approximation Tamas Sarlos Yahoo! Research IPAM 23 Oct 2007 stamas Joint work with P. Drineas, M. Mahoney,

Fast LS via uniform sampling

Algorithm

1. Pre-process A and b with a “randomized Hadamard transform”.

2. Uniformly sample

r=O(d log(n) log(d log(n))/)

constraints.

3. Solve the induced problem:

Drineas, Mahoney, Muthukrishnan, and Sarlos (2007)

• Non-uniform sampling will work, but it is “slow.”

• Uniform sampling is “fast,” but will it work?

Page 10: Faster least squares approximation Tamas Sarlos Yahoo! Research IPAM 23 Oct 2007 stamas Joint work with P. Drineas, M. Mahoney,

A structural lemma

Approximate by solving

where is any matrix.

Let be the matrix of left singular vectors of A.

assume that -fraction of mass of b lies in span(A).

Lemma: Assume that:

Then, we get relative-error approximation:

Page 11: Faster least squares approximation Tamas Sarlos Yahoo! Research IPAM 23 Oct 2007 stamas Joint work with P. Drineas, M. Mahoney,

Randomized Hadamard preprocessing

Fact 1: Multiplication by HnDn doesn’t change the solution:

Fact 2: Multiplication by HnDn is fast - only O(n log(r)) time, where r is the number of elements of the output vector we need to “touch”.

(since Hn and Dn are orthogonal matrices).

Fact 3: Multiplication by HnDn approximately uniformizes all leverage scores:

Let Hn be an n-by-n deterministic Hadamard matrix, and Let Dn be an n-by-n random diagonal matrix with +1/-1 chosen u.a.r. on the diagonal.

Facts implicit or explicit in: Ailon & Chazelle (2006), or Ailon and Liberty (2008).

Page 12: Faster least squares approximation Tamas Sarlos Yahoo! Research IPAM 23 Oct 2007 stamas Joint work with P. Drineas, M. Mahoney,

A matrix perturbation theorem

Page 13: Faster least squares approximation Tamas Sarlos Yahoo! Research IPAM 23 Oct 2007 stamas Joint work with P. Drineas, M. Mahoney,

Establishing structural conditions

Lemma:|1-i2(SHUA)| ≤ 1/10, with constant probability,

i.e,. singular values of SHUA are close to 1.

Proof: Apply spectral norm matrix perturbation bound.

Lemma:||(SHUA)TSHb|| ≤ OPT, with constant probability,

i.e., SHUA is nearly orthogonal to b.

Proof: Apply Forbenius norm matrix perturbation bound.

Bottom line: • Fast algorithm approximately uniformizes the leverage of each data point.• Uniform sampling is optimal up to O(1/log(n)) factor, so over-sample a bit.• Overall running time O(ndlog(…)+d3log(…)/)

Page 14: Faster least squares approximation Tamas Sarlos Yahoo! Research IPAM 23 Oct 2007 stamas Joint work with P. Drineas, M. Mahoney,

Fast LS via sparse projection

Algorithm

1. Pre-process A and b with a randomized Hadamard transform.

2. Multiply preprocessed input by sparse random k x n matrix T, where

and where k=O(d/) and q=O(d log2(n)/n+d2log(n)/n) .

1. Solve the induced problem:

Drineas, Mahoney, Muthukrishnan, and Sarlos (2007) - sparse projection matrix from Matousek’s version of Ailon-Chazelle 2006

• Dense projections will work, but it is “slow.”

• Sparse projection is “fast,” but will it work?

-> YES! Sparsity parameter q of T related to non-uniformity of leverage scores!

Page 15: Faster least squares approximation Tamas Sarlos Yahoo! Research IPAM 23 Oct 2007 stamas Joint work with P. Drineas, M. Mahoney,

Lp regression problems

We are interested in over-constrained Lp regression problems, n >> d.

Lp regression problems are convex programs (or better!).

There exist poly-time algorithms.

We want to solve them faster!

Dasgupta, Drineas, Harb, Kumar, and Mahoney (2008)

Page 16: Faster least squares approximation Tamas Sarlos Yahoo! Research IPAM 23 Oct 2007 stamas Joint work with P. Drineas, M. Mahoney,

Lp norms and their unit balls

Recall the Lp norm for :

Some inequality relationships include:

Page 17: Faster least squares approximation Tamas Sarlos Yahoo! Research IPAM 23 Oct 2007 stamas Joint work with P. Drineas, M. Mahoney,

Solution to Lp regression

For p=2, Least-squares approximation (minimize ||Ax-b||2):

For p=∞, Chebyshev or mini-max approximation (minimize ||Ax-b||∞):

For p=1, Sum of absolute residuals approximation (minimize ||Ax-b||1):

Lp regression can be cast as a convex program for all .

Page 18: Faster least squares approximation Tamas Sarlos Yahoo! Research IPAM 23 Oct 2007 stamas Joint work with P. Drineas, M. Mahoney,

What made the L2 result work?

The L2 sampling algorithm worked because:

• For p=2, an orthogonal basis (from SVD, QR, etc.) is a “good” or “well-conditioned” basis.

(This came for free, since orthogonal bases are the obvious choice.)

• Sampling w.r.t. the “good” basis allowed us to perform “subspace-preserving sampling.”

(This allowed us to preserve the rank of the matrix and distances in its span.)

Can we generalize these two ideas to p2?

Page 19: Faster least squares approximation Tamas Sarlos Yahoo! Research IPAM 23 Oct 2007 stamas Joint work with P. Drineas, M. Mahoney,

Sufficient condition

Easy

Approximate minx||b-Ax||p by solving minx||S(b-Ax)||p giving x’

Assume that the matrix S is an isometry over the span of A and b: for any vector y in span(Ab) we have (1-)||y||p ||Sy||p (1+ ) ||y|| p

Then, we get relative-error approximation:

||b-Ax||p (1+ ) minx|| S(b-Ax)||p

(This requires O(d/2) for L2 regression.)

How to construct an Lp isometry?

Page 20: Faster least squares approximation Tamas Sarlos Yahoo! Research IPAM 23 Oct 2007 stamas Joint work with P. Drineas, M. Mahoney,

p-well-conditioned basis (definition)

Let A be an n x m matrix of rank d<<n, let p [1,∞), and q its dual.

Definition: An n x d matrix U is an (,,p)-well-conditioned basis for span(A) if:

(1) |||U|||p ≤ , (where |||U|||p = (ij|Uij|p)1/p )

(2) for all z in Rd, ||z||q ≤ ||Uz||p.

U is a p-well-conditioned basis if ,=dO(1), independent of m,n.

Page 21: Faster least squares approximation Tamas Sarlos Yahoo! Research IPAM 23 Oct 2007 stamas Joint work with P. Drineas, M. Mahoney,

p-well-conditioned basis (existence)

Let A be an n x m matrix of rank d<<n, let p [1,∞), and q its dual.

Theorem: There exists an (,,p)-well-conditioned basis U for span(A) s.t.:

if p < 2, then = d1/p+1/2 and = 1,if p = 2, then = d1/2 and = 1,if p > 2, then = d1/p+1/2 and = d1/q-1/2.

U can be computed in O(nmd+nd5log n) time (or just O(nmd) if p = 2).

Page 22: Faster least squares approximation Tamas Sarlos Yahoo! Research IPAM 23 Oct 2007 stamas Joint work with P. Drineas, M. Mahoney,

p-well-conditioned basis (construction)

Algorithm:

• Let A=QR be any QR decomposition of A.

(Stop if p=2.)

• Define the norm on Rd by ||z||Q,p ||Qz||p.

• Let C be the unit ball of the norm ||•||Q,p.

• Let the d x d matrix F define the Lowner-John ellipsoid of C.

• Decompose F=GTG,

where G is full rank and upper triangular.

• Return U = QG-1

as the p-well-conditioned basis.

Page 23: Faster least squares approximation Tamas Sarlos Yahoo! Research IPAM 23 Oct 2007 stamas Joint work with P. Drineas, M. Mahoney,

Subspace-preserving sampling

Let A be an n x m matrix of rank d<<n, let p [1,∞).

Let U be an (,,p)-well-conditioned basis for span(A),

Theorem: Randomly sample rows of A according to the probability distribution:

where:

Then, with probability 1- , the following holds for all x in Rm:

Page 24: Faster least squares approximation Tamas Sarlos Yahoo! Research IPAM 23 Oct 2007 stamas Joint work with P. Drineas, M. Mahoney,

Regularized regression• If you don’t trust your data (entirely)

• Minimize ||b-Ax||p+ ||x||r, where r arbitrary

• Lasso is especially popular: minx||b-Ax||2+ ||x||1

• Sampled Lp regression works for the regularized version too

• We want more: imagine

Page 25: Faster least squares approximation Tamas Sarlos Yahoo! Research IPAM 23 Oct 2007 stamas Joint work with P. Drineas, M. Mahoney,

Ridge regression• Tikhonov regularization, back to L2

• Minimize ||b-Ax||22+ 2||x||2

2

• Equivalent to augmenting n-by-d A with d and b with 0

• SVD of AUDiag(sqrt(i))VT

• Let D = i sqrt(i) << d

• For (1+)relative error:• precondition with randomized FFT• sample r=O(D log(n) log(D log(n))/) rows uniformly.

Page 26: Faster least squares approximation Tamas Sarlos Yahoo! Research IPAM 23 Oct 2007 stamas Joint work with P. Drineas, M. Mahoney,

SVD of a matrix

U (V): orthogonal matrix containing the left (right) singular vectors of A.

: diagonal matrix containing the singular values of A, ordered non-increasingly.

rank of A, the number of non-zero singular values.

Exact computation of the SVD takes O(min{mn2 , m2n}) time.

Any m x n matrix A can be decomposed as:

Page 27: Faster least squares approximation Tamas Sarlos Yahoo! Research IPAM 23 Oct 2007 stamas Joint work with P. Drineas, M. Mahoney,

SVD and low-rank approximations

Uk (Vk): orthogonal matrix containing the top k left (right) singular vectors of A.

k: diagonal matrix containing the top k singular values of A.

• Ak is the “best” matrix among all rank-k matrices wrt. to the spectral and Frobenius norms

• This optimality property of very useful in, e.g., Principal Component Analysis (PCA), LSI, etc.

Truncate the SVD by keeping k ≤ terms:

Page 28: Faster least squares approximation Tamas Sarlos Yahoo! Research IPAM 23 Oct 2007 stamas Joint work with P. Drineas, M. Mahoney,

Approximate SVD• Starting with the seminal work of Frieze et al (1998) and Papadimitriou et al (1998) till 2005 several results of the form

||A-Ak’||F ≤ ||A-Ak||F + t||A||F

• ||A||F might be a significantly larger than ||A-Ak||F

• Four independent results [H-P, DV, MRT, S 06] on

||A-Ak’||F ≤ (1+)||A-Ak||F

• This year Shyanalkumar and Deshpande …

Page 29: Faster least squares approximation Tamas Sarlos Yahoo! Research IPAM 23 Oct 2007 stamas Joint work with P. Drineas, M. Mahoney,

Relative error SVD via random projections

Theorem Let S be n by r=O(k/) random matrix with1 entries. Project A to the subspace spanned by the rows of SA and then compute Ak’ as the rank-k SVD of the projection. Then ||A-Ak’||F ≤ (1+)||A-Ak||F with prob. at least 1/3

• By keeping an orthonormal basis of SA total time is O(Mr+(n+m)r^2) with M non-zeroes • We need to make only 2 passes over the data

Can use fast random projections and avoid the SA projection, spectral norm (talks by R and T)

Page 30: Faster least squares approximation Tamas Sarlos Yahoo! Research IPAM 23 Oct 2007 stamas Joint work with P. Drineas, M. Mahoney,

SVD proof sketch Can show:

||A-Ak’||F2 ≤ ||A-Ak||F

2 + ||Ak-Ak(SAk)+(SA)||F

2

For all j columns of A consider the regression A(j)≈Akx

Exact: Ak(j), error is ||A(j)-Ak

(j)||2

Approximate: Ak-Ak(SAk)+SA(j)

Page 31: Faster least squares approximation Tamas Sarlos Yahoo! Research IPAM 23 Oct 2007 stamas Joint work with P. Drineas, M. Mahoney,

Conclusion

Fast methods based on randomized preprocessing and sampling or sparse projections for

Least squares approximation Regularized regression Low rank matrix approximation

Page 32: Faster least squares approximation Tamas Sarlos Yahoo! Research IPAM 23 Oct 2007 stamas Joint work with P. Drineas, M. Mahoney,

Thank you!