Acceleration of Randomized Kaczmarz Methoddeanna/Banfftalk.pdf · Deanna Needell [Joint work with Y. Eldar] Stanford University BIRS Banﬀ, March 2011 D. Needell Acceleration of

Introduction Kaczmarz Randomized Kaczmarz Modified RK Evidence

Acceleration of Randomized Kaczmarz Method

Deanna Needell [Joint work with Y. Eldar]

Stanford University

BIRS Banff, March 2011

D. Needell Acceleration of Randomized Kaczmarz Method


Problem Background

Setup

Setup

Let Ax = b be an overdetermined consistent system of equations



Problem Background

Setup

Setup




Problem Background

Setup

Setup




Problem Background

Setup

Setup


Goal

From A and b we wish to recover unknown x . Assume m ≫ n.



Kaczmarz Method

Method

Kaczmarz

The Kaczmarz method is an iterative method used to solveAx = b.

Due to its speed and simplicity, it’s used in a variety ofapplications.



Kaczmarz Method

Method

Kaczmarz





Kaczmarz Method

Method

Kaczmarz





Kaczmarz Method

Method

Kaczmarz

1 Start with initial guess x02 xk+1 = xk +

b[i ]−〈ai ,xk〉‖ai‖

22

ai where i = (k mod m) + 1

3 Repeat (2)



Kaczmarz Method

Method

Kaczmarz



22


3 Repeat (2)



Kaczmarz Method

Method

Kaczmarz



22


3 Repeat (2)



Kaczmarz Method

Method

Kaczmarz



22


3 Repeat (2)



Kaczmarz Method

Geometrically

Denote Hi = {w : 〈ai ,w〉 = b[i ]}.



Kaczmarz Method

Geometrically




Kaczmarz Method

Geometrically




Kaczmarz Method

Geometrically




Kaczmarz Method

But what if...




Kaczmarz Method

But what if...




Kaczmarz Method

But what if...




Kaczmarz Method

But what if...




Kaczmarz Method

But what if...




Kaczmarz Method

But what if...




Kaczmarz Method

But what if...




Kaczmarz Method

But what if...




Kaczmarz Method

But what if...




Randomized Version

Randomized Kaczmarz

Kaczmarz



22

ai where i is chosen randomly

3 Repeat (2)



Randomized Version

Randomized Kaczmarz

Kaczmarz



22

ai where i is chosen randomly

3 Repeat (2)



Randomized Version

Randomized Kaczmarz

Strohmer-Vershynin

1 Start with initial guess x0

2 xk+1 = xk +bp−〈ap,xk〉

‖ap‖22ap where P(p = i) =

‖ai‖22

‖A‖2F

3 Repeat (2)



Randomized Version

Randomized Kaczmarz

Strohmer-Vershynin

1 Start with initial guess x0

2 xk+1 = xk +bp−〈ap,xk〉

‖ap‖22ap where P(p = i) =

‖ai‖22

‖A‖2F

3 Repeat (2)



Randomized Version

Randomized Kaczmarz (RK)

Strohmer-Vershynin

Let R = ‖A−1‖2‖A‖2F (‖A−1‖def

= inf{M : M‖Ax‖2 ≥ ‖x‖2 for all x})

Then E‖xk − x‖22 ≤(

1− 1R

)k‖x0 − x‖22

Well conditioned A → Convergence in O(n) iterations → O(n2) total runtime.

Better than O(mn2) runtime for Gaussian elimination and empirically oftenfaster than Conjugate Gradient.



Randomized Version


Strohmer-Vershynin

Let R = ‖A−1‖2‖A‖2F (‖A−1‖def



1− 1R

)k‖x0 − x‖22





Randomized Version


Strohmer-Vershynin

Let R = ‖A−1‖2‖A‖2F (‖A−1‖def



1− 1R

)k‖x0 − x‖22





Randomized Version


Strohmer-Vershynin

Let R = ‖A−1‖2‖A‖2F (‖A−1‖def



1− 1R

)k‖x0 − x‖22





Randomized Version

Randomized Kaczmarz (RK) with noise

System with noise

We now consider the consistent system Ax = b corrupted by noiseto form the possibly inconsistent system Ax ≈ b + z .



Randomized Version


Theorem [N]

Let Ax = b be corrupted with noise: Ax ≈ b + z . Then

E‖xk − x‖2 ≤(

1− 1

R

)k/2‖x0 − x‖2 +

√Rγ,

where γ = maxi|z [i ]|‖ai‖2

.

This bound is sharp and attained in simple examples.



Randomized Version


Theorem [N]

Let Ax = b be corrupted with noise: Ax ≈ b + z . Then

E‖xk − x‖2 ≤(

1− 1

R

)k/2‖x0 − x‖2 +

√Rγ,

where γ = maxi|z [i ]|‖ai‖2

.

This bound is sharp and attained in simple examples.



Randomized Version


0 20 40 60 80 1000.01

0.015

0.02

0.025

0.03

0.035

0.04

0.045

0.05

0.055

Trials

Err

or

Error in estimation: Gaussian 2000 by 100 after 800 iterations.

ErrorThreshold

0 100 200 300 400 500 600 700 8000

0.2

0.4

0.6

0.8

1

Iterations

Err

or

Error in estimation: Gaussian 2000 by 100

0 20 40 60 80 1000

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

Trials

Err

or

Error in estimation: Partial Fourier 700 by 101 after 1000 iterations.

ErrorThreshold

0 20 40 60 80 100

0.02

0.03

0.04

0.05

0.06

0.07

Trials

Err

or

Error in estimation: Bernoulli 2000 by 100 after 750 iterations.

ErrorThreshold

Figure: Comparison between actual error (blue) and predicted threshold(pink). Scatter plot shows exponential convergence over several trials.



Modified RK

Even better convergence? : Noiseless case revisited

Recall xk+1 = xk +b[i ]−〈ai ,xk〉

‖ai‖22

ai

Since these projections are orthogonal, the optimal projectionis one that maximizes ‖xk+1 − xk‖2.Therefore we choose i maximizing |b[i ]−〈ai ,xk〉

‖ai‖2|.

Too costly → Project onto low dimensional subspace.

Use the low dimensional representations to predict the optimalprojection.



Modified RK



‖ai‖22

ai


‖ai‖2|.





Modified RK



‖ai‖22

ai


‖ai‖2|.





Modified RK



‖ai‖22

ai


‖ai‖2|.





Modified RK



‖ai‖22

ai


‖ai‖2|.





Modified RK



‖ai‖22

ai


‖ai‖2|.





Modified RK

JL Dimension Reduction

Johnson-Lindenstrauss Lemma

Let δ > 0 and let S be a finite set of points in Rn. Then for any d

satisfying

d ≥ Clog |S |δ2

, (1)

there exists a Lipschitz mapping Φ : Rn → Rd such that

(1− δ)‖si − sj‖22 ≤ ‖Φ(si )− Φ(sj)‖22 ≤ (1 + δ)‖si − sj‖22, (2)

for all si , sj ∈ S .



Modified RK


Moreover

In the proof of the JL Lemma the map Φ is chosen as theprojection onto a random d -dimensional subspace of Rn. Nowmany known distributions will yield such a projection.

Recently, transforms with fast multiplies have also been shownto satisfy the JL Lemma [Ailon-Chazelle, Hinrichs-Vybiral,Ailon-Liberty, Krahmer-Ward, ...]

Perform Reduction

Choose such a d × n projector Φ and during preprocessing setαi = Φai .



Modified RK


Moreover



Perform Reduction




Modified RK


Moreover



Perform Reduction




Modified RK

RK via Johnson-Lindenstrauss (RKJL) [N-Eldar]

Select: Select n rows so that each row ai is chosen withprobability ‖ai‖22/‖A‖2F . For each set

γi =|b[i ]− 〈αi ,Φxk〉|

‖αi‖2,

and set j = argmaxi γi .Test: For aj and the first row al selected set

γ∗j =

|b[j ] − 〈aj , xk 〉|

‖aj‖2

and γ∗l =

|b[l ] − 〈al , xk 〉|

‖al‖2

.

If γ∗l

> γ∗j, set j = l .

Project: Set

xk+1 = xk +b[j ] − 〈aj , xk 〉

‖aj‖22

aj .

Update: Set k = k + 1 and repeat.



Modified RK


Select: Select n rows so that each row ai is chosen with probability ‖ai‖22/‖A‖

2F . For each set

γi =|b[i ] − 〈αi , Φxk 〉|

‖αi‖2

,

and set j = argmaxi γi .

Test: For aj and the first row al selected set

γ∗j =|b[j]− 〈aj , xk〉|

‖aj‖2and γ∗l =

|b[l ]− 〈al , xk〉|‖al‖2

.

If γ∗l > γ∗j , set j = l .Project: Set

xk+1 = xk +b[j ] − 〈aj , xk 〉

‖aj‖22

aj .




Modified RK



2F . For each set

γi =|b[i ] − 〈αi , Φxk 〉|

‖αi‖2

,



γ∗j =

|b[j ] − 〈aj , xk 〉|

‖aj‖2

and γ∗l =

|b[l ] − 〈al , xk 〉|

‖al‖2

.

If γ∗l > γ∗

j , set j = l .

Project: Set

xk+1 = xk +b[j]− 〈aj , xk〉

‖aj‖22aj .




Modified RK



2F . For each set

γi =|b[i ] − 〈αi , Φxk 〉|

‖αi‖2

,



γ∗j =

|b[j ] − 〈aj , xk 〉|

‖aj‖2

and γ∗l =

|b[l ] − 〈al , xk 〉|

‖al‖2

.

If γ∗l > γ∗

j , set j = l .

Project: Set

xk+1 = xk +b[j ] − 〈aj , xk 〉

‖aj‖22

aj .




Modified RK

Runtime

Select: Calculate Φxk : In general O(nd)Calculate γi for each i (of n): O(nd)

Test: Calculate γ∗j and γ∗l : O(n)

Project: Calculate xk+1: O(n)

Overall Runtime

Since each iteration takes O(nd), we have convergence in O(n2d).



Modified RK

Choosing parameter d

Lemma: Choice of d

Let Φ be the n× d (Gaussian) matrix with d = Cδ−2 log(n) as inthe RKJL method. Set γi = 〈Φai ,Φxk〉 also as in the method.Then |γi − 〈ai , xk〉| ≤ 2δ for all i and k in the first O(n) iterationsof RKJL.

Low Risk

This shows worst case expected convergence in at mostO(n2 log n) time, and of course in most cases one expects farfaster convergence.



Modified RK


Lemma: Choice of d


Low Risk




Modified RK


Lemma: Choice of d


Low Risk




Modified RK


Lemma: Choice of d


Low Risk




Justification

Analytical Justification

Theorem [Assuming row normalization]

Fix an estimation xk and denote by xk+1 and x∗k+1 the next estimationsusing the RKJL and the standard RK method, respectively. Setγ∗

j = |〈aj , xk〉|2 and reorder these so that γ∗

1 ≥ γ∗

2 ≥ . . . ≥ γ∗

m. Then

when d = Cδ−2 log n,

E‖xk+1−x‖22 ≤ min

E‖x∗k+1 − x‖22 −m∑

j=1

(

pj −1

m

)

γ∗

j + 2δ, E‖x∗k+1 − x‖22

where

pj =

{

(m−j

n−1)(mn)

, j ≤ m − n + 1

0, j > m − n + 1

are non-negative values satisfying∑m

j=1 pj = 1 andp1 ≥ p2 ≥ . . . ≥ pm = 0.



Justification





1 ≥ γ∗

2 ≥ . . . ≥ γ∗

m. Then



E‖x∗k+1 − x‖22 −m∑

j=1

(

pj −1

m

)

γ∗

j + 2δ, E‖x∗k+1 − x‖22

where

pj =

{

(m−j

n−1)(mn)

, j ≤ m − n + 1

0, j > m − n + 1


j=1 pj = 1 andp1 ≥ p2 ≥ . . . ≥ pm = 0.



Justification





1 ≥ γ∗

2 ≥ . . . ≥ γ∗

m. Then



E‖x∗k+1 − x‖22 −m∑

j=1

(

pj −1

m

)

γ∗

j + 2δ, E‖x∗k+1 − x‖22

where

pj =

{

(m−j

n−1)(mn)

, j ≤ m − n + 1

0, j > m − n + 1


j=1 pj = 1 andp1 ≥ p2 ≥ . . . ≥ pm = 0.



Justification





1 ≥ γ∗

2 ≥ . . . ≥ γ∗

m. Then



E‖x∗k+1 − x‖22 −m∑

j=1

(

pj −1

m

)

γ∗

j + 2δ, E‖x∗k+1 − x‖22

where

pj =

{

(m−j

n−1)(mn)

, j ≤ m − n + 1

0, j > m − n + 1


j=1 pj = 1 andp1 ≥ p2 ≥ . . . ≥ pm = 0.



Justification


Corollary

Fix an estimation xk and denote by xk+1 and x∗k+1 the nextestimations using the RKJL and the standard method, respectively.Set γ∗j = |〈aj , xk〉|2 and reorder these so that γ∗1 ≥ γ∗2 ≥ . . . ≥ γ∗m.Then when exact geometry is preserved (δ → 0),

E‖xk+1 − x‖22 ≤ E‖x∗k+1 − x‖22 −m∑

j=1

(

pj −1

m

)

γ∗j .



Justification


Corollary


E‖xk+1 − x‖22 ≤ E‖x∗k+1 − x‖22 −m∑

j=1

(

pj −1

m

)

γ∗j .



Justification


Corollary


E‖xk+1 − x‖22 ≤ E‖x∗k+1 − x‖22 −m∑

j=1

(

pj −1

m

)

γ∗j .



Justification


Corollary


E‖xk+1 − x‖22 ≤ E‖x∗k+1 − x‖22 −m∑

j=1

(

pj −1

m

)

γ∗j .



Justification

Empirical Evidence

500 1000 1500 2000 2500 30000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

RKRKJL

Figure: ℓ2-Error (y-axis) as a function of the iterations (x-axis). Thedashed line is standard Randomized Kaczmarz, and the solid line is themodified one, without a Johnson-Lindenstrauss projection. Instead, thebest move out of the randomly chosen n rows is used. Note that wecannot afford to do this computationally.



Justification

Empirical Evidence

500 1000 1500 2000 2500 30000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

RKRKJL, d=1000RKJL, d=500RKJL, d=100RKJL, d=10

Figure: ℓ2-Error (y-axis) as a function of the iterations (x-axis) forvarious values of d with m = 60000 and n = 1000.



Thank you

For more information

E-mail:

[email protected]

Web: http://www-stat.stanford.edu/~dneedell

References:

Eldar, Needell, “Acceleration of Randomized Kaczmarz Method via theJohnson-Lindenstrauss Lemma”, Num. Algorithms, to appear.

Needell, “Randomized Kaczmarz solver for noisy linear systems”, BITNum. Math., 50(2) 395–403.

Strohmer, Vershynin, “ A randomized Kaczmarz algorithm withexponential convergence”, J. Four. Ana. and App. 15 262–278.


[email protected]

http://www-stat.stanford.edu/~dneedell

Documents

Acceleration of Randomized Kaczmarz Methoddeanna/Banfftalk.pdf · Deanna Needell [Joint work with Y. Eldar] Stanford University BIRS Banﬀ, March 2011 D. Needell Acceleration of