Stochastic Newton and quasi-Newton Methods for …coral.ise.lehigh.edu/usmex2016/files/2016/04/talks/...Stochastic Newton and quasi-Newton Methods for Large-Scale Convex and Non-Convex

Stochastic Newton and quasi-Newton Methods forLarge-Scale Convex and Non-Convex Optimization

Donald Goldfarb

Department of Industrial Engineering and Operations ResearchColumbia University

Joint with Robert Gower, Peter Richtarik, Shiqian Ma, Xiao Wang and Wei Liu

U.S.-Mexico Workshop on Optimization and its Applications2016

Merida, January 4, 2016

1 / 31

Outline

Newton-like and quasi-Newton methods for convex stochasticoptimization problems using limited memory block BFGSupdates.

Quasi-Newton methods for nonconvex stochastic optimizationproblems using damped limited memory BFGS updates.

In both cases the objective functions can be expressed as thesum of a huge number of functions of an extremely largenumber of variables.

We present numerical results on problems from machinelearning.

2 / 31

Related work on L-BFGS for Stochastic Optimization

P1 N.N. Schraudolph, J. Yu and S.Gunter. A stochastic quasi-Newtonmethod for online convex optim. Int’l. Conf. AI & Stat., 2007

P2 A. Bordes, L. Bottou and P. Gallinari. SGD-QN: Carefulquasi-Newton stochastic gradient descent. JMLR vol. 10, 2009

P3 R.H. Byrd, S.L. Hansen, J. Nocedal, and Y. Singer. A stochasticquasi-Newton method for large-scale optim. arXiv1401.7020v2,2014

P4 A. Mokhtari and A. Ribeiro. RES: Regularized stochastic BFGSalgorithm. IEEE Trans. Signal Process., no. 10, 2014.

P5 A. Mokhtari and A. Ribeiro. Global convergence of online limitedmemory BFGS. to appear in J. Mach. Learn. Res., 2015.

P6 P. Moritz, R. Nishihara, M.I. Jordan. A linearly-convergentstochastic L-BFGS Algorithm, 2015 arXiv:1508.02087v1

P7 X. Wang, S. Ma, D. Goldfarb and W. Liu. Stochastic quasi-Newtonmethods for nonconvex stochastic optim. 2015, submitted.

(the first 6 papers are for strongly convex problems, the last one isfor nonconvex problems) 3 / 31

Stochastic optimization

Stochastic optimization

min f (x) = E[f (x , ξ)], ξ is random variable

Or finite sum (with fi (x) ≡ f (x , ξi ) for i = 1, . . . , n and verylarge n)

min f (x) =1

n

n∑i=1

fi (x)

f and ∇f are very expensive to evaluate; e.g., SGD methodsrandomly choose a random subset S ⊂ [n] and evaluate

fS(x) =1

|S|∑i∈S

fi (x) and ∇fS(x) =1

|S|∑i∈S∇fi (x)

Essentially, only noisy info about f , ∇f and ∇2f is available

Challenge: how to design a method that takes advantage ofnoisy 2nd-order information?

4 / 31

Part 1: Using 2nd-order information

Assumption: f (x) = 1n

∑ni=1 fi (x) is strongly convex and twice

continuously differentiable.

Choose (compute) a sketching matrix Sk (the columns of Skare a set of directions).

Following Byrd, Hansen, Nocedal and Singer, we do not usedifferences in noisy gradients to estimate curvature, but rathercompute the action of the sub-sampled Hessian on Sk . i.e.,

compute Yk = 1|T |

∑i∈T ∇2fi (x)Sk , where T ⊂ [n].

We choose T = S

5 / 31

block BFGS

Given Hk = B−1k , the block BFGS method computes a ”least

change” update to the current approximation Hk to the inverseHessian matrix ∇2f (x) at the current point x , by solving

min ‖H − Hk‖s.t., H = H>, HYk = Sk .

This gives the updating formula (analgous to the updates derivedby Broyden, Fletcher, Goldfarb and Shanno).

Hk+1 = (I−Sk [S>k Yk ]−1Y>k )Hk(I−Yk [S>k Yk ]−1S>k )+Sk [S>k Yk ]−1S>k

or, by the Sherman-Morrison-Woodbury formula:

Bk+1 = Bk − BkSk [S>k BkSk ]−1S>k Bk + Yk [S>k Yk ]−1Y>k

6 / 31

Limited Memory Block BFGS

After M block BFGS steps starting from Hk+1−M , one can expressHk+1 as

Hk+1 = VkHkVTk + SkΛkS

Tk

= VkVk−1Hk−1VTk−1Vk + VkSk−1Λk−1S

Tk−1V

Tk + SkΛkS

Tk

...

= Vk:k+1−MHk+1−MV Tk:k+1−M +

k+1−M∑i=k

Vk:i+1SiΛiSTi V T

k:i+1,

whereVk = (I − SkΛkY

Tk ) (1)

and Λk = (STk Yk)−1 and Vk:i = Vk · · ·Vi .

7 / 31

Limited Memory Block BFGS

Hence, when the number of variables d is large, instead ofstoring the d × d matrix Hk , we store the previous M blockcurvature pairs

(Sk+1−M ,Yk+1−M) , . . . , (Sk ,Yk) .

Then, analogously to the standard L-BFGS method, for anyvector v ∈ Rd , Hkv can be computed efficiently using atwo-loop block recursion (in O(Mp(d + p) + p3) operations),if all Si ∈ Rd×p.

Intuition

Limited memory - least change aspect of BFGS is important

Each block update acts like a sketching procedure.

8 / 31

Choices for the Sketching Matrix Sk

We employ one of the following strategies

Gaussian: Sk ∼ N (0, I ) has Gaussian entries sampled i.i.d ateach iteration.

Previous search directions si delayed: Store the previous Lsearch directions Sk = [sk+1−L, . . . , sk ] then update Hk onlyonce every L iterations.

Self-conditioning: Sample the columns of the Cholesky factorsLk of Hk (i.e., LkL

Tk = Hk) uniformly at random. Fortunately

we can maintain and update Lk efficiently with limitedmemory.

The matrix S is a sketching matrix, in the sense that we aresketching the, possibly very large equation ∇2f (x)H = I to whichthe solution is the inverse Hessian. Left multiplying by ST

compresses/sketches the equation yielding ST∇2f (x)H = ST .

9 / 31

Stochastic Variance Reduced Gradients

Stochastic methods converge slowly near the optimum due tothe variance of the gradient estimates ∇fS(x); hence requiringa decreasing step size.

We use the control variates approach of Johnson and Zhang(2013) for a SGD method SVRG.

It uses ∇fS(xt)−∇fS(wk) +∇f (wk , where wk is a referencepoint, in place of ∇fS(xt) .

wk , and the full gradient, are computed after each full pass ofthe data, hence doubling the work of computing stochasticgradients.

Other SGD variance reduction techniques have been recentlyproposes including the methods: SAG, SAGA, SDCA, S2GD.

10 / 31

The Basic Algorithm

11 / 31

Algorithm 0.1: Stochastic Variable Metric Learning with SVRG

Input: H−1 ∈ Rd×d , w0 ∈ Rd , η ∈ R+, s = subsample size, q = sampleaction size and m

1 for k = 0, . . . , max iter do2 µ = ∇f (wk)3 x0 = wk

4 for t = 0, . . . ,m − 1 do5 Sample St , Tt ⊆ [n] i.i.d from a distribution S6 Compute the sketching matrix St ∈ Rd×q

7 Compute ∇2fS(xt)St8 Ht =update metric(Ht−1,St ,∇2fT (xt)St)9 dt = −Ht (∇fS(xt)−∇fS(wk) + µ)

10 xt+1 = xt + ηdt11 end12 Option I: wk+1 = xm13 Option II: wk+1 = xi , i selected uniformly at random from [m];

14 end

11 / 31

Convergence - Assumptions

There exist constants λ,Λ ∈ R+ such that

f is λ–strongly convex

f (w) ≥ f (x) +∇f (x)T (w − x) +λ

2‖w − x‖2

2 , (2)

f is Λ–smooth

f (w) ≤ f (x) +∇f (x)T (w − x) +Λ

2‖w − x‖2

2 , (3)

These assumptions imply that

λI � ∇2fS(w) � ΛI , for all x ∈ Rd ,S ⊆ [n], (4)

from which we can prove that there exist constants γ, Γ ∈ R+

such that for all k we have

γI � Hk � ΓI . (5)

12 / 31

Linear Convergence

Theorem

Suppose that the Assumptions hold. Let w∗ be the uniqueminimizer of f (w). Then in our Algorithm, we have for all k ≥ 0that

Ef (wk)− f (w∗) ≤ ρkEf (w0)− f (w∗),

where the convergence rate is given by

ρ =1/2mη + ηΓ2Λ(Λ− λ)

γλ− ηΓ2Λ2< 1,

assuming we have chosen η < γλ/(2Γ2Λ2) and that we choose mlarge enough to satisfy

m ≥ 1

2η (γλ− ηΓ2Λ(2Λ− λ)),

which is a positive lower bound given our restriction on η.13 / 31

gisette-scale d = 5, 000, n = 6, 000

0 5 10 15 20 25 30

0.01

0.175

0.34

0.505

0.670.835

1

time (s)

err

or

5 10 15 20 25 30

datapasses

LML_gauss_18_M_5

LMLpdd_18_M_5

LMLF_18_M_5

MNJ_b_78_bH_330

SVRG

Figure: limited-gisette

14 / 31

covtype-libsvm-binary d = 54, n = 581, 012

0 10 20 30 40

0.748

0.79

0.832

0.874

0.916

0.958

1

time (s)

err

or

2 4 6 8 10

datapasses

LML_gauss_8_M_5

LMLpdd_8_M_5

LMLF_8_M_5

MNJ_b_763_bH_3815

SVRG

Figure: limited-covtype-libsvm-binary

15 / 31

Higgs d = 28, n = 11, 000, 000

0 50 100 150 200 250

0.933

0.947

0.96

0.973

0.987

1

time (s)

err

or

1 1.2 1.4 1.6 1.8 2

datapasses

LML_gauss_4_M_5

LMLpdd_4_M_5

LMLF_4_M_5

MNJ_b_3317_bH_16585

SVRG

Figure: HIGGS

16 / 31

SUSY d = 18, n = 3, 548, 466

0 50 100 150 200 250

0.723

0.779

0.834

0.889

0.945

1

time (s)

err

or

2 4 6 8

datapasses

MLgauss_3_M_5

MLprev_3_M_5

MLfact_3_M_5

MNJ

SVRG

Figure: SUSY

17 / 31

epsilon-normalized d = 2, 000, n = 400, 000

0 50 100 150 200 250

0.506

0.605

0.704

0.802

0.901

1

time (s)

err

or

5 10 15

datapasses

LML_gauss_45_M_5

LMLpdd_45_M_5

LMLF_45_M_5

MNJ_b_633_bH_3165

SVRG

Figure: epsilon-normaliized

18 / 31

rcv1-training d = 47, 236, n = 20, 242

0 5 10 15 20 25

0.411

0.529

0.647

0.764

0.882

1

time (s)

err

or

50 100 150 200 250

datapasses

MLgauss_15_M_5

MLprev_15_M_5

MLFact_15_M_5

MNJ_b_1000_bH_1000

SVRG

Figure: rcv1-train

19 / 31

url-combined d = 3, 231, 961, n = 2, 396, 130

0 20 40 60 80

0.163

0.302

0.442

0.581

0.721

0.86

1

time (s)

err

or

1 1.2 1.4 1.6 1.8 2

datapasses

MLgauss_10_M_5

MLprev_10_M_5

MLFact_10_M_3

MNJ_b_1548_bH_7740

SVRG

Figure:

20 / 31

zero-real-sim-L2 d = 20, 958, n = 72, 309

0 5 10 15 20 25

0.341

0.473

0.605

0.736

0.868

1

time (s)

err

or

20 40 60 80 100

datapasses

MLgauss_13_M_1

MLgauss_13_M_5

MLprev_13_M_5

MLFact_13_M_5

MNJ_b_1000_bH_1000

SVRG

MLprev_13_M_1

MLFact_13_M_1

MNJ_b_269_bH_1000

SVRG

Figure:

21 / 31

Contributions

New metric learning framework. A block BFGS framework forgradually learning the metric of the underlying function usinga sketched form of the subsampled Hessian matrix

New limited memory block BFGS method. May also be ofinterest for non-stochastic optimization

Several sketching matrix possibilities.

22 / 31

Part 2: Nonconvex stochastic optimization

Most stochastic quasi-Newton optimization methods are forstrongly convex problems; this is needed to ensure a curvaturecondition required for the positive definiteness of Bk (Hk)

This is not possible for nonconvex problem

In deterministic setting, one can do line search to guaranteethe curvature condition, and hence the positive definiteness ofBk (Hk)

Line search is not possible for stochastic optimization

To address these issues we develop a stochastic dampedL-BFGS method:

23 / 31

Stochastic quasi-Newton (SQN) for nonconvex problem

min f (x) ≡ E[F (x , ξ)]

Assumptions

[AS1] f is continuously differentiable; f is bounded below; ∇fis Lipschitz continuous with constant L

[AS2] For any iteration k, we have stochastic gradient satisfies

Eξk [∇f (xk , ξk)] = ∇f (xk)Eξk [‖∇f (xk , ξk)−∇f (xk)‖2] ≤ σ2

[AS3] Exist positive constants Cl , Cu, such that

Cl I � Hk � CuI , for any k

[AS4] Hk depends only on ξ[k−1], i.e., on all the randomsamples in iterations 1, 2, . . . , k − 1.

24 / 31

How to generate Hk to satisfy AS3 and AS4?

Let yk = 1m

∑mi=1(∇f (xk+1, ξk,i )−∇f (xk , ξk,i )) and define

yk = θkyk + (1− θk)Bksk ,

where θk is calculated through:

θk =

{1, if s>k yk ≥ 0.25s>k Bksk ,

(0.75s>k Bksk)/(s>k Bksk − s>k yk), if s>k yk < 0.25s>k Bksk .

Update Hk : (replace yk by yk )

Hk+1 = (I − ρksk y>k )Hk(I − ρk yks>k ) + ρksks>k

where ρk = 1/s>k yk

Implement in a limited memory version

25 / 31

Numerical Experiments

A nonconvex SVM problem with a sigmoid loss function

minx∈Rn

f (x) := Eu,v [1− tanh(v〈x , u〉)] + λ‖x‖22,

u ∈ Rn: feature vector; v ∈ {−1, 1}: corresponding label.

λ = 10−4 in our experiment

RCV1 dataset: Reuters newswire articles from 1996-1997.

A simplified version: 9625 articles classified into fourcategories “C15”, “ECAT”, “GCAT” and “MCAT”, each with2022, 2064, 2901 and 2638 articles, respectively.

Binary classification: predict if an article is in “MCAT” and“ECAT”.

Label: 1 if a given word in “MCAT” or “ECAT”, -1 otherwise.

60% of the articles - training data; 40% - testing data.

Problem dimension: 29992 (number of distinct words)

26 / 31

0 100 200 300 400 500 600 700 800 900 100010

−5

10−4

10−3

10−2

10−1

Iteration

Squ

ared

nor

m o

f gra

dien

t

SGD: β=10

SGD: β=20

SdLBFGS: p=0

SdLBFGS: p=1

SdLBFGS: p=3

SdLBFGS: p=5

SdLBFGS: p=10

SdLBFGS: p=20

0 100 200 300 400 500 600 700 800 900 1000

0.4

0.5

0.6

0.7

0.8

0.9

1

Iteration

Per

cent

cor

rect

ly c

lass

ified

Figure: Comparison of SdLBFGS variants with different memory size onRCV1 dataset. The step size of SdLBFGS is αk = 10/k and the batchsize is m = 100.

27 / 31

0 100 200 300 400 500 600 700 800 900 100010

−5

10−4

10−3

10−2

10−1

100

Iteration

Squ

ared

nor

m o

f gra

dien

t

SGD: m=1

SGD: m=50

SGD: m=100

SdLBFGS: m=1

SdLBFGS: m=50

SdLBFGS: m=75

SdLBFGS: m=100

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

x 104

10−4

10−3

10−2

10−1

100

SFO−calls

Squ

ared

nor

m o

f gra

dien

t

Figure: Comparison of SGD and SdLBFGS with different batch size onRCV1 dataset. For SdLBFGS the step size is αk = 10/k and the memorysize is p = 10. For SGD the step size is αk = 20/k .

28 / 31

0 100 200 300 400 500 600 700 800 900 1000

0.4

0.5

0.6

0.7

0.8

0.9

1

Iteration

Per

cent

cor

ectly

cla

ssifi

ed

SGD: m=50

SGD: m=100

SdLBFGS: m=50

SdLBFGS: m=75

SdLBFGS: m=100

Figure: Comparison of correct classification percentage by SGD andSdLBFGS with different batch size on RCV1 dataset. For SdLBFGS thestep size is αk = 10/k and the memory size is p = 10. For SGD the stepsize is αk = 20/k .

29 / 31

1 3 5 10 200

1

2

3

4

5

6

7

8

Memory size (batch size m=50)

Num

ber

of d

ampe

d st

eps

1 50 75 1000

2

6

10

100

308

1,000

Batch size (memory size p=10)

Num

ber

of d

ampe

d st

eps

Figure: The average number of damped steps over 10 runs of SdLBFGS.Here the maximum number of iterations is set as 1000 and step size is10/k .

30 / 31

Contributions

Our contributions:

A general framework of SQN for nonconvex problemConvergence guaranteeComplexity analysis for random output and constant step sizeStochastic damped L-BFGS falls into the framework

Future work for nonconvex problems:

develop a damped limited memory block BFGS methodVariance reduction techniques?

31 / 31

Documents

Stochastic Newton and quasi-Newton Methods for …coral.ise.lehigh.edu/usmex2016/files/2016/04/talks/...Stochastic Newton and quasi-Newton Methods for Large-Scale Convex and Non-Convex