Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Stochastic Optimization for Large-scale MachineLearning
Wei-Cheng Chang
Language Technologies InstituteCarnegie Mellon University
September 13, 2016
Wei-Cheng Chang (LTI,CMU) Stochastic Optimization for Large-scale Machine LearningSeptember 13, 2016 1 / 50
Outline
1 Introduction
2 Analysis of Stochastic Gradient MethodsPreliminariesTwo Assumptions and Fundamental LemmasSG for Strongly Convex ObjectivesSG for General Objectives
3 Noise Reduction MethodsDynamic SamplingGradient Aggregation
4 Second-Order MethodsHesssian Free methodsStochastic Quasi-Newton MethodsGauss-NewtonDiagonal Scaling
5 Conclusions
Wei-Cheng Chang (LTI,CMU) Stochastic Optimization for Large-scale Machine LearningSeptember 13, 2016 2 / 50
Problem Settings and Notations
Expected Risk: Consider prediction function h parameterized byw ∈ Rd and some loss function `(·, ·),
R(w) =
∫`(h(x ;w), y)dP(x , y) = E[`(h(x ;w), y)].
Empirical Risk: In supervised learning, only have access to a set of nI.I.D. samples {(xi , yi )} ⊆ Rdx × Rdy
Rn(w) =1
n
n∑i=1
`(h(xi ;w), yi ).
We consider Rn as the objective function from now on. Note that allmethods in the subsequent slides can be applied readily when asmooth regularization term.
Wei-Cheng Chang (LTI,CMU) Stochastic Optimization for Large-scale Machine LearningSeptember 13, 2016 3 / 50
Problem Settings and Notations (Cont’d)
a sets of samples is presented by a random seed ξ. For example, arealization of ξ as a single sample (x , y) or a set of samples{(xi , yi )}i∈S .
refer the loss incurred for a given (w , ξ) as f (w ; ξ). That is, f is thecomposition of loss function ` and prediction function h.
Then we have,
R(w) = E[f (w ; ξ)],
Rn(w) =1
n
n∑i=1
fi (w),
where fi (w) ≡ f (w ; ξ[i ]) and {ξ[i ]}ni=1 is a set of realizations of ξcorresponding to a sample set {(xi , yi )}.
Wei-Cheng Chang (LTI,CMU) Stochastic Optimization for Large-scale Machine LearningSeptember 13, 2016 4 / 50
Why Stochastic Optimization in Large-scale MachineLearning?
Batch gradient methods could achieve linear rate of convergence
Rn(wk)− R∗n ≤ O(ρk).
To obtain ε-optimality, the total work is proportion to n log(1/ε).
SG could achieve sublinear rate of convergence in expectation
E[Rn(wk)− R∗n ] = O(1/k).
To obtain ε-optimality, the total work is proportion to O(1/ε). Whenn is large and limited computational time budget, SG is favorable.(More on later slides)
Due to stochastic nature of SG, it enjoys same rate of convergencefor the generalization error R − R∗.
Wei-Cheng Chang (LTI,CMU) Stochastic Optimization for Large-scale Machine LearningSeptember 13, 2016 5 / 50
Outline
1 Introduction
2 Analysis of Stochastic Gradient MethodsPreliminariesTwo Assumptions and Fundamental LemmasSG for Strongly Convex ObjectivesSG for General Objectives
3 Noise Reduction MethodsDynamic SamplingGradient Aggregation
4 Second-Order MethodsHesssian Free methodsStochastic Quasi-Newton MethodsGauss-NewtonDiagonal Scaling
5 Conclusions
Wei-Cheng Chang (LTI,CMU) Stochastic Optimization for Large-scale Machine LearningSeptember 13, 2016 6 / 50
Rate of Convergence
Consider optimization problem
f ∗ = minx
f (x),
linear rate of convergence means, for some c ∈ (0, 1),
f (xk)− f ∗ ≤ c(f (xk−1)− f ∗) ≤ . . . ≤ ck(f (x0)− f ∗) ≤ ε
to achieve ε-optimality, the number of iterations need at most
k ≤ log((f (x0)− f ∗)/ε)
log(1/c)= O(log(
1
ε))
Wei-Cheng Chang (LTI,CMU) Stochastic Optimization for Large-scale Machine LearningSeptember 13, 2016 7 / 50
Stochastic Gradient (SG) Method
Objective function F : Rd → R represents
R(w) = E[f (w ; ξ)] or Rn(w) =1
n
n∑i=1
fi (w)
SG updates:wk+1 ← wk − αkg(wk , ξk).
Our analyses cover the choices
g(wk , ξk) =
∇f (wk ; ξk)1nk
∑nki=1∇f (wk ; ξk,i )
Hk1nk
∑nki=1∇f (wk ; ξk,i ),
where Hk is PSD, and nk is the mini-batch size.
Wei-Cheng Chang (LTI,CMU) Stochastic Optimization for Large-scale Machine LearningSeptember 13, 2016 8 / 50
Some Intuitions of SG
{ξk} is a sequence of independent random variables.
Sample uniformly from training set, SG optimizes the empirical riskF = Rn.
Picking samples according to P, SG optimizes the expected riskF = R.
Our goal is to show the expected angle between g(wk , ξk) and∇F (wk) is sufficiently close.
Wei-Cheng Chang (LTI,CMU) Stochastic Optimization for Large-scale Machine LearningSeptember 13, 2016 9 / 50
Outline
1 Introduction
2 Analysis of Stochastic Gradient MethodsPreliminariesTwo Assumptions and Fundamental LemmasSG for Strongly Convex ObjectivesSG for General Objectives
3 Noise Reduction MethodsDynamic SamplingGradient Aggregation
4 Second-Order MethodsHesssian Free methodsStochastic Quasi-Newton MethodsGauss-NewtonDiagonal Scaling
5 Conclusions
Wei-Cheng Chang (LTI,CMU) Stochastic Optimization for Large-scale Machine LearningSeptember 13, 2016 10 / 50
Smoothness of objective function
Assumption 4.1 (Lipschitz-continuous objective gradients)
F is continuously differentiable and the gradient function of F , namely∇F , is Lipschitz-continuous with Lipschitz constant L > 0, i.e.,
‖∇F (w)−∇F (w)‖2 ≤ L‖w − w‖2 ∀{w , w} ⊂ Rd
This assumption ensures gradient of F does not change arbitrary quicklyw.r.t the parameter w . From this assumption, we could then derive
F (w) ≤ F (w) +∇F (w)T (w − w) +1
2L‖w − w‖22 (1)
Wei-Cheng Chang (LTI,CMU) Stochastic Optimization for Large-scale Machine LearningSeptember 13, 2016 11 / 50
Proof of equation (1)
Define g(t) = F (w + t(w − w)) and notice
g ′(t) = ∇F (w + t(w − w))T (w − w).
g(1) = g(0) +
∫ 1
0g ′(t)dt
= F (w) +
∫ 1
0∇F (w + t(w − w))T (w − w)dt
= F (w) +∇F (w)T (w − w) +
∫ 1
0[∇F (w + t(w − w))−∇F (w)]T (w − w)dt
≤ F (w) +∇F (w)T (w − w) +
∫ 1
0L‖t(w − w)‖2‖w − w‖2dt
= F (w) +∇F (w)T (w − w) +1
2L‖w − w‖22.
Wei-Cheng Chang (LTI,CMU) Stochastic Optimization for Large-scale Machine LearningSeptember 13, 2016 12 / 50
Bound expected function decrease by gradient information
Lemma 4.2
Under Assumption 1, the iterates of SG satisfy the following inequality
Eξk [F (wk+1)]− F (wk) ≤− αk∇F (wk)TEξk [g(wk , ξk)] (2)
+1
2α2kLEξk [‖g(wk , ξk)‖22].
Proof: By Assumption 4.1, the iterates by SG satisfy
F (wk+1)− F (wk) ≤ ∇F (wk)T (wk+1 − wk) +1
2L‖wk+1 − wk‖22
≤ −αk∇F (wk)Tg(wk , ξk) +1
2α2kL‖g(wk , ξk)‖22.
Taking expectations w.r.t the distribution of ξk , we obtain the desiredresults.
Wei-Cheng Chang (LTI,CMU) Stochastic Optimization for Large-scale Machine LearningSeptember 13, 2016 13 / 50
Intuitions of Lemma 4.2
RHS of (2) depends on the first and second moment of g(wk , ξk).
If g(wk , ξk) is an unbiased estimate of ∇F (wk), then
Eξk [F (wk+1)]− F (wk) ≤ −αk‖∇F (wk)‖22 +1
2α2kLEξk [‖g(wk , ξk)‖22].
Convergence is guarantee if RHS of (2) is be bounded.
We need to restricted the variance of g(wk , ξk), i.e.,
Vξk [g(wk , ξk)] := Eξk [‖g(wk , ξk)‖22]− ‖Eξk [g(wk , ξk)]‖22. (3)
Wei-Cheng Chang (LTI,CMU) Stochastic Optimization for Large-scale Machine LearningSeptember 13, 2016 14 / 50
First and Second Moment limits
Assumption 4.3 (First and Second Moment limits)
The objective function and SG satisfy the following
1 The sequence of iterates {wk} is contained in an open set over whichF is bounded below by Finf .
2 There exist scalars µG ≥ µ > 0 such that
∇F (wk)TEξk [g(wk , ξk)] ≥ µ‖∇F (wk)‖22, (4)
‖Eξk [g(wk , ξk)]‖ ≤ µG‖∇F (wk)‖2. (5)
3 There exist scalars M > 0 and MV > 0 such that
Vξk [g(wk , ξk)] ≤ M + Mv‖∇F (wk)‖22. (6)
Wei-Cheng Chang (LTI,CMU) Stochastic Optimization for Large-scale Machine LearningSeptember 13, 2016 15 / 50
More on Assumption 4.3
The properties in Assumption 4.3-(2) immediately holds if g(wk , ξk)is an unbiased estimate of F (wk), and are maintained if suchunbiased estimate is multiplied by a PSD matrix Hk .
Combing the definition of variance (3) and Assumption 4.3,
Eξk [‖g(wk , ξk)‖22] ≤ M + MG‖∇F (wk)‖, (7)
where MG := MV + µ2G ≥ µ > 0.
Wei-Cheng Chang (LTI,CMU) Stochastic Optimization for Large-scale Machine LearningSeptember 13, 2016 16 / 50
Bounding Expected Function Decreases by Gradient
Lemma 4.4
Under Assumption 4.1 and 4.3, the iterate of SG satisfy the followinginequality
Eξk [F (wk+1)]− F (wk) ≤ −(µ− 1
2αkLMG )αk‖∇F (wk)‖22 +
1
2α2kLM.
Proof. By Lemma 1 and (4), it follows that
Eξk [F (wk+1)]− F (wk) ≤ −µαk‖∇F (wk)‖22 +1
2α2kLEξk [‖g(wk , ξk)‖22].
Together with (7), it yields the desired results.
Wei-Cheng Chang (LTI,CMU) Stochastic Optimization for Large-scale Machine LearningSeptember 13, 2016 17 / 50
Outline
1 Introduction
2 Analysis of Stochastic Gradient MethodsPreliminariesTwo Assumptions and Fundamental LemmasSG for Strongly Convex ObjectivesSG for General Objectives
3 Noise Reduction MethodsDynamic SamplingGradient Aggregation
4 Second-Order MethodsHesssian Free methodsStochastic Quasi-Newton MethodsGauss-NewtonDiagonal Scaling
5 Conclusions
Wei-Cheng Chang (LTI,CMU) Stochastic Optimization for Large-scale Machine LearningSeptember 13, 2016 18 / 50
Strong Convexity
Assumption 4.5 (Strong Convexity)
The objective function F is strongly convex if there exist a constant c > 0such that
∇2F (w) ≥ cI ∀w ∈ Rd .
Strong convexity has several interesting consequences.
Better lower bound on F (w)
F (w) ≥ F (w) +∇F (w)T (w − w) +c
2‖w − w‖22.
Bound the optimality gap F (w)− F (w∗)
2c(F (w)− F (w∗)) ≤ ‖∇F (w)‖22 ∀w ∈ Rd (8)
Wei-Cheng Chang (LTI,CMU) Stochastic Optimization for Large-scale Machine LearningSeptember 13, 2016 19 / 50
Strong Convexity (Cont’d)
F (w) ≥ F (w) +∇F (w)T (w − w) +c
2‖w − w‖22
≥ F (w) +∇F (w)T (w − w) +c
2‖w − w‖22
= F (w)− 1
2c‖∇F (w)‖22,
where w = w − (1/c)‖ is the minimizer of the quadratic model. This holdfor any w ∈ Rd . Therefore,
F (w∗) ≥ F (w)− 1
2c‖∇F (w)‖22.
Wei-Cheng Chang (LTI,CMU) Stochastic Optimization for Large-scale Machine LearningSeptember 13, 2016 20 / 50
Strongly Convex Objective for Fixed Stepsizes SG
Theorem 4.6 (Strongly Convex Objective, Fixed Stepsizes
Under assumption 4.1, 4.3, 4.5, suppose that the SG method is run with afixed stepsize αk = α, satisfying
0 < α ≤ µ
LMG. (9)
Then, the expected optimality gap satisfies
E[F (wk)− F∗] ≤αLM
2cµ+ (1− αcµ)k−1
(F (w1)− F∗ −
αLM
2cµ
)(10)
k→∞−−−→ αLM
2cµ, (11)
where E[F (wk)] = Eξ1Eξ2 . . .Eξk−1[F (wk)].
Wei-Cheng Chang (LTI,CMU) Stochastic Optimization for Large-scale Machine LearningSeptember 13, 2016 21 / 50
Intuition of Theorem 4.6
If there’s no noise (M = 0), Theorem 4.6 becomes linear rate ofconvergence. Reduce to standard result of full gradient.
Need α to be small.
Wei-Cheng Chang (LTI,CMU) Stochastic Optimization for Large-scale Machine LearningSeptember 13, 2016 22 / 50
Strongly Convex Objective for Diminishing Stepsizes SG
Theorem 4.7 (Strongly Convex Objective, Diminishing Stepsizes
Under assumption 4.1, 4.3, 4.5, suppose that the SG method is run with astepsize sequence, such that
αk =β
γ + kfor some β >
1
cµand γ > 0 such that α1 ≤
µ
LMG. (12)
Then, the expected optimality gap satisfies
E[F (wk)− F∗] ≤ν
γ + k, (13)
where
ν := max{ β2LM
2(βcµ− 1), (γ + 1)(F (w1)− F∗)
}. (14)
Wei-Cheng Chang (LTI,CMU) Stochastic Optimization for Large-scale Machine LearningSeptember 13, 2016 23 / 50
Inuition of Theorem 4.7
For constant stepsize, α does not depend on c . For diminishingstepsizes, initial α1 depends on c .
For constant stepsize, initial point eventually does not matter. Fordiminishing stepsizes, initial point matters.
the effect of mini-batch is ”subtle”
Wei-Cheng Chang (LTI,CMU) Stochastic Optimization for Large-scale Machine LearningSeptember 13, 2016 24 / 50
Outline
1 Introduction
2 Analysis of Stochastic Gradient MethodsPreliminariesTwo Assumptions and Fundamental LemmasSG for Strongly Convex ObjectivesSG for General Objectives
3 Noise Reduction MethodsDynamic SamplingGradient Aggregation
4 Second-Order MethodsHesssian Free methodsStochastic Quasi-Newton MethodsGauss-NewtonDiagonal Scaling
5 Conclusions
Wei-Cheng Chang (LTI,CMU) Stochastic Optimization for Large-scale Machine LearningSeptember 13, 2016 25 / 50
Nonconvex Objective for Fixed Stepsizes SG
Theorem 4.8 (Nonconvex Objective, Fixed Stepsize)
Under assumption 4.1, 4.3, suppose that the SG method is run with afixed stepsize αk = α, satisfying
0 < α ≤ µ
LMG. (15)
Then, the expected sum-of-squares and average-squared gradients of Fsatisfies
E[ K∑k=1
‖∇F (wk)‖22]≤ K αLM
µ+
2(F (w1)− Finf)
µα(16)
E[ 1
K
K∑k=1
‖∇F (wk)‖22]≤ αLM
µ+
2(F (w1)− Finf)
Kµα(17)
K→∞−−−−→ αLM
µ. (18)
where E[F (wk)] = Eξ1Eξ2 . . .Eξk−1[F (wk)].
Wei-Cheng Chang (LTI,CMU) Stochastic Optimization for Large-scale Machine LearningSeptember 13, 2016 26 / 50
Nonconvex Objective for Diminihsing Stepsizes SG
Theorem 4.9 (Nonconvex Objective, Diminishing Stepsizes)
Under Assumption 4.1 and 4.3, suppose SG is run with a stepsize sequencesatisfying
∑∞k=1 αk =∞ and
∑∞k=1 α
2k <∞, with AK =
∑Kk=1 αk ,
E[ K∑k=1
αk‖∇F (wk)‖22]<∞ (19)
E[ 1
Ak
K∑k=1
αk‖∇F (wk)‖22]
K→∞−−−−→ 0. (20)
Wei-Cheng Chang (LTI,CMU) Stochastic Optimization for Large-scale Machine LearningSeptember 13, 2016 27 / 50
Outline
1 Introduction
2 Analysis of Stochastic Gradient MethodsPreliminariesTwo Assumptions and Fundamental LemmasSG for Strongly Convex ObjectivesSG for General Objectives
3 Noise Reduction MethodsDynamic SamplingGradient Aggregation
4 Second-Order MethodsHesssian Free methodsStochastic Quasi-Newton MethodsGauss-NewtonDiagonal Scaling
5 Conclusions
Wei-Cheng Chang (LTI,CMU) Stochastic Optimization for Large-scale Machine LearningSeptember 13, 2016 28 / 50
An Ideal case: reducing noise at Geometric Rate
Recall that Lemma 4.2, we have
Eξk [F (wk+1)]− F (wk) ≤ −αk∇F (wk)TEξk[g(wk , ξk)
]+
L
2α2kEξk
[‖g(wk , ξk)‖22
].
If g() is a descent direction in expectation and we could decreaseEξk[‖g(wk , ξk)‖22
]fast enough, then good convergence rate is still
possible.
Wei-Cheng Chang (LTI,CMU) Stochastic Optimization for Large-scale Machine LearningSeptember 13, 2016 29 / 50
Linear Convergence Rate for Ideal Noise Reduction
Theorem 5.1 (Strongly Convex Objective, Noise Reduction)
Suppose that Assumption 4.1, 4.3 and 4.5 hold, and that
Vξk[g(wk , ξk)
]≤ Mζk−1. (21)
In addition, suppose SG is run with a fixed stepsize, αk = α such that
0 < α ≤ min{ µ
Lµ2G,
1
cµ}.
Then the expected optimality gap satisfies
E[F (wk)− F∗
]≤ ωρk−1
where
ω := max{ αLMcµ
,F (w1)− F∗} and ρ := max{1− αcµ/2, ζ}.
Wei-Cheng Chang (LTI,CMU) Stochastic Optimization for Large-scale Machine LearningSeptember 13, 2016 30 / 50
Dynamic Sampling
Consider the iteration
wk+1 ← wk −α
nk
∑i∈Sk
∇f (wk , ξk,i )
with nk := |Sk | =⌈τk−1
⌉for some τ > 1. Then this update fall under the
class of methods in Theorem 5.1, because
Vξk[g(wk , ξk)
]≤
Vξk [∇f (wk ; ξk,i )]
nk≤ M
nk,
Wei-Cheng Chang (LTI,CMU) Stochastic Optimization for Large-scale Machine LearningSeptember 13, 2016 31 / 50
Outline
1 Introduction
2 Analysis of Stochastic Gradient MethodsPreliminariesTwo Assumptions and Fundamental LemmasSG for Strongly Convex ObjectivesSG for General Objectives
3 Noise Reduction MethodsDynamic SamplingGradient Aggregation
4 Second-Order MethodsHesssian Free methodsStochastic Quasi-Newton MethodsGauss-NewtonDiagonal Scaling
5 Conclusions
Wei-Cheng Chang (LTI,CMU) Stochastic Optimization for Large-scale Machine LearningSeptember 13, 2016 32 / 50
Intuition
Rather than increasing more new stochastic gradient, is it possible toachieve a lower variance by reusing and/or revising previous computedstochastic gradient information ?
Stochastic variance reduction gradient (SVRG)
Stochastic average gradient (SAG)
Iterative averaging
wk+1 ← wk − αkg(wk , ξk)
wk+1 ←1
k + 1
k+1∑j=1
wj .
Wei-Cheng Chang (LTI,CMU) Stochastic Optimization for Large-scale Machine LearningSeptember 13, 2016 33 / 50
Outline
1 Introduction
2 Analysis of Stochastic Gradient MethodsPreliminariesTwo Assumptions and Fundamental LemmasSG for Strongly Convex ObjectivesSG for General Objectives
3 Noise Reduction MethodsDynamic SamplingGradient Aggregation
4 Second-Order MethodsHesssian Free methodsStochastic Quasi-Newton MethodsGauss-NewtonDiagonal Scaling
5 Conclusions
Wei-Cheng Chang (LTI,CMU) Stochastic Optimization for Large-scale Machine LearningSeptember 13, 2016 34 / 50
Preliminaries
To address adverse effects of high nonlinearity and ill-condition ofobjective function
At each iteration, solve a quadratic model
qk(s) = F (wk) +∇F (wk)T s +1
2sTHks
The Hessian matrix Hk ∈ Rd×d(i .e.,∇2F (wk)) could be very large,need to solve the following linear system
Hks = −∇F (wk) (22)
The updatewk+1 ← wk + αksk
Wei-Cheng Chang (LTI,CMU) Stochastic Optimization for Large-scale Machine LearningSeptember 13, 2016 35 / 50
Hessian Free Inexact Newton
When d is large (especially in deep learning), impossible to explicitlycompute H.
Apply Conjugate Gradient (CG) method to iteratively solve the linearsystem (22).
Only need to compute matrix-vector product Hs, as expensive ascomputing gradient
CG need at most d iterations to solve the linear system
This approach is still not possible for deep neural nets...
Wei-Cheng Chang (LTI,CMU) Stochastic Optimization for Large-scale Machine LearningSeptember 13, 2016 36 / 50
Hessian Free Inexact Newton (Cont’d)
Inexact Newton suggests it is tolerant to certain degree of noise in theHessian estimate than it is to the noise in the gradient estimate.
One could employ subsampled Hessian-free inexact Newton
∇2fSHk
(wk ; ξHk )s = −∇fSk (wk ; ξk)
where SHk ⊂ Sk .
gradient sample size |Sk | need to be large enough so that takingsubsamples for the Hessian estimate is sensible.
Wei-Cheng Chang (LTI,CMU) Stochastic Optimization for Large-scale Machine LearningSeptember 13, 2016 37 / 50
Outline
1 Introduction
2 Analysis of Stochastic Gradient MethodsPreliminariesTwo Assumptions and Fundamental LemmasSG for Strongly Convex ObjectivesSG for General Objectives
3 Noise Reduction MethodsDynamic SamplingGradient Aggregation
4 Second-Order MethodsHesssian Free methodsStochastic Quasi-Newton MethodsGauss-NewtonDiagonal Scaling
5 Conclusions
Wei-Cheng Chang (LTI,CMU) Stochastic Optimization for Large-scale Machine LearningSeptember 13, 2016 38 / 50
Quasi-Newton (BFGS)
Approximate Hessian using only gradient information
BFGS is one of the most success of Quasi-Newton, by defining
sk := wk+1 − wk and vk := ∇F (wk+1)−∇F (wk),
the update rule of BFGS is
wk+1 ← wk − Hk∇F (wk)
Hk+1 ←(
1−vks
Tk
sTk vk
)THk
(1−
vksTk
sTk vk
)+
sksTK
sTk vk,
where Hk is a PSD approximation of (∇2F (wk))−1.
Wei-Cheng Chang (LTI,CMU) Stochastic Optimization for Large-scale Machine LearningSeptember 13, 2016 39 / 50
BFGS (Cont’d)
This update ensures H−1k+1sk = vk holds, meaning that a second-orderTaylor expansion is satisfied along the most recent displacement.
enjoy a local superlinear rate of convergence.
Hk+1 are dense, even if the exact Hessians are sparse.
L-BFGS to rescue: don’t explicitly form Hk , but only computeHk∇F (wk) by a sequence of displacement pairs {(sk , vk)}.L-BFGS only achieves linear rate of convergence though...
Wei-Cheng Chang (LTI,CMU) Stochastic Optimization for Large-scale Machine LearningSeptember 13, 2016 40 / 50
How about Stochastic L-BFGS ?
Iteration update: wk+1 ← wk − αkHkg(wk , ξk)
Stochastic L-BFGS could only have sublinear rate as SG. What’s thebenefit ? improving the constant, from dependent to Hessian toindependent of Hessian!
each iteration costs O(4md), if m = 5, the 20 times greater thanstochastic gradient estimation. This is not big problem in mini-batchSG though.
the effects of even a single bad update may linger for numerousiterations.
Wei-Cheng Chang (LTI,CMU) Stochastic Optimization for Large-scale Machine LearningSeptember 13, 2016 41 / 50
Online L-BFGS
Replace gradient with stochastic gradient
sk := wk+1 − wk and vk := ∇fSk (wk+1, ξk)−∇fSk (wk , ξk)
Note that the use of same seed in two gradient estimate.
The inverse Hessian approximation with ever step may not bewarranted when using small sample size Sk .
Alternatively, since ∇F (wk+1)−∇F (wk) ≈ ∇F (wk)2(wk+1 − wk),define
vk := ∇2fSKk
(wk ; ξHS )sk ,
where ∇2fSKk
(wk ; ξHS ) is a subsampled hessian.
Wei-Cheng Chang (LTI,CMU) Stochastic Optimization for Large-scale Machine LearningSeptember 13, 2016 42 / 50
Outline
1 Introduction
2 Analysis of Stochastic Gradient MethodsPreliminariesTwo Assumptions and Fundamental LemmasSG for Strongly Convex ObjectivesSG for General Objectives
3 Noise Reduction MethodsDynamic SamplingGradient Aggregation
4 Second-Order MethodsHesssian Free methodsStochastic Quasi-Newton MethodsGauss-NewtonDiagonal Scaling
5 Conclusions
Wei-Cheng Chang (LTI,CMU) Stochastic Optimization for Large-scale Machine LearningSeptember 13, 2016 43 / 50
Inuition
Construct an approximation to the Hessian using only first-orderinformation
Always PSD, even when Hessian is not (useful for deep learning)
Ignores second-order interaction between elements of the parametervector w .
Wei-Cheng Chang (LTI,CMU) Stochastic Optimization for Large-scale Machine LearningSeptember 13, 2016 44 / 50
Classifical Gauss-Newton Method
Consider the following problem
f (w ; ξ) = `(h(xξ;w), yξ) =1
2‖h(xξ;w)− yξ‖22.
Gauss-Newton approximation is an affine approximation of the predictionfunction inside the quadratic loss function,
h(ξi ;w) ≈ h(xξ;wk) + Jh(wk ; ξ)(w − wk)
where J(·; ξ) represent the Jacobian of h(ξ; ·) w.r.t. w . This leads to
f (w ; ξ) ≈ 1
2‖h(xξ;wk) + Jh(wk ; ξ)(w − wk)− yξ‖22
=1
2‖h(xξ;w)− yξ‖22 + (h(xξ;wk)− yξ)
T Jh(wk ; ξ)(w − wk)
+1
2(w − wk)T Jh(wk ; ξ)T Jh(wk ; ξ)(w − wk)
Wei-Cheng Chang (LTI,CMU) Stochastic Optimization for Large-scale Machine LearningSeptember 13, 2016 45 / 50
Gauss-Newton
We could replace subsampled Hessian with Gauss-Newton matrix
GSHk
(wk ; ξHk ) =1
|SHk |∑i∈SH
k
Jh(wk ; ξk,i )T Jh(wl ; ξk,i ) (23)
For least square loss function, Gauss-Newton is always PSD, so itcould serve as another approximation of hessian in stochastic L-BFSGand Inexact Newton method.
It could also be generalized to other loss functions.
GSHk
(wk ; ξHk ) =1
|SHk |∑i∈SH
k
Jh(wk ; ξk,i )TH`(wk ; ξk,i )Jh(wl ; ξk,i ),
where H`(wk ; ξ) = ∂2
∂h2`(h(xξ;w), yξ).
Wei-Cheng Chang (LTI,CMU) Stochastic Optimization for Large-scale Machine LearningSeptember 13, 2016 46 / 50
Outline
1 Introduction
2 Analysis of Stochastic Gradient MethodsPreliminariesTwo Assumptions and Fundamental LemmasSG for Strongly Convex ObjectivesSG for General Objectives
3 Noise Reduction MethodsDynamic SamplingGradient Aggregation
4 Second-Order MethodsHesssian Free methodsStochastic Quasi-Newton MethodsGauss-NewtonDiagonal Scaling
5 Conclusions
Wei-Cheng Chang (LTI,CMU) Stochastic Optimization for Large-scale Machine LearningSeptember 13, 2016 47 / 50
Diagonal Scaling Intuition
Iteration update:
wk+1 ← wk − αkDkg(wk , ξk)
D−1k ≈ diag(Hessian) or diag(Guass-Newton)
D−1k ≈ running average/sum of gradient component.
Wei-Cheng Chang (LTI,CMU) Stochastic Optimization for Large-scale Machine LearningSeptember 13, 2016 48 / 50
An Example: RMSPROP and ADAGRAD
RMSPROP estimates the average magnitude of each element of thestochastic gradient vector g(wk , ξk) by maintaining the runningaverages
Rk = (1− λ)Rk−1 + λ(g(wk , ξk) ◦ g(wk , ξk)
)wk+1 = wk −
α√Rk + µ
◦ g(wk , ξk)
where ◦ represents element-wise product.
ADAGRAD just replaced the running average by a sum
Rk = Rk−1 +[g(wk , ξk) ◦ g(wk , ξk)
]
Wei-Cheng Chang (LTI,CMU) Stochastic Optimization for Large-scale Machine LearningSeptember 13, 2016 49 / 50
Take Home Message
RMSPROP and ADAGRAD are considered the state-of-the-artstochastic optimization method in practical use of deep learningmodel.
Note that they are just approximating the diagonal value of Hessian.
There’s still big room for stochastic second order method to be playedaround with the deep learning model!
Wei-Cheng Chang (LTI,CMU) Stochastic Optimization for Large-scale Machine LearningSeptember 13, 2016 50 / 50