A Unified Theory of SGD: Variance Reduction, Sampling ... · Introduction SGD Generalanalysis...

Introduction SGD General analysis Special cases Discussion Bibliography

A Unified Theory of SGD: Variance Reduction,Sampling, Quantization and Coordinate Descent

The talk is based on the work [2] and slides [6]

Eduard Gorbunov1 Filip Hanzely2 Peter Richtarik2 1

1Moscow Institute of Physics and Technology, Russia2King Abdullah University of Science and Technology, Saudi Arabia

INRIA, Paris, 18 October, 2019

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 1 / 27

Table of Contents

1 Introduction

3 General analysis

4 Special cases

5 Discussion

6 Bibliography

The Problem

minx∈Rd

f (x) + R(x) (1)

∙ f (x) — convex function with Lipschitz gradient.∙ R : Rd → R ∪ {+∞} — proximable (proper closed convex) regularizer

The Problem

minx∈Rd

f (x) + R(x) (1)

∙ f (x) — convex function with Lipschitz gradient.

∙ R : Rd → R ∪ {+∞} — proximable (proper closed convex) regularizer

The Problem

minx∈Rd

f (x) + R(x) (1)

∙ f (x) — convex function with Lipschitz gradient.∙ R : Rd → R ∪ {+∞} — proximable (proper closed convex) regularizer

Structure of f in supervised learning theory and practice

We focus on the following situations

∙f (x) = E𝜉∼𝒟 [f𝜉(x)] (2)

f (x) =1n

n∑i=1

fi (x). (3)

∙ f (x) = 1n

∑ni=1 fi (x) and

fi (x) =1m

m∑j=1

fij(x), (4)

Typical case: Using the exact gradient of ∇f (x) is too expensive, but anunbiased estimator of ∇f (x) can be computed efficiently.

We focus on the following situations∙

f (x) = E𝜉∼𝒟 [f𝜉(x)] (2)

f (x) =1n

n∑i=1

fi (x). (3)

∙ f (x) = 1n

∑ni=1 fi (x) and

fi (x) =1m

m∑j=1

fij(x), (4)

f (x) = E𝜉∼𝒟 [f𝜉(x)] (2)

f (x) =1n

n∑i=1

fi (x). (3)

∙ f (x) = 1n

∑ni=1 fi (x) and

fi (x) =1m

m∑j=1

fij(x), (4)

f (x) = E𝜉∼𝒟 [f𝜉(x)] (2)

f (x) =1n

n∑i=1

fi (x). (3)

∙ f (x) = 1n

∑ni=1 fi (x) and

fi (x) =1m

m∑j=1

fij(x), (4)

f (x) = E𝜉∼𝒟 [f𝜉(x)] (2)

f (x) =1n

n∑i=1

fi (x). (3)

∙ f (x) = 1n

∑ni=1 fi (x) and

fi (x) =1m

m∑j=1

fij(x), (4)

f (x) = E𝜉∼𝒟 [f𝜉(x)] (2)

f (x) =1n

n∑i=1

fi (x). (3)

∙ f (x) = 1n

∑ni=1 fi (x) and

fi (x) =1m

m∑j=1

fij(x), (4)

Typical case:

Using the exact gradient of ∇f (x) is too expensive, but anunbiased estimator of ∇f (x) can be computed efficiently.

f (x) = E𝜉∼𝒟 [f𝜉(x)] (2)

f (x) =1n

n∑i=1

fi (x). (3)

∙ f (x) = 1n

∑ni=1 fi (x) and

fi (x) =1m

m∑j=1

fij(x), (4)

Typical case: Using the exact gradient of ∇f (x) is too expensive

, but anunbiased estimator of ∇f (x) can be computed efficiently.

f (x) = E𝜉∼𝒟 [f𝜉(x)] (2)

f (x) =1n

n∑i=1

fi (x). (3)

∙ f (x) = 1n

∑ni=1 fi (x) and

fi (x) =1m

m∑j=1

fij(x), (4)

Popular approach to solve problem (1)

xk+1 = prox𝛾R(xk − 𝛾gk). (5)

∙ gk is unbiased gradient estimator: E[gk | xk

]= ∇f (xk)

∙ proxR(x)def= argminu

{12‖u − x‖2 + R(x)

}∙ ‖ · ‖ — standard Euclidean norm

The prox operator∙ x → proxR(x) is a function∙ ‖ proxR(x)− proxR(y)‖ ≤ ‖x − y‖

]= ∇f (xk)

{12‖u − x‖2 + R(x)

]= ∇f (xk)

{12‖u − x‖2 + R(x)

∙ ‖ · ‖ — standard Euclidean norm

]= ∇f (xk)

{12‖u − x‖2 + R(x)

]= ∇f (xk)

{12‖u − x‖2 + R(x)

The prox operator∙ x → proxR(x) is a function

∙ ‖ proxR(x)− proxR(y)‖ ≤ ‖x − y‖

]= ∇f (xk)

{12‖u − x‖2 + R(x)

Stochastic gradient

There are infinitely many ways of getting unbiased estimator with “good”properties.

∙ Flexibility to construct stochastic gradients in order to target desirableproperties:

∙ convergence speed∙ iteration cost∙ overall complexity∙ parallelizability∙ communication cost and etc.

∙ Too many methods∙ Hard to keep up with new results∙ Challenges in terms of analysis∙ Some problems with fair comparison: different assumptions are used

Stochastic gradient

There are infinitely many ways of getting unbiased estimator with “good”properties.∙ Flexibility to construct stochastic gradients in order to target desirable

properties:

∙ convergence speed∙ iteration cost∙ overall complexity∙ parallelizability∙ communication cost and etc.

Stochastic gradient

properties:∙ convergence speed

∙ iteration cost∙ overall complexity∙ parallelizability∙ communication cost and etc.

Stochastic gradient

properties:∙ convergence speed∙ iteration cost

∙ overall complexity∙ parallelizability∙ communication cost and etc.

Stochastic gradient

properties:∙ convergence speed∙ iteration cost∙ overall complexity

∙ parallelizability∙ communication cost and etc.

Stochastic gradient

properties:∙ convergence speed∙ iteration cost∙ overall complexity∙ parallelizability

∙ communication cost and etc.∙ Too many methods

∙ Hard to keep up with new results∙ Challenges in terms of analysis∙ Some problems with fair comparison: different assumptions are used

Stochastic gradient

properties:∙ convergence speed∙ iteration cost∙ overall complexity∙ parallelizability∙ communication cost and etc.

Stochastic gradient

∙ Too many methods

∙ Hard to keep up with new results∙ Challenges in terms of analysis∙ Some problems with fair comparison: different assumptions are used

Stochastic gradient

∙ Too many methods∙ Hard to keep up with new results

∙ Challenges in terms of analysis∙ Some problems with fair comparison: different assumptions are used

Stochastic gradient

∙ Too many methods∙ Hard to keep up with new results∙ Challenges in terms of analysis

∙ Some problems with fair comparison: different assumptions are used

Stochastic gradient

Bregman divergence

By Df (x , y) we denote the Bregman divergence associated with f :

Df (x , y)def= f (x)− f (y)− ⟨∇f (y), x − y⟩.

Key assumption

Assumption 1

Let {xk} be the random iterates produced by proximal SGD.

1 The stochastic gradients gk are unbiased

E[gk | xk

]= ∇f (xk) ∀k ≥ 0. (6)

2 There exist non-negative constants A,B,C ,D1,D2, 𝜌 and a (possibly)random sequence {𝜎2

k}k≥0 such that the following two relations hold

gk −∇f (x*)2

| xk]≤ 2ADf (x

k , x*) + B𝜎2k + D1, (7)

E[𝜎2k+1 | 𝜎2

]≤ (1− 𝜌)𝜎2

k + 2CDf (xk , x*) + D2, ∀k ≥ 0 (8)

Key assumption

Assumption 1

Let {xk} be the random iterates produced by proximal SGD.1 The stochastic gradients gk are unbiased

E[gk | xk

]= ∇f (xk) ∀k ≥ 0. (6)

gk −∇f (x*)2

| xk]≤ 2ADf (x

k , x*) + B𝜎2k + D1, (7)

E[𝜎2k+1 | 𝜎2

]≤ (1− 𝜌)𝜎2

k + 2CDf (xk , x*) + D2, ∀k ≥ 0 (8)

Key assumption

Assumption 1

E[gk | xk

]= ∇f (xk) ∀k ≥ 0. (6)

gk −∇f (x*)2

| xk]≤ 2ADf (x

k , x*) + B𝜎2k + D1, (7)

E[𝜎2k+1 | 𝜎2

]≤ (1− 𝜌)𝜎2

k + 2CDf (xk , x*) + D2, ∀k ≥ 0 (8)

Key assumption

Assumption 1

E[gk | xk

]= ∇f (xk) ∀k ≥ 0. (6)

gk −∇f (x*)2

| xk]≤ 2ADf (x

k , x*) + B𝜎2k + D1, (7)

E[𝜎2k+1 | 𝜎2

]≤ (1− 𝜌)𝜎2

k + 2CDf (xk , x*) + D2, ∀k ≥ 0 (8)

Key assumption

Assumption 1

E[gk | xk

]= ∇f (xk) ∀k ≥ 0. (6)

gk −∇f (x*)2

| xk]≤ 2ADf (x

k , x*) + B𝜎2k + D1, (7)

E[𝜎2k+1 | 𝜎2

]≤ (1− 𝜌)𝜎2

k + 2CDf (xk , x*) + D2, ∀k ≥ 0 (8)

Gradient Descent satisfies Assumption 1

Assumption 2Assume that f is convex, i.e.

Df (x , y) ≥ 0 ∀x , y ∈ Rd ,

and L-smooth, i.e.

‖∇f (x)−∇f (y)‖ ≤ L‖x − y‖ ∀x , y ∈ Rd .

Assumption 2 implies that (see Nesterov’s book [4])

‖∇f (x)−∇f (y)‖2 ≤ 2LDf (x , y) ∀x , y ∈ Rd .

Therefore, if f satisfies Assumption 2, then gradient descent satisfiesAssumption 1 with

A = L,B = 0,D1 = 0, 𝜎k = 0, 𝜌 = 1,C = 0,D2 = 0.

Df (x , y) ≥ 0 ∀x , y ∈ Rd ,

and L-smooth, i.e.

‖∇f (x)−∇f (y)‖ ≤ L‖x − y‖ ∀x , y ∈ Rd .

‖∇f (x)−∇f (y)‖2 ≤ 2LDf (x , y) ∀x , y ∈ Rd .

A = L,B = 0,D1 = 0, 𝜎k = 0, 𝜌 = 1,C = 0,D2 = 0.

Df (x , y) ≥ 0 ∀x , y ∈ Rd ,

and L-smooth, i.e.

‖∇f (x)−∇f (y)‖ ≤ L‖x − y‖ ∀x , y ∈ Rd .

‖∇f (x)−∇f (y)‖2 ≤ 2LDf (x , y) ∀x , y ∈ Rd .

A = L,B = 0,D1 = 0, 𝜎k = 0, 𝜌 = 1,C = 0,D2 = 0.Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 9 / 27

Other assumptions

Assumption 3 (Unique solution)The problem (1) has an unique minimizer x*.

Assumption 4 (𝜇-strong quasi-convexity)

There exists 𝜇 > 0 such that f : Rd → R is 𝜇-strongly quasi-convex, i.e.

f (x*) ≥ f (x) + ⟨∇f (x), x* − x⟩+ 𝜇

2‖x* − x‖2 , ∀x ∈ Rd . (9)

Other assumptions

Assumption 3 (Unique solution)The problem (1) has an unique minimizer x*.

Assumption 4 (𝜇-strong quasi-convexity)

There exists 𝜇 > 0 such that f : Rd → R is 𝜇-strongly quasi-convex, i.e.

f (x*) ≥ f (x) + ⟨∇f (x), x* − x⟩+ 𝜇

2‖x* − x‖2 , ∀x ∈ Rd . (9)

Main result

Theorem 1Let Assumptions 1, 3 and 4 be satisfied.

Choose constant M such thatM > B

𝜌 . Choose a stepsize satisfying

0 < 𝛾 ≤ min

{1𝜇,

1A+ CM

}. (10)

Then the iterates {xk}k≥0 of proximal SGD satisfy

E[V k]≤ max

{(1− 𝛾𝜇)k ,

M− 𝜌

)k}V 0 +

(D1 +MD2)𝛾2

min{𝛾𝜇, 𝜌− B

} ,(11)

where the Lyapunov function V k is defined by V k def=xk − x*

2+M𝛾2𝜎2

Main result

Theorem 1Let Assumptions 1, 3 and 4 be satisfied. Choose constant M such thatM > B

0 < 𝛾 ≤ min

{1𝜇,

1A+ CM

E[V k]≤ max

{(1− 𝛾𝜇)k ,

M− 𝜌

)k}V 0 +

(D1 +MD2)𝛾2

} ,(11)

2+M𝛾2𝜎2

Main result

0 < 𝛾 ≤ min

{1𝜇,

1A+ CM

}. (10)

E[V k]≤ max

{(1− 𝛾𝜇)k ,

M− 𝜌

)k}V 0 +

(D1 +MD2)𝛾2

} ,(11)

2+M𝛾2𝜎2

Main result

0 < 𝛾 ≤ min

{1𝜇,

1A+ CM

}. (10)

E[V k]≤ max

{(1− 𝛾𝜇)k ,

M− 𝜌

)k}V 0 +

(D1 +MD2)𝛾2

} ,(11)

2+M𝛾2𝜎2

Methods

Parameters

Proximal Gradient Descent

We checked that GD satisfies Assumption 1 when f is convex and L-smoothwith

A = L,B = 0,D1 = 0, 𝜎k = 0, 𝜌 = 1,C = 0,D2 = 0.

We can choose M = 1 in Theorem 1 and get∙ 0 < 𝛾 ≤ 1

L (since 𝜇 ≤ L and C = 0)∙ V k = ‖xk − x*‖2 (since 𝜎k = 0)∙ The rate: ‖xk − x*‖2 ≤ (1− 𝛾𝜇)k‖x0 − x*‖2

∙ In particular, for 𝛾 = 1L we recover the standard rate for proximal

gradient descent:

k ≥ L

𝜇log

=⇒ ‖xk − x*‖2 ≤ 𝜀‖x0 − x*‖2.

A = L,B = 0,D1 = 0, 𝜎k = 0, 𝜌 = 1,C = 0,D2 = 0.

We can choose M = 1 in Theorem 1 and get

∙ 0 < 𝛾 ≤ 1L (since 𝜇 ≤ L and C = 0)

∙ V k = ‖xk − x*‖2 (since 𝜎k = 0)∙ The rate: ‖xk − x*‖2 ≤ (1− 𝛾𝜇)k‖x0 − x*‖2

gradient descent:

k ≥ L

𝜇log

=⇒ ‖xk − x*‖2 ≤ 𝜀‖x0 − x*‖2.

A = L,B = 0,D1 = 0, 𝜎k = 0, 𝜌 = 1,C = 0,D2 = 0.

L (since 𝜇 ≤ L and C = 0)

∙ V k = ‖xk − x*‖2 (since 𝜎k = 0)∙ The rate: ‖xk − x*‖2 ≤ (1− 𝛾𝜇)k‖x0 − x*‖2

gradient descent:

k ≥ L

𝜇log

=⇒ ‖xk − x*‖2 ≤ 𝜀‖x0 − x*‖2.

A = L,B = 0,D1 = 0, 𝜎k = 0, 𝜌 = 1,C = 0,D2 = 0.

L (since 𝜇 ≤ L and C = 0)∙ V k = ‖xk − x*‖2 (since 𝜎k = 0)

∙ The rate: ‖xk − x*‖2 ≤ (1− 𝛾𝜇)k‖x0 − x*‖2

gradient descent:

k ≥ L

𝜇log

=⇒ ‖xk − x*‖2 ≤ 𝜀‖x0 − x*‖2.

A = L,B = 0,D1 = 0, 𝜎k = 0, 𝜌 = 1,C = 0,D2 = 0.

gradient descent:

k ≥ L

𝜇log

=⇒ ‖xk − x*‖2 ≤ 𝜀‖x0 − x*‖2.

A = L,B = 0,D1 = 0, 𝜎k = 0, 𝜌 = 1,C = 0,D2 = 0.

gradient descent:

k ≥ L

𝜇log

=⇒ ‖xk − x*‖2 ≤ 𝜀‖x0 − x*‖2.

SGD-SR (see Gower et al. (2019), [3])

Stochastic reformulation

minx∈Rd

f (x) + R(x), f (x) = E𝒟 [f𝜉(x)] , f𝜉(x)def=

n∑i=1

𝜉i fi (x) (12)

1 𝜉 ∼ 𝒟: E𝒟 [𝜉i ] = 1 for all i2 fi (for all i) is smooth, possibly non-convex function

Algorithm 1 SGD-SR

Input: learning rate 𝛾 > 0, starting point x0 ∈ Rd , distribution 𝒟 over 𝜉 ∈ Rn

such that E𝒟 [𝜉] is vector of ones1: for k = 0, 1, 2, . . . do2: Sample 𝜉 ∼ 𝒟3: gk = ∇f𝜉(x

k)4: xk+1 = prox𝛾R(x

k − 𝛾gk)5: end for

minx∈Rd

n∑i=1

𝜉i fi (x) (12)

1 𝜉 ∼ 𝒟: E𝒟 [𝜉i ] = 1 for all i

2 fi (for all i) is smooth, possibly non-convex function

Algorithm 2 SGD-SR

minx∈Rd

n∑i=1

𝜉i fi (x) (12)

Algorithm 3 SGD-SR

minx∈Rd

n∑i=1

𝜉i fi (x) (12)

Algorithm 4 SGD-SR

Expected smoothnessWe say that f is ℒ-smooth in expectation with respect to distribution 𝒟 ifthere exists ℒ = ℒ(f ,𝒟) > 0 such that

[‖∇f𝜉(x)−∇f𝜉(x

*)‖2]≤ 2ℒDf (x , x

*), (13)

for all x ∈ Rd and write (f ,𝒟) ∼ ES(ℒ).

Lemma 1 (Generalization of Lemma 2.4, [3])If (f ,𝒟) ∼ ES(ℒ), then

[‖∇f𝜉(x)−∇f (x*)‖2

]≤ 4ℒDf (x , x

*) + 2𝜎2. (14)

where 𝜎2 def= E𝒟

[‖∇f𝜉(x

*)−∇f (x*)‖2].

That is, SGD-SR satisfies Assumption 1 with

A = 2ℒ,B = 0,D1 = 2𝜎2, 𝜎k = 0, 𝜌 = 1,C = 0,D2 = 0.

If 𝛾k ≡ 𝛾 ≤ 12ℒ then Theorem 1 implies that

E‖xk − x*‖2 ≤ (1− 𝛾𝜇)k‖x0 − x*‖2 +2𝛾𝜎2

[‖∇f𝜉(x)−∇f (x*)‖2

]≤ 4ℒDf (x , x

*) + 2𝜎2. (14)

[‖∇f𝜉(x

*)−∇f (x*)‖2].

A = 2ℒ,B = 0,D1 = 2𝜎2, 𝜎k = 0, 𝜌 = 1,C = 0,D2 = 0.

E‖xk − x*‖2 ≤ (1− 𝛾𝜇)k‖x0 − x*‖2 +2𝛾𝜎2

[‖∇f𝜉(x)−∇f (x*)‖2

]≤ 4ℒDf (x , x

*) + 2𝜎2. (14)

[‖∇f𝜉(x

*)−∇f (x*)‖2].

A = 2ℒ,B = 0,D1 = 2𝜎2, 𝜎k = 0, 𝜌 = 1,C = 0,D2 = 0.

E‖xk − x*‖2 ≤ (1− 𝛾𝜇)k‖x0 − x*‖2 +2𝛾𝜎2

SAGA (see Defazio, Bach & Lacoste-Julien (2014), [1])

Consider the finite-sum minimization problem

f (x) =1n

n∑i=1

fi (x) + R(x), (15)

where fi is convex, L-smooth for each i and f is 𝜇-strongly convex.

Algorithm 5 SAGA [1]

Input: learning rate 𝛾 > 0, starting point x0 ∈ Rd

1: Set 𝜓0j = x0 for each j ∈ [n]

2: for k = 0, 1, 2, . . . do3: Sample j ∈ [n] uniformly at random4: Set 𝜑k+1

j = xk and 𝜑k+1i = 𝜑ki for i = j

5: gk = ∇fj(𝜑k+1j )−∇fj(𝜑

kj ) +

∑ni=1 ∇fi (𝜑

6: xk+1 = prox𝛾R

(xk − 𝛾gk

)7: end for

f (x) =1n

n∑i=1

fi (x) + R(x), (15)

kj ) +

∑ni=1 ∇fi (𝜑

6: xk+1 = prox𝛾R

(xk − 𝛾gk

)7: end for

f (x) =1n

n∑i=1

fi (x) + R(x), (15)

kj ) +

∑ni=1 ∇fi (𝜑

6: xk+1 = prox𝛾R

(xk − 𝛾gk

)7: end for

Lemma 2We have

gk −∇f (x*)2

| xk]≤ 4LDf (x

k , x*) + 2𝜎2k

E[𝜎2k+1 | xk

]≤(1− 1

)𝜎2k +

2LnDf (x

k , x*), (17)

where 𝜎2k = 1

n∑i=1

∇fi (𝜑

ki )−∇fi (x

Lemma 2We have

gk −∇f (x*)2

| xk]≤ 4LDf (x

k , x*) + 2𝜎2k (16)

E[𝜎2k+1 | xk

]≤(1− 1

)𝜎2k +

2LnDf (x

k , x*), (17)

where 𝜎2k = 1

n∑i=1

∇fi (𝜑

ki )−∇fi (x

Lemma 2We have

gk −∇f (x*)2

| xk]≤ 4LDf (x

k , x*) + 2𝜎2k (16)

E[𝜎2k+1 | xk

]≤(1− 1

)𝜎2k +

2LnDf (x

k , x*), (17)

where 𝜎2k = 1

n∑i=1

∇fi (𝜑

ki )−∇fi (x

That is, SAGA satisfies Assumption 1 with

A = 2L,B = 1,D1 = 0, 𝜌 =1n,C =

n,D2 = 0.

Theorem 1 with M = 4n implies that for 𝛾 = 16L

EV k ≤(1−min

6L,12n

That is, SAGA satisfies Assumption 1 with

A = 2L,B = 1,D1 = 0, 𝜌 =1n,C =

n,D2 = 0.

Theorem 1 with M = 4n implies that for 𝛾 = 16L

EV k ≤(1−min

6L,12n

SGD-star

Consider the same situation as for SGD-SR. Recall that for SGD-SR we got

E‖xk − x*‖2 ≤ (1− 𝛾𝜇)k‖x0 − x*‖2 +2𝛾𝜎2

𝜇, 𝛾 ∈

Algorithm 8 SGD-star

Input: learning rate 𝛾 > 0, starting point x0 ∈ Rd , distribution 𝒟 over𝜉 ∈ Rn such that E𝒟 [𝜉] is vector of ones

1: for k = 0, 1, 2, . . . do2: Sample 𝜉 ∼ 𝒟3: gk = ∇f𝜉(x

k)−∇f𝜉(x*) +∇f (x*)

4: xk+1 = prox𝛾R(xk − 𝛾gk)5: end for

SGD-star

Consider the same situation as for SGD-SR. Recall that for SGD-SR we got

E‖xk − x*‖2 ≤ (1− 𝛾𝜇)k‖x0 − x*‖2 +2𝛾𝜎2

𝜇, 𝛾 ∈

Algorithm 9 SGD-star

Input: learning rate 𝛾 > 0, starting point x0 ∈ Rd , distribution 𝒟 over𝜉 ∈ Rn such that E𝒟 [𝜉] is vector of ones

1: for k = 0, 1, 2, . . . do2: Sample 𝜉 ∼ 𝒟3: gk = ∇f𝜉(x

k)−∇f𝜉(x*) +∇f (x*)

4: xk+1 = prox𝛾R(xk − 𝛾gk)5: end for

SGD-star

Lemma 3 (Lemma 2.4 from [3])If (f ,𝒟) ∼ ES(ℒ), then

[gk −∇f (x*)

2]≤ 4ℒDf (x

k , x*). (18)

Thus, SGD-star satisfies Assumption 1 with

A = 2ℒ,B = 0,D1 = 0, 𝜎k = 0, 𝜌 = 1,C = 0,D2 = 0.

E‖xk − x*‖2 ≤ (1− 𝛾𝜇)k‖x0 − x*‖2.

SGD-star

Lemma 3 (Lemma 2.4 from [3])If (f ,𝒟) ∼ ES(ℒ), then

[gk −∇f (x*)

2]≤ 4ℒDf (x

k , x*). (18)

Thus, SGD-star satisfies Assumption 1 with

A = 2ℒ,B = 0,D1 = 0, 𝜎k = 0, 𝜌 = 1,C = 0,D2 = 0.

E‖xk − x*‖2 ≤ (1− 𝛾𝜇)k‖x0 − x*‖2.

Quantitative definition of variance reduction

Variance-reduced methodsWe say that SGD with stochastic gradients satisfying Assumption 1 isvariance-reduced if D1 = D2 = 0.

∙ Variance-reduced methods converge to the exact solution∙ The sequence {𝜎2

k}k≥0 reflects the progress of the variance reductionprocess.

∙ Variance-reduced methods converge to the exact solution

∙ The sequence {𝜎2k}k≥0 reflects the progress of the variance reduction

process.

∙ Variance-reduced methods converge to the exact solution∙ The sequence {𝜎2

k}k≥0 reflects the progress of the variance reductionprocess.

Limitations and Extensions

1 We consider only L-smooth 𝜇-strongly quasi-convex case.

2 It would interesting to unify the theory for biased gradients estimator(e.g. SAG [7], SARAH [5], zero-order optimization and etc.)

3 Our analysis doesn’t recover the best known rates for RCD typemethods with importance sampling.

4 An extension of iteration dependent parameters A,B,C ,D1,D2, 𝜌would cover a new methods, such as SGD with decreasing stepsizes.

5 We consider only non-accelerated methods. It would be interesting toprovide an unified analysis of stochastic methods with acceleration andmomentum.

1 We consider only L-smooth 𝜇-strongly quasi-convex case.2 It would interesting to unify the theory for biased gradients estimator

(e.g. SAG [7], SARAH [5], zero-order optimization and etc.)

3 Our analysis doesn’t recover the best known rates for RCD typemethods with importance sampling.

(e.g. SAG [7], SARAH [5], zero-order optimization and etc.)3 Our analysis doesn’t recover the best known rates for RCD type

methods with importance sampling.

methods with importance sampling.4 An extension of iteration dependent parameters A,B,C ,D1,D2, 𝜌

would cover a new methods, such as SGD with decreasing stepsizes.

methods with importance sampling.4 An extension of iteration dependent parameters A,B,C ,D1,D2, 𝜌

would cover a new methods, such as SGD with decreasing stepsizes.5 We consider only non-accelerated methods. It would be interesting to

provide an unified analysis of stochastic methods with acceleration andmomentum.

Bibliography I

Aaron Defazio, Francis Bach, and Simon Lacoste-Julien.SAGA: A fast incremental gradient method with support fornon-strongly convex composite objectives.In Advances in Neural Information Processing Systems, pages1646–1654, 2014.

Eduard Gorbunov, Filip Hanzely, and Peter Richtarik.A unified theory of sgd: Variance reduction, sampling, quantizationand coordinate descent.arXiv preprint arXiv:1905.11261, 2019.

Bibliography II

Robert M Gower, Nicolas Loizou, Xun Qian, Alibek Sailanbayev, EgorShulgin, and Peter Richtarik.SGD: General analysis and improved rates.arXiv preprint arXiv:1901.09401, 2019.

Yurii Nesterov.Introductory lectures on convex optimization: a basic course.2004.

Lam M Nguyen, Jie Liu, Katya Scheinberg, and Martin Takac.SARAH: A novel method for machine learning problems usingstochastic recursive gradient.In Proceedings of the 34th International Conference on MachineLearning, volume 70 of Proceedings of Machine Learning Research,pages 2613–2621. PMLR, 2017.

Bibliography III

Peter Richtarik.A guided walk through the zoo of stochastic gradient descent methods.

ICCOPT 2019 Summer School.

Nicolas Le Roux, Mark Schmidt, and Francis Bach.A stochastic gradient method with an exponential convergence rate forfinite training sets.In Advances in Neural Information Processing Systems, pages2663–2671, 2012.

A Unified Theory of SGD: Variance Reduction, Sampling ... · Introduction SGD Generalanalysis...

Documents

ProX Cases Catalog 2015

Manual XK 315A

XK VEHICLE ACCESSORIES

DEPED Valencia – The Official Website of DepEd – Division ...€¦ · SGD-MES-OI SGD-MES-02 SGD-MES-03 QMS-RAl-02 SGD-SHS-OI SGD-SHS-02 Special AUDITOR Dorothy M. Aroa Jose V

Wireless subsystem repeater U-Prox HEu-prox.com.au/files/Docs/Panels/U-Prox HE.en-us.pdf · U-Prox HE repeater operates automatically. After the download from the U-Prox IC L

Apple Logic ProX

Capacitive Prox

ProX UNO Laser 2015

Jaguar XK EBrochure

Rims Prox Fb 0601

Specifications Smart Reader - download.peplink.com · HID Prox, HID Indala Prox, EM Prox, AWiD Prox (Idm), PIV, ... with up to 26 kbps transmission rate (depend-ing on card), iCLASS

Prox–DU & Prox–SU - Gemalto Supportsupport.gemalto.com/fileadmin/user_upload/drivers/Prox-DU_and_Pro… · Prox–DU & Prox–SU DOC119811A Public Use Page 1/129 Prox–DU & Prox–SU

SPECIAL MACHINE SERIES XK...3 Synchr onous drive of slides by 2 racks and a pinion 3 Basic Machine in C-shape and cast design XK 637, XK 651, XK 675 The XK 6 machines continue the

MOPED - ProX Racing Parts

ProX Cables Catalog 2015

Qsparse-local-SGD: Distributed SGD with Quantization ... · chronous2 implementations of Qsparse-local-SGD. We analyze convergence for Qsparse-local-SGD in the distributed case, for

SPONSORSHIP & EXHIBITOR PROSPECTUS€¦ · IB Global Conference Sponsorship and Exhibitor Packages SGD 18,000.00 SGD 12,000.00 SGD 9,800.00 SGD 6,800.00 SGD 2,000.00 4 2 2 2 1 4 2

The power of SGD . SGD overview

Microscopes Generation 5 Phenom ProX...Imaging Specifications Phenom ProX 02 Specification sheet | Phenom ProX The Phenom ProX scanning electron microscope (SEM) is based on the 5th

Electoral Office of St. Vincent and The Grenadines...05-Apr-2017 11:46:52 100 01/06/2005 05/04/2017 147827 SGA 102495 SGD 083654 SGD 083895 SGD 091749 SGD 091809 SGD 083587 SGD 083334