A Unified Theory of SGD: Variance Reduction, Sampling ... · Introduction SGD Generalanalysis...

Preview:

Citation preview

Introduction SGD General analysis Special cases Discussion Bibliography

A Unified Theory of SGD: Variance Reduction,Sampling, Quantization and Coordinate Descent

The talk is based on the work [2] and slides [6]

Eduard Gorbunov1 Filip Hanzely2 Peter Richtarik2 1

1Moscow Institute of Physics and Technology, Russia2King Abdullah University of Science and Technology, Saudi Arabia

INRIA, Paris, 18 October, 2019

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 1 / 27

Introduction SGD General analysis Special cases Discussion Bibliography

Table of Contents

1 Introduction

2 SGD

3 General analysis

4 Special cases

5 Discussion

6 Bibliography

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 2 / 27

Introduction SGD General analysis Special cases Discussion Bibliography

The Problem

minx∈Rd

f (x) + R(x) (1)

∙ f (x) — convex function with Lipschitz gradient.∙ R : Rd → R ∪ {+∞} — proximable (proper closed convex) regularizer

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 3 / 27

Introduction SGD General analysis Special cases Discussion Bibliography

The Problem

minx∈Rd

f (x) + R(x) (1)

∙ f (x) — convex function with Lipschitz gradient.

∙ R : Rd → R ∪ {+∞} — proximable (proper closed convex) regularizer

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 3 / 27

Introduction SGD General analysis Special cases Discussion Bibliography

The Problem

minx∈Rd

f (x) + R(x) (1)

∙ f (x) — convex function with Lipschitz gradient.∙ R : Rd → R ∪ {+∞} — proximable (proper closed convex) regularizer

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 3 / 27

Introduction SGD General analysis Special cases Discussion Bibliography

Structure of f in supervised learning theory and practice

We focus on the following situations

∙f (x) = E𝜉∼𝒟 [f𝜉(x)] (2)

f (x) =1n

n∑i=1

fi (x). (3)

∙ f (x) = 1n

∑ni=1 fi (x) and

fi (x) =1m

m∑j=1

fij(x), (4)

Typical case: Using the exact gradient of ∇f (x) is too expensive, but anunbiased estimator of ∇f (x) can be computed efficiently.

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 4 / 27

Introduction SGD General analysis Special cases Discussion Bibliography

Structure of f in supervised learning theory and practice

We focus on the following situations∙

f (x) = E𝜉∼𝒟 [f𝜉(x)] (2)

f (x) =1n

n∑i=1

fi (x). (3)

∙ f (x) = 1n

∑ni=1 fi (x) and

fi (x) =1m

m∑j=1

fij(x), (4)

Typical case: Using the exact gradient of ∇f (x) is too expensive, but anunbiased estimator of ∇f (x) can be computed efficiently.

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 4 / 27

Introduction SGD General analysis Special cases Discussion Bibliography

Structure of f in supervised learning theory and practice

We focus on the following situations∙

f (x) = E𝜉∼𝒟 [f𝜉(x)] (2)

f (x) =1n

n∑i=1

fi (x). (3)

∙ f (x) = 1n

∑ni=1 fi (x) and

fi (x) =1m

m∑j=1

fij(x), (4)

Typical case: Using the exact gradient of ∇f (x) is too expensive, but anunbiased estimator of ∇f (x) can be computed efficiently.

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 4 / 27

Introduction SGD General analysis Special cases Discussion Bibliography

Structure of f in supervised learning theory and practice

We focus on the following situations∙

f (x) = E𝜉∼𝒟 [f𝜉(x)] (2)

f (x) =1n

n∑i=1

fi (x). (3)

∙ f (x) = 1n

∑ni=1 fi (x) and

fi (x) =1m

m∑j=1

fij(x), (4)

Typical case: Using the exact gradient of ∇f (x) is too expensive, but anunbiased estimator of ∇f (x) can be computed efficiently.

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 4 / 27

Introduction SGD General analysis Special cases Discussion Bibliography

Structure of f in supervised learning theory and practice

We focus on the following situations∙

f (x) = E𝜉∼𝒟 [f𝜉(x)] (2)

f (x) =1n

n∑i=1

fi (x). (3)

∙ f (x) = 1n

∑ni=1 fi (x) and

fi (x) =1m

m∑j=1

fij(x), (4)

Typical case: Using the exact gradient of ∇f (x) is too expensive, but anunbiased estimator of ∇f (x) can be computed efficiently.

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 4 / 27

Introduction SGD General analysis Special cases Discussion Bibliography

Structure of f in supervised learning theory and practice

We focus on the following situations∙

f (x) = E𝜉∼𝒟 [f𝜉(x)] (2)

f (x) =1n

n∑i=1

fi (x). (3)

∙ f (x) = 1n

∑ni=1 fi (x) and

fi (x) =1m

m∑j=1

fij(x), (4)

Typical case:

Using the exact gradient of ∇f (x) is too expensive, but anunbiased estimator of ∇f (x) can be computed efficiently.

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 4 / 27

Introduction SGD General analysis Special cases Discussion Bibliography

Structure of f in supervised learning theory and practice

We focus on the following situations∙

f (x) = E𝜉∼𝒟 [f𝜉(x)] (2)

f (x) =1n

n∑i=1

fi (x). (3)

∙ f (x) = 1n

∑ni=1 fi (x) and

fi (x) =1m

m∑j=1

fij(x), (4)

Typical case: Using the exact gradient of ∇f (x) is too expensive

, but anunbiased estimator of ∇f (x) can be computed efficiently.

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 4 / 27

Introduction SGD General analysis Special cases Discussion Bibliography

Structure of f in supervised learning theory and practice

We focus on the following situations∙

f (x) = E𝜉∼𝒟 [f𝜉(x)] (2)

f (x) =1n

n∑i=1

fi (x). (3)

∙ f (x) = 1n

∑ni=1 fi (x) and

fi (x) =1m

m∑j=1

fij(x), (4)

Typical case: Using the exact gradient of ∇f (x) is too expensive, but anunbiased estimator of ∇f (x) can be computed efficiently.

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 4 / 27

Introduction SGD General analysis Special cases Discussion Bibliography

Popular approach to solve problem (1)

xk+1 = prox𝛾R(xk − 𝛾gk). (5)

∙ gk is unbiased gradient estimator: E[gk | xk

]= ∇f (xk)

∙ proxR(x)def= argminu

{12‖u − x‖2 + R(x)

}∙ ‖ · ‖ — standard Euclidean norm

The prox operator∙ x → proxR(x) is a function∙ ‖ proxR(x)− proxR(y)‖ ≤ ‖x − y‖

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 5 / 27

Introduction SGD General analysis Special cases Discussion Bibliography

Popular approach to solve problem (1)

xk+1 = prox𝛾R(xk − 𝛾gk). (5)

∙ gk is unbiased gradient estimator: E[gk | xk

]= ∇f (xk)

∙ proxR(x)def= argminu

{12‖u − x‖2 + R(x)

}∙ ‖ · ‖ — standard Euclidean norm

The prox operator∙ x → proxR(x) is a function∙ ‖ proxR(x)− proxR(y)‖ ≤ ‖x − y‖

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 5 / 27

Introduction SGD General analysis Special cases Discussion Bibliography

Popular approach to solve problem (1)

xk+1 = prox𝛾R(xk − 𝛾gk). (5)

∙ gk is unbiased gradient estimator: E[gk | xk

]= ∇f (xk)

∙ proxR(x)def= argminu

{12‖u − x‖2 + R(x)

}

∙ ‖ · ‖ — standard Euclidean norm

The prox operator∙ x → proxR(x) is a function∙ ‖ proxR(x)− proxR(y)‖ ≤ ‖x − y‖

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 5 / 27

Introduction SGD General analysis Special cases Discussion Bibliography

Popular approach to solve problem (1)

xk+1 = prox𝛾R(xk − 𝛾gk). (5)

∙ gk is unbiased gradient estimator: E[gk | xk

]= ∇f (xk)

∙ proxR(x)def= argminu

{12‖u − x‖2 + R(x)

}∙ ‖ · ‖ — standard Euclidean norm

The prox operator∙ x → proxR(x) is a function∙ ‖ proxR(x)− proxR(y)‖ ≤ ‖x − y‖

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 5 / 27

Introduction SGD General analysis Special cases Discussion Bibliography

Popular approach to solve problem (1)

xk+1 = prox𝛾R(xk − 𝛾gk). (5)

∙ gk is unbiased gradient estimator: E[gk | xk

]= ∇f (xk)

∙ proxR(x)def= argminu

{12‖u − x‖2 + R(x)

}∙ ‖ · ‖ — standard Euclidean norm

The prox operator∙ x → proxR(x) is a function

∙ ‖ proxR(x)− proxR(y)‖ ≤ ‖x − y‖

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 5 / 27

Introduction SGD General analysis Special cases Discussion Bibliography

Popular approach to solve problem (1)

xk+1 = prox𝛾R(xk − 𝛾gk). (5)

∙ gk is unbiased gradient estimator: E[gk | xk

]= ∇f (xk)

∙ proxR(x)def= argminu

{12‖u − x‖2 + R(x)

}∙ ‖ · ‖ — standard Euclidean norm

The prox operator∙ x → proxR(x) is a function∙ ‖ proxR(x)− proxR(y)‖ ≤ ‖x − y‖

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 5 / 27

Introduction SGD General analysis Special cases Discussion Bibliography

Stochastic gradient

There are infinitely many ways of getting unbiased estimator with “good”properties.

∙ Flexibility to construct stochastic gradients in order to target desirableproperties:

∙ convergence speed∙ iteration cost∙ overall complexity∙ parallelizability∙ communication cost and etc.

∙ Too many methods∙ Hard to keep up with new results∙ Challenges in terms of analysis∙ Some problems with fair comparison: different assumptions are used

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 6 / 27

Introduction SGD General analysis Special cases Discussion Bibliography

Stochastic gradient

There are infinitely many ways of getting unbiased estimator with “good”properties.∙ Flexibility to construct stochastic gradients in order to target desirable

properties:

∙ convergence speed∙ iteration cost∙ overall complexity∙ parallelizability∙ communication cost and etc.

∙ Too many methods∙ Hard to keep up with new results∙ Challenges in terms of analysis∙ Some problems with fair comparison: different assumptions are used

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 6 / 27

Introduction SGD General analysis Special cases Discussion Bibliography

Stochastic gradient

There are infinitely many ways of getting unbiased estimator with “good”properties.∙ Flexibility to construct stochastic gradients in order to target desirable

properties:∙ convergence speed

∙ iteration cost∙ overall complexity∙ parallelizability∙ communication cost and etc.

∙ Too many methods∙ Hard to keep up with new results∙ Challenges in terms of analysis∙ Some problems with fair comparison: different assumptions are used

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 6 / 27

Introduction SGD General analysis Special cases Discussion Bibliography

Stochastic gradient

There are infinitely many ways of getting unbiased estimator with “good”properties.∙ Flexibility to construct stochastic gradients in order to target desirable

properties:∙ convergence speed∙ iteration cost

∙ overall complexity∙ parallelizability∙ communication cost and etc.

∙ Too many methods∙ Hard to keep up with new results∙ Challenges in terms of analysis∙ Some problems with fair comparison: different assumptions are used

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 6 / 27

Introduction SGD General analysis Special cases Discussion Bibliography

Stochastic gradient

There are infinitely many ways of getting unbiased estimator with “good”properties.∙ Flexibility to construct stochastic gradients in order to target desirable

properties:∙ convergence speed∙ iteration cost∙ overall complexity

∙ parallelizability∙ communication cost and etc.

∙ Too many methods∙ Hard to keep up with new results∙ Challenges in terms of analysis∙ Some problems with fair comparison: different assumptions are used

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 6 / 27

Introduction SGD General analysis Special cases Discussion Bibliography

Stochastic gradient

There are infinitely many ways of getting unbiased estimator with “good”properties.∙ Flexibility to construct stochastic gradients in order to target desirable

properties:∙ convergence speed∙ iteration cost∙ overall complexity∙ parallelizability

∙ communication cost and etc.∙ Too many methods

∙ Hard to keep up with new results∙ Challenges in terms of analysis∙ Some problems with fair comparison: different assumptions are used

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 6 / 27

Introduction SGD General analysis Special cases Discussion Bibliography

Stochastic gradient

There are infinitely many ways of getting unbiased estimator with “good”properties.∙ Flexibility to construct stochastic gradients in order to target desirable

properties:∙ convergence speed∙ iteration cost∙ overall complexity∙ parallelizability∙ communication cost and etc.

∙ Too many methods∙ Hard to keep up with new results∙ Challenges in terms of analysis∙ Some problems with fair comparison: different assumptions are used

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 6 / 27

Introduction SGD General analysis Special cases Discussion Bibliography

Stochastic gradient

There are infinitely many ways of getting unbiased estimator with “good”properties.∙ Flexibility to construct stochastic gradients in order to target desirable

properties:∙ convergence speed∙ iteration cost∙ overall complexity∙ parallelizability∙ communication cost and etc.

∙ Too many methods

∙ Hard to keep up with new results∙ Challenges in terms of analysis∙ Some problems with fair comparison: different assumptions are used

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 6 / 27

Introduction SGD General analysis Special cases Discussion Bibliography

Stochastic gradient

There are infinitely many ways of getting unbiased estimator with “good”properties.∙ Flexibility to construct stochastic gradients in order to target desirable

properties:∙ convergence speed∙ iteration cost∙ overall complexity∙ parallelizability∙ communication cost and etc.

∙ Too many methods∙ Hard to keep up with new results

∙ Challenges in terms of analysis∙ Some problems with fair comparison: different assumptions are used

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 6 / 27

Introduction SGD General analysis Special cases Discussion Bibliography

Stochastic gradient

There are infinitely many ways of getting unbiased estimator with “good”properties.∙ Flexibility to construct stochastic gradients in order to target desirable

properties:∙ convergence speed∙ iteration cost∙ overall complexity∙ parallelizability∙ communication cost and etc.

∙ Too many methods∙ Hard to keep up with new results∙ Challenges in terms of analysis

∙ Some problems with fair comparison: different assumptions are used

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 6 / 27

Introduction SGD General analysis Special cases Discussion Bibliography

Stochastic gradient

There are infinitely many ways of getting unbiased estimator with “good”properties.∙ Flexibility to construct stochastic gradients in order to target desirable

properties:∙ convergence speed∙ iteration cost∙ overall complexity∙ parallelizability∙ communication cost and etc.

∙ Too many methods∙ Hard to keep up with new results∙ Challenges in terms of analysis∙ Some problems with fair comparison: different assumptions are used

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 6 / 27

Introduction SGD General analysis Special cases Discussion Bibliography

Bregman divergence

By Df (x , y) we denote the Bregman divergence associated with f :

Df (x , y)def= f (x)− f (y)− ⟨∇f (y), x − y⟩.

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 7 / 27

Introduction SGD General analysis Special cases Discussion Bibliography

Key assumption

Assumption 1

Let {xk} be the random iterates produced by proximal SGD.

1 The stochastic gradients gk are unbiased

E[gk | xk

]= ∇f (xk) ∀k ≥ 0. (6)

2 There exist non-negative constants A,B,C ,D1,D2, 𝜌 and a (possibly)random sequence {𝜎2

k}k≥0 such that the following two relations hold

E[

gk −∇f (x*)2

| xk]≤ 2ADf (x

k , x*) + B𝜎2k + D1, (7)

E[𝜎2k+1 | 𝜎2

k

]≤ (1− 𝜌)𝜎2

k + 2CDf (xk , x*) + D2, ∀k ≥ 0 (8)

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 8 / 27

Introduction SGD General analysis Special cases Discussion Bibliography

Key assumption

Assumption 1

Let {xk} be the random iterates produced by proximal SGD.1 The stochastic gradients gk are unbiased

E[gk | xk

]= ∇f (xk) ∀k ≥ 0. (6)

2 There exist non-negative constants A,B,C ,D1,D2, 𝜌 and a (possibly)random sequence {𝜎2

k}k≥0 such that the following two relations hold

E[

gk −∇f (x*)2

| xk]≤ 2ADf (x

k , x*) + B𝜎2k + D1, (7)

E[𝜎2k+1 | 𝜎2

k

]≤ (1− 𝜌)𝜎2

k + 2CDf (xk , x*) + D2, ∀k ≥ 0 (8)

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 8 / 27

Introduction SGD General analysis Special cases Discussion Bibliography

Key assumption

Assumption 1

Let {xk} be the random iterates produced by proximal SGD.1 The stochastic gradients gk are unbiased

E[gk | xk

]= ∇f (xk) ∀k ≥ 0. (6)

2 There exist non-negative constants A,B,C ,D1,D2, 𝜌 and a (possibly)random sequence {𝜎2

k}k≥0 such that the following two relations hold

E[

gk −∇f (x*)2

| xk]≤ 2ADf (x

k , x*) + B𝜎2k + D1, (7)

E[𝜎2k+1 | 𝜎2

k

]≤ (1− 𝜌)𝜎2

k + 2CDf (xk , x*) + D2, ∀k ≥ 0 (8)

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 8 / 27

Introduction SGD General analysis Special cases Discussion Bibliography

Key assumption

Assumption 1

Let {xk} be the random iterates produced by proximal SGD.1 The stochastic gradients gk are unbiased

E[gk | xk

]= ∇f (xk) ∀k ≥ 0. (6)

2 There exist non-negative constants A,B,C ,D1,D2, 𝜌 and a (possibly)random sequence {𝜎2

k}k≥0 such that the following two relations hold

E[

gk −∇f (x*)2

| xk]≤ 2ADf (x

k , x*) + B𝜎2k + D1, (7)

E[𝜎2k+1 | 𝜎2

k

]≤ (1− 𝜌)𝜎2

k + 2CDf (xk , x*) + D2, ∀k ≥ 0 (8)

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 8 / 27

Introduction SGD General analysis Special cases Discussion Bibliography

Key assumption

Assumption 1

Let {xk} be the random iterates produced by proximal SGD.1 The stochastic gradients gk are unbiased

E[gk | xk

]= ∇f (xk) ∀k ≥ 0. (6)

2 There exist non-negative constants A,B,C ,D1,D2, 𝜌 and a (possibly)random sequence {𝜎2

k}k≥0 such that the following two relations hold

E[

gk −∇f (x*)2

| xk]≤ 2ADf (x

k , x*) + B𝜎2k + D1, (7)

E[𝜎2k+1 | 𝜎2

k

]≤ (1− 𝜌)𝜎2

k + 2CDf (xk , x*) + D2, ∀k ≥ 0 (8)

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 8 / 27

Introduction SGD General analysis Special cases Discussion Bibliography

Gradient Descent satisfies Assumption 1

Assumption 2Assume that f is convex, i.e.

Df (x , y) ≥ 0 ∀x , y ∈ Rd ,

and L-smooth, i.e.

‖∇f (x)−∇f (y)‖ ≤ L‖x − y‖ ∀x , y ∈ Rd .

Assumption 2 implies that (see Nesterov’s book [4])

‖∇f (x)−∇f (y)‖2 ≤ 2LDf (x , y) ∀x , y ∈ Rd .

Therefore, if f satisfies Assumption 2, then gradient descent satisfiesAssumption 1 with

A = L,B = 0,D1 = 0, 𝜎k = 0, 𝜌 = 1,C = 0,D2 = 0.

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 9 / 27

Introduction SGD General analysis Special cases Discussion Bibliography

Gradient Descent satisfies Assumption 1

Assumption 2Assume that f is convex, i.e.

Df (x , y) ≥ 0 ∀x , y ∈ Rd ,

and L-smooth, i.e.

‖∇f (x)−∇f (y)‖ ≤ L‖x − y‖ ∀x , y ∈ Rd .

Assumption 2 implies that (see Nesterov’s book [4])

‖∇f (x)−∇f (y)‖2 ≤ 2LDf (x , y) ∀x , y ∈ Rd .

Therefore, if f satisfies Assumption 2, then gradient descent satisfiesAssumption 1 with

A = L,B = 0,D1 = 0, 𝜎k = 0, 𝜌 = 1,C = 0,D2 = 0.

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 9 / 27

Introduction SGD General analysis Special cases Discussion Bibliography

Gradient Descent satisfies Assumption 1

Assumption 2Assume that f is convex, i.e.

Df (x , y) ≥ 0 ∀x , y ∈ Rd ,

and L-smooth, i.e.

‖∇f (x)−∇f (y)‖ ≤ L‖x − y‖ ∀x , y ∈ Rd .

Assumption 2 implies that (see Nesterov’s book [4])

‖∇f (x)−∇f (y)‖2 ≤ 2LDf (x , y) ∀x , y ∈ Rd .

Therefore, if f satisfies Assumption 2, then gradient descent satisfiesAssumption 1 with

A = L,B = 0,D1 = 0, 𝜎k = 0, 𝜌 = 1,C = 0,D2 = 0.Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 9 / 27

Introduction SGD General analysis Special cases Discussion Bibliography

Other assumptions

Assumption 3 (Unique solution)The problem (1) has an unique minimizer x*.

Assumption 4 (𝜇-strong quasi-convexity)

There exists 𝜇 > 0 such that f : Rd → R is 𝜇-strongly quasi-convex, i.e.

f (x*) ≥ f (x) + ⟨∇f (x), x* − x⟩+ 𝜇

2‖x* − x‖2 , ∀x ∈ Rd . (9)

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 10 / 27

Introduction SGD General analysis Special cases Discussion Bibliography

Other assumptions

Assumption 3 (Unique solution)The problem (1) has an unique minimizer x*.

Assumption 4 (𝜇-strong quasi-convexity)

There exists 𝜇 > 0 such that f : Rd → R is 𝜇-strongly quasi-convex, i.e.

f (x*) ≥ f (x) + ⟨∇f (x), x* − x⟩+ 𝜇

2‖x* − x‖2 , ∀x ∈ Rd . (9)

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 10 / 27

Introduction SGD General analysis Special cases Discussion Bibliography

Main result

Theorem 1Let Assumptions 1, 3 and 4 be satisfied.

Choose constant M such thatM > B

𝜌 . Choose a stepsize satisfying

0 < 𝛾 ≤ min

{1𝜇,

1A+ CM

}. (10)

Then the iterates {xk}k≥0 of proximal SGD satisfy

E[V k]≤ max

{(1− 𝛾𝜇)k ,

(1+

B

M− 𝜌

)k}V 0 +

(D1 +MD2)𝛾2

min{𝛾𝜇, 𝜌− B

M

} ,(11)

where the Lyapunov function V k is defined by V k def=xk − x*

2+M𝛾2𝜎2

k .

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 11 / 27

Introduction SGD General analysis Special cases Discussion Bibliography

Main result

Theorem 1Let Assumptions 1, 3 and 4 be satisfied. Choose constant M such thatM > B

𝜌 . Choose a stepsize satisfying

0 < 𝛾 ≤ min

{1𝜇,

1A+ CM

}.

(10)

Then the iterates {xk}k≥0 of proximal SGD satisfy

E[V k]≤ max

{(1− 𝛾𝜇)k ,

(1+

B

M− 𝜌

)k}V 0 +

(D1 +MD2)𝛾2

min{𝛾𝜇, 𝜌− B

M

} ,(11)

where the Lyapunov function V k is defined by V k def=xk − x*

2+M𝛾2𝜎2

k .

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 11 / 27

Introduction SGD General analysis Special cases Discussion Bibliography

Main result

Theorem 1Let Assumptions 1, 3 and 4 be satisfied. Choose constant M such thatM > B

𝜌 . Choose a stepsize satisfying

0 < 𝛾 ≤ min

{1𝜇,

1A+ CM

}. (10)

Then the iterates {xk}k≥0 of proximal SGD satisfy

E[V k]≤ max

{(1− 𝛾𝜇)k ,

(1+

B

M− 𝜌

)k}V 0 +

(D1 +MD2)𝛾2

min{𝛾𝜇, 𝜌− B

M

} ,(11)

where the Lyapunov function V k is defined by V k def=xk − x*

2+M𝛾2𝜎2

k .

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 11 / 27

Introduction SGD General analysis Special cases Discussion Bibliography

Main result

Theorem 1Let Assumptions 1, 3 and 4 be satisfied. Choose constant M such thatM > B

𝜌 . Choose a stepsize satisfying

0 < 𝛾 ≤ min

{1𝜇,

1A+ CM

}. (10)

Then the iterates {xk}k≥0 of proximal SGD satisfy

E[V k]≤ max

{(1− 𝛾𝜇)k ,

(1+

B

M− 𝜌

)k}V 0 +

(D1 +MD2)𝛾2

min{𝛾𝜇, 𝜌− B

M

} ,(11)

where the Lyapunov function V k is defined by V k def=xk − x*

2+M𝛾2𝜎2

k .

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 11 / 27

Introduction SGD General analysis Special cases Discussion Bibliography

Methods

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 12 / 27

Introduction SGD General analysis Special cases Discussion Bibliography

Parameters

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 13 / 27

Introduction SGD General analysis Special cases Discussion Bibliography

Proximal Gradient Descent

We checked that GD satisfies Assumption 1 when f is convex and L-smoothwith

A = L,B = 0,D1 = 0, 𝜎k = 0, 𝜌 = 1,C = 0,D2 = 0.

We can choose M = 1 in Theorem 1 and get∙ 0 < 𝛾 ≤ 1

L (since 𝜇 ≤ L and C = 0)∙ V k = ‖xk − x*‖2 (since 𝜎k = 0)∙ The rate: ‖xk − x*‖2 ≤ (1− 𝛾𝜇)k‖x0 − x*‖2

∙ In particular, for 𝛾 = 1L we recover the standard rate for proximal

gradient descent:

k ≥ L

𝜇log

1𝜀

=⇒ ‖xk − x*‖2 ≤ 𝜀‖x0 − x*‖2.

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 14 / 27

Introduction SGD General analysis Special cases Discussion Bibliography

Proximal Gradient Descent

We checked that GD satisfies Assumption 1 when f is convex and L-smoothwith

A = L,B = 0,D1 = 0, 𝜎k = 0, 𝜌 = 1,C = 0,D2 = 0.

We can choose M = 1 in Theorem 1 and get

∙ 0 < 𝛾 ≤ 1L (since 𝜇 ≤ L and C = 0)

∙ V k = ‖xk − x*‖2 (since 𝜎k = 0)∙ The rate: ‖xk − x*‖2 ≤ (1− 𝛾𝜇)k‖x0 − x*‖2

∙ In particular, for 𝛾 = 1L we recover the standard rate for proximal

gradient descent:

k ≥ L

𝜇log

1𝜀

=⇒ ‖xk − x*‖2 ≤ 𝜀‖x0 − x*‖2.

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 14 / 27

Introduction SGD General analysis Special cases Discussion Bibliography

Proximal Gradient Descent

We checked that GD satisfies Assumption 1 when f is convex and L-smoothwith

A = L,B = 0,D1 = 0, 𝜎k = 0, 𝜌 = 1,C = 0,D2 = 0.

We can choose M = 1 in Theorem 1 and get∙ 0 < 𝛾 ≤ 1

L (since 𝜇 ≤ L and C = 0)

∙ V k = ‖xk − x*‖2 (since 𝜎k = 0)∙ The rate: ‖xk − x*‖2 ≤ (1− 𝛾𝜇)k‖x0 − x*‖2

∙ In particular, for 𝛾 = 1L we recover the standard rate for proximal

gradient descent:

k ≥ L

𝜇log

1𝜀

=⇒ ‖xk − x*‖2 ≤ 𝜀‖x0 − x*‖2.

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 14 / 27

Introduction SGD General analysis Special cases Discussion Bibliography

Proximal Gradient Descent

We checked that GD satisfies Assumption 1 when f is convex and L-smoothwith

A = L,B = 0,D1 = 0, 𝜎k = 0, 𝜌 = 1,C = 0,D2 = 0.

We can choose M = 1 in Theorem 1 and get∙ 0 < 𝛾 ≤ 1

L (since 𝜇 ≤ L and C = 0)∙ V k = ‖xk − x*‖2 (since 𝜎k = 0)

∙ The rate: ‖xk − x*‖2 ≤ (1− 𝛾𝜇)k‖x0 − x*‖2

∙ In particular, for 𝛾 = 1L we recover the standard rate for proximal

gradient descent:

k ≥ L

𝜇log

1𝜀

=⇒ ‖xk − x*‖2 ≤ 𝜀‖x0 − x*‖2.

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 14 / 27

Introduction SGD General analysis Special cases Discussion Bibliography

Proximal Gradient Descent

We checked that GD satisfies Assumption 1 when f is convex and L-smoothwith

A = L,B = 0,D1 = 0, 𝜎k = 0, 𝜌 = 1,C = 0,D2 = 0.

We can choose M = 1 in Theorem 1 and get∙ 0 < 𝛾 ≤ 1

L (since 𝜇 ≤ L and C = 0)∙ V k = ‖xk − x*‖2 (since 𝜎k = 0)∙ The rate: ‖xk − x*‖2 ≤ (1− 𝛾𝜇)k‖x0 − x*‖2

∙ In particular, for 𝛾 = 1L we recover the standard rate for proximal

gradient descent:

k ≥ L

𝜇log

1𝜀

=⇒ ‖xk − x*‖2 ≤ 𝜀‖x0 − x*‖2.

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 14 / 27

Introduction SGD General analysis Special cases Discussion Bibliography

Proximal Gradient Descent

We checked that GD satisfies Assumption 1 when f is convex and L-smoothwith

A = L,B = 0,D1 = 0, 𝜎k = 0, 𝜌 = 1,C = 0,D2 = 0.

We can choose M = 1 in Theorem 1 and get∙ 0 < 𝛾 ≤ 1

L (since 𝜇 ≤ L and C = 0)∙ V k = ‖xk − x*‖2 (since 𝜎k = 0)∙ The rate: ‖xk − x*‖2 ≤ (1− 𝛾𝜇)k‖x0 − x*‖2

∙ In particular, for 𝛾 = 1L we recover the standard rate for proximal

gradient descent:

k ≥ L

𝜇log

1𝜀

=⇒ ‖xk − x*‖2 ≤ 𝜀‖x0 − x*‖2.

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 14 / 27

Introduction SGD General analysis Special cases Discussion Bibliography

SGD-SR (see Gower et al. (2019), [3])

Stochastic reformulation

minx∈Rd

f (x) + R(x), f (x) = E𝒟 [f𝜉(x)] , f𝜉(x)def=

1n

n∑i=1

𝜉i fi (x) (12)

1 𝜉 ∼ 𝒟: E𝒟 [𝜉i ] = 1 for all i2 fi (for all i) is smooth, possibly non-convex function

Algorithm 1 SGD-SR

Input: learning rate 𝛾 > 0, starting point x0 ∈ Rd , distribution 𝒟 over 𝜉 ∈ Rn

such that E𝒟 [𝜉] is vector of ones1: for k = 0, 1, 2, . . . do2: Sample 𝜉 ∼ 𝒟3: gk = ∇f𝜉(x

k)4: xk+1 = prox𝛾R(x

k − 𝛾gk)5: end for

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 15 / 27

Introduction SGD General analysis Special cases Discussion Bibliography

SGD-SR (see Gower et al. (2019), [3])

Stochastic reformulation

minx∈Rd

f (x) + R(x), f (x) = E𝒟 [f𝜉(x)] , f𝜉(x)def=

1n

n∑i=1

𝜉i fi (x) (12)

1 𝜉 ∼ 𝒟: E𝒟 [𝜉i ] = 1 for all i

2 fi (for all i) is smooth, possibly non-convex function

Algorithm 2 SGD-SR

Input: learning rate 𝛾 > 0, starting point x0 ∈ Rd , distribution 𝒟 over 𝜉 ∈ Rn

such that E𝒟 [𝜉] is vector of ones1: for k = 0, 1, 2, . . . do2: Sample 𝜉 ∼ 𝒟3: gk = ∇f𝜉(x

k)4: xk+1 = prox𝛾R(x

k − 𝛾gk)5: end for

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 15 / 27

Introduction SGD General analysis Special cases Discussion Bibliography

SGD-SR (see Gower et al. (2019), [3])

Stochastic reformulation

minx∈Rd

f (x) + R(x), f (x) = E𝒟 [f𝜉(x)] , f𝜉(x)def=

1n

n∑i=1

𝜉i fi (x) (12)

1 𝜉 ∼ 𝒟: E𝒟 [𝜉i ] = 1 for all i2 fi (for all i) is smooth, possibly non-convex function

Algorithm 3 SGD-SR

Input: learning rate 𝛾 > 0, starting point x0 ∈ Rd , distribution 𝒟 over 𝜉 ∈ Rn

such that E𝒟 [𝜉] is vector of ones1: for k = 0, 1, 2, . . . do2: Sample 𝜉 ∼ 𝒟3: gk = ∇f𝜉(x

k)4: xk+1 = prox𝛾R(x

k − 𝛾gk)5: end for

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 15 / 27

Introduction SGD General analysis Special cases Discussion Bibliography

SGD-SR (see Gower et al. (2019), [3])

Stochastic reformulation

minx∈Rd

f (x) + R(x), f (x) = E𝒟 [f𝜉(x)] , f𝜉(x)def=

1n

n∑i=1

𝜉i fi (x) (12)

1 𝜉 ∼ 𝒟: E𝒟 [𝜉i ] = 1 for all i2 fi (for all i) is smooth, possibly non-convex function

Algorithm 4 SGD-SR

Input: learning rate 𝛾 > 0, starting point x0 ∈ Rd , distribution 𝒟 over 𝜉 ∈ Rn

such that E𝒟 [𝜉] is vector of ones1: for k = 0, 1, 2, . . . do2: Sample 𝜉 ∼ 𝒟3: gk = ∇f𝜉(x

k)4: xk+1 = prox𝛾R(x

k − 𝛾gk)5: end for

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 15 / 27

Introduction SGD General analysis Special cases Discussion Bibliography

SGD-SR (see Gower et al. (2019), [3])

Expected smoothnessWe say that f is ℒ-smooth in expectation with respect to distribution 𝒟 ifthere exists ℒ = ℒ(f ,𝒟) > 0 such that

E𝒟

[‖∇f𝜉(x)−∇f𝜉(x

*)‖2]≤ 2ℒDf (x , x

*), (13)

for all x ∈ Rd and write (f ,𝒟) ∼ ES(ℒ).

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 16 / 27

Introduction SGD General analysis Special cases Discussion Bibliography

SGD-SR (see Gower et al. (2019), [3])

Lemma 1 (Generalization of Lemma 2.4, [3])If (f ,𝒟) ∼ ES(ℒ), then

E𝒟

[‖∇f𝜉(x)−∇f (x*)‖2

]≤ 4ℒDf (x , x

*) + 2𝜎2. (14)

where 𝜎2 def= E𝒟

[‖∇f𝜉(x

*)−∇f (x*)‖2].

That is, SGD-SR satisfies Assumption 1 with

A = 2ℒ,B = 0,D1 = 2𝜎2, 𝜎k = 0, 𝜌 = 1,C = 0,D2 = 0.

If 𝛾k ≡ 𝛾 ≤ 12ℒ then Theorem 1 implies that

E‖xk − x*‖2 ≤ (1− 𝛾𝜇)k‖x0 − x*‖2 +2𝛾𝜎2

𝜇.

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 17 / 27

Introduction SGD General analysis Special cases Discussion Bibliography

SGD-SR (see Gower et al. (2019), [3])

Lemma 1 (Generalization of Lemma 2.4, [3])If (f ,𝒟) ∼ ES(ℒ), then

E𝒟

[‖∇f𝜉(x)−∇f (x*)‖2

]≤ 4ℒDf (x , x

*) + 2𝜎2. (14)

where 𝜎2 def= E𝒟

[‖∇f𝜉(x

*)−∇f (x*)‖2].

That is, SGD-SR satisfies Assumption 1 with

A = 2ℒ,B = 0,D1 = 2𝜎2, 𝜎k = 0, 𝜌 = 1,C = 0,D2 = 0.

If 𝛾k ≡ 𝛾 ≤ 12ℒ then Theorem 1 implies that

E‖xk − x*‖2 ≤ (1− 𝛾𝜇)k‖x0 − x*‖2 +2𝛾𝜎2

𝜇.

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 17 / 27

Introduction SGD General analysis Special cases Discussion Bibliography

SGD-SR (see Gower et al. (2019), [3])

Lemma 1 (Generalization of Lemma 2.4, [3])If (f ,𝒟) ∼ ES(ℒ), then

E𝒟

[‖∇f𝜉(x)−∇f (x*)‖2

]≤ 4ℒDf (x , x

*) + 2𝜎2. (14)

where 𝜎2 def= E𝒟

[‖∇f𝜉(x

*)−∇f (x*)‖2].

That is, SGD-SR satisfies Assumption 1 with

A = 2ℒ,B = 0,D1 = 2𝜎2, 𝜎k = 0, 𝜌 = 1,C = 0,D2 = 0.

If 𝛾k ≡ 𝛾 ≤ 12ℒ then Theorem 1 implies that

E‖xk − x*‖2 ≤ (1− 𝛾𝜇)k‖x0 − x*‖2 +2𝛾𝜎2

𝜇.

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 17 / 27

Introduction SGD General analysis Special cases Discussion Bibliography

SAGA (see Defazio, Bach & Lacoste-Julien (2014), [1])

Consider the finite-sum minimization problem

f (x) =1n

n∑i=1

fi (x) + R(x), (15)

where fi is convex, L-smooth for each i and f is 𝜇-strongly convex.

Algorithm 5 SAGA [1]

Input: learning rate 𝛾 > 0, starting point x0 ∈ Rd

1: Set 𝜓0j = x0 for each j ∈ [n]

2: for k = 0, 1, 2, . . . do3: Sample j ∈ [n] uniformly at random4: Set 𝜑k+1

j = xk and 𝜑k+1i = 𝜑ki for i = j

5: gk = ∇fj(𝜑k+1j )−∇fj(𝜑

kj ) +

1n

∑ni=1 ∇fi (𝜑

ki )

6: xk+1 = prox𝛾R

(xk − 𝛾gk

)7: end for

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 18 / 27

Introduction SGD General analysis Special cases Discussion Bibliography

SAGA (see Defazio, Bach & Lacoste-Julien (2014), [1])

Consider the finite-sum minimization problem

f (x) =1n

n∑i=1

fi (x) + R(x), (15)

where fi is convex, L-smooth for each i and f is 𝜇-strongly convex.

Algorithm 6 SAGA [1]

Input: learning rate 𝛾 > 0, starting point x0 ∈ Rd

1: Set 𝜓0j = x0 for each j ∈ [n]

2: for k = 0, 1, 2, . . . do3: Sample j ∈ [n] uniformly at random4: Set 𝜑k+1

j = xk and 𝜑k+1i = 𝜑ki for i = j

5: gk = ∇fj(𝜑k+1j )−∇fj(𝜑

kj ) +

1n

∑ni=1 ∇fi (𝜑

ki )

6: xk+1 = prox𝛾R

(xk − 𝛾gk

)7: end for

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 18 / 27

Introduction SGD General analysis Special cases Discussion Bibliography

SAGA (see Defazio, Bach & Lacoste-Julien (2014), [1])

Consider the finite-sum minimization problem

f (x) =1n

n∑i=1

fi (x) + R(x), (15)

where fi is convex, L-smooth for each i and f is 𝜇-strongly convex.

Algorithm 7 SAGA [1]

Input: learning rate 𝛾 > 0, starting point x0 ∈ Rd

1: Set 𝜓0j = x0 for each j ∈ [n]

2: for k = 0, 1, 2, . . . do3: Sample j ∈ [n] uniformly at random4: Set 𝜑k+1

j = xk and 𝜑k+1i = 𝜑ki for i = j

5: gk = ∇fj(𝜑k+1j )−∇fj(𝜑

kj ) +

1n

∑ni=1 ∇fi (𝜑

ki )

6: xk+1 = prox𝛾R

(xk − 𝛾gk

)7: end for

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 18 / 27

Introduction SGD General analysis Special cases Discussion Bibliography

SAGA (see Defazio, Bach & Lacoste-Julien (2014), [1])

Lemma 2We have

E[

gk −∇f (x*)2

| xk]≤ 4LDf (x

k , x*) + 2𝜎2k

(16)

and

E[𝜎2k+1 | xk

]≤(1− 1

n

)𝜎2k +

2LnDf (x

k , x*), (17)

where 𝜎2k = 1

n

n∑i=1

∇fi (𝜑

ki )−∇fi (x

*)2.

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 19 / 27

Introduction SGD General analysis Special cases Discussion Bibliography

SAGA (see Defazio, Bach & Lacoste-Julien (2014), [1])

Lemma 2We have

E[

gk −∇f (x*)2

| xk]≤ 4LDf (x

k , x*) + 2𝜎2k (16)

and

E[𝜎2k+1 | xk

]≤(1− 1

n

)𝜎2k +

2LnDf (x

k , x*), (17)

where 𝜎2k = 1

n

n∑i=1

∇fi (𝜑

ki )−∇fi (x

*)2.

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 19 / 27

Introduction SGD General analysis Special cases Discussion Bibliography

SAGA (see Defazio, Bach & Lacoste-Julien (2014), [1])

Lemma 2We have

E[

gk −∇f (x*)2

| xk]≤ 4LDf (x

k , x*) + 2𝜎2k (16)

and

E[𝜎2k+1 | xk

]≤(1− 1

n

)𝜎2k +

2LnDf (x

k , x*), (17)

where 𝜎2k = 1

n

n∑i=1

∇fi (𝜑

ki )−∇fi (x

*)2.

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 19 / 27

Introduction SGD General analysis Special cases Discussion Bibliography

SAGA (see Defazio, Bach & Lacoste-Julien (2014), [1])

That is, SAGA satisfies Assumption 1 with

A = 2L,B = 1,D1 = 0, 𝜌 =1n,C =

L

n,D2 = 0.

Theorem 1 with M = 4n implies that for 𝛾 = 16L

EV k ≤(1−min

{𝜇

6L,12n

})2

V 0.

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 20 / 27

Introduction SGD General analysis Special cases Discussion Bibliography

SAGA (see Defazio, Bach & Lacoste-Julien (2014), [1])

That is, SAGA satisfies Assumption 1 with

A = 2L,B = 1,D1 = 0, 𝜌 =1n,C =

L

n,D2 = 0.

Theorem 1 with M = 4n implies that for 𝛾 = 16L

EV k ≤(1−min

{𝜇

6L,12n

})2

V 0.

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 20 / 27

Introduction SGD General analysis Special cases Discussion Bibliography

SGD-star

Consider the same situation as for SGD-SR. Recall that for SGD-SR we got

E‖xk − x*‖2 ≤ (1− 𝛾𝜇)k‖x0 − x*‖2 +2𝛾𝜎2

𝜇, 𝛾 ∈

(0,

12ℒ

]

Algorithm 8 SGD-star

Input: learning rate 𝛾 > 0, starting point x0 ∈ Rd , distribution 𝒟 over𝜉 ∈ Rn such that E𝒟 [𝜉] is vector of ones

1: for k = 0, 1, 2, . . . do2: Sample 𝜉 ∼ 𝒟3: gk = ∇f𝜉(x

k)−∇f𝜉(x*) +∇f (x*)

4: xk+1 = prox𝛾R(xk − 𝛾gk)5: end for

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 21 / 27

Introduction SGD General analysis Special cases Discussion Bibliography

SGD-star

Consider the same situation as for SGD-SR. Recall that for SGD-SR we got

E‖xk − x*‖2 ≤ (1− 𝛾𝜇)k‖x0 − x*‖2 +2𝛾𝜎2

𝜇, 𝛾 ∈

(0,

12ℒ

]

Algorithm 9 SGD-star

Input: learning rate 𝛾 > 0, starting point x0 ∈ Rd , distribution 𝒟 over𝜉 ∈ Rn such that E𝒟 [𝜉] is vector of ones

1: for k = 0, 1, 2, . . . do2: Sample 𝜉 ∼ 𝒟3: gk = ∇f𝜉(x

k)−∇f𝜉(x*) +∇f (x*)

4: xk+1 = prox𝛾R(xk − 𝛾gk)5: end for

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 21 / 27

Introduction SGD General analysis Special cases Discussion Bibliography

SGD-star

Lemma 3 (Lemma 2.4 from [3])If (f ,𝒟) ∼ ES(ℒ), then

E𝒟

[gk −∇f (x*)

2]≤ 4ℒDf (x

k , x*). (18)

Thus, SGD-star satisfies Assumption 1 with

A = 2ℒ,B = 0,D1 = 0, 𝜎k = 0, 𝜌 = 1,C = 0,D2 = 0.

If 𝛾k ≡ 𝛾 ≤ 12ℒ then Theorem 1 implies that

E‖xk − x*‖2 ≤ (1− 𝛾𝜇)k‖x0 − x*‖2.

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 22 / 27

Introduction SGD General analysis Special cases Discussion Bibliography

SGD-star

Lemma 3 (Lemma 2.4 from [3])If (f ,𝒟) ∼ ES(ℒ), then

E𝒟

[gk −∇f (x*)

2]≤ 4ℒDf (x

k , x*). (18)

Thus, SGD-star satisfies Assumption 1 with

A = 2ℒ,B = 0,D1 = 0, 𝜎k = 0, 𝜌 = 1,C = 0,D2 = 0.

If 𝛾k ≡ 𝛾 ≤ 12ℒ then Theorem 1 implies that

E‖xk − x*‖2 ≤ (1− 𝛾𝜇)k‖x0 − x*‖2.

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 22 / 27

Introduction SGD General analysis Special cases Discussion Bibliography

Quantitative definition of variance reduction

Variance-reduced methodsWe say that SGD with stochastic gradients satisfying Assumption 1 isvariance-reduced if D1 = D2 = 0.

∙ Variance-reduced methods converge to the exact solution∙ The sequence {𝜎2

k}k≥0 reflects the progress of the variance reductionprocess.

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 23 / 27

Introduction SGD General analysis Special cases Discussion Bibliography

Quantitative definition of variance reduction

Variance-reduced methodsWe say that SGD with stochastic gradients satisfying Assumption 1 isvariance-reduced if D1 = D2 = 0.

∙ Variance-reduced methods converge to the exact solution

∙ The sequence {𝜎2k}k≥0 reflects the progress of the variance reduction

process.

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 23 / 27

Introduction SGD General analysis Special cases Discussion Bibliography

Quantitative definition of variance reduction

Variance-reduced methodsWe say that SGD with stochastic gradients satisfying Assumption 1 isvariance-reduced if D1 = D2 = 0.

∙ Variance-reduced methods converge to the exact solution∙ The sequence {𝜎2

k}k≥0 reflects the progress of the variance reductionprocess.

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 23 / 27

Introduction SGD General analysis Special cases Discussion Bibliography

Limitations and Extensions

1 We consider only L-smooth 𝜇-strongly quasi-convex case.

2 It would interesting to unify the theory for biased gradients estimator(e.g. SAG [7], SARAH [5], zero-order optimization and etc.)

3 Our analysis doesn’t recover the best known rates for RCD typemethods with importance sampling.

4 An extension of iteration dependent parameters A,B,C ,D1,D2, 𝜌would cover a new methods, such as SGD with decreasing stepsizes.

5 We consider only non-accelerated methods. It would be interesting toprovide an unified analysis of stochastic methods with acceleration andmomentum.

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 24 / 27

Introduction SGD General analysis Special cases Discussion Bibliography

Limitations and Extensions

1 We consider only L-smooth 𝜇-strongly quasi-convex case.2 It would interesting to unify the theory for biased gradients estimator

(e.g. SAG [7], SARAH [5], zero-order optimization and etc.)

3 Our analysis doesn’t recover the best known rates for RCD typemethods with importance sampling.

4 An extension of iteration dependent parameters A,B,C ,D1,D2, 𝜌would cover a new methods, such as SGD with decreasing stepsizes.

5 We consider only non-accelerated methods. It would be interesting toprovide an unified analysis of stochastic methods with acceleration andmomentum.

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 24 / 27

Introduction SGD General analysis Special cases Discussion Bibliography

Limitations and Extensions

1 We consider only L-smooth 𝜇-strongly quasi-convex case.2 It would interesting to unify the theory for biased gradients estimator

(e.g. SAG [7], SARAH [5], zero-order optimization and etc.)3 Our analysis doesn’t recover the best known rates for RCD type

methods with importance sampling.

4 An extension of iteration dependent parameters A,B,C ,D1,D2, 𝜌would cover a new methods, such as SGD with decreasing stepsizes.

5 We consider only non-accelerated methods. It would be interesting toprovide an unified analysis of stochastic methods with acceleration andmomentum.

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 24 / 27

Introduction SGD General analysis Special cases Discussion Bibliography

Limitations and Extensions

1 We consider only L-smooth 𝜇-strongly quasi-convex case.2 It would interesting to unify the theory for biased gradients estimator

(e.g. SAG [7], SARAH [5], zero-order optimization and etc.)3 Our analysis doesn’t recover the best known rates for RCD type

methods with importance sampling.4 An extension of iteration dependent parameters A,B,C ,D1,D2, 𝜌

would cover a new methods, such as SGD with decreasing stepsizes.

5 We consider only non-accelerated methods. It would be interesting toprovide an unified analysis of stochastic methods with acceleration andmomentum.

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 24 / 27

Introduction SGD General analysis Special cases Discussion Bibliography

Limitations and Extensions

1 We consider only L-smooth 𝜇-strongly quasi-convex case.2 It would interesting to unify the theory for biased gradients estimator

(e.g. SAG [7], SARAH [5], zero-order optimization and etc.)3 Our analysis doesn’t recover the best known rates for RCD type

methods with importance sampling.4 An extension of iteration dependent parameters A,B,C ,D1,D2, 𝜌

would cover a new methods, such as SGD with decreasing stepsizes.5 We consider only non-accelerated methods. It would be interesting to

provide an unified analysis of stochastic methods with acceleration andmomentum.

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 24 / 27

Introduction SGD General analysis Special cases Discussion Bibliography

Bibliography I

Aaron Defazio, Francis Bach, and Simon Lacoste-Julien.SAGA: A fast incremental gradient method with support fornon-strongly convex composite objectives.In Advances in Neural Information Processing Systems, pages1646–1654, 2014.

Eduard Gorbunov, Filip Hanzely, and Peter Richtarik.A unified theory of sgd: Variance reduction, sampling, quantizationand coordinate descent.arXiv preprint arXiv:1905.11261, 2019.

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 25 / 27

Introduction SGD General analysis Special cases Discussion Bibliography

Bibliography II

Robert M Gower, Nicolas Loizou, Xun Qian, Alibek Sailanbayev, EgorShulgin, and Peter Richtarik.SGD: General analysis and improved rates.arXiv preprint arXiv:1901.09401, 2019.

Yurii Nesterov.Introductory lectures on convex optimization: a basic course.2004.

Lam M Nguyen, Jie Liu, Katya Scheinberg, and Martin Takac.SARAH: A novel method for machine learning problems usingstochastic recursive gradient.In Proceedings of the 34th International Conference on MachineLearning, volume 70 of Proceedings of Machine Learning Research,pages 2613–2621. PMLR, 2017.

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 26 / 27

Introduction SGD General analysis Special cases Discussion Bibliography

Bibliography III

Peter Richtarik.A guided walk through the zoo of stochastic gradient descent methods.

ICCOPT 2019 Summer School.

Nicolas Le Roux, Mark Schmidt, and Francis Bach.A stochastic gradient method with an exponential convergence rate forfinite training sets.In Advances in Neural Information Processing Systems, pages2663–2671, 2012.

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 27 / 27

Recommended