Stochastic Variance Reduced Gradient Methods · Convex Optimization: Algorithms and Complexity How...

Optimization for Machine Learning

Stochastic Variance Reduced Gradient Methods

Lecturers: Francis Bach & Robert M. Gower

Tutorials: Hadrien Hendrikx, Rui Yuan, Nidham Gazagnadou

African Master's in Machine Intelligence (AMMI), Kigali

References for this class

Section 6.3:

M. Schmidt, N. Le Roux, F. Bach (2016), Mathematical Programming Minimizing Finite Sums with the Stochastic Average Gradient.

Sébastien Bubeck (2015)Foundations and TrendsConvex Optimization: Algorithms and Complexity

How to transform convergence results into

iteration complexity

Section 1.3.5, R.M. Gower, Ph.d thesis: Sketch and Project: Randomized Iterative Methods for Linear Systems and Inverting Matrices University of Edinburgh, 2016

RMG, P. Richtárik and Francis Bach (2018)Stochastic quasi-gradient methods: variance reduction via Jacobian sketching

Solving the Finite Sum Training Problem

A Datum Function

Finite Sum Training Problem

Optimization Sum of Terms

SGD shrinking stepsize

Convergence for Strongly Convex● f(w) is - strongly convex● Subgradients bounded

SGD Theory

SGD recap

SGD initially fast, slow later

SGD suffers from high variance

Can we get best of both?

Today we learn about methods like this one

Variance reduced methods

Build an Estimate of the Gradient

Instead of using directlyUse to update estimate

Similar

Converges in L2

We would like gradient estimate such that:

Similar

Converges in L2

Similar

Converges in L2

Covariates

Let x and z be random variables. We say that x and z are covariates if:

Variance Reduced Estimate:

Covariates

SVRG: Stochastic Variance Reduced Gradients

grad estimate

Reference point

Sample

SVRG: Stochastic Variance Reduced Gradients

Freeze reference point for m iterations

SAGA: Stochastic Average Gradient unbiased version

grad estimate

Store gradient

Sample

SAGA: Stochastic Average Gradient

Stores a matrixNo inner loop, rolling update

SAG: Stochastic Average Gradient (Biased version)

grad estimate

Store gradient

Sample

SAG: Stochastic Average Gradient

EXE: Introduce a variable . Re-write the SAG algorithm so G is updated efficiently at each iteration.

SAG: Stochastic Average Gradient

Stores a matrixVery easy to implement

EXE: Introduce a variable . Re-write the SAG algorithm so G is updated efficiently at each iteration.

The Stochastic Average Gradient

How to prove this converges? Is this the only option?

Stochastic Gradient Descent α =0.5

Convergence Theorems

Smoothness + convexity

Strong Convexity

Assumptions for Convergence

Convergence SVRG

Theorem

In practice use

Johnson, R. & Zhang, T. Accelerating Stochastic Gradient Descent using Predictive Variance Reduction, NIPS 2013

Convergence SAG

Theorem SAG

A practical convergence result!

M. Schmidt, N. Le Roux, F. Bach (2016)Mathematical Programming Minimizing Finite Sums with the Stochastic Average Gradient.

Because of biased gradients, difficult proof that relies on computer assisted steps

Convergence SAGA

Theorem SAGA

An even more practical convergence result!

A. Defazio, F. Bach and J. Lacoste-Julien (2014)NIPS, SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives.

Much easier proof due to unbiased gradients

Comparisons in complexity for strongly convex

Gradient descent

Approximate solution

SVRG/SAGA/SAG

Variance reduction faster than GD when

How did I get these complexity results from the convergence results?

Section 1.3.5, R.M. Gower, Ph.d thesis: Sketch and Project: Randomized Iterative Methods for Linear Systems and Inverting Matrices University of Edinburgh, 2016

Practicals implementation of SAG for Linear Classifiers

L2 regularizor + linear hypothesis

Nonlinear in w

Linear in w

Only store real number

Nonlinear in w

Linear in w

Stoch. gradient estimate

Full gradient estimate

Reduce Storage to O(n)

Take for home Variance Reduction

● Variance reduced methods use only one stochastic gradient per iteration and converge linearly on strongly convex functions

● Choice of fixed stepsize possible

● SAGA only needs to know the smoothness parameter to work, but requires storing n past stochastic gradients

● SVRG only has O(d) storage, but requires full gradient computations every so often. Has an extra “number of inner iterations” parameter to tune

Proving Convergence of SVRG

Proof:

Unbiased estimatorTaking expectation with respect to j

Must control this!

Smoothness Consequences ISmoothness

EXE: Lemma 1

Proof:

Smoothness

Smoothness Consequences II

EXE: Lemma 2

Proof:

Smoothness

EXE: Lemma 2

Proof:

Smoothness

EXE: Lemma 2

Proof:

Lemma 1

Bounding gradient estimate

EXE: Lemma 3

Proof:

EXE: Lemma 3

Proof:

EXE: Lemma 3

Proof:

Lemma 2

Proof:

Must control this!

Proof (continued I):

Proof (continued II):

Jensen’s inequality

Re-arranging again

Proof (continued II):

Jensen’s inequality

Re-arranging again

Stochastic Variance Reduced Gradient Methods · Convex Optimization: Algorithms and Complexity How...

Documents

Gower Street Hotels in London

1.3.5 heat exhaustion

2008 Gower Timetable

Dylan Thomas' Secret Gower

Gower Books 2009 (ROW)

PostGIS 1.3.5 Manual

Hammond Gower 2013 Catalogue

Gower e-News - Issue 11: 30th March 2012 - Explore Gower This Easter

Gower Walking Festival 2012

Gower Walking Festival Brochure 2014

Gower Rock

Discovery Day Gower

Release 1.3.5 Jonathan Peoples, Colton Williams

Gower Activity Centres

Gower Ambassadors...1.3.5 Ambassador project branding and promotion 1.3 Results of consultations with ambassadors 1.3.1 Pilot Project - Lessons learnt 1.4 Current status of the Pilot

Gower Coast Path West - visitswanseabay.com

MIRtoolbox UserGuide 1.3.5

1.3.5. COFOG nomenklatura_hr.pdf

SALT MAP BY ROWEN GOWER

Smart genie ver 1.3.5-ppsx3