82
Optimization Methods for Supervised Machine Learning: From Linear Models to Deep Learning. Part I Katya Scheinberg jointly with Frank Curtis 10/23/17 Informs Tutorial

Optimization Methods for Supervised Machine Learning: From ...coral.ise.lehigh.edu/frankecurtis/files/talks/informs_tutorial_17.pdf · Optimization Methods for Supervised Machine

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Optimization Methods for Supervised Machine Learning: From ...coral.ise.lehigh.edu/frankecurtis/files/talks/informs_tutorial_17.pdf · Optimization Methods for Supervised Machine

Optimization Methods for Supervised Machine Learning: From Linear Models to Deep Learning. Part I

Katya Scheinberg jointly with Frank Curtis

TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAAAAAAAAAAAA

10/23/17 Informs Tutorial

Page 2: Optimization Methods for Supervised Machine Learning: From ...coral.ise.lehigh.edu/frankecurtis/files/talks/informs_tutorial_17.pdf · Optimization Methods for Supervised Machine

ML applications l  Computer vision l  Machine translation l  Speech recognition l  Text categorization l  Recommender systems l  Ranking web search results l  Next word prediction l  Video content classification l  Anomaly detection 10/23/17 Informs Tutorial

Page 3: Optimization Methods for Supervised Machine Learning: From ...coral.ise.lehigh.edu/frankecurtis/files/talks/informs_tutorial_17.pdf · Optimization Methods for Supervised Machine

Popular applications of deep learning models

Informs Tutorial 10/23/17

Page 4: Optimization Methods for Supervised Machine Learning: From ...coral.ise.lehigh.edu/frankecurtis/files/talks/informs_tutorial_17.pdf · Optimization Methods for Supervised Machine

Data on Google scale today l  ImageNet Large Scale Visual Recognition

Competition: 1.2 million 224x224 images l  300 hours of video are uploaded to YouTube

every minute! 819,417,600 hours of video total > 93,000 years.

l  Google’s big initiative is: next billion users. l  Next word prediction in texts. l  Machine translation l  Image classification and recognition

10/23/17 Informs Tutorial

Page 5: Optimization Methods for Supervised Machine Learning: From ...coral.ise.lehigh.edu/frankecurtis/files/talks/informs_tutorial_17.pdf · Optimization Methods for Supervised Machine

LEARNING PROBLEM, SETUP

10/23/17 Informs Tutorial

Page 6: Optimization Methods for Supervised Machine Learning: From ...coral.ise.lehigh.edu/frankecurtis/files/talks/informs_tutorial_17.pdf · Optimization Methods for Supervised Machine

Supervised learning problem

Informs Tutorial 10/23/17

For example, x is the image of a letter and y is the letter label. What should p(w,x) be?

Page 7: Optimization Methods for Supervised Machine Learning: From ...coral.ise.lehigh.edu/frankecurtis/files/talks/informs_tutorial_17.pdf · Optimization Methods for Supervised Machine

Binary classification problem Y2{-1,+1}

+ -

Two sets of labeled points

10/23/17 Informs Tutorial

Page 8: Optimization Methods for Supervised Machine Learning: From ...coral.ise.lehigh.edu/frankecurtis/files/talks/informs_tutorial_17.pdf · Optimization Methods for Supervised Machine

+ -

Like this:

Linear classifier y2 {-1,+1}

10/23/17 Informs Tutorial

Page 9: Optimization Methods for Supervised Machine Learning: From ...coral.ise.lehigh.edu/frankecurtis/files/talks/informs_tutorial_17.pdf · Optimization Methods for Supervised Machine

Binary Classification Objective Expected risk: ideal objective

Usually an intractable problem

10/23/17 Informs Tutorial

Page 10: Optimization Methods for Supervised Machine Learning: From ...coral.ise.lehigh.edu/frankecurtis/files/talks/informs_tutorial_17.pdf · Optimization Methods for Supervised Machine

Binary Classification Objective Expected risk: ideal objective

Empirical risk: realizable objective

Finite, but NP hard problem 10/23/17 Informs Tutorial

Page 11: Optimization Methods for Supervised Machine Learning: From ...coral.ise.lehigh.edu/frankecurtis/files/talks/informs_tutorial_17.pdf · Optimization Methods for Supervised Machine

Handling outliers, logistic regression

+ -

10/23/17 Informs Tutorial

Page 12: Optimization Methods for Supervised Machine Learning: From ...coral.ise.lehigh.edu/frankecurtis/files/talks/informs_tutorial_17.pdf · Optimization Methods for Supervised Machine

Logistic Regression Model Expected loss: ideal objective

Empirical loss: realizable objective

This is a convex function when p(w,x) is linear in w 10/23/17 Informs Tutorial

Page 13: Optimization Methods for Supervised Machine Learning: From ...coral.ise.lehigh.edu/frankecurtis/files/talks/informs_tutorial_17.pdf · Optimization Methods for Supervised Machine

Linear classifier, generalization

+

-

+

10/23/17 Informs Tutorial

Page 14: Optimization Methods for Supervised Machine Learning: From ...coral.ise.lehigh.edu/frankecurtis/files/talks/informs_tutorial_17.pdf · Optimization Methods for Supervised Machine

Linear classifier, generalization

+

-

+

10/23/17 Informs Tutorial

Page 15: Optimization Methods for Supervised Machine Learning: From ...coral.ise.lehigh.edu/frankecurtis/files/talks/informs_tutorial_17.pdf · Optimization Methods for Supervised Machine

Linear classifier, generalization

+

-

+

10/23/17 Informs Tutorial

Page 16: Optimization Methods for Supervised Machine Learning: From ...coral.ise.lehigh.edu/frankecurtis/files/talks/informs_tutorial_17.pdf · Optimization Methods for Supervised Machine

Learning guarantees

Informs Tutorial 10/23/17

Problem is well defined, even if it may be difficult to solve

Page 17: Optimization Methods for Supervised Machine Learning: From ...coral.ise.lehigh.edu/frankecurtis/files/talks/informs_tutorial_17.pdf · Optimization Methods for Supervised Machine

Learning guarantees

Problem is well defined, even if it may be difficult to solve.

However, we are really interested in

Informs Tutorial 10/23/17

Page 18: Optimization Methods for Supervised Machine Learning: From ...coral.ise.lehigh.edu/frankecurtis/files/talks/informs_tutorial_17.pdf · Optimization Methods for Supervised Machine

Learning guarantees

Problem is well defined, even if it may be difficult to solve.

However, we are really interested in

How does behave on the unseen data?

Informs Tutorial 10/23/17

is a measure of complexity of the class of predictors

Page 19: Optimization Methods for Supervised Machine Learning: From ...coral.ise.lehigh.edu/frankecurtis/files/talks/informs_tutorial_17.pdf · Optimization Methods for Supervised Machine

Learning guarantees

Problem is well defined, even if it may be difficult to solve.

However, we are really interested in

How does behave on the unseen data?

Informs Tutorial 10/23/17

is a measure of complexity of the class of predictors

Page 20: Optimization Methods for Supervised Machine Learning: From ...coral.ise.lehigh.edu/frankecurtis/files/talks/informs_tutorial_17.pdf · Optimization Methods for Supervised Machine

Learning guarantees

Problem is well defined, even if it may be difficult to solve.

However, we are really interested in

How do and relate?

Informs Tutorial 10/23/17

is a measure of complexity of the class of predictors

Page 21: Optimization Methods for Supervised Machine Learning: From ...coral.ise.lehigh.edu/frankecurtis/files/talks/informs_tutorial_17.pdf · Optimization Methods for Supervised Machine

Learning guarantees

Problem is well defined, even if it may be difficult to solve.

However, we are really interested in

How do and relate?

Informs Tutorial 10/23/17

is a measure of complexity of the class of predictors

Page 22: Optimization Methods for Supervised Machine Learning: From ...coral.ise.lehigh.edu/frankecurtis/files/talks/informs_tutorial_17.pdf · Optimization Methods for Supervised Machine

Learning guarantees via Vapnik-Chervonenkis (VC)-dimension

10/23/17 Informs Tutorial

•  VC(p(·,·) - VC dimension of a set of classifiers – is the maximum number of points x such that any labeling can be separated by a classifier from this set.

Page 23: Optimization Methods for Supervised Machine Learning: From ...coral.ise.lehigh.edu/frankecurtis/files/talks/informs_tutorial_17.pdf · Optimization Methods for Supervised Machine

Vapnik-Chervonenkis (VC)-dimension, example

10/23/17 Informs Tutorial

Page 24: Optimization Methods for Supervised Machine Learning: From ...coral.ise.lehigh.edu/frankecurtis/files/talks/informs_tutorial_17.pdf · Optimization Methods for Supervised Machine

Learning guarantees via Vapnik-Chervonenkis (VC)-dimension

•  VC(p(·,·) - VC dimension of a set of classifiers – is the maximum number of points x such that any labeling can be separated by a classifier from this set.

•  VC(linear classifiers) = m+1

•  Conclusion: large dimension of w require large data sets.

10/23/17 Informs Tutorial

Page 25: Optimization Methods for Supervised Machine Learning: From ...coral.ise.lehigh.edu/frankecurtis/files/talks/informs_tutorial_17.pdf · Optimization Methods for Supervised Machine

OPTIMIZATION METHODS FOR LOGISTIC REGRESSION

10/23/17 Informs Tutorial

Page 26: Optimization Methods for Supervised Machine Learning: From ...coral.ise.lehigh.edu/frankecurtis/files/talks/informs_tutorial_17.pdf · Optimization Methods for Supervised Machine

Gradient descent with line search

10/23/17 Informs Tutorial

Page 27: Optimization Methods for Supervised Machine Learning: From ...coral.ise.lehigh.edu/frankecurtis/files/talks/informs_tutorial_17.pdf · Optimization Methods for Supervised Machine

Gradient descent with line search

10/23/17 Informs Tutorial

Convergence rate O(log(1/²)) (strongly convex case)

Page 28: Optimization Methods for Supervised Machine Learning: From ...coral.ise.lehigh.edu/frankecurtis/files/talks/informs_tutorial_17.pdf · Optimization Methods for Supervised Machine

Stochastic Gradient Descent Choose a subset of {1, �, n}, Sk, uniformly at random

10/23/17 Informs Tutorial

Page 29: Optimization Methods for Supervised Machine Learning: From ...coral.ise.lehigh.edu/frankecurtis/files/talks/informs_tutorial_17.pdf · Optimization Methods for Supervised Machine

Stochastic Gradient Descent Choose a subset of {1, �, n}, Sk, uniformly at random

Work per-iteration is O(sm)<<O(nm), but convergence sensitive to ´k and s

10/23/17 Informs Tutorial

Convergence rate O(1/²) (strongly convex case)

Page 30: Optimization Methods for Supervised Machine Learning: From ...coral.ise.lehigh.edu/frankecurtis/files/talks/informs_tutorial_17.pdf · Optimization Methods for Supervised Machine

Stochastic Gradient Descent with Momentum

10/23/17 Informs Tutorial

Work per-iteration is O(sm), less sensitive to ´ but no convergence theory

Page 31: Optimization Methods for Supervised Machine Learning: From ...coral.ise.lehigh.edu/frankecurtis/files/talks/informs_tutorial_17.pdf · Optimization Methods for Supervised Machine

Stochastic Variance Reducing Gradient method

10/23/17 Informs Tutorial

Outer Loop

Inner Loop

[Johnson & Zhang, 2013]

Page 32: Optimization Methods for Supervised Machine Learning: From ...coral.ise.lehigh.edu/frankecurtis/files/talks/informs_tutorial_17.pdf · Optimization Methods for Supervised Machine

SARAH (momentum version of SVRG)

10/23/17 Informs Tutorial

Outer Loop

Inner Loop

[Nguyen, Liu, S & Takac, 2017]

Page 33: Optimization Methods for Supervised Machine Learning: From ...coral.ise.lehigh.edu/frankecurtis/files/talks/informs_tutorial_17.pdf · Optimization Methods for Supervised Machine

Regular SG vs. Momentum SG

7/18/17 FOCM 2017, Bacelona

Page 34: Optimization Methods for Supervised Machine Learning: From ...coral.ise.lehigh.edu/frankecurtis/files/talks/informs_tutorial_17.pdf · Optimization Methods for Supervised Machine

Convergence rates comparisons

10/23/17 Informs Tutorial

For strongly convex functions, · is the condition number

For convex functions

Page 35: Optimization Methods for Supervised Machine Learning: From ...coral.ise.lehigh.edu/frankecurtis/files/talks/informs_tutorial_17.pdf · Optimization Methods for Supervised Machine

What’s so good about stochastic gradient method?

Informs Tutorial 10/23/17

Page 36: Optimization Methods for Supervised Machine Learning: From ...coral.ise.lehigh.edu/frankecurtis/files/talks/informs_tutorial_17.pdf · Optimization Methods for Supervised Machine

What’s so good about stochastic gradient method?

Informs Tutorial 10/23/17 Bousquet and Bottou ’08

Page 37: Optimization Methods for Supervised Machine Learning: From ...coral.ise.lehigh.edu/frankecurtis/files/talks/informs_tutorial_17.pdf · Optimization Methods for Supervised Machine

What’s so good about stochastic gradient method?

Informs Tutorial 10/23/17

Strongly convex case

Bousquet and Bottou ’08

Page 38: Optimization Methods for Supervised Machine Learning: From ...coral.ise.lehigh.edu/frankecurtis/files/talks/informs_tutorial_17.pdf · Optimization Methods for Supervised Machine

What’s so good about stochastic gradient method?

Informs Tutorial 10/23/17

Convex case

Page 39: Optimization Methods for Supervised Machine Learning: From ...coral.ise.lehigh.edu/frankecurtis/files/talks/informs_tutorial_17.pdf · Optimization Methods for Supervised Machine

Is stochastic gradient the best we can do? And what about nonconvex problems...? To be continued by Frank Curtis

10/23/17 Informs Tutorial

Page 40: Optimization Methods for Supervised Machine Learning: From ...coral.ise.lehigh.edu/frankecurtis/files/talks/informs_tutorial_17.pdf · Optimization Methods for Supervised Machine

Deep Neural Networks Nonconvex Optimization Second-Order Methods Thanks

Optimization Methods for Supervised Machine Learning:From Linear Models to Deep Learning, Part II

Frank E. Curtis, Lehigh University

joint work with

Katya Scheinberg, Lehigh University

INFORMS Annual Meeting, Houston, TX, USA

23 October 2017

Optimization Methods for Supervised Machine Learning, Part II 1 of 29

Page 41: Optimization Methods for Supervised Machine Learning: From ...coral.ise.lehigh.edu/frankecurtis/files/talks/informs_tutorial_17.pdf · Optimization Methods for Supervised Machine

Deep Neural Networks Nonconvex Optimization Second-Order Methods Thanks

Outline

Deep Neural Networks

Nonconvex Optimization

Second-Order Methods

Thanks

Optimization Methods for Supervised Machine Learning, Part II 2 of 29

Page 42: Optimization Methods for Supervised Machine Learning: From ...coral.ise.lehigh.edu/frankecurtis/files/talks/informs_tutorial_17.pdf · Optimization Methods for Supervised Machine

Deep Neural Networks Nonconvex Optimization Second-Order Methods Thanks

Outline

Deep Neural Networks

Nonconvex Optimization

Second-Order Methods

Thanks

Optimization Methods for Supervised Machine Learning, Part II 3 of 29

Page 43: Optimization Methods for Supervised Machine Learning: From ...coral.ise.lehigh.edu/frankecurtis/files/talks/informs_tutorial_17.pdf · Optimization Methods for Supervised Machine

Deep Neural Networks Nonconvex Optimization Second-Order Methods Thanks

What is a neural network?

I A computer brain (artificial intelligence!)I A computational graph

I. . . defined using neuroscience jargon (e.g., node ⌘ neuron)

I A function! . . . defined by some parameters

Optimization Methods for Supervised Machine Learning, Part II 4 of 29

Page 44: Optimization Methods for Supervised Machine Learning: From ...coral.ise.lehigh.edu/frankecurtis/files/talks/informs_tutorial_17.pdf · Optimization Methods for Supervised Machine

Deep Neural Networks Nonconvex Optimization Second-Order Methods Thanks

What is a neural network?

I A computer brain (artificial intelligence!)

I A computational graphI

. . . defined using neuroscience jargon (e.g., node ⌘ neuron)

I A function! . . . defined by some parameters

http://www.bbc.com/future/tags/brain

Optimization Methods for Supervised Machine Learning, Part II 4 of 29

Page 45: Optimization Methods for Supervised Machine Learning: From ...coral.ise.lehigh.edu/frankecurtis/files/talks/informs_tutorial_17.pdf · Optimization Methods for Supervised Machine

Deep Neural Networks Nonconvex Optimization Second-Order Methods Thanks

What is a neural network?

I A computer brain (artificial intelligence!)I A computational graph

I. . . defined using neuroscience jargon (e.g., node ⌘ neuron)

I A function! . . . defined by some parameters

x1

x2

x3

= 0.5x1

+ 0.5x2

0.5

0.5

https://www.extremetech.com/extreme/151696-ibm-on-track-to-building-artificial-synapses

Optimization Methods for Supervised Machine Learning, Part II 4 of 29

Page 46: Optimization Methods for Supervised Machine Learning: From ...coral.ise.lehigh.edu/frankecurtis/files/talks/informs_tutorial_17.pdf · Optimization Methods for Supervised Machine

Deep Neural Networks Nonconvex Optimization Second-Order Methods Thanks

What is a neural network?

I A computer brain (artificial intelligence!)I A computational graph

I. . . defined using neuroscience jargon (e.g., node ⌘ neuron)

I A function! . . . defined by some parameters

x(input)

(deep) neural networkwith parameters w

p(w, x)(output)

Optimization Methods for Supervised Machine Learning, Part II 4 of 29

Page 47: Optimization Methods for Supervised Machine Learning: From ...coral.ise.lehigh.edu/frankecurtis/files/talks/informs_tutorial_17.pdf · Optimization Methods for Supervised Machine

Deep Neural Networks Nonconvex Optimization Second-Order Methods Thanks

Learning

Neural networks do not learn on their own.

I In supervised learning, we train them by giving them inputs. . .

I . . . and use optimization to better match their outputs to known outputs.

I (After, we hope they give the right outputs when they are unknown!)

We optimize the parameters ⌘ weights ⌘ decision variables.

Optimization Methods for Supervised Machine Learning, Part II 5 of 29

Page 48: Optimization Methods for Supervised Machine Learning: From ...coral.ise.lehigh.edu/frankecurtis/files/talks/informs_tutorial_17.pdf · Optimization Methods for Supervised Machine

Deep Neural Networks Nonconvex Optimization Second-Order Methods Thanks

Feed forward network, fully connected

InputLay

erOutp

utLay

er

Hidden Layersx5

x4

x3

x2

x1

p3

p2

p1

Optimization Methods for Supervised Machine Learning, Part II 6 of 29

Page 49: Optimization Methods for Supervised Machine Learning: From ...coral.ise.lehigh.edu/frankecurtis/files/talks/informs_tutorial_17.pdf · Optimization Methods for Supervised Machine

Deep Neural Networks Nonconvex Optimization Second-Order Methods Thanks

Feed forward network, fully connected

InputLay

erOutp

utLay

er

Hidden Layers

x5

x4

x3

x2

x1

h14

h13

h12

h11

h24

h23

h22

h21

p3

p2

p1

[W1

]54

[W1

]11

[W2

]44

[W2

]11

[W3

]43

[W3

]11

Optimization Methods for Supervised Machine Learning, Part II 6 of 29

Page 50: Optimization Methods for Supervised Machine Learning: From ...coral.ise.lehigh.edu/frankecurtis/files/talks/informs_tutorial_17.pdf · Optimization Methods for Supervised Machine

Deep Neural Networks Nonconvex Optimization Second-Order Methods Thanks

Feed forward network, fully connected

InputLay

erOutp

utLay

er

Hidden Layers

x5

x4

x3

x2

x1

h14

h13

h12

h11

h24

h23

h22

h21

p3

p2

p1

[W1

]54

[W1

]11

[W2

]44

[W2

]11

[W3

]43

[W3

]11

h1

= s1

(W1

x+ !1

)

Optimization Methods for Supervised Machine Learning, Part II 6 of 29

Page 51: Optimization Methods for Supervised Machine Learning: From ...coral.ise.lehigh.edu/frankecurtis/files/talks/informs_tutorial_17.pdf · Optimization Methods for Supervised Machine

Deep Neural Networks Nonconvex Optimization Second-Order Methods Thanks

Feed forward network, fully connected

InputLay

erOutp

utLay

er

Hidden Layers

x5

x4

x3

x2

x1

h14

h13

h12

h11

h24

h23

h22

h21

p3

p2

p1

[W1

]54

[W1

]11

[W2

]44

[W2

]11

[W3

]43

[W3

]11

p(w, x) = s3

(W3

(s2

(W2

(s1

(W1

x+ !1

)) + !2

)) + !3

)

Optimization Methods for Supervised Machine Learning, Part II 6 of 29

Page 52: Optimization Methods for Supervised Machine Learning: From ...coral.ise.lehigh.edu/frankecurtis/files/talks/informs_tutorial_17.pdf · Optimization Methods for Supervised Machine

Deep Neural Networks Nonconvex Optimization Second-Order Methods Thanks

Training

As before, we have an optimization problem of the form

minw2W

E[`(p(w, x), y)]

or, with training data, of the form

minw2W

1

n

nX

i=1

`(p(w, xi), yi)

wherep(w, x) = s

3

(W3

(s2

(W2

(s1

(W1

x+ !1

)) + !2

)) + !3

)

Optimization Methods for Supervised Machine Learning, Part II 7 of 29

Page 53: Optimization Methods for Supervised Machine Learning: From ...coral.ise.lehigh.edu/frankecurtis/files/talks/informs_tutorial_17.pdf · Optimization Methods for Supervised Machine

Deep Neural Networks Nonconvex Optimization Second-Order Methods Thanks

Example: Image classification

Humans can easily determine digits/letters from arrangements of pixels

. . . for the most part.

(I’m told there’s a number there!)

Bottou et al., Optimization Methods for Large-Scale Machine Learning, SIAM Review (to appear)

Optimization Methods for Supervised Machine Learning, Part II 8 of 29

Page 54: Optimization Methods for Supervised Machine Learning: From ...coral.ise.lehigh.edu/frankecurtis/files/talks/informs_tutorial_17.pdf · Optimization Methods for Supervised Machine

Deep Neural Networks Nonconvex Optimization Second-Order Methods Thanks

Example: Image classification

Humans can easily determine digits/letters from arrangements of pixels

. . . for the most part.

(I’m told there’s a number there!)

https://colormax.org/color-blind-test/

Optimization Methods for Supervised Machine Learning, Part II 8 of 29

Page 55: Optimization Methods for Supervised Machine Learning: From ...coral.ise.lehigh.edu/frankecurtis/files/talks/informs_tutorial_17.pdf · Optimization Methods for Supervised Machine

Deep Neural Networks Nonconvex Optimization Second-Order Methods Thanks

Convolutional neural networks

A modern tool for image classification is a convolutional neural network (CNN)

I These work by trying to capture spatial relationships between input values

I For example, in the example below, a filter is applied—to compute the sum ofelementwise products—to look for a diagonal pattern

1 0 9 2

2 8 0 8

9 1 7 0

1 8 0 2

1 0 9 2

2 8 0 8

9 1 7 0

1 8 0 2

9 0 17

3 15 0

17 1 9

1 0

0 1

I Here, the data is a matrix, but these can be translated to vector operations.

Optimization Methods for Supervised Machine Learning, Part II 9 of 29

Page 56: Optimization Methods for Supervised Machine Learning: From ...coral.ise.lehigh.edu/frankecurtis/files/talks/informs_tutorial_17.pdf · Optimization Methods for Supervised Machine

Deep Neural Networks Nonconvex Optimization Second-Order Methods Thanks

Convolutional neural networks

A modern tool for image classification is a convolutional neural network (CNN)

I These work by trying to capture spatial relationships between input values

I For example, in the example below, a filter is applied—to compute the sum ofelementwise products—to look for a diagonal pattern

1 0 9 2

2 8 0 8

9 1 7 0

1 8 0 2

1 0 9 2

2 8 0 8

9 1 7 0

1 8 0 2

9 0 17

3 15 0

17 1 9

1 0

0 1

I Here, the data is a matrix, but these can be translated to vector operations.

Optimization Methods for Supervised Machine Learning, Part II 9 of 29

Page 57: Optimization Methods for Supervised Machine Learning: From ...coral.ise.lehigh.edu/frankecurtis/files/talks/informs_tutorial_17.pdf · Optimization Methods for Supervised Machine

Deep Neural Networks Nonconvex Optimization Second-Order Methods Thanks

Anjelica Huston (not Houston!)

A random filter simply blurs the data, which doesn’t help

(There are plenty of Python tools for playing around like this.)

https://www.filmcomment.com/article/interview-anjelica-huston

Optimization Methods for Supervised Machine Learning, Part II 10 of 29

Page 58: Optimization Methods for Supervised Machine Learning: From ...coral.ise.lehigh.edu/frankecurtis/files/talks/informs_tutorial_17.pdf · Optimization Methods for Supervised Machine

Deep Neural Networks Nonconvex Optimization Second-Order Methods Thanks

Anjelica Huston (not Houston!)

. . . but certain filters can reveal edges and other features

(There are plenty of Python tools for playing around like this.)

https://www.filmcomment.com/article/interview-anjelica-huston

Optimization Methods for Supervised Machine Learning, Part II 10 of 29

Page 59: Optimization Methods for Supervised Machine Learning: From ...coral.ise.lehigh.edu/frankecurtis/files/talks/informs_tutorial_17.pdf · Optimization Methods for Supervised Machine

Deep Neural Networks Nonconvex Optimization Second-Order Methods Thanks

Large-scale network

A full large-scale network involves various other components/tools:

I rectification, normalization, pooling, regularization, etc.

This network involves over 60 million parameters. Need good algorithms!

Bottou et al., Optimization Methods for Large-Scale Machine Learning, SIAM Review (to appear)

Optimization Methods for Supervised Machine Learning, Part II 11 of 29

Page 60: Optimization Methods for Supervised Machine Learning: From ...coral.ise.lehigh.edu/frankecurtis/files/talks/informs_tutorial_17.pdf · Optimization Methods for Supervised Machine

Deep Neural Networks Nonconvex Optimization Second-Order Methods Thanks

Recurrent neural networks

These try to capture temporal relationships between input values.

x1

(input 1)(deep) neural networkwith parameters w

x2

(input 2)(deep) neural networkwith parameters w

x3

(input 3)(deep) neural networkwith parameters w

p1

(output 1)

p2

(output 2)

p3

(output 3)

I Video classification, speech recognition, text classification, etc.

Optimization Methods for Supervised Machine Learning, Part II 12 of 29

Page 61: Optimization Methods for Supervised Machine Learning: From ...coral.ise.lehigh.edu/frankecurtis/files/talks/informs_tutorial_17.pdf · Optimization Methods for Supervised Machine

Deep Neural Networks Nonconvex Optimization Second-Order Methods Thanks

Outline

Deep Neural Networks

Nonconvex Optimization

Second-Order Methods

Thanks

Optimization Methods for Supervised Machine Learning, Part II 13 of 29

Page 62: Optimization Methods for Supervised Machine Learning: From ...coral.ise.lehigh.edu/frankecurtis/files/talks/informs_tutorial_17.pdf · Optimization Methods for Supervised Machine

Deep Neural Networks Nonconvex Optimization Second-Order Methods Thanks

Back propagation

How do we optimize?

I Same as always!

I Compute derivatives, but how?

I Back propagation, i.e., automatic di↵erentiation

I Then we need an optimization algorithm in which to use them.

Main challenges:

I “Full gradient” involves loop over all data, which is expensive

I . . . so consider stochastic methods, as previously mentioned.

I However, these problems are large-scale and nonconvex.

Optimization Methods for Supervised Machine Learning, Part II 14 of 29

Page 63: Optimization Methods for Supervised Machine Learning: From ...coral.ise.lehigh.edu/frankecurtis/files/talks/informs_tutorial_17.pdf · Optimization Methods for Supervised Machine

Deep Neural Networks Nonconvex Optimization Second-Order Methods Thanks

Global minima, local minima, saddle points, etc.

x

f(x)

strict

local

minimum

local

minima

strict

global

minimum

I These textbook illustrations might be misleading.

I The “landscape” of the objective function defined by a deep neural network issomething of great interest these days.

https://upload.wikimedia.org/wikipedia/commons/4/40/Saddle point.png

Optimization Methods for Supervised Machine Learning, Part II 15 of 29

Page 64: Optimization Methods for Supervised Machine Learning: From ...coral.ise.lehigh.edu/frankecurtis/files/talks/informs_tutorial_17.pdf · Optimization Methods for Supervised Machine

Deep Neural Networks Nonconvex Optimization Second-Order Methods Thanks

How to optimize?

It is not clear where a gradient-based method might converge.

I However, (stochastic) gradient-based methods seem to work well!

I They provably avoid saddle points with high probability.

I . . . and often converge to “good” stationary points.

Open questions:

I How to characterize the behavior of di↵erent methods?

I How to characterize the generalization properties of solutions?

I What algorithms are the most e↵ective at finding points with goodgeneralization properties?

We will not answer these; instead, we’ll simply describe/motivate some methods.

Optimization Methods for Supervised Machine Learning, Part II 16 of 29

Page 65: Optimization Methods for Supervised Machine Learning: From ...coral.ise.lehigh.edu/frankecurtis/files/talks/informs_tutorial_17.pdf · Optimization Methods for Supervised Machine

Deep Neural Networks Nonconvex Optimization Second-Order Methods Thanks

How to optimize?

It is not clear where a gradient-based method might converge.

I However, (stochastic) gradient-based methods seem to work well!

I They provably avoid saddle points with high probability.

I . . . and often converge to “good” stationary points.

Open questions:

I How to characterize the behavior of di↵erent methods?

I How to characterize the generalization properties of solutions?

I What algorithms are the most e↵ective at finding points with goodgeneralization properties?

We will not answer these; instead, we’ll simply describe/motivate some methods.

Optimization Methods for Supervised Machine Learning, Part II 16 of 29

Page 66: Optimization Methods for Supervised Machine Learning: From ...coral.ise.lehigh.edu/frankecurtis/files/talks/informs_tutorial_17.pdf · Optimization Methods for Supervised Machine

Deep Neural Networks Nonconvex Optimization Second-Order Methods Thanks

Outline

Deep Neural Networks

Nonconvex Optimization

Second-Order Methods

Thanks

Optimization Methods for Supervised Machine Learning, Part II 17 of 29

Page 67: Optimization Methods for Supervised Machine Learning: From ...coral.ise.lehigh.edu/frankecurtis/files/talks/informs_tutorial_17.pdf · Optimization Methods for Supervised Machine

Deep Neural Networks Nonconvex Optimization Second-Order Methods Thanks

First- versus second-order

First-order methods follow a steepest descent methodology:

wk+1

wk � ↵krf(wk)

Second-order methods follow Newton’s methodology:

wk+1

wk � ↵k[r2f(wk)]�1rf(wk),

which one should view as minimizing a quadratic model of f at wk:

f(wk) +rf(wk)T (w � wk) +

1

2

(w � wk)Tr2f(wk)(w � wk)

Might also replace the Hessian with an approximation Hk with inverse Mk

Optimization Methods for Supervised Machine Learning, Part II 18 of 29

Page 68: Optimization Methods for Supervised Machine Learning: From ...coral.ise.lehigh.edu/frankecurtis/files/talks/informs_tutorial_17.pdf · Optimization Methods for Supervised Machine

Deep Neural Networks Nonconvex Optimization Second-Order Methods Thanks

First- versus quasi-second-order

First-order methods follow a steepest descent methodology:

wk+1

wk � ↵krf(wk)

Second-order methods follow Newton’s methodology:

wk+1

wk � ↵kMkrf(wk),

which one should view as minimizing a quadratic model of f at wk:

f(wk) +rf(wk)T (w � wk) +

1

2

(w � wk)THk(w � wk)

Might also replace the Hessian with an approximation Hk with inverse Mk

Optimization Methods for Supervised Machine Learning, Part II 18 of 29

Page 69: Optimization Methods for Supervised Machine Learning: From ...coral.ise.lehigh.edu/frankecurtis/files/talks/informs_tutorial_17.pdf · Optimization Methods for Supervised Machine

Deep Neural Networks Nonconvex Optimization Second-Order Methods Thanks

Why second-order?

Second-order methods are expensive!

I Yes, but judicious use of second-order information can help

I . . . and the resulting methods can be made nearly as cheap as SG.

Overall, there are various ways to improve upon SG. . .

Optimization Methods for Supervised Machine Learning, Part II 19 of 29

Page 70: Optimization Methods for Supervised Machine Learning: From ...coral.ise.lehigh.edu/frankecurtis/files/talks/informs_tutorial_17.pdf · Optimization Methods for Supervised Machine

Deep Neural Networks Nonconvex Optimization Second-Order Methods Thanks

What can be improved?

stochasticgradient

betterrate

betterconstant

better rate andbetter constant

Optimization Methods for Supervised Machine Learning, Part II 20 of 29

Page 71: Optimization Methods for Supervised Machine Learning: From ...coral.ise.lehigh.edu/frankecurtis/files/talks/informs_tutorial_17.pdf · Optimization Methods for Supervised Machine

Deep Neural Networks Nonconvex Optimization Second-Order Methods Thanks

What can be improved?

stochasticgradient

betterrate

betterconstant

better rate andbetter constant

Optimization Methods for Supervised Machine Learning, Part II 20 of 29

Page 72: Optimization Methods for Supervised Machine Learning: From ...coral.ise.lehigh.edu/frankecurtis/files/talks/informs_tutorial_17.pdf · Optimization Methods for Supervised Machine

Deep Neural Networks Nonconvex Optimization Second-Order Methods Thanks

Two-dimensional schematic of methods

stochasticgradient

batchgradient

stochasticNewton

batchNewton

noise reduction

second-order

Optimization Methods for Supervised Machine Learning, Part II 21 of 29

Page 73: Optimization Methods for Supervised Machine Learning: From ...coral.ise.lehigh.edu/frankecurtis/files/talks/informs_tutorial_17.pdf · Optimization Methods for Supervised Machine

Deep Neural Networks Nonconvex Optimization Second-Order Methods Thanks

2D schematic: Noise reduction methods

stochasticgradient

batchgradient

stochasticNewton

noise reduction

dynamic sampling

gradient aggregation

iterate averaging

Optimization Methods for Supervised Machine Learning, Part II 22 of 29

Page 74: Optimization Methods for Supervised Machine Learning: From ...coral.ise.lehigh.edu/frankecurtis/files/talks/informs_tutorial_17.pdf · Optimization Methods for Supervised Machine

Deep Neural Networks Nonconvex Optimization Second-Order Methods Thanks

2D schematic: Second-order methods

stochasticgradient

batchgradient

stochasticNewton

second-order

diagonal scaling

natural gradient

Gauss-Newton

quasi-Newton

Hessian-free Newton

Optimization Methods for Supervised Machine Learning, Part II 23 of 29

Page 75: Optimization Methods for Supervised Machine Learning: From ...coral.ise.lehigh.edu/frankecurtis/files/talks/informs_tutorial_17.pdf · Optimization Methods for Supervised Machine

Deep Neural Networks Nonconvex Optimization Second-Order Methods Thanks

So, why second-order?

Traditional motivation: fast local convergence guarantees

I Hard to achieve in large-scale stochastic settings

Recent motivation (last few years): better complexity properties

I Many are no better than first-order methods in terms of complexity

I . . . and ones with better complexity aren’t necessarily best in practice (yet)

Other reasons?

I Adaptive, natural scaling (gradient descent ⇡ 1/L while Newton ⇡ 1)

I Mitigate e↵ects of ill-conditioning

I Easier to tune parameters(?)

I Better at avoiding saddle points(?)

I Better trade-o↵ in parallel and distributed computing settings

I New algorithms! Not analyzing the same old

Optimization Methods for Supervised Machine Learning, Part II 24 of 29

Page 76: Optimization Methods for Supervised Machine Learning: From ...coral.ise.lehigh.edu/frankecurtis/files/talks/informs_tutorial_17.pdf · Optimization Methods for Supervised Machine

Deep Neural Networks Nonconvex Optimization Second-Order Methods Thanks

So, why second-order?

Traditional motivation: fast local convergence guarantees

I Hard to achieve in large-scale stochastic settings

Recent motivation (last few years): better complexity properties

I Many are no better than first-order methods in terms of complexity

I . . . and ones with better complexity aren’t necessarily best in practice (yet)

Other reasons?

I Adaptive, natural scaling (gradient descent ⇡ 1/L while Newton ⇡ 1)

I Mitigate e↵ects of ill-conditioning

I Easier to tune parameters(?)

I Better at avoiding saddle points(?)

I Better trade-o↵ in parallel and distributed computing settings

I New algorithms! Not analyzing the same old

Optimization Methods for Supervised Machine Learning, Part II 24 of 29

Page 77: Optimization Methods for Supervised Machine Learning: From ...coral.ise.lehigh.edu/frankecurtis/files/talks/informs_tutorial_17.pdf · Optimization Methods for Supervised Machine

Deep Neural Networks Nonconvex Optimization Second-Order Methods Thanks

So, why second-order?

Traditional motivation: fast local convergence guarantees

I Hard to achieve in large-scale stochastic settings

Recent motivation (last few years): better complexity properties

I Many are no better than first-order methods in terms of complexity

I . . . and ones with better complexity aren’t necessarily best in practice (yet)

Other reasons?

I Adaptive, natural scaling (gradient descent ⇡ 1/L while Newton ⇡ 1)

I Mitigate e↵ects of ill-conditioning

I Easier to tune parameters(?)

I Better at avoiding saddle points(?)

I Better trade-o↵ in parallel and distributed computing settings

I New algorithms! Not analyzing the same old

Optimization Methods for Supervised Machine Learning, Part II 24 of 29

Page 78: Optimization Methods for Supervised Machine Learning: From ...coral.ise.lehigh.edu/frankecurtis/files/talks/informs_tutorial_17.pdf · Optimization Methods for Supervised Machine

Deep Neural Networks Nonconvex Optimization Second-Order Methods Thanks

Framework #1: Matrix-free (Gauss-)Newton

Compute each step by applying an iterative method to solve

Hksk = �gk

potentially with regularization, within a trust region, etc.

This can be computationally e�cient since

I Hk can be defined by a subsample of data.

I Matrix-vector products can be computed without forming the matrix

I . . . using similar principles as in back propagation.

I The linear system need not be solved exactly.

Optimization Methods for Supervised Machine Learning, Part II 25 of 29

Page 79: Optimization Methods for Supervised Machine Learning: From ...coral.ise.lehigh.edu/frankecurtis/files/talks/informs_tutorial_17.pdf · Optimization Methods for Supervised Machine

Deep Neural Networks Nonconvex Optimization Second-Order Methods Thanks

Quasi-Newton (deterministic setting)

Only approximate second-order information with gradient displacements:

x

xkxk+1

Secant equation Hkyk = sk to match gradient of f at wk, where

sk := wk+1

� wk and yk := rf(wk+1

)�rf(wk)

Optimization Methods for Supervised Machine Learning, Part II 26 of 29

Page 80: Optimization Methods for Supervised Machine Learning: From ...coral.ise.lehigh.edu/frankecurtis/files/talks/informs_tutorial_17.pdf · Optimization Methods for Supervised Machine

Deep Neural Networks Nonconvex Optimization Second-Order Methods Thanks

Framework #2: Quasi-Newton

How can this idea be adapted to the stochastic setting?

I Idea #1: Replace yk by displacement using the same sample, i.e.,

rSkf(wk+1

)�rSkf(wk).

(This doubles the number of stochastic gradients, but maybe worthwhile?)

I Idea #2: Replace yk by action on a (subsampled) Hessian, i.e.,

r2

SHkf(wk+1

)sk

(This requires matrix-vector products with a Hessian.)

I . . . other ideas?

Optimization Methods for Supervised Machine Learning, Part II 27 of 29

Page 81: Optimization Methods for Supervised Machine Learning: From ...coral.ise.lehigh.edu/frankecurtis/files/talks/informs_tutorial_17.pdf · Optimization Methods for Supervised Machine

Deep Neural Networks Nonconvex Optimization Second-Order Methods Thanks

Outline

Deep Neural Networks

Nonconvex Optimization

Second-Order Methods

Thanks

Optimization Methods for Supervised Machine Learning, Part II 28 of 29

Page 82: Optimization Methods for Supervised Machine Learning: From ...coral.ise.lehigh.edu/frankecurtis/files/talks/informs_tutorial_17.pdf · Optimization Methods for Supervised Machine

Deep Neural Networks Nonconvex Optimization Second-Order Methods Thanks

OptML @ Lehigh

Please visit the OptML @ Lehigh website!

I http://optml.lehigh.edu

Optimization Methods for Supervised Machine Learning, Part II 29 of 29