65
DNN DNN: Final Presentation Ryan Jeong, Will Barton, Maricela Ramirez Summer@ICERM June-July 2020 Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM) DNN: Final Presentation June-July 2020 1 / 29

DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear

DNN

DNN: Final Presentation

Ryan Jeong, Will Barton, Maricela Ramirez

Summer@ICERM

June-July 2020

Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 1 / 29

Page 2: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear

Outline

1 Deep Learning/Neural Networks Fundamentals

2 Universal Approximation Results with Neural Nets

3 Piecewise Linear Networks

4 On Expressivity vs. Learnability

Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 2 / 29

Page 3: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear

Motivation

Source: Jay Alammar, How GPT3 Works Source: European Go Federation

Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 3 / 29

Page 4: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear

DL Fundamentals

Figure: Source: Virtual Labs, Multilayer Feedforward networks

General idea of a Feedforward networkz(l) = W

(l)a(l�1) + b

(l)

a(l) = ReLU(z (l))

Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 4 / 29

Page 5: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear

DL Fundamentals

Activation function

Cost function

Minimize cost function through backpropogationGradient Descent

@C0

@w (l) =@C0

@a(l)· @a(l)

@z(l)· @z(l)

@w (l)

@C0

@b(l) =@C0

@a(l)· @a(l)

@z(l)· @z(l)

@b(l)

Stochastic Gradient Descent

Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 5 / 29

Page 6: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear

DL Fundamentals

Activation function

Cost function

Minimize cost function through backpropogationGradient Descent@C0

@w (l) =@C0

@a(l)· @a(l)

@z(l)· @z(l)

@w (l)

@C0

@b(l) =@C0

@a(l)· @a(l)

@z(l)· @z(l)

@b(l)

Stochastic Gradient Descent

Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 5 / 29

Page 7: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear

Convolutional Networks

Motivation: computational issues with feedforward networks

Figure: Source: missinglink.ai

For image recognition problems, on which CNNs are largely applied:

Convolution layer: slide small kernel of fixed weights along the image,performing the same computation every time; activation performed onoutput of a convolution layerPooling layer: divides result of convolution into regions and computes afunction on each one (usually just max); intuitively summarizesinformation and compresses dimensionFinishes with fully connected layers

Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 6 / 29

Page 8: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear

Convolutional Networks

Motivation: computational issues with feedforward networks

Figure: Source: missinglink.ai

For image recognition problems, on which CNNs are largely applied:Convolution layer: slide small kernel of fixed weights along the image,performing the same computation every time; activation performed onoutput of a convolution layer

Pooling layer: divides result of convolution into regions and computes afunction on each one (usually just max); intuitively summarizesinformation and compresses dimensionFinishes with fully connected layers

Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 6 / 29

Page 9: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear

Convolutional Networks

Motivation: computational issues with feedforward networks

Figure: Source: missinglink.ai

For image recognition problems, on which CNNs are largely applied:Convolution layer: slide small kernel of fixed weights along the image,performing the same computation every time; activation performed onoutput of a convolution layerPooling layer: divides result of convolution into regions and computes afunction on each one (usually just max); intuitively summarizesinformation and compresses dimension

Finishes with fully connected layers

Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 6 / 29

Page 10: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear

Convolutional Networks

Motivation: computational issues with feedforward networks

Figure: Source: missinglink.ai

For image recognition problems, on which CNNs are largely applied:Convolution layer: slide small kernel of fixed weights along the image,performing the same computation every time; activation performed onoutput of a convolution layerPooling layer: divides result of convolution into regions and computes afunction on each one (usually just max); intuitively summarizesinformation and compresses dimensionFinishes with fully connected layers

Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 6 / 29

Page 11: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear

SVD classification model

Initial model from linear algebraic techniques as a baseline

MNIST dataset: handwritten digit classification for digits 0 to 9

Split the data into 10 classes, by the label of each data point

Compute SVD for each of the 10 new design matrices; number ofcolumns is a hyperparameter; yields matrix Uc for class c

Key idea behind algorithm: PCA equal to SVDPCA: finding a lower-dimensional representation of the data thatpreserves the most varianceBasis vectors come from SVD

Classification objective: argminc=0,1,...,9

||x � Uc(UTc x)||2

Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 7 / 29

Page 12: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear

SVD classification model

Initial model from linear algebraic techniques as a baseline

MNIST dataset: handwritten digit classification for digits 0 to 9

Split the data into 10 classes, by the label of each data point

Compute SVD for each of the 10 new design matrices; number ofcolumns is a hyperparameter; yields matrix Uc for class c

Key idea behind algorithm: PCA equal to SVDPCA: finding a lower-dimensional representation of the data thatpreserves the most varianceBasis vectors come from SVD

Classification objective: argminc=0,1,...,9

||x � Uc(UTc x)||2

Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 7 / 29

Page 13: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear

SVD classification model

Initial model from linear algebraic techniques as a baseline

MNIST dataset: handwritten digit classification for digits 0 to 9

Split the data into 10 classes, by the label of each data point

Compute SVD for each of the 10 new design matrices; number ofcolumns is a hyperparameter; yields matrix Uc for class c

Key idea behind algorithm: PCA equal to SVDPCA: finding a lower-dimensional representation of the data thatpreserves the most varianceBasis vectors come from SVD

Classification objective: argminc=0,1,...,9

||x � Uc(UTc x)||2

Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 7 / 29

Page 14: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear

SVD classification model

Initial model from linear algebraic techniques as a baseline

MNIST dataset: handwritten digit classification for digits 0 to 9

Split the data into 10 classes, by the label of each data point

Compute SVD for each of the 10 new design matrices; number ofcolumns is a hyperparameter; yields matrix Uc for class c

Key idea behind algorithm: PCA equal to SVD

PCA: finding a lower-dimensional representation of the data thatpreserves the most varianceBasis vectors come from SVD

Classification objective: argminc=0,1,...,9

||x � Uc(UTc x)||2

Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 7 / 29

Page 15: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear

SVD classification model

Initial model from linear algebraic techniques as a baseline

MNIST dataset: handwritten digit classification for digits 0 to 9

Split the data into 10 classes, by the label of each data point

Compute SVD for each of the 10 new design matrices; number ofcolumns is a hyperparameter; yields matrix Uc for class c

Key idea behind algorithm: PCA equal to SVDPCA: finding a lower-dimensional representation of the data thatpreserves the most variance

Basis vectors come from SVD

Classification objective: argminc=0,1,...,9

||x � Uc(UTc x)||2

Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 7 / 29

Page 16: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear

SVD classification model

Initial model from linear algebraic techniques as a baseline

MNIST dataset: handwritten digit classification for digits 0 to 9

Split the data into 10 classes, by the label of each data point

Compute SVD for each of the 10 new design matrices; number ofcolumns is a hyperparameter; yields matrix Uc for class c

Key idea behind algorithm: PCA equal to SVDPCA: finding a lower-dimensional representation of the data thatpreserves the most varianceBasis vectors come from SVD

Classification objective: argminc=0,1,...,9

||x � Uc(UTc x)||2

Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 7 / 29

Page 17: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear

SVD classification model

Initial model from linear algebraic techniques as a baseline

MNIST dataset: handwritten digit classification for digits 0 to 9

Split the data into 10 classes, by the label of each data point

Compute SVD for each of the 10 new design matrices; number ofcolumns is a hyperparameter; yields matrix Uc for class c

Key idea behind algorithm: PCA equal to SVDPCA: finding a lower-dimensional representation of the data thatpreserves the most varianceBasis vectors come from SVD

Classification objective: argminc=0,1,...,9

||x � Uc(UTc x)||2

Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 7 / 29

Page 18: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear

Results from SVD Classification Algorithm

Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 8 / 29

Page 19: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear

Results from SVD Classification Algorithm

Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 9 / 29

Page 20: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear

Neural networks for MNIST

Feedforward neural networks: di↵erent architecturesBest performance, 1 hidden layer, with 128 units over 12 EPOCHS,97.8% accuracy

Convolutional neural network: two convolutional layers (convolution+ max pooling) and three fully connected layers (one of them beingoutput)

CNN was the best model, achieving 98.35 percent accuracy after 3epochs (nearing 99 percent accuracy after roughly 10 epochs)

Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 10 / 29

Page 21: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear

Feedforward Network Results

Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 11 / 29

Page 22: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear

Motivating questions in NN approximation theory

What class of functions can be approximated/expressed by a standardfeedforward neural network of depth k?

How do we think about the shift to rectified activations fromsigmoidal functions, which perform well in practice?

What subset of all functions that are provably approximable by neuralnetworks are actually learnable in practice?

Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 12 / 29

Page 23: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear

Motivating questions in NN approximation theory

What class of functions can be approximated/expressed by a standardfeedforward neural network of depth k?

How do we think about the shift to rectified activations fromsigmoidal functions, which perform well in practice?

What subset of all functions that are provably approximable by neuralnetworks are actually learnable in practice?

Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 12 / 29

Page 24: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear

Motivating questions in NN approximation theory

What class of functions can be approximated/expressed by a standardfeedforward neural network of depth k?

How do we think about the shift to rectified activations fromsigmoidal functions, which perform well in practice?

What subset of all functions that are provably approximable by neuralnetworks are actually learnable in practice?

Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 12 / 29

Page 25: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear

Classical Results: Shallow Neural Networks as Universal

Approximators

Definition: let ✏-approximation of a function f (x) by anotherfunction F (x) with a shared domain X denote that for arbitrary ✏ > 0,

supx2X|f (x)� F (x)| < ✏

Definition: Call a function �(t) sigmoidal if

�(t) !(1 as t ! 10 as t ! �1

Cybenko (1989): Shallow neural networks (one hidden layer) withcontinuous sigmoidal activations are universal approximators, i.e.capable of ✏-approximating any continuous function defined on theunit hypercube.

Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 13 / 29

Page 26: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear

Classical Results: Shallow Neural Networks as Universal

Approximators

Definition: let ✏-approximation of a function f (x) by anotherfunction F (x) with a shared domain X denote that for arbitrary ✏ > 0,

supx2X|f (x)� F (x)| < ✏

Definition: Call a function �(t) sigmoidal if

�(t) !(1 as t ! 10 as t ! �1

Cybenko (1989): Shallow neural networks (one hidden layer) withcontinuous sigmoidal activations are universal approximators, i.e.capable of ✏-approximating any continuous function defined on theunit hypercube.

Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 13 / 29

Page 27: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear

Classical Results: Shallow Neural Networks as Universal

Approximators

Definition: let ✏-approximation of a function f (x) by anotherfunction F (x) with a shared domain X denote that for arbitrary ✏ > 0,

supx2X|f (x)� F (x)| < ✏

Definition: Call a function �(t) sigmoidal if

�(t) !(1 as t ! 10 as t ! �1

Cybenko (1989): Shallow neural networks (one hidden layer) withcontinuous sigmoidal activations are universal approximators, i.e.capable of ✏-approximating any continuous function defined on theunit hypercube.

Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 13 / 29

Page 28: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear

Classical Results: Shallow Neural Networks as Universal

Approximators

Loosening of restrictions on activation function that still yield notionof ✏-approximation:

Hornik (1990): extension to any continuous and bounded activationfunctions, support extends to more than just the unit hypercubeLeshno (1993): extension to nonpolynomial activation functions

By extension, deeper neural networks of depth k also enjoy the sametheoretical guarantees.

Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 14 / 29

Page 29: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear

Classical Results: Shallow Neural Networks as Universal

Approximators

Loosening of restrictions on activation function that still yield notionof ✏-approximation:

Hornik (1990): extension to any continuous and bounded activationfunctions, support extends to more than just the unit hypercubeLeshno (1993): extension to nonpolynomial activation functions

By extension, deeper neural networks of depth k also enjoy the sametheoretical guarantees.

Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 14 / 29

Page 30: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear

Improved Expressivity Results with Depth

Let k denote the depth of a network. The following results are due toRolnick and Tegmark (2018).

Let f (x) be a multivariate polynomial function of finite degree d , andN(x) be a network with nonlinear activation having nonzero Taylorcoe�cients up to degree d . Then there exists a number of neuronsmk(f ) that can ✏-approximate f (x), where mk is independent of ✏.

Let f (x) = xr11 x

r22 . . . x rnn be a monomial function of finite terms,

network N(x) has nonlinear activation having nonzero Taylorcoe�cients up to degree 2d , and mk(f ) defined as above. Thenm1(f ) is exponential, but linear in a log-depth network.

m1(f ) = ⇧ni=1(ri + 1)

m(f ) ⌃ni=17dlog2(ri )e+ 4

Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 15 / 29

Page 31: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear

Improved Expressivity Results with Depth

Let k denote the depth of a network. The following results are due toRolnick and Tegmark (2018).

Let f (x) be a multivariate polynomial function of finite degree d , andN(x) be a network with nonlinear activation having nonzero Taylorcoe�cients up to degree d . Then there exists a number of neuronsmk(f ) that can ✏-approximate f (x), where mk is independent of ✏.

Let f (x) = xr11 x

r22 . . . x rnn be a monomial function of finite terms,

network N(x) has nonlinear activation having nonzero Taylorcoe�cients up to degree 2d , and mk(f ) defined as above. Thenm1(f ) is exponential, but linear in a log-depth network.

m1(f ) = ⇧ni=1(ri + 1)

m(f ) ⌃ni=17dlog2(ri )e+ 4

Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 15 / 29

Page 32: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear

Improved Expressivity Results with Depth

Let k denote the depth of a network. The following results are due toRolnick and Tegmark (2018).

Let f (x) be a multivariate polynomial function of finite degree d , andN(x) be a network with nonlinear activation having nonzero Taylorcoe�cients up to degree d . Then there exists a number of neuronsmk(f ) that can ✏-approximate f (x), where mk is independent of ✏.

Let f (x) = xr11 x

r22 . . . x rnn be a monomial function of finite terms,

network N(x) has nonlinear activation having nonzero Taylorcoe�cients up to degree 2d , and mk(f ) defined as above. Thenm1(f ) is exponential, but linear in a log-depth network.

m1(f ) = ⇧ni=1(ri + 1)

m(f ) ⌃ni=17dlog2(ri )e+ 4

Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 15 / 29

Page 33: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear

ReLU Networks as Partitioning Input Space

A ReLU activation function - used between a�ne transformations tointroduce nonlinearities in the learned function.

Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 19 / 30

Page 34: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear

Review: ReLU Networks as Partitioning Input Space

Known theorems:

ReLU networks are bijective to the appropriate class of piecewiselinear functions (up to isomorphism of network)

For one hidden layer neural networks (p-dimensional input,q-dimensional hidden layer): can combinatorially show an upperbound on the number of piecewise linear regions:

r(q,p) =Pp

i=0

�qi

Montufar et al. (2014): The maximal number of linear regions of thefunctions computed by a NN with n0 input units and L hidden layers,with ni � n0 rectifiers at the ith layer, is lower bounded by

(⇧L�1i=1 b

nin0cn0)(⌃n0

j=0

�nLj

�)

Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 20 / 30

Page 35: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear

Review: ReLU Networks as Partitioning Input Space

Known theorems:

ReLU networks are bijective to the appropriate class of piecewiselinear functions (up to isomorphism of network)

For one hidden layer neural networks (p-dimensional input,q-dimensional hidden layer): can combinatorially show an upperbound on the number of piecewise linear regions:

r(q,p) =Pp

i=0

�qi

Montufar et al. (2014): The maximal number of linear regions of thefunctions computed by a NN with n0 input units and L hidden layers,with ni � n0 rectifiers at the ith layer, is lower bounded by

(⇧L�1i=1 b

nin0cn0)(⌃n0

j=0

�nLj

�)

Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 20 / 30

Page 36: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear

Review: ReLU Networks as Partitioning Input Space

Known theorems:

ReLU networks are bijective to the appropriate class of piecewiselinear functions (up to isomorphism of network)

For one hidden layer neural networks (p-dimensional input,q-dimensional hidden layer): can combinatorially show an upperbound on the number of piecewise linear regions:

r(q,p) =Pp

i=0

�qi

Montufar et al. (2014): The maximal number of linear regions of thefunctions computed by a NN with n0 input units and L hidden layers,with ni � n0 rectifiers at the ith layer, is lower bounded by

(⇧L�1i=1 b

nin0cn0)(⌃n0

j=0

�nLj

�)

Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 20 / 30

Page 37: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear

Review: ReLU Networks as Partitioning Input Space

Known theorems:

ReLU networks are bijective to the appropriate class of piecewiselinear functions (up to isomorphism of network)

For one hidden layer neural networks (p-dimensional input,q-dimensional hidden layer): can combinatorially show an upperbound on the number of piecewise linear regions:

r(q,p) =Pp

i=0

�qi

Montufar et al. (2014): The maximal number of linear regions of thefunctions computed by a NN with n0 input units and L hidden layers,with ni � n0 rectifiers at the ith layer, is lower bounded by

(⇧L�1i=1 b

nin0cn0)(⌃n0

j=0

�nLj

�)

Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 20 / 30

Page 38: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear

Visualizing the hyperplanes to depth k

First layer: hyperplanes through input space

All proceeding layers: ”bent hyperplanes” that bend at theestablished bent hyperplanes of previous layers

Figure 1 of Raghu et al. (2017)

Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 21 / 30

Page 39: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear

Visualizing the hyperplanes to depth k

First layer: hyperplanes through input space

All proceeding layers: ”bent hyperplanes” that bend at theestablished bent hyperplanes of previous layers

Figure 1 of Raghu et al. (2017)

Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 21 / 30

Page 40: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear

Visualizing the hyperplanes to depth k

First layer: hyperplanes through input space

All proceeding layers: ”bent hyperplanes” that bend at theestablished bent hyperplanes of previous layers

Figure 1 of Raghu et al. (2017)

Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 21 / 30

Page 41: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear

Expressibility vs. Learnability: Theorems

The following is from the work of Hanin and Rolnick (2019ab), andconcerns ReLU networks. The results are stated informally.

For any line segment through input space, the average number ofregions intersected is linear, and not exponential, in the number ofneurons.

Both the number of regions and the distance to the nearest regionboundary stay roughly constant during training.

Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 21 / 29

Page 42: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear

Expressibility vs. Learnability: Theorems

The following is from the work of Hanin and Rolnick (2019ab), andconcerns ReLU networks. The results are stated informally.

For any line segment through input space, the average number ofregions intersected is linear, and not exponential, in the number ofneurons.

Both the number of regions and the distance to the nearest regionboundary stay roughly constant during training.

Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 21 / 29

Page 43: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear

Expressibility vs. Learnability: Theorems

The following is from the work of Hanin and Rolnick (2019ab), andconcerns ReLU networks. The results are stated informally.

For any line segment through input space, the average number ofregions intersected is linear, and not exponential, in the number ofneurons.

Both the number of regions and the distance to the nearest regionboundary stay roughly constant during training.

Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 21 / 29

Page 44: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear

Expressibility vs. Learnability: Graphs

(a) Over Epochs (b) Over Test Accuracy

Figure: Normalization by squared number of neurons

Across di↵erent networks, number of piecewise linear regions isO(n2), and this doesn’t change with greater depth.Upshot: empirical success of depth on certain problems is not becausedeep nets learn a complex function inaccessible to shallow networks.

Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 23 / 30

Page 45: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear

Expressibility vs. Learnability: Graphs

(a) Over Epochs (b) Over Test Accuracy

Figure: Normalization by squared number of neurons

Across di↵erent networks, number of piecewise linear regions isO(n2), and this doesn’t change with greater depth.

Upshot: empirical success of depth on certain problems is not becausedeep nets learn a complex function inaccessible to shallow networks.

Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 23 / 30

Page 46: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear

Expressibility vs. Learnability: Graphs

(a) Over Epochs (b) Over Test Accuracy

Figure: Normalization by squared number of neurons

Across di↵erent networks, number of piecewise linear regions isO(n2), and this doesn’t change with greater depth.Upshot: empirical success of depth on certain problems is not becausedeep nets learn a complex function inaccessible to shallow networks.

Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 23 / 30

Page 47: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear

Bridging the Gap

Two next questions (in context of ReLU networks):

Are networks that learn an exponential number of linear regions”usable”, or are they purely a theoretical guarantee?

How can we adjust the traditional deep learning pipeline to allowlearning of piecewise linear functions in the exponential regime?

Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 24 / 29

Page 48: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear

Bridging the Gap

Two next questions (in context of ReLU networks):

Are networks that learn an exponential number of linear regions”usable”, or are they purely a theoretical guarantee?

How can we adjust the traditional deep learning pipeline to allowlearning of piecewise linear functions in the exponential regime?

Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 24 / 29

Page 49: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear

Bridging the Gap

Two next questions (in context of ReLU networks):

Are networks that learn an exponential number of linear regions”usable”, or are they purely a theoretical guarantee?

How can we adjust the traditional deep learning pipeline to allowlearning of piecewise linear functions in the exponential regime?

Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 24 / 29

Page 50: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear

Sawtooth functions/triangular wave functions

Type of function that achieves an exponential number of regions innumber of nodes/depth.

(a) No Perturbation (b) Random Perturbation

Figure: Sawtooth functions

Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 25 / 29

Page 51: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear

Sawtooth functions as feedforward neural networks

To achieve a feedforward neural network that represents a sawtoothfunction with 2n a�ne regions in 2n hidden layers:

Mirror map, defined on [0, 1]:

f(x) =

(2x when 0 x 1

2

2(1� x) when 12 x 1

As a two-layer neural network:

f (x) = ReLU(2ReLU(x)� 4ReLU(x � 12))

Composing mirror map with itself n times will yield a sawtoothfunction with 2n equally spaced a�ne components on [0, 1].

Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 26 / 29

Page 52: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear

Sawtooth functions as feedforward neural networks

To achieve a feedforward neural network that represents a sawtoothfunction with 2n a�ne regions in 2n hidden layers:

Mirror map, defined on [0, 1]:

f(x) =

(2x when 0 x 1

2

2(1� x) when 12 x 1

As a two-layer neural network:

f (x) = ReLU(2ReLU(x)� 4ReLU(x � 12))

Composing mirror map with itself n times will yield a sawtoothfunction with 2n equally spaced a�ne components on [0, 1].

Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 26 / 29

Page 53: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear

Sawtooth functions as feedforward neural networks

To achieve a feedforward neural network that represents a sawtoothfunction with 2n a�ne regions in 2n hidden layers:

Mirror map, defined on [0, 1]:

f(x) =

(2x when 0 x 1

2

2(1� x) when 12 x 1

As a two-layer neural network:

f (x) = ReLU(2ReLU(x)� 4ReLU(x � 12))

Composing mirror map with itself n times will yield a sawtoothfunction with 2n equally spaced a�ne components on [0, 1].

Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 26 / 29

Page 54: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear

Sawtooth functions as feedforward neural networks

To achieve a feedforward neural network that represents a sawtoothfunction with 2n a�ne regions in 2n hidden layers:

Mirror map, defined on [0, 1]:

f(x) =

(2x when 0 x 1

2

2(1� x) when 12 x 1

As a two-layer neural network:

f (x) = ReLU(2ReLU(x)� 4ReLU(x � 12))

Composing mirror map with itself n times will yield a sawtoothfunction with 2n equally spaced a�ne components on [0, 1].

Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 26 / 29

Page 55: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear

Questions regarding sawtooth functions

How robust are sawtooth functions to multiplicative weightperturbation, of the form w(1 + ✏)? (The perturbations arezero-mean Gaussians, and experiments changed the variance.)

Can randomly initialized or perturbed networks re-learn the sawtoothfunction?

Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 27 / 32

Page 56: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear

Perturbing the parameters

(a) All weights and biases (b) All weights

(c) All biases (d) First layer only

Figure: Another figureRyan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 28 / 32

Page 57: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear

Re-learning sawtooth networks from di↵erent initializations

(a) Mean squared error (b) Number of linear regions

Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 29 / 32

Page 58: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear

Re-learning sawtooth networks from di↵erent initializations

(c) Variance = 0.01 (d) Variance = 0.05

(e) Variance = 0.1 (f) Variance = 0.5

Figure: Another figureRyan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 30 / 32

Page 59: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear

Implications

”Sawtooth networks” fall apart with mild variance on the weights.

Even when starting from perturbed versions of the original function,the original sawtooth network is not retained, implying that the set ofparameters yielding exponential nets in the loss landscape is localizedand challenging to learn.

Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 31 / 32

Page 60: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear

Learning functions in the exponential regime (sketch)

How to adjust the DL framework to encourage feedforward model to learnfunctions currently in Fexpress \ Flearn?

First attempt: adding terms to the objective function to encouragenetwork to learn more complex functions (”anti-regularization”)

For 2D input, two hidden-layer ReLU nets, can think aboutencouraging all hyperplanes/bent hyperplanes to intersect:

Can think of an inactive ReLU activation function as replacingappropriate column of outer weight vector by zeros

w2ReLU(W1x + b1) + b2 = w02((W1x + b1) + b2)

For a first-layer activation pattern, this yields a square matrix equationand a collection of inequalities, and one can add ”perceptron-like” errorterms to the objective function to encourage regions to intersect

Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 29 / 30

Page 61: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear

Learning functions in the exponential regime (sketch)

How to adjust the DL framework to encourage feedforward model to learnfunctions currently in Fexpress \ Flearn?

First attempt: adding terms to the objective function to encouragenetwork to learn more complex functions (”anti-regularization”)

For 2D input, two hidden-layer ReLU nets, can think aboutencouraging all hyperplanes/bent hyperplanes to intersect:

Can think of an inactive ReLU activation function as replacingappropriate column of outer weight vector by zeros

w2ReLU(W1x + b1) + b2 = w02((W1x + b1) + b2)

For a first-layer activation pattern, this yields a square matrix equationand a collection of inequalities, and one can add ”perceptron-like” errorterms to the objective function to encourage regions to intersect

Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 29 / 30

Page 62: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear

Learning functions in the exponential regime (sketch)

How to adjust the DL framework to encourage feedforward model to learnfunctions currently in Fexpress \ Flearn?

First attempt: adding terms to the objective function to encouragenetwork to learn more complex functions (”anti-regularization”)

For 2D input, two hidden-layer ReLU nets, can think aboutencouraging all hyperplanes/bent hyperplanes to intersect:

Can think of an inactive ReLU activation function as replacingappropriate column of outer weight vector by zeros

w2ReLU(W1x + b1) + b2 = w02((W1x + b1) + b2)

For a first-layer activation pattern, this yields a square matrix equationand a collection of inequalities, and one can add ”perceptron-like” errorterms to the objective function to encourage regions to intersect

Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 29 / 30

Page 63: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear

Learning functions in the exponential regime (sketch)

How to adjust the DL framework to encourage feedforward model to learnfunctions currently in Fexpress \ Flearn?

First attempt: adding terms to the objective function to encouragenetwork to learn more complex functions (”anti-regularization”)

For 2D input, two hidden-layer ReLU nets, can think aboutencouraging all hyperplanes/bent hyperplanes to intersect:

Can think of an inactive ReLU activation function as replacingappropriate column of outer weight vector by zeros

w2ReLU(W1x + b1) + b2 = w02((W1x + b1) + b2)

For a first-layer activation pattern, this yields a square matrix equationand a collection of inequalities, and one can add ”perceptron-like” errorterms to the objective function to encourage regions to intersect

Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 29 / 30

Page 64: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear

Learning functions in the exponential regime (sketch)

How to adjust the DL framework to encourage feedforward model to learnfunctions currently in Fexpress \ Flearn?

First attempt: adding terms to the objective function to encouragenetwork to learn more complex functions (”anti-regularization”)

For 2D input, two hidden-layer ReLU nets, can think aboutencouraging all hyperplanes/bent hyperplanes to intersect:

Can think of an inactive ReLU activation function as replacingappropriate column of outer weight vector by zeros

w2ReLU(W1x + b1) + b2 = w02((W1x + b1) + b2)

For a first-layer activation pattern, this yields a square matrix equationand a collection of inequalities, and one can add ”perceptron-like” errorterms to the objective function to encourage regions to intersect

Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 29 / 30

Page 65: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear

Conclusion

Thank you for listening!

Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 29 / 29