Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
Deep Learning: An Engineer’s Dream and aMathematician’s Nightmare
John McKay
Information Processing & Algorithms LaboratoryDept. of Electrical EngineeringPennsylvania State University
Dartmouth Applied Math Seminar, 2017
Intro Neural Network Design Training Neural Networks Transfer Learning References
Contents
1 Intro
2 Neural Network DesignConvolutional Neural Networks
3 Training Neural NetworksBackpropagationStochastic Gradient DescentOptimization Issues
4 Transfer LearningPretrainingFine Tuning
[email protected] Skeptical Math Intro to DL 0 / 55
Intro Neural Network Design Training Neural Networks Transfer Learning References
Why Give a Talk to Mathematicians on DL?
� Deep learning is dominating machine learning (imageclassification, object detection, text parsing, etc).
� Industry is keeping pace with/beating academia in terms ofbreakthroughs and adoption. We need mathematicians toprovide rigor to modern strategies.
Month of Citation
Numb
er o
f Tex
ts
Convolutional Neural Networks [Category: Computer Science; 86K articles, 946M words]topology [Category: Mathematics; 303K articles, 4B words]
1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 20140
500
1000
1500
2000
Highcharts.com
Figure: Per-month publication totals of papers on arXiv. Topology(orange) provided as a reference.
[email protected] Skeptical Math Intro to DL 1 / 55
Intro Neural Network Design Training Neural Networks Transfer Learning References
Gatys et al. (2015), Taigman et al. (2014)
[email protected] Skeptical Math Intro to DL 2 / 55
Intro Neural Network Design Training Neural Networks Transfer Learning References
Courtsey Nvidia,Krause et al. (2016)
[email protected] Skeptical Math Intro to DL 3 / 55
Intro Neural Network Design Training Neural Networks Transfer Learning References
This talk will. . .
� Provide a comprehensive introduction to Deep Learning� Cover most of the modern ideas and trends� Illuminate the issues underpinning popular strategies
[email protected] Skeptical Math Intro to DL 4 / 55
Intro Neural Network Design Training Neural Networks Transfer Learning References
Traditional Supervised Image Classification
Training1. Input Labeled Data
↓2. Extract Features
↓3. Train Classifier
[email protected] Skeptical Math Intro to DL 5 / 55
Intro Neural Network Design Training Neural Networks Transfer Learning References
Traditional Supervised Image Classification
Training1. Input Labeled Data
↓2. Extract Features
↓3. Train Classifier
[email protected] Skeptical Math Intro to DL 5 / 55
Intro Neural Network Design Training Neural Networks Transfer Learning References
Traditional Supervised Image Classification
Training1. Input Labeled Data
↓2. Extract Features
↓3. Train Classifier
↓
[email protected] Skeptical Math Intro to DL 5 / 55
Intro Neural Network Design Training Neural Networks Transfer Learning References
Traditional Supervised Image Classification
Training1. Input Labeled Data
↓2. Extract Features
↓3. Train Classifier
↓
↓
[email protected] Skeptical Math Intro to DL 5 / 55
Intro Neural Network Design Training Neural Networks Transfer Learning References
Training
http://www.cs.trincoll.edu
[email protected] Skeptical Math Intro to DL 6 / 55
Intro Neural Network Design Training Neural Networks Transfer Learning References
Feature Extraction
� Popular features: scale-invariant feature transforms (SIFT),speeded up robust features (SURF), sparsereconstruction-based classification (SRC) coefficients.
� They are hand crafted from analytical work trying todecipher invariant (translation, rotational, scale, etc.) anddiscriminatory image characteristics.
Lowe (1999)
[email protected] Skeptical Math Intro to DL 7 / 55
Intro Neural Network Design Training Neural Networks Transfer Learning References
Feature Extraction
� Popular features: scale-invariant feature transforms (SIFT),speeded up robust features (SURF), sparsereconstruction-based classification (SRC) coefficients.
� They are hand crafted from analytical work trying todecipher invariant (translation, rotational, scale, etc.) anddiscriminatory image characteristics.
� Why not have the computer do it?
Lowe (1999)
[email protected] Skeptical Math Intro to DL 7 / 55
Intro Neural Network Design Training Neural Networks Transfer Learning References
Contents
1 Intro
2 Neural Network DesignConvolutional Neural Networks
3 Training Neural NetworksBackpropagationStochastic Gradient DescentOptimization Issues
4 Transfer LearningPretrainingFine Tuning
[email protected] Skeptical Math Intro to DL 7 / 55
Intro Neural Network Design Training Neural Networks Transfer Learning References
Single Hidden Layer Network
� Input x ∈ Rd , NN output a(2)
a(2) = f2(f1(x))
[email protected] Skeptical Math Intro to DL 8 / 55
Intro Neural Network Design Training Neural Networks Transfer Learning References
Single Hidden Layer Network
� Input x ∈ Rd , NN output a(2)
a(2) = W2 σ(W1x + b1)︸ ︷︷ ︸1st Layer (hidden)
+b2
[email protected] Skeptical Math Intro to DL 8 / 55
Intro Neural Network Design Training Neural Networks Transfer Learning References
Single Hidden Layer Network
� Input x ∈ Rd , NN output a(2)
a(2) =
2nd Layer (Output)︷ ︸︸ ︷W2 σ(W1x + b1)︸ ︷︷ ︸
1st Layer (Hidden)
+b2
[email protected] Skeptical Math Intro to DL 8 / 55
Intro Neural Network Design Training Neural Networks Transfer Learning References
Single Hidden Layer Network
� Input x ∈ Rd , NN output a(2)
a(2) =
2nd Layer (Output)︷ ︸︸ ︷W2 σ(W1x + b1)︸ ︷︷ ︸
1st Layer (Hidden)
+b2
� W1, W2 are the weights and b1, b2 the biases.� σ is an activation function (more on this later)
[email protected] Skeptical Math Intro to DL 8 / 55
Intro Neural Network Design Training Neural Networks Transfer Learning References
Single Hidden Layer Network
� Input x ∈ Rd , NN output a(2)
a(2) =
2nd Layer (Output)︷ ︸︸ ︷W2 σ(W1x + b1)︸ ︷︷ ︸
1st Layer (Hidden)
+b2
� W1, W2 are the weights and b1, b2 the biases.� σ is an activation function (more on this later)
[email protected] Skeptical Math Intro to DL 8 / 55
Intro Neural Network Design Training Neural Networks Transfer Learning References
Nielsen (2015)
[email protected] Skeptical Math Intro to DL 9 / 55
Intro Neural Network Design Training Neural Networks Transfer Learning References
Universal Approximation Theorem
� Universal Approximation Theorem: given enough hiddenlayers, weight depth, and σ of a certain set of nonlinearfunctions, then a NN with linear output can represent anyBorel measurable function mapping finite dimensional spaceto another.
� Nice, but representation6=prediction.
Hornik et al. (1989),Leshno et al. (1993)
[email protected] Skeptical Math Intro to DL 10 / 55
Intro Neural Network Design Training Neural Networks Transfer Learning References
More Parameters, More Descriptiveness
Li/Kaparthy/Johnson, Stanford CS231
[email protected] Skeptical Math Intro to DL 11 / 55
Intro Neural Network Design Training Neural Networks Transfer Learning References
Activation Functions
� Feature spaces require nonlinear activation functions.� There is no one “best” choice for an activation function, to
say nothing of how many layers have one specific activationfunction.
� Since 2009, the go-to: Rectified Linear Units (ReLU)
σ(z) = a such that ai = max(0, zi)
[email protected] Skeptical Math Intro to DL 12 / 55
Intro Neural Network Design Training Neural Networks Transfer Learning References
Activation Functions
imiloainf.wordpress.com (her typo)
[email protected] Skeptical Math Intro to DL 13 / 55
Intro Neural Network Design Training Neural Networks Transfer Learning References
Output Layer
� Nevermind, sigmoids are okay.� For binary problem, sigmoid acts as a probability� For multinomial, popular to use softmax
softmax(zi) =exp(zi)∑j exp(zj)
� softmax ≈ argmax function (one hot vector output)
[email protected] Skeptical Math Intro to DL 14 / 55
Intro Neural Network Design Training Neural Networks Transfer Learning References
Contents
1 Intro
2 Neural Network DesignConvolutional Neural Networks
3 Training Neural NetworksBackpropagationStochastic Gradient DescentOptimization Issues
4 Transfer LearningPretrainingFine Tuning
[email protected] Skeptical Math Intro to DL 14 / 55
Intro Neural Network Design Training Neural Networks Transfer Learning References
Convolutional Neural Networks
� Convolutions with filters are a way to do weight sharing andexploit sparse interactions.
� Fewer weights? Less to train/save.� Modern CNNs are the most accurate algorithms humans
have ever devised (by a bit).
LeCun et al. (1998)
[email protected] Skeptical Math Intro to DL 15 / 55
Intro Neural Network Design Training Neural Networks Transfer Learning References
Convolutional Neural Network
[email protected] Skeptical Math Intro to DL 16 / 55
Intro Neural Network Design Training Neural Networks Transfer Learning References
Convolutional Neural Network
[email protected] Skeptical Math Intro to DL 16 / 55
Intro Neural Network Design Training Neural Networks Transfer Learning References
Convolutional Neural Network
[email protected] Skeptical Math Intro to DL 16 / 55
Intro Neural Network Design Training Neural Networks Transfer Learning References
Convolutional Neural Network
[email protected] Skeptical Math Intro to DL 16 / 55
Intro Neural Network Design Training Neural Networks Transfer Learning References
Convolutional Neural Network
[email protected] Skeptical Math Intro to DL 16 / 55
Intro Neural Network Design Training Neural Networks Transfer Learning References
Convolutional Neural Network
[email protected] Skeptical Math Intro to DL 16 / 55
Intro Neural Network Design Training Neural Networks Transfer Learning References
Convolutional Neural Network
[email protected] Skeptical Math Intro to DL 16 / 55
Intro Neural Network Design Training Neural Networks Transfer Learning References
Convolutional Neural Network
[email protected] Skeptical Math Intro to DL 16 / 55
Intro Neural Network Design Training Neural Networks Transfer Learning References
Li/Kaparthy/Johnson, Stanford CS231
[email protected] Skeptical Math Intro to DL 17 / 55
Intro Neural Network Design Training Neural Networks Transfer Learning References
Recurrent Neural Networks
source
[email protected] Skeptical Math Intro to DL 18 / 55
Intro Neural Network Design Training Neural Networks Transfer Learning References
Regularization
Li/Kaparthy/Johnson, Stanford CS231,Gal and Ghahramani (2015)
[email protected] Skeptical Math Intro to DL 19 / 55
Intro Neural Network Design Training Neural Networks Transfer Learning References
Contents
1 Intro
2 Neural Network DesignConvolutional Neural Networks
3 Training Neural NetworksBackpropagationStochastic Gradient DescentOptimization Issues
4 Transfer LearningPretrainingFine Tuning
[email protected] Skeptical Math Intro to DL 19 / 55
Intro Neural Network Design Training Neural Networks Transfer Learning References
How to train Neural Networks?
� Suppose we are given a set x1, . . . ,xn with known labelsy(xi) = yi . Training entails tuning parameters to minimizethe error between y and the model’s outputs for each xi .
� Cost functions represent a surrogate for classification error.We indirectly improve classification performance.
� In minimizing the cost function, gradient descent methodshave become the go-to strategy. We will later discuss issueswith this.
� What is the gradient of a NN w.r.t. its parameters?
[email protected] Skeptical Math Intro to DL 20 / 55
Intro Neural Network Design Training Neural Networks Transfer Learning References
Contents
1 Intro
2 Neural Network DesignConvolutional Neural Networks
3 Training Neural NetworksBackpropagationStochastic Gradient DescentOptimization Issues
4 Transfer LearningPretrainingFine Tuning
[email protected] Skeptical Math Intro to DL 20 / 55
Intro Neural Network Design Training Neural Networks Transfer Learning References
Backpropagation
� Calculating the gradient was significant challenge until 1986when Rumelhart et al proposed backpropagation.
� What was the problem before them?� Nested nonlinear activation functions and nontrivial cost
functions are messy and require a lot of work for slight tweaks.� NNs with even a little depth involve several matrix
multiplications. If one were to use a finite difference scheme,several forward passes through a network gets expensive.
Rumelhart et al. (1986)
[email protected] Skeptical Math Intro to DL 21 / 55
Intro Neural Network Design Training Neural Networks Transfer Learning References
Backpropagation
� Calculating the gradient was significant challenge until 1986when Rumelhart et al proposed backpropagation.
� What was the problem before them?� Nested nonlinear activation functions and nontrivial cost
functions are messy and require a lot of work for slight tweaks.� NNs with even a little depth involve several matrix
multiplications. If one were to use a finite difference scheme,several forward passes through a network gets expensive.
Rumelhart et al. (1986)
[email protected] Skeptical Math Intro to DL 21 / 55
Intro Neural Network Design Training Neural Networks Transfer Learning References
Backpropagation
� Calculating the gradient was significant challenge until 1986when Rumelhart et al proposed backpropagation.
� What was the problem before them?� Nested nonlinear activation functions and nontrivial cost
functions are messy and require a lot of work for slight tweaks.� NNs with even a little depth involve several matrix
multiplications. If one were to use a finite difference scheme,several forward passes through a network gets expensive.
Rumelhart et al. (1986)
[email protected] Skeptical Math Intro to DL 21 / 55
Intro Neural Network Design Training Neural Networks Transfer Learning References
Backpropagation
� As we will show, Backpropagation allows for us to computethe gradient of a NN at the cost of two forward passesregardless of the architecture.
� It plays mainly off of implementations of the chain rule.
[email protected] Skeptical Math Intro to DL 22 / 55
Intro Neural Network Design Training Neural Networks Transfer Learning References
Example Network: C cost function, L number of layers, inputtraining x ∈ Rm (n in total), y(x) desired output
alj(x) = al
j = σ
(∑k
w ljkal−1
k + blj
)
Nielsen (2015)
[email protected] Skeptical Math Intro to DL 23 / 55
Intro Neural Network Design Training Neural Networks Transfer Learning References
Example Network: C cost function, L number of layers, inputtraining x ∈ Rm (n in total), y(x) desired output
alj(x) = al
j = σ
(∑k
w ljkal−1
k + blj
)
C has two requirements: it can be written as a function of thenetwork outputs aL and arranged as a sum of cost functionsfor each training sample. Example:
C(aL) =1
2n
∑x||y(x)− aL(x)||22
Nielsen (2015)
[email protected] Skeptical Math Intro to DL 23 / 55
Intro Neural Network Design Training Neural Networks Transfer Learning References
Example Network: C cost function, L number of layers, inputtraining x ∈ Rm (n in total), y(x) desired output
alj(x) = al
j = σ
(∑k
w ljkal−1
k + blj
)a1
j = xj
Nielsen (2015)
[email protected] Skeptical Math Intro to DL 23 / 55
Intro Neural Network Design Training Neural Networks Transfer Learning References
Example Network: C cost function, L number of layers, inputtraining x ∈ Rm (n in total), y(x) desired output
alj = σ
(∑k
w ljkal−1
k + blj
)
Nielsen (2015)
[email protected] Skeptical Math Intro to DL 23 / 55
Intro Neural Network Design Training Neural Networks Transfer Learning References
Example Network: C cost function, L number of layers, inputtraining x ∈ Rm (n in total), y(x) desired output
alj = σ
(∑k
w ljkal−1
k + blj︸ ︷︷ ︸
z l
)
Nielsen (2015)
[email protected] Skeptical Math Intro to DL 23 / 55
Intro Neural Network Design Training Neural Networks Transfer Learning References
Example Network: C cost function, L number of layers, inputtraining x ∈ Rm (n in total), y(x) desired output
al = σ(
W lal−1 + bl)= σ(z l)
Goal: find ∂C/∂w ljk , ∂C/∂bl
j
Nielsen (2015)
[email protected] Skeptical Math Intro to DL 23 / 55
Intro Neural Network Design Training Neural Networks Transfer Learning References
al = σ(
W lal−1 + bl)= σ(z l)
Let δlj = ∂C/∂z l
j- Error in Output Layer
δLj =
∂C∂aL
j
∂aLj
∂zLj=∂C∂aL
jσ′(zL
j ) (1)
-δl = ((W l+1)Tδl+1)� σ′(z l) (2)
-∂C∂bl
j= δl
j (3)
-∂C∂w l
jk= al−1
k δlj (4)
[email protected] Skeptical Math Intro to DL 24 / 55
Intro Neural Network Design Training Neural Networks Transfer Learning References
al = σ(
W lal−1 + bl)= σ(z l)
Let δlj = ∂C/∂z l
j-
δLj =
∂C∂aL
j
∂aLj
∂zLj=∂C∂aL
jσ′(zL
j ) (1)
- δl in terms of δl+1
δl = ((W l+1)Tδl+1)� σ′(z l) (2)
-∂C∂bl
j= δl
j (3)
-∂C∂w l
jk= al−1
k δlj (4)
[email protected] Skeptical Math Intro to DL 24 / 55
Intro Neural Network Design Training Neural Networks Transfer Learning References
al = σ(
W lal−1 + bl)= σ(z l)
Let δlj = ∂C/∂z l
j-
δLj =
∂C∂aL
j
∂aLj
∂zLj=∂C∂aL
jσ′(zL
j ) (1)
-δl = ((W l+1)Tδl+1)� σ′(z l) (2)
- Partial w.r.t. bias∂C∂bl
j= δl
j (3)
-∂C∂w l
jk= al−1
k δlj (4)
[email protected] Skeptical Math Intro to DL 24 / 55
Intro Neural Network Design Training Neural Networks Transfer Learning References
al = σ(
W lal−1 + bl)= σ(z l)
Let δlj = ∂C/∂z l
j-
δLj =
∂C∂aL
j
∂aLj
∂zLj=∂C∂aL
jσ′(zL
j ) (1)
-δl = ((W l+1)Tδl+1)� σ′(z l) (2)
-∂C∂bl
j= δl
j (3)
- Partial w.r.t. weights∂C∂w l
jk= al−1
k δlj (4)
[email protected] Skeptical Math Intro to DL 24 / 55
Intro Neural Network Design Training Neural Networks Transfer Learning References
Backpropagation Algorithm
1 Input x and solve for a(1)
2 Feedforward through the network, i.e. for l = 2, . . . ,L find
z l = W lal−1 + bl , al = σ(z l)
3 Find δL = ∇aLC � σ(zL)4 Backpropagate error by, for l = L− 1,L− 2, . . . ,2,
calculatingδl = ((W l+1)Tδl+1)� σ′(z l)
5 Gradients of Cost Function are found as∂C∂bl
j= δl
j ,∂C∂w l
jk= al−1
k δlj
Nielsen (2015)
[email protected] Skeptical Math Intro to DL 25 / 55
Intro Neural Network Design Training Neural Networks Transfer Learning References
Backpropagation Algorithm
1 Input x and solve for a(1)
2 Feedforward through the network, i.e. for l = 2, . . . ,L find
z l = W lal−1 + bl , al = σ(z l)
3 Find δL = ∇aLC � σ(zL)4 Backpropagate error by, for l = L− 1,L− 2, . . . ,2,
calculatingδl = ((W l+1)Tδl+1)� σ′(z l)
5 Gradients of Cost Function are found as∂C∂bl
j= δl
j ,∂C∂w l
jk= al−1
k δlj
Nielsen (2015)
[email protected] Skeptical Math Intro to DL 25 / 55
Intro Neural Network Design Training Neural Networks Transfer Learning References
Backpropagation Algorithm
1 Input x and solve for a(1)
2 Feedforward through the network, i.e. for l = 2, . . . ,L find
z l = W lal−1 + bl , al = σ(z l)
3 Find δL = ∇aLC � σ(zL)4 Backpropagate error by, for l = L− 1,L− 2, . . . ,2,
calculatingδl = ((W l+1)Tδl+1)� σ′(z l)
5 Gradients of Cost Function are found as∂C∂bl
j= δl
j ,∂C∂w l
jk= al−1
k δlj
Nielsen (2015)
[email protected] Skeptical Math Intro to DL 25 / 55
Intro Neural Network Design Training Neural Networks Transfer Learning References
Backpropagation Algorithm
1 Input x and solve for a(1)
2 Feedforward through the network, i.e. for l = 2, . . . ,L find
z l = W lal−1 + bl , al = σ(z l)
3 Find δL = ∇aLC � σ(zL)4 Backpropagate error by, for l = L− 1,L− 2, . . . ,2,
calculatingδl = ((W l+1)Tδl+1)� σ′(z l)
5 Gradients of Cost Function are found as∂C∂bl
j= δl
j ,∂C∂w l
jk= al−1
k δlj
Nielsen (2015)
[email protected] Skeptical Math Intro to DL 25 / 55
Intro Neural Network Design Training Neural Networks Transfer Learning References
Backpropagation Algorithm
1 Input x and solve for a(1)
2 Feedforward through the network, i.e. for l = 2, . . . ,L find
z l = W lal−1 + bl , al = σ(z l)
3 Find δL = ∇aLC � σ(zL)4 Backpropagate error by, for l = L− 1,L− 2, . . . ,2,
calculatingδl = ((W l+1)Tδl+1)� σ′(z l)
5 Gradients of Cost Function are found as∂C∂bl
j= δl
j ,∂C∂w l
jk= al−1
k δlj
Nielsen (2015)
[email protected] Skeptical Math Intro to DL 25 / 55
Intro Neural Network Design Training Neural Networks Transfer Learning References
Learned Filters
Courtesy Yann LeCun
[email protected] Skeptical Math Intro to DL 26 / 55
Intro Neural Network Design Training Neural Networks Transfer Learning References
Contents
1 Intro
2 Neural Network DesignConvolutional Neural Networks
3 Training Neural NetworksBackpropagationStochastic Gradient DescentOptimization Issues
4 Transfer LearningPretrainingFine Tuning
[email protected] Skeptical Math Intro to DL 26 / 55
Intro Neural Network Design Training Neural Networks Transfer Learning References
Gradient Descent
� Backpropagation→ gradient� (Batch) gradient descent:
W l = W l − αW∂C∂W l
b = bl − αb∂C∂bl
� Issue: C is the sum of costs associated with each trainingsample.
� Many problems use millions of training samples. . .
[email protected] Skeptical Math Intro to DL 27 / 55
Intro Neural Network Design Training Neural Networks Transfer Learning References
Gradient Descent
� Backpropagation→ gradient� (Batch) gradient descent:
W l = W l − αW∂C∂W l
b = bl − αb∂C∂bl
� Issue: C is the sum of costs associated with each trainingsample.
� Many problems use millions of training samples. . .
[email protected] Skeptical Math Intro to DL 27 / 55
Intro Neural Network Design Training Neural Networks Transfer Learning References
Stochastic Gradient Descent
� Randomly choose m < n training samples x1, . . . ,xm
C(m) =m∑
i=1
Cxi
� Update step:
W lk+1 = W l
k −nαW
m
m∑i=1
∂Cxi
∂W lk
bk+1 = bl − nαb
m
m∑i=1
∂Cxi
∂blk
[email protected] Skeptical Math Intro to DL 28 / 55
Intro Neural Network Design Training Neural Networks Transfer Learning References
Stochastic Gradient Descent
� Randomly choose m < n training samples x1, . . . ,xm
C(m) =m∑
i=1
Cxi
W lk+1 = W l
k − αWnm
m∑i=1
∂Cxi
∂W lk︸ ︷︷ ︸
gW
E[gW ]= ∇W C
[email protected] Skeptical Math Intro to DL 28 / 55
Intro Neural Network Design Training Neural Networks Transfer Learning References
Stochastic Gradient
� Randomly choose m < n training samples x1, . . . ,xm
C(m) =m∑
i=1
Cxi
W lk+1 = W l
k − αWnm
m∑i=1
∂Cxi
∂W lk
� Learning rate αW is typically fixed and “small” (but not “toosmall”).
[email protected] Skeptical Math Intro to DL 28 / 55
Intro Neural Network Design Training Neural Networks Transfer Learning References
Stochastic Gradient “Descent”
� Randomly choose m < n training samples x1, . . . ,xm
C(m) =m∑
i=1
Cxi
W lk+1 = W l
k − αWnm
m∑i=1
∂Cxi
∂W lk
� Learning rate αW is typically fixed and “small” (but not “toosmall”).
[email protected] Skeptical Math Intro to DL 28 / 55
Intro Neural Network Design Training Neural Networks Transfer Learning References
Stochastic Gradient Descent
Figure: GD on left, SGD on right
Ng, Standford Machine Learning (Fall 2011)
[email protected] Skeptical Math Intro to DL 29 / 55
Intro Neural Network Design Training Neural Networks Transfer Learning References
Stochastic Gradient Descent
Figure: GD on left, SGD on right
Still acceptable - we just want “close enough”
Ng, Standford Machine Learning (Fall 2011)
[email protected] Skeptical Math Intro to DL 29 / 55
Intro Neural Network Design Training Neural Networks Transfer Learning References
Summarize Training
[email protected] Skeptical Math Intro to DL 30 / 55
Intro Neural Network Design Training Neural Networks Transfer Learning References
Quick Note on C
� Least squares is useful for examples, but usually notsuitable for practice (slow with sigmoid output).
� Cross-entropy
C =1n
∑x
yT ln(aL(x)) + (1− y)T ln(1− aL(x))
� Chosen for “nice” attributes, but no idea of resultingtopology, which influences everything.
[email protected] Skeptical Math Intro to DL 31 / 55
Intro Neural Network Design Training Neural Networks Transfer Learning References
Contents
1 Intro
2 Neural Network DesignConvolutional Neural Networks
3 Training Neural NetworksBackpropagationStochastic Gradient DescentOptimization Issues
4 Transfer LearningPretrainingFine Tuning
[email protected] Skeptical Math Intro to DL 31 / 55
Intro Neural Network Design Training Neural Networks Transfer Learning References
Optimization Issues
� Training for NNs remains one of the least analytically soundaspects.
� What does the cost function look like?� Where are its local minima (if there are any)?� How will SGD jump around?
[email protected] Skeptical Math Intro to DL 32 / 55
Intro Neural Network Design Training Neural Networks Transfer Learning References
Local Minima Problem
Sontag and Sussmann (1989),Brady et al. (1989),Gori and Tesi (1992)
[email protected] Skeptical Math Intro to DL 33 / 55
Intro Neural Network Design Training Neural Networks Transfer Learning References
Saddle Points
� Suppose H = ∇2C and λi its eigenvalues.� P(λi > 0) = 1
2=⇒P(λ1 > 0, . . . , λK > 0) = (12)
K
� (12)
K → 0 as K →∞, meaning the probability that a criticalpoint is a saddle point.
Dauphin et al. (2014)
[email protected] Skeptical Math Intro to DL 34 / 55
Intro Neural Network Design Training Neural Networks Transfer Learning References
Sontag and Sussmann (1989),Brady et al. (1989),Gori and Tesi (1992)
[email protected] Skeptical Math Intro to DL 35 / 55
Intro Neural Network Design Training Neural Networks Transfer Learning References
No Minima Problem
[email protected] Skeptical Math Intro to DL 36 / 55
Intro Neural Network Design Training Neural Networks Transfer Learning References
In Practice. . .
� Even in cases where local minimum is not found, results canbe state-of-the-art; just want SGD to find “very small value.”
� Initialization of weights is a major area of research to try toplace model in “good” area to find local minima (more onthis later).
� Structural elements have adapted to (experimentally) fixthese issues (ReLU, cross-entropy cost, convolutionallayers, etc.).
Goodfellow et al. (2016)
[email protected] Skeptical Math Intro to DL 37 / 55
Intro Neural Network Design Training Neural Networks Transfer Learning References
Unstable Gradient vs. Architecture
Layers (Parameters∗) Mutli-Class Error Top-5 Error11 (133) 29.6 10.413 (133) 28.7 9.916 (134) 28.1 9.416 (138) 27.3 8.819 (144) 25.5 8.0
∗ Number in millions
Simonyan and Zisserman (2014)
[email protected] Skeptical Math Intro to DL 38 / 55
Intro Neural Network Design Training Neural Networks Transfer Learning References
Unstable Gradient
� Consider simplified model
� Typical initialization: wi ∼ N (0,1), bi = 1
Nielsen (2015)
[email protected] Skeptical Math Intro to DL 39 / 55
Intro Neural Network Design Training Neural Networks Transfer Learning References
Vanishing Gradient
� First layers learn slower than later ones
Nielsen (2015)[email protected] Skeptical Math Intro to DL 40 / 55
Intro Neural Network Design Training Neural Networks Transfer Learning References
Unstable Gradients for FFN
Nielsen (2015)[email protected] Skeptical Math Intro to DL 41 / 55
Intro Neural Network Design Training Neural Networks Transfer Learning References
Overcoming Unstable Gradients
� Initialization (again)1 Random Walk Initialization1
W̃ ` = ωW ` where ωReLU = (√
2) exp(
1.2max(N`,6)− 2.4
)2 Orthonormal Initialization2
3 Unit Variance Initilization3,4
Sussillo and Abbott (2014)1,Saxe et al. (2013)2,He et al. (2015)3,Mishkin and Matas (2015)4
[email protected] Skeptical Math Intro to DL 42 / 55
Intro Neural Network Design Training Neural Networks Transfer Learning References
Overcoming Unstable Gradients
� Student-teacher model� Drop in layers as you go� Change the cost function and/or activiation functions?
Romero et al. (2014)
[email protected] Skeptical Math Intro to DL 43 / 55
Intro Neural Network Design Training Neural Networks Transfer Learning References
Contents
1 Intro
2 Neural Network DesignConvolutional Neural Networks
3 Training Neural NetworksBackpropagationStochastic Gradient DescentOptimization Issues
4 Transfer LearningPretrainingFine Tuning
[email protected] Skeptical Math Intro to DL 43 / 55
Intro Neural Network Design Training Neural Networks Transfer Learning References
Transfer Learning & Optimization Problems
� Transfer learning: taking a model primed for task X andapplying to unrelated problem Y . For example, taking aCNN designed to discern amongst vehicles and applying itto classifying different animals without manipulatingparameters.
� Transfer learning and its predecessor pretraining areunintentional solutions to the initialization problems.
[email protected] Skeptical Math Intro to DL 44 / 55
Intro Neural Network Design Training Neural Networks Transfer Learning References
Contents
1 Intro
2 Neural Network DesignConvolutional Neural Networks
3 Training Neural NetworksBackpropagationStochastic Gradient DescentOptimization Issues
4 Transfer LearningPretrainingFine Tuning
[email protected] Skeptical Math Intro to DL 44 / 55
Intro Neural Network Design Training Neural Networks Transfer Learning References
Pretraining
� Pretraining revived deep neural networks from obscurity.� Without convolutions or recursion, deep full connected
networks could be trained and avoid local minima problems.� Unsupervised pretraining has mostly been abandoned, but it
inspired much of the modern work in transfer learning.
Bengio et al. (2007)
[email protected] Skeptical Math Intro to DL 45 / 55
Intro Neural Network Design Training Neural Networks Transfer Learning References
Unsupervised Pretraining
Let f be the identity function, X input data matrix (1 row perexample), K number of iterationsfor k ∈ {1, . . . ,K}
f (k) = L(X )
f = f (k) ◦ f
X = f (k)(X )
The above is done for each layer. The input is the previouslayer’s output.
Goodfellow et al. (2016)
[email protected] Skeptical Math Intro to DL 46 / 55
Intro Neural Network Design Training Neural Networks Transfer Learning References
Unsupervised Pretraining
Unsupervised pretraining is thought to work because:1 A model is sensitive to its initializations.2 Learning the input distribution is useful.
Goodfellow et al. (2016)
[email protected] Skeptical Math Intro to DL 47 / 55
Intro Neural Network Design Training Neural Networks Transfer Learning References
Unsupervised Pretraining
Unsupervised pretraining is thought to work because:1 A model is sensitive to its initializations.
� This is the least mathematically understood part ofpretraining.
� It may start the model at local minima otherwise inaccessiblebased on the shape of the cost function.
� How much of the initialization survives training?
2 Learning the input distribution is useful.
Goodfellow et al. (2016)
[email protected] Skeptical Math Intro to DL 47 / 55
Intro Neural Network Design Training Neural Networks Transfer Learning References
Unsupervised Pretraining
Unsupervised pretraining is thought to work because:1 A model is sensitive to its initializations.2 Learning the input distribution is useful.
� Suppose car/motorcycle model; will pretraining spot wheels?Will its representation of a wheel be helpful?
� There is no coherent theory as to if/how/why/whenunsupervised features help with NN training.
Goodfellow et al. (2016)
[email protected] Skeptical Math Intro to DL 47 / 55
Intro Neural Network Design Training Neural Networks Transfer Learning References
Supervised Pretraining
x → σ(W (1)x + b(1))→ y(x)
Bengio et al. (2007)
[email protected] Skeptical Math Intro to DL 48 / 55
Intro Neural Network Design Training Neural Networks Transfer Learning References
Supervised Pretraining
x → σ(W (1)x + b(1))︸ ︷︷ ︸x(1)
→ y(x)
x (1) → σ(W (2)x (1) + b(2))→ y(x)
Bengio et al. (2007)
[email protected] Skeptical Math Intro to DL 48 / 55
Intro Neural Network Design Training Neural Networks Transfer Learning References
Supervised Pretraining
x → σ(W (1)x + b(1))︸ ︷︷ ︸x(1)
→ y(x)
x (1) → σ(W (2)x (1) + b(2))→ y(x)...
x → σ(W (1)x + b(1))→ σ(W (2)a(1) + b(2))→ · · · → y(x)
Bengio et al. (2007)
[email protected] Skeptical Math Intro to DL 48 / 55
Intro Neural Network Design Training Neural Networks Transfer Learning References
Contents
1 Intro
2 Neural Network DesignConvolutional Neural Networks
3 Training Neural NetworksBackpropagationStochastic Gradient DescentOptimization Issues
4 Transfer LearningPretrainingFine Tuning
[email protected] Skeptical Math Intro to DL 48 / 55
Intro Neural Network Design Training Neural Networks Transfer Learning References
Oquab et al. (2014)
[email protected] Skeptical Math Intro to DL 49 / 55
Intro Neural Network Design Training Neural Networks Transfer Learning References
Finer Details
Inpu
tIm
age
Outpu
tLayer
Inpu
tIm
age
Outpu
tLayer
SVM
Model
xx
x
xx
oo o
o
1
2
3
McKay 2017
[email protected] Skeptical Math Intro to DL 50 / 55
Intro Neural Network Design Training Neural Networks Transfer Learning References
70
75
80
85
90
95
100
70 75 80 85 90 95 100Recall
Precision
VGG-f VGG16 VGG19 Alex
McKay 2017
[email protected] Skeptical Math Intro to DL 51 / 55
Intro Neural Network Design Training Neural Networks Transfer Learning References
Thank You
Acknolwedgements� Dr. Anne Gelb� Dr. Suren Jayasuriya� iPal NN group: Tiep Vu, Hojjat Mousavi, Tiantong Guo
[email protected] Skeptical Math Intro to DL 52 / 55
Intro Neural Network Design Training Neural Networks Transfer Learning References
References I
Bengio, Y. et al. (2007). “Greedy layer-wise training of deep networks”. In: IN NIPS.
Brady, M. L., R. Raghavan, and J. Slawny (1989). “Back propagation fails to separate where perceptrons succeed”.In: IEEE Transactions on Circuits and Systems 36.5,
Dauphin, Y. N. et al. (2014). “Identifying and attacking the saddle point problem in high-dimensional non-convexoptimization”. In: Advances in neural information processing systems,
Gal, Y. and Z. Ghahramani (2015). “Dropout as a Bayesian approximation: Representing model uncertainty in deeplearning”. In: arXiv preprint arXiv:1506.02142 2.
Gatys, L. A., A. S. Ecker, and M. Bethge (2015). “A neural algorithm of artistic style”. In: arXiv preprintarXiv:1508.06576.
Goodfellow, I., Y. Bengio, and A. Courville (2016). Deep Learning. http://www.deeplearningbook.org. MITPress.
Gori, M. and A. Tesi (1992). “On the problem of local minima in backpropagation”. In: IEEE Transactions on PatternAnalysis and Machine Intelligence 14.1,
He, K. et al. (2015). “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification”.In: Proceedings of the IEEE international conference on computer vision,
Hornik, K., M. Stinchcombe, and H. White (1989). “Multilayer feedforward networks are universal approximators”. In:Neural networks 2.5,
[email protected] Skeptical Math Intro to DL 53 / 55
Intro Neural Network Design Training Neural Networks Transfer Learning References
References II
Krause, J. et al. (2016). “A Hierarchical Approach for Generating Descriptive Image Paragraphs”. In: arXiv preprintarXiv:1611.06607.
LeCun, Y. et al. (1998). “Gradient-based learning applied to document recognition”. In: Proceedings of the IEEE86.11,
Leshno, M. et al. (1993). “Multilayer feedforward networks with a nonpolynomial activation function can approximateany function”. In: Neural networks 6.6,
Lowe, D. G. (1999). “Object recognition from local scale-invariant features”. In: Computer vision, 1999. Theproceedings of the seventh IEEE international conference on. Vol. 2. Ieee,
Mishkin, D. and J. Matas (2015). “All you need is a good init”. In: arXiv preprint arXiv:1511.06422.
Nielsen, M. A. (2015). Neural Networks and Deep Learning. http://neuralnetworksanddeeplearning.com.Determination Press.
Oquab, M. et al. (2014). “Learning and transferring mid-level image representations using convolutional neuralnetworks”. In: Proceedings of the IEEE conference on computer vision and pattern recognition,
Romero, A. et al. (2014). “Fitnets: Hints for thin deep nets”. In: arXiv preprint arXiv:1412.6550.
Rumelhart, D. E., G. E. Hinton, and R. J. Williams (1986). “Learning representations by back-propagating errors”.In: Nature 323,
[email protected] Skeptical Math Intro to DL 54 / 55
Intro Neural Network Design Training Neural Networks Transfer Learning References
References III
Saxe, A. M., J. L. McClelland, and S. Ganguli (2013). “Exact solutions to the nonlinear dynamics of learning in deeplinear neural networks”. In: arXiv preprint arXiv:1312.6120.
Simonyan, K. and A. Zisserman (2014). “Very deep convolutional networks for large-scale image recognition”. In:arXiv preprint arXiv:1409.1556.
Sontag, E. D. and H. J. Sussmann (1989). “Backpropagation can give rise to spurious local minima even fornetworks without hidden layers”. In: Complex Systems 3.1,
Sussillo, D. and L. Abbott (2014). “Random walk initialization for training very deep feedforward networks”. In: arXivpreprint arXiv:1412.6558.
Taigman, Y. et al. (2014). “Deepface: Closing the gap to human-level performance in face verification”. In:Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
[email protected] Skeptical Math Intro to DL 55 / 55