Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Feed-forward Networks Network Training Error Backpropagation Applications
Neural NetworksOliver Schulte - CMPT 726
Bishop PRML Ch. 5
Feed-forward Networks Network Training Error Backpropagation Applications
Neural Networks
⢠Neural networks arise from attempts to modelhuman/animal brains
⢠Many models, many claims of biological plausibility⢠We will focus on multi-layer perceptrons
⢠Mathematical properties rather than plausibility⢠Prof. Hadley CMPT418
Feed-forward Networks Network Training Error Backpropagation Applications
Uses of Neural Networks
⢠Pros⢠Good for continuous input variables.⢠General continuous function approximators.⢠Highly non-linear.⢠Learn feature functions.⢠Good to use in continuous domains with little knowledge:
⢠When you donât know good features.⢠You donât know the form of a good functional model.
⢠Cons⢠Not interpretable, âblack boxâ.⢠Learning is slow.⢠Good generalization can require many datapoints.
Feed-forward Networks Network Training Error Backpropagation Applications
Applications
There are many, many applications.⢠World-Champion Backgammon Player.⢠No Hands Across America Tour.⢠Digit Recognition with 99.26% accuracy.⢠...
Feed-forward Networks Network Training Error Backpropagation Applications
Outline
Feed-forward Networks
Network Training
Error Backpropagation
Applications
Feed-forward Networks Network Training Error Backpropagation Applications
Outline
Feed-forward Networks
Network Training
Error Backpropagation
Applications
Feed-forward Networks Network Training Error Backpropagation Applications
Feed-forward Networks
⢠We have looked at generalized linear models of the form:
y(x,w) = f
Mâj=1
wjĎj(x)
for fixed non-linear basis functions Ď(¡)
⢠We now extend this model by allowing adaptive basisfunctions, and learning their parameters
⢠In feed-forward networks (a.k.a. multi-layer perceptrons)we let each basis function be another non-linear function oflinear combination of the inputs:
Ďj(x) = f
Mâj=1
. . .
Feed-forward Networks Network Training Error Backpropagation Applications
Feed-forward Networks
⢠We have looked at generalized linear models of the form:
y(x,w) = f
Mâj=1
wjĎj(x)
for fixed non-linear basis functions Ď(¡)
⢠We now extend this model by allowing adaptive basisfunctions, and learning their parameters
⢠In feed-forward networks (a.k.a. multi-layer perceptrons)we let each basis function be another non-linear function oflinear combination of the inputs:
Ďj(x) = f
Mâj=1
. . .
Feed-forward Networks Network Training Error Backpropagation Applications
Feed-forward Networks⢠Starting with input x = (x1, . . . , xD), construct linear
combinations:
aj =
Dâi=1
w(1)ji xi + w(1)
j0
These aj are known as activations⢠Pass through an activation function h(¡) to get output
zj = h(aj)⢠Model of an individual neuron
from Russell and Norvig, AIMA2e
Feed-forward Networks Network Training Error Backpropagation Applications
Feed-forward Networks⢠Starting with input x = (x1, . . . , xD), construct linear
combinations:
aj =
Dâi=1
w(1)ji xi + w(1)
j0
These aj are known as activations⢠Pass through an activation function h(¡) to get output
zj = h(aj)⢠Model of an individual neuron
from Russell and Norvig, AIMA2e
Feed-forward Networks Network Training Error Backpropagation Applications
Feed-forward Networks⢠Starting with input x = (x1, . . . , xD), construct linear
combinations:
aj =
Dâi=1
w(1)ji xi + w(1)
j0
These aj are known as activations⢠Pass through an activation function h(¡) to get output
zj = h(aj)⢠Model of an individual neuron
from Russell and Norvig, AIMA2e
Feed-forward Networks Network Training Error Backpropagation Applications
Activation Functions
⢠Can use a variety of activation functions⢠Sigmoidal (S-shaped)
⢠Logistic sigmoid 1/(1 + exp(âa)) (useful for binaryclassification)
⢠Hyperbolic tangent tanh⢠Radial basis function zj =
âi(xi â wji)
2
⢠Softmax⢠Useful for multi-class classification
⢠Identity⢠Useful for regression
⢠Threshold⢠. . .
⢠Needs to be differentiable for gradient-based learning(later)
⢠Can use different activation functions in each unit
Feed-forward Networks Network Training Error Backpropagation Applications
Feed-forward Networks
x0
x1
xD
z0
z1
zM
y1
yK
w(1)MD w
(2)KM
w(2)10
hidden units
inputs outputs
⢠Connect together a number of these units into afeed-forward network (DAG)
⢠Above shows a network with one layer of hidden units⢠Implements function:
yk(x,w) = h
Mâj=1
w(2)kj h
(Dâ
i=1
w(1)ji xi + w(1)
j0
)+ w(2)
k0
Feed-forward Networks Network Training Error Backpropagation Applications
Hidden Units Compute Basis Functions
⢠red dots = network function⢠dashed line = hidden unit activation function.⢠blue dots = data points
Feed-forward Networks Network Training Error Backpropagation Applications
Outline
Feed-forward Networks
Network Training
Error Backpropagation
Applications
Feed-forward Networks Network Training Error Backpropagation Applications
Network Training⢠Given a specified network structure, how do we set its
parameters (weights)?⢠As usual, we define a criterion to measure how well our
network performs, optimize against it⢠For regression, training data are (xn, t), tn â R
⢠Squared error naturally arises:
E(w) =
Nân=1
{y(xn,w)â tn}2
⢠For binary classification, this is another discriminativemodel, ML:
p(t|w) =
Nân=1
ytnn {1â yn}1âtn
E(w) = âNâ
n=1
{tn ln yn + (1â tn) ln(1â yn)}
Feed-forward Networks Network Training Error Backpropagation Applications
Network Training⢠Given a specified network structure, how do we set its
parameters (weights)?⢠As usual, we define a criterion to measure how well our
network performs, optimize against it⢠For regression, training data are (xn, t), tn â R
⢠Squared error naturally arises:
E(w) =
Nân=1
{y(xn,w)â tn}2
⢠For binary classification, this is another discriminativemodel, ML:
p(t|w) =
Nân=1
ytnn {1â yn}1âtn
E(w) = âNâ
n=1
{tn ln yn + (1â tn) ln(1â yn)}
Feed-forward Networks Network Training Error Backpropagation Applications
Network Training⢠Given a specified network structure, how do we set its
parameters (weights)?⢠As usual, we define a criterion to measure how well our
network performs, optimize against it⢠For regression, training data are (xn, t), tn â R
⢠Squared error naturally arises:
E(w) =
Nân=1
{y(xn,w)â tn}2
⢠For binary classification, this is another discriminativemodel, ML:
p(t|w) =
Nân=1
ytnn {1â yn}1âtn
E(w) = âNâ
n=1
{tn ln yn + (1â tn) ln(1â yn)}
Feed-forward Networks Network Training Error Backpropagation Applications
Parameter Optimization
w1
w2
E(w)
wA wB wC
âE
⢠For either of these problems, the error function E(w) isnasty
⢠Nasty = non-convex⢠Non-convex = has local minima
Feed-forward Networks Network Training Error Backpropagation Applications
Descent Methods
⢠The typical strategy for optimization problems of this sort isa descent method:
w(Ď+1) = w(Ď) + âw(Ď)
⢠These come in many flavours⢠Gradient descent âE(w(Ď))⢠Stochastic gradient descent âEn(w(Ď))⢠Newton-Raphson (second order) â2
⢠All of these can be used here, stochastic gradient descentis particularly effective
⢠Redundancy in training data, escaping local minima
Feed-forward Networks Network Training Error Backpropagation Applications
Descent Methods
⢠The typical strategy for optimization problems of this sort isa descent method:
w(Ď+1) = w(Ď) + âw(Ď)
⢠These come in many flavours⢠Gradient descent âE(w(Ď))⢠Stochastic gradient descent âEn(w(Ď))⢠Newton-Raphson (second order) â2
⢠All of these can be used here, stochastic gradient descentis particularly effective
⢠Redundancy in training data, escaping local minima
Feed-forward Networks Network Training Error Backpropagation Applications
Descent Methods
⢠The typical strategy for optimization problems of this sort isa descent method:
w(Ď+1) = w(Ď) + âw(Ď)
⢠These come in many flavours⢠Gradient descent âE(w(Ď))⢠Stochastic gradient descent âEn(w(Ď))⢠Newton-Raphson (second order) â2
⢠All of these can be used here, stochastic gradient descentis particularly effective
⢠Redundancy in training data, escaping local minima
Feed-forward Networks Network Training Error Backpropagation Applications
Computing Gradients
⢠The function y(xn,w) implemented by a network iscomplicated
⢠It isnât obvious how to compute error function derivativeswith respect to weights
⢠Numerical method for calculating error derivatives, usefinite differences:
âEn
âwjiâ
En(wji + Îľ)â En(wji â Îľ)2Îľ
⢠How much computation would this take with W weights inthe network?
⢠O(W) per derivative, O(W2) total per gradient descent step
Feed-forward Networks Network Training Error Backpropagation Applications
Computing Gradients
⢠The function y(xn,w) implemented by a network iscomplicated
⢠It isnât obvious how to compute error function derivativeswith respect to weights
⢠Numerical method for calculating error derivatives, usefinite differences:
âEn
âwjiâ
En(wji + Îľ)â En(wji â Îľ)2Îľ
⢠How much computation would this take with W weights inthe network?
⢠O(W) per derivative, O(W2) total per gradient descent step
Feed-forward Networks Network Training Error Backpropagation Applications
Computing Gradients
⢠The function y(xn,w) implemented by a network iscomplicated
⢠It isnât obvious how to compute error function derivativeswith respect to weights
⢠Numerical method for calculating error derivatives, usefinite differences:
âEn
âwjiâ
En(wji + Îľ)â En(wji â Îľ)2Îľ
⢠How much computation would this take with W weights inthe network?
⢠O(W) per derivative, O(W2) total per gradient descent step
Feed-forward Networks Network Training Error Backpropagation Applications
Computing Gradients
⢠The function y(xn,w) implemented by a network iscomplicated
⢠It isnât obvious how to compute error function derivativeswith respect to weights
⢠Numerical method for calculating error derivatives, usefinite differences:
âEn
âwjiâ
En(wji + Îľ)â En(wji â Îľ)2Îľ
⢠How much computation would this take with W weights inthe network?
⢠O(W) per derivative, O(W2) total per gradient descent step
Feed-forward Networks Network Training Error Backpropagation Applications
Outline
Feed-forward Networks
Network Training
Error Backpropagation
Applications
Feed-forward Networks Network Training Error Backpropagation Applications
Error Backpropagation
⢠Backprop is an efficient method for computing errorderivatives âEn
âwji
⢠O(W) to compute derivatives wrt all weights
⢠First, feed training example xn forward through the network,storing all activations aj
⢠Calculating derivatives for weights connected to outputnodes is easy
⢠e.g. For linear output nodes yk =â
i wkizi:
âEn
âwki=
â
âwki
12
(yn â tn)2 = (yn â tn)xni
⢠For hidden layers, propagate error backwards from theoutput nodes
Feed-forward Networks Network Training Error Backpropagation Applications
Error Backpropagation
⢠Backprop is an efficient method for computing errorderivatives âEn
âwji
⢠O(W) to compute derivatives wrt all weights
⢠First, feed training example xn forward through the network,storing all activations aj
⢠Calculating derivatives for weights connected to outputnodes is easy
⢠e.g. For linear output nodes yk =â
i wkizi:
âEn
âwki=
â
âwki
12
(yn â tn)2 = (yn â tn)xni
⢠For hidden layers, propagate error backwards from theoutput nodes
Feed-forward Networks Network Training Error Backpropagation Applications
Error Backpropagation
⢠Backprop is an efficient method for computing errorderivatives âEn
âwji
⢠O(W) to compute derivatives wrt all weights
⢠First, feed training example xn forward through the network,storing all activations aj
⢠Calculating derivatives for weights connected to outputnodes is easy
⢠e.g. For linear output nodes yk =â
i wkizi:
âEn
âwki=
â
âwki
12
(yn â tn)2 = (yn â tn)xni
⢠For hidden layers, propagate error backwards from theoutput nodes
Feed-forward Networks Network Training Error Backpropagation Applications
Chain Rule for Partial Derivatives
⢠A âreminderâ⢠For f (x, y), with f differentiable wrt x and y, and x and y
differentiable wrt u and v:
âfâu
=âfâxâxâu
+âfâyâyâu
and
âfâv
=âfâxâxâv
+âfâyâyâv
Feed-forward Networks Network Training Error Backpropagation Applications
Error Backpropagation⢠We can write
âEn
âwji=
â
âwjiEn(aj1 , aj2 , . . . , ajm)
where {ji} are the indices of the nodes in the same layeras node j
⢠Using the chain rule:
âEn
âwji=âEn
âaj
âaj
âwji+â
k
âEn
âak
âak
âwji
whereâ
k runs over all other nodes k in the same layer asnode j.
⢠Since ak does not depend on wji, all terms in thesummation go to 0
âEn
âwji=âEn
âaj
âaj
âwji
Feed-forward Networks Network Training Error Backpropagation Applications
Error Backpropagation⢠We can write
âEn
âwji=
â
âwjiEn(aj1 , aj2 , . . . , ajm)
where {ji} are the indices of the nodes in the same layeras node j
⢠Using the chain rule:
âEn
âwji=âEn
âaj
âaj
âwji+â
k
âEn
âak
âak
âwji
whereâ
k runs over all other nodes k in the same layer asnode j.
⢠Since ak does not depend on wji, all terms in thesummation go to 0
âEn
âwji=âEn
âaj
âaj
âwji
Feed-forward Networks Network Training Error Backpropagation Applications
Error Backpropagation⢠We can write
âEn
âwji=
â
âwjiEn(aj1 , aj2 , . . . , ajm)
where {ji} are the indices of the nodes in the same layeras node j
⢠Using the chain rule:
âEn
âwji=âEn
âaj
âaj
âwji+â
k
âEn
âak
âak
âwji
whereâ
k runs over all other nodes k in the same layer asnode j.
⢠Since ak does not depend on wji, all terms in thesummation go to 0
âEn
âwji=âEn
âaj
âaj
âwji
Feed-forward Networks Network Training Error Backpropagation Applications
Error Backpropagation cont.
⢠Introduce error δj ⥠âEnâaj
âEn
âwji= δj
âaj
âwji
⢠Other factor is:
âaj
âwji=
â
âwji
âk
wjkzk = zi
Feed-forward Networks Network Training Error Backpropagation Applications
Error Backpropagation cont.
⢠Error δj can also be computed using chain rule:
δj âĄâEn
âaj=â
k
âEn
âak︸︡︡︸δk
âak
âaj
whereâ
k runs over all nodes k in the layer after node j.⢠Eventually:
δj = hâ˛(aj)â
k
wkjδk
⢠A weighted sum of the later error âcausedâ by this weight
Feed-forward Networks Network Training Error Backpropagation Applications
Error Backpropagation cont.
⢠Error δj can also be computed using chain rule:
δj âĄâEn
âaj=â
k
âEn
âak︸︡︡︸δk
âak
âaj
whereâ
k runs over all nodes k in the layer after node j.⢠Eventually:
δj = hâ˛(aj)â
k
wkjδk
⢠A weighted sum of the later error âcausedâ by this weight
Feed-forward Networks Network Training Error Backpropagation Applications
Outline
Feed-forward Networks
Network Training
Error Backpropagation
Applications
Feed-forward Networks Network Training Error Backpropagation Applications
Applications of Neural Networks
⢠Many success stories for neural networks⢠Credit card fraud detection⢠Hand-written digit recognition⢠Face detection⢠Autonomous driving (CMU ALVINN)
Feed-forward Networks Network Training Error Backpropagation Applications
Hand-written Digit Recognition
⢠MNIST - standard dataset for hand-written digit recognition⢠60000 training, 10000 test images
Feed-forward Networks Network Training Error Backpropagation Applications
LeNet-5
INPUT
32x32
Convolutions SubsamplingConvolutions
C1: feature maps 6@28x28
Subsampling
S2: f. maps6@14x14
S4: f. maps 16@5x5
C5: layer120
C3: f. maps 16@10x10
F6: layer 84
Full connection
Full connection
Gaussian connections
OUTPUT
10
⢠LeNet developed by Yann LeCun et al.⢠Convolutional neural network
⢠Local receptive fields (5x5 connectivity)⢠Subsampling (2x2)⢠Shared weights (reuse same 5x5 âfilterâ)⢠Breaking symmetry
⢠Seehttp://www.codeproject.com/KB/library/NeuralNetRecognition.aspx
Feed-forward Networks Network Training Error Backpropagation Applications
4!>6 3!>5 8!>2 2!>1 5!>3 4!>8 2!>8 3!>5 6!>5 7!>3
9!>4 8!>0 7!>8 5!>3 8!>7 0!>6 3!>7 2!>7 8!>3 9!>4
8!>2 5!>3 4!>8 3!>9 6!>0 9!>8 4!>9 6!>1 9!>4 9!>1
9!>4 2!>0 6!>1 3!>5 3!>2 9!>5 6!>0 6!>0 6!>0 6!>8
4!>6 7!>3 9!>4 4!>6 2!>7 9!>7 4!>3 9!>4 9!>4 9!>4
8!>7 4!>2 8!>4 3!>5 8!>4 6!>5 8!>5 3!>8 3!>8 9!>8
1!>5 9!>8 6!>3 0!>2 6!>5 9!>5 0!>7 1!>6 4!>9 2!>1
2!>8 8!>5 4!>9 7!>2 7!>2 6!>5 9!>7 6!>1 5!>6 5!>0
4!>9 2!>8
⢠The 82 errors made by LeNet5 (0.82% test error rate)
Feed-forward Networks Network Training Error Backpropagation Applications
Conclusion
⢠Readings: Ch. 5.1, 5.2, 5.3⢠Feed-forward networks can be used for regression or
classification⢠Similar to linear models, except with adaptive non-linear
basis functions⢠These allow us to do more than e.g. linear decision
boundaries
⢠Different error functions⢠Learning is more difficult, error function not convex
⢠Use stochastic gradient descent, obtain (good?) localminimum
⢠Backpropagation for efficient gradient computation