Upload
tranphuc
View
222
Download
0
Embed Size (px)
Citation preview
Multi-layer Neural Networks
Steve Renals
Informatics 2B— Learning and Data Lecture 138 March 2011
Informatics 2B: Learning and Data Lecture 13 Multi-layer Neural Networks 1
Overview
Multi-layer neural networks
Multi-layer perceptrons (MLPs)
The credit assignment problem for hidden units
Back-propagation of error (backprop) training
Informatics 2B: Learning and Data Lecture 13 Multi-layer Neural Networks 2
Limitations of single-layer neural networks
Single-layer neural networks have many advantages:
Easy to setup and trainExplicit link to statistical models:
Shared covariance Gaussian density functionsSigmoid output functions allow a link to posterior probabilities
Outputs are weighted sum of inputs: interpretablerepresentation
But some big limitations:
Can only represent a limited set of functionsDecision boundaries must be hyperplanesCan only perfectly separate linearly separable data
Informatics 2B: Learning and Data Lecture 13 Multi-layer Neural Networks 3
Limitations of single-layer neural networks
Single-layer neural networks have many advantages:
Easy to setup and trainExplicit link to statistical models:
Shared covariance Gaussian density functionsSigmoid output functions allow a link to posterior probabilities
Outputs are weighted sum of inputs: interpretablerepresentation
But some big limitations:
Can only represent a limited set of functionsDecision boundaries must be hyperplanesCan only perfectly separate linearly separable data
Informatics 2B: Learning and Data Lecture 13 Multi-layer Neural Networks 3
Generalised Linear Network
Generalises linear discriminants by adding another,non-adaptive layer:
yk(x) =M∑
j=0
wkjφj(x)
The input vector x is transformed using a set of M predefinednon-linear functions, φj(x), called basis functions.
This allows a much larger class of discriminant functions (infact can approximate any continuous function to an arbitraryaccuracy)
Multilayer neural networks employ adaptive basis functionswith parameters (weights) that may be estimated from thetraining data
Informatics 2B: Learning and Data Lecture 13 Multi-layer Neural Networks 4
Generalised Linear Network
Generalises linear discriminants by adding another,non-adaptive layer:
yk(x) =M∑
j=0
wkjφj(x)
The input vector x is transformed using a set of M predefinednon-linear functions, φj(x), called basis functions.
This allows a much larger class of discriminant functions (infact can approximate any continuous function to an arbitraryaccuracy)
Multilayer neural networks employ adaptive basis functionswith parameters (weights) that may be estimated from thetraining data
Informatics 2B: Learning and Data Lecture 13 Multi-layer Neural Networks 4
Generalised Linear Network
Generalises linear discriminants by adding another,non-adaptive layer:
yk(x) =M∑
j=0
wkjφj(x)
The input vector x is transformed using a set of M predefinednon-linear functions, φj(x), called basis functions.
This allows a much larger class of discriminant functions (infact can approximate any continuous function to an arbitraryaccuracy)
Multilayer neural networks employ adaptive basis functionswith parameters (weights) that may be estimated from thetraining data
Informatics 2B: Learning and Data Lecture 13 Multi-layer Neural Networks 4
Multi-layer neural networks
Construct more general networks by considering layers ofprocessing units
Unlike generalised linear discriminants, each layer has a set ofadaptive weights
Layers that are not input or output are referred to as hidden
Often called multilayer perceptrons (MLPs)
Informatics 2B: Learning and Data Lecture 13 Multi-layer Neural Networks 5
Multi-layer neural networks
Construct more general networks by considering layers ofprocessing units
Unlike generalised linear discriminants, each layer has a set ofadaptive weights
Layers that are not input or output are referred to as hidden
Often called multilayer perceptrons (MLPs)
Informatics 2B: Learning and Data Lecture 13 Multi-layer Neural Networks 5
Multi-layer neural networks
Construct more general networks by considering layers ofprocessing units
Unlike generalised linear discriminants, each layer has a set ofadaptive weights
Layers that are not input or output are referred to as hidden
Often called multilayer perceptrons (MLPs)
Informatics 2B: Learning and Data Lecture 13 Multi-layer Neural Networks 5
Multi-layer neural networks
Construct more general networks by considering layers ofprocessing units
Unlike generalised linear discriminants, each layer has a set ofadaptive weights
Layers that are not input or output are referred to as hidden
Often called multilayer perceptrons (MLPs)
Informatics 2B: Learning and Data Lecture 13 Multi-layer Neural Networks 5
Multi-layer Perceptron
+
Inputs
x0 x1 xd
Bias
+
+ +
Hidden
Outputs
Biasz0 z1 zM
yKy1
w(1)Md
w(1)10
w(2)10 w(2)
KM
Informatics 2B: Learning and Data Lecture 13 Multi-layer Neural Networks 6
Building up the MLP (1)
First we take a M linear combinations of the d-dimensioninputs:
bj =d∑
i=0
w(1)ji xi
bj : activations
w(1)ji : first layer of weights
Activations transformed by a nonlinear activation function h(e.g. a sigmoid):
zj = h(bj) =1
1 + exp(−bj)
zj : hidden unit outputs
Informatics 2B: Learning and Data Lecture 13 Multi-layer Neural Networks 7
Building up the MLP (1)
First we take a M linear combinations of the d-dimensioninputs:
bj =d∑
i=0
w(1)ji xi
bj : activations
w(1)ji : first layer of weights
Activations transformed by a nonlinear activation function h(e.g. a sigmoid):
zj = h(bj) =1
1 + exp(−bj)
zj : hidden unit outputs
Informatics 2B: Learning and Data Lecture 13 Multi-layer Neural Networks 7
Building up the MLP (2)
Outputs of the hidden units are linearly combined to giveactivations of the output units:
ak =M∑
j=0
w(2)kj zj
The output units are transformed using an activation function(e.g. a sigmoid)
yk = g(ak) =1
1 + exp(−ak)
For multiclass problems, a softmax may be usedCombine to give the overall forward propagation equation forthe network:
yk = g
M∑
j=0
w(2)kj h
(d∑
i=0
w(1)ji xi
)
Informatics 2B: Learning and Data Lecture 13 Multi-layer Neural Networks 8
Building up the MLP (2)
Outputs of the hidden units are linearly combined to giveactivations of the output units:
ak =M∑
j=0
w(2)kj zj
The output units are transformed using an activation function(e.g. a sigmoid)
yk = g(ak) =1
1 + exp(−ak)
For multiclass problems, a softmax may be used
Combine to give the overall forward propagation equation forthe network:
yk = g
M∑
j=0
w(2)kj h
(d∑
i=0
w(1)ji xi
)
Informatics 2B: Learning and Data Lecture 13 Multi-layer Neural Networks 8
Building up the MLP (2)
Outputs of the hidden units are linearly combined to giveactivations of the output units:
ak =M∑
j=0
w(2)kj zj
The output units are transformed using an activation function(e.g. a sigmoid)
yk = g(ak) =1
1 + exp(−ak)
For multiclass problems, a softmax may be usedCombine to give the overall forward propagation equation forthe network:
yk = g
M∑
j=0
w(2)kj h
(d∑
i=0
w(1)ji xi
)
Informatics 2B: Learning and Data Lecture 13 Multi-layer Neural Networks 8
Multi-layer Perceptron
+
Inputs
x0 x1 xd
Bias
+
+ +
Hidden
Outputs
Biasz0 z1 zM
yKy1
w(1)Md
w(1)10
w(2)10 w(2)
KM
Informatics 2B: Learning and Data Lecture 13 Multi-layer Neural Networks 9
Training MLPs: Credit assignment
Hidden units make training the weights more complicated
The gradients for a single layer neural network have a simpleform:
∂En
∂wki= δkxi
For a multi-layer neural network: what is the “error” of a
hidden unit? how important is input-hidden weight w(1)ji to
output unit k?
Solution: we need to define derivatives of the error withrespect to each weight
Algorithm: back-propagation of error (backprop)
Backprop gives a way to compute the derivatives. Thesederivatives are used by an optimisation algorithm (e.g.gradient descent) to train the weights.
Informatics 2B: Learning and Data Lecture 13 Multi-layer Neural Networks 10
Training MLPs: Credit assignment
Hidden units make training the weights more complicated
The gradients for a single layer neural network have a simpleform:
∂En
∂wki= δkxi
For a multi-layer neural network: what is the “error” of a
hidden unit? how important is input-hidden weight w(1)ji to
output unit k?
Solution: we need to define derivatives of the error withrespect to each weight
Algorithm: back-propagation of error (backprop)
Backprop gives a way to compute the derivatives. Thesederivatives are used by an optimisation algorithm (e.g.gradient descent) to train the weights.
Informatics 2B: Learning and Data Lecture 13 Multi-layer Neural Networks 10
Training MLPs: Credit assignment
Hidden units make training the weights more complicated
The gradients for a single layer neural network have a simpleform:
∂En
∂wki= δkxi
For a multi-layer neural network: what is the “error” of a
hidden unit? how important is input-hidden weight w(1)ji to
output unit k?
Solution: we need to define derivatives of the error withrespect to each weight
Algorithm: back-propagation of error (backprop)
Backprop gives a way to compute the derivatives. Thesederivatives are used by an optimisation algorithm (e.g.gradient descent) to train the weights.
Informatics 2B: Learning and Data Lecture 13 Multi-layer Neural Networks 10
Training MLPs: Credit assignment
Hidden units make training the weights more complicated
The gradients for a single layer neural network have a simpleform:
∂En
∂wki= δkxi
For a multi-layer neural network: what is the “error” of a
hidden unit? how important is input-hidden weight w(1)ji to
output unit k?
Solution: we need to define derivatives of the error withrespect to each weight
Algorithm: back-propagation of error (backprop)
Backprop gives a way to compute the derivatives. Thesederivatives are used by an optimisation algorithm (e.g.gradient descent) to train the weights.
Informatics 2B: Learning and Data Lecture 13 Multi-layer Neural Networks 10
Training MLPs: Credit assignment
Hidden units make training the weights more complicated
The gradients for a single layer neural network have a simpleform:
∂En
∂wki= δkxi
For a multi-layer neural network: what is the “error” of a
hidden unit? how important is input-hidden weight w(1)ji to
output unit k?
Solution: we need to define derivatives of the error withrespect to each weight
Algorithm: back-propagation of error (backprop)
Backprop gives a way to compute the derivatives. Thesederivatives are used by an optimisation algorithm (e.g.gradient descent) to train the weights.
Informatics 2B: Learning and Data Lecture 13 Multi-layer Neural Networks 10
Training MLPs: Credit assignment
Hidden units make training the weights more complicated
The gradients for a single layer neural network have a simpleform:
∂En
∂wki= δkxi
For a multi-layer neural network: what is the “error” of a
hidden unit? how important is input-hidden weight w(1)ji to
output unit k?
Solution: we need to define derivatives of the error withrespect to each weight
Algorithm: back-propagation of error (backprop)
Backprop gives a way to compute the derivatives. Thesederivatives are used by an optimisation algorithm (e.g.gradient descent) to train the weights.
Informatics 2B: Learning and Data Lecture 13 Multi-layer Neural Networks 10
Training MLPs: Error function and required gradients
Sum-of-squares error function, obtained by summing over atraining set of N examples:
E =N∑
n=1
En
En =1
2
K∑
k=1
(ynk − tnk )2
Informatics 2B: Learning and Data Lecture 13 Multi-layer Neural Networks 11
Gradient descent training
Operation of gradient descent:1 Start with a guess for the weight matrix W (small random
numbers)2 Update the weights by adjusting the weight matrix in the
direction of −∇WE .3 Recompute the error, and iterate
The update for weight wki at iteration τ + 1 is:
w τ+1ki = w τ
ki − η∂E
∂wki
The parameter η is the learning rate
Informatics 2B: Learning and Data Lecture 13 Multi-layer Neural Networks 12
Training MLPs: Error function and required gradients
Sum-of-squares error function, obtained by summing over atraining set of N examples:
E =N∑
n=1
En
En =1
2
K∑
k=1
(ynk − tnk )2
To obtain the overall error gradients, we sum over the trainingexamples:
∂E
∂wkj=
N∑
n=1
∂En
∂wkj
∂E
∂wji=
N∑
n=1
∂En
∂wji
Informatics 2B: Learning and Data Lecture 13 Multi-layer Neural Networks 13
Training MLPs: Hidden-to-output weights
Write En in terms of hidden-to-output weights:
En =1
2
K∑
k=1
(g(ank)− tnk )2 =1
2
K∑
k=1
g
M∑
j=0
wkjznj
− tnk
2
Break down error derivatives
∂En
∂wkj=∂En
∂ank
∂ank∂wkj
∂En/∂ank is often referred to as the error signal, δnk
δnk =∂En
∂ank=∂En
∂ynk· ∂y
nk
∂ank= (ynk − tnk )g ′(ank)
Since∂ank∂wkj
= znj , we obtain∂En
∂wkj= δnkz
nj
This is similar to single-layer neural networks with a nonlinearactivation function.
Informatics 2B: Learning and Data Lecture 13 Multi-layer Neural Networks 14
Training MLPs: Hidden-to-output weights
Write En in terms of hidden-to-output weights:
En =1
2
K∑
k=1
(g(ank)− tnk )2 =1
2
K∑
k=1
g
M∑
j=0
wkjznj
− tnk
2
Break down error derivatives
∂En
∂wkj=∂En
∂ank
∂ank∂wkj
∂En/∂ank is often referred to as the error signal, δnk
δnk =∂En
∂ank=∂En
∂ynk· ∂y
nk
∂ank= (ynk − tnk )g ′(ank)
Since∂ank∂wkj
= znj , we obtain∂En
∂wkj= δnkz
nj
This is similar to single-layer neural networks with a nonlinearactivation function.
Informatics 2B: Learning and Data Lecture 13 Multi-layer Neural Networks 14
Training MLPs: Hidden-to-output weights
Write En in terms of hidden-to-output weights:
En =1
2
K∑
k=1
(g(ank)− tnk )2 =1
2
K∑
k=1
g
M∑
j=0
wkjznj
− tnk
2
Break down error derivatives
∂En
∂wkj=∂En
∂ank
∂ank∂wkj
∂En/∂ank is often referred to as the error signal, δnk
δnk =∂En
∂ank=∂En
∂ynk· ∂y
nk
∂ank= (ynk − tnk )g ′(ank)
Since∂ank∂wkj
= znj , we obtain∂En
∂wkj= δnkz
nj
This is similar to single-layer neural networks with a nonlinearactivation function.
Informatics 2B: Learning and Data Lecture 13 Multi-layer Neural Networks 14
Training MLPs: Hidden-to-output weights
Write En in terms of hidden-to-output weights:
En =1
2
K∑
k=1
(g(ank)− tnk )2 =1
2
K∑
k=1
g
M∑
j=0
wkjznj
− tnk
2
Break down error derivatives
∂En
∂wkj=∂En
∂ank
∂ank∂wkj
∂En/∂ank is often referred to as the error signal, δnk
δnk =∂En
∂ank=∂En
∂ynk· ∂y
nk
∂ank= (ynk − tnk )g ′(ank)
Since∂ank∂wkj
= znj , we obtain∂En
∂wkj= δnkz
nj
This is similar to single-layer neural networks with a nonlinearactivation function.
Informatics 2B: Learning and Data Lecture 13 Multi-layer Neural Networks 14
Training MLPs: Hidden-to-output weights
Write En in terms of hidden-to-output weights:
En =1
2
K∑
k=1
(g(ank)− tnk )2 =1
2
K∑
k=1
g
M∑
j=0
wkjznj
− tnk
2
Break down error derivatives
∂En
∂wkj=∂En
∂ank
∂ank∂wkj
∂En/∂ank is often referred to as the error signal, δnk
δnk =∂En
∂ank=∂En
∂ynk· ∂y
nk
∂ank= (ynk − tnk )g ′(ank)
Since∂ank∂wkj
= znj , we obtain∂En
∂wkj= δnkz
nj
This is similar to single-layer neural networks with a nonlinearactivation function.
Informatics 2B: Learning and Data Lecture 13 Multi-layer Neural Networks 14
Training MLPs: Input-to-hidden weights
To compute the error gradients for the input-to-hiddenweights we must take into account all the ways in whichhidden unit j (and hence weight wji ) can influence the error.Consider δnj , the error signal for hidden unit j :
δnj =∂En
∂bnj=
K∑
k=1
∂En
∂ank
∂ank∂bnj
=K∑
k=1
δnk∂ank∂bnj
Sum over all the output units’ contributions to δnj :
∂ank∂bnj
=∂ank∂znj
∂znj∂bnj
= wkjh′(bnj )
Substituting in we obtain:
δnj = h′(bnj )K∑
k=1
δnkwkj
This is the famous back-propagation of error (backprop)equation.
Informatics 2B: Learning and Data Lecture 13 Multi-layer Neural Networks 15
Training MLPs: Input-to-hidden weights
To compute the error gradients for the input-to-hiddenweights we must take into account all the ways in whichhidden unit j (and hence weight wji ) can influence the error.Consider δnj , the error signal for hidden unit j :
δnj =∂En
∂bnj=
K∑
k=1
∂En
∂ank
∂ank∂bnj
=K∑
k=1
δnk∂ank∂bnj
Sum over all the output units’ contributions to δnj :
∂ank∂bnj
=∂ank∂znj
∂znj∂bnj
= wkjh′(bnj )
Substituting in we obtain:
δnj = h′(bnj )K∑
k=1
δnkwkj
This is the famous back-propagation of error (backprop)equation.
Informatics 2B: Learning and Data Lecture 13 Multi-layer Neural Networks 15
Training MLPs: Input-to-hidden weights
To compute the error gradients for the input-to-hiddenweights we must take into account all the ways in whichhidden unit j (and hence weight wji ) can influence the error.Consider δnj , the error signal for hidden unit j :
δnj =∂En
∂bnj=
K∑
k=1
∂En
∂ank
∂ank∂bnj
=K∑
k=1
δnk∂ank∂bnj
Sum over all the output units’ contributions to δnj :
∂ank∂bnj
=∂ank∂znj
∂znj∂bnj
= wkjh′(bnj )
Substituting in we obtain:
δnj = h′(bnj )K∑
k=1
δnkwkj
This is the famous back-propagation of error (backprop)equation.Informatics 2B: Learning and Data Lecture 13 Multi-layer Neural Networks 15
Back-propagation of error: hidden unit error signal
Outputs
Hidden units
z j
xi
w(2)1 j w(2)
! j
w(1)ji
yKy!y1
w(2)K j
Informatics 2B: Learning and Data Lecture 13 Multi-layer Neural Networks 16
Back-propagation of error: hidden unit error signal
Outputs
Hidden units
z j
xi
w(2)1 j w(2)
! j
δ!δ1
w(1)ji
yKy!y1
w(2)K j
!K
Informatics 2B: Learning and Data Lecture 13 Multi-layer Neural Networks 16
Back-propagation of error: hidden unit error signal
Outputs
Hidden units
z j
xi
w(2)1 j w(2)
! j
δ!δ1
w(1)ji
yKy!y1
w(2)K j
!j = h!(bj)!
!
!!w!j
!K
Informatics 2B: Learning and Data Lecture 13 Multi-layer Neural Networks 16
Back-propagation of error
The derivatives of the input-to-hidden weights can thus beevaluated using:
∂En
∂wji=∂En
∂bnj
∂bnj∂wji
= δnj xni
The back-propagation of error algorithm is summarised asfollows:
1 Apply the N input vectors from the training set, xn, to thenetwork and forward propagate to obtain the set of outputvectors yn
2 Using the target vectors tn compute the error E3 Evaluate the error signals δnk for each output unit4 Evaluate the error signals δnk for each hidden unit using
back-propagation of error5 Evaluate the derivatives for each training pattern, summing to
obtain the overall derivatives
Informatics 2B: Learning and Data Lecture 13 Multi-layer Neural Networks 17
Back-propagation of error
The derivatives of the input-to-hidden weights can thus beevaluated using:
∂En
∂wji=∂En
∂bnj
∂bnj∂wji
= δnj xni
The back-propagation of error algorithm is summarised asfollows:
1 Apply the N input vectors from the training set, xn, to thenetwork and forward propagate to obtain the set of outputvectors yn
2 Using the target vectors tn compute the error E3 Evaluate the error signals δnk for each output unit4 Evaluate the error signals δnk for each hidden unit using
back-propagation of error5 Evaluate the derivatives for each training pattern, summing to
obtain the overall derivatives
Informatics 2B: Learning and Data Lecture 13 Multi-layer Neural Networks 17
Gradient descent training
Operation of gradient descent:1 Start with a guess for the weight matrix W (small random
numbers)2 Update the weights by adjusting the weight matrix in the
direction of −∇WE .3 Recompute the error, and iterate
The update for weight wki at iteration τ + 1 is:
w τ+1ki = w τ
ki − η∂E
∂wki
The parameter η is the learning rate
Informatics 2B: Learning and Data Lecture 13 Multi-layer Neural Networks 18
MLP Example
Netlab demmlp2
MLP trained as a classifier on data from a known distribution
Informatics 2B: Learning and Data Lecture 13 Multi-layer Neural Networks 19
Training data
!2 !1 0 1 2
!2
!1
0
1
2
3
The Sampled Data Probability Density p(x)
!2 !1 0 1 2
!2
!1
0
1
2
3
Training Data pdf
Informatics 2B: Learning and Data Lecture 13 Multi-layer Neural Networks 20
Training data
Density p(x|red)
!2 0 2
!2
!1
0
1
2
3
0.05
0.1
0.15
0.2
Density p(x|yellow)
!2 0 2
!2
!1
0
1
2
3
0.1
0.2
0.3
0.4
0.5
0.6
Posterior Probability p(red|x)
!2 0 2
!2
!1
0
1
2
3
0.2
0.4
0.6
0.8
Posterior Probability p(yellow|x)
!2 0 2
!2
!1
0
1
2
3
0.2
0.4
0.6
0.8
Informatics 2B: Learning and Data Lecture 13 Multi-layer Neural Networks 21
MLP Example
Netlab demmlp2
MLP trained as a classifier on data from a known distribution
Train MLP: 6 hidden units, 2 output units
Informatics 2B: Learning and Data Lecture 13 Multi-layer Neural Networks 22
MLP Network OutputTraining Data
BayesNetwork
Network Output
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Training data Network output
Informatics 2B: Learning and Data Lecture 13 Multi-layer Neural Networks 23
MLP Example
Netlab demmlp2
MLP trained as a classifier on data from a known distribution
Train MLP: 6 hidden units, 2 output units
Compare with single layer network
Informatics 2B: Learning and Data Lecture 13 Multi-layer Neural Networks 24
SLN Network OutputTraining Data
BayesSLN
SLN Output
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Training data Network output
Informatics 2B: Learning and Data Lecture 13 Multi-layer Neural Networks 25
MLP Example
Netlab demmlp2
MLP trained as a classifier on data from a known distribution
Train MLP: 6 hidden units, 2 output units
Compare with single layer network
Classify test data
Informatics 2B: Learning and Data Lecture 13 Multi-layer Neural Networks 26
Decision boundaries
Test Data
Bayes decision boundarySLN decision boundary
Informatics 2B: Learning and Data Lecture 13 Multi-layer Neural Networks 27
MLP Example
Netlab demmlp2
MLP trained as a classifier on data from a known distribution
Train MLP: 6 hidden units, 2 output units
Compare with single layer network
Classify test data
Confusion matrices:
Optimal
79 2810 83
81% correct
MLP
80 2713 80
80% correct
SLN
79 3816 77
73% correct
Informatics 2B: Learning and Data Lecture 13 Multi-layer Neural Networks 28
Summary
Multi-layer perceptrons
Multi-layer neural networks
Back-propagation of error training
Example
Informatics 2B: Learning and Data Lecture 13 Multi-layer Neural Networks 29