Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Feedforward Networks
Gradient Descent Learning and Backpropagation
CPSC 533 — Fall 2003
Christian Jacob ©
Dept.of Computer Science,University of Calgary
Feedforward Neural Networks - Backpropagation 1
Christian Jacob, Dept. of Computer Science, Univ. of Calgary
Adaptive "Programming" of ANNs through Learning
ANN LearningA learning algorithm is an adaptive method by which a network of computing units self-organizes to implement the desired behavior.
Changing Network Parameters
TestingInput/Output
ExamplesCalculating
Network Errors
Figure 1. Learning process in a parametric system
In some learning algorithms, examples of the desired input-output mapping are presented to the network.
A correction step is executed iteratively until the network learns to produce the desired response.
Feedforward Neural Networks - Backpropagation 2
Christian Jacob, Dept. of Computer Science, Univ. of Calgary
Learning Schemes
‡ Unsupervised Learning
For a given input, the exact numerical output a network should produce is unknown. Since no "teacher" is available, the network must organize itself (e.g., in order to associate clusters with units).
Examples: Clustering with self-organizing feature maps, Kohonen networks.
Figure 2. Three clusters and a classifier network
‡ Supervised Learning
Some input vectors are collected and presented to the network. The output com-puted by the network is observed and the deviation from the expected answer is measured. The weights are corrected (= learning algorithm) according to the magni-tude of the error.
Ë Error-correction Learning:
The magnitude of the error, together with the input vector, determines the magnitude of the corrections to the weights.
Examples: Perceptron learning, backpropagation.
Ë Reinforcement Learning:
After each presentation of an input-output example we only know whether the network produces the desired result or not. The weights are updated based on this Boolean decision (true or false).
Examples: Learning how to ride a bike.
Feedforward Neural Networks - Backpropagation 3
Christian Jacob, Dept. of Computer Science, Univ. of Calgary
Ë Reinforcement Learning:
After each presentation of an input-output example we only know whether the network produces the desired result or not. The weights are updated based on this Boolean decision (true or false).
Examples: Learning how to ride a bike.
Learning by Gradient Descent
Definition of the Learning ProblemLet us start with the simple case of linear cells, which we have introduced as percep-tron units.
The linear network should learn mappings (for m = 1, …, P ) between
Ë an input pattern xm = Hx1m, …, xN
m L and
Ë an associated target pattern T m .
Figure 3. Perceptron
Feedforward Neural Networks - Backpropagation 4
Christian Jacob, Dept. of Computer Science, Univ. of Calgary
The output Oim of cell i for the input pattern xm is calculated as
(1)Oim = ‚
k
Hwki ÿ xkmL
The goal of the learning procedure is, that eventually the output Oim for input pat-
tern xm corresponds to the desired output Tim :
(2)Oim =
! Ti
m = ‚k
Hwki ÿ xkmL
‡ Example: Letter Classification
Note: This letter classification will only work with non-linear (sigmoidal) process-ing units.
Feedforward Neural Networks - Backpropagation 5
Christian Jacob, Dept. of Computer Science, Univ. of Calgary
Explicit Solution (Linear Network)*For a linear network, the weights that satisfy Equation (2) can be calculated explic-itly using the pseudo-inverse:
(3)wik =1ÅÅÅÅP
‚ml
Tim HQk-1Lml xk
l
(4)Qml =1ÅÅÅÅP
‚k
xkm xk
l
‡ Correlation Matrix
Here Qml is a component of the correlation matrix Qk of the input patterns:
(5)Qk =ikjjjjjjjj xk
1 xk1 xk
1 xk2 … xk
1 xkP
. . . .xkP xk
1 … … xkP xk
P
y{zzzzzzzzYou can check that this is indeed a solution by verifying
(6)‚k
wik xkm = Ti
m.
‡ Caveat
Note that Q-1 only exists for linearly independent input patterns.
That means, if there are ai such that for all k = 1, …, N
(7)a1 xk1 + a2 xk
2 + … + aP xkP = 0,
then the outputs Oim cannot be selected independently from each other, and the
problem is NOT solvable.
Feedforward Neural Networks - Backpropagation 6
Christian Jacob, Dept. of Computer Science, Univ. of Calgary
Learning by Gradient Descent (Linear Network)Let us now try to find a learning rule for a linear network with M output units.
Starting from a random initial weight setting w”÷÷ 0 , the learning procedure should find a solution weight matrix for Equation (2).
‡ Error Function
For this purpose, we define a cost or error function EHw”÷÷ L:
(8)
E Hw”L =1ÅÅÅÅ2
‚m=1
M ‚m=1
P HTmm - OmmL2
E Hw”L =1ÅÅÅÅ2
„m=1
M „m=1
P ikjjjjTmm - ‚k
Hwkm ÿ xkmLy{zzzz2
EHw”÷÷ L ¥ 0 will approach zero as w”÷÷ = 8wkm< satisfies Equation (2).
This cost function is a quadratic function in weight space.
Feedforward Neural Networks - Backpropagation 7
Christian Jacob, Dept. of Computer Science, Univ. of Calgary
‡ Paraboloid
Therefore, EHw”÷÷ L is a paraboloid with a single global minimum.
<< RealTime3D`
Plot3D@x2 + y2, 8x, -5, 5<, 8y, -5, 5<D;
Feedforward Neural Networks - Backpropagation 8
Christian Jacob, Dept. of Computer Science, Univ. of Calgary
ContourPlot@x2 + y2, 8x, -5, 5<, 8y, -5, 5<D;
-4 -2 0 2 4
-4
-2
0
2
4
If the pattern vectors are linearly independent—i.e., a solution for Equation (2) exists—the minimum is at E = 0.
Feedforward Neural Networks - Backpropagation 9
Christian Jacob, Dept. of Computer Science, Univ. of Calgary
‡ Graphical Illustration: Following the Gradient
‡ Finding the Minimum: Following the Gradient
We can find the minimum of EHw”÷÷ L in weight space by following the negative gradient
(9)-∑w”÷ EHw”÷ L =-∑EHw”÷ LÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ
∑ w”We can implement this gradient strategy as follows:
‡ Changing a Weight
Each weight wki œ w”÷÷ is changed by Dwki proportionate to the E gradient at the current weight position (i.e., the current settings of all the weights):
(10)Dwki = -h ∑ E Hw”LÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ
∑ wki
Feedforward Neural Networks - Backpropagation 10
Christian Jacob, Dept. of Computer Science, Univ. of Calgary
‡ Steps Towards the Solution
(11)
Dwki = -h ∑
ÅÅÅÅÅÅÅÅÅÅÅÅÅ∑ wki
ikjjjjjjjj 1ÅÅÅÅ2 „
m=1
M „m=1
P ikjjjjTmm - ‚n
Hwnm ÿ xnmLy{zzzz2y{zzzzzzzz
Dwki = -h 1ÅÅÅÅ2
„m=1
P∑
ÅÅÅÅÅÅÅÅÅÅÅÅÅ∑ wki
ikjjjjjjj„m=1
M ikjjjjTmm - ‚n
Hwnm ÿ xnmLy{zzzz2y{zzzzzzz
Dwki = -h 1ÅÅÅÅ2
„m=1
P
2 ikjjjjTim - ‚
n
Hwni ÿ xnmLy{zzzz H-xk
mL‡ Weight Adaptation Rule
(12)Dwki = h ‚m=1
P HTim - OimL xk
m
The parameter h is usually referred to as the learning rate.
In this formula, the adaptation of the weights are accumulated over all patterns.
‡ Delta, LMS Learning
If we change the weights after each presentation of an input pattern to the network, we get a simpler form for the weight update term:
(13)Dwki = h HTim - OimL xk
m
or
(14)Dwki = h dim xk
m
with
(15)dim = Ti
m - Oim.
This learning rule has several names:
Feedforward Neural Networks - Backpropagation 11
Christian Jacob, Dept. of Computer Science, Univ. of Calgary
Ë Delta rule
Ë Adaline rule
Ë Widrow-Hoff rule
Ë LMS (least mean square) rule.
Feedforward Neural Networks - Backpropagation 12
Christian Jacob, Dept. of Computer Science, Univ. of Calgary
Gradient Descent Learning with Nonlinear CellsWe will now extend the gradient descent technique for the case of nonlinear cells, that is, where the activation/output function is a general nonlinear function g(x).
† The input function is denoted by hHxL.
† The activation/output function gHhHxLL is assumed to be differentiable in x .
‡ Remember:
‡ Rewriting the Error Function
The definition of the error function (Equation (8)) can be simply rewritten as follows:
(16)
E Hw”L =1ÅÅÅÅ2
‚m=1
M ‚m=1
P HTmm - OmmL2
E Hw”L =1ÅÅÅÅ2
„m=1
M
„m=1
P
ikjjjjTmm - g
ikjjjj‚k
Hwkm ÿ xkmLy{zzzzy{zzzz2
Feedforward Neural Networks - Backpropagation 13
Christian Jacob, Dept. of Computer Science, Univ. of Calgary
‡ Weight Gradients
Consequently, we can compute the wki gradients:
(17)∑ E Hw”LÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ
∑ wki= ‚
m=1
P
HTim - g HhimLL ÿ g£ HhimL ÿ xkm
‡ From Weight Gradients to the Learning Rule
This eventually (after some more calculations) shows us that the adaptation term Dwki for wki has the same form as in Equations (10), (13), and (14), namely:
(18)Dwki = h dim xk
m
where
(19)dim = HTim - Oi
mL ÿ g£ HhimL
Suitable Activation FunctionsThe calculation of the above d terms is easy for the following functions g, which are commonly used as activation functions:
‡ Hyperbolic Tangens:
(20)g HxL = tanh b x
g£ HxL = b H1 - g2 HxLL
Feedforward Neural Networks - Backpropagation 14
Christian Jacob, Dept. of Computer Science, Univ. of Calgary
Hyperbolic Tangens Plot:
Plot@Tanh@xD, 8x, -5, 5<D;
-4 -2 2 4
-1
-0.5
0.5
1
Plot of the first derivative:
Plot@Tanh'@xD, 8x, -5, 5<D;
-4 -2 2 4
0.2
0.4
0.6
0.8
1
Feedforward Neural Networks - Backpropagation 15
Christian Jacob, Dept. of Computer Science, Univ. of Calgary
Check for equality with 1 - tanh2 x
Plot@1 - Tanh@xD2, 8x, -5, 5<D;
-4 -2 2 4
0.2
0.4
0.6
0.8
1
Influence of the b parameter:
p1@b_D :=Plot@Tanh@b xD, 8x, -5, 5<, PlotRange Ø All, DisplayFunction Ø IdentityDp2@b_D :=Plot@Tanh'@b xD, 8x, -5, 5<, PlotRange Ø All, DisplayFunction Ø IdentityDTable@Show@GraphicsArray@8p1@bD, p2@bD<DD, 8b, 1, 5<D;
-4 -2 2 4
-1
-0.5
0.5
1
-4 -2 2 4
0.2
0.4
0.6
0.8
1
Feedforward Neural Networks - Backpropagation 16
Christian Jacob, Dept. of Computer Science, Univ. of Calgary
-4 -2 2 4
-1
-0.5
0.5
1
-4 -2 2 4
0.2
0.4
0.6
0.8
1
-4 -2 2 4
-1
-0.5
0.5
1
-4 -2 2 4
0.2
0.4
0.6
0.8
1
-4 -2 2 4
-1
-0.5
0.5
1
-4 -2 2 4
0.2
0.4
0.6
0.8
1
-4 -2 2 4
-1
-0.5
0.5
1
-4 -2 2 4
0.2
0.4
0.6
0.8
1
Feedforward Neural Networks - Backpropagation 17
Christian Jacob, Dept. of Computer Science, Univ. of Calgary
Table@Show@GraphicsArray@8p1@bD, p2@bD<DD, 8b, 0.1, 1, 0.1<D;
-4 -2 2 4
-0.4
-0.2
0.2
0.4 -4 -2 2 4
0.8
0.85
0.9
0.95
-4 -2 2 4
-0.6-0.4-0.2
0.20.40.6 -4 -2 2 4
0.5
0.6
0.7
0.8
0.9
-4 -2 2 4
-0.75-0.5
-0.25
0.250.5
0.75 -4 -2 2 4
0.2
0.4
0.6
0.8
-4 -2 2 4
-1
-0.5
0.5
1 -4 -2 2 4
0.2
0.4
0.6
0.8
Feedforward Neural Networks - Backpropagation 18
Christian Jacob, Dept. of Computer Science, Univ. of Calgary
-4 -2 2 4
-1
-0.5
0.5
1
-4 -2 2 4
0.2
0.4
0.6
0.8
1
-4 -2 2 4
-1
-0.5
0.5
1
-4 -2 2 4
0.2
0.4
0.6
0.8
1
-4 -2 2 4
-1
-0.5
0.5
1
-4 -2 2 4
0.2
0.4
0.6
0.8
1
-4 -2 2 4
-1
-0.5
0.5
1
-4 -2 2 4
0.2
0.4
0.6
0.8
1
-4 -2 2 4
-1
-0.5
0.5
1
-4 -2 2 4
0.2
0.4
0.6
0.8
1
Feedforward Neural Networks - Backpropagation 19
Christian Jacob, Dept. of Computer Science, Univ. of Calgary
‡ Sigmoid:
(21)g HxL =
1ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ1 + e-2 bx
g£ HxL = 2 b g HxL H1 - g HxLLSigmoid Plot:
sigmoid@x_, b_D :=1
ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ1 + E-2 b x
Plot@sigmoid@x, 1D, 8x, -5, 5<D;
-4 -2 2 4
0.2
0.4
0.6
0.8
1
Plot of the first derivative:
D@sigmoid@x, bD, xD2 ‰-2 x b b
ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅH1 + ‰-2 x bL2
Feedforward Neural Networks - Backpropagation 20
Christian Jacob, Dept. of Computer Science, Univ. of Calgary
Plot@D@sigmoid@x, 1D, xD êê Evaluate, 8x, -5, 5<D;
-4 -2 2 4
0.1
0.2
0.3
0.4
0.5
Check for equality with 2 ÿ g ÿ H1 - gLPlot@2 sigmoid@x, 1D H1 - sigmoid@x, 1DL, 8x, -5, 5<D;
-4 -2 2 4
0.1
0.2
0.3
0.4
0.5
Influence of the b parameter:
p1@b_D :=Plot@sigmoid@x, bD, 8x, -5, 5<, PlotRange Ø All, DisplayFunction Ø IdentityDp2@b_D := Plot@D@sigmoid@x, bD, xD êê Evaluate,8x, -5, 5<, PlotRange Ø All, DisplayFunction Ø IdentityD
Feedforward Neural Networks - Backpropagation 21
Christian Jacob, Dept. of Computer Science, Univ. of Calgary
Table@Show@GraphicsArray@8p1@bD, p2@bD<DD, 8b, 1, 5<D;
-4 -2 2 4
0.2
0.4
0.6
0.8
1
-4 -2 2 4
0.1
0.2
0.3
0.4
0.5
-4 -2 2 4
0.2
0.4
0.6
0.8
1
-4 -2 2 4
0.2
0.4
0.6
0.8
1
-4 -2 2 4
0.2
0.4
0.6
0.8
1
-4 -2 2 4
0.20.40.60.81
1.21.4
-4 -2 2 4
0.2
0.4
0.6
0.8
1
-4 -2 2 4
0.5
1
1.5
2
Feedforward Neural Networks - Backpropagation 22
Christian Jacob, Dept. of Computer Science, Univ. of Calgary
Table@Show@GraphicsArray@8p1@bD, p2@bD<DD, 8b, 0.1, 1, 0.1<D;
-4 -2 2 4
0.4
0.5
0.6
0.7
-4 -2 2 4
0.042
0.044
0.046
0.048
0.05
-4 -2 2 4
0.20.30.40.50.60.70.8 -4 -2 2 4
0.05
0.06
0.07
0.08
0.09
-4 -2 2 4
0.4
0.6
0.8
-4 -2 2 4
0.04
0.06
0.08
0.12
0.14
-4 -2 2 4
0.2
0.4
0.6
0.8
1
-4 -2 2 4
0.0250.05
0.075
0.1250.15
0.1750.2
Feedforward Neural Networks - Backpropagation 23
Christian Jacob, Dept. of Computer Science, Univ. of Calgary
-4 -2 2 4
0.2
0.4
0.6
0.8
1
-4 -2 2 4
0.05
0.1
0.15
0.2
0.25
-4 -2 2 4
0.2
0.4
0.6
0.8
1
-4 -2 2 4
0.050.1
0.150.2
0.250.3
-4 -2 2 4
0.2
0.4
0.6
0.8
1
-4 -2 2 4
0.050.1
0.150.2
0.250.3
0.35
-4 -2 2 4
0.2
0.4
0.6
0.8
1
-4 -2 2 4
0.1
0.2
0.3
0.4
-4 -2 2 4
0.2
0.4
0.6
0.8
1
-4 -2 2 4
0.1
0.2
0.3
0.4
Feedforward Neural Networks - Backpropagation 24
Christian Jacob, Dept. of Computer Science, Univ. of Calgary
d Update Rule for Sigmoid Units
Using the sigmoidal activation function, the d update rule takes the simple form:
(22)dim = Oi
m H1 - OimL HTim - Oi
mL,which is used in the weight update rule:
(23)Dwki = h di
m xkm
Feedforward Neural Networks - Backpropagation 25
Christian Jacob, Dept. of Computer Science, Univ. of Calgary
Learning in Multilayer NetworksMultilayer networks with nonlinear processing elements have a wider capability for solving classification tasks.
Learning by error backpropagation is a common method to train multilayer networks.
Error BackpropagationThe backpropagation (BP) algorithm describes an update procedure for the set of weights w”÷÷ in a feedforward multilayer network.
The network has to learn input-output patterns 8xkm, Ti
m<.
The basis for BP learning is, again, a similar gradient descent technique as used for perceptron learning, as described above.
‡ Notation
We use the following notation:
Ë xkm : value of input unit k for training pattern m; k = 1, …, N ;
m = 1, …, P
Ë H j : output of hidden unit j
Ë Oi : output of output unit i , i = 1, …, M
Ë wkj : weight of the link from input unit k to hidden unit j
Ë W ji : weight of the link from hidden unit j to output unit i
‡ Propagating the input through the network
For pattern m the hidden unit j receives the input
(24)hjm = ‚
k=1
N
wkj xkm
and generates the output
Feedforward Neural Networks - Backpropagation 26
Christian Jacob, Dept. of Computer Science, Univ. of Calgary
(25)Hjm = g HhjmL = g
ikjjjj‚k=1
N
wkj xkmy{zzzz.
These signals are propagated to the output cells, which receive the signals
(26)him = ‚
j
Wij Hjm = „
j
Wij g ikjjjj‚k=1
N
wkj xkmy{zzzz
and generate the output
(27)Oim = g HhimL = g
ikjjjjjjjj„j
Wij g ikjjjj‚k=1
N
wkj xkmy{zzzzy{zzzzzzzz
‡ Error function
We use the known quadratic function as our error function:
(28)E Hw”L =1ÅÅÅÅ2
‚m=1
M ‚m=1
P HTmm - OmmL2
Continuing the calculations, we get:
(29)
E Hw”L =1ÅÅÅÅ2
‚m=1
M ‚m=1
P HTmm - g HhmmLL2E Hw”L =
1ÅÅÅÅ2
„m=1
M „m=1
P ikjjjjjjjjTmm - g
ikjjjjjjjj„j
Wmj g ikjjjj‚k=1
N
wkj xkmy{zzzzy{zzzzzzzzy{zzzzzzzz
2
E Hw”L =1ÅÅÅÅ2 „
m=1
M „m=1
P ikjjjjjjTmm - g ikjjjjjj‚
j
Wmj Hjmy{zzzzzzy{zzzzzz2
‡ Updating the weights: hidden—output layer
For the connections from hidden to output cells we can use the delta weight update rule:
Feedforward Neural Networks - Backpropagation 27
Christian Jacob, Dept. of Computer Science, Univ. of Calgary
(30)
DWji = -h ∑ E
ÅÅÅÅÅÅÅÅÅÅÅÅÅ∑ Wji
DWji = h ‚m
HTim - OimL g£ HhimL Hj
m
DWji = h ‚m
dim Hj
m
with
(31)dim = g£ HhimL HTim - Oi
mL‡ Updating the weights: input—hidden layer
(32)
Dwkj = -h ∑ E
ÅÅÅÅÅÅÅÅÅÅÅÅÅ∑ wkj
Dwkj = -h „m
ikjjj ∑ EÅÅÅÅÅÅÅÅÅÅÅ∑ Hj
m ÿ∑ Hj
m
ÅÅÅÅÅÅÅÅÅÅÅÅÅ∑ wkj
y{zzzAfter a few more calculations we get the following weight update rule:
(33)Dwkj = h ‚m
djm xk
m
with
(34)djm = g£ HhjmL ‚
i
Wjim di
m
Feedforward Neural Networks - Backpropagation 28
Christian Jacob, Dept. of Computer Science, Univ. of Calgary
The Backpropagation AlgorithmFor the BP algorithms we use the following notations:
Ë Vim : output of cell i in layer m
Ë Vi0 : corresponds to xi , the i -th input component
Ë w jim : the connection from V j
m-1 to Vim
‡ Backpropagation Algorithm
Ï Step 1: Initialize all weights with random values.
Ï Step 2: Select a pattern xm and attach it to the input layer Hm = 0L :
(35)Vj0 = xj
m , " k
Ï Step 3: Propagate the signals through all layers:
(36)Vim = g HhimL = g
ikjjjjjj‚j
wjim Vj
m-1y{zzzzzz, " i, " m
Ï Step 4: Calculate the d's of the output layer:
(37)diM = g£ HhiML HTiM - Vi
MLÏ Step 5: Calculate the d's for the inner layers by error backpropagation:
(38)dim-1 = g£ Hhim-1L ‚
j
wijm dj
m, m = M, M - 1, …, 2
Ï Step 6: Adapt all connection weights:
(39)wjinew = wji
old + Dwjiwith Dwji
m = h dim Vj
m-1
Ï Step 7: Go back to Step 2 for the next training pattern.
Feedforward Neural Networks - Backpropagation 29
Christian Jacob, Dept. of Computer Science, Univ. of Calgary
ReferencesFreeman, J. A. Simulating Neural Networks with Mathematica. Addison-Wesley, Reading, MA, 1994.
Hertz, J., Krogh, A., and Palmer, R. G. Introduction to the Theory of Neural Compu-tation. Addison-Wesley, Reading, MA, 1991.
Rojas, R. Neural Networks: A Systematic Introduction . Springer Verlag, Berlin,1996.
Feedforward Neural Networks - Backpropagation 30
Christian Jacob, Dept. of Computer Science, Univ. of Calgary