Gradient Descent Learning and Backpropagationpages.cpsc.ucalgary.ca/~jacob/Courses/Fall2003/CPSC533/Slides/0… · Examples: Perceptron learning, backpropagation. ËReinforcement

Feedforward Networks

Gradient Descent Learning and Backpropagation

CPSC 533 — Fall 2003

Christian Jacob ©

Dept.of Computer Science,University of Calgary

Feedforward Neural Networks - Backpropagation 1

Christian Jacob, Dept. of Computer Science, Univ. of Calgary

Adaptive "Programming" of ANNs through Learning

ANN LearningA learning algorithm is an adaptive method by which a network of computing units self-organizes to implement the desired behavior.

Changing Network Parameters

TestingInput/Output

ExamplesCalculating

Network Errors

Figure 1. Learning process in a parametric system

In some learning algorithms, examples of the desired input-output mapping are presented to the network.

A correction step is executed iteratively until the network learns to produce the desired response.



Learning Schemes

‡ Unsupervised Learning

For a given input, the exact numerical output a network should produce is unknown. Since no "teacher" is available, the network must organize itself (e.g., in order to associate clusters with units).

Examples: Clustering with self-organizing feature maps, Kohonen networks.

Figure 2. Three clusters and a classifier network

‡ Supervised Learning

Some input vectors are collected and presented to the network. The output com-puted by the network is observed and the deviation from the expected answer is measured. The weights are corrected (= learning algorithm) according to the magni-tude of the error.

Ë Error-correction Learning:

The magnitude of the error, together with the input vector, determines the magnitude of the corrections to the weights.

Examples: Perceptron learning, backpropagation.

Ë Reinforcement Learning:

After each presentation of an input-output example we only know whether the network produces the desired result or not. The weights are updated based on this Boolean decision (true or false).

Examples: Learning how to ride a bike.



Ë Reinforcement Learning:

After each presentation of an input-output example we only know whether the network produces the desired result or not. The weights are updated based on this Boolean decision (true or false).

Examples: Learning how to ride a bike.

Learning by Gradient Descent

Definition of the Learning ProblemLet us start with the simple case of linear cells, which we have introduced as percep-tron units.

The linear network should learn mappings (for m = 1, …, P ) between

Ë an input pattern xm = Hx1m, …, xN

m L and

Ë an associated target pattern T m .

Figure 3. Perceptron



The output Oim of cell i for the input pattern xm is calculated as

(1)Oim = ‚

k

Hwki ÿ xkmL

The goal of the learning procedure is, that eventually the output Oim for input pat-

tern xm corresponds to the desired output Tim :

(2)Oim =

! Ti

m = ‚k

Hwki ÿ xkmL

‡ Example: Letter Classification

Note: This letter classification will only work with non-linear (sigmoidal) process-ing units.



Explicit Solution (Linear Network)*For a linear network, the weights that satisfy Equation (2) can be calculated explic-itly using the pseudo-inverse:

(3)wik =1ÅÅÅÅP

‚ml

Tim HQk-1Lml xk

l

(4)Qml =1ÅÅÅÅP

‚k

xkm xk

l

‡ Correlation Matrix

Here Qml is a component of the correlation matrix Qk of the input patterns:

(5)Qk =ikjjjjjjjj xk

1 xk1 xk

1 xk2 … xk

1 xkP

. . . .xkP xk

1 … … xkP xk

P

y{zzzzzzzzYou can check that this is indeed a solution by verifying

(6)‚k

wik xkm = Ti

m.

‡ Caveat

Note that Q-1 only exists for linearly independent input patterns.

That means, if there are ai such that for all k = 1, …, N

(7)a1 xk1 + a2 xk

2 + … + aP xkP = 0,

then the outputs Oim cannot be selected independently from each other, and the

problem is NOT solvable.



Learning by Gradient Descent (Linear Network)Let us now try to find a learning rule for a linear network with M output units.

Starting from a random initial weight setting w”÷÷ 0 , the learning procedure should find a solution weight matrix for Equation (2).

‡ Error Function

For this purpose, we define a cost or error function EHw”÷÷ L:

(8)

E Hw”L =1ÅÅÅÅ2

‚m=1

M ‚m=1

P HTmm - OmmL2


„m=1

M „m=1

P ikjjjjTmm - ‚k

Hwkm ÿ xkmLy{zzzz2

EHw”÷÷ L ¥ 0 will approach zero as w”÷÷ = 8wkm< satisfies Equation (2).

This cost function is a quadratic function in weight space.



‡ Paraboloid

Therefore, EHw”÷÷ L is a paraboloid with a single global minimum.

<< RealTime3D`

Plot3D@x2 + y2, 8x, -5, 5<, 8y, -5, 5<D;



ContourPlot@x2 + y2, 8x, -5, 5<, 8y, -5, 5<D;

-4 -2 0 2 4

-4

-2

0

2

4

If the pattern vectors are linearly independent—i.e., a solution for Equation (2) exists—the minimum is at E = 0.



‡ Graphical Illustration: Following the Gradient

‡ Finding the Minimum: Following the Gradient

We can find the minimum of EHw”÷÷ L in weight space by following the negative gradient

(9)-∑w”÷ EHw”÷ L =-∑EHw”÷ LÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ

∑ w”We can implement this gradient strategy as follows:

‡ Changing a Weight

Each weight wki œ w”÷÷ is changed by Dwki proportionate to the E gradient at the current weight position (i.e., the current settings of all the weights):

(10)Dwki = -h ∑ E Hw”LÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ

∑ wki



‡ Steps Towards the Solution

(11)

Dwki = -h ∑

ÅÅÅÅÅÅÅÅÅÅÅÅÅ∑ wki

ikjjjjjjjj 1ÅÅÅÅ2 „

m=1

M „m=1

P ikjjjjTmm - ‚n

Hwnm ÿ xnmLy{zzzz2y{zzzzzzzz

Dwki = -h 1ÅÅÅÅ2

„m=1

P∑

ÅÅÅÅÅÅÅÅÅÅÅÅÅ∑ wki

ikjjjjjjj„m=1

M ikjjjjTmm - ‚n

Hwnm ÿ xnmLy{zzzz2y{zzzzzzz

Dwki = -h 1ÅÅÅÅ2

„m=1

P

2 ikjjjjTim - ‚

n

Hwni ÿ xnmLy{zzzz H-xk

mL‡ Weight Adaptation Rule

(12)Dwki = h ‚m=1

P HTim - OimL xk

m

The parameter h is usually referred to as the learning rate.

In this formula, the adaptation of the weights are accumulated over all patterns.

‡ Delta, LMS Learning

If we change the weights after each presentation of an input pattern to the network, we get a simpler form for the weight update term:

(13)Dwki = h HTim - OimL xk

m

or

(14)Dwki = h dim xk

m

with

(15)dim = Ti

m - Oim.

This learning rule has several names:



Ë Delta rule

Ë Adaline rule

Ë Widrow-Hoff rule

Ë LMS (least mean square) rule.



Gradient Descent Learning with Nonlinear CellsWe will now extend the gradient descent technique for the case of nonlinear cells, that is, where the activation/output function is a general nonlinear function g(x).

† The input function is denoted by hHxL.

† The activation/output function gHhHxLL is assumed to be differentiable in x .

‡ Remember:

‡ Rewriting the Error Function

The definition of the error function (Equation (8)) can be simply rewritten as follows:

(16)


‚m=1

M ‚m=1

P HTmm - OmmL2


„m=1

M

„m=1

P

ikjjjjTmm - g

ikjjjj‚k

Hwkm ÿ xkmLy{zzzzy{zzzz2



‡ Weight Gradients

Consequently, we can compute the wki gradients:

(17)∑ E Hw”LÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ

∑ wki= ‚

m=1

P

HTim - g HhimLL ÿ g£ HhimL ÿ xkm

‡ From Weight Gradients to the Learning Rule

This eventually (after some more calculations) shows us that the adaptation term Dwki for wki has the same form as in Equations (10), (13), and (14), namely:

(18)Dwki = h dim xk

m

where

(19)dim = HTim - Oi

mL ÿ g£ HhimL

Suitable Activation FunctionsThe calculation of the above d terms is easy for the following functions g, which are commonly used as activation functions:

‡ Hyperbolic Tangens:

(20)g HxL = tanh b x

g£ HxL = b H1 - g2 HxLL



Hyperbolic Tangens Plot:

Plot@Tanh@xD, 8x, -5, 5<D;

-4 -2 2 4

-1

-0.5

0.5

1

Plot of the first derivative:

Plot@Tanh'@xD, 8x, -5, 5<D;

-4 -2 2 4

0.2

0.4

0.6

0.8

1



Check for equality with 1 - tanh2 x

Plot@1 - Tanh@xD2, 8x, -5, 5<D;

-4 -2 2 4

0.2

0.4

0.6

0.8

1

Influence of the b parameter:

p1@b_D :=Plot@Tanh@b xD, 8x, -5, 5<, PlotRange Ø All, DisplayFunction Ø IdentityDp2@b_D :=Plot@Tanh'@b xD, 8x, -5, 5<, PlotRange Ø All, DisplayFunction Ø IdentityDTable@Show@GraphicsArray@8p1@bD, p2@bD<DD, 8b, 1, 5<D;

-4 -2 2 4

-1

-0.5

0.5

1

-4 -2 2 4

0.2

0.4

0.6

0.8

1



-4 -2 2 4

-1

-0.5

0.5

1

-4 -2 2 4

0.2

0.4

0.6

0.8

1

-4 -2 2 4

-1

-0.5

0.5

1

-4 -2 2 4

0.2

0.4

0.6

0.8

1

-4 -2 2 4

-1

-0.5

0.5

1

-4 -2 2 4

0.2

0.4

0.6

0.8

1

-4 -2 2 4

-1

-0.5

0.5

1

-4 -2 2 4

0.2

0.4

0.6

0.8

1



Table@Show@GraphicsArray@8p1@bD, p2@bD<DD, 8b, 0.1, 1, 0.1<D;

-4 -2 2 4

-0.4

-0.2

0.2

0.4 -4 -2 2 4

0.8

0.85

0.9

0.95

-4 -2 2 4

-0.6-0.4-0.2

0.20.40.6 -4 -2 2 4

0.5

0.6

0.7

0.8

0.9

-4 -2 2 4

-0.75-0.5

-0.25

0.250.5

0.75 -4 -2 2 4

0.2

0.4

0.6

0.8

-4 -2 2 4

-1

-0.5

0.5

1 -4 -2 2 4

0.2

0.4

0.6

0.8



-4 -2 2 4

-1

-0.5

0.5

1

-4 -2 2 4

0.2

0.4

0.6

0.8

1

-4 -2 2 4

-1

-0.5

0.5

1

-4 -2 2 4

0.2

0.4

0.6

0.8

1

-4 -2 2 4

-1

-0.5

0.5

1

-4 -2 2 4

0.2

0.4

0.6

0.8

1

-4 -2 2 4

-1

-0.5

0.5

1

-4 -2 2 4

0.2

0.4

0.6

0.8

1

-4 -2 2 4

-1

-0.5

0.5

1

-4 -2 2 4

0.2

0.4

0.6

0.8

1



‡ Sigmoid:

(21)g HxL =

1ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ1 + e-2 bx

g£ HxL = 2 b g HxL H1 - g HxLLSigmoid Plot:

sigmoid@x_, b_D :=1

ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ1 + E-2 b x

Plot@sigmoid@x, 1D, 8x, -5, 5<D;

-4 -2 2 4

0.2

0.4

0.6

0.8

1

Plot of the first derivative:

D@sigmoid@x, bD, xD2 ‰-2 x b b

ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅH1 + ‰-2 x bL2



Plot@D@sigmoid@x, 1D, xD êê Evaluate, 8x, -5, 5<D;

-4 -2 2 4

0.1

0.2

0.3

0.4

0.5

Check for equality with 2 ÿ g ÿ H1 - gLPlot@2 sigmoid@x, 1D H1 - sigmoid@x, 1DL, 8x, -5, 5<D;

-4 -2 2 4

0.1

0.2

0.3

0.4

0.5

Influence of the b parameter:

p1@b_D :=Plot@sigmoid@x, bD, 8x, -5, 5<, PlotRange Ø All, DisplayFunction Ø IdentityDp2@b_D := Plot@D@sigmoid@x, bD, xD êê Evaluate,8x, -5, 5<, PlotRange Ø All, DisplayFunction Ø IdentityD



Table@Show@GraphicsArray@8p1@bD, p2@bD<DD, 8b, 1, 5<D;

-4 -2 2 4

0.2

0.4

0.6

0.8

1

-4 -2 2 4

0.1

0.2

0.3

0.4

0.5

-4 -2 2 4

0.2

0.4

0.6

0.8

1

-4 -2 2 4

0.2

0.4

0.6

0.8

1

-4 -2 2 4

0.2

0.4

0.6

0.8

1

-4 -2 2 4

0.20.40.60.81

1.21.4

-4 -2 2 4

0.2

0.4

0.6

0.8

1

-4 -2 2 4

0.5

1

1.5

2



Table@Show@GraphicsArray@8p1@bD, p2@bD<DD, 8b, 0.1, 1, 0.1<D;

-4 -2 2 4

0.4

0.5

0.6

0.7

-4 -2 2 4

0.042

0.044

0.046

0.048

0.05

-4 -2 2 4

0.20.30.40.50.60.70.8 -4 -2 2 4

0.05

0.06

0.07

0.08

0.09

-4 -2 2 4

0.4

0.6

0.8

-4 -2 2 4

0.04

0.06

0.08

0.12

0.14

-4 -2 2 4

0.2

0.4

0.6

0.8

1

-4 -2 2 4

0.0250.05

0.075

0.1250.15

0.1750.2



-4 -2 2 4

0.2

0.4

0.6

0.8

1

-4 -2 2 4

0.05

0.1

0.15

0.2

0.25

-4 -2 2 4

0.2

0.4

0.6

0.8

1

-4 -2 2 4

0.050.1

0.150.2

0.250.3

-4 -2 2 4

0.2

0.4

0.6

0.8

1

-4 -2 2 4

0.050.1

0.150.2

0.250.3

0.35

-4 -2 2 4

0.2

0.4

0.6

0.8

1

-4 -2 2 4

0.1

0.2

0.3

0.4

-4 -2 2 4

0.2

0.4

0.6

0.8

1

-4 -2 2 4

0.1

0.2

0.3

0.4



d Update Rule for Sigmoid Units

Using the sigmoidal activation function, the d update rule takes the simple form:

(22)dim = Oi

m H1 - OimL HTim - Oi

mL,which is used in the weight update rule:

(23)Dwki = h di

m xkm



Learning in Multilayer NetworksMultilayer networks with nonlinear processing elements have a wider capability for solving classification tasks.

Learning by error backpropagation is a common method to train multilayer networks.

Error BackpropagationThe backpropagation (BP) algorithm describes an update procedure for the set of weights w”÷÷ in a feedforward multilayer network.

The network has to learn input-output patterns 8xkm, Ti

m<.

The basis for BP learning is, again, a similar gradient descent technique as used for perceptron learning, as described above.

‡ Notation

We use the following notation:

Ë xkm : value of input unit k for training pattern m; k = 1, …, N ;

m = 1, …, P

Ë H j : output of hidden unit j

Ë Oi : output of output unit i , i = 1, …, M

Ë wkj : weight of the link from input unit k to hidden unit j

Ë W ji : weight of the link from hidden unit j to output unit i

‡ Propagating the input through the network

For pattern m the hidden unit j receives the input

(24)hjm = ‚

k=1

N

wkj xkm

and generates the output



(25)Hjm = g HhjmL = g

ikjjjj‚k=1

N

wkj xkmy{zzzz.

These signals are propagated to the output cells, which receive the signals

(26)him = ‚

j

Wij Hjm = „

j

Wij g ikjjjj‚k=1

N

wkj xkmy{zzzz

and generate the output

(27)Oim = g HhimL = g

ikjjjjjjjj„j

Wij g ikjjjj‚k=1

N

wkj xkmy{zzzzy{zzzzzzzz

‡ Error function

We use the known quadratic function as our error function:

(28)E Hw”L =1ÅÅÅÅ2

‚m=1

M ‚m=1

P HTmm - OmmL2

Continuing the calculations, we get:

(29)


‚m=1

M ‚m=1

P HTmm - g HhmmLL2E Hw”L =

1ÅÅÅÅ2

„m=1

M „m=1

P ikjjjjjjjjTmm - g

ikjjjjjjjj„j

Wmj g ikjjjj‚k=1

N

wkj xkmy{zzzzy{zzzzzzzzy{zzzzzzzz

2

E Hw”L =1ÅÅÅÅ2 „

m=1

M „m=1

P ikjjjjjjTmm - g ikjjjjjj‚

j

Wmj Hjmy{zzzzzzy{zzzzzz2

‡ Updating the weights: hidden—output layer

For the connections from hidden to output cells we can use the delta weight update rule:



(30)

DWji = -h ∑ E

ÅÅÅÅÅÅÅÅÅÅÅÅÅ∑ Wji

DWji = h ‚m

HTim - OimL g£ HhimL Hj

m

DWji = h ‚m

dim Hj

m

with

(31)dim = g£ HhimL HTim - Oi

mL‡ Updating the weights: input—hidden layer

(32)

Dwkj = -h ∑ E

ÅÅÅÅÅÅÅÅÅÅÅÅÅ∑ wkj

Dwkj = -h „m

ikjjj ∑ EÅÅÅÅÅÅÅÅÅÅÅ∑ Hj

m ÿ∑ Hj

m

ÅÅÅÅÅÅÅÅÅÅÅÅÅ∑ wkj

y{zzzAfter a few more calculations we get the following weight update rule:

(33)Dwkj = h ‚m

djm xk

m

with

(34)djm = g£ HhjmL ‚

i

Wjim di

m



The Backpropagation AlgorithmFor the BP algorithms we use the following notations:

Ë Vim : output of cell i in layer m

Ë Vi0 : corresponds to xi , the i -th input component

Ë w jim : the connection from V j

m-1 to Vim

‡ Backpropagation Algorithm

Ï Step 1: Initialize all weights with random values.

Ï Step 2: Select a pattern xm and attach it to the input layer Hm = 0L :

(35)Vj0 = xj

m , " k

Ï Step 3: Propagate the signals through all layers:

(36)Vim = g HhimL = g

ikjjjjjj‚j

wjim Vj

m-1y{zzzzzz, " i, " m

Ï Step 4: Calculate the d's of the output layer:

(37)diM = g£ HhiML HTiM - Vi

MLÏ Step 5: Calculate the d's for the inner layers by error backpropagation:

(38)dim-1 = g£ Hhim-1L ‚

j

wijm dj

m, m = M, M - 1, …, 2

Ï Step 6: Adapt all connection weights:

(39)wjinew = wji

old + Dwjiwith Dwji

m = h dim Vj

m-1

Ï Step 7: Go back to Step 2 for the next training pattern.



ReferencesFreeman, J. A. Simulating Neural Networks with Mathematica. Addison-Wesley, Reading, MA, 1994.

Hertz, J., Krogh, A., and Palmer, R. G. Introduction to the Theory of Neural Compu-tation. Addison-Wesley, Reading, MA, 1991.

Rojas, R. Neural Networks: A Systematic Introduction . Springer Verlag, Berlin,1996.



Documents

Gradient Descent Learning and Backpropagationpages.cpsc.ucalgary.ca/~jacob/Courses/Fall2003/CPSC533/Slides/0… · Examples: Perceptron learning, backpropagation. ËReinforcement