21
Lecture 4: Multi-Layer Perceptrons Lecture 4: Multi-Layer Perceptrons Dept. of Computing Science & Math 1

Lecture 4: MultiLecture 4: Multi-Layer PerceptronsLayer

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Lecture 4: Multi-Layer PerceptronsLecture 4: Multi-Layer Perceptrons

Dept. of Computing Science & Math

1

Review of Gradient Descent Learning

1 Th f l k i i i i i i h1. The purpose of neural network training is to minimize the output errors on a particular set of training data by adjusting the network weights w

2. We define a Cost Function E(w) that measures how far the current network’s output is from the desired one

3 Partial derivatives of the cost function ∂E(w)/∂w tell us which3. Partial derivatives of the cost function ∂E(w)/∂w tell us which direction we need to move in weight space to reduce the error

4. The learning rate η specifies the step sizes we take in weight space for each iteration of the weight update equation

5. We keep stepping through weight space until the errors are ‘small enough’enough .

Dept. of Computing Science & Math

2

G hi l R t ti f GDRGraphical Representation of GDR

Total Error

Local minimum

Global minimum

Ideal weight

Weight, wi

Dept. of Computing Science & Math

3

Review of Perceptron Training

1 Generate a training pair or pattern x that you wish your network to1. Generate a training pair or pattern x that you wish your network to learn

2. Setup your network with N input units fully connected to M output iunits.

3. Initialize weights, w, at random 4 Select an appropriate error function E(w) and learning rate η4. Select an appropriate error function E(w) and learning rate η5. Apply the weight change Δw = - η ∂E(w)/ ∂w to each weight w for

each training pattern p. One set of updates for all the weights for ll th t i i tt i ll d h f t i iall the training patterns is called one epoch of training.

6. Repeat step 5 until the network error function is ‘small enough’

Dept. of Computing Science & Math

4

R i f XOR d Li S bilitReview of XOR and Linear Separability

Recall that it is not possible to find weights that enable Single Layer PerceptronsRecall that it is not possible to find weights that enable Single Layer Perceptrons to deal with non-linearly separable problems like XOR

XORXOR

in1 in2 out

0 0 0

0 1 1

I1

1 0 1

1 1 0

I2

The proposed solution was to use a more complex network that is able to generate more complex decision boundaries. That network is the Multi-Layer Perceptron

Dept. of Computing Science & Math

5

Perceptron.

Multi Layer Perceptrons (MLPs)Multi-Layer Perceptrons (MLPs)

Y Y Y

∫ ∫ ∫

Y1 Y2 Yk

Output layer, k)( j

jkjk OwfY

∫ ∫∫O1

OjHidden layer j)( XwfO ∫ ∫ ∫∫ Hidden layer, j)( i

ijij XwfO

X1 X2 X3 XiInput layer, i

Dept. of Computing Science & Math

6

Can We Use a Generalized Form of the PLR/Delta Rule to Train the MLP?

Recall the PLR/Delta rule: Adjust neuron weights to reduce error at neuron’s j goutput:

xww old where yydesired

Main problem: How to adjust the weights in the hidden layer, so they reduce the error in the output layer, when there is no specified target response in the hidden layer?hidden layer?

Solution: Alter the non-linear Perceptron (discrete threshold) activation function to make it differentiable and hence, help derive Generalized DR for MLP training

+1+1

MLP training.

-1

Th h ld f ti

-1

Sigmoid function

Dept. of Computing Science & Math

7

Threshold function Sigmoid function

Sigmoid (S-shaped) Function PropertiesSigmoid (S-shaped) Function Properties

A i h h h ld f i• Approximates the threshold function• Smoothly differentiable (everywhere), and hence DR applicable• Positive slope 1Positive slope• Popular choice is

afy 1)(

0.6

0.8

1

aeafy 1)(

• Derivative of sigmoidal function is: 0

0.2

0.4

-5 -4 -3 -2 -1 0 1 2 3 4 5

))(1()()(' afafaf

5 4 3 2 1 0 1 2 3 4 5

Dept. of Computing Science & Math

8

Weight Update RuleWeight Update Rule

Generally weight change from any unit j to unit k by gradient descent (i eGenerally, weight change from any unit j to unit k by gradient descent (i.e. weight change by small increment in negative direction to the gradient) is now called Generalized Delta Rule (GDR or Backpropagation):

xwEwww old

Now the “delta” is more complicated because of the sigmoid function:

2)(1)( yywE owaafy 1)(

ayEE

)(2

)( ktarkkj yywE j

jkjk owakakk e

afy

1)(

jkjkkktarkkj

k

k

k

kkj

ooyyyywa

ay

yE

wE

)1()(

Dept. of Computing Science & Math

9

Weight Update Rule (2)Weight Update Rule (2)

For the output units “delta” is the output error multiplied by a gradient term:For the output units, delta is the output error multiplied by a gradient term:

)1()( kkktarkk yyyy jkkj ow

For hidden units we also need a value of “error”. A suitable quantity to use is the weighted sum of the output “deltas” from a hidden unit:

kk

kjjjj woo )1(

xw

k

And again the weight change is:

ijji xw

Dept. of Computing Science & Math

10

T i i f 2 L F d F d N t kTraining of a 2-Layer Feed Forward Network

1 Take the set of training patterns you wish the network to learn1. Take the set of training patterns you wish the network to learn

2. Set up the network with N input units fully connected to M non-linear hidden units via connections with weights wji, which in turn are fully jconnected to P output units via connections with weights wkj

3. Generate random initial weights, e.g. from range [-wt, +wt]

4. Select appropriate error function E(wkj) and learning rate η

5. Apply the weight update equation Δwkj=-η∂E(wkj)/∂wkj to each weight wkj for each training pattern p.wkj for each training pattern p.

6. Do the same to all hidden layers.

7. Repeat step 5-6 until the network error function is ‘small enough’p p g

Dept. of Computing Science & Math

11

Practical Considerations for Learning RulesPractical Considerations for Learning Rules

There are a number of important issues about training single layer neural p g g ynetworks that need further resolving:

1. Do we need to pre-process the training data? If so, how?2 How do we choose the initial weights from which we start the training?2. How do we choose the initial weights from which we start the training?3. How do we choose an appropriate learning rate η?4. Should we change the weights after each training pattern, or after the

whole set?whole set?5. Are some activation/transfer functions better than others?6. How do we avoid local minima in the error function?7. How do we know when we should stop the training?8. How many hidden units do we need?9 Should we have different learning rates for the different layers?9. Should we have different learning rates for the different layers?

We shall now consider each of these issues one by one.

Dept. of Computing Science & Math

12

Pre processing of the Training DataPre-processing of the Training Data

In principle, we can just use any raw input-output data to train our networks. H i ti it ft h l th t k t l i t l ifHowever, in practice, it often helps the network to learn appropriately if we carry out some pre-processing of the training data before feeding it to the network.

We should make sure that the training data is representativerepresentative – it should not contain too many examples of one type at the expense of another. On the other hand, if one class of pattern is easy to learn, having large numbers of patterns from that class in the training set will only slow down the over-allpatterns from that class in the training set will only slow down the over-all learning process.

Dept. of Computing Science & Math

13

Choosing the Initial Weight ValuesChoosing the Initial Weight ValuesThe gradient descent learning algorithm treats all the weights in the same way,

so if we start them all off with the same values, all the hidden units will end ,up doing the same thing and the network will never learn properly.

For that reason, we generally start off all the weights with small random values. Usually we take them from a flat distribution around zero [ wt +wt] or fromUsually we take them from a flat distribution around zero [–wt, +wt], or from a Gaussian distribution around zero with standard deviation wt.

Choosing a good value of wt can be difficult Generally it is a good idea toChoosing a good value of wt can be difficult. Generally, it is a good idea to make it as large as you can without saturating any of the sigmoids.

We usually hope that the final network performance will be independent of the h i f i iti l i ht b t d t h k thi b t i i th t kchoice of initial weights, but we need to check this by training the network

from a number of different random initial weight sets.

Dept. of Computing Science & Math

14

Choosing the Learning RateChoosing the Learning Rate

Choosing a good value for the learning rate η is constrained by two opposing g g g η y pp gfacts:1. If η is too small, it will take too long to get anywhere near the minimum

of the error function.2 If η is too large the weight updates will over shoot the error minimum2. If η is too large, the weight updates will over-shoot the error minimum

and the weights will oscillate, or even diverge.

Unfortunately the optimal value is very problem and network dependent so oneUnfortunately, the optimal value is very problem and network dependent, so one cannot formulate reliable general prescriptions.

Generally, one should try a range of different values (e.g. η = 0.1, 0.01, 1.0, 0 0001) d th lt id0.0001) and use the results as a guide.

Dept. of Computing Science & Math

15

Batch Training vs On line TrainingBatch Training vs. On-line Training

Batch Training:Batch Training: update the weights after all training patterns haveBatch Training:Batch Training: update the weights after all training patterns have been presented.

OnOn line Trainingline Training (or Sequential Training):(or Sequential Training): A natural alternative is toOnOn--line Trainingline Training (or Sequential Training):(or Sequential Training): A natural alternative is to update all the weights immediately after processing each training pattern.

On-line learning does not perform true gradient descent, and the individual weight changes can be rather erratic. Normally a much lower learning rate η will be necessary than for batch learning. g η y gHowever, because each weight now has N updates (where N is the number of patterns) per epoch, rather than just one, overall the learning is often much quicker. This is particularly true if there is a lot of redundancy in the training data i e many training patternslot of redundancy in the training data, i.e. many training patterns containing similar information.

Dept. of Computing Science & Math

16

Ch i th T f F tiChoosing the Transfer Function

W h l d th t h i diff ti bl t f / ti tiWe have already seen that having a differentiable transfer/activation function is important for the gradient descent algorithm to work. We have also seen that, in terms of computational efficiency, the standard sigmoid (i e logistic function) is a particularly convenientstandard sigmoid (i.e. logistic function) is a particularly convenient replacement for the step function of the Simple Perceptron.

The logistic function ranges from 0 to 1. There is some evidence that an anti-symmetric transfer function (eg tanh), i.e. one that satisfies f(–x) = –f(x), enables the gradient descent algorithm to learn faster.

When the outputs are required to be non-binary, i.e. continuous real values, having sigmoidal transfer functions no longer makes sense. , g g gIn these cases, a simple linear transfer function f(x) = x is appropriate.

Dept. of Computing Science & Math

17

Local MinimaLocal Minima

Cost functions can quite easily have more than one minimum:

If we start off in the vicinity of the local minimum, we may end up at the local minimum rather than the global minimum. Starting with a range of different initial weight sets increases our chances of finding the global minimum Any variationweight sets increases our chances of finding the global minimum. Any variation from true gradient descent will also increase our chances of stepping into the deeper valley.

Dept. of Computing Science & Math

18

When to Stop TrainingWhen to Stop Training

The Sigmoid(x) function only takes on its extreme values of 0 and 1 at x = ±∞. I ff t thi th t th t k l hi it bi t tIn effect, this means that the network can only achieve its binary targets when at least some of its weights reach ±∞. So, given finite gradient descent step sizes, our networks will never reach their binary targets.

Even if we offset the targets (to 0.1 and 0.9 say) we will generally require an infinite number of increasingly small gradient descent steps to achieve those targets.

Clearly, if the training algorithm can never actually reach the minimum, we have to stop the training process when it is ‘near enough’. What constitutes ‘near enough’ depends on the problem. If we have binary targets, it might g p p y g , gbe enough that all outputs are within 0.1 (say) of their targets. Or, it might be easier to stop the training when the sum squared error function becomes less than a particular small value (0.2 say).

Dept. of Computing Science & Math

19

How Many Hidden Units?How Many Hidden Units?

The best number of hidden units depends in a complex way on many factors, i l diincluding:

1. The number of training patterns2. The numbers of input and output units3. The amount of noise in the training data3. The amount of noise in the training data4. The complexity of the function or classification to be learned5. The type of hidden unit activation function6. The training algorithm

Too few hidden units will generally leave high training and generalisation errors due to under-fitting. Too many hidden units will result in low training errors, but will make the training unnecessarily slow, and will result in poor g y , pgeneralisation unless some other technique (such as regularisationregularisation) is used to prevent over-fitting.

Vi t ll ll “ l f th b” h b t t ll A iblVirtually all “rules of thumb” you hear about are actually nonsense. A sensible strategy is to try a range of numbers of hidden units and see which works best.

Dept. of Computing Science & Math

20

Different Learning Rates for Different Layers?Different Learning Rates for Different Layers?

A network as a whole will usually learn most efficiently if all its neurons are l i t hl th d S b diff t t f th t klearning at roughly the same speed. So maybe different parts of the network should have different learning rates η. There are a number of factors that may affect the choices:

1. The later network layers (nearer the outputs) will tend to have larger local y ( p ) ggradients (deltas) than the earlier layers (nearer the inputs).

2. The activations of units with many connections feeding into or out of them tend to change faster than units with fewer connections.

3. Activations required for linear units will be different for Sigmoidal units.q g4. There is empirical evidence that it helps to have different learning rates η for the

thresholds/biases compared with the real connection weights.

In practice it is often quicker to just use the same rates η for all the weightsIn practice, it is often quicker to just use the same rates η for all the weights and thresholds, rather than spending time trying to work out appropriate differences. A very powerful approach is to use evolutionary strategies to determine good learning rates.

Dept. of Computing Science & Math

21