Artificial Neural Networks (Cont.) Chapter 4 Perceptron Gradient Descent Multilayer Networks...

Preview:

Citation preview

1

Artificial Neural Networks (Cont.)

Chapter 4

• Perceptron• Gradient Descent• Multilayer Networks • Backpropagation Algorithm

2

Review: The Main Idea of Gradient Descent

Goal: minimizing the error:

3

Review: The Main Idea of Gradient Descent

4

Gradient Descent

Derive the equations for updating the weights of the simple neuron using the gradient descent technique

5

Gradient Descent (with Linear Transfer function)

6

Review:Batch vs. Incremental Gradient Descent

7

Gradient Descent(with Linear Transfer function)

Is it batch or incremental mode?

8

Forward Equation

: the transfer function of

: the output of

X: the input array

: the n weight matrix for the first layer

: the weight matrix for the second layer

9

Backpropagation Learning Rule

• Each weight changed by (if all the neurons are sigmoid units):

where l is the layer number η is a constant called the learning rate tj is the correct teacher output for output unit j

δ(l)j is the error measure for unit j in the lth layer

)1()()( li

ll owjji

unitoutput an is if ))(1( )2()2()2()2( jotoojjjj j

unithidden a is if )1( )2()2()1()1()1( jwook

kjkjjj

Usually the output neurons are linear & hidden neurons are sigmoid units. Then, what will be changed in the above equations?

10

• Compute using the BP learning rule for the following condition:

Backpropagation Learning Rule

)1()()( li

ll owjji

unitoutput an is if ))(1( )2()2()2()2( jotoojjjj j

unithidden a is if )1( )2()2()1()1()1( jwook

kjkjjj

Current output: oj=0.2Correct output: tj=1.0

output

hidden

input

11

Error Backpropagation

• First calculate error of output units and use this to change the top layer of weights.

Current output: oj=0.2Correct output: tj=1.0

= 0.2(1–0.2)(1–0.2)=0.128

output

hidden

input

Update weights into j)1()2()2(

ijjiow

))(1( )2()2()2()2(

jjjjotoo j

12

Error Backpropagation

• Next calculate error for hidden units based on errors on the output units it feeds into.

output

hidden

input

k

kjkjjjwoo )2()2()1()1()1( )1(

13

Error Backpropagation

• Finally update bottom layer of weights based on errors calculated for hidden units.

output

hidden

input

Update weights into j

)0()1()1(

ijjiow

k

kjkjjjwoo )2()2()1()1()1( )1(

)( )0(ixo

i

14

Comments on Training Algorithm

• Not guaranteed to converge to zero training error, may converge to local optima or oscillate indefinitely.

• However, in practice, does converge to low error for many large networks on real data.

• Many epochs (thousands) may be required, hours or days of training for large networks.

• To avoid local-minima problems, run several trials starting with different random weights (random restarts).

– Take results of trial with lowest training set error.

– Build a committee of results from multiple trials.

15

Representational Power

• Boolean functions: Any boolean function can be represented by a two-layer network with sufficient hidden units.

• Continuous functions: Any bounded continuous function can be approximated with arbitrarily small error by a two-layer network.

– Sigmoid functions can act as a set of basis functions for composing more complex functions, like sine waves in Fourier analysis.

• Arbitrary function: Any function can be approximated to arbitrary accuracy by a three-layer network.

16

How many hidden layersand hidden units per layer?

• Theoretically, one hidden layer (possibly with many hidden units) is sufficient for most problems

• There is no theoretical results on minimum necessary # of hidden units (either problem dependent or independent)

• Practical rule of thumb:

–n = # of input units; p = # of hidden units

–For binary/bipolar data: p = 2n

–For real data: p >> 2n

• Multiple hidden layers with fewer units may be trained faster for similar quality in some applications

17

Data sets to handle over-fitting & # of hidden neurons

• Training set:

– A set of examples used for learning, that is to fit the parameters [i.e., weights] of the classifier.

• Validation set:

– A set of examples used to tune the parameters [i.e., architecture, not weights] of a classifier, for example to choose the number of hidden units in a neural network or handle over-fitting.

• Test set (completely unseen data):

– A set of examples used only to assess the performance [generalization] of a fully specified classifier.

• Usually all the data set is divided by 60%-20%-20% or 70%-20%-10%

18

Over-training/over-fitting

• The meaning of over-fitting: – Trained net fits very well with the training samples (total

error almost zero), but not with new input patterns

• Over-training may become serious if– Training samples were not obtained properly

– Training samples have noise

• Control over-training for better generalization– Cross-validation: dividing the samples into two sets,

- 80% into training set: used to train the network

- 20% into test set: used to validate training results

periodically test the trained net with test samples, stop training when test results start to deteriorating, repeat the process for many times, report the average results.

19

Over-Training Prevention• Running too many epochs can result in over-fitting.

• Keep a hold-out validation set and test accuracy on it after every epoch. Stop training when additional epochs actually increase validation error.

erro

r

on training data

on test data

0 # training epochs

20

Determining the Best Number of Hidden Units

• Too few hidden units prevents the network from adequately fitting the data.

• Too many hidden units can result in over-fitting.

• Use internal cross-validation to empirically determine an optimal number of hidden units.

erro

r

on training data

on test data

0 # hidden units

21

Learning Rate• Adaptive learning rate to fastening the training process• There are many different approaches• One of them:

weights

trainingerror

كاهش تغييرات وزنافزايش تغييرات گراديان

22

Momentum• Improving the gradient descent to escape the local

minima• Adds a percentage of the last movement to the current

movement

)1()( kwEkw ijij

23

Typical Learning Curve

0 50 100 150 200

101

Epoch

Su

m-S

qu

are

d E

rro

rSum-Squared Network Error for 224 Epochs

100

10-1

10-2

10-3

10-4

24

Typical learning with adaptive learning rate

0 10 20 30 40 50 60 70 80 90 100Epoch

Training for 103 Epochs

0 20 40 60 80 100 1200

0.2

0.4

0.6

0.8

1

Epoch

Le

arn

ing

Rate

10-4

10-2

100

102S

um

-Sq

ua

red

Err

or

10-3

101

10-1

25

Typical Learning with adaptive learning rate plus momentum

0 10 20 30 40 50 60 70 80Epoch

Training for 85 Epochs

0 10 20 30 40 50 60 70 80 900

0.5

1

2.5

Epoch

Le

arn

ing

Rate

10-4

10-2

100

102

Su

m-S

qu

are

d E

rro

r

10-3

101

10-1

1.5

2

26

Hidden Unit Representations

• Trained hidden units can be seen as newly constructed features that make the target concept linearly separable in the transformed space.

• On many real domains, hidden units can be interpreted as representing meaningful features such as vowel detectors or edge detectors, etc..

• However, the hidden layer can also become a distributed representation of the input in which each individual unit is not easily interpretable as a meaningful feature.

27

Appropriate problems for NN

• Instances are represented by many attribute-value pairs

• The target function output may be discrete-valued, real-valued or a vector of several real- or discrete-valued attributes

• The dataset may contains errors

• Long training time are acceptable

• Fast evaluation (test phase) is required

• The ability of humans to understand the learned target function is not important

28

Successful Applications

• Text to Speech (NetTalk)

• Fraud detection

• Financial Applications

• Chemical Plant Control

• Automated Vehicles

• Game Playing

– Neurogammon (a neural-network backgammon program)

• Handwriting recognition

29

More Issues in Neural Nets

• More efficient training methods:– Quickprop– Conjugate gradient (exploits 2nd derivative)

• Learning the proper network architecture:– Grow network until able to fit data– Shrink large network until unable to fit data

• Recurrent networks that use feedback and can learn finite state machines with “backpropagation through time.”

30

More Issues in Neural Nets (cont.)

• More biologically plausible learning algorithms based on Hebbian learning.

• Unsupervised Learning

– Self-Organizing Feature Maps (SOMs)

• Reinforcement Learning

– Frequently used as function approximators for learning value functions.

31

Assignment 2

• Tom Mitchelle problems – 4.1– 4.2– 4.5– 4.7– 4.8– 4.10

32

Remaining Course Plan

• Chapter 9. Genetic Algorithms

• Chapter 13. Reinforcement Learning

• Clustering

• Dimension Reduction Algorithms

• Support Vector Machine

• Cellular Automata

• Other Biologically-inspired optimization algorithms (Ant Colony,

PSO, Simulated annealing, …)

• Active Learning

Recommended