Artificial Neural Networks (Cont.) Chapter 4 Perceptron Gradient Descent Multilayer Networks...

Artificial Neural Networks (Cont.)

Chapter 4

• Perceptron• Gradient Descent• Multilayer Networks • Backpropagation Algorithm

Review: The Main Idea of Gradient Descent

Goal: minimizing the error:

Review: The Main Idea of Gradient Descent

Gradient Descent

Derive the equations for updating the weights of the simple neuron using the gradient descent technique

Gradient Descent (with Linear Transfer function)

Review:Batch vs. Incremental Gradient Descent

Gradient Descent(with Linear Transfer function)

Is it batch or incremental mode?

Forward Equation

: the transfer function of

: the output of

X: the input array

: the n weight matrix for the first layer

: the weight matrix for the second layer

Backpropagation Learning Rule

• Each weight changed by (if all the neurons are sigmoid units):

where l is the layer number η is a constant called the learning rate tj is the correct teacher output for output unit j

δ(l)j is the error measure for unit j in the lth layer

)1()()( li

ll owjji

unitoutput an is if ))(1( )2()2()2()2( jotoojjjj j

unithidden a is if )1( )2()2()1()1()1( jwook

kjkjjj

Usually the output neurons are linear & hidden neurons are sigmoid units. Then, what will be changed in the above equations?

• Compute using the BP learning rule for the following condition:

Backpropagation Learning Rule

)1()()( li

ll owjji

unitoutput an is if ))(1( )2()2()2()2( jotoojjjj j

unithidden a is if )1( )2()2()1()1()1( jwook

kjkjjj

Current output: oj=0.2Correct output: tj=1.0

output

hidden

Error Backpropagation

• First calculate error of output units and use this to change the top layer of weights.

Current output: oj=0.2Correct output: tj=1.0

= 0.2(1–0.2)(1–0.2)=0.128

output

hidden

Update weights into j)1()2()2(

ijjiow

))(1( )2()2()2()2(

jjjjotoo j

• Next calculate error for hidden units based on errors on the output units it feeds into.

output

hidden

kjkjjjwoo )2()2()1()1()1( )1(

• Finally update bottom layer of weights based on errors calculated for hidden units.

output

hidden

Update weights into j

)0()1()1(

ijjiow

kjkjjjwoo )2()2()1()1()1( )1(

)( )0(ixo

Comments on Training Algorithm

• Not guaranteed to converge to zero training error, may converge to local optima or oscillate indefinitely.

• However, in practice, does converge to low error for many large networks on real data.

• Many epochs (thousands) may be required, hours or days of training for large networks.

• To avoid local-minima problems, run several trials starting with different random weights (random restarts).

– Take results of trial with lowest training set error.

– Build a committee of results from multiple trials.

Representational Power

• Boolean functions: Any boolean function can be represented by a two-layer network with sufficient hidden units.

• Continuous functions: Any bounded continuous function can be approximated with arbitrarily small error by a two-layer network.

– Sigmoid functions can act as a set of basis functions for composing more complex functions, like sine waves in Fourier analysis.

• Arbitrary function: Any function can be approximated to arbitrary accuracy by a three-layer network.

How many hidden layersand hidden units per layer?

• Theoretically, one hidden layer (possibly with many hidden units) is sufficient for most problems

• There is no theoretical results on minimum necessary # of hidden units (either problem dependent or independent)

• Practical rule of thumb:

–n = # of input units; p = # of hidden units

–For binary/bipolar data: p = 2n

–For real data: p >> 2n

• Multiple hidden layers with fewer units may be trained faster for similar quality in some applications

Data sets to handle over-fitting & # of hidden neurons

• Training set:

– A set of examples used for learning, that is to fit the parameters [i.e., weights] of the classifier.

• Validation set:

– A set of examples used to tune the parameters [i.e., architecture, not weights] of a classifier, for example to choose the number of hidden units in a neural network or handle over-fitting.

• Test set (completely unseen data):

– A set of examples used only to assess the performance [generalization] of a fully specified classifier.

• Usually all the data set is divided by 60%-20%-20% or 70%-20%-10%

Over-training/over-fitting

• The meaning of over-fitting: – Trained net fits very well with the training samples (total

error almost zero), but not with new input patterns

• Over-training may become serious if– Training samples were not obtained properly

– Training samples have noise

• Control over-training for better generalization– Cross-validation: dividing the samples into two sets,

- 80% into training set: used to train the network

- 20% into test set: used to validate training results

periodically test the trained net with test samples, stop training when test results start to deteriorating, repeat the process for many times, report the average results.

Over-Training Prevention• Running too many epochs can result in over-fitting.

• Keep a hold-out validation set and test accuracy on it after every epoch. Stop training when additional epochs actually increase validation error.

on training data

on test data

0 # training epochs

Determining the Best Number of Hidden Units

• Too few hidden units prevents the network from adequately fitting the data.

• Too many hidden units can result in over-fitting.

• Use internal cross-validation to empirically determine an optimal number of hidden units.

on training data

on test data

0 # hidden units

Learning Rate• Adaptive learning rate to fastening the training process• There are many different approaches• One of them:

weights

trainingerror

كاهش تغييرات وزنافزايش تغييرات گراديان

Momentum• Improving the gradient descent to escape the local

minima• Adds a percentage of the last movement to the current

movement

)1()( kwEkw ijij

Typical Learning Curve

0 50 100 150 200

rSum-Squared Network Error for 224 Epochs

Typical learning with adaptive learning rate

0 10 20 30 40 50 60 70 80 90 100Epoch

Training for 103 Epochs

0 20 40 60 80 100 1200

Typical Learning with adaptive learning rate plus momentum

0 10 20 30 40 50 60 70 80Epoch

Training for 85 Epochs

0 10 20 30 40 50 60 70 80 900

Hidden Unit Representations

• Trained hidden units can be seen as newly constructed features that make the target concept linearly separable in the transformed space.

• On many real domains, hidden units can be interpreted as representing meaningful features such as vowel detectors or edge detectors, etc..

• However, the hidden layer can also become a distributed representation of the input in which each individual unit is not easily interpretable as a meaningful feature.

Appropriate problems for NN

• Instances are represented by many attribute-value pairs

• The target function output may be discrete-valued, real-valued or a vector of several real- or discrete-valued attributes

• The dataset may contains errors

• Long training time are acceptable

• Fast evaluation (test phase) is required

• The ability of humans to understand the learned target function is not important

Successful Applications

• Text to Speech (NetTalk)

• Fraud detection

• Financial Applications

• Chemical Plant Control

• Automated Vehicles

• Game Playing

– Neurogammon (a neural-network backgammon program)

• Handwriting recognition

More Issues in Neural Nets

• More efficient training methods:– Quickprop– Conjugate gradient (exploits 2nd derivative)

• Learning the proper network architecture:– Grow network until able to fit data– Shrink large network until unable to fit data

• Recurrent networks that use feedback and can learn finite state machines with “backpropagation through time.”

More Issues in Neural Nets (cont.)

• More biologically plausible learning algorithms based on Hebbian learning.

• Unsupervised Learning

– Self-Organizing Feature Maps (SOMs)

• Reinforcement Learning

– Frequently used as function approximators for learning value functions.

Assignment 2

• Tom Mitchelle problems – 4.1– 4.2– 4.5– 4.7– 4.8– 4.10

Remaining Course Plan

• Chapter 9. Genetic Algorithms

• Chapter 13. Reinforcement Learning

• Clustering

• Dimension Reduction Algorithms

• Support Vector Machine

• Cellular Automata

• Other Biologically-inspired optimization algorithms (Ant Colony,

PSO, Simulated annealing, …)

• Active Learning

Artificial Neural Networks (Cont.) Chapter 4 Perceptron Gradient Descent Multilayer Networks...

Documents

September 30, 2010Neural Networks Lecture 8: Backpropagation Learning 1 Sigmoidal Neurons In backpropagation networks, we typically choose = 1 and

Multi Layer Perceptron Neural Networks

Neural Networks: Backpropagation - sviveksvivek.com/.../fall2018/slides/neural-networks/neural-networks-backpropagation.pdfNeural Networks: Backpropagation 1 Based on slides and material

Training RBF Networks With Selective Backpropagation

Matrix Backpropagation for Deep Networks With Structured Layersopenaccess.thecvf.com/content_iccv_2015/papers/Ionescu... · 2017-04-04 · Matrix Backpropagation for Deep Networks

Artificial Neural Networks. Outline Biological Motivation Perceptron Gradient Descent –Least Mean Square Error Multi-layer networks –Sigmoid node –Backpropagation

Neural Networks & Backpropagation - Sharif

Neural Networks: Multilayer Perceptron

Backpropagation for Continuous Theta Neural Networks

Backpropagation Networks. Introduction to Backpropagation - In 1969 a method for learning in multi-layer network, Backpropagation Backpropagation, was

backpropagation in neural networks

Objectives: Feedforward Networks Multilayer Networks Backpropagation Posteriors Kernels

DISEÑO DEL REGISTRO SONICO SINTETICO (RSS) …tangara.uis.edu.co/biblioweb/tesis/2008/125938.pdf · 2.8 backpropagation ... esquema perceptron simple. 68 figura 24 ... esquema perceptron

Introduction to Neural Networks - Perceptron

Artificial Neural Networks for The Perceptron, Madaline ... · Artificial Neural Networks for The Perceptron, Madaline, and Backpropagation Family Bernard Widrow and Michael A. Lehr

30 years of adaptive neural networks: perceptron, …engel/data/media/file/cmp121/widrow.pdf30 Years of Adaptive Neural Networks: Perceptron, Madaline, and Backpropagation BERNARD

Backpropagation Networks

Lecture 7. Multilayer Perceptron. Backpropagation · • Backpropagation ∗Step-by-step derivation ∗Notes on regularisation 2. Statistical Machine Learning (S2 2017) Deck 7 Animals

Boosted Backpropagation Learning for Training Deep … · Boosted Backpropagation Learning for Training Deep Modular Networks ... We derive and demon-strate this functional backpropagation

Multilayer Perceptron (MLP): the Backpropagation (BP) Algorithm