95
Neural Networks Dr. G. Bharadwaja Kumar

Name of presentation · •Dr. Robert Hecht-Nielsen defines a neural network as: “a computing system made up of a number of simple, highly interconnected processing elements, which

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Name of presentation · •Dr. Robert Hecht-Nielsen defines a neural network as: “a computing system made up of a number of simple, highly interconnected processing elements, which

Neural Networks

Dr. G. Bharadwaja Kumar

Page 2: Name of presentation · •Dr. Robert Hecht-Nielsen defines a neural network as: “a computing system made up of a number of simple, highly interconnected processing elements, which

Agenda

• Introduction

• An Overview of Neural Networks

• Learning Methods

• Applications

Page 3: Name of presentation · •Dr. Robert Hecht-Nielsen defines a neural network as: “a computing system made up of a number of simple, highly interconnected processing elements, which

Artificial Neural Networks (ANNs)

• an abstract simulation of a real nervous system that contains a collection of neuron units communicating with each other via axon connections.

• Dr. Robert Hecht-Nielsen defines a neural network as: “a computing system made up of a number of simple, highly interconnected processing elements, which process information by their dynamic state response to external inputs”.

Page 4: Name of presentation · •Dr. Robert Hecht-Nielsen defines a neural network as: “a computing system made up of a number of simple, highly interconnected processing elements, which

Neurons• The basic computational unit in the nervous system is

the nerve cell, or neuron. A "typical" neuron has four distinct parts (or regions). The first part is the cell body (or soma). Incoming signals from other neurons are (typically) received through its dendrites. The outgoing signal to other neurons flows along its axon. A neuron may have many thousands of dendrites, but it will have only one axon. The fourth distinct part of a neuron lies at the end of the axon, the synopses. These are the structures that contain neurotransmitters.

Page 5: Name of presentation · •Dr. Robert Hecht-Nielsen defines a neural network as: “a computing system made up of a number of simple, highly interconnected processing elements, which

• So, neuron receives information from other neurons, processes it and then relays this information to other neurons.

• We still don't know the exact answer to the question as to what happens in a biological neuron.

• What do we know is neuron integrates the signals that arrive and when this integration exceeds a certain limit, neuron in turn emits a electro-chemical signal .

Page 6: Name of presentation · •Dr. Robert Hecht-Nielsen defines a neural network as: “a computing system made up of a number of simple, highly interconnected processing elements, which

• The electro-chemical signal released by a particular neurotransmitter may be such as to encourage to the receiving cell to also fire, or to inhibit or prevent it from firing.

• Different neurotransmitters tend to act as excitatory (e.g. acetylcholine, glutamate, aspartate, noradrenaline, histamine) or inhibitory (e.g. GABA, glycine, seratonin), while some (e.g. dopamine) may be either.

• Subtle variations in the mechanisms of neurotransmission allow the brain to respond to the various demands made on it, including the encoding, consolidation, storage and retrieval of memories.

Page 7: Name of presentation · •Dr. Robert Hecht-Nielsen defines a neural network as: “a computing system made up of a number of simple, highly interconnected processing elements, which

• The human brain consists of about 1011

neurons.

• On average, each neuron is connected to other neurons through about 10,000 synapses

• There are about 1015 connections (synapses) between the cells.

• Neurons typically operate at a maximum rate of about 100 Hz, while a conventional CPU carries Giga Hz operations per second.

• Neurons computation is parallel and distributed which approximately equals about 1018 operations per second and is many times greater than features of nowadays computers.

Page 8: Name of presentation · •Dr. Robert Hecht-Nielsen defines a neural network as: “a computing system made up of a number of simple, highly interconnected processing elements, which

Learning in the Brain

• Brains learn– Altering strength between neurons

– Creating/deleting connections

• Hebb’s Postulate (Hebbian Learning)– When an axon of cell A is near enough to excite a cell B

and repeatedly or persistently takes part in firing it, some growth process or metabolic change takes place in one or both cells such that A's efficiency, as one of the cells firing B, is increased.

• Long Term Potentiation (LTP)– Cellular basis for learning and memory

– LTP is the long-lasting strengthening of the connection between two nerve cells in response to stimulation

– Discovered in many regions of the cortex

Page 9: Name of presentation · •Dr. Robert Hecht-Nielsen defines a neural network as: “a computing system made up of a number of simple, highly interconnected processing elements, which

• The following properties of nervous systems will be of particular interest in our neurally-inspired models: – parallel, distributed information processing

– high degree of connectivity among basic units

– connections are modifiable based on experience

– learning is a constant process, and usually unsupervised

– learning is based only on local information

– performance degrades gracefully if some units are removed

– etc..........

Page 10: Name of presentation · •Dr. Robert Hecht-Nielsen defines a neural network as: “a computing system made up of a number of simple, highly interconnected processing elements, which

Learning Methods

Page 11: Name of presentation · •Dr. Robert Hecht-Nielsen defines a neural network as: “a computing system made up of a number of simple, highly interconnected processing elements, which

Artificial Neurons

x1

x2

xn

1

S f(x)

w1

w2

wn

b

f(x) = w x + b

Axon

Synapses

Activation

of other

neurons Dendrites

Cell potential

Activation function

McCulloch and Pitts, 1943

(Soma)

Page 12: Name of presentation · •Dr. Robert Hecht-Nielsen defines a neural network as: “a computing system made up of a number of simple, highly interconnected processing elements, which

Analogy between biological and

artificial neural networks

Biological Neural Network Artificial Neural NetworkSoma

DendriteAxonSynapse

Neuron

InputOutputWeight

Page 13: Name of presentation · •Dr. Robert Hecht-Nielsen defines a neural network as: “a computing system made up of a number of simple, highly interconnected processing elements, which

What are connectionist neural

networks?• Connectionism refers to a computer modeling

approach to computation that is loosely based upon the architecture of the brain.

• Many different models, but all include:– Multiple, individual “nodes” or “units” that operate

at the same time (in parallel)– A network that connects the nodes together– Information is stored in a distributed fashion among

the links that connect the nodes– Learning can occur with gradual changes in

connection strength

Page 14: Name of presentation · •Dr. Robert Hecht-Nielsen defines a neural network as: “a computing system made up of a number of simple, highly interconnected processing elements, which

A Neural Network generally maps a set of inputs to a set of outputs

Number of inputs/outputs is variable

The Network itself is composed of an arbitrary number of nodes with an arbitrary topology

Page 15: Name of presentation · •Dr. Robert Hecht-Nielsen defines a neural network as: “a computing system made up of a number of simple, highly interconnected processing elements, which

Architecture of neural

networksFeed-forward networks

• Feed-forward ANNs allow signals to travel one way only; from input to output. There is no feedback (loops) i.e. the output of any layer does not affect that same layer.

• Feed-forward ANNs tend to be straight forward networks that associate inputs with outputs. They are extensively used in pattern recognition.

• This type of organization is also referred to as bottom-up or top-down.

Page 16: Name of presentation · •Dr. Robert Hecht-Nielsen defines a neural network as: “a computing system made up of a number of simple, highly interconnected processing elements, which

• Feedback networks

• Feedback networks can have signals travelling in both directions by introducing loops in the network. Feedback networks are very powerful and can get extremely complicated.

• Feedback networks are dynamic; their 'state' is changing continuously until they reach an equilibrium point. They remain at the equilibrium point until the input changes and a new equilibrium needs to be found.

• Feedback architectures are also referred to as interactive or recurrent, although the latter term is often used to denote feedback connections in single-layer organizations.

Page 17: Name of presentation · •Dr. Robert Hecht-Nielsen defines a neural network as: “a computing system made up of a number of simple, highly interconnected processing elements, which

Topologies of Neural Networks

completely

connectedfeedforward

(directed, a-cyclic)recurrent

(feedback

connections)

Page 18: Name of presentation · •Dr. Robert Hecht-Nielsen defines a neural network as: “a computing system made up of a number of simple, highly interconnected processing elements, which

Learning paradigms

• Two Major Categories Based On Input

Format

– Binary-valued (0s and 1s)

– Continuous-valued

• Basic Learning Categories

– Supervised Learning

– Unsupervised Learning

– Reinforced learning

Page 19: Name of presentation · •Dr. Robert Hecht-Nielsen defines a neural network as: “a computing system made up of a number of simple, highly interconnected processing elements, which

Input Data Types

• Discrete data can only take particular values. There may potentially be an infinite number of those values, but each is distinct and there's no grey area in between. Discrete data can be numeric -- like numbers of apples -- but it can also be categorical -- like red or blue, or male or female, or good or bad.

• Continuous data are not restricted to defined separate values, but can occupy any value over a continuous range. Between any two continuous data values there may be an infinite number of others. Continuous data are always essentially numeric.

Page 20: Name of presentation · •Dr. Robert Hecht-Nielsen defines a neural network as: “a computing system made up of a number of simple, highly interconnected processing elements, which

Learning Algorithms

Page 21: Name of presentation · •Dr. Robert Hecht-Nielsen defines a neural network as: “a computing system made up of a number of simple, highly interconnected processing elements, which

Supervised learning

• the aim is to infer a function from labeled training data.

• The training data consist of a set of training examples where each example is a pair consisting of an input object (typically a vector) and a desired output value .

• The aim of supervised training is then to adjust the weight values such that the error between the real output, o, of the neuron and the target output, t, is minimized.

• The parallel task in human and animal psychology is often referred to as concept learning.

• Neural Network models: perceptron, feed-forward, radial basis function

Page 22: Name of presentation · •Dr. Robert Hecht-Nielsen defines a neural network as: “a computing system made up of a number of simple, highly interconnected processing elements, which

Unsupervised learning

• the aim is to discover patterns or to find hidden structure in unlabeled data.

• Since the examples given to the learner are unlabeled, there is no error or reward signal to evaluate a potential solution.

• Among neural network models, the self-organizing map (SOM) and adaptive resonancetheory (ART) are commonly used unsupervisedlearning algorithms.

Page 23: Name of presentation · •Dr. Robert Hecht-Nielsen defines a neural network as: “a computing system made up of a number of simple, highly interconnected processing elements, which

Reinforcement learning

• the aim is to reward the neuron (or parts of a NN) for good performance, and to penalize the neuron for bad performance.

• Reinforcement learning differs from standard supervised learning in that correct input/output pairs are never presented, nor sub-optimal actions explicitly corrected.

Page 24: Name of presentation · •Dr. Robert Hecht-Nielsen defines a neural network as: “a computing system made up of a number of simple, highly interconnected processing elements, which

Learning in Neural Networks

• Theoretically, a neural network could learn by

1. developing new connections,

2. deleting existing connections,

3. changing connecting weights,

4. changing the threshold values of neurons,

5. varying one or more of the three neuron functions (remember: activation function, propagation function and output function),

6. developing new neurons, or

7. deleting existing neurons (and so, of course, existing connections).

Page 25: Name of presentation · •Dr. Robert Hecht-Nielsen defines a neural network as: “a computing system made up of a number of simple, highly interconnected processing elements, which

Network layers

• The commonest type of artificial neural network consists of three groups, or layers, of units: a layer of "input" units is connected to a layer of "hidden" units, which is connected to a layer of "output" units. – The activity of the input units represents the raw

information that is fed into the network.– The activity of each hidden unit is determined by

the activities of the input units and the weights on the connections between the input and the hidden units.

– The behaviour of the output units depends on the activity of the hidden units and the weights between the hidden and output units.

Page 26: Name of presentation · •Dr. Robert Hecht-Nielsen defines a neural network as: “a computing system made up of a number of simple, highly interconnected processing elements, which

Different non linearly

separable problems

Page 27: Name of presentation · •Dr. Robert Hecht-Nielsen defines a neural network as: “a computing system made up of a number of simple, highly interconnected processing elements, which

Activation Function

• Activation function is used to calculate output response of the neuron.

• It converts the net input value to the net output value.

Page 28: Name of presentation · •Dr. Robert Hecht-Nielsen defines a neural network as: “a computing system made up of a number of simple, highly interconnected processing elements, which

• If you want the outputs of a network to be interpretable as posterior probabilities for a categorical target variable, it is highly desirable for those outputs to lie between zero and one and to sum to one.

• The purpose of the softmax activation function is to enforce these constraints on the outputs.

Page 29: Name of presentation · •Dr. Robert Hecht-Nielsen defines a neural network as: “a computing system made up of a number of simple, highly interconnected processing elements, which

Activation functions of a

neuron

Step function Sign function

+1

-1

0

+1

-1

0X

Y

X

Y

1 1

-1

0 X

Y

Sigmoid function

-1

0 X

Y

Linear function

0if,0

0if,1

X

XYstep

0if,1

0if,1

X

XYsign

X

sigmoid

eY

1

1XY

linear

Page 30: Name of presentation · •Dr. Robert Hecht-Nielsen defines a neural network as: “a computing system made up of a number of simple, highly interconnected processing elements, which

Activation Functions

Page 31: Name of presentation · •Dr. Robert Hecht-Nielsen defines a neural network as: “a computing system made up of a number of simple, highly interconnected processing elements, which

Learning Methods

Page 32: Name of presentation · •Dr. Robert Hecht-Nielsen defines a neural network as: “a computing system made up of a number of simple, highly interconnected processing elements, which

The Perceptron

The operation of Rosenblatt’s perceptron is based on the McCulloch and Pitts neuron model. The model consists of a linear combiner followed by a hard limiter.

The weighted sum of the inputs is applied to the hard limiter, which produces an output equal to +1 if its input is positive and -1 if it is negative.

Page 33: Name of presentation · •Dr. Robert Hecht-Nielsen defines a neural network as: “a computing system made up of a number of simple, highly interconnected processing elements, which

• The perceptron is used for supervised classification of an input into one of several possible non-binary outputs.

• It is a type of linear classifier, i.e. a classification algorithm that makes its predictions based on a linear predictor function combining a set of weights with the feature vector describing a given input using the delta rule.

Page 34: Name of presentation · •Dr. Robert Hecht-Nielsen defines a neural network as: “a computing system made up of a number of simple, highly interconnected processing elements, which

Perceptron: Limitations

• The XOR is not linear separable

• It is impossible to separate the classes C1

and C2 with only one line

0 1

0

1

-1

-11

1

x2

x1

C1

C1

C2

Page 35: Name of presentation · •Dr. Robert Hecht-Nielsen defines a neural network as: “a computing system made up of a number of simple, highly interconnected processing elements, which

• Training time can be exponential in number of features.

• Epoch is single pass through entire data.

• Convergence can take exponentially many epochs, but guaranteed to work.

Page 36: Name of presentation · •Dr. Robert Hecht-Nielsen defines a neural network as: “a computing system made up of a number of simple, highly interconnected processing elements, which

Perceptron Model• Uses a non-linear (McCulloch-Pitts)

model of neuron:

x1

x2

xm

w2

w1

wm

b (bias)

v y

(v)

• is the sign function:

(v) =+1 IF v >= 0

-1 IF v < 0Is the function sign(v)

Page 37: Name of presentation · •Dr. Robert Hecht-Nielsen defines a neural network as: “a computing system made up of a number of simple, highly interconnected processing elements, which

• The perceptron is used for classification:

classify correctly a set of examples into one

of the two classes C1, C2:

If the output of the perceptron is +1 then the input is assigned to class C1

If the output is -1 then the input is assigned to C2

Page 38: Name of presentation · •Dr. Robert Hecht-Nielsen defines a neural network as: “a computing system made up of a number of simple, highly interconnected processing elements, which

Perceptron: Classification

• The equation below describes a hyperplane in the input space. This hyperplane is used to separate the two classes C1 and C2

0 bxw

m

1i

ii

x2

C1

C2

x1

decision

boundary

w1x1 + w2x2 + b = 0

decision

region for C1

w1x1 + w2x2 + b > 0

w1x1 + w2x2 + b <= 0

decision

region for C2

Page 39: Name of presentation · •Dr. Robert Hecht-Nielsen defines a neural network as: “a computing system made up of a number of simple, highly interconnected processing elements, which

Perceptron: Limitations

• The perceptron can only model linearly separable functions.

• The perceptron can be used to model the following Boolean functions:

• AND

• OR

• COMPLEMENT

• But it cannot model the XOR. Why?

Page 40: Name of presentation · •Dr. Robert Hecht-Nielsen defines a neural network as: “a computing system made up of a number of simple, highly interconnected processing elements, which

Linear Separability

• A hyperplane is a plane of >3 dimensionso in 2 dimensions it is a line

• Each output neuron in a perceptroncorresponds to one hyperplane

• Dimensionality of the hyperplane is determined by the number of incoming connections

• The position of the hyperplane depends on the weight values

Page 41: Name of presentation · •Dr. Robert Hecht-Nielsen defines a neural network as: “a computing system made up of a number of simple, highly interconnected processing elements, which

• Without a threshold the hyperplanewill pass through the origin

o causes problems with learning

• Threshold / bias allows the hyperplane to be positioned away from the origin

• Learning involves positioning the hyperplane

• If the hyperplane cannot be positioned to separate all training examples, then the algorithm will never converge

o never reach 0 errors

Page 42: Name of presentation · •Dr. Robert Hecht-Nielsen defines a neural network as: “a computing system made up of a number of simple, highly interconnected processing elements, which

Perceptron Learning

• Step 1: Initialize weights and bias. Start with the all-zeroes weight vector w1 = 0, and set learning rate (0 < ≤ 1)

• Step 2: While stopping condition false, do steps 3 to 7.

• Step 3: For each training pair S: t, do steps 4 to 6.

• Step 4: Set the input activations xi = si

Page 43: Name of presentation · •Dr. Robert Hecht-Nielsen defines a neural network as: “a computing system made up of a number of simple, highly interconnected processing elements, which

• Step 5: Compute net input to the

perceptron and the output response of the

perceptron

1

n

in i i

i

y b x w

in

in

in

1 y >

0 - y

1 y <

y

Page 44: Name of presentation · •Dr. Robert Hecht-Nielsen defines a neural network as: “a computing system made up of a number of simple, highly interconnected processing elements, which

• Step 6: Update the bias and weights if the target is not equal to the output response

if t ≠ yif xi ≠ 0

wi (new) = wi (old) + xi t

else no change in weights

b(new) = b (old) + t

• Step 7: Test for stopping condition. If no weight change in step 3 stop else continue.

Page 45: Name of presentation · •Dr. Robert Hecht-Nielsen defines a neural network as: “a computing system made up of a number of simple, highly interconnected processing elements, which

• Conditions:

– Weights connecting to active input units are updated i. e. xi ≠ 0

– Weights are updated for patterns that do not produce the correct output value.

Page 46: Name of presentation · •Dr. Robert Hecht-Nielsen defines a neural network as: “a computing system made up of a number of simple, highly interconnected processing elements, which

• Advantages of Perceptrons• Simple models• Efficient• Guaranteed to converge over linearly separable

problems• Easy to analyse• Well known

• Disadvantages of Perceptrons• Cannot model non-linearly separable

problems• Some problems may require a bias or

threshold parameter• Not taken seriously anymore

Page 47: Name of presentation · •Dr. Robert Hecht-Nielsen defines a neural network as: “a computing system made up of a number of simple, highly interconnected processing elements, which

Adaline

• An Adaline is a single unit neuron that recieves input from several units.

• It can be trained using delta rule or Widrow-Hoff learning rule.

• The linear networks (ADALINE) are similar to the perceptron, but their transfer function is linear rather than hard-limiting.

• This allows their outputs to take on any value, whereas the perceptron output is limited to either 0 or 1.

• Linear networks, like the perceptron, can only solvelinearly separable problems.

Page 48: Name of presentation · •Dr. Robert Hecht-Nielsen defines a neural network as: “a computing system made up of a number of simple, highly interconnected processing elements, which

Learning Methods

• We can always train the network to have a minimum error by using the Least Mean Squares (Widrow-Hoff) algorithm.

• The error is the difference between an output vector and its target vector.

• Find values for the network weights and biases such that the sum of the squares of the errorsis minimized or below a specific value.

Page 49: Name of presentation · •Dr. Robert Hecht-Nielsen defines a neural network as: “a computing system made up of a number of simple, highly interconnected processing elements, which

Least Mean Square

Minimization

• Errsq = S (d(k) – W x(k)) ** 2

• Grad(errsq) = 2S(d(k) – W x(k)) (-x(k))

• W (new) = W(old) - Grad(errsq)

• To ease calculations, use Err(k) in place of Errsq

• W(new) = W(old) + Err(k) x(k)

• b(new) = b(old) + Err(k)

• Continue with next choice of k

Page 50: Name of presentation · •Dr. Robert Hecht-Nielsen defines a neural network as: “a computing system made up of a number of simple, highly interconnected processing elements, which

ADALINE Learning Algorithm

• Step 1: Assign random synoptic weight values in the range -1 to 1

• Step 2: While stopping condition false, do steps 3 to 7.

• Step 3: For each bipolar training pair S: t, do steps 4 to 7.

• Step 4: Set the input activations x0 = 1, xi = si

where (i = 1,2, .. n)

Page 51: Name of presentation · •Dr. Robert Hecht-Nielsen defines a neural network as: “a computing system made up of a number of simple, highly interconnected processing elements, which

Learning Methods

• Step 5: Compute net input to the neuron

• Step 6: update the bias and weights

• W(new) = W(old) + Err(k) x(k)

• b(new) = b(old) + Err(k)

where k = 1,2 .. n

1

n

in i i

i

y b x w

Page 52: Name of presentation · •Dr. Robert Hecht-Nielsen defines a neural network as: “a computing system made up of a number of simple, highly interconnected processing elements, which

• Step 7: Test for stopping condition. If the largest weight change that occurred in step 3 is smaller than a specified value, stop else continue.

Page 53: Name of presentation · •Dr. Robert Hecht-Nielsen defines a neural network as: “a computing system made up of a number of simple, highly interconnected processing elements, which

Multi Layer Neural Network

• A single-layer neural network has many restrictions and it can accomplish very limited classes of tasks.

• Minsky and Papert (1969) showed that a two layer feed-forward network can overcome many restrictions, but they did not present a solution to the problem as "how to adjust the weights from input to hidden layer" ?

• Rumelhart, Hinton and Williams in 1986 proposed the solution that the errors for the units of the hidden layer are determined by back-propagating the errors of the units of the output layer. This method is often called the Back-propagation learning rule.

Page 54: Name of presentation · •Dr. Robert Hecht-Nielsen defines a neural network as: “a computing system made up of a number of simple, highly interconnected processing elements, which

• Back-propagation can also be considered as a generalization of the delta rule for non-linear activation functions and multi-layer networks.

• Back-propagation is a systematic method of training multi-layer artificial neural networks

Page 55: Name of presentation · •Dr. Robert Hecht-Nielsen defines a neural network as: “a computing system made up of a number of simple, highly interconnected processing elements, which

BPN

• A BackProp network consists of at least three layers of units :

– an input layer– at least one intermediate hidden layer– an output layer

• Typically, units are connected in a feed-forward fashion with input units fully connected to units in the hidden layer and hidden units fully connected to units in the output layer.

Page 56: Name of presentation · •Dr. Robert Hecht-Nielsen defines a neural network as: “a computing system made up of a number of simple, highly interconnected processing elements, which

• A generalization of the LMS algorithm, called backpropagation, can be used to train multilayer networks.

• Backpropagation is an approximate steepest descent algorithm, in which the performance index is mean square error.

• In order to calculate the derivatives, we need to use the chain rule of calculus.

Page 57: Name of presentation · •Dr. Robert Hecht-Nielsen defines a neural network as: “a computing system made up of a number of simple, highly interconnected processing elements, which

Back Propagation Algorithm

BPA • Since backpropagation uses the gradient

descent method, one needs to calculate the derivative of the squared error function with respect to the weights of the network.

E = the squared error

t = target output

y = actual output of the output neuron

Therefore the error, E , depends on the output y.

21( )

2E t y

Page 58: Name of presentation · •Dr. Robert Hecht-Nielsen defines a neural network as: “a computing system made up of a number of simple, highly interconnected processing elements, which

• However, the output y depends on the weighted sum of all its input

• For linear activation function:

N = the number of input units to the neuronwi = the ith weightxi = the ith input value to the neuron

• In general, a non-linear, differentiable

activation function, , is used.

1

n

i i

i

y w x

( )y net

1

n

i i

i

net w x

Page 59: Name of presentation · •Dr. Robert Hecht-Nielsen defines a neural network as: “a computing system made up of a number of simple, highly interconnected processing elements, which

• We calculate the partial derivative of the

error with respect to a weight using the

chain rule:

= How the error changes when the weights are

changed

= How the error changes when the output is changed

= How the output changes when the weighted sum

changes

= How the weighted sum changes as the weights

change

i i

E dE dy net

w dy dnet w

i

E

w

dE

dy

dy

dnet

i

net

w

Page 60: Name of presentation · •Dr. Robert Hecht-Nielsen defines a neural network as: “a computing system made up of a number of simple, highly interconnected processing elements, which

• Since the weighted sum net is just the sum over all products wixi , therefore the partial derivative of the sum with respect to a weight wi is the just the corresponding input xi

• The derivative of the output y with respect

to the weighted sum net is simply the

derivative of the activation function :

i

i

netx

w

dy d

dnet dnet

Page 61: Name of presentation · •Dr. Robert Hecht-Nielsen defines a neural network as: “a computing system made up of a number of simple, highly interconnected processing elements, which

• This is the reason why backpropagation

requires the activation function to be

differentiable.

• A commonly used activation function is the

sigmoid function:

which has a nice derivative of:

1

1 zy

e

(1 )dy

y ydnet

Page 62: Name of presentation · •Dr. Robert Hecht-Nielsen defines a neural network as: “a computing system made up of a number of simple, highly interconnected processing elements, which

• Finally, the derivative of the error E with respect to the output y is:

• Putting it all together:

21( )

2

dE dt y

dy dy

( )dE

y tdy

( ) (1 ) i

i

Ey t y y x

w

Page 63: Name of presentation · •Dr. Robert Hecht-Nielsen defines a neural network as: “a computing system made up of a number of simple, highly interconnected processing elements, which

• To update the weight wi using gradient

descent, one must chooses a learning rate,

.

• The change in weight after learning then

would be the product of the learning rate and the gradient:

i

i

Ew

w

( )i iw t y x

Page 64: Name of presentation · •Dr. Robert Hecht-Nielsen defines a neural network as: “a computing system made up of a number of simple, highly interconnected processing elements, which

• For a linear neuron, the derivative of the activation function is 1, which yields:

• This is exactly the delta rule for perceptronlearning, which is why the backpropagationalgorithm is a generalization of the delta rule.

• In backpropagation and perceptron learning, when the output matches y the desired output t , the change in weight wi would be zero, which is exactly what is desired.

( )i iw t y x

Page 65: Name of presentation · •Dr. Robert Hecht-Nielsen defines a neural network as: “a computing system made up of a number of simple, highly interconnected processing elements, which

Strengths of BP Learning

• Great representation power– Any L2 function can be represented by a BP net

– Many such functions can be approximated by BP learning (gradient descent approach)

• Easy to apply– Only requires that a good set of training samples is

available

– Does not require substantial prior knowledge or deep understanding of the domain itself

– Tolerates noise and missing data in training samples (graceful degrading)

• Easy to implement the core of the learning algorithm

• Good generalization power

– Often produce accurate results for inputs outside the training set

Page 66: Name of presentation · •Dr. Robert Hecht-Nielsen defines a neural network as: “a computing system made up of a number of simple, highly interconnected processing elements, which

Drawbacks

• Learning often takes a long time to converge– Complex functions often need hundreds or thousands

of epochs

• Generalization is not guaranteed even if the error is reduced to zero

• The network is essentially a black box– It may provide a desired mapping between input and

output vectors (x, o) but does not have the information of why a particular x is mapped to a particular o.

Page 67: Name of presentation · •Dr. Robert Hecht-Nielsen defines a neural network as: “a computing system made up of a number of simple, highly interconnected processing elements, which

Problem with gradient descent

approach • only guarantees to reduce the total error to a

local minimum. • Cannot escape from the local minimum error state• How bad: depends on the shape of the error surface. Too

many valleys/wells will make it easy to be trapped in local minima

• Not every function that is representable can be learned

• Possible remedies: • Try nets with different no of hidden layers and hidden

nodes (they may lead to different error surfaces, some might be better than others)

• Try different initial weights (different starting points on the surface)

• Forced escape from local minima by random perturbation (e.g., simulated annealing)

Page 68: Name of presentation · •Dr. Robert Hecht-Nielsen defines a neural network as: “a computing system made up of a number of simple, highly interconnected processing elements, which

An important consideration is the learning rate

, which determines by how much we change

the weights w at each step. If is too small, the

algorithm will take a long time to converge

Learning rate

Page 69: Name of presentation · •Dr. Robert Hecht-Nielsen defines a neural network as: “a computing system made up of a number of simple, highly interconnected processing elements, which

Learning rate

Conversely, if is too large, we may end up bouncing

around the error surface out of control - the algorithm

diverges . This usually ends with an overflow error in

the computer's floating-point arithmetic.

Page 70: Name of presentation · •Dr. Robert Hecht-Nielsen defines a neural network as: “a computing system made up of a number of simple, highly interconnected processing elements, which

Activation Function Properties

• Derivable (known and computable

derivative)

• Continuous (derivative defined every

where)

• Complexity (nonlinear for higher order

tasks)

• Monotonous (derivative positive)

• Boundless (activation output and its

derivative are finite)

• Polarity (bipolar preferred to positive)

Page 71: Name of presentation · •Dr. Robert Hecht-Nielsen defines a neural network as: “a computing system made up of a number of simple, highly interconnected processing elements, which

Multi-Layer Feedforward

Networks• Boolean functions:

• Every boolean function can be represented by a network with single hidden layer

• But might require exponential (in number of inputs) hidden units

• Continuous functions:

• Every bounded continuous function can be approximated with arbitrarily small error, by network with single hidden layer [Cybenko1989; Hornik et al. 1989]

• Any function can be approximated to arbitrary accuracy by a network with two hidden layers [Cybenko 1988].

Page 72: Name of presentation · •Dr. Robert Hecht-Nielsen defines a neural network as: “a computing system made up of a number of simple, highly interconnected processing elements, which

Topology of MLP

• MLP topology is said to be feed forward i.e. the neurons in each layer feed their output forward to the next layer until we get the final output from the neural network.

• There can be an arbitrary number of hidden layers in MLP

• The number of neurons can be completely arbitrary.

• Adding too many hidden layers increases the computational complexity of the network

• One hidden layer is usually enough to allow the MLP to be a universal approximator capable of approximating any continuous function

Page 73: Name of presentation · •Dr. Robert Hecht-Nielsen defines a neural network as: “a computing system made up of a number of simple, highly interconnected processing elements, which

• In some cases, there may be many

independencies among the input variables

and adding an extra hidden layer can be

helpful

• Adding hidden layers some times can

reduce the total number of weights needed

for suitable approximation

• MLP with two hidden layers can

approximate any non-continuous functions

– A continuous function is a function for which,

intuitively, "small" changes in the input result in

"small" changes in the output. Otherwise

discontinuous

Page 74: Name of presentation · •Dr. Robert Hecht-Nielsen defines a neural network as: “a computing system made up of a number of simple, highly interconnected processing elements, which

Learning Methods

Learning in a multilayer network proceeds the same way as for a perceptron.

A training set of input patterns is presented to the network.

The network computes its output pattern, and if there is an error - or in other words a difference between actual and desired output patterns - the weights are adjusted to reduce this error.

Page 75: Name of presentation · •Dr. Robert Hecht-Nielsen defines a neural network as: “a computing system made up of a number of simple, highly interconnected processing elements, which

Learning Methods

In a back-propagation neural network, the learning algorithm has two phases.

First, a training input pattern is presented to the network input layer. The network propagates the input pattern from layer to layer until the output pattern is generated by the output layer.

If this pattern is different from the desired output, an error is calculated and then propagated backwards through the network from the output layer to the input layer. The weights are modified as the error is propagated.

Page 76: Name of presentation · •Dr. Robert Hecht-Nielsen defines a neural network as: “a computing system made up of a number of simple, highly interconnected processing elements, which

What do the middle layers

hide?• A hidden layer “hides” its desired output.

Neurons in the hidden layer cannot be observed through the input/output behaviour of the network. There is no obvious way to know what the desired output of the hidden layer should be.

Page 77: Name of presentation · •Dr. Robert Hecht-Nielsen defines a neural network as: “a computing system made up of a number of simple, highly interconnected processing elements, which

Activation Functions

• Activation functions for the input and output

layers are usually one of the following:

– Step, Linear, Threshold logic, Sigmoid

• Hidden layer activation functions might be

one of the following

– Sigmoid: sig(x) = 1/(1 + e-x)

– Hyperbolic tangent

– Bipolar Sigmoid: sigb(x) = 2/(1 + e-x) - 1

Page 78: Name of presentation · •Dr. Robert Hecht-Nielsen defines a neural network as: “a computing system made up of a number of simple, highly interconnected processing elements, which

MLP

Page 79: Name of presentation · •Dr. Robert Hecht-Nielsen defines a neural network as: “a computing system made up of a number of simple, highly interconnected processing elements, which

What do each of the layers do?

2nd layer

combines the

boundaries

3rd layer can generate

arbitrarily complex

boundaries

1st layer draws

linear boundaries

Page 80: Name of presentation · •Dr. Robert Hecht-Nielsen defines a neural network as: “a computing system made up of a number of simple, highly interconnected processing elements, which

Learning Algorithm

• Step 1: Assign random weight values

• Step 2: While stopping condition false, do steps 3 to 10.

• Step 3: For each training pair x: t, do steps 4 to 9.

• Step 4: Each input unit Xi where (i = 1,2, .. n), receives the input signal, xi , and broadcasts it to the next layer.

Page 81: Name of presentation · •Dr. Robert Hecht-Nielsen defines a neural network as: “a computing system made up of a number of simple, highly interconnected processing elements, which

• Step5: For each hidden layer neuron denoted as Zj, j=1,2,…,p

Broadcast zj to the next layer

• Step6: For each output neuron Yk, k=1,2,…,m

( )

inj oj i ij

i

j inj

z v x v

z f z

( )

ink ok j jk

j

k ink

y w z w

y f y

Page 82: Name of presentation · •Dr. Robert Hecht-Nielsen defines a neural network as: “a computing system made up of a number of simple, highly interconnected processing elements, which

• Step7: Compute k for each output neuron, Yk

• Step 8: For each hidden neuron

0 0

( ) ( )

since 1

k k k ink

jk k j

k k

t y f y

w z

w z

1

0

j=1,2,...,p

( )

m

inj k jk

k

j inj inj

ij j i

j j

w

f z

v x

v

Page 83: Name of presentation · •Dr. Robert Hecht-Nielsen defines a neural network as: “a computing system made up of a number of simple, highly interconnected processing elements, which

• Step9: Update weights

• Step 10: Test for the stopping condition

( ) ( )

( ) ( )

jk jk jk

ij ij ij

w new w old w

v new v old v

Page 84: Name of presentation · •Dr. Robert Hecht-Nielsen defines a neural network as: “a computing system made up of a number of simple, highly interconnected processing elements, which

Variations in BP

• Heuristic modifications

– Momentum and rescaling variables

– Variable learning rate

• Standard numerical optimization

– Conjugate gradient

– Newton’s method (Levenberg-Marquardt)

Page 85: Name of presentation · •Dr. Robert Hecht-Nielsen defines a neural network as: “a computing system made up of a number of simple, highly interconnected processing elements, which

Momentum term

• Adding momentum term (to speedup learning)

– Weights update at time t+1 contains the momentum

of the previous updates, e.g.,

– Avoid sudden change of directions of weight

update (smoothing the learning process)

– Error is no longer monotonically decreasing

, ,( 1) ( ) ( ) where 0 1k j k j k jw t t x w t

Page 86: Name of presentation · •Dr. Robert Hecht-Nielsen defines a neural network as: “a computing system made up of a number of simple, highly interconnected processing elements, which

– Fixed rate much smaller than 1

– Start with large η, gradually decrease its value

– Start with a small η, steadily double it until MSE

start to increase

– Give known underrepresented samples higher rates

– Find the maximum safe step size at each stage of

learning (to avoid overshoot the minimum E when

increasing η)

– Adaptive learning rate (delta-bar-delta method)

• Each weight wk,j has its own rate ηk,j

• If remains in the same direction, increase

ηk,j (E has a smooth curve in the vicinity of

current w)

• If changes the direction, decrease ηk,j (E

has a rough curve in the vicinity of current w)

jkw ,

jkw ,

Variations on learning rate η

Page 87: Name of presentation · •Dr. Robert Hecht-Nielsen defines a neural network as: “a computing system made up of a number of simple, highly interconnected processing elements, which

LMA

• The Levenberg–Marquardt algorithm (LMA), also known as the damped least-squares (DLS) method, provides a numerical solution to the problem of minimizing a function, generally nonlinear, over a space of parameters of the function. These minimization problems arise especially in least squares curve fitting and nonlinear programming.

• The LMA interpolates between the Gauss–Newton algorithm (GNA) and the method of gradient descent. The LMA is more robust than the GNA, which means that in many cases it finds a solution even if it starts very far off the final minimum. For well-behaved functions and reasonable starting parameters, the LMA tends to be a bit slower than the GNA. LMA can also be viewed as Gauss–Newton using a trust region approach.

Page 88: Name of presentation · •Dr. Robert Hecht-Nielsen defines a neural network as: “a computing system made up of a number of simple, highly interconnected processing elements, which

Other Methods

• Quickprop algorithm of Fahlman (1988).(It

assumes that the error surface is parabolic

and concave upward around the minimum

point and that the effect of each weight can

be considered independently)

• SuperSAB algorithm of Tollenaere (1990).

(It has more complex rules for adjusting the learning rates).

Page 89: Name of presentation · •Dr. Robert Hecht-Nielsen defines a neural network as: “a computing system made up of a number of simple, highly interconnected processing elements, which

QuickProp

• One method to speed up the learning is to use information about the curvature of the error surface. This requires the computation of the second order derivatives of the error function.

• Quickprop assumes the error surface to be locally quadratic and attempts to jump in one step from the current position directly into the minimum of the parabola.

Page 90: Name of presentation · •Dr. Robert Hecht-Nielsen defines a neural network as: “a computing system made up of a number of simple, highly interconnected processing elements, which

• Quickprop computes the derivatives in the direction of each weight. After computing the first gradient with regular backpropagation, a direct step to the error minimum is attempted by

( 1)( 1) ( )

( ) ( 1)ij ij

S tw t w t

S t S t

ij

weight between units i and j

(t+1) actual weight change

S(t+1) partial derivative of the error function by w

( ) the last partial derivative

ijw

S t

Page 91: Name of presentation · •Dr. Robert Hecht-Nielsen defines a neural network as: “a computing system made up of a number of simple, highly interconnected processing elements, which

Batch Backpropagation

• We can use a variation of the standard algorithm, called batching.

• In batching mode the parameters are updated only after the entire training set has been presented.

• The gradients calculated at each training example are averaged together to produce a more accurate estimate of the gradient.– Smoothing the training sample outliers– Learning independent of the order of sample

presentations– Usually slower than in sequential mode

Page 92: Name of presentation · •Dr. Robert Hecht-Nielsen defines a neural network as: “a computing system made up of a number of simple, highly interconnected processing elements, which

Learning Methods

• If the squared error (over the entire training set) increases by more than some set percentage after a weight update, then the weight update is discarded, the learning rate is multiplied by some factor (0<<1), and the momentum coefficient is set to zero.

• If the squared error decreases after a weight update, then the weight update is accepted and the learning rate is multiplied by some factor >1. If has been previously set to zero, it is reset to its original value.

• If the squared error increases by less than , then the weight update is accepted, but the learning rate and the momentum coefficient are unchanged.

Page 93: Name of presentation · •Dr. Robert Hecht-Nielsen defines a neural network as: “a computing system made up of a number of simple, highly interconnected processing elements, which

Strengths of Neural Networks

• ability to learn from experience in order to improve their performance and to adapt themselves to changes in the environment.

• able to deal with incomplete information or noisy data and can be very effective especially in situations where it is not possible to define the rules or steps that lead to the solution of a problem.

• In principle, NNs can compute any computable function, especially anything that can be represented as a mapping between vector spaces can be approximated to arbitrary precision.

Page 94: Name of presentation · •Dr. Robert Hecht-Nielsen defines a neural network as: “a computing system made up of a number of simple, highly interconnected processing elements, which

Limitations

• Lack explanation capabilities

• Limitations and expense of hardware technology restrict most applications to software simulations

• Training time can be excessive and tedious

• Usually requires large amounts of training and test data

Page 95: Name of presentation · •Dr. Robert Hecht-Nielsen defines a neural network as: “a computing system made up of a number of simple, highly interconnected processing elements, which