Upload
others
View
18
Download
0
Embed Size (px)
Citation preview
Artificial Neural Networks
Introduction
Biological neural networks
• There are about 1011 neurons in humans, of more than 100
different kinds, forming more than 1014 connections
• When activated (stimuli received through the dendrites >
threshold) the impulses generated by the excitation of the
soma propagate through the axon to the other neurons.
Nucleus
Cell body (soma)
Dendrites
Axon
• The contact points between neurons are called synapses and
can be either inhibitory or excitatory, according to the
corresponding effect a stimulus received through them
causes on the neuron.
• At the synaptic level the terminals of two connected
neurons are separated by a gap through which the nerve
impulse train is transmitted electro-chemically (by producing
a substance, the acetylcoline, which acts as synaptic mediator).
• The maximum frequency of the impulses <= 1KHz, therefore
information transmission/processing is rather slow. The
power of the nervous system is hypothesized to be related
with its parallel distributed processing of information.
Biological neural networks
Composed of two cascaded stages:
• Linear adder (which produces the so-called net input)
net = Sk wk ik , wk = weight associated with the kth input
• Non-linear threshold-like activation function o = f(net)
Artificial neuron
Possible activation functions (step function or approximated
versions):
o = 1 if (x - q) > 0
0 if (x - q) <= 0
o = +1 if (x - q) > 0
-1 if (x - q) <= 0
o = 1 / (1 + e -(x -q))
o = tanh (x - q)
Sigmoid/Logistic function
Step function
Bipolar step
Hyperbolic Tangent
q is a constant (bias) that acts as a threshold (it shifts f
along the x axis). It is equivalent to a weight associated
with a constant input whose value is 1.
(continuous
approximations
of the step
function)
Artificial neuron
Multi-layer architecture:
• Input layer
• Hidden layer/s (not directly
accessible from ‘outside’)
• Output (layer)
A layer is a set of topologically
equivalent neurons (i.e. all are
connected to neurons which
are in turn topologically
equivalent)
IN
Artificial neural network
(example: multi-layer perceptron)
OUT
A weight wij is associated to each connection between
neuron i and neuron j, and is used in the first-stage adder of
the neuron which receives data through it.
The ‘behavior’ of a neural network therefore depends on:
• Number of neurons
• Topology
• Values of the weights associated with its connections
Artificial neural network
Problems that can be solved by different net configurations (RP Lippmann “An introduction to computing with neural nets”, IEEE 1987)
Artificial neural networks
Computability
Supervised learning:
Comparator
Input
Pattern
Teaching
Input
Neural
Network
Weight
Adaptation
Example
Artificial neural networks: training
Unsupervised learning:
Input Pattern Neural Network
Weight Adaptation
Example
Artificial neural networks: training
Supervised Learning
Supervised learning Training can be turned into an optimization task:
Minimize a function with a high-dimensional domain (as many
dimensions as the weights in the network) which measures the
difference between the teaching inputs in the training set and the
actual outputs of the network (error function).
An iterative procedure (gradient descent) moves the network
weights along the direction of the negative gradient of the
error function, thus guaranteeing that, at each step, the value
of the error function is decreased, until a local minimum is
reached.
Gradient descent is one of the so-called trajectory-based
optimization methods: a single point moves within the input
space; each steps aims at moving it to a better location.
i1
i2
.
.
.
.
iM
oP,i
oP,i+1
.
.
oP,i+k
oP,i
oP,i+1
.
.
oP,i+k
n1
.
.
.
.
.
.
.
.
.
nM
ni
ni+k
nj
nN
oP,i
oP,1
oP,M
oP,i+k
oP,N OP
P wi,j wi+1,j
.
. wi+k,j
wi,j wi+1,j
.
. wi+k,j
netP,j netP,j oP,j
.
.
.
Given:
• A single-layer network with linear activations ( oj(W) = Si wij ii )
• a training set T = { (xp, tp) : p = 1, …., P} P = n. of examples
• a squared error function computed on the pth pattern
Ep = Sj =1,N (tpj -opj)2 / 2 N = n. of output neurons,
opj,tpj= output/teaching input for neuron j
• a global error function
E = Sp=1,P Ep = E(W), W = weight matrix of weights wij
associated with the connections ij
(from neuron i to neuron j)
E can be minimized using a gradient descent, converging to the local
minimum of E which is closest to the initial conditions
(corresponding to the values at which weights are initialized).
Supervised learning: Delta Rule (Widrow & Hoff’s rule)
Basin of
attraction
for y
y
Basin of
attraction
for x
x
Supervised learning: gradient descent
The gradient of E has E/wij as components
E/wij = Sp Ep / wij
From the derivation rule for composed functions (E = E (O(W)) :
Ep/ wij = Ep/ opj · opj/ wij
From the definition of the error function:
-Ep/ opj = (tpj - opj) = dpj (error produced by neuron j
when pattern p is input)
Since neuron activations are linear (opj = Si wij ipi)
opj/ wij = ipi Therefore Ep/ wij = - dpj ipi
thus E / wij = - Sp dpj ipi
Supervised learning: Delta Rule (Widrow & Hoff’s rule)
def
If we apply the gradient descent
Dwij = - e E/wij = - Sp e Ep/wij = e Sp dpj ipi = Sp Dpwij
If e is small enough, one can modify weights after each single
pattern according to the rule:
Dpwij = e dpj ipi
NB All these quantitites are easily computed.
Supervised learning: Delta Rule (Widrow & Hoff’s rule)
def
Supervised learning: gradient descent
I. Initialize the weights
II. Repeat
For each pattern in the training set:
a. Compute the output produced by the present
configuration of the network
b. Compute the error function
c. Modify weights by moving them in a direction
that is opposite to the direction of the error
function gradient with respect to the weights
until the error function reaches a pre-set value OR
a preset maximum number of iterations is reached.
Problems
• It is not possible, in general, to compute the gradient of the
error function with respect to all the weights for any
network configuration. However, some configurations (see
the following slides) do allow gradients to be computed or
approximated.
• Even when this is possible, gradient descent will find a local
minimum, which may be very far from the global one.
Supervised learning: gradient descent
The delta rule can be applied only to a particular neural net
(single-layer with linear activation functions).
The delta rule can be generalized and applied to multi-layer
networks with non-linear activation functions.
This is possible for feedforward networks (aka Multi-layer
Perceptrons or MLP) where a topological ordering of neurons
can be defined, as well as a sequential order of activation.
The activation function f(net) of all neurons must be continuous,
differentiable and non-decreasing.
netpj= Si wij opi for a multi-layer network
(i = index of a neuron forward-connected to j , i.e., i<j if neurons are ordered
starting from the input layer)
Supervised learning:
Generalized Delta Rule (Backpropagation)
We want to apply gradient descent for minimizing the
total squared error (this holds for any topology):
Dpwij = - e Ep/wij
Notice that EP = EP (Op) = EP (f(netp(W))
Applying the differentiation rule for composite functions:
Ep/netpj
Ep/wij = Ep/opj · opj/netpj· netpj/wij
netpj/wij = /wij (Sk wkj opk ) = opi
Supervised learning:
Generalized Delta Rule (Backpropagation)
If we define: dpj = - Ep/netpj we obtain:
Ep/wij = e dpj opi
(same formulation as for the Delta rule)
We know opi; we need to compute dpj
(defined as for the delta rule: in fact, for a linear neuron,
opj = netpj )
Supervised learning:
Generalized Delta Rule (Backpropagation)
dpj = - Ep/netpj = - Ep/opj · opj/netpj
Notice that opj = fj(netpj) and
f depends on a single variable, thus
opj/netpj = dopj/dnetpj = f ’j(netpj)
If the j-th neuron is an output neuron then
Ep/opj = - (tpj - opj)
Therefore, for such neurons, dpj = - (tpj - opj) f ’j(netpj)
Supervised learning:
Generalized Delta Rule (Backpropagation)
Instead, if the j-th neuron is a hidden neuron (here is where
topology becomes important),
Ep/opj = Sk Ep/netpk · netpk/opj = k>j
= Sk Ep/netpk · /opj Si wik opi =
= Sk Ep/netpk · wjk = - Sk dpk wjk
This means that, for the hidden neurons,
dpj = f ’j(netpj) · Sk dpk wjk k>j
Therefore, for the hidden neurons, dpj can be computed
recursively starting from the output neurons (error
backpropagation)
Supervised learning:
Generalized Delta Rule (Backpropagation)
In a MLP we
know the
values of all
these terms!
1. Weights are initialized
2. The pth pattern is given as input
• the corresponding network outputs opj are computed
• The error function is computed for the output layer:
dpj = - (tpj - opj) f ’j(netpj) (for the output layer)
• To compute deltas for the hidden layers:
dpj = f’j(netpj) · Sk dpk wjk are iteratively computed
starting from the hidden
layer closest to the outputs
3. Weights are modified as follows: Dpwij = e dpj opj
4. Steps 2. and 3. are repeated until convergence
Supervised learning:
Backpropagation algorithm
Weights should be updated after processing all training patterns
(batch learning). In fact: Dwij = Sp Dpwij = e Sp dpj opj
If e is small the same result is obtained updating weights after
processing each pattern (online learning).
If weights are initialized to 0 convergence problems occur: usually
different small positive random values (in a range such as 0.05-
0.1) are used for weight initialization.
For this procedure to be an actual gradient descent e should be
infinitesimal, but the smaller e, the slower the convergence.
However, if e is too large one might ‘fly over’ a minimum.
We may want this to happen, if it is a local minimum (see the next
slides)
The gradient descent is not computationally efficient.
Supervised learning: Backpropagation
• Setting the network structure: how many neurons in how
many layers? The only constraint is imposed on the number of
neurons in the input and in the output layers.
Usually, the layers closer to the input are larger (they include
more neurons), to allow enough degrees of freedom to
recombine inputs appropriately; the following layers are
generally smaller, to favor generalization and limit overfitting.
Incremental algorithms exist that start from very few neurons
and add them step by step until good performances are
reached, or start from a large network and then ‘prune’ it by
removing connections associated with weights that are smaller
than a threshold.
Supervised learning: Practical problems when using the MLP
• Setting the learning rate: a large learning rate makes it
easier for the network to escape local minima (which have
often a narrow ‘basin of attraction’) and jump into others; a
small learning rate generates a more literal gradient descent
into the closest local minimum. Usually we want to first
explore the space (large e) but then, when we have reached a
good point, we want to refine our search (small e).
Supervised learning: Practical problems when using the MLP
• Setting a stopping condition: typically, a maximum number of
iterations must be set. Restarting from the same weight
configuration is always possible. Going back is not!
A good advice is to save the network configuration whenever a
new minimum is reached (NB You need a validation set here!)
and to finally stop the algorithm from running when overfitting
appears.
In fact, usually, in the first (NB could be many!) iterations one can
observe a decrease in the error over both the training and the
validation set. After a certain number of iterations the network
will start overfitting the training data, i.e. the error will still be
decreasing on the training set, but will start increasing on the
validation set.
Supervised learning: Practical problems when using the MLP
• The gradient descent is a deterministic algorithm, but weight
initialization is a random process, which means that the results
obtained by running the network several times on the same data
with different random sequences are instances of a random
variable, and must be treated as such! Hence, it makes no sense
comparing different network configurations based on a single run
of each. Appropriate statistics must be collected by running the
BP algorithm several times for each configuration and appropriate
statistical tests must be applied to show that the distributions of
the results, and especially their mean values, are significantly
different.
Never forget to consider that drawing training/validation/test
sets from a data set is already enough to make the whole training
procedure stochastic.
Supervised learning: Practical problems when using the MLP