Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
SC
198 WS 17/18 Georg Frey
Contents of the 8th lecture
1. Introduction of Soft Control: Definition and Limitations, Basics of
“Intelligent" Systems
2. Knowledge representation and Knowledge Processing (Symbolic AI)
Application: Expert Systems
3. Fuzzy-Systems: Dealing with Fuzzy Knowledge
Application : Fuzzy-Control
4. Connective Systems: Neuronale Networks
Application: Identification and neural Control
1. Basics
2. Learning
5. Genetic Algorithms: Stochastic Optimization
Application: Optimization
6. Summary & Literature References
SC
199 WS 17/18 Georg Frey
Contents of 7th Lecture
Learning in Neural Networks
Supervised (monitored) learning
Solid Learning Task:
Geg.: Input E, Output A
Un-Supervised (un-monitored)
learning
Free Learning Task :
Geg.: Input E
Example: Backpropagation Example: Competitive Learning
SC
200 WS 17/18 Georg Frey
Unsupervised Learning
Learning in Neural Networks
Supervised (monitored) learning
Solid Learning Task:
Geg.: Input E, Output A
Un-Supervised (un-monitored)
learning
Free Learning Task :
Geg.: Input E
Example : Backpropagation Example : Competitive Learning
Source: Carola Huthmacher
SC
201 WS 17/18 Georg Frey
Principle of Competitive Learning in the problem of clustering
Objectives of the clustering:
• Differences between
objects of a cluster are
minimal
• Differences between
objects of different
clusters are maximum
Learning through competition
• Competition principle
(Competition)
• Objective: Each group
will activate an output
neuron (binary)
SC
202 WS 17/18 Georg Frey
Architecture of a Competitive Learning Network
...
...
0 1 1 ) = x
0 )
Input
...
...
( 1 0 1 1 ) = x Rn
Output ( 1 0 ) = y Bm
Input Layer
Competitive Layer
3 1 2 n
SC
203 WS 17/18 Georg Frey
Processes in the Competitive Layer
j
( x1 x2 xn ) = x Rn
wj1 wj2 wjn
• Measure of the distance (displacement/offset)
between input and weighting vector
Sj = i wij xi = |w||x|cos
S is large for small displacement
• Winner: Neuron j with
Sj > Sk for all k j
• Output:
y winner = 1
y loser = 0
(„winner takes all“)
SC
204 WS 17/18 Georg Frey
Unsupervised Learning Algorithms
• Initialization:
Early Random weighting (normalized weight
vectors)
Vectors from training inputs (normalized) as
initial weights
• Competitive process
• Learning:
Input is a Vector x
Recalculate the weightings of the winner
neuron :
wj(t+1) = wj(t) + (t) [x - wj(t)]
(t) is the Learning rate (0,01 -0,3)
in the process the learning is gradually
reduced
Normalization(Standardization)
• Termination:
At the end the fulfillment of a Termination criterion
wj (t)
wj (t+1)
x
(t) [ x – wj (t) ]
0 1
1
x – wj (t)
SC
205 WS 17/18 Georg Frey
Advantages and Dis-Advantages
• Disadvantages:
difficult to find good initialization
Unstable
Problem: # Neurons in Competitive
Layer
• Advantages:
good clustering
easier and faster algorithm
Building block for more complex
networks
SC
206 WS 17/18 Georg Frey
Supervised Learning
Learning in Neural Networks
Supervised (monitored) learning
Solid Learning Task:
Geg.: Input E, Output A
Un-Supervised (un-monitored)
learning
Free Learning Task :
Geg.: Input E
Example: Back propagation Example: Competitive Learning
Source: Dr. Van Bang Le
SC
207 WS 17/18 Georg Frey
The Back propagation-Learning algorithms
History
• Werbos (1974)
• Rumelharts, Hintons, Williams (1986)
• Very important and well-known supervised learning for forward
networks
Idea:
• Minimizing the error function by Gradient relegation (descend)
Consequences
• Back propagation is a Gradient base procedure
• Learning here is math, no biological motivation!
SC
208 WS 17/18 Georg Frey
Task and aims of back propagation-learning
• Learning Task:
Quantity of input / output examples (training set):
L = {(x1, t1), ..., (xk, tk)}, where:
xi = Input Example (input pattern)
ti = Solution (Desired task, target) with input xi
• Learning Objective:
Each task (x, t) from L should be from the network with as little error as
can be calculated. .
SC
209 WS 17/18 Georg Frey
BP general approach to learning
• Subdivision of existing data
in
Trainings data
Validation data
• Training to achieve desired
error
• Validation
• Problem: Optimal end point
for training
Underfitting
Overfitting
Trainings-Iterations
Error
Validation
Training
SC
210 WS 17/18 Georg Frey
The Back propagation-Learning algorithms
• Error measurement:
Let (x, t) L and y is actual output of the network when input is x.
• Error concerning the pair (x, t):
Ex,t = ( = ½ || t –y ||2)
• Total Error :
• Note: :
The factor ½ is not relevant (|| t –y ||2 is then exactly minimum, If ½
|| t –y ||2 is minimum), but later leads to simplify the formulas.
L ) ,( i
2
ii
L ) ,(
, )y(tEE21
txtx
t x
i
2
ii )yt(21
SC
211 WS 17/18 Georg Frey
The gradient method
1. Consider the error as
a function of weights
2. To the weight vector
w = (W11, W12, ...)
belongs to the point (w, E (w))
on the error surface
3 Since E is differentiable, so at point w the gradient of the error area
is possible, and the gradient descends at a fraction New weight
vector w ‘
4. Repeat the Procedure at the Point w´ ...
E(w)
w w´
Fehler
Gewichte
SC
212 WS 17/18 Georg Frey
Gradient
Let f : ℝn → ℝ eine real Value Function.
• f(x1, ..., xn) show ,,in the direction of the highest growth rate ‘‘
of f and instead (x1, ..., xn).
Towards the relegation : –f
Example: f(x1, x2) = ½ x12 – x2 , f(x1, x2) = (x1, –1)
• Partial derivative of f after xi :
• Gradient of f :
Towards the descent into xi-direction: −∂
∂ x i
f
f) ..., f, f,( fnx
2x
1x
fi
x
SC
213 WS 17/18 Georg Frey
BP to multiple networks
Designations:
The network with input x was
completely broken into shares!
• A:= {i : i is Output neuron} the quantity of output neurons
For (x, t) L is then y =(oi)i A is the output when input is x
• Output of neuron i: oi
• Input for neuron j: netj :=
wij
i j
Viewing multiple-networks without abbreviation
(pure Feed-forward networks with connections between
Successive layers)
ji : i
iji wo
SC
214 WS 17/18 Georg Frey
BP to multiple networks: : Notation: Error Function
Error function:
f is differentiable, so is Ex,t and E is also differentiable, and gradient
relegation method can be applied!
• oj = f(netj), where f is the activation function of neurons.
• netj =
Offline-Version: Weight change after calculation of total error E (Batch
Learning)
Online-Version: Weight change under the current calculation error Ex,t
E = Ex,t =
ji : i
iji wo
L ) ,(
,Etx
t x
A j
2
jj )ot(21
SC
215 WS 17/18 Georg Frey
Sigmoid as the activation function
Until now, the
Activation function f was
the staircase function
So not everywhere
differentiable :
1 1
As an activation function for all neurons is
Now the sigmoid function s (x) = s1 (x)
Everywhere differentiable
Function:
1
1+e− cxsc(x) =
It is: s´(x) = s(x)(1 – s(x))
s2
s1
SC
216 WS 17/18 Georg Frey
The Back propagation-Learning algorithm: Online-Version
(1) Initialize the weights with random values wij
(2) Choose a pair (x, t) L
(3) Calculate the output y when input is x
(4) Consider the error Ex,t as a function of weights :
Ex,t = ½ || t –y ||2 = Ex,t(w11, w12, ...)
(5) Fractionally change wij (Learning rate) in the steepest descent
direction of the error :
(6) If there is no termination then repeat from (2) criterion
wij := wij + ·( ) −∂ E x , t
∂ w ij
SC
217 WS 17/18 Georg Frey
The Back propagation-Learning algorithm: Online-Version (2)
For a fixed pair i, j Ex,t is considered as a Function of wij
(all other weights are included in this calculation constant )
• Ex,t depends on network output y (i.e. oj, j A)
• oj, j A, depends on the input of neuron j , netj, ab
• netj depends on wkj and ok , for all Connections kj
• ...
Backpropagation
Calculation of wij
i j
−∂ E x , t
∂ w ij
So backward is determined by the network! −∂ E x , t
∂ w ij
SC
218 WS 17/18 Georg Frey
The Back propagation-Learning algorithm: Online-Version (3)
Dependency: Ex,t(wij) depends on net, netj depends on wij ab.
Application of the chain rule:
= oi
∂ net j
∂ w ij j := ,, Error Signal ‘‘ −
∂ E x , t
∂ net j
Calculation of wij
i j
−∂ E x , t
∂ w ij
ij
j
j
,
ij
,
w
net
net
E
w
E
txtx
SC
219 WS 17/18 Georg Frey
The Back propagation-Learning algorithm: Online-Version (4)
Dependency: Ex,t(netj) depends on oj , oj depends on netj .
Application of the chain rule:
• = f´(netj) = ...
For f = sigmoid Activation function s shall continue :
... = s´(netj) = s(netj)·(1 – s(netj)) = oj·(1 – oj)
j
j
j
,
j
,
net
o
o
E
net
E
txtx
j
)j
j
j
net
f (net
net
o
SC
220 WS 17/18 Georg Frey
The Back propagation-Learning algorithm: Online-Version (5)
wij
i j
Calculation of ∂ E x , t
∂ o j
Case 1: j is a output neuron.
= 2 ½ (tj – oj) (–1)
= – (tj – oj)
))(( A k
2
kk21
jj
,ot
oo
E
tx
SC
221 WS 17/18 Georg Frey
The Back propagation-Learning algorithm: Online-Version (6)
Case 2: j is not an output neuron.
wij
i j
Calculation of ∂ E x , t
∂ o j
Dependency: oj will be presented at all follow-up of neurons, k and j
redirected and Ex,t depends on!
Application of the chain rule :
j
k
kj k:k
,
j
,
o
net
net
E
o
E
txtx
jk
kj k:
k w
SC
222 WS 17/18 Georg Frey
The Back propagation-Learning algorithm: Online-Version (7)
Summary:
Error signal: j
−∂ E x , t
∂ w ijRelegation(descend) direction wij : = oi · j
Correction for wij: wij = wij + · oi · j
j to be calculated, all k must be known for all connections
kj
Back propagation
sonst,w)o1(o
Aj ),ot()o1(o
jk
kj k:
kjj
jjjj
SC
223 WS 17/18 Georg Frey
The Back propagation-Learning algorithm: Online-Version (8)
• Initialize the weights with random values
• Determination of abort criterion for total failure (error) E
• Determination of maximum Epoch number emax
E:= 0; e:= 1
repeat
for all (x, t) L do
• compute
• E:= E + Ex,t
• calculate backward, layerwise starting with the
output layer of the error signals j
• wij = wij + · oi · j
endfor
e:= e + 1
until (E meets ) or (e > emax)
Ex,t =
A j
2
jj )ot(21
SC
224 WS 17/18 Georg Frey
The Back propagation-Learning algorithm : Offline-Version
Offline means that the error for all input data
should also be minimized
In this mode, the weights after Presentation of all
tasks (x, t) L are modified:
)(ij
ijij wEww
))((ij
,
L ),(
ij w
Ew
tx
tx
L ),(
ijwtx
xx )(
j
)(
io
SC
225 WS 17/18 Georg Frey
Online vs. Offline
• When offline learning (Batch Learning) is in a corrective step, the
total error function (for all data) is optimized .
• There is a descent in the direction of the real Gradient direction the
total error function
• When online learning are the weights after the presentation of each
date adapt immediately.
• The direction of adjustment is in general not in agreement with the
Gradient direction.
• If the entries are selected in a random order, it is the middle of the
gradient that is followed.
• The online version is necessary, if not all pairs (x, t) at the beginning
of learning are known (adapting to new data, adaptive systems), or
if the offline version is too burdensome.
SC
226 WS 17/18 Georg Frey
Problems of Backpropagation: Symmetry Breaking
For complete layers, forward-affiliated networks, the weights may not give
equal value to be initialized. Otherwise, the weights between two layers
through back-propagation will always give the same values .
1
2
3
4
5
6
7
8
Ini: wij = a for all i, j
After the Forward-Phase:
o4 = o5 = o6 4 = 5 = 6
w14 = w15 = w16, w24 = w25 = w26,
w34 = w35 = w36, w47 = w57 = w67,
w48 = w58 = w68
This situation applies forward after each phase. Through such initialization
is therefore certain symmetry, which no longer be broken!
Solution: Small, random values for top weights.
Network input neti for all Neurons i is almost Null
s´(neti) size, and the Network adapts quickly.
SC
227 WS 17/18 Georg Frey
Problems of back propagation: Local minima
As with all gradient may be in back propagation
a local minimum area of error remains :
E
w w0 w1 w2 w3
There is no guarantee that a global
minimum (optimal weights) will be
found .
With a growing number of connections ( the dimension of the weight room is
great ) the surface error greater jagged. In a local minimum is likely to land !
Way out:
• Learning rate not to be chosen too small
• Several different initialization of the weights to try According to experience, the one minimum found for the concrete
application is acceptable solution
SC
228 WS 17/18 Georg Frey
Problems of Backpropagation: Leave (abandon) good minima
Leave good Minima:
• The size of the weight change depends on the amount of gradients .
• A good minimum is in a steep valley, the amount of the is gradient
so large that the good and minimize skipped in the vicinity of where
a worse minimum will be landed will:
E
w Way out:
• Learning rate not to be chosen very large
• Several different initialization of the weights to try
According to experience, the one minimum found for the concrete
application is an acceptable solution
SC
229 WS 17/18 Georg Frey
Problems of Backpropagation: Flat plateau
Flat plateau :
• At the very shallow surface, the error of the gradient is small and the
weights change according marginally .
• Especially many iteration step (high time for training)
• In extreme cases, do not fix the weights instead !
E
w
SC
230 WS 17/18 Georg Frey
Problems of Backpropagation: Oscillation
Oscillation
• In steep ravines (gorges), the procedure oscillate.
• At the edges of a steep ravine, the weight change cause from one
page to another is cracked, because the gradient is the same
amount but the reverse sign holds :
E
w
SC
231 WS 17/18 Georg Frey
Modification 0f Backpropagation
• There are many modifications to remedy the problems addressed.
All are based on heuristics: they cause in many cases, a rapid
acceleration of convergence .
• But there are cases where the adoption of heuristics is not present,
and a deterioration compared to the traditional procedure occurs
back propagation .
• Some popular modifications :
Momentum-Term (also conjugated Gradient relegation): The alleged problems
at the shallow plateaus and steep canyons. Idea: Increase the Learning rate to
shallow levels and reduction in the valleys. .
Weight Decay Large weights are neurobiological look implausible and cause
steep errors and rugged area. Error functions usually change at the same time
minimizing the weights (weight decay).
Quickprop Heuristic: A Valley of the fault surface (about a local minimum) may
be replaced by a top open parabolic approximate described. Idea: In a step
toward the vertex of the parabola (expected minimum of error function) jump .
SC
232 WS 17/18 Georg Frey
Summary and learning from the 8th Lecture
To know basic forms of learning in neural networks
Supervised
Unsupervised
To know the idea of learning without teachers based on the
concurrent learning
To know the idea of learning by minimizing errors (with "teacher")
Example Back propagation
To know Back propagation
Procedure
Possible Problems