8. Lecture Neural Networks - uni-saarland.de€¦ · SC WS 17/18 Georg Frey199 Contents of 7th...

Preview:

Citation preview

8. Lecture

Neural Networks

Learning Process

Soft Control

(AT 3, RMA)

SC

198 WS 17/18 Georg Frey

Contents of the 8th lecture

1. Introduction of Soft Control: Definition and Limitations, Basics of

“Intelligent" Systems

2. Knowledge representation and Knowledge Processing (Symbolic AI)

Application: Expert Systems

3. Fuzzy-Systems: Dealing with Fuzzy Knowledge

Application : Fuzzy-Control

4. Connective Systems: Neuronale Networks

Application: Identification and neural Control

1. Basics

2. Learning

5. Genetic Algorithms: Stochastic Optimization

Application: Optimization

6. Summary & Literature References

SC

199 WS 17/18 Georg Frey

Contents of 7th Lecture

Learning in Neural Networks

Supervised (monitored) learning

Solid Learning Task:

Geg.: Input E, Output A

Un-Supervised (un-monitored)

learning

Free Learning Task :

Geg.: Input E

Example: Backpropagation Example: Competitive Learning

SC

200 WS 17/18 Georg Frey

Unsupervised Learning

Learning in Neural Networks

Supervised (monitored) learning

Solid Learning Task:

Geg.: Input E, Output A

Un-Supervised (un-monitored)

learning

Free Learning Task :

Geg.: Input E

Example : Backpropagation Example : Competitive Learning

Source: Carola Huthmacher

SC

201 WS 17/18 Georg Frey

Principle of Competitive Learning in the problem of clustering

Objectives of the clustering:

• Differences between

objects of a cluster are

minimal

• Differences between

objects of different

clusters are maximum

Learning through competition

• Competition principle

(Competition)

• Objective: Each group

will activate an output

neuron (binary)

SC

202 WS 17/18 Georg Frey

Architecture of a Competitive Learning Network

...

...

0 1 1 ) = x

0 )

Input

...

...

( 1 0 1 1 ) = x Rn

Output ( 1 0 ) = y Bm

Input Layer

Competitive Layer

3 1 2 n

SC

203 WS 17/18 Georg Frey

Processes in the Competitive Layer

j

( x1 x2 xn ) = x Rn

wj1 wj2 wjn

• Measure of the distance (displacement/offset)

between input and weighting vector

Sj = i wij xi = |w||x|cos

S is large for small displacement

• Winner: Neuron j with

Sj > Sk for all k j

• Output:

y winner = 1

y loser = 0

(„winner takes all“)

SC

204 WS 17/18 Georg Frey

Unsupervised Learning Algorithms

• Initialization:

Early Random weighting (normalized weight

vectors)

Vectors from training inputs (normalized) as

initial weights

• Competitive process

• Learning:

Input is a Vector x

Recalculate the weightings of the winner

neuron :

wj(t+1) = wj(t) + (t) [x - wj(t)]

(t) is the Learning rate (0,01 -0,3)

in the process the learning is gradually

reduced

Normalization(Standardization)

• Termination:

At the end the fulfillment of a Termination criterion

wj (t)

wj (t+1)

x

(t) [ x – wj (t) ]

0 1

1

x – wj (t)

SC

205 WS 17/18 Georg Frey

Advantages and Dis-Advantages

• Disadvantages:

difficult to find good initialization

Unstable

Problem: # Neurons in Competitive

Layer

• Advantages:

good clustering

easier and faster algorithm

Building block for more complex

networks

SC

206 WS 17/18 Georg Frey

Supervised Learning

Learning in Neural Networks

Supervised (monitored) learning

Solid Learning Task:

Geg.: Input E, Output A

Un-Supervised (un-monitored)

learning

Free Learning Task :

Geg.: Input E

Example: Back propagation Example: Competitive Learning

Source: Dr. Van Bang Le

SC

207 WS 17/18 Georg Frey

The Back propagation-Learning algorithms

History

• Werbos (1974)

• Rumelharts, Hintons, Williams (1986)

• Very important and well-known supervised learning for forward

networks

Idea:

• Minimizing the error function by Gradient relegation (descend)

Consequences

• Back propagation is a Gradient base procedure

• Learning here is math, no biological motivation!

SC

208 WS 17/18 Georg Frey

Task and aims of back propagation-learning

• Learning Task:

Quantity of input / output examples (training set):

L = {(x1, t1), ..., (xk, tk)}, where:

xi = Input Example (input pattern)

ti = Solution (Desired task, target) with input xi

• Learning Objective:

Each task (x, t) from L should be from the network with as little error as

can be calculated. .

SC

209 WS 17/18 Georg Frey

BP general approach to learning

• Subdivision of existing data

in

Trainings data

Validation data

• Training to achieve desired

error

• Validation

• Problem: Optimal end point

for training

Underfitting

Overfitting

Trainings-Iterations

Error

Validation

Training

SC

210 WS 17/18 Georg Frey

The Back propagation-Learning algorithms

• Error measurement:

Let (x, t) L and y is actual output of the network when input is x.

• Error concerning the pair (x, t):

Ex,t = ( = ½ || t –y ||2)

• Total Error :

• Note: :

The factor ½ is not relevant (|| t –y ||2 is then exactly minimum, If ½

|| t –y ||2 is minimum), but later leads to simplify the formulas.

L ) ,( i

2

ii

L ) ,(

, )y(tEE21

txtx

t x

i

2

ii )yt(21

SC

211 WS 17/18 Georg Frey

The gradient method

1. Consider the error as

a function of weights

2. To the weight vector

w = (W11, W12, ...)

belongs to the point (w, E (w))

on the error surface

3 Since E is differentiable, so at point w the gradient of the error area

is possible, and the gradient descends at a fraction New weight

vector w ‘

4. Repeat the Procedure at the Point w´ ...

E(w)

w w´

Fehler

Gewichte

SC

212 WS 17/18 Georg Frey

Gradient

Let f : ℝn → ℝ eine real Value Function.

• f(x1, ..., xn) show ,,in the direction of the highest growth rate ‘‘

of f and instead (x1, ..., xn).

Towards the relegation : –f

Example: f(x1, x2) = ½ x12 – x2 , f(x1, x2) = (x1, –1)

• Partial derivative of f after xi :

• Gradient of f :

Towards the descent into xi-direction: −∂

∂ x i

f

f) ..., f, f,( fnx

2x

1x

fi

x

SC

213 WS 17/18 Georg Frey

BP to multiple networks

Designations:

The network with input x was

completely broken into shares!

• A:= {i : i is Output neuron} the quantity of output neurons

For (x, t) L is then y =(oi)i A is the output when input is x

• Output of neuron i: oi

• Input for neuron j: netj :=

wij

i j

Viewing multiple-networks without abbreviation

(pure Feed-forward networks with connections between

Successive layers)

ji : i

iji wo

SC

214 WS 17/18 Georg Frey

BP to multiple networks: : Notation: Error Function

Error function:

f is differentiable, so is Ex,t and E is also differentiable, and gradient

relegation method can be applied!

• oj = f(netj), where f is the activation function of neurons.

• netj =

Offline-Version: Weight change after calculation of total error E (Batch

Learning)

Online-Version: Weight change under the current calculation error Ex,t

E = Ex,t =

ji : i

iji wo

L ) ,(

,Etx

t x

A j

2

jj )ot(21

SC

215 WS 17/18 Georg Frey

Sigmoid as the activation function

Until now, the

Activation function f was

the staircase function

So not everywhere

differentiable :

1 1

As an activation function for all neurons is

Now the sigmoid function s (x) = s1 (x)

Everywhere differentiable

Function:

1

1+e− cxsc(x) =

It is: s´(x) = s(x)(1 – s(x))

s2

s1

SC

216 WS 17/18 Georg Frey

The Back propagation-Learning algorithm: Online-Version

(1) Initialize the weights with random values wij

(2) Choose a pair (x, t) L

(3) Calculate the output y when input is x

(4) Consider the error Ex,t as a function of weights :

Ex,t = ½ || t –y ||2 = Ex,t(w11, w12, ...)

(5) Fractionally change wij (Learning rate) in the steepest descent

direction of the error :

(6) If there is no termination then repeat from (2) criterion

wij := wij + ·( ) −∂ E x , t

∂ w ij

SC

217 WS 17/18 Georg Frey

The Back propagation-Learning algorithm: Online-Version (2)

For a fixed pair i, j Ex,t is considered as a Function of wij

(all other weights are included in this calculation constant )

• Ex,t depends on network output y (i.e. oj, j A)

• oj, j A, depends on the input of neuron j , netj, ab

• netj depends on wkj and ok , for all Connections kj

• ...

Backpropagation

Calculation of wij

i j

−∂ E x , t

∂ w ij

So backward is determined by the network! −∂ E x , t

∂ w ij

SC

218 WS 17/18 Georg Frey

The Back propagation-Learning algorithm: Online-Version (3)

Dependency: Ex,t(wij) depends on net, netj depends on wij ab.

Application of the chain rule:

= oi

∂ net j

∂ w ij j := ,, Error Signal ‘‘ −

∂ E x , t

∂ net j

Calculation of wij

i j

−∂ E x , t

∂ w ij

ij

j

j

,

ij

,

w

net

net

E

w

E

txtx

SC

219 WS 17/18 Georg Frey

The Back propagation-Learning algorithm: Online-Version (4)

Dependency: Ex,t(netj) depends on oj , oj depends on netj .

Application of the chain rule:

• = f´(netj) = ...

For f = sigmoid Activation function s shall continue :

... = s´(netj) = s(netj)·(1 – s(netj)) = oj·(1 – oj)

j

j

j

,

j

,

net

o

o

E

net

E

txtx

j

)j

j

j

net

f (net

net

o

SC

220 WS 17/18 Georg Frey

The Back propagation-Learning algorithm: Online-Version (5)

wij

i j

Calculation of ∂ E x , t

∂ o j

Case 1: j is a output neuron.

= 2 ½ (tj – oj) (–1)

= – (tj – oj)

))(( A k

2

kk21

jj

,ot

oo

E

tx

SC

221 WS 17/18 Georg Frey

The Back propagation-Learning algorithm: Online-Version (6)

Case 2: j is not an output neuron.

wij

i j

Calculation of ∂ E x , t

∂ o j

Dependency: oj will be presented at all follow-up of neurons, k and j

redirected and Ex,t depends on!

Application of the chain rule :

j

k

kj k:k

,

j

,

o

net

net

E

o

E

txtx

jk

kj k:

k w

SC

222 WS 17/18 Georg Frey

The Back propagation-Learning algorithm: Online-Version (7)

Summary:

Error signal: j

−∂ E x , t

∂ w ijRelegation(descend) direction wij : = oi · j

Correction for wij: wij = wij + · oi · j

j to be calculated, all k must be known for all connections

kj

Back propagation

sonst,w)o1(o

Aj ),ot()o1(o

jk

kj k:

kjj

jjjj

SC

223 WS 17/18 Georg Frey

The Back propagation-Learning algorithm: Online-Version (8)

• Initialize the weights with random values

• Determination of abort criterion for total failure (error) E

• Determination of maximum Epoch number emax

E:= 0; e:= 1

repeat

for all (x, t) L do

• compute

• E:= E + Ex,t

• calculate backward, layerwise starting with the

output layer of the error signals j

• wij = wij + · oi · j

endfor

e:= e + 1

until (E meets ) or (e > emax)

Ex,t =

A j

2

jj )ot(21

SC

224 WS 17/18 Georg Frey

The Back propagation-Learning algorithm : Offline-Version

Offline means that the error for all input data

should also be minimized

In this mode, the weights after Presentation of all

tasks (x, t) L are modified:

)(ij

ijij wEww

))((ij

,

L ),(

ij w

Ew

tx

tx

L ),(

ijwtx

xx )(

j

)(

io

SC

225 WS 17/18 Georg Frey

Online vs. Offline

• When offline learning (Batch Learning) is in a corrective step, the

total error function (for all data) is optimized .

• There is a descent in the direction of the real Gradient direction the

total error function

• When online learning are the weights after the presentation of each

date adapt immediately.

• The direction of adjustment is in general not in agreement with the

Gradient direction.

• If the entries are selected in a random order, it is the middle of the

gradient that is followed.

• The online version is necessary, if not all pairs (x, t) at the beginning

of learning are known (adapting to new data, adaptive systems), or

if the offline version is too burdensome.

SC

226 WS 17/18 Georg Frey

Problems of Backpropagation: Symmetry Breaking

For complete layers, forward-affiliated networks, the weights may not give

equal value to be initialized. Otherwise, the weights between two layers

through back-propagation will always give the same values .

1

2

3

4

5

6

7

8

Ini: wij = a for all i, j

After the Forward-Phase:

o4 = o5 = o6 4 = 5 = 6

w14 = w15 = w16, w24 = w25 = w26,

w34 = w35 = w36, w47 = w57 = w67,

w48 = w58 = w68

This situation applies forward after each phase. Through such initialization

is therefore certain symmetry, which no longer be broken!

Solution: Small, random values for top weights.

Network input neti for all Neurons i is almost Null

s´(neti) size, and the Network adapts quickly.

SC

227 WS 17/18 Georg Frey

Problems of back propagation: Local minima

As with all gradient may be in back propagation

a local minimum area of error remains :

E

w w0 w1 w2 w3

There is no guarantee that a global

minimum (optimal weights) will be

found .

With a growing number of connections ( the dimension of the weight room is

great ) the surface error greater jagged. In a local minimum is likely to land !

Way out:

• Learning rate not to be chosen too small

• Several different initialization of the weights to try According to experience, the one minimum found for the concrete

application is acceptable solution

SC

228 WS 17/18 Georg Frey

Problems of Backpropagation: Leave (abandon) good minima

Leave good Minima:

• The size of the weight change depends on the amount of gradients .

• A good minimum is in a steep valley, the amount of the is gradient

so large that the good and minimize skipped in the vicinity of where

a worse minimum will be landed will:

E

w Way out:

• Learning rate not to be chosen very large

• Several different initialization of the weights to try

According to experience, the one minimum found for the concrete

application is an acceptable solution

SC

229 WS 17/18 Georg Frey

Problems of Backpropagation: Flat plateau

Flat plateau :

• At the very shallow surface, the error of the gradient is small and the

weights change according marginally .

• Especially many iteration step (high time for training)

• In extreme cases, do not fix the weights instead !

E

w

SC

230 WS 17/18 Georg Frey

Problems of Backpropagation: Oscillation

Oscillation

• In steep ravines (gorges), the procedure oscillate.

• At the edges of a steep ravine, the weight change cause from one

page to another is cracked, because the gradient is the same

amount but the reverse sign holds :

E

w

SC

231 WS 17/18 Georg Frey

Modification 0f Backpropagation

• There are many modifications to remedy the problems addressed.

All are based on heuristics: they cause in many cases, a rapid

acceleration of convergence .

• But there are cases where the adoption of heuristics is not present,

and a deterioration compared to the traditional procedure occurs

back propagation .

• Some popular modifications :

Momentum-Term (also conjugated Gradient relegation): The alleged problems

at the shallow plateaus and steep canyons. Idea: Increase the Learning rate to

shallow levels and reduction in the valleys. .

Weight Decay Large weights are neurobiological look implausible and cause

steep errors and rugged area. Error functions usually change at the same time

minimizing the weights (weight decay).

Quickprop Heuristic: A Valley of the fault surface (about a local minimum) may

be replaced by a top open parabolic approximate described. Idea: In a step

toward the vertex of the parabola (expected minimum of error function) jump .

SC

232 WS 17/18 Georg Frey

Summary and learning from the 8th Lecture

To know basic forms of learning in neural networks

Supervised

Unsupervised

To know the idea of learning without teachers based on the

concurrent learning

To know the idea of learning by minimizing errors (with "teacher")

Example Back propagation

To know Back propagation

Procedure

Possible Problems

Recommended