8. Lecture Neural Networks - uni-saarland.de€¦ · SC WS 17/18 Georg Frey199 Contents of 7th...

8. Lecture

Neural Networks

Learning Process

Soft Control

(AT 3, RMA)

198 WS 17/18 Georg Frey

Contents of the 8th lecture

1. Introduction of Soft Control: Definition and Limitations, Basics of

“Intelligent" Systems

2. Knowledge representation and Knowledge Processing (Symbolic AI)

Application: Expert Systems

3. Fuzzy-Systems: Dealing with Fuzzy Knowledge

Application : Fuzzy-Control

4. Connective Systems: Neuronale Networks

Application: Identification and neural Control

1. Basics

2. Learning

5. Genetic Algorithms: Stochastic Optimization

Application: Optimization

6. Summary & Literature References

Contents of 7th Lecture

Learning in Neural Networks

Supervised (monitored) learning

Solid Learning Task:

Geg.: Input E, Output A

Un-Supervised (un-monitored)

learning

Free Learning Task :

Geg.: Input E

Example: Backpropagation Example: Competitive Learning

Unsupervised Learning

learning

Geg.: Input E

Example : Backpropagation Example : Competitive Learning

Source: Carola Huthmacher

Principle of Competitive Learning in the problem of clustering

Objectives of the clustering:

• Differences between

objects of a cluster are

minimal

• Differences between

objects of different

clusters are maximum

Learning through competition

• Competition principle

(Competition)

• Objective: Each group

will activate an output

neuron (binary)

Architecture of a Competitive Learning Network

0 1 1 ) = x

( 1 0 1 1 ) = x Rn

Output ( 1 0 ) = y Bm

Input Layer

Competitive Layer

3 1 2 n

Processes in the Competitive Layer

( x1 x2 xn ) = x Rn

wj1 wj2 wjn

• Measure of the distance (displacement/offset)

between input and weighting vector

Sj = i wij xi = |w||x|cos

S is large for small displacement

• Winner: Neuron j with

Sj > Sk for all k j

• Output:

y winner = 1

y loser = 0

(„winner takes all“)

Unsupervised Learning Algorithms

• Initialization:

Early Random weighting (normalized weight

vectors)

Vectors from training inputs (normalized) as

initial weights

• Competitive process

• Learning:

Input is a Vector x

Recalculate the weightings of the winner

neuron :

wj(t+1) = wj(t) + (t) [x - wj(t)]

(t) is the Learning rate (0,01 -0,3)

in the process the learning is gradually

reduced

Normalization(Standardization)

• Termination:

At the end the fulfillment of a Termination criterion

wj (t)

wj (t+1)

(t) [ x – wj (t) ]

x – wj (t)

Advantages and Dis-Advantages

• Disadvantages:

difficult to find good initialization

Unstable

Problem: # Neurons in Competitive

• Advantages:

good clustering

easier and faster algorithm

Building block for more complex

networks

Supervised Learning

learning

Geg.: Input E

Example: Back propagation Example: Competitive Learning

Source: Dr. Van Bang Le

The Back propagation-Learning algorithms

History

• Werbos (1974)

• Rumelharts, Hintons, Williams (1986)

• Very important and well-known supervised learning for forward

networks

• Minimizing the error function by Gradient relegation (descend)

Consequences

• Back propagation is a Gradient base procedure

• Learning here is math, no biological motivation!

Task and aims of back propagation-learning

• Learning Task:

Quantity of input / output examples (training set):

L = {(x1, t1), ..., (xk, tk)}, where:

xi = Input Example (input pattern)

ti = Solution (Desired task, target) with input xi

• Learning Objective:

Each task (x, t) from L should be from the network with as little error as

can be calculated. .

BP general approach to learning

• Subdivision of existing data

Trainings data

Validation data

• Training to achieve desired

• Validation

• Problem: Optimal end point

for training

Underfitting

Overfitting

Trainings-Iterations

Validation

Training

The Back propagation-Learning algorithms

• Error measurement:

Let (x, t) L and y is actual output of the network when input is x.

• Error concerning the pair (x, t):

Ex,t = ( = ½ || t –y ||2)

• Total Error :

• Note: :

The factor ½ is not relevant (|| t –y ||2 is then exactly minimum, If ½

|| t –y ||2 is minimum), but later leads to simplify the formulas.

L ) ,( i

L ) ,(

, )y(tEE21

ii )yt(21

The gradient method

1. Consider the error as

a function of weights

2. To the weight vector

w = (W11, W12, ...)

belongs to the point (w, E (w))

on the error surface

3 Since E is differentiable, so at point w the gradient of the error area

is possible, and the gradient descends at a fraction New weight

vector w ‘

4. Repeat the Procedure at the Point w´ ...

Fehler

Gewichte

Gradient

Let f : ℝn → ℝ eine real Value Function.

• f(x1, ..., xn) show ,,in the direction of the highest growth rate ‘‘

of f and instead (x1, ..., xn).

Towards the relegation : –f

Example: f(x1, x2) = ½ x12 – x2 , f(x1, x2) = (x1, –1)

• Partial derivative of f after xi :

• Gradient of f :

Towards the descent into xi-direction: −∂

∂ x i

f) ..., f, f,( fnx

BP to multiple networks

Designations:

The network with input x was

completely broken into shares!

• A:= {i : i is Output neuron} the quantity of output neurons

For (x, t) L is then y =(oi)i A is the output when input is x

• Output of neuron i: oi

• Input for neuron j: netj :=

Viewing multiple-networks without abbreviation

(pure Feed-forward networks with connections between

Successive layers)

ji : i

iji wo

BP to multiple networks: : Notation: Error Function

Error function:

f is differentiable, so is Ex,t and E is also differentiable, and gradient

relegation method can be applied!

• oj = f(netj), where f is the activation function of neurons.

• netj =

Offline-Version: Weight change after calculation of total error E (Batch

Learning)

Online-Version: Weight change under the current calculation error Ex,t

E = Ex,t =

ji : i

iji wo

L ) ,(

jj )ot(21

Sigmoid as the activation function

Until now, the

Activation function f was

the staircase function

So not everywhere

differentiable :

As an activation function for all neurons is

Now the sigmoid function s (x) = s1 (x)

Everywhere differentiable

Function:

1+e− cxsc(x) =

It is: s´(x) = s(x)(1 – s(x))

The Back propagation-Learning algorithm: Online-Version

(1) Initialize the weights with random values wij

(2) Choose a pair (x, t) L

(3) Calculate the output y when input is x

(4) Consider the error Ex,t as a function of weights :

Ex,t = ½ || t –y ||2 = Ex,t(w11, w12, ...)

(5) Fractionally change wij (Learning rate) in the steepest descent

direction of the error :

(6) If there is no termination then repeat from (2) criterion

wij := wij + ·( ) −∂ E x , t

∂ w ij

The Back propagation-Learning algorithm: Online-Version (2)

For a fixed pair i, j Ex,t is considered as a Function of wij

(all other weights are included in this calculation constant )

• Ex,t depends on network output y (i.e. oj, j A)

• oj, j A, depends on the input of neuron j , netj, ab

• netj depends on wkj and ok , for all Connections kj

• ...

Backpropagation

Calculation of wij

−∂ E x , t

∂ w ij

So backward is determined by the network! −∂ E x , t

∂ w ij

Dependency: Ex,t(wij) depends on net, netj depends on wij ab.

Application of the chain rule:

∂ net j

∂ w ij j := ,, Error Signal ‘‘ −

∂ E x , t

∂ net j

Calculation of wij

−∂ E x , t

∂ w ij

Dependency: Ex,t(netj) depends on oj , oj depends on netj .

Application of the chain rule:

• = f´(netj) = ...

For f = sigmoid Activation function s shall continue :

... = s´(netj) = s(netj)·(1 – s(netj)) = oj·(1 – oj)

f (net

Calculation of ∂ E x , t

∂ o j

Case 1: j is a output neuron.

= 2 ½ (tj – oj) (–1)

= – (tj – oj)

))(( A k

Case 2: j is not an output neuron.

Calculation of ∂ E x , t

∂ o j

Dependency: oj will be presented at all follow-up of neurons, k and j

redirected and Ex,t depends on!

Application of the chain rule :

kj k:k

Summary:

Error signal: j

−∂ E x , t

∂ w ijRelegation(descend) direction wij : = oi · j

Correction for wij: wij = wij + · oi · j

j to be calculated, all k must be known for all connections

Back propagation

sonst,w)o1(o

Aj ),ot()o1(o

• Initialize the weights with random values

• Determination of abort criterion for total failure (error) E

• Determination of maximum Epoch number emax

E:= 0; e:= 1

repeat

for all (x, t) L do

• compute

• E:= E + Ex,t

• calculate backward, layerwise starting with the

output layer of the error signals j

• wij = wij + · oi · j

endfor

e:= e + 1

until (E meets ) or (e > emax)

Ex,t =

jj )ot(21

The Back propagation-Learning algorithm : Offline-Version

Offline means that the error for all input data

should also be minimized

In this mode, the weights after Presentation of all

tasks (x, t) L are modified:

ijij wEww

))((ij

Online vs. Offline

• When offline learning (Batch Learning) is in a corrective step, the

total error function (for all data) is optimized .

• There is a descent in the direction of the real Gradient direction the

total error function

• When online learning are the weights after the presentation of each

date adapt immediately.

• The direction of adjustment is in general not in agreement with the

Gradient direction.

• If the entries are selected in a random order, it is the middle of the

gradient that is followed.

• The online version is necessary, if not all pairs (x, t) at the beginning

of learning are known (adapting to new data, adaptive systems), or

if the offline version is too burdensome.

Problems of Backpropagation: Symmetry Breaking

For complete layers, forward-affiliated networks, the weights may not give

equal value to be initialized. Otherwise, the weights between two layers

through back-propagation will always give the same values .

Ini: wij = a for all i, j

After the Forward-Phase:

o4 = o5 = o6 4 = 5 = 6

w14 = w15 = w16, w24 = w25 = w26,

w34 = w35 = w36, w47 = w57 = w67,

w48 = w58 = w68

This situation applies forward after each phase. Through such initialization

is therefore certain symmetry, which no longer be broken!

Solution: Small, random values for top weights.

Network input neti for all Neurons i is almost Null

s´(neti) size, and the Network adapts quickly.

Problems of back propagation: Local minima

As with all gradient may be in back propagation

a local minimum area of error remains :

w w0 w1 w2 w3

There is no guarantee that a global

minimum (optimal weights) will be

found .

With a growing number of connections ( the dimension of the weight room is

great ) the surface error greater jagged. In a local minimum is likely to land !

Way out:

• Learning rate not to be chosen too small

• Several different initialization of the weights to try According to experience, the one minimum found for the concrete

application is acceptable solution

Problems of Backpropagation: Leave (abandon) good minima

Leave good Minima:

• The size of the weight change depends on the amount of gradients .

• A good minimum is in a steep valley, the amount of the is gradient

so large that the good and minimize skipped in the vicinity of where

a worse minimum will be landed will:

w Way out:

• Learning rate not to be chosen very large

• Several different initialization of the weights to try

According to experience, the one minimum found for the concrete

application is an acceptable solution

Problems of Backpropagation: Flat plateau

Flat plateau :

• At the very shallow surface, the error of the gradient is small and the

weights change according marginally .

• Especially many iteration step (high time for training)

• In extreme cases, do not fix the weights instead !

Problems of Backpropagation: Oscillation

Oscillation

• In steep ravines (gorges), the procedure oscillate.

• At the edges of a steep ravine, the weight change cause from one

page to another is cracked, because the gradient is the same

amount but the reverse sign holds :

Modification 0f Backpropagation

• There are many modifications to remedy the problems addressed.

All are based on heuristics: they cause in many cases, a rapid

acceleration of convergence .

• But there are cases where the adoption of heuristics is not present,

and a deterioration compared to the traditional procedure occurs

back propagation .

• Some popular modifications :

Momentum-Term (also conjugated Gradient relegation): The alleged problems

at the shallow plateaus and steep canyons. Idea: Increase the Learning rate to

shallow levels and reduction in the valleys. .

Weight Decay Large weights are neurobiological look implausible and cause

steep errors and rugged area. Error functions usually change at the same time

minimizing the weights (weight decay).

Quickprop Heuristic: A Valley of the fault surface (about a local minimum) may

be replaced by a top open parabolic approximate described. Idea: In a step

toward the vertex of the parabola (expected minimum of error function) jump .

Summary and learning from the 8th Lecture

To know basic forms of learning in neural networks

Supervised

Unsupervised

To know the idea of learning without teachers based on the

concurrent learning

To know the idea of learning by minimizing errors (with "teacher")

Example Back propagation

To know Back propagation

Procedure

Possible Problems

8. Lecture Neural Networks - uni-saarland.de€¦ · SC WS 17/18 Georg Frey199 Contents of 7th...

Documents

Machine learning 2 - Neural Networks

DEEP LEARNING AND NEURAL NETWORKS Deep Learning History of Artificial Neural Networks Keras ... Recurrent Neural Networks (RNNs) Long Term Short Memory (I-ST M) ... Fractal Analytics

CS536: Machine Learning Artificial Neural Networks Neural Networks

Neural Networks and Deep Learning

Neural Networks and Learning Machines

Arti cal Neural Networks - University of Chicagottic.uchicago.edu/~shubhendu/Slides/NeuralNets1.pdf · Arti cal Neural Networks ... • Deep Learning: Convolutional Neural Networks

Artiﬁcial Neural Networks Connectionist Learning Machinesci.louisville.edu/kerem/evraklar/sunum/CECS694_01_seminar.pdfArtiﬁcial Neural Networks Connectionist Learning Machines

Supervised learning neural networks Multilayer perceptron ...liacs.leidenuniv.nl/~nijssensgr/CI/2011/5 neural networks...Supervised learning neural networks • Multilayer perceptron

Perceptrons and Learning Learning in Neural Networks

Deep Learning in Neural Networks: An Overviewpeople.idsia.ch/~juergen/DeepLearning2July2014.pdf · Deep Learning in Neural Networks: An Overview ... numerous neural network experts

Reinforcement Learning and Neural networks

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND …wcohen/10-605/deep-1.pdf · DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY 1. ... Convolutional Neural Networks for

Deep Learning in Neural Networks: An Overviejuergen/DeepLearning28May2014.pdf · Deep Learning in Neural Networks: An Overview ... 1 Introduction to Deep Learning ... Neural Fitted

7 Neural Networks Learning

The Next Generation Neural Networks: Deep Learning and Spiking Neural Networks · · 2015-08-10Networks: Deep Learning and Spiking Neural Networks ADVANCED SEMINAR submitted by

Modal Learning Neural Networks - WSEAS

Course 395: Machine Learning - Lectures...Stefanos Zafeiriou Machine Learning (395) Neural Networks Neural Networks The Perceptron (1958) Multilayer Neural Networks (50’s, 60’s)

Supervised learning. Multilayer neural networks

Deep Learning: What Makes Neural Networks Great Again?infochim.u-strasbg.fr/CS3_2018/Presentations/... · • Convolutional neural networks (CNN) • Recurrent neural networks (RNN)

Convolutional Neural Networks for Deep Learning: An Introduction · 2017-10-05 · Introduction: Convolutional Neural Networks (CNN) Convolutional Neural Networks: A deep learning