Neural Networks and Machine Learning Applications CSC 563 Prof. Mohamed Batouche Computer Science Department CCIS – King Saud University Riyadh, Saudi

Neural Networks and Machine Learning

ApplicationsCSC 563

Prof. Mohamed Batouche

Computer Science DepartmentCCIS – King Saud University

Riyadh, Saudi Arabia

[email protected]

Artificial Complex Systems

Artificial Neural NetworksPerceptrons and Multi Layer

Perceptrons (MLP)

Artificial Neural Networks

Perceptron

4

The Perceptron

x1

w0

y

x2

x3

x4

x5

w1

w2

w3

w4

w5

0

1

sgn wxwyn

iiiΣΣ

Initialisation : 00,..0 iwni

The first model of a biological neuron

5

Artificial Neuron: Perceptron

It’s a step function based on a linear It’s a step function based on a linear combination of real-valued inputs. If the combination of real-valued inputs. If the combination is above a threshold it combination is above a threshold it outputs a 1, otherwise it outputs a –1. outputs a 1, otherwise it outputs a –1.

x1x1

x2x2

xnxn

{1 or –1}{1 or –1}

X0=1X0=1

w0w0

w1w1

w2w2

wnwn

ΣΣ

6

Perceptron: activation rule

O(xO(x11,x,x22,…,x,…,xnn) = ) = 1 if w1 if w00 + w + w11xx11 + w + w22xx22 + … + w + … + wnnxxnn > 0 > 0

-1 otherwise-1 otherwise

To simplify we can represent the function as follows:To simplify we can represent the function as follows:

O(O(XX) = sgn() = sgn(WWTTXX) where) where

sgn(y) = sgn(y) = 1 if y > 01 if y > 0-1 otherwise-1 otherwise

Activation Rule: Linear Threshold (step Unit)

7

What a Perceptron does ?

For a perceptron with 2 input variables namely x1 and x2

Equation WTX = 0 determines a line separating positive from negative examples.

x2

w1 x

1 + w2 x

2 + w0 = 0

x1

x1

y

x2

w1

w2

ΣΣ

w0

y = sgn(w1x1+w2x2+w0)

8

What a Perceptron does ?

For a perceptron with n input variables, it draws a For a perceptron with n input variables, it draws a hyperplane as the decision boundary over the (n-hyperplane as the decision boundary over the (n-dimensional) input space. It classifies input patterns into dimensional) input space. It classifies input patterns into two classes.two classes.

The perceptron outputs 1 for instances lying on one side of thehyperplane and outputs –1 for instances on the other side.

x3

x2

w1 x

1 + w2 x

2 + w3 x

3 + w0 = 0

x1

9

What can be represented using Perceptrons?

and or

Representation Theorem: perceptrons can only represent linearly separable functions. Examples: AND,OR, NOT.

10

Limits of the PerceptronA perceptron can learn only examples that are called A perceptron can learn only examples that are called ““linearly separable”. These are examples that can be perfectly linearly separable”. These are examples that can be perfectly separated by a hyperplane. separated by a hyperplane.

++

++++

----

--

++

++

++--

--

--

Linearly separableLinearly separable Non-linearly separableNon-linearly separable

11

Functions for Perceptron

Perceptrons can learn many boolean functions:Perceptrons can learn many boolean functions:AND, OR, NAND, NOR, but not XOR AND, OR, NAND, NOR, but not XOR

AND: AND: x1x1

x2x2

X0=1X0=1

W0 = -0.8W0 = -0.8

W1=0.5W1=0.5

W2=0.5W2=0.5ΣΣ

12

Learning Perceptrons•Learning is a process by which the free parameters of a Learning is a process by which the free parameters of a neural network are adapted through a process of neural network are adapted through a process of stimulation by the environment in which the network is stimulation by the environment in which the network is embedded. The type of learning is determined by the embedded. The type of learning is determined by the manner in which the parameters changes take place.manner in which the parameters changes take place.

•In the case of Perceptrons, we use a supervised learning.In the case of Perceptrons, we use a supervised learning.

•Learning a perceptron means finding the right values for Learning a perceptron means finding the right values for W that satisfy the input examples {(inputW that satisfy the input examples {(inputii, target, targetii))**}}

•The hypothesis space of a perceptron is the space of all The hypothesis space of a perceptron is the space of all weight vectorsweight vectors..

13

Learning Perceptrons

Principle of learning using the perceptron rule:Principle of learning using the perceptron rule:

1.1. A set of training examples is given: {(x, t)*} where x A set of training examples is given: {(x, t)*} where x is the input and t the target output [supervised is the input and t the target output [supervised learning]learning]

2. Examples are presented to the network.2. Examples are presented to the network.

3.3. For each example, the network gives an output o.For each example, the network gives an output o.

4.4. If there is an error, the hyperplane is moved in order If there is an error, the hyperplane is moved in order to correct the output error.to correct the output error.

5.5. When all training examples are correctly classified, When all training examples are correctly classified, Stop learning.Stop learning.

14

Learning Perceptrons

More formally, the algorithm for learning Perceptrons is as More formally, the algorithm for learning Perceptrons is as follows:follows:

1.1. Assign random values to the weight vectorAssign random values to the weight vector

2. Apply the 2. Apply the perceptron ruleperceptron rule to every training example to every training example

3. Are all training examples correctly classified?3. Are all training examples correctly classified?

Yes. QuitYes. QuitNo. Go Back to Step 2.No. Go Back to Step 2.

15

Perceptron Training RuleThe perceptron training rule:The perceptron training rule:

For a new training example [X = (xFor a new training example [X = (x11, x, x22, …, x, …, xnn), t] ), t] update each weight according to this rule:update each weight according to this rule:

wwii = w = wii + Δw + Δwii

Where ΔwWhere Δwii = η (t-o) x = η (t-o) xii

t: target outputt: target outputo: output generated by the perceptrono: output generated by the perceptronη: constant called the learning rate (e.g., 0.1)η: constant called the learning rate (e.g., 0.1)

16

Perceptron Training RuleComments about the perceptron training rule:Comments about the perceptron training rule:

• If the example is correctly classified the term (t-o) equals zero, If the example is correctly classified the term (t-o) equals zero, and no update on the weight is necessary.and no update on the weight is necessary.

• If the perceptron outputs –1 and the real answer is 1, the If the perceptron outputs –1 and the real answer is 1, the weight is increased.weight is increased.

• If the perceptron outputs a 1 and the real answer is -1, the If the perceptron outputs a 1 and the real answer is -1, the weight is decreased.weight is decreased.

• Provided the examples are linearly separable and a small value Provided the examples are linearly separable and a small value for η is used, the rule is proved to classify all training examples for η is used, the rule is proved to classify all training examples correctly.correctly.

17

Perceptron Training RuleConsider the following example: (two classes: Red and Green)

18

Perceptron Training RuleRandom Initialization of perceptron weights …

19

Perceptron Training RuleApply Iteratively Perceptron Training Rule on the different examples:

20


21


22


23

Perceptron Training RuleAll examples are correctly classified … stop Learning

24

Perceptron Training RuleThe straight line w1x+ w2y + w0=0 separates the two classes

W 1x +

W 2y +

W 0 =

0

25

Demo Matlab

Perception training rule Demo Learning AND, OR functions Try to learn XOR with Perceptron

26

Learning AND/OR operationsP = [ 0 0 1 1; ... % Input patterns 0 1 0 1 ]; T = [ 0 1 1 1];% Desired Outputs

net = newp([0 1;0 1],1);

net.adaptParam.passes = 35;net = adapt(net,P,T);

x = [1; 1];y = sim(net,x);display(y);

x1

y

x2

w1

w2

ΣΣ

w0

Artificial Neural Networks

MultiLayer Perceptron (MLP)

28

Solution for XOR : Add a hidden layer !!

Input nodesInput nodes Internal nodesInternal nodes Output nodesOutput nodes

X1

X2

X1 XOR X2

x1

x2

x1

x2

x1

29

Solution for XOR : Add a hidden layer !!


X1

X2

The problem is: How to learn Multi Layer Perceptrons??

Solution: Backpropagation Algorithm invented by Rumelhart and colleagues in 1986

X1 XOR x2

30

MultiLayer Perceptron

In contrast to perceptrons, multilayer networks can learn not only In contrast to perceptrons, multilayer networks can learn not only multiple decision boundaries, but the boundaries may be nonlinear.multiple decision boundaries, but the boundaries may be nonlinear.


31

MultiLayer PerceptronDecision Boundaries

A

B

A

B

A

B

A

A

B

B

A

A

B

B

A

A

B

B

HALF PLANE BOUNDED BY HYPERPLANE

CONVEX OPEN OR CLOSED REGION

ARBITRARY (complexity limited by number of neurons)

Single-layer

Two-layer

Three-layer

32

Example

x1x1

x2x2

33

One single unit

To make nonlinear partitions on the space we need to define To make nonlinear partitions on the space we need to define each unit as a nonlinear function (unlike the perceptron).each unit as a nonlinear function (unlike the perceptron).One solution is to use the sigmoid unit.One solution is to use the sigmoid unit.

x1x1

x2x2

xnxnX0=1X0=1

w0w0

w1w1

w2w2

wnwn

ΣΣ

O = σ(net) = 1 / 1 + e O = σ(net) = 1 / 1 + e -net-net

netnet

34

Sigmoid or logistic function

O(xO(x11,x,x22,…,x,…,xnn) = ) = σ ( WX )σ ( WX )

where: σ ( WX ) = 1 / 1 + e where: σ ( WX ) = 1 / 1 + e -WX-WX

Function σ is called the sigmoid or logistic Function σ is called the sigmoid or logistic function. function.

This function is easy to differentiate and has This function is easy to differentiate and has the following property:the following property:

d σ(y) / dy = σ(y) (1 – σ(y))d σ(y) / dy = σ(y) (1 – σ(y))

35

Hyperbolic Tangent activation function: Tanh

O(xO(x11,x,x22,…,x,…,xnn) = ) = Tanh ( WX )Tanh ( WX )

where: Tanh ( WX ) = (e where: Tanh ( WX ) = (e WXWX - e - e –WX–WX) / (e ) / (e WXWX + e + e ––

WXWX) )

Function Tanh is called the Hyperbolic Tangent Function Tanh is called the Hyperbolic Tangent function. function.

This function is easy to differentiate and has the This function is easy to differentiate and has the following property:following property:

d Tanh(y) / dy = (1 + Tanh(y)) (1 – Tanh(y))d Tanh(y) / dy = (1 + Tanh(y)) (1 – Tanh(y))

36

Learning MultiLayer Perceptron

BackPropagation Algorithm:BackPropagation Algorithm:

Goal: To learn the weights for all links in an Goal: To learn the weights for all links in an interconnected multilayer network. interconnected multilayer network.

We begin by defining our measure of error:We begin by defining our measure of error:

E(W) = ½ ΣE(W) = ½ Σdd Σ Σkk (t (tkdkd – o – okdkd))2 2 = ½ Σ= ½ Σexamplesexamples (t-o) (t-o)22= ½ Err= ½ Err22

k varies along the output nodes and k varies along the output nodes and d over the training examples.d over the training examples.

The idea is to use a gradient descent over the space of The idea is to use a gradient descent over the space of weights to find a global minimum (no guarantee). weights to find a global minimum (no guarantee).

37

Gradient Descent

38

Minimizing Error Using Steepest Descent

The main idea:Find the way downhill and take a step:

E

x

minimum

downhill = - _____d Ed x

= step size

x x -d Ed x

39

Reduction of Squared Error

Gradient descent reduces the squared error by calculatingthe partial derivative of E with respect to each weight:

j

n

jjj

j

jj

xingErr

xWgtW

Err

W

ErrErr

W

EE

)('

0

jjj xingErrWW )('

chain rule for derivatives

expand second Err to (t – g(in))

This is called “in”

0

jW

tbecause and chain rule

The weight is updated by η times this gradient of error in weight space. The fact that the weight is updated in the correct direction (+/-) can be verified with examples.

learning rate

The learning rate, η, is typically set to a small value such as 0.1

E is a vector

40

BackPropagation Algorithm Create a network with nCreate a network with ninin input nodes, input nodes,

nnhiddenhidden internal nodes, and n internal nodes, and noutout output output nodes. nodes.

Initialize all weights to small random Initialize all weights to small random numbers in the range of -0.5 to 0.5. numbers in the range of -0.5 to 0.5.

Until error is small do:Until error is small do: For each example X doFor each example X do

Propagate example X forward through the networkPropagate example X forward through the network Propagate errors backward through the networkPropagate errors backward through the network

41

Y

BackPropagation Algorithm

X

E

D

y1

y2

y4

y3

e1

e2

e4

e3

x1

x2

x3

x4

x5

In the classification phase, only propagation step is used to classify patterns

(X,D)

42

The Backpropagation Algorithm for Three-Layer Networks with Sigmoid Units

Initialize all weights in the network to small random numbers. Until weights converge (may take thousands of iterations) do

For each training example Compute network output vector o For each output unit i do

Update each network weight

For each hidden unit j do

Update network weight from each input k to hidden j

))(1( iiii otoo

ioutputsi

jijjj Woo

)1(

error backpropagationijjiji aWWij ,,

jkkjkj aWWjk ,,

error gradient

43

The problem of overfitting … Approximation of the function y =

f(x) :2 neurons in hidden layer

5 neurons in hidden layer

40 neurons in hidden layer

x

y

The overfitting is not detectable in the learning phase …

So use Cross-Validation ...

44

Application of ANNs

Network

Stimulus Response

0 1 0 1 1 1 0 0

1 1 00 10 10

Input Pattern

Output Pattern

encodingdecoding

The general scheme when using ANNs is as follows:

45

Application: Digit Recognition

46

Matlab Demo

Function approximation Pattern Recognition

47

Learning XOR Operation: Matlab CodeP = [ 0 0 1 1; ... 0 1 0 1]T = [ 0 1 1 0];

net = newff([0 1;0 1],[6 1],{'tansig' 'tansig'});

net.trainParam.epochs = 4850;net = train(net,P,T);

X = [0 1];Y = sim(net,X);display(Y);

48

Function Approximation:Learning Sinus Function P = 0:0.1:10; T = sin(P)*10.0;

net = newff([0.0 10.0],[8 1],{'tansig' 'purelin'}); plot(P,T); pause; Y = sim(net,P); plot(P,T,P,Y,’o’); pause;

net.trainParam.epochs = 4850; net = train(net,P,T);

Y = sim(net,P); plot(P,T,P,Y,’o’);

Documents

Neural Networks and Machine Learning Applications CSC 563 Prof. Mohamed Batouche Computer Science Department CCIS – King Saud University Riyadh, Saudi