Cooperating Intelligent Systems Statistical learning methods Chapter 20, AIMA (only ANNs & SVMs)

Cooperating Intelligent Systems

Statistical learning methodsChapter 20, AIMA

(only ANNs & SVMs)

Artificial neural networks

The brain is a pretty intelligent system.

Can we ”copy” it?

There are approx. 1011 neurons in the brain.

There are approx. 23109 neurons in the male cortex (females have about 15% less).

The simple model

• The McCulloch-Pitts model (1943)

Image from Neuroscience: Exploring the brain by Bear, Connors, and Paradiso

y = g(w0+w1x1+w2x2+w3x3)

Transfer functions g(z)

The Heaviside function The logistic function

The simple perceptron

With {-1,+1} representation

Traditionally (early 60:s) trained with Perceptron learning.

0 if1]sgn[)(

22110 xwxwwT xw

Perceptron learning

Repeat until no errors are made anymore1. Pick a random example [x(n),f(n)]

2. If the classification is correct, i.e. if y(x(n)) = f(n) , then do nothing

3. If the classification is wrong, then do the following update to the parameters (, the learning rate, is a small positive number)

class tobelongs )( if1

class tobelongs )( if1)(

xDesired output

)()(1 nxnfww iii

Example: Perceptron learning

The AND function

x2x1 x2 f

0 0 -1

0 1 -1

1 0 -1

1 1 +1

Initial values:

The AND function

x2x1 x2 f

0 0 -1

0 1 -1

1 0 -1

1 1 +1

This one is correctlyclassified, no action.

The AND function

x2x1 x2 f

0 0 -1

0 1 -1

1 0 -1

1 1 +1

This one is incorrectlyclassified, learning action.

The AND function

x2x1 x2 f

0 0 -1

0 1 -1

1 0 -1

1 1 +1

The AND function

x2x1 x2 f

0 0 -1

0 1 -1

1 0 -1

1 1 +1

This one is correctlyclassified, no action.

The AND function

x2x1 x2 f

0 0 -1

0 1 -1

1 0 -1

1 1 +1This one is incorrectlyclassified, learning action.

The AND function

x2x1 x2 f

0 0 -1

0 1 -1

1 0 -1

1 1 +1

The AND function

x2x1 x2 f

0 0 -1

0 1 -1

1 0 -1

1 1 +1

w Final solution

Perceptron learning

• Perceptron learning is guaranteed to find a solution in finite time, if a solution exists.

• Perceptron learning cannot be generalized to more complex networks.

• Better to use gradient descent – based on formulating an error and differentiable functions

nynfE1

2),()()( WW

Gradient search

)(WW EW

“Go downhill”

The “learning rate” () is set heuristically

W(k+1) = W(k) + W(k)

The Multilayer Perceptron (MLP)

• Combine several single layer perceptrons.

• Each single layer perceptron uses a sigmoid function (C)E.g.

y l 1)exp(1)(

)tanh()(

Can be trained using gradient descent

Example: One hidden layer

• Can approximate any continuous function

(z) = sigmoid or linear, (z) = sigmoid.

kkjkjj

jjijii

Example of computing the gradient

)(WEW W

nynxWyN

MSEWE1 1

22 1))())(,(ˆ(

nWW yne

2 )ˆ)((2

))()((2

What we need to do is to compute yW ˆ

))((ˆ11

jj wxhvy

We have the complete equation for the network:

Example of computing the gradient

kjkjjk

w xwxhvwxhvww

kj)()(

)(1)()tanh()( 2 zhzhzzh

kkjkjjw wxhvy

When should you stop learning?

• After a set number of learning epochs• When the change in the gradient becomes

smaller than a certain number• Validation data - “early stopping”

Classification error

Training epochs Preferred model

Training error

Validation (test error)

RPROP (Resilient PROPagation)

))(()( iWii WEsigntWi

0)()()1(5.0

0)()()1(2.1)(

itWitWi

i WEWEift

WEWEiftt

No parameter tuning unlike standard backpropagation!

Parameter update rule:

Learning rate update rule:

Model selection

5 6 7 8 9 10 11

Model type A

Classification error [%]

2 3 4 5 6 7 8 9

0.4Model type B

Classification error [%]

Model type A Model type B2

10Errorbar plot

Use this to determine:• Number of hidden nodes• Which input signalsto use• If a pre-processing strategy is good or not• Etc...

Variability typically induced by:• Varying train and test data sets• Random initialmodel parameters

Support vector machines

Linear classifier on a linearly separable problem

There are infinitely manylines that have zero trainingerror.

Which line should we choose?

Choose the line with thelargest margin.

The “large margin classifier”

margin

Choose the line with thelargest margin.

The “large margin classifier”

margin

”Support vectors”

The plane separating and is defined by

The dashed planes are given by

margin

Computing the margin

Divide by b

Define new w = w/b and = a/bmargin

We have defined a scalefor w and b

We have

which gives

margin

2margin

Maximizing the margin isequal to minimizing

subject to the constraints

wTx(n) – +1 for all

wTx(n) – -1 for all

Quadratic programming problem, constraints can be included with Lagrange multipliers.

TmnnD mnmynyL

)()()()(2

Quadratic programming problem

Tnp nnyL

21)()(

Minimize cost (Lagrangian)

Minimum of Lp occurs at the maximum of (the Wolfe dual)

Only scalar productin cost. IMPORTANT!

T nnyy xxxwx )()(sgnsgn)(ˆ

Linear Support Vector Machine

Test phase, the predicted output

Still only scalar products in the expression.

Example: Robot color vision(Competition 1999)

Classify the Lego pieces into red, blue, and yellow.Classify white balls, black sideboard, and green carpet.

What the camera sees (RGB space)

Yellow

Mapping RGB (3D) to rgb (2D)

Lego in normalized rgb space

2Xx 6Cc

Input is 2D

Output is 6D:{red, blue, yellow,green, black, white}

MLP classifier

E_train = 0.21%E_test = 0.24%2-3-1 MLP

Levenberg-Marquardt

Training time(150 epochs):51 seconds

SVM classifier

E_train = 0.19%E_test = 0.20%SVM with

= 1000

2)(exp),( yxyx K

Training time:22 seconds

Lab 4: Digit recognition

• Inputs (digits) are provided as 32x32 bitmaps. Task is to investigate how well these handwritten digits can be recognized by neural networks.

• Assignment includes changing in the program code to answer:

1.How good is the generalization performance? (test data error)

2.Can pre-processing improve performance?3.What is the best configuration of the

network?

public AppTrain() { // create a new network of given size nn=new NN(32*32, 10, seed);

// each row contains 32*32+1 integer // create the matrix holding the data // read data into the matrix file=new TFile("digits.dat"); System.out.println(file.rows()+" digits have been loaded");

double[] input=new double[32*32]; double[] target=new double[10];

// the training session (below) is iterative for (int e=0; e<nEpochs; e++) { // reset the error accumulated over each training epoch double err=0; // in each epoch, go through all examples/tuples/digits // note: all examples are here used for training, consequently no systematic testing // you may consider dividing the data set into training, testing and validation sets. for (int p=0; p<file.rows(); p++) { for (int i=0; i<32*32; i++) input[i]=file.values[p][i]; // the last value on each row contains the target (0-9) // convert it to a double[] target vector for (int i=0; i<10; i++) { if (file.values[p][32*32]==i) target[i]=1; else target[i]=0; } // present a sample and // calculate errors and adjust weights err+=nn.train(input, target, eta); } System.out.println("Epoch "+e+" finished with error "+err/file.rows()); }

// save network weights in a file for later use, e.g. in AppDigits nn.save("network.m"); }

/** classify * @param map the bitmap on the screen * @return int the most likely digit (0-9) according to network */ public int classify(boolean[][] map) { double[] input=new double[32*32]; for (int c=0; c<map.length; c++) { for (int r=0; r<map[c].length; r++) { if (map[c][r]) // bit set input[r*map[r].length+c]=1; else input[r*map[r].length+c]=0; } } // activate the network, produce output vector double[] output=nn.feedforward(input); // alternative version assumes that the network has been trained on an 8x8 map // double[] output=nn.feedforward(to8x8(input)); double highscore=0; int highscoreIndex=0; // print out each output value (gives an idea of the network's support for each digit). System.out.println("--------------"); for (int k=0; k<10; k++) { System.out.println(k+":"+(double)((int)(output[k]*1000)/1000.0)); if (output[k]>highscore) { highscore=output[k]; highscoreIndex=k; } } System.out.println("--------------"); return highscoreIndex; }

Cooperating Intelligent Systems Statistical learning methods Chapter 20, AIMA (only ANNs & SVMs)

Documents

Collection Aima

SVMS Library Vision

Artificial Intelligence Statistical learning methods Chapter 20, AIMA (only ANNs & SVMs)

APRIL 2015 - AIMA

4Intro to ANNs Perceptron

Cat SVMS 72

PSR Using ANNs

St Anns Hospice Workshop

Welcome to SVMS Welcome to SVMS Open House 2014 Cougar Team

AIMA SASOS MA

Merry Anns Brief (questionable)

Support Vector Machines (SVMs). Semi-Supervised Learning.ninamf/courses/601sp15/slides/18_svm-ssl_03-25-2015... · • Support Vector Machines (SVMs). • Semi-Supervised SVMs.

Linear Classifiers/SVMs

Aima Advertisement

SVMs for (x) Recognition (From Moghaddam / Yang’s “Gender Classification with SVMs”) Brian Whitman

AIMA SASOS ARI

Non-Linear SVMs

AIMA SASOS ECEMBERresources.aima.in/aimanews/december2018.pdf1 AIMA SASOS ECEMBER DECEMBER 2018 M ANA GEMENT TIMES AIMA’S MONTHLY E-MAGAZINE AIMA OFFICE BEARERS AIMA, Management

Artificial Intelligence Statistical learning methods Chapter 20, AIMA 2 nd ed. Chapter 18, AIMA 3 rd ed. (only ANNs & SVMs)

Aima final