39
Concept Learning Concept Learning Algorithms Algorithms Come from many different Come from many different theoretical backgrounds and theoretical backgrounds and motivations motivations Behaviors related to human Behaviors related to human learning learning Some biologically inspired, Some biologically inspired, others not others not Biologically- Inspired Utilitarian (just get good result) Neural Networks Tree Learners Nearest Neighbor © Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010 CS 760 – Machine Learning (UW- CS 760 – Machine Learning (UW- Madison) Madison)

Concept Learning Algorithms

Embed Size (px)

DESCRIPTION

Concept Learning Algorithms. Come from many different theoretical backgrounds and motivations Behaviors related to human learning Some biologically inspired, others not. Neural Networks. Nearest Neighbor. Tree Learners. Utilitarian (just get good result). Biologically- Inspired. - PowerPoint PPT Presentation

Citation preview

Page 1: Concept Learning Algorithms

Concept Learning Concept Learning AlgorithmsAlgorithms

bull Come from many different Come from many different theoretical backgrounds and theoretical backgrounds and motivationsmotivations

bull Behaviors related to human Behaviors related to human learninglearning

bull Some biologically inspired others Some biologically inspired others notnot

Biologically-Inspired

Utilitarian(just getgood result)

Neural Networks Tree LearnersNearest Neighbor

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

TodayrsquosTodayrsquos TopicsTopics

bull PerceptronsPerceptronsbull Artificial Neural Networks (ANNs)Artificial Neural Networks (ANNs)bull BackpropagationBackpropagationbull Weight SpaceWeight Space

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

ConnectionismConnectionism

PERCEPTRONS (Rosenblatt 1957)PERCEPTRONS (Rosenblatt 1957)bull among earliest work in machine among earliest work in machine

learninglearningbull died out in 1960rsquos (Minsky amp Papert died out in 1960rsquos (Minsky amp Papert

book)book)J

K

L

I

wij

wik

wil

Outputi = F(Wij outputj + Wik outputk + Wil outputl )

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Perceptron as Perceptron as ClassifierClassifierbull Output for N example X is sign(WOutput for N example X is sign(WX) X)

where sign is -1 or +1 (or use threshold where sign is -1 or +1 (or use threshold and 01)and 01)

bull Candidate Hypotheses real-valued weight Candidate Hypotheses real-valued weight vectorsvectors

bull Training Update W for each misclassified Training Update W for each misclassified example X (target class example X (target class tt predicted predicted oo) by) bybull WWii W Wii + + ((tt--oo)X)Xii

bull Here Here is learning rate parameteris learning rate parameter

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Gradient Descent Gradient Descent for the Perceptronfor the Perceptron(Assume no threshold for now and (Assume no threshold for now and

start with a common error measure)start with a common error measure) Error Error frac12 ( t ndash o ) frac12 ( t ndash o )

2

Networkrsquos output

Teacherrsquos answer (a constant wrt the weights) EE

WWkk

ΔΔWWjj - η

= (t ndash o)

EE

WWkk

(t ndash o)(t ndash o)

WWkk

= -(t ndash o) oo

WW kk

Remember o = WmiddotX

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Continuation of Continuation of DerivationDerivation

EE WWkk

= -(t ndash o) WWkk

(( sumsumk k ww k k x x kk))

= -(t ndash o) x k

So ΔWk = η (t ndash o) xk The Perceptron Rule

Stick in formula for

output

Also known as the delta rule and other names (with small variations in calc)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

As it looks in your text As it looks in your text (processing all data at once)hellip(processing all data at once)hellip

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Linear Separability Linear Separability

Consider a perceptron its output is

1 If W1X1+W2X2 + hellip + WnXn gt 0 otherwise

In terms of feature space W1X1 + W2X2 =

X2 = = W1X1

W2

-W1 W2 W2

X1+

+ + + + + + - + - - + + + + - + + - - -+ + - -+ - - -

- -

Hence can only classify examples if a ldquolinerdquo (hyerplane) can separate them

y = mx + b

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Perceptron Convergence Perceptron Convergence

TheoremTheorem (Rosemblatt 1957)(Rosemblatt 1957)

Perceptron no Hidden Units

If a set of examples is learnable the perceptron training rule will eventually find the necessary weightsHowever a perceptron can only learnrepresent linearly separable datasetcopy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

The (Infamous) XOR The (Infamous) XOR ProblemProblem

Input

0 00 11 01 1

Output

0110

a)b)c)d)

Exclusive OR (XOR)Not linearly separable

b

a c

d

0 1

1

A Neural Network SolutionX1

X2

X1

X2

1

1

-1-1

1

1 Let = 0 for all nodescopy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

The Need for Hidden The Need for Hidden UnitsUnits

If there is one layer of enough hidden units (possibly 2N for Boolean functions) the input can be recoded (N = number of input units)

This recoding allows any mapping to be represented (Minsky amp Papert)Question How to provide an error signal to the interior units

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Hidden UnitsHidden Units

One ViewAllow a system to create its own internal representation ndash for which problem solving is easy A perceptron

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Advantages of Advantages of Neural NetworksNeural Networks

Provide best predictive Provide best predictive accuracy for some accuracy for some problemsproblems

Being supplanted by Being supplanted by SVMrsquosSVMrsquos

Can represent a rich Can represent a rich class of conceptsclass of concepts

PositivenegativePositive

Saturday 40 chance of rainSunday 25 chance of rain

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Overview of ANNsOverview of ANNs

Recurrentlink

Output units

Input units

Hidden units

error

weight

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

BackpropagationBackpropagation

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

BackpropagationBackpropagation

bull Backpropagation involves a generalization of the Backpropagation involves a generalization of the perceptron ruleperceptron rule

bull Rumelhart Parker and Le Cun (and Bryson amp Ho 1969) Rumelhart Parker and Le Cun (and Bryson amp Ho 1969) Werbos 1974) independently developed (1985) a Werbos 1974) independently developed (1985) a technique for determining how to adjust weights of technique for determining how to adjust weights of interior (ldquohiddenrdquo) unitsinterior (ldquohiddenrdquo) units

bull Derivation involves partial derivatives Derivation involves partial derivatives (hence threshold function must be differentiable)(hence threshold function must be differentiable)

error signal

E Wij

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Weight SpaceWeight Space

bull Given a neural-network layout the weights Given a neural-network layout the weights are free parameters that are free parameters that define a define a spacespace

bull Each pointEach point in this in this Weight SpaceWeight Space specifies a specifies a networknetwork

bull Associated with each point is an Associated with each point is an error rateerror rate EE over the training dataover the training data

bull Backprop performs Backprop performs gradient descentgradient descent in weight in weight spacespace

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Gradient Descent in Weight Gradient Descent in Weight SpaceSpace

E

W1

W2

Ew

W1

W2

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

The Gradient-Descent The Gradient-Descent RuleRule

E(w) [ ]Ew0

Ew1

Ew

2

EwN

hellip hellip hellip _

The ldquogradien

trdquoThis is a N+1 dimensional vector (ie the lsquoslopersquo in weight space)Since we want to reduce errors we want to go ldquodown hillrdquoWersquoll take a finite step in weight space

E

W1

W2

w = - E ( w )

or wi = - Ewi

ldquodeltardquo = change to

w

E

w

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

ldquoldquoOn Linerdquo vs ldquoBatchrdquo On Linerdquo vs ldquoBatchrdquo BackpropBackprop

bull Technically we should look at the error Technically we should look at the error gradient for the entire training set gradient for the entire training set before taking a step in weight space before taking a step in weight space (ldquo(ldquobatchbatchrdquo Backprop)rdquo Backprop)

bull HoweverHowever as presented we take a step as presented we take a step after each example (ldquoafter each example (ldquoon-lineon-linerdquo Backprop)rdquo Backprop)bull Much faster convergenceMuch faster convergencebull Can reduce overfitting (since on-line Can reduce overfitting (since on-line

Backprop is ldquonoisyrdquo gradient descent)Backprop is ldquonoisyrdquo gradient descent)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

ldquoldquoOn Linerdquo vs ldquoBatchrdquo BP On Linerdquo vs ldquoBatchrdquo BP (continued)(continued)

BATCHBATCH ndash add ndash add w w vectors for vectors for everyevery training example training example thenthen lsquomoversquo in weight lsquomoversquo in weight spacespace

ON-LINEON-LINE ndash ldquomoverdquo ndash ldquomoverdquo after after eacheach example example (aka (aka stochasticstochastic gradient descent)gradient descent)

E

wi

w1

w3w2

w

w1

w2

w3

Final locations in space need not be the same for BATCH and ON-LINE w

N

ote

w

iB

ATC

H

w

i O

N-L

INE

for

i gt

1

E

w

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Need Derivatives Replace Need Derivatives Replace Step (Threshold) by Step (Threshold) by SigmoidSigmoidIndividual units

bias

output

input

output i= F(weight ij x output j)

Where

F(input i) =

j

1

1+e -(input i ndash bias i)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Differentiating the Differentiating the Logistic FunctionLogistic Function

out i =

1

1 + e - ( wji x outj)

F rsquo(wgtrsquoed in) = out i ( 1- out i ) 0

12

Wj x outj

F(wgtrsquoed in)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Assume one layer of hidden units (std topology)Assume one layer of hidden units (std topology)

11 Error Error frac12 frac12 ( Teacher ( Teacherii ndash ndash OutputOutput ii ) ) 22

22 = frac12 = frac12 (Teacher(Teacherii ndash F ndash F (( [[WWijij x x OutputOutput jj] )] )22

33 = frac12 = frac12 (Teacher(Teacherii ndash F ndash F (( [[WWijij x F x F ((WWjkjk x Output x Output

kk)])))]))22

DetermineDetermine

recallrecall

BP CalculationsBP Calculations

Error Wij

Error Wjk

= (use equation 2)

= (use equation 3)

See Table 42 in Mitchell for resultswxy = - ( E wxy )

k j i

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Derivation in MitchellDerivation in Mitchell

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Some NotationSome Notation

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

By Chain Rule By Chain Rule (since (since WWjiji influences rest of network only influences rest of network only by its influence on by its influence on NetNetjj)hellip)hellip

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Also remember this for later ndashWersquoll call it -δj

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Remember thatoj is xkj outputfrom j is inputto k

Remember netk =wk1 xk1 + hellip+ wkN xkN

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Using BP to Train Using BP to Train ANNrsquosANNrsquos

11 Initiate weights amp bias to Initiate weights amp bias to small random values small random values (eg in [-03 03])(eg in [-03 03])

22 Randomize order of Randomize order of training examples for training examples for each doeach do

a)a) Propagate activity Propagate activity forwardforward to output unitsto output units

k j i

outi = F( wij x outj )j

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)

b)b) Compute ldquodeviationrdquo for output Compute ldquodeviationrdquo for output unitsunits

c)c) Compute ldquodeviationrdquo for hidden Compute ldquodeviationrdquo for hidden unitsunits

d)d) Update weightsUpdate weights

i = F rsquo( neti ) x (Teacheri-outi)

ij = F rsquo( netj ) x ( wij x

i)

wij = x i x out j

wjk = x j x out k

F rsquo( netj ) = F(neti) neti

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)

33 Repeat until training-set error rate Repeat until training-set error rate small enough (or until tuning-set error small enough (or until tuning-set error rate begins to rise ndash see later slide)rate begins to rise ndash see later slide)

Should use ldquoearly stoppingrdquo (ie Should use ldquoearly stoppingrdquo (ie minimize error on the tuning set more minimize error on the tuning set more details later)details later)

44 Measure accuracy on test set to Measure accuracy on test set to estimate estimate generalizationgeneralization (future (future accuracy)accuracy)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Advantages of Neural Advantages of Neural NetworksNetworks

bull Universal representation (provided Universal representation (provided enough hidden units)enough hidden units)

bull Less greedy than tree learnersLess greedy than tree learnersbull In practice good for problems with In practice good for problems with

numeric inputs and can also numeric inputs and can also handle numeric outputshandle numeric outputs

bull PHD for many years best protein PHD for many years best protein secondary structure predictorsecondary structure predictor

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

DisadvantagesDisadvantages

bull Models not very comprehensibleModels not very comprehensiblebull Long training timesLong training timesbull Very sensitive to number of Very sensitive to number of

hidden unitshellip as a result largely hidden unitshellip as a result largely being supplanted by SVMs (SVMs being supplanted by SVMs (SVMs take very different approach to take very different approach to getting non-linearity)getting non-linearity)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Looking AheadLooking Aheadbull Perceptron rule can also be thought of as Perceptron rule can also be thought of as

modifying modifying weights on data points weights on data points rather rather than featuresthan features

bull Instead of process all data (batch) vs Instead of process all data (batch) vs one-at-a-time could imagine processing 2 one-at-a-time could imagine processing 2 data points at a time adjusting their data points at a time adjusting their relative weights based on their relative relative weights based on their relative errorserrors

bull This is what Plattrsquos SMO does (the SVM This is what Plattrsquos SMO does (the SVM implementation in Weka)implementation in Weka)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Backup Slide to help Backup Slide to help with Derivative of with Derivative of SigmoidSigmoid

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Page 2: Concept Learning Algorithms

TodayrsquosTodayrsquos TopicsTopics

bull PerceptronsPerceptronsbull Artificial Neural Networks (ANNs)Artificial Neural Networks (ANNs)bull BackpropagationBackpropagationbull Weight SpaceWeight Space

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

ConnectionismConnectionism

PERCEPTRONS (Rosenblatt 1957)PERCEPTRONS (Rosenblatt 1957)bull among earliest work in machine among earliest work in machine

learninglearningbull died out in 1960rsquos (Minsky amp Papert died out in 1960rsquos (Minsky amp Papert

book)book)J

K

L

I

wij

wik

wil

Outputi = F(Wij outputj + Wik outputk + Wil outputl )

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Perceptron as Perceptron as ClassifierClassifierbull Output for N example X is sign(WOutput for N example X is sign(WX) X)

where sign is -1 or +1 (or use threshold where sign is -1 or +1 (or use threshold and 01)and 01)

bull Candidate Hypotheses real-valued weight Candidate Hypotheses real-valued weight vectorsvectors

bull Training Update W for each misclassified Training Update W for each misclassified example X (target class example X (target class tt predicted predicted oo) by) bybull WWii W Wii + + ((tt--oo)X)Xii

bull Here Here is learning rate parameteris learning rate parameter

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Gradient Descent Gradient Descent for the Perceptronfor the Perceptron(Assume no threshold for now and (Assume no threshold for now and

start with a common error measure)start with a common error measure) Error Error frac12 ( t ndash o ) frac12 ( t ndash o )

2

Networkrsquos output

Teacherrsquos answer (a constant wrt the weights) EE

WWkk

ΔΔWWjj - η

= (t ndash o)

EE

WWkk

(t ndash o)(t ndash o)

WWkk

= -(t ndash o) oo

WW kk

Remember o = WmiddotX

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Continuation of Continuation of DerivationDerivation

EE WWkk

= -(t ndash o) WWkk

(( sumsumk k ww k k x x kk))

= -(t ndash o) x k

So ΔWk = η (t ndash o) xk The Perceptron Rule

Stick in formula for

output

Also known as the delta rule and other names (with small variations in calc)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

As it looks in your text As it looks in your text (processing all data at once)hellip(processing all data at once)hellip

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Linear Separability Linear Separability

Consider a perceptron its output is

1 If W1X1+W2X2 + hellip + WnXn gt 0 otherwise

In terms of feature space W1X1 + W2X2 =

X2 = = W1X1

W2

-W1 W2 W2

X1+

+ + + + + + - + - - + + + + - + + - - -+ + - -+ - - -

- -

Hence can only classify examples if a ldquolinerdquo (hyerplane) can separate them

y = mx + b

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Perceptron Convergence Perceptron Convergence

TheoremTheorem (Rosemblatt 1957)(Rosemblatt 1957)

Perceptron no Hidden Units

If a set of examples is learnable the perceptron training rule will eventually find the necessary weightsHowever a perceptron can only learnrepresent linearly separable datasetcopy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

The (Infamous) XOR The (Infamous) XOR ProblemProblem

Input

0 00 11 01 1

Output

0110

a)b)c)d)

Exclusive OR (XOR)Not linearly separable

b

a c

d

0 1

1

A Neural Network SolutionX1

X2

X1

X2

1

1

-1-1

1

1 Let = 0 for all nodescopy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

The Need for Hidden The Need for Hidden UnitsUnits

If there is one layer of enough hidden units (possibly 2N for Boolean functions) the input can be recoded (N = number of input units)

This recoding allows any mapping to be represented (Minsky amp Papert)Question How to provide an error signal to the interior units

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Hidden UnitsHidden Units

One ViewAllow a system to create its own internal representation ndash for which problem solving is easy A perceptron

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Advantages of Advantages of Neural NetworksNeural Networks

Provide best predictive Provide best predictive accuracy for some accuracy for some problemsproblems

Being supplanted by Being supplanted by SVMrsquosSVMrsquos

Can represent a rich Can represent a rich class of conceptsclass of concepts

PositivenegativePositive

Saturday 40 chance of rainSunday 25 chance of rain

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Overview of ANNsOverview of ANNs

Recurrentlink

Output units

Input units

Hidden units

error

weight

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

BackpropagationBackpropagation

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

BackpropagationBackpropagation

bull Backpropagation involves a generalization of the Backpropagation involves a generalization of the perceptron ruleperceptron rule

bull Rumelhart Parker and Le Cun (and Bryson amp Ho 1969) Rumelhart Parker and Le Cun (and Bryson amp Ho 1969) Werbos 1974) independently developed (1985) a Werbos 1974) independently developed (1985) a technique for determining how to adjust weights of technique for determining how to adjust weights of interior (ldquohiddenrdquo) unitsinterior (ldquohiddenrdquo) units

bull Derivation involves partial derivatives Derivation involves partial derivatives (hence threshold function must be differentiable)(hence threshold function must be differentiable)

error signal

E Wij

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Weight SpaceWeight Space

bull Given a neural-network layout the weights Given a neural-network layout the weights are free parameters that are free parameters that define a define a spacespace

bull Each pointEach point in this in this Weight SpaceWeight Space specifies a specifies a networknetwork

bull Associated with each point is an Associated with each point is an error rateerror rate EE over the training dataover the training data

bull Backprop performs Backprop performs gradient descentgradient descent in weight in weight spacespace

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Gradient Descent in Weight Gradient Descent in Weight SpaceSpace

E

W1

W2

Ew

W1

W2

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

The Gradient-Descent The Gradient-Descent RuleRule

E(w) [ ]Ew0

Ew1

Ew

2

EwN

hellip hellip hellip _

The ldquogradien

trdquoThis is a N+1 dimensional vector (ie the lsquoslopersquo in weight space)Since we want to reduce errors we want to go ldquodown hillrdquoWersquoll take a finite step in weight space

E

W1

W2

w = - E ( w )

or wi = - Ewi

ldquodeltardquo = change to

w

E

w

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

ldquoldquoOn Linerdquo vs ldquoBatchrdquo On Linerdquo vs ldquoBatchrdquo BackpropBackprop

bull Technically we should look at the error Technically we should look at the error gradient for the entire training set gradient for the entire training set before taking a step in weight space before taking a step in weight space (ldquo(ldquobatchbatchrdquo Backprop)rdquo Backprop)

bull HoweverHowever as presented we take a step as presented we take a step after each example (ldquoafter each example (ldquoon-lineon-linerdquo Backprop)rdquo Backprop)bull Much faster convergenceMuch faster convergencebull Can reduce overfitting (since on-line Can reduce overfitting (since on-line

Backprop is ldquonoisyrdquo gradient descent)Backprop is ldquonoisyrdquo gradient descent)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

ldquoldquoOn Linerdquo vs ldquoBatchrdquo BP On Linerdquo vs ldquoBatchrdquo BP (continued)(continued)

BATCHBATCH ndash add ndash add w w vectors for vectors for everyevery training example training example thenthen lsquomoversquo in weight lsquomoversquo in weight spacespace

ON-LINEON-LINE ndash ldquomoverdquo ndash ldquomoverdquo after after eacheach example example (aka (aka stochasticstochastic gradient descent)gradient descent)

E

wi

w1

w3w2

w

w1

w2

w3

Final locations in space need not be the same for BATCH and ON-LINE w

N

ote

w

iB

ATC

H

w

i O

N-L

INE

for

i gt

1

E

w

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Need Derivatives Replace Need Derivatives Replace Step (Threshold) by Step (Threshold) by SigmoidSigmoidIndividual units

bias

output

input

output i= F(weight ij x output j)

Where

F(input i) =

j

1

1+e -(input i ndash bias i)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Differentiating the Differentiating the Logistic FunctionLogistic Function

out i =

1

1 + e - ( wji x outj)

F rsquo(wgtrsquoed in) = out i ( 1- out i ) 0

12

Wj x outj

F(wgtrsquoed in)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Assume one layer of hidden units (std topology)Assume one layer of hidden units (std topology)

11 Error Error frac12 frac12 ( Teacher ( Teacherii ndash ndash OutputOutput ii ) ) 22

22 = frac12 = frac12 (Teacher(Teacherii ndash F ndash F (( [[WWijij x x OutputOutput jj] )] )22

33 = frac12 = frac12 (Teacher(Teacherii ndash F ndash F (( [[WWijij x F x F ((WWjkjk x Output x Output

kk)])))]))22

DetermineDetermine

recallrecall

BP CalculationsBP Calculations

Error Wij

Error Wjk

= (use equation 2)

= (use equation 3)

See Table 42 in Mitchell for resultswxy = - ( E wxy )

k j i

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Derivation in MitchellDerivation in Mitchell

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Some NotationSome Notation

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

By Chain Rule By Chain Rule (since (since WWjiji influences rest of network only influences rest of network only by its influence on by its influence on NetNetjj)hellip)hellip

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Also remember this for later ndashWersquoll call it -δj

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Remember thatoj is xkj outputfrom j is inputto k

Remember netk =wk1 xk1 + hellip+ wkN xkN

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Using BP to Train Using BP to Train ANNrsquosANNrsquos

11 Initiate weights amp bias to Initiate weights amp bias to small random values small random values (eg in [-03 03])(eg in [-03 03])

22 Randomize order of Randomize order of training examples for training examples for each doeach do

a)a) Propagate activity Propagate activity forwardforward to output unitsto output units

k j i

outi = F( wij x outj )j

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)

b)b) Compute ldquodeviationrdquo for output Compute ldquodeviationrdquo for output unitsunits

c)c) Compute ldquodeviationrdquo for hidden Compute ldquodeviationrdquo for hidden unitsunits

d)d) Update weightsUpdate weights

i = F rsquo( neti ) x (Teacheri-outi)

ij = F rsquo( netj ) x ( wij x

i)

wij = x i x out j

wjk = x j x out k

F rsquo( netj ) = F(neti) neti

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)

33 Repeat until training-set error rate Repeat until training-set error rate small enough (or until tuning-set error small enough (or until tuning-set error rate begins to rise ndash see later slide)rate begins to rise ndash see later slide)

Should use ldquoearly stoppingrdquo (ie Should use ldquoearly stoppingrdquo (ie minimize error on the tuning set more minimize error on the tuning set more details later)details later)

44 Measure accuracy on test set to Measure accuracy on test set to estimate estimate generalizationgeneralization (future (future accuracy)accuracy)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Advantages of Neural Advantages of Neural NetworksNetworks

bull Universal representation (provided Universal representation (provided enough hidden units)enough hidden units)

bull Less greedy than tree learnersLess greedy than tree learnersbull In practice good for problems with In practice good for problems with

numeric inputs and can also numeric inputs and can also handle numeric outputshandle numeric outputs

bull PHD for many years best protein PHD for many years best protein secondary structure predictorsecondary structure predictor

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

DisadvantagesDisadvantages

bull Models not very comprehensibleModels not very comprehensiblebull Long training timesLong training timesbull Very sensitive to number of Very sensitive to number of

hidden unitshellip as a result largely hidden unitshellip as a result largely being supplanted by SVMs (SVMs being supplanted by SVMs (SVMs take very different approach to take very different approach to getting non-linearity)getting non-linearity)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Looking AheadLooking Aheadbull Perceptron rule can also be thought of as Perceptron rule can also be thought of as

modifying modifying weights on data points weights on data points rather rather than featuresthan features

bull Instead of process all data (batch) vs Instead of process all data (batch) vs one-at-a-time could imagine processing 2 one-at-a-time could imagine processing 2 data points at a time adjusting their data points at a time adjusting their relative weights based on their relative relative weights based on their relative errorserrors

bull This is what Plattrsquos SMO does (the SVM This is what Plattrsquos SMO does (the SVM implementation in Weka)implementation in Weka)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Backup Slide to help Backup Slide to help with Derivative of with Derivative of SigmoidSigmoid

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Page 3: Concept Learning Algorithms

ConnectionismConnectionism

PERCEPTRONS (Rosenblatt 1957)PERCEPTRONS (Rosenblatt 1957)bull among earliest work in machine among earliest work in machine

learninglearningbull died out in 1960rsquos (Minsky amp Papert died out in 1960rsquos (Minsky amp Papert

book)book)J

K

L

I

wij

wik

wil

Outputi = F(Wij outputj + Wik outputk + Wil outputl )

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Perceptron as Perceptron as ClassifierClassifierbull Output for N example X is sign(WOutput for N example X is sign(WX) X)

where sign is -1 or +1 (or use threshold where sign is -1 or +1 (or use threshold and 01)and 01)

bull Candidate Hypotheses real-valued weight Candidate Hypotheses real-valued weight vectorsvectors

bull Training Update W for each misclassified Training Update W for each misclassified example X (target class example X (target class tt predicted predicted oo) by) bybull WWii W Wii + + ((tt--oo)X)Xii

bull Here Here is learning rate parameteris learning rate parameter

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Gradient Descent Gradient Descent for the Perceptronfor the Perceptron(Assume no threshold for now and (Assume no threshold for now and

start with a common error measure)start with a common error measure) Error Error frac12 ( t ndash o ) frac12 ( t ndash o )

2

Networkrsquos output

Teacherrsquos answer (a constant wrt the weights) EE

WWkk

ΔΔWWjj - η

= (t ndash o)

EE

WWkk

(t ndash o)(t ndash o)

WWkk

= -(t ndash o) oo

WW kk

Remember o = WmiddotX

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Continuation of Continuation of DerivationDerivation

EE WWkk

= -(t ndash o) WWkk

(( sumsumk k ww k k x x kk))

= -(t ndash o) x k

So ΔWk = η (t ndash o) xk The Perceptron Rule

Stick in formula for

output

Also known as the delta rule and other names (with small variations in calc)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

As it looks in your text As it looks in your text (processing all data at once)hellip(processing all data at once)hellip

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Linear Separability Linear Separability

Consider a perceptron its output is

1 If W1X1+W2X2 + hellip + WnXn gt 0 otherwise

In terms of feature space W1X1 + W2X2 =

X2 = = W1X1

W2

-W1 W2 W2

X1+

+ + + + + + - + - - + + + + - + + - - -+ + - -+ - - -

- -

Hence can only classify examples if a ldquolinerdquo (hyerplane) can separate them

y = mx + b

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Perceptron Convergence Perceptron Convergence

TheoremTheorem (Rosemblatt 1957)(Rosemblatt 1957)

Perceptron no Hidden Units

If a set of examples is learnable the perceptron training rule will eventually find the necessary weightsHowever a perceptron can only learnrepresent linearly separable datasetcopy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

The (Infamous) XOR The (Infamous) XOR ProblemProblem

Input

0 00 11 01 1

Output

0110

a)b)c)d)

Exclusive OR (XOR)Not linearly separable

b

a c

d

0 1

1

A Neural Network SolutionX1

X2

X1

X2

1

1

-1-1

1

1 Let = 0 for all nodescopy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

The Need for Hidden The Need for Hidden UnitsUnits

If there is one layer of enough hidden units (possibly 2N for Boolean functions) the input can be recoded (N = number of input units)

This recoding allows any mapping to be represented (Minsky amp Papert)Question How to provide an error signal to the interior units

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Hidden UnitsHidden Units

One ViewAllow a system to create its own internal representation ndash for which problem solving is easy A perceptron

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Advantages of Advantages of Neural NetworksNeural Networks

Provide best predictive Provide best predictive accuracy for some accuracy for some problemsproblems

Being supplanted by Being supplanted by SVMrsquosSVMrsquos

Can represent a rich Can represent a rich class of conceptsclass of concepts

PositivenegativePositive

Saturday 40 chance of rainSunday 25 chance of rain

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Overview of ANNsOverview of ANNs

Recurrentlink

Output units

Input units

Hidden units

error

weight

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

BackpropagationBackpropagation

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

BackpropagationBackpropagation

bull Backpropagation involves a generalization of the Backpropagation involves a generalization of the perceptron ruleperceptron rule

bull Rumelhart Parker and Le Cun (and Bryson amp Ho 1969) Rumelhart Parker and Le Cun (and Bryson amp Ho 1969) Werbos 1974) independently developed (1985) a Werbos 1974) independently developed (1985) a technique for determining how to adjust weights of technique for determining how to adjust weights of interior (ldquohiddenrdquo) unitsinterior (ldquohiddenrdquo) units

bull Derivation involves partial derivatives Derivation involves partial derivatives (hence threshold function must be differentiable)(hence threshold function must be differentiable)

error signal

E Wij

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Weight SpaceWeight Space

bull Given a neural-network layout the weights Given a neural-network layout the weights are free parameters that are free parameters that define a define a spacespace

bull Each pointEach point in this in this Weight SpaceWeight Space specifies a specifies a networknetwork

bull Associated with each point is an Associated with each point is an error rateerror rate EE over the training dataover the training data

bull Backprop performs Backprop performs gradient descentgradient descent in weight in weight spacespace

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Gradient Descent in Weight Gradient Descent in Weight SpaceSpace

E

W1

W2

Ew

W1

W2

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

The Gradient-Descent The Gradient-Descent RuleRule

E(w) [ ]Ew0

Ew1

Ew

2

EwN

hellip hellip hellip _

The ldquogradien

trdquoThis is a N+1 dimensional vector (ie the lsquoslopersquo in weight space)Since we want to reduce errors we want to go ldquodown hillrdquoWersquoll take a finite step in weight space

E

W1

W2

w = - E ( w )

or wi = - Ewi

ldquodeltardquo = change to

w

E

w

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

ldquoldquoOn Linerdquo vs ldquoBatchrdquo On Linerdquo vs ldquoBatchrdquo BackpropBackprop

bull Technically we should look at the error Technically we should look at the error gradient for the entire training set gradient for the entire training set before taking a step in weight space before taking a step in weight space (ldquo(ldquobatchbatchrdquo Backprop)rdquo Backprop)

bull HoweverHowever as presented we take a step as presented we take a step after each example (ldquoafter each example (ldquoon-lineon-linerdquo Backprop)rdquo Backprop)bull Much faster convergenceMuch faster convergencebull Can reduce overfitting (since on-line Can reduce overfitting (since on-line

Backprop is ldquonoisyrdquo gradient descent)Backprop is ldquonoisyrdquo gradient descent)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

ldquoldquoOn Linerdquo vs ldquoBatchrdquo BP On Linerdquo vs ldquoBatchrdquo BP (continued)(continued)

BATCHBATCH ndash add ndash add w w vectors for vectors for everyevery training example training example thenthen lsquomoversquo in weight lsquomoversquo in weight spacespace

ON-LINEON-LINE ndash ldquomoverdquo ndash ldquomoverdquo after after eacheach example example (aka (aka stochasticstochastic gradient descent)gradient descent)

E

wi

w1

w3w2

w

w1

w2

w3

Final locations in space need not be the same for BATCH and ON-LINE w

N

ote

w

iB

ATC

H

w

i O

N-L

INE

for

i gt

1

E

w

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Need Derivatives Replace Need Derivatives Replace Step (Threshold) by Step (Threshold) by SigmoidSigmoidIndividual units

bias

output

input

output i= F(weight ij x output j)

Where

F(input i) =

j

1

1+e -(input i ndash bias i)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Differentiating the Differentiating the Logistic FunctionLogistic Function

out i =

1

1 + e - ( wji x outj)

F rsquo(wgtrsquoed in) = out i ( 1- out i ) 0

12

Wj x outj

F(wgtrsquoed in)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Assume one layer of hidden units (std topology)Assume one layer of hidden units (std topology)

11 Error Error frac12 frac12 ( Teacher ( Teacherii ndash ndash OutputOutput ii ) ) 22

22 = frac12 = frac12 (Teacher(Teacherii ndash F ndash F (( [[WWijij x x OutputOutput jj] )] )22

33 = frac12 = frac12 (Teacher(Teacherii ndash F ndash F (( [[WWijij x F x F ((WWjkjk x Output x Output

kk)])))]))22

DetermineDetermine

recallrecall

BP CalculationsBP Calculations

Error Wij

Error Wjk

= (use equation 2)

= (use equation 3)

See Table 42 in Mitchell for resultswxy = - ( E wxy )

k j i

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Derivation in MitchellDerivation in Mitchell

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Some NotationSome Notation

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

By Chain Rule By Chain Rule (since (since WWjiji influences rest of network only influences rest of network only by its influence on by its influence on NetNetjj)hellip)hellip

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Also remember this for later ndashWersquoll call it -δj

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Remember thatoj is xkj outputfrom j is inputto k

Remember netk =wk1 xk1 + hellip+ wkN xkN

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Using BP to Train Using BP to Train ANNrsquosANNrsquos

11 Initiate weights amp bias to Initiate weights amp bias to small random values small random values (eg in [-03 03])(eg in [-03 03])

22 Randomize order of Randomize order of training examples for training examples for each doeach do

a)a) Propagate activity Propagate activity forwardforward to output unitsto output units

k j i

outi = F( wij x outj )j

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)

b)b) Compute ldquodeviationrdquo for output Compute ldquodeviationrdquo for output unitsunits

c)c) Compute ldquodeviationrdquo for hidden Compute ldquodeviationrdquo for hidden unitsunits

d)d) Update weightsUpdate weights

i = F rsquo( neti ) x (Teacheri-outi)

ij = F rsquo( netj ) x ( wij x

i)

wij = x i x out j

wjk = x j x out k

F rsquo( netj ) = F(neti) neti

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)

33 Repeat until training-set error rate Repeat until training-set error rate small enough (or until tuning-set error small enough (or until tuning-set error rate begins to rise ndash see later slide)rate begins to rise ndash see later slide)

Should use ldquoearly stoppingrdquo (ie Should use ldquoearly stoppingrdquo (ie minimize error on the tuning set more minimize error on the tuning set more details later)details later)

44 Measure accuracy on test set to Measure accuracy on test set to estimate estimate generalizationgeneralization (future (future accuracy)accuracy)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Advantages of Neural Advantages of Neural NetworksNetworks

bull Universal representation (provided Universal representation (provided enough hidden units)enough hidden units)

bull Less greedy than tree learnersLess greedy than tree learnersbull In practice good for problems with In practice good for problems with

numeric inputs and can also numeric inputs and can also handle numeric outputshandle numeric outputs

bull PHD for many years best protein PHD for many years best protein secondary structure predictorsecondary structure predictor

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

DisadvantagesDisadvantages

bull Models not very comprehensibleModels not very comprehensiblebull Long training timesLong training timesbull Very sensitive to number of Very sensitive to number of

hidden unitshellip as a result largely hidden unitshellip as a result largely being supplanted by SVMs (SVMs being supplanted by SVMs (SVMs take very different approach to take very different approach to getting non-linearity)getting non-linearity)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Looking AheadLooking Aheadbull Perceptron rule can also be thought of as Perceptron rule can also be thought of as

modifying modifying weights on data points weights on data points rather rather than featuresthan features

bull Instead of process all data (batch) vs Instead of process all data (batch) vs one-at-a-time could imagine processing 2 one-at-a-time could imagine processing 2 data points at a time adjusting their data points at a time adjusting their relative weights based on their relative relative weights based on their relative errorserrors

bull This is what Plattrsquos SMO does (the SVM This is what Plattrsquos SMO does (the SVM implementation in Weka)implementation in Weka)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Backup Slide to help Backup Slide to help with Derivative of with Derivative of SigmoidSigmoid

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Page 4: Concept Learning Algorithms

Perceptron as Perceptron as ClassifierClassifierbull Output for N example X is sign(WOutput for N example X is sign(WX) X)

where sign is -1 or +1 (or use threshold where sign is -1 or +1 (or use threshold and 01)and 01)

bull Candidate Hypotheses real-valued weight Candidate Hypotheses real-valued weight vectorsvectors

bull Training Update W for each misclassified Training Update W for each misclassified example X (target class example X (target class tt predicted predicted oo) by) bybull WWii W Wii + + ((tt--oo)X)Xii

bull Here Here is learning rate parameteris learning rate parameter

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Gradient Descent Gradient Descent for the Perceptronfor the Perceptron(Assume no threshold for now and (Assume no threshold for now and

start with a common error measure)start with a common error measure) Error Error frac12 ( t ndash o ) frac12 ( t ndash o )

2

Networkrsquos output

Teacherrsquos answer (a constant wrt the weights) EE

WWkk

ΔΔWWjj - η

= (t ndash o)

EE

WWkk

(t ndash o)(t ndash o)

WWkk

= -(t ndash o) oo

WW kk

Remember o = WmiddotX

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Continuation of Continuation of DerivationDerivation

EE WWkk

= -(t ndash o) WWkk

(( sumsumk k ww k k x x kk))

= -(t ndash o) x k

So ΔWk = η (t ndash o) xk The Perceptron Rule

Stick in formula for

output

Also known as the delta rule and other names (with small variations in calc)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

As it looks in your text As it looks in your text (processing all data at once)hellip(processing all data at once)hellip

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Linear Separability Linear Separability

Consider a perceptron its output is

1 If W1X1+W2X2 + hellip + WnXn gt 0 otherwise

In terms of feature space W1X1 + W2X2 =

X2 = = W1X1

W2

-W1 W2 W2

X1+

+ + + + + + - + - - + + + + - + + - - -+ + - -+ - - -

- -

Hence can only classify examples if a ldquolinerdquo (hyerplane) can separate them

y = mx + b

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Perceptron Convergence Perceptron Convergence

TheoremTheorem (Rosemblatt 1957)(Rosemblatt 1957)

Perceptron no Hidden Units

If a set of examples is learnable the perceptron training rule will eventually find the necessary weightsHowever a perceptron can only learnrepresent linearly separable datasetcopy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

The (Infamous) XOR The (Infamous) XOR ProblemProblem

Input

0 00 11 01 1

Output

0110

a)b)c)d)

Exclusive OR (XOR)Not linearly separable

b

a c

d

0 1

1

A Neural Network SolutionX1

X2

X1

X2

1

1

-1-1

1

1 Let = 0 for all nodescopy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

The Need for Hidden The Need for Hidden UnitsUnits

If there is one layer of enough hidden units (possibly 2N for Boolean functions) the input can be recoded (N = number of input units)

This recoding allows any mapping to be represented (Minsky amp Papert)Question How to provide an error signal to the interior units

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Hidden UnitsHidden Units

One ViewAllow a system to create its own internal representation ndash for which problem solving is easy A perceptron

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Advantages of Advantages of Neural NetworksNeural Networks

Provide best predictive Provide best predictive accuracy for some accuracy for some problemsproblems

Being supplanted by Being supplanted by SVMrsquosSVMrsquos

Can represent a rich Can represent a rich class of conceptsclass of concepts

PositivenegativePositive

Saturday 40 chance of rainSunday 25 chance of rain

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Overview of ANNsOverview of ANNs

Recurrentlink

Output units

Input units

Hidden units

error

weight

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

BackpropagationBackpropagation

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

BackpropagationBackpropagation

bull Backpropagation involves a generalization of the Backpropagation involves a generalization of the perceptron ruleperceptron rule

bull Rumelhart Parker and Le Cun (and Bryson amp Ho 1969) Rumelhart Parker and Le Cun (and Bryson amp Ho 1969) Werbos 1974) independently developed (1985) a Werbos 1974) independently developed (1985) a technique for determining how to adjust weights of technique for determining how to adjust weights of interior (ldquohiddenrdquo) unitsinterior (ldquohiddenrdquo) units

bull Derivation involves partial derivatives Derivation involves partial derivatives (hence threshold function must be differentiable)(hence threshold function must be differentiable)

error signal

E Wij

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Weight SpaceWeight Space

bull Given a neural-network layout the weights Given a neural-network layout the weights are free parameters that are free parameters that define a define a spacespace

bull Each pointEach point in this in this Weight SpaceWeight Space specifies a specifies a networknetwork

bull Associated with each point is an Associated with each point is an error rateerror rate EE over the training dataover the training data

bull Backprop performs Backprop performs gradient descentgradient descent in weight in weight spacespace

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Gradient Descent in Weight Gradient Descent in Weight SpaceSpace

E

W1

W2

Ew

W1

W2

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

The Gradient-Descent The Gradient-Descent RuleRule

E(w) [ ]Ew0

Ew1

Ew

2

EwN

hellip hellip hellip _

The ldquogradien

trdquoThis is a N+1 dimensional vector (ie the lsquoslopersquo in weight space)Since we want to reduce errors we want to go ldquodown hillrdquoWersquoll take a finite step in weight space

E

W1

W2

w = - E ( w )

or wi = - Ewi

ldquodeltardquo = change to

w

E

w

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

ldquoldquoOn Linerdquo vs ldquoBatchrdquo On Linerdquo vs ldquoBatchrdquo BackpropBackprop

bull Technically we should look at the error Technically we should look at the error gradient for the entire training set gradient for the entire training set before taking a step in weight space before taking a step in weight space (ldquo(ldquobatchbatchrdquo Backprop)rdquo Backprop)

bull HoweverHowever as presented we take a step as presented we take a step after each example (ldquoafter each example (ldquoon-lineon-linerdquo Backprop)rdquo Backprop)bull Much faster convergenceMuch faster convergencebull Can reduce overfitting (since on-line Can reduce overfitting (since on-line

Backprop is ldquonoisyrdquo gradient descent)Backprop is ldquonoisyrdquo gradient descent)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

ldquoldquoOn Linerdquo vs ldquoBatchrdquo BP On Linerdquo vs ldquoBatchrdquo BP (continued)(continued)

BATCHBATCH ndash add ndash add w w vectors for vectors for everyevery training example training example thenthen lsquomoversquo in weight lsquomoversquo in weight spacespace

ON-LINEON-LINE ndash ldquomoverdquo ndash ldquomoverdquo after after eacheach example example (aka (aka stochasticstochastic gradient descent)gradient descent)

E

wi

w1

w3w2

w

w1

w2

w3

Final locations in space need not be the same for BATCH and ON-LINE w

N

ote

w

iB

ATC

H

w

i O

N-L

INE

for

i gt

1

E

w

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Need Derivatives Replace Need Derivatives Replace Step (Threshold) by Step (Threshold) by SigmoidSigmoidIndividual units

bias

output

input

output i= F(weight ij x output j)

Where

F(input i) =

j

1

1+e -(input i ndash bias i)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Differentiating the Differentiating the Logistic FunctionLogistic Function

out i =

1

1 + e - ( wji x outj)

F rsquo(wgtrsquoed in) = out i ( 1- out i ) 0

12

Wj x outj

F(wgtrsquoed in)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Assume one layer of hidden units (std topology)Assume one layer of hidden units (std topology)

11 Error Error frac12 frac12 ( Teacher ( Teacherii ndash ndash OutputOutput ii ) ) 22

22 = frac12 = frac12 (Teacher(Teacherii ndash F ndash F (( [[WWijij x x OutputOutput jj] )] )22

33 = frac12 = frac12 (Teacher(Teacherii ndash F ndash F (( [[WWijij x F x F ((WWjkjk x Output x Output

kk)])))]))22

DetermineDetermine

recallrecall

BP CalculationsBP Calculations

Error Wij

Error Wjk

= (use equation 2)

= (use equation 3)

See Table 42 in Mitchell for resultswxy = - ( E wxy )

k j i

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Derivation in MitchellDerivation in Mitchell

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Some NotationSome Notation

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

By Chain Rule By Chain Rule (since (since WWjiji influences rest of network only influences rest of network only by its influence on by its influence on NetNetjj)hellip)hellip

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Also remember this for later ndashWersquoll call it -δj

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Remember thatoj is xkj outputfrom j is inputto k

Remember netk =wk1 xk1 + hellip+ wkN xkN

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Using BP to Train Using BP to Train ANNrsquosANNrsquos

11 Initiate weights amp bias to Initiate weights amp bias to small random values small random values (eg in [-03 03])(eg in [-03 03])

22 Randomize order of Randomize order of training examples for training examples for each doeach do

a)a) Propagate activity Propagate activity forwardforward to output unitsto output units

k j i

outi = F( wij x outj )j

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)

b)b) Compute ldquodeviationrdquo for output Compute ldquodeviationrdquo for output unitsunits

c)c) Compute ldquodeviationrdquo for hidden Compute ldquodeviationrdquo for hidden unitsunits

d)d) Update weightsUpdate weights

i = F rsquo( neti ) x (Teacheri-outi)

ij = F rsquo( netj ) x ( wij x

i)

wij = x i x out j

wjk = x j x out k

F rsquo( netj ) = F(neti) neti

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)

33 Repeat until training-set error rate Repeat until training-set error rate small enough (or until tuning-set error small enough (or until tuning-set error rate begins to rise ndash see later slide)rate begins to rise ndash see later slide)

Should use ldquoearly stoppingrdquo (ie Should use ldquoearly stoppingrdquo (ie minimize error on the tuning set more minimize error on the tuning set more details later)details later)

44 Measure accuracy on test set to Measure accuracy on test set to estimate estimate generalizationgeneralization (future (future accuracy)accuracy)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Advantages of Neural Advantages of Neural NetworksNetworks

bull Universal representation (provided Universal representation (provided enough hidden units)enough hidden units)

bull Less greedy than tree learnersLess greedy than tree learnersbull In practice good for problems with In practice good for problems with

numeric inputs and can also numeric inputs and can also handle numeric outputshandle numeric outputs

bull PHD for many years best protein PHD for many years best protein secondary structure predictorsecondary structure predictor

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

DisadvantagesDisadvantages

bull Models not very comprehensibleModels not very comprehensiblebull Long training timesLong training timesbull Very sensitive to number of Very sensitive to number of

hidden unitshellip as a result largely hidden unitshellip as a result largely being supplanted by SVMs (SVMs being supplanted by SVMs (SVMs take very different approach to take very different approach to getting non-linearity)getting non-linearity)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Looking AheadLooking Aheadbull Perceptron rule can also be thought of as Perceptron rule can also be thought of as

modifying modifying weights on data points weights on data points rather rather than featuresthan features

bull Instead of process all data (batch) vs Instead of process all data (batch) vs one-at-a-time could imagine processing 2 one-at-a-time could imagine processing 2 data points at a time adjusting their data points at a time adjusting their relative weights based on their relative relative weights based on their relative errorserrors

bull This is what Plattrsquos SMO does (the SVM This is what Plattrsquos SMO does (the SVM implementation in Weka)implementation in Weka)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Backup Slide to help Backup Slide to help with Derivative of with Derivative of SigmoidSigmoid

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Page 5: Concept Learning Algorithms

Gradient Descent Gradient Descent for the Perceptronfor the Perceptron(Assume no threshold for now and (Assume no threshold for now and

start with a common error measure)start with a common error measure) Error Error frac12 ( t ndash o ) frac12 ( t ndash o )

2

Networkrsquos output

Teacherrsquos answer (a constant wrt the weights) EE

WWkk

ΔΔWWjj - η

= (t ndash o)

EE

WWkk

(t ndash o)(t ndash o)

WWkk

= -(t ndash o) oo

WW kk

Remember o = WmiddotX

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Continuation of Continuation of DerivationDerivation

EE WWkk

= -(t ndash o) WWkk

(( sumsumk k ww k k x x kk))

= -(t ndash o) x k

So ΔWk = η (t ndash o) xk The Perceptron Rule

Stick in formula for

output

Also known as the delta rule and other names (with small variations in calc)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

As it looks in your text As it looks in your text (processing all data at once)hellip(processing all data at once)hellip

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Linear Separability Linear Separability

Consider a perceptron its output is

1 If W1X1+W2X2 + hellip + WnXn gt 0 otherwise

In terms of feature space W1X1 + W2X2 =

X2 = = W1X1

W2

-W1 W2 W2

X1+

+ + + + + + - + - - + + + + - + + - - -+ + - -+ - - -

- -

Hence can only classify examples if a ldquolinerdquo (hyerplane) can separate them

y = mx + b

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Perceptron Convergence Perceptron Convergence

TheoremTheorem (Rosemblatt 1957)(Rosemblatt 1957)

Perceptron no Hidden Units

If a set of examples is learnable the perceptron training rule will eventually find the necessary weightsHowever a perceptron can only learnrepresent linearly separable datasetcopy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

The (Infamous) XOR The (Infamous) XOR ProblemProblem

Input

0 00 11 01 1

Output

0110

a)b)c)d)

Exclusive OR (XOR)Not linearly separable

b

a c

d

0 1

1

A Neural Network SolutionX1

X2

X1

X2

1

1

-1-1

1

1 Let = 0 for all nodescopy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

The Need for Hidden The Need for Hidden UnitsUnits

If there is one layer of enough hidden units (possibly 2N for Boolean functions) the input can be recoded (N = number of input units)

This recoding allows any mapping to be represented (Minsky amp Papert)Question How to provide an error signal to the interior units

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Hidden UnitsHidden Units

One ViewAllow a system to create its own internal representation ndash for which problem solving is easy A perceptron

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Advantages of Advantages of Neural NetworksNeural Networks

Provide best predictive Provide best predictive accuracy for some accuracy for some problemsproblems

Being supplanted by Being supplanted by SVMrsquosSVMrsquos

Can represent a rich Can represent a rich class of conceptsclass of concepts

PositivenegativePositive

Saturday 40 chance of rainSunday 25 chance of rain

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Overview of ANNsOverview of ANNs

Recurrentlink

Output units

Input units

Hidden units

error

weight

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

BackpropagationBackpropagation

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

BackpropagationBackpropagation

bull Backpropagation involves a generalization of the Backpropagation involves a generalization of the perceptron ruleperceptron rule

bull Rumelhart Parker and Le Cun (and Bryson amp Ho 1969) Rumelhart Parker and Le Cun (and Bryson amp Ho 1969) Werbos 1974) independently developed (1985) a Werbos 1974) independently developed (1985) a technique for determining how to adjust weights of technique for determining how to adjust weights of interior (ldquohiddenrdquo) unitsinterior (ldquohiddenrdquo) units

bull Derivation involves partial derivatives Derivation involves partial derivatives (hence threshold function must be differentiable)(hence threshold function must be differentiable)

error signal

E Wij

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Weight SpaceWeight Space

bull Given a neural-network layout the weights Given a neural-network layout the weights are free parameters that are free parameters that define a define a spacespace

bull Each pointEach point in this in this Weight SpaceWeight Space specifies a specifies a networknetwork

bull Associated with each point is an Associated with each point is an error rateerror rate EE over the training dataover the training data

bull Backprop performs Backprop performs gradient descentgradient descent in weight in weight spacespace

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Gradient Descent in Weight Gradient Descent in Weight SpaceSpace

E

W1

W2

Ew

W1

W2

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

The Gradient-Descent The Gradient-Descent RuleRule

E(w) [ ]Ew0

Ew1

Ew

2

EwN

hellip hellip hellip _

The ldquogradien

trdquoThis is a N+1 dimensional vector (ie the lsquoslopersquo in weight space)Since we want to reduce errors we want to go ldquodown hillrdquoWersquoll take a finite step in weight space

E

W1

W2

w = - E ( w )

or wi = - Ewi

ldquodeltardquo = change to

w

E

w

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

ldquoldquoOn Linerdquo vs ldquoBatchrdquo On Linerdquo vs ldquoBatchrdquo BackpropBackprop

bull Technically we should look at the error Technically we should look at the error gradient for the entire training set gradient for the entire training set before taking a step in weight space before taking a step in weight space (ldquo(ldquobatchbatchrdquo Backprop)rdquo Backprop)

bull HoweverHowever as presented we take a step as presented we take a step after each example (ldquoafter each example (ldquoon-lineon-linerdquo Backprop)rdquo Backprop)bull Much faster convergenceMuch faster convergencebull Can reduce overfitting (since on-line Can reduce overfitting (since on-line

Backprop is ldquonoisyrdquo gradient descent)Backprop is ldquonoisyrdquo gradient descent)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

ldquoldquoOn Linerdquo vs ldquoBatchrdquo BP On Linerdquo vs ldquoBatchrdquo BP (continued)(continued)

BATCHBATCH ndash add ndash add w w vectors for vectors for everyevery training example training example thenthen lsquomoversquo in weight lsquomoversquo in weight spacespace

ON-LINEON-LINE ndash ldquomoverdquo ndash ldquomoverdquo after after eacheach example example (aka (aka stochasticstochastic gradient descent)gradient descent)

E

wi

w1

w3w2

w

w1

w2

w3

Final locations in space need not be the same for BATCH and ON-LINE w

N

ote

w

iB

ATC

H

w

i O

N-L

INE

for

i gt

1

E

w

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Need Derivatives Replace Need Derivatives Replace Step (Threshold) by Step (Threshold) by SigmoidSigmoidIndividual units

bias

output

input

output i= F(weight ij x output j)

Where

F(input i) =

j

1

1+e -(input i ndash bias i)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Differentiating the Differentiating the Logistic FunctionLogistic Function

out i =

1

1 + e - ( wji x outj)

F rsquo(wgtrsquoed in) = out i ( 1- out i ) 0

12

Wj x outj

F(wgtrsquoed in)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Assume one layer of hidden units (std topology)Assume one layer of hidden units (std topology)

11 Error Error frac12 frac12 ( Teacher ( Teacherii ndash ndash OutputOutput ii ) ) 22

22 = frac12 = frac12 (Teacher(Teacherii ndash F ndash F (( [[WWijij x x OutputOutput jj] )] )22

33 = frac12 = frac12 (Teacher(Teacherii ndash F ndash F (( [[WWijij x F x F ((WWjkjk x Output x Output

kk)])))]))22

DetermineDetermine

recallrecall

BP CalculationsBP Calculations

Error Wij

Error Wjk

= (use equation 2)

= (use equation 3)

See Table 42 in Mitchell for resultswxy = - ( E wxy )

k j i

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Derivation in MitchellDerivation in Mitchell

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Some NotationSome Notation

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

By Chain Rule By Chain Rule (since (since WWjiji influences rest of network only influences rest of network only by its influence on by its influence on NetNetjj)hellip)hellip

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Also remember this for later ndashWersquoll call it -δj

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Remember thatoj is xkj outputfrom j is inputto k

Remember netk =wk1 xk1 + hellip+ wkN xkN

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Using BP to Train Using BP to Train ANNrsquosANNrsquos

11 Initiate weights amp bias to Initiate weights amp bias to small random values small random values (eg in [-03 03])(eg in [-03 03])

22 Randomize order of Randomize order of training examples for training examples for each doeach do

a)a) Propagate activity Propagate activity forwardforward to output unitsto output units

k j i

outi = F( wij x outj )j

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)

b)b) Compute ldquodeviationrdquo for output Compute ldquodeviationrdquo for output unitsunits

c)c) Compute ldquodeviationrdquo for hidden Compute ldquodeviationrdquo for hidden unitsunits

d)d) Update weightsUpdate weights

i = F rsquo( neti ) x (Teacheri-outi)

ij = F rsquo( netj ) x ( wij x

i)

wij = x i x out j

wjk = x j x out k

F rsquo( netj ) = F(neti) neti

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)

33 Repeat until training-set error rate Repeat until training-set error rate small enough (or until tuning-set error small enough (or until tuning-set error rate begins to rise ndash see later slide)rate begins to rise ndash see later slide)

Should use ldquoearly stoppingrdquo (ie Should use ldquoearly stoppingrdquo (ie minimize error on the tuning set more minimize error on the tuning set more details later)details later)

44 Measure accuracy on test set to Measure accuracy on test set to estimate estimate generalizationgeneralization (future (future accuracy)accuracy)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Advantages of Neural Advantages of Neural NetworksNetworks

bull Universal representation (provided Universal representation (provided enough hidden units)enough hidden units)

bull Less greedy than tree learnersLess greedy than tree learnersbull In practice good for problems with In practice good for problems with

numeric inputs and can also numeric inputs and can also handle numeric outputshandle numeric outputs

bull PHD for many years best protein PHD for many years best protein secondary structure predictorsecondary structure predictor

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

DisadvantagesDisadvantages

bull Models not very comprehensibleModels not very comprehensiblebull Long training timesLong training timesbull Very sensitive to number of Very sensitive to number of

hidden unitshellip as a result largely hidden unitshellip as a result largely being supplanted by SVMs (SVMs being supplanted by SVMs (SVMs take very different approach to take very different approach to getting non-linearity)getting non-linearity)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Looking AheadLooking Aheadbull Perceptron rule can also be thought of as Perceptron rule can also be thought of as

modifying modifying weights on data points weights on data points rather rather than featuresthan features

bull Instead of process all data (batch) vs Instead of process all data (batch) vs one-at-a-time could imagine processing 2 one-at-a-time could imagine processing 2 data points at a time adjusting their data points at a time adjusting their relative weights based on their relative relative weights based on their relative errorserrors

bull This is what Plattrsquos SMO does (the SVM This is what Plattrsquos SMO does (the SVM implementation in Weka)implementation in Weka)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Backup Slide to help Backup Slide to help with Derivative of with Derivative of SigmoidSigmoid

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Page 6: Concept Learning Algorithms

Continuation of Continuation of DerivationDerivation

EE WWkk

= -(t ndash o) WWkk

(( sumsumk k ww k k x x kk))

= -(t ndash o) x k

So ΔWk = η (t ndash o) xk The Perceptron Rule

Stick in formula for

output

Also known as the delta rule and other names (with small variations in calc)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

As it looks in your text As it looks in your text (processing all data at once)hellip(processing all data at once)hellip

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Linear Separability Linear Separability

Consider a perceptron its output is

1 If W1X1+W2X2 + hellip + WnXn gt 0 otherwise

In terms of feature space W1X1 + W2X2 =

X2 = = W1X1

W2

-W1 W2 W2

X1+

+ + + + + + - + - - + + + + - + + - - -+ + - -+ - - -

- -

Hence can only classify examples if a ldquolinerdquo (hyerplane) can separate them

y = mx + b

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Perceptron Convergence Perceptron Convergence

TheoremTheorem (Rosemblatt 1957)(Rosemblatt 1957)

Perceptron no Hidden Units

If a set of examples is learnable the perceptron training rule will eventually find the necessary weightsHowever a perceptron can only learnrepresent linearly separable datasetcopy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

The (Infamous) XOR The (Infamous) XOR ProblemProblem

Input

0 00 11 01 1

Output

0110

a)b)c)d)

Exclusive OR (XOR)Not linearly separable

b

a c

d

0 1

1

A Neural Network SolutionX1

X2

X1

X2

1

1

-1-1

1

1 Let = 0 for all nodescopy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

The Need for Hidden The Need for Hidden UnitsUnits

If there is one layer of enough hidden units (possibly 2N for Boolean functions) the input can be recoded (N = number of input units)

This recoding allows any mapping to be represented (Minsky amp Papert)Question How to provide an error signal to the interior units

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Hidden UnitsHidden Units

One ViewAllow a system to create its own internal representation ndash for which problem solving is easy A perceptron

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Advantages of Advantages of Neural NetworksNeural Networks

Provide best predictive Provide best predictive accuracy for some accuracy for some problemsproblems

Being supplanted by Being supplanted by SVMrsquosSVMrsquos

Can represent a rich Can represent a rich class of conceptsclass of concepts

PositivenegativePositive

Saturday 40 chance of rainSunday 25 chance of rain

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Overview of ANNsOverview of ANNs

Recurrentlink

Output units

Input units

Hidden units

error

weight

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

BackpropagationBackpropagation

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

BackpropagationBackpropagation

bull Backpropagation involves a generalization of the Backpropagation involves a generalization of the perceptron ruleperceptron rule

bull Rumelhart Parker and Le Cun (and Bryson amp Ho 1969) Rumelhart Parker and Le Cun (and Bryson amp Ho 1969) Werbos 1974) independently developed (1985) a Werbos 1974) independently developed (1985) a technique for determining how to adjust weights of technique for determining how to adjust weights of interior (ldquohiddenrdquo) unitsinterior (ldquohiddenrdquo) units

bull Derivation involves partial derivatives Derivation involves partial derivatives (hence threshold function must be differentiable)(hence threshold function must be differentiable)

error signal

E Wij

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Weight SpaceWeight Space

bull Given a neural-network layout the weights Given a neural-network layout the weights are free parameters that are free parameters that define a define a spacespace

bull Each pointEach point in this in this Weight SpaceWeight Space specifies a specifies a networknetwork

bull Associated with each point is an Associated with each point is an error rateerror rate EE over the training dataover the training data

bull Backprop performs Backprop performs gradient descentgradient descent in weight in weight spacespace

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Gradient Descent in Weight Gradient Descent in Weight SpaceSpace

E

W1

W2

Ew

W1

W2

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

The Gradient-Descent The Gradient-Descent RuleRule

E(w) [ ]Ew0

Ew1

Ew

2

EwN

hellip hellip hellip _

The ldquogradien

trdquoThis is a N+1 dimensional vector (ie the lsquoslopersquo in weight space)Since we want to reduce errors we want to go ldquodown hillrdquoWersquoll take a finite step in weight space

E

W1

W2

w = - E ( w )

or wi = - Ewi

ldquodeltardquo = change to

w

E

w

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

ldquoldquoOn Linerdquo vs ldquoBatchrdquo On Linerdquo vs ldquoBatchrdquo BackpropBackprop

bull Technically we should look at the error Technically we should look at the error gradient for the entire training set gradient for the entire training set before taking a step in weight space before taking a step in weight space (ldquo(ldquobatchbatchrdquo Backprop)rdquo Backprop)

bull HoweverHowever as presented we take a step as presented we take a step after each example (ldquoafter each example (ldquoon-lineon-linerdquo Backprop)rdquo Backprop)bull Much faster convergenceMuch faster convergencebull Can reduce overfitting (since on-line Can reduce overfitting (since on-line

Backprop is ldquonoisyrdquo gradient descent)Backprop is ldquonoisyrdquo gradient descent)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

ldquoldquoOn Linerdquo vs ldquoBatchrdquo BP On Linerdquo vs ldquoBatchrdquo BP (continued)(continued)

BATCHBATCH ndash add ndash add w w vectors for vectors for everyevery training example training example thenthen lsquomoversquo in weight lsquomoversquo in weight spacespace

ON-LINEON-LINE ndash ldquomoverdquo ndash ldquomoverdquo after after eacheach example example (aka (aka stochasticstochastic gradient descent)gradient descent)

E

wi

w1

w3w2

w

w1

w2

w3

Final locations in space need not be the same for BATCH and ON-LINE w

N

ote

w

iB

ATC

H

w

i O

N-L

INE

for

i gt

1

E

w

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Need Derivatives Replace Need Derivatives Replace Step (Threshold) by Step (Threshold) by SigmoidSigmoidIndividual units

bias

output

input

output i= F(weight ij x output j)

Where

F(input i) =

j

1

1+e -(input i ndash bias i)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Differentiating the Differentiating the Logistic FunctionLogistic Function

out i =

1

1 + e - ( wji x outj)

F rsquo(wgtrsquoed in) = out i ( 1- out i ) 0

12

Wj x outj

F(wgtrsquoed in)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Assume one layer of hidden units (std topology)Assume one layer of hidden units (std topology)

11 Error Error frac12 frac12 ( Teacher ( Teacherii ndash ndash OutputOutput ii ) ) 22

22 = frac12 = frac12 (Teacher(Teacherii ndash F ndash F (( [[WWijij x x OutputOutput jj] )] )22

33 = frac12 = frac12 (Teacher(Teacherii ndash F ndash F (( [[WWijij x F x F ((WWjkjk x Output x Output

kk)])))]))22

DetermineDetermine

recallrecall

BP CalculationsBP Calculations

Error Wij

Error Wjk

= (use equation 2)

= (use equation 3)

See Table 42 in Mitchell for resultswxy = - ( E wxy )

k j i

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Derivation in MitchellDerivation in Mitchell

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Some NotationSome Notation

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

By Chain Rule By Chain Rule (since (since WWjiji influences rest of network only influences rest of network only by its influence on by its influence on NetNetjj)hellip)hellip

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Also remember this for later ndashWersquoll call it -δj

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Remember thatoj is xkj outputfrom j is inputto k

Remember netk =wk1 xk1 + hellip+ wkN xkN

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Using BP to Train Using BP to Train ANNrsquosANNrsquos

11 Initiate weights amp bias to Initiate weights amp bias to small random values small random values (eg in [-03 03])(eg in [-03 03])

22 Randomize order of Randomize order of training examples for training examples for each doeach do

a)a) Propagate activity Propagate activity forwardforward to output unitsto output units

k j i

outi = F( wij x outj )j

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)

b)b) Compute ldquodeviationrdquo for output Compute ldquodeviationrdquo for output unitsunits

c)c) Compute ldquodeviationrdquo for hidden Compute ldquodeviationrdquo for hidden unitsunits

d)d) Update weightsUpdate weights

i = F rsquo( neti ) x (Teacheri-outi)

ij = F rsquo( netj ) x ( wij x

i)

wij = x i x out j

wjk = x j x out k

F rsquo( netj ) = F(neti) neti

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)

33 Repeat until training-set error rate Repeat until training-set error rate small enough (or until tuning-set error small enough (or until tuning-set error rate begins to rise ndash see later slide)rate begins to rise ndash see later slide)

Should use ldquoearly stoppingrdquo (ie Should use ldquoearly stoppingrdquo (ie minimize error on the tuning set more minimize error on the tuning set more details later)details later)

44 Measure accuracy on test set to Measure accuracy on test set to estimate estimate generalizationgeneralization (future (future accuracy)accuracy)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Advantages of Neural Advantages of Neural NetworksNetworks

bull Universal representation (provided Universal representation (provided enough hidden units)enough hidden units)

bull Less greedy than tree learnersLess greedy than tree learnersbull In practice good for problems with In practice good for problems with

numeric inputs and can also numeric inputs and can also handle numeric outputshandle numeric outputs

bull PHD for many years best protein PHD for many years best protein secondary structure predictorsecondary structure predictor

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

DisadvantagesDisadvantages

bull Models not very comprehensibleModels not very comprehensiblebull Long training timesLong training timesbull Very sensitive to number of Very sensitive to number of

hidden unitshellip as a result largely hidden unitshellip as a result largely being supplanted by SVMs (SVMs being supplanted by SVMs (SVMs take very different approach to take very different approach to getting non-linearity)getting non-linearity)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Looking AheadLooking Aheadbull Perceptron rule can also be thought of as Perceptron rule can also be thought of as

modifying modifying weights on data points weights on data points rather rather than featuresthan features

bull Instead of process all data (batch) vs Instead of process all data (batch) vs one-at-a-time could imagine processing 2 one-at-a-time could imagine processing 2 data points at a time adjusting their data points at a time adjusting their relative weights based on their relative relative weights based on their relative errorserrors

bull This is what Plattrsquos SMO does (the SVM This is what Plattrsquos SMO does (the SVM implementation in Weka)implementation in Weka)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Backup Slide to help Backup Slide to help with Derivative of with Derivative of SigmoidSigmoid

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Page 7: Concept Learning Algorithms

As it looks in your text As it looks in your text (processing all data at once)hellip(processing all data at once)hellip

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Linear Separability Linear Separability

Consider a perceptron its output is

1 If W1X1+W2X2 + hellip + WnXn gt 0 otherwise

In terms of feature space W1X1 + W2X2 =

X2 = = W1X1

W2

-W1 W2 W2

X1+

+ + + + + + - + - - + + + + - + + - - -+ + - -+ - - -

- -

Hence can only classify examples if a ldquolinerdquo (hyerplane) can separate them

y = mx + b

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Perceptron Convergence Perceptron Convergence

TheoremTheorem (Rosemblatt 1957)(Rosemblatt 1957)

Perceptron no Hidden Units

If a set of examples is learnable the perceptron training rule will eventually find the necessary weightsHowever a perceptron can only learnrepresent linearly separable datasetcopy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

The (Infamous) XOR The (Infamous) XOR ProblemProblem

Input

0 00 11 01 1

Output

0110

a)b)c)d)

Exclusive OR (XOR)Not linearly separable

b

a c

d

0 1

1

A Neural Network SolutionX1

X2

X1

X2

1

1

-1-1

1

1 Let = 0 for all nodescopy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

The Need for Hidden The Need for Hidden UnitsUnits

If there is one layer of enough hidden units (possibly 2N for Boolean functions) the input can be recoded (N = number of input units)

This recoding allows any mapping to be represented (Minsky amp Papert)Question How to provide an error signal to the interior units

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Hidden UnitsHidden Units

One ViewAllow a system to create its own internal representation ndash for which problem solving is easy A perceptron

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Advantages of Advantages of Neural NetworksNeural Networks

Provide best predictive Provide best predictive accuracy for some accuracy for some problemsproblems

Being supplanted by Being supplanted by SVMrsquosSVMrsquos

Can represent a rich Can represent a rich class of conceptsclass of concepts

PositivenegativePositive

Saturday 40 chance of rainSunday 25 chance of rain

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Overview of ANNsOverview of ANNs

Recurrentlink

Output units

Input units

Hidden units

error

weight

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

BackpropagationBackpropagation

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

BackpropagationBackpropagation

bull Backpropagation involves a generalization of the Backpropagation involves a generalization of the perceptron ruleperceptron rule

bull Rumelhart Parker and Le Cun (and Bryson amp Ho 1969) Rumelhart Parker and Le Cun (and Bryson amp Ho 1969) Werbos 1974) independently developed (1985) a Werbos 1974) independently developed (1985) a technique for determining how to adjust weights of technique for determining how to adjust weights of interior (ldquohiddenrdquo) unitsinterior (ldquohiddenrdquo) units

bull Derivation involves partial derivatives Derivation involves partial derivatives (hence threshold function must be differentiable)(hence threshold function must be differentiable)

error signal

E Wij

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Weight SpaceWeight Space

bull Given a neural-network layout the weights Given a neural-network layout the weights are free parameters that are free parameters that define a define a spacespace

bull Each pointEach point in this in this Weight SpaceWeight Space specifies a specifies a networknetwork

bull Associated with each point is an Associated with each point is an error rateerror rate EE over the training dataover the training data

bull Backprop performs Backprop performs gradient descentgradient descent in weight in weight spacespace

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Gradient Descent in Weight Gradient Descent in Weight SpaceSpace

E

W1

W2

Ew

W1

W2

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

The Gradient-Descent The Gradient-Descent RuleRule

E(w) [ ]Ew0

Ew1

Ew

2

EwN

hellip hellip hellip _

The ldquogradien

trdquoThis is a N+1 dimensional vector (ie the lsquoslopersquo in weight space)Since we want to reduce errors we want to go ldquodown hillrdquoWersquoll take a finite step in weight space

E

W1

W2

w = - E ( w )

or wi = - Ewi

ldquodeltardquo = change to

w

E

w

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

ldquoldquoOn Linerdquo vs ldquoBatchrdquo On Linerdquo vs ldquoBatchrdquo BackpropBackprop

bull Technically we should look at the error Technically we should look at the error gradient for the entire training set gradient for the entire training set before taking a step in weight space before taking a step in weight space (ldquo(ldquobatchbatchrdquo Backprop)rdquo Backprop)

bull HoweverHowever as presented we take a step as presented we take a step after each example (ldquoafter each example (ldquoon-lineon-linerdquo Backprop)rdquo Backprop)bull Much faster convergenceMuch faster convergencebull Can reduce overfitting (since on-line Can reduce overfitting (since on-line

Backprop is ldquonoisyrdquo gradient descent)Backprop is ldquonoisyrdquo gradient descent)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

ldquoldquoOn Linerdquo vs ldquoBatchrdquo BP On Linerdquo vs ldquoBatchrdquo BP (continued)(continued)

BATCHBATCH ndash add ndash add w w vectors for vectors for everyevery training example training example thenthen lsquomoversquo in weight lsquomoversquo in weight spacespace

ON-LINEON-LINE ndash ldquomoverdquo ndash ldquomoverdquo after after eacheach example example (aka (aka stochasticstochastic gradient descent)gradient descent)

E

wi

w1

w3w2

w

w1

w2

w3

Final locations in space need not be the same for BATCH and ON-LINE w

N

ote

w

iB

ATC

H

w

i O

N-L

INE

for

i gt

1

E

w

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Need Derivatives Replace Need Derivatives Replace Step (Threshold) by Step (Threshold) by SigmoidSigmoidIndividual units

bias

output

input

output i= F(weight ij x output j)

Where

F(input i) =

j

1

1+e -(input i ndash bias i)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Differentiating the Differentiating the Logistic FunctionLogistic Function

out i =

1

1 + e - ( wji x outj)

F rsquo(wgtrsquoed in) = out i ( 1- out i ) 0

12

Wj x outj

F(wgtrsquoed in)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Assume one layer of hidden units (std topology)Assume one layer of hidden units (std topology)

11 Error Error frac12 frac12 ( Teacher ( Teacherii ndash ndash OutputOutput ii ) ) 22

22 = frac12 = frac12 (Teacher(Teacherii ndash F ndash F (( [[WWijij x x OutputOutput jj] )] )22

33 = frac12 = frac12 (Teacher(Teacherii ndash F ndash F (( [[WWijij x F x F ((WWjkjk x Output x Output

kk)])))]))22

DetermineDetermine

recallrecall

BP CalculationsBP Calculations

Error Wij

Error Wjk

= (use equation 2)

= (use equation 3)

See Table 42 in Mitchell for resultswxy = - ( E wxy )

k j i

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Derivation in MitchellDerivation in Mitchell

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Some NotationSome Notation

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

By Chain Rule By Chain Rule (since (since WWjiji influences rest of network only influences rest of network only by its influence on by its influence on NetNetjj)hellip)hellip

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Also remember this for later ndashWersquoll call it -δj

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Remember thatoj is xkj outputfrom j is inputto k

Remember netk =wk1 xk1 + hellip+ wkN xkN

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Using BP to Train Using BP to Train ANNrsquosANNrsquos

11 Initiate weights amp bias to Initiate weights amp bias to small random values small random values (eg in [-03 03])(eg in [-03 03])

22 Randomize order of Randomize order of training examples for training examples for each doeach do

a)a) Propagate activity Propagate activity forwardforward to output unitsto output units

k j i

outi = F( wij x outj )j

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)

b)b) Compute ldquodeviationrdquo for output Compute ldquodeviationrdquo for output unitsunits

c)c) Compute ldquodeviationrdquo for hidden Compute ldquodeviationrdquo for hidden unitsunits

d)d) Update weightsUpdate weights

i = F rsquo( neti ) x (Teacheri-outi)

ij = F rsquo( netj ) x ( wij x

i)

wij = x i x out j

wjk = x j x out k

F rsquo( netj ) = F(neti) neti

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)

33 Repeat until training-set error rate Repeat until training-set error rate small enough (or until tuning-set error small enough (or until tuning-set error rate begins to rise ndash see later slide)rate begins to rise ndash see later slide)

Should use ldquoearly stoppingrdquo (ie Should use ldquoearly stoppingrdquo (ie minimize error on the tuning set more minimize error on the tuning set more details later)details later)

44 Measure accuracy on test set to Measure accuracy on test set to estimate estimate generalizationgeneralization (future (future accuracy)accuracy)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Advantages of Neural Advantages of Neural NetworksNetworks

bull Universal representation (provided Universal representation (provided enough hidden units)enough hidden units)

bull Less greedy than tree learnersLess greedy than tree learnersbull In practice good for problems with In practice good for problems with

numeric inputs and can also numeric inputs and can also handle numeric outputshandle numeric outputs

bull PHD for many years best protein PHD for many years best protein secondary structure predictorsecondary structure predictor

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

DisadvantagesDisadvantages

bull Models not very comprehensibleModels not very comprehensiblebull Long training timesLong training timesbull Very sensitive to number of Very sensitive to number of

hidden unitshellip as a result largely hidden unitshellip as a result largely being supplanted by SVMs (SVMs being supplanted by SVMs (SVMs take very different approach to take very different approach to getting non-linearity)getting non-linearity)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Looking AheadLooking Aheadbull Perceptron rule can also be thought of as Perceptron rule can also be thought of as

modifying modifying weights on data points weights on data points rather rather than featuresthan features

bull Instead of process all data (batch) vs Instead of process all data (batch) vs one-at-a-time could imagine processing 2 one-at-a-time could imagine processing 2 data points at a time adjusting their data points at a time adjusting their relative weights based on their relative relative weights based on their relative errorserrors

bull This is what Plattrsquos SMO does (the SVM This is what Plattrsquos SMO does (the SVM implementation in Weka)implementation in Weka)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Backup Slide to help Backup Slide to help with Derivative of with Derivative of SigmoidSigmoid

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Page 8: Concept Learning Algorithms

Linear Separability Linear Separability

Consider a perceptron its output is

1 If W1X1+W2X2 + hellip + WnXn gt 0 otherwise

In terms of feature space W1X1 + W2X2 =

X2 = = W1X1

W2

-W1 W2 W2

X1+

+ + + + + + - + - - + + + + - + + - - -+ + - -+ - - -

- -

Hence can only classify examples if a ldquolinerdquo (hyerplane) can separate them

y = mx + b

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Perceptron Convergence Perceptron Convergence

TheoremTheorem (Rosemblatt 1957)(Rosemblatt 1957)

Perceptron no Hidden Units

If a set of examples is learnable the perceptron training rule will eventually find the necessary weightsHowever a perceptron can only learnrepresent linearly separable datasetcopy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

The (Infamous) XOR The (Infamous) XOR ProblemProblem

Input

0 00 11 01 1

Output

0110

a)b)c)d)

Exclusive OR (XOR)Not linearly separable

b

a c

d

0 1

1

A Neural Network SolutionX1

X2

X1

X2

1

1

-1-1

1

1 Let = 0 for all nodescopy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

The Need for Hidden The Need for Hidden UnitsUnits

If there is one layer of enough hidden units (possibly 2N for Boolean functions) the input can be recoded (N = number of input units)

This recoding allows any mapping to be represented (Minsky amp Papert)Question How to provide an error signal to the interior units

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Hidden UnitsHidden Units

One ViewAllow a system to create its own internal representation ndash for which problem solving is easy A perceptron

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Advantages of Advantages of Neural NetworksNeural Networks

Provide best predictive Provide best predictive accuracy for some accuracy for some problemsproblems

Being supplanted by Being supplanted by SVMrsquosSVMrsquos

Can represent a rich Can represent a rich class of conceptsclass of concepts

PositivenegativePositive

Saturday 40 chance of rainSunday 25 chance of rain

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Overview of ANNsOverview of ANNs

Recurrentlink

Output units

Input units

Hidden units

error

weight

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

BackpropagationBackpropagation

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

BackpropagationBackpropagation

bull Backpropagation involves a generalization of the Backpropagation involves a generalization of the perceptron ruleperceptron rule

bull Rumelhart Parker and Le Cun (and Bryson amp Ho 1969) Rumelhart Parker and Le Cun (and Bryson amp Ho 1969) Werbos 1974) independently developed (1985) a Werbos 1974) independently developed (1985) a technique for determining how to adjust weights of technique for determining how to adjust weights of interior (ldquohiddenrdquo) unitsinterior (ldquohiddenrdquo) units

bull Derivation involves partial derivatives Derivation involves partial derivatives (hence threshold function must be differentiable)(hence threshold function must be differentiable)

error signal

E Wij

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Weight SpaceWeight Space

bull Given a neural-network layout the weights Given a neural-network layout the weights are free parameters that are free parameters that define a define a spacespace

bull Each pointEach point in this in this Weight SpaceWeight Space specifies a specifies a networknetwork

bull Associated with each point is an Associated with each point is an error rateerror rate EE over the training dataover the training data

bull Backprop performs Backprop performs gradient descentgradient descent in weight in weight spacespace

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Gradient Descent in Weight Gradient Descent in Weight SpaceSpace

E

W1

W2

Ew

W1

W2

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

The Gradient-Descent The Gradient-Descent RuleRule

E(w) [ ]Ew0

Ew1

Ew

2

EwN

hellip hellip hellip _

The ldquogradien

trdquoThis is a N+1 dimensional vector (ie the lsquoslopersquo in weight space)Since we want to reduce errors we want to go ldquodown hillrdquoWersquoll take a finite step in weight space

E

W1

W2

w = - E ( w )

or wi = - Ewi

ldquodeltardquo = change to

w

E

w

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

ldquoldquoOn Linerdquo vs ldquoBatchrdquo On Linerdquo vs ldquoBatchrdquo BackpropBackprop

bull Technically we should look at the error Technically we should look at the error gradient for the entire training set gradient for the entire training set before taking a step in weight space before taking a step in weight space (ldquo(ldquobatchbatchrdquo Backprop)rdquo Backprop)

bull HoweverHowever as presented we take a step as presented we take a step after each example (ldquoafter each example (ldquoon-lineon-linerdquo Backprop)rdquo Backprop)bull Much faster convergenceMuch faster convergencebull Can reduce overfitting (since on-line Can reduce overfitting (since on-line

Backprop is ldquonoisyrdquo gradient descent)Backprop is ldquonoisyrdquo gradient descent)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

ldquoldquoOn Linerdquo vs ldquoBatchrdquo BP On Linerdquo vs ldquoBatchrdquo BP (continued)(continued)

BATCHBATCH ndash add ndash add w w vectors for vectors for everyevery training example training example thenthen lsquomoversquo in weight lsquomoversquo in weight spacespace

ON-LINEON-LINE ndash ldquomoverdquo ndash ldquomoverdquo after after eacheach example example (aka (aka stochasticstochastic gradient descent)gradient descent)

E

wi

w1

w3w2

w

w1

w2

w3

Final locations in space need not be the same for BATCH and ON-LINE w

N

ote

w

iB

ATC

H

w

i O

N-L

INE

for

i gt

1

E

w

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Need Derivatives Replace Need Derivatives Replace Step (Threshold) by Step (Threshold) by SigmoidSigmoidIndividual units

bias

output

input

output i= F(weight ij x output j)

Where

F(input i) =

j

1

1+e -(input i ndash bias i)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Differentiating the Differentiating the Logistic FunctionLogistic Function

out i =

1

1 + e - ( wji x outj)

F rsquo(wgtrsquoed in) = out i ( 1- out i ) 0

12

Wj x outj

F(wgtrsquoed in)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Assume one layer of hidden units (std topology)Assume one layer of hidden units (std topology)

11 Error Error frac12 frac12 ( Teacher ( Teacherii ndash ndash OutputOutput ii ) ) 22

22 = frac12 = frac12 (Teacher(Teacherii ndash F ndash F (( [[WWijij x x OutputOutput jj] )] )22

33 = frac12 = frac12 (Teacher(Teacherii ndash F ndash F (( [[WWijij x F x F ((WWjkjk x Output x Output

kk)])))]))22

DetermineDetermine

recallrecall

BP CalculationsBP Calculations

Error Wij

Error Wjk

= (use equation 2)

= (use equation 3)

See Table 42 in Mitchell for resultswxy = - ( E wxy )

k j i

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Derivation in MitchellDerivation in Mitchell

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Some NotationSome Notation

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

By Chain Rule By Chain Rule (since (since WWjiji influences rest of network only influences rest of network only by its influence on by its influence on NetNetjj)hellip)hellip

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Also remember this for later ndashWersquoll call it -δj

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Remember thatoj is xkj outputfrom j is inputto k

Remember netk =wk1 xk1 + hellip+ wkN xkN

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Using BP to Train Using BP to Train ANNrsquosANNrsquos

11 Initiate weights amp bias to Initiate weights amp bias to small random values small random values (eg in [-03 03])(eg in [-03 03])

22 Randomize order of Randomize order of training examples for training examples for each doeach do

a)a) Propagate activity Propagate activity forwardforward to output unitsto output units

k j i

outi = F( wij x outj )j

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)

b)b) Compute ldquodeviationrdquo for output Compute ldquodeviationrdquo for output unitsunits

c)c) Compute ldquodeviationrdquo for hidden Compute ldquodeviationrdquo for hidden unitsunits

d)d) Update weightsUpdate weights

i = F rsquo( neti ) x (Teacheri-outi)

ij = F rsquo( netj ) x ( wij x

i)

wij = x i x out j

wjk = x j x out k

F rsquo( netj ) = F(neti) neti

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)

33 Repeat until training-set error rate Repeat until training-set error rate small enough (or until tuning-set error small enough (or until tuning-set error rate begins to rise ndash see later slide)rate begins to rise ndash see later slide)

Should use ldquoearly stoppingrdquo (ie Should use ldquoearly stoppingrdquo (ie minimize error on the tuning set more minimize error on the tuning set more details later)details later)

44 Measure accuracy on test set to Measure accuracy on test set to estimate estimate generalizationgeneralization (future (future accuracy)accuracy)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Advantages of Neural Advantages of Neural NetworksNetworks

bull Universal representation (provided Universal representation (provided enough hidden units)enough hidden units)

bull Less greedy than tree learnersLess greedy than tree learnersbull In practice good for problems with In practice good for problems with

numeric inputs and can also numeric inputs and can also handle numeric outputshandle numeric outputs

bull PHD for many years best protein PHD for many years best protein secondary structure predictorsecondary structure predictor

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

DisadvantagesDisadvantages

bull Models not very comprehensibleModels not very comprehensiblebull Long training timesLong training timesbull Very sensitive to number of Very sensitive to number of

hidden unitshellip as a result largely hidden unitshellip as a result largely being supplanted by SVMs (SVMs being supplanted by SVMs (SVMs take very different approach to take very different approach to getting non-linearity)getting non-linearity)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Looking AheadLooking Aheadbull Perceptron rule can also be thought of as Perceptron rule can also be thought of as

modifying modifying weights on data points weights on data points rather rather than featuresthan features

bull Instead of process all data (batch) vs Instead of process all data (batch) vs one-at-a-time could imagine processing 2 one-at-a-time could imagine processing 2 data points at a time adjusting their data points at a time adjusting their relative weights based on their relative relative weights based on their relative errorserrors

bull This is what Plattrsquos SMO does (the SVM This is what Plattrsquos SMO does (the SVM implementation in Weka)implementation in Weka)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Backup Slide to help Backup Slide to help with Derivative of with Derivative of SigmoidSigmoid

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Page 9: Concept Learning Algorithms

Perceptron Convergence Perceptron Convergence

TheoremTheorem (Rosemblatt 1957)(Rosemblatt 1957)

Perceptron no Hidden Units

If a set of examples is learnable the perceptron training rule will eventually find the necessary weightsHowever a perceptron can only learnrepresent linearly separable datasetcopy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

The (Infamous) XOR The (Infamous) XOR ProblemProblem

Input

0 00 11 01 1

Output

0110

a)b)c)d)

Exclusive OR (XOR)Not linearly separable

b

a c

d

0 1

1

A Neural Network SolutionX1

X2

X1

X2

1

1

-1-1

1

1 Let = 0 for all nodescopy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

The Need for Hidden The Need for Hidden UnitsUnits

If there is one layer of enough hidden units (possibly 2N for Boolean functions) the input can be recoded (N = number of input units)

This recoding allows any mapping to be represented (Minsky amp Papert)Question How to provide an error signal to the interior units

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Hidden UnitsHidden Units

One ViewAllow a system to create its own internal representation ndash for which problem solving is easy A perceptron

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Advantages of Advantages of Neural NetworksNeural Networks

Provide best predictive Provide best predictive accuracy for some accuracy for some problemsproblems

Being supplanted by Being supplanted by SVMrsquosSVMrsquos

Can represent a rich Can represent a rich class of conceptsclass of concepts

PositivenegativePositive

Saturday 40 chance of rainSunday 25 chance of rain

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Overview of ANNsOverview of ANNs

Recurrentlink

Output units

Input units

Hidden units

error

weight

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

BackpropagationBackpropagation

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

BackpropagationBackpropagation

bull Backpropagation involves a generalization of the Backpropagation involves a generalization of the perceptron ruleperceptron rule

bull Rumelhart Parker and Le Cun (and Bryson amp Ho 1969) Rumelhart Parker and Le Cun (and Bryson amp Ho 1969) Werbos 1974) independently developed (1985) a Werbos 1974) independently developed (1985) a technique for determining how to adjust weights of technique for determining how to adjust weights of interior (ldquohiddenrdquo) unitsinterior (ldquohiddenrdquo) units

bull Derivation involves partial derivatives Derivation involves partial derivatives (hence threshold function must be differentiable)(hence threshold function must be differentiable)

error signal

E Wij

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Weight SpaceWeight Space

bull Given a neural-network layout the weights Given a neural-network layout the weights are free parameters that are free parameters that define a define a spacespace

bull Each pointEach point in this in this Weight SpaceWeight Space specifies a specifies a networknetwork

bull Associated with each point is an Associated with each point is an error rateerror rate EE over the training dataover the training data

bull Backprop performs Backprop performs gradient descentgradient descent in weight in weight spacespace

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Gradient Descent in Weight Gradient Descent in Weight SpaceSpace

E

W1

W2

Ew

W1

W2

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

The Gradient-Descent The Gradient-Descent RuleRule

E(w) [ ]Ew0

Ew1

Ew

2

EwN

hellip hellip hellip _

The ldquogradien

trdquoThis is a N+1 dimensional vector (ie the lsquoslopersquo in weight space)Since we want to reduce errors we want to go ldquodown hillrdquoWersquoll take a finite step in weight space

E

W1

W2

w = - E ( w )

or wi = - Ewi

ldquodeltardquo = change to

w

E

w

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

ldquoldquoOn Linerdquo vs ldquoBatchrdquo On Linerdquo vs ldquoBatchrdquo BackpropBackprop

bull Technically we should look at the error Technically we should look at the error gradient for the entire training set gradient for the entire training set before taking a step in weight space before taking a step in weight space (ldquo(ldquobatchbatchrdquo Backprop)rdquo Backprop)

bull HoweverHowever as presented we take a step as presented we take a step after each example (ldquoafter each example (ldquoon-lineon-linerdquo Backprop)rdquo Backprop)bull Much faster convergenceMuch faster convergencebull Can reduce overfitting (since on-line Can reduce overfitting (since on-line

Backprop is ldquonoisyrdquo gradient descent)Backprop is ldquonoisyrdquo gradient descent)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

ldquoldquoOn Linerdquo vs ldquoBatchrdquo BP On Linerdquo vs ldquoBatchrdquo BP (continued)(continued)

BATCHBATCH ndash add ndash add w w vectors for vectors for everyevery training example training example thenthen lsquomoversquo in weight lsquomoversquo in weight spacespace

ON-LINEON-LINE ndash ldquomoverdquo ndash ldquomoverdquo after after eacheach example example (aka (aka stochasticstochastic gradient descent)gradient descent)

E

wi

w1

w3w2

w

w1

w2

w3

Final locations in space need not be the same for BATCH and ON-LINE w

N

ote

w

iB

ATC

H

w

i O

N-L

INE

for

i gt

1

E

w

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Need Derivatives Replace Need Derivatives Replace Step (Threshold) by Step (Threshold) by SigmoidSigmoidIndividual units

bias

output

input

output i= F(weight ij x output j)

Where

F(input i) =

j

1

1+e -(input i ndash bias i)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Differentiating the Differentiating the Logistic FunctionLogistic Function

out i =

1

1 + e - ( wji x outj)

F rsquo(wgtrsquoed in) = out i ( 1- out i ) 0

12

Wj x outj

F(wgtrsquoed in)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Assume one layer of hidden units (std topology)Assume one layer of hidden units (std topology)

11 Error Error frac12 frac12 ( Teacher ( Teacherii ndash ndash OutputOutput ii ) ) 22

22 = frac12 = frac12 (Teacher(Teacherii ndash F ndash F (( [[WWijij x x OutputOutput jj] )] )22

33 = frac12 = frac12 (Teacher(Teacherii ndash F ndash F (( [[WWijij x F x F ((WWjkjk x Output x Output

kk)])))]))22

DetermineDetermine

recallrecall

BP CalculationsBP Calculations

Error Wij

Error Wjk

= (use equation 2)

= (use equation 3)

See Table 42 in Mitchell for resultswxy = - ( E wxy )

k j i

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Derivation in MitchellDerivation in Mitchell

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Some NotationSome Notation

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

By Chain Rule By Chain Rule (since (since WWjiji influences rest of network only influences rest of network only by its influence on by its influence on NetNetjj)hellip)hellip

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Also remember this for later ndashWersquoll call it -δj

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Remember thatoj is xkj outputfrom j is inputto k

Remember netk =wk1 xk1 + hellip+ wkN xkN

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Using BP to Train Using BP to Train ANNrsquosANNrsquos

11 Initiate weights amp bias to Initiate weights amp bias to small random values small random values (eg in [-03 03])(eg in [-03 03])

22 Randomize order of Randomize order of training examples for training examples for each doeach do

a)a) Propagate activity Propagate activity forwardforward to output unitsto output units

k j i

outi = F( wij x outj )j

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)

b)b) Compute ldquodeviationrdquo for output Compute ldquodeviationrdquo for output unitsunits

c)c) Compute ldquodeviationrdquo for hidden Compute ldquodeviationrdquo for hidden unitsunits

d)d) Update weightsUpdate weights

i = F rsquo( neti ) x (Teacheri-outi)

ij = F rsquo( netj ) x ( wij x

i)

wij = x i x out j

wjk = x j x out k

F rsquo( netj ) = F(neti) neti

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)

33 Repeat until training-set error rate Repeat until training-set error rate small enough (or until tuning-set error small enough (or until tuning-set error rate begins to rise ndash see later slide)rate begins to rise ndash see later slide)

Should use ldquoearly stoppingrdquo (ie Should use ldquoearly stoppingrdquo (ie minimize error on the tuning set more minimize error on the tuning set more details later)details later)

44 Measure accuracy on test set to Measure accuracy on test set to estimate estimate generalizationgeneralization (future (future accuracy)accuracy)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Advantages of Neural Advantages of Neural NetworksNetworks

bull Universal representation (provided Universal representation (provided enough hidden units)enough hidden units)

bull Less greedy than tree learnersLess greedy than tree learnersbull In practice good for problems with In practice good for problems with

numeric inputs and can also numeric inputs and can also handle numeric outputshandle numeric outputs

bull PHD for many years best protein PHD for many years best protein secondary structure predictorsecondary structure predictor

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

DisadvantagesDisadvantages

bull Models not very comprehensibleModels not very comprehensiblebull Long training timesLong training timesbull Very sensitive to number of Very sensitive to number of

hidden unitshellip as a result largely hidden unitshellip as a result largely being supplanted by SVMs (SVMs being supplanted by SVMs (SVMs take very different approach to take very different approach to getting non-linearity)getting non-linearity)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Looking AheadLooking Aheadbull Perceptron rule can also be thought of as Perceptron rule can also be thought of as

modifying modifying weights on data points weights on data points rather rather than featuresthan features

bull Instead of process all data (batch) vs Instead of process all data (batch) vs one-at-a-time could imagine processing 2 one-at-a-time could imagine processing 2 data points at a time adjusting their data points at a time adjusting their relative weights based on their relative relative weights based on their relative errorserrors

bull This is what Plattrsquos SMO does (the SVM This is what Plattrsquos SMO does (the SVM implementation in Weka)implementation in Weka)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Backup Slide to help Backup Slide to help with Derivative of with Derivative of SigmoidSigmoid

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Page 10: Concept Learning Algorithms

The (Infamous) XOR The (Infamous) XOR ProblemProblem

Input

0 00 11 01 1

Output

0110

a)b)c)d)

Exclusive OR (XOR)Not linearly separable

b

a c

d

0 1

1

A Neural Network SolutionX1

X2

X1

X2

1

1

-1-1

1

1 Let = 0 for all nodescopy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

The Need for Hidden The Need for Hidden UnitsUnits

If there is one layer of enough hidden units (possibly 2N for Boolean functions) the input can be recoded (N = number of input units)

This recoding allows any mapping to be represented (Minsky amp Papert)Question How to provide an error signal to the interior units

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Hidden UnitsHidden Units

One ViewAllow a system to create its own internal representation ndash for which problem solving is easy A perceptron

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Advantages of Advantages of Neural NetworksNeural Networks

Provide best predictive Provide best predictive accuracy for some accuracy for some problemsproblems

Being supplanted by Being supplanted by SVMrsquosSVMrsquos

Can represent a rich Can represent a rich class of conceptsclass of concepts

PositivenegativePositive

Saturday 40 chance of rainSunday 25 chance of rain

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Overview of ANNsOverview of ANNs

Recurrentlink

Output units

Input units

Hidden units

error

weight

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

BackpropagationBackpropagation

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

BackpropagationBackpropagation

bull Backpropagation involves a generalization of the Backpropagation involves a generalization of the perceptron ruleperceptron rule

bull Rumelhart Parker and Le Cun (and Bryson amp Ho 1969) Rumelhart Parker and Le Cun (and Bryson amp Ho 1969) Werbos 1974) independently developed (1985) a Werbos 1974) independently developed (1985) a technique for determining how to adjust weights of technique for determining how to adjust weights of interior (ldquohiddenrdquo) unitsinterior (ldquohiddenrdquo) units

bull Derivation involves partial derivatives Derivation involves partial derivatives (hence threshold function must be differentiable)(hence threshold function must be differentiable)

error signal

E Wij

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Weight SpaceWeight Space

bull Given a neural-network layout the weights Given a neural-network layout the weights are free parameters that are free parameters that define a define a spacespace

bull Each pointEach point in this in this Weight SpaceWeight Space specifies a specifies a networknetwork

bull Associated with each point is an Associated with each point is an error rateerror rate EE over the training dataover the training data

bull Backprop performs Backprop performs gradient descentgradient descent in weight in weight spacespace

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Gradient Descent in Weight Gradient Descent in Weight SpaceSpace

E

W1

W2

Ew

W1

W2

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

The Gradient-Descent The Gradient-Descent RuleRule

E(w) [ ]Ew0

Ew1

Ew

2

EwN

hellip hellip hellip _

The ldquogradien

trdquoThis is a N+1 dimensional vector (ie the lsquoslopersquo in weight space)Since we want to reduce errors we want to go ldquodown hillrdquoWersquoll take a finite step in weight space

E

W1

W2

w = - E ( w )

or wi = - Ewi

ldquodeltardquo = change to

w

E

w

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

ldquoldquoOn Linerdquo vs ldquoBatchrdquo On Linerdquo vs ldquoBatchrdquo BackpropBackprop

bull Technically we should look at the error Technically we should look at the error gradient for the entire training set gradient for the entire training set before taking a step in weight space before taking a step in weight space (ldquo(ldquobatchbatchrdquo Backprop)rdquo Backprop)

bull HoweverHowever as presented we take a step as presented we take a step after each example (ldquoafter each example (ldquoon-lineon-linerdquo Backprop)rdquo Backprop)bull Much faster convergenceMuch faster convergencebull Can reduce overfitting (since on-line Can reduce overfitting (since on-line

Backprop is ldquonoisyrdquo gradient descent)Backprop is ldquonoisyrdquo gradient descent)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

ldquoldquoOn Linerdquo vs ldquoBatchrdquo BP On Linerdquo vs ldquoBatchrdquo BP (continued)(continued)

BATCHBATCH ndash add ndash add w w vectors for vectors for everyevery training example training example thenthen lsquomoversquo in weight lsquomoversquo in weight spacespace

ON-LINEON-LINE ndash ldquomoverdquo ndash ldquomoverdquo after after eacheach example example (aka (aka stochasticstochastic gradient descent)gradient descent)

E

wi

w1

w3w2

w

w1

w2

w3

Final locations in space need not be the same for BATCH and ON-LINE w

N

ote

w

iB

ATC

H

w

i O

N-L

INE

for

i gt

1

E

w

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Need Derivatives Replace Need Derivatives Replace Step (Threshold) by Step (Threshold) by SigmoidSigmoidIndividual units

bias

output

input

output i= F(weight ij x output j)

Where

F(input i) =

j

1

1+e -(input i ndash bias i)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Differentiating the Differentiating the Logistic FunctionLogistic Function

out i =

1

1 + e - ( wji x outj)

F rsquo(wgtrsquoed in) = out i ( 1- out i ) 0

12

Wj x outj

F(wgtrsquoed in)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Assume one layer of hidden units (std topology)Assume one layer of hidden units (std topology)

11 Error Error frac12 frac12 ( Teacher ( Teacherii ndash ndash OutputOutput ii ) ) 22

22 = frac12 = frac12 (Teacher(Teacherii ndash F ndash F (( [[WWijij x x OutputOutput jj] )] )22

33 = frac12 = frac12 (Teacher(Teacherii ndash F ndash F (( [[WWijij x F x F ((WWjkjk x Output x Output

kk)])))]))22

DetermineDetermine

recallrecall

BP CalculationsBP Calculations

Error Wij

Error Wjk

= (use equation 2)

= (use equation 3)

See Table 42 in Mitchell for resultswxy = - ( E wxy )

k j i

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Derivation in MitchellDerivation in Mitchell

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Some NotationSome Notation

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

By Chain Rule By Chain Rule (since (since WWjiji influences rest of network only influences rest of network only by its influence on by its influence on NetNetjj)hellip)hellip

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Also remember this for later ndashWersquoll call it -δj

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Remember thatoj is xkj outputfrom j is inputto k

Remember netk =wk1 xk1 + hellip+ wkN xkN

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Using BP to Train Using BP to Train ANNrsquosANNrsquos

11 Initiate weights amp bias to Initiate weights amp bias to small random values small random values (eg in [-03 03])(eg in [-03 03])

22 Randomize order of Randomize order of training examples for training examples for each doeach do

a)a) Propagate activity Propagate activity forwardforward to output unitsto output units

k j i

outi = F( wij x outj )j

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)

b)b) Compute ldquodeviationrdquo for output Compute ldquodeviationrdquo for output unitsunits

c)c) Compute ldquodeviationrdquo for hidden Compute ldquodeviationrdquo for hidden unitsunits

d)d) Update weightsUpdate weights

i = F rsquo( neti ) x (Teacheri-outi)

ij = F rsquo( netj ) x ( wij x

i)

wij = x i x out j

wjk = x j x out k

F rsquo( netj ) = F(neti) neti

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)

33 Repeat until training-set error rate Repeat until training-set error rate small enough (or until tuning-set error small enough (or until tuning-set error rate begins to rise ndash see later slide)rate begins to rise ndash see later slide)

Should use ldquoearly stoppingrdquo (ie Should use ldquoearly stoppingrdquo (ie minimize error on the tuning set more minimize error on the tuning set more details later)details later)

44 Measure accuracy on test set to Measure accuracy on test set to estimate estimate generalizationgeneralization (future (future accuracy)accuracy)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Advantages of Neural Advantages of Neural NetworksNetworks

bull Universal representation (provided Universal representation (provided enough hidden units)enough hidden units)

bull Less greedy than tree learnersLess greedy than tree learnersbull In practice good for problems with In practice good for problems with

numeric inputs and can also numeric inputs and can also handle numeric outputshandle numeric outputs

bull PHD for many years best protein PHD for many years best protein secondary structure predictorsecondary structure predictor

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

DisadvantagesDisadvantages

bull Models not very comprehensibleModels not very comprehensiblebull Long training timesLong training timesbull Very sensitive to number of Very sensitive to number of

hidden unitshellip as a result largely hidden unitshellip as a result largely being supplanted by SVMs (SVMs being supplanted by SVMs (SVMs take very different approach to take very different approach to getting non-linearity)getting non-linearity)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Looking AheadLooking Aheadbull Perceptron rule can also be thought of as Perceptron rule can also be thought of as

modifying modifying weights on data points weights on data points rather rather than featuresthan features

bull Instead of process all data (batch) vs Instead of process all data (batch) vs one-at-a-time could imagine processing 2 one-at-a-time could imagine processing 2 data points at a time adjusting their data points at a time adjusting their relative weights based on their relative relative weights based on their relative errorserrors

bull This is what Plattrsquos SMO does (the SVM This is what Plattrsquos SMO does (the SVM implementation in Weka)implementation in Weka)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Backup Slide to help Backup Slide to help with Derivative of with Derivative of SigmoidSigmoid

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Page 11: Concept Learning Algorithms

The Need for Hidden The Need for Hidden UnitsUnits

If there is one layer of enough hidden units (possibly 2N for Boolean functions) the input can be recoded (N = number of input units)

This recoding allows any mapping to be represented (Minsky amp Papert)Question How to provide an error signal to the interior units

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Hidden UnitsHidden Units

One ViewAllow a system to create its own internal representation ndash for which problem solving is easy A perceptron

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Advantages of Advantages of Neural NetworksNeural Networks

Provide best predictive Provide best predictive accuracy for some accuracy for some problemsproblems

Being supplanted by Being supplanted by SVMrsquosSVMrsquos

Can represent a rich Can represent a rich class of conceptsclass of concepts

PositivenegativePositive

Saturday 40 chance of rainSunday 25 chance of rain

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Overview of ANNsOverview of ANNs

Recurrentlink

Output units

Input units

Hidden units

error

weight

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

BackpropagationBackpropagation

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

BackpropagationBackpropagation

bull Backpropagation involves a generalization of the Backpropagation involves a generalization of the perceptron ruleperceptron rule

bull Rumelhart Parker and Le Cun (and Bryson amp Ho 1969) Rumelhart Parker and Le Cun (and Bryson amp Ho 1969) Werbos 1974) independently developed (1985) a Werbos 1974) independently developed (1985) a technique for determining how to adjust weights of technique for determining how to adjust weights of interior (ldquohiddenrdquo) unitsinterior (ldquohiddenrdquo) units

bull Derivation involves partial derivatives Derivation involves partial derivatives (hence threshold function must be differentiable)(hence threshold function must be differentiable)

error signal

E Wij

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Weight SpaceWeight Space

bull Given a neural-network layout the weights Given a neural-network layout the weights are free parameters that are free parameters that define a define a spacespace

bull Each pointEach point in this in this Weight SpaceWeight Space specifies a specifies a networknetwork

bull Associated with each point is an Associated with each point is an error rateerror rate EE over the training dataover the training data

bull Backprop performs Backprop performs gradient descentgradient descent in weight in weight spacespace

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Gradient Descent in Weight Gradient Descent in Weight SpaceSpace

E

W1

W2

Ew

W1

W2

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

The Gradient-Descent The Gradient-Descent RuleRule

E(w) [ ]Ew0

Ew1

Ew

2

EwN

hellip hellip hellip _

The ldquogradien

trdquoThis is a N+1 dimensional vector (ie the lsquoslopersquo in weight space)Since we want to reduce errors we want to go ldquodown hillrdquoWersquoll take a finite step in weight space

E

W1

W2

w = - E ( w )

or wi = - Ewi

ldquodeltardquo = change to

w

E

w

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

ldquoldquoOn Linerdquo vs ldquoBatchrdquo On Linerdquo vs ldquoBatchrdquo BackpropBackprop

bull Technically we should look at the error Technically we should look at the error gradient for the entire training set gradient for the entire training set before taking a step in weight space before taking a step in weight space (ldquo(ldquobatchbatchrdquo Backprop)rdquo Backprop)

bull HoweverHowever as presented we take a step as presented we take a step after each example (ldquoafter each example (ldquoon-lineon-linerdquo Backprop)rdquo Backprop)bull Much faster convergenceMuch faster convergencebull Can reduce overfitting (since on-line Can reduce overfitting (since on-line

Backprop is ldquonoisyrdquo gradient descent)Backprop is ldquonoisyrdquo gradient descent)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

ldquoldquoOn Linerdquo vs ldquoBatchrdquo BP On Linerdquo vs ldquoBatchrdquo BP (continued)(continued)

BATCHBATCH ndash add ndash add w w vectors for vectors for everyevery training example training example thenthen lsquomoversquo in weight lsquomoversquo in weight spacespace

ON-LINEON-LINE ndash ldquomoverdquo ndash ldquomoverdquo after after eacheach example example (aka (aka stochasticstochastic gradient descent)gradient descent)

E

wi

w1

w3w2

w

w1

w2

w3

Final locations in space need not be the same for BATCH and ON-LINE w

N

ote

w

iB

ATC

H

w

i O

N-L

INE

for

i gt

1

E

w

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Need Derivatives Replace Need Derivatives Replace Step (Threshold) by Step (Threshold) by SigmoidSigmoidIndividual units

bias

output

input

output i= F(weight ij x output j)

Where

F(input i) =

j

1

1+e -(input i ndash bias i)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Differentiating the Differentiating the Logistic FunctionLogistic Function

out i =

1

1 + e - ( wji x outj)

F rsquo(wgtrsquoed in) = out i ( 1- out i ) 0

12

Wj x outj

F(wgtrsquoed in)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Assume one layer of hidden units (std topology)Assume one layer of hidden units (std topology)

11 Error Error frac12 frac12 ( Teacher ( Teacherii ndash ndash OutputOutput ii ) ) 22

22 = frac12 = frac12 (Teacher(Teacherii ndash F ndash F (( [[WWijij x x OutputOutput jj] )] )22

33 = frac12 = frac12 (Teacher(Teacherii ndash F ndash F (( [[WWijij x F x F ((WWjkjk x Output x Output

kk)])))]))22

DetermineDetermine

recallrecall

BP CalculationsBP Calculations

Error Wij

Error Wjk

= (use equation 2)

= (use equation 3)

See Table 42 in Mitchell for resultswxy = - ( E wxy )

k j i

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Derivation in MitchellDerivation in Mitchell

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Some NotationSome Notation

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

By Chain Rule By Chain Rule (since (since WWjiji influences rest of network only influences rest of network only by its influence on by its influence on NetNetjj)hellip)hellip

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Also remember this for later ndashWersquoll call it -δj

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Remember thatoj is xkj outputfrom j is inputto k

Remember netk =wk1 xk1 + hellip+ wkN xkN

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Using BP to Train Using BP to Train ANNrsquosANNrsquos

11 Initiate weights amp bias to Initiate weights amp bias to small random values small random values (eg in [-03 03])(eg in [-03 03])

22 Randomize order of Randomize order of training examples for training examples for each doeach do

a)a) Propagate activity Propagate activity forwardforward to output unitsto output units

k j i

outi = F( wij x outj )j

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)

b)b) Compute ldquodeviationrdquo for output Compute ldquodeviationrdquo for output unitsunits

c)c) Compute ldquodeviationrdquo for hidden Compute ldquodeviationrdquo for hidden unitsunits

d)d) Update weightsUpdate weights

i = F rsquo( neti ) x (Teacheri-outi)

ij = F rsquo( netj ) x ( wij x

i)

wij = x i x out j

wjk = x j x out k

F rsquo( netj ) = F(neti) neti

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)

33 Repeat until training-set error rate Repeat until training-set error rate small enough (or until tuning-set error small enough (or until tuning-set error rate begins to rise ndash see later slide)rate begins to rise ndash see later slide)

Should use ldquoearly stoppingrdquo (ie Should use ldquoearly stoppingrdquo (ie minimize error on the tuning set more minimize error on the tuning set more details later)details later)

44 Measure accuracy on test set to Measure accuracy on test set to estimate estimate generalizationgeneralization (future (future accuracy)accuracy)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Advantages of Neural Advantages of Neural NetworksNetworks

bull Universal representation (provided Universal representation (provided enough hidden units)enough hidden units)

bull Less greedy than tree learnersLess greedy than tree learnersbull In practice good for problems with In practice good for problems with

numeric inputs and can also numeric inputs and can also handle numeric outputshandle numeric outputs

bull PHD for many years best protein PHD for many years best protein secondary structure predictorsecondary structure predictor

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

DisadvantagesDisadvantages

bull Models not very comprehensibleModels not very comprehensiblebull Long training timesLong training timesbull Very sensitive to number of Very sensitive to number of

hidden unitshellip as a result largely hidden unitshellip as a result largely being supplanted by SVMs (SVMs being supplanted by SVMs (SVMs take very different approach to take very different approach to getting non-linearity)getting non-linearity)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Looking AheadLooking Aheadbull Perceptron rule can also be thought of as Perceptron rule can also be thought of as

modifying modifying weights on data points weights on data points rather rather than featuresthan features

bull Instead of process all data (batch) vs Instead of process all data (batch) vs one-at-a-time could imagine processing 2 one-at-a-time could imagine processing 2 data points at a time adjusting their data points at a time adjusting their relative weights based on their relative relative weights based on their relative errorserrors

bull This is what Plattrsquos SMO does (the SVM This is what Plattrsquos SMO does (the SVM implementation in Weka)implementation in Weka)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Backup Slide to help Backup Slide to help with Derivative of with Derivative of SigmoidSigmoid

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Page 12: Concept Learning Algorithms

Hidden UnitsHidden Units

One ViewAllow a system to create its own internal representation ndash for which problem solving is easy A perceptron

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Advantages of Advantages of Neural NetworksNeural Networks

Provide best predictive Provide best predictive accuracy for some accuracy for some problemsproblems

Being supplanted by Being supplanted by SVMrsquosSVMrsquos

Can represent a rich Can represent a rich class of conceptsclass of concepts

PositivenegativePositive

Saturday 40 chance of rainSunday 25 chance of rain

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Overview of ANNsOverview of ANNs

Recurrentlink

Output units

Input units

Hidden units

error

weight

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

BackpropagationBackpropagation

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

BackpropagationBackpropagation

bull Backpropagation involves a generalization of the Backpropagation involves a generalization of the perceptron ruleperceptron rule

bull Rumelhart Parker and Le Cun (and Bryson amp Ho 1969) Rumelhart Parker and Le Cun (and Bryson amp Ho 1969) Werbos 1974) independently developed (1985) a Werbos 1974) independently developed (1985) a technique for determining how to adjust weights of technique for determining how to adjust weights of interior (ldquohiddenrdquo) unitsinterior (ldquohiddenrdquo) units

bull Derivation involves partial derivatives Derivation involves partial derivatives (hence threshold function must be differentiable)(hence threshold function must be differentiable)

error signal

E Wij

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Weight SpaceWeight Space

bull Given a neural-network layout the weights Given a neural-network layout the weights are free parameters that are free parameters that define a define a spacespace

bull Each pointEach point in this in this Weight SpaceWeight Space specifies a specifies a networknetwork

bull Associated with each point is an Associated with each point is an error rateerror rate EE over the training dataover the training data

bull Backprop performs Backprop performs gradient descentgradient descent in weight in weight spacespace

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Gradient Descent in Weight Gradient Descent in Weight SpaceSpace

E

W1

W2

Ew

W1

W2

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

The Gradient-Descent The Gradient-Descent RuleRule

E(w) [ ]Ew0

Ew1

Ew

2

EwN

hellip hellip hellip _

The ldquogradien

trdquoThis is a N+1 dimensional vector (ie the lsquoslopersquo in weight space)Since we want to reduce errors we want to go ldquodown hillrdquoWersquoll take a finite step in weight space

E

W1

W2

w = - E ( w )

or wi = - Ewi

ldquodeltardquo = change to

w

E

w

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

ldquoldquoOn Linerdquo vs ldquoBatchrdquo On Linerdquo vs ldquoBatchrdquo BackpropBackprop

bull Technically we should look at the error Technically we should look at the error gradient for the entire training set gradient for the entire training set before taking a step in weight space before taking a step in weight space (ldquo(ldquobatchbatchrdquo Backprop)rdquo Backprop)

bull HoweverHowever as presented we take a step as presented we take a step after each example (ldquoafter each example (ldquoon-lineon-linerdquo Backprop)rdquo Backprop)bull Much faster convergenceMuch faster convergencebull Can reduce overfitting (since on-line Can reduce overfitting (since on-line

Backprop is ldquonoisyrdquo gradient descent)Backprop is ldquonoisyrdquo gradient descent)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

ldquoldquoOn Linerdquo vs ldquoBatchrdquo BP On Linerdquo vs ldquoBatchrdquo BP (continued)(continued)

BATCHBATCH ndash add ndash add w w vectors for vectors for everyevery training example training example thenthen lsquomoversquo in weight lsquomoversquo in weight spacespace

ON-LINEON-LINE ndash ldquomoverdquo ndash ldquomoverdquo after after eacheach example example (aka (aka stochasticstochastic gradient descent)gradient descent)

E

wi

w1

w3w2

w

w1

w2

w3

Final locations in space need not be the same for BATCH and ON-LINE w

N

ote

w

iB

ATC

H

w

i O

N-L

INE

for

i gt

1

E

w

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Need Derivatives Replace Need Derivatives Replace Step (Threshold) by Step (Threshold) by SigmoidSigmoidIndividual units

bias

output

input

output i= F(weight ij x output j)

Where

F(input i) =

j

1

1+e -(input i ndash bias i)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Differentiating the Differentiating the Logistic FunctionLogistic Function

out i =

1

1 + e - ( wji x outj)

F rsquo(wgtrsquoed in) = out i ( 1- out i ) 0

12

Wj x outj

F(wgtrsquoed in)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Assume one layer of hidden units (std topology)Assume one layer of hidden units (std topology)

11 Error Error frac12 frac12 ( Teacher ( Teacherii ndash ndash OutputOutput ii ) ) 22

22 = frac12 = frac12 (Teacher(Teacherii ndash F ndash F (( [[WWijij x x OutputOutput jj] )] )22

33 = frac12 = frac12 (Teacher(Teacherii ndash F ndash F (( [[WWijij x F x F ((WWjkjk x Output x Output

kk)])))]))22

DetermineDetermine

recallrecall

BP CalculationsBP Calculations

Error Wij

Error Wjk

= (use equation 2)

= (use equation 3)

See Table 42 in Mitchell for resultswxy = - ( E wxy )

k j i

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Derivation in MitchellDerivation in Mitchell

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Some NotationSome Notation

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

By Chain Rule By Chain Rule (since (since WWjiji influences rest of network only influences rest of network only by its influence on by its influence on NetNetjj)hellip)hellip

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Also remember this for later ndashWersquoll call it -δj

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Remember thatoj is xkj outputfrom j is inputto k

Remember netk =wk1 xk1 + hellip+ wkN xkN

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Using BP to Train Using BP to Train ANNrsquosANNrsquos

11 Initiate weights amp bias to Initiate weights amp bias to small random values small random values (eg in [-03 03])(eg in [-03 03])

22 Randomize order of Randomize order of training examples for training examples for each doeach do

a)a) Propagate activity Propagate activity forwardforward to output unitsto output units

k j i

outi = F( wij x outj )j

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)

b)b) Compute ldquodeviationrdquo for output Compute ldquodeviationrdquo for output unitsunits

c)c) Compute ldquodeviationrdquo for hidden Compute ldquodeviationrdquo for hidden unitsunits

d)d) Update weightsUpdate weights

i = F rsquo( neti ) x (Teacheri-outi)

ij = F rsquo( netj ) x ( wij x

i)

wij = x i x out j

wjk = x j x out k

F rsquo( netj ) = F(neti) neti

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)

33 Repeat until training-set error rate Repeat until training-set error rate small enough (or until tuning-set error small enough (or until tuning-set error rate begins to rise ndash see later slide)rate begins to rise ndash see later slide)

Should use ldquoearly stoppingrdquo (ie Should use ldquoearly stoppingrdquo (ie minimize error on the tuning set more minimize error on the tuning set more details later)details later)

44 Measure accuracy on test set to Measure accuracy on test set to estimate estimate generalizationgeneralization (future (future accuracy)accuracy)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Advantages of Neural Advantages of Neural NetworksNetworks

bull Universal representation (provided Universal representation (provided enough hidden units)enough hidden units)

bull Less greedy than tree learnersLess greedy than tree learnersbull In practice good for problems with In practice good for problems with

numeric inputs and can also numeric inputs and can also handle numeric outputshandle numeric outputs

bull PHD for many years best protein PHD for many years best protein secondary structure predictorsecondary structure predictor

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

DisadvantagesDisadvantages

bull Models not very comprehensibleModels not very comprehensiblebull Long training timesLong training timesbull Very sensitive to number of Very sensitive to number of

hidden unitshellip as a result largely hidden unitshellip as a result largely being supplanted by SVMs (SVMs being supplanted by SVMs (SVMs take very different approach to take very different approach to getting non-linearity)getting non-linearity)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Looking AheadLooking Aheadbull Perceptron rule can also be thought of as Perceptron rule can also be thought of as

modifying modifying weights on data points weights on data points rather rather than featuresthan features

bull Instead of process all data (batch) vs Instead of process all data (batch) vs one-at-a-time could imagine processing 2 one-at-a-time could imagine processing 2 data points at a time adjusting their data points at a time adjusting their relative weights based on their relative relative weights based on their relative errorserrors

bull This is what Plattrsquos SMO does (the SVM This is what Plattrsquos SMO does (the SVM implementation in Weka)implementation in Weka)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Backup Slide to help Backup Slide to help with Derivative of with Derivative of SigmoidSigmoid

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Page 13: Concept Learning Algorithms

Advantages of Advantages of Neural NetworksNeural Networks

Provide best predictive Provide best predictive accuracy for some accuracy for some problemsproblems

Being supplanted by Being supplanted by SVMrsquosSVMrsquos

Can represent a rich Can represent a rich class of conceptsclass of concepts

PositivenegativePositive

Saturday 40 chance of rainSunday 25 chance of rain

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Overview of ANNsOverview of ANNs

Recurrentlink

Output units

Input units

Hidden units

error

weight

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

BackpropagationBackpropagation

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

BackpropagationBackpropagation

bull Backpropagation involves a generalization of the Backpropagation involves a generalization of the perceptron ruleperceptron rule

bull Rumelhart Parker and Le Cun (and Bryson amp Ho 1969) Rumelhart Parker and Le Cun (and Bryson amp Ho 1969) Werbos 1974) independently developed (1985) a Werbos 1974) independently developed (1985) a technique for determining how to adjust weights of technique for determining how to adjust weights of interior (ldquohiddenrdquo) unitsinterior (ldquohiddenrdquo) units

bull Derivation involves partial derivatives Derivation involves partial derivatives (hence threshold function must be differentiable)(hence threshold function must be differentiable)

error signal

E Wij

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Weight SpaceWeight Space

bull Given a neural-network layout the weights Given a neural-network layout the weights are free parameters that are free parameters that define a define a spacespace

bull Each pointEach point in this in this Weight SpaceWeight Space specifies a specifies a networknetwork

bull Associated with each point is an Associated with each point is an error rateerror rate EE over the training dataover the training data

bull Backprop performs Backprop performs gradient descentgradient descent in weight in weight spacespace

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Gradient Descent in Weight Gradient Descent in Weight SpaceSpace

E

W1

W2

Ew

W1

W2

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

The Gradient-Descent The Gradient-Descent RuleRule

E(w) [ ]Ew0

Ew1

Ew

2

EwN

hellip hellip hellip _

The ldquogradien

trdquoThis is a N+1 dimensional vector (ie the lsquoslopersquo in weight space)Since we want to reduce errors we want to go ldquodown hillrdquoWersquoll take a finite step in weight space

E

W1

W2

w = - E ( w )

or wi = - Ewi

ldquodeltardquo = change to

w

E

w

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

ldquoldquoOn Linerdquo vs ldquoBatchrdquo On Linerdquo vs ldquoBatchrdquo BackpropBackprop

bull Technically we should look at the error Technically we should look at the error gradient for the entire training set gradient for the entire training set before taking a step in weight space before taking a step in weight space (ldquo(ldquobatchbatchrdquo Backprop)rdquo Backprop)

bull HoweverHowever as presented we take a step as presented we take a step after each example (ldquoafter each example (ldquoon-lineon-linerdquo Backprop)rdquo Backprop)bull Much faster convergenceMuch faster convergencebull Can reduce overfitting (since on-line Can reduce overfitting (since on-line

Backprop is ldquonoisyrdquo gradient descent)Backprop is ldquonoisyrdquo gradient descent)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

ldquoldquoOn Linerdquo vs ldquoBatchrdquo BP On Linerdquo vs ldquoBatchrdquo BP (continued)(continued)

BATCHBATCH ndash add ndash add w w vectors for vectors for everyevery training example training example thenthen lsquomoversquo in weight lsquomoversquo in weight spacespace

ON-LINEON-LINE ndash ldquomoverdquo ndash ldquomoverdquo after after eacheach example example (aka (aka stochasticstochastic gradient descent)gradient descent)

E

wi

w1

w3w2

w

w1

w2

w3

Final locations in space need not be the same for BATCH and ON-LINE w

N

ote

w

iB

ATC

H

w

i O

N-L

INE

for

i gt

1

E

w

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Need Derivatives Replace Need Derivatives Replace Step (Threshold) by Step (Threshold) by SigmoidSigmoidIndividual units

bias

output

input

output i= F(weight ij x output j)

Where

F(input i) =

j

1

1+e -(input i ndash bias i)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Differentiating the Differentiating the Logistic FunctionLogistic Function

out i =

1

1 + e - ( wji x outj)

F rsquo(wgtrsquoed in) = out i ( 1- out i ) 0

12

Wj x outj

F(wgtrsquoed in)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Assume one layer of hidden units (std topology)Assume one layer of hidden units (std topology)

11 Error Error frac12 frac12 ( Teacher ( Teacherii ndash ndash OutputOutput ii ) ) 22

22 = frac12 = frac12 (Teacher(Teacherii ndash F ndash F (( [[WWijij x x OutputOutput jj] )] )22

33 = frac12 = frac12 (Teacher(Teacherii ndash F ndash F (( [[WWijij x F x F ((WWjkjk x Output x Output

kk)])))]))22

DetermineDetermine

recallrecall

BP CalculationsBP Calculations

Error Wij

Error Wjk

= (use equation 2)

= (use equation 3)

See Table 42 in Mitchell for resultswxy = - ( E wxy )

k j i

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Derivation in MitchellDerivation in Mitchell

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Some NotationSome Notation

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

By Chain Rule By Chain Rule (since (since WWjiji influences rest of network only influences rest of network only by its influence on by its influence on NetNetjj)hellip)hellip

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Also remember this for later ndashWersquoll call it -δj

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Remember thatoj is xkj outputfrom j is inputto k

Remember netk =wk1 xk1 + hellip+ wkN xkN

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Using BP to Train Using BP to Train ANNrsquosANNrsquos

11 Initiate weights amp bias to Initiate weights amp bias to small random values small random values (eg in [-03 03])(eg in [-03 03])

22 Randomize order of Randomize order of training examples for training examples for each doeach do

a)a) Propagate activity Propagate activity forwardforward to output unitsto output units

k j i

outi = F( wij x outj )j

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)

b)b) Compute ldquodeviationrdquo for output Compute ldquodeviationrdquo for output unitsunits

c)c) Compute ldquodeviationrdquo for hidden Compute ldquodeviationrdquo for hidden unitsunits

d)d) Update weightsUpdate weights

i = F rsquo( neti ) x (Teacheri-outi)

ij = F rsquo( netj ) x ( wij x

i)

wij = x i x out j

wjk = x j x out k

F rsquo( netj ) = F(neti) neti

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)

33 Repeat until training-set error rate Repeat until training-set error rate small enough (or until tuning-set error small enough (or until tuning-set error rate begins to rise ndash see later slide)rate begins to rise ndash see later slide)

Should use ldquoearly stoppingrdquo (ie Should use ldquoearly stoppingrdquo (ie minimize error on the tuning set more minimize error on the tuning set more details later)details later)

44 Measure accuracy on test set to Measure accuracy on test set to estimate estimate generalizationgeneralization (future (future accuracy)accuracy)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Advantages of Neural Advantages of Neural NetworksNetworks

bull Universal representation (provided Universal representation (provided enough hidden units)enough hidden units)

bull Less greedy than tree learnersLess greedy than tree learnersbull In practice good for problems with In practice good for problems with

numeric inputs and can also numeric inputs and can also handle numeric outputshandle numeric outputs

bull PHD for many years best protein PHD for many years best protein secondary structure predictorsecondary structure predictor

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

DisadvantagesDisadvantages

bull Models not very comprehensibleModels not very comprehensiblebull Long training timesLong training timesbull Very sensitive to number of Very sensitive to number of

hidden unitshellip as a result largely hidden unitshellip as a result largely being supplanted by SVMs (SVMs being supplanted by SVMs (SVMs take very different approach to take very different approach to getting non-linearity)getting non-linearity)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Looking AheadLooking Aheadbull Perceptron rule can also be thought of as Perceptron rule can also be thought of as

modifying modifying weights on data points weights on data points rather rather than featuresthan features

bull Instead of process all data (batch) vs Instead of process all data (batch) vs one-at-a-time could imagine processing 2 one-at-a-time could imagine processing 2 data points at a time adjusting their data points at a time adjusting their relative weights based on their relative relative weights based on their relative errorserrors

bull This is what Plattrsquos SMO does (the SVM This is what Plattrsquos SMO does (the SVM implementation in Weka)implementation in Weka)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Backup Slide to help Backup Slide to help with Derivative of with Derivative of SigmoidSigmoid

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Page 14: Concept Learning Algorithms

Overview of ANNsOverview of ANNs

Recurrentlink

Output units

Input units

Hidden units

error

weight

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

BackpropagationBackpropagation

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

BackpropagationBackpropagation

bull Backpropagation involves a generalization of the Backpropagation involves a generalization of the perceptron ruleperceptron rule

bull Rumelhart Parker and Le Cun (and Bryson amp Ho 1969) Rumelhart Parker and Le Cun (and Bryson amp Ho 1969) Werbos 1974) independently developed (1985) a Werbos 1974) independently developed (1985) a technique for determining how to adjust weights of technique for determining how to adjust weights of interior (ldquohiddenrdquo) unitsinterior (ldquohiddenrdquo) units

bull Derivation involves partial derivatives Derivation involves partial derivatives (hence threshold function must be differentiable)(hence threshold function must be differentiable)

error signal

E Wij

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Weight SpaceWeight Space

bull Given a neural-network layout the weights Given a neural-network layout the weights are free parameters that are free parameters that define a define a spacespace

bull Each pointEach point in this in this Weight SpaceWeight Space specifies a specifies a networknetwork

bull Associated with each point is an Associated with each point is an error rateerror rate EE over the training dataover the training data

bull Backprop performs Backprop performs gradient descentgradient descent in weight in weight spacespace

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Gradient Descent in Weight Gradient Descent in Weight SpaceSpace

E

W1

W2

Ew

W1

W2

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

The Gradient-Descent The Gradient-Descent RuleRule

E(w) [ ]Ew0

Ew1

Ew

2

EwN

hellip hellip hellip _

The ldquogradien

trdquoThis is a N+1 dimensional vector (ie the lsquoslopersquo in weight space)Since we want to reduce errors we want to go ldquodown hillrdquoWersquoll take a finite step in weight space

E

W1

W2

w = - E ( w )

or wi = - Ewi

ldquodeltardquo = change to

w

E

w

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

ldquoldquoOn Linerdquo vs ldquoBatchrdquo On Linerdquo vs ldquoBatchrdquo BackpropBackprop

bull Technically we should look at the error Technically we should look at the error gradient for the entire training set gradient for the entire training set before taking a step in weight space before taking a step in weight space (ldquo(ldquobatchbatchrdquo Backprop)rdquo Backprop)

bull HoweverHowever as presented we take a step as presented we take a step after each example (ldquoafter each example (ldquoon-lineon-linerdquo Backprop)rdquo Backprop)bull Much faster convergenceMuch faster convergencebull Can reduce overfitting (since on-line Can reduce overfitting (since on-line

Backprop is ldquonoisyrdquo gradient descent)Backprop is ldquonoisyrdquo gradient descent)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

ldquoldquoOn Linerdquo vs ldquoBatchrdquo BP On Linerdquo vs ldquoBatchrdquo BP (continued)(continued)

BATCHBATCH ndash add ndash add w w vectors for vectors for everyevery training example training example thenthen lsquomoversquo in weight lsquomoversquo in weight spacespace

ON-LINEON-LINE ndash ldquomoverdquo ndash ldquomoverdquo after after eacheach example example (aka (aka stochasticstochastic gradient descent)gradient descent)

E

wi

w1

w3w2

w

w1

w2

w3

Final locations in space need not be the same for BATCH and ON-LINE w

N

ote

w

iB

ATC

H

w

i O

N-L

INE

for

i gt

1

E

w

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Need Derivatives Replace Need Derivatives Replace Step (Threshold) by Step (Threshold) by SigmoidSigmoidIndividual units

bias

output

input

output i= F(weight ij x output j)

Where

F(input i) =

j

1

1+e -(input i ndash bias i)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Differentiating the Differentiating the Logistic FunctionLogistic Function

out i =

1

1 + e - ( wji x outj)

F rsquo(wgtrsquoed in) = out i ( 1- out i ) 0

12

Wj x outj

F(wgtrsquoed in)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Assume one layer of hidden units (std topology)Assume one layer of hidden units (std topology)

11 Error Error frac12 frac12 ( Teacher ( Teacherii ndash ndash OutputOutput ii ) ) 22

22 = frac12 = frac12 (Teacher(Teacherii ndash F ndash F (( [[WWijij x x OutputOutput jj] )] )22

33 = frac12 = frac12 (Teacher(Teacherii ndash F ndash F (( [[WWijij x F x F ((WWjkjk x Output x Output

kk)])))]))22

DetermineDetermine

recallrecall

BP CalculationsBP Calculations

Error Wij

Error Wjk

= (use equation 2)

= (use equation 3)

See Table 42 in Mitchell for resultswxy = - ( E wxy )

k j i

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Derivation in MitchellDerivation in Mitchell

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Some NotationSome Notation

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

By Chain Rule By Chain Rule (since (since WWjiji influences rest of network only influences rest of network only by its influence on by its influence on NetNetjj)hellip)hellip

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Also remember this for later ndashWersquoll call it -δj

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Remember thatoj is xkj outputfrom j is inputto k

Remember netk =wk1 xk1 + hellip+ wkN xkN

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Using BP to Train Using BP to Train ANNrsquosANNrsquos

11 Initiate weights amp bias to Initiate weights amp bias to small random values small random values (eg in [-03 03])(eg in [-03 03])

22 Randomize order of Randomize order of training examples for training examples for each doeach do

a)a) Propagate activity Propagate activity forwardforward to output unitsto output units

k j i

outi = F( wij x outj )j

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)

b)b) Compute ldquodeviationrdquo for output Compute ldquodeviationrdquo for output unitsunits

c)c) Compute ldquodeviationrdquo for hidden Compute ldquodeviationrdquo for hidden unitsunits

d)d) Update weightsUpdate weights

i = F rsquo( neti ) x (Teacheri-outi)

ij = F rsquo( netj ) x ( wij x

i)

wij = x i x out j

wjk = x j x out k

F rsquo( netj ) = F(neti) neti

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)

33 Repeat until training-set error rate Repeat until training-set error rate small enough (or until tuning-set error small enough (or until tuning-set error rate begins to rise ndash see later slide)rate begins to rise ndash see later slide)

Should use ldquoearly stoppingrdquo (ie Should use ldquoearly stoppingrdquo (ie minimize error on the tuning set more minimize error on the tuning set more details later)details later)

44 Measure accuracy on test set to Measure accuracy on test set to estimate estimate generalizationgeneralization (future (future accuracy)accuracy)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Advantages of Neural Advantages of Neural NetworksNetworks

bull Universal representation (provided Universal representation (provided enough hidden units)enough hidden units)

bull Less greedy than tree learnersLess greedy than tree learnersbull In practice good for problems with In practice good for problems with

numeric inputs and can also numeric inputs and can also handle numeric outputshandle numeric outputs

bull PHD for many years best protein PHD for many years best protein secondary structure predictorsecondary structure predictor

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

DisadvantagesDisadvantages

bull Models not very comprehensibleModels not very comprehensiblebull Long training timesLong training timesbull Very sensitive to number of Very sensitive to number of

hidden unitshellip as a result largely hidden unitshellip as a result largely being supplanted by SVMs (SVMs being supplanted by SVMs (SVMs take very different approach to take very different approach to getting non-linearity)getting non-linearity)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Looking AheadLooking Aheadbull Perceptron rule can also be thought of as Perceptron rule can also be thought of as

modifying modifying weights on data points weights on data points rather rather than featuresthan features

bull Instead of process all data (batch) vs Instead of process all data (batch) vs one-at-a-time could imagine processing 2 one-at-a-time could imagine processing 2 data points at a time adjusting their data points at a time adjusting their relative weights based on their relative relative weights based on their relative errorserrors

bull This is what Plattrsquos SMO does (the SVM This is what Plattrsquos SMO does (the SVM implementation in Weka)implementation in Weka)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Backup Slide to help Backup Slide to help with Derivative of with Derivative of SigmoidSigmoid

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Page 15: Concept Learning Algorithms

BackpropagationBackpropagation

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

BackpropagationBackpropagation

bull Backpropagation involves a generalization of the Backpropagation involves a generalization of the perceptron ruleperceptron rule

bull Rumelhart Parker and Le Cun (and Bryson amp Ho 1969) Rumelhart Parker and Le Cun (and Bryson amp Ho 1969) Werbos 1974) independently developed (1985) a Werbos 1974) independently developed (1985) a technique for determining how to adjust weights of technique for determining how to adjust weights of interior (ldquohiddenrdquo) unitsinterior (ldquohiddenrdquo) units

bull Derivation involves partial derivatives Derivation involves partial derivatives (hence threshold function must be differentiable)(hence threshold function must be differentiable)

error signal

E Wij

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Weight SpaceWeight Space

bull Given a neural-network layout the weights Given a neural-network layout the weights are free parameters that are free parameters that define a define a spacespace

bull Each pointEach point in this in this Weight SpaceWeight Space specifies a specifies a networknetwork

bull Associated with each point is an Associated with each point is an error rateerror rate EE over the training dataover the training data

bull Backprop performs Backprop performs gradient descentgradient descent in weight in weight spacespace

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Gradient Descent in Weight Gradient Descent in Weight SpaceSpace

E

W1

W2

Ew

W1

W2

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

The Gradient-Descent The Gradient-Descent RuleRule

E(w) [ ]Ew0

Ew1

Ew

2

EwN

hellip hellip hellip _

The ldquogradien

trdquoThis is a N+1 dimensional vector (ie the lsquoslopersquo in weight space)Since we want to reduce errors we want to go ldquodown hillrdquoWersquoll take a finite step in weight space

E

W1

W2

w = - E ( w )

or wi = - Ewi

ldquodeltardquo = change to

w

E

w

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

ldquoldquoOn Linerdquo vs ldquoBatchrdquo On Linerdquo vs ldquoBatchrdquo BackpropBackprop

bull Technically we should look at the error Technically we should look at the error gradient for the entire training set gradient for the entire training set before taking a step in weight space before taking a step in weight space (ldquo(ldquobatchbatchrdquo Backprop)rdquo Backprop)

bull HoweverHowever as presented we take a step as presented we take a step after each example (ldquoafter each example (ldquoon-lineon-linerdquo Backprop)rdquo Backprop)bull Much faster convergenceMuch faster convergencebull Can reduce overfitting (since on-line Can reduce overfitting (since on-line

Backprop is ldquonoisyrdquo gradient descent)Backprop is ldquonoisyrdquo gradient descent)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

ldquoldquoOn Linerdquo vs ldquoBatchrdquo BP On Linerdquo vs ldquoBatchrdquo BP (continued)(continued)

BATCHBATCH ndash add ndash add w w vectors for vectors for everyevery training example training example thenthen lsquomoversquo in weight lsquomoversquo in weight spacespace

ON-LINEON-LINE ndash ldquomoverdquo ndash ldquomoverdquo after after eacheach example example (aka (aka stochasticstochastic gradient descent)gradient descent)

E

wi

w1

w3w2

w

w1

w2

w3

Final locations in space need not be the same for BATCH and ON-LINE w

N

ote

w

iB

ATC

H

w

i O

N-L

INE

for

i gt

1

E

w

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Need Derivatives Replace Need Derivatives Replace Step (Threshold) by Step (Threshold) by SigmoidSigmoidIndividual units

bias

output

input

output i= F(weight ij x output j)

Where

F(input i) =

j

1

1+e -(input i ndash bias i)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Differentiating the Differentiating the Logistic FunctionLogistic Function

out i =

1

1 + e - ( wji x outj)

F rsquo(wgtrsquoed in) = out i ( 1- out i ) 0

12

Wj x outj

F(wgtrsquoed in)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Assume one layer of hidden units (std topology)Assume one layer of hidden units (std topology)

11 Error Error frac12 frac12 ( Teacher ( Teacherii ndash ndash OutputOutput ii ) ) 22

22 = frac12 = frac12 (Teacher(Teacherii ndash F ndash F (( [[WWijij x x OutputOutput jj] )] )22

33 = frac12 = frac12 (Teacher(Teacherii ndash F ndash F (( [[WWijij x F x F ((WWjkjk x Output x Output

kk)])))]))22

DetermineDetermine

recallrecall

BP CalculationsBP Calculations

Error Wij

Error Wjk

= (use equation 2)

= (use equation 3)

See Table 42 in Mitchell for resultswxy = - ( E wxy )

k j i

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Derivation in MitchellDerivation in Mitchell

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Some NotationSome Notation

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

By Chain Rule By Chain Rule (since (since WWjiji influences rest of network only influences rest of network only by its influence on by its influence on NetNetjj)hellip)hellip

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Also remember this for later ndashWersquoll call it -δj

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Remember thatoj is xkj outputfrom j is inputto k

Remember netk =wk1 xk1 + hellip+ wkN xkN

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Using BP to Train Using BP to Train ANNrsquosANNrsquos

11 Initiate weights amp bias to Initiate weights amp bias to small random values small random values (eg in [-03 03])(eg in [-03 03])

22 Randomize order of Randomize order of training examples for training examples for each doeach do

a)a) Propagate activity Propagate activity forwardforward to output unitsto output units

k j i

outi = F( wij x outj )j

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)

b)b) Compute ldquodeviationrdquo for output Compute ldquodeviationrdquo for output unitsunits

c)c) Compute ldquodeviationrdquo for hidden Compute ldquodeviationrdquo for hidden unitsunits

d)d) Update weightsUpdate weights

i = F rsquo( neti ) x (Teacheri-outi)

ij = F rsquo( netj ) x ( wij x

i)

wij = x i x out j

wjk = x j x out k

F rsquo( netj ) = F(neti) neti

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)

33 Repeat until training-set error rate Repeat until training-set error rate small enough (or until tuning-set error small enough (or until tuning-set error rate begins to rise ndash see later slide)rate begins to rise ndash see later slide)

Should use ldquoearly stoppingrdquo (ie Should use ldquoearly stoppingrdquo (ie minimize error on the tuning set more minimize error on the tuning set more details later)details later)

44 Measure accuracy on test set to Measure accuracy on test set to estimate estimate generalizationgeneralization (future (future accuracy)accuracy)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Advantages of Neural Advantages of Neural NetworksNetworks

bull Universal representation (provided Universal representation (provided enough hidden units)enough hidden units)

bull Less greedy than tree learnersLess greedy than tree learnersbull In practice good for problems with In practice good for problems with

numeric inputs and can also numeric inputs and can also handle numeric outputshandle numeric outputs

bull PHD for many years best protein PHD for many years best protein secondary structure predictorsecondary structure predictor

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

DisadvantagesDisadvantages

bull Models not very comprehensibleModels not very comprehensiblebull Long training timesLong training timesbull Very sensitive to number of Very sensitive to number of

hidden unitshellip as a result largely hidden unitshellip as a result largely being supplanted by SVMs (SVMs being supplanted by SVMs (SVMs take very different approach to take very different approach to getting non-linearity)getting non-linearity)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Looking AheadLooking Aheadbull Perceptron rule can also be thought of as Perceptron rule can also be thought of as

modifying modifying weights on data points weights on data points rather rather than featuresthan features

bull Instead of process all data (batch) vs Instead of process all data (batch) vs one-at-a-time could imagine processing 2 one-at-a-time could imagine processing 2 data points at a time adjusting their data points at a time adjusting their relative weights based on their relative relative weights based on their relative errorserrors

bull This is what Plattrsquos SMO does (the SVM This is what Plattrsquos SMO does (the SVM implementation in Weka)implementation in Weka)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Backup Slide to help Backup Slide to help with Derivative of with Derivative of SigmoidSigmoid

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Page 16: Concept Learning Algorithms

BackpropagationBackpropagation

bull Backpropagation involves a generalization of the Backpropagation involves a generalization of the perceptron ruleperceptron rule

bull Rumelhart Parker and Le Cun (and Bryson amp Ho 1969) Rumelhart Parker and Le Cun (and Bryson amp Ho 1969) Werbos 1974) independently developed (1985) a Werbos 1974) independently developed (1985) a technique for determining how to adjust weights of technique for determining how to adjust weights of interior (ldquohiddenrdquo) unitsinterior (ldquohiddenrdquo) units

bull Derivation involves partial derivatives Derivation involves partial derivatives (hence threshold function must be differentiable)(hence threshold function must be differentiable)

error signal

E Wij

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Weight SpaceWeight Space

bull Given a neural-network layout the weights Given a neural-network layout the weights are free parameters that are free parameters that define a define a spacespace

bull Each pointEach point in this in this Weight SpaceWeight Space specifies a specifies a networknetwork

bull Associated with each point is an Associated with each point is an error rateerror rate EE over the training dataover the training data

bull Backprop performs Backprop performs gradient descentgradient descent in weight in weight spacespace

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Gradient Descent in Weight Gradient Descent in Weight SpaceSpace

E

W1

W2

Ew

W1

W2

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

The Gradient-Descent The Gradient-Descent RuleRule

E(w) [ ]Ew0

Ew1

Ew

2

EwN

hellip hellip hellip _

The ldquogradien

trdquoThis is a N+1 dimensional vector (ie the lsquoslopersquo in weight space)Since we want to reduce errors we want to go ldquodown hillrdquoWersquoll take a finite step in weight space

E

W1

W2

w = - E ( w )

or wi = - Ewi

ldquodeltardquo = change to

w

E

w

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

ldquoldquoOn Linerdquo vs ldquoBatchrdquo On Linerdquo vs ldquoBatchrdquo BackpropBackprop

bull Technically we should look at the error Technically we should look at the error gradient for the entire training set gradient for the entire training set before taking a step in weight space before taking a step in weight space (ldquo(ldquobatchbatchrdquo Backprop)rdquo Backprop)

bull HoweverHowever as presented we take a step as presented we take a step after each example (ldquoafter each example (ldquoon-lineon-linerdquo Backprop)rdquo Backprop)bull Much faster convergenceMuch faster convergencebull Can reduce overfitting (since on-line Can reduce overfitting (since on-line

Backprop is ldquonoisyrdquo gradient descent)Backprop is ldquonoisyrdquo gradient descent)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

ldquoldquoOn Linerdquo vs ldquoBatchrdquo BP On Linerdquo vs ldquoBatchrdquo BP (continued)(continued)

BATCHBATCH ndash add ndash add w w vectors for vectors for everyevery training example training example thenthen lsquomoversquo in weight lsquomoversquo in weight spacespace

ON-LINEON-LINE ndash ldquomoverdquo ndash ldquomoverdquo after after eacheach example example (aka (aka stochasticstochastic gradient descent)gradient descent)

E

wi

w1

w3w2

w

w1

w2

w3

Final locations in space need not be the same for BATCH and ON-LINE w

N

ote

w

iB

ATC

H

w

i O

N-L

INE

for

i gt

1

E

w

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Need Derivatives Replace Need Derivatives Replace Step (Threshold) by Step (Threshold) by SigmoidSigmoidIndividual units

bias

output

input

output i= F(weight ij x output j)

Where

F(input i) =

j

1

1+e -(input i ndash bias i)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Differentiating the Differentiating the Logistic FunctionLogistic Function

out i =

1

1 + e - ( wji x outj)

F rsquo(wgtrsquoed in) = out i ( 1- out i ) 0

12

Wj x outj

F(wgtrsquoed in)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Assume one layer of hidden units (std topology)Assume one layer of hidden units (std topology)

11 Error Error frac12 frac12 ( Teacher ( Teacherii ndash ndash OutputOutput ii ) ) 22

22 = frac12 = frac12 (Teacher(Teacherii ndash F ndash F (( [[WWijij x x OutputOutput jj] )] )22

33 = frac12 = frac12 (Teacher(Teacherii ndash F ndash F (( [[WWijij x F x F ((WWjkjk x Output x Output

kk)])))]))22

DetermineDetermine

recallrecall

BP CalculationsBP Calculations

Error Wij

Error Wjk

= (use equation 2)

= (use equation 3)

See Table 42 in Mitchell for resultswxy = - ( E wxy )

k j i

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Derivation in MitchellDerivation in Mitchell

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Some NotationSome Notation

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

By Chain Rule By Chain Rule (since (since WWjiji influences rest of network only influences rest of network only by its influence on by its influence on NetNetjj)hellip)hellip

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Also remember this for later ndashWersquoll call it -δj

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Remember thatoj is xkj outputfrom j is inputto k

Remember netk =wk1 xk1 + hellip+ wkN xkN

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Using BP to Train Using BP to Train ANNrsquosANNrsquos

11 Initiate weights amp bias to Initiate weights amp bias to small random values small random values (eg in [-03 03])(eg in [-03 03])

22 Randomize order of Randomize order of training examples for training examples for each doeach do

a)a) Propagate activity Propagate activity forwardforward to output unitsto output units

k j i

outi = F( wij x outj )j

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)

b)b) Compute ldquodeviationrdquo for output Compute ldquodeviationrdquo for output unitsunits

c)c) Compute ldquodeviationrdquo for hidden Compute ldquodeviationrdquo for hidden unitsunits

d)d) Update weightsUpdate weights

i = F rsquo( neti ) x (Teacheri-outi)

ij = F rsquo( netj ) x ( wij x

i)

wij = x i x out j

wjk = x j x out k

F rsquo( netj ) = F(neti) neti

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)

33 Repeat until training-set error rate Repeat until training-set error rate small enough (or until tuning-set error small enough (or until tuning-set error rate begins to rise ndash see later slide)rate begins to rise ndash see later slide)

Should use ldquoearly stoppingrdquo (ie Should use ldquoearly stoppingrdquo (ie minimize error on the tuning set more minimize error on the tuning set more details later)details later)

44 Measure accuracy on test set to Measure accuracy on test set to estimate estimate generalizationgeneralization (future (future accuracy)accuracy)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Advantages of Neural Advantages of Neural NetworksNetworks

bull Universal representation (provided Universal representation (provided enough hidden units)enough hidden units)

bull Less greedy than tree learnersLess greedy than tree learnersbull In practice good for problems with In practice good for problems with

numeric inputs and can also numeric inputs and can also handle numeric outputshandle numeric outputs

bull PHD for many years best protein PHD for many years best protein secondary structure predictorsecondary structure predictor

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

DisadvantagesDisadvantages

bull Models not very comprehensibleModels not very comprehensiblebull Long training timesLong training timesbull Very sensitive to number of Very sensitive to number of

hidden unitshellip as a result largely hidden unitshellip as a result largely being supplanted by SVMs (SVMs being supplanted by SVMs (SVMs take very different approach to take very different approach to getting non-linearity)getting non-linearity)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Looking AheadLooking Aheadbull Perceptron rule can also be thought of as Perceptron rule can also be thought of as

modifying modifying weights on data points weights on data points rather rather than featuresthan features

bull Instead of process all data (batch) vs Instead of process all data (batch) vs one-at-a-time could imagine processing 2 one-at-a-time could imagine processing 2 data points at a time adjusting their data points at a time adjusting their relative weights based on their relative relative weights based on their relative errorserrors

bull This is what Plattrsquos SMO does (the SVM This is what Plattrsquos SMO does (the SVM implementation in Weka)implementation in Weka)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Backup Slide to help Backup Slide to help with Derivative of with Derivative of SigmoidSigmoid

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Page 17: Concept Learning Algorithms

Weight SpaceWeight Space

bull Given a neural-network layout the weights Given a neural-network layout the weights are free parameters that are free parameters that define a define a spacespace

bull Each pointEach point in this in this Weight SpaceWeight Space specifies a specifies a networknetwork

bull Associated with each point is an Associated with each point is an error rateerror rate EE over the training dataover the training data

bull Backprop performs Backprop performs gradient descentgradient descent in weight in weight spacespace

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Gradient Descent in Weight Gradient Descent in Weight SpaceSpace

E

W1

W2

Ew

W1

W2

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

The Gradient-Descent The Gradient-Descent RuleRule

E(w) [ ]Ew0

Ew1

Ew

2

EwN

hellip hellip hellip _

The ldquogradien

trdquoThis is a N+1 dimensional vector (ie the lsquoslopersquo in weight space)Since we want to reduce errors we want to go ldquodown hillrdquoWersquoll take a finite step in weight space

E

W1

W2

w = - E ( w )

or wi = - Ewi

ldquodeltardquo = change to

w

E

w

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

ldquoldquoOn Linerdquo vs ldquoBatchrdquo On Linerdquo vs ldquoBatchrdquo BackpropBackprop

bull Technically we should look at the error Technically we should look at the error gradient for the entire training set gradient for the entire training set before taking a step in weight space before taking a step in weight space (ldquo(ldquobatchbatchrdquo Backprop)rdquo Backprop)

bull HoweverHowever as presented we take a step as presented we take a step after each example (ldquoafter each example (ldquoon-lineon-linerdquo Backprop)rdquo Backprop)bull Much faster convergenceMuch faster convergencebull Can reduce overfitting (since on-line Can reduce overfitting (since on-line

Backprop is ldquonoisyrdquo gradient descent)Backprop is ldquonoisyrdquo gradient descent)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

ldquoldquoOn Linerdquo vs ldquoBatchrdquo BP On Linerdquo vs ldquoBatchrdquo BP (continued)(continued)

BATCHBATCH ndash add ndash add w w vectors for vectors for everyevery training example training example thenthen lsquomoversquo in weight lsquomoversquo in weight spacespace

ON-LINEON-LINE ndash ldquomoverdquo ndash ldquomoverdquo after after eacheach example example (aka (aka stochasticstochastic gradient descent)gradient descent)

E

wi

w1

w3w2

w

w1

w2

w3

Final locations in space need not be the same for BATCH and ON-LINE w

N

ote

w

iB

ATC

H

w

i O

N-L

INE

for

i gt

1

E

w

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Need Derivatives Replace Need Derivatives Replace Step (Threshold) by Step (Threshold) by SigmoidSigmoidIndividual units

bias

output

input

output i= F(weight ij x output j)

Where

F(input i) =

j

1

1+e -(input i ndash bias i)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Differentiating the Differentiating the Logistic FunctionLogistic Function

out i =

1

1 + e - ( wji x outj)

F rsquo(wgtrsquoed in) = out i ( 1- out i ) 0

12

Wj x outj

F(wgtrsquoed in)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Assume one layer of hidden units (std topology)Assume one layer of hidden units (std topology)

11 Error Error frac12 frac12 ( Teacher ( Teacherii ndash ndash OutputOutput ii ) ) 22

22 = frac12 = frac12 (Teacher(Teacherii ndash F ndash F (( [[WWijij x x OutputOutput jj] )] )22

33 = frac12 = frac12 (Teacher(Teacherii ndash F ndash F (( [[WWijij x F x F ((WWjkjk x Output x Output

kk)])))]))22

DetermineDetermine

recallrecall

BP CalculationsBP Calculations

Error Wij

Error Wjk

= (use equation 2)

= (use equation 3)

See Table 42 in Mitchell for resultswxy = - ( E wxy )

k j i

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Derivation in MitchellDerivation in Mitchell

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Some NotationSome Notation

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

By Chain Rule By Chain Rule (since (since WWjiji influences rest of network only influences rest of network only by its influence on by its influence on NetNetjj)hellip)hellip

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Also remember this for later ndashWersquoll call it -δj

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Remember thatoj is xkj outputfrom j is inputto k

Remember netk =wk1 xk1 + hellip+ wkN xkN

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Using BP to Train Using BP to Train ANNrsquosANNrsquos

11 Initiate weights amp bias to Initiate weights amp bias to small random values small random values (eg in [-03 03])(eg in [-03 03])

22 Randomize order of Randomize order of training examples for training examples for each doeach do

a)a) Propagate activity Propagate activity forwardforward to output unitsto output units

k j i

outi = F( wij x outj )j

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)

b)b) Compute ldquodeviationrdquo for output Compute ldquodeviationrdquo for output unitsunits

c)c) Compute ldquodeviationrdquo for hidden Compute ldquodeviationrdquo for hidden unitsunits

d)d) Update weightsUpdate weights

i = F rsquo( neti ) x (Teacheri-outi)

ij = F rsquo( netj ) x ( wij x

i)

wij = x i x out j

wjk = x j x out k

F rsquo( netj ) = F(neti) neti

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)

33 Repeat until training-set error rate Repeat until training-set error rate small enough (or until tuning-set error small enough (or until tuning-set error rate begins to rise ndash see later slide)rate begins to rise ndash see later slide)

Should use ldquoearly stoppingrdquo (ie Should use ldquoearly stoppingrdquo (ie minimize error on the tuning set more minimize error on the tuning set more details later)details later)

44 Measure accuracy on test set to Measure accuracy on test set to estimate estimate generalizationgeneralization (future (future accuracy)accuracy)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Advantages of Neural Advantages of Neural NetworksNetworks

bull Universal representation (provided Universal representation (provided enough hidden units)enough hidden units)

bull Less greedy than tree learnersLess greedy than tree learnersbull In practice good for problems with In practice good for problems with

numeric inputs and can also numeric inputs and can also handle numeric outputshandle numeric outputs

bull PHD for many years best protein PHD for many years best protein secondary structure predictorsecondary structure predictor

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

DisadvantagesDisadvantages

bull Models not very comprehensibleModels not very comprehensiblebull Long training timesLong training timesbull Very sensitive to number of Very sensitive to number of

hidden unitshellip as a result largely hidden unitshellip as a result largely being supplanted by SVMs (SVMs being supplanted by SVMs (SVMs take very different approach to take very different approach to getting non-linearity)getting non-linearity)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Looking AheadLooking Aheadbull Perceptron rule can also be thought of as Perceptron rule can also be thought of as

modifying modifying weights on data points weights on data points rather rather than featuresthan features

bull Instead of process all data (batch) vs Instead of process all data (batch) vs one-at-a-time could imagine processing 2 one-at-a-time could imagine processing 2 data points at a time adjusting their data points at a time adjusting their relative weights based on their relative relative weights based on their relative errorserrors

bull This is what Plattrsquos SMO does (the SVM This is what Plattrsquos SMO does (the SVM implementation in Weka)implementation in Weka)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Backup Slide to help Backup Slide to help with Derivative of with Derivative of SigmoidSigmoid

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Page 18: Concept Learning Algorithms

Gradient Descent in Weight Gradient Descent in Weight SpaceSpace

E

W1

W2

Ew

W1

W2

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

The Gradient-Descent The Gradient-Descent RuleRule

E(w) [ ]Ew0

Ew1

Ew

2

EwN

hellip hellip hellip _

The ldquogradien

trdquoThis is a N+1 dimensional vector (ie the lsquoslopersquo in weight space)Since we want to reduce errors we want to go ldquodown hillrdquoWersquoll take a finite step in weight space

E

W1

W2

w = - E ( w )

or wi = - Ewi

ldquodeltardquo = change to

w

E

w

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

ldquoldquoOn Linerdquo vs ldquoBatchrdquo On Linerdquo vs ldquoBatchrdquo BackpropBackprop

bull Technically we should look at the error Technically we should look at the error gradient for the entire training set gradient for the entire training set before taking a step in weight space before taking a step in weight space (ldquo(ldquobatchbatchrdquo Backprop)rdquo Backprop)

bull HoweverHowever as presented we take a step as presented we take a step after each example (ldquoafter each example (ldquoon-lineon-linerdquo Backprop)rdquo Backprop)bull Much faster convergenceMuch faster convergencebull Can reduce overfitting (since on-line Can reduce overfitting (since on-line

Backprop is ldquonoisyrdquo gradient descent)Backprop is ldquonoisyrdquo gradient descent)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

ldquoldquoOn Linerdquo vs ldquoBatchrdquo BP On Linerdquo vs ldquoBatchrdquo BP (continued)(continued)

BATCHBATCH ndash add ndash add w w vectors for vectors for everyevery training example training example thenthen lsquomoversquo in weight lsquomoversquo in weight spacespace

ON-LINEON-LINE ndash ldquomoverdquo ndash ldquomoverdquo after after eacheach example example (aka (aka stochasticstochastic gradient descent)gradient descent)

E

wi

w1

w3w2

w

w1

w2

w3

Final locations in space need not be the same for BATCH and ON-LINE w

N

ote

w

iB

ATC

H

w

i O

N-L

INE

for

i gt

1

E

w

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Need Derivatives Replace Need Derivatives Replace Step (Threshold) by Step (Threshold) by SigmoidSigmoidIndividual units

bias

output

input

output i= F(weight ij x output j)

Where

F(input i) =

j

1

1+e -(input i ndash bias i)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Differentiating the Differentiating the Logistic FunctionLogistic Function

out i =

1

1 + e - ( wji x outj)

F rsquo(wgtrsquoed in) = out i ( 1- out i ) 0

12

Wj x outj

F(wgtrsquoed in)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Assume one layer of hidden units (std topology)Assume one layer of hidden units (std topology)

11 Error Error frac12 frac12 ( Teacher ( Teacherii ndash ndash OutputOutput ii ) ) 22

22 = frac12 = frac12 (Teacher(Teacherii ndash F ndash F (( [[WWijij x x OutputOutput jj] )] )22

33 = frac12 = frac12 (Teacher(Teacherii ndash F ndash F (( [[WWijij x F x F ((WWjkjk x Output x Output

kk)])))]))22

DetermineDetermine

recallrecall

BP CalculationsBP Calculations

Error Wij

Error Wjk

= (use equation 2)

= (use equation 3)

See Table 42 in Mitchell for resultswxy = - ( E wxy )

k j i

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Derivation in MitchellDerivation in Mitchell

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Some NotationSome Notation

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

By Chain Rule By Chain Rule (since (since WWjiji influences rest of network only influences rest of network only by its influence on by its influence on NetNetjj)hellip)hellip

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Also remember this for later ndashWersquoll call it -δj

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Remember thatoj is xkj outputfrom j is inputto k

Remember netk =wk1 xk1 + hellip+ wkN xkN

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Using BP to Train Using BP to Train ANNrsquosANNrsquos

11 Initiate weights amp bias to Initiate weights amp bias to small random values small random values (eg in [-03 03])(eg in [-03 03])

22 Randomize order of Randomize order of training examples for training examples for each doeach do

a)a) Propagate activity Propagate activity forwardforward to output unitsto output units

k j i

outi = F( wij x outj )j

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)

b)b) Compute ldquodeviationrdquo for output Compute ldquodeviationrdquo for output unitsunits

c)c) Compute ldquodeviationrdquo for hidden Compute ldquodeviationrdquo for hidden unitsunits

d)d) Update weightsUpdate weights

i = F rsquo( neti ) x (Teacheri-outi)

ij = F rsquo( netj ) x ( wij x

i)

wij = x i x out j

wjk = x j x out k

F rsquo( netj ) = F(neti) neti

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)

33 Repeat until training-set error rate Repeat until training-set error rate small enough (or until tuning-set error small enough (or until tuning-set error rate begins to rise ndash see later slide)rate begins to rise ndash see later slide)

Should use ldquoearly stoppingrdquo (ie Should use ldquoearly stoppingrdquo (ie minimize error on the tuning set more minimize error on the tuning set more details later)details later)

44 Measure accuracy on test set to Measure accuracy on test set to estimate estimate generalizationgeneralization (future (future accuracy)accuracy)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Advantages of Neural Advantages of Neural NetworksNetworks

bull Universal representation (provided Universal representation (provided enough hidden units)enough hidden units)

bull Less greedy than tree learnersLess greedy than tree learnersbull In practice good for problems with In practice good for problems with

numeric inputs and can also numeric inputs and can also handle numeric outputshandle numeric outputs

bull PHD for many years best protein PHD for many years best protein secondary structure predictorsecondary structure predictor

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

DisadvantagesDisadvantages

bull Models not very comprehensibleModels not very comprehensiblebull Long training timesLong training timesbull Very sensitive to number of Very sensitive to number of

hidden unitshellip as a result largely hidden unitshellip as a result largely being supplanted by SVMs (SVMs being supplanted by SVMs (SVMs take very different approach to take very different approach to getting non-linearity)getting non-linearity)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Looking AheadLooking Aheadbull Perceptron rule can also be thought of as Perceptron rule can also be thought of as

modifying modifying weights on data points weights on data points rather rather than featuresthan features

bull Instead of process all data (batch) vs Instead of process all data (batch) vs one-at-a-time could imagine processing 2 one-at-a-time could imagine processing 2 data points at a time adjusting their data points at a time adjusting their relative weights based on their relative relative weights based on their relative errorserrors

bull This is what Plattrsquos SMO does (the SVM This is what Plattrsquos SMO does (the SVM implementation in Weka)implementation in Weka)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Backup Slide to help Backup Slide to help with Derivative of with Derivative of SigmoidSigmoid

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Page 19: Concept Learning Algorithms

The Gradient-Descent The Gradient-Descent RuleRule

E(w) [ ]Ew0

Ew1

Ew

2

EwN

hellip hellip hellip _

The ldquogradien

trdquoThis is a N+1 dimensional vector (ie the lsquoslopersquo in weight space)Since we want to reduce errors we want to go ldquodown hillrdquoWersquoll take a finite step in weight space

E

W1

W2

w = - E ( w )

or wi = - Ewi

ldquodeltardquo = change to

w

E

w

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

ldquoldquoOn Linerdquo vs ldquoBatchrdquo On Linerdquo vs ldquoBatchrdquo BackpropBackprop

bull Technically we should look at the error Technically we should look at the error gradient for the entire training set gradient for the entire training set before taking a step in weight space before taking a step in weight space (ldquo(ldquobatchbatchrdquo Backprop)rdquo Backprop)

bull HoweverHowever as presented we take a step as presented we take a step after each example (ldquoafter each example (ldquoon-lineon-linerdquo Backprop)rdquo Backprop)bull Much faster convergenceMuch faster convergencebull Can reduce overfitting (since on-line Can reduce overfitting (since on-line

Backprop is ldquonoisyrdquo gradient descent)Backprop is ldquonoisyrdquo gradient descent)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

ldquoldquoOn Linerdquo vs ldquoBatchrdquo BP On Linerdquo vs ldquoBatchrdquo BP (continued)(continued)

BATCHBATCH ndash add ndash add w w vectors for vectors for everyevery training example training example thenthen lsquomoversquo in weight lsquomoversquo in weight spacespace

ON-LINEON-LINE ndash ldquomoverdquo ndash ldquomoverdquo after after eacheach example example (aka (aka stochasticstochastic gradient descent)gradient descent)

E

wi

w1

w3w2

w

w1

w2

w3

Final locations in space need not be the same for BATCH and ON-LINE w

N

ote

w

iB

ATC

H

w

i O

N-L

INE

for

i gt

1

E

w

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Need Derivatives Replace Need Derivatives Replace Step (Threshold) by Step (Threshold) by SigmoidSigmoidIndividual units

bias

output

input

output i= F(weight ij x output j)

Where

F(input i) =

j

1

1+e -(input i ndash bias i)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Differentiating the Differentiating the Logistic FunctionLogistic Function

out i =

1

1 + e - ( wji x outj)

F rsquo(wgtrsquoed in) = out i ( 1- out i ) 0

12

Wj x outj

F(wgtrsquoed in)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Assume one layer of hidden units (std topology)Assume one layer of hidden units (std topology)

11 Error Error frac12 frac12 ( Teacher ( Teacherii ndash ndash OutputOutput ii ) ) 22

22 = frac12 = frac12 (Teacher(Teacherii ndash F ndash F (( [[WWijij x x OutputOutput jj] )] )22

33 = frac12 = frac12 (Teacher(Teacherii ndash F ndash F (( [[WWijij x F x F ((WWjkjk x Output x Output

kk)])))]))22

DetermineDetermine

recallrecall

BP CalculationsBP Calculations

Error Wij

Error Wjk

= (use equation 2)

= (use equation 3)

See Table 42 in Mitchell for resultswxy = - ( E wxy )

k j i

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Derivation in MitchellDerivation in Mitchell

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Some NotationSome Notation

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

By Chain Rule By Chain Rule (since (since WWjiji influences rest of network only influences rest of network only by its influence on by its influence on NetNetjj)hellip)hellip

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Also remember this for later ndashWersquoll call it -δj

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Remember thatoj is xkj outputfrom j is inputto k

Remember netk =wk1 xk1 + hellip+ wkN xkN

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Using BP to Train Using BP to Train ANNrsquosANNrsquos

11 Initiate weights amp bias to Initiate weights amp bias to small random values small random values (eg in [-03 03])(eg in [-03 03])

22 Randomize order of Randomize order of training examples for training examples for each doeach do

a)a) Propagate activity Propagate activity forwardforward to output unitsto output units

k j i

outi = F( wij x outj )j

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)

b)b) Compute ldquodeviationrdquo for output Compute ldquodeviationrdquo for output unitsunits

c)c) Compute ldquodeviationrdquo for hidden Compute ldquodeviationrdquo for hidden unitsunits

d)d) Update weightsUpdate weights

i = F rsquo( neti ) x (Teacheri-outi)

ij = F rsquo( netj ) x ( wij x

i)

wij = x i x out j

wjk = x j x out k

F rsquo( netj ) = F(neti) neti

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)

33 Repeat until training-set error rate Repeat until training-set error rate small enough (or until tuning-set error small enough (or until tuning-set error rate begins to rise ndash see later slide)rate begins to rise ndash see later slide)

Should use ldquoearly stoppingrdquo (ie Should use ldquoearly stoppingrdquo (ie minimize error on the tuning set more minimize error on the tuning set more details later)details later)

44 Measure accuracy on test set to Measure accuracy on test set to estimate estimate generalizationgeneralization (future (future accuracy)accuracy)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Advantages of Neural Advantages of Neural NetworksNetworks

bull Universal representation (provided Universal representation (provided enough hidden units)enough hidden units)

bull Less greedy than tree learnersLess greedy than tree learnersbull In practice good for problems with In practice good for problems with

numeric inputs and can also numeric inputs and can also handle numeric outputshandle numeric outputs

bull PHD for many years best protein PHD for many years best protein secondary structure predictorsecondary structure predictor

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

DisadvantagesDisadvantages

bull Models not very comprehensibleModels not very comprehensiblebull Long training timesLong training timesbull Very sensitive to number of Very sensitive to number of

hidden unitshellip as a result largely hidden unitshellip as a result largely being supplanted by SVMs (SVMs being supplanted by SVMs (SVMs take very different approach to take very different approach to getting non-linearity)getting non-linearity)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Looking AheadLooking Aheadbull Perceptron rule can also be thought of as Perceptron rule can also be thought of as

modifying modifying weights on data points weights on data points rather rather than featuresthan features

bull Instead of process all data (batch) vs Instead of process all data (batch) vs one-at-a-time could imagine processing 2 one-at-a-time could imagine processing 2 data points at a time adjusting their data points at a time adjusting their relative weights based on their relative relative weights based on their relative errorserrors

bull This is what Plattrsquos SMO does (the SVM This is what Plattrsquos SMO does (the SVM implementation in Weka)implementation in Weka)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Backup Slide to help Backup Slide to help with Derivative of with Derivative of SigmoidSigmoid

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Page 20: Concept Learning Algorithms

ldquoldquoOn Linerdquo vs ldquoBatchrdquo On Linerdquo vs ldquoBatchrdquo BackpropBackprop

bull Technically we should look at the error Technically we should look at the error gradient for the entire training set gradient for the entire training set before taking a step in weight space before taking a step in weight space (ldquo(ldquobatchbatchrdquo Backprop)rdquo Backprop)

bull HoweverHowever as presented we take a step as presented we take a step after each example (ldquoafter each example (ldquoon-lineon-linerdquo Backprop)rdquo Backprop)bull Much faster convergenceMuch faster convergencebull Can reduce overfitting (since on-line Can reduce overfitting (since on-line

Backprop is ldquonoisyrdquo gradient descent)Backprop is ldquonoisyrdquo gradient descent)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

ldquoldquoOn Linerdquo vs ldquoBatchrdquo BP On Linerdquo vs ldquoBatchrdquo BP (continued)(continued)

BATCHBATCH ndash add ndash add w w vectors for vectors for everyevery training example training example thenthen lsquomoversquo in weight lsquomoversquo in weight spacespace

ON-LINEON-LINE ndash ldquomoverdquo ndash ldquomoverdquo after after eacheach example example (aka (aka stochasticstochastic gradient descent)gradient descent)

E

wi

w1

w3w2

w

w1

w2

w3

Final locations in space need not be the same for BATCH and ON-LINE w

N

ote

w

iB

ATC

H

w

i O

N-L

INE

for

i gt

1

E

w

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Need Derivatives Replace Need Derivatives Replace Step (Threshold) by Step (Threshold) by SigmoidSigmoidIndividual units

bias

output

input

output i= F(weight ij x output j)

Where

F(input i) =

j

1

1+e -(input i ndash bias i)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Differentiating the Differentiating the Logistic FunctionLogistic Function

out i =

1

1 + e - ( wji x outj)

F rsquo(wgtrsquoed in) = out i ( 1- out i ) 0

12

Wj x outj

F(wgtrsquoed in)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Assume one layer of hidden units (std topology)Assume one layer of hidden units (std topology)

11 Error Error frac12 frac12 ( Teacher ( Teacherii ndash ndash OutputOutput ii ) ) 22

22 = frac12 = frac12 (Teacher(Teacherii ndash F ndash F (( [[WWijij x x OutputOutput jj] )] )22

33 = frac12 = frac12 (Teacher(Teacherii ndash F ndash F (( [[WWijij x F x F ((WWjkjk x Output x Output

kk)])))]))22

DetermineDetermine

recallrecall

BP CalculationsBP Calculations

Error Wij

Error Wjk

= (use equation 2)

= (use equation 3)

See Table 42 in Mitchell for resultswxy = - ( E wxy )

k j i

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Derivation in MitchellDerivation in Mitchell

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Some NotationSome Notation

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

By Chain Rule By Chain Rule (since (since WWjiji influences rest of network only influences rest of network only by its influence on by its influence on NetNetjj)hellip)hellip

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Also remember this for later ndashWersquoll call it -δj

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Remember thatoj is xkj outputfrom j is inputto k

Remember netk =wk1 xk1 + hellip+ wkN xkN

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Using BP to Train Using BP to Train ANNrsquosANNrsquos

11 Initiate weights amp bias to Initiate weights amp bias to small random values small random values (eg in [-03 03])(eg in [-03 03])

22 Randomize order of Randomize order of training examples for training examples for each doeach do

a)a) Propagate activity Propagate activity forwardforward to output unitsto output units

k j i

outi = F( wij x outj )j

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)

b)b) Compute ldquodeviationrdquo for output Compute ldquodeviationrdquo for output unitsunits

c)c) Compute ldquodeviationrdquo for hidden Compute ldquodeviationrdquo for hidden unitsunits

d)d) Update weightsUpdate weights

i = F rsquo( neti ) x (Teacheri-outi)

ij = F rsquo( netj ) x ( wij x

i)

wij = x i x out j

wjk = x j x out k

F rsquo( netj ) = F(neti) neti

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)

33 Repeat until training-set error rate Repeat until training-set error rate small enough (or until tuning-set error small enough (or until tuning-set error rate begins to rise ndash see later slide)rate begins to rise ndash see later slide)

Should use ldquoearly stoppingrdquo (ie Should use ldquoearly stoppingrdquo (ie minimize error on the tuning set more minimize error on the tuning set more details later)details later)

44 Measure accuracy on test set to Measure accuracy on test set to estimate estimate generalizationgeneralization (future (future accuracy)accuracy)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Advantages of Neural Advantages of Neural NetworksNetworks

bull Universal representation (provided Universal representation (provided enough hidden units)enough hidden units)

bull Less greedy than tree learnersLess greedy than tree learnersbull In practice good for problems with In practice good for problems with

numeric inputs and can also numeric inputs and can also handle numeric outputshandle numeric outputs

bull PHD for many years best protein PHD for many years best protein secondary structure predictorsecondary structure predictor

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

DisadvantagesDisadvantages

bull Models not very comprehensibleModels not very comprehensiblebull Long training timesLong training timesbull Very sensitive to number of Very sensitive to number of

hidden unitshellip as a result largely hidden unitshellip as a result largely being supplanted by SVMs (SVMs being supplanted by SVMs (SVMs take very different approach to take very different approach to getting non-linearity)getting non-linearity)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Looking AheadLooking Aheadbull Perceptron rule can also be thought of as Perceptron rule can also be thought of as

modifying modifying weights on data points weights on data points rather rather than featuresthan features

bull Instead of process all data (batch) vs Instead of process all data (batch) vs one-at-a-time could imagine processing 2 one-at-a-time could imagine processing 2 data points at a time adjusting their data points at a time adjusting their relative weights based on their relative relative weights based on their relative errorserrors

bull This is what Plattrsquos SMO does (the SVM This is what Plattrsquos SMO does (the SVM implementation in Weka)implementation in Weka)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Backup Slide to help Backup Slide to help with Derivative of with Derivative of SigmoidSigmoid

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Page 21: Concept Learning Algorithms

ldquoldquoOn Linerdquo vs ldquoBatchrdquo BP On Linerdquo vs ldquoBatchrdquo BP (continued)(continued)

BATCHBATCH ndash add ndash add w w vectors for vectors for everyevery training example training example thenthen lsquomoversquo in weight lsquomoversquo in weight spacespace

ON-LINEON-LINE ndash ldquomoverdquo ndash ldquomoverdquo after after eacheach example example (aka (aka stochasticstochastic gradient descent)gradient descent)

E

wi

w1

w3w2

w

w1

w2

w3

Final locations in space need not be the same for BATCH and ON-LINE w

N

ote

w

iB

ATC

H

w

i O

N-L

INE

for

i gt

1

E

w

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Need Derivatives Replace Need Derivatives Replace Step (Threshold) by Step (Threshold) by SigmoidSigmoidIndividual units

bias

output

input

output i= F(weight ij x output j)

Where

F(input i) =

j

1

1+e -(input i ndash bias i)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Differentiating the Differentiating the Logistic FunctionLogistic Function

out i =

1

1 + e - ( wji x outj)

F rsquo(wgtrsquoed in) = out i ( 1- out i ) 0

12

Wj x outj

F(wgtrsquoed in)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Assume one layer of hidden units (std topology)Assume one layer of hidden units (std topology)

11 Error Error frac12 frac12 ( Teacher ( Teacherii ndash ndash OutputOutput ii ) ) 22

22 = frac12 = frac12 (Teacher(Teacherii ndash F ndash F (( [[WWijij x x OutputOutput jj] )] )22

33 = frac12 = frac12 (Teacher(Teacherii ndash F ndash F (( [[WWijij x F x F ((WWjkjk x Output x Output

kk)])))]))22

DetermineDetermine

recallrecall

BP CalculationsBP Calculations

Error Wij

Error Wjk

= (use equation 2)

= (use equation 3)

See Table 42 in Mitchell for resultswxy = - ( E wxy )

k j i

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Derivation in MitchellDerivation in Mitchell

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Some NotationSome Notation

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

By Chain Rule By Chain Rule (since (since WWjiji influences rest of network only influences rest of network only by its influence on by its influence on NetNetjj)hellip)hellip

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Also remember this for later ndashWersquoll call it -δj

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Remember thatoj is xkj outputfrom j is inputto k

Remember netk =wk1 xk1 + hellip+ wkN xkN

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Using BP to Train Using BP to Train ANNrsquosANNrsquos

11 Initiate weights amp bias to Initiate weights amp bias to small random values small random values (eg in [-03 03])(eg in [-03 03])

22 Randomize order of Randomize order of training examples for training examples for each doeach do

a)a) Propagate activity Propagate activity forwardforward to output unitsto output units

k j i

outi = F( wij x outj )j

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)

b)b) Compute ldquodeviationrdquo for output Compute ldquodeviationrdquo for output unitsunits

c)c) Compute ldquodeviationrdquo for hidden Compute ldquodeviationrdquo for hidden unitsunits

d)d) Update weightsUpdate weights

i = F rsquo( neti ) x (Teacheri-outi)

ij = F rsquo( netj ) x ( wij x

i)

wij = x i x out j

wjk = x j x out k

F rsquo( netj ) = F(neti) neti

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)

33 Repeat until training-set error rate Repeat until training-set error rate small enough (or until tuning-set error small enough (or until tuning-set error rate begins to rise ndash see later slide)rate begins to rise ndash see later slide)

Should use ldquoearly stoppingrdquo (ie Should use ldquoearly stoppingrdquo (ie minimize error on the tuning set more minimize error on the tuning set more details later)details later)

44 Measure accuracy on test set to Measure accuracy on test set to estimate estimate generalizationgeneralization (future (future accuracy)accuracy)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Advantages of Neural Advantages of Neural NetworksNetworks

bull Universal representation (provided Universal representation (provided enough hidden units)enough hidden units)

bull Less greedy than tree learnersLess greedy than tree learnersbull In practice good for problems with In practice good for problems with

numeric inputs and can also numeric inputs and can also handle numeric outputshandle numeric outputs

bull PHD for many years best protein PHD for many years best protein secondary structure predictorsecondary structure predictor

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

DisadvantagesDisadvantages

bull Models not very comprehensibleModels not very comprehensiblebull Long training timesLong training timesbull Very sensitive to number of Very sensitive to number of

hidden unitshellip as a result largely hidden unitshellip as a result largely being supplanted by SVMs (SVMs being supplanted by SVMs (SVMs take very different approach to take very different approach to getting non-linearity)getting non-linearity)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Looking AheadLooking Aheadbull Perceptron rule can also be thought of as Perceptron rule can also be thought of as

modifying modifying weights on data points weights on data points rather rather than featuresthan features

bull Instead of process all data (batch) vs Instead of process all data (batch) vs one-at-a-time could imagine processing 2 one-at-a-time could imagine processing 2 data points at a time adjusting their data points at a time adjusting their relative weights based on their relative relative weights based on their relative errorserrors

bull This is what Plattrsquos SMO does (the SVM This is what Plattrsquos SMO does (the SVM implementation in Weka)implementation in Weka)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Backup Slide to help Backup Slide to help with Derivative of with Derivative of SigmoidSigmoid

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Page 22: Concept Learning Algorithms

Need Derivatives Replace Need Derivatives Replace Step (Threshold) by Step (Threshold) by SigmoidSigmoidIndividual units

bias

output

input

output i= F(weight ij x output j)

Where

F(input i) =

j

1

1+e -(input i ndash bias i)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Differentiating the Differentiating the Logistic FunctionLogistic Function

out i =

1

1 + e - ( wji x outj)

F rsquo(wgtrsquoed in) = out i ( 1- out i ) 0

12

Wj x outj

F(wgtrsquoed in)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Assume one layer of hidden units (std topology)Assume one layer of hidden units (std topology)

11 Error Error frac12 frac12 ( Teacher ( Teacherii ndash ndash OutputOutput ii ) ) 22

22 = frac12 = frac12 (Teacher(Teacherii ndash F ndash F (( [[WWijij x x OutputOutput jj] )] )22

33 = frac12 = frac12 (Teacher(Teacherii ndash F ndash F (( [[WWijij x F x F ((WWjkjk x Output x Output

kk)])))]))22

DetermineDetermine

recallrecall

BP CalculationsBP Calculations

Error Wij

Error Wjk

= (use equation 2)

= (use equation 3)

See Table 42 in Mitchell for resultswxy = - ( E wxy )

k j i

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Derivation in MitchellDerivation in Mitchell

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Some NotationSome Notation

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

By Chain Rule By Chain Rule (since (since WWjiji influences rest of network only influences rest of network only by its influence on by its influence on NetNetjj)hellip)hellip

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Also remember this for later ndashWersquoll call it -δj

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Remember thatoj is xkj outputfrom j is inputto k

Remember netk =wk1 xk1 + hellip+ wkN xkN

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Using BP to Train Using BP to Train ANNrsquosANNrsquos

11 Initiate weights amp bias to Initiate weights amp bias to small random values small random values (eg in [-03 03])(eg in [-03 03])

22 Randomize order of Randomize order of training examples for training examples for each doeach do

a)a) Propagate activity Propagate activity forwardforward to output unitsto output units

k j i

outi = F( wij x outj )j

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)

b)b) Compute ldquodeviationrdquo for output Compute ldquodeviationrdquo for output unitsunits

c)c) Compute ldquodeviationrdquo for hidden Compute ldquodeviationrdquo for hidden unitsunits

d)d) Update weightsUpdate weights

i = F rsquo( neti ) x (Teacheri-outi)

ij = F rsquo( netj ) x ( wij x

i)

wij = x i x out j

wjk = x j x out k

F rsquo( netj ) = F(neti) neti

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)

33 Repeat until training-set error rate Repeat until training-set error rate small enough (or until tuning-set error small enough (or until tuning-set error rate begins to rise ndash see later slide)rate begins to rise ndash see later slide)

Should use ldquoearly stoppingrdquo (ie Should use ldquoearly stoppingrdquo (ie minimize error on the tuning set more minimize error on the tuning set more details later)details later)

44 Measure accuracy on test set to Measure accuracy on test set to estimate estimate generalizationgeneralization (future (future accuracy)accuracy)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Advantages of Neural Advantages of Neural NetworksNetworks

bull Universal representation (provided Universal representation (provided enough hidden units)enough hidden units)

bull Less greedy than tree learnersLess greedy than tree learnersbull In practice good for problems with In practice good for problems with

numeric inputs and can also numeric inputs and can also handle numeric outputshandle numeric outputs

bull PHD for many years best protein PHD for many years best protein secondary structure predictorsecondary structure predictor

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

DisadvantagesDisadvantages

bull Models not very comprehensibleModels not very comprehensiblebull Long training timesLong training timesbull Very sensitive to number of Very sensitive to number of

hidden unitshellip as a result largely hidden unitshellip as a result largely being supplanted by SVMs (SVMs being supplanted by SVMs (SVMs take very different approach to take very different approach to getting non-linearity)getting non-linearity)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Looking AheadLooking Aheadbull Perceptron rule can also be thought of as Perceptron rule can also be thought of as

modifying modifying weights on data points weights on data points rather rather than featuresthan features

bull Instead of process all data (batch) vs Instead of process all data (batch) vs one-at-a-time could imagine processing 2 one-at-a-time could imagine processing 2 data points at a time adjusting their data points at a time adjusting their relative weights based on their relative relative weights based on their relative errorserrors

bull This is what Plattrsquos SMO does (the SVM This is what Plattrsquos SMO does (the SVM implementation in Weka)implementation in Weka)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Backup Slide to help Backup Slide to help with Derivative of with Derivative of SigmoidSigmoid

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Page 23: Concept Learning Algorithms

Differentiating the Differentiating the Logistic FunctionLogistic Function

out i =

1

1 + e - ( wji x outj)

F rsquo(wgtrsquoed in) = out i ( 1- out i ) 0

12

Wj x outj

F(wgtrsquoed in)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Assume one layer of hidden units (std topology)Assume one layer of hidden units (std topology)

11 Error Error frac12 frac12 ( Teacher ( Teacherii ndash ndash OutputOutput ii ) ) 22

22 = frac12 = frac12 (Teacher(Teacherii ndash F ndash F (( [[WWijij x x OutputOutput jj] )] )22

33 = frac12 = frac12 (Teacher(Teacherii ndash F ndash F (( [[WWijij x F x F ((WWjkjk x Output x Output

kk)])))]))22

DetermineDetermine

recallrecall

BP CalculationsBP Calculations

Error Wij

Error Wjk

= (use equation 2)

= (use equation 3)

See Table 42 in Mitchell for resultswxy = - ( E wxy )

k j i

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Derivation in MitchellDerivation in Mitchell

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Some NotationSome Notation

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

By Chain Rule By Chain Rule (since (since WWjiji influences rest of network only influences rest of network only by its influence on by its influence on NetNetjj)hellip)hellip

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Also remember this for later ndashWersquoll call it -δj

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Remember thatoj is xkj outputfrom j is inputto k

Remember netk =wk1 xk1 + hellip+ wkN xkN

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Using BP to Train Using BP to Train ANNrsquosANNrsquos

11 Initiate weights amp bias to Initiate weights amp bias to small random values small random values (eg in [-03 03])(eg in [-03 03])

22 Randomize order of Randomize order of training examples for training examples for each doeach do

a)a) Propagate activity Propagate activity forwardforward to output unitsto output units

k j i

outi = F( wij x outj )j

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)

b)b) Compute ldquodeviationrdquo for output Compute ldquodeviationrdquo for output unitsunits

c)c) Compute ldquodeviationrdquo for hidden Compute ldquodeviationrdquo for hidden unitsunits

d)d) Update weightsUpdate weights

i = F rsquo( neti ) x (Teacheri-outi)

ij = F rsquo( netj ) x ( wij x

i)

wij = x i x out j

wjk = x j x out k

F rsquo( netj ) = F(neti) neti

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)

33 Repeat until training-set error rate Repeat until training-set error rate small enough (or until tuning-set error small enough (or until tuning-set error rate begins to rise ndash see later slide)rate begins to rise ndash see later slide)

Should use ldquoearly stoppingrdquo (ie Should use ldquoearly stoppingrdquo (ie minimize error on the tuning set more minimize error on the tuning set more details later)details later)

44 Measure accuracy on test set to Measure accuracy on test set to estimate estimate generalizationgeneralization (future (future accuracy)accuracy)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Advantages of Neural Advantages of Neural NetworksNetworks

bull Universal representation (provided Universal representation (provided enough hidden units)enough hidden units)

bull Less greedy than tree learnersLess greedy than tree learnersbull In practice good for problems with In practice good for problems with

numeric inputs and can also numeric inputs and can also handle numeric outputshandle numeric outputs

bull PHD for many years best protein PHD for many years best protein secondary structure predictorsecondary structure predictor

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

DisadvantagesDisadvantages

bull Models not very comprehensibleModels not very comprehensiblebull Long training timesLong training timesbull Very sensitive to number of Very sensitive to number of

hidden unitshellip as a result largely hidden unitshellip as a result largely being supplanted by SVMs (SVMs being supplanted by SVMs (SVMs take very different approach to take very different approach to getting non-linearity)getting non-linearity)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Looking AheadLooking Aheadbull Perceptron rule can also be thought of as Perceptron rule can also be thought of as

modifying modifying weights on data points weights on data points rather rather than featuresthan features

bull Instead of process all data (batch) vs Instead of process all data (batch) vs one-at-a-time could imagine processing 2 one-at-a-time could imagine processing 2 data points at a time adjusting their data points at a time adjusting their relative weights based on their relative relative weights based on their relative errorserrors

bull This is what Plattrsquos SMO does (the SVM This is what Plattrsquos SMO does (the SVM implementation in Weka)implementation in Weka)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Backup Slide to help Backup Slide to help with Derivative of with Derivative of SigmoidSigmoid

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Page 24: Concept Learning Algorithms

Assume one layer of hidden units (std topology)Assume one layer of hidden units (std topology)

11 Error Error frac12 frac12 ( Teacher ( Teacherii ndash ndash OutputOutput ii ) ) 22

22 = frac12 = frac12 (Teacher(Teacherii ndash F ndash F (( [[WWijij x x OutputOutput jj] )] )22

33 = frac12 = frac12 (Teacher(Teacherii ndash F ndash F (( [[WWijij x F x F ((WWjkjk x Output x Output

kk)])))]))22

DetermineDetermine

recallrecall

BP CalculationsBP Calculations

Error Wij

Error Wjk

= (use equation 2)

= (use equation 3)

See Table 42 in Mitchell for resultswxy = - ( E wxy )

k j i

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Derivation in MitchellDerivation in Mitchell

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Some NotationSome Notation

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

By Chain Rule By Chain Rule (since (since WWjiji influences rest of network only influences rest of network only by its influence on by its influence on NetNetjj)hellip)hellip

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Also remember this for later ndashWersquoll call it -δj

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Remember thatoj is xkj outputfrom j is inputto k

Remember netk =wk1 xk1 + hellip+ wkN xkN

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Using BP to Train Using BP to Train ANNrsquosANNrsquos

11 Initiate weights amp bias to Initiate weights amp bias to small random values small random values (eg in [-03 03])(eg in [-03 03])

22 Randomize order of Randomize order of training examples for training examples for each doeach do

a)a) Propagate activity Propagate activity forwardforward to output unitsto output units

k j i

outi = F( wij x outj )j

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)

b)b) Compute ldquodeviationrdquo for output Compute ldquodeviationrdquo for output unitsunits

c)c) Compute ldquodeviationrdquo for hidden Compute ldquodeviationrdquo for hidden unitsunits

d)d) Update weightsUpdate weights

i = F rsquo( neti ) x (Teacheri-outi)

ij = F rsquo( netj ) x ( wij x

i)

wij = x i x out j

wjk = x j x out k

F rsquo( netj ) = F(neti) neti

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)

33 Repeat until training-set error rate Repeat until training-set error rate small enough (or until tuning-set error small enough (or until tuning-set error rate begins to rise ndash see later slide)rate begins to rise ndash see later slide)

Should use ldquoearly stoppingrdquo (ie Should use ldquoearly stoppingrdquo (ie minimize error on the tuning set more minimize error on the tuning set more details later)details later)

44 Measure accuracy on test set to Measure accuracy on test set to estimate estimate generalizationgeneralization (future (future accuracy)accuracy)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Advantages of Neural Advantages of Neural NetworksNetworks

bull Universal representation (provided Universal representation (provided enough hidden units)enough hidden units)

bull Less greedy than tree learnersLess greedy than tree learnersbull In practice good for problems with In practice good for problems with

numeric inputs and can also numeric inputs and can also handle numeric outputshandle numeric outputs

bull PHD for many years best protein PHD for many years best protein secondary structure predictorsecondary structure predictor

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

DisadvantagesDisadvantages

bull Models not very comprehensibleModels not very comprehensiblebull Long training timesLong training timesbull Very sensitive to number of Very sensitive to number of

hidden unitshellip as a result largely hidden unitshellip as a result largely being supplanted by SVMs (SVMs being supplanted by SVMs (SVMs take very different approach to take very different approach to getting non-linearity)getting non-linearity)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Looking AheadLooking Aheadbull Perceptron rule can also be thought of as Perceptron rule can also be thought of as

modifying modifying weights on data points weights on data points rather rather than featuresthan features

bull Instead of process all data (batch) vs Instead of process all data (batch) vs one-at-a-time could imagine processing 2 one-at-a-time could imagine processing 2 data points at a time adjusting their data points at a time adjusting their relative weights based on their relative relative weights based on their relative errorserrors

bull This is what Plattrsquos SMO does (the SVM This is what Plattrsquos SMO does (the SVM implementation in Weka)implementation in Weka)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Backup Slide to help Backup Slide to help with Derivative of with Derivative of SigmoidSigmoid

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Page 25: Concept Learning Algorithms

Derivation in MitchellDerivation in Mitchell

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Some NotationSome Notation

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

By Chain Rule By Chain Rule (since (since WWjiji influences rest of network only influences rest of network only by its influence on by its influence on NetNetjj)hellip)hellip

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Also remember this for later ndashWersquoll call it -δj

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Remember thatoj is xkj outputfrom j is inputto k

Remember netk =wk1 xk1 + hellip+ wkN xkN

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Using BP to Train Using BP to Train ANNrsquosANNrsquos

11 Initiate weights amp bias to Initiate weights amp bias to small random values small random values (eg in [-03 03])(eg in [-03 03])

22 Randomize order of Randomize order of training examples for training examples for each doeach do

a)a) Propagate activity Propagate activity forwardforward to output unitsto output units

k j i

outi = F( wij x outj )j

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)

b)b) Compute ldquodeviationrdquo for output Compute ldquodeviationrdquo for output unitsunits

c)c) Compute ldquodeviationrdquo for hidden Compute ldquodeviationrdquo for hidden unitsunits

d)d) Update weightsUpdate weights

i = F rsquo( neti ) x (Teacheri-outi)

ij = F rsquo( netj ) x ( wij x

i)

wij = x i x out j

wjk = x j x out k

F rsquo( netj ) = F(neti) neti

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)

33 Repeat until training-set error rate Repeat until training-set error rate small enough (or until tuning-set error small enough (or until tuning-set error rate begins to rise ndash see later slide)rate begins to rise ndash see later slide)

Should use ldquoearly stoppingrdquo (ie Should use ldquoearly stoppingrdquo (ie minimize error on the tuning set more minimize error on the tuning set more details later)details later)

44 Measure accuracy on test set to Measure accuracy on test set to estimate estimate generalizationgeneralization (future (future accuracy)accuracy)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Advantages of Neural Advantages of Neural NetworksNetworks

bull Universal representation (provided Universal representation (provided enough hidden units)enough hidden units)

bull Less greedy than tree learnersLess greedy than tree learnersbull In practice good for problems with In practice good for problems with

numeric inputs and can also numeric inputs and can also handle numeric outputshandle numeric outputs

bull PHD for many years best protein PHD for many years best protein secondary structure predictorsecondary structure predictor

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

DisadvantagesDisadvantages

bull Models not very comprehensibleModels not very comprehensiblebull Long training timesLong training timesbull Very sensitive to number of Very sensitive to number of

hidden unitshellip as a result largely hidden unitshellip as a result largely being supplanted by SVMs (SVMs being supplanted by SVMs (SVMs take very different approach to take very different approach to getting non-linearity)getting non-linearity)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Looking AheadLooking Aheadbull Perceptron rule can also be thought of as Perceptron rule can also be thought of as

modifying modifying weights on data points weights on data points rather rather than featuresthan features

bull Instead of process all data (batch) vs Instead of process all data (batch) vs one-at-a-time could imagine processing 2 one-at-a-time could imagine processing 2 data points at a time adjusting their data points at a time adjusting their relative weights based on their relative relative weights based on their relative errorserrors

bull This is what Plattrsquos SMO does (the SVM This is what Plattrsquos SMO does (the SVM implementation in Weka)implementation in Weka)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Backup Slide to help Backup Slide to help with Derivative of with Derivative of SigmoidSigmoid

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Page 26: Concept Learning Algorithms

Some NotationSome Notation

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

By Chain Rule By Chain Rule (since (since WWjiji influences rest of network only influences rest of network only by its influence on by its influence on NetNetjj)hellip)hellip

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Also remember this for later ndashWersquoll call it -δj

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Remember thatoj is xkj outputfrom j is inputto k

Remember netk =wk1 xk1 + hellip+ wkN xkN

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Using BP to Train Using BP to Train ANNrsquosANNrsquos

11 Initiate weights amp bias to Initiate weights amp bias to small random values small random values (eg in [-03 03])(eg in [-03 03])

22 Randomize order of Randomize order of training examples for training examples for each doeach do

a)a) Propagate activity Propagate activity forwardforward to output unitsto output units

k j i

outi = F( wij x outj )j

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)

b)b) Compute ldquodeviationrdquo for output Compute ldquodeviationrdquo for output unitsunits

c)c) Compute ldquodeviationrdquo for hidden Compute ldquodeviationrdquo for hidden unitsunits

d)d) Update weightsUpdate weights

i = F rsquo( neti ) x (Teacheri-outi)

ij = F rsquo( netj ) x ( wij x

i)

wij = x i x out j

wjk = x j x out k

F rsquo( netj ) = F(neti) neti

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)

33 Repeat until training-set error rate Repeat until training-set error rate small enough (or until tuning-set error small enough (or until tuning-set error rate begins to rise ndash see later slide)rate begins to rise ndash see later slide)

Should use ldquoearly stoppingrdquo (ie Should use ldquoearly stoppingrdquo (ie minimize error on the tuning set more minimize error on the tuning set more details later)details later)

44 Measure accuracy on test set to Measure accuracy on test set to estimate estimate generalizationgeneralization (future (future accuracy)accuracy)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Advantages of Neural Advantages of Neural NetworksNetworks

bull Universal representation (provided Universal representation (provided enough hidden units)enough hidden units)

bull Less greedy than tree learnersLess greedy than tree learnersbull In practice good for problems with In practice good for problems with

numeric inputs and can also numeric inputs and can also handle numeric outputshandle numeric outputs

bull PHD for many years best protein PHD for many years best protein secondary structure predictorsecondary structure predictor

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

DisadvantagesDisadvantages

bull Models not very comprehensibleModels not very comprehensiblebull Long training timesLong training timesbull Very sensitive to number of Very sensitive to number of

hidden unitshellip as a result largely hidden unitshellip as a result largely being supplanted by SVMs (SVMs being supplanted by SVMs (SVMs take very different approach to take very different approach to getting non-linearity)getting non-linearity)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Looking AheadLooking Aheadbull Perceptron rule can also be thought of as Perceptron rule can also be thought of as

modifying modifying weights on data points weights on data points rather rather than featuresthan features

bull Instead of process all data (batch) vs Instead of process all data (batch) vs one-at-a-time could imagine processing 2 one-at-a-time could imagine processing 2 data points at a time adjusting their data points at a time adjusting their relative weights based on their relative relative weights based on their relative errorserrors

bull This is what Plattrsquos SMO does (the SVM This is what Plattrsquos SMO does (the SVM implementation in Weka)implementation in Weka)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Backup Slide to help Backup Slide to help with Derivative of with Derivative of SigmoidSigmoid

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Page 27: Concept Learning Algorithms

By Chain Rule By Chain Rule (since (since WWjiji influences rest of network only influences rest of network only by its influence on by its influence on NetNetjj)hellip)hellip

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Also remember this for later ndashWersquoll call it -δj

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Remember thatoj is xkj outputfrom j is inputto k

Remember netk =wk1 xk1 + hellip+ wkN xkN

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Using BP to Train Using BP to Train ANNrsquosANNrsquos

11 Initiate weights amp bias to Initiate weights amp bias to small random values small random values (eg in [-03 03])(eg in [-03 03])

22 Randomize order of Randomize order of training examples for training examples for each doeach do

a)a) Propagate activity Propagate activity forwardforward to output unitsto output units

k j i

outi = F( wij x outj )j

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)

b)b) Compute ldquodeviationrdquo for output Compute ldquodeviationrdquo for output unitsunits

c)c) Compute ldquodeviationrdquo for hidden Compute ldquodeviationrdquo for hidden unitsunits

d)d) Update weightsUpdate weights

i = F rsquo( neti ) x (Teacheri-outi)

ij = F rsquo( netj ) x ( wij x

i)

wij = x i x out j

wjk = x j x out k

F rsquo( netj ) = F(neti) neti

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)

33 Repeat until training-set error rate Repeat until training-set error rate small enough (or until tuning-set error small enough (or until tuning-set error rate begins to rise ndash see later slide)rate begins to rise ndash see later slide)

Should use ldquoearly stoppingrdquo (ie Should use ldquoearly stoppingrdquo (ie minimize error on the tuning set more minimize error on the tuning set more details later)details later)

44 Measure accuracy on test set to Measure accuracy on test set to estimate estimate generalizationgeneralization (future (future accuracy)accuracy)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Advantages of Neural Advantages of Neural NetworksNetworks

bull Universal representation (provided Universal representation (provided enough hidden units)enough hidden units)

bull Less greedy than tree learnersLess greedy than tree learnersbull In practice good for problems with In practice good for problems with

numeric inputs and can also numeric inputs and can also handle numeric outputshandle numeric outputs

bull PHD for many years best protein PHD for many years best protein secondary structure predictorsecondary structure predictor

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

DisadvantagesDisadvantages

bull Models not very comprehensibleModels not very comprehensiblebull Long training timesLong training timesbull Very sensitive to number of Very sensitive to number of

hidden unitshellip as a result largely hidden unitshellip as a result largely being supplanted by SVMs (SVMs being supplanted by SVMs (SVMs take very different approach to take very different approach to getting non-linearity)getting non-linearity)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Looking AheadLooking Aheadbull Perceptron rule can also be thought of as Perceptron rule can also be thought of as

modifying modifying weights on data points weights on data points rather rather than featuresthan features

bull Instead of process all data (batch) vs Instead of process all data (batch) vs one-at-a-time could imagine processing 2 one-at-a-time could imagine processing 2 data points at a time adjusting their data points at a time adjusting their relative weights based on their relative relative weights based on their relative errorserrors

bull This is what Plattrsquos SMO does (the SVM This is what Plattrsquos SMO does (the SVM implementation in Weka)implementation in Weka)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Backup Slide to help Backup Slide to help with Derivative of with Derivative of SigmoidSigmoid

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Page 28: Concept Learning Algorithms

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Also remember this for later ndashWersquoll call it -δj

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Remember thatoj is xkj outputfrom j is inputto k

Remember netk =wk1 xk1 + hellip+ wkN xkN

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Using BP to Train Using BP to Train ANNrsquosANNrsquos

11 Initiate weights amp bias to Initiate weights amp bias to small random values small random values (eg in [-03 03])(eg in [-03 03])

22 Randomize order of Randomize order of training examples for training examples for each doeach do

a)a) Propagate activity Propagate activity forwardforward to output unitsto output units

k j i

outi = F( wij x outj )j

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)

b)b) Compute ldquodeviationrdquo for output Compute ldquodeviationrdquo for output unitsunits

c)c) Compute ldquodeviationrdquo for hidden Compute ldquodeviationrdquo for hidden unitsunits

d)d) Update weightsUpdate weights

i = F rsquo( neti ) x (Teacheri-outi)

ij = F rsquo( netj ) x ( wij x

i)

wij = x i x out j

wjk = x j x out k

F rsquo( netj ) = F(neti) neti

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)

33 Repeat until training-set error rate Repeat until training-set error rate small enough (or until tuning-set error small enough (or until tuning-set error rate begins to rise ndash see later slide)rate begins to rise ndash see later slide)

Should use ldquoearly stoppingrdquo (ie Should use ldquoearly stoppingrdquo (ie minimize error on the tuning set more minimize error on the tuning set more details later)details later)

44 Measure accuracy on test set to Measure accuracy on test set to estimate estimate generalizationgeneralization (future (future accuracy)accuracy)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Advantages of Neural Advantages of Neural NetworksNetworks

bull Universal representation (provided Universal representation (provided enough hidden units)enough hidden units)

bull Less greedy than tree learnersLess greedy than tree learnersbull In practice good for problems with In practice good for problems with

numeric inputs and can also numeric inputs and can also handle numeric outputshandle numeric outputs

bull PHD for many years best protein PHD for many years best protein secondary structure predictorsecondary structure predictor

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

DisadvantagesDisadvantages

bull Models not very comprehensibleModels not very comprehensiblebull Long training timesLong training timesbull Very sensitive to number of Very sensitive to number of

hidden unitshellip as a result largely hidden unitshellip as a result largely being supplanted by SVMs (SVMs being supplanted by SVMs (SVMs take very different approach to take very different approach to getting non-linearity)getting non-linearity)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Looking AheadLooking Aheadbull Perceptron rule can also be thought of as Perceptron rule can also be thought of as

modifying modifying weights on data points weights on data points rather rather than featuresthan features

bull Instead of process all data (batch) vs Instead of process all data (batch) vs one-at-a-time could imagine processing 2 one-at-a-time could imagine processing 2 data points at a time adjusting their data points at a time adjusting their relative weights based on their relative relative weights based on their relative errorserrors

bull This is what Plattrsquos SMO does (the SVM This is what Plattrsquos SMO does (the SVM implementation in Weka)implementation in Weka)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Backup Slide to help Backup Slide to help with Derivative of with Derivative of SigmoidSigmoid

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Page 29: Concept Learning Algorithms

Also remember this for later ndashWersquoll call it -δj

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Remember thatoj is xkj outputfrom j is inputto k

Remember netk =wk1 xk1 + hellip+ wkN xkN

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Using BP to Train Using BP to Train ANNrsquosANNrsquos

11 Initiate weights amp bias to Initiate weights amp bias to small random values small random values (eg in [-03 03])(eg in [-03 03])

22 Randomize order of Randomize order of training examples for training examples for each doeach do

a)a) Propagate activity Propagate activity forwardforward to output unitsto output units

k j i

outi = F( wij x outj )j

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)

b)b) Compute ldquodeviationrdquo for output Compute ldquodeviationrdquo for output unitsunits

c)c) Compute ldquodeviationrdquo for hidden Compute ldquodeviationrdquo for hidden unitsunits

d)d) Update weightsUpdate weights

i = F rsquo( neti ) x (Teacheri-outi)

ij = F rsquo( netj ) x ( wij x

i)

wij = x i x out j

wjk = x j x out k

F rsquo( netj ) = F(neti) neti

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)

33 Repeat until training-set error rate Repeat until training-set error rate small enough (or until tuning-set error small enough (or until tuning-set error rate begins to rise ndash see later slide)rate begins to rise ndash see later slide)

Should use ldquoearly stoppingrdquo (ie Should use ldquoearly stoppingrdquo (ie minimize error on the tuning set more minimize error on the tuning set more details later)details later)

44 Measure accuracy on test set to Measure accuracy on test set to estimate estimate generalizationgeneralization (future (future accuracy)accuracy)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Advantages of Neural Advantages of Neural NetworksNetworks

bull Universal representation (provided Universal representation (provided enough hidden units)enough hidden units)

bull Less greedy than tree learnersLess greedy than tree learnersbull In practice good for problems with In practice good for problems with

numeric inputs and can also numeric inputs and can also handle numeric outputshandle numeric outputs

bull PHD for many years best protein PHD for many years best protein secondary structure predictorsecondary structure predictor

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

DisadvantagesDisadvantages

bull Models not very comprehensibleModels not very comprehensiblebull Long training timesLong training timesbull Very sensitive to number of Very sensitive to number of

hidden unitshellip as a result largely hidden unitshellip as a result largely being supplanted by SVMs (SVMs being supplanted by SVMs (SVMs take very different approach to take very different approach to getting non-linearity)getting non-linearity)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Looking AheadLooking Aheadbull Perceptron rule can also be thought of as Perceptron rule can also be thought of as

modifying modifying weights on data points weights on data points rather rather than featuresthan features

bull Instead of process all data (batch) vs Instead of process all data (batch) vs one-at-a-time could imagine processing 2 one-at-a-time could imagine processing 2 data points at a time adjusting their data points at a time adjusting their relative weights based on their relative relative weights based on their relative errorserrors

bull This is what Plattrsquos SMO does (the SVM This is what Plattrsquos SMO does (the SVM implementation in Weka)implementation in Weka)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Backup Slide to help Backup Slide to help with Derivative of with Derivative of SigmoidSigmoid

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Page 30: Concept Learning Algorithms

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Remember thatoj is xkj outputfrom j is inputto k

Remember netk =wk1 xk1 + hellip+ wkN xkN

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Using BP to Train Using BP to Train ANNrsquosANNrsquos

11 Initiate weights amp bias to Initiate weights amp bias to small random values small random values (eg in [-03 03])(eg in [-03 03])

22 Randomize order of Randomize order of training examples for training examples for each doeach do

a)a) Propagate activity Propagate activity forwardforward to output unitsto output units

k j i

outi = F( wij x outj )j

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)

b)b) Compute ldquodeviationrdquo for output Compute ldquodeviationrdquo for output unitsunits

c)c) Compute ldquodeviationrdquo for hidden Compute ldquodeviationrdquo for hidden unitsunits

d)d) Update weightsUpdate weights

i = F rsquo( neti ) x (Teacheri-outi)

ij = F rsquo( netj ) x ( wij x

i)

wij = x i x out j

wjk = x j x out k

F rsquo( netj ) = F(neti) neti

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)

33 Repeat until training-set error rate Repeat until training-set error rate small enough (or until tuning-set error small enough (or until tuning-set error rate begins to rise ndash see later slide)rate begins to rise ndash see later slide)

Should use ldquoearly stoppingrdquo (ie Should use ldquoearly stoppingrdquo (ie minimize error on the tuning set more minimize error on the tuning set more details later)details later)

44 Measure accuracy on test set to Measure accuracy on test set to estimate estimate generalizationgeneralization (future (future accuracy)accuracy)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Advantages of Neural Advantages of Neural NetworksNetworks

bull Universal representation (provided Universal representation (provided enough hidden units)enough hidden units)

bull Less greedy than tree learnersLess greedy than tree learnersbull In practice good for problems with In practice good for problems with

numeric inputs and can also numeric inputs and can also handle numeric outputshandle numeric outputs

bull PHD for many years best protein PHD for many years best protein secondary structure predictorsecondary structure predictor

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

DisadvantagesDisadvantages

bull Models not very comprehensibleModels not very comprehensiblebull Long training timesLong training timesbull Very sensitive to number of Very sensitive to number of

hidden unitshellip as a result largely hidden unitshellip as a result largely being supplanted by SVMs (SVMs being supplanted by SVMs (SVMs take very different approach to take very different approach to getting non-linearity)getting non-linearity)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Looking AheadLooking Aheadbull Perceptron rule can also be thought of as Perceptron rule can also be thought of as

modifying modifying weights on data points weights on data points rather rather than featuresthan features

bull Instead of process all data (batch) vs Instead of process all data (batch) vs one-at-a-time could imagine processing 2 one-at-a-time could imagine processing 2 data points at a time adjusting their data points at a time adjusting their relative weights based on their relative relative weights based on their relative errorserrors

bull This is what Plattrsquos SMO does (the SVM This is what Plattrsquos SMO does (the SVM implementation in Weka)implementation in Weka)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Backup Slide to help Backup Slide to help with Derivative of with Derivative of SigmoidSigmoid

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Page 31: Concept Learning Algorithms

Remember thatoj is xkj outputfrom j is inputto k

Remember netk =wk1 xk1 + hellip+ wkN xkN

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Using BP to Train Using BP to Train ANNrsquosANNrsquos

11 Initiate weights amp bias to Initiate weights amp bias to small random values small random values (eg in [-03 03])(eg in [-03 03])

22 Randomize order of Randomize order of training examples for training examples for each doeach do

a)a) Propagate activity Propagate activity forwardforward to output unitsto output units

k j i

outi = F( wij x outj )j

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)

b)b) Compute ldquodeviationrdquo for output Compute ldquodeviationrdquo for output unitsunits

c)c) Compute ldquodeviationrdquo for hidden Compute ldquodeviationrdquo for hidden unitsunits

d)d) Update weightsUpdate weights

i = F rsquo( neti ) x (Teacheri-outi)

ij = F rsquo( netj ) x ( wij x

i)

wij = x i x out j

wjk = x j x out k

F rsquo( netj ) = F(neti) neti

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)

33 Repeat until training-set error rate Repeat until training-set error rate small enough (or until tuning-set error small enough (or until tuning-set error rate begins to rise ndash see later slide)rate begins to rise ndash see later slide)

Should use ldquoearly stoppingrdquo (ie Should use ldquoearly stoppingrdquo (ie minimize error on the tuning set more minimize error on the tuning set more details later)details later)

44 Measure accuracy on test set to Measure accuracy on test set to estimate estimate generalizationgeneralization (future (future accuracy)accuracy)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Advantages of Neural Advantages of Neural NetworksNetworks

bull Universal representation (provided Universal representation (provided enough hidden units)enough hidden units)

bull Less greedy than tree learnersLess greedy than tree learnersbull In practice good for problems with In practice good for problems with

numeric inputs and can also numeric inputs and can also handle numeric outputshandle numeric outputs

bull PHD for many years best protein PHD for many years best protein secondary structure predictorsecondary structure predictor

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

DisadvantagesDisadvantages

bull Models not very comprehensibleModels not very comprehensiblebull Long training timesLong training timesbull Very sensitive to number of Very sensitive to number of

hidden unitshellip as a result largely hidden unitshellip as a result largely being supplanted by SVMs (SVMs being supplanted by SVMs (SVMs take very different approach to take very different approach to getting non-linearity)getting non-linearity)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Looking AheadLooking Aheadbull Perceptron rule can also be thought of as Perceptron rule can also be thought of as

modifying modifying weights on data points weights on data points rather rather than featuresthan features

bull Instead of process all data (batch) vs Instead of process all data (batch) vs one-at-a-time could imagine processing 2 one-at-a-time could imagine processing 2 data points at a time adjusting their data points at a time adjusting their relative weights based on their relative relative weights based on their relative errorserrors

bull This is what Plattrsquos SMO does (the SVM This is what Plattrsquos SMO does (the SVM implementation in Weka)implementation in Weka)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Backup Slide to help Backup Slide to help with Derivative of with Derivative of SigmoidSigmoid

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Page 32: Concept Learning Algorithms

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Using BP to Train Using BP to Train ANNrsquosANNrsquos

11 Initiate weights amp bias to Initiate weights amp bias to small random values small random values (eg in [-03 03])(eg in [-03 03])

22 Randomize order of Randomize order of training examples for training examples for each doeach do

a)a) Propagate activity Propagate activity forwardforward to output unitsto output units

k j i

outi = F( wij x outj )j

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)

b)b) Compute ldquodeviationrdquo for output Compute ldquodeviationrdquo for output unitsunits

c)c) Compute ldquodeviationrdquo for hidden Compute ldquodeviationrdquo for hidden unitsunits

d)d) Update weightsUpdate weights

i = F rsquo( neti ) x (Teacheri-outi)

ij = F rsquo( netj ) x ( wij x

i)

wij = x i x out j

wjk = x j x out k

F rsquo( netj ) = F(neti) neti

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)

33 Repeat until training-set error rate Repeat until training-set error rate small enough (or until tuning-set error small enough (or until tuning-set error rate begins to rise ndash see later slide)rate begins to rise ndash see later slide)

Should use ldquoearly stoppingrdquo (ie Should use ldquoearly stoppingrdquo (ie minimize error on the tuning set more minimize error on the tuning set more details later)details later)

44 Measure accuracy on test set to Measure accuracy on test set to estimate estimate generalizationgeneralization (future (future accuracy)accuracy)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Advantages of Neural Advantages of Neural NetworksNetworks

bull Universal representation (provided Universal representation (provided enough hidden units)enough hidden units)

bull Less greedy than tree learnersLess greedy than tree learnersbull In practice good for problems with In practice good for problems with

numeric inputs and can also numeric inputs and can also handle numeric outputshandle numeric outputs

bull PHD for many years best protein PHD for many years best protein secondary structure predictorsecondary structure predictor

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

DisadvantagesDisadvantages

bull Models not very comprehensibleModels not very comprehensiblebull Long training timesLong training timesbull Very sensitive to number of Very sensitive to number of

hidden unitshellip as a result largely hidden unitshellip as a result largely being supplanted by SVMs (SVMs being supplanted by SVMs (SVMs take very different approach to take very different approach to getting non-linearity)getting non-linearity)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Looking AheadLooking Aheadbull Perceptron rule can also be thought of as Perceptron rule can also be thought of as

modifying modifying weights on data points weights on data points rather rather than featuresthan features

bull Instead of process all data (batch) vs Instead of process all data (batch) vs one-at-a-time could imagine processing 2 one-at-a-time could imagine processing 2 data points at a time adjusting their data points at a time adjusting their relative weights based on their relative relative weights based on their relative errorserrors

bull This is what Plattrsquos SMO does (the SVM This is what Plattrsquos SMO does (the SVM implementation in Weka)implementation in Weka)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Backup Slide to help Backup Slide to help with Derivative of with Derivative of SigmoidSigmoid

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Page 33: Concept Learning Algorithms

Using BP to Train Using BP to Train ANNrsquosANNrsquos

11 Initiate weights amp bias to Initiate weights amp bias to small random values small random values (eg in [-03 03])(eg in [-03 03])

22 Randomize order of Randomize order of training examples for training examples for each doeach do

a)a) Propagate activity Propagate activity forwardforward to output unitsto output units

k j i

outi = F( wij x outj )j

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)

b)b) Compute ldquodeviationrdquo for output Compute ldquodeviationrdquo for output unitsunits

c)c) Compute ldquodeviationrdquo for hidden Compute ldquodeviationrdquo for hidden unitsunits

d)d) Update weightsUpdate weights

i = F rsquo( neti ) x (Teacheri-outi)

ij = F rsquo( netj ) x ( wij x

i)

wij = x i x out j

wjk = x j x out k

F rsquo( netj ) = F(neti) neti

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)

33 Repeat until training-set error rate Repeat until training-set error rate small enough (or until tuning-set error small enough (or until tuning-set error rate begins to rise ndash see later slide)rate begins to rise ndash see later slide)

Should use ldquoearly stoppingrdquo (ie Should use ldquoearly stoppingrdquo (ie minimize error on the tuning set more minimize error on the tuning set more details later)details later)

44 Measure accuracy on test set to Measure accuracy on test set to estimate estimate generalizationgeneralization (future (future accuracy)accuracy)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Advantages of Neural Advantages of Neural NetworksNetworks

bull Universal representation (provided Universal representation (provided enough hidden units)enough hidden units)

bull Less greedy than tree learnersLess greedy than tree learnersbull In practice good for problems with In practice good for problems with

numeric inputs and can also numeric inputs and can also handle numeric outputshandle numeric outputs

bull PHD for many years best protein PHD for many years best protein secondary structure predictorsecondary structure predictor

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

DisadvantagesDisadvantages

bull Models not very comprehensibleModels not very comprehensiblebull Long training timesLong training timesbull Very sensitive to number of Very sensitive to number of

hidden unitshellip as a result largely hidden unitshellip as a result largely being supplanted by SVMs (SVMs being supplanted by SVMs (SVMs take very different approach to take very different approach to getting non-linearity)getting non-linearity)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Looking AheadLooking Aheadbull Perceptron rule can also be thought of as Perceptron rule can also be thought of as

modifying modifying weights on data points weights on data points rather rather than featuresthan features

bull Instead of process all data (batch) vs Instead of process all data (batch) vs one-at-a-time could imagine processing 2 one-at-a-time could imagine processing 2 data points at a time adjusting their data points at a time adjusting their relative weights based on their relative relative weights based on their relative errorserrors

bull This is what Plattrsquos SMO does (the SVM This is what Plattrsquos SMO does (the SVM implementation in Weka)implementation in Weka)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Backup Slide to help Backup Slide to help with Derivative of with Derivative of SigmoidSigmoid

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Page 34: Concept Learning Algorithms

Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)

b)b) Compute ldquodeviationrdquo for output Compute ldquodeviationrdquo for output unitsunits

c)c) Compute ldquodeviationrdquo for hidden Compute ldquodeviationrdquo for hidden unitsunits

d)d) Update weightsUpdate weights

i = F rsquo( neti ) x (Teacheri-outi)

ij = F rsquo( netj ) x ( wij x

i)

wij = x i x out j

wjk = x j x out k

F rsquo( netj ) = F(neti) neti

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)

33 Repeat until training-set error rate Repeat until training-set error rate small enough (or until tuning-set error small enough (or until tuning-set error rate begins to rise ndash see later slide)rate begins to rise ndash see later slide)

Should use ldquoearly stoppingrdquo (ie Should use ldquoearly stoppingrdquo (ie minimize error on the tuning set more minimize error on the tuning set more details later)details later)

44 Measure accuracy on test set to Measure accuracy on test set to estimate estimate generalizationgeneralization (future (future accuracy)accuracy)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Advantages of Neural Advantages of Neural NetworksNetworks

bull Universal representation (provided Universal representation (provided enough hidden units)enough hidden units)

bull Less greedy than tree learnersLess greedy than tree learnersbull In practice good for problems with In practice good for problems with

numeric inputs and can also numeric inputs and can also handle numeric outputshandle numeric outputs

bull PHD for many years best protein PHD for many years best protein secondary structure predictorsecondary structure predictor

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

DisadvantagesDisadvantages

bull Models not very comprehensibleModels not very comprehensiblebull Long training timesLong training timesbull Very sensitive to number of Very sensitive to number of

hidden unitshellip as a result largely hidden unitshellip as a result largely being supplanted by SVMs (SVMs being supplanted by SVMs (SVMs take very different approach to take very different approach to getting non-linearity)getting non-linearity)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Looking AheadLooking Aheadbull Perceptron rule can also be thought of as Perceptron rule can also be thought of as

modifying modifying weights on data points weights on data points rather rather than featuresthan features

bull Instead of process all data (batch) vs Instead of process all data (batch) vs one-at-a-time could imagine processing 2 one-at-a-time could imagine processing 2 data points at a time adjusting their data points at a time adjusting their relative weights based on their relative relative weights based on their relative errorserrors

bull This is what Plattrsquos SMO does (the SVM This is what Plattrsquos SMO does (the SVM implementation in Weka)implementation in Weka)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Backup Slide to help Backup Slide to help with Derivative of with Derivative of SigmoidSigmoid

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Page 35: Concept Learning Algorithms

Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)

33 Repeat until training-set error rate Repeat until training-set error rate small enough (or until tuning-set error small enough (or until tuning-set error rate begins to rise ndash see later slide)rate begins to rise ndash see later slide)

Should use ldquoearly stoppingrdquo (ie Should use ldquoearly stoppingrdquo (ie minimize error on the tuning set more minimize error on the tuning set more details later)details later)

44 Measure accuracy on test set to Measure accuracy on test set to estimate estimate generalizationgeneralization (future (future accuracy)accuracy)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Advantages of Neural Advantages of Neural NetworksNetworks

bull Universal representation (provided Universal representation (provided enough hidden units)enough hidden units)

bull Less greedy than tree learnersLess greedy than tree learnersbull In practice good for problems with In practice good for problems with

numeric inputs and can also numeric inputs and can also handle numeric outputshandle numeric outputs

bull PHD for many years best protein PHD for many years best protein secondary structure predictorsecondary structure predictor

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

DisadvantagesDisadvantages

bull Models not very comprehensibleModels not very comprehensiblebull Long training timesLong training timesbull Very sensitive to number of Very sensitive to number of

hidden unitshellip as a result largely hidden unitshellip as a result largely being supplanted by SVMs (SVMs being supplanted by SVMs (SVMs take very different approach to take very different approach to getting non-linearity)getting non-linearity)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Looking AheadLooking Aheadbull Perceptron rule can also be thought of as Perceptron rule can also be thought of as

modifying modifying weights on data points weights on data points rather rather than featuresthan features

bull Instead of process all data (batch) vs Instead of process all data (batch) vs one-at-a-time could imagine processing 2 one-at-a-time could imagine processing 2 data points at a time adjusting their data points at a time adjusting their relative weights based on their relative relative weights based on their relative errorserrors

bull This is what Plattrsquos SMO does (the SVM This is what Plattrsquos SMO does (the SVM implementation in Weka)implementation in Weka)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Backup Slide to help Backup Slide to help with Derivative of with Derivative of SigmoidSigmoid

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Page 36: Concept Learning Algorithms

Advantages of Neural Advantages of Neural NetworksNetworks

bull Universal representation (provided Universal representation (provided enough hidden units)enough hidden units)

bull Less greedy than tree learnersLess greedy than tree learnersbull In practice good for problems with In practice good for problems with

numeric inputs and can also numeric inputs and can also handle numeric outputshandle numeric outputs

bull PHD for many years best protein PHD for many years best protein secondary structure predictorsecondary structure predictor

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

DisadvantagesDisadvantages

bull Models not very comprehensibleModels not very comprehensiblebull Long training timesLong training timesbull Very sensitive to number of Very sensitive to number of

hidden unitshellip as a result largely hidden unitshellip as a result largely being supplanted by SVMs (SVMs being supplanted by SVMs (SVMs take very different approach to take very different approach to getting non-linearity)getting non-linearity)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Looking AheadLooking Aheadbull Perceptron rule can also be thought of as Perceptron rule can also be thought of as

modifying modifying weights on data points weights on data points rather rather than featuresthan features

bull Instead of process all data (batch) vs Instead of process all data (batch) vs one-at-a-time could imagine processing 2 one-at-a-time could imagine processing 2 data points at a time adjusting their data points at a time adjusting their relative weights based on their relative relative weights based on their relative errorserrors

bull This is what Plattrsquos SMO does (the SVM This is what Plattrsquos SMO does (the SVM implementation in Weka)implementation in Weka)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Backup Slide to help Backup Slide to help with Derivative of with Derivative of SigmoidSigmoid

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Page 37: Concept Learning Algorithms

DisadvantagesDisadvantages

bull Models not very comprehensibleModels not very comprehensiblebull Long training timesLong training timesbull Very sensitive to number of Very sensitive to number of

hidden unitshellip as a result largely hidden unitshellip as a result largely being supplanted by SVMs (SVMs being supplanted by SVMs (SVMs take very different approach to take very different approach to getting non-linearity)getting non-linearity)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Looking AheadLooking Aheadbull Perceptron rule can also be thought of as Perceptron rule can also be thought of as

modifying modifying weights on data points weights on data points rather rather than featuresthan features

bull Instead of process all data (batch) vs Instead of process all data (batch) vs one-at-a-time could imagine processing 2 one-at-a-time could imagine processing 2 data points at a time adjusting their data points at a time adjusting their relative weights based on their relative relative weights based on their relative errorserrors

bull This is what Plattrsquos SMO does (the SVM This is what Plattrsquos SMO does (the SVM implementation in Weka)implementation in Weka)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Backup Slide to help Backup Slide to help with Derivative of with Derivative of SigmoidSigmoid

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Page 38: Concept Learning Algorithms

Looking AheadLooking Aheadbull Perceptron rule can also be thought of as Perceptron rule can also be thought of as

modifying modifying weights on data points weights on data points rather rather than featuresthan features

bull Instead of process all data (batch) vs Instead of process all data (batch) vs one-at-a-time could imagine processing 2 one-at-a-time could imagine processing 2 data points at a time adjusting their data points at a time adjusting their relative weights based on their relative relative weights based on their relative errorserrors

bull This is what Plattrsquos SMO does (the SVM This is what Plattrsquos SMO does (the SVM implementation in Weka)implementation in Weka)

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Backup Slide to help Backup Slide to help with Derivative of with Derivative of SigmoidSigmoid

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)

Page 39: Concept Learning Algorithms

Backup Slide to help Backup Slide to help with Derivative of with Derivative of SigmoidSigmoid

copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010

CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)