Upload
domenico-beglan
View
27
Download
0
Embed Size (px)
DESCRIPTION
Concept Learning Algorithms. Come from many different theoretical backgrounds and motivations Behaviors related to human learning Some biologically inspired, others not. Neural Networks. Nearest Neighbor. Tree Learners. Utilitarian (just get good result). Biologically- Inspired. - PowerPoint PPT Presentation
Citation preview
Concept Learning Concept Learning AlgorithmsAlgorithms
bull Come from many different Come from many different theoretical backgrounds and theoretical backgrounds and motivationsmotivations
bull Behaviors related to human Behaviors related to human learninglearning
bull Some biologically inspired others Some biologically inspired others notnot
Biologically-Inspired
Utilitarian(just getgood result)
Neural Networks Tree LearnersNearest Neighbor
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
TodayrsquosTodayrsquos TopicsTopics
bull PerceptronsPerceptronsbull Artificial Neural Networks (ANNs)Artificial Neural Networks (ANNs)bull BackpropagationBackpropagationbull Weight SpaceWeight Space
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
ConnectionismConnectionism
PERCEPTRONS (Rosenblatt 1957)PERCEPTRONS (Rosenblatt 1957)bull among earliest work in machine among earliest work in machine
learninglearningbull died out in 1960rsquos (Minsky amp Papert died out in 1960rsquos (Minsky amp Papert
book)book)J
K
L
I
wij
wik
wil
Outputi = F(Wij outputj + Wik outputk + Wil outputl )
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Perceptron as Perceptron as ClassifierClassifierbull Output for N example X is sign(WOutput for N example X is sign(WX) X)
where sign is -1 or +1 (or use threshold where sign is -1 or +1 (or use threshold and 01)and 01)
bull Candidate Hypotheses real-valued weight Candidate Hypotheses real-valued weight vectorsvectors
bull Training Update W for each misclassified Training Update W for each misclassified example X (target class example X (target class tt predicted predicted oo) by) bybull WWii W Wii + + ((tt--oo)X)Xii
bull Here Here is learning rate parameteris learning rate parameter
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Gradient Descent Gradient Descent for the Perceptronfor the Perceptron(Assume no threshold for now and (Assume no threshold for now and
start with a common error measure)start with a common error measure) Error Error frac12 ( t ndash o ) frac12 ( t ndash o )
2
Networkrsquos output
Teacherrsquos answer (a constant wrt the weights) EE
WWkk
ΔΔWWjj - η
= (t ndash o)
EE
WWkk
(t ndash o)(t ndash o)
WWkk
= -(t ndash o) oo
WW kk
Remember o = WmiddotX
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Continuation of Continuation of DerivationDerivation
EE WWkk
= -(t ndash o) WWkk
(( sumsumk k ww k k x x kk))
= -(t ndash o) x k
So ΔWk = η (t ndash o) xk The Perceptron Rule
Stick in formula for
output
Also known as the delta rule and other names (with small variations in calc)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
As it looks in your text As it looks in your text (processing all data at once)hellip(processing all data at once)hellip
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Linear Separability Linear Separability
Consider a perceptron its output is
1 If W1X1+W2X2 + hellip + WnXn gt 0 otherwise
In terms of feature space W1X1 + W2X2 =
X2 = = W1X1
W2
-W1 W2 W2
X1+
+ + + + + + - + - - + + + + - + + - - -+ + - -+ - - -
- -
Hence can only classify examples if a ldquolinerdquo (hyerplane) can separate them
y = mx + b
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Perceptron Convergence Perceptron Convergence
TheoremTheorem (Rosemblatt 1957)(Rosemblatt 1957)
Perceptron no Hidden Units
If a set of examples is learnable the perceptron training rule will eventually find the necessary weightsHowever a perceptron can only learnrepresent linearly separable datasetcopy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
The (Infamous) XOR The (Infamous) XOR ProblemProblem
Input
0 00 11 01 1
Output
0110
a)b)c)d)
Exclusive OR (XOR)Not linearly separable
b
a c
d
0 1
1
A Neural Network SolutionX1
X2
X1
X2
1
1
-1-1
1
1 Let = 0 for all nodescopy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
The Need for Hidden The Need for Hidden UnitsUnits
If there is one layer of enough hidden units (possibly 2N for Boolean functions) the input can be recoded (N = number of input units)
This recoding allows any mapping to be represented (Minsky amp Papert)Question How to provide an error signal to the interior units
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Hidden UnitsHidden Units
One ViewAllow a system to create its own internal representation ndash for which problem solving is easy A perceptron
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Advantages of Advantages of Neural NetworksNeural Networks
Provide best predictive Provide best predictive accuracy for some accuracy for some problemsproblems
Being supplanted by Being supplanted by SVMrsquosSVMrsquos
Can represent a rich Can represent a rich class of conceptsclass of concepts
PositivenegativePositive
Saturday 40 chance of rainSunday 25 chance of rain
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Overview of ANNsOverview of ANNs
Recurrentlink
Output units
Input units
Hidden units
error
weight
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
BackpropagationBackpropagation
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
BackpropagationBackpropagation
bull Backpropagation involves a generalization of the Backpropagation involves a generalization of the perceptron ruleperceptron rule
bull Rumelhart Parker and Le Cun (and Bryson amp Ho 1969) Rumelhart Parker and Le Cun (and Bryson amp Ho 1969) Werbos 1974) independently developed (1985) a Werbos 1974) independently developed (1985) a technique for determining how to adjust weights of technique for determining how to adjust weights of interior (ldquohiddenrdquo) unitsinterior (ldquohiddenrdquo) units
bull Derivation involves partial derivatives Derivation involves partial derivatives (hence threshold function must be differentiable)(hence threshold function must be differentiable)
error signal
E Wij
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Weight SpaceWeight Space
bull Given a neural-network layout the weights Given a neural-network layout the weights are free parameters that are free parameters that define a define a spacespace
bull Each pointEach point in this in this Weight SpaceWeight Space specifies a specifies a networknetwork
bull Associated with each point is an Associated with each point is an error rateerror rate EE over the training dataover the training data
bull Backprop performs Backprop performs gradient descentgradient descent in weight in weight spacespace
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Gradient Descent in Weight Gradient Descent in Weight SpaceSpace
E
W1
W2
Ew
W1
W2
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
The Gradient-Descent The Gradient-Descent RuleRule
E(w) [ ]Ew0
Ew1
Ew
2
EwN
hellip hellip hellip _
The ldquogradien
trdquoThis is a N+1 dimensional vector (ie the lsquoslopersquo in weight space)Since we want to reduce errors we want to go ldquodown hillrdquoWersquoll take a finite step in weight space
E
W1
W2
w = - E ( w )
or wi = - Ewi
ldquodeltardquo = change to
w
E
w
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
ldquoldquoOn Linerdquo vs ldquoBatchrdquo On Linerdquo vs ldquoBatchrdquo BackpropBackprop
bull Technically we should look at the error Technically we should look at the error gradient for the entire training set gradient for the entire training set before taking a step in weight space before taking a step in weight space (ldquo(ldquobatchbatchrdquo Backprop)rdquo Backprop)
bull HoweverHowever as presented we take a step as presented we take a step after each example (ldquoafter each example (ldquoon-lineon-linerdquo Backprop)rdquo Backprop)bull Much faster convergenceMuch faster convergencebull Can reduce overfitting (since on-line Can reduce overfitting (since on-line
Backprop is ldquonoisyrdquo gradient descent)Backprop is ldquonoisyrdquo gradient descent)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
ldquoldquoOn Linerdquo vs ldquoBatchrdquo BP On Linerdquo vs ldquoBatchrdquo BP (continued)(continued)
BATCHBATCH ndash add ndash add w w vectors for vectors for everyevery training example training example thenthen lsquomoversquo in weight lsquomoversquo in weight spacespace
ON-LINEON-LINE ndash ldquomoverdquo ndash ldquomoverdquo after after eacheach example example (aka (aka stochasticstochastic gradient descent)gradient descent)
E
wi
w1
w3w2
w
w1
w2
w3
Final locations in space need not be the same for BATCH and ON-LINE w
N
ote
w
iB
ATC
H
w
i O
N-L
INE
for
i gt
1
E
w
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Need Derivatives Replace Need Derivatives Replace Step (Threshold) by Step (Threshold) by SigmoidSigmoidIndividual units
bias
output
input
output i= F(weight ij x output j)
Where
F(input i) =
j
1
1+e -(input i ndash bias i)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Differentiating the Differentiating the Logistic FunctionLogistic Function
out i =
1
1 + e - ( wji x outj)
F rsquo(wgtrsquoed in) = out i ( 1- out i ) 0
12
Wj x outj
F(wgtrsquoed in)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Assume one layer of hidden units (std topology)Assume one layer of hidden units (std topology)
11 Error Error frac12 frac12 ( Teacher ( Teacherii ndash ndash OutputOutput ii ) ) 22
22 = frac12 = frac12 (Teacher(Teacherii ndash F ndash F (( [[WWijij x x OutputOutput jj] )] )22
33 = frac12 = frac12 (Teacher(Teacherii ndash F ndash F (( [[WWijij x F x F ((WWjkjk x Output x Output
kk)])))]))22
DetermineDetermine
recallrecall
BP CalculationsBP Calculations
Error Wij
Error Wjk
= (use equation 2)
= (use equation 3)
See Table 42 in Mitchell for resultswxy = - ( E wxy )
k j i
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Derivation in MitchellDerivation in Mitchell
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Some NotationSome Notation
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
By Chain Rule By Chain Rule (since (since WWjiji influences rest of network only influences rest of network only by its influence on by its influence on NetNetjj)hellip)hellip
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Also remember this for later ndashWersquoll call it -δj
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Remember thatoj is xkj outputfrom j is inputto k
Remember netk =wk1 xk1 + hellip+ wkN xkN
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Using BP to Train Using BP to Train ANNrsquosANNrsquos
11 Initiate weights amp bias to Initiate weights amp bias to small random values small random values (eg in [-03 03])(eg in [-03 03])
22 Randomize order of Randomize order of training examples for training examples for each doeach do
a)a) Propagate activity Propagate activity forwardforward to output unitsto output units
k j i
outi = F( wij x outj )j
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)
b)b) Compute ldquodeviationrdquo for output Compute ldquodeviationrdquo for output unitsunits
c)c) Compute ldquodeviationrdquo for hidden Compute ldquodeviationrdquo for hidden unitsunits
d)d) Update weightsUpdate weights
i = F rsquo( neti ) x (Teacheri-outi)
ij = F rsquo( netj ) x ( wij x
i)
wij = x i x out j
wjk = x j x out k
F rsquo( netj ) = F(neti) neti
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)
33 Repeat until training-set error rate Repeat until training-set error rate small enough (or until tuning-set error small enough (or until tuning-set error rate begins to rise ndash see later slide)rate begins to rise ndash see later slide)
Should use ldquoearly stoppingrdquo (ie Should use ldquoearly stoppingrdquo (ie minimize error on the tuning set more minimize error on the tuning set more details later)details later)
44 Measure accuracy on test set to Measure accuracy on test set to estimate estimate generalizationgeneralization (future (future accuracy)accuracy)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Advantages of Neural Advantages of Neural NetworksNetworks
bull Universal representation (provided Universal representation (provided enough hidden units)enough hidden units)
bull Less greedy than tree learnersLess greedy than tree learnersbull In practice good for problems with In practice good for problems with
numeric inputs and can also numeric inputs and can also handle numeric outputshandle numeric outputs
bull PHD for many years best protein PHD for many years best protein secondary structure predictorsecondary structure predictor
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
DisadvantagesDisadvantages
bull Models not very comprehensibleModels not very comprehensiblebull Long training timesLong training timesbull Very sensitive to number of Very sensitive to number of
hidden unitshellip as a result largely hidden unitshellip as a result largely being supplanted by SVMs (SVMs being supplanted by SVMs (SVMs take very different approach to take very different approach to getting non-linearity)getting non-linearity)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Looking AheadLooking Aheadbull Perceptron rule can also be thought of as Perceptron rule can also be thought of as
modifying modifying weights on data points weights on data points rather rather than featuresthan features
bull Instead of process all data (batch) vs Instead of process all data (batch) vs one-at-a-time could imagine processing 2 one-at-a-time could imagine processing 2 data points at a time adjusting their data points at a time adjusting their relative weights based on their relative relative weights based on their relative errorserrors
bull This is what Plattrsquos SMO does (the SVM This is what Plattrsquos SMO does (the SVM implementation in Weka)implementation in Weka)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Backup Slide to help Backup Slide to help with Derivative of with Derivative of SigmoidSigmoid
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
TodayrsquosTodayrsquos TopicsTopics
bull PerceptronsPerceptronsbull Artificial Neural Networks (ANNs)Artificial Neural Networks (ANNs)bull BackpropagationBackpropagationbull Weight SpaceWeight Space
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
ConnectionismConnectionism
PERCEPTRONS (Rosenblatt 1957)PERCEPTRONS (Rosenblatt 1957)bull among earliest work in machine among earliest work in machine
learninglearningbull died out in 1960rsquos (Minsky amp Papert died out in 1960rsquos (Minsky amp Papert
book)book)J
K
L
I
wij
wik
wil
Outputi = F(Wij outputj + Wik outputk + Wil outputl )
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Perceptron as Perceptron as ClassifierClassifierbull Output for N example X is sign(WOutput for N example X is sign(WX) X)
where sign is -1 or +1 (or use threshold where sign is -1 or +1 (or use threshold and 01)and 01)
bull Candidate Hypotheses real-valued weight Candidate Hypotheses real-valued weight vectorsvectors
bull Training Update W for each misclassified Training Update W for each misclassified example X (target class example X (target class tt predicted predicted oo) by) bybull WWii W Wii + + ((tt--oo)X)Xii
bull Here Here is learning rate parameteris learning rate parameter
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Gradient Descent Gradient Descent for the Perceptronfor the Perceptron(Assume no threshold for now and (Assume no threshold for now and
start with a common error measure)start with a common error measure) Error Error frac12 ( t ndash o ) frac12 ( t ndash o )
2
Networkrsquos output
Teacherrsquos answer (a constant wrt the weights) EE
WWkk
ΔΔWWjj - η
= (t ndash o)
EE
WWkk
(t ndash o)(t ndash o)
WWkk
= -(t ndash o) oo
WW kk
Remember o = WmiddotX
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Continuation of Continuation of DerivationDerivation
EE WWkk
= -(t ndash o) WWkk
(( sumsumk k ww k k x x kk))
= -(t ndash o) x k
So ΔWk = η (t ndash o) xk The Perceptron Rule
Stick in formula for
output
Also known as the delta rule and other names (with small variations in calc)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
As it looks in your text As it looks in your text (processing all data at once)hellip(processing all data at once)hellip
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Linear Separability Linear Separability
Consider a perceptron its output is
1 If W1X1+W2X2 + hellip + WnXn gt 0 otherwise
In terms of feature space W1X1 + W2X2 =
X2 = = W1X1
W2
-W1 W2 W2
X1+
+ + + + + + - + - - + + + + - + + - - -+ + - -+ - - -
- -
Hence can only classify examples if a ldquolinerdquo (hyerplane) can separate them
y = mx + b
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Perceptron Convergence Perceptron Convergence
TheoremTheorem (Rosemblatt 1957)(Rosemblatt 1957)
Perceptron no Hidden Units
If a set of examples is learnable the perceptron training rule will eventually find the necessary weightsHowever a perceptron can only learnrepresent linearly separable datasetcopy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
The (Infamous) XOR The (Infamous) XOR ProblemProblem
Input
0 00 11 01 1
Output
0110
a)b)c)d)
Exclusive OR (XOR)Not linearly separable
b
a c
d
0 1
1
A Neural Network SolutionX1
X2
X1
X2
1
1
-1-1
1
1 Let = 0 for all nodescopy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
The Need for Hidden The Need for Hidden UnitsUnits
If there is one layer of enough hidden units (possibly 2N for Boolean functions) the input can be recoded (N = number of input units)
This recoding allows any mapping to be represented (Minsky amp Papert)Question How to provide an error signal to the interior units
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Hidden UnitsHidden Units
One ViewAllow a system to create its own internal representation ndash for which problem solving is easy A perceptron
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Advantages of Advantages of Neural NetworksNeural Networks
Provide best predictive Provide best predictive accuracy for some accuracy for some problemsproblems
Being supplanted by Being supplanted by SVMrsquosSVMrsquos
Can represent a rich Can represent a rich class of conceptsclass of concepts
PositivenegativePositive
Saturday 40 chance of rainSunday 25 chance of rain
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Overview of ANNsOverview of ANNs
Recurrentlink
Output units
Input units
Hidden units
error
weight
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
BackpropagationBackpropagation
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
BackpropagationBackpropagation
bull Backpropagation involves a generalization of the Backpropagation involves a generalization of the perceptron ruleperceptron rule
bull Rumelhart Parker and Le Cun (and Bryson amp Ho 1969) Rumelhart Parker and Le Cun (and Bryson amp Ho 1969) Werbos 1974) independently developed (1985) a Werbos 1974) independently developed (1985) a technique for determining how to adjust weights of technique for determining how to adjust weights of interior (ldquohiddenrdquo) unitsinterior (ldquohiddenrdquo) units
bull Derivation involves partial derivatives Derivation involves partial derivatives (hence threshold function must be differentiable)(hence threshold function must be differentiable)
error signal
E Wij
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Weight SpaceWeight Space
bull Given a neural-network layout the weights Given a neural-network layout the weights are free parameters that are free parameters that define a define a spacespace
bull Each pointEach point in this in this Weight SpaceWeight Space specifies a specifies a networknetwork
bull Associated with each point is an Associated with each point is an error rateerror rate EE over the training dataover the training data
bull Backprop performs Backprop performs gradient descentgradient descent in weight in weight spacespace
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Gradient Descent in Weight Gradient Descent in Weight SpaceSpace
E
W1
W2
Ew
W1
W2
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
The Gradient-Descent The Gradient-Descent RuleRule
E(w) [ ]Ew0
Ew1
Ew
2
EwN
hellip hellip hellip _
The ldquogradien
trdquoThis is a N+1 dimensional vector (ie the lsquoslopersquo in weight space)Since we want to reduce errors we want to go ldquodown hillrdquoWersquoll take a finite step in weight space
E
W1
W2
w = - E ( w )
or wi = - Ewi
ldquodeltardquo = change to
w
E
w
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
ldquoldquoOn Linerdquo vs ldquoBatchrdquo On Linerdquo vs ldquoBatchrdquo BackpropBackprop
bull Technically we should look at the error Technically we should look at the error gradient for the entire training set gradient for the entire training set before taking a step in weight space before taking a step in weight space (ldquo(ldquobatchbatchrdquo Backprop)rdquo Backprop)
bull HoweverHowever as presented we take a step as presented we take a step after each example (ldquoafter each example (ldquoon-lineon-linerdquo Backprop)rdquo Backprop)bull Much faster convergenceMuch faster convergencebull Can reduce overfitting (since on-line Can reduce overfitting (since on-line
Backprop is ldquonoisyrdquo gradient descent)Backprop is ldquonoisyrdquo gradient descent)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
ldquoldquoOn Linerdquo vs ldquoBatchrdquo BP On Linerdquo vs ldquoBatchrdquo BP (continued)(continued)
BATCHBATCH ndash add ndash add w w vectors for vectors for everyevery training example training example thenthen lsquomoversquo in weight lsquomoversquo in weight spacespace
ON-LINEON-LINE ndash ldquomoverdquo ndash ldquomoverdquo after after eacheach example example (aka (aka stochasticstochastic gradient descent)gradient descent)
E
wi
w1
w3w2
w
w1
w2
w3
Final locations in space need not be the same for BATCH and ON-LINE w
N
ote
w
iB
ATC
H
w
i O
N-L
INE
for
i gt
1
E
w
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Need Derivatives Replace Need Derivatives Replace Step (Threshold) by Step (Threshold) by SigmoidSigmoidIndividual units
bias
output
input
output i= F(weight ij x output j)
Where
F(input i) =
j
1
1+e -(input i ndash bias i)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Differentiating the Differentiating the Logistic FunctionLogistic Function
out i =
1
1 + e - ( wji x outj)
F rsquo(wgtrsquoed in) = out i ( 1- out i ) 0
12
Wj x outj
F(wgtrsquoed in)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Assume one layer of hidden units (std topology)Assume one layer of hidden units (std topology)
11 Error Error frac12 frac12 ( Teacher ( Teacherii ndash ndash OutputOutput ii ) ) 22
22 = frac12 = frac12 (Teacher(Teacherii ndash F ndash F (( [[WWijij x x OutputOutput jj] )] )22
33 = frac12 = frac12 (Teacher(Teacherii ndash F ndash F (( [[WWijij x F x F ((WWjkjk x Output x Output
kk)])))]))22
DetermineDetermine
recallrecall
BP CalculationsBP Calculations
Error Wij
Error Wjk
= (use equation 2)
= (use equation 3)
See Table 42 in Mitchell for resultswxy = - ( E wxy )
k j i
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Derivation in MitchellDerivation in Mitchell
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Some NotationSome Notation
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
By Chain Rule By Chain Rule (since (since WWjiji influences rest of network only influences rest of network only by its influence on by its influence on NetNetjj)hellip)hellip
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Also remember this for later ndashWersquoll call it -δj
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Remember thatoj is xkj outputfrom j is inputto k
Remember netk =wk1 xk1 + hellip+ wkN xkN
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Using BP to Train Using BP to Train ANNrsquosANNrsquos
11 Initiate weights amp bias to Initiate weights amp bias to small random values small random values (eg in [-03 03])(eg in [-03 03])
22 Randomize order of Randomize order of training examples for training examples for each doeach do
a)a) Propagate activity Propagate activity forwardforward to output unitsto output units
k j i
outi = F( wij x outj )j
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)
b)b) Compute ldquodeviationrdquo for output Compute ldquodeviationrdquo for output unitsunits
c)c) Compute ldquodeviationrdquo for hidden Compute ldquodeviationrdquo for hidden unitsunits
d)d) Update weightsUpdate weights
i = F rsquo( neti ) x (Teacheri-outi)
ij = F rsquo( netj ) x ( wij x
i)
wij = x i x out j
wjk = x j x out k
F rsquo( netj ) = F(neti) neti
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)
33 Repeat until training-set error rate Repeat until training-set error rate small enough (or until tuning-set error small enough (or until tuning-set error rate begins to rise ndash see later slide)rate begins to rise ndash see later slide)
Should use ldquoearly stoppingrdquo (ie Should use ldquoearly stoppingrdquo (ie minimize error on the tuning set more minimize error on the tuning set more details later)details later)
44 Measure accuracy on test set to Measure accuracy on test set to estimate estimate generalizationgeneralization (future (future accuracy)accuracy)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Advantages of Neural Advantages of Neural NetworksNetworks
bull Universal representation (provided Universal representation (provided enough hidden units)enough hidden units)
bull Less greedy than tree learnersLess greedy than tree learnersbull In practice good for problems with In practice good for problems with
numeric inputs and can also numeric inputs and can also handle numeric outputshandle numeric outputs
bull PHD for many years best protein PHD for many years best protein secondary structure predictorsecondary structure predictor
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
DisadvantagesDisadvantages
bull Models not very comprehensibleModels not very comprehensiblebull Long training timesLong training timesbull Very sensitive to number of Very sensitive to number of
hidden unitshellip as a result largely hidden unitshellip as a result largely being supplanted by SVMs (SVMs being supplanted by SVMs (SVMs take very different approach to take very different approach to getting non-linearity)getting non-linearity)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Looking AheadLooking Aheadbull Perceptron rule can also be thought of as Perceptron rule can also be thought of as
modifying modifying weights on data points weights on data points rather rather than featuresthan features
bull Instead of process all data (batch) vs Instead of process all data (batch) vs one-at-a-time could imagine processing 2 one-at-a-time could imagine processing 2 data points at a time adjusting their data points at a time adjusting their relative weights based on their relative relative weights based on their relative errorserrors
bull This is what Plattrsquos SMO does (the SVM This is what Plattrsquos SMO does (the SVM implementation in Weka)implementation in Weka)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Backup Slide to help Backup Slide to help with Derivative of with Derivative of SigmoidSigmoid
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
ConnectionismConnectionism
PERCEPTRONS (Rosenblatt 1957)PERCEPTRONS (Rosenblatt 1957)bull among earliest work in machine among earliest work in machine
learninglearningbull died out in 1960rsquos (Minsky amp Papert died out in 1960rsquos (Minsky amp Papert
book)book)J
K
L
I
wij
wik
wil
Outputi = F(Wij outputj + Wik outputk + Wil outputl )
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Perceptron as Perceptron as ClassifierClassifierbull Output for N example X is sign(WOutput for N example X is sign(WX) X)
where sign is -1 or +1 (or use threshold where sign is -1 or +1 (or use threshold and 01)and 01)
bull Candidate Hypotheses real-valued weight Candidate Hypotheses real-valued weight vectorsvectors
bull Training Update W for each misclassified Training Update W for each misclassified example X (target class example X (target class tt predicted predicted oo) by) bybull WWii W Wii + + ((tt--oo)X)Xii
bull Here Here is learning rate parameteris learning rate parameter
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Gradient Descent Gradient Descent for the Perceptronfor the Perceptron(Assume no threshold for now and (Assume no threshold for now and
start with a common error measure)start with a common error measure) Error Error frac12 ( t ndash o ) frac12 ( t ndash o )
2
Networkrsquos output
Teacherrsquos answer (a constant wrt the weights) EE
WWkk
ΔΔWWjj - η
= (t ndash o)
EE
WWkk
(t ndash o)(t ndash o)
WWkk
= -(t ndash o) oo
WW kk
Remember o = WmiddotX
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Continuation of Continuation of DerivationDerivation
EE WWkk
= -(t ndash o) WWkk
(( sumsumk k ww k k x x kk))
= -(t ndash o) x k
So ΔWk = η (t ndash o) xk The Perceptron Rule
Stick in formula for
output
Also known as the delta rule and other names (with small variations in calc)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
As it looks in your text As it looks in your text (processing all data at once)hellip(processing all data at once)hellip
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Linear Separability Linear Separability
Consider a perceptron its output is
1 If W1X1+W2X2 + hellip + WnXn gt 0 otherwise
In terms of feature space W1X1 + W2X2 =
X2 = = W1X1
W2
-W1 W2 W2
X1+
+ + + + + + - + - - + + + + - + + - - -+ + - -+ - - -
- -
Hence can only classify examples if a ldquolinerdquo (hyerplane) can separate them
y = mx + b
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Perceptron Convergence Perceptron Convergence
TheoremTheorem (Rosemblatt 1957)(Rosemblatt 1957)
Perceptron no Hidden Units
If a set of examples is learnable the perceptron training rule will eventually find the necessary weightsHowever a perceptron can only learnrepresent linearly separable datasetcopy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
The (Infamous) XOR The (Infamous) XOR ProblemProblem
Input
0 00 11 01 1
Output
0110
a)b)c)d)
Exclusive OR (XOR)Not linearly separable
b
a c
d
0 1
1
A Neural Network SolutionX1
X2
X1
X2
1
1
-1-1
1
1 Let = 0 for all nodescopy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
The Need for Hidden The Need for Hidden UnitsUnits
If there is one layer of enough hidden units (possibly 2N for Boolean functions) the input can be recoded (N = number of input units)
This recoding allows any mapping to be represented (Minsky amp Papert)Question How to provide an error signal to the interior units
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Hidden UnitsHidden Units
One ViewAllow a system to create its own internal representation ndash for which problem solving is easy A perceptron
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Advantages of Advantages of Neural NetworksNeural Networks
Provide best predictive Provide best predictive accuracy for some accuracy for some problemsproblems
Being supplanted by Being supplanted by SVMrsquosSVMrsquos
Can represent a rich Can represent a rich class of conceptsclass of concepts
PositivenegativePositive
Saturday 40 chance of rainSunday 25 chance of rain
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Overview of ANNsOverview of ANNs
Recurrentlink
Output units
Input units
Hidden units
error
weight
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
BackpropagationBackpropagation
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
BackpropagationBackpropagation
bull Backpropagation involves a generalization of the Backpropagation involves a generalization of the perceptron ruleperceptron rule
bull Rumelhart Parker and Le Cun (and Bryson amp Ho 1969) Rumelhart Parker and Le Cun (and Bryson amp Ho 1969) Werbos 1974) independently developed (1985) a Werbos 1974) independently developed (1985) a technique for determining how to adjust weights of technique for determining how to adjust weights of interior (ldquohiddenrdquo) unitsinterior (ldquohiddenrdquo) units
bull Derivation involves partial derivatives Derivation involves partial derivatives (hence threshold function must be differentiable)(hence threshold function must be differentiable)
error signal
E Wij
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Weight SpaceWeight Space
bull Given a neural-network layout the weights Given a neural-network layout the weights are free parameters that are free parameters that define a define a spacespace
bull Each pointEach point in this in this Weight SpaceWeight Space specifies a specifies a networknetwork
bull Associated with each point is an Associated with each point is an error rateerror rate EE over the training dataover the training data
bull Backprop performs Backprop performs gradient descentgradient descent in weight in weight spacespace
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Gradient Descent in Weight Gradient Descent in Weight SpaceSpace
E
W1
W2
Ew
W1
W2
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
The Gradient-Descent The Gradient-Descent RuleRule
E(w) [ ]Ew0
Ew1
Ew
2
EwN
hellip hellip hellip _
The ldquogradien
trdquoThis is a N+1 dimensional vector (ie the lsquoslopersquo in weight space)Since we want to reduce errors we want to go ldquodown hillrdquoWersquoll take a finite step in weight space
E
W1
W2
w = - E ( w )
or wi = - Ewi
ldquodeltardquo = change to
w
E
w
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
ldquoldquoOn Linerdquo vs ldquoBatchrdquo On Linerdquo vs ldquoBatchrdquo BackpropBackprop
bull Technically we should look at the error Technically we should look at the error gradient for the entire training set gradient for the entire training set before taking a step in weight space before taking a step in weight space (ldquo(ldquobatchbatchrdquo Backprop)rdquo Backprop)
bull HoweverHowever as presented we take a step as presented we take a step after each example (ldquoafter each example (ldquoon-lineon-linerdquo Backprop)rdquo Backprop)bull Much faster convergenceMuch faster convergencebull Can reduce overfitting (since on-line Can reduce overfitting (since on-line
Backprop is ldquonoisyrdquo gradient descent)Backprop is ldquonoisyrdquo gradient descent)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
ldquoldquoOn Linerdquo vs ldquoBatchrdquo BP On Linerdquo vs ldquoBatchrdquo BP (continued)(continued)
BATCHBATCH ndash add ndash add w w vectors for vectors for everyevery training example training example thenthen lsquomoversquo in weight lsquomoversquo in weight spacespace
ON-LINEON-LINE ndash ldquomoverdquo ndash ldquomoverdquo after after eacheach example example (aka (aka stochasticstochastic gradient descent)gradient descent)
E
wi
w1
w3w2
w
w1
w2
w3
Final locations in space need not be the same for BATCH and ON-LINE w
N
ote
w
iB
ATC
H
w
i O
N-L
INE
for
i gt
1
E
w
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Need Derivatives Replace Need Derivatives Replace Step (Threshold) by Step (Threshold) by SigmoidSigmoidIndividual units
bias
output
input
output i= F(weight ij x output j)
Where
F(input i) =
j
1
1+e -(input i ndash bias i)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Differentiating the Differentiating the Logistic FunctionLogistic Function
out i =
1
1 + e - ( wji x outj)
F rsquo(wgtrsquoed in) = out i ( 1- out i ) 0
12
Wj x outj
F(wgtrsquoed in)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Assume one layer of hidden units (std topology)Assume one layer of hidden units (std topology)
11 Error Error frac12 frac12 ( Teacher ( Teacherii ndash ndash OutputOutput ii ) ) 22
22 = frac12 = frac12 (Teacher(Teacherii ndash F ndash F (( [[WWijij x x OutputOutput jj] )] )22
33 = frac12 = frac12 (Teacher(Teacherii ndash F ndash F (( [[WWijij x F x F ((WWjkjk x Output x Output
kk)])))]))22
DetermineDetermine
recallrecall
BP CalculationsBP Calculations
Error Wij
Error Wjk
= (use equation 2)
= (use equation 3)
See Table 42 in Mitchell for resultswxy = - ( E wxy )
k j i
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Derivation in MitchellDerivation in Mitchell
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Some NotationSome Notation
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
By Chain Rule By Chain Rule (since (since WWjiji influences rest of network only influences rest of network only by its influence on by its influence on NetNetjj)hellip)hellip
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Also remember this for later ndashWersquoll call it -δj
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Remember thatoj is xkj outputfrom j is inputto k
Remember netk =wk1 xk1 + hellip+ wkN xkN
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Using BP to Train Using BP to Train ANNrsquosANNrsquos
11 Initiate weights amp bias to Initiate weights amp bias to small random values small random values (eg in [-03 03])(eg in [-03 03])
22 Randomize order of Randomize order of training examples for training examples for each doeach do
a)a) Propagate activity Propagate activity forwardforward to output unitsto output units
k j i
outi = F( wij x outj )j
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)
b)b) Compute ldquodeviationrdquo for output Compute ldquodeviationrdquo for output unitsunits
c)c) Compute ldquodeviationrdquo for hidden Compute ldquodeviationrdquo for hidden unitsunits
d)d) Update weightsUpdate weights
i = F rsquo( neti ) x (Teacheri-outi)
ij = F rsquo( netj ) x ( wij x
i)
wij = x i x out j
wjk = x j x out k
F rsquo( netj ) = F(neti) neti
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)
33 Repeat until training-set error rate Repeat until training-set error rate small enough (or until tuning-set error small enough (or until tuning-set error rate begins to rise ndash see later slide)rate begins to rise ndash see later slide)
Should use ldquoearly stoppingrdquo (ie Should use ldquoearly stoppingrdquo (ie minimize error on the tuning set more minimize error on the tuning set more details later)details later)
44 Measure accuracy on test set to Measure accuracy on test set to estimate estimate generalizationgeneralization (future (future accuracy)accuracy)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Advantages of Neural Advantages of Neural NetworksNetworks
bull Universal representation (provided Universal representation (provided enough hidden units)enough hidden units)
bull Less greedy than tree learnersLess greedy than tree learnersbull In practice good for problems with In practice good for problems with
numeric inputs and can also numeric inputs and can also handle numeric outputshandle numeric outputs
bull PHD for many years best protein PHD for many years best protein secondary structure predictorsecondary structure predictor
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
DisadvantagesDisadvantages
bull Models not very comprehensibleModels not very comprehensiblebull Long training timesLong training timesbull Very sensitive to number of Very sensitive to number of
hidden unitshellip as a result largely hidden unitshellip as a result largely being supplanted by SVMs (SVMs being supplanted by SVMs (SVMs take very different approach to take very different approach to getting non-linearity)getting non-linearity)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Looking AheadLooking Aheadbull Perceptron rule can also be thought of as Perceptron rule can also be thought of as
modifying modifying weights on data points weights on data points rather rather than featuresthan features
bull Instead of process all data (batch) vs Instead of process all data (batch) vs one-at-a-time could imagine processing 2 one-at-a-time could imagine processing 2 data points at a time adjusting their data points at a time adjusting their relative weights based on their relative relative weights based on their relative errorserrors
bull This is what Plattrsquos SMO does (the SVM This is what Plattrsquos SMO does (the SVM implementation in Weka)implementation in Weka)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Backup Slide to help Backup Slide to help with Derivative of with Derivative of SigmoidSigmoid
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Perceptron as Perceptron as ClassifierClassifierbull Output for N example X is sign(WOutput for N example X is sign(WX) X)
where sign is -1 or +1 (or use threshold where sign is -1 or +1 (or use threshold and 01)and 01)
bull Candidate Hypotheses real-valued weight Candidate Hypotheses real-valued weight vectorsvectors
bull Training Update W for each misclassified Training Update W for each misclassified example X (target class example X (target class tt predicted predicted oo) by) bybull WWii W Wii + + ((tt--oo)X)Xii
bull Here Here is learning rate parameteris learning rate parameter
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Gradient Descent Gradient Descent for the Perceptronfor the Perceptron(Assume no threshold for now and (Assume no threshold for now and
start with a common error measure)start with a common error measure) Error Error frac12 ( t ndash o ) frac12 ( t ndash o )
2
Networkrsquos output
Teacherrsquos answer (a constant wrt the weights) EE
WWkk
ΔΔWWjj - η
= (t ndash o)
EE
WWkk
(t ndash o)(t ndash o)
WWkk
= -(t ndash o) oo
WW kk
Remember o = WmiddotX
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Continuation of Continuation of DerivationDerivation
EE WWkk
= -(t ndash o) WWkk
(( sumsumk k ww k k x x kk))
= -(t ndash o) x k
So ΔWk = η (t ndash o) xk The Perceptron Rule
Stick in formula for
output
Also known as the delta rule and other names (with small variations in calc)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
As it looks in your text As it looks in your text (processing all data at once)hellip(processing all data at once)hellip
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Linear Separability Linear Separability
Consider a perceptron its output is
1 If W1X1+W2X2 + hellip + WnXn gt 0 otherwise
In terms of feature space W1X1 + W2X2 =
X2 = = W1X1
W2
-W1 W2 W2
X1+
+ + + + + + - + - - + + + + - + + - - -+ + - -+ - - -
- -
Hence can only classify examples if a ldquolinerdquo (hyerplane) can separate them
y = mx + b
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Perceptron Convergence Perceptron Convergence
TheoremTheorem (Rosemblatt 1957)(Rosemblatt 1957)
Perceptron no Hidden Units
If a set of examples is learnable the perceptron training rule will eventually find the necessary weightsHowever a perceptron can only learnrepresent linearly separable datasetcopy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
The (Infamous) XOR The (Infamous) XOR ProblemProblem
Input
0 00 11 01 1
Output
0110
a)b)c)d)
Exclusive OR (XOR)Not linearly separable
b
a c
d
0 1
1
A Neural Network SolutionX1
X2
X1
X2
1
1
-1-1
1
1 Let = 0 for all nodescopy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
The Need for Hidden The Need for Hidden UnitsUnits
If there is one layer of enough hidden units (possibly 2N for Boolean functions) the input can be recoded (N = number of input units)
This recoding allows any mapping to be represented (Minsky amp Papert)Question How to provide an error signal to the interior units
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Hidden UnitsHidden Units
One ViewAllow a system to create its own internal representation ndash for which problem solving is easy A perceptron
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Advantages of Advantages of Neural NetworksNeural Networks
Provide best predictive Provide best predictive accuracy for some accuracy for some problemsproblems
Being supplanted by Being supplanted by SVMrsquosSVMrsquos
Can represent a rich Can represent a rich class of conceptsclass of concepts
PositivenegativePositive
Saturday 40 chance of rainSunday 25 chance of rain
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Overview of ANNsOverview of ANNs
Recurrentlink
Output units
Input units
Hidden units
error
weight
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
BackpropagationBackpropagation
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
BackpropagationBackpropagation
bull Backpropagation involves a generalization of the Backpropagation involves a generalization of the perceptron ruleperceptron rule
bull Rumelhart Parker and Le Cun (and Bryson amp Ho 1969) Rumelhart Parker and Le Cun (and Bryson amp Ho 1969) Werbos 1974) independently developed (1985) a Werbos 1974) independently developed (1985) a technique for determining how to adjust weights of technique for determining how to adjust weights of interior (ldquohiddenrdquo) unitsinterior (ldquohiddenrdquo) units
bull Derivation involves partial derivatives Derivation involves partial derivatives (hence threshold function must be differentiable)(hence threshold function must be differentiable)
error signal
E Wij
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Weight SpaceWeight Space
bull Given a neural-network layout the weights Given a neural-network layout the weights are free parameters that are free parameters that define a define a spacespace
bull Each pointEach point in this in this Weight SpaceWeight Space specifies a specifies a networknetwork
bull Associated with each point is an Associated with each point is an error rateerror rate EE over the training dataover the training data
bull Backprop performs Backprop performs gradient descentgradient descent in weight in weight spacespace
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Gradient Descent in Weight Gradient Descent in Weight SpaceSpace
E
W1
W2
Ew
W1
W2
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
The Gradient-Descent The Gradient-Descent RuleRule
E(w) [ ]Ew0
Ew1
Ew
2
EwN
hellip hellip hellip _
The ldquogradien
trdquoThis is a N+1 dimensional vector (ie the lsquoslopersquo in weight space)Since we want to reduce errors we want to go ldquodown hillrdquoWersquoll take a finite step in weight space
E
W1
W2
w = - E ( w )
or wi = - Ewi
ldquodeltardquo = change to
w
E
w
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
ldquoldquoOn Linerdquo vs ldquoBatchrdquo On Linerdquo vs ldquoBatchrdquo BackpropBackprop
bull Technically we should look at the error Technically we should look at the error gradient for the entire training set gradient for the entire training set before taking a step in weight space before taking a step in weight space (ldquo(ldquobatchbatchrdquo Backprop)rdquo Backprop)
bull HoweverHowever as presented we take a step as presented we take a step after each example (ldquoafter each example (ldquoon-lineon-linerdquo Backprop)rdquo Backprop)bull Much faster convergenceMuch faster convergencebull Can reduce overfitting (since on-line Can reduce overfitting (since on-line
Backprop is ldquonoisyrdquo gradient descent)Backprop is ldquonoisyrdquo gradient descent)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
ldquoldquoOn Linerdquo vs ldquoBatchrdquo BP On Linerdquo vs ldquoBatchrdquo BP (continued)(continued)
BATCHBATCH ndash add ndash add w w vectors for vectors for everyevery training example training example thenthen lsquomoversquo in weight lsquomoversquo in weight spacespace
ON-LINEON-LINE ndash ldquomoverdquo ndash ldquomoverdquo after after eacheach example example (aka (aka stochasticstochastic gradient descent)gradient descent)
E
wi
w1
w3w2
w
w1
w2
w3
Final locations in space need not be the same for BATCH and ON-LINE w
N
ote
w
iB
ATC
H
w
i O
N-L
INE
for
i gt
1
E
w
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Need Derivatives Replace Need Derivatives Replace Step (Threshold) by Step (Threshold) by SigmoidSigmoidIndividual units
bias
output
input
output i= F(weight ij x output j)
Where
F(input i) =
j
1
1+e -(input i ndash bias i)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Differentiating the Differentiating the Logistic FunctionLogistic Function
out i =
1
1 + e - ( wji x outj)
F rsquo(wgtrsquoed in) = out i ( 1- out i ) 0
12
Wj x outj
F(wgtrsquoed in)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Assume one layer of hidden units (std topology)Assume one layer of hidden units (std topology)
11 Error Error frac12 frac12 ( Teacher ( Teacherii ndash ndash OutputOutput ii ) ) 22
22 = frac12 = frac12 (Teacher(Teacherii ndash F ndash F (( [[WWijij x x OutputOutput jj] )] )22
33 = frac12 = frac12 (Teacher(Teacherii ndash F ndash F (( [[WWijij x F x F ((WWjkjk x Output x Output
kk)])))]))22
DetermineDetermine
recallrecall
BP CalculationsBP Calculations
Error Wij
Error Wjk
= (use equation 2)
= (use equation 3)
See Table 42 in Mitchell for resultswxy = - ( E wxy )
k j i
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Derivation in MitchellDerivation in Mitchell
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Some NotationSome Notation
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
By Chain Rule By Chain Rule (since (since WWjiji influences rest of network only influences rest of network only by its influence on by its influence on NetNetjj)hellip)hellip
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Also remember this for later ndashWersquoll call it -δj
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Remember thatoj is xkj outputfrom j is inputto k
Remember netk =wk1 xk1 + hellip+ wkN xkN
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Using BP to Train Using BP to Train ANNrsquosANNrsquos
11 Initiate weights amp bias to Initiate weights amp bias to small random values small random values (eg in [-03 03])(eg in [-03 03])
22 Randomize order of Randomize order of training examples for training examples for each doeach do
a)a) Propagate activity Propagate activity forwardforward to output unitsto output units
k j i
outi = F( wij x outj )j
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)
b)b) Compute ldquodeviationrdquo for output Compute ldquodeviationrdquo for output unitsunits
c)c) Compute ldquodeviationrdquo for hidden Compute ldquodeviationrdquo for hidden unitsunits
d)d) Update weightsUpdate weights
i = F rsquo( neti ) x (Teacheri-outi)
ij = F rsquo( netj ) x ( wij x
i)
wij = x i x out j
wjk = x j x out k
F rsquo( netj ) = F(neti) neti
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)
33 Repeat until training-set error rate Repeat until training-set error rate small enough (or until tuning-set error small enough (or until tuning-set error rate begins to rise ndash see later slide)rate begins to rise ndash see later slide)
Should use ldquoearly stoppingrdquo (ie Should use ldquoearly stoppingrdquo (ie minimize error on the tuning set more minimize error on the tuning set more details later)details later)
44 Measure accuracy on test set to Measure accuracy on test set to estimate estimate generalizationgeneralization (future (future accuracy)accuracy)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Advantages of Neural Advantages of Neural NetworksNetworks
bull Universal representation (provided Universal representation (provided enough hidden units)enough hidden units)
bull Less greedy than tree learnersLess greedy than tree learnersbull In practice good for problems with In practice good for problems with
numeric inputs and can also numeric inputs and can also handle numeric outputshandle numeric outputs
bull PHD for many years best protein PHD for many years best protein secondary structure predictorsecondary structure predictor
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
DisadvantagesDisadvantages
bull Models not very comprehensibleModels not very comprehensiblebull Long training timesLong training timesbull Very sensitive to number of Very sensitive to number of
hidden unitshellip as a result largely hidden unitshellip as a result largely being supplanted by SVMs (SVMs being supplanted by SVMs (SVMs take very different approach to take very different approach to getting non-linearity)getting non-linearity)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Looking AheadLooking Aheadbull Perceptron rule can also be thought of as Perceptron rule can also be thought of as
modifying modifying weights on data points weights on data points rather rather than featuresthan features
bull Instead of process all data (batch) vs Instead of process all data (batch) vs one-at-a-time could imagine processing 2 one-at-a-time could imagine processing 2 data points at a time adjusting their data points at a time adjusting their relative weights based on their relative relative weights based on their relative errorserrors
bull This is what Plattrsquos SMO does (the SVM This is what Plattrsquos SMO does (the SVM implementation in Weka)implementation in Weka)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Backup Slide to help Backup Slide to help with Derivative of with Derivative of SigmoidSigmoid
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Gradient Descent Gradient Descent for the Perceptronfor the Perceptron(Assume no threshold for now and (Assume no threshold for now and
start with a common error measure)start with a common error measure) Error Error frac12 ( t ndash o ) frac12 ( t ndash o )
2
Networkrsquos output
Teacherrsquos answer (a constant wrt the weights) EE
WWkk
ΔΔWWjj - η
= (t ndash o)
EE
WWkk
(t ndash o)(t ndash o)
WWkk
= -(t ndash o) oo
WW kk
Remember o = WmiddotX
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Continuation of Continuation of DerivationDerivation
EE WWkk
= -(t ndash o) WWkk
(( sumsumk k ww k k x x kk))
= -(t ndash o) x k
So ΔWk = η (t ndash o) xk The Perceptron Rule
Stick in formula for
output
Also known as the delta rule and other names (with small variations in calc)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
As it looks in your text As it looks in your text (processing all data at once)hellip(processing all data at once)hellip
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Linear Separability Linear Separability
Consider a perceptron its output is
1 If W1X1+W2X2 + hellip + WnXn gt 0 otherwise
In terms of feature space W1X1 + W2X2 =
X2 = = W1X1
W2
-W1 W2 W2
X1+
+ + + + + + - + - - + + + + - + + - - -+ + - -+ - - -
- -
Hence can only classify examples if a ldquolinerdquo (hyerplane) can separate them
y = mx + b
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Perceptron Convergence Perceptron Convergence
TheoremTheorem (Rosemblatt 1957)(Rosemblatt 1957)
Perceptron no Hidden Units
If a set of examples is learnable the perceptron training rule will eventually find the necessary weightsHowever a perceptron can only learnrepresent linearly separable datasetcopy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
The (Infamous) XOR The (Infamous) XOR ProblemProblem
Input
0 00 11 01 1
Output
0110
a)b)c)d)
Exclusive OR (XOR)Not linearly separable
b
a c
d
0 1
1
A Neural Network SolutionX1
X2
X1
X2
1
1
-1-1
1
1 Let = 0 for all nodescopy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
The Need for Hidden The Need for Hidden UnitsUnits
If there is one layer of enough hidden units (possibly 2N for Boolean functions) the input can be recoded (N = number of input units)
This recoding allows any mapping to be represented (Minsky amp Papert)Question How to provide an error signal to the interior units
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Hidden UnitsHidden Units
One ViewAllow a system to create its own internal representation ndash for which problem solving is easy A perceptron
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Advantages of Advantages of Neural NetworksNeural Networks
Provide best predictive Provide best predictive accuracy for some accuracy for some problemsproblems
Being supplanted by Being supplanted by SVMrsquosSVMrsquos
Can represent a rich Can represent a rich class of conceptsclass of concepts
PositivenegativePositive
Saturday 40 chance of rainSunday 25 chance of rain
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Overview of ANNsOverview of ANNs
Recurrentlink
Output units
Input units
Hidden units
error
weight
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
BackpropagationBackpropagation
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
BackpropagationBackpropagation
bull Backpropagation involves a generalization of the Backpropagation involves a generalization of the perceptron ruleperceptron rule
bull Rumelhart Parker and Le Cun (and Bryson amp Ho 1969) Rumelhart Parker and Le Cun (and Bryson amp Ho 1969) Werbos 1974) independently developed (1985) a Werbos 1974) independently developed (1985) a technique for determining how to adjust weights of technique for determining how to adjust weights of interior (ldquohiddenrdquo) unitsinterior (ldquohiddenrdquo) units
bull Derivation involves partial derivatives Derivation involves partial derivatives (hence threshold function must be differentiable)(hence threshold function must be differentiable)
error signal
E Wij
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Weight SpaceWeight Space
bull Given a neural-network layout the weights Given a neural-network layout the weights are free parameters that are free parameters that define a define a spacespace
bull Each pointEach point in this in this Weight SpaceWeight Space specifies a specifies a networknetwork
bull Associated with each point is an Associated with each point is an error rateerror rate EE over the training dataover the training data
bull Backprop performs Backprop performs gradient descentgradient descent in weight in weight spacespace
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Gradient Descent in Weight Gradient Descent in Weight SpaceSpace
E
W1
W2
Ew
W1
W2
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
The Gradient-Descent The Gradient-Descent RuleRule
E(w) [ ]Ew0
Ew1
Ew
2
EwN
hellip hellip hellip _
The ldquogradien
trdquoThis is a N+1 dimensional vector (ie the lsquoslopersquo in weight space)Since we want to reduce errors we want to go ldquodown hillrdquoWersquoll take a finite step in weight space
E
W1
W2
w = - E ( w )
or wi = - Ewi
ldquodeltardquo = change to
w
E
w
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
ldquoldquoOn Linerdquo vs ldquoBatchrdquo On Linerdquo vs ldquoBatchrdquo BackpropBackprop
bull Technically we should look at the error Technically we should look at the error gradient for the entire training set gradient for the entire training set before taking a step in weight space before taking a step in weight space (ldquo(ldquobatchbatchrdquo Backprop)rdquo Backprop)
bull HoweverHowever as presented we take a step as presented we take a step after each example (ldquoafter each example (ldquoon-lineon-linerdquo Backprop)rdquo Backprop)bull Much faster convergenceMuch faster convergencebull Can reduce overfitting (since on-line Can reduce overfitting (since on-line
Backprop is ldquonoisyrdquo gradient descent)Backprop is ldquonoisyrdquo gradient descent)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
ldquoldquoOn Linerdquo vs ldquoBatchrdquo BP On Linerdquo vs ldquoBatchrdquo BP (continued)(continued)
BATCHBATCH ndash add ndash add w w vectors for vectors for everyevery training example training example thenthen lsquomoversquo in weight lsquomoversquo in weight spacespace
ON-LINEON-LINE ndash ldquomoverdquo ndash ldquomoverdquo after after eacheach example example (aka (aka stochasticstochastic gradient descent)gradient descent)
E
wi
w1
w3w2
w
w1
w2
w3
Final locations in space need not be the same for BATCH and ON-LINE w
N
ote
w
iB
ATC
H
w
i O
N-L
INE
for
i gt
1
E
w
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Need Derivatives Replace Need Derivatives Replace Step (Threshold) by Step (Threshold) by SigmoidSigmoidIndividual units
bias
output
input
output i= F(weight ij x output j)
Where
F(input i) =
j
1
1+e -(input i ndash bias i)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Differentiating the Differentiating the Logistic FunctionLogistic Function
out i =
1
1 + e - ( wji x outj)
F rsquo(wgtrsquoed in) = out i ( 1- out i ) 0
12
Wj x outj
F(wgtrsquoed in)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Assume one layer of hidden units (std topology)Assume one layer of hidden units (std topology)
11 Error Error frac12 frac12 ( Teacher ( Teacherii ndash ndash OutputOutput ii ) ) 22
22 = frac12 = frac12 (Teacher(Teacherii ndash F ndash F (( [[WWijij x x OutputOutput jj] )] )22
33 = frac12 = frac12 (Teacher(Teacherii ndash F ndash F (( [[WWijij x F x F ((WWjkjk x Output x Output
kk)])))]))22
DetermineDetermine
recallrecall
BP CalculationsBP Calculations
Error Wij
Error Wjk
= (use equation 2)
= (use equation 3)
See Table 42 in Mitchell for resultswxy = - ( E wxy )
k j i
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Derivation in MitchellDerivation in Mitchell
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Some NotationSome Notation
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
By Chain Rule By Chain Rule (since (since WWjiji influences rest of network only influences rest of network only by its influence on by its influence on NetNetjj)hellip)hellip
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Also remember this for later ndashWersquoll call it -δj
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Remember thatoj is xkj outputfrom j is inputto k
Remember netk =wk1 xk1 + hellip+ wkN xkN
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Using BP to Train Using BP to Train ANNrsquosANNrsquos
11 Initiate weights amp bias to Initiate weights amp bias to small random values small random values (eg in [-03 03])(eg in [-03 03])
22 Randomize order of Randomize order of training examples for training examples for each doeach do
a)a) Propagate activity Propagate activity forwardforward to output unitsto output units
k j i
outi = F( wij x outj )j
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)
b)b) Compute ldquodeviationrdquo for output Compute ldquodeviationrdquo for output unitsunits
c)c) Compute ldquodeviationrdquo for hidden Compute ldquodeviationrdquo for hidden unitsunits
d)d) Update weightsUpdate weights
i = F rsquo( neti ) x (Teacheri-outi)
ij = F rsquo( netj ) x ( wij x
i)
wij = x i x out j
wjk = x j x out k
F rsquo( netj ) = F(neti) neti
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)
33 Repeat until training-set error rate Repeat until training-set error rate small enough (or until tuning-set error small enough (or until tuning-set error rate begins to rise ndash see later slide)rate begins to rise ndash see later slide)
Should use ldquoearly stoppingrdquo (ie Should use ldquoearly stoppingrdquo (ie minimize error on the tuning set more minimize error on the tuning set more details later)details later)
44 Measure accuracy on test set to Measure accuracy on test set to estimate estimate generalizationgeneralization (future (future accuracy)accuracy)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Advantages of Neural Advantages of Neural NetworksNetworks
bull Universal representation (provided Universal representation (provided enough hidden units)enough hidden units)
bull Less greedy than tree learnersLess greedy than tree learnersbull In practice good for problems with In practice good for problems with
numeric inputs and can also numeric inputs and can also handle numeric outputshandle numeric outputs
bull PHD for many years best protein PHD for many years best protein secondary structure predictorsecondary structure predictor
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
DisadvantagesDisadvantages
bull Models not very comprehensibleModels not very comprehensiblebull Long training timesLong training timesbull Very sensitive to number of Very sensitive to number of
hidden unitshellip as a result largely hidden unitshellip as a result largely being supplanted by SVMs (SVMs being supplanted by SVMs (SVMs take very different approach to take very different approach to getting non-linearity)getting non-linearity)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Looking AheadLooking Aheadbull Perceptron rule can also be thought of as Perceptron rule can also be thought of as
modifying modifying weights on data points weights on data points rather rather than featuresthan features
bull Instead of process all data (batch) vs Instead of process all data (batch) vs one-at-a-time could imagine processing 2 one-at-a-time could imagine processing 2 data points at a time adjusting their data points at a time adjusting their relative weights based on their relative relative weights based on their relative errorserrors
bull This is what Plattrsquos SMO does (the SVM This is what Plattrsquos SMO does (the SVM implementation in Weka)implementation in Weka)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Backup Slide to help Backup Slide to help with Derivative of with Derivative of SigmoidSigmoid
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Continuation of Continuation of DerivationDerivation
EE WWkk
= -(t ndash o) WWkk
(( sumsumk k ww k k x x kk))
= -(t ndash o) x k
So ΔWk = η (t ndash o) xk The Perceptron Rule
Stick in formula for
output
Also known as the delta rule and other names (with small variations in calc)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
As it looks in your text As it looks in your text (processing all data at once)hellip(processing all data at once)hellip
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Linear Separability Linear Separability
Consider a perceptron its output is
1 If W1X1+W2X2 + hellip + WnXn gt 0 otherwise
In terms of feature space W1X1 + W2X2 =
X2 = = W1X1
W2
-W1 W2 W2
X1+
+ + + + + + - + - - + + + + - + + - - -+ + - -+ - - -
- -
Hence can only classify examples if a ldquolinerdquo (hyerplane) can separate them
y = mx + b
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Perceptron Convergence Perceptron Convergence
TheoremTheorem (Rosemblatt 1957)(Rosemblatt 1957)
Perceptron no Hidden Units
If a set of examples is learnable the perceptron training rule will eventually find the necessary weightsHowever a perceptron can only learnrepresent linearly separable datasetcopy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
The (Infamous) XOR The (Infamous) XOR ProblemProblem
Input
0 00 11 01 1
Output
0110
a)b)c)d)
Exclusive OR (XOR)Not linearly separable
b
a c
d
0 1
1
A Neural Network SolutionX1
X2
X1
X2
1
1
-1-1
1
1 Let = 0 for all nodescopy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
The Need for Hidden The Need for Hidden UnitsUnits
If there is one layer of enough hidden units (possibly 2N for Boolean functions) the input can be recoded (N = number of input units)
This recoding allows any mapping to be represented (Minsky amp Papert)Question How to provide an error signal to the interior units
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Hidden UnitsHidden Units
One ViewAllow a system to create its own internal representation ndash for which problem solving is easy A perceptron
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Advantages of Advantages of Neural NetworksNeural Networks
Provide best predictive Provide best predictive accuracy for some accuracy for some problemsproblems
Being supplanted by Being supplanted by SVMrsquosSVMrsquos
Can represent a rich Can represent a rich class of conceptsclass of concepts
PositivenegativePositive
Saturday 40 chance of rainSunday 25 chance of rain
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Overview of ANNsOverview of ANNs
Recurrentlink
Output units
Input units
Hidden units
error
weight
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
BackpropagationBackpropagation
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
BackpropagationBackpropagation
bull Backpropagation involves a generalization of the Backpropagation involves a generalization of the perceptron ruleperceptron rule
bull Rumelhart Parker and Le Cun (and Bryson amp Ho 1969) Rumelhart Parker and Le Cun (and Bryson amp Ho 1969) Werbos 1974) independently developed (1985) a Werbos 1974) independently developed (1985) a technique for determining how to adjust weights of technique for determining how to adjust weights of interior (ldquohiddenrdquo) unitsinterior (ldquohiddenrdquo) units
bull Derivation involves partial derivatives Derivation involves partial derivatives (hence threshold function must be differentiable)(hence threshold function must be differentiable)
error signal
E Wij
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Weight SpaceWeight Space
bull Given a neural-network layout the weights Given a neural-network layout the weights are free parameters that are free parameters that define a define a spacespace
bull Each pointEach point in this in this Weight SpaceWeight Space specifies a specifies a networknetwork
bull Associated with each point is an Associated with each point is an error rateerror rate EE over the training dataover the training data
bull Backprop performs Backprop performs gradient descentgradient descent in weight in weight spacespace
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Gradient Descent in Weight Gradient Descent in Weight SpaceSpace
E
W1
W2
Ew
W1
W2
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
The Gradient-Descent The Gradient-Descent RuleRule
E(w) [ ]Ew0
Ew1
Ew
2
EwN
hellip hellip hellip _
The ldquogradien
trdquoThis is a N+1 dimensional vector (ie the lsquoslopersquo in weight space)Since we want to reduce errors we want to go ldquodown hillrdquoWersquoll take a finite step in weight space
E
W1
W2
w = - E ( w )
or wi = - Ewi
ldquodeltardquo = change to
w
E
w
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
ldquoldquoOn Linerdquo vs ldquoBatchrdquo On Linerdquo vs ldquoBatchrdquo BackpropBackprop
bull Technically we should look at the error Technically we should look at the error gradient for the entire training set gradient for the entire training set before taking a step in weight space before taking a step in weight space (ldquo(ldquobatchbatchrdquo Backprop)rdquo Backprop)
bull HoweverHowever as presented we take a step as presented we take a step after each example (ldquoafter each example (ldquoon-lineon-linerdquo Backprop)rdquo Backprop)bull Much faster convergenceMuch faster convergencebull Can reduce overfitting (since on-line Can reduce overfitting (since on-line
Backprop is ldquonoisyrdquo gradient descent)Backprop is ldquonoisyrdquo gradient descent)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
ldquoldquoOn Linerdquo vs ldquoBatchrdquo BP On Linerdquo vs ldquoBatchrdquo BP (continued)(continued)
BATCHBATCH ndash add ndash add w w vectors for vectors for everyevery training example training example thenthen lsquomoversquo in weight lsquomoversquo in weight spacespace
ON-LINEON-LINE ndash ldquomoverdquo ndash ldquomoverdquo after after eacheach example example (aka (aka stochasticstochastic gradient descent)gradient descent)
E
wi
w1
w3w2
w
w1
w2
w3
Final locations in space need not be the same for BATCH and ON-LINE w
N
ote
w
iB
ATC
H
w
i O
N-L
INE
for
i gt
1
E
w
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Need Derivatives Replace Need Derivatives Replace Step (Threshold) by Step (Threshold) by SigmoidSigmoidIndividual units
bias
output
input
output i= F(weight ij x output j)
Where
F(input i) =
j
1
1+e -(input i ndash bias i)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Differentiating the Differentiating the Logistic FunctionLogistic Function
out i =
1
1 + e - ( wji x outj)
F rsquo(wgtrsquoed in) = out i ( 1- out i ) 0
12
Wj x outj
F(wgtrsquoed in)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Assume one layer of hidden units (std topology)Assume one layer of hidden units (std topology)
11 Error Error frac12 frac12 ( Teacher ( Teacherii ndash ndash OutputOutput ii ) ) 22
22 = frac12 = frac12 (Teacher(Teacherii ndash F ndash F (( [[WWijij x x OutputOutput jj] )] )22
33 = frac12 = frac12 (Teacher(Teacherii ndash F ndash F (( [[WWijij x F x F ((WWjkjk x Output x Output
kk)])))]))22
DetermineDetermine
recallrecall
BP CalculationsBP Calculations
Error Wij
Error Wjk
= (use equation 2)
= (use equation 3)
See Table 42 in Mitchell for resultswxy = - ( E wxy )
k j i
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Derivation in MitchellDerivation in Mitchell
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Some NotationSome Notation
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
By Chain Rule By Chain Rule (since (since WWjiji influences rest of network only influences rest of network only by its influence on by its influence on NetNetjj)hellip)hellip
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Also remember this for later ndashWersquoll call it -δj
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Remember thatoj is xkj outputfrom j is inputto k
Remember netk =wk1 xk1 + hellip+ wkN xkN
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Using BP to Train Using BP to Train ANNrsquosANNrsquos
11 Initiate weights amp bias to Initiate weights amp bias to small random values small random values (eg in [-03 03])(eg in [-03 03])
22 Randomize order of Randomize order of training examples for training examples for each doeach do
a)a) Propagate activity Propagate activity forwardforward to output unitsto output units
k j i
outi = F( wij x outj )j
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)
b)b) Compute ldquodeviationrdquo for output Compute ldquodeviationrdquo for output unitsunits
c)c) Compute ldquodeviationrdquo for hidden Compute ldquodeviationrdquo for hidden unitsunits
d)d) Update weightsUpdate weights
i = F rsquo( neti ) x (Teacheri-outi)
ij = F rsquo( netj ) x ( wij x
i)
wij = x i x out j
wjk = x j x out k
F rsquo( netj ) = F(neti) neti
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)
33 Repeat until training-set error rate Repeat until training-set error rate small enough (or until tuning-set error small enough (or until tuning-set error rate begins to rise ndash see later slide)rate begins to rise ndash see later slide)
Should use ldquoearly stoppingrdquo (ie Should use ldquoearly stoppingrdquo (ie minimize error on the tuning set more minimize error on the tuning set more details later)details later)
44 Measure accuracy on test set to Measure accuracy on test set to estimate estimate generalizationgeneralization (future (future accuracy)accuracy)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Advantages of Neural Advantages of Neural NetworksNetworks
bull Universal representation (provided Universal representation (provided enough hidden units)enough hidden units)
bull Less greedy than tree learnersLess greedy than tree learnersbull In practice good for problems with In practice good for problems with
numeric inputs and can also numeric inputs and can also handle numeric outputshandle numeric outputs
bull PHD for many years best protein PHD for many years best protein secondary structure predictorsecondary structure predictor
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
DisadvantagesDisadvantages
bull Models not very comprehensibleModels not very comprehensiblebull Long training timesLong training timesbull Very sensitive to number of Very sensitive to number of
hidden unitshellip as a result largely hidden unitshellip as a result largely being supplanted by SVMs (SVMs being supplanted by SVMs (SVMs take very different approach to take very different approach to getting non-linearity)getting non-linearity)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Looking AheadLooking Aheadbull Perceptron rule can also be thought of as Perceptron rule can also be thought of as
modifying modifying weights on data points weights on data points rather rather than featuresthan features
bull Instead of process all data (batch) vs Instead of process all data (batch) vs one-at-a-time could imagine processing 2 one-at-a-time could imagine processing 2 data points at a time adjusting their data points at a time adjusting their relative weights based on their relative relative weights based on their relative errorserrors
bull This is what Plattrsquos SMO does (the SVM This is what Plattrsquos SMO does (the SVM implementation in Weka)implementation in Weka)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Backup Slide to help Backup Slide to help with Derivative of with Derivative of SigmoidSigmoid
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
As it looks in your text As it looks in your text (processing all data at once)hellip(processing all data at once)hellip
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Linear Separability Linear Separability
Consider a perceptron its output is
1 If W1X1+W2X2 + hellip + WnXn gt 0 otherwise
In terms of feature space W1X1 + W2X2 =
X2 = = W1X1
W2
-W1 W2 W2
X1+
+ + + + + + - + - - + + + + - + + - - -+ + - -+ - - -
- -
Hence can only classify examples if a ldquolinerdquo (hyerplane) can separate them
y = mx + b
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Perceptron Convergence Perceptron Convergence
TheoremTheorem (Rosemblatt 1957)(Rosemblatt 1957)
Perceptron no Hidden Units
If a set of examples is learnable the perceptron training rule will eventually find the necessary weightsHowever a perceptron can only learnrepresent linearly separable datasetcopy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
The (Infamous) XOR The (Infamous) XOR ProblemProblem
Input
0 00 11 01 1
Output
0110
a)b)c)d)
Exclusive OR (XOR)Not linearly separable
b
a c
d
0 1
1
A Neural Network SolutionX1
X2
X1
X2
1
1
-1-1
1
1 Let = 0 for all nodescopy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
The Need for Hidden The Need for Hidden UnitsUnits
If there is one layer of enough hidden units (possibly 2N for Boolean functions) the input can be recoded (N = number of input units)
This recoding allows any mapping to be represented (Minsky amp Papert)Question How to provide an error signal to the interior units
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Hidden UnitsHidden Units
One ViewAllow a system to create its own internal representation ndash for which problem solving is easy A perceptron
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Advantages of Advantages of Neural NetworksNeural Networks
Provide best predictive Provide best predictive accuracy for some accuracy for some problemsproblems
Being supplanted by Being supplanted by SVMrsquosSVMrsquos
Can represent a rich Can represent a rich class of conceptsclass of concepts
PositivenegativePositive
Saturday 40 chance of rainSunday 25 chance of rain
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Overview of ANNsOverview of ANNs
Recurrentlink
Output units
Input units
Hidden units
error
weight
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
BackpropagationBackpropagation
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
BackpropagationBackpropagation
bull Backpropagation involves a generalization of the Backpropagation involves a generalization of the perceptron ruleperceptron rule
bull Rumelhart Parker and Le Cun (and Bryson amp Ho 1969) Rumelhart Parker and Le Cun (and Bryson amp Ho 1969) Werbos 1974) independently developed (1985) a Werbos 1974) independently developed (1985) a technique for determining how to adjust weights of technique for determining how to adjust weights of interior (ldquohiddenrdquo) unitsinterior (ldquohiddenrdquo) units
bull Derivation involves partial derivatives Derivation involves partial derivatives (hence threshold function must be differentiable)(hence threshold function must be differentiable)
error signal
E Wij
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Weight SpaceWeight Space
bull Given a neural-network layout the weights Given a neural-network layout the weights are free parameters that are free parameters that define a define a spacespace
bull Each pointEach point in this in this Weight SpaceWeight Space specifies a specifies a networknetwork
bull Associated with each point is an Associated with each point is an error rateerror rate EE over the training dataover the training data
bull Backprop performs Backprop performs gradient descentgradient descent in weight in weight spacespace
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Gradient Descent in Weight Gradient Descent in Weight SpaceSpace
E
W1
W2
Ew
W1
W2
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
The Gradient-Descent The Gradient-Descent RuleRule
E(w) [ ]Ew0
Ew1
Ew
2
EwN
hellip hellip hellip _
The ldquogradien
trdquoThis is a N+1 dimensional vector (ie the lsquoslopersquo in weight space)Since we want to reduce errors we want to go ldquodown hillrdquoWersquoll take a finite step in weight space
E
W1
W2
w = - E ( w )
or wi = - Ewi
ldquodeltardquo = change to
w
E
w
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
ldquoldquoOn Linerdquo vs ldquoBatchrdquo On Linerdquo vs ldquoBatchrdquo BackpropBackprop
bull Technically we should look at the error Technically we should look at the error gradient for the entire training set gradient for the entire training set before taking a step in weight space before taking a step in weight space (ldquo(ldquobatchbatchrdquo Backprop)rdquo Backprop)
bull HoweverHowever as presented we take a step as presented we take a step after each example (ldquoafter each example (ldquoon-lineon-linerdquo Backprop)rdquo Backprop)bull Much faster convergenceMuch faster convergencebull Can reduce overfitting (since on-line Can reduce overfitting (since on-line
Backprop is ldquonoisyrdquo gradient descent)Backprop is ldquonoisyrdquo gradient descent)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
ldquoldquoOn Linerdquo vs ldquoBatchrdquo BP On Linerdquo vs ldquoBatchrdquo BP (continued)(continued)
BATCHBATCH ndash add ndash add w w vectors for vectors for everyevery training example training example thenthen lsquomoversquo in weight lsquomoversquo in weight spacespace
ON-LINEON-LINE ndash ldquomoverdquo ndash ldquomoverdquo after after eacheach example example (aka (aka stochasticstochastic gradient descent)gradient descent)
E
wi
w1
w3w2
w
w1
w2
w3
Final locations in space need not be the same for BATCH and ON-LINE w
N
ote
w
iB
ATC
H
w
i O
N-L
INE
for
i gt
1
E
w
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Need Derivatives Replace Need Derivatives Replace Step (Threshold) by Step (Threshold) by SigmoidSigmoidIndividual units
bias
output
input
output i= F(weight ij x output j)
Where
F(input i) =
j
1
1+e -(input i ndash bias i)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Differentiating the Differentiating the Logistic FunctionLogistic Function
out i =
1
1 + e - ( wji x outj)
F rsquo(wgtrsquoed in) = out i ( 1- out i ) 0
12
Wj x outj
F(wgtrsquoed in)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Assume one layer of hidden units (std topology)Assume one layer of hidden units (std topology)
11 Error Error frac12 frac12 ( Teacher ( Teacherii ndash ndash OutputOutput ii ) ) 22
22 = frac12 = frac12 (Teacher(Teacherii ndash F ndash F (( [[WWijij x x OutputOutput jj] )] )22
33 = frac12 = frac12 (Teacher(Teacherii ndash F ndash F (( [[WWijij x F x F ((WWjkjk x Output x Output
kk)])))]))22
DetermineDetermine
recallrecall
BP CalculationsBP Calculations
Error Wij
Error Wjk
= (use equation 2)
= (use equation 3)
See Table 42 in Mitchell for resultswxy = - ( E wxy )
k j i
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Derivation in MitchellDerivation in Mitchell
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Some NotationSome Notation
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
By Chain Rule By Chain Rule (since (since WWjiji influences rest of network only influences rest of network only by its influence on by its influence on NetNetjj)hellip)hellip
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Also remember this for later ndashWersquoll call it -δj
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Remember thatoj is xkj outputfrom j is inputto k
Remember netk =wk1 xk1 + hellip+ wkN xkN
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Using BP to Train Using BP to Train ANNrsquosANNrsquos
11 Initiate weights amp bias to Initiate weights amp bias to small random values small random values (eg in [-03 03])(eg in [-03 03])
22 Randomize order of Randomize order of training examples for training examples for each doeach do
a)a) Propagate activity Propagate activity forwardforward to output unitsto output units
k j i
outi = F( wij x outj )j
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)
b)b) Compute ldquodeviationrdquo for output Compute ldquodeviationrdquo for output unitsunits
c)c) Compute ldquodeviationrdquo for hidden Compute ldquodeviationrdquo for hidden unitsunits
d)d) Update weightsUpdate weights
i = F rsquo( neti ) x (Teacheri-outi)
ij = F rsquo( netj ) x ( wij x
i)
wij = x i x out j
wjk = x j x out k
F rsquo( netj ) = F(neti) neti
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)
33 Repeat until training-set error rate Repeat until training-set error rate small enough (or until tuning-set error small enough (or until tuning-set error rate begins to rise ndash see later slide)rate begins to rise ndash see later slide)
Should use ldquoearly stoppingrdquo (ie Should use ldquoearly stoppingrdquo (ie minimize error on the tuning set more minimize error on the tuning set more details later)details later)
44 Measure accuracy on test set to Measure accuracy on test set to estimate estimate generalizationgeneralization (future (future accuracy)accuracy)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Advantages of Neural Advantages of Neural NetworksNetworks
bull Universal representation (provided Universal representation (provided enough hidden units)enough hidden units)
bull Less greedy than tree learnersLess greedy than tree learnersbull In practice good for problems with In practice good for problems with
numeric inputs and can also numeric inputs and can also handle numeric outputshandle numeric outputs
bull PHD for many years best protein PHD for many years best protein secondary structure predictorsecondary structure predictor
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
DisadvantagesDisadvantages
bull Models not very comprehensibleModels not very comprehensiblebull Long training timesLong training timesbull Very sensitive to number of Very sensitive to number of
hidden unitshellip as a result largely hidden unitshellip as a result largely being supplanted by SVMs (SVMs being supplanted by SVMs (SVMs take very different approach to take very different approach to getting non-linearity)getting non-linearity)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Looking AheadLooking Aheadbull Perceptron rule can also be thought of as Perceptron rule can also be thought of as
modifying modifying weights on data points weights on data points rather rather than featuresthan features
bull Instead of process all data (batch) vs Instead of process all data (batch) vs one-at-a-time could imagine processing 2 one-at-a-time could imagine processing 2 data points at a time adjusting their data points at a time adjusting their relative weights based on their relative relative weights based on their relative errorserrors
bull This is what Plattrsquos SMO does (the SVM This is what Plattrsquos SMO does (the SVM implementation in Weka)implementation in Weka)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Backup Slide to help Backup Slide to help with Derivative of with Derivative of SigmoidSigmoid
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Linear Separability Linear Separability
Consider a perceptron its output is
1 If W1X1+W2X2 + hellip + WnXn gt 0 otherwise
In terms of feature space W1X1 + W2X2 =
X2 = = W1X1
W2
-W1 W2 W2
X1+
+ + + + + + - + - - + + + + - + + - - -+ + - -+ - - -
- -
Hence can only classify examples if a ldquolinerdquo (hyerplane) can separate them
y = mx + b
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Perceptron Convergence Perceptron Convergence
TheoremTheorem (Rosemblatt 1957)(Rosemblatt 1957)
Perceptron no Hidden Units
If a set of examples is learnable the perceptron training rule will eventually find the necessary weightsHowever a perceptron can only learnrepresent linearly separable datasetcopy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
The (Infamous) XOR The (Infamous) XOR ProblemProblem
Input
0 00 11 01 1
Output
0110
a)b)c)d)
Exclusive OR (XOR)Not linearly separable
b
a c
d
0 1
1
A Neural Network SolutionX1
X2
X1
X2
1
1
-1-1
1
1 Let = 0 for all nodescopy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
The Need for Hidden The Need for Hidden UnitsUnits
If there is one layer of enough hidden units (possibly 2N for Boolean functions) the input can be recoded (N = number of input units)
This recoding allows any mapping to be represented (Minsky amp Papert)Question How to provide an error signal to the interior units
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Hidden UnitsHidden Units
One ViewAllow a system to create its own internal representation ndash for which problem solving is easy A perceptron
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Advantages of Advantages of Neural NetworksNeural Networks
Provide best predictive Provide best predictive accuracy for some accuracy for some problemsproblems
Being supplanted by Being supplanted by SVMrsquosSVMrsquos
Can represent a rich Can represent a rich class of conceptsclass of concepts
PositivenegativePositive
Saturday 40 chance of rainSunday 25 chance of rain
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Overview of ANNsOverview of ANNs
Recurrentlink
Output units
Input units
Hidden units
error
weight
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
BackpropagationBackpropagation
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
BackpropagationBackpropagation
bull Backpropagation involves a generalization of the Backpropagation involves a generalization of the perceptron ruleperceptron rule
bull Rumelhart Parker and Le Cun (and Bryson amp Ho 1969) Rumelhart Parker and Le Cun (and Bryson amp Ho 1969) Werbos 1974) independently developed (1985) a Werbos 1974) independently developed (1985) a technique for determining how to adjust weights of technique for determining how to adjust weights of interior (ldquohiddenrdquo) unitsinterior (ldquohiddenrdquo) units
bull Derivation involves partial derivatives Derivation involves partial derivatives (hence threshold function must be differentiable)(hence threshold function must be differentiable)
error signal
E Wij
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Weight SpaceWeight Space
bull Given a neural-network layout the weights Given a neural-network layout the weights are free parameters that are free parameters that define a define a spacespace
bull Each pointEach point in this in this Weight SpaceWeight Space specifies a specifies a networknetwork
bull Associated with each point is an Associated with each point is an error rateerror rate EE over the training dataover the training data
bull Backprop performs Backprop performs gradient descentgradient descent in weight in weight spacespace
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Gradient Descent in Weight Gradient Descent in Weight SpaceSpace
E
W1
W2
Ew
W1
W2
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
The Gradient-Descent The Gradient-Descent RuleRule
E(w) [ ]Ew0
Ew1
Ew
2
EwN
hellip hellip hellip _
The ldquogradien
trdquoThis is a N+1 dimensional vector (ie the lsquoslopersquo in weight space)Since we want to reduce errors we want to go ldquodown hillrdquoWersquoll take a finite step in weight space
E
W1
W2
w = - E ( w )
or wi = - Ewi
ldquodeltardquo = change to
w
E
w
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
ldquoldquoOn Linerdquo vs ldquoBatchrdquo On Linerdquo vs ldquoBatchrdquo BackpropBackprop
bull Technically we should look at the error Technically we should look at the error gradient for the entire training set gradient for the entire training set before taking a step in weight space before taking a step in weight space (ldquo(ldquobatchbatchrdquo Backprop)rdquo Backprop)
bull HoweverHowever as presented we take a step as presented we take a step after each example (ldquoafter each example (ldquoon-lineon-linerdquo Backprop)rdquo Backprop)bull Much faster convergenceMuch faster convergencebull Can reduce overfitting (since on-line Can reduce overfitting (since on-line
Backprop is ldquonoisyrdquo gradient descent)Backprop is ldquonoisyrdquo gradient descent)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
ldquoldquoOn Linerdquo vs ldquoBatchrdquo BP On Linerdquo vs ldquoBatchrdquo BP (continued)(continued)
BATCHBATCH ndash add ndash add w w vectors for vectors for everyevery training example training example thenthen lsquomoversquo in weight lsquomoversquo in weight spacespace
ON-LINEON-LINE ndash ldquomoverdquo ndash ldquomoverdquo after after eacheach example example (aka (aka stochasticstochastic gradient descent)gradient descent)
E
wi
w1
w3w2
w
w1
w2
w3
Final locations in space need not be the same for BATCH and ON-LINE w
N
ote
w
iB
ATC
H
w
i O
N-L
INE
for
i gt
1
E
w
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Need Derivatives Replace Need Derivatives Replace Step (Threshold) by Step (Threshold) by SigmoidSigmoidIndividual units
bias
output
input
output i= F(weight ij x output j)
Where
F(input i) =
j
1
1+e -(input i ndash bias i)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Differentiating the Differentiating the Logistic FunctionLogistic Function
out i =
1
1 + e - ( wji x outj)
F rsquo(wgtrsquoed in) = out i ( 1- out i ) 0
12
Wj x outj
F(wgtrsquoed in)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Assume one layer of hidden units (std topology)Assume one layer of hidden units (std topology)
11 Error Error frac12 frac12 ( Teacher ( Teacherii ndash ndash OutputOutput ii ) ) 22
22 = frac12 = frac12 (Teacher(Teacherii ndash F ndash F (( [[WWijij x x OutputOutput jj] )] )22
33 = frac12 = frac12 (Teacher(Teacherii ndash F ndash F (( [[WWijij x F x F ((WWjkjk x Output x Output
kk)])))]))22
DetermineDetermine
recallrecall
BP CalculationsBP Calculations
Error Wij
Error Wjk
= (use equation 2)
= (use equation 3)
See Table 42 in Mitchell for resultswxy = - ( E wxy )
k j i
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Derivation in MitchellDerivation in Mitchell
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Some NotationSome Notation
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
By Chain Rule By Chain Rule (since (since WWjiji influences rest of network only influences rest of network only by its influence on by its influence on NetNetjj)hellip)hellip
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Also remember this for later ndashWersquoll call it -δj
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Remember thatoj is xkj outputfrom j is inputto k
Remember netk =wk1 xk1 + hellip+ wkN xkN
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Using BP to Train Using BP to Train ANNrsquosANNrsquos
11 Initiate weights amp bias to Initiate weights amp bias to small random values small random values (eg in [-03 03])(eg in [-03 03])
22 Randomize order of Randomize order of training examples for training examples for each doeach do
a)a) Propagate activity Propagate activity forwardforward to output unitsto output units
k j i
outi = F( wij x outj )j
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)
b)b) Compute ldquodeviationrdquo for output Compute ldquodeviationrdquo for output unitsunits
c)c) Compute ldquodeviationrdquo for hidden Compute ldquodeviationrdquo for hidden unitsunits
d)d) Update weightsUpdate weights
i = F rsquo( neti ) x (Teacheri-outi)
ij = F rsquo( netj ) x ( wij x
i)
wij = x i x out j
wjk = x j x out k
F rsquo( netj ) = F(neti) neti
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)
33 Repeat until training-set error rate Repeat until training-set error rate small enough (or until tuning-set error small enough (or until tuning-set error rate begins to rise ndash see later slide)rate begins to rise ndash see later slide)
Should use ldquoearly stoppingrdquo (ie Should use ldquoearly stoppingrdquo (ie minimize error on the tuning set more minimize error on the tuning set more details later)details later)
44 Measure accuracy on test set to Measure accuracy on test set to estimate estimate generalizationgeneralization (future (future accuracy)accuracy)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Advantages of Neural Advantages of Neural NetworksNetworks
bull Universal representation (provided Universal representation (provided enough hidden units)enough hidden units)
bull Less greedy than tree learnersLess greedy than tree learnersbull In practice good for problems with In practice good for problems with
numeric inputs and can also numeric inputs and can also handle numeric outputshandle numeric outputs
bull PHD for many years best protein PHD for many years best protein secondary structure predictorsecondary structure predictor
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
DisadvantagesDisadvantages
bull Models not very comprehensibleModels not very comprehensiblebull Long training timesLong training timesbull Very sensitive to number of Very sensitive to number of
hidden unitshellip as a result largely hidden unitshellip as a result largely being supplanted by SVMs (SVMs being supplanted by SVMs (SVMs take very different approach to take very different approach to getting non-linearity)getting non-linearity)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Looking AheadLooking Aheadbull Perceptron rule can also be thought of as Perceptron rule can also be thought of as
modifying modifying weights on data points weights on data points rather rather than featuresthan features
bull Instead of process all data (batch) vs Instead of process all data (batch) vs one-at-a-time could imagine processing 2 one-at-a-time could imagine processing 2 data points at a time adjusting their data points at a time adjusting their relative weights based on their relative relative weights based on their relative errorserrors
bull This is what Plattrsquos SMO does (the SVM This is what Plattrsquos SMO does (the SVM implementation in Weka)implementation in Weka)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Backup Slide to help Backup Slide to help with Derivative of with Derivative of SigmoidSigmoid
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Perceptron Convergence Perceptron Convergence
TheoremTheorem (Rosemblatt 1957)(Rosemblatt 1957)
Perceptron no Hidden Units
If a set of examples is learnable the perceptron training rule will eventually find the necessary weightsHowever a perceptron can only learnrepresent linearly separable datasetcopy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
The (Infamous) XOR The (Infamous) XOR ProblemProblem
Input
0 00 11 01 1
Output
0110
a)b)c)d)
Exclusive OR (XOR)Not linearly separable
b
a c
d
0 1
1
A Neural Network SolutionX1
X2
X1
X2
1
1
-1-1
1
1 Let = 0 for all nodescopy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
The Need for Hidden The Need for Hidden UnitsUnits
If there is one layer of enough hidden units (possibly 2N for Boolean functions) the input can be recoded (N = number of input units)
This recoding allows any mapping to be represented (Minsky amp Papert)Question How to provide an error signal to the interior units
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Hidden UnitsHidden Units
One ViewAllow a system to create its own internal representation ndash for which problem solving is easy A perceptron
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Advantages of Advantages of Neural NetworksNeural Networks
Provide best predictive Provide best predictive accuracy for some accuracy for some problemsproblems
Being supplanted by Being supplanted by SVMrsquosSVMrsquos
Can represent a rich Can represent a rich class of conceptsclass of concepts
PositivenegativePositive
Saturday 40 chance of rainSunday 25 chance of rain
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Overview of ANNsOverview of ANNs
Recurrentlink
Output units
Input units
Hidden units
error
weight
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
BackpropagationBackpropagation
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
BackpropagationBackpropagation
bull Backpropagation involves a generalization of the Backpropagation involves a generalization of the perceptron ruleperceptron rule
bull Rumelhart Parker and Le Cun (and Bryson amp Ho 1969) Rumelhart Parker and Le Cun (and Bryson amp Ho 1969) Werbos 1974) independently developed (1985) a Werbos 1974) independently developed (1985) a technique for determining how to adjust weights of technique for determining how to adjust weights of interior (ldquohiddenrdquo) unitsinterior (ldquohiddenrdquo) units
bull Derivation involves partial derivatives Derivation involves partial derivatives (hence threshold function must be differentiable)(hence threshold function must be differentiable)
error signal
E Wij
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Weight SpaceWeight Space
bull Given a neural-network layout the weights Given a neural-network layout the weights are free parameters that are free parameters that define a define a spacespace
bull Each pointEach point in this in this Weight SpaceWeight Space specifies a specifies a networknetwork
bull Associated with each point is an Associated with each point is an error rateerror rate EE over the training dataover the training data
bull Backprop performs Backprop performs gradient descentgradient descent in weight in weight spacespace
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Gradient Descent in Weight Gradient Descent in Weight SpaceSpace
E
W1
W2
Ew
W1
W2
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
The Gradient-Descent The Gradient-Descent RuleRule
E(w) [ ]Ew0
Ew1
Ew
2
EwN
hellip hellip hellip _
The ldquogradien
trdquoThis is a N+1 dimensional vector (ie the lsquoslopersquo in weight space)Since we want to reduce errors we want to go ldquodown hillrdquoWersquoll take a finite step in weight space
E
W1
W2
w = - E ( w )
or wi = - Ewi
ldquodeltardquo = change to
w
E
w
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
ldquoldquoOn Linerdquo vs ldquoBatchrdquo On Linerdquo vs ldquoBatchrdquo BackpropBackprop
bull Technically we should look at the error Technically we should look at the error gradient for the entire training set gradient for the entire training set before taking a step in weight space before taking a step in weight space (ldquo(ldquobatchbatchrdquo Backprop)rdquo Backprop)
bull HoweverHowever as presented we take a step as presented we take a step after each example (ldquoafter each example (ldquoon-lineon-linerdquo Backprop)rdquo Backprop)bull Much faster convergenceMuch faster convergencebull Can reduce overfitting (since on-line Can reduce overfitting (since on-line
Backprop is ldquonoisyrdquo gradient descent)Backprop is ldquonoisyrdquo gradient descent)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
ldquoldquoOn Linerdquo vs ldquoBatchrdquo BP On Linerdquo vs ldquoBatchrdquo BP (continued)(continued)
BATCHBATCH ndash add ndash add w w vectors for vectors for everyevery training example training example thenthen lsquomoversquo in weight lsquomoversquo in weight spacespace
ON-LINEON-LINE ndash ldquomoverdquo ndash ldquomoverdquo after after eacheach example example (aka (aka stochasticstochastic gradient descent)gradient descent)
E
wi
w1
w3w2
w
w1
w2
w3
Final locations in space need not be the same for BATCH and ON-LINE w
N
ote
w
iB
ATC
H
w
i O
N-L
INE
for
i gt
1
E
w
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Need Derivatives Replace Need Derivatives Replace Step (Threshold) by Step (Threshold) by SigmoidSigmoidIndividual units
bias
output
input
output i= F(weight ij x output j)
Where
F(input i) =
j
1
1+e -(input i ndash bias i)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Differentiating the Differentiating the Logistic FunctionLogistic Function
out i =
1
1 + e - ( wji x outj)
F rsquo(wgtrsquoed in) = out i ( 1- out i ) 0
12
Wj x outj
F(wgtrsquoed in)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Assume one layer of hidden units (std topology)Assume one layer of hidden units (std topology)
11 Error Error frac12 frac12 ( Teacher ( Teacherii ndash ndash OutputOutput ii ) ) 22
22 = frac12 = frac12 (Teacher(Teacherii ndash F ndash F (( [[WWijij x x OutputOutput jj] )] )22
33 = frac12 = frac12 (Teacher(Teacherii ndash F ndash F (( [[WWijij x F x F ((WWjkjk x Output x Output
kk)])))]))22
DetermineDetermine
recallrecall
BP CalculationsBP Calculations
Error Wij
Error Wjk
= (use equation 2)
= (use equation 3)
See Table 42 in Mitchell for resultswxy = - ( E wxy )
k j i
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Derivation in MitchellDerivation in Mitchell
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Some NotationSome Notation
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
By Chain Rule By Chain Rule (since (since WWjiji influences rest of network only influences rest of network only by its influence on by its influence on NetNetjj)hellip)hellip
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Also remember this for later ndashWersquoll call it -δj
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Remember thatoj is xkj outputfrom j is inputto k
Remember netk =wk1 xk1 + hellip+ wkN xkN
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Using BP to Train Using BP to Train ANNrsquosANNrsquos
11 Initiate weights amp bias to Initiate weights amp bias to small random values small random values (eg in [-03 03])(eg in [-03 03])
22 Randomize order of Randomize order of training examples for training examples for each doeach do
a)a) Propagate activity Propagate activity forwardforward to output unitsto output units
k j i
outi = F( wij x outj )j
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)
b)b) Compute ldquodeviationrdquo for output Compute ldquodeviationrdquo for output unitsunits
c)c) Compute ldquodeviationrdquo for hidden Compute ldquodeviationrdquo for hidden unitsunits
d)d) Update weightsUpdate weights
i = F rsquo( neti ) x (Teacheri-outi)
ij = F rsquo( netj ) x ( wij x
i)
wij = x i x out j
wjk = x j x out k
F rsquo( netj ) = F(neti) neti
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)
33 Repeat until training-set error rate Repeat until training-set error rate small enough (or until tuning-set error small enough (or until tuning-set error rate begins to rise ndash see later slide)rate begins to rise ndash see later slide)
Should use ldquoearly stoppingrdquo (ie Should use ldquoearly stoppingrdquo (ie minimize error on the tuning set more minimize error on the tuning set more details later)details later)
44 Measure accuracy on test set to Measure accuracy on test set to estimate estimate generalizationgeneralization (future (future accuracy)accuracy)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Advantages of Neural Advantages of Neural NetworksNetworks
bull Universal representation (provided Universal representation (provided enough hidden units)enough hidden units)
bull Less greedy than tree learnersLess greedy than tree learnersbull In practice good for problems with In practice good for problems with
numeric inputs and can also numeric inputs and can also handle numeric outputshandle numeric outputs
bull PHD for many years best protein PHD for many years best protein secondary structure predictorsecondary structure predictor
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
DisadvantagesDisadvantages
bull Models not very comprehensibleModels not very comprehensiblebull Long training timesLong training timesbull Very sensitive to number of Very sensitive to number of
hidden unitshellip as a result largely hidden unitshellip as a result largely being supplanted by SVMs (SVMs being supplanted by SVMs (SVMs take very different approach to take very different approach to getting non-linearity)getting non-linearity)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Looking AheadLooking Aheadbull Perceptron rule can also be thought of as Perceptron rule can also be thought of as
modifying modifying weights on data points weights on data points rather rather than featuresthan features
bull Instead of process all data (batch) vs Instead of process all data (batch) vs one-at-a-time could imagine processing 2 one-at-a-time could imagine processing 2 data points at a time adjusting their data points at a time adjusting their relative weights based on their relative relative weights based on their relative errorserrors
bull This is what Plattrsquos SMO does (the SVM This is what Plattrsquos SMO does (the SVM implementation in Weka)implementation in Weka)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Backup Slide to help Backup Slide to help with Derivative of with Derivative of SigmoidSigmoid
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
The (Infamous) XOR The (Infamous) XOR ProblemProblem
Input
0 00 11 01 1
Output
0110
a)b)c)d)
Exclusive OR (XOR)Not linearly separable
b
a c
d
0 1
1
A Neural Network SolutionX1
X2
X1
X2
1
1
-1-1
1
1 Let = 0 for all nodescopy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
The Need for Hidden The Need for Hidden UnitsUnits
If there is one layer of enough hidden units (possibly 2N for Boolean functions) the input can be recoded (N = number of input units)
This recoding allows any mapping to be represented (Minsky amp Papert)Question How to provide an error signal to the interior units
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Hidden UnitsHidden Units
One ViewAllow a system to create its own internal representation ndash for which problem solving is easy A perceptron
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Advantages of Advantages of Neural NetworksNeural Networks
Provide best predictive Provide best predictive accuracy for some accuracy for some problemsproblems
Being supplanted by Being supplanted by SVMrsquosSVMrsquos
Can represent a rich Can represent a rich class of conceptsclass of concepts
PositivenegativePositive
Saturday 40 chance of rainSunday 25 chance of rain
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Overview of ANNsOverview of ANNs
Recurrentlink
Output units
Input units
Hidden units
error
weight
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
BackpropagationBackpropagation
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
BackpropagationBackpropagation
bull Backpropagation involves a generalization of the Backpropagation involves a generalization of the perceptron ruleperceptron rule
bull Rumelhart Parker and Le Cun (and Bryson amp Ho 1969) Rumelhart Parker and Le Cun (and Bryson amp Ho 1969) Werbos 1974) independently developed (1985) a Werbos 1974) independently developed (1985) a technique for determining how to adjust weights of technique for determining how to adjust weights of interior (ldquohiddenrdquo) unitsinterior (ldquohiddenrdquo) units
bull Derivation involves partial derivatives Derivation involves partial derivatives (hence threshold function must be differentiable)(hence threshold function must be differentiable)
error signal
E Wij
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Weight SpaceWeight Space
bull Given a neural-network layout the weights Given a neural-network layout the weights are free parameters that are free parameters that define a define a spacespace
bull Each pointEach point in this in this Weight SpaceWeight Space specifies a specifies a networknetwork
bull Associated with each point is an Associated with each point is an error rateerror rate EE over the training dataover the training data
bull Backprop performs Backprop performs gradient descentgradient descent in weight in weight spacespace
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Gradient Descent in Weight Gradient Descent in Weight SpaceSpace
E
W1
W2
Ew
W1
W2
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
The Gradient-Descent The Gradient-Descent RuleRule
E(w) [ ]Ew0
Ew1
Ew
2
EwN
hellip hellip hellip _
The ldquogradien
trdquoThis is a N+1 dimensional vector (ie the lsquoslopersquo in weight space)Since we want to reduce errors we want to go ldquodown hillrdquoWersquoll take a finite step in weight space
E
W1
W2
w = - E ( w )
or wi = - Ewi
ldquodeltardquo = change to
w
E
w
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
ldquoldquoOn Linerdquo vs ldquoBatchrdquo On Linerdquo vs ldquoBatchrdquo BackpropBackprop
bull Technically we should look at the error Technically we should look at the error gradient for the entire training set gradient for the entire training set before taking a step in weight space before taking a step in weight space (ldquo(ldquobatchbatchrdquo Backprop)rdquo Backprop)
bull HoweverHowever as presented we take a step as presented we take a step after each example (ldquoafter each example (ldquoon-lineon-linerdquo Backprop)rdquo Backprop)bull Much faster convergenceMuch faster convergencebull Can reduce overfitting (since on-line Can reduce overfitting (since on-line
Backprop is ldquonoisyrdquo gradient descent)Backprop is ldquonoisyrdquo gradient descent)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
ldquoldquoOn Linerdquo vs ldquoBatchrdquo BP On Linerdquo vs ldquoBatchrdquo BP (continued)(continued)
BATCHBATCH ndash add ndash add w w vectors for vectors for everyevery training example training example thenthen lsquomoversquo in weight lsquomoversquo in weight spacespace
ON-LINEON-LINE ndash ldquomoverdquo ndash ldquomoverdquo after after eacheach example example (aka (aka stochasticstochastic gradient descent)gradient descent)
E
wi
w1
w3w2
w
w1
w2
w3
Final locations in space need not be the same for BATCH and ON-LINE w
N
ote
w
iB
ATC
H
w
i O
N-L
INE
for
i gt
1
E
w
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Need Derivatives Replace Need Derivatives Replace Step (Threshold) by Step (Threshold) by SigmoidSigmoidIndividual units
bias
output
input
output i= F(weight ij x output j)
Where
F(input i) =
j
1
1+e -(input i ndash bias i)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Differentiating the Differentiating the Logistic FunctionLogistic Function
out i =
1
1 + e - ( wji x outj)
F rsquo(wgtrsquoed in) = out i ( 1- out i ) 0
12
Wj x outj
F(wgtrsquoed in)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Assume one layer of hidden units (std topology)Assume one layer of hidden units (std topology)
11 Error Error frac12 frac12 ( Teacher ( Teacherii ndash ndash OutputOutput ii ) ) 22
22 = frac12 = frac12 (Teacher(Teacherii ndash F ndash F (( [[WWijij x x OutputOutput jj] )] )22
33 = frac12 = frac12 (Teacher(Teacherii ndash F ndash F (( [[WWijij x F x F ((WWjkjk x Output x Output
kk)])))]))22
DetermineDetermine
recallrecall
BP CalculationsBP Calculations
Error Wij
Error Wjk
= (use equation 2)
= (use equation 3)
See Table 42 in Mitchell for resultswxy = - ( E wxy )
k j i
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Derivation in MitchellDerivation in Mitchell
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Some NotationSome Notation
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
By Chain Rule By Chain Rule (since (since WWjiji influences rest of network only influences rest of network only by its influence on by its influence on NetNetjj)hellip)hellip
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Also remember this for later ndashWersquoll call it -δj
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Remember thatoj is xkj outputfrom j is inputto k
Remember netk =wk1 xk1 + hellip+ wkN xkN
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Using BP to Train Using BP to Train ANNrsquosANNrsquos
11 Initiate weights amp bias to Initiate weights amp bias to small random values small random values (eg in [-03 03])(eg in [-03 03])
22 Randomize order of Randomize order of training examples for training examples for each doeach do
a)a) Propagate activity Propagate activity forwardforward to output unitsto output units
k j i
outi = F( wij x outj )j
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)
b)b) Compute ldquodeviationrdquo for output Compute ldquodeviationrdquo for output unitsunits
c)c) Compute ldquodeviationrdquo for hidden Compute ldquodeviationrdquo for hidden unitsunits
d)d) Update weightsUpdate weights
i = F rsquo( neti ) x (Teacheri-outi)
ij = F rsquo( netj ) x ( wij x
i)
wij = x i x out j
wjk = x j x out k
F rsquo( netj ) = F(neti) neti
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)
33 Repeat until training-set error rate Repeat until training-set error rate small enough (or until tuning-set error small enough (or until tuning-set error rate begins to rise ndash see later slide)rate begins to rise ndash see later slide)
Should use ldquoearly stoppingrdquo (ie Should use ldquoearly stoppingrdquo (ie minimize error on the tuning set more minimize error on the tuning set more details later)details later)
44 Measure accuracy on test set to Measure accuracy on test set to estimate estimate generalizationgeneralization (future (future accuracy)accuracy)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Advantages of Neural Advantages of Neural NetworksNetworks
bull Universal representation (provided Universal representation (provided enough hidden units)enough hidden units)
bull Less greedy than tree learnersLess greedy than tree learnersbull In practice good for problems with In practice good for problems with
numeric inputs and can also numeric inputs and can also handle numeric outputshandle numeric outputs
bull PHD for many years best protein PHD for many years best protein secondary structure predictorsecondary structure predictor
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
DisadvantagesDisadvantages
bull Models not very comprehensibleModels not very comprehensiblebull Long training timesLong training timesbull Very sensitive to number of Very sensitive to number of
hidden unitshellip as a result largely hidden unitshellip as a result largely being supplanted by SVMs (SVMs being supplanted by SVMs (SVMs take very different approach to take very different approach to getting non-linearity)getting non-linearity)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Looking AheadLooking Aheadbull Perceptron rule can also be thought of as Perceptron rule can also be thought of as
modifying modifying weights on data points weights on data points rather rather than featuresthan features
bull Instead of process all data (batch) vs Instead of process all data (batch) vs one-at-a-time could imagine processing 2 one-at-a-time could imagine processing 2 data points at a time adjusting their data points at a time adjusting their relative weights based on their relative relative weights based on their relative errorserrors
bull This is what Plattrsquos SMO does (the SVM This is what Plattrsquos SMO does (the SVM implementation in Weka)implementation in Weka)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Backup Slide to help Backup Slide to help with Derivative of with Derivative of SigmoidSigmoid
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
The Need for Hidden The Need for Hidden UnitsUnits
If there is one layer of enough hidden units (possibly 2N for Boolean functions) the input can be recoded (N = number of input units)
This recoding allows any mapping to be represented (Minsky amp Papert)Question How to provide an error signal to the interior units
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Hidden UnitsHidden Units
One ViewAllow a system to create its own internal representation ndash for which problem solving is easy A perceptron
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Advantages of Advantages of Neural NetworksNeural Networks
Provide best predictive Provide best predictive accuracy for some accuracy for some problemsproblems
Being supplanted by Being supplanted by SVMrsquosSVMrsquos
Can represent a rich Can represent a rich class of conceptsclass of concepts
PositivenegativePositive
Saturday 40 chance of rainSunday 25 chance of rain
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Overview of ANNsOverview of ANNs
Recurrentlink
Output units
Input units
Hidden units
error
weight
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
BackpropagationBackpropagation
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
BackpropagationBackpropagation
bull Backpropagation involves a generalization of the Backpropagation involves a generalization of the perceptron ruleperceptron rule
bull Rumelhart Parker and Le Cun (and Bryson amp Ho 1969) Rumelhart Parker and Le Cun (and Bryson amp Ho 1969) Werbos 1974) independently developed (1985) a Werbos 1974) independently developed (1985) a technique for determining how to adjust weights of technique for determining how to adjust weights of interior (ldquohiddenrdquo) unitsinterior (ldquohiddenrdquo) units
bull Derivation involves partial derivatives Derivation involves partial derivatives (hence threshold function must be differentiable)(hence threshold function must be differentiable)
error signal
E Wij
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Weight SpaceWeight Space
bull Given a neural-network layout the weights Given a neural-network layout the weights are free parameters that are free parameters that define a define a spacespace
bull Each pointEach point in this in this Weight SpaceWeight Space specifies a specifies a networknetwork
bull Associated with each point is an Associated with each point is an error rateerror rate EE over the training dataover the training data
bull Backprop performs Backprop performs gradient descentgradient descent in weight in weight spacespace
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Gradient Descent in Weight Gradient Descent in Weight SpaceSpace
E
W1
W2
Ew
W1
W2
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
The Gradient-Descent The Gradient-Descent RuleRule
E(w) [ ]Ew0
Ew1
Ew
2
EwN
hellip hellip hellip _
The ldquogradien
trdquoThis is a N+1 dimensional vector (ie the lsquoslopersquo in weight space)Since we want to reduce errors we want to go ldquodown hillrdquoWersquoll take a finite step in weight space
E
W1
W2
w = - E ( w )
or wi = - Ewi
ldquodeltardquo = change to
w
E
w
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
ldquoldquoOn Linerdquo vs ldquoBatchrdquo On Linerdquo vs ldquoBatchrdquo BackpropBackprop
bull Technically we should look at the error Technically we should look at the error gradient for the entire training set gradient for the entire training set before taking a step in weight space before taking a step in weight space (ldquo(ldquobatchbatchrdquo Backprop)rdquo Backprop)
bull HoweverHowever as presented we take a step as presented we take a step after each example (ldquoafter each example (ldquoon-lineon-linerdquo Backprop)rdquo Backprop)bull Much faster convergenceMuch faster convergencebull Can reduce overfitting (since on-line Can reduce overfitting (since on-line
Backprop is ldquonoisyrdquo gradient descent)Backprop is ldquonoisyrdquo gradient descent)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
ldquoldquoOn Linerdquo vs ldquoBatchrdquo BP On Linerdquo vs ldquoBatchrdquo BP (continued)(continued)
BATCHBATCH ndash add ndash add w w vectors for vectors for everyevery training example training example thenthen lsquomoversquo in weight lsquomoversquo in weight spacespace
ON-LINEON-LINE ndash ldquomoverdquo ndash ldquomoverdquo after after eacheach example example (aka (aka stochasticstochastic gradient descent)gradient descent)
E
wi
w1
w3w2
w
w1
w2
w3
Final locations in space need not be the same for BATCH and ON-LINE w
N
ote
w
iB
ATC
H
w
i O
N-L
INE
for
i gt
1
E
w
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Need Derivatives Replace Need Derivatives Replace Step (Threshold) by Step (Threshold) by SigmoidSigmoidIndividual units
bias
output
input
output i= F(weight ij x output j)
Where
F(input i) =
j
1
1+e -(input i ndash bias i)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Differentiating the Differentiating the Logistic FunctionLogistic Function
out i =
1
1 + e - ( wji x outj)
F rsquo(wgtrsquoed in) = out i ( 1- out i ) 0
12
Wj x outj
F(wgtrsquoed in)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Assume one layer of hidden units (std topology)Assume one layer of hidden units (std topology)
11 Error Error frac12 frac12 ( Teacher ( Teacherii ndash ndash OutputOutput ii ) ) 22
22 = frac12 = frac12 (Teacher(Teacherii ndash F ndash F (( [[WWijij x x OutputOutput jj] )] )22
33 = frac12 = frac12 (Teacher(Teacherii ndash F ndash F (( [[WWijij x F x F ((WWjkjk x Output x Output
kk)])))]))22
DetermineDetermine
recallrecall
BP CalculationsBP Calculations
Error Wij
Error Wjk
= (use equation 2)
= (use equation 3)
See Table 42 in Mitchell for resultswxy = - ( E wxy )
k j i
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Derivation in MitchellDerivation in Mitchell
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Some NotationSome Notation
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
By Chain Rule By Chain Rule (since (since WWjiji influences rest of network only influences rest of network only by its influence on by its influence on NetNetjj)hellip)hellip
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Also remember this for later ndashWersquoll call it -δj
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Remember thatoj is xkj outputfrom j is inputto k
Remember netk =wk1 xk1 + hellip+ wkN xkN
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Using BP to Train Using BP to Train ANNrsquosANNrsquos
11 Initiate weights amp bias to Initiate weights amp bias to small random values small random values (eg in [-03 03])(eg in [-03 03])
22 Randomize order of Randomize order of training examples for training examples for each doeach do
a)a) Propagate activity Propagate activity forwardforward to output unitsto output units
k j i
outi = F( wij x outj )j
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)
b)b) Compute ldquodeviationrdquo for output Compute ldquodeviationrdquo for output unitsunits
c)c) Compute ldquodeviationrdquo for hidden Compute ldquodeviationrdquo for hidden unitsunits
d)d) Update weightsUpdate weights
i = F rsquo( neti ) x (Teacheri-outi)
ij = F rsquo( netj ) x ( wij x
i)
wij = x i x out j
wjk = x j x out k
F rsquo( netj ) = F(neti) neti
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)
33 Repeat until training-set error rate Repeat until training-set error rate small enough (or until tuning-set error small enough (or until tuning-set error rate begins to rise ndash see later slide)rate begins to rise ndash see later slide)
Should use ldquoearly stoppingrdquo (ie Should use ldquoearly stoppingrdquo (ie minimize error on the tuning set more minimize error on the tuning set more details later)details later)
44 Measure accuracy on test set to Measure accuracy on test set to estimate estimate generalizationgeneralization (future (future accuracy)accuracy)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Advantages of Neural Advantages of Neural NetworksNetworks
bull Universal representation (provided Universal representation (provided enough hidden units)enough hidden units)
bull Less greedy than tree learnersLess greedy than tree learnersbull In practice good for problems with In practice good for problems with
numeric inputs and can also numeric inputs and can also handle numeric outputshandle numeric outputs
bull PHD for many years best protein PHD for many years best protein secondary structure predictorsecondary structure predictor
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
DisadvantagesDisadvantages
bull Models not very comprehensibleModels not very comprehensiblebull Long training timesLong training timesbull Very sensitive to number of Very sensitive to number of
hidden unitshellip as a result largely hidden unitshellip as a result largely being supplanted by SVMs (SVMs being supplanted by SVMs (SVMs take very different approach to take very different approach to getting non-linearity)getting non-linearity)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Looking AheadLooking Aheadbull Perceptron rule can also be thought of as Perceptron rule can also be thought of as
modifying modifying weights on data points weights on data points rather rather than featuresthan features
bull Instead of process all data (batch) vs Instead of process all data (batch) vs one-at-a-time could imagine processing 2 one-at-a-time could imagine processing 2 data points at a time adjusting their data points at a time adjusting their relative weights based on their relative relative weights based on their relative errorserrors
bull This is what Plattrsquos SMO does (the SVM This is what Plattrsquos SMO does (the SVM implementation in Weka)implementation in Weka)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Backup Slide to help Backup Slide to help with Derivative of with Derivative of SigmoidSigmoid
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Hidden UnitsHidden Units
One ViewAllow a system to create its own internal representation ndash for which problem solving is easy A perceptron
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Advantages of Advantages of Neural NetworksNeural Networks
Provide best predictive Provide best predictive accuracy for some accuracy for some problemsproblems
Being supplanted by Being supplanted by SVMrsquosSVMrsquos
Can represent a rich Can represent a rich class of conceptsclass of concepts
PositivenegativePositive
Saturday 40 chance of rainSunday 25 chance of rain
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Overview of ANNsOverview of ANNs
Recurrentlink
Output units
Input units
Hidden units
error
weight
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
BackpropagationBackpropagation
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
BackpropagationBackpropagation
bull Backpropagation involves a generalization of the Backpropagation involves a generalization of the perceptron ruleperceptron rule
bull Rumelhart Parker and Le Cun (and Bryson amp Ho 1969) Rumelhart Parker and Le Cun (and Bryson amp Ho 1969) Werbos 1974) independently developed (1985) a Werbos 1974) independently developed (1985) a technique for determining how to adjust weights of technique for determining how to adjust weights of interior (ldquohiddenrdquo) unitsinterior (ldquohiddenrdquo) units
bull Derivation involves partial derivatives Derivation involves partial derivatives (hence threshold function must be differentiable)(hence threshold function must be differentiable)
error signal
E Wij
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Weight SpaceWeight Space
bull Given a neural-network layout the weights Given a neural-network layout the weights are free parameters that are free parameters that define a define a spacespace
bull Each pointEach point in this in this Weight SpaceWeight Space specifies a specifies a networknetwork
bull Associated with each point is an Associated with each point is an error rateerror rate EE over the training dataover the training data
bull Backprop performs Backprop performs gradient descentgradient descent in weight in weight spacespace
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Gradient Descent in Weight Gradient Descent in Weight SpaceSpace
E
W1
W2
Ew
W1
W2
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
The Gradient-Descent The Gradient-Descent RuleRule
E(w) [ ]Ew0
Ew1
Ew
2
EwN
hellip hellip hellip _
The ldquogradien
trdquoThis is a N+1 dimensional vector (ie the lsquoslopersquo in weight space)Since we want to reduce errors we want to go ldquodown hillrdquoWersquoll take a finite step in weight space
E
W1
W2
w = - E ( w )
or wi = - Ewi
ldquodeltardquo = change to
w
E
w
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
ldquoldquoOn Linerdquo vs ldquoBatchrdquo On Linerdquo vs ldquoBatchrdquo BackpropBackprop
bull Technically we should look at the error Technically we should look at the error gradient for the entire training set gradient for the entire training set before taking a step in weight space before taking a step in weight space (ldquo(ldquobatchbatchrdquo Backprop)rdquo Backprop)
bull HoweverHowever as presented we take a step as presented we take a step after each example (ldquoafter each example (ldquoon-lineon-linerdquo Backprop)rdquo Backprop)bull Much faster convergenceMuch faster convergencebull Can reduce overfitting (since on-line Can reduce overfitting (since on-line
Backprop is ldquonoisyrdquo gradient descent)Backprop is ldquonoisyrdquo gradient descent)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
ldquoldquoOn Linerdquo vs ldquoBatchrdquo BP On Linerdquo vs ldquoBatchrdquo BP (continued)(continued)
BATCHBATCH ndash add ndash add w w vectors for vectors for everyevery training example training example thenthen lsquomoversquo in weight lsquomoversquo in weight spacespace
ON-LINEON-LINE ndash ldquomoverdquo ndash ldquomoverdquo after after eacheach example example (aka (aka stochasticstochastic gradient descent)gradient descent)
E
wi
w1
w3w2
w
w1
w2
w3
Final locations in space need not be the same for BATCH and ON-LINE w
N
ote
w
iB
ATC
H
w
i O
N-L
INE
for
i gt
1
E
w
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Need Derivatives Replace Need Derivatives Replace Step (Threshold) by Step (Threshold) by SigmoidSigmoidIndividual units
bias
output
input
output i= F(weight ij x output j)
Where
F(input i) =
j
1
1+e -(input i ndash bias i)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Differentiating the Differentiating the Logistic FunctionLogistic Function
out i =
1
1 + e - ( wji x outj)
F rsquo(wgtrsquoed in) = out i ( 1- out i ) 0
12
Wj x outj
F(wgtrsquoed in)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Assume one layer of hidden units (std topology)Assume one layer of hidden units (std topology)
11 Error Error frac12 frac12 ( Teacher ( Teacherii ndash ndash OutputOutput ii ) ) 22
22 = frac12 = frac12 (Teacher(Teacherii ndash F ndash F (( [[WWijij x x OutputOutput jj] )] )22
33 = frac12 = frac12 (Teacher(Teacherii ndash F ndash F (( [[WWijij x F x F ((WWjkjk x Output x Output
kk)])))]))22
DetermineDetermine
recallrecall
BP CalculationsBP Calculations
Error Wij
Error Wjk
= (use equation 2)
= (use equation 3)
See Table 42 in Mitchell for resultswxy = - ( E wxy )
k j i
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Derivation in MitchellDerivation in Mitchell
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Some NotationSome Notation
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
By Chain Rule By Chain Rule (since (since WWjiji influences rest of network only influences rest of network only by its influence on by its influence on NetNetjj)hellip)hellip
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Also remember this for later ndashWersquoll call it -δj
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Remember thatoj is xkj outputfrom j is inputto k
Remember netk =wk1 xk1 + hellip+ wkN xkN
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Using BP to Train Using BP to Train ANNrsquosANNrsquos
11 Initiate weights amp bias to Initiate weights amp bias to small random values small random values (eg in [-03 03])(eg in [-03 03])
22 Randomize order of Randomize order of training examples for training examples for each doeach do
a)a) Propagate activity Propagate activity forwardforward to output unitsto output units
k j i
outi = F( wij x outj )j
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)
b)b) Compute ldquodeviationrdquo for output Compute ldquodeviationrdquo for output unitsunits
c)c) Compute ldquodeviationrdquo for hidden Compute ldquodeviationrdquo for hidden unitsunits
d)d) Update weightsUpdate weights
i = F rsquo( neti ) x (Teacheri-outi)
ij = F rsquo( netj ) x ( wij x
i)
wij = x i x out j
wjk = x j x out k
F rsquo( netj ) = F(neti) neti
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)
33 Repeat until training-set error rate Repeat until training-set error rate small enough (or until tuning-set error small enough (or until tuning-set error rate begins to rise ndash see later slide)rate begins to rise ndash see later slide)
Should use ldquoearly stoppingrdquo (ie Should use ldquoearly stoppingrdquo (ie minimize error on the tuning set more minimize error on the tuning set more details later)details later)
44 Measure accuracy on test set to Measure accuracy on test set to estimate estimate generalizationgeneralization (future (future accuracy)accuracy)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Advantages of Neural Advantages of Neural NetworksNetworks
bull Universal representation (provided Universal representation (provided enough hidden units)enough hidden units)
bull Less greedy than tree learnersLess greedy than tree learnersbull In practice good for problems with In practice good for problems with
numeric inputs and can also numeric inputs and can also handle numeric outputshandle numeric outputs
bull PHD for many years best protein PHD for many years best protein secondary structure predictorsecondary structure predictor
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
DisadvantagesDisadvantages
bull Models not very comprehensibleModels not very comprehensiblebull Long training timesLong training timesbull Very sensitive to number of Very sensitive to number of
hidden unitshellip as a result largely hidden unitshellip as a result largely being supplanted by SVMs (SVMs being supplanted by SVMs (SVMs take very different approach to take very different approach to getting non-linearity)getting non-linearity)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Looking AheadLooking Aheadbull Perceptron rule can also be thought of as Perceptron rule can also be thought of as
modifying modifying weights on data points weights on data points rather rather than featuresthan features
bull Instead of process all data (batch) vs Instead of process all data (batch) vs one-at-a-time could imagine processing 2 one-at-a-time could imagine processing 2 data points at a time adjusting their data points at a time adjusting their relative weights based on their relative relative weights based on their relative errorserrors
bull This is what Plattrsquos SMO does (the SVM This is what Plattrsquos SMO does (the SVM implementation in Weka)implementation in Weka)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Backup Slide to help Backup Slide to help with Derivative of with Derivative of SigmoidSigmoid
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Advantages of Advantages of Neural NetworksNeural Networks
Provide best predictive Provide best predictive accuracy for some accuracy for some problemsproblems
Being supplanted by Being supplanted by SVMrsquosSVMrsquos
Can represent a rich Can represent a rich class of conceptsclass of concepts
PositivenegativePositive
Saturday 40 chance of rainSunday 25 chance of rain
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Overview of ANNsOverview of ANNs
Recurrentlink
Output units
Input units
Hidden units
error
weight
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
BackpropagationBackpropagation
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
BackpropagationBackpropagation
bull Backpropagation involves a generalization of the Backpropagation involves a generalization of the perceptron ruleperceptron rule
bull Rumelhart Parker and Le Cun (and Bryson amp Ho 1969) Rumelhart Parker and Le Cun (and Bryson amp Ho 1969) Werbos 1974) independently developed (1985) a Werbos 1974) independently developed (1985) a technique for determining how to adjust weights of technique for determining how to adjust weights of interior (ldquohiddenrdquo) unitsinterior (ldquohiddenrdquo) units
bull Derivation involves partial derivatives Derivation involves partial derivatives (hence threshold function must be differentiable)(hence threshold function must be differentiable)
error signal
E Wij
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Weight SpaceWeight Space
bull Given a neural-network layout the weights Given a neural-network layout the weights are free parameters that are free parameters that define a define a spacespace
bull Each pointEach point in this in this Weight SpaceWeight Space specifies a specifies a networknetwork
bull Associated with each point is an Associated with each point is an error rateerror rate EE over the training dataover the training data
bull Backprop performs Backprop performs gradient descentgradient descent in weight in weight spacespace
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Gradient Descent in Weight Gradient Descent in Weight SpaceSpace
E
W1
W2
Ew
W1
W2
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
The Gradient-Descent The Gradient-Descent RuleRule
E(w) [ ]Ew0
Ew1
Ew
2
EwN
hellip hellip hellip _
The ldquogradien
trdquoThis is a N+1 dimensional vector (ie the lsquoslopersquo in weight space)Since we want to reduce errors we want to go ldquodown hillrdquoWersquoll take a finite step in weight space
E
W1
W2
w = - E ( w )
or wi = - Ewi
ldquodeltardquo = change to
w
E
w
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
ldquoldquoOn Linerdquo vs ldquoBatchrdquo On Linerdquo vs ldquoBatchrdquo BackpropBackprop
bull Technically we should look at the error Technically we should look at the error gradient for the entire training set gradient for the entire training set before taking a step in weight space before taking a step in weight space (ldquo(ldquobatchbatchrdquo Backprop)rdquo Backprop)
bull HoweverHowever as presented we take a step as presented we take a step after each example (ldquoafter each example (ldquoon-lineon-linerdquo Backprop)rdquo Backprop)bull Much faster convergenceMuch faster convergencebull Can reduce overfitting (since on-line Can reduce overfitting (since on-line
Backprop is ldquonoisyrdquo gradient descent)Backprop is ldquonoisyrdquo gradient descent)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
ldquoldquoOn Linerdquo vs ldquoBatchrdquo BP On Linerdquo vs ldquoBatchrdquo BP (continued)(continued)
BATCHBATCH ndash add ndash add w w vectors for vectors for everyevery training example training example thenthen lsquomoversquo in weight lsquomoversquo in weight spacespace
ON-LINEON-LINE ndash ldquomoverdquo ndash ldquomoverdquo after after eacheach example example (aka (aka stochasticstochastic gradient descent)gradient descent)
E
wi
w1
w3w2
w
w1
w2
w3
Final locations in space need not be the same for BATCH and ON-LINE w
N
ote
w
iB
ATC
H
w
i O
N-L
INE
for
i gt
1
E
w
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Need Derivatives Replace Need Derivatives Replace Step (Threshold) by Step (Threshold) by SigmoidSigmoidIndividual units
bias
output
input
output i= F(weight ij x output j)
Where
F(input i) =
j
1
1+e -(input i ndash bias i)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Differentiating the Differentiating the Logistic FunctionLogistic Function
out i =
1
1 + e - ( wji x outj)
F rsquo(wgtrsquoed in) = out i ( 1- out i ) 0
12
Wj x outj
F(wgtrsquoed in)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Assume one layer of hidden units (std topology)Assume one layer of hidden units (std topology)
11 Error Error frac12 frac12 ( Teacher ( Teacherii ndash ndash OutputOutput ii ) ) 22
22 = frac12 = frac12 (Teacher(Teacherii ndash F ndash F (( [[WWijij x x OutputOutput jj] )] )22
33 = frac12 = frac12 (Teacher(Teacherii ndash F ndash F (( [[WWijij x F x F ((WWjkjk x Output x Output
kk)])))]))22
DetermineDetermine
recallrecall
BP CalculationsBP Calculations
Error Wij
Error Wjk
= (use equation 2)
= (use equation 3)
See Table 42 in Mitchell for resultswxy = - ( E wxy )
k j i
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Derivation in MitchellDerivation in Mitchell
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Some NotationSome Notation
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
By Chain Rule By Chain Rule (since (since WWjiji influences rest of network only influences rest of network only by its influence on by its influence on NetNetjj)hellip)hellip
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Also remember this for later ndashWersquoll call it -δj
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Remember thatoj is xkj outputfrom j is inputto k
Remember netk =wk1 xk1 + hellip+ wkN xkN
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Using BP to Train Using BP to Train ANNrsquosANNrsquos
11 Initiate weights amp bias to Initiate weights amp bias to small random values small random values (eg in [-03 03])(eg in [-03 03])
22 Randomize order of Randomize order of training examples for training examples for each doeach do
a)a) Propagate activity Propagate activity forwardforward to output unitsto output units
k j i
outi = F( wij x outj )j
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)
b)b) Compute ldquodeviationrdquo for output Compute ldquodeviationrdquo for output unitsunits
c)c) Compute ldquodeviationrdquo for hidden Compute ldquodeviationrdquo for hidden unitsunits
d)d) Update weightsUpdate weights
i = F rsquo( neti ) x (Teacheri-outi)
ij = F rsquo( netj ) x ( wij x
i)
wij = x i x out j
wjk = x j x out k
F rsquo( netj ) = F(neti) neti
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)
33 Repeat until training-set error rate Repeat until training-set error rate small enough (or until tuning-set error small enough (or until tuning-set error rate begins to rise ndash see later slide)rate begins to rise ndash see later slide)
Should use ldquoearly stoppingrdquo (ie Should use ldquoearly stoppingrdquo (ie minimize error on the tuning set more minimize error on the tuning set more details later)details later)
44 Measure accuracy on test set to Measure accuracy on test set to estimate estimate generalizationgeneralization (future (future accuracy)accuracy)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Advantages of Neural Advantages of Neural NetworksNetworks
bull Universal representation (provided Universal representation (provided enough hidden units)enough hidden units)
bull Less greedy than tree learnersLess greedy than tree learnersbull In practice good for problems with In practice good for problems with
numeric inputs and can also numeric inputs and can also handle numeric outputshandle numeric outputs
bull PHD for many years best protein PHD for many years best protein secondary structure predictorsecondary structure predictor
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
DisadvantagesDisadvantages
bull Models not very comprehensibleModels not very comprehensiblebull Long training timesLong training timesbull Very sensitive to number of Very sensitive to number of
hidden unitshellip as a result largely hidden unitshellip as a result largely being supplanted by SVMs (SVMs being supplanted by SVMs (SVMs take very different approach to take very different approach to getting non-linearity)getting non-linearity)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Looking AheadLooking Aheadbull Perceptron rule can also be thought of as Perceptron rule can also be thought of as
modifying modifying weights on data points weights on data points rather rather than featuresthan features
bull Instead of process all data (batch) vs Instead of process all data (batch) vs one-at-a-time could imagine processing 2 one-at-a-time could imagine processing 2 data points at a time adjusting their data points at a time adjusting their relative weights based on their relative relative weights based on their relative errorserrors
bull This is what Plattrsquos SMO does (the SVM This is what Plattrsquos SMO does (the SVM implementation in Weka)implementation in Weka)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Backup Slide to help Backup Slide to help with Derivative of with Derivative of SigmoidSigmoid
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Overview of ANNsOverview of ANNs
Recurrentlink
Output units
Input units
Hidden units
error
weight
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
BackpropagationBackpropagation
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
BackpropagationBackpropagation
bull Backpropagation involves a generalization of the Backpropagation involves a generalization of the perceptron ruleperceptron rule
bull Rumelhart Parker and Le Cun (and Bryson amp Ho 1969) Rumelhart Parker and Le Cun (and Bryson amp Ho 1969) Werbos 1974) independently developed (1985) a Werbos 1974) independently developed (1985) a technique for determining how to adjust weights of technique for determining how to adjust weights of interior (ldquohiddenrdquo) unitsinterior (ldquohiddenrdquo) units
bull Derivation involves partial derivatives Derivation involves partial derivatives (hence threshold function must be differentiable)(hence threshold function must be differentiable)
error signal
E Wij
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Weight SpaceWeight Space
bull Given a neural-network layout the weights Given a neural-network layout the weights are free parameters that are free parameters that define a define a spacespace
bull Each pointEach point in this in this Weight SpaceWeight Space specifies a specifies a networknetwork
bull Associated with each point is an Associated with each point is an error rateerror rate EE over the training dataover the training data
bull Backprop performs Backprop performs gradient descentgradient descent in weight in weight spacespace
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Gradient Descent in Weight Gradient Descent in Weight SpaceSpace
E
W1
W2
Ew
W1
W2
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
The Gradient-Descent The Gradient-Descent RuleRule
E(w) [ ]Ew0
Ew1
Ew
2
EwN
hellip hellip hellip _
The ldquogradien
trdquoThis is a N+1 dimensional vector (ie the lsquoslopersquo in weight space)Since we want to reduce errors we want to go ldquodown hillrdquoWersquoll take a finite step in weight space
E
W1
W2
w = - E ( w )
or wi = - Ewi
ldquodeltardquo = change to
w
E
w
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
ldquoldquoOn Linerdquo vs ldquoBatchrdquo On Linerdquo vs ldquoBatchrdquo BackpropBackprop
bull Technically we should look at the error Technically we should look at the error gradient for the entire training set gradient for the entire training set before taking a step in weight space before taking a step in weight space (ldquo(ldquobatchbatchrdquo Backprop)rdquo Backprop)
bull HoweverHowever as presented we take a step as presented we take a step after each example (ldquoafter each example (ldquoon-lineon-linerdquo Backprop)rdquo Backprop)bull Much faster convergenceMuch faster convergencebull Can reduce overfitting (since on-line Can reduce overfitting (since on-line
Backprop is ldquonoisyrdquo gradient descent)Backprop is ldquonoisyrdquo gradient descent)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
ldquoldquoOn Linerdquo vs ldquoBatchrdquo BP On Linerdquo vs ldquoBatchrdquo BP (continued)(continued)
BATCHBATCH ndash add ndash add w w vectors for vectors for everyevery training example training example thenthen lsquomoversquo in weight lsquomoversquo in weight spacespace
ON-LINEON-LINE ndash ldquomoverdquo ndash ldquomoverdquo after after eacheach example example (aka (aka stochasticstochastic gradient descent)gradient descent)
E
wi
w1
w3w2
w
w1
w2
w3
Final locations in space need not be the same for BATCH and ON-LINE w
N
ote
w
iB
ATC
H
w
i O
N-L
INE
for
i gt
1
E
w
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Need Derivatives Replace Need Derivatives Replace Step (Threshold) by Step (Threshold) by SigmoidSigmoidIndividual units
bias
output
input
output i= F(weight ij x output j)
Where
F(input i) =
j
1
1+e -(input i ndash bias i)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Differentiating the Differentiating the Logistic FunctionLogistic Function
out i =
1
1 + e - ( wji x outj)
F rsquo(wgtrsquoed in) = out i ( 1- out i ) 0
12
Wj x outj
F(wgtrsquoed in)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Assume one layer of hidden units (std topology)Assume one layer of hidden units (std topology)
11 Error Error frac12 frac12 ( Teacher ( Teacherii ndash ndash OutputOutput ii ) ) 22
22 = frac12 = frac12 (Teacher(Teacherii ndash F ndash F (( [[WWijij x x OutputOutput jj] )] )22
33 = frac12 = frac12 (Teacher(Teacherii ndash F ndash F (( [[WWijij x F x F ((WWjkjk x Output x Output
kk)])))]))22
DetermineDetermine
recallrecall
BP CalculationsBP Calculations
Error Wij
Error Wjk
= (use equation 2)
= (use equation 3)
See Table 42 in Mitchell for resultswxy = - ( E wxy )
k j i
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Derivation in MitchellDerivation in Mitchell
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Some NotationSome Notation
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
By Chain Rule By Chain Rule (since (since WWjiji influences rest of network only influences rest of network only by its influence on by its influence on NetNetjj)hellip)hellip
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Also remember this for later ndashWersquoll call it -δj
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Remember thatoj is xkj outputfrom j is inputto k
Remember netk =wk1 xk1 + hellip+ wkN xkN
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Using BP to Train Using BP to Train ANNrsquosANNrsquos
11 Initiate weights amp bias to Initiate weights amp bias to small random values small random values (eg in [-03 03])(eg in [-03 03])
22 Randomize order of Randomize order of training examples for training examples for each doeach do
a)a) Propagate activity Propagate activity forwardforward to output unitsto output units
k j i
outi = F( wij x outj )j
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)
b)b) Compute ldquodeviationrdquo for output Compute ldquodeviationrdquo for output unitsunits
c)c) Compute ldquodeviationrdquo for hidden Compute ldquodeviationrdquo for hidden unitsunits
d)d) Update weightsUpdate weights
i = F rsquo( neti ) x (Teacheri-outi)
ij = F rsquo( netj ) x ( wij x
i)
wij = x i x out j
wjk = x j x out k
F rsquo( netj ) = F(neti) neti
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)
33 Repeat until training-set error rate Repeat until training-set error rate small enough (or until tuning-set error small enough (or until tuning-set error rate begins to rise ndash see later slide)rate begins to rise ndash see later slide)
Should use ldquoearly stoppingrdquo (ie Should use ldquoearly stoppingrdquo (ie minimize error on the tuning set more minimize error on the tuning set more details later)details later)
44 Measure accuracy on test set to Measure accuracy on test set to estimate estimate generalizationgeneralization (future (future accuracy)accuracy)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Advantages of Neural Advantages of Neural NetworksNetworks
bull Universal representation (provided Universal representation (provided enough hidden units)enough hidden units)
bull Less greedy than tree learnersLess greedy than tree learnersbull In practice good for problems with In practice good for problems with
numeric inputs and can also numeric inputs and can also handle numeric outputshandle numeric outputs
bull PHD for many years best protein PHD for many years best protein secondary structure predictorsecondary structure predictor
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
DisadvantagesDisadvantages
bull Models not very comprehensibleModels not very comprehensiblebull Long training timesLong training timesbull Very sensitive to number of Very sensitive to number of
hidden unitshellip as a result largely hidden unitshellip as a result largely being supplanted by SVMs (SVMs being supplanted by SVMs (SVMs take very different approach to take very different approach to getting non-linearity)getting non-linearity)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Looking AheadLooking Aheadbull Perceptron rule can also be thought of as Perceptron rule can also be thought of as
modifying modifying weights on data points weights on data points rather rather than featuresthan features
bull Instead of process all data (batch) vs Instead of process all data (batch) vs one-at-a-time could imagine processing 2 one-at-a-time could imagine processing 2 data points at a time adjusting their data points at a time adjusting their relative weights based on their relative relative weights based on their relative errorserrors
bull This is what Plattrsquos SMO does (the SVM This is what Plattrsquos SMO does (the SVM implementation in Weka)implementation in Weka)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Backup Slide to help Backup Slide to help with Derivative of with Derivative of SigmoidSigmoid
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
BackpropagationBackpropagation
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
BackpropagationBackpropagation
bull Backpropagation involves a generalization of the Backpropagation involves a generalization of the perceptron ruleperceptron rule
bull Rumelhart Parker and Le Cun (and Bryson amp Ho 1969) Rumelhart Parker and Le Cun (and Bryson amp Ho 1969) Werbos 1974) independently developed (1985) a Werbos 1974) independently developed (1985) a technique for determining how to adjust weights of technique for determining how to adjust weights of interior (ldquohiddenrdquo) unitsinterior (ldquohiddenrdquo) units
bull Derivation involves partial derivatives Derivation involves partial derivatives (hence threshold function must be differentiable)(hence threshold function must be differentiable)
error signal
E Wij
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Weight SpaceWeight Space
bull Given a neural-network layout the weights Given a neural-network layout the weights are free parameters that are free parameters that define a define a spacespace
bull Each pointEach point in this in this Weight SpaceWeight Space specifies a specifies a networknetwork
bull Associated with each point is an Associated with each point is an error rateerror rate EE over the training dataover the training data
bull Backprop performs Backprop performs gradient descentgradient descent in weight in weight spacespace
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Gradient Descent in Weight Gradient Descent in Weight SpaceSpace
E
W1
W2
Ew
W1
W2
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
The Gradient-Descent The Gradient-Descent RuleRule
E(w) [ ]Ew0
Ew1
Ew
2
EwN
hellip hellip hellip _
The ldquogradien
trdquoThis is a N+1 dimensional vector (ie the lsquoslopersquo in weight space)Since we want to reduce errors we want to go ldquodown hillrdquoWersquoll take a finite step in weight space
E
W1
W2
w = - E ( w )
or wi = - Ewi
ldquodeltardquo = change to
w
E
w
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
ldquoldquoOn Linerdquo vs ldquoBatchrdquo On Linerdquo vs ldquoBatchrdquo BackpropBackprop
bull Technically we should look at the error Technically we should look at the error gradient for the entire training set gradient for the entire training set before taking a step in weight space before taking a step in weight space (ldquo(ldquobatchbatchrdquo Backprop)rdquo Backprop)
bull HoweverHowever as presented we take a step as presented we take a step after each example (ldquoafter each example (ldquoon-lineon-linerdquo Backprop)rdquo Backprop)bull Much faster convergenceMuch faster convergencebull Can reduce overfitting (since on-line Can reduce overfitting (since on-line
Backprop is ldquonoisyrdquo gradient descent)Backprop is ldquonoisyrdquo gradient descent)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
ldquoldquoOn Linerdquo vs ldquoBatchrdquo BP On Linerdquo vs ldquoBatchrdquo BP (continued)(continued)
BATCHBATCH ndash add ndash add w w vectors for vectors for everyevery training example training example thenthen lsquomoversquo in weight lsquomoversquo in weight spacespace
ON-LINEON-LINE ndash ldquomoverdquo ndash ldquomoverdquo after after eacheach example example (aka (aka stochasticstochastic gradient descent)gradient descent)
E
wi
w1
w3w2
w
w1
w2
w3
Final locations in space need not be the same for BATCH and ON-LINE w
N
ote
w
iB
ATC
H
w
i O
N-L
INE
for
i gt
1
E
w
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Need Derivatives Replace Need Derivatives Replace Step (Threshold) by Step (Threshold) by SigmoidSigmoidIndividual units
bias
output
input
output i= F(weight ij x output j)
Where
F(input i) =
j
1
1+e -(input i ndash bias i)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Differentiating the Differentiating the Logistic FunctionLogistic Function
out i =
1
1 + e - ( wji x outj)
F rsquo(wgtrsquoed in) = out i ( 1- out i ) 0
12
Wj x outj
F(wgtrsquoed in)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Assume one layer of hidden units (std topology)Assume one layer of hidden units (std topology)
11 Error Error frac12 frac12 ( Teacher ( Teacherii ndash ndash OutputOutput ii ) ) 22
22 = frac12 = frac12 (Teacher(Teacherii ndash F ndash F (( [[WWijij x x OutputOutput jj] )] )22
33 = frac12 = frac12 (Teacher(Teacherii ndash F ndash F (( [[WWijij x F x F ((WWjkjk x Output x Output
kk)])))]))22
DetermineDetermine
recallrecall
BP CalculationsBP Calculations
Error Wij
Error Wjk
= (use equation 2)
= (use equation 3)
See Table 42 in Mitchell for resultswxy = - ( E wxy )
k j i
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Derivation in MitchellDerivation in Mitchell
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Some NotationSome Notation
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
By Chain Rule By Chain Rule (since (since WWjiji influences rest of network only influences rest of network only by its influence on by its influence on NetNetjj)hellip)hellip
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Also remember this for later ndashWersquoll call it -δj
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Remember thatoj is xkj outputfrom j is inputto k
Remember netk =wk1 xk1 + hellip+ wkN xkN
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Using BP to Train Using BP to Train ANNrsquosANNrsquos
11 Initiate weights amp bias to Initiate weights amp bias to small random values small random values (eg in [-03 03])(eg in [-03 03])
22 Randomize order of Randomize order of training examples for training examples for each doeach do
a)a) Propagate activity Propagate activity forwardforward to output unitsto output units
k j i
outi = F( wij x outj )j
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)
b)b) Compute ldquodeviationrdquo for output Compute ldquodeviationrdquo for output unitsunits
c)c) Compute ldquodeviationrdquo for hidden Compute ldquodeviationrdquo for hidden unitsunits
d)d) Update weightsUpdate weights
i = F rsquo( neti ) x (Teacheri-outi)
ij = F rsquo( netj ) x ( wij x
i)
wij = x i x out j
wjk = x j x out k
F rsquo( netj ) = F(neti) neti
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)
33 Repeat until training-set error rate Repeat until training-set error rate small enough (or until tuning-set error small enough (or until tuning-set error rate begins to rise ndash see later slide)rate begins to rise ndash see later slide)
Should use ldquoearly stoppingrdquo (ie Should use ldquoearly stoppingrdquo (ie minimize error on the tuning set more minimize error on the tuning set more details later)details later)
44 Measure accuracy on test set to Measure accuracy on test set to estimate estimate generalizationgeneralization (future (future accuracy)accuracy)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Advantages of Neural Advantages of Neural NetworksNetworks
bull Universal representation (provided Universal representation (provided enough hidden units)enough hidden units)
bull Less greedy than tree learnersLess greedy than tree learnersbull In practice good for problems with In practice good for problems with
numeric inputs and can also numeric inputs and can also handle numeric outputshandle numeric outputs
bull PHD for many years best protein PHD for many years best protein secondary structure predictorsecondary structure predictor
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
DisadvantagesDisadvantages
bull Models not very comprehensibleModels not very comprehensiblebull Long training timesLong training timesbull Very sensitive to number of Very sensitive to number of
hidden unitshellip as a result largely hidden unitshellip as a result largely being supplanted by SVMs (SVMs being supplanted by SVMs (SVMs take very different approach to take very different approach to getting non-linearity)getting non-linearity)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Looking AheadLooking Aheadbull Perceptron rule can also be thought of as Perceptron rule can also be thought of as
modifying modifying weights on data points weights on data points rather rather than featuresthan features
bull Instead of process all data (batch) vs Instead of process all data (batch) vs one-at-a-time could imagine processing 2 one-at-a-time could imagine processing 2 data points at a time adjusting their data points at a time adjusting their relative weights based on their relative relative weights based on their relative errorserrors
bull This is what Plattrsquos SMO does (the SVM This is what Plattrsquos SMO does (the SVM implementation in Weka)implementation in Weka)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Backup Slide to help Backup Slide to help with Derivative of with Derivative of SigmoidSigmoid
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
BackpropagationBackpropagation
bull Backpropagation involves a generalization of the Backpropagation involves a generalization of the perceptron ruleperceptron rule
bull Rumelhart Parker and Le Cun (and Bryson amp Ho 1969) Rumelhart Parker and Le Cun (and Bryson amp Ho 1969) Werbos 1974) independently developed (1985) a Werbos 1974) independently developed (1985) a technique for determining how to adjust weights of technique for determining how to adjust weights of interior (ldquohiddenrdquo) unitsinterior (ldquohiddenrdquo) units
bull Derivation involves partial derivatives Derivation involves partial derivatives (hence threshold function must be differentiable)(hence threshold function must be differentiable)
error signal
E Wij
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Weight SpaceWeight Space
bull Given a neural-network layout the weights Given a neural-network layout the weights are free parameters that are free parameters that define a define a spacespace
bull Each pointEach point in this in this Weight SpaceWeight Space specifies a specifies a networknetwork
bull Associated with each point is an Associated with each point is an error rateerror rate EE over the training dataover the training data
bull Backprop performs Backprop performs gradient descentgradient descent in weight in weight spacespace
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Gradient Descent in Weight Gradient Descent in Weight SpaceSpace
E
W1
W2
Ew
W1
W2
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
The Gradient-Descent The Gradient-Descent RuleRule
E(w) [ ]Ew0
Ew1
Ew
2
EwN
hellip hellip hellip _
The ldquogradien
trdquoThis is a N+1 dimensional vector (ie the lsquoslopersquo in weight space)Since we want to reduce errors we want to go ldquodown hillrdquoWersquoll take a finite step in weight space
E
W1
W2
w = - E ( w )
or wi = - Ewi
ldquodeltardquo = change to
w
E
w
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
ldquoldquoOn Linerdquo vs ldquoBatchrdquo On Linerdquo vs ldquoBatchrdquo BackpropBackprop
bull Technically we should look at the error Technically we should look at the error gradient for the entire training set gradient for the entire training set before taking a step in weight space before taking a step in weight space (ldquo(ldquobatchbatchrdquo Backprop)rdquo Backprop)
bull HoweverHowever as presented we take a step as presented we take a step after each example (ldquoafter each example (ldquoon-lineon-linerdquo Backprop)rdquo Backprop)bull Much faster convergenceMuch faster convergencebull Can reduce overfitting (since on-line Can reduce overfitting (since on-line
Backprop is ldquonoisyrdquo gradient descent)Backprop is ldquonoisyrdquo gradient descent)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
ldquoldquoOn Linerdquo vs ldquoBatchrdquo BP On Linerdquo vs ldquoBatchrdquo BP (continued)(continued)
BATCHBATCH ndash add ndash add w w vectors for vectors for everyevery training example training example thenthen lsquomoversquo in weight lsquomoversquo in weight spacespace
ON-LINEON-LINE ndash ldquomoverdquo ndash ldquomoverdquo after after eacheach example example (aka (aka stochasticstochastic gradient descent)gradient descent)
E
wi
w1
w3w2
w
w1
w2
w3
Final locations in space need not be the same for BATCH and ON-LINE w
N
ote
w
iB
ATC
H
w
i O
N-L
INE
for
i gt
1
E
w
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Need Derivatives Replace Need Derivatives Replace Step (Threshold) by Step (Threshold) by SigmoidSigmoidIndividual units
bias
output
input
output i= F(weight ij x output j)
Where
F(input i) =
j
1
1+e -(input i ndash bias i)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Differentiating the Differentiating the Logistic FunctionLogistic Function
out i =
1
1 + e - ( wji x outj)
F rsquo(wgtrsquoed in) = out i ( 1- out i ) 0
12
Wj x outj
F(wgtrsquoed in)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Assume one layer of hidden units (std topology)Assume one layer of hidden units (std topology)
11 Error Error frac12 frac12 ( Teacher ( Teacherii ndash ndash OutputOutput ii ) ) 22
22 = frac12 = frac12 (Teacher(Teacherii ndash F ndash F (( [[WWijij x x OutputOutput jj] )] )22
33 = frac12 = frac12 (Teacher(Teacherii ndash F ndash F (( [[WWijij x F x F ((WWjkjk x Output x Output
kk)])))]))22
DetermineDetermine
recallrecall
BP CalculationsBP Calculations
Error Wij
Error Wjk
= (use equation 2)
= (use equation 3)
See Table 42 in Mitchell for resultswxy = - ( E wxy )
k j i
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Derivation in MitchellDerivation in Mitchell
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Some NotationSome Notation
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
By Chain Rule By Chain Rule (since (since WWjiji influences rest of network only influences rest of network only by its influence on by its influence on NetNetjj)hellip)hellip
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Also remember this for later ndashWersquoll call it -δj
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Remember thatoj is xkj outputfrom j is inputto k
Remember netk =wk1 xk1 + hellip+ wkN xkN
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Using BP to Train Using BP to Train ANNrsquosANNrsquos
11 Initiate weights amp bias to Initiate weights amp bias to small random values small random values (eg in [-03 03])(eg in [-03 03])
22 Randomize order of Randomize order of training examples for training examples for each doeach do
a)a) Propagate activity Propagate activity forwardforward to output unitsto output units
k j i
outi = F( wij x outj )j
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)
b)b) Compute ldquodeviationrdquo for output Compute ldquodeviationrdquo for output unitsunits
c)c) Compute ldquodeviationrdquo for hidden Compute ldquodeviationrdquo for hidden unitsunits
d)d) Update weightsUpdate weights
i = F rsquo( neti ) x (Teacheri-outi)
ij = F rsquo( netj ) x ( wij x
i)
wij = x i x out j
wjk = x j x out k
F rsquo( netj ) = F(neti) neti
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)
33 Repeat until training-set error rate Repeat until training-set error rate small enough (or until tuning-set error small enough (or until tuning-set error rate begins to rise ndash see later slide)rate begins to rise ndash see later slide)
Should use ldquoearly stoppingrdquo (ie Should use ldquoearly stoppingrdquo (ie minimize error on the tuning set more minimize error on the tuning set more details later)details later)
44 Measure accuracy on test set to Measure accuracy on test set to estimate estimate generalizationgeneralization (future (future accuracy)accuracy)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Advantages of Neural Advantages of Neural NetworksNetworks
bull Universal representation (provided Universal representation (provided enough hidden units)enough hidden units)
bull Less greedy than tree learnersLess greedy than tree learnersbull In practice good for problems with In practice good for problems with
numeric inputs and can also numeric inputs and can also handle numeric outputshandle numeric outputs
bull PHD for many years best protein PHD for many years best protein secondary structure predictorsecondary structure predictor
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
DisadvantagesDisadvantages
bull Models not very comprehensibleModels not very comprehensiblebull Long training timesLong training timesbull Very sensitive to number of Very sensitive to number of
hidden unitshellip as a result largely hidden unitshellip as a result largely being supplanted by SVMs (SVMs being supplanted by SVMs (SVMs take very different approach to take very different approach to getting non-linearity)getting non-linearity)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Looking AheadLooking Aheadbull Perceptron rule can also be thought of as Perceptron rule can also be thought of as
modifying modifying weights on data points weights on data points rather rather than featuresthan features
bull Instead of process all data (batch) vs Instead of process all data (batch) vs one-at-a-time could imagine processing 2 one-at-a-time could imagine processing 2 data points at a time adjusting their data points at a time adjusting their relative weights based on their relative relative weights based on their relative errorserrors
bull This is what Plattrsquos SMO does (the SVM This is what Plattrsquos SMO does (the SVM implementation in Weka)implementation in Weka)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Backup Slide to help Backup Slide to help with Derivative of with Derivative of SigmoidSigmoid
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Weight SpaceWeight Space
bull Given a neural-network layout the weights Given a neural-network layout the weights are free parameters that are free parameters that define a define a spacespace
bull Each pointEach point in this in this Weight SpaceWeight Space specifies a specifies a networknetwork
bull Associated with each point is an Associated with each point is an error rateerror rate EE over the training dataover the training data
bull Backprop performs Backprop performs gradient descentgradient descent in weight in weight spacespace
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Gradient Descent in Weight Gradient Descent in Weight SpaceSpace
E
W1
W2
Ew
W1
W2
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
The Gradient-Descent The Gradient-Descent RuleRule
E(w) [ ]Ew0
Ew1
Ew
2
EwN
hellip hellip hellip _
The ldquogradien
trdquoThis is a N+1 dimensional vector (ie the lsquoslopersquo in weight space)Since we want to reduce errors we want to go ldquodown hillrdquoWersquoll take a finite step in weight space
E
W1
W2
w = - E ( w )
or wi = - Ewi
ldquodeltardquo = change to
w
E
w
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
ldquoldquoOn Linerdquo vs ldquoBatchrdquo On Linerdquo vs ldquoBatchrdquo BackpropBackprop
bull Technically we should look at the error Technically we should look at the error gradient for the entire training set gradient for the entire training set before taking a step in weight space before taking a step in weight space (ldquo(ldquobatchbatchrdquo Backprop)rdquo Backprop)
bull HoweverHowever as presented we take a step as presented we take a step after each example (ldquoafter each example (ldquoon-lineon-linerdquo Backprop)rdquo Backprop)bull Much faster convergenceMuch faster convergencebull Can reduce overfitting (since on-line Can reduce overfitting (since on-line
Backprop is ldquonoisyrdquo gradient descent)Backprop is ldquonoisyrdquo gradient descent)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
ldquoldquoOn Linerdquo vs ldquoBatchrdquo BP On Linerdquo vs ldquoBatchrdquo BP (continued)(continued)
BATCHBATCH ndash add ndash add w w vectors for vectors for everyevery training example training example thenthen lsquomoversquo in weight lsquomoversquo in weight spacespace
ON-LINEON-LINE ndash ldquomoverdquo ndash ldquomoverdquo after after eacheach example example (aka (aka stochasticstochastic gradient descent)gradient descent)
E
wi
w1
w3w2
w
w1
w2
w3
Final locations in space need not be the same for BATCH and ON-LINE w
N
ote
w
iB
ATC
H
w
i O
N-L
INE
for
i gt
1
E
w
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Need Derivatives Replace Need Derivatives Replace Step (Threshold) by Step (Threshold) by SigmoidSigmoidIndividual units
bias
output
input
output i= F(weight ij x output j)
Where
F(input i) =
j
1
1+e -(input i ndash bias i)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Differentiating the Differentiating the Logistic FunctionLogistic Function
out i =
1
1 + e - ( wji x outj)
F rsquo(wgtrsquoed in) = out i ( 1- out i ) 0
12
Wj x outj
F(wgtrsquoed in)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Assume one layer of hidden units (std topology)Assume one layer of hidden units (std topology)
11 Error Error frac12 frac12 ( Teacher ( Teacherii ndash ndash OutputOutput ii ) ) 22
22 = frac12 = frac12 (Teacher(Teacherii ndash F ndash F (( [[WWijij x x OutputOutput jj] )] )22
33 = frac12 = frac12 (Teacher(Teacherii ndash F ndash F (( [[WWijij x F x F ((WWjkjk x Output x Output
kk)])))]))22
DetermineDetermine
recallrecall
BP CalculationsBP Calculations
Error Wij
Error Wjk
= (use equation 2)
= (use equation 3)
See Table 42 in Mitchell for resultswxy = - ( E wxy )
k j i
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Derivation in MitchellDerivation in Mitchell
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Some NotationSome Notation
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
By Chain Rule By Chain Rule (since (since WWjiji influences rest of network only influences rest of network only by its influence on by its influence on NetNetjj)hellip)hellip
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Also remember this for later ndashWersquoll call it -δj
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Remember thatoj is xkj outputfrom j is inputto k
Remember netk =wk1 xk1 + hellip+ wkN xkN
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Using BP to Train Using BP to Train ANNrsquosANNrsquos
11 Initiate weights amp bias to Initiate weights amp bias to small random values small random values (eg in [-03 03])(eg in [-03 03])
22 Randomize order of Randomize order of training examples for training examples for each doeach do
a)a) Propagate activity Propagate activity forwardforward to output unitsto output units
k j i
outi = F( wij x outj )j
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)
b)b) Compute ldquodeviationrdquo for output Compute ldquodeviationrdquo for output unitsunits
c)c) Compute ldquodeviationrdquo for hidden Compute ldquodeviationrdquo for hidden unitsunits
d)d) Update weightsUpdate weights
i = F rsquo( neti ) x (Teacheri-outi)
ij = F rsquo( netj ) x ( wij x
i)
wij = x i x out j
wjk = x j x out k
F rsquo( netj ) = F(neti) neti
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)
33 Repeat until training-set error rate Repeat until training-set error rate small enough (or until tuning-set error small enough (or until tuning-set error rate begins to rise ndash see later slide)rate begins to rise ndash see later slide)
Should use ldquoearly stoppingrdquo (ie Should use ldquoearly stoppingrdquo (ie minimize error on the tuning set more minimize error on the tuning set more details later)details later)
44 Measure accuracy on test set to Measure accuracy on test set to estimate estimate generalizationgeneralization (future (future accuracy)accuracy)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Advantages of Neural Advantages of Neural NetworksNetworks
bull Universal representation (provided Universal representation (provided enough hidden units)enough hidden units)
bull Less greedy than tree learnersLess greedy than tree learnersbull In practice good for problems with In practice good for problems with
numeric inputs and can also numeric inputs and can also handle numeric outputshandle numeric outputs
bull PHD for many years best protein PHD for many years best protein secondary structure predictorsecondary structure predictor
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
DisadvantagesDisadvantages
bull Models not very comprehensibleModels not very comprehensiblebull Long training timesLong training timesbull Very sensitive to number of Very sensitive to number of
hidden unitshellip as a result largely hidden unitshellip as a result largely being supplanted by SVMs (SVMs being supplanted by SVMs (SVMs take very different approach to take very different approach to getting non-linearity)getting non-linearity)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Looking AheadLooking Aheadbull Perceptron rule can also be thought of as Perceptron rule can also be thought of as
modifying modifying weights on data points weights on data points rather rather than featuresthan features
bull Instead of process all data (batch) vs Instead of process all data (batch) vs one-at-a-time could imagine processing 2 one-at-a-time could imagine processing 2 data points at a time adjusting their data points at a time adjusting their relative weights based on their relative relative weights based on their relative errorserrors
bull This is what Plattrsquos SMO does (the SVM This is what Plattrsquos SMO does (the SVM implementation in Weka)implementation in Weka)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Backup Slide to help Backup Slide to help with Derivative of with Derivative of SigmoidSigmoid
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Gradient Descent in Weight Gradient Descent in Weight SpaceSpace
E
W1
W2
Ew
W1
W2
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
The Gradient-Descent The Gradient-Descent RuleRule
E(w) [ ]Ew0
Ew1
Ew
2
EwN
hellip hellip hellip _
The ldquogradien
trdquoThis is a N+1 dimensional vector (ie the lsquoslopersquo in weight space)Since we want to reduce errors we want to go ldquodown hillrdquoWersquoll take a finite step in weight space
E
W1
W2
w = - E ( w )
or wi = - Ewi
ldquodeltardquo = change to
w
E
w
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
ldquoldquoOn Linerdquo vs ldquoBatchrdquo On Linerdquo vs ldquoBatchrdquo BackpropBackprop
bull Technically we should look at the error Technically we should look at the error gradient for the entire training set gradient for the entire training set before taking a step in weight space before taking a step in weight space (ldquo(ldquobatchbatchrdquo Backprop)rdquo Backprop)
bull HoweverHowever as presented we take a step as presented we take a step after each example (ldquoafter each example (ldquoon-lineon-linerdquo Backprop)rdquo Backprop)bull Much faster convergenceMuch faster convergencebull Can reduce overfitting (since on-line Can reduce overfitting (since on-line
Backprop is ldquonoisyrdquo gradient descent)Backprop is ldquonoisyrdquo gradient descent)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
ldquoldquoOn Linerdquo vs ldquoBatchrdquo BP On Linerdquo vs ldquoBatchrdquo BP (continued)(continued)
BATCHBATCH ndash add ndash add w w vectors for vectors for everyevery training example training example thenthen lsquomoversquo in weight lsquomoversquo in weight spacespace
ON-LINEON-LINE ndash ldquomoverdquo ndash ldquomoverdquo after after eacheach example example (aka (aka stochasticstochastic gradient descent)gradient descent)
E
wi
w1
w3w2
w
w1
w2
w3
Final locations in space need not be the same for BATCH and ON-LINE w
N
ote
w
iB
ATC
H
w
i O
N-L
INE
for
i gt
1
E
w
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Need Derivatives Replace Need Derivatives Replace Step (Threshold) by Step (Threshold) by SigmoidSigmoidIndividual units
bias
output
input
output i= F(weight ij x output j)
Where
F(input i) =
j
1
1+e -(input i ndash bias i)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Differentiating the Differentiating the Logistic FunctionLogistic Function
out i =
1
1 + e - ( wji x outj)
F rsquo(wgtrsquoed in) = out i ( 1- out i ) 0
12
Wj x outj
F(wgtrsquoed in)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Assume one layer of hidden units (std topology)Assume one layer of hidden units (std topology)
11 Error Error frac12 frac12 ( Teacher ( Teacherii ndash ndash OutputOutput ii ) ) 22
22 = frac12 = frac12 (Teacher(Teacherii ndash F ndash F (( [[WWijij x x OutputOutput jj] )] )22
33 = frac12 = frac12 (Teacher(Teacherii ndash F ndash F (( [[WWijij x F x F ((WWjkjk x Output x Output
kk)])))]))22
DetermineDetermine
recallrecall
BP CalculationsBP Calculations
Error Wij
Error Wjk
= (use equation 2)
= (use equation 3)
See Table 42 in Mitchell for resultswxy = - ( E wxy )
k j i
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Derivation in MitchellDerivation in Mitchell
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Some NotationSome Notation
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
By Chain Rule By Chain Rule (since (since WWjiji influences rest of network only influences rest of network only by its influence on by its influence on NetNetjj)hellip)hellip
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Also remember this for later ndashWersquoll call it -δj
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Remember thatoj is xkj outputfrom j is inputto k
Remember netk =wk1 xk1 + hellip+ wkN xkN
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Using BP to Train Using BP to Train ANNrsquosANNrsquos
11 Initiate weights amp bias to Initiate weights amp bias to small random values small random values (eg in [-03 03])(eg in [-03 03])
22 Randomize order of Randomize order of training examples for training examples for each doeach do
a)a) Propagate activity Propagate activity forwardforward to output unitsto output units
k j i
outi = F( wij x outj )j
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)
b)b) Compute ldquodeviationrdquo for output Compute ldquodeviationrdquo for output unitsunits
c)c) Compute ldquodeviationrdquo for hidden Compute ldquodeviationrdquo for hidden unitsunits
d)d) Update weightsUpdate weights
i = F rsquo( neti ) x (Teacheri-outi)
ij = F rsquo( netj ) x ( wij x
i)
wij = x i x out j
wjk = x j x out k
F rsquo( netj ) = F(neti) neti
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)
33 Repeat until training-set error rate Repeat until training-set error rate small enough (or until tuning-set error small enough (or until tuning-set error rate begins to rise ndash see later slide)rate begins to rise ndash see later slide)
Should use ldquoearly stoppingrdquo (ie Should use ldquoearly stoppingrdquo (ie minimize error on the tuning set more minimize error on the tuning set more details later)details later)
44 Measure accuracy on test set to Measure accuracy on test set to estimate estimate generalizationgeneralization (future (future accuracy)accuracy)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Advantages of Neural Advantages of Neural NetworksNetworks
bull Universal representation (provided Universal representation (provided enough hidden units)enough hidden units)
bull Less greedy than tree learnersLess greedy than tree learnersbull In practice good for problems with In practice good for problems with
numeric inputs and can also numeric inputs and can also handle numeric outputshandle numeric outputs
bull PHD for many years best protein PHD for many years best protein secondary structure predictorsecondary structure predictor
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
DisadvantagesDisadvantages
bull Models not very comprehensibleModels not very comprehensiblebull Long training timesLong training timesbull Very sensitive to number of Very sensitive to number of
hidden unitshellip as a result largely hidden unitshellip as a result largely being supplanted by SVMs (SVMs being supplanted by SVMs (SVMs take very different approach to take very different approach to getting non-linearity)getting non-linearity)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Looking AheadLooking Aheadbull Perceptron rule can also be thought of as Perceptron rule can also be thought of as
modifying modifying weights on data points weights on data points rather rather than featuresthan features
bull Instead of process all data (batch) vs Instead of process all data (batch) vs one-at-a-time could imagine processing 2 one-at-a-time could imagine processing 2 data points at a time adjusting their data points at a time adjusting their relative weights based on their relative relative weights based on their relative errorserrors
bull This is what Plattrsquos SMO does (the SVM This is what Plattrsquos SMO does (the SVM implementation in Weka)implementation in Weka)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Backup Slide to help Backup Slide to help with Derivative of with Derivative of SigmoidSigmoid
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
The Gradient-Descent The Gradient-Descent RuleRule
E(w) [ ]Ew0
Ew1
Ew
2
EwN
hellip hellip hellip _
The ldquogradien
trdquoThis is a N+1 dimensional vector (ie the lsquoslopersquo in weight space)Since we want to reduce errors we want to go ldquodown hillrdquoWersquoll take a finite step in weight space
E
W1
W2
w = - E ( w )
or wi = - Ewi
ldquodeltardquo = change to
w
E
w
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
ldquoldquoOn Linerdquo vs ldquoBatchrdquo On Linerdquo vs ldquoBatchrdquo BackpropBackprop
bull Technically we should look at the error Technically we should look at the error gradient for the entire training set gradient for the entire training set before taking a step in weight space before taking a step in weight space (ldquo(ldquobatchbatchrdquo Backprop)rdquo Backprop)
bull HoweverHowever as presented we take a step as presented we take a step after each example (ldquoafter each example (ldquoon-lineon-linerdquo Backprop)rdquo Backprop)bull Much faster convergenceMuch faster convergencebull Can reduce overfitting (since on-line Can reduce overfitting (since on-line
Backprop is ldquonoisyrdquo gradient descent)Backprop is ldquonoisyrdquo gradient descent)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
ldquoldquoOn Linerdquo vs ldquoBatchrdquo BP On Linerdquo vs ldquoBatchrdquo BP (continued)(continued)
BATCHBATCH ndash add ndash add w w vectors for vectors for everyevery training example training example thenthen lsquomoversquo in weight lsquomoversquo in weight spacespace
ON-LINEON-LINE ndash ldquomoverdquo ndash ldquomoverdquo after after eacheach example example (aka (aka stochasticstochastic gradient descent)gradient descent)
E
wi
w1
w3w2
w
w1
w2
w3
Final locations in space need not be the same for BATCH and ON-LINE w
N
ote
w
iB
ATC
H
w
i O
N-L
INE
for
i gt
1
E
w
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Need Derivatives Replace Need Derivatives Replace Step (Threshold) by Step (Threshold) by SigmoidSigmoidIndividual units
bias
output
input
output i= F(weight ij x output j)
Where
F(input i) =
j
1
1+e -(input i ndash bias i)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Differentiating the Differentiating the Logistic FunctionLogistic Function
out i =
1
1 + e - ( wji x outj)
F rsquo(wgtrsquoed in) = out i ( 1- out i ) 0
12
Wj x outj
F(wgtrsquoed in)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Assume one layer of hidden units (std topology)Assume one layer of hidden units (std topology)
11 Error Error frac12 frac12 ( Teacher ( Teacherii ndash ndash OutputOutput ii ) ) 22
22 = frac12 = frac12 (Teacher(Teacherii ndash F ndash F (( [[WWijij x x OutputOutput jj] )] )22
33 = frac12 = frac12 (Teacher(Teacherii ndash F ndash F (( [[WWijij x F x F ((WWjkjk x Output x Output
kk)])))]))22
DetermineDetermine
recallrecall
BP CalculationsBP Calculations
Error Wij
Error Wjk
= (use equation 2)
= (use equation 3)
See Table 42 in Mitchell for resultswxy = - ( E wxy )
k j i
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Derivation in MitchellDerivation in Mitchell
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Some NotationSome Notation
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
By Chain Rule By Chain Rule (since (since WWjiji influences rest of network only influences rest of network only by its influence on by its influence on NetNetjj)hellip)hellip
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Also remember this for later ndashWersquoll call it -δj
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Remember thatoj is xkj outputfrom j is inputto k
Remember netk =wk1 xk1 + hellip+ wkN xkN
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Using BP to Train Using BP to Train ANNrsquosANNrsquos
11 Initiate weights amp bias to Initiate weights amp bias to small random values small random values (eg in [-03 03])(eg in [-03 03])
22 Randomize order of Randomize order of training examples for training examples for each doeach do
a)a) Propagate activity Propagate activity forwardforward to output unitsto output units
k j i
outi = F( wij x outj )j
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)
b)b) Compute ldquodeviationrdquo for output Compute ldquodeviationrdquo for output unitsunits
c)c) Compute ldquodeviationrdquo for hidden Compute ldquodeviationrdquo for hidden unitsunits
d)d) Update weightsUpdate weights
i = F rsquo( neti ) x (Teacheri-outi)
ij = F rsquo( netj ) x ( wij x
i)
wij = x i x out j
wjk = x j x out k
F rsquo( netj ) = F(neti) neti
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)
33 Repeat until training-set error rate Repeat until training-set error rate small enough (or until tuning-set error small enough (or until tuning-set error rate begins to rise ndash see later slide)rate begins to rise ndash see later slide)
Should use ldquoearly stoppingrdquo (ie Should use ldquoearly stoppingrdquo (ie minimize error on the tuning set more minimize error on the tuning set more details later)details later)
44 Measure accuracy on test set to Measure accuracy on test set to estimate estimate generalizationgeneralization (future (future accuracy)accuracy)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Advantages of Neural Advantages of Neural NetworksNetworks
bull Universal representation (provided Universal representation (provided enough hidden units)enough hidden units)
bull Less greedy than tree learnersLess greedy than tree learnersbull In practice good for problems with In practice good for problems with
numeric inputs and can also numeric inputs and can also handle numeric outputshandle numeric outputs
bull PHD for many years best protein PHD for many years best protein secondary structure predictorsecondary structure predictor
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
DisadvantagesDisadvantages
bull Models not very comprehensibleModels not very comprehensiblebull Long training timesLong training timesbull Very sensitive to number of Very sensitive to number of
hidden unitshellip as a result largely hidden unitshellip as a result largely being supplanted by SVMs (SVMs being supplanted by SVMs (SVMs take very different approach to take very different approach to getting non-linearity)getting non-linearity)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Looking AheadLooking Aheadbull Perceptron rule can also be thought of as Perceptron rule can also be thought of as
modifying modifying weights on data points weights on data points rather rather than featuresthan features
bull Instead of process all data (batch) vs Instead of process all data (batch) vs one-at-a-time could imagine processing 2 one-at-a-time could imagine processing 2 data points at a time adjusting their data points at a time adjusting their relative weights based on their relative relative weights based on their relative errorserrors
bull This is what Plattrsquos SMO does (the SVM This is what Plattrsquos SMO does (the SVM implementation in Weka)implementation in Weka)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Backup Slide to help Backup Slide to help with Derivative of with Derivative of SigmoidSigmoid
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
ldquoldquoOn Linerdquo vs ldquoBatchrdquo On Linerdquo vs ldquoBatchrdquo BackpropBackprop
bull Technically we should look at the error Technically we should look at the error gradient for the entire training set gradient for the entire training set before taking a step in weight space before taking a step in weight space (ldquo(ldquobatchbatchrdquo Backprop)rdquo Backprop)
bull HoweverHowever as presented we take a step as presented we take a step after each example (ldquoafter each example (ldquoon-lineon-linerdquo Backprop)rdquo Backprop)bull Much faster convergenceMuch faster convergencebull Can reduce overfitting (since on-line Can reduce overfitting (since on-line
Backprop is ldquonoisyrdquo gradient descent)Backprop is ldquonoisyrdquo gradient descent)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
ldquoldquoOn Linerdquo vs ldquoBatchrdquo BP On Linerdquo vs ldquoBatchrdquo BP (continued)(continued)
BATCHBATCH ndash add ndash add w w vectors for vectors for everyevery training example training example thenthen lsquomoversquo in weight lsquomoversquo in weight spacespace
ON-LINEON-LINE ndash ldquomoverdquo ndash ldquomoverdquo after after eacheach example example (aka (aka stochasticstochastic gradient descent)gradient descent)
E
wi
w1
w3w2
w
w1
w2
w3
Final locations in space need not be the same for BATCH and ON-LINE w
N
ote
w
iB
ATC
H
w
i O
N-L
INE
for
i gt
1
E
w
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Need Derivatives Replace Need Derivatives Replace Step (Threshold) by Step (Threshold) by SigmoidSigmoidIndividual units
bias
output
input
output i= F(weight ij x output j)
Where
F(input i) =
j
1
1+e -(input i ndash bias i)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Differentiating the Differentiating the Logistic FunctionLogistic Function
out i =
1
1 + e - ( wji x outj)
F rsquo(wgtrsquoed in) = out i ( 1- out i ) 0
12
Wj x outj
F(wgtrsquoed in)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Assume one layer of hidden units (std topology)Assume one layer of hidden units (std topology)
11 Error Error frac12 frac12 ( Teacher ( Teacherii ndash ndash OutputOutput ii ) ) 22
22 = frac12 = frac12 (Teacher(Teacherii ndash F ndash F (( [[WWijij x x OutputOutput jj] )] )22
33 = frac12 = frac12 (Teacher(Teacherii ndash F ndash F (( [[WWijij x F x F ((WWjkjk x Output x Output
kk)])))]))22
DetermineDetermine
recallrecall
BP CalculationsBP Calculations
Error Wij
Error Wjk
= (use equation 2)
= (use equation 3)
See Table 42 in Mitchell for resultswxy = - ( E wxy )
k j i
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Derivation in MitchellDerivation in Mitchell
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Some NotationSome Notation
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
By Chain Rule By Chain Rule (since (since WWjiji influences rest of network only influences rest of network only by its influence on by its influence on NetNetjj)hellip)hellip
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Also remember this for later ndashWersquoll call it -δj
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Remember thatoj is xkj outputfrom j is inputto k
Remember netk =wk1 xk1 + hellip+ wkN xkN
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Using BP to Train Using BP to Train ANNrsquosANNrsquos
11 Initiate weights amp bias to Initiate weights amp bias to small random values small random values (eg in [-03 03])(eg in [-03 03])
22 Randomize order of Randomize order of training examples for training examples for each doeach do
a)a) Propagate activity Propagate activity forwardforward to output unitsto output units
k j i
outi = F( wij x outj )j
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)
b)b) Compute ldquodeviationrdquo for output Compute ldquodeviationrdquo for output unitsunits
c)c) Compute ldquodeviationrdquo for hidden Compute ldquodeviationrdquo for hidden unitsunits
d)d) Update weightsUpdate weights
i = F rsquo( neti ) x (Teacheri-outi)
ij = F rsquo( netj ) x ( wij x
i)
wij = x i x out j
wjk = x j x out k
F rsquo( netj ) = F(neti) neti
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)
33 Repeat until training-set error rate Repeat until training-set error rate small enough (or until tuning-set error small enough (or until tuning-set error rate begins to rise ndash see later slide)rate begins to rise ndash see later slide)
Should use ldquoearly stoppingrdquo (ie Should use ldquoearly stoppingrdquo (ie minimize error on the tuning set more minimize error on the tuning set more details later)details later)
44 Measure accuracy on test set to Measure accuracy on test set to estimate estimate generalizationgeneralization (future (future accuracy)accuracy)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Advantages of Neural Advantages of Neural NetworksNetworks
bull Universal representation (provided Universal representation (provided enough hidden units)enough hidden units)
bull Less greedy than tree learnersLess greedy than tree learnersbull In practice good for problems with In practice good for problems with
numeric inputs and can also numeric inputs and can also handle numeric outputshandle numeric outputs
bull PHD for many years best protein PHD for many years best protein secondary structure predictorsecondary structure predictor
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
DisadvantagesDisadvantages
bull Models not very comprehensibleModels not very comprehensiblebull Long training timesLong training timesbull Very sensitive to number of Very sensitive to number of
hidden unitshellip as a result largely hidden unitshellip as a result largely being supplanted by SVMs (SVMs being supplanted by SVMs (SVMs take very different approach to take very different approach to getting non-linearity)getting non-linearity)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Looking AheadLooking Aheadbull Perceptron rule can also be thought of as Perceptron rule can also be thought of as
modifying modifying weights on data points weights on data points rather rather than featuresthan features
bull Instead of process all data (batch) vs Instead of process all data (batch) vs one-at-a-time could imagine processing 2 one-at-a-time could imagine processing 2 data points at a time adjusting their data points at a time adjusting their relative weights based on their relative relative weights based on their relative errorserrors
bull This is what Plattrsquos SMO does (the SVM This is what Plattrsquos SMO does (the SVM implementation in Weka)implementation in Weka)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Backup Slide to help Backup Slide to help with Derivative of with Derivative of SigmoidSigmoid
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
ldquoldquoOn Linerdquo vs ldquoBatchrdquo BP On Linerdquo vs ldquoBatchrdquo BP (continued)(continued)
BATCHBATCH ndash add ndash add w w vectors for vectors for everyevery training example training example thenthen lsquomoversquo in weight lsquomoversquo in weight spacespace
ON-LINEON-LINE ndash ldquomoverdquo ndash ldquomoverdquo after after eacheach example example (aka (aka stochasticstochastic gradient descent)gradient descent)
E
wi
w1
w3w2
w
w1
w2
w3
Final locations in space need not be the same for BATCH and ON-LINE w
N
ote
w
iB
ATC
H
w
i O
N-L
INE
for
i gt
1
E
w
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Need Derivatives Replace Need Derivatives Replace Step (Threshold) by Step (Threshold) by SigmoidSigmoidIndividual units
bias
output
input
output i= F(weight ij x output j)
Where
F(input i) =
j
1
1+e -(input i ndash bias i)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Differentiating the Differentiating the Logistic FunctionLogistic Function
out i =
1
1 + e - ( wji x outj)
F rsquo(wgtrsquoed in) = out i ( 1- out i ) 0
12
Wj x outj
F(wgtrsquoed in)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Assume one layer of hidden units (std topology)Assume one layer of hidden units (std topology)
11 Error Error frac12 frac12 ( Teacher ( Teacherii ndash ndash OutputOutput ii ) ) 22
22 = frac12 = frac12 (Teacher(Teacherii ndash F ndash F (( [[WWijij x x OutputOutput jj] )] )22
33 = frac12 = frac12 (Teacher(Teacherii ndash F ndash F (( [[WWijij x F x F ((WWjkjk x Output x Output
kk)])))]))22
DetermineDetermine
recallrecall
BP CalculationsBP Calculations
Error Wij
Error Wjk
= (use equation 2)
= (use equation 3)
See Table 42 in Mitchell for resultswxy = - ( E wxy )
k j i
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Derivation in MitchellDerivation in Mitchell
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Some NotationSome Notation
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
By Chain Rule By Chain Rule (since (since WWjiji influences rest of network only influences rest of network only by its influence on by its influence on NetNetjj)hellip)hellip
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Also remember this for later ndashWersquoll call it -δj
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Remember thatoj is xkj outputfrom j is inputto k
Remember netk =wk1 xk1 + hellip+ wkN xkN
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Using BP to Train Using BP to Train ANNrsquosANNrsquos
11 Initiate weights amp bias to Initiate weights amp bias to small random values small random values (eg in [-03 03])(eg in [-03 03])
22 Randomize order of Randomize order of training examples for training examples for each doeach do
a)a) Propagate activity Propagate activity forwardforward to output unitsto output units
k j i
outi = F( wij x outj )j
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)
b)b) Compute ldquodeviationrdquo for output Compute ldquodeviationrdquo for output unitsunits
c)c) Compute ldquodeviationrdquo for hidden Compute ldquodeviationrdquo for hidden unitsunits
d)d) Update weightsUpdate weights
i = F rsquo( neti ) x (Teacheri-outi)
ij = F rsquo( netj ) x ( wij x
i)
wij = x i x out j
wjk = x j x out k
F rsquo( netj ) = F(neti) neti
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)
33 Repeat until training-set error rate Repeat until training-set error rate small enough (or until tuning-set error small enough (or until tuning-set error rate begins to rise ndash see later slide)rate begins to rise ndash see later slide)
Should use ldquoearly stoppingrdquo (ie Should use ldquoearly stoppingrdquo (ie minimize error on the tuning set more minimize error on the tuning set more details later)details later)
44 Measure accuracy on test set to Measure accuracy on test set to estimate estimate generalizationgeneralization (future (future accuracy)accuracy)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Advantages of Neural Advantages of Neural NetworksNetworks
bull Universal representation (provided Universal representation (provided enough hidden units)enough hidden units)
bull Less greedy than tree learnersLess greedy than tree learnersbull In practice good for problems with In practice good for problems with
numeric inputs and can also numeric inputs and can also handle numeric outputshandle numeric outputs
bull PHD for many years best protein PHD for many years best protein secondary structure predictorsecondary structure predictor
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
DisadvantagesDisadvantages
bull Models not very comprehensibleModels not very comprehensiblebull Long training timesLong training timesbull Very sensitive to number of Very sensitive to number of
hidden unitshellip as a result largely hidden unitshellip as a result largely being supplanted by SVMs (SVMs being supplanted by SVMs (SVMs take very different approach to take very different approach to getting non-linearity)getting non-linearity)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Looking AheadLooking Aheadbull Perceptron rule can also be thought of as Perceptron rule can also be thought of as
modifying modifying weights on data points weights on data points rather rather than featuresthan features
bull Instead of process all data (batch) vs Instead of process all data (batch) vs one-at-a-time could imagine processing 2 one-at-a-time could imagine processing 2 data points at a time adjusting their data points at a time adjusting their relative weights based on their relative relative weights based on their relative errorserrors
bull This is what Plattrsquos SMO does (the SVM This is what Plattrsquos SMO does (the SVM implementation in Weka)implementation in Weka)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Backup Slide to help Backup Slide to help with Derivative of with Derivative of SigmoidSigmoid
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Need Derivatives Replace Need Derivatives Replace Step (Threshold) by Step (Threshold) by SigmoidSigmoidIndividual units
bias
output
input
output i= F(weight ij x output j)
Where
F(input i) =
j
1
1+e -(input i ndash bias i)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Differentiating the Differentiating the Logistic FunctionLogistic Function
out i =
1
1 + e - ( wji x outj)
F rsquo(wgtrsquoed in) = out i ( 1- out i ) 0
12
Wj x outj
F(wgtrsquoed in)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Assume one layer of hidden units (std topology)Assume one layer of hidden units (std topology)
11 Error Error frac12 frac12 ( Teacher ( Teacherii ndash ndash OutputOutput ii ) ) 22
22 = frac12 = frac12 (Teacher(Teacherii ndash F ndash F (( [[WWijij x x OutputOutput jj] )] )22
33 = frac12 = frac12 (Teacher(Teacherii ndash F ndash F (( [[WWijij x F x F ((WWjkjk x Output x Output
kk)])))]))22
DetermineDetermine
recallrecall
BP CalculationsBP Calculations
Error Wij
Error Wjk
= (use equation 2)
= (use equation 3)
See Table 42 in Mitchell for resultswxy = - ( E wxy )
k j i
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Derivation in MitchellDerivation in Mitchell
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Some NotationSome Notation
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
By Chain Rule By Chain Rule (since (since WWjiji influences rest of network only influences rest of network only by its influence on by its influence on NetNetjj)hellip)hellip
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Also remember this for later ndashWersquoll call it -δj
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Remember thatoj is xkj outputfrom j is inputto k
Remember netk =wk1 xk1 + hellip+ wkN xkN
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Using BP to Train Using BP to Train ANNrsquosANNrsquos
11 Initiate weights amp bias to Initiate weights amp bias to small random values small random values (eg in [-03 03])(eg in [-03 03])
22 Randomize order of Randomize order of training examples for training examples for each doeach do
a)a) Propagate activity Propagate activity forwardforward to output unitsto output units
k j i
outi = F( wij x outj )j
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)
b)b) Compute ldquodeviationrdquo for output Compute ldquodeviationrdquo for output unitsunits
c)c) Compute ldquodeviationrdquo for hidden Compute ldquodeviationrdquo for hidden unitsunits
d)d) Update weightsUpdate weights
i = F rsquo( neti ) x (Teacheri-outi)
ij = F rsquo( netj ) x ( wij x
i)
wij = x i x out j
wjk = x j x out k
F rsquo( netj ) = F(neti) neti
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)
33 Repeat until training-set error rate Repeat until training-set error rate small enough (or until tuning-set error small enough (or until tuning-set error rate begins to rise ndash see later slide)rate begins to rise ndash see later slide)
Should use ldquoearly stoppingrdquo (ie Should use ldquoearly stoppingrdquo (ie minimize error on the tuning set more minimize error on the tuning set more details later)details later)
44 Measure accuracy on test set to Measure accuracy on test set to estimate estimate generalizationgeneralization (future (future accuracy)accuracy)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Advantages of Neural Advantages of Neural NetworksNetworks
bull Universal representation (provided Universal representation (provided enough hidden units)enough hidden units)
bull Less greedy than tree learnersLess greedy than tree learnersbull In practice good for problems with In practice good for problems with
numeric inputs and can also numeric inputs and can also handle numeric outputshandle numeric outputs
bull PHD for many years best protein PHD for many years best protein secondary structure predictorsecondary structure predictor
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
DisadvantagesDisadvantages
bull Models not very comprehensibleModels not very comprehensiblebull Long training timesLong training timesbull Very sensitive to number of Very sensitive to number of
hidden unitshellip as a result largely hidden unitshellip as a result largely being supplanted by SVMs (SVMs being supplanted by SVMs (SVMs take very different approach to take very different approach to getting non-linearity)getting non-linearity)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Looking AheadLooking Aheadbull Perceptron rule can also be thought of as Perceptron rule can also be thought of as
modifying modifying weights on data points weights on data points rather rather than featuresthan features
bull Instead of process all data (batch) vs Instead of process all data (batch) vs one-at-a-time could imagine processing 2 one-at-a-time could imagine processing 2 data points at a time adjusting their data points at a time adjusting their relative weights based on their relative relative weights based on their relative errorserrors
bull This is what Plattrsquos SMO does (the SVM This is what Plattrsquos SMO does (the SVM implementation in Weka)implementation in Weka)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Backup Slide to help Backup Slide to help with Derivative of with Derivative of SigmoidSigmoid
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Differentiating the Differentiating the Logistic FunctionLogistic Function
out i =
1
1 + e - ( wji x outj)
F rsquo(wgtrsquoed in) = out i ( 1- out i ) 0
12
Wj x outj
F(wgtrsquoed in)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Assume one layer of hidden units (std topology)Assume one layer of hidden units (std topology)
11 Error Error frac12 frac12 ( Teacher ( Teacherii ndash ndash OutputOutput ii ) ) 22
22 = frac12 = frac12 (Teacher(Teacherii ndash F ndash F (( [[WWijij x x OutputOutput jj] )] )22
33 = frac12 = frac12 (Teacher(Teacherii ndash F ndash F (( [[WWijij x F x F ((WWjkjk x Output x Output
kk)])))]))22
DetermineDetermine
recallrecall
BP CalculationsBP Calculations
Error Wij
Error Wjk
= (use equation 2)
= (use equation 3)
See Table 42 in Mitchell for resultswxy = - ( E wxy )
k j i
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Derivation in MitchellDerivation in Mitchell
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Some NotationSome Notation
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
By Chain Rule By Chain Rule (since (since WWjiji influences rest of network only influences rest of network only by its influence on by its influence on NetNetjj)hellip)hellip
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Also remember this for later ndashWersquoll call it -δj
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Remember thatoj is xkj outputfrom j is inputto k
Remember netk =wk1 xk1 + hellip+ wkN xkN
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Using BP to Train Using BP to Train ANNrsquosANNrsquos
11 Initiate weights amp bias to Initiate weights amp bias to small random values small random values (eg in [-03 03])(eg in [-03 03])
22 Randomize order of Randomize order of training examples for training examples for each doeach do
a)a) Propagate activity Propagate activity forwardforward to output unitsto output units
k j i
outi = F( wij x outj )j
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)
b)b) Compute ldquodeviationrdquo for output Compute ldquodeviationrdquo for output unitsunits
c)c) Compute ldquodeviationrdquo for hidden Compute ldquodeviationrdquo for hidden unitsunits
d)d) Update weightsUpdate weights
i = F rsquo( neti ) x (Teacheri-outi)
ij = F rsquo( netj ) x ( wij x
i)
wij = x i x out j
wjk = x j x out k
F rsquo( netj ) = F(neti) neti
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)
33 Repeat until training-set error rate Repeat until training-set error rate small enough (or until tuning-set error small enough (or until tuning-set error rate begins to rise ndash see later slide)rate begins to rise ndash see later slide)
Should use ldquoearly stoppingrdquo (ie Should use ldquoearly stoppingrdquo (ie minimize error on the tuning set more minimize error on the tuning set more details later)details later)
44 Measure accuracy on test set to Measure accuracy on test set to estimate estimate generalizationgeneralization (future (future accuracy)accuracy)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Advantages of Neural Advantages of Neural NetworksNetworks
bull Universal representation (provided Universal representation (provided enough hidden units)enough hidden units)
bull Less greedy than tree learnersLess greedy than tree learnersbull In practice good for problems with In practice good for problems with
numeric inputs and can also numeric inputs and can also handle numeric outputshandle numeric outputs
bull PHD for many years best protein PHD for many years best protein secondary structure predictorsecondary structure predictor
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
DisadvantagesDisadvantages
bull Models not very comprehensibleModels not very comprehensiblebull Long training timesLong training timesbull Very sensitive to number of Very sensitive to number of
hidden unitshellip as a result largely hidden unitshellip as a result largely being supplanted by SVMs (SVMs being supplanted by SVMs (SVMs take very different approach to take very different approach to getting non-linearity)getting non-linearity)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Looking AheadLooking Aheadbull Perceptron rule can also be thought of as Perceptron rule can also be thought of as
modifying modifying weights on data points weights on data points rather rather than featuresthan features
bull Instead of process all data (batch) vs Instead of process all data (batch) vs one-at-a-time could imagine processing 2 one-at-a-time could imagine processing 2 data points at a time adjusting their data points at a time adjusting their relative weights based on their relative relative weights based on their relative errorserrors
bull This is what Plattrsquos SMO does (the SVM This is what Plattrsquos SMO does (the SVM implementation in Weka)implementation in Weka)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Backup Slide to help Backup Slide to help with Derivative of with Derivative of SigmoidSigmoid
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Assume one layer of hidden units (std topology)Assume one layer of hidden units (std topology)
11 Error Error frac12 frac12 ( Teacher ( Teacherii ndash ndash OutputOutput ii ) ) 22
22 = frac12 = frac12 (Teacher(Teacherii ndash F ndash F (( [[WWijij x x OutputOutput jj] )] )22
33 = frac12 = frac12 (Teacher(Teacherii ndash F ndash F (( [[WWijij x F x F ((WWjkjk x Output x Output
kk)])))]))22
DetermineDetermine
recallrecall
BP CalculationsBP Calculations
Error Wij
Error Wjk
= (use equation 2)
= (use equation 3)
See Table 42 in Mitchell for resultswxy = - ( E wxy )
k j i
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Derivation in MitchellDerivation in Mitchell
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Some NotationSome Notation
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
By Chain Rule By Chain Rule (since (since WWjiji influences rest of network only influences rest of network only by its influence on by its influence on NetNetjj)hellip)hellip
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Also remember this for later ndashWersquoll call it -δj
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Remember thatoj is xkj outputfrom j is inputto k
Remember netk =wk1 xk1 + hellip+ wkN xkN
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Using BP to Train Using BP to Train ANNrsquosANNrsquos
11 Initiate weights amp bias to Initiate weights amp bias to small random values small random values (eg in [-03 03])(eg in [-03 03])
22 Randomize order of Randomize order of training examples for training examples for each doeach do
a)a) Propagate activity Propagate activity forwardforward to output unitsto output units
k j i
outi = F( wij x outj )j
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)
b)b) Compute ldquodeviationrdquo for output Compute ldquodeviationrdquo for output unitsunits
c)c) Compute ldquodeviationrdquo for hidden Compute ldquodeviationrdquo for hidden unitsunits
d)d) Update weightsUpdate weights
i = F rsquo( neti ) x (Teacheri-outi)
ij = F rsquo( netj ) x ( wij x
i)
wij = x i x out j
wjk = x j x out k
F rsquo( netj ) = F(neti) neti
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)
33 Repeat until training-set error rate Repeat until training-set error rate small enough (or until tuning-set error small enough (or until tuning-set error rate begins to rise ndash see later slide)rate begins to rise ndash see later slide)
Should use ldquoearly stoppingrdquo (ie Should use ldquoearly stoppingrdquo (ie minimize error on the tuning set more minimize error on the tuning set more details later)details later)
44 Measure accuracy on test set to Measure accuracy on test set to estimate estimate generalizationgeneralization (future (future accuracy)accuracy)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Advantages of Neural Advantages of Neural NetworksNetworks
bull Universal representation (provided Universal representation (provided enough hidden units)enough hidden units)
bull Less greedy than tree learnersLess greedy than tree learnersbull In practice good for problems with In practice good for problems with
numeric inputs and can also numeric inputs and can also handle numeric outputshandle numeric outputs
bull PHD for many years best protein PHD for many years best protein secondary structure predictorsecondary structure predictor
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
DisadvantagesDisadvantages
bull Models not very comprehensibleModels not very comprehensiblebull Long training timesLong training timesbull Very sensitive to number of Very sensitive to number of
hidden unitshellip as a result largely hidden unitshellip as a result largely being supplanted by SVMs (SVMs being supplanted by SVMs (SVMs take very different approach to take very different approach to getting non-linearity)getting non-linearity)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Looking AheadLooking Aheadbull Perceptron rule can also be thought of as Perceptron rule can also be thought of as
modifying modifying weights on data points weights on data points rather rather than featuresthan features
bull Instead of process all data (batch) vs Instead of process all data (batch) vs one-at-a-time could imagine processing 2 one-at-a-time could imagine processing 2 data points at a time adjusting their data points at a time adjusting their relative weights based on their relative relative weights based on their relative errorserrors
bull This is what Plattrsquos SMO does (the SVM This is what Plattrsquos SMO does (the SVM implementation in Weka)implementation in Weka)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Backup Slide to help Backup Slide to help with Derivative of with Derivative of SigmoidSigmoid
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Derivation in MitchellDerivation in Mitchell
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Some NotationSome Notation
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
By Chain Rule By Chain Rule (since (since WWjiji influences rest of network only influences rest of network only by its influence on by its influence on NetNetjj)hellip)hellip
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Also remember this for later ndashWersquoll call it -δj
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Remember thatoj is xkj outputfrom j is inputto k
Remember netk =wk1 xk1 + hellip+ wkN xkN
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Using BP to Train Using BP to Train ANNrsquosANNrsquos
11 Initiate weights amp bias to Initiate weights amp bias to small random values small random values (eg in [-03 03])(eg in [-03 03])
22 Randomize order of Randomize order of training examples for training examples for each doeach do
a)a) Propagate activity Propagate activity forwardforward to output unitsto output units
k j i
outi = F( wij x outj )j
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)
b)b) Compute ldquodeviationrdquo for output Compute ldquodeviationrdquo for output unitsunits
c)c) Compute ldquodeviationrdquo for hidden Compute ldquodeviationrdquo for hidden unitsunits
d)d) Update weightsUpdate weights
i = F rsquo( neti ) x (Teacheri-outi)
ij = F rsquo( netj ) x ( wij x
i)
wij = x i x out j
wjk = x j x out k
F rsquo( netj ) = F(neti) neti
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)
33 Repeat until training-set error rate Repeat until training-set error rate small enough (or until tuning-set error small enough (or until tuning-set error rate begins to rise ndash see later slide)rate begins to rise ndash see later slide)
Should use ldquoearly stoppingrdquo (ie Should use ldquoearly stoppingrdquo (ie minimize error on the tuning set more minimize error on the tuning set more details later)details later)
44 Measure accuracy on test set to Measure accuracy on test set to estimate estimate generalizationgeneralization (future (future accuracy)accuracy)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Advantages of Neural Advantages of Neural NetworksNetworks
bull Universal representation (provided Universal representation (provided enough hidden units)enough hidden units)
bull Less greedy than tree learnersLess greedy than tree learnersbull In practice good for problems with In practice good for problems with
numeric inputs and can also numeric inputs and can also handle numeric outputshandle numeric outputs
bull PHD for many years best protein PHD for many years best protein secondary structure predictorsecondary structure predictor
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
DisadvantagesDisadvantages
bull Models not very comprehensibleModels not very comprehensiblebull Long training timesLong training timesbull Very sensitive to number of Very sensitive to number of
hidden unitshellip as a result largely hidden unitshellip as a result largely being supplanted by SVMs (SVMs being supplanted by SVMs (SVMs take very different approach to take very different approach to getting non-linearity)getting non-linearity)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Looking AheadLooking Aheadbull Perceptron rule can also be thought of as Perceptron rule can also be thought of as
modifying modifying weights on data points weights on data points rather rather than featuresthan features
bull Instead of process all data (batch) vs Instead of process all data (batch) vs one-at-a-time could imagine processing 2 one-at-a-time could imagine processing 2 data points at a time adjusting their data points at a time adjusting their relative weights based on their relative relative weights based on their relative errorserrors
bull This is what Plattrsquos SMO does (the SVM This is what Plattrsquos SMO does (the SVM implementation in Weka)implementation in Weka)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Backup Slide to help Backup Slide to help with Derivative of with Derivative of SigmoidSigmoid
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Some NotationSome Notation
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
By Chain Rule By Chain Rule (since (since WWjiji influences rest of network only influences rest of network only by its influence on by its influence on NetNetjj)hellip)hellip
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Also remember this for later ndashWersquoll call it -δj
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Remember thatoj is xkj outputfrom j is inputto k
Remember netk =wk1 xk1 + hellip+ wkN xkN
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Using BP to Train Using BP to Train ANNrsquosANNrsquos
11 Initiate weights amp bias to Initiate weights amp bias to small random values small random values (eg in [-03 03])(eg in [-03 03])
22 Randomize order of Randomize order of training examples for training examples for each doeach do
a)a) Propagate activity Propagate activity forwardforward to output unitsto output units
k j i
outi = F( wij x outj )j
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)
b)b) Compute ldquodeviationrdquo for output Compute ldquodeviationrdquo for output unitsunits
c)c) Compute ldquodeviationrdquo for hidden Compute ldquodeviationrdquo for hidden unitsunits
d)d) Update weightsUpdate weights
i = F rsquo( neti ) x (Teacheri-outi)
ij = F rsquo( netj ) x ( wij x
i)
wij = x i x out j
wjk = x j x out k
F rsquo( netj ) = F(neti) neti
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)
33 Repeat until training-set error rate Repeat until training-set error rate small enough (or until tuning-set error small enough (or until tuning-set error rate begins to rise ndash see later slide)rate begins to rise ndash see later slide)
Should use ldquoearly stoppingrdquo (ie Should use ldquoearly stoppingrdquo (ie minimize error on the tuning set more minimize error on the tuning set more details later)details later)
44 Measure accuracy on test set to Measure accuracy on test set to estimate estimate generalizationgeneralization (future (future accuracy)accuracy)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Advantages of Neural Advantages of Neural NetworksNetworks
bull Universal representation (provided Universal representation (provided enough hidden units)enough hidden units)
bull Less greedy than tree learnersLess greedy than tree learnersbull In practice good for problems with In practice good for problems with
numeric inputs and can also numeric inputs and can also handle numeric outputshandle numeric outputs
bull PHD for many years best protein PHD for many years best protein secondary structure predictorsecondary structure predictor
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
DisadvantagesDisadvantages
bull Models not very comprehensibleModels not very comprehensiblebull Long training timesLong training timesbull Very sensitive to number of Very sensitive to number of
hidden unitshellip as a result largely hidden unitshellip as a result largely being supplanted by SVMs (SVMs being supplanted by SVMs (SVMs take very different approach to take very different approach to getting non-linearity)getting non-linearity)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Looking AheadLooking Aheadbull Perceptron rule can also be thought of as Perceptron rule can also be thought of as
modifying modifying weights on data points weights on data points rather rather than featuresthan features
bull Instead of process all data (batch) vs Instead of process all data (batch) vs one-at-a-time could imagine processing 2 one-at-a-time could imagine processing 2 data points at a time adjusting their data points at a time adjusting their relative weights based on their relative relative weights based on their relative errorserrors
bull This is what Plattrsquos SMO does (the SVM This is what Plattrsquos SMO does (the SVM implementation in Weka)implementation in Weka)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Backup Slide to help Backup Slide to help with Derivative of with Derivative of SigmoidSigmoid
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
By Chain Rule By Chain Rule (since (since WWjiji influences rest of network only influences rest of network only by its influence on by its influence on NetNetjj)hellip)hellip
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Also remember this for later ndashWersquoll call it -δj
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Remember thatoj is xkj outputfrom j is inputto k
Remember netk =wk1 xk1 + hellip+ wkN xkN
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Using BP to Train Using BP to Train ANNrsquosANNrsquos
11 Initiate weights amp bias to Initiate weights amp bias to small random values small random values (eg in [-03 03])(eg in [-03 03])
22 Randomize order of Randomize order of training examples for training examples for each doeach do
a)a) Propagate activity Propagate activity forwardforward to output unitsto output units
k j i
outi = F( wij x outj )j
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)
b)b) Compute ldquodeviationrdquo for output Compute ldquodeviationrdquo for output unitsunits
c)c) Compute ldquodeviationrdquo for hidden Compute ldquodeviationrdquo for hidden unitsunits
d)d) Update weightsUpdate weights
i = F rsquo( neti ) x (Teacheri-outi)
ij = F rsquo( netj ) x ( wij x
i)
wij = x i x out j
wjk = x j x out k
F rsquo( netj ) = F(neti) neti
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)
33 Repeat until training-set error rate Repeat until training-set error rate small enough (or until tuning-set error small enough (or until tuning-set error rate begins to rise ndash see later slide)rate begins to rise ndash see later slide)
Should use ldquoearly stoppingrdquo (ie Should use ldquoearly stoppingrdquo (ie minimize error on the tuning set more minimize error on the tuning set more details later)details later)
44 Measure accuracy on test set to Measure accuracy on test set to estimate estimate generalizationgeneralization (future (future accuracy)accuracy)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Advantages of Neural Advantages of Neural NetworksNetworks
bull Universal representation (provided Universal representation (provided enough hidden units)enough hidden units)
bull Less greedy than tree learnersLess greedy than tree learnersbull In practice good for problems with In practice good for problems with
numeric inputs and can also numeric inputs and can also handle numeric outputshandle numeric outputs
bull PHD for many years best protein PHD for many years best protein secondary structure predictorsecondary structure predictor
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
DisadvantagesDisadvantages
bull Models not very comprehensibleModels not very comprehensiblebull Long training timesLong training timesbull Very sensitive to number of Very sensitive to number of
hidden unitshellip as a result largely hidden unitshellip as a result largely being supplanted by SVMs (SVMs being supplanted by SVMs (SVMs take very different approach to take very different approach to getting non-linearity)getting non-linearity)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Looking AheadLooking Aheadbull Perceptron rule can also be thought of as Perceptron rule can also be thought of as
modifying modifying weights on data points weights on data points rather rather than featuresthan features
bull Instead of process all data (batch) vs Instead of process all data (batch) vs one-at-a-time could imagine processing 2 one-at-a-time could imagine processing 2 data points at a time adjusting their data points at a time adjusting their relative weights based on their relative relative weights based on their relative errorserrors
bull This is what Plattrsquos SMO does (the SVM This is what Plattrsquos SMO does (the SVM implementation in Weka)implementation in Weka)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Backup Slide to help Backup Slide to help with Derivative of with Derivative of SigmoidSigmoid
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Also remember this for later ndashWersquoll call it -δj
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Remember thatoj is xkj outputfrom j is inputto k
Remember netk =wk1 xk1 + hellip+ wkN xkN
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Using BP to Train Using BP to Train ANNrsquosANNrsquos
11 Initiate weights amp bias to Initiate weights amp bias to small random values small random values (eg in [-03 03])(eg in [-03 03])
22 Randomize order of Randomize order of training examples for training examples for each doeach do
a)a) Propagate activity Propagate activity forwardforward to output unitsto output units
k j i
outi = F( wij x outj )j
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)
b)b) Compute ldquodeviationrdquo for output Compute ldquodeviationrdquo for output unitsunits
c)c) Compute ldquodeviationrdquo for hidden Compute ldquodeviationrdquo for hidden unitsunits
d)d) Update weightsUpdate weights
i = F rsquo( neti ) x (Teacheri-outi)
ij = F rsquo( netj ) x ( wij x
i)
wij = x i x out j
wjk = x j x out k
F rsquo( netj ) = F(neti) neti
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)
33 Repeat until training-set error rate Repeat until training-set error rate small enough (or until tuning-set error small enough (or until tuning-set error rate begins to rise ndash see later slide)rate begins to rise ndash see later slide)
Should use ldquoearly stoppingrdquo (ie Should use ldquoearly stoppingrdquo (ie minimize error on the tuning set more minimize error on the tuning set more details later)details later)
44 Measure accuracy on test set to Measure accuracy on test set to estimate estimate generalizationgeneralization (future (future accuracy)accuracy)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Advantages of Neural Advantages of Neural NetworksNetworks
bull Universal representation (provided Universal representation (provided enough hidden units)enough hidden units)
bull Less greedy than tree learnersLess greedy than tree learnersbull In practice good for problems with In practice good for problems with
numeric inputs and can also numeric inputs and can also handle numeric outputshandle numeric outputs
bull PHD for many years best protein PHD for many years best protein secondary structure predictorsecondary structure predictor
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
DisadvantagesDisadvantages
bull Models not very comprehensibleModels not very comprehensiblebull Long training timesLong training timesbull Very sensitive to number of Very sensitive to number of
hidden unitshellip as a result largely hidden unitshellip as a result largely being supplanted by SVMs (SVMs being supplanted by SVMs (SVMs take very different approach to take very different approach to getting non-linearity)getting non-linearity)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Looking AheadLooking Aheadbull Perceptron rule can also be thought of as Perceptron rule can also be thought of as
modifying modifying weights on data points weights on data points rather rather than featuresthan features
bull Instead of process all data (batch) vs Instead of process all data (batch) vs one-at-a-time could imagine processing 2 one-at-a-time could imagine processing 2 data points at a time adjusting their data points at a time adjusting their relative weights based on their relative relative weights based on their relative errorserrors
bull This is what Plattrsquos SMO does (the SVM This is what Plattrsquos SMO does (the SVM implementation in Weka)implementation in Weka)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Backup Slide to help Backup Slide to help with Derivative of with Derivative of SigmoidSigmoid
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Also remember this for later ndashWersquoll call it -δj
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Remember thatoj is xkj outputfrom j is inputto k
Remember netk =wk1 xk1 + hellip+ wkN xkN
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Using BP to Train Using BP to Train ANNrsquosANNrsquos
11 Initiate weights amp bias to Initiate weights amp bias to small random values small random values (eg in [-03 03])(eg in [-03 03])
22 Randomize order of Randomize order of training examples for training examples for each doeach do
a)a) Propagate activity Propagate activity forwardforward to output unitsto output units
k j i
outi = F( wij x outj )j
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)
b)b) Compute ldquodeviationrdquo for output Compute ldquodeviationrdquo for output unitsunits
c)c) Compute ldquodeviationrdquo for hidden Compute ldquodeviationrdquo for hidden unitsunits
d)d) Update weightsUpdate weights
i = F rsquo( neti ) x (Teacheri-outi)
ij = F rsquo( netj ) x ( wij x
i)
wij = x i x out j
wjk = x j x out k
F rsquo( netj ) = F(neti) neti
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)
33 Repeat until training-set error rate Repeat until training-set error rate small enough (or until tuning-set error small enough (or until tuning-set error rate begins to rise ndash see later slide)rate begins to rise ndash see later slide)
Should use ldquoearly stoppingrdquo (ie Should use ldquoearly stoppingrdquo (ie minimize error on the tuning set more minimize error on the tuning set more details later)details later)
44 Measure accuracy on test set to Measure accuracy on test set to estimate estimate generalizationgeneralization (future (future accuracy)accuracy)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Advantages of Neural Advantages of Neural NetworksNetworks
bull Universal representation (provided Universal representation (provided enough hidden units)enough hidden units)
bull Less greedy than tree learnersLess greedy than tree learnersbull In practice good for problems with In practice good for problems with
numeric inputs and can also numeric inputs and can also handle numeric outputshandle numeric outputs
bull PHD for many years best protein PHD for many years best protein secondary structure predictorsecondary structure predictor
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
DisadvantagesDisadvantages
bull Models not very comprehensibleModels not very comprehensiblebull Long training timesLong training timesbull Very sensitive to number of Very sensitive to number of
hidden unitshellip as a result largely hidden unitshellip as a result largely being supplanted by SVMs (SVMs being supplanted by SVMs (SVMs take very different approach to take very different approach to getting non-linearity)getting non-linearity)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Looking AheadLooking Aheadbull Perceptron rule can also be thought of as Perceptron rule can also be thought of as
modifying modifying weights on data points weights on data points rather rather than featuresthan features
bull Instead of process all data (batch) vs Instead of process all data (batch) vs one-at-a-time could imagine processing 2 one-at-a-time could imagine processing 2 data points at a time adjusting their data points at a time adjusting their relative weights based on their relative relative weights based on their relative errorserrors
bull This is what Plattrsquos SMO does (the SVM This is what Plattrsquos SMO does (the SVM implementation in Weka)implementation in Weka)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Backup Slide to help Backup Slide to help with Derivative of with Derivative of SigmoidSigmoid
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Remember thatoj is xkj outputfrom j is inputto k
Remember netk =wk1 xk1 + hellip+ wkN xkN
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Using BP to Train Using BP to Train ANNrsquosANNrsquos
11 Initiate weights amp bias to Initiate weights amp bias to small random values small random values (eg in [-03 03])(eg in [-03 03])
22 Randomize order of Randomize order of training examples for training examples for each doeach do
a)a) Propagate activity Propagate activity forwardforward to output unitsto output units
k j i
outi = F( wij x outj )j
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)
b)b) Compute ldquodeviationrdquo for output Compute ldquodeviationrdquo for output unitsunits
c)c) Compute ldquodeviationrdquo for hidden Compute ldquodeviationrdquo for hidden unitsunits
d)d) Update weightsUpdate weights
i = F rsquo( neti ) x (Teacheri-outi)
ij = F rsquo( netj ) x ( wij x
i)
wij = x i x out j
wjk = x j x out k
F rsquo( netj ) = F(neti) neti
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)
33 Repeat until training-set error rate Repeat until training-set error rate small enough (or until tuning-set error small enough (or until tuning-set error rate begins to rise ndash see later slide)rate begins to rise ndash see later slide)
Should use ldquoearly stoppingrdquo (ie Should use ldquoearly stoppingrdquo (ie minimize error on the tuning set more minimize error on the tuning set more details later)details later)
44 Measure accuracy on test set to Measure accuracy on test set to estimate estimate generalizationgeneralization (future (future accuracy)accuracy)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Advantages of Neural Advantages of Neural NetworksNetworks
bull Universal representation (provided Universal representation (provided enough hidden units)enough hidden units)
bull Less greedy than tree learnersLess greedy than tree learnersbull In practice good for problems with In practice good for problems with
numeric inputs and can also numeric inputs and can also handle numeric outputshandle numeric outputs
bull PHD for many years best protein PHD for many years best protein secondary structure predictorsecondary structure predictor
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
DisadvantagesDisadvantages
bull Models not very comprehensibleModels not very comprehensiblebull Long training timesLong training timesbull Very sensitive to number of Very sensitive to number of
hidden unitshellip as a result largely hidden unitshellip as a result largely being supplanted by SVMs (SVMs being supplanted by SVMs (SVMs take very different approach to take very different approach to getting non-linearity)getting non-linearity)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Looking AheadLooking Aheadbull Perceptron rule can also be thought of as Perceptron rule can also be thought of as
modifying modifying weights on data points weights on data points rather rather than featuresthan features
bull Instead of process all data (batch) vs Instead of process all data (batch) vs one-at-a-time could imagine processing 2 one-at-a-time could imagine processing 2 data points at a time adjusting their data points at a time adjusting their relative weights based on their relative relative weights based on their relative errorserrors
bull This is what Plattrsquos SMO does (the SVM This is what Plattrsquos SMO does (the SVM implementation in Weka)implementation in Weka)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Backup Slide to help Backup Slide to help with Derivative of with Derivative of SigmoidSigmoid
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Remember thatoj is xkj outputfrom j is inputto k
Remember netk =wk1 xk1 + hellip+ wkN xkN
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Using BP to Train Using BP to Train ANNrsquosANNrsquos
11 Initiate weights amp bias to Initiate weights amp bias to small random values small random values (eg in [-03 03])(eg in [-03 03])
22 Randomize order of Randomize order of training examples for training examples for each doeach do
a)a) Propagate activity Propagate activity forwardforward to output unitsto output units
k j i
outi = F( wij x outj )j
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)
b)b) Compute ldquodeviationrdquo for output Compute ldquodeviationrdquo for output unitsunits
c)c) Compute ldquodeviationrdquo for hidden Compute ldquodeviationrdquo for hidden unitsunits
d)d) Update weightsUpdate weights
i = F rsquo( neti ) x (Teacheri-outi)
ij = F rsquo( netj ) x ( wij x
i)
wij = x i x out j
wjk = x j x out k
F rsquo( netj ) = F(neti) neti
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)
33 Repeat until training-set error rate Repeat until training-set error rate small enough (or until tuning-set error small enough (or until tuning-set error rate begins to rise ndash see later slide)rate begins to rise ndash see later slide)
Should use ldquoearly stoppingrdquo (ie Should use ldquoearly stoppingrdquo (ie minimize error on the tuning set more minimize error on the tuning set more details later)details later)
44 Measure accuracy on test set to Measure accuracy on test set to estimate estimate generalizationgeneralization (future (future accuracy)accuracy)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Advantages of Neural Advantages of Neural NetworksNetworks
bull Universal representation (provided Universal representation (provided enough hidden units)enough hidden units)
bull Less greedy than tree learnersLess greedy than tree learnersbull In practice good for problems with In practice good for problems with
numeric inputs and can also numeric inputs and can also handle numeric outputshandle numeric outputs
bull PHD for many years best protein PHD for many years best protein secondary structure predictorsecondary structure predictor
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
DisadvantagesDisadvantages
bull Models not very comprehensibleModels not very comprehensiblebull Long training timesLong training timesbull Very sensitive to number of Very sensitive to number of
hidden unitshellip as a result largely hidden unitshellip as a result largely being supplanted by SVMs (SVMs being supplanted by SVMs (SVMs take very different approach to take very different approach to getting non-linearity)getting non-linearity)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Looking AheadLooking Aheadbull Perceptron rule can also be thought of as Perceptron rule can also be thought of as
modifying modifying weights on data points weights on data points rather rather than featuresthan features
bull Instead of process all data (batch) vs Instead of process all data (batch) vs one-at-a-time could imagine processing 2 one-at-a-time could imagine processing 2 data points at a time adjusting their data points at a time adjusting their relative weights based on their relative relative weights based on their relative errorserrors
bull This is what Plattrsquos SMO does (the SVM This is what Plattrsquos SMO does (the SVM implementation in Weka)implementation in Weka)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Backup Slide to help Backup Slide to help with Derivative of with Derivative of SigmoidSigmoid
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Using BP to Train Using BP to Train ANNrsquosANNrsquos
11 Initiate weights amp bias to Initiate weights amp bias to small random values small random values (eg in [-03 03])(eg in [-03 03])
22 Randomize order of Randomize order of training examples for training examples for each doeach do
a)a) Propagate activity Propagate activity forwardforward to output unitsto output units
k j i
outi = F( wij x outj )j
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)
b)b) Compute ldquodeviationrdquo for output Compute ldquodeviationrdquo for output unitsunits
c)c) Compute ldquodeviationrdquo for hidden Compute ldquodeviationrdquo for hidden unitsunits
d)d) Update weightsUpdate weights
i = F rsquo( neti ) x (Teacheri-outi)
ij = F rsquo( netj ) x ( wij x
i)
wij = x i x out j
wjk = x j x out k
F rsquo( netj ) = F(neti) neti
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)
33 Repeat until training-set error rate Repeat until training-set error rate small enough (or until tuning-set error small enough (or until tuning-set error rate begins to rise ndash see later slide)rate begins to rise ndash see later slide)
Should use ldquoearly stoppingrdquo (ie Should use ldquoearly stoppingrdquo (ie minimize error on the tuning set more minimize error on the tuning set more details later)details later)
44 Measure accuracy on test set to Measure accuracy on test set to estimate estimate generalizationgeneralization (future (future accuracy)accuracy)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Advantages of Neural Advantages of Neural NetworksNetworks
bull Universal representation (provided Universal representation (provided enough hidden units)enough hidden units)
bull Less greedy than tree learnersLess greedy than tree learnersbull In practice good for problems with In practice good for problems with
numeric inputs and can also numeric inputs and can also handle numeric outputshandle numeric outputs
bull PHD for many years best protein PHD for many years best protein secondary structure predictorsecondary structure predictor
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
DisadvantagesDisadvantages
bull Models not very comprehensibleModels not very comprehensiblebull Long training timesLong training timesbull Very sensitive to number of Very sensitive to number of
hidden unitshellip as a result largely hidden unitshellip as a result largely being supplanted by SVMs (SVMs being supplanted by SVMs (SVMs take very different approach to take very different approach to getting non-linearity)getting non-linearity)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Looking AheadLooking Aheadbull Perceptron rule can also be thought of as Perceptron rule can also be thought of as
modifying modifying weights on data points weights on data points rather rather than featuresthan features
bull Instead of process all data (batch) vs Instead of process all data (batch) vs one-at-a-time could imagine processing 2 one-at-a-time could imagine processing 2 data points at a time adjusting their data points at a time adjusting their relative weights based on their relative relative weights based on their relative errorserrors
bull This is what Plattrsquos SMO does (the SVM This is what Plattrsquos SMO does (the SVM implementation in Weka)implementation in Weka)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Backup Slide to help Backup Slide to help with Derivative of with Derivative of SigmoidSigmoid
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Using BP to Train Using BP to Train ANNrsquosANNrsquos
11 Initiate weights amp bias to Initiate weights amp bias to small random values small random values (eg in [-03 03])(eg in [-03 03])
22 Randomize order of Randomize order of training examples for training examples for each doeach do
a)a) Propagate activity Propagate activity forwardforward to output unitsto output units
k j i
outi = F( wij x outj )j
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)
b)b) Compute ldquodeviationrdquo for output Compute ldquodeviationrdquo for output unitsunits
c)c) Compute ldquodeviationrdquo for hidden Compute ldquodeviationrdquo for hidden unitsunits
d)d) Update weightsUpdate weights
i = F rsquo( neti ) x (Teacheri-outi)
ij = F rsquo( netj ) x ( wij x
i)
wij = x i x out j
wjk = x j x out k
F rsquo( netj ) = F(neti) neti
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)
33 Repeat until training-set error rate Repeat until training-set error rate small enough (or until tuning-set error small enough (or until tuning-set error rate begins to rise ndash see later slide)rate begins to rise ndash see later slide)
Should use ldquoearly stoppingrdquo (ie Should use ldquoearly stoppingrdquo (ie minimize error on the tuning set more minimize error on the tuning set more details later)details later)
44 Measure accuracy on test set to Measure accuracy on test set to estimate estimate generalizationgeneralization (future (future accuracy)accuracy)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Advantages of Neural Advantages of Neural NetworksNetworks
bull Universal representation (provided Universal representation (provided enough hidden units)enough hidden units)
bull Less greedy than tree learnersLess greedy than tree learnersbull In practice good for problems with In practice good for problems with
numeric inputs and can also numeric inputs and can also handle numeric outputshandle numeric outputs
bull PHD for many years best protein PHD for many years best protein secondary structure predictorsecondary structure predictor
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
DisadvantagesDisadvantages
bull Models not very comprehensibleModels not very comprehensiblebull Long training timesLong training timesbull Very sensitive to number of Very sensitive to number of
hidden unitshellip as a result largely hidden unitshellip as a result largely being supplanted by SVMs (SVMs being supplanted by SVMs (SVMs take very different approach to take very different approach to getting non-linearity)getting non-linearity)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Looking AheadLooking Aheadbull Perceptron rule can also be thought of as Perceptron rule can also be thought of as
modifying modifying weights on data points weights on data points rather rather than featuresthan features
bull Instead of process all data (batch) vs Instead of process all data (batch) vs one-at-a-time could imagine processing 2 one-at-a-time could imagine processing 2 data points at a time adjusting their data points at a time adjusting their relative weights based on their relative relative weights based on their relative errorserrors
bull This is what Plattrsquos SMO does (the SVM This is what Plattrsquos SMO does (the SVM implementation in Weka)implementation in Weka)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Backup Slide to help Backup Slide to help with Derivative of with Derivative of SigmoidSigmoid
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)
b)b) Compute ldquodeviationrdquo for output Compute ldquodeviationrdquo for output unitsunits
c)c) Compute ldquodeviationrdquo for hidden Compute ldquodeviationrdquo for hidden unitsunits
d)d) Update weightsUpdate weights
i = F rsquo( neti ) x (Teacheri-outi)
ij = F rsquo( netj ) x ( wij x
i)
wij = x i x out j
wjk = x j x out k
F rsquo( netj ) = F(neti) neti
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)
33 Repeat until training-set error rate Repeat until training-set error rate small enough (or until tuning-set error small enough (or until tuning-set error rate begins to rise ndash see later slide)rate begins to rise ndash see later slide)
Should use ldquoearly stoppingrdquo (ie Should use ldquoearly stoppingrdquo (ie minimize error on the tuning set more minimize error on the tuning set more details later)details later)
44 Measure accuracy on test set to Measure accuracy on test set to estimate estimate generalizationgeneralization (future (future accuracy)accuracy)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Advantages of Neural Advantages of Neural NetworksNetworks
bull Universal representation (provided Universal representation (provided enough hidden units)enough hidden units)
bull Less greedy than tree learnersLess greedy than tree learnersbull In practice good for problems with In practice good for problems with
numeric inputs and can also numeric inputs and can also handle numeric outputshandle numeric outputs
bull PHD for many years best protein PHD for many years best protein secondary structure predictorsecondary structure predictor
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
DisadvantagesDisadvantages
bull Models not very comprehensibleModels not very comprehensiblebull Long training timesLong training timesbull Very sensitive to number of Very sensitive to number of
hidden unitshellip as a result largely hidden unitshellip as a result largely being supplanted by SVMs (SVMs being supplanted by SVMs (SVMs take very different approach to take very different approach to getting non-linearity)getting non-linearity)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Looking AheadLooking Aheadbull Perceptron rule can also be thought of as Perceptron rule can also be thought of as
modifying modifying weights on data points weights on data points rather rather than featuresthan features
bull Instead of process all data (batch) vs Instead of process all data (batch) vs one-at-a-time could imagine processing 2 one-at-a-time could imagine processing 2 data points at a time adjusting their data points at a time adjusting their relative weights based on their relative relative weights based on their relative errorserrors
bull This is what Plattrsquos SMO does (the SVM This is what Plattrsquos SMO does (the SVM implementation in Weka)implementation in Weka)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Backup Slide to help Backup Slide to help with Derivative of with Derivative of SigmoidSigmoid
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Using BP to Train Using BP to Train ANNrsquos ANNrsquos (continued)(continued)
33 Repeat until training-set error rate Repeat until training-set error rate small enough (or until tuning-set error small enough (or until tuning-set error rate begins to rise ndash see later slide)rate begins to rise ndash see later slide)
Should use ldquoearly stoppingrdquo (ie Should use ldquoearly stoppingrdquo (ie minimize error on the tuning set more minimize error on the tuning set more details later)details later)
44 Measure accuracy on test set to Measure accuracy on test set to estimate estimate generalizationgeneralization (future (future accuracy)accuracy)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Advantages of Neural Advantages of Neural NetworksNetworks
bull Universal representation (provided Universal representation (provided enough hidden units)enough hidden units)
bull Less greedy than tree learnersLess greedy than tree learnersbull In practice good for problems with In practice good for problems with
numeric inputs and can also numeric inputs and can also handle numeric outputshandle numeric outputs
bull PHD for many years best protein PHD for many years best protein secondary structure predictorsecondary structure predictor
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
DisadvantagesDisadvantages
bull Models not very comprehensibleModels not very comprehensiblebull Long training timesLong training timesbull Very sensitive to number of Very sensitive to number of
hidden unitshellip as a result largely hidden unitshellip as a result largely being supplanted by SVMs (SVMs being supplanted by SVMs (SVMs take very different approach to take very different approach to getting non-linearity)getting non-linearity)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Looking AheadLooking Aheadbull Perceptron rule can also be thought of as Perceptron rule can also be thought of as
modifying modifying weights on data points weights on data points rather rather than featuresthan features
bull Instead of process all data (batch) vs Instead of process all data (batch) vs one-at-a-time could imagine processing 2 one-at-a-time could imagine processing 2 data points at a time adjusting their data points at a time adjusting their relative weights based on their relative relative weights based on their relative errorserrors
bull This is what Plattrsquos SMO does (the SVM This is what Plattrsquos SMO does (the SVM implementation in Weka)implementation in Weka)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Backup Slide to help Backup Slide to help with Derivative of with Derivative of SigmoidSigmoid
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Advantages of Neural Advantages of Neural NetworksNetworks
bull Universal representation (provided Universal representation (provided enough hidden units)enough hidden units)
bull Less greedy than tree learnersLess greedy than tree learnersbull In practice good for problems with In practice good for problems with
numeric inputs and can also numeric inputs and can also handle numeric outputshandle numeric outputs
bull PHD for many years best protein PHD for many years best protein secondary structure predictorsecondary structure predictor
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
DisadvantagesDisadvantages
bull Models not very comprehensibleModels not very comprehensiblebull Long training timesLong training timesbull Very sensitive to number of Very sensitive to number of
hidden unitshellip as a result largely hidden unitshellip as a result largely being supplanted by SVMs (SVMs being supplanted by SVMs (SVMs take very different approach to take very different approach to getting non-linearity)getting non-linearity)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Looking AheadLooking Aheadbull Perceptron rule can also be thought of as Perceptron rule can also be thought of as
modifying modifying weights on data points weights on data points rather rather than featuresthan features
bull Instead of process all data (batch) vs Instead of process all data (batch) vs one-at-a-time could imagine processing 2 one-at-a-time could imagine processing 2 data points at a time adjusting their data points at a time adjusting their relative weights based on their relative relative weights based on their relative errorserrors
bull This is what Plattrsquos SMO does (the SVM This is what Plattrsquos SMO does (the SVM implementation in Weka)implementation in Weka)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Backup Slide to help Backup Slide to help with Derivative of with Derivative of SigmoidSigmoid
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
DisadvantagesDisadvantages
bull Models not very comprehensibleModels not very comprehensiblebull Long training timesLong training timesbull Very sensitive to number of Very sensitive to number of
hidden unitshellip as a result largely hidden unitshellip as a result largely being supplanted by SVMs (SVMs being supplanted by SVMs (SVMs take very different approach to take very different approach to getting non-linearity)getting non-linearity)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Looking AheadLooking Aheadbull Perceptron rule can also be thought of as Perceptron rule can also be thought of as
modifying modifying weights on data points weights on data points rather rather than featuresthan features
bull Instead of process all data (batch) vs Instead of process all data (batch) vs one-at-a-time could imagine processing 2 one-at-a-time could imagine processing 2 data points at a time adjusting their data points at a time adjusting their relative weights based on their relative relative weights based on their relative errorserrors
bull This is what Plattrsquos SMO does (the SVM This is what Plattrsquos SMO does (the SVM implementation in Weka)implementation in Weka)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Backup Slide to help Backup Slide to help with Derivative of with Derivative of SigmoidSigmoid
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Looking AheadLooking Aheadbull Perceptron rule can also be thought of as Perceptron rule can also be thought of as
modifying modifying weights on data points weights on data points rather rather than featuresthan features
bull Instead of process all data (batch) vs Instead of process all data (batch) vs one-at-a-time could imagine processing 2 one-at-a-time could imagine processing 2 data points at a time adjusting their data points at a time adjusting their relative weights based on their relative relative weights based on their relative errorserrors
bull This is what Plattrsquos SMO does (the SVM This is what Plattrsquos SMO does (the SVM implementation in Weka)implementation in Weka)
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Backup Slide to help Backup Slide to help with Derivative of with Derivative of SigmoidSigmoid
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)
Backup Slide to help Backup Slide to help with Derivative of with Derivative of SigmoidSigmoid
copy Jude Shavlik 2006 copy Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 ndash Machine Learning (UW-Madison)CS 760 ndash Machine Learning (UW-Madison)