58
Announcements Announcements HW1 assigned due Monday HW1 assigned due Monday HW2 may be assigned prior to HW2 may be assigned prior to next class, check email next class, check email Reading Assignment Reading Assignment Chapter 3 of the text book Chapter 3 of the text book (decision trees) (decision trees) Midterm exam Midterm exam In class on Oct 25 In class on Oct 25

Announcements HW1 assigned due MondayHW1 assigned due Monday HW2 may be assigned prior to next class, check emailHW2 may be assigned prior to next class,

Embed Size (px)

Citation preview

Page 1: Announcements HW1 assigned due MondayHW1 assigned due Monday HW2 may be assigned prior to next class, check emailHW2 may be assigned prior to next class,

AnnouncementsAnnouncements

• HW1 assigned due MondayHW1 assigned due Monday• HW2 may be assigned prior to HW2 may be assigned prior to

next class, check emailnext class, check email• Reading Assignment Reading Assignment

• Chapter 3 of the text book (decision Chapter 3 of the text book (decision trees)trees)

• Midterm examMidterm exam• In class on Oct 25In class on Oct 25

Page 2: Announcements HW1 assigned due MondayHW1 assigned due Monday HW2 may be assigned prior to next class, check emailHW2 may be assigned prior to next class,

Last TimeLast Time• How accurate is algorithm A? What are its 95% How accurate is algorithm A? What are its 95%

confidence intervals?confidence intervals?• Apply CLT to get Gaussian PDF for acc.Apply CLT to get Gaussian PDF for acc.• Find Find ZZ statistic to get CI statistic to get CI

• Do performances of Algos. A & B differ?Do performances of Algos. A & B differ?• Use 10 fold CV to get deltasUse 10 fold CV to get deltas• Estimate the mean and std dev of deltaEstimate the mean and std dev of delta• Compute Compute tt stat and get CI for delta stat and get CI for delta• Does it contain 0?Does it contain 0?

• How can we eval. Algorithms if FN and FP costs differ?How can we eval. Algorithms if FN and FP costs differ?• ROC curves ROC curves

Page 3: Announcements HW1 assigned due MondayHW1 assigned due Monday HW2 may be assigned prior to next class, check emailHW2 may be assigned prior to next class,

Today’sToday’s TopicsTopics

• Finish ROC + Precision-Recall Finish ROC + Precision-Recall curvescurves

• Our next SL algorithm, Logistic Our next SL algorithm, Logistic RegresionRegresion• Discriminative vs GenerativeDiscriminative vs Generative

• Perceptrons, Neural networksPerceptrons, Neural networks

Page 4: Announcements HW1 assigned due MondayHW1 assigned due Monday HW2 may be assigned prior to next class, check emailHW2 may be assigned prior to next class,

Plot ROC Curve Plot ROC Curve ExampleExample

Ex 9 .99 +

Ex 7 .98 +

Ex 1 .72 -

Ex 2 .7 +

Ex 6 .65 +

Ex 10 .51 -

Ex 3 .39 -

Ex 5 .24 +

Ex 4 .11 -

Ex 3 .01 -

ML Algo Output (Sorted) Correct

Category

1.0

1.0FP rateTP

rate

Pro

b (

alg

outp

uts

+ |

+ is

corr

ect

)Prob (alg outputs + | - is correct)

TPR=(2/5) FPR=(0/5)

TPR=(2/5) FPR=(1/5)

TPR=(4/5) FPR=(1/5)

TPR=(4/5) FPR=(3/5)

TPR=(5/5) FPR=(3/5)

TPR=(5/5) FPR=(5/5)

Page 5: Announcements HW1 assigned due MondayHW1 assigned due Monday HW2 may be assigned prior to next class, check emailHW2 may be assigned prior to next class,

Area Under ROC CurveArea Under ROC Curve

• A common metric for experiments is to A common metric for experiments is to numerically integrate the ROC Curvenumerically integrate the ROC Curve

1.0

1.0FP Rate

TP R

ate

Page 6: Announcements HW1 assigned due MondayHW1 assigned due Monday HW2 may be assigned prior to next class, check emailHW2 may be assigned prior to next class,

Precision vs RecallPrecision vs Recall

• Precision = TP / (TP + FP) Precision = TP / (TP + FP) • Recall = TP / (TP + FN) Recall = TP / (TP + FN) • Notice that TN is not used in Notice that TN is not used in

either formulaeither formula

Page 7: Announcements HW1 assigned due MondayHW1 assigned due Monday HW2 may be assigned prior to next class, check emailHW2 may be assigned prior to next class,

ROC vs Recall-ROC vs Recall-PrecisionPrecision

• You can get very different visual You can get very different visual results on the same data.results on the same data.

vs

P ( + | - ) RecallPre

cisi

on

P (

+ |

+ )

The reason for this is because there may be lots of -

Page 8: Announcements HW1 assigned due MondayHW1 assigned due Monday HW2 may be assigned prior to next class, check emailHW2 may be assigned prior to next class,

Recall-Precision CurveRecall-Precision Curve

• You cannot simply connect the You cannot simply connect the dots in Recall-Precision curves.dots in Recall-Precision curves.

• See Goadrich, Oliphant, & Shavlik See Goadrich, Oliphant, & Shavlik ILP’04ILP’04

Recall

Pre

cisi

on

x

Page 9: Announcements HW1 assigned due MondayHW1 assigned due Monday HW2 may be assigned prior to next class, check emailHW2 may be assigned prior to next class,

Exp Methodology Exp Methodology WrapupWrapup• Never train on test sets. (use tune sets)Never train on test sets. (use tune sets)• Use central-limit theorem to place Use central-limit theorem to place

confidence intervals on measurementsconfidence intervals on measurements• Paired Paired tt-test’s provide a sensitive way -test’s provide a sensitive way

to judge whether two algorithms to judge whether two algorithms perform differently.perform differently.

• tt-test is a useful heuristic for guiding -test is a useful heuristic for guiding researchresearch

• Use a two-tailed testUse a two-tailed test• ROC curves are better than accuracyROC curves are better than accuracy

Page 10: Announcements HW1 assigned due MondayHW1 assigned due Monday HW2 may be assigned prior to next class, check emailHW2 may be assigned prior to next class,

Next Topic: Logistic Next Topic: Logistic RegressionRegression

Page 11: Announcements HW1 assigned due MondayHW1 assigned due Monday HW2 may be assigned prior to next class, check emailHW2 may be assigned prior to next class,

The Logistic Function The Logistic Function (also called sigmoid)(also called sigmoid)

y =1

1+ exp(−x)

Logistic function:

y

x

• Sigmoid dates back to 19th Sigmoid dates back to 19th centurycentury

• Originally used to model growth Originally used to model growth of populationsof populations

Page 12: Announcements HW1 assigned due MondayHW1 assigned due Monday HW2 may be assigned prior to next class, check emailHW2 may be assigned prior to next class,

The Logistic Function The Logistic Function (also called sigmoid)(also called sigmoid)

Pr(C =1 | F) =1

1+ exp(−g(F))

• Logistic Regression assumes the conditional Logistic Regression assumes the conditional Pr(C=1|F) is a sigmoidPr(C=1|F) is a sigmoid

linear function of the features

Real valued feature vector

y

x

Page 13: Announcements HW1 assigned due MondayHW1 assigned due Monday HW2 may be assigned prior to next class, check emailHW2 may be assigned prior to next class,

Logistic RegressionLogistic Regression

Pr(C =1 | F) =1

1+ exp(−g(F))

Pr(C = 0 | F) =exp(−g(F))

1+ exp(−g(F))

This gives us Pr(C=0|F) since Pr(C=1|F) + Pr(C=0|F) =1.0

g(F) = w0 + w1 f1 + ...+ wN fN

So the odds are

Pr(C =1 | F)

Pr(C = 0 | F)=

1

exp(−g(F))= exp(g(F))

And

lnPr(C =1 | F)

Pr(C = 0 | F)

⎝ ⎜

⎠ ⎟= w0 + w1 f1 + ...+ wN fN

Page 14: Announcements HW1 assigned due MondayHW1 assigned due Monday HW2 may be assigned prior to next class, check emailHW2 may be assigned prior to next class,

LR Decision RuleLR Decision Rule

Predict class is + if

lnPr(C =1 | F)

Pr(C = 0 | F)

⎝ ⎜

⎠ ⎟= w0 + w1 f1 + ...+ wN fN

T < w0 + w1 f1 + ...+ wN fN

Threshold (0 if equal FP and FN costs)

Page 15: Announcements HW1 assigned due MondayHW1 assigned due Monday HW2 may be assigned prior to next class, check emailHW2 may be assigned prior to next class,

The Decision Boundary The Decision Boundary of Logistic Regression of Logistic Regression is a hyperplane (line in is a hyperplane (line in 2D)2D)

T < w0 + w1 f1 + ...+ wN fNIf

predict +

otherwise

predict -

w0 + w1 f1 + w2 f2 − T = 0+ +

+

+ +

+

-

-- -€

f1

f2

Page 16: Announcements HW1 assigned due MondayHW1 assigned due Monday HW2 may be assigned prior to next class, check emailHW2 may be assigned prior to next class,

Encoding Nominal Encoding Nominal FeaturesFeatures

• Logistic Regression requires that Logistic Regression requires that examples be represented as a examples be represented as a vector of real values (also vector of real values (also perceptrons, Neural Nets, SVMs, perceptrons, Neural Nets, SVMs, …)…)

• How can transform FVs with How can transform FVs with nominal features to real values?nominal features to real values?

Page 17: Announcements HW1 assigned due MondayHW1 assigned due Monday HW2 may be assigned prior to next class, check emailHW2 may be assigned prior to next class,

Two PossibilitiesTwo Possibilities

• For nominal feature with For nominal feature with MM possible values: possible values:1)1) Assign each value an integer between 1 and MAssign each value an integer between 1 and M

• Color = {Red, Green, Blue} Color = {1, 2, 3}Color = {Red, Green, Blue} Color = {1, 2, 3}

2)2) Create Create M M binary features where for each example binary features where for each example (without missing features) exactly one derived (without missing features) exactly one derived feature has value of 1 and feature has value of 1 and MM-1 features have a -1 features have a value of 0 value of 0 • Color = {Red, Green, Blue}Color = {Red, Green, Blue}

• isRed={0,1}, isGreen={0,1}, isBlue={0,1}isRed={0,1}, isGreen={0,1}, isBlue={0,1}

Not a good idea (why?)

Page 18: Announcements HW1 assigned due MondayHW1 assigned due Monday HW2 may be assigned prior to next class, check emailHW2 may be assigned prior to next class,

Representation of Pr(C|F) same Representation of Pr(C|F) same for LR and Navie Bayes with for LR and Navie Bayes with nominal featuresnominal features

lnPr(C =1 | F)

Pr(C = 0 | F)

⎝ ⎜

⎠ ⎟= w0 + w1 f1 + ...+ wN fN

lnPr(C =1 | F)

Pr(C = 0 | F)

⎝ ⎜

⎠ ⎟=

Pr(C =1)

Pr(C = 0)+

Pr( f i | C =1)

Pr( f i | C = 0)∑

=Pr(C =1)

Pr(C = 0)+ ∑ Pr( f i = v | C =1)

Pr( f i = v | C = 0)∑ ˜ f i,v

Navie Bayes:Sum over features

Sum over features Sum over values Derived binary feature

Logistic Regression:

Page 19: Announcements HW1 assigned due MondayHW1 assigned due Monday HW2 may be assigned prior to next class, check emailHW2 may be assigned prior to next class,

Representation of Pr(C|F) same Representation of Pr(C|F) same for LR and Navie Bayes with for LR and Navie Bayes with nominal featuresnominal features

lnPr(C =1 | F)

Pr(C = 0 | F)

⎝ ⎜

⎠ ⎟= w0 + w1 f1 + ...+ wN fN

lnPr(C =1 | F)

Pr(C = 0 | F)

⎝ ⎜

⎠ ⎟=

Pr(C =1)

Pr(C = 0)+

Pr( f i | C =1)

Pr( f i | C = 0)∑

=Pr(C =1)

Pr(C = 0)+ ∑ Pr( f i = v | C =1)

Pr( f i = v | C = 0)∑ ˜ f i,v

Navie Bayes:Sum over features

Sum over features Sum over values Derived binary feature

Logistic Regression:

lnPr(C =1 | F)

Pr(C = 0 | F)

⎝ ⎜

⎠ ⎟=

Pr(C =1)

Pr(C = 0)+

Pr( f i | C =1)

Pr( f i | C = 0)∑

Page 20: Announcements HW1 assigned due MondayHW1 assigned due Monday HW2 may be assigned prior to next class, check emailHW2 may be assigned prior to next class,

What about real What about real valued features?valued features?

IfIf

thenthen

Pr( f i | C =1) ~ N(μ i1,σ i) and Pr( f i | C = 0) ~ N(ui0,σ i)

wi =μ i1 − ui0

σ i2

Same variance

see text for details

The log of the ratio of two Gaussians with equal variance is a line

Page 21: Announcements HW1 assigned due MondayHW1 assigned due Monday HW2 may be assigned prior to next class, check emailHW2 may be assigned prior to next class,

Learning TaskLearning Task

• Given:Given:• Labeled examples {(C,F)}Labeled examples {(C,F)}

• Do:Do:• Find a good setting of the weights Find a good setting of the weights WW

Page 22: Announcements HW1 assigned due MondayHW1 assigned due Monday HW2 may be assigned prior to next class, check emailHW2 may be assigned prior to next class,

Learning Parameters Learning Parameters for Logistic Regression for Logistic Regression

• Typically, we want the weights Typically, we want the weights W W that maximize the that maximize the conditional log-conditional log-likelihoodlikelihood

ˆ W ← arg max W

L(W )

L(W ) = ln Pr(C k | F k )( )k

Sum over training examples

Since each setting for the weights gives has an associated likelihood, We can view the likelihood as a function of the weights

Page 23: Announcements HW1 assigned due MondayHW1 assigned due Monday HW2 may be assigned prior to next class, check emailHW2 may be assigned prior to next class,

““Weight Space”Weight Space”

W

L(W)

Goal

For LR, L(W) is a concave function (it has a single global maximum), so we are guaranteed to find the global maximum

• Given feature representations, the weights W are free parameters that define a space• Each point in “weight space” corresponds to an LR model• Associated with each point is a conditional log likelihood • One way to do LR learning is to perform “gradient ascent” in the weight space

Page 24: Announcements HW1 assigned due MondayHW1 assigned due Monday HW2 may be assigned prior to next class, check emailHW2 may be assigned prior to next class,

The Gradient-Ascent The Gradient-Ascent RuleRule

L(W) [ ]Lw0

Lw1

Lw2

LwN

, , , … … … , _

The “gradien

t”

−The direction of gradient at W is direction of fastest increase−The magnitude of gradient at W is the rate of fastest increase −Since we want to increase L(W), we want to go “up hill”−We’ll take a finite step in weight space:

L

W1

W2

W = L ( W )

or wi = Ewi

“delta” = change to

W

L

L(W)

Page 25: Announcements HW1 assigned due MondayHW1 assigned due Monday HW2 may be assigned prior to next class, check emailHW2 may be assigned prior to next class,

““On Line” vs. “Batch” On Line” vs. “Batch” UpdatesUpdates

• We can either update We can either update WW after each after each example is examined (on line / example is examined (on line / stochastic updates) or after the entire stochastic updates) or after the entire training set is examined (batch) training set is examined (batch) updatesupdates• On-line is typically much fasterOn-line is typically much faster

• But is dependent on the order of the examplesBut is dependent on the order of the examples• Will it converge to the same spot?Will it converge to the same spot?

• For non-concave “objective functions” For non-concave “objective functions” online and batch processing will typically online and batch processing will typically end up in different placesend up in different places

Page 26: Announcements HW1 assigned due MondayHW1 assigned due Monday HW2 may be assigned prior to next class, check emailHW2 may be assigned prior to next class,

Computing the LR Computing the LR GradientGradient

Page 11Page 11

Page 27: Announcements HW1 assigned due MondayHW1 assigned due Monday HW2 may be assigned prior to next class, check emailHW2 may be assigned prior to next class,

Logistic Regression Logistic Regression Update RuleUpdate Rule

wi ← wi + η f ik (y k

k

∑ − Pr(y k =1 | F k,W ))

sum over training examples

Prediction error for example k

Note that this is the batch update rule

Page 28: Announcements HW1 assigned due MondayHW1 assigned due Monday HW2 may be assigned prior to next class, check emailHW2 may be assigned prior to next class,

Two Decisions Needed Two Decisions Needed for Learning Procedurefor Learning Procedure

1)1) What are we trying to optimize?What are we trying to optimize?

2)2) How are we going to carry out the How are we going to carry out the optimization?optimization?

Page 29: Announcements HW1 assigned due MondayHW1 assigned due Monday HW2 may be assigned prior to next class, check emailHW2 may be assigned prior to next class,

Discriminant ModelsDiscriminant Models

• Can classify instances into Can classify instances into categoriescategories• Captures differences between Captures differences between

categoriescategories• May not describe all featuresMay not describe all features• Example: Decision trees (covered later)Example: Decision trees (covered later)

• Efficient and simpleEfficient and simple

Page 30: Announcements HW1 assigned due MondayHW1 assigned due Monday HW2 may be assigned prior to next class, check emailHW2 may be assigned prior to next class,

Generative ModelsGenerative Models

• Can create complete Can create complete input feature vectorsinput feature vectors• Describes distributions of all Describes distributions of all

featuresfeatures• Stochastically creates a plausible Stochastically creates a plausible

vectorvector• Example: Bayes net (from above)Example: Bayes net (from above)

Page 31: Announcements HW1 assigned due MondayHW1 assigned due Monday HW2 may be assigned prior to next class, check emailHW2 may be assigned prior to next class,

Using Generative Using Generative ModelsModels

• Make a model to generate Make a model to generate positivespositives

• Make a model to generate Make a model to generate negativesnegatives

• Classify a test example based on Classify a test example based on which is more likely to generate itwhich is more likely to generate it• The Naïve Bayes ratio does thisThe Naïve Bayes ratio does this

Page 32: Announcements HW1 assigned due MondayHW1 assigned due Monday HW2 may be assigned prior to next class, check emailHW2 may be assigned prior to next class,

Some PropertiesSome Properties(Ng & Jordan NIPS ‘02)(Ng & Jordan NIPS ‘02)

• If NB assumption holds asymptotic If NB assumption holds asymptotic accuracy is the sameaccuracy is the same• Otherwise LR acc > NB acc as the number Otherwise LR acc > NB acc as the number

of training examples increasesof training examples increases

• NB converges to asymptotic NB converges to asymptotic performance with fewer examples, LR performance with fewer examples, LR takes moretakes more

• NB is faster to trainNB is faster to train

Page 33: Announcements HW1 assigned due MondayHW1 assigned due Monday HW2 may be assigned prior to next class, check emailHW2 may be assigned prior to next class,

Neural NetsNeural Nets

Page 34: Announcements HW1 assigned due MondayHW1 assigned due Monday HW2 may be assigned prior to next class, check emailHW2 may be assigned prior to next class,

PerceptronsPerceptrons

f1

f2

fN

w1w2

wN

F0=1

w0

T < w0 + w1 f1 + ...+ wN fN

The decision rule for perceptrons has the same form as the decision rule for logistic regression and naïve Bayes

So perceptrons are linear separators

Input units Output unit

Page 35: Announcements HW1 assigned due MondayHW1 assigned due Monday HW2 may be assigned prior to next class, check emailHW2 may be assigned prior to next class,

Training PerceptronsTraining Perceptrons

• Perceptron training rule:Perceptron training rule:

• If training data is not linearly separable, If training data is not linearly separable, training may not convergetraining may not converge

• Delta RuleDelta Rule• gradient gradient descentdescent rule derived from rule derived from

objective function based on minimizing the objective function based on minimizing the squaredsquared error of error of linearlinear output unit. output unit.

wi ← η (target - output) × f i

Almost identical to LR on-line training rule

Page 36: Announcements HW1 assigned due MondayHW1 assigned due Monday HW2 may be assigned prior to next class, check emailHW2 may be assigned prior to next class,

Should you?Should you?

"Fenwίck here is biding his time waiting for neural networks.

Page 37: Announcements HW1 assigned due MondayHW1 assigned due Monday HW2 may be assigned prior to next class, check emailHW2 may be assigned prior to next class,

Concept LearningConcept Learning

Learning sytems differ in how they represent concepts:

Training Examples

Backpropagation

C4.5 CART

AQ. FOIL

… …X^Y Z

Page 38: Announcements HW1 assigned due MondayHW1 assigned due Monday HW2 may be assigned prior to next class, check emailHW2 may be assigned prior to next class,

Advantages of Neural Advantages of Neural NetworksNetworks

• Provide best Provide best predictive predictive accuracy for accuracy for some problemssome problems− Being supplanted by Being supplanted by

SVM’s?SVM’s?

• Can represent a rich Can represent a rich class of conceptsclass of concepts

PositivenegativePositive

Saturday: 40% chance of rainSunday: 25% chance of rain

Page 39: Announcements HW1 assigned due MondayHW1 assigned due Monday HW2 may be assigned prior to next class, check emailHW2 may be assigned prior to next class,

Artificial Neural Artificial Neural Networks (ANNs)Networks (ANNs)

Networks

Recurrentlink

Output units

Input units

Hidden unitserrorweight

Page 40: Announcements HW1 assigned due MondayHW1 assigned due Monday HW2 may be assigned prior to next class, check emailHW2 may be assigned prior to next class,

ANNs ANNs (continued)(continued)

Individual units

bias

outputs

inputs

output i= F(weighti,j x outputj)

Where

F(inputi) =

j

1

1+e -(inputi - biasi)

Page 41: Announcements HW1 assigned due MondayHW1 assigned due Monday HW2 may be assigned prior to next class, check emailHW2 may be assigned prior to next class,

Perceptron Convergence Perceptron Convergence

TheoremTheorem (Rosemblatt, 1957)(Rosemblatt, 1957)

Perceptron = no Hidden Units

If a set of examples is learnable, the DELTA rule will eventually find the necessary weights

However a perceptron can only learn/represent linearly separable dataset.

Page 42: Announcements HW1 assigned due MondayHW1 assigned due Monday HW2 may be assigned prior to next class, check emailHW2 may be assigned prior to next class,

Linear Separability Linear Separability Consider a perceptron

Its output is 1 If W1X1+W2X2 + … + WnXn > 0 otherwise

In terms of feature space: WiXi + WjXj =

Xj = = WiXi

Wj

Wi Xj Wj

Xi +

[ y = mx + b]

+ + + + + + - + - - + + + + - + + - - -+ + - -+ - - -

- -

Hence, can only classify examples if a “line” (hyerplane) can separate them

Page 43: Announcements HW1 assigned due MondayHW1 assigned due Monday HW2 may be assigned prior to next class, check emailHW2 may be assigned prior to next class,

The XOR ProblemThe XOR Problem

Input

0 00 11 01 1

Output

0110

a)b)c)d)

Exclusive OR (XOR)Not linearly separable:

b

a c

d

0 1

1

A Neural Network SolutionX1

X2

X1

X2

1

1

-1-1

1

1 Let = 0 !

Page 44: Announcements HW1 assigned due MondayHW1 assigned due Monday HW2 may be assigned prior to next class, check emailHW2 may be assigned prior to next class,

The Need for Hidden The Need for Hidden UnitsUnits

If there is one layer of enough hidden units (possibly 2N for Boolean functions), the input can be recoded. (N = number of input units)

This recoding allows any mapping to be represented (Minsky & Papert)Question: How to provide an error signal to the interior units?

Page 45: Announcements HW1 assigned due MondayHW1 assigned due Monday HW2 may be assigned prior to next class, check emailHW2 may be assigned prior to next class,

Hidden UnitsHidden Units

One View:Allow a system to create its own internal representation – for which problem solving is easy.

A perceptron

Page 46: Announcements HW1 assigned due MondayHW1 assigned due Monday HW2 may be assigned prior to next class, check emailHW2 may be assigned prior to next class,

Reformulating XORReformulating XOR

X1

X2

X3 = X1 ^ X2

X1

X2

X3

Or:

X1

X2

So, if a hidden unit can learn to represent X1 ^ X2 , solution is easy

Page 47: Announcements HW1 assigned due MondayHW1 assigned due Monday HW2 may be assigned prior to next class, check emailHW2 may be assigned prior to next class,

BackpropagationBackpropagation

Page 48: Announcements HW1 assigned due MondayHW1 assigned due Monday HW2 may be assigned prior to next class, check emailHW2 may be assigned prior to next class,

BackpropagationBackpropagation• Backpropagation involves a generalization of Backpropagation involves a generalization of

the delta rulethe delta rule• Rumelhart, Parker, and Le Cun (and Bryson & Rumelhart, Parker, and Le Cun (and Bryson &

Ho(1969), Werbos(1974)) independently Ho(1969), Werbos(1974)) independently developed(1985) a technique for determining developed(1985) a technique for determining how to adjust weights of interior (“hidden”) how to adjust weights of interior (“hidden”) unitsunits

• Derivation involves partial derivatives Derivation involves partial derivatives (Hence, threshold function must be (Hence, threshold function must be differentiable)differentiable)

error signal

EWi,j

Page 49: Announcements HW1 assigned due MondayHW1 assigned due Monday HW2 may be assigned prior to next class, check emailHW2 may be assigned prior to next class,

Weight SpaceWeight Space

• Given a network layout, the weights Given a network layout, the weights and biases are free parameters that and biases are free parameters that define a define a Space.Space.

• Each point in this Each point in this Wight SpaceWight Space (w) (w) specifies a networkspecifies a network

• Associated with each point is an Associated with each point is an error error rate, rate, E, over the training dataE, over the training data

• BackProp performs gradient descent in BackProp performs gradient descent in weight spaceweight space

Page 50: Announcements HW1 assigned due MondayHW1 assigned due Monday HW2 may be assigned prior to next class, check emailHW2 may be assigned prior to next class,

Gradient descent in weight Gradient descent in weight spacespace

E

W1

W2

E

w

W1

W2

Page 51: Announcements HW1 assigned due MondayHW1 assigned due Monday HW2 may be assigned prior to next class, check emailHW2 may be assigned prior to next class,

The Gradient-Descent The Gradient-Descent RuleRule

E(w) [ ]Ew0

Ew1

Ew1

EwN

, , , … … … , _

The “gradien

t”

−This is a N+1 dimensional vector (i.e., the ‘slope’ in weight space)

−Since we want to reduce errors, we want to go “down hill”

−We’ll take a finite step in weight space: E

W1

W2

w = - E ( w )

or wi = - Ewi

“delta” = change to

w

E

w

Page 52: Announcements HW1 assigned due MondayHW1 assigned due Monday HW2 may be assigned prior to next class, check emailHW2 may be assigned prior to next class,

““On Line” vs. “Batch” BP On Line” vs. “Batch” BP (continued)(continued)

• BATCH – add BATCH – add w w vectors for each vectors for each training example, training example, then ‘move’ in then ‘move’ in weight space.weight space.

• ON-LINE – “move” ON-LINE – “move” after after eacheach example example (aka, stochastic (aka, stochastic gradient descent)gradient descent)

wi

w1

w3w2

E

w

E

w

w1

w2

w3

* Final locations in space need not be the same for BATCH and ON-LINE w

* N

ote

w

i,B

ATC

H

w

i, O

N-L

INE,

for

i >

1

Page 53: Announcements HW1 assigned due MondayHW1 assigned due Monday HW2 may be assigned prior to next class, check emailHW2 may be assigned prior to next class,

Backprop CalculationsBackprop Calculations• Assume one layer of hidden units (std. Assume one layer of hidden units (std.

topology)topology)1.1. Error = ½ Error = ½ ( Teacher ( Teacherii – Output – Outputii ) ) 22

2.2. = ½ = ½ (Teacher(Teacherii – – f f [[WWi,ji,j x Output x Outputjj] )] )22

3.3. = ½ = ½ (Teacher(Teacherii – – f f [[WWi,ji,j x x f f ((WWj,kj,k x Output x Outputkk)])))]))22

• DetermineDetermine

recallrecall

ErrorWi,j

ErrorWj,k

= (use equation 2)

= (use equation 3)

* See table 4.2 for results

wx,y = - ( E / wx,y )

k j i

Page 54: Announcements HW1 assigned due MondayHW1 assigned due Monday HW2 may be assigned prior to next class, check emailHW2 may be assigned prior to next class,

Differentiating the Logistic Differentiating the Logistic FunctionFunction

Outi =

= outi ( 1- outi ) = f’(weighted input)

1

1 + e - ( wj,i x outj - i )

f’(weighted input) = outi

1/2

w.outi

Weightedinputf’( )f’( )f’( )

Weightedinputf ( )

Notice that even if totally wrong, no (or very little) change in weights

1/4

1

Page 55: Announcements HW1 assigned due MondayHW1 assigned due Monday HW2 may be assigned prior to next class, check emailHW2 may be assigned prior to next class,

The Need for The Need for Symmetry BreakingSymmetry Breaking

Assume all weights are initially the same

Can the corresponding (mirror-image) weight ever differ? - NO

WHY? - by symmetry

Solution - randomize initial weights

Page 56: Announcements HW1 assigned due MondayHW1 assigned due Monday HW2 may be assigned prior to next class, check emailHW2 may be assigned prior to next class,

Using BP to Train Using BP to Train ANN’sANN’s

1.1. Initiate weights & bias to Initiate weights & bias to small random values (eg. small random values (eg. In [-0.3, 0/3])In [-0.3, 0/3])

2.2. Randomize order of Randomize order of training examples; for training examples; for each do:each do:

a)a) Propagate activity forward Propagate activity forward to output unitsto output units

k j i

outi = f ( wi,j x outj)j

Page 57: Announcements HW1 assigned due MondayHW1 assigned due Monday HW2 may be assigned prior to next class, check emailHW2 may be assigned prior to next class,

Using BP to Train Using BP to Train ANN’s ANN’s (continued)(continued)

b)b) Compute “error” for output unitsCompute “error” for output units

c)c) Compute “error” for hidden unitsCompute “error” for hidden units

d)d) Update weightsUpdate weights

i = f ’( neti ) x (Teacheri-outi)

ij = f ’( netj ) x ( wi,j x i)

wi,j = x i x outj

wj,k = x j x outk

f ’( netj ) = f (neti) neti

Page 58: Announcements HW1 assigned due MondayHW1 assigned due Monday HW2 may be assigned prior to next class, check emailHW2 may be assigned prior to next class,

Using BP to Train Using BP to Train ANN’s ANN’s (continued)(continued)

3.3. Repeat until training-set error rate small Repeat until training-set error rate small enough ( or until tuning-set error rate enough ( or until tuning-set error rate begins to rise – see later slide)begins to rise – see later slide)

− Should use “early stopping” ( i.e., minimize Should use “early stopping” ( i.e., minimize error on the tuning set; more details later)error on the tuning set; more details later)

4.4. Measure accuracy on test set to Measure accuracy on test set to estimate estimate generalizationgeneralization (future (future accuracy)accuracy)