Announcements HW1 assigned due MondayHW1 assigned due Monday HW2 may be assigned prior to next class, check emailHW2 may be assigned prior to next class,

AnnouncementsAnnouncements

• HW1 assigned due MondayHW1 assigned due Monday• HW2 may be assigned prior to HW2 may be assigned prior to

next class, check emailnext class, check email• Reading Assignment Reading Assignment

• Chapter 3 of the text book (decision Chapter 3 of the text book (decision trees)trees)

• Midterm examMidterm exam• In class on Oct 25In class on Oct 25

Last TimeLast Time• How accurate is algorithm A? What are its 95% How accurate is algorithm A? What are its 95%

confidence intervals?confidence intervals?• Apply CLT to get Gaussian PDF for acc.Apply CLT to get Gaussian PDF for acc.• Find Find ZZ statistic to get CI statistic to get CI

• Do performances of Algos. A & B differ?Do performances of Algos. A & B differ?• Use 10 fold CV to get deltasUse 10 fold CV to get deltas• Estimate the mean and std dev of deltaEstimate the mean and std dev of delta• Compute Compute tt stat and get CI for delta stat and get CI for delta• Does it contain 0?Does it contain 0?

• How can we eval. Algorithms if FN and FP costs differ?How can we eval. Algorithms if FN and FP costs differ?• ROC curves ROC curves

Today’sToday’s TopicsTopics

• Finish ROC + Precision-Recall Finish ROC + Precision-Recall curvescurves

• Our next SL algorithm, Logistic Our next SL algorithm, Logistic RegresionRegresion• Discriminative vs GenerativeDiscriminative vs Generative

• Perceptrons, Neural networksPerceptrons, Neural networks

Plot ROC Curve Plot ROC Curve ExampleExample

Ex 9 .99 +

Ex 7 .98 +

Ex 1 .72 -

Ex 2 .7 +

Ex 6 .65 +

Ex 10 .51 -

Ex 3 .39 -

Ex 5 .24 +

Ex 4 .11 -

Ex 3 .01 -

ML Algo Output (Sorted) Correct

Category

1.0

1.0FP rateTP

rate

Pro

b (

alg

outp

uts

+ |

+ is

corr

ect

)Prob (alg outputs + | - is correct)

TPR=(2/5) FPR=(0/5)

TPR=(2/5) FPR=(1/5)

TPR=(4/5) FPR=(1/5)

TPR=(4/5) FPR=(3/5)

TPR=(5/5) FPR=(3/5)

TPR=(5/5) FPR=(5/5)

Area Under ROC CurveArea Under ROC Curve

• A common metric for experiments is to A common metric for experiments is to numerically integrate the ROC Curvenumerically integrate the ROC Curve

1.0

1.0FP Rate

TP R

ate

Precision vs RecallPrecision vs Recall

• Precision = TP / (TP + FP) Precision = TP / (TP + FP) • Recall = TP / (TP + FN) Recall = TP / (TP + FN) • Notice that TN is not used in Notice that TN is not used in

either formulaeither formula

ROC vs Recall-ROC vs Recall-PrecisionPrecision

• You can get very different visual You can get very different visual results on the same data.results on the same data.

vs

P ( + | - ) RecallPre

cisi

on

P (

+ |

+ )

The reason for this is because there may be lots of -

Recall-Precision CurveRecall-Precision Curve

• You cannot simply connect the You cannot simply connect the dots in Recall-Precision curves.dots in Recall-Precision curves.

• See Goadrich, Oliphant, & Shavlik See Goadrich, Oliphant, & Shavlik ILP’04ILP’04

Recall

Pre

cisi

on

x

Exp Methodology Exp Methodology WrapupWrapup• Never train on test sets. (use tune sets)Never train on test sets. (use tune sets)• Use central-limit theorem to place Use central-limit theorem to place

confidence intervals on measurementsconfidence intervals on measurements• Paired Paired tt-test’s provide a sensitive way -test’s provide a sensitive way

to judge whether two algorithms to judge whether two algorithms perform differently.perform differently.

• tt-test is a useful heuristic for guiding -test is a useful heuristic for guiding researchresearch

• Use a two-tailed testUse a two-tailed test• ROC curves are better than accuracyROC curves are better than accuracy

Next Topic: Logistic Next Topic: Logistic RegressionRegression

The Logistic Function The Logistic Function (also called sigmoid)(also called sigmoid)

€

y =1

1+ exp(−x)

Logistic function:

y

x

• Sigmoid dates back to 19th Sigmoid dates back to 19th centurycentury

• Originally used to model growth Originally used to model growth of populationsof populations

The Logistic Function The Logistic Function (also called sigmoid)(also called sigmoid)

€

Pr(C =1 | F) =1

1+ exp(−g(F))

• Logistic Regression assumes the conditional Logistic Regression assumes the conditional Pr(C=1|F) is a sigmoidPr(C=1|F) is a sigmoid

linear function of the features

Real valued feature vector

y

x

Logistic RegressionLogistic Regression

€

Pr(C =1 | F) =1

1+ exp(−g(F))

€

Pr(C = 0 | F) =exp(−g(F))

1+ exp(−g(F))

This gives us Pr(C=0|F) since Pr(C=1|F) + Pr(C=0|F) =1.0

€

g(F) = w0 + w1 f1 + ...+ wN fN

So the odds are

€

Pr(C =1 | F)

Pr(C = 0 | F)=

1

exp(−g(F))= exp(g(F))

And

€

lnPr(C =1 | F)

Pr(C = 0 | F)

⎛

⎝ ⎜

⎞

⎠ ⎟= w0 + w1 f1 + ...+ wN fN

LR Decision RuleLR Decision Rule

Predict class is + if

€

lnPr(C =1 | F)

Pr(C = 0 | F)

⎛

⎝ ⎜

⎞

⎠ ⎟= w0 + w1 f1 + ...+ wN fN

€

T < w0 + w1 f1 + ...+ wN fN

Threshold (0 if equal FP and FN costs)

The Decision Boundary The Decision Boundary of Logistic Regression of Logistic Regression is a hyperplane (line in is a hyperplane (line in 2D)2D)

€

T < w0 + w1 f1 + ...+ wN fNIf

predict +

otherwise

predict -

€

w0 + w1 f1 + w2 f2 − T = 0+ +

+

+ +

+

-

-- -€

f1

€

f2

Encoding Nominal Encoding Nominal FeaturesFeatures

• Logistic Regression requires that Logistic Regression requires that examples be represented as a examples be represented as a vector of real values (also vector of real values (also perceptrons, Neural Nets, SVMs, perceptrons, Neural Nets, SVMs, …)…)

• How can transform FVs with How can transform FVs with nominal features to real values?nominal features to real values?

Two PossibilitiesTwo Possibilities

• For nominal feature with For nominal feature with MM possible values: possible values:1)1) Assign each value an integer between 1 and MAssign each value an integer between 1 and M

• Color = {Red, Green, Blue} Color = {1, 2, 3}Color = {Red, Green, Blue} Color = {1, 2, 3}

2)2) Create Create M M binary features where for each example binary features where for each example (without missing features) exactly one derived (without missing features) exactly one derived feature has value of 1 and feature has value of 1 and MM-1 features have a -1 features have a value of 0 value of 0 • Color = {Red, Green, Blue}Color = {Red, Green, Blue}

• isRed={0,1}, isGreen={0,1}, isBlue={0,1}isRed={0,1}, isGreen={0,1}, isBlue={0,1}

Not a good idea (why?)

Representation of Pr(C|F) same Representation of Pr(C|F) same for LR and Navie Bayes with for LR and Navie Bayes with nominal featuresnominal features

€

lnPr(C =1 | F)

Pr(C = 0 | F)

⎛

⎝ ⎜

⎞

⎠ ⎟= w0 + w1 f1 + ...+ wN fN

€

lnPr(C =1 | F)

Pr(C = 0 | F)

⎛

⎝ ⎜

⎞

⎠ ⎟=

Pr(C =1)

Pr(C = 0)+

Pr( f i | C =1)

Pr( f i | C = 0)∑

€

=Pr(C =1)

Pr(C = 0)+ ∑ Pr( f i = v | C =1)

Pr( f i = v | C = 0)∑ ˜ f i,v

Navie Bayes:Sum over features

Sum over features Sum over values Derived binary feature

Logistic Regression:

Representation of Pr(C|F) same Representation of Pr(C|F) same for LR and Navie Bayes with for LR and Navie Bayes with nominal featuresnominal features

€

lnPr(C =1 | F)

Pr(C = 0 | F)

⎛

⎝ ⎜

⎞

⎠ ⎟= w0 + w1 f1 + ...+ wN fN

€

lnPr(C =1 | F)

Pr(C = 0 | F)

⎛

⎝ ⎜

⎞

⎠ ⎟=

Pr(C =1)

Pr(C = 0)+

Pr( f i | C =1)

Pr( f i | C = 0)∑

€

=Pr(C =1)

Pr(C = 0)+ ∑ Pr( f i = v | C =1)

Pr( f i = v | C = 0)∑ ˜ f i,v

Navie Bayes:Sum over features

Sum over features Sum over values Derived binary feature

Logistic Regression:

€

lnPr(C =1 | F)

Pr(C = 0 | F)

⎛

⎝ ⎜

⎞

⎠ ⎟=

Pr(C =1)

Pr(C = 0)+

Pr( f i | C =1)

Pr( f i | C = 0)∑

What about real What about real valued features?valued features?

IfIf

thenthen

€

Pr( f i | C =1) ~ N(μ i1,σ i) and Pr( f i | C = 0) ~ N(ui0,σ i)

€

wi =μ i1 − ui0

σ i2

Same variance

see text for details

The log of the ratio of two Gaussians with equal variance is a line

Learning TaskLearning Task

• Given:Given:• Labeled examples {(C,F)}Labeled examples {(C,F)}

• Do:Do:• Find a good setting of the weights Find a good setting of the weights WW

Learning Parameters Learning Parameters for Logistic Regression for Logistic Regression

• Typically, we want the weights Typically, we want the weights W W that maximize the that maximize the conditional log-conditional log-likelihoodlikelihood

€

ˆ W ← arg max W

L(W )

L(W ) = ln Pr(C k | F k )( )k

∑

Sum over training examples

Since each setting for the weights gives has an associated likelihood, We can view the likelihood as a function of the weights

““Weight Space”Weight Space”

W

L(W)

Goal

For LR, L(W) is a concave function (it has a single global maximum), so we are guaranteed to find the global maximum

• Given feature representations, the weights W are free parameters that define a space• Each point in “weight space” corresponds to an LR model• Associated with each point is a conditional log likelihood • One way to do LR learning is to perform “gradient ascent” in the weight space

The Gradient-Ascent The Gradient-Ascent RuleRule

L(W) [ ]Lw0

Lw1

Lw2

LwN

, , , … … … , _

The “gradien

t”

−The direction of gradient at W is direction of fastest increase−The magnitude of gradient at W is the rate of fastest increase −Since we want to increase L(W), we want to go “up hill”−We’ll take a finite step in weight space:

L

W1

W2

W = L ( W )

or wi = Ewi

“delta” = change to

W

L

L(W)

““On Line” vs. “Batch” On Line” vs. “Batch” UpdatesUpdates

• We can either update We can either update WW after each after each example is examined (on line / example is examined (on line / stochastic updates) or after the entire stochastic updates) or after the entire training set is examined (batch) training set is examined (batch) updatesupdates• On-line is typically much fasterOn-line is typically much faster

• But is dependent on the order of the examplesBut is dependent on the order of the examples• Will it converge to the same spot?Will it converge to the same spot?

• For non-concave “objective functions” For non-concave “objective functions” online and batch processing will typically online and batch processing will typically end up in different placesend up in different places

Computing the LR Computing the LR GradientGradient

Page 11Page 11

Logistic Regression Logistic Regression Update RuleUpdate Rule

€

wi ← wi + η f ik (y k

k

∑ − Pr(y k =1 | F k,W ))

sum over training examples

Prediction error for example k

Note that this is the batch update rule

Two Decisions Needed Two Decisions Needed for Learning Procedurefor Learning Procedure

1)1) What are we trying to optimize?What are we trying to optimize?

2)2) How are we going to carry out the How are we going to carry out the optimization?optimization?

Discriminant ModelsDiscriminant Models

• Can classify instances into Can classify instances into categoriescategories• Captures differences between Captures differences between

categoriescategories• May not describe all featuresMay not describe all features• Example: Decision trees (covered later)Example: Decision trees (covered later)

• Efficient and simpleEfficient and simple

Generative ModelsGenerative Models

• Can create complete Can create complete input feature vectorsinput feature vectors• Describes distributions of all Describes distributions of all

featuresfeatures• Stochastically creates a plausible Stochastically creates a plausible

vectorvector• Example: Bayes net (from above)Example: Bayes net (from above)

Using Generative Using Generative ModelsModels

• Make a model to generate Make a model to generate positivespositives

• Make a model to generate Make a model to generate negativesnegatives

• Classify a test example based on Classify a test example based on which is more likely to generate itwhich is more likely to generate it• The Naïve Bayes ratio does thisThe Naïve Bayes ratio does this

Some PropertiesSome Properties(Ng & Jordan NIPS ‘02)(Ng & Jordan NIPS ‘02)

• If NB assumption holds asymptotic If NB assumption holds asymptotic accuracy is the sameaccuracy is the same• Otherwise LR acc > NB acc as the number Otherwise LR acc > NB acc as the number

of training examples increasesof training examples increases

• NB converges to asymptotic NB converges to asymptotic performance with fewer examples, LR performance with fewer examples, LR takes moretakes more

• NB is faster to trainNB is faster to train

Neural NetsNeural Nets

PerceptronsPerceptrons

∑

f1

f2

fN

w1w2

wN

F0=1

w0

€

T < w0 + w1 f1 + ...+ wN fN

The decision rule for perceptrons has the same form as the decision rule for logistic regression and naïve Bayes

So perceptrons are linear separators

Input units Output unit

Training PerceptronsTraining Perceptrons

• Perceptron training rule:Perceptron training rule:

• If training data is not linearly separable, If training data is not linearly separable, training may not convergetraining may not converge

• Delta RuleDelta Rule• gradient gradient descentdescent rule derived from rule derived from

objective function based on minimizing the objective function based on minimizing the squaredsquared error of error of linearlinear output unit. output unit.

€

wi ← η (target - output) × f i

Almost identical to LR on-line training rule

Should you?Should you?

"Fenwίck here is biding his time waiting for neural networks.

Concept LearningConcept Learning

Learning sytems differ in how they represent concepts:

Training Examples

Backpropagation

C4.5 CART

AQ. FOIL

… …X^Y Z

Advantages of Neural Advantages of Neural NetworksNetworks

• Provide best Provide best predictive predictive accuracy for accuracy for some problemssome problems− Being supplanted by Being supplanted by

SVM’s?SVM’s?

• Can represent a rich Can represent a rich class of conceptsclass of concepts

PositivenegativePositive

Saturday: 40% chance of rainSunday: 25% chance of rain

Artificial Neural Artificial Neural Networks (ANNs)Networks (ANNs)

Networks

Recurrentlink

Output units

Input units

Hidden unitserrorweight

ANNs ANNs (continued)(continued)

Individual units

bias

outputs

inputs

output i= F(weighti,j x outputj)

Where

F(inputi) =

j

1

1+e -(inputi - biasi)

Perceptron Convergence Perceptron Convergence

TheoremTheorem (Rosemblatt, 1957)(Rosemblatt, 1957)

Perceptron = no Hidden Units

If a set of examples is learnable, the DELTA rule will eventually find the necessary weights

However a perceptron can only learn/represent linearly separable dataset.

Linear Separability Linear Separability Consider a perceptron

Its output is 1 If W1X1+W2X2 + … + WnXn > 0 otherwise

In terms of feature space: WiXi + WjXj =

Xj = = WiXi

Wj

Wi Xj Wj

Xi +

[ y = mx + b]

+ + + + + + - + - - + + + + - + + - - -+ + - -+ - - -

- -

Hence, can only classify examples if a “line” (hyerplane) can separate them

The XOR ProblemThe XOR Problem

Input

0 00 11 01 1

Output

0110

a)b)c)d)

Exclusive OR (XOR)Not linearly separable:

b

a c

d

0 1

1

A Neural Network SolutionX1

X2

X1

X2

1

1

-1-1

1

1 Let = 0 !

The Need for Hidden The Need for Hidden UnitsUnits

If there is one layer of enough hidden units (possibly 2N for Boolean functions), the input can be recoded. (N = number of input units)

This recoding allows any mapping to be represented (Minsky & Papert)Question: How to provide an error signal to the interior units?

Hidden UnitsHidden Units

One View:Allow a system to create its own internal representation – for which problem solving is easy.

A perceptron

Reformulating XORReformulating XOR

X1

X2

X3 = X1 ^ X2

X1

X2

X3

Or:

X1

X2

So, if a hidden unit can learn to represent X1 ^ X2 , solution is easy

BackpropagationBackpropagation

BackpropagationBackpropagation• Backpropagation involves a generalization of Backpropagation involves a generalization of

the delta rulethe delta rule• Rumelhart, Parker, and Le Cun (and Bryson & Rumelhart, Parker, and Le Cun (and Bryson &

Ho(1969), Werbos(1974)) independently Ho(1969), Werbos(1974)) independently developed(1985) a technique for determining developed(1985) a technique for determining how to adjust weights of interior (“hidden”) how to adjust weights of interior (“hidden”) unitsunits

• Derivation involves partial derivatives Derivation involves partial derivatives (Hence, threshold function must be (Hence, threshold function must be differentiable)differentiable)

error signal

EWi,j

Weight SpaceWeight Space

• Given a network layout, the weights Given a network layout, the weights and biases are free parameters that and biases are free parameters that define a define a Space.Space.

• Each point in this Each point in this Wight SpaceWight Space (w) (w) specifies a networkspecifies a network

• Associated with each point is an Associated with each point is an error error rate, rate, E, over the training dataE, over the training data

• BackProp performs gradient descent in BackProp performs gradient descent in weight spaceweight space

Gradient descent in weight Gradient descent in weight spacespace

E

W1

W2

E

w

W1

W2

The Gradient-Descent The Gradient-Descent RuleRule

E(w) [ ]Ew0

Ew1

Ew1

EwN

, , , … … … , _

The “gradien

t”

−This is a N+1 dimensional vector (i.e., the ‘slope’ in weight space)

−Since we want to reduce errors, we want to go “down hill”

−We’ll take a finite step in weight space: E

W1

W2

w = - E ( w )

or wi = - Ewi

“delta” = change to

w

E

w

““On Line” vs. “Batch” BP On Line” vs. “Batch” BP (continued)(continued)

• BATCH – add BATCH – add w w vectors for each vectors for each training example, training example, then ‘move’ in then ‘move’ in weight space.weight space.

• ON-LINE – “move” ON-LINE – “move” after after eacheach example example (aka, stochastic (aka, stochastic gradient descent)gradient descent)

wi

w1

w3w2

E

w

E

w

w1

w2

w3

* Final locations in space need not be the same for BATCH and ON-LINE w

* N

ote

w

i,B

ATC

H

w

i, O

N-L

INE,

for

i >

1

Backprop CalculationsBackprop Calculations• Assume one layer of hidden units (std. Assume one layer of hidden units (std.

topology)topology)1.1. Error = ½ Error = ½ ( Teacher ( Teacherii – Output – Outputii ) ) 22

2.2. = ½ = ½ (Teacher(Teacherii – – f f [[WWi,ji,j x Output x Outputjj] )] )22

3.3. = ½ = ½ (Teacher(Teacherii – – f f [[WWi,ji,j x x f f ((WWj,kj,k x Output x Outputkk)])))]))22

• DetermineDetermine

recallrecall

ErrorWi,j

ErrorWj,k

= (use equation 2)

= (use equation 3)

* See table 4.2 for results

wx,y = - ( E / wx,y )

k j i

Differentiating the Logistic Differentiating the Logistic FunctionFunction

Outi =

= outi ( 1- outi ) = f’(weighted input)

1

1 + e - ( wj,i x outj - i )

f’(weighted input) = outi

1/2

w.outi

Weightedinputf’( )f’( )f’( )

Weightedinputf ( )

Notice that even if totally wrong, no (or very little) change in weights

1/4

1

The Need for The Need for Symmetry BreakingSymmetry Breaking

Assume all weights are initially the same

Can the corresponding (mirror-image) weight ever differ? - NO

WHY? - by symmetry

Solution - randomize initial weights

Using BP to Train Using BP to Train ANN’sANN’s

1.1. Initiate weights & bias to Initiate weights & bias to small random values (eg. small random values (eg. In [-0.3, 0/3])In [-0.3, 0/3])

2.2. Randomize order of Randomize order of training examples; for training examples; for each do:each do:

a)a) Propagate activity forward Propagate activity forward to output unitsto output units

k j i

outi = f ( wi,j x outj)j

Using BP to Train Using BP to Train ANN’s ANN’s (continued)(continued)

b)b) Compute “error” for output unitsCompute “error” for output units

c)c) Compute “error” for hidden unitsCompute “error” for hidden units

d)d) Update weightsUpdate weights

i = f ’( neti ) x (Teacheri-outi)

ij = f ’( netj ) x ( wi,j x i)

wi,j = x i x outj

wj,k = x j x outk

f ’( netj ) = f (neti) neti

Using BP to Train Using BP to Train ANN’s ANN’s (continued)(continued)

3.3. Repeat until training-set error rate small Repeat until training-set error rate small enough ( or until tuning-set error rate enough ( or until tuning-set error rate begins to rise – see later slide)begins to rise – see later slide)

− Should use “early stopping” ( i.e., minimize Should use “early stopping” ( i.e., minimize error on the tuning set; more details later)error on the tuning set; more details later)

4.4. Measure accuracy on test set to Measure accuracy on test set to estimate estimate generalizationgeneralization (future (future accuracy)accuracy)

Documents

Announcements HW1 assigned due MondayHW1 assigned due Monday HW2 may be assigned prior to next class, check emailHW2 may be assigned prior to next class,