Upload
charlotte-booth
View
220
Download
0
Tags:
Embed Size (px)
Citation preview
AnnouncementsAnnouncements
• HW1 assigned due MondayHW1 assigned due Monday• HW2 may be assigned prior to HW2 may be assigned prior to
next class, check emailnext class, check email• Reading Assignment Reading Assignment
• Chapter 3 of the text book (decision Chapter 3 of the text book (decision trees)trees)
• Midterm examMidterm exam• In class on Oct 25In class on Oct 25
Last TimeLast Time• How accurate is algorithm A? What are its 95% How accurate is algorithm A? What are its 95%
confidence intervals?confidence intervals?• Apply CLT to get Gaussian PDF for acc.Apply CLT to get Gaussian PDF for acc.• Find Find ZZ statistic to get CI statistic to get CI
• Do performances of Algos. A & B differ?Do performances of Algos. A & B differ?• Use 10 fold CV to get deltasUse 10 fold CV to get deltas• Estimate the mean and std dev of deltaEstimate the mean and std dev of delta• Compute Compute tt stat and get CI for delta stat and get CI for delta• Does it contain 0?Does it contain 0?
• How can we eval. Algorithms if FN and FP costs differ?How can we eval. Algorithms if FN and FP costs differ?• ROC curves ROC curves
Today’sToday’s TopicsTopics
• Finish ROC + Precision-Recall Finish ROC + Precision-Recall curvescurves
• Our next SL algorithm, Logistic Our next SL algorithm, Logistic RegresionRegresion• Discriminative vs GenerativeDiscriminative vs Generative
• Perceptrons, Neural networksPerceptrons, Neural networks
Plot ROC Curve Plot ROC Curve ExampleExample
Ex 9 .99 +
Ex 7 .98 +
Ex 1 .72 -
Ex 2 .7 +
Ex 6 .65 +
Ex 10 .51 -
Ex 3 .39 -
Ex 5 .24 +
Ex 4 .11 -
Ex 3 .01 -
ML Algo Output (Sorted) Correct
Category
1.0
1.0FP rateTP
rate
Pro
b (
alg
outp
uts
+ |
+ is
corr
ect
)Prob (alg outputs + | - is correct)
TPR=(2/5) FPR=(0/5)
TPR=(2/5) FPR=(1/5)
TPR=(4/5) FPR=(1/5)
TPR=(4/5) FPR=(3/5)
TPR=(5/5) FPR=(3/5)
TPR=(5/5) FPR=(5/5)
Area Under ROC CurveArea Under ROC Curve
• A common metric for experiments is to A common metric for experiments is to numerically integrate the ROC Curvenumerically integrate the ROC Curve
1.0
1.0FP Rate
TP R
ate
Precision vs RecallPrecision vs Recall
• Precision = TP / (TP + FP) Precision = TP / (TP + FP) • Recall = TP / (TP + FN) Recall = TP / (TP + FN) • Notice that TN is not used in Notice that TN is not used in
either formulaeither formula
ROC vs Recall-ROC vs Recall-PrecisionPrecision
• You can get very different visual You can get very different visual results on the same data.results on the same data.
vs
P ( + | - ) RecallPre
cisi
on
P (
+ |
+ )
The reason for this is because there may be lots of -
Recall-Precision CurveRecall-Precision Curve
• You cannot simply connect the You cannot simply connect the dots in Recall-Precision curves.dots in Recall-Precision curves.
• See Goadrich, Oliphant, & Shavlik See Goadrich, Oliphant, & Shavlik ILP’04ILP’04
Recall
Pre
cisi
on
x
Exp Methodology Exp Methodology WrapupWrapup• Never train on test sets. (use tune sets)Never train on test sets. (use tune sets)• Use central-limit theorem to place Use central-limit theorem to place
confidence intervals on measurementsconfidence intervals on measurements• Paired Paired tt-test’s provide a sensitive way -test’s provide a sensitive way
to judge whether two algorithms to judge whether two algorithms perform differently.perform differently.
• tt-test is a useful heuristic for guiding -test is a useful heuristic for guiding researchresearch
• Use a two-tailed testUse a two-tailed test• ROC curves are better than accuracyROC curves are better than accuracy
Next Topic: Logistic Next Topic: Logistic RegressionRegression
The Logistic Function The Logistic Function (also called sigmoid)(also called sigmoid)
€
y =1
1+ exp(−x)
Logistic function:
y
x
• Sigmoid dates back to 19th Sigmoid dates back to 19th centurycentury
• Originally used to model growth Originally used to model growth of populationsof populations
The Logistic Function The Logistic Function (also called sigmoid)(also called sigmoid)
€
Pr(C =1 | F) =1
1+ exp(−g(F))
• Logistic Regression assumes the conditional Logistic Regression assumes the conditional Pr(C=1|F) is a sigmoidPr(C=1|F) is a sigmoid
linear function of the features
Real valued feature vector
y
x
Logistic RegressionLogistic Regression
€
Pr(C =1 | F) =1
1+ exp(−g(F))
€
Pr(C = 0 | F) =exp(−g(F))
1+ exp(−g(F))
This gives us Pr(C=0|F) since Pr(C=1|F) + Pr(C=0|F) =1.0
€
g(F) = w0 + w1 f1 + ...+ wN fN
So the odds are
€
Pr(C =1 | F)
Pr(C = 0 | F)=
1
exp(−g(F))= exp(g(F))
And
€
lnPr(C =1 | F)
Pr(C = 0 | F)
⎛
⎝ ⎜
⎞
⎠ ⎟= w0 + w1 f1 + ...+ wN fN
LR Decision RuleLR Decision Rule
Predict class is + if
€
lnPr(C =1 | F)
Pr(C = 0 | F)
⎛
⎝ ⎜
⎞
⎠ ⎟= w0 + w1 f1 + ...+ wN fN
€
T < w0 + w1 f1 + ...+ wN fN
Threshold (0 if equal FP and FN costs)
The Decision Boundary The Decision Boundary of Logistic Regression of Logistic Regression is a hyperplane (line in is a hyperplane (line in 2D)2D)
€
T < w0 + w1 f1 + ...+ wN fNIf
predict +
otherwise
predict -
€
w0 + w1 f1 + w2 f2 − T = 0+ +
+
+ +
+
-
-- -€
f1
€
f2
Encoding Nominal Encoding Nominal FeaturesFeatures
• Logistic Regression requires that Logistic Regression requires that examples be represented as a examples be represented as a vector of real values (also vector of real values (also perceptrons, Neural Nets, SVMs, perceptrons, Neural Nets, SVMs, …)…)
• How can transform FVs with How can transform FVs with nominal features to real values?nominal features to real values?
Two PossibilitiesTwo Possibilities
• For nominal feature with For nominal feature with MM possible values: possible values:1)1) Assign each value an integer between 1 and MAssign each value an integer between 1 and M
• Color = {Red, Green, Blue} Color = {1, 2, 3}Color = {Red, Green, Blue} Color = {1, 2, 3}
2)2) Create Create M M binary features where for each example binary features where for each example (without missing features) exactly one derived (without missing features) exactly one derived feature has value of 1 and feature has value of 1 and MM-1 features have a -1 features have a value of 0 value of 0 • Color = {Red, Green, Blue}Color = {Red, Green, Blue}
• isRed={0,1}, isGreen={0,1}, isBlue={0,1}isRed={0,1}, isGreen={0,1}, isBlue={0,1}
Not a good idea (why?)
Representation of Pr(C|F) same Representation of Pr(C|F) same for LR and Navie Bayes with for LR and Navie Bayes with nominal featuresnominal features
€
lnPr(C =1 | F)
Pr(C = 0 | F)
⎛
⎝ ⎜
⎞
⎠ ⎟= w0 + w1 f1 + ...+ wN fN
€
lnPr(C =1 | F)
Pr(C = 0 | F)
⎛
⎝ ⎜
⎞
⎠ ⎟=
Pr(C =1)
Pr(C = 0)+
Pr( f i | C =1)
Pr( f i | C = 0)∑
€
=Pr(C =1)
Pr(C = 0)+ ∑ Pr( f i = v | C =1)
Pr( f i = v | C = 0)∑ ˜ f i,v
Navie Bayes:Sum over features
Sum over features Sum over values Derived binary feature
Logistic Regression:
Representation of Pr(C|F) same Representation of Pr(C|F) same for LR and Navie Bayes with for LR and Navie Bayes with nominal featuresnominal features
€
lnPr(C =1 | F)
Pr(C = 0 | F)
⎛
⎝ ⎜
⎞
⎠ ⎟= w0 + w1 f1 + ...+ wN fN
€
lnPr(C =1 | F)
Pr(C = 0 | F)
⎛
⎝ ⎜
⎞
⎠ ⎟=
Pr(C =1)
Pr(C = 0)+
Pr( f i | C =1)
Pr( f i | C = 0)∑
€
=Pr(C =1)
Pr(C = 0)+ ∑ Pr( f i = v | C =1)
Pr( f i = v | C = 0)∑ ˜ f i,v
Navie Bayes:Sum over features
Sum over features Sum over values Derived binary feature
Logistic Regression:
€
lnPr(C =1 | F)
Pr(C = 0 | F)
⎛
⎝ ⎜
⎞
⎠ ⎟=
Pr(C =1)
Pr(C = 0)+
Pr( f i | C =1)
Pr( f i | C = 0)∑
What about real What about real valued features?valued features?
IfIf
thenthen
€
Pr( f i | C =1) ~ N(μ i1,σ i) and Pr( f i | C = 0) ~ N(ui0,σ i)
€
wi =μ i1 − ui0
σ i2
Same variance
see text for details
The log of the ratio of two Gaussians with equal variance is a line
Learning TaskLearning Task
• Given:Given:• Labeled examples {(C,F)}Labeled examples {(C,F)}
• Do:Do:• Find a good setting of the weights Find a good setting of the weights WW
Learning Parameters Learning Parameters for Logistic Regression for Logistic Regression
• Typically, we want the weights Typically, we want the weights W W that maximize the that maximize the conditional log-conditional log-likelihoodlikelihood
€
ˆ W ← arg max W
L(W )
L(W ) = ln Pr(C k | F k )( )k
∑
Sum over training examples
Since each setting for the weights gives has an associated likelihood, We can view the likelihood as a function of the weights
““Weight Space”Weight Space”
W
L(W)
Goal
For LR, L(W) is a concave function (it has a single global maximum), so we are guaranteed to find the global maximum
• Given feature representations, the weights W are free parameters that define a space• Each point in “weight space” corresponds to an LR model• Associated with each point is a conditional log likelihood • One way to do LR learning is to perform “gradient ascent” in the weight space
The Gradient-Ascent The Gradient-Ascent RuleRule
L(W) [ ]Lw0
Lw1
Lw2
LwN
, , , … … … , _
The “gradien
t”
−The direction of gradient at W is direction of fastest increase−The magnitude of gradient at W is the rate of fastest increase −Since we want to increase L(W), we want to go “up hill”−We’ll take a finite step in weight space:
L
W1
W2
W = L ( W )
or wi = Ewi
“delta” = change to
W
L
L(W)
““On Line” vs. “Batch” On Line” vs. “Batch” UpdatesUpdates
• We can either update We can either update WW after each after each example is examined (on line / example is examined (on line / stochastic updates) or after the entire stochastic updates) or after the entire training set is examined (batch) training set is examined (batch) updatesupdates• On-line is typically much fasterOn-line is typically much faster
• But is dependent on the order of the examplesBut is dependent on the order of the examples• Will it converge to the same spot?Will it converge to the same spot?
• For non-concave “objective functions” For non-concave “objective functions” online and batch processing will typically online and batch processing will typically end up in different placesend up in different places
Computing the LR Computing the LR GradientGradient
Page 11Page 11
Logistic Regression Logistic Regression Update RuleUpdate Rule
€
wi ← wi + η f ik (y k
k
∑ − Pr(y k =1 | F k,W ))
sum over training examples
Prediction error for example k
Note that this is the batch update rule
Two Decisions Needed Two Decisions Needed for Learning Procedurefor Learning Procedure
1)1) What are we trying to optimize?What are we trying to optimize?
2)2) How are we going to carry out the How are we going to carry out the optimization?optimization?
Discriminant ModelsDiscriminant Models
• Can classify instances into Can classify instances into categoriescategories• Captures differences between Captures differences between
categoriescategories• May not describe all featuresMay not describe all features• Example: Decision trees (covered later)Example: Decision trees (covered later)
• Efficient and simpleEfficient and simple
Generative ModelsGenerative Models
• Can create complete Can create complete input feature vectorsinput feature vectors• Describes distributions of all Describes distributions of all
featuresfeatures• Stochastically creates a plausible Stochastically creates a plausible
vectorvector• Example: Bayes net (from above)Example: Bayes net (from above)
Using Generative Using Generative ModelsModels
• Make a model to generate Make a model to generate positivespositives
• Make a model to generate Make a model to generate negativesnegatives
• Classify a test example based on Classify a test example based on which is more likely to generate itwhich is more likely to generate it• The Naïve Bayes ratio does thisThe Naïve Bayes ratio does this
Some PropertiesSome Properties(Ng & Jordan NIPS ‘02)(Ng & Jordan NIPS ‘02)
• If NB assumption holds asymptotic If NB assumption holds asymptotic accuracy is the sameaccuracy is the same• Otherwise LR acc > NB acc as the number Otherwise LR acc > NB acc as the number
of training examples increasesof training examples increases
• NB converges to asymptotic NB converges to asymptotic performance with fewer examples, LR performance with fewer examples, LR takes moretakes more
• NB is faster to trainNB is faster to train
Neural NetsNeural Nets
PerceptronsPerceptrons
∑
f1
f2
fN
w1w2
wN
F0=1
w0
€
T < w0 + w1 f1 + ...+ wN fN
The decision rule for perceptrons has the same form as the decision rule for logistic regression and naïve Bayes
So perceptrons are linear separators
Input units Output unit
Training PerceptronsTraining Perceptrons
• Perceptron training rule:Perceptron training rule:
• If training data is not linearly separable, If training data is not linearly separable, training may not convergetraining may not converge
• Delta RuleDelta Rule• gradient gradient descentdescent rule derived from rule derived from
objective function based on minimizing the objective function based on minimizing the squaredsquared error of error of linearlinear output unit. output unit.
€
wi ← η (target - output) × f i
Almost identical to LR on-line training rule
Should you?Should you?
"Fenwίck here is biding his time waiting for neural networks.
Concept LearningConcept Learning
Learning sytems differ in how they represent concepts:
Training Examples
Backpropagation
C4.5 CART
AQ. FOIL
… …X^Y Z
Advantages of Neural Advantages of Neural NetworksNetworks
• Provide best Provide best predictive predictive accuracy for accuracy for some problemssome problems− Being supplanted by Being supplanted by
SVM’s?SVM’s?
• Can represent a rich Can represent a rich class of conceptsclass of concepts
PositivenegativePositive
Saturday: 40% chance of rainSunday: 25% chance of rain
Artificial Neural Artificial Neural Networks (ANNs)Networks (ANNs)
Networks
Recurrentlink
Output units
Input units
Hidden unitserrorweight
ANNs ANNs (continued)(continued)
Individual units
bias
outputs
inputs
output i= F(weighti,j x outputj)
Where
F(inputi) =
j
1
1+e -(inputi - biasi)
Perceptron Convergence Perceptron Convergence
TheoremTheorem (Rosemblatt, 1957)(Rosemblatt, 1957)
Perceptron = no Hidden Units
If a set of examples is learnable, the DELTA rule will eventually find the necessary weights
However a perceptron can only learn/represent linearly separable dataset.
Linear Separability Linear Separability Consider a perceptron
Its output is 1 If W1X1+W2X2 + … + WnXn > 0 otherwise
In terms of feature space: WiXi + WjXj =
Xj = = WiXi
Wj
Wi Xj Wj
Xi +
[ y = mx + b]
+ + + + + + - + - - + + + + - + + - - -+ + - -+ - - -
- -
Hence, can only classify examples if a “line” (hyerplane) can separate them
The XOR ProblemThe XOR Problem
Input
0 00 11 01 1
Output
0110
a)b)c)d)
Exclusive OR (XOR)Not linearly separable:
b
a c
d
0 1
1
A Neural Network SolutionX1
X2
X1
X2
1
1
-1-1
1
1 Let = 0 !
The Need for Hidden The Need for Hidden UnitsUnits
If there is one layer of enough hidden units (possibly 2N for Boolean functions), the input can be recoded. (N = number of input units)
This recoding allows any mapping to be represented (Minsky & Papert)Question: How to provide an error signal to the interior units?
Hidden UnitsHidden Units
One View:Allow a system to create its own internal representation – for which problem solving is easy.
A perceptron
Reformulating XORReformulating XOR
X1
X2
X3 = X1 ^ X2
X1
X2
X3
Or:
X1
X2
So, if a hidden unit can learn to represent X1 ^ X2 , solution is easy
BackpropagationBackpropagation
BackpropagationBackpropagation• Backpropagation involves a generalization of Backpropagation involves a generalization of
the delta rulethe delta rule• Rumelhart, Parker, and Le Cun (and Bryson & Rumelhart, Parker, and Le Cun (and Bryson &
Ho(1969), Werbos(1974)) independently Ho(1969), Werbos(1974)) independently developed(1985) a technique for determining developed(1985) a technique for determining how to adjust weights of interior (“hidden”) how to adjust weights of interior (“hidden”) unitsunits
• Derivation involves partial derivatives Derivation involves partial derivatives (Hence, threshold function must be (Hence, threshold function must be differentiable)differentiable)
error signal
EWi,j
Weight SpaceWeight Space
• Given a network layout, the weights Given a network layout, the weights and biases are free parameters that and biases are free parameters that define a define a Space.Space.
• Each point in this Each point in this Wight SpaceWight Space (w) (w) specifies a networkspecifies a network
• Associated with each point is an Associated with each point is an error error rate, rate, E, over the training dataE, over the training data
• BackProp performs gradient descent in BackProp performs gradient descent in weight spaceweight space
Gradient descent in weight Gradient descent in weight spacespace
E
W1
W2
E
w
W1
W2
The Gradient-Descent The Gradient-Descent RuleRule
E(w) [ ]Ew0
Ew1
Ew1
EwN
, , , … … … , _
The “gradien
t”
−This is a N+1 dimensional vector (i.e., the ‘slope’ in weight space)
−Since we want to reduce errors, we want to go “down hill”
−We’ll take a finite step in weight space: E
W1
W2
w = - E ( w )
or wi = - Ewi
“delta” = change to
w
E
w
““On Line” vs. “Batch” BP On Line” vs. “Batch” BP (continued)(continued)
• BATCH – add BATCH – add w w vectors for each vectors for each training example, training example, then ‘move’ in then ‘move’ in weight space.weight space.
• ON-LINE – “move” ON-LINE – “move” after after eacheach example example (aka, stochastic (aka, stochastic gradient descent)gradient descent)
wi
w1
w3w2
E
w
E
w
w1
w2
w3
* Final locations in space need not be the same for BATCH and ON-LINE w
* N
ote
w
i,B
ATC
H
w
i, O
N-L
INE,
for
i >
1
Backprop CalculationsBackprop Calculations• Assume one layer of hidden units (std. Assume one layer of hidden units (std.
topology)topology)1.1. Error = ½ Error = ½ ( Teacher ( Teacherii – Output – Outputii ) ) 22
2.2. = ½ = ½ (Teacher(Teacherii – – f f [[WWi,ji,j x Output x Outputjj] )] )22
3.3. = ½ = ½ (Teacher(Teacherii – – f f [[WWi,ji,j x x f f ((WWj,kj,k x Output x Outputkk)])))]))22
• DetermineDetermine
recallrecall
ErrorWi,j
ErrorWj,k
= (use equation 2)
= (use equation 3)
* See table 4.2 for results
wx,y = - ( E / wx,y )
k j i
Differentiating the Logistic Differentiating the Logistic FunctionFunction
Outi =
= outi ( 1- outi ) = f’(weighted input)
1
1 + e - ( wj,i x outj - i )
f’(weighted input) = outi
1/2
w.outi
Weightedinputf’( )f’( )f’( )
Weightedinputf ( )
Notice that even if totally wrong, no (or very little) change in weights
1/4
1
The Need for The Need for Symmetry BreakingSymmetry Breaking
Assume all weights are initially the same
Can the corresponding (mirror-image) weight ever differ? - NO
WHY? - by symmetry
Solution - randomize initial weights
Using BP to Train Using BP to Train ANN’sANN’s
1.1. Initiate weights & bias to Initiate weights & bias to small random values (eg. small random values (eg. In [-0.3, 0/3])In [-0.3, 0/3])
2.2. Randomize order of Randomize order of training examples; for training examples; for each do:each do:
a)a) Propagate activity forward Propagate activity forward to output unitsto output units
k j i
outi = f ( wi,j x outj)j
Using BP to Train Using BP to Train ANN’s ANN’s (continued)(continued)
b)b) Compute “error” for output unitsCompute “error” for output units
c)c) Compute “error” for hidden unitsCompute “error” for hidden units
d)d) Update weightsUpdate weights
i = f ’( neti ) x (Teacheri-outi)
ij = f ’( netj ) x ( wi,j x i)
wi,j = x i x outj
wj,k = x j x outk
f ’( netj ) = f (neti) neti
Using BP to Train Using BP to Train ANN’s ANN’s (continued)(continued)
3.3. Repeat until training-set error rate small Repeat until training-set error rate small enough ( or until tuning-set error rate enough ( or until tuning-set error rate begins to rise – see later slide)begins to rise – see later slide)
− Should use “early stopping” ( i.e., minimize Should use “early stopping” ( i.e., minimize error on the tuning set; more details later)error on the tuning set; more details later)
4.4. Measure accuracy on test set to Measure accuracy on test set to estimate estimate generalizationgeneralization (future (future accuracy)accuracy)