Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
Classification and K-Nearest Neighbors
Administriviao Reminder: Homework 1 is due by 5pm Friday on Moodle
o Reading Quiz associated with today’s lecture. Due before class Wednesday.
2
° NOTETAKER
Regression vs ClassificationSo far, we’ve taken a vector of features and predicted a numerical response
Regression deals with quantitative predictions
But what if we’re interested in making qualitative predictions?
3
y = �0 + �1x1 + �2x2 + · · · �pxp
(x1, x2, . . . , xp)
$360, 000 = �0 + �1 ⇥ (2000 sq-ft) + �2 ⇥ (3 bath rooms)
Classification Examples
4
Text Classification: Spam or Not Spam
Classification Examples
5
Text Classification: Fake News or Real News
Classification Examples
6
Text Classification: Sentiment Analysis (Tweets, Amazon Reviews)
Classification Examples
7
Image Classification: Handwritten Digits
Classification Examples
8
Image Classification: Stop Sign or No Stop Sign (Autonomous Vehicles)
Classification Examples
9
Image Classification: Hot Dog or Not Hot Dog (No real application whatsoever)
Classification Examples
10
Image Classification: Labradoodle or Fried Chicken (Important for not eating your dog)
Classification with Probability
11
In classification problems our outputs are discrete classes
Goal: Given features predict which class belongs to x = (x1, x2, . . . , xp) Y = k, x
Classification with Probability
12
In classification problems our outputs are discrete classes
Goal: Given features predict which class belongs to x = (x1, x2, . . . , xp) Y = k, x
Conditional probability gives a nice framework for many classification models
Suppose our data can belong to one of classes
We model the conditional probability for each
and we make the prediction
k = 1, 2, . . . ,K
p(Y = k | x) k = 1, 2, . . . ,K
y = argmaxk
p(Y = k | x)
Classification with Probability
13
In classification problems our outputs are discrete classes
Goal: Given features predict which class belongs to x = (x1, x2, . . . , xp) Y = k, x
Example: In Spam Classification, the features are the words in the email
The classes are SPAM and HAM
We estimate and p(SPAM | email) p(HAM | email). r
€0.75= o .zs y
'
=SPAM
K-Nearest Neighbors
14
AllLabeledData TrainingSet ValidationSet
Our first classifier is a very simple one
Assume we have a labeled training set
Features can be anything: words, pixel intensities, numerical quantities
Instead of continuous response, now have discrete labels (encoded as 0, 1, 2, etc)
K-Nearest Neighbors
15
General Idea: Suppose we have an example that we want to classify
o Look in training set for K examples that are “nearest” to
o Predict label for using majority vote of labels for K nearest neighbors
Questions: o What does “nearest” mean?
o What if there’s a tie in the vote?
x
x
x
K-Nearest Neighbors
16
KNN:
o Find : the set of K training examples nearest to
o Predict to be majority label in
Question: o What does “nearest” mean?
Answer:o Measure closeness using Euclidean Distance:
xNK(x)
y NK(x)
kxi � xk2
-
gtest Example
Tnthimng vector
K-Nearest Neighbors
17
Euclidean Distance:
o 1-NN predicts: _________
o 2-NN predicts: _________
o 5-NN predicts: _________
kxi � xk2
Classes = { RED ,Blue 3
K-Nearest Neighbors
18
Euclidean Distance:
o 1-NN predicts: _________
o 2-NN predicts: _________
o 5-NN predicts: _________
kxi � xk2
BLUE
K-Nearest Neighbors
19
Euclidean Distance:
o 1-NN predicts: blue
o 2-NN predicts: _________
o 5-NN predicts: _________
kxi � xk2
unclear
K-Nearest Neighbors
20
Euclidean Distance:
o 1-NN predicts: blue
o 2-NN predicts: unclear
o 5-NN predicts: _________
kxi � xk2
RED
K-Nearest Neighbors
21
Euclidean Distance:
o 1-NN predicts: blue
o 2-NN predicts: unclear
o 5-NN predicts: red
kxi � xk2
What About Probability?
22
o 5-NN:
p(Y = k | x) = 1
K
X
i2NK
I(yi = k)
INDICATORFunction
: -
=
p ( Y = Blue / x ) - }=
ftp.REDK/)=3@I (b) = { 1 it b is true
0 IF b is FADE
What About Ties?
23
Lots of reasonable strategies
Example: If there is a tie: o reduce K by 1 and try again.
o repeat until the tie is broken.
K-Nearest Neighbors
24
Predict with
Closest Points:
Prediction:
y K = 1
N, 1×1
= { Coin }
ya= RED
K-Nearest Neighbors
25
Predict with
Closest Points:
Prediction:
y K = 2
[ Nzlx ) > 310,4/11,4 ) }
t µ , (1) a { ( 0,2 ) }
RED =p
K-Nearest Neighbors
26
Predict with
Closest Points:
Prediction:
y K = 3
N } ( x ) = { ( 0,2 ) , 11,4 )( 2) 4)}
ya= BLUE
A Subtle Distinction
27
o KNN is an instance-based method
o Given an unseen query point, look directly at training data and then predict
o Compare to regression where we use training data to learn parameters
o When we learn parameters we call the model a parametric model
o Instance-based methods like KNN are called non-parametric models
Question: When would parametric vs non-parametric make a big difference?
.
Naïve Implementation
28
How much does KNN cost?
Suppose your training set has n examples, each with p features
Now you classify a single example
Naïve Method: Loop over each training example, compute distances, cache min indices
o The naïve method is ___________ in time
o The naïve method is ___________ in space
x
klxi - x 113
ocpn )olpn )
Tree-Based Implementation
29
Build a KD- or Ball-tree on the training data
Tree-Based Implementation
30
Build a KD- or Ball-tree on the training data
Higher up-front cost in time and storage to build data structure
Up-front cost is ___________ in time and ___________ in space
Now you classify a single example
Classification time for single query is ______________
Good strategy if lots of data and need to do lots of queries
x
ocpnlogn) Ocpu)
ocplogn)
Performance Metrics and Choosing K
31
Question: How do we evaluate our KNN classifier?
Answer: Perform prediction on labeled data, count how many we got wrong
Question: How do we choose K?
Error Rate =1
n
nX
i=1
I(yi 6= yi)
[# wrong PRED
-
# predictions
Performance Metrics and Choosing K
32
Like all models, KNN can overfit and underfit, depending on choice of K
K = 1 K = 15 DECISION
ayBoundary
Performance Metrics and Choosing K
33
Choose optimal K by varying K and computing error rate on train and validation set
o Note: Plot vs 1/K
o Lower K leads to more flexibility
o Choose K to minimize validation error
0.01 0.02 0.05 0.10 0.20 0.50 1.00
0.0
00
.05
0.1
00
.15
0.2
01/K
Err
or
Ra
te
Training ErrorsTest Errors.
VALIDATION
Feature Standardization
34
o Important to normalize/standardize features to similar sizes
o Consider doing 2-NN on unscaled and scaled data
O
Feature Standardization
35
o Important to normalize/standardize features to similar sizes
o Consider doing 2-NN on unscaled and scaled data
y"
= BLUE ya = RED
0 o
Generalization Error and Analysis
36
Bias-Variance Trade-Off: o Pretty much analogous to our experience with polynomial regression
o As , variance increases and bias decreases
Reducible and Irreducible Errors: o This is more interesting, and more complicated
o What follows applies to ALL classification methods, not just KNN
1/K ! 1
The Optimal Bayes Classifier
37
Suppose we knew the true probabilistic distribution of our data:
Example:o Equal prob a 2D point is in Q1 or Q3
o If point in Q1, 0.9 probability it’s red.
o If point in Q2, 0.9 probability it’s blue.
o If classifier uses true model to predict…
o will be wrong ______ of time on average
o Bayes Error Rate = ____________
p(Y = k | X)
QI
Atta04:
The Optimal Bayes Classifier
38
The Bayes Error Rate is the theoretical measure of Irreducible Error
o In long run, can never do better than the Bayes Error Rate on validation data
o Of course, just like in the regression setting, we don’t actually know what this error is. We just know it’s there!
0.01 0.02 0.05 0.10 0.20 0.50 1.00
0.0
00
.05
0.1
00
.15
0.2
01/K
Err
or
Ra
te
Training ErrorsTest Errors
Var(✏)
_
t
K-Nearest Neighbors Wrap-Up
39
o Talked about the general classification setting
o Learned how to do simple KNN
o Looked at possible implementations and their complexity
Next Time: o Look at our first parametric classifier: The Perceptron
o Kinda a weak model, but forms the basis for Neural Networks
If-Time Bonus: Weighted KNN
40
Question: What should 5-NN predict in the following picture?
If-Time Bonus: Weighted KNN
41
Question: What should 5-NN predict in the following picture?
If-Time Bonus: Weighted KNN
42
Weighted-KNN:
o Find : the set of K training examples nearest to
o Predict to be weighted-majority label in , weighted by inverse-distance
Improvements over KNN:
o Gives more weight to examples that are very close to query point
o Less tie-breaking required
xNK(x)
y NK(x)
If-Time Bonus: Weighted KNN
43
Question: What should 5-NN predict in the following picture?
Red Distances:
Blue Distances:
Red Weighted-Majority Vote:
Blue Weighted-Majority Vote:
Prediction:
44
45
46
47