Classification and K-Nearest Neighborsketelsen/files/courses/csci4622/slides/lesson06.pdfA Subtle...

Preview:

Citation preview

Classification and K-Nearest Neighbors

Administriviao Reminder: Homework 1 is due by 5pm Friday on Moodle

o Reading Quiz associated with today’s lecture. Due before class Wednesday.

2

° NOTETAKER

Regression vs ClassificationSo far, we’ve taken a vector of features and predicted a numerical response

Regression deals with quantitative predictions

But what if we’re interested in making qualitative predictions?

3

y = �0 + �1x1 + �2x2 + · · · �pxp

(x1, x2, . . . , xp)

$360, 000 = �0 + �1 ⇥ (2000 sq-ft) + �2 ⇥ (3 bath rooms)

Classification Examples

4

Text Classification: Spam or Not Spam

Classification Examples

5

Text Classification: Fake News or Real News

Classification Examples

6

Text Classification: Sentiment Analysis (Tweets, Amazon Reviews)

Classification Examples

7

Image Classification: Handwritten Digits

Classification Examples

8

Image Classification: Stop Sign or No Stop Sign (Autonomous Vehicles)

Classification Examples

9

Image Classification: Hot Dog or Not Hot Dog (No real application whatsoever)

Classification Examples

10

Image Classification: Labradoodle or Fried Chicken (Important for not eating your dog)

Classification with Probability

11

In classification problems our outputs are discrete classes

Goal: Given features predict which class belongs to x = (x1, x2, . . . , xp) Y = k, x

Classification with Probability

12

In classification problems our outputs are discrete classes

Goal: Given features predict which class belongs to x = (x1, x2, . . . , xp) Y = k, x

Conditional probability gives a nice framework for many classification models

Suppose our data can belong to one of classes

We model the conditional probability for each

and we make the prediction

k = 1, 2, . . . ,K

p(Y = k | x) k = 1, 2, . . . ,K

y = argmaxk

p(Y = k | x)

Classification with Probability

13

In classification problems our outputs are discrete classes

Goal: Given features predict which class belongs to x = (x1, x2, . . . , xp) Y = k, x

Example: In Spam Classification, the features are the words in the email

The classes are SPAM and HAM

We estimate and p(SPAM | email) p(HAM | email). r

€0.75= o .zs y

'

=SPAM

K-Nearest Neighbors

14

AllLabeledData TrainingSet ValidationSet

Our first classifier is a very simple one

Assume we have a labeled training set

Features can be anything: words, pixel intensities, numerical quantities

Instead of continuous response, now have discrete labels (encoded as 0, 1, 2, etc)

K-Nearest Neighbors

15

General Idea: Suppose we have an example that we want to classify

o Look in training set for K examples that are “nearest” to

o Predict label for using majority vote of labels for K nearest neighbors

Questions: o What does “nearest” mean?

o What if there’s a tie in the vote?

x

x

x

K-Nearest Neighbors

16

KNN:

o Find : the set of K training examples nearest to

o Predict to be majority label in

Question: o What does “nearest” mean?

Answer:o Measure closeness using Euclidean Distance:

xNK(x)

y NK(x)

kxi � xk2

-

gtest Example

Tnthimng vector

K-Nearest Neighbors

17

Euclidean Distance:

o 1-NN predicts: _________

o 2-NN predicts: _________

o 5-NN predicts: _________

kxi � xk2

Classes = { RED ,Blue 3

K-Nearest Neighbors

18

Euclidean Distance:

o 1-NN predicts: _________

o 2-NN predicts: _________

o 5-NN predicts: _________

kxi � xk2

BLUE

K-Nearest Neighbors

19

Euclidean Distance:

o 1-NN predicts: blue

o 2-NN predicts: _________

o 5-NN predicts: _________

kxi � xk2

unclear

K-Nearest Neighbors

20

Euclidean Distance:

o 1-NN predicts: blue

o 2-NN predicts: unclear

o 5-NN predicts: _________

kxi � xk2

RED

K-Nearest Neighbors

21

Euclidean Distance:

o 1-NN predicts: blue

o 2-NN predicts: unclear

o 5-NN predicts: red

kxi � xk2

What About Probability?

22

o 5-NN:

p(Y = k | x) = 1

K

X

i2NK

I(yi = k)

INDICATORFunction

: -

=

p ( Y = Blue / x ) - }=

ftp.REDK/)=3@I (b) = { 1 it b is true

0 IF b is FADE

What About Ties?

23

Lots of reasonable strategies

Example: If there is a tie: o reduce K by 1 and try again.

o repeat until the tie is broken.

K-Nearest Neighbors

24

Predict with

Closest Points:

Prediction:

y K = 1

N, 1×1

= { Coin }

ya= RED

K-Nearest Neighbors

25

Predict with

Closest Points:

Prediction:

y K = 2

[ Nzlx ) > 310,4/11,4 ) }

t µ , (1) a { ( 0,2 ) }

RED =p

K-Nearest Neighbors

26

Predict with

Closest Points:

Prediction:

y K = 3

N } ( x ) = { ( 0,2 ) , 11,4 )( 2) 4)}

ya= BLUE

A Subtle Distinction

27

o KNN is an instance-based method

o Given an unseen query point, look directly at training data and then predict

o Compare to regression where we use training data to learn parameters

o When we learn parameters we call the model a parametric model

o Instance-based methods like KNN are called non-parametric models

Question: When would parametric vs non-parametric make a big difference?

.

Naïve Implementation

28

How much does KNN cost?

Suppose your training set has n examples, each with p features

Now you classify a single example

Naïve Method: Loop over each training example, compute distances, cache min indices

o The naïve method is ___________ in time

o The naïve method is ___________ in space

x

klxi - x 113

ocpn )olpn )

Tree-Based Implementation

29

Build a KD- or Ball-tree on the training data

Tree-Based Implementation

30

Build a KD- or Ball-tree on the training data

Higher up-front cost in time and storage to build data structure

Up-front cost is ___________ in time and ___________ in space

Now you classify a single example

Classification time for single query is ______________

Good strategy if lots of data and need to do lots of queries

x

ocpnlogn) Ocpu)

ocplogn)

Performance Metrics and Choosing K

31

Question: How do we evaluate our KNN classifier?

Answer: Perform prediction on labeled data, count how many we got wrong

Question: How do we choose K?

Error Rate =1

n

nX

i=1

I(yi 6= yi)

[# wrong PRED

-

# predictions

Performance Metrics and Choosing K

32

Like all models, KNN can overfit and underfit, depending on choice of K

K = 1 K = 15 DECISION

ayBoundary

Performance Metrics and Choosing K

33

Choose optimal K by varying K and computing error rate on train and validation set

o Note: Plot vs 1/K

o Lower K leads to more flexibility

o Choose K to minimize validation error

0.01 0.02 0.05 0.10 0.20 0.50 1.00

0.0

00

.05

0.1

00

.15

0.2

01/K

Err

or

Ra

te

Training ErrorsTest Errors.

VALIDATION

Feature Standardization

34

o Important to normalize/standardize features to similar sizes

o Consider doing 2-NN on unscaled and scaled data

O

Feature Standardization

35

o Important to normalize/standardize features to similar sizes

o Consider doing 2-NN on unscaled and scaled data

y"

= BLUE ya = RED

0 o

Generalization Error and Analysis

36

Bias-Variance Trade-Off: o Pretty much analogous to our experience with polynomial regression

o As , variance increases and bias decreases

Reducible and Irreducible Errors: o This is more interesting, and more complicated

o What follows applies to ALL classification methods, not just KNN

1/K ! 1

The Optimal Bayes Classifier

37

Suppose we knew the true probabilistic distribution of our data:

Example:o Equal prob a 2D point is in Q1 or Q3

o If point in Q1, 0.9 probability it’s red.

o If point in Q2, 0.9 probability it’s blue.

o If classifier uses true model to predict…

o will be wrong ______ of time on average

o Bayes Error Rate = ____________

p(Y = k | X)

QI

Atta04:

The Optimal Bayes Classifier

38

The Bayes Error Rate is the theoretical measure of Irreducible Error

o In long run, can never do better than the Bayes Error Rate on validation data

o Of course, just like in the regression setting, we don’t actually know what this error is. We just know it’s there!

0.01 0.02 0.05 0.10 0.20 0.50 1.00

0.0

00

.05

0.1

00

.15

0.2

01/K

Err

or

Ra

te

Training ErrorsTest Errors

Var(✏)

_

t

K-Nearest Neighbors Wrap-Up

39

o Talked about the general classification setting

o Learned how to do simple KNN

o Looked at possible implementations and their complexity

Next Time: o Look at our first parametric classifier: The Perceptron

o Kinda a weak model, but forms the basis for Neural Networks

If-Time Bonus: Weighted KNN

40

Question: What should 5-NN predict in the following picture?

If-Time Bonus: Weighted KNN

41

Question: What should 5-NN predict in the following picture?

Standard KNN would predict red, but it seemslike there are almost asmany blues and they’recloser.

If-Time Bonus: Weighted KNN

42

Weighted-KNN:

o Find : the set of K training examples nearest to

o Predict to be weighted-majority label in , weighted by inverse-distance

Improvements over KNN:

o Gives more weight to examples that are very close to query point

o Less tie-breaking required

xNK(x)

y NK(x)

If-Time Bonus: Weighted KNN

43

Question: What should 5-NN predict in the following picture?

Red Distances:

Blue Distances:

Red Weighted-Majority Vote:

Blue Weighted-Majority Vote:

Prediction:

0.5, 0.5
1/0.5 + 1/0.5 = 2 + 2 = 4
4 > 3 -> predict Blue
1, 1, 1
1/1 + 1/1 + 1/1 = 3

44

45

46

47