Instance-Based Learning and Clustering · Example: Recognition of handwritten digits 30 pixels 20...

Preview:

Citation preview

1

Instance-Based Learning and Clustering

R&N 20.4, a bit of 20.3

2

Different kinds of Inductive Learning

• Supervised learning – Basic idea: Learn an approximation for a

function y=f(x) based on labelled examples { (x1,y1), (x2,y2), …, (xn,yn) }

– E.g. Decision Trees, Bayes classifiers, Instance-based learning methods

• Unsupervised learning

Instance-based learning

• Idea: For every test data point, search database of training data for ‘similar’ points and predict according to those points

3

Instance-based learning

• Idea: For every test data point, search database of training data for ‘similar’ points and predict according to those points

• Four elements of an instance-based learner:– How do we define ‘similarity’?– How many similar data points (neighbors) do we use?– (Optional) What weights do we give these neighbors?– How do we predict using these neighbors?

One-nearest-neighbor (1-NN)

• Simplest Instance-based learning method• Four elements of 1-NN:

– How do we define ‘similarity’?• Euclidian distance metric

– How many similar data points (neighbors) do we use?• one

– (Optional) What weights do we give these neighbors?• unused

– How do we predict using these neighbors?• predict the same value as the nearest neighbor

4

1-NN Prediction

class A

class B

1. Classification (predicting discrete-valued labels)

1-NN Prediction

class A

class B

p2

1. Classification (predicting discrete-valued labels)

PredictionTest point

p2

p1

p1

5

1-NN Prediction

class A

class B

p2

1. Classification (predicting discrete-valued labels)

PredictionTest point

p2

p1

p1

1-NN Prediction

1. Classification (predicting discrete-valued labels)

• Three classes: • Background color

indicates prediction in different areas

• Solid lines are decision boundaries between classes

[ignore the dashed purple line]

6

1-NN Prediction

2. Regression (predicting real-valued labels)

K-nearest-neighbor (K-NN)• A generalization of 1-NN to multiple neighbors• Four elements of K-NN:

– How do we define ‘similarity’?• Euclidian distance metric

– How many similar data points (neighbors) do we use?• K

– (Optional) What weights do we give these neighbors?• unused

– How do we predict using these neighbors?• Classification: predict majority label among neighbors• Regression: predict average value among neighbors

7

K-NN Prediction

class A

class B

p2

1. Classification (K=3)

PredictionTest point

p2

p1

p1

K-NN Prediction

class A

class B

p2

1. Classification (K=3)

PredictionTest point

p2

p1

p1

8

K-NN Prediction

1. Classification (K=15)

• Three classes: • Background color

indicates prediction in different areas

• Solid lines are decision boundaries between classes

[ignore the dashed purple line]

K-NN Prediction

K=1 K=15

9

K-NN Prediction

2. Regression (with K=9)

K-NN Prediction

K=1

K=9

10

Example: Recognition of handwritten digits

30 pixels

20 pixels

.

.

.

.

.

.

600-dimensionaldata point

Example: Recognition of handwritten digits

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

N sets of handwritten digit samples

10xN 600-dimensional training points

New handwritten sample classifiedby K-NN in 600-dimensional space

on training data

(Each color represents samples of a particular digit)

11

K-NN vs. other techniques• Most Instance-based methods work only for real-valued

inputs• Instance-based methods do not need a training phase,

unlike decision trees and Bayes classifiers• However, the nearest-neighbors-search step can be

expensive for large/high-dimensional datasets • Instance-based learning is non-parametric, i.e. no prior

model assumptions• No foolproof way to pre-select K … must try different

values and pick one that works well• Problems of discontinuities and edge effects in K-NN

regression … can be addressed by introducing weightsfor data points that are proportional to closeness

Unsupervised Learning (a.k.a. Clustering)

• Unsupervised learning – Basic idea: Learn an approximation for a

function y=f(x) based on unlabelled examples { x1, x2, …, xn }

– The goal is to uncover distinct classes of data points (clusters), which might then lead to a supervised learning scenario

– E.g. K-means, hierarchical clustering

The following slides are adapted from Andrew Moore’s slides at http://www.autonlab.org/tutorials/kmeans.html

12

K-means• Even if we have no labels for a

data set, there might still be interesting structure in the data in the form of distinct clusters/clumps.

• K-means is an iterative algorithm to find such clusters given the assumption that exactly K clusters exist.

K-means1. Ask user how many

clusters they’d like. (e.g. k=5)

13

K-means1. Ask user how many

clusters they’d like. (e.g. k=5)

2. Randomly guess k cluster Center locations

K-means1. Ask user how many

clusters they’d like. (e.g. k=5)

2. Randomly guess k cluster Center locations

3. Each datapoint finds out which Center it’s closest to. (Thus each Center “owns”a set of datapoints)

14

K-means1. Ask user how many

clusters they’d like. (e.g. k=5)

2. Randomly guess k cluster Center locations

3. Each datapoint finds out which Center it’s closest to.

4. Each Center finds the centroid of the points it owns

K-means1. Ask user how many

clusters they’d like. (e.g. k=5)

2. Randomly guess k cluster Center locations

3. Each datapoint finds out which Center it’s closest to.

4. Each Center finds the centroid of the points it owns…

5. …and jumps there

6. …Repeat until terminated!

15

K-means Questions

• What is it trying to optimize?• Are we sure it will terminate?• Are we sure it will find an optimal

clustering?• How should we start it?

DistortionGiven..

•an encoder function: ENCODE : ℜm → [1..k]

•a decoder function: DECODE : [1..k] → ℜm

Define…

( )∑=

−=R

iii

1

2)]([Distortion ENCODEDECODE xx

16

DistortionGiven..

•an encoder function: ENCODE : ℜm → [1..k]

•a decoder function: DECODE : [1..k] → ℜm

Define…

We may as well write

∑=

−=

=R

ii

j

i

j

1

2)(ENCODE )(Distortionso

][DECODE

xcx

c

( )∑=

−=R

iii

1

2)]([Distortion ENCODEDECODE xx

The Minimal Distortion

What properties must centers c1 , c2 , … , ck have when distortion is minimized?

∑=

−=R

ii i

1

2)(ENCODE )(Distortion xcx

17

The Minimal Distortion (1)

What properties must centers c1 , c2 , … , ck have when distortion is minimized?

(1) xi must be encoded by its nearest center

….why?

∑=

−=R

ii i

1

2)(ENCODE )(Distortion xcx

2

},...,{)(ENCODE )(minarg

21

jikj

icxc

ccccx −=

..at the minimal distortion

The Minimal Distortion (1)

What properties must centers c1 , c2 , … , ck have when distortion is minimized?

(1) xi must be encoded by its nearest center

….why?

∑=

−=R

ii i

1

2)(ENCODE )(Distortion xcx

2

},...,{)(ENCODE )(minarg

21

jikj

icxc

ccccx −=

..at the minimal distortion

Otherwise distortion could be reduced by replacing ENCODE[xi] by the nearest center

18

The Minimal Distortion (2)

What properties must centers c1 , c2 , … , ck have when distortion is minimized?

(2) The partial derivative of Distortion with respect to each center location must be zero.

∑=

−=R

ii i

1

2)(ENCODE )(Distortion xcx

(2) The partial derivative of Distortion with respect to each center location must be zero.

minimum) a(for 0

)(2

)(Distortion

)(

)(Distortion

)OwnedBy(

)OwnedBy(

2

1 )OwnedBy(

2

1

2)(ENCODE

=

−−=

−∂∂

=∂

−=

−=

∑ ∑

= ∈

=

j

j

j

i

iji

iji

jj

k

j iji

R

ii

c

c

c

x

cx

cxcc

cx

cx

OwnedBy(cj ) = the set of records owned by Center cj .

19

(2) The partial derivative of Distortion with respect to each center location must be zero.

minimum) a(for 0

)(2

)(Distortion

)(

)(Distortion

)OwnedBy(

)OwnedBy(

2

1 )OwnedBy(

2

1

2)(ENCODE

=

−−=

−∂∂

=∂

−=

−=

∑ ∑

= ∈

=

j

j

j

i

iji

iji

jj

k

j iji

R

ii

c

c

c

x

cx

cxcc

cx

cx

∑∈

=)OwnedBy(|)OwnedBy(|

1

jii

jj

c

xc

cThus, at a minimum:

At the minimum distortion

What properties must centers c1 , c2 , … , ck have when distortion is minimized?

(1) xi must be encoded by its nearest center

(2) Each Center must be at the centroid of points it owns.

∑=

−=R

ii i

1

2)(ENCODE )(Distortion xcx

20

Improving a suboptimal configuration…

What properties can be changed for centers c1 , c2 , … , ckhave when distortion is not minimized?

∑=

−=R

ii i

1

2)(ENCODE )(Distortion xcx

Improving a suboptimal configuration…

What properties can be changed for centers c1 , c2 , … , ckhave when distortion is not minimized?

(1) Change encoding so that xi is encoded by its nearest center

(2) Set each Center to the centroid of points it owns.

There’s no point applying either operation twice in succession.

But it can be profitable to alternate.

…And that’s K-means!

∑=

−=R

ii i

1

2)(ENCODE )(Distortion xcx

21

Will we find the optimal configuration?

• Not necessarily.• Can you invent a configuration that has

converged, but does not have the minimum distortion?

Will we find the optimal configuration?

• Not necessarily.• Can you invent a configuration that has

converged, but does not have the minimum distortion?

22

Trying to find good optima

• Idea 1: Be careful about where you start• Idea 2: Do many runs of k-means, each

from a different random start configuration• Many other ideas floating around.

Other distance metrics

• Note that we could have used the Manhattan distance metric instead of the one above.

• If so,

∑=

−=R

ii i

1)(ENCODE ||Distortion xcx

How would you find the distortion-minimizing centers in this case?

23

Example: Image Segmentation• Once K-means is performed, the resulting cluster centers can be

thought of as K labelled data points for 1-NN on the entire training set, such that each data point is labelled with its nearest center. This is called Vector Quantization.

Example: Image Segmentation• Once K-means is performed, the resulting cluster centers can be

thought of as K labelled data points for 1-NN on the entire training set, such that each data point is labelled with its nearest center. This is called Vector Quantization.

24

Example: Image Segmentation• Once K-means is performed, the resulting cluster centers can be

thought of as K labelled data points for 1-NN on the entire training set, such that each data point is labelled with its nearest center. This is called Vector Quantization.

Vector quantization on pixel intensities

Vector quantization on pixel colors

Common uses of K-means

• Often used as an exploratory data analysis tool• In one-dimension, a good way to quantize real-

valued variables into k non-uniform buckets• Used on acoustic data in speech understanding

to convert waveforms into one of k categories (i.e. Vector Quantization)

• Also used for choosing color palettes on old fashioned graphical display devices!

25

Single Linkage Hierarchical Clustering

1. Say “Every point is its own cluster”

Single Linkage Hierarchical Clustering

1. Say “Every point is its own cluster”

2. Find “most similar” pair of clusters

26

Single Linkage Hierarchical Clustering

1. Say “Every point is its own cluster”

2. Find “most similar” pair of clusters

3. Merge it into a parent cluster

Single Linkage Hierarchical Clustering

1. Say “Every point is its own cluster”

2. Find “most similar” pair of clusters

3. Merge it into a parent cluster

4. Repeat

27

Single Linkage Hierarchical Clustering

1. Say “Every point is its own cluster”

2. Find “most similar” pair of clusters

3. Merge it into a parent cluster

4. Repeat

Single Linkage Hierarchical Clustering

1. Say “Every point is its own cluster”

2. Find “most similar” pair of clusters

3. Merge it into a parent cluster

4. Repeat…until you’ve merged the whole dataset into one clusterYou’re left with a nice

dendrogram, or taxonomy, or hierarchy of datapoints

How do we define similarity between clusters?

• Minimum distance between points in clusters

• Maximum distance between points in clusters

• Average distance between points in clusters

28

Hierarchical Clustering Comments• It’s nice that you get a hierarchy instead of

an amorphous collection of groups• If you want k groups, just cut the (k-1)

longest links

Recommended