28
1 Instance-Based Learning and Clustering R&N 20.4, a bit of 20.3

Instance-Based Learning and Clustering · Example: Recognition of handwritten digits 30 pixels 20 pixels..... 600-dimensional data point Example: Recognition of handwritten digits

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Instance-Based Learning and Clustering · Example: Recognition of handwritten digits 30 pixels 20 pixels..... 600-dimensional data point Example: Recognition of handwritten digits

1

Instance-Based Learning and Clustering

R&N 20.4, a bit of 20.3

Page 2: Instance-Based Learning and Clustering · Example: Recognition of handwritten digits 30 pixels 20 pixels..... 600-dimensional data point Example: Recognition of handwritten digits

2

Different kinds of Inductive Learning

• Supervised learning – Basic idea: Learn an approximation for a

function y=f(x) based on labelled examples { (x1,y1), (x2,y2), …, (xn,yn) }

– E.g. Decision Trees, Bayes classifiers, Instance-based learning methods

• Unsupervised learning

Instance-based learning

• Idea: For every test data point, search database of training data for ‘similar’ points and predict according to those points

Page 3: Instance-Based Learning and Clustering · Example: Recognition of handwritten digits 30 pixels 20 pixels..... 600-dimensional data point Example: Recognition of handwritten digits

3

Instance-based learning

• Idea: For every test data point, search database of training data for ‘similar’ points and predict according to those points

• Four elements of an instance-based learner:– How do we define ‘similarity’?– How many similar data points (neighbors) do we use?– (Optional) What weights do we give these neighbors?– How do we predict using these neighbors?

One-nearest-neighbor (1-NN)

• Simplest Instance-based learning method• Four elements of 1-NN:

– How do we define ‘similarity’?• Euclidian distance metric

– How many similar data points (neighbors) do we use?• one

– (Optional) What weights do we give these neighbors?• unused

– How do we predict using these neighbors?• predict the same value as the nearest neighbor

Page 4: Instance-Based Learning and Clustering · Example: Recognition of handwritten digits 30 pixels 20 pixels..... 600-dimensional data point Example: Recognition of handwritten digits

4

1-NN Prediction

class A

class B

1. Classification (predicting discrete-valued labels)

1-NN Prediction

class A

class B

p2

1. Classification (predicting discrete-valued labels)

PredictionTest point

p2

p1

p1

Page 5: Instance-Based Learning and Clustering · Example: Recognition of handwritten digits 30 pixels 20 pixels..... 600-dimensional data point Example: Recognition of handwritten digits

5

1-NN Prediction

class A

class B

p2

1. Classification (predicting discrete-valued labels)

PredictionTest point

p2

p1

p1

1-NN Prediction

1. Classification (predicting discrete-valued labels)

• Three classes: • Background color

indicates prediction in different areas

• Solid lines are decision boundaries between classes

[ignore the dashed purple line]

Page 6: Instance-Based Learning and Clustering · Example: Recognition of handwritten digits 30 pixels 20 pixels..... 600-dimensional data point Example: Recognition of handwritten digits

6

1-NN Prediction

2. Regression (predicting real-valued labels)

K-nearest-neighbor (K-NN)• A generalization of 1-NN to multiple neighbors• Four elements of K-NN:

– How do we define ‘similarity’?• Euclidian distance metric

– How many similar data points (neighbors) do we use?• K

– (Optional) What weights do we give these neighbors?• unused

– How do we predict using these neighbors?• Classification: predict majority label among neighbors• Regression: predict average value among neighbors

Page 7: Instance-Based Learning and Clustering · Example: Recognition of handwritten digits 30 pixels 20 pixels..... 600-dimensional data point Example: Recognition of handwritten digits

7

K-NN Prediction

class A

class B

p2

1. Classification (K=3)

PredictionTest point

p2

p1

p1

K-NN Prediction

class A

class B

p2

1. Classification (K=3)

PredictionTest point

p2

p1

p1

Page 8: Instance-Based Learning and Clustering · Example: Recognition of handwritten digits 30 pixels 20 pixels..... 600-dimensional data point Example: Recognition of handwritten digits

8

K-NN Prediction

1. Classification (K=15)

• Three classes: • Background color

indicates prediction in different areas

• Solid lines are decision boundaries between classes

[ignore the dashed purple line]

K-NN Prediction

K=1 K=15

Page 9: Instance-Based Learning and Clustering · Example: Recognition of handwritten digits 30 pixels 20 pixels..... 600-dimensional data point Example: Recognition of handwritten digits

9

K-NN Prediction

2. Regression (with K=9)

K-NN Prediction

K=1

K=9

Page 10: Instance-Based Learning and Clustering · Example: Recognition of handwritten digits 30 pixels 20 pixels..... 600-dimensional data point Example: Recognition of handwritten digits

10

Example: Recognition of handwritten digits

30 pixels

20 pixels

.

.

.

.

.

.

600-dimensionaldata point

Example: Recognition of handwritten digits

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

N sets of handwritten digit samples

10xN 600-dimensional training points

New handwritten sample classifiedby K-NN in 600-dimensional space

on training data

(Each color represents samples of a particular digit)

Page 11: Instance-Based Learning and Clustering · Example: Recognition of handwritten digits 30 pixels 20 pixels..... 600-dimensional data point Example: Recognition of handwritten digits

11

K-NN vs. other techniques• Most Instance-based methods work only for real-valued

inputs• Instance-based methods do not need a training phase,

unlike decision trees and Bayes classifiers• However, the nearest-neighbors-search step can be

expensive for large/high-dimensional datasets • Instance-based learning is non-parametric, i.e. no prior

model assumptions• No foolproof way to pre-select K … must try different

values and pick one that works well• Problems of discontinuities and edge effects in K-NN

regression … can be addressed by introducing weightsfor data points that are proportional to closeness

Unsupervised Learning (a.k.a. Clustering)

• Unsupervised learning – Basic idea: Learn an approximation for a

function y=f(x) based on unlabelled examples { x1, x2, …, xn }

– The goal is to uncover distinct classes of data points (clusters), which might then lead to a supervised learning scenario

– E.g. K-means, hierarchical clustering

The following slides are adapted from Andrew Moore’s slides at http://www.autonlab.org/tutorials/kmeans.html

Page 12: Instance-Based Learning and Clustering · Example: Recognition of handwritten digits 30 pixels 20 pixels..... 600-dimensional data point Example: Recognition of handwritten digits

12

K-means• Even if we have no labels for a

data set, there might still be interesting structure in the data in the form of distinct clusters/clumps.

• K-means is an iterative algorithm to find such clusters given the assumption that exactly K clusters exist.

K-means1. Ask user how many

clusters they’d like. (e.g. k=5)

Page 13: Instance-Based Learning and Clustering · Example: Recognition of handwritten digits 30 pixels 20 pixels..... 600-dimensional data point Example: Recognition of handwritten digits

13

K-means1. Ask user how many

clusters they’d like. (e.g. k=5)

2. Randomly guess k cluster Center locations

K-means1. Ask user how many

clusters they’d like. (e.g. k=5)

2. Randomly guess k cluster Center locations

3. Each datapoint finds out which Center it’s closest to. (Thus each Center “owns”a set of datapoints)

Page 14: Instance-Based Learning and Clustering · Example: Recognition of handwritten digits 30 pixels 20 pixels..... 600-dimensional data point Example: Recognition of handwritten digits

14

K-means1. Ask user how many

clusters they’d like. (e.g. k=5)

2. Randomly guess k cluster Center locations

3. Each datapoint finds out which Center it’s closest to.

4. Each Center finds the centroid of the points it owns

K-means1. Ask user how many

clusters they’d like. (e.g. k=5)

2. Randomly guess k cluster Center locations

3. Each datapoint finds out which Center it’s closest to.

4. Each Center finds the centroid of the points it owns…

5. …and jumps there

6. …Repeat until terminated!

Page 15: Instance-Based Learning and Clustering · Example: Recognition of handwritten digits 30 pixels 20 pixels..... 600-dimensional data point Example: Recognition of handwritten digits

15

K-means Questions

• What is it trying to optimize?• Are we sure it will terminate?• Are we sure it will find an optimal

clustering?• How should we start it?

DistortionGiven..

•an encoder function: ENCODE : ℜm → [1..k]

•a decoder function: DECODE : [1..k] → ℜm

Define…

( )∑=

−=R

iii

1

2)]([Distortion ENCODEDECODE xx

Page 16: Instance-Based Learning and Clustering · Example: Recognition of handwritten digits 30 pixels 20 pixels..... 600-dimensional data point Example: Recognition of handwritten digits

16

DistortionGiven..

•an encoder function: ENCODE : ℜm → [1..k]

•a decoder function: DECODE : [1..k] → ℜm

Define…

We may as well write

∑=

−=

=R

ii

j

i

j

1

2)(ENCODE )(Distortionso

][DECODE

xcx

c

( )∑=

−=R

iii

1

2)]([Distortion ENCODEDECODE xx

The Minimal Distortion

What properties must centers c1 , c2 , … , ck have when distortion is minimized?

∑=

−=R

ii i

1

2)(ENCODE )(Distortion xcx

Page 17: Instance-Based Learning and Clustering · Example: Recognition of handwritten digits 30 pixels 20 pixels..... 600-dimensional data point Example: Recognition of handwritten digits

17

The Minimal Distortion (1)

What properties must centers c1 , c2 , … , ck have when distortion is minimized?

(1) xi must be encoded by its nearest center

….why?

∑=

−=R

ii i

1

2)(ENCODE )(Distortion xcx

2

},...,{)(ENCODE )(minarg

21

jikj

icxc

ccccx −=

..at the minimal distortion

The Minimal Distortion (1)

What properties must centers c1 , c2 , … , ck have when distortion is minimized?

(1) xi must be encoded by its nearest center

….why?

∑=

−=R

ii i

1

2)(ENCODE )(Distortion xcx

2

},...,{)(ENCODE )(minarg

21

jikj

icxc

ccccx −=

..at the minimal distortion

Otherwise distortion could be reduced by replacing ENCODE[xi] by the nearest center

Page 18: Instance-Based Learning and Clustering · Example: Recognition of handwritten digits 30 pixels 20 pixels..... 600-dimensional data point Example: Recognition of handwritten digits

18

The Minimal Distortion (2)

What properties must centers c1 , c2 , … , ck have when distortion is minimized?

(2) The partial derivative of Distortion with respect to each center location must be zero.

∑=

−=R

ii i

1

2)(ENCODE )(Distortion xcx

(2) The partial derivative of Distortion with respect to each center location must be zero.

minimum) a(for 0

)(2

)(Distortion

)(

)(Distortion

)OwnedBy(

)OwnedBy(

2

1 )OwnedBy(

2

1

2)(ENCODE

=

−−=

−∂∂

=∂

−=

−=

∑ ∑

= ∈

=

j

j

j

i

iji

iji

jj

k

j iji

R

ii

c

c

c

x

cx

cxcc

cx

cx

OwnedBy(cj ) = the set of records owned by Center cj .

Page 19: Instance-Based Learning and Clustering · Example: Recognition of handwritten digits 30 pixels 20 pixels..... 600-dimensional data point Example: Recognition of handwritten digits

19

(2) The partial derivative of Distortion with respect to each center location must be zero.

minimum) a(for 0

)(2

)(Distortion

)(

)(Distortion

)OwnedBy(

)OwnedBy(

2

1 )OwnedBy(

2

1

2)(ENCODE

=

−−=

−∂∂

=∂

−=

−=

∑ ∑

= ∈

=

j

j

j

i

iji

iji

jj

k

j iji

R

ii

c

c

c

x

cx

cxcc

cx

cx

∑∈

=)OwnedBy(|)OwnedBy(|

1

jii

jj

c

xc

cThus, at a minimum:

At the minimum distortion

What properties must centers c1 , c2 , … , ck have when distortion is minimized?

(1) xi must be encoded by its nearest center

(2) Each Center must be at the centroid of points it owns.

∑=

−=R

ii i

1

2)(ENCODE )(Distortion xcx

Page 20: Instance-Based Learning and Clustering · Example: Recognition of handwritten digits 30 pixels 20 pixels..... 600-dimensional data point Example: Recognition of handwritten digits

20

Improving a suboptimal configuration…

What properties can be changed for centers c1 , c2 , … , ckhave when distortion is not minimized?

∑=

−=R

ii i

1

2)(ENCODE )(Distortion xcx

Improving a suboptimal configuration…

What properties can be changed for centers c1 , c2 , … , ckhave when distortion is not minimized?

(1) Change encoding so that xi is encoded by its nearest center

(2) Set each Center to the centroid of points it owns.

There’s no point applying either operation twice in succession.

But it can be profitable to alternate.

…And that’s K-means!

∑=

−=R

ii i

1

2)(ENCODE )(Distortion xcx

Page 21: Instance-Based Learning and Clustering · Example: Recognition of handwritten digits 30 pixels 20 pixels..... 600-dimensional data point Example: Recognition of handwritten digits

21

Will we find the optimal configuration?

• Not necessarily.• Can you invent a configuration that has

converged, but does not have the minimum distortion?

Will we find the optimal configuration?

• Not necessarily.• Can you invent a configuration that has

converged, but does not have the minimum distortion?

Page 22: Instance-Based Learning and Clustering · Example: Recognition of handwritten digits 30 pixels 20 pixels..... 600-dimensional data point Example: Recognition of handwritten digits

22

Trying to find good optima

• Idea 1: Be careful about where you start• Idea 2: Do many runs of k-means, each

from a different random start configuration• Many other ideas floating around.

Other distance metrics

• Note that we could have used the Manhattan distance metric instead of the one above.

• If so,

∑=

−=R

ii i

1)(ENCODE ||Distortion xcx

How would you find the distortion-minimizing centers in this case?

Page 23: Instance-Based Learning and Clustering · Example: Recognition of handwritten digits 30 pixels 20 pixels..... 600-dimensional data point Example: Recognition of handwritten digits

23

Example: Image Segmentation• Once K-means is performed, the resulting cluster centers can be

thought of as K labelled data points for 1-NN on the entire training set, such that each data point is labelled with its nearest center. This is called Vector Quantization.

Example: Image Segmentation• Once K-means is performed, the resulting cluster centers can be

thought of as K labelled data points for 1-NN on the entire training set, such that each data point is labelled with its nearest center. This is called Vector Quantization.

Page 24: Instance-Based Learning and Clustering · Example: Recognition of handwritten digits 30 pixels 20 pixels..... 600-dimensional data point Example: Recognition of handwritten digits

24

Example: Image Segmentation• Once K-means is performed, the resulting cluster centers can be

thought of as K labelled data points for 1-NN on the entire training set, such that each data point is labelled with its nearest center. This is called Vector Quantization.

Vector quantization on pixel intensities

Vector quantization on pixel colors

Common uses of K-means

• Often used as an exploratory data analysis tool• In one-dimension, a good way to quantize real-

valued variables into k non-uniform buckets• Used on acoustic data in speech understanding

to convert waveforms into one of k categories (i.e. Vector Quantization)

• Also used for choosing color palettes on old fashioned graphical display devices!

Page 25: Instance-Based Learning and Clustering · Example: Recognition of handwritten digits 30 pixels 20 pixels..... 600-dimensional data point Example: Recognition of handwritten digits

25

Single Linkage Hierarchical Clustering

1. Say “Every point is its own cluster”

Single Linkage Hierarchical Clustering

1. Say “Every point is its own cluster”

2. Find “most similar” pair of clusters

Page 26: Instance-Based Learning and Clustering · Example: Recognition of handwritten digits 30 pixels 20 pixels..... 600-dimensional data point Example: Recognition of handwritten digits

26

Single Linkage Hierarchical Clustering

1. Say “Every point is its own cluster”

2. Find “most similar” pair of clusters

3. Merge it into a parent cluster

Single Linkage Hierarchical Clustering

1. Say “Every point is its own cluster”

2. Find “most similar” pair of clusters

3. Merge it into a parent cluster

4. Repeat

Page 27: Instance-Based Learning and Clustering · Example: Recognition of handwritten digits 30 pixels 20 pixels..... 600-dimensional data point Example: Recognition of handwritten digits

27

Single Linkage Hierarchical Clustering

1. Say “Every point is its own cluster”

2. Find “most similar” pair of clusters

3. Merge it into a parent cluster

4. Repeat

Single Linkage Hierarchical Clustering

1. Say “Every point is its own cluster”

2. Find “most similar” pair of clusters

3. Merge it into a parent cluster

4. Repeat…until you’ve merged the whole dataset into one clusterYou’re left with a nice

dendrogram, or taxonomy, or hierarchy of datapoints

How do we define similarity between clusters?

• Minimum distance between points in clusters

• Maximum distance between points in clusters

• Average distance between points in clusters

Page 28: Instance-Based Learning and Clustering · Example: Recognition of handwritten digits 30 pixels 20 pixels..... 600-dimensional data point Example: Recognition of handwritten digits

28

Hierarchical Clustering Comments• It’s nice that you get a hierarchy instead of

an amorphous collection of groups• If you want k groups, just cut the (k-1)

longest links