43
K-means++: The Advantages of Careful Seeding Sergei Vassilvitskii David Arthur (Stanford university)

K-means++: The Advantages of Careful Seeding

Embed Size (px)

Citation preview

Page 1: K-means++: The Advantages of Careful Seeding

K-means++:The Advantages of Careful Seeding

Sergei VassilvitskiiDavid Arthur

(Stanford university)

Page 2: K-means++: The Advantages of Careful Seeding

Given points in split them into similar groups.

Clusteringn R

dk

Page 3: K-means++: The Advantages of Careful Seeding

Given points in split them into similar groups.

Clusteringn R

dk

This talk: k-means clustering:

Find centers, that minimizek C

!

x!X

minc!C

!x " c!2

2

Page 4: K-means++: The Advantages of Careful Seeding

Objective: Find centers, that minimize

Why Means?k C

!

x!X

minc!C

!x " c!2

2

For one cluster: Find that minimizesy

!

x!X

!x " y!2

2

Easy! y =1

|X|

!

x!X

x

Page 5: K-means++: The Advantages of Careful Seeding

Lloyd’s Method: k-means

Initialize with random clusters

Page 6: K-means++: The Advantages of Careful Seeding

Lloyd’s Method: k-means

Assign each point to nearest center

Page 7: K-means++: The Advantages of Careful Seeding

Lloyd’s Method: k-means

Recompute optimum centers (means)

Page 8: K-means++: The Advantages of Careful Seeding

Lloyd’s Method: k-means

Repeat: Assign points to nearest center

Page 9: K-means++: The Advantages of Careful Seeding

Lloyd’s Method: k-means

Repeat: Recompute centers

Page 10: K-means++: The Advantages of Careful Seeding

Lloyd’s Method: k-means

Repeat...

Page 11: K-means++: The Advantages of Careful Seeding

Lloyd’s Method: k-means

Repeat...Until clustering does not change

Page 12: K-means++: The Advantages of Careful Seeding

Analysis

How good is this algorithm?

Finds a local optimum

That is potentially arbitrarily worse than optimal solution

Page 13: K-means++: The Advantages of Careful Seeding

Approximating k-means

• Mount et al.: approximation in time

• Har Peled et al.: in time

• Kumar et al.: in time

9 + ! O(n3/!d)

1 + ! O(n + kk+2!!2dk logk(n/!))

1 + ! 2(k/!)O(1)

nd

Page 14: K-means++: The Advantages of Careful Seeding

Approximating k-means

• Mount et al.: approximation in time

• Har Peled et al.: in time

• Kumar et al.: in time

Lloyd’s method:

• Worst-case time complexity:

• Smoothed complexity:

9 + ! O(n3/!d)

1 + ! O(n + kk+2!!2dk logk(n/!))

1 + ! 2(k/!)O(1)

nd

2!(

!

n)

nO(k)

Page 15: K-means++: The Advantages of Careful Seeding

Approximating k-means

• Mount et al.: approximation in time

• Har Peled et al.: in time

• Kumar et al.: in time

Lloyd’s method:

9 + ! O(n3/!d)

1 + ! O(n + kk+2!!2dk logk(n/!))

1 + ! 2(k/!)O(1)

nd

For example, Digit Recognition dataset (UCI):

n = 60, 000 d = 600

Convergence to a local optimum in 60 iterations.

Page 16: K-means++: The Advantages of Careful Seeding

Challenge

Develop an approximation algorithm for k-means clustering that is competitive with the k-means method in speed and solution quality.

Easiest line of attack: focus on the initial center positions.

Classical k-means: pick points at random. k

Page 17: K-means++: The Advantages of Careful Seeding

k-means on Gaussians

Page 18: K-means++: The Advantages of Careful Seeding

k-means on Gaussians

Page 19: K-means++: The Advantages of Careful Seeding

Easy Fix

Select centers using a furthest point algorithm (2-approximation to k-Center clustering).

Page 20: K-means++: The Advantages of Careful Seeding

Easy Fix

Select centers using a furthest point algorithm (2-approximation to k-Center clustering).

Page 21: K-means++: The Advantages of Careful Seeding

Easy Fix

Select centers using a furthest point algorithm (2-approximation to k-Center clustering).

Page 22: K-means++: The Advantages of Careful Seeding

Easy Fix

Select centers using a furthest point algorithm (2-approximation to k-Center clustering).

Page 23: K-means++: The Advantages of Careful Seeding

Easy Fix

Select centers using a furthest point algorithm (2-approximation to k-Center clustering).

Page 24: K-means++: The Advantages of Careful Seeding

Sensitive to Outliers

Page 25: K-means++: The Advantages of Careful Seeding

Sensitive to Outliers

Page 26: K-means++: The Advantages of Careful Seeding

Sensitive to Outliers

Page 27: K-means++: The Advantages of Careful Seeding

Interpolate between the two methods:

Let be the distance between and the nearest cluster center. Sample proportionally to (D(x))! = D

!(x)

k-means++

D(x) x

Original Lloyd’s: ! = 0

Contribution of to the overall errorx

! = !Furthest Point:

! = 2k-means++:

Page 28: K-means++: The Advantages of Careful Seeding

k-Means++

Page 29: K-means++: The Advantages of Careful Seeding

k-Means++

Theorem: k-means++ is approximate in expectation. !(log k)

Ostrovsky et al. [06]: Similar method is approximate under some data distribution assumptions.

O(1)

Page 30: K-means++: The Advantages of Careful Seeding

Proof - 1st cluster

Fix an optimal clustering .C!

Pick first center uniformly at random

Bound the total error of that cluster.

Page 31: K-means++: The Advantages of Careful Seeding

Proof - 1st cluster

Let be the cluster.

Each point equally likely to be the chosen center.

A

a0 ! A

= 2

!

a!A

!a " A!2 = 2!!(A)

E[!(A)] =!

a0!A

1

|A|

!

a!A

!a " a0!2

Expected Error:

Page 32: K-means++: The Advantages of Careful Seeding

Proof - Other Clusters

Suppose next center came from a new cluster in OPT.

Bound the total error of that cluster.

Page 33: K-means++: The Advantages of Careful Seeding

Other CLusters

Let be this cluster, and the point selected.

Then:

B b0

E[!(B)] =!

b0!B

D2(b0)"b!B

D2(b)·!

b!B

min(D(b), !b " b0!)2

Key step:

D(b0) ! D(b) + "b # b0"

Page 34: K-means++: The Advantages of Careful Seeding

Cont.

For any b: D2(b0) ! 2D

2(b) + 2"b # b0"2

Same for all b0

Cost in uniform sampling

D2(b0) !

2

|B|

!

b!B

D2(b) +

2

|B|

!

b!B

"b # b0"2Avg. over all b:

Page 35: K-means++: The Advantages of Careful Seeding

D2(b0) !

2

|B|

!

b!B

D2(b) +

2

|B|

!

b!B

"b # b0"2Avg. over all b:

Cont.

For any b: D2(b0) ! 2D

2(b) + 2"b # b0"2

Recall:

E[!(B)] =!

b0!B

D2(b0)"b!B

D2(b)·!

b!B

min(D(b), !b " b0!)2

!4

|B|

!

b0!B

!

b!B

"b # b0"2 = 8!!(B)

Page 36: K-means++: The Advantages of Careful Seeding

Wrap Up

If clusters are well separated, and we always pick a center from a new optimal cluster, the algorithm is - competitive.8

Page 37: K-means++: The Advantages of Careful Seeding

Wrap Up

If clusters are well separated, and we always pick a center from a new optimal cluster, the algorithm is - competitive.

Intuition: if no points from a cluster are picked, then it probably does not contribute much to the overall error.

8

Page 38: K-means++: The Advantages of Careful Seeding

Wrap Up

If clusters are well separated, and we always pick a center from a new optimal cluster, the algorithm is - competitive.

Intuition: if no points from a cluster are picked, then it probably does not contribute much to the overall error.

Formally, an inductive proof shows this method is competitive.

8

!(log k)

Page 39: K-means++: The Advantages of Careful Seeding

Experiments

Tested on several datasets:

Synthetic

• 10k points, 3 dimensions

Cloud Cover (UCI Repository]

• 10k points, 54 dimensions

Color Quantization

• 16k points, 16 dimensions

Intrusion Detection (KDD Cup)

• 500k points, 35 dimensions

Page 40: K-means++: The Advantages of Careful Seeding

Typical RunKM++ v. KM v. KM-Hybrid

600

700

800

900

1000

1100

1200

1300

0 50 100 150 200 250 300 350 400 450 500

Stage

Erro

r LLOYD

HYBRID

KM++

Page 41: K-means++: The Advantages of Careful Seeding

Experiments

Total Error

Time:

k-means++ 1% slower due to initialization.

k-means km-Hybrid k-means++

Synthetic

Cloud Cover

Color

Intrusion

6.02 ! 105

5.95 ! 105

6.06 ! 105

670741 712

0.016 0.015 0.014

32.9 ! 103

3.4 ! 103

!

Page 42: K-means++: The Advantages of Careful Seeding

Final Message

Friends don’t let friends use k-means.

Page 43: K-means++: The Advantages of Careful Seeding

Thank YouAny Questions?