Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16

Preview:

Citation preview

Teaching k-Means New Tricks

Sergei VassilvitskiiGoogle

k-Means Algorithm

The k-Means Algorithm [Lloyd ’57]– Clusters points intro groups– Remains a workhorse of machine learning even in the age of deep networks

MR ML Algorithmics Sergei Vassilvitskii

Lloyd’s Method: k-means

Initialize with random clusters

49Saturday, August 25, 12

MR ML Algorithmics Sergei Vassilvitskii

Lloyd’s Method: k-means

Assign each point to nearest center

50Saturday, August 25, 12

MR ML Algorithmics Sergei Vassilvitskii

Lloyd’s Method: k-means

Recompute optimum centers (means)

51Saturday, August 25, 12

MR ML Algorithmics Sergei Vassilvitskii

Lloyd’s Method: k-means

Repeat: Assign points to nearest center

52Saturday, August 25, 12

MR ML Algorithmics Sergei Vassilvitskii

Lloyd’s Method: k-means

Repeat: Recompute centers

53Saturday, August 25, 12

MR ML Algorithmics Sergei Vassilvitskii

Lloyd’s Method: k-means

Repeat...

54Saturday, August 25, 12

MR ML Algorithmics Sergei Vassilvitskii

Lloyd’s Method: k-means

Repeat...Until clustering does not change

55Saturday, August 25, 12

MR ML Algorithmics Sergei Vassilvitskii

Lloyd’s Method: k-means

Repeat...Until clustering does not change

Total error reduced at every step - guaranteed to converge.

55Saturday, August 25, 12

MR ML Algorithmics Sergei Vassilvitskii

Lloyd’s Method: k-means

Repeat...Until clustering does not change

Total error reduced at every step - guaranteed to converge.

Minimizes:

56

�(X,C) =X

x2X

d(x,C)2

Saturday, August 25, 12

New Tricks for k-Means

Initialization:– Is random initialization a good idea?

Large data:– Clustering many points (in parallel) – Clustering into many clusters

MR ML Algorithmics Sergei Vassilvitskii

k-means Initialization

Random?

57Saturday, August 25, 12

MR ML Algorithmics Sergei Vassilvitskii

k-means Initialization

Random?

58Saturday, August 25, 12

MR ML Algorithmics Sergei Vassilvitskii

k-means Initialization

Random? A bad idea

59Saturday, August 25, 12

MR ML Algorithmics Sergei Vassilvitskii

k-means Initialization

Random? A bad idea

Even with many random restarts!

59Saturday, August 25, 12

MR ML Algorithmics Sergei Vassilvitskii

Easy Fix

Select centers using a furthest point algorithm (2-approximation to k-Center clustering).

60Saturday, August 25, 12

MR ML Algorithmics Sergei Vassilvitskii

Easy Fix

Select centers using a furthest point algorithm (2-approximation to k-Center clustering).

61Saturday, August 25, 12

MR ML Algorithmics Sergei Vassilvitskii

Easy Fix

Select centers using a furthest point algorithm (2-approximation to k-Center clustering).

62Saturday, August 25, 12

MR ML Algorithmics Sergei Vassilvitskii

Easy Fix

Select centers using a furthest point algorithm (2-approximation to k-Center clustering).

63Saturday, August 25, 12

MR ML Algorithmics Sergei Vassilvitskii

Easy Fix

Select centers using a furthest point algorithm (2-approximation to k-Center clustering).

64Saturday, August 25, 12

MR ML Algorithmics Sergei Vassilvitskii

Sensitive to Outliers

65Saturday, August 25, 12

MR ML Algorithmics Sergei Vassilvitskii

Sensitive to Outliers

65Saturday, August 25, 12

MR ML Algorithmics Sergei Vassilvitskii

Sensitive to Outliers

65Saturday, August 25, 12

MR ML Algorithmics Sergei Vassilvitskii

Sensitive to Outliers

65Saturday, August 25, 12

MR ML Algorithmics Sergei Vassilvitskii

Sensitive to Outliers

66Saturday, August 25, 12

MR ML Algorithmics Sergei Vassilvitskii

Interpolate between two methods. Give preference to further points.

Let be the distance between and the nearest cluster center. Sample next center proportionally to .

k-means++

67

D(p) p

D↵(p)

Saturday, August 25, 12

MR ML Algorithmics Sergei Vassilvitskii

k-means++

68

D(p) p

Interpolate between two methods. Give preference to further points.

Let be the distance between and the nearest cluster center. Sample next center proportionally to . D↵(p)

D↵(p)Px

D↵(p)

kmeans++: Select first point uniformly at random for (int i=1; i < k; ++i){ Select next point p with probability ; UpdateDistances(); }

Saturday, August 25, 12

MR ML Algorithmics Sergei Vassilvitskii

k-means++

69

D(p) p

Interpolate between two methods. Give preference to further points.

Let be the distance between and the nearest cluster center. Sample next center proportionally to . D↵(p)

↵ = 1↵ = 2

Original Lloyd’s:

Furthest Point: k-means++:

↵ = 0

D↵(p)Px

D↵(p)

kmeans++: Select first point uniformly at random for (int i=1; i < k; ++i){ Select next point p with probability ; UpdateDistances(); }

Saturday, August 25, 12

MR ML Algorithmics Sergei Vassilvitskii

k-means++

70Saturday, August 25, 12

MR ML Algorithmics Sergei Vassilvitskii

k-means++

70Saturday, August 25, 12

MR ML Algorithmics Sergei Vassilvitskii

k-means++

70Saturday, August 25, 12

MR ML Algorithmics Sergei Vassilvitskii

k-means++

70Saturday, August 25, 12

MR ML Algorithmics Sergei Vassilvitskii

k-means++

71

Theorem [AV ’07]: k-means++ guarantees a approximation⇥(log k)

Saturday, August 25, 12

New Tricks for k-Means

Initialization:– Is random initialization a good idea?

Large data:– Clustering many points (in parallel) – Clustering into many clusters

Dealing with large data

The new initialization approach:– Leads to very good clusterings– But is very sequential!

• Must select one cluster at a time, then update the distribution we are sampling from

– How to adapt it in the world of parallel computing?

Speeding up initialization

Initialization:

kmeans++: Select first point uniformly at random for (int i=1; i < k; ++i) { Select next point p with probability ; UpdateDistance(); }

Improving the speed:– Instead of selecting a single point, sample many points at a time– Oversample: select more than k centers, and then select the best k out of them.

D

2(p)Px

D

2(x)

MR ML Algorithmics Sergei Vassilvitskii

k-means||

74

kmeans++: Select first point uniformly at random for (int i=1; i < k; ++i){ Select next point p with probability ; UpdateDistances(); }}

D2(p)Pp D2(p)

Saturday, August 25, 12

MR ML Algorithmics Sergei Vassilvitskii

k-means||

75

kmeans++: Select first point c uniformly at random for (int i=1; i < ; ++i){ Select point p independently with probability ; UpdateDistances(); } Prune to k points total by clustering the clusters}

k · ` · D↵(p)Px

D↵(p)

log`(�(X, c))

Saturday, August 25, 12

MR ML Algorithmics Sergei Vassilvitskii

k-means||

76

kmeans++: Select first point c uniformly at random for (int i=1; i < ; ++i){ Select point p independently with probability ; UpdateDistances(); } Prune to k points total by clustering the clusters}

k · ` · D↵(p)Px

D↵(p)

log`(�(X, c))

Independent selection

Easy MR

Saturday, August 25, 12

MR ML Algorithmics Sergei Vassilvitskii

k-means||

77

kmeans++: Select first point c uniformly at random for (int i=1; i < ; ++i){ Select point p independently with probability ; UpdateDistances(); } Prune to k points total by clustering the clusters}

k · ` · D↵(p)Px

D↵(p)

log`(�(X, c))

Independent selection Easy MR

Oversampling Parameter

Saturday, August 25, 12

MR ML Algorithmics Sergei Vassilvitskii

k-means||

78

kmeans++: Select first point c uniformly at random for (int i=1; i < ; ++i){ Select point p independently with probability ; UpdateDistances(); } Prune to k points total by clustering the clusters}

k · ` · D↵(p)Px

D↵(p)

log`(�(X, c))

Independent selection Easy MR

Oversampling Parameter

Re-clustering step

Saturday, August 25, 12

MR ML Algorithmics Sergei Vassilvitskii

k-means||: Analysis

How Many Rounds?– Theorem: After rounds, guarantee approximation – In practice: fewer iterations are needed– Need to re-cluster intermediate centers

Discussion:– Number of rounds independent of k– Tradeoff between number of rounds and memory

79

O(1)O(log`(n�))

O(k` log`(n�))

Saturday, August 25, 12

MR ML Algorithmics Sergei Vassilvitskii

How well does this work?

80

1e+12

1e+13

1e+14

1e+15

1e+16

1 10co

stlog # Rounds

KDD Dataset, k=17

l/k=1l/k=2l/k=4

1e+11

1e+12

1e+13

1e+14

1e+15

1e+16

1 10

cost

log # Rounds

KDD Dataset, k=33

l/k=1l/k=2l/k=4

1e+11

1e+12

1e+13

1e+14

1e+15

1e+16

1 10

cost

log # Rounds

KDD Dataset, k=65

l/k=1l/k=2l/k=4

1e+10

1e+11

1e+12

1e+13

1e+14

1e+15

1e+16

1 10 100

cost

log # Rounds

KDD Dataset, k=129

l/k=1l/k=2l/k=4

Random Initialization

k-means++

k-means||

l=1l=2l=4

Saturday, August 25, 12

MR ML Algorithmics Sergei Vassilvitskii

Performance vs. k-means++

– Even better on small datasets: 4600 points, 50 dimensions (SPAM)

– Accuracy:

– Time (iterations):

81Saturday, August 25, 12

New Tricks for k-Means

Initialization:– Is random initialization a good idea?

Large data:– Clustering many points (in parallel) – Clustering into many clusters

Large k

How do you run k-means when k is large? – For every point, need to find the nearest center

Large k

How do you run k-means when k is large? – For every point, need to find the nearest center– Naive approach: linear scan

Large k

How do you run k-means when k is large? – For every point, need to find the nearest center– Naive approach: linear scan – Better approach [Elkan]:

• Use triangle inequality to see if the center could have possibly gotten closer• Still expensive when k is large

Using Nearest Neighbor Data Structures

Expensive step of k-Means:– For every point, find the nearest center

But we have many algorithms for nearest neighbors!

Using Nearest Neighbor Data Structures

Expensive step of k-Means:– For every point, find the nearest center

But we have many algorithms for nearest neighbors!

First idea:– Index the centers. Then do a query into this data structure for every point – Need to rebuild the NN Data structure every time

Using Nearest Neighbor Data Structures

Expensive step of k-Means:– For every point, find the nearest center

But we have many algorithms for nearest neighbors!

First idea:– Index the centers. Then do a query into this data structure for every point – Need to rebuild the NN Data structure every time

Better idea:– Index the points! – For every center, query the nearest points

Performance

Two large datasets:– 1M points in each– 7-25M features in each (very high dimensionality) – Clustering into k=1000 clusters.

Performance

Two large datasets:– 1M points in each– 7-25M features in each (very high dimensionality) – Clustering into k=1000 clusters.

Index based k-means:– Simple implementation: 2-7x faster than traditional k-means– No degradation in quality (same objective function value) – More complex implementation:

• An additional 8-50x speed improvement !

K-Means Algorithm

Almost 60 years on, still incredibly popular and useful approach It has gotten better with age:

– Better initialization approaches that are fast and accurate – Parallel implementations to handle large datasets– New implementations that handle points in many dimensions and clustering into

many clusters– New approaches for online clustering

K-Means Algorithm

Almost 60 years on, still incredibly popular and useful approach It has gotten better with age:

– Better initialization approaches that are fast and accurate – Parallel implementations to handle large datasets– New implementations that handle points in many dimensions and clustering into

many clusters– New approaches for online clustering

More work remains!– Non spherical clusters – Other metric spaces – Dealing with outliers

Thank You.

Arthur, D., V., S. K-means++, the advantages of better seeding. SODA 2007.

Bahmani, B., Moseley, B., Vattani A., Kumar, R., V.,S. Scalable k-means++. VLDB 2012.

Broder, A., Garcia, L., Josifovski, V., V.S., Venkatesan, S. Scalable k-means by ranked retrieval. WSDM 2014.