35
K-Clustering Coresets Sagi Hed 9th Dec 2008 , , , “Bi-Criteria Linear-Time Approximations for Generalized k-Mean/Median/Center” (Danny Feldman, Amos Fiat, Micha Sharir, Danny Segev) “Smaller Coresets for k-Median an k-Means Clustering” (Sariel Har-Peled, Akash Kushal)

K-Clustering Coresets - Tel Aviv Universityronitt/COURSES/F08/lec6.pdf · K-Clustering Coresets Sagi Hed 9th Dec 2008,,, • “Bi-Criteria Linear-Time Approximations for Generalized

  • Upload
    others

  • View
    17

  • Download
    0

Embed Size (px)

Citation preview

K-Clustering Coresets

Sagi Hed

9th Dec 2008

, , ,

• “Bi-Criteria Linear-Time Approximations for Generalized k-Mean/Median/Center”(Danny Feldman, Amos Fiat, Micha Sharir, Danny Segev)

• “Smaller Coresets for k-Median an k-Means Clustering”(Sariel Har-Peled, Akash Kushal)

Coresets Approach

• Not exactly sublinear time algorithm!

Instead of working on input, work on an alternative, very small, but indistinguishable input

• Create small coreset in streaming algorithm

Coreset will allow infinite various queries in sub-linear time..

, , ,

StreamingWhat is A Streaming Algorithm?• Data continuously generated and total volume is extremely large• Algorithm examines arriving items once, and discards• Update internal state fast per new item (O(1)

to poly log(n))• Use small space (poly log(n))

In other words, can run whenever new data arrives, even if n

is too large for memory.

CoresetsWhat is a Coreset?A subset of input, such that we can get a good approximation to the original input by solving the problem directly on the coreset.

Since the coreset is very small, typically poly-logarithmic, we can do this even with exhaustive search.

Coreset

K-ClusteringP

set of n

points in Rd

C

set of k

points in Rd

dist(p, C)

distance from p

to the nearest point in C

CostP

(C) ≡ ∑pєP

dist(p, C)

K-Median ClusteringCostP

(C) ≡ ∑pєP

[dist(p, C)] 2

K-Means ClusteringCostP

(C) ≡

MaxpєP

{ dist(p, C) }

K-Center Clustering

CostP

(C)(in k-center clustering)

P

K-ClusteringP

set of n

points in Rd

• K-Center/Median/Mean ClusteringFind C, |C| = k, s.t. CostP

(C)

is minimal…

• K-Center/Median/Mean QueriesGiven C, what is CostP

(C)?

Exact K-Clustering is NP-hard… (exponential in k)

(k,ε) Coreset(k, ε)-coreset ≡

set S

of points in Rd

where for every C, |C| = k

(1-

ε)·CostP

(C) ≤

CostS

(C) ≤

(1+

ε)·CostP

(C)

Coreset S

Cost(C,S) ≈

Cost(C, P)

CostS

(OPT(S)) ≤

CostS

(OPT(P)) ≤

(1+ε)CostP

(OPT(P))But OPT(S) ≠

OPT(P)…

Our GoalStreaming algorithm for (k, ε)-coreset with O(kε-d·log(n)) points

With this small coreset we can answer k-queries or find a k-clustering in sub-linear time, without ever reading the original input again

We assume k and d is constantCoreset

Answer queries on coreset with exhaustive search –

Answer k-queries in O(k2·d·ε-d·log(n)) time

Find k-clustering in roughly O( [kε-d·log(n)] kd+1 ) time (number of Voronoi partitions is bounded)

K-Clustering Coresets

Order Of Business:

1. Coreset Streaming Reduction

2.

Bi-Criteria K-Clustering Approximation

3.

K-Center Coresets Using Bi-Criteria K-Clustering

4.

K-Means/Median Coresets Using Hand Waving

Coreset Streaming Reduction

Input

A (k,ε)-coreset creation algorithm for k-clustering, creating coresets of size O(k(1/ε)d)

in O(d·n·k)

OutputA Streaming (k,ε)-coreset creation algorithm for k-center clustering, creating coresets of size O(k(1/ε)dlog(n)),

that works in time O(polylog(k,1/ε))

per insertion

Coreset StreamingLemma 1C1

is a (k, ε)-coreset for P1C2

is a (k, ε)-coreset for P2=> C1

U C2

is a (k, ε)-coreset for P1 U P2

C2C1

P1 P2

C2C1Lemma 2

C1

is a (k, ε1

)-coreset for PC2

is a (k, ε2

)-coreset for C1=> C2

is a (k, O(ε1

+ε2

))-coreset for P

Coreset Streaming

• Operate coreset on existing coresets• Maintain log(n) coresets at a time

(k,p1

)-coreset(k,p2

)-coreset

(k,p3

)-coreset

pi = ε

/ [c(j+1)2]

∏(1+pi

) ≤

1 + ε/2

for a large enough c=> U Qi

is a (k,ε/2

)-coreset

O(k(1/ε)d)

Q3

=

Q2

=

Q1

=

Coreset StreamingSpace ComplexitySpace for algorithm is –∑1 ≤

j ≤

log(n)

O([c(j+1)2d] / εd) = O(polylog(n))

Coreset SizeBuild Ri

= (k, ε/6)-coreset from every QiU Ri

= still a (k, ε)-coreset for all points|U Ri

|

= ∑1 ≤

j ≤

log(n)

O((6/ε)d) = O((1/ε)dlog(n))=> actual coreset (for exhaustive search) is smaller

Time ComplexityRequires an amortized analysis similar to that of a binary counter=> O(polylog(k,1/ε))

per insertion

Bi-CriteriaK-Clustering

Clustering ApproximationC

is a β-approximation for clustering ≡

CostP

(C) ≤ β·CostP

(OPT)

GoalWith high probability obtain a2-approximation clustering C|C| = O(k·log2(n))in time O(d·n·k·log(n))With meticulous calculations can be improved to|C| = O(k·log(n)·loglog(n))and time O(d·n·k)

Algorithm1. F = Ø, t = 12.Until we cover all points:

1.

Ft

= Sample O(k·log(n))

points at random2.

F = F U Ft3.

Pt

= the |P|/2

points from P that are closest to Ft

4.

Remove Pt

from P5.

t++3.Output F

Bi-CriteriaK-Clustering

Sample

F =

Algorithm1.While (P is not empty, t++) :

1.

F = F U Ft

(≡

Sample O(k·log(n))

points at random)2.

Remove Pt

= the |P|/2

points from P that are closest to Ft

Bi-CriteriaK-Clustering

Analysis

Running Time –

O(n·d·k·log(n))

|F| = O(k·log2(n))

good point

Bi-CriteriaK-Clustering

Correctness

F* ≡

optimal centers for clustering, |F| = k

p

is “bad”

for Ft

dist(p, Ft

) > 2·dist(p, F*)p

is “good”

for Ft

dist(p, Ft

) ≤

2·dist(p, F*)

F*

bad point

Bi-CriteriaK-Clustering

Lemma 1

With high probability, number of bad

points for Ft

|Pt

| / 8

OPT

bad points

Lemma 2With high probability, for every bad

point for Ft

which is discarded with Pt

, we can match a distinct good

point for Ft

which is outside Pt

.In fact, this point is discarded with Pt+1

.

bad point

good point

Pt

g

b

Bi-CriteriaK-Clustering

Proof of Lemma 2#{points in Pt+1

good

for Ft

}

|Pt+1

| -

#{points bad

for Ft

}

Using Lemma 1,≥

|Pt

| / 2 -

|Pt

| / 8 ≥

3/8 |Pt

|

#{points bad

for Ft

}

Since the point is discarded with Pt+1

, it is distinct..

Proof? (Hint: use previous lemma..)

CorrectnessFor any bad

point b

for some Ft

, let g

its matching distinct good

point –dist(b, F) ≤

dist(b, Ft

) ≤

dist(g, Ft

) ≤

2·dist(g, F*)

K-Median ClusteringCostP

(F) = ∑pϵP

dist(p, F)

≤ ∑g

dist(g,F) + ∑b

dist(b, F)

≤ ∑g

2·dist(g, F*) + ∑g

2·dist(g, F*)

2 ∑g

dist(g, F*) ≤

2·CostP

(F*) □

K-Means/Center ClusteringSame proof…

bad point

good point

Pt

g

b

Bi-CriteriaK-Clustering

Bi-CriteriaK-Clustering

Proof of Lemma 1

During iteration t –f*i

the i’th center in F*Qi

|Pt

|/8k

points closest to fi

Probability that sampled point is not in Qi

= (1 –

1/8k

)=> Probability that none of the sampled points are in Qi

= (1-1/8k

)8klog(n)·c

e–c·log(n)

= 1/nc

Probability that at least one of Qi

has no sampled points in at least one of the log(n)

iterations is, using union bound ≤

k·log(n)/nc ≤

1/nc-2

=> Probability that all Qi

have sampled points in all iterations ≥

1-

1/nc-2

For a large enough n, this expression is arbitrarily large

sampled point

Qi (|Pt |/8k closest to f*)

f*

Bi-CriteriaK-Clustering

Proof of Lemma 1

q ≡

point outside the |Pt

|/8k

closest of any f*if* ≡

q’s closest point in F*p ≡

a sampled point closest to f* (within the |Pt

|/8k

closest)

dist(q,Ft

) ≤

dist(q,p) ≤

dist(q,f*) + dist(f*,p) ≤

2·dist(q,f*) = 2·dist(q,F*)=> p is a “good“

point!

=> at most |Pt|/8 “bad”

points!

sampled point Qi (|Pt |/8k closest to f*)

q

f*p

Careful

Can amplify the probability more by making attempts and choosing

F

with lowest CostP

(F)..

Need to be careful with probabilities and streaming reduction..

Bi-CriteriaK-Clustering

Generalizes to j-flats in d dimensions instead of points (open problem to construct coresets for these)

K-Center Clustering Coreset

Goal

Obtain a (k,ε)

center-clustering coreset S

(1-ε)CostP

(C) ≤

CostS

(C) ≤

(1+ε)CostP

(C)

CostA

(B) ≡

MaxpєA

{ dist(p,B) }

|S| = O(k(√d/ε

)d)

Coreset

Can show its impossible to not be exponential in d..Just for k-clustering its possible to get a coreset of size O(kε-2)

K-Center Clustering Coreset

Coreset

ConstructionUse KLog(n)-Clustering on P

to create C|C| = O(klog2(n))CostP

(C) ≤

8·CostP

(OPT)C

K-Center Clustering Coreset

Coreset

ConstructionBuild 2·CostP

(C) x 2·CostP

(C) x

… d-dimensional grid around points in C.Divide to “squares”

εCostP

(C)/(8√d) x εCostP

(C)/(8√d) x …2CostP

(C)

2CostP

(C)

ε/(8√d) ·CostP

(C)

ε/(8√d) ·CostP

(C)

K-Center Clustering Coreset

Coreset

ConstructionCostP

(C) = maxpϵP { dist(p,C) }=> all points contained in the gridsRunning time is O(n)

2CostP

(C)

2CostP

(C)

ε/(8√d) ·CostP

(C)

ε/(8√d) ·CostP

(C)

K-Center Clustering Coreset

Coreset

ConstructionCoreset = pick one representative from every square|Coreset| = O[ (√d/ε)d·k·log2(n) ]

2CostP

(C)

Coreset

ε/(8√d) ·CostP

(C)

K-Center Clustering Coreset

Coreset

CorrectnessS

coreset, p

point maximizing dist(p,A), q ≡

p’s representative

CostP

(A) = dist(p,A) ≤

dist(q, A) + (√d)(ε/8√d)CostP

(C)≤

CostS

(A) + εCostP

(OPT) ≤

CostS

(A) + εCostP

(A)

=> (1-ε)CostP

(A) ≤

CostS

(A)

2CostP

(C)

ε/(8√d) ·CostP

(C)

K-Center Clustering Coreset

Coreset

CorrectnessCostP

(A) = dist(p,A) ≥

dist(q, A) -

(√d)(ε/8√d)CostP

(C)≥

CostS

(A) -

εCostP

(OPT) ≥

CostS

(A) -

εCostP

(A)

=> (1+ε)CostP

(A) ≥

CostS

(A)

2CostP

(C)

ε/(8√d) ·CostP

(C)

K-Center Clustering Coreset

Coreset

|Coreset| = O[ (√d/ε)d·k·log2(n) ]

How to get rid of the log2(n)?It was inherited from number of cluster centers in the bi-criteria clustering approximation…

Find a k-center clustering (1+ε) approximation using the current coreset, then reassign the clustering (this time with exactly k centers) into the coreset creation…

K-Means/MedianClustering Coreset

Can also create a (k,ε) median/means-clustering coreset S

(1-ε)CostP

(C) ≤

CostS

(C) ≤

(1+ε)CostP

(C)

CostA

(B) ≡ ∑pєA

dist(p,B)orCostA

(B) ≡ ∑pєA

[dist(p,B)]2

|S| = O(k2ε-d)

Was improved to |S| = O(poly(d,k,1/ε)) !! Again, just for k-clustering its possible to get

a coreset of size independent of n or d

K-Means/MedianClustering Coreset

ConstructionIn K-Means/Median the coresets have to be weighted…

Similarly to k-centers, start with a bi-criteria clustering and create squares around each center. The squares are growing by factor 2.

Sample points from every square, but give a weight according to the number of points in that square.

Questions ??

Coreset

Exercises1.

Consider the case of points in R2. Suppose you have an O(log(n))

size coreset. Figure out how to find its optimal clustering using exhaustive search in O(log4k(n))

time.

2.

Consider Bi-Criteria K-Clustering for the case of lines in R2. i.e. we want to find a set C

of O(kclogc(n))

lines s.t. -CostP

(C) ≤

c’

·

CostP

(OPT)where distances are now from points to their nearest line.

Hint: The algorithm stays almost the same.When sampling points think how you should transform them into lines…