88
Introduction Insights from BIRCH Coreset Theory BICO Experiments End BICO: BIRCH meets Coresets for k -means Hendrik Fichtenberger, Marc Gill´ e, Melanie Schmidt, Chris Schwiegelshohn, Christian Sohler BICO: BIRCH meets Coresets for k -means

BICO: BIRCH meets Coresets for k-meansls2- · Introduction Insights from BIRCH Coreset Theory BICO ExperimentsEnd Clustering Algorithms: Practice and Theory The k-means Problem Given

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: BICO: BIRCH meets Coresets for k-meansls2- · Introduction Insights from BIRCH Coreset Theory BICO ExperimentsEnd Clustering Algorithms: Practice and Theory The k-means Problem Given

Introduction Insights from BIRCH Coreset Theory BICO Experiments End

BICO: BIRCH meets Coresets for k -means

Hendrik Fichtenberger, Marc Gille, Melanie Schmidt, ChrisSchwiegelshohn, Christian Sohler

BICO: BIRCH meets Coresets for k -means

Page 2: BICO: BIRCH meets Coresets for k-meansls2- · Introduction Insights from BIRCH Coreset Theory BICO ExperimentsEnd Clustering Algorithms: Practice and Theory The k-means Problem Given

Introduction Insights from BIRCH Coreset Theory BICO Experiments End

Clustering Algorithms: Practice and Theory

The k -means ProblemGiven a point set P ⊆ Rd ,compute a set C ⊆ Rd

with |C| = k centerswhich minimizescost(P,C)

=∑p∈P

minc∈C||c − p||2,

the sum of the squareddistances.

BICO: BIRCH meets Coresets for k -means

Page 3: BICO: BIRCH meets Coresets for k-meansls2- · Introduction Insights from BIRCH Coreset Theory BICO ExperimentsEnd Clustering Algorithms: Practice and Theory The k-means Problem Given

Introduction Insights from BIRCH Coreset Theory BICO Experiments End

Clustering Algorithms: Practice and Theory

Popular k -means algorithms. . .Lloyd’s algorithm (1982)k -means++ (2007)several approximation algorithms (recent)

. . . for Big DataBIRCH (1996)MacQueen’s k -means (1967)several approximations using coresets (recent)

Implementability, Speed and good quality?

Stream-KM++ (2010)BICO

BICO: BIRCH meets Coresets for k -means

Page 4: BICO: BIRCH meets Coresets for k-meansls2- · Introduction Insights from BIRCH Coreset Theory BICO ExperimentsEnd Clustering Algorithms: Practice and Theory The k-means Problem Given

Introduction Insights from BIRCH Coreset Theory BICO Experiments End

Clustering Algorithms: Practice and Theory

Popular k -means algorithms. . .Lloyd’s algorithm (1982)k -means++ (2007)several approximation algorithms (recent)

. . . for Big DataBIRCH (1996)MacQueen’s k -means (1967)several approximations using coresets (recent)

Implementability, Speed and good quality?

Stream-KM++ (2010)BICO

BICO: BIRCH meets Coresets for k -means

Page 5: BICO: BIRCH meets Coresets for k-meansls2- · Introduction Insights from BIRCH Coreset Theory BICO ExperimentsEnd Clustering Algorithms: Practice and Theory The k-means Problem Given

Introduction Insights from BIRCH Coreset Theory BICO Experiments End

Clustering Algorithms: Practice and Theory

Popular k -means algorithms. . .Lloyd’s algorithm (1982)k -means++ (2007)several approximation algorithms (recent)

. . . for Big DataBIRCH (1996)MacQueen’s k -means (1967)several approximations using coresets (recent)

Implementability, Speed and good quality?

Stream-KM++ (2010)BICO

BICO: BIRCH meets Coresets for k -means

Page 6: BICO: BIRCH meets Coresets for k-meansls2- · Introduction Insights from BIRCH Coreset Theory BICO ExperimentsEnd Clustering Algorithms: Practice and Theory The k-means Problem Given

Introduction Insights from BIRCH Coreset Theory BICO Experiments End

Clustering Algorithms: Practice and Theory

Popular k -means algorithms. . .Lloyd’s algorithm (1982) moderate speedk -means++ (2007) moderate speed & qualityseveral approximation algorithms (recent) quality

. . . for Big DataBIRCH (1996) fastMacQueen’s k -means (1967) fastseveral approximations using coresets (recent) quality

Implementability, Speed and good quality?

Stream-KM++ (2010) next slidesBICO next slides

BICO: BIRCH meets Coresets for k -means

Page 7: BICO: BIRCH meets Coresets for k-meansls2- · Introduction Insights from BIRCH Coreset Theory BICO ExperimentsEnd Clustering Algorithms: Practice and Theory The k-means Problem Given

Introduction Insights from BIRCH Coreset Theory BICO Experiments End

Clustering Algorithms: Practice and Theory

Popular k -means algorithms. . .Lloyd’s algorithm (1982) moderate speedk -means++ (2007) moderate speed & qualityseveral approximation algorithms (recent) quality

. . . for Big DataBIRCH (1996) fastMacQueen’s k -means (1967) fastseveral approximations using coresets (recent) quality

Implementability, Speed and good quality?Stream-KM++ (2010) next slides

BICO next slides

BICO: BIRCH meets Coresets for k -means

Page 8: BICO: BIRCH meets Coresets for k-meansls2- · Introduction Insights from BIRCH Coreset Theory BICO ExperimentsEnd Clustering Algorithms: Practice and Theory The k-means Problem Given

Introduction Insights from BIRCH Coreset Theory BICO Experiments End

Clustering Algorithms: Practice and Theory

Popular k -means algorithms. . .Lloyd’s algorithm (1982) moderate speedk -means++ (2007) moderate speed & qualityseveral approximation algorithms (recent) quality

. . . for Big DataBIRCH (1996) fastMacQueen’s k -means (1967) fastseveral approximations using coresets (recent) quality

Implementability, Speed and good quality?Stream-KM++ (2010) next slidesBICO next slides

BICO: BIRCH meets Coresets for k -means

Page 9: BICO: BIRCH meets Coresets for k-meansls2- · Introduction Insights from BIRCH Coreset Theory BICO ExperimentsEnd Clustering Algorithms: Practice and Theory The k-means Problem Given

Introduction Insights from BIRCH Coreset Theory BICO Experiments End

Clustering Algorithms: Practice and Theory

Costs

3e+012

3.5e+012

4e+012

4.5e+012

5e+012

5.5e+012

6e+012

6.5e+012

7e+012

15 20 25 30

Cos

ts

Number of Centers

BigCross

StreamKM++BICO

MacQueenBIRCH

BICO: BIRCH meets Coresets for k -means

Page 10: BICO: BIRCH meets Coresets for k-meansls2- · Introduction Insights from BIRCH Coreset Theory BICO ExperimentsEnd Clustering Algorithms: Practice and Theory The k-means Problem Given

Introduction Insights from BIRCH Coreset Theory BICO Experiments End

Clustering Algorithms: Practice and Theory

Running Time

0

200

400

600

800

1000

1200

10 20 30 40 50

Run

ning

Tim

e

Number of Centers

Census

BICO: BIRCH meets Coresets for k -means

Page 11: BICO: BIRCH meets Coresets for k-meansls2- · Introduction Insights from BIRCH Coreset Theory BICO ExperimentsEnd Clustering Algorithms: Practice and Theory The k-means Problem Given

Introduction Insights from BIRCH Coreset Theory BICO Experiments End

How BIRCH computes a summary of the data

Ideastart with BIRCH for the basic design because it is very fastanalyze its flawsdevelop an improved algorithm based ontheoretical observations

Now: Description of BIRCH

WarningBIRCH has several phaseswe are only interested in the main phase(and a little in the rebuilding phase)

BICO: BIRCH meets Coresets for k -means

Page 12: BICO: BIRCH meets Coresets for k-meansls2- · Introduction Insights from BIRCH Coreset Theory BICO ExperimentsEnd Clustering Algorithms: Practice and Theory The k-means Problem Given

Introduction Insights from BIRCH Coreset Theory BICO Experiments End

How BIRCH computes a summary of the data

Ideastart with BIRCH for the basic design because it is very fastanalyze its flawsdevelop an improved algorithm based ontheoretical observations

Now: Description of BIRCH

WarningBIRCH has several phaseswe are only interested in the main phase(and a little in the rebuilding phase)

BICO: BIRCH meets Coresets for k -means

Page 13: BICO: BIRCH meets Coresets for k-meansls2- · Introduction Insights from BIRCH Coreset Theory BICO ExperimentsEnd Clustering Algorithms: Practice and Theory The k-means Problem Given

Introduction Insights from BIRCH Coreset Theory BICO Experiments End

How BIRCH computes a summary of the data

Ideastart with BIRCH for the basic design because it is very fastanalyze its flawsdevelop an improved algorithm based ontheoretical observations

Now: Description of BIRCH

WarningBIRCH has several phaseswe are only interested in the main phase(and a little in the rebuilding phase)

BICO: BIRCH meets Coresets for k -means

Page 14: BICO: BIRCH meets Coresets for k-meansls2- · Introduction Insights from BIRCH Coreset Theory BICO ExperimentsEnd Clustering Algorithms: Practice and Theory The k-means Problem Given

Introduction Insights from BIRCH Coreset Theory BICO Experiments End

How BIRCH computes a summary of the data

BIRCH

stores points in a treeeach node represents a subset of the input point setsubset is summarized by the number of points, the centroidof the set and the squared distances to the centroidall points in the same subset get the same center

BICO: BIRCH meets Coresets for k -means

Page 15: BICO: BIRCH meets Coresets for k-meansls2- · Introduction Insights from BIRCH Coreset Theory BICO ExperimentsEnd Clustering Algorithms: Practice and Theory The k-means Problem Given

Introduction Insights from BIRCH Coreset Theory BICO Experiments End

How BIRCH computes a summary of the data

BIRCHstores points in a tree

each node represents a subset of the input point setsubset is summarized by the number of points, the centroidof the set and the squared distances to the centroidall points in the same subset get the same center

BICO: BIRCH meets Coresets for k -means

Page 16: BICO: BIRCH meets Coresets for k-meansls2- · Introduction Insights from BIRCH Coreset Theory BICO ExperimentsEnd Clustering Algorithms: Practice and Theory The k-means Problem Given

Introduction Insights from BIRCH Coreset Theory BICO Experiments End

How BIRCH computes a summary of the data

BIRCHstores points in a treeeach node represents a subset of the input point set

subset is summarized by the number of points, the centroidof the set and the squared distances to the centroidall points in the same subset get the same center

BICO: BIRCH meets Coresets for k -means

Page 17: BICO: BIRCH meets Coresets for k-meansls2- · Introduction Insights from BIRCH Coreset Theory BICO ExperimentsEnd Clustering Algorithms: Practice and Theory The k-means Problem Given

Introduction Insights from BIRCH Coreset Theory BICO Experiments End

How BIRCH computes a summary of the data

BIRCHstores points in a treeeach node represents a subset of the input point setsubset is summarized by the number of points, the centroidof the set and the squared distances to the centroid

all points in the same subset get the same center

BICO: BIRCH meets Coresets for k -means

Page 18: BICO: BIRCH meets Coresets for k-meansls2- · Introduction Insights from BIRCH Coreset Theory BICO ExperimentsEnd Clustering Algorithms: Practice and Theory The k-means Problem Given

Introduction Insights from BIRCH Coreset Theory BICO Experiments End

How BIRCH computes a summary of the data

BIRCHstores points in a treeeach node represents a subset of the input point setsubset is summarized by the number of points, the centroidof the set and the squared distances to the centroidall points in the same subset get the same center

BICO: BIRCH meets Coresets for k -means

Page 19: BICO: BIRCH meets Coresets for k-meansls2- · Introduction Insights from BIRCH Coreset Theory BICO ExperimentsEnd Clustering Algorithms: Practice and Theory The k-means Problem Given

Introduction Insights from BIRCH Coreset Theory BICO Experiments End

How BIRCH computes a summary of the data

BIRCHnodes in the tree represent subsets of pointspoints at the same node get the same center

Insertion of a new pointWhen a new point p is added to the tree

BIRCH searches for the ‘closest’ node according to∑q∈(S∪{p})

(q − µ(S ∪ {p}))2 −∑

q∈S(q − µ(S))2

p is added to the node representing subset S∗ if∑q∈(S∗∪{p})

(q − µS)2/(|S∗|+ 1) ≤ T 2 for a given threshold T

BICO: BIRCH meets Coresets for k -means

Page 20: BICO: BIRCH meets Coresets for k-meansls2- · Introduction Insights from BIRCH Coreset Theory BICO ExperimentsEnd Clustering Algorithms: Practice and Theory The k-means Problem Given

Introduction Insights from BIRCH Coreset Theory BICO Experiments End

How BIRCH computes a summary of the data

BIRCHnodes in the tree represent subsets of pointspoints at the same node get the same center

Insertion of a new pointWhen a new point p is added to the tree

BIRCH searches for the ‘closest’ node according to∑q∈(S∪{p})

(q − µ(S ∪ {p}))2 −∑

q∈S(q − µ(S))2

p is added to the node representing subset S∗ if∑q∈(S∗∪{p})

(q − µS)2/(|S∗|+ 1) ≤ T 2 for a given threshold T

BICO: BIRCH meets Coresets for k -means

Page 21: BICO: BIRCH meets Coresets for k-meansls2- · Introduction Insights from BIRCH Coreset Theory BICO ExperimentsEnd Clustering Algorithms: Practice and Theory The k-means Problem Given

Introduction Insights from BIRCH Coreset Theory BICO Experiments End

How BIRCH computes a summary of the data

BIRCHnodes in the tree represent subsets of pointspoints at the same node get the same center

Insertion of a new pointWhen a new point p is added to the tree

BIRCH searches for the ‘closest’ node according to∑q∈(S∪{p})

(q − µ(S ∪ {p}))2 −∑

q∈S(q − µ(S))2

p is added to the node representing subset S∗ if∑q∈(S∗∪{p})

(q − µS)2/(|S∗|+ 1) ≤ T 2 for a given threshold T

BICO: BIRCH meets Coresets for k -means

Page 22: BICO: BIRCH meets Coresets for k-meansls2- · Introduction Insights from BIRCH Coreset Theory BICO ExperimentsEnd Clustering Algorithms: Practice and Theory The k-means Problem Given

Introduction Insights from BIRCH Coreset Theory BICO Experiments End

How BIRCH computes a summary of the data

BIRCHnodes in the tree represent subsets of pointspoints at the same node get the same center

Insertion of a new pointWhen a new point p is added to the tree

BIRCH searches for the ‘closest’ node according to∑q∈(S∪{p})

(q − µ(S ∪ {p}))2 −∑

q∈S(q − µ(S))2

p is added to the node representing subset S∗ if∑q∈(S∗∪{p})

(q − µS)2/(|S∗|+ 1) ≤ T 2 for a given threshold T

BICO: BIRCH meets Coresets for k -means

Page 23: BICO: BIRCH meets Coresets for k-meansls2- · Introduction Insights from BIRCH Coreset Theory BICO ExperimentsEnd Clustering Algorithms: Practice and Theory The k-means Problem Given

Introduction Insights from BIRCH Coreset Theory BICO Experiments End

How BIRCH computes a summary of the data

150 points drawn uniformly around (−0.5,0) and (0,0.5)75 points drawn uniformly from [−4,−2]× [4,2] as noiseCenters and partitions computed by BIRCH and BICO

BICO: BIRCH meets Coresets for k -means

Page 24: BICO: BIRCH meets Coresets for k-meansls2- · Introduction Insights from BIRCH Coreset Theory BICO ExperimentsEnd Clustering Algorithms: Practice and Theory The k-means Problem Given

Introduction Insights from BIRCH Coreset Theory BICO Experiments End

How BIRCH computes a summary of the data

150 points drawn uniformly around (−0.5,0) and (0,0.5)75 points drawn uniformly from [−4,−2]× [4,2] as noiseCenters and partitions computed by BIRCH and BICO

BICO: BIRCH meets Coresets for k -means

Page 25: BICO: BIRCH meets Coresets for k-meansls2- · Introduction Insights from BIRCH Coreset Theory BICO ExperimentsEnd Clustering Algorithms: Practice and Theory The k-means Problem Given

Introduction Insights from BIRCH Coreset Theory BICO Experiments End

How BIRCH computes a summary of the data

150 points drawn uniformly around (−0.5,0) and (0,0.5)75 points drawn uniformly from [−4,−2]× [4,2] as noiseCenters and partitions computed by BIRCH and BICO

BICO: BIRCH meets Coresets for k -means

Page 26: BICO: BIRCH meets Coresets for k-meansls2- · Introduction Insights from BIRCH Coreset Theory BICO ExperimentsEnd Clustering Algorithms: Practice and Theory The k-means Problem Given

Introduction Insights from BIRCH Coreset Theory BICO Experiments End

How BIRCH computes a summary of the data

Insights from BIRCHFast point by point updatesTree structure

Insertion decision should be improved

BICO: BIRCH meets Coresets for k -means

Page 27: BICO: BIRCH meets Coresets for k-meansls2- · Introduction Insights from BIRCH Coreset Theory BICO ExperimentsEnd Clustering Algorithms: Practice and Theory The k-means Problem Given

Introduction Insights from BIRCH Coreset Theory BICO Experiments End

How BIRCH computes a summary of the data

Insights from BIRCHFast point by point updatesTree structure

Insertion decision should be improved

BICO: BIRCH meets Coresets for k -means

Page 28: BICO: BIRCH meets Coresets for k-meansls2- · Introduction Insights from BIRCH Coreset Theory BICO ExperimentsEnd Clustering Algorithms: Practice and Theory The k-means Problem Given

Introduction Insights from BIRCH Coreset Theory BICO Experiments End

Coresets

Coresetssmall summary of given datatypically of constant or polylogarithmic sizecan be used to approximate the cost of the original data

BICO: BIRCH meets Coresets for k -means

Page 29: BICO: BIRCH meets Coresets for k-meansls2- · Introduction Insights from BIRCH Coreset Theory BICO ExperimentsEnd Clustering Algorithms: Practice and Theory The k-means Problem Given

Introduction Insights from BIRCH Coreset Theory BICO Experiments End

Coresets

CoresetsGiven a set of points P, a weighted subset S ⊂ P is a(k , ε)-coreset if for all sets C ⊂ C of k centers it holds

|costw (S,C)− cost(P,C)| ≤ ε cost(P,C)

where costw (S,C) =∑

p∈S minc∈C w(p)(.p, c).

3 2

2

4

BICO: BIRCH meets Coresets for k -means

Page 30: BICO: BIRCH meets Coresets for k-meansls2- · Introduction Insights from BIRCH Coreset Theory BICO ExperimentsEnd Clustering Algorithms: Practice and Theory The k-means Problem Given

Introduction Insights from BIRCH Coreset Theory BICO Experiments End

Coresets

CoresetsGiven a set of points P, a weighted subset S ⊂ P is a(k , ε)-coreset if for all sets C ⊂ C of k centers it holds

|costw (S,C)− cost(P,C)| ≤ ε cost(P,C)

where costw (S,C) =∑

p∈S minc∈C w(p)(.p, c).

3 2

2

4

BICO: BIRCH meets Coresets for k -means

Page 31: BICO: BIRCH meets Coresets for k-meansls2- · Introduction Insights from BIRCH Coreset Theory BICO ExperimentsEnd Clustering Algorithms: Practice and Theory The k-means Problem Given

Introduction Insights from BIRCH Coreset Theory BICO Experiments End

Coresets

CoresetsGiven a set of points P, a weighted subset S ⊂ P is a(k , ε)-coreset if for all sets C ⊂ C of k centers it holds

|costw (S,C)− cost(P,C)| ≤ ε cost(P,C)

where costw (S,C) =∑

p∈S minc∈C w(p)(.p, c).

3 2

2

4

BICO: BIRCH meets Coresets for k -means

Page 32: BICO: BIRCH meets Coresets for k-meansls2- · Introduction Insights from BIRCH Coreset Theory BICO ExperimentsEnd Clustering Algorithms: Practice and Theory The k-means Problem Given

Introduction Insights from BIRCH Coreset Theory BICO Experiments End

Coresets

CoresetsGiven a set of points P, a weighted subset S ⊂ P is a(k , ε)-coreset if for all sets C ⊂ C of k centers it holds

|costw (S,C)− cost(P,C)| ≤ ε cost(P,C)

where costw (S,C) =∑

p∈S minc∈C w(p)(.p, c).

3 2

2

4

BICO: BIRCH meets Coresets for k -means

Page 33: BICO: BIRCH meets Coresets for k-meansls2- · Introduction Insights from BIRCH Coreset Theory BICO ExperimentsEnd Clustering Algorithms: Practice and Theory The k-means Problem Given

Introduction Insights from BIRCH Coreset Theory BICO Experiments End

Coresets

CoresetsGiven a set of points P, a weighted subset S ⊂ P is a(k , ε)-coreset if for all sets C ⊂ C of k centers it holds

|costw (S,C)− cost(P,C)| ≤ ε cost(P,C)

where costw (S,C) =∑

p∈S minc∈C w(p)(.p, c).

3 2

2

4

BICO: BIRCH meets Coresets for k -means

Page 34: BICO: BIRCH meets Coresets for k-meansls2- · Introduction Insights from BIRCH Coreset Theory BICO ExperimentsEnd Clustering Algorithms: Practice and Theory The k-means Problem Given

Introduction Insights from BIRCH Coreset Theory BICO Experiments End

Coresets

CoresetsGiven a set of points P, a weighted subset S ⊂ P is a(k , ε)-coreset if for all sets C ⊂ C of k centers it holds

|costw (S,C)− cost(P,C)| ≤ ε cost(P,C)

where costw (S,C) =∑

p∈S minc∈C w(p)(.p, c).

3 2

2

4

BICO: BIRCH meets Coresets for k -means

Page 35: BICO: BIRCH meets Coresets for k-meansls2- · Introduction Insights from BIRCH Coreset Theory BICO ExperimentsEnd Clustering Algorithms: Practice and Theory The k-means Problem Given

Introduction Insights from BIRCH Coreset Theory BICO Experiments End

Coresets

Merge & Reduceread data in blockscompute a coresetfor each block→ smerge coresets in atree fashion space s · log n

Runtime: No asymptotic increase, but overhead in practice

BIRCH uses point-wise updates :-)

BICO: BIRCH meets Coresets for k -means

Page 36: BICO: BIRCH meets Coresets for k-meansls2- · Introduction Insights from BIRCH Coreset Theory BICO ExperimentsEnd Clustering Algorithms: Practice and Theory The k-means Problem Given

Introduction Insights from BIRCH Coreset Theory BICO Experiments End

Coresets

Merge & Reduceread data in blockscompute a coresetfor each block→ smerge coresets in atree fashion space s · log n

Runtime: No asymptotic increase, but overhead in practice

BIRCH uses point-wise updates :-)

BICO: BIRCH meets Coresets for k -means

Page 37: BICO: BIRCH meets Coresets for k-meansls2- · Introduction Insights from BIRCH Coreset Theory BICO ExperimentsEnd Clustering Algorithms: Practice and Theory The k-means Problem Given

Introduction Insights from BIRCH Coreset Theory BICO Experiments End

Coresets

Insights from Coreset ThreoryLimit the induced error

Goal: Point set P ′ in each node should induce at mostε · cost(P ′,C) error (for an optimal solution C)

Base insertion decision on induced errorReplacing all points in a node by the (weighted) centroid islike moving all points to the centroidInduced error is connected to the 1-means cost of the set

Side noteAvoiding Merge & Reduce is a good idea

BICO: BIRCH meets Coresets for k -means

Page 38: BICO: BIRCH meets Coresets for k-meansls2- · Introduction Insights from BIRCH Coreset Theory BICO ExperimentsEnd Clustering Algorithms: Practice and Theory The k-means Problem Given

Introduction Insights from BIRCH Coreset Theory BICO Experiments End

Coresets

Insights from Coreset ThreoryLimit the induced error

Goal: Point set P ′ in each node should induce at mostε · cost(P ′,C) error (for an optimal solution C)

Base insertion decision on induced errorReplacing all points in a node by the (weighted) centroid islike moving all points to the centroidInduced error is connected to the 1-means cost of the set

Side noteAvoiding Merge & Reduce is a good idea

BICO: BIRCH meets Coresets for k -means

Page 39: BICO: BIRCH meets Coresets for k-meansls2- · Introduction Insights from BIRCH Coreset Theory BICO ExperimentsEnd Clustering Algorithms: Practice and Theory The k-means Problem Given

Introduction Insights from BIRCH Coreset Theory BICO Experiments End

BIRCH meets Coresets

BIRCHnodes in the tree represent subsets of pointspoints at the same node get the same centerimprove insertion decision

AdjustmentsNodes additionally have a reference point and a range

Closest is now determined by Euclidean distanceWe say a node is full with regard to a point p if adding p tothe node increases its 1-means cost above a threshold T

BICO: BIRCH meets Coresets for k -means

Page 40: BICO: BIRCH meets Coresets for k-meansls2- · Introduction Insights from BIRCH Coreset Theory BICO ExperimentsEnd Clustering Algorithms: Practice and Theory The k-means Problem Given

Introduction Insights from BIRCH Coreset Theory BICO Experiments End

BIRCH meets Coresets

BIRCHnodes in the tree represent subsets of pointspoints at the same node get the same centerimprove insertion decision

AdjustmentsNodes additionally have a reference point and a range

Closest is now determined by Euclidean distanceWe say a node is full with regard to a point p if adding p tothe node increases its 1-means cost above a threshold T

BICO: BIRCH meets Coresets for k -means

Page 41: BICO: BIRCH meets Coresets for k-meansls2- · Introduction Insights from BIRCH Coreset Theory BICO ExperimentsEnd Clustering Algorithms: Practice and Theory The k-means Problem Given

Introduction Insights from BIRCH Coreset Theory BICO Experiments End

BIRCH meets Coresets

BIRCHnodes in the tree represent subsets of pointspoints at the same node get the same centerimprove insertion decision

AdjustmentsNodes additionally have a reference point and a rangeClosest is now determined by Euclidean distance

We say a node is full with regard to a point p if adding p tothe node increases its 1-means cost above a threshold T

BICO: BIRCH meets Coresets for k -means

Page 42: BICO: BIRCH meets Coresets for k-meansls2- · Introduction Insights from BIRCH Coreset Theory BICO ExperimentsEnd Clustering Algorithms: Practice and Theory The k-means Problem Given

Introduction Insights from BIRCH Coreset Theory BICO Experiments End

BIRCH meets Coresets

BIRCHnodes in the tree represent subsets of pointspoints at the same node get the same centerimprove insertion decision

AdjustmentsNodes additionally have a reference point and a rangeClosest is now determined by Euclidean distanceWe say a node is full with regard to a point p if adding p tothe node increases its 1-means cost above a threshold T

BICO: BIRCH meets Coresets for k -means

Page 43: BICO: BIRCH meets Coresets for k-meansls2- · Introduction Insights from BIRCH Coreset Theory BICO ExperimentsEnd Clustering Algorithms: Practice and Theory The k-means Problem Given

Introduction Insights from BIRCH Coreset Theory BICO Experiments End

BIRCH meets Coresets

1. Find closest reference point2. If node is not in range3. Then create a new node4. Else add to node if possible5. If not, go one level down,6. And Find closest child, goto 2.

Threshold TRadius Ri

BICO: BIRCH meets Coresets for k -means

Page 44: BICO: BIRCH meets Coresets for k-meansls2- · Introduction Insights from BIRCH Coreset Theory BICO ExperimentsEnd Clustering Algorithms: Practice and Theory The k-means Problem Given

Introduction Insights from BIRCH Coreset Theory BICO Experiments End

BIRCH meets Coresets

1. Find closest reference point2. If node is not in range3. Then create a new node4. Else add to node if possible5. If not, go one level down,6. And Find closest child, goto 2.

Threshold TRadius Ri

BICO: BIRCH meets Coresets for k -means

Page 45: BICO: BIRCH meets Coresets for k-meansls2- · Introduction Insights from BIRCH Coreset Theory BICO ExperimentsEnd Clustering Algorithms: Practice and Theory The k-means Problem Given

Introduction Insights from BIRCH Coreset Theory BICO Experiments End

BIRCH meets Coresets

1. Find closest reference point2. If node is not in range3. Then create a new node4. Else add to node if possible5. If not, go one level down,6. And Find closest child, goto 2.

Threshold TRadius Ri

BICO: BIRCH meets Coresets for k -means

Page 46: BICO: BIRCH meets Coresets for k-meansls2- · Introduction Insights from BIRCH Coreset Theory BICO ExperimentsEnd Clustering Algorithms: Practice and Theory The k-means Problem Given

Introduction Insights from BIRCH Coreset Theory BICO Experiments End

BIRCH meets Coresets

1. Find closest reference point2. If node is not in range3. Then create a new node4. Else add to node if possible5. If not, go one level down,6. And Find closest child, goto 2.

Threshold TRadius Ri

BICO: BIRCH meets Coresets for k -means

Page 47: BICO: BIRCH meets Coresets for k-meansls2- · Introduction Insights from BIRCH Coreset Theory BICO ExperimentsEnd Clustering Algorithms: Practice and Theory The k-means Problem Given

Introduction Insights from BIRCH Coreset Theory BICO Experiments End

BIRCH meets Coresets

1. Find closest reference point2. If node is not in range3. Then create a new node4. Else add to node if possible5. If not, go one level down,6. And Find closest child, goto 2.

Threshold TRadius Ri

BICO: BIRCH meets Coresets for k -means

Page 48: BICO: BIRCH meets Coresets for k-meansls2- · Introduction Insights from BIRCH Coreset Theory BICO ExperimentsEnd Clustering Algorithms: Practice and Theory The k-means Problem Given

Introduction Insights from BIRCH Coreset Theory BICO Experiments End

BIRCH meets Coresets

1. Find closest reference point2. If node is not in range3. Then create a new node4. Else add to node if possible5. If not, go one level down,6. And Find closest child, goto 2.

Threshold TRadius Ri

BICO: BIRCH meets Coresets for k -means

Page 49: BICO: BIRCH meets Coresets for k-meansls2- · Introduction Insights from BIRCH Coreset Theory BICO ExperimentsEnd Clustering Algorithms: Practice and Theory The k-means Problem Given

Introduction Insights from BIRCH Coreset Theory BICO Experiments End

BIRCH meets Coresets

Theorem

For T ≈ OPT/(k · log n · 8d · εd+2) and Ri :=√

T/(8 · 2i),the set of centroids weighted by the number of points in thesubset is a (1 + ε)-coresetfor constant d , the number of nodes is O(k · log n · ε−(d+2))

ProblemWe do not know OPT and thus cannot compute T !

BICO: BIRCH meets Coresets for k -means

Page 50: BICO: BIRCH meets Coresets for k-meansls2- · Introduction Insights from BIRCH Coreset Theory BICO ExperimentsEnd Clustering Algorithms: Practice and Theory The k-means Problem Given

Introduction Insights from BIRCH Coreset Theory BICO Experiments End

BIRCH meets Coresets

Theorem

For T ≈ OPT/(k · log n · 8d · εd+2) and Ri :=√

T/(8 · 2i),the set of centroids weighted by the number of points in thesubset is a (1 + ε)-coresetfor constant d , the number of nodes is O(k · log n · ε−(d+2))

ProblemWe do not know OPT and thus cannot compute T !

BICO: BIRCH meets Coresets for k -means

Page 51: BICO: BIRCH meets Coresets for k-meansls2- · Introduction Insights from BIRCH Coreset Theory BICO ExperimentsEnd Clustering Algorithms: Practice and Theory The k-means Problem Given

Introduction Insights from BIRCH Coreset Theory BICO Experiments End

BIRCH meets Coresets

Rebuilding algorithmDouble T when maximum number of nodes is reached‘Rebuild’ the tree according to new T

Let T ′ and R′i be before and T and Ri be after the doublingMove all nodes one level down and create empty first levelNotice that Ri =

√T/(8 · 2i) =

√T ′/8 · 2i−1 = R′i−1

⇒ Radius doesn’t change!

BICO: BIRCH meets Coresets for k -means

Page 52: BICO: BIRCH meets Coresets for k-meansls2- · Introduction Insights from BIRCH Coreset Theory BICO ExperimentsEnd Clustering Algorithms: Practice and Theory The k-means Problem Given

Introduction Insights from BIRCH Coreset Theory BICO Experiments End

BIRCH meets Coresets

Rebuilding algorithmDouble T when maximum number of nodes is reached‘Rebuild’ the tree according to new T

Let T ′ and R′i be before and T and Ri be after the doublingMove all nodes one level down and create empty first levelNotice that Ri =

√T/(8 · 2i) =

√T ′/8 · 2i−1 = R′i−1

⇒ Radius doesn’t change!

BICO: BIRCH meets Coresets for k -means

Page 53: BICO: BIRCH meets Coresets for k-meansls2- · Introduction Insights from BIRCH Coreset Theory BICO ExperimentsEnd Clustering Algorithms: Practice and Theory The k-means Problem Given

Introduction Insights from BIRCH Coreset Theory BICO Experiments End

BIRCH meets Coresets

For every node in the tree1. Find ref. point one level higher2. If none in range, move node

and children one level up3. If it is in range, check T4. If sum is smaller than T ,

merge nodes6. Else make it a child

BICO: BIRCH meets Coresets for k -means

Page 54: BICO: BIRCH meets Coresets for k-meansls2- · Introduction Insights from BIRCH Coreset Theory BICO ExperimentsEnd Clustering Algorithms: Practice and Theory The k-means Problem Given

Introduction Insights from BIRCH Coreset Theory BICO Experiments End

BIRCH meets Coresets

For every node in the tree1. Find ref. point one level higher2. If none in range, move node

and children one level up3. If it is in range, check T4. If sum is smaller than T ,

merge nodes6. Else make it a child

BICO: BIRCH meets Coresets for k -means

Page 55: BICO: BIRCH meets Coresets for k-meansls2- · Introduction Insights from BIRCH Coreset Theory BICO ExperimentsEnd Clustering Algorithms: Practice and Theory The k-means Problem Given

Introduction Insights from BIRCH Coreset Theory BICO Experiments End

BIRCH meets Coresets

For every node in the tree1. Find ref. point one level higher2. If none in range, move node

and children one level up3. If it is in range, check T4. If sum is smaller than T ,

merge nodes6. Else make it a child

BICO: BIRCH meets Coresets for k -means

Page 56: BICO: BIRCH meets Coresets for k-meansls2- · Introduction Insights from BIRCH Coreset Theory BICO ExperimentsEnd Clustering Algorithms: Practice and Theory The k-means Problem Given

Introduction Insights from BIRCH Coreset Theory BICO Experiments End

BIRCH meets Coresets

For every node in the tree1. Find ref. point one level higher2. If none in range, move node

and children one level up3. If it is in range, check T4. If sum is smaller than T ,

merge nodes6. Else make it a child

BICO: BIRCH meets Coresets for k -means

Page 57: BICO: BIRCH meets Coresets for k-meansls2- · Introduction Insights from BIRCH Coreset Theory BICO ExperimentsEnd Clustering Algorithms: Practice and Theory The k-means Problem Given

Introduction Insights from BIRCH Coreset Theory BICO Experiments End

BIRCH meets Coresets

For every node in the tree1. Find ref. point one level higher2. If none in range, move node

and children one level up3. If it is in range, check T4. If sum is smaller than T ,

merge nodes6. Else make it a child

BICO: BIRCH meets Coresets for k -means

Page 58: BICO: BIRCH meets Coresets for k-meansls2- · Introduction Insights from BIRCH Coreset Theory BICO ExperimentsEnd Clustering Algorithms: Practice and Theory The k-means Problem Given

Introduction Insights from BIRCH Coreset Theory BICO Experiments End

BIRCH meets Coresets

For every node in the tree1. Find ref. point one level higher2. If none in range, move node

and children one level up3. If it is in range, check T4. If sum is smaller than T ,

merge nodes6. Else make it a child

BICO: BIRCH meets Coresets for k -means

Page 59: BICO: BIRCH meets Coresets for k-meansls2- · Introduction Insights from BIRCH Coreset Theory BICO ExperimentsEnd Clustering Algorithms: Practice and Theory The k-means Problem Given

Introduction Insights from BIRCH Coreset Theory BICO Experiments End

BIRCH meets Coresets

For every node in the tree1. Find ref. point one level higher2. If none in range, move node

and children one level up3. If it is in range, check T4. If sum is smaller than T ,

merge nodes6. Else make it a child

BICO: BIRCH meets Coresets for k -means

Page 60: BICO: BIRCH meets Coresets for k-meansls2- · Introduction Insights from BIRCH Coreset Theory BICO ExperimentsEnd Clustering Algorithms: Practice and Theory The k-means Problem Given

Introduction Insights from BIRCH Coreset Theory BICO Experiments End

BIRCH meets Coresets

For every node in the tree1. Find ref. point one level higher2. If none in range, move node

and children one level up3. If it is in range, check T4. If sum is smaller than T ,

merge nodes6. Else make it a child

BICO: BIRCH meets Coresets for k -means

Page 61: BICO: BIRCH meets Coresets for k-meansls2- · Introduction Insights from BIRCH Coreset Theory BICO ExperimentsEnd Clustering Algorithms: Practice and Theory The k-means Problem Given

Introduction Insights from BIRCH Coreset Theory BICO Experiments End

BIRCH meets Coresets

For every node in the tree1. Find ref. point one level higher2. If none in range, move node

and children one level up3. If it is in range, check T4. If sum is smaller than T ,

merge nodes6. Else make it a child

BICO: BIRCH meets Coresets for k -means

Page 62: BICO: BIRCH meets Coresets for k-meansls2- · Introduction Insights from BIRCH Coreset Theory BICO ExperimentsEnd Clustering Algorithms: Practice and Theory The k-means Problem Given

Introduction Insights from BIRCH Coreset Theory BICO Experiments End

BIRCH meets Coresets

For every node in the tree1. Find ref. point one level higher2. If none in range, move node

and children one level up3. If it is in range, check T4. If sum is smaller than T ,

merge nodes6. Else make it a child

BICO: BIRCH meets Coresets for k -means

Page 63: BICO: BIRCH meets Coresets for k-meansls2- · Introduction Insights from BIRCH Coreset Theory BICO ExperimentsEnd Clustering Algorithms: Practice and Theory The k-means Problem Given

Introduction Insights from BIRCH Coreset Theory BICO Experiments End

BIRCH meets Coresets

For every node in the tree1. Find ref. point one level higher2. If none in range, move node

and children one level up3. If it is in range, check T4. If sum is smaller than T ,

merge nodes6. Else make it a child

BICO: BIRCH meets Coresets for k -means

Page 64: BICO: BIRCH meets Coresets for k-meansls2- · Introduction Insights from BIRCH Coreset Theory BICO ExperimentsEnd Clustering Algorithms: Practice and Theory The k-means Problem Given

Introduction Insights from BIRCH Coreset Theory BICO Experiments End

BIRCH meets Coresets

Merging might result in violations of the range of nodes

Increases the coreset size by a constant factorBut does not destroy the coreset property

⇒ BICO computes a coreset in the data stream setting ¨

. . . if we compute lower bound on T

BICO: BIRCH meets Coresets for k -means

Page 65: BICO: BIRCH meets Coresets for k-meansls2- · Introduction Insights from BIRCH Coreset Theory BICO ExperimentsEnd Clustering Algorithms: Practice and Theory The k-means Problem Given

Introduction Insights from BIRCH Coreset Theory BICO Experiments End

BIRCH meets Coresets

Merging might result in violations of the range of nodes

Increases the coreset size by a constant factorBut does not destroy the coreset property

⇒ BICO computes a coreset in the data stream setting ¨

. . . if we compute lower bound on T

BICO: BIRCH meets Coresets for k -means

Page 66: BICO: BIRCH meets Coresets for k-meansls2- · Introduction Insights from BIRCH Coreset Theory BICO ExperimentsEnd Clustering Algorithms: Practice and Theory The k-means Problem Given

Introduction Insights from BIRCH Coreset Theory BICO Experiments End

BIRCH meets Coresets

Merging might result in violations of the range of nodesIncreases the coreset size by a constant factor

But does not destroy the coreset property

⇒ BICO computes a coreset in the data stream setting ¨

. . . if we compute lower bound on T

BICO: BIRCH meets Coresets for k -means

Page 67: BICO: BIRCH meets Coresets for k-meansls2- · Introduction Insights from BIRCH Coreset Theory BICO ExperimentsEnd Clustering Algorithms: Practice and Theory The k-means Problem Given

Introduction Insights from BIRCH Coreset Theory BICO Experiments End

BIRCH meets Coresets

Merging might result in violations of the range of nodesIncreases the coreset size by a constant factorBut does not destroy the coreset property

⇒ BICO computes a coreset in the data stream setting ¨

. . . if we compute lower bound on T

BICO: BIRCH meets Coresets for k -means

Page 68: BICO: BIRCH meets Coresets for k-meansls2- · Introduction Insights from BIRCH Coreset Theory BICO ExperimentsEnd Clustering Algorithms: Practice and Theory The k-means Problem Given

Introduction Insights from BIRCH Coreset Theory BICO Experiments End

BIRCH meets Coresets

Merging might result in violations of the range of nodesIncreases the coreset size by a constant factorBut does not destroy the coreset property

⇒ BICO computes a coreset in the data stream setting ¨

. . . if we compute lower bound on T

BICO: BIRCH meets Coresets for k -means

Page 69: BICO: BIRCH meets Coresets for k-meansls2- · Introduction Insights from BIRCH Coreset Theory BICO ExperimentsEnd Clustering Algorithms: Practice and Theory The k-means Problem Given

Introduction Insights from BIRCH Coreset Theory BICO Experiments End

BIRCH meets Coresets

Merging might result in violations of the range of nodesIncreases the coreset size by a constant factorBut does not destroy the coreset property

⇒ BICO computes a coreset in the data stream setting ¨

. . . if we compute lower bound on T

BICO: BIRCH meets Coresets for k -means

Page 70: BICO: BIRCH meets Coresets for k-meansls2- · Introduction Insights from BIRCH Coreset Theory BICO ExperimentsEnd Clustering Algorithms: Practice and Theory The k-means Problem Given

Introduction Insights from BIRCH Coreset Theory BICO Experiments End

BICO is cool :-)

The actual solution is computed with k -means++.

Adjustments

Add heuristic speed-up to find closest reference point

Speed-upProject all ref. points to d random 1-dim. subspacesProject new point p to the same subspacesCount how many ref. points are in range of p in everysubspaceSearch nearest neighbor in the shortest list

BICO: BIRCH meets Coresets for k -means

Page 71: BICO: BIRCH meets Coresets for k-meansls2- · Introduction Insights from BIRCH Coreset Theory BICO ExperimentsEnd Clustering Algorithms: Practice and Theory The k-means Problem Given

Introduction Insights from BIRCH Coreset Theory BICO Experiments End

BICO is cool :-)

The actual solution is computed with k -means++.

AdjustmentsSet coreset size to 200k

Add heuristic speed-up to find closest reference point

Speed-upProject all ref. points to d random 1-dim. subspacesProject new point p to the same subspacesCount how many ref. points are in range of p in everysubspaceSearch nearest neighbor in the shortest list

BICO: BIRCH meets Coresets for k -means

Page 72: BICO: BIRCH meets Coresets for k-meansls2- · Introduction Insights from BIRCH Coreset Theory BICO ExperimentsEnd Clustering Algorithms: Practice and Theory The k-means Problem Given

Introduction Insights from BIRCH Coreset Theory BICO Experiments End

BICO is cool :-)

The actual solution is computed with k -means++.

AdjustmentsSet coreset size to 200kAdd heuristic speed-up to find closest reference point

Speed-upProject all ref. points to d random 1-dim. subspacesProject new point p to the same subspacesCount how many ref. points are in range of p in everysubspaceSearch nearest neighbor in the shortest list

BICO: BIRCH meets Coresets for k -means

Page 73: BICO: BIRCH meets Coresets for k-meansls2- · Introduction Insights from BIRCH Coreset Theory BICO ExperimentsEnd Clustering Algorithms: Practice and Theory The k-means Problem Given

Introduction Insights from BIRCH Coreset Theory BICO Experiments End

BICO is cool :-)

The actual solution is computed with k -means++.

AdjustmentsSet coreset size to 200kAdd heuristic speed-up to find closest reference point

Speed-upProject all ref. points to d random 1-dim. subspacesProject new point p to the same subspacesCount how many ref. points are in range of p in everysubspaceSearch nearest neighbor in the shortest list

BICO: BIRCH meets Coresets for k -means

Page 74: BICO: BIRCH meets Coresets for k-meansls2- · Introduction Insights from BIRCH Coreset Theory BICO ExperimentsEnd Clustering Algorithms: Practice and Theory The k-means Problem Given

Introduction Insights from BIRCH Coreset Theory BICO Experiments End

BICO is cool :-)

The actual solution is computed with k -means++.

AdjustmentsSet coreset size to 200kAdd heuristic speed-up to find closest reference point

Speed-upProject all ref. points to d random 1-dim. subspacesProject new point p to the same subspacesCount how many ref. points are in range of p in everysubspaceSearch nearest neighbor in the shortest list

BICO: BIRCH meets Coresets for k -means

Page 75: BICO: BIRCH meets Coresets for k-meansls2- · Introduction Insights from BIRCH Coreset Theory BICO ExperimentsEnd Clustering Algorithms: Practice and Theory The k-means Problem Given

Introduction Insights from BIRCH Coreset Theory BICO Experiments End

BICO is cool :-)

Data SetsData Sets used in StreamKM++ paper from UCI repository:Tower, CoverType, Census and BigCross (cross product)CalTech128 by Rene Grzeszick, group of Prof. Finkconsists of 128 SIFT descriptors of an object database

Data Set SizesBigCross CalTech128 Census CoverType Tower

n 11620300 3168383 2458285 581012 4915200d 57 128 68 55 3

n · d 662357100 405553024 167163380 31955660 14745600

BICO: BIRCH meets Coresets for k -means

Page 76: BICO: BIRCH meets Coresets for k-meansls2- · Introduction Insights from BIRCH Coreset Theory BICO ExperimentsEnd Clustering Algorithms: Practice and Theory The k-means Problem Given

Introduction Insights from BIRCH Coreset Theory BICO Experiments End

BICO is cool :-)

Data SetsData Sets used in StreamKM++ paper from UCI repository:Tower, CoverType, Census and BigCross (cross product)CalTech128 by Rene Grzeszick, group of Prof. Finkconsists of 128 SIFT descriptors of an object database

Data Set SizesBigCross CalTech128 Census CoverType Tower

n 11620300 3168383 2458285 581012 4915200d 57 128 68 55 3

n · d 662357100 405553024 167163380 31955660 14745600

BICO: BIRCH meets Coresets for k -means

Page 77: BICO: BIRCH meets Coresets for k-meansls2- · Introduction Insights from BIRCH Coreset Theory BICO ExperimentsEnd Clustering Algorithms: Practice and Theory The k-means Problem Given

Introduction Insights from BIRCH Coreset Theory BICO Experiments End

BICO is cool :-)

ImplementationsAuthor’s implementations for StreamKM++ and BIRCHimplementation for MacQueen’s k -means fromESMERALDA (framework by group of Prof. Fink)

ExperimentsExperiments done on mud1-6 and mud8100 runs for every test instancevalues shown in the diagrams are mean values

BICO: BIRCH meets Coresets for k -means

Page 78: BICO: BIRCH meets Coresets for k-meansls2- · Introduction Insights from BIRCH Coreset Theory BICO ExperimentsEnd Clustering Algorithms: Practice and Theory The k-means Problem Given

Introduction Insights from BIRCH Coreset Theory BICO Experiments End

BICO is cool :-)

ImplementationsAuthor’s implementations for StreamKM++ and BIRCHimplementation for MacQueen’s k -means fromESMERALDA (framework by group of Prof. Fink)

ExperimentsExperiments done on mud1-6 and mud8100 runs for every test instancevalues shown in the diagrams are mean values

BICO: BIRCH meets Coresets for k -means

Page 79: BICO: BIRCH meets Coresets for k-meansls2- · Introduction Insights from BIRCH Coreset Theory BICO ExperimentsEnd Clustering Algorithms: Practice and Theory The k-means Problem Given

Introduction Insights from BIRCH Coreset Theory BICO Experiments End

BICO is cool :-)

Costs

3e+012

3.5e+012

4e+012

4.5e+012

5e+012

5.5e+012

6e+012

6.5e+012

7e+012

15 20 25 30

Cos

ts

Number of Centers

BigCross

StreamKM++BICO

MacQueenBIRCH

BICO: BIRCH meets Coresets for k -means

Page 80: BICO: BIRCH meets Coresets for k-meansls2- · Introduction Insights from BIRCH Coreset Theory BICO ExperimentsEnd Clustering Algorithms: Practice and Theory The k-means Problem Given

Introduction Insights from BIRCH Coreset Theory BICO Experiments End

BICO is cool :-)

Costs

1e+011

1.5e+011

2e+011

2.5e+011

3e+011

3.5e+011

4e+011

4.5e+011

5e+011

5.5e+011

6e+011

10 20 30 40 50

Cos

ts

Number of Centers

Covertype

BICO: BIRCH meets Coresets for k -means

Page 81: BICO: BIRCH meets Coresets for k-meansls2- · Introduction Insights from BIRCH Coreset Theory BICO ExperimentsEnd Clustering Algorithms: Practice and Theory The k-means Problem Given

Introduction Insights from BIRCH Coreset Theory BICO Experiments End

BICO is cool :-)

Costs

1e+012

1.02e+012

1.04e+012

1.06e+012

1.08e+012

1.1e+012

1.12e+012

1.14e+012

1.16e+012

250

Costs

Number of Centers

BigCross

StreamKM++BICO

MacQueenBIRCH

1e+011

1.5e+011

2e+011

2.5e+011

3e+011

250

Number of Centers

CalTech

1e+011

1.2e+011

1.4e+011

1.6e+011

1.8e+011

2e+011

2.2e+011

2.4e+011

2.6e+011

1000

Number of Centers

CalTech

BICO: BIRCH meets Coresets for k -means

Page 82: BICO: BIRCH meets Coresets for k-meansls2- · Introduction Insights from BIRCH Coreset Theory BICO ExperimentsEnd Clustering Algorithms: Practice and Theory The k-means Problem Given

Introduction Insights from BIRCH Coreset Theory BICO Experiments End

BICO is cool :-)

Running time

0

200

400

600

800

1000

1200

10 20 30 40 50

Run

ning

Tim

e

Number of Centers

Census

BICO: BIRCH meets Coresets for k -means

Page 83: BICO: BIRCH meets Coresets for k-meansls2- · Introduction Insights from BIRCH Coreset Theory BICO ExperimentsEnd Clustering Algorithms: Practice and Theory The k-means Problem Given

Introduction Insights from BIRCH Coreset Theory BICO Experiments End

BICO is cool :-)

Running time

0

1000

2000

3000

4000

5000

6000

7000

250

Ru

nn

ing

Tim

e

Number of Centers

Census

StreamKM++

BICOBICO KMeans++

MacQueen

BIRCH

0

20000

40000

60000

80000

100000

120000

140000

160000

180000

1000

Number of Centers

Census

BICO: BIRCH meets Coresets for k -means

Page 84: BICO: BIRCH meets Coresets for k-meansls2- · Introduction Insights from BIRCH Coreset Theory BICO ExperimentsEnd Clustering Algorithms: Practice and Theory The k-means Problem Given

Introduction Insights from BIRCH Coreset Theory BICO Experiments End

BICO is cool :-)

Running time

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

250

Number of Centers

CalTech

0

100000

200000

300000

400000

500000

600000

1000

Number of Centers

CalTech

BICO: BIRCH meets Coresets for k -means

Page 85: BICO: BIRCH meets Coresets for k-meansls2- · Introduction Insights from BIRCH Coreset Theory BICO ExperimentsEnd Clustering Algorithms: Practice and Theory The k-means Problem Given

Introduction Insights from BIRCH Coreset Theory BICO Experiments End

BICO is cool :-)

Running time

0

2000

4000

6000

8000

10000

12000

14000

250

Ru

nn

ing

Tim

e

Number of Centers

BigCross

StreamKM++

BICOBICO KMeans++

MacQueen

BIRCH

0

20000

40000

60000

80000

100000

120000

140000

160000

180000

1000

Number of Centers

BigCross

BICO: BIRCH meets Coresets for k -means

Page 86: BICO: BIRCH meets Coresets for k-meansls2- · Introduction Insights from BIRCH Coreset Theory BICO ExperimentsEnd Clustering Algorithms: Practice and Theory The k-means Problem Given

Introduction Insights from BIRCH Coreset Theory BICO Experiments End

BICO is cool :-)

Trade-OffBIRCH MQ m = 200k 100k 25k

Time 616 4241 20271 5444 618Costs 58 · 1010 72 · 1010 51 · 1010 52 ·1010 55 · 1010

Tests run on on BigCross with k = 1000BICO is less than 1.5 times slower than MacQueen withm = 100k while still computing reasonable costsfaster than BIRCH for m = 25k , still much better cost thanBIRCH and MacQueen

BICO: BIRCH meets Coresets for k -means

Page 87: BICO: BIRCH meets Coresets for k-meansls2- · Introduction Insights from BIRCH Coreset Theory BICO ExperimentsEnd Clustering Algorithms: Practice and Theory The k-means Problem Given

Introduction Insights from BIRCH Coreset Theory BICO Experiments End

BICO is cool :-)

ZieleImplementierung in bekanntem Framework / Anbindung

Baustein fur andere AlgorithmenVergleich verschiedener Strategien fur Nearest Neighbor

BICO: BIRCH meets Coresets for k -means

Page 88: BICO: BIRCH meets Coresets for k-meansls2- · Introduction Insights from BIRCH Coreset Theory BICO ExperimentsEnd Clustering Algorithms: Practice and Theory The k-means Problem Given

Introduction Insights from BIRCH Coreset Theory BICO Experiments End

BIRCH MQ m = 200k 100k 25kTime 616 4241 20271 5444 618Costs 58 · 1010 72 · 1010 51 · 1010 52 ·1010 55 · 1010

Thank you for your attention!

BICO: BIRCH meets Coresets for k -means