Birch: Balanced Iterative Reducing and Clustering using Hierarchies By Tian Zhang, Raghu Ramakrishnan Presented by Vladimir Jelić 3218/10 e-mail: [email protected]

Birch: Balanced Iterative Reducing and

Clustering using Hierarchies

By Tian Zhang, Raghu Ramakrishnan

Presented by

Vladimir Jelić 3218/10e-mail: [email protected]

What is Data Clustering?

A cluster is a closely-packed group. A collection of data objects that are similar to

one another and treated collectively as a group.

Data Clustering is the partitioning of a dataset into clusters

2 of 28 Vladimir Jelić ([email protected])

Data Clustering

Helps understand the natural grouping or structure in a dataset

Provided a large set of multidimensional data– Data space is usually not uniformly occupied– Identify the sparse and crowded places– Helps visualization


Some Clustering Applications

Biology – building groups of genes with related patterns

Marketing – partition the population of consumers to market segments

Division of WWW pages into genres. Image segmentations – for object recognition Land use – Identification of areas of similar

land use from satellite images


Clustering Problems

Today many datasets are too large to fit into main memory

The dominating cost of any clustering algorithm is I/O, because seek times on disk are orders of a magnitude higher than RAM access times


Previous Work

Two classes of clustering algorithms: Probability-Based

Examples: COBWEB and CLASSIT

Distance-Based Examples: KMEANS, KMEDOIDS, and CLARANS


Previous Work: COBWEB

Probabilistic approach to make decisions Clusters are represented with probabilistic

description Probability representations of clusters is

expensive Every instance (data point) translates into a

terminal node in the hierarchy, so large hierarchies tend to over fit data


Previous Work: KMeans

Distance based approach, so there must be distance measurement between any two instances

Sensitive to instance order Instances must be stored in memory All instances must be initially available May have exponential run time


Previous Work: CLARANS

Also distance based approach, so there must be distance measurement between any two instances

computational complexity of CLARANS is about O(n2)

Sensitive to instance order Ignore the fact that not all data points in the

dataset are equally important


Contributions of BIRCH

Each clustering decision is made without scanning all data points

BIRCH exploits the observation that the data space is usually not uniformly occupied, and hence not every data point is equally important for clustering purposes

BIRCH makes full use of available memory to derive the finest possible subclusters ( to ensure accuracy) while minimizing I/O costs ( to ensure efficiency)


Background Knowledge (1)

Given a cluster of instances , we define:


}{ ix

2

11 1

2

2

11

20

10

))1(

)((

))(

(

NN

xxD

N

xxR

N

xx

N

i

N

j ji

N

i i

N

i i

Centroid:

Radius:

Diameter:

Background Knowledge (2)

Vladimir Jelić ([email protected])12 of 28

2)

21

1

1

1

21

11 2

21 11(

2)

1

11(

2)

21

211(4

21

))121)(21(

211 2112

)((3

21

)

21

11 21 11

2)(

(2

NN

k

N

i

NN

Nj N

NNNl lx

jxN

Nl lx

ixNN

NNl lx

kxD

NNNN

NNi

NNj jxix

D

NN

Ni

NNNj jxix

D

|1

)(

20)(

10||2010|1

21

)2

)2010((0

d

i

ix

ixxxD

xxD

centroid Manhattan distance:

centroid Euclidian distance:

average inter-cluster:

average intra-cluster:

variance increase:

Clustering Features (CF)

The Birch algorithm builds a dendrogram called clustering feature tree (CF tree) while scanning the data set.

Each entry in the CF tree represents a cluster of objects and is characterized by a 3-tuple: (N, LS, SS), where N is the number of objects in the cluster and LS, SS are defined in the following.


Clustering Feature (CF)

Given N d-dimensional data points in a cluster: {Xi} where i = 1, 2, …, N,

CF = (N, LS, SS) N is the number of data points in the cluster, LS is the linear sum of the N data points, SS is the square sum of the N data points.


CF Additivity Theorem (1)

If CF1 = (N1, LS1, SS1), and

CF2 = (N2 ,LS2, SS2) are the CF entries of two disjoint sub-clusters.

The CF entry of the sub-cluster formed by merging the two disjoin sub-clusters is:

CF1 + CF2 = (N1 + N2 , LS1 + LS2, SS1 + SS2)


CF Additivity Theorem (2)


0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

CF = (5, (16,30),(54,190))

(3,4)(2,6)(4,5)(4,7)(3,8)

Example:

Properties of CF-Tree

Each non-leaf node has at most B entriesEach leaf node has at most L CF entries which each satisfy threshold TNode size is determined by dimensionality of data space and input parameter P (page size)


CF Tree Insertion

Identifying the appropriate leaf: recursively descending the CF tree and choosing the closest child node according to a chosen distance metric

Modifying the leaf: test whether the leaf can absorb the node without violating the threshold. If there is no room, split the node

Modifying the path: update CF information up the path.


Example of the BIRCH Algorithm

Root

LN1

LN2LN3

LN1 LN2 LN3

sc1sc2

sc3

sc4sc5

sc6sc7

sc1 sc2sc3sc4

sc5 sc6 sc7sc8

sc8

New subcluster



Merge Operation in BIRCH

RootLN1”

LN2LN3

LN1’ LN2 LN3

sc1

sc2sc3

sc4sc5

sc6sc7

sc1 sc2sc3sc4

sc5 sc6 sc7sc8

sc8

LN1’

LN1”

If the branching factor of a leaf node can not exceed 3, then LN1 is split



Root

LN1”

LN2LN3

LN1’

LN2LN3

sc1

sc2

sc3sc4

sc5

sc6

sc7

sc1sc2 sc3 sc4

sc5 sc6 sc7sc8

sc8

LN1’LN1”

If the branching factor of a non-leaf node can not exceed 3, then the root is split and the height of the CF Tree increases by one

NLN1

NLN2



root

LN1

LN2

sc1sc2 sc3 sc4

sc5sc6

LN1

Assume that the subclusters are numbered according to the order of formation

sc3

sc4

sc5sc6

sc2sc1

LN2


LN2”

LN1

sc3 sc4

sc5sc6

sc2

If the branching factor of a leaf node can not exceed 3, then LN2 is split

sc1

LN2’

root

LN2’

sc1 sc2sc3sc4sc5

sc6

LN1

LN2”



rootLN2”

LN3’

LN3”

sc3 sc4

sc5sc6

sc2

sc1sc2 sc3sc4sc5

sc6

LN3’

LN2’ and LN1 will be merged, and the newly formed nodewil be split immediately

sc1

LN3”

LN2”


Birch Clustering Algorithm (1)

Phase 1: Scan all data and build an initial in-memory CF tree.

Phase 2: condense into desirable length by building a smaller CF tree.

Phase 3: Global clustering Phase 4: Cluster refining – this is optional,

and requires more passes over the data to refine the results


Birch Clustering Algorithm (2)


Birch – Phase 1

Start with initial threshold and insert points into the tree

If run out of memory, increase thresholdvalue, and rebuild a smaller tree by reinserting values from older tree and then other values

Good initial threshold is important but hard to figure out

Outlier removal – when rebuilding tree remove outliers


Birch - Phase 2

Optional Phase 3 sometime have minimum size which

performs well, so phase 2 prepares the tree for phase 3.

BIRCH applies a (selected) clustering algorithm to cluster the leaf nodes of the CF tree, which removes sparse clusters as outliers and groups dense clusters into larger ones.


Birch – Phase 3

Problems after phase 1:– Input order affects results– Splitting triggered by node size

Phase 3:– cluster all leaf nodes on the CF values according

to an existing algorithm– Algorithm used here: agglomerative hierarchical

clustering


Birch – Phase 4

Optional Do additional passes over the dataset &

reassign data points to the closest centroid from phase 3

Recalculating the centroids and redistributing the items.

Always converges (no matter how many time phase 4 is repeated)


Conclusions (1)

Birch performs faster than existing algorithms (CLARANS and KMEANS) on large datasets

Scans whole data only once Handles outliers better Superior to other algorithms in stability and

scalability


Conclusions (2)

Since each node in a CF tree can hold only a limited number of entries due to the size, a CF tree node doesn’t always correspond to what a user may consider a nature cluster. Moreover, if the clusters are not spherical in shape, it doesn’t perform well because it uses the notion of radius or diameter to control the boundary of a cluster


References

T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH : An efficient data clustering method for very large databases. SIGMOD'96

Jan Oberst: Efficient Data Clustering and How to Groom Fast-Growing Trees

Tan, Steinbach, Kumar: Introduction to Data Mining


Documents

Birch: Balanced Iterative Reducing and Clustering using Hierarchies By Tian Zhang, Raghu Ramakrishnan Presented by Vladimir Jelić 3218/10 e-mail: [email protected]