Upload
denisse-asbury
View
222
Download
3
Tags:
Embed Size (px)
Citation preview
Birch: Balanced Iterative Reducing and
Clustering using Hierarchies
By Tian Zhang, Raghu Ramakrishnan
Presented by
Vladimir Jelić 3218/10e-mail: [email protected]
What is Data Clustering?
A cluster is a closely-packed group. A collection of data objects that are similar to
one another and treated collectively as a group.
Data Clustering is the partitioning of a dataset into clusters
2 of 28 Vladimir Jelić ([email protected])
Data Clustering
Helps understand the natural grouping or structure in a dataset
Provided a large set of multidimensional data– Data space is usually not uniformly occupied– Identify the sparse and crowded places– Helps visualization
3 of 28 Vladimir Jelić ([email protected])
Some Clustering Applications
Biology – building groups of genes with related patterns
Marketing – partition the population of consumers to market segments
Division of WWW pages into genres. Image segmentations – for object recognition Land use – Identification of areas of similar
land use from satellite images
4 of 28 Vladimir Jelić ([email protected])
Clustering Problems
Today many datasets are too large to fit into main memory
The dominating cost of any clustering algorithm is I/O, because seek times on disk are orders of a magnitude higher than RAM access times
5 of 28 Vladimir Jelić ([email protected])
Previous Work
Two classes of clustering algorithms: Probability-Based
Examples: COBWEB and CLASSIT
Distance-Based Examples: KMEANS, KMEDOIDS, and CLARANS
6 of 28 Vladimir Jelić ([email protected])
Previous Work: COBWEB
Probabilistic approach to make decisions Clusters are represented with probabilistic
description Probability representations of clusters is
expensive Every instance (data point) translates into a
terminal node in the hierarchy, so large hierarchies tend to over fit data
7 of 28 Vladimir Jelić ([email protected])
Previous Work: KMeans
Distance based approach, so there must be distance measurement between any two instances
Sensitive to instance order Instances must be stored in memory All instances must be initially available May have exponential run time
8 of 28 Vladimir Jelić ([email protected])
Previous Work: CLARANS
Also distance based approach, so there must be distance measurement between any two instances
computational complexity of CLARANS is about O(n2)
Sensitive to instance order Ignore the fact that not all data points in the
dataset are equally important
9 of 28 Vladimir Jelić ([email protected])
Contributions of BIRCH
Each clustering decision is made without scanning all data points
BIRCH exploits the observation that the data space is usually not uniformly occupied, and hence not every data point is equally important for clustering purposes
BIRCH makes full use of available memory to derive the finest possible subclusters ( to ensure accuracy) while minimizing I/O costs ( to ensure efficiency)
10 of 28 Vladimir Jelić ([email protected])
Background Knowledge (1)
Given a cluster of instances , we define:
11 of 28 Vladimir Jelić ([email protected])
}{ ix
2
11 1
2
2
11
20
10
))1(
)((
))(
(
NN
xxD
N
xxR
N
xx
N
i
N
j ji
N
i i
N
i i
Centroid:
Radius:
Diameter:
Background Knowledge (2)
Vladimir Jelić ([email protected])12 of 28
2)
21
1
1
1
21
11 2
21 11(
2)
1
11(
2)
21
211(4
21
))121)(21(
211 2112
)((3
21
)
21
11 21 11
2)(
(2
NN
k
N
i
NN
Nj N
NNNl lx
jxN
Nl lx
ixNN
NNl lx
kxD
NNNN
NNi
NNj jxix
D
NN
Ni
NNNj jxix
D
|1
)(
20)(
10||2010|1
21
)2
)2010((0
d
i
ix
ixxxD
xxD
centroid Manhattan distance:
centroid Euclidian distance:
average inter-cluster:
average intra-cluster:
variance increase:
Clustering Features (CF)
The Birch algorithm builds a dendrogram called clustering feature tree (CF tree) while scanning the data set.
Each entry in the CF tree represents a cluster of objects and is characterized by a 3-tuple: (N, LS, SS), where N is the number of objects in the cluster and LS, SS are defined in the following.
13 of 28 Vladimir Jelić ([email protected])
Clustering Feature (CF)
Given N d-dimensional data points in a cluster: {Xi} where i = 1, 2, …, N,
CF = (N, LS, SS) N is the number of data points in the cluster, LS is the linear sum of the N data points, SS is the square sum of the N data points.
14 of 28 Vladimir Jelić ([email protected])
CF Additivity Theorem (1)
If CF1 = (N1, LS1, SS1), and
CF2 = (N2 ,LS2, SS2) are the CF entries of two disjoint sub-clusters.
The CF entry of the sub-cluster formed by merging the two disjoin sub-clusters is:
CF1 + CF2 = (N1 + N2 , LS1 + LS2, SS1 + SS2)
15 of 28 Vladimir Jelić ([email protected])
CF Additivity Theorem (2)
Vladimir Jelić ([email protected])16 of 28
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
CF = (5, (16,30),(54,190))
(3,4)(2,6)(4,5)(4,7)(3,8)
Example:
Properties of CF-Tree
Each non-leaf node has at most B entriesEach leaf node has at most L CF entries which each satisfy threshold TNode size is determined by dimensionality of data space and input parameter P (page size)
17 of 28 Vladimir Jelić ([email protected])
CF Tree Insertion
Identifying the appropriate leaf: recursively descending the CF tree and choosing the closest child node according to a chosen distance metric
Modifying the leaf: test whether the leaf can absorb the node without violating the threshold. If there is no room, split the node
Modifying the path: update CF information up the path.
18 of 28 Vladimir Jelić ([email protected])
Example of the BIRCH Algorithm
Root
LN1
LN2LN3
LN1 LN2 LN3
sc1sc2
sc3
sc4sc5
sc6sc7
sc1 sc2sc3sc4
sc5 sc6 sc7sc8
sc8
New subcluster
19 of 28 Vladimir Jelić ([email protected])
Vladimir Jelić ([email protected])19 of 28
Merge Operation in BIRCH
RootLN1”
LN2LN3
LN1’ LN2 LN3
sc1
sc2sc3
sc4sc5
sc6sc7
sc1 sc2sc3sc4
sc5 sc6 sc7sc8
sc8
LN1’
LN1”
If the branching factor of a leaf node can not exceed 3, then LN1 is split
Vladimir Jelić ([email protected])19 of 28
Merge Operation in BIRCH
Root
LN1”
LN2LN3
LN1’
LN2LN3
sc1
sc2
sc3sc4
sc5
sc6
sc7
sc1sc2 sc3 sc4
sc5 sc6 sc7sc8
sc8
LN1’LN1”
If the branching factor of a non-leaf node can not exceed 3, then the root is split and the height of the CF Tree increases by one
NLN1
NLN2
Vladimir Jelić ([email protected])19 of 28
Merge Operation in BIRCH
root
LN1
LN2
sc1sc2 sc3 sc4
sc5sc6
LN1
Assume that the subclusters are numbered according to the order of formation
sc3
sc4
sc5sc6
sc2sc1
LN2
Vladimir Jelić ([email protected])19 of 28
LN2”
LN1
sc3 sc4
sc5sc6
sc2
If the branching factor of a leaf node can not exceed 3, then LN2 is split
sc1
LN2’
root
LN2’
sc1 sc2sc3sc4sc5
sc6
LN1
LN2”
Merge Operation in BIRCH
Vladimir Jelić ([email protected])19 of 28
rootLN2”
LN3’
LN3”
sc3 sc4
sc5sc6
sc2
sc1sc2 sc3sc4sc5
sc6
LN3’
LN2’ and LN1 will be merged, and the newly formed nodewil be split immediately
sc1
LN3”
LN2”
Merge Operation in BIRCH
Birch Clustering Algorithm (1)
Phase 1: Scan all data and build an initial in-memory CF tree.
Phase 2: condense into desirable length by building a smaller CF tree.
Phase 3: Global clustering Phase 4: Cluster refining – this is optional,
and requires more passes over the data to refine the results
20 of 28 Vladimir Jelić ([email protected])
Birch Clustering Algorithm (2)
Vladimir Jelić ([email protected])21 of 28
Birch – Phase 1
Start with initial threshold and insert points into the tree
If run out of memory, increase thresholdvalue, and rebuild a smaller tree by reinserting values from older tree and then other values
Good initial threshold is important but hard to figure out
Outlier removal – when rebuilding tree remove outliers
22 of 28 Vladimir Jelić ([email protected])
Birch - Phase 2
Optional Phase 3 sometime have minimum size which
performs well, so phase 2 prepares the tree for phase 3.
BIRCH applies a (selected) clustering algorithm to cluster the leaf nodes of the CF tree, which removes sparse clusters as outliers and groups dense clusters into larger ones.
23 of 28 Vladimir Jelić ([email protected])
Birch – Phase 3
Problems after phase 1:– Input order affects results– Splitting triggered by node size
Phase 3:– cluster all leaf nodes on the CF values according
to an existing algorithm– Algorithm used here: agglomerative hierarchical
clustering
24 of 28 Vladimir Jelić ([email protected])
Birch – Phase 4
Optional Do additional passes over the dataset &
reassign data points to the closest centroid from phase 3
Recalculating the centroids and redistributing the items.
Always converges (no matter how many time phase 4 is repeated)
25 of 28 Vladimir Jelić ([email protected])
Conclusions (1)
Birch performs faster than existing algorithms (CLARANS and KMEANS) on large datasets
Scans whole data only once Handles outliers better Superior to other algorithms in stability and
scalability
26 of 28 Vladimir Jelić ([email protected])
Conclusions (2)
Since each node in a CF tree can hold only a limited number of entries due to the size, a CF tree node doesn’t always correspond to what a user may consider a nature cluster. Moreover, if the clusters are not spherical in shape, it doesn’t perform well because it uses the notion of radius or diameter to control the boundary of a cluster
Vladimir Jelić ([email protected])27 of 28
References
T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH : An efficient data clustering method for very large databases. SIGMOD'96
Jan Oberst: Efficient Data Clustering and How to Groom Fast-Growing Trees
Tan, Steinbach, Kumar: Introduction to Data Mining
Vladimir Jelić ([email protected])28 of 28