CSE 634 Data Mining Techniques

CSE 634 Data Mining Techniques

CLUSTERINGPart 2( Group no: 1 )

By: Anushree Shibani Shivaprakash & Fatima Zarinni

Spring 2006Professor Anita Wasilewska

SUNY Stony Brook

References

Jiawei Han and Michelle Kamber. Data Mining Concept and Techniques (Chapter8). Morgan Kaufman, 2002.

M. Ester, H.P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases. KDD'96. http://ifsc.ualr.edu/xwxu/publications/kdd-96.pdf

How to explain hierarchical clustering. http://www.analytictech.com/networks/hiclus.htm

Tian Zhang, Raghu Ramakrishnan, Miron Livny. Birch: An efficient data clustering method for very large databases

Data mining- Margaret H. Dunham http://cs.sunysb.edu/~cse634/ Presentation 9 – Cluster

Analysis

Introduction

Major clustering methods

Partitioning methods Hierarchical methods Density-based methods Grid-based methods

Hierarchical methods

Here we group data objects into a tree of clusters.

There are two types of hierarchical clustering

1. Agglomerative hierarchical clustering.2. Divisive hierarchical clustering

Agglomerative hierarchical clustering

Group data objects in a bottom-up fashion. Initially each data object is in its own

cluster. Then we merge these atomic clusters into

larger and larger clusters, until all of the objects are in a single cluster or until certain termination conditions are satisfied.

A user can specify the desired number of clusters as a termination condition.

Divisive hierarchical clustering

Groups data objects in a top-down fashion.

Initially all data objects are in one cluster. We then subdivide the cluster into smaller

and smaller clusters, until each object forms cluster on its own or satisfies certain termination conditions, such as a desired number of clusters is obtained.

AGNES & DIANA

Application of AGNES( AGglomerative NESting) and DIANA( Divisive ANAlysis) to a data set of five objects, {a, b, c, d, e}.

Step 0 Step 1 Step 2 Step 3 Step 4

b

d

c

e

a a b

d e

c d e

a b c d e

Step 4 Step 3 Step 2 Step 1 Step 0

agglomerative(AGNES)

divisive(DIANA)

AGNES-Explored

1. Given a set of N items to be clustered, and an NxN distance (or similarity) matrix, the basic process of Johnson's (1967) hierarchical clustering is this:

2. Start by assigning each item to its own cluster, so that if you have N items, you now have N clusters, each containing just one item. Let the distances (similarities) between the clusters equal the distances (similarities) between the items they contain.

3. Find the closest (most similar) pair of clusters and merge them into a single cluster, so that now you have one less cluster.

AGNES

4. Compute distances (similarities) between the new cluster and each of the old clusters.

5. Repeat steps 2 and 3 until all items are clustered into a single cluster of size N.

6. Step 3 can be done in different ways, which is what distinguishes single-link from complete-link and average-link clustering

Similarity/Distance metrics

single-link clustering, distance= shortest distance

complete-link clustering, distance = longest distance

average-link clustering, distance = average distance

from any member of one cluster to any member of the other cluster.

Single Linkage Hierarchical Clustering

1. Say “Every point is its own cluster”



2. Find “most similar” pair of clusters




3. Merge it into a parent cluster





4. Repeat





4. Repeat

DIANA (Divisive Analysis)

Introduced in Kaufmann and Rousseeuw (1990)

Inverse order of AGNES

Eventually each node forms a cluster on its own

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 100

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

Overview

Divisive Clustering starts by placing all objects into a single group. Before we start the procedure, we need to decide on a threshold distance. The procedure is as follows:

The distance between all pairs of objects within the same group is determined and the pair with the largest distance is selected.

Overview-contd

This maximum distance is compared to the threshold distance. If it is larger than the threshold, this group is divided

in two. This is done by placing the selected pair into different groups and using them as seed points. All other objects in this group are examined, and are placed into the new group with the closest seed point. The procedure then returns to Step 1.

If the distance between the selected objects is less than the threshold, the divisive clustering stops.

To run a divisive clustering, you simply need to decide upon a method of measuring the distance between two objects.

DIANA- Explored

In DIANA, a divisive hierarchical clustering method, all of the objects form one cluster.

The cluster is split according to some principle, such as the minimum Euclidean distance between the closest neighboring objects in the cluster.

The cluster splitting process repeats until, eventually, each new cluster contains a single object or a termination condition is met.

Difficulties with Hierarchical clustering

It encounters difficulties regarding the selection of merge and split points.

Such a decision is critical because once a group of objects is merged or split, the process at the next step will operate on the newly generated clusters.

It will not undo what was done previously. Thus, split or merge decisions, if not well

chosen at some step, may lead to low-quality clusters.

One promising direction for improving the clustering quality of hierarchical methods is to integrate hierarchical clustering with other clustering techniques. A few such methods are:

1. Birch2. Cure3. Chameleon

Solution to improve Hierarchical clustering

BIRCH: An Efficient Data Clustering Method for Very Large Databases

Paper by:

Tian Zhang

Computer Sciences Dept.

University of Wisconsin- Madison

[email protected]

Raghu RamakrishnanComputer Sciences Dept.

University of Wisconsin- [email protected]

Miron LivnyComputer Sciences Dept.

University of Wisconsin- [email protected]

In Proceedings of the International Conference Management of Data (ACM-SIGMOD), pages 103-114, In Proceedings of the International Conference Management of Data (ACM-SIGMOD), pages 103-114,

Montreal, Canada, June, 1996.Montreal, Canada, June, 1996.

Reference For Paper

www2.informatik.huberlin.de/wm/mldm2004/zhang96birch.pdf

Birch (Balanced Iterative Reducing and Clustering Using Hierarchies)

A hierarchical clustering method. It introduces two concepts :1. Clustering feature2. Clustering feature tree (CF tree)

These structures help the clustering method achieve good speed and scalability in large databases.

Clustering Feature Definition

Given N d-dimensional data points in a cluster: {Xi} where i = 1, 2, …, N,

CF = (N, LS, SS) N is the number of data points in the

cluster, LS is the linear sum of the N data points, SS is the square sum of the N data

points.

Anu Shibani

A clustering feature (CF) is a triplet summarizing information about subclusters of objects

Clustering feature concepts Each record (data object) is a tuple of values of

attributes and here is called a vector. Here is a database.

We define (Vi1, …Vid) = Oi

N N N N

LS = ∑ Oi = (∑Vi1, ∑ Vi2,… ∑Vid)

i=1 i=1 i=1 i =1

Linear Sum Definition

NameDefinition

Square sum

N N N N

SS = ∑ Oi2 = ( ∑Vi12, ∑Vi22… ∑Vid2)

i =1 i=1 i=1 i=1

DefinitionName

Example of a case

Assume N = 5 and d = 2Linear Sum

5 5 5

LS = ∑ Oi = (∑Vi1, ∑ Vi2)

i=1 i=1 i=1

Square Sum

5 5SS =( ∑Vi12), ∑Vi22)

i=1 i=1

Example 2

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

CF = (5, (16,30),(54,190))

Object Attribute1 Attribute2

O1 3 4

O2 2 6

O3 4 5

O4 4 7

O5 3 8

Clustering feature = CF=( N, LS, SS)

N = 5

LS = (16, 30)

SS = ( 54, 190)

Anu Shibani

So, a CF is essentially a summary of the statistics for the given subcluster.

CF-Tree

A CF-tree is a height-balanced tree with two parameters: branching factor (B for nonleaf node and L for leaf node) and threshold T.

The entry in each nonleaf node has the form [CFi, childi]

The entry in each leaf node is a CF; each leaf node has two pointers: `prev' and`next'.

The CF tree is basically a tree used to store all the clustering features.

CF TreeCF1

child1

CF3

child3

CF2

child2

CF6

child6

CF1

child1

CF3

child3

CF2

child2

CF5

child5

CF1 CF2 CF6prev next CF1 CF2 CF4

prev next

Root

Non-leaf node

Leaf node Leaf node

BIRCH Clustering

Phase 1: scan DB to build an initial in-memory CF tree (a multi-level compression of the data that tries to preserve the inherent clustering structure of the data)

Phase 2: use an arbitrary clustering algorithm to cluster the leaf nodes of the CF-tree

BIRCH Algorithm Overview

Summary of Birch

Scales linearly- with a single scan you get good clustering and the quality of clustering improves with a few additional scans.

It handles noise (data points that are not part of the underlying pattern) effectively.

Density-Based Clustering Methods Clustering based on density, such as density-

connected points instead of distance metric. Cluster = set of “density connected” points. Major features:

Discover clusters of arbitrary shape Handle noise Need “density parameters” as termination condition- (when no new objects can be added to the cluster.)

Example: DBSCAN (Ester, et al. 1996) OPTICS (Ankerst, et al 1999) DENCLUE (Hinneburg & D. Keim 1998)

Density-Based Clustering: Background Eps neighborhood: The neighborhood within a

radius Eps of a given object MinPts: Minimum number of points in an Eps-

neighborhood of that object.

Core object :If the Eps neighborhood contains at least a minimum number of points Minpts, then the object is a core object

Directly density-reachable: A point p is directly density-reachable from a point q wrt. Eps, MinPts if

1) p is within the Eps neighborhood of q

2) q is a core objectp

qMinPts = 5

Eps = 1

Figure showing the density reachability and density connectivity in density based clustering

M, P, O, R and S are core objects since each is in an Eps neighborhood containing at least 3 points

Minpts = 3

Eps=radius of the circles

Directly density reachable

Q is directly density reachable from M. M is directly density reachable from P and vice versa.

Indirectly density reachable

Q is indirectly density reachable from P since Q is directly density reachable from M and M is directly density reachable from P. But, P is not density reachable from Q since Q is not a core

object.

Core, border, and noise points

DBSCAN is a density-based algorithm. Density = number of points within a specified

radius (Eps)

A point is a core point if it has more than a specified number of points (MinPts) within Eps

These are points that are at the interior of a cluster.

A border point has fewer than MinPts within Eps, but is in the neighborhood of a core point.

A noise point is any point that is not a core point nor a border point.

DBSCAN (Density based Spatial clustering of Application with noise): The Algorithm

Arbitrary select a point p

Retrieve all points density-reachable from p wrt Eps and MinPts.

If p is a core point, a cluster is formed.

If p is a border point, no points are density-reachable from p and DBSCAN visits the next point of the database.

Continue the process until all of the points have been processed.

Conclusions

We discussed two hierarchical clustering methods – Agglomerative and Divisive.

We also discussed Birch- a hierarchical clustering which produces good clustering over a single scan and with a few additional scans you get better clustering.

DBSCAN is a density based clustering algorithm and through this algorithm we discover clusters of arbitrary shapes. Distance is not the metric unlike the case of hierarchical methods.

GRID-BASED CLUSTERING METHODS

This is the approach in which we quantize space into a finite number of cells that form a grid structure on which all of the operations for clustering is performed.

So, for example assume that we have a set of records and we want to cluster with respect to two attributes, then, we divide the related space (plane), into a grid structure and then we find the clusters.

Age

Salary (10,000)

Our “space” is this plane

20 30 40 50 60

88

77

66

5 5

44

33

22

11

00

Techniques for Grid-Based Clustering

The following are some techniques that are used to perform Grid-Based Clustering: CLIQUE (CLustering In QUest.) STING (STatistical Information Grid.) WaveCluster

Looking at CLIQUE as an Example

CLIQUE is used for the clustering of high-dimensional data present in large tables. By high-dimensional data we mean records that have many attributes.

CLIQUE identifies the dense units in the subspaces of high dimensional data space, and uses these subspaces to provide more efficient clustering.

Definitions That Need to Be Known

Unit : After forming a grid structure on the space, each rectangular cell is called a Unit.

Dense: A unit is dense, if the fraction of total data points contained in the unit exceeds the input model parameter.

Cluster: A cluster is defined as a maximal set of connected dense units.

How Does CLIQUE Work? Let us say that we have a set of records that we would like to cluster in terms of n-attributes.

So, we are dealing with an n-dimensional space.

MAJOR STEPS : CLIQUE partitions each subspace that has

dimension 1 into the same number of equal length intervals.

Using this as basis, it partitions the n-dimensional data space into non-overlapping rectangular units.

CLIQUE: Major Steps (Cont.)

Now CLIQUE’S goal is to identify the dense n-dimensional units.

It does this in the following way: CLIQUE finds dense units of higher

dimensionality by finding the dense units in the subspaces.

So, for example if we are dealing with a 3-dimensional space, CLIQUE finds the dense units in the 3 related PLANES (2-dimensional subspaces.)

It then intersects the extension of the subspaces representing the dense units to form a candidate search space in which dense units of higher dimensionality would exist.

CLIQUE: Major Steps. (Cont.)

Each maximal set of connected dense units is considered a cluster.

Using this definition, the dense units in the subspaces are examined in order to find clusters in the subspaces.

The information of the subspaces is then used to find clusters in the n-dimensional space.

It must be noted that all cluster boundaries are either horizontal or vertical. This is due to the nature of the rectangular grid cells.

Example for CLIQUE

Let us say that we want to cluster a set of records that have three attributes, namely, salary, vacation and age.

The data space for the this data would be 3-dimensional.

age

salary

vacation

Example (Cont.)

After plotting the data objects, each dimension, (i.e., salary, vacation and age) is split into intervals of equal length.

Then we form a 3-dimensional grid on the space, each unit of which would be a 3-D rectangle.

Now, our goal is to find the dense 3-D rectangular units.

Example (Cont.)

To do this, we find the dense units of the subspaces of this 3-d space.

So, we find the dense units with respect to age for salary. This means that we look at the salary-age plane and find all the 2-D rectangular units that are dense.

We also find the dense 2-D rectangular units for the vacation-age plane.

Example 1

Sal

ary

(10,

000)

20 30 40 50 60age

54

31

26

70

20 30 40 50 60age

54

31

26

70

Vac

atio

n(w

eek)

20 30 40 50 60age

54

31

26

70

Vac

atio

n(w

eek)

Example (Cont.)

Now let us try to visualize the dense units of the two planes on the following 3-d figure :

age

Vac

atio

n

Salary 30 50

age

Vac

atio

n

Salary 30 50

= 3

Example (Cont.)

We can extend the dense areas in the vacation-age plane inwards.

We can extend the dense areas in the salary-age plane upwards.

The intersection of these two spaces would give us a candidate search space in which 3-dimensional dense units exist.

We then find the dense units in the salary-vacation plane and we form an extension of the subspace that represents these dense units.

Example (Cont.)

Now, we perform an intersection of the candidate search space with the extension of the dense units of the salary-vacation plane, in order to get all the 3-d dense units.

So, What was the main idea? We used the dense units in subspaces in

order to find the dense units in the 3-dimensional space.

After finding the dense units, it is very easy to find clusters.

Reflecting upon CLIQUE

Why does CLIQUE confine its search for dense units in high dimensions to the intersection of dense units in subspaces?

Because the Apriori property employs prior knowledge of the items in the search space so that portions of the space can be pruned.

The property for CLIQUE says that if a k-dimensional unit is dense then so are its projections in the (k-1) dimensional space.

Strength and Weakness of CLIQUE Strength

It automatically finds subspaces of the highest dimensionality such that high density clusters exist in those subspaces.

It is quite efficient. It is insensitive to the order of records in input and

does not presume some canonical data distribution. It scales linearly with the size of input and has good

scalability as the number of dimensions in the data increases.

Weakness The accuracy of the clustering result may be

degraded at the expense of simplicity of the simplicity of this method.

STING: A Statistical Information Grid Approach to Spatial Data Mining

Paper by:

Wei Wang

Department of Computer Science

University of California, Los

Angeles

CA 90095, U.S.A.

[email protected]

Jiong Yang



Angeles

CA 90095, U.S.A.

[email protected]

Richard Muntz



Angeles

CA 90095, U.S.A.

[email protected]

VLDB Conference Athens, Greece, 1997VLDB Conference Athens, Greece, 1997

Reference For Paper

http://georges.gardarin.free.fr/Cours_XMLDM_Master2/Sting.PDF

Definitions That Need to Be Known

Spatial Data: Data that have a spatial or location

component. These are objects that themselves are located

in physical space. Examples: My house, lake Geneva, New York

City, etc. Spatial Area:

The area that encompasses the locations of all the spatial data is called spatial area.

STING (Introduction)

STING is used for performing clustering on spatial data.

STING uses a hierarchical multi resolution grid data structure to partition the spatial area.

STINGS big benefit is that it processes many common “region oriented” queries on a set of points, efficiently.

We want to cluster the records that are in a spatial table in terms of location.

Placement of a record in a grid cell is completely determined by its physical location.

Hierarchical Structure of Each Grid Cell

The spatial area is divided into rectangular cells. (Using latitude and longitude.)

Each cell forms a hierarchical structure. This means that each cell at a higher level is

further partitioned into 4 smaller cells in the lower level.

In other words each cell at the ith level (except the leaves) has 4 children in the i+1 level.

The union of the 4 children cells would give back the parent cell in the level above them.

Hierarchical Structure of Cells (Cont.)

The size of the leaf level cells and the number of layers depends upon how much granularity the user wants.

So, Why do we have a hierarchical structure for cells?

We have them in order to provide a better granularity, or higher resolution.

A Hierarchical Structure for Sting Clustering

Statistical Parameters Stored in each Cell

For each cell in each layer we have

attribute dependent and attribute independent parameters. Attribute Independent Parameter:

Count : number of records in this cell. Attribute Dependent Parameter:

(We are assuming that our attribute values are real numbers.)

Statistical Parameters (Cont.)

For each attribute of each cell we store the following parameters:

M mean of all values of each attribute in this cell.

S Standard Deviation of all values of each attribute in this cell.

Min The minimum value for each attribute in this cell.

Max The maximum value for each attribute in this cell.

Distribution The type of distribution that the attribute value in this cell follows. (e.g. normal, exponential, etc.) None is assigned to “Distribution” if the distribution is unknown.

Storing of Statistical Parameters

Statistical information regarding the attributes in each grid cell, for each layer are pre-computed and stored before hand.

The statistical parameters for the cells in the lowest layer is computed directly from the values that are present in the table.

The Statistical parameters for the cells in all the other levels are computed from their respective children cells that are in the lower level.

How are Queries Processed ? STING can answer many queries, (especially

region queries) efficiently, because we don’t have to access full database.

How are spatial data queries processed? We use a top-down approach to answer spatial

data queries. Start from a pre-selected layer-typically with a

small number of cells. The pre-selected layer does not have to be the

top most layer. For each cell in the current layer compute the

confidence interval (or estimated range of probability) reflecting the cells relevance to the given query.

Query Processing (Cont.)

The confidence interval is calculated by using the statistical parameters of each cell.

Remove irrelevant cells from further consideration.

When finished with the current layer, proceed to the next lower level.

Processing of the next lower level examines only the remaining relevant cells.

Repeat this process until the bottom layer is reached.

Different Grid Levels during Query Processing.

Sample Query Examples Assume that the spatial area is the map of the

regions of Long Island, Brooklyn and Queens. Our records represent apartments that are

present throughout the above region. Query : “ Find all the apartments that are for

rent near Stony Brook University that have a rent range of: $800 to $1000”

The above query depend upon the parameter “near.” For our example near means within 15 miles of Stony Brook University.

Advantages and Disadvantages of STING

ADVANTAGES: Very efficient. The computational complexity is O(k) where k

is the number of grid cells at the lowest level. Usually k << N, where N is the number of records.

STING is a query independent approach, since statistical information exists independently of queries.

Incremental update. DISADVANTAGES:

All Cluster boundaries are either horizontal or vertical, and no diagonal boundary is selected.

Thank you !

Documents

CSE 634 Data Mining Techniques