50
Clustering Methods Professor: Dr. Mansouri Presented by : Muhammad Abouei &Mohsen Ghahremani Manesh

Clustering Methods

Embed Size (px)

DESCRIPTION

Clustering Methods. Professor: Dr. Mansouri Presented by : Muhammad Abouei &Mohsen Ghahremani Manesh. Clustering Methods. Density-Based Clustering Methods DBSCAN ( D ensity B ased S patial C lustering of A pplications with N oise) - PowerPoint PPT Presentation

Citation preview

Page 1: Clustering Methods

Clustering MethodsProfessor: Dr. Mansouri

Presented by : Muhammad Abouei &Mohsen Ghahremani Manesh

Page 2: Clustering Methods

2

Clustering Methods

Density-Based Clustering Methods DBSCAN (Density Based Spatial Clustering of Applications with

Noise)

OPTICS (Ordering Points To Identify the Clustering Structure)

DENCLUE (DENsity-based CLUstEring)

Grid-based Clustering

Page 3: Clustering Methods

3

Density Based Clustering

Page 4: Clustering Methods

4

DBSCAN Concepts

ε -neighborhood: Points within ε distance (radius) of a point. MinPts: minimum number of points in cluster (ε-

neighborhood of that point).

ε-neighborhood of q

ε-neighborhood of p

MinPts = 5

where ε and MinPts are a user-defined function.

Page 5: Clustering Methods

5

DBSCAN Concepts

Density : number of points within a specified radius (ε)

Density(p)=5

Page 6: Clustering Methods

6

DBSCAN Concepts

Core point : A point is a core point if it has more than a specified number of points (MinPts) within ε These are points that are at the interior of a cluster

ε-neighborhood of q

ε-neighborhood of p

p is a core point (MinPts = 5)

q is not a core point.

Page 7: Clustering Methods

7

DBSCAN Concepts

Directly density-reachable : point p is directly density-reachable from a point q w.r.t. ε , MinPts if

1. p belongs to ε -neighborhood of q,

2. q is a core point,

MinPts = 4

p is DDR from q.

q is not DDR from p!

DDR is an asymmetric relation.

Page 8: Clustering Methods

8

DBSCAN Concepts

Density-reachable: A point p is density-reachable from a point q w.r.t. ε , MinPts if there is a chain of points P1, …, Pn , P1=q, Pn=p such that Pi +1is directly density-reachable from Pi .

Or, point p is density-reachable form q, if there is a path (chain of points) from p to q consisting of only core points.

MinPts = 4

p is DR from q.

q is not DR from p!

p is not core.

DR is an asymmetric relation.

Page 9: Clustering Methods

9

DBSCAN Concepts

Density-connectivity: point p is density-connected to point q w.r.t. ε , MinPts if there is a point r such that both, p and q are density-reachable from r w.r.t. ε and MinPts.

MinPts = 4

p and q are density-connected.

DC is an symmetric relation.

Page 10: Clustering Methods

10

DBSCAN Concepts

Border point : A border point has fewer than MinPts within ε, but is in the neighborhood of a core point

MinPts =5

ε = circle radius

Page 11: Clustering Methods

11

DBSCAN Concepts

Noise (outlier) point : is any point that is not a core point nor a border point.

MinPts =5

ε = circle radius

Page 12: Clustering Methods

12

DBSCAN Concepts

DBSCAN relies on a density-based notion of cluster. Cluster : a cluster C is a non-empty set of density-connected

points that is maximal w.r.t. density-reachability. Maximality: For all p, q; if q C and if ∈ p is density-reachable from

q w.r.t. ε and MinPts, then also p C.∈

MinPts = 3

ε = circle radius

Page 13: Clustering Methods

13

DBSCAN Algorithm

Arbitrary select a point p Retrieve all points density-reachable from p w.r.t. ε and

MinPts. If p is a core point, a cluster is formed. If p is a border point, no points are density-reachable from p

and DBSCAN visits the next point of the database. Continue the process until all of the points have been

processed.

Page 14: Clustering Methods

14

DBSCAN

MinPts = 4

Page 15: Clustering Methods

15

DBSCAN

DBSCAN is Sensitive to Parameters. MinPts = 4

Page 16: Clustering Methods

16

DBSCAN

Core, Border and Noise Points: MinPts = 4, ε = 10

Original Points Point types: core, border

and noise

Page 17: Clustering Methods

17

DBSCAN

When DBSCAN works well: Resistant to Noise Can handle clusters of different shapes and sizes

Original Points Clusters

Page 18: Clustering Methods

18

DBSCAN

When DBSCAN does not work well: Varying densities High-dimensional data

Page 19: Clustering Methods

19

DBSCAN Complexity

If a spatial index (ex, kd-tree, R*-tree) is used, the computational complexity of DBSCAN is O(n.logn), where n is the number of database objects. Otherwise, it is O(n2).

Page 20: Clustering Methods

20

OPTICS

Core distance: smallest ε that makes it a core object. If p is

not core, it is undefined.

Core Distance of p or ε′ : distance between p and its 4-thNN.

MinPts = 5

ε = 3 cm

Page 21: Clustering Methods

21

OPTICS

Reachability distance: of r w.r.t. p is the greater value of the core distance

of p and the Euclidean distance between p & r. If p is not a core object,

distance reachability between p & q is undefined.

reachability-distance ε, MinPts(p, r) = ε′

reachability-distance ε, MinPts(p, r′) = d(p, r′ )

MinPts = 5

ε = 3 cm

Page 22: Clustering Methods

22

OPTICS

Page 23: Clustering Methods

23

OPTICS

Page 24: Clustering Methods

24

OPTICS

Page 25: Clustering Methods

25

OPTICS

Page 26: Clustering Methods

26

OPTICS

Page 27: Clustering Methods

27

OPTICS

Page 28: Clustering Methods

28

OPTICS

Page 29: Clustering Methods

29

OPTICS

Page 30: Clustering Methods

30

OPTICS

Page 31: Clustering Methods

31

OPTICS

Page 32: Clustering Methods

32

OPTICS

Color image segmentation using density-Based clustering

Page 33: Clustering Methods

33

DENCLUE

DENCLUE (DENsity-based CLUstEring)

Major features

Solid mathematical foundation

Good for data sets with large amounts of noise

Allows a compact mathematical description of arbitrarily shaped clusters in

high-dimensional data sets

Significant faster than existing algorithm (faster than DBSCAN by a factor of

up to 45)

But needs a large number of parameters

Page 34: Clustering Methods

34

DENCLUE

Technical Essence

Uses grid cells but only keeps information about grid cells that do

actually contain data points and manages these cells in a tree- based

access structure.

Page 35: Clustering Methods

35

DENCLUE

Technical Essence

DENCLUE is based on the following concepts:

Influence function

Density function

Density attractors.

Page 36: Clustering Methods

36

DENCLUE

Influence function : The influence function f y(x) for a point

(data space) at point x is a positive function that decays to zero

as x “moves away” from .

Typical examples are:

and

where σ is a user-defined function.

Page 37: Clustering Methods

37

DENCLUE

Density function :The density function at x based on a data space of

N points; i.e. D = {x1,…, xN}; is defined as the sum of the influence

function of all data points at x :

The goal of the definition: Identify all “significant” local maxima, xj*, j=1,…,m of f D(x)

Create a cluster Cj for each xj* and assign to Cj all points of D that lie within

the “region of attraction” of xj*.

Page 38: Clustering Methods

38

DENCLUE

Example: Density Computation

D={x1,x2,x3,x4}

f DGaussian (x) = influence(x1)+influence(x2)+influence(x3)+influence(x4)

=0.04+0.06+0.08+0.6=0.78

Remark: the density value of y would be larger than the one for x.

Page 39: Clustering Methods

39

DENCLUE

Density attractors :Density attractors are local maxima of the

overall density function f D(x). Clusters can then be determined mathematically by identifying density

attractors. A hill-climbing algorithm guided by the gradient can be used to determine

the density attractor of a set of data points.

Page 40: Clustering Methods

40

DENCLUE

Density-attracted : A point x is density-attracted to a density

attractor x*, if there exists a set of points x0, x1, …, xk such

that x0 = x , xk = x* and the gradient of xi-1 is in the direction of

xi for 0<i<k.

Page 41: Clustering Methods

41

DENCLUE

Center-Defined Cluster :A center-defined cluster (w.r.t. to σ, ε)

for a density attractor x* is a subset C D, with x C being

density-attracted by x* and f D(x) ε.

Outlier: Point x D is called outlier if it is density-attracted by

a local maximum xo* with f D(xo*) < ε.

Page 42: Clustering Methods

42

DENCLUE

Multicenter defined clusters : Multicenter defined clusters are

a set of center-defined clusters linked by a path of

significance.

Page 43: Clustering Methods

43

DENCLUE

An arbitrary-shape cluster : An arbitrary-shape cluster (w.r.t. to

σ, ) for a set of density attractors X is a subset C D, where

, x is density-attracted to , and

a path P from to with

Page 44: Clustering Methods

44

DENCLUE

Note : that the number of clusters found by DENCLUE varies

depending on σ, .

Page 45: Clustering Methods

45

DENCLUE

DENCLUE is able to detect arbitrarily shaped clusters.

The algorithm deals with noise very satisfactory.

The worst-case time complexity of DENCLUE is O(N.log2N).

Experimental results indicate that the average time complexity

is O(log2N).

It works efficiently with high-dimensional data.

DENCLUE needs at least 3 parameters to be determined, i.e.

σ, .

Page 46: Clustering Methods

46

Grid-based

Using multi-resolution grid data structure Clustering complexity depends on the number of

populated grid cells and not on the number of objects in the dataset

Several interesting methods: CS Tree (Clustering Statistical Tree)STING WaveCluster

Page 47: Clustering Methods

47

Grid-based

Basic Grid-based Algorithm 1. Define a set of grid-cells.

2. Assign objects to the appropriate grid cell and compute the density of each cell.

3. Eliminate cells, whose density is below a certain threshold τ.

4. Form clusters from contiguous (adjacent) groups of dense cells (usually minimizing a given objective function).

Page 48: Clustering Methods

48

Grid-based

Fast: No distance computations,Clustering is performed on summaries and not individual

objects; complexity is usually O(no_of_populated_grid_cells) and not O(no_of_objects),

Easy to determine which clusters are neighboring.

Page 49: Clustering Methods

49

References

A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Prentice Hall, 1988. A.K. Jain and M. N. Murty and P.J. Flynn, Data Clustering: A Review, ACM

Computing Surveys, vol 31. No 3,pp 264-323, 1999. A. L. N. Fred, J. M. N. Leitão, A New Cluster Isolation Criterion Based on

Dissimilarity Increments, IEEE “Optimal grid-clustering: Toward breaking the curse of dimensionality in high-

dimensional clustering,”in Proc. 25th VLDB Conf.,1999, pp. 506–517.

Page 50: Clustering Methods

50

?