27
Data Mining Presented By: Sunawar Khan Reg No: 813-MSCS-F14

Db Scan

Embed Size (px)

Citation preview

Data Mining

Presented By: Sunawar KhanReg No: 813-MSCS-F14

Clustering

• Clustering is a process of partitioning a set of data(objects) in a set of meaningful sub classes, called clusters.

• Cluster is a collection of objects that are similar to each other.

• Unsupervised classification (no predefined classes).

Example

Clustering Algorithms

• Are attractive for the task of class identification.

1. Partitioning Methods2. Hierarchical Methods 3. Density Based Methods 4. Grid Based Methods 5. Model Based Methods

Density Based Methods

• Based on notion of density • Density-based clustering algorithm that grows

regions with sufficiently high density into clusters.• The idea is to continue growing the given cluster as

long as the density (# of data points) in the neighborhood exceeds some threshold. Namely, the neighborhood of a given radius has to contain at least a minimum number of objects.

• Discover clusters of arbitrary shape • Handle noise

Density Based Methods

• Clustering based on density (local cluster criterion), such as density-connected points

• Major features:– Discover clusters of arbitrary shape– Handle noise– One scan– Need density parameters as termination condition

• Several interesting studies:– DBSCAN: Ester, et al. (KDD’96)– OPTICS: Ankerst, et al (SIGMOD’99).– DENCLUE: Hinneburg & D. Keim (KDD’98)– CLIQUE: Agrawal, et al. (SIGMOD’98) (more grid-based)

6

Density based Notion of Clusters

• Def:1 (Eps-neighborhood of a point)

• The Eps neighborhood of a point p, denoted by NEps(P), is defined:

NEps(P) = {q E D I dist(p,q) < = Eps}.

• A naive approach could require for each point in a cluster that there are at least a minimum number (MinPts) of points in an Eps-neighborhood of that point.

Def:2 (directly density reachable)

• A point p is directly density-reachable from a point q wrt. Eps, MinPts if

• 1) p є NEps(q)• 2) I NEps(q) l > = MinPts (core point condition).

• Def:4(density connected)• A point p is density connected to a point q wrt. Eps and

MinPts if there is a point o such that both, p and q are density-reachable from o wrt. Eps and MinPts. Density-connectivity is a symmetric relation. Now, we are able to define our density-based notion of a cluster. cluster is defined to be a set of density connected points which is maximal wrt. density-reachability. Noise is simply the set of points in D not belonging to any of its clusters.

Def:5 (Cluster) Let D be a database of points. A cluster C wrt. Eps and MinPts is a non-empty subset of D satisfying the following conditions: 1) ɏ p, q: if p E C and q is density-reachable from p wrt. Eps and

MinPts, then q E C. (Maximality)2) ɏ p, q є C: p is density-connected to q wrt. EPS and MinPts.

(Connectivity)

Def:6 (Noise) Let C 1 ..... Ck be the clusters of the database D wrt. parameters Eps i and MinPts i, i = 1 ..... k. Then we define the noise as the set of points in the database D not belonging to any cluster C i, i.e. noise = {p E D I ɏ i: p !є Ci)

Lemmas for validating the correctness of our clustering algorithm

Lemma 1: Let p be a point in D and INEps(p)l > MinPts. Then the set O = {o I o E D and o is density-reachable from p wrt. Eps and MinPts } is a cluster wrt. Eps and MinPts.

• It is not obvious that a cluster C wrt. Eps and MinPts is uniquely determined by any of its core points. However, each point in C is density-reachable from any of the core points of C and, therefore, a cluster C contains exactly the points which are density-reachable from an arbitrary core point of C.

Lemmas for validating the correctness of our clustering algorithm

Lemma 2: • Let C be a cluster wrt. Eps and MinPts and let p be

any point in C with INEps(P)l >= MinPts.

• Then C equals to the set O = {o I o is density-connected from p wrt. Eps and MinPts }.

Algorithm• Arbitrary select a point p

• Retrieve all points density-reachable from p w.r.t. Eps and MinPts

• If p is a core point, a cluster is formed

• If p is a border point, no points are density-reachable from p and DBSCAN visits the next point of the database

• Continue the process until all of the points have been processed

• If a spatial index is used, the computational complexity of DBSCAN is O(nlogn), where n is the number of database objects. Otherwise, the complexity is O(n2)

Comparisons (DBSCAN vs. CLARANS)

• the DBSCAN algorithm is compared to another clustering algorithm. This one is called CLARANS (Clustering Large Applications based on RANdomized Search).

• It is an improvement of the k-medoid algorithms.• The good properties compared to k-medoid are that

CLARANS works efficient for databases with about a thousand objects. When the database grows larger, CLARANS will fall behind because the algorithm temporarily stores all the objects in the main memory, i.e. the run time will increase.

Complexity• DBSCAN visits each point of the database, possibly multiple

times. For practical considerations, time complexity is mostly governed by the number of regionQuery invocations. DBSCAN executes exactly one such query for each point, and if an indexing structure is used that executes such aneighborhood query in O(log n), an overall runtime complexity of O(n log n) is obtained.

• Without the use of an accelerating index structure, the run time complexity is O(n²). Often the distance matrix of size (n²-n)/2 is materialized to avoid distance recomputations. This however also needs O(n²) memory, whereas a non-matrix based implementation only needs O(n) memory.

Advantages• DBSCAN does not require one to specify the number of

clusters in the data a priori, as opposed to k-means.

• DBSCAN can find arbitrarily shaped clusters.

• DBSCAN requires just two parameters and is mostly insensitive to the ordering of the points in the database.

• DBSCAN has a notion of noise, and is robust to outliers

• DBSCAN is designed for use with databases that can accelerate region queries, e.g. using an R* tree.

Disadvantages

• DBSCAN is not entirely deterministic: border points that are reachable from more than one cluster can be part of either cluster. Fortunately, this situation does not arise often, and has little impact on the clustering result: both on core points and noise points, DBSCAN is deterministic.

• The quality of DBSCAN depends on the distance measure used in the function regionQuery(P,ε). The most common distance metric used is Euclidean distance (making it difficult to find an appropriate value for ε. This effect, however, is also present in any other algorithm based on Euclidean distance.)

• DBSCAN cannot cluster data sets well with large differences in densities

Extensions

• Generalized DBSCAN (GDBSCAN) is a generalization by the same authors to arbitrary "neighborhood" and "dense" predicates.

• DBSCAN algorithm have been proposed, including methods for parallelization, parameter estimation and support for uncertain data. The basic idea has been extended to hierarchical clustering by the OPTICS algorithm.

• HDBSCAN is a hierarchical version of DBSCAN which is also faster than OPTICS, from which a flat partition consisting of most prominent clusters can be extracted from the hierarchy.