63
Clustering Algorithms for Numerical Data Sets

Clustering Algorithms for Numerical Data Sets. Contents 1.Data Clustering Introduction 2.Hierarchical Clustering Algorithms 3.Partitional Data Clustering

Embed Size (px)

Citation preview

Page 1: Clustering Algorithms for Numerical Data Sets. Contents 1.Data Clustering Introduction 2.Hierarchical Clustering Algorithms 3.Partitional Data Clustering

Clustering Algorithms for Numerical Data Sets

Page 2: Clustering Algorithms for Numerical Data Sets. Contents 1.Data Clustering Introduction 2.Hierarchical Clustering Algorithms 3.Partitional Data Clustering

Contents

1. Data Clustering Introduction2. Hierarchical Clustering Algorithms3. Partitional Data Clustering Algorithms

• K-mean clustering4. Density-based Clustering Algorithms

• Density-Based Spatial Clustering of Applications with Noise (DBSCAN)

Page 3: Clustering Algorithms for Numerical Data Sets. Contents 1.Data Clustering Introduction 2.Hierarchical Clustering Algorithms 3.Partitional Data Clustering

What is clustering?

• Clustering: the process of grouping a set of objects into classes of similar objects

• Most common form of unsupervised learning• Unsupervised learning = learning from raw

data, as opposed to supervised data where a classification of examples is given

Page 4: Clustering Algorithms for Numerical Data Sets. Contents 1.Data Clustering Introduction 2.Hierarchical Clustering Algorithms 3.Partitional Data Clustering

Clustering

Page 5: Clustering Algorithms for Numerical Data Sets. Contents 1.Data Clustering Introduction 2.Hierarchical Clustering Algorithms 3.Partitional Data Clustering

Clustering

Page 6: Clustering Algorithms for Numerical Data Sets. Contents 1.Data Clustering Introduction 2.Hierarchical Clustering Algorithms 3.Partitional Data Clustering

Clustering Applications

• Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs

• Land use: Identification of areas of similar land use in an earth observation database

• Insurance: Identifying groups of motor insurance policy holders with a high average claim cost

• City-planning: Identifying groups of houses according to their house type, value, and geographical location

• Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults

Introduction

Page 7: Clustering Algorithms for Numerical Data Sets. Contents 1.Data Clustering Introduction 2.Hierarchical Clustering Algorithms 3.Partitional Data Clustering

Examples of Knowledge Extracted by Data Clustering

• For intelligent web search, data clustering can be conducted in advance on the terms contained in a set of training documents.

• The intelligent search engine can expand the query according to the term clusters, when the user submits a search term, then

Page 8: Clustering Algorithms for Numerical Data Sets. Contents 1.Data Clustering Introduction 2.Hierarchical Clustering Algorithms 3.Partitional Data Clustering

• For example, when the user submits “Federal Reserve Board”, the search engine automatically expands the query term to include additional search terms as follows:{“Greenspan”, “FED”}.

• The search engine may further rank the documents retrieved based on their correlation to the search terms.

Page 9: Clustering Algorithms for Numerical Data Sets. Contents 1.Data Clustering Introduction 2.Hierarchical Clustering Algorithms 3.Partitional Data Clustering

Clustering – Reference matching• Fahlman, Scott & Lebiere, Christian (1989). The cascade-correlation learning

architecture. In Touretzky, D., editor, Advances in Neural Information Processing Systems (volume 2), (pp. 524-532), San Mateo, CA. Morgan Kaufmann.

• Fahlman, S.E. and Lebiere, C., “The Cascade Correlation Learning Architecture,” NIPS, Vol. 2, pp. 524-532, Morgan Kaufmann, 1990.

• Fahlman, S. E. (1991) The recurrent cascade-correlation learning architecture. In Lippman, R.P. Moody, J.E., and Touretzky, D.S., editors, NIPS 3, 190-205.

Page 10: Clustering Algorithms for Numerical Data Sets. Contents 1.Data Clustering Introduction 2.Hierarchical Clustering Algorithms 3.Partitional Data Clustering

Clustering – Reference matching• Fahlman, Scott & Lebiere, Christian (1989). The cascade-correlation learning

architecture. In Touretzky, D., editor, Advances in Neural Information Processing Systems (volume 2), (pp. 524-532), San Mateo, CA. Morgan Kaufmann.

• Fahlman, S.E. and Lebiere, C., “The Cascade Correlation Learning Architecture,” NIPS, Vol. 2, pp. 524-532, Morgan Kaufmann, 1990.

• Fahlman, S. E. (1991) The recurrent cascade-correlation learning architecture. In Lippman, R.P. Moody, J.E., and Touretzky, D.S., editors, NIPS 3, 190-205.

Page 11: Clustering Algorithms for Numerical Data Sets. Contents 1.Data Clustering Introduction 2.Hierarchical Clustering Algorithms 3.Partitional Data Clustering

Citation ranking

Page 12: Clustering Algorithms for Numerical Data Sets. Contents 1.Data Clustering Introduction 2.Hierarchical Clustering Algorithms 3.Partitional Data Clustering

Clustering: Navigation of search results

• For grouping search results thematically

– clusty.com / Vivisimo

Page 13: Clustering Algorithms for Numerical Data Sets. Contents 1.Data Clustering Introduction 2.Hierarchical Clustering Algorithms 3.Partitional Data Clustering

Clustering: Corpus browsing

dairycrops

agronomyforestry

AI

HCIcraft

missions

botany

evolution

cellmagnetism

relativity

courses

agriculture biology physics CS space

... ... ...

… (30)

www.yahoo.com/Science

... ...

Page 14: Clustering Algorithms for Numerical Data Sets. Contents 1.Data Clustering Introduction 2.Hierarchical Clustering Algorithms 3.Partitional Data Clustering

Clustering considerations

• What does it mean for objects to be similar?• What algorithm and approach do we take?

– Top-down: k-means– Bottom-up: hierarchical agglomerative clustering

• Do we need a hierarchical arrangement of clusters?

• How many clusters? • Can we label or name the clusters?• How do we make it efficient and scalable?

Page 15: Clustering Algorithms for Numerical Data Sets. Contents 1.Data Clustering Introduction 2.Hierarchical Clustering Algorithms 3.Partitional Data Clustering

Hierarchical Clustering

Dendrogram

Decompose data objects into a several levels of nested partitioning (tree of clusters), called a dendrogram.

A clustering of the data objects is obtained by cutting the dendrogram at the desired level, then each connected component forms a cluster.

Page 16: Clustering Algorithms for Numerical Data Sets. Contents 1.Data Clustering Introduction 2.Hierarchical Clustering Algorithms 3.Partitional Data Clustering

Hierarchical Clustering

• Build a tree-based hierarchical taxonomy (dendrogram) from a set of documents.

animal

vertebrate

fish reptile amphib. mammal worm insect crustacean

invertebrate

Page 17: Clustering Algorithms for Numerical Data Sets. Contents 1.Data Clustering Introduction 2.Hierarchical Clustering Algorithms 3.Partitional Data Clustering

• Agglomerative (bottom-up): – Start with each instance being a single cluster.– Eventually all instances belong to the same cluster.

• Divisive (top-down): – Start with all instances belong to the same cluster. – Eventually each node forms a cluster on its own.

• Does not require the number of clusters k in advance

• Needs a termination/readout condition

Hierarchical Clustering algorithms

Page 18: Clustering Algorithms for Numerical Data Sets. Contents 1.Data Clustering Introduction 2.Hierarchical Clustering Algorithms 3.Partitional Data Clustering

Hierarchical Agglomerative Clustering (HAC)

• Assumes a similarity function for determining the similarity of two instances.

• Starts with all instances in a separate cluster and then repeatedly joins the two clusters that are most similar until there is only one cluster.

• The history of merging forms a binary tree or hierarchy.

Page 19: Clustering Algorithms for Numerical Data Sets. Contents 1.Data Clustering Introduction 2.Hierarchical Clustering Algorithms 3.Partitional Data Clustering

• Clustering obtained by cutting the dendrogram at a desired level: each connectedconnected component forms a cluster.

Dendogram: Hierarchical Clustering

Page 20: Clustering Algorithms for Numerical Data Sets. Contents 1.Data Clustering Introduction 2.Hierarchical Clustering Algorithms 3.Partitional Data Clustering

Hierarchical Agglomerative Clustering (HAC)

• Starts with each doc in a separate cluster– then repeatedly joins the closest pair of

clusters, until there is only one cluster.• The history of merging forms a binary tree

or hierarchy.How to measure distance of clusters??

Page 21: Clustering Algorithms for Numerical Data Sets. Contents 1.Data Clustering Introduction 2.Hierarchical Clustering Algorithms 3.Partitional Data Clustering

Closest pair of clusters

Many variants to defining closest pair of clusters• Single-link

– Distance of the “closest” points (single-link)

• Complete-link– Distance of the “furthest” points

• Centroid– Distance of the centroids (centers of gravity)

• (Average-link)– Average distance between pairs of elements

Page 22: Clustering Algorithms for Numerical Data Sets. Contents 1.Data Clustering Introduction 2.Hierarchical Clustering Algorithms 3.Partitional Data Clustering

Single Link Agglomerative Clustering

• Use maximum similarity of pairs:

• Can result in “straggly” (long and thin) clusters due to chaining effect.

• After merging ci and cj, the similarity of the resulting cluster to another cluster, ck, is:

),(max),(,

yxsimccsimji cycx

ji

)),(),,(max()),(( kjkikji ccsimccsimcccsim

Page 23: Clustering Algorithms for Numerical Data Sets. Contents 1.Data Clustering Introduction 2.Hierarchical Clustering Algorithms 3.Partitional Data Clustering

Single Link Example

Page 24: Clustering Algorithms for Numerical Data Sets. Contents 1.Data Clustering Introduction 2.Hierarchical Clustering Algorithms 3.Partitional Data Clustering

Complete Link Agglomerative Clustering

• Use minimum similarity of pairs:

• Makes “tighter,” spherical clusters that are typically preferable.

• After merging ci and cj, the similarity of the resulting cluster to another cluster, ck, is:

),(min),(,

yxsimccsimji cycx

ji

)),(),,(min()),(( kjkikji ccsimccsimcccsim

Ci Cj Ck

Page 25: Clustering Algorithms for Numerical Data Sets. Contents 1.Data Clustering Introduction 2.Hierarchical Clustering Algorithms 3.Partitional Data Clustering

Complete Link Example

Page 26: Clustering Algorithms for Numerical Data Sets. Contents 1.Data Clustering Introduction 2.Hierarchical Clustering Algorithms 3.Partitional Data Clustering

Key notion: cluster representative

• We want a notion of a representative point in a cluster

• Representative should be some sort of “typical” or central point in the cluster, e.g.,– point inducing smallest radii to docs in cluster– smallest squared distances, etc.– point that is the “average” of all docs in the

cluster• Centroid or center of gravity

Page 27: Clustering Algorithms for Numerical Data Sets. Contents 1.Data Clustering Introduction 2.Hierarchical Clustering Algorithms 3.Partitional Data Clustering

Centroid-based Similarity

• Always maintain average of vectors in each cluster:

• Compute similarity of clusters by:

• For non-vector data, can’t always make a centroid

j

cx

jc

x

cs j

)(

))(),((),( jiji cscssimccsim

Page 28: Clustering Algorithms for Numerical Data Sets. Contents 1.Data Clustering Introduction 2.Hierarchical Clustering Algorithms 3.Partitional Data Clustering

Partitioning Algorithms

• Partitioning method: Construct a partition of n documents into a set of K clusters

• Given: a set of documents and the number K • Find: a partition of K clusters that optimizes the

chosen partitioning criterion– Globally optimal: exhaustively enumerate all

partitions– Effective heuristic methods: K-means algorithms

Page 29: Clustering Algorithms for Numerical Data Sets. Contents 1.Data Clustering Introduction 2.Hierarchical Clustering Algorithms 3.Partitional Data Clustering

K-Means

• Assumes instances are real-valued vectors.• Clusters based on centroids (aka the center of

gravity or mean) of points in a cluster, c:

• Reassignment of instances to clusters is based on distance to the current cluster centroids.

cx

xc

||

1(c)μ

Page 30: Clustering Algorithms for Numerical Data Sets. Contents 1.Data Clustering Introduction 2.Hierarchical Clustering Algorithms 3.Partitional Data Clustering

K-Means Algorithm

Select K random seeds.Until clustering converges or other stopping criterion: For each doc di: Assign di to the cluster cj such that dist(xi, sj) is

minimal. (Update the seeds to the centroid of each cluster) For each cluster cj

sj = (cj)

How?

Page 31: Clustering Algorithms for Numerical Data Sets. Contents 1.Data Clustering Introduction 2.Hierarchical Clustering Algorithms 3.Partitional Data Clustering

K Means Example(K=2)

Pick seeds

Reassign clusters

Compute centroids

xx

Reassign clusters

xx xx

Compute centroids

Reassign clusters

Converged!

Page 32: Clustering Algorithms for Numerical Data Sets. Contents 1.Data Clustering Introduction 2.Hierarchical Clustering Algorithms 3.Partitional Data Clustering

Termination conditions

• Several possibilities, e.g.,– A fixed number of iterations.– Partition unchanged.– Centroid positions don’t change.

Does this mean that the instances in a cluster are

unchanged?

Page 33: Clustering Algorithms for Numerical Data Sets. Contents 1.Data Clustering Introduction 2.Hierarchical Clustering Algorithms 3.Partitional Data Clustering

Convergence

• Why should the K-means algorithm ever reach a fixed point?– A state in which clusters don’t change.

• K-means is a special case of a general procedure known as the Expectation Maximization (EM) algorithm.– EM is known to converge.– Theoretically, number of iterations could be large.– Typically converges quickly

Page 34: Clustering Algorithms for Numerical Data Sets. Contents 1.Data Clustering Introduction 2.Hierarchical Clustering Algorithms 3.Partitional Data Clustering

0

1

2

3

4

5

0 1 2 3 4 5

k1

k2

k3

K-means Clustering: Step 1

Decide on a value for k.

Page 35: Clustering Algorithms for Numerical Data Sets. Contents 1.Data Clustering Introduction 2.Hierarchical Clustering Algorithms 3.Partitional Data Clustering

0

1

2

3

4

5

0 1 2 3 4 5

k1

k2

k3

K-means Clustering: Step 2

Initialize the k cluster centers

Page 36: Clustering Algorithms for Numerical Data Sets. Contents 1.Data Clustering Introduction 2.Hierarchical Clustering Algorithms 3.Partitional Data Clustering

0

1

2

3

4

5

0 1 2 3 4 5

k1

k2

k3

K-means Clustering: Step 3

Decide the class memberships of the N objects by assigning them to the nearest cluster center.

Page 37: Clustering Algorithms for Numerical Data Sets. Contents 1.Data Clustering Introduction 2.Hierarchical Clustering Algorithms 3.Partitional Data Clustering

0

1

2

3

4

5

0 1 2 3 4 5

k1

k2

k3

K-means Clustering: Step 4

Re-estimate the k cluster centers, by assuming the memberships found above are correct.

Page 38: Clustering Algorithms for Numerical Data Sets. Contents 1.Data Clustering Introduction 2.Hierarchical Clustering Algorithms 3.Partitional Data Clustering

0

1

2

3

4

5

0 1 2 3 4 5

expr

essi

on i

n co

ndi

tio

n 2

k1

k2k3

K-means Clustering: Step 5

If none of the N objects1 changed membership in the last iteration, exit. Otherwise go to step 3.

Page 39: Clustering Algorithms for Numerical Data Sets. Contents 1.Data Clustering Introduction 2.Hierarchical Clustering Algorithms 3.Partitional Data Clustering

How Many Clusters?

• Number of clusters K is given– Partition n docs into predetermined number of clusters

• Finding the “right” number of clusters is part of the problem– Given data, partition into an “appropriate” number of

subsets.– E.g., for query results - ideal value of K not known up front

- though UI may impose limits.

• Can usually take an algorithm for one flavor and convert to the other.

Page 40: Clustering Algorithms for Numerical Data Sets. Contents 1.Data Clustering Introduction 2.Hierarchical Clustering Algorithms 3.Partitional Data Clustering

K not specified in advance

• Say, the results of a query.• Solve an optimization problem: penalize having

lots of clusters– application dependent, e.g., compressed summary

of search results list.• Tradeoff between having more clusters (better

focus within each cluster) and having too many clusters

Page 41: Clustering Algorithms for Numerical Data Sets. Contents 1.Data Clustering Introduction 2.Hierarchical Clustering Algorithms 3.Partitional Data Clustering

K not specified in advance

• Given a clustering, define the Benefit for a doc to be some inverse distance to its centroid

• Define the Total Benefit to be the sum of the individual doc Benefits.

Page 42: Clustering Algorithms for Numerical Data Sets. Contents 1.Data Clustering Introduction 2.Hierarchical Clustering Algorithms 3.Partitional Data Clustering

Penalize lots of clusters

• For each cluster, we have a Cost C.• Thus for a clustering with K clusters, the Total Cost is

KC.• Define the Value of a clustering to be =

Total Benefit - Total Cost.

• Find the clustering of highest value, over all choices of K.– Total benefit increases with increasing K. But can stop

when it doesn’t increase by “much”. The Cost term enforces this.

Page 43: Clustering Algorithms for Numerical Data Sets. Contents 1.Data Clustering Introduction 2.Hierarchical Clustering Algorithms 3.Partitional Data Clustering

Density-based Clustering

• Why Density-Based Clustering methods?• Discover clusters of arbitrary shape. • Clusters – Dense regions of objects separated by

regions of low density

– DBSCAN – the first density based clustering– OPTICS – density based cluster-ordering – DENCLUE – a general density-based

description of cluster and clustering

Page 44: Clustering Algorithms for Numerical Data Sets. Contents 1.Data Clustering Introduction 2.Hierarchical Clustering Algorithms 3.Partitional Data Clustering

Density-Based Clustering

• Why Density-Based Clustering?

Results of a k-medoid

algorithm for k=4

Basic Idea:Clusters are dense regions in the data space, separated by regions of lower object density

Different density-based approaches exist

Here we discuss the ideas underlying the DBSCAN algorithm

Page 45: Clustering Algorithms for Numerical Data Sets. Contents 1.Data Clustering Introduction 2.Hierarchical Clustering Algorithms 3.Partitional Data Clustering

DBSCAN: Density Based Spatial Clustering of Applications with Noise

• Proposed by Ester, Kriegel, Sander, and Xu (KDD96)• Relies on a density-based notion of cluster: A cluster

is defined as a maximal set of density-connected points.

• Discovers clusters of arbitrary shape in spatial databases with noise

Page 46: Clustering Algorithms for Numerical Data Sets. Contents 1.Data Clustering Introduction 2.Hierarchical Clustering Algorithms 3.Partitional Data Clustering

DBSCAN

Density-based Clustering locates regions of high density that are separated from one another by regions of low density. – Density = number of points within a specified radius (Eps)

• DBSCAN is a density-based algorithm.– A point is a core point if it has more than a specified number of points

(MinPts) within Eps • These are points that are at the interior of a cluster

– A border point has fewer than MinPts within Eps, but is in the neighborhood of a core point

Page 47: Clustering Algorithms for Numerical Data Sets. Contents 1.Data Clustering Introduction 2.Hierarchical Clustering Algorithms 3.Partitional Data Clustering

– A noise point is any point that is not a core point or a border point.

– Any two core points are close enough– within a distance Eps of one another – are put in the same cluster

– Any border point that is close enough to a core point is put in the same cluster as the core point

– Noise points are discarded

DBSCAN

Page 48: Clustering Algorithms for Numerical Data Sets. Contents 1.Data Clustering Introduction 2.Hierarchical Clustering Algorithms 3.Partitional Data Clustering

Border & Core

Core

Border

Outlier

= 1unit

MinPts = 5

Page 49: Clustering Algorithms for Numerical Data Sets. Contents 1.Data Clustering Introduction 2.Hierarchical Clustering Algorithms 3.Partitional Data Clustering

Concepts: ε-Neighborhood

• ε-Neighborhoodε-Neighborhood - Objects within a radius of ε from an object. (epsilon-neighborhood)

• Core objectsCore objects - ε-Neighborhood of an object contains at least MinPtsMinPts of objects

qq ppεεεε

ε-Neighborhood of pε-Neighborhood of q

p is a core object (MinPts = 4)

q is not a core object

Page 50: Clustering Algorithms for Numerical Data Sets. Contents 1.Data Clustering Introduction 2.Hierarchical Clustering Algorithms 3.Partitional Data Clustering

Concepts: Reachability

• Directly density-reachableDirectly density-reachable– An object q is directly density-reachable from

object p if q is within the ε-Neighborhood of p and p is a core object.

qq ppεεεε

q is directly density-reachable from p

p is not directly density- reachable from q?

Page 51: Clustering Algorithms for Numerical Data Sets. Contents 1.Data Clustering Introduction 2.Hierarchical Clustering Algorithms 3.Partitional Data Clustering

Concepts: Reachability

• Density-reachable: Density-reachable: – An object p is density-reachable from q w.r.t ε and

MinPts if there is a chain of objects p1,…,pn, with p1=q, pn=p such that pi+1is directly density-reachable from pi w.r.t ε and MinPts for all 1 <= i <= n

pp

q is density-reachable from p p is not density- reachable from q? Transitive closure of direct density-

Reachability, asymmetricqq

Page 52: Clustering Algorithms for Numerical Data Sets. Contents 1.Data Clustering Introduction 2.Hierarchical Clustering Algorithms 3.Partitional Data Clustering

Concepts: Connectivity

• Density-connectivityDensity-connectivity– Object p is density-connected to object q w.r.t ε

and MinPts if there is an object o such that both p and q are density-reachable from o w.r.t ε and MinPts

pp

qq

rr

P and q are density-connected to each other by r

Density-connectivity is symmetric

Page 53: Clustering Algorithms for Numerical Data Sets. Contents 1.Data Clustering Introduction 2.Hierarchical Clustering Algorithms 3.Partitional Data Clustering

Concepts: cluster & noise

• ClusterCluster: a cluster C in a set of objects D w.r.t ε and MinPts is a non empty subset of D satisfying– Maximality: For all p, q if p C and if q is density-

reachable from p w.r.t ε and MinPts, then also q C.– Connectivity: for all p, q C, p is density-connected to q

w.r.t ε and MinPts in D.– Note: cluster contains core objects as well as border

objects• Noise:Noise: objects which are not directly density-

reachable from at least one core object.

Page 54: Clustering Algorithms for Numerical Data Sets. Contents 1.Data Clustering Introduction 2.Hierarchical Clustering Algorithms 3.Partitional Data Clustering

(Indirectly) Density-reachable:

p

qp1

p q

o

Density-connected

Page 55: Clustering Algorithms for Numerical Data Sets. Contents 1.Data Clustering Introduction 2.Hierarchical Clustering Algorithms 3.Partitional Data Clustering

DBSCAN: The Algorithm

– select a point p

– Retrieve all points density-reachable from p wrt and MinPts.

– If p is a core point, a cluster is formed.

– If p is a border point, no points are density-reachable from p and DBSCAN visits the next point of the database.

– Continue the process until all of the points have been processed.

Result is independent of the order of processing the points

Page 56: Clustering Algorithms for Numerical Data Sets. Contents 1.Data Clustering Introduction 2.Hierarchical Clustering Algorithms 3.Partitional Data Clustering

An ExampleMinPts = 4

C1

C1

C1

Page 57: Clustering Algorithms for Numerical Data Sets. Contents 1.Data Clustering Introduction 2.Hierarchical Clustering Algorithms 3.Partitional Data Clustering

DBSCAN: Determining EPS and MinPts• Idea is that for points in a cluster, their kth nearest

neighbors are at roughly the same distance• Noise points have the kth nearest neighbor at farther

distance• So, plot sorted distance of every point to its kth

nearest neighbor

Page 58: Clustering Algorithms for Numerical Data Sets. Contents 1.Data Clustering Introduction 2.Hierarchical Clustering Algorithms 3.Partitional Data Clustering

DBSCAN: Determining EPS and MinPts• Distance from a point to its kth nearest neighbor=>k-

dist• For points that belong to some clusters, the value of

k-dist will be small if k is not larger than cluster size• For points that are not in a cluster such as noise

points, the k-dist will be relatively large• Compute k-dist for all points for some k• Sort them in increasing order and plot sorted values• A sharp change at the value of k-dist that

corresponds to suitable value of eps and the value of k as MinPts

Page 59: Clustering Algorithms for Numerical Data Sets. Contents 1.Data Clustering Introduction 2.Hierarchical Clustering Algorithms 3.Partitional Data Clustering

DBSCAN: Determining EPS and MinPts• A sharp change at the value of k-dist that

corresponds to suitable value of eps and the value of k as MinPts

– Points for which k-dist is less than eps will be labeled as core points while other points will be labeled as noise or border points.

• If k is too large=> small clusters (of size less than k) are likely to be labeled as noise

• If k is too small=> Even a small number of closely spaced that are noise or outliers will be incorrectly labeled as clusters

Page 60: Clustering Algorithms for Numerical Data Sets. Contents 1.Data Clustering Introduction 2.Hierarchical Clustering Algorithms 3.Partitional Data Clustering

Clusters Identified by the DBSCAN Algorithm

• A density-based cluster is a set of density-connected objects that is maximal with respect to density-reachability.

• An object not contained in any cluster is considered to be noise.

Page 61: Clustering Algorithms for Numerical Data Sets. Contents 1.Data Clustering Introduction 2.Hierarchical Clustering Algorithms 3.Partitional Data Clustering

On Class Exercise 1

• Data – Iris.arff and your own data (if applicable)• Method – Hierarchical algorithms – Parameter (num of cluster = 3)• Software– Weka 3.7.3• Steps– Explorer->Cluster->Clusterer (Hierachical Clusterer)

Page 62: Clustering Algorithms for Numerical Data Sets. Contents 1.Data Clustering Introduction 2.Hierarchical Clustering Algorithms 3.Partitional Data Clustering

On Class Exercise 2

• Data – Iris.arff, and your own data (if applicable)• Method – K-means – Parameter (num of cluster = 3)• Software– Weka 3.7.3• Steps• Explorer -> Cluster->Clusterer (SimpleKMeans)

Page 63: Clustering Algorithms for Numerical Data Sets. Contents 1.Data Clustering Introduction 2.Hierarchical Clustering Algorithms 3.Partitional Data Clustering

On Class Exercise 3

• Data – Iris.arff and your own data (if applicable )• Method – DBSCAN – Parameter (num of cluster = 3)• Software– Weka 3.7.3• Steps– Explorer->Cluster->Clusterer (DBScan)