Upload
arline-banks
View
242
Download
1
Embed Size (px)
Citation preview
Clustering Algorithms for Numerical Data Sets
Contents
1. Data Clustering Introduction2. Hierarchical Clustering Algorithms3. Partitional Data Clustering Algorithms
• K-mean clustering4. Density-based Clustering Algorithms
• Density-Based Spatial Clustering of Applications with Noise (DBSCAN)
What is clustering?
• Clustering: the process of grouping a set of objects into classes of similar objects
• Most common form of unsupervised learning• Unsupervised learning = learning from raw
data, as opposed to supervised data where a classification of examples is given
Clustering
Clustering
Clustering Applications
• Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs
• Land use: Identification of areas of similar land use in an earth observation database
• Insurance: Identifying groups of motor insurance policy holders with a high average claim cost
• City-planning: Identifying groups of houses according to their house type, value, and geographical location
• Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults
Introduction
Examples of Knowledge Extracted by Data Clustering
• For intelligent web search, data clustering can be conducted in advance on the terms contained in a set of training documents.
• The intelligent search engine can expand the query according to the term clusters, when the user submits a search term, then
• For example, when the user submits “Federal Reserve Board”, the search engine automatically expands the query term to include additional search terms as follows:{“Greenspan”, “FED”}.
• The search engine may further rank the documents retrieved based on their correlation to the search terms.
Clustering – Reference matching• Fahlman, Scott & Lebiere, Christian (1989). The cascade-correlation learning
architecture. In Touretzky, D., editor, Advances in Neural Information Processing Systems (volume 2), (pp. 524-532), San Mateo, CA. Morgan Kaufmann.
• Fahlman, S.E. and Lebiere, C., “The Cascade Correlation Learning Architecture,” NIPS, Vol. 2, pp. 524-532, Morgan Kaufmann, 1990.
• Fahlman, S. E. (1991) The recurrent cascade-correlation learning architecture. In Lippman, R.P. Moody, J.E., and Touretzky, D.S., editors, NIPS 3, 190-205.
Clustering – Reference matching• Fahlman, Scott & Lebiere, Christian (1989). The cascade-correlation learning
architecture. In Touretzky, D., editor, Advances in Neural Information Processing Systems (volume 2), (pp. 524-532), San Mateo, CA. Morgan Kaufmann.
• Fahlman, S.E. and Lebiere, C., “The Cascade Correlation Learning Architecture,” NIPS, Vol. 2, pp. 524-532, Morgan Kaufmann, 1990.
• Fahlman, S. E. (1991) The recurrent cascade-correlation learning architecture. In Lippman, R.P. Moody, J.E., and Touretzky, D.S., editors, NIPS 3, 190-205.
Citation ranking
Clustering: Navigation of search results
• For grouping search results thematically
– clusty.com / Vivisimo
Clustering: Corpus browsing
dairycrops
agronomyforestry
AI
HCIcraft
missions
botany
evolution
cellmagnetism
relativity
courses
agriculture biology physics CS space
... ... ...
… (30)
www.yahoo.com/Science
... ...
Clustering considerations
• What does it mean for objects to be similar?• What algorithm and approach do we take?
– Top-down: k-means– Bottom-up: hierarchical agglomerative clustering
• Do we need a hierarchical arrangement of clusters?
• How many clusters? • Can we label or name the clusters?• How do we make it efficient and scalable?
Hierarchical Clustering
Dendrogram
Decompose data objects into a several levels of nested partitioning (tree of clusters), called a dendrogram.
A clustering of the data objects is obtained by cutting the dendrogram at the desired level, then each connected component forms a cluster.
Hierarchical Clustering
• Build a tree-based hierarchical taxonomy (dendrogram) from a set of documents.
animal
vertebrate
fish reptile amphib. mammal worm insect crustacean
invertebrate
• Agglomerative (bottom-up): – Start with each instance being a single cluster.– Eventually all instances belong to the same cluster.
• Divisive (top-down): – Start with all instances belong to the same cluster. – Eventually each node forms a cluster on its own.
• Does not require the number of clusters k in advance
• Needs a termination/readout condition
Hierarchical Clustering algorithms
Hierarchical Agglomerative Clustering (HAC)
• Assumes a similarity function for determining the similarity of two instances.
• Starts with all instances in a separate cluster and then repeatedly joins the two clusters that are most similar until there is only one cluster.
• The history of merging forms a binary tree or hierarchy.
• Clustering obtained by cutting the dendrogram at a desired level: each connectedconnected component forms a cluster.
Dendogram: Hierarchical Clustering
Hierarchical Agglomerative Clustering (HAC)
• Starts with each doc in a separate cluster– then repeatedly joins the closest pair of
clusters, until there is only one cluster.• The history of merging forms a binary tree
or hierarchy.How to measure distance of clusters??
Closest pair of clusters
Many variants to defining closest pair of clusters• Single-link
– Distance of the “closest” points (single-link)
• Complete-link– Distance of the “furthest” points
• Centroid– Distance of the centroids (centers of gravity)
• (Average-link)– Average distance between pairs of elements
Single Link Agglomerative Clustering
• Use maximum similarity of pairs:
• Can result in “straggly” (long and thin) clusters due to chaining effect.
• After merging ci and cj, the similarity of the resulting cluster to another cluster, ck, is:
),(max),(,
yxsimccsimji cycx
ji
)),(),,(max()),(( kjkikji ccsimccsimcccsim
Single Link Example
Complete Link Agglomerative Clustering
• Use minimum similarity of pairs:
• Makes “tighter,” spherical clusters that are typically preferable.
• After merging ci and cj, the similarity of the resulting cluster to another cluster, ck, is:
),(min),(,
yxsimccsimji cycx
ji
)),(),,(min()),(( kjkikji ccsimccsimcccsim
Ci Cj Ck
Complete Link Example
Key notion: cluster representative
• We want a notion of a representative point in a cluster
• Representative should be some sort of “typical” or central point in the cluster, e.g.,– point inducing smallest radii to docs in cluster– smallest squared distances, etc.– point that is the “average” of all docs in the
cluster• Centroid or center of gravity
Centroid-based Similarity
• Always maintain average of vectors in each cluster:
• Compute similarity of clusters by:
• For non-vector data, can’t always make a centroid
j
cx
jc
x
cs j
)(
))(),((),( jiji cscssimccsim
Partitioning Algorithms
• Partitioning method: Construct a partition of n documents into a set of K clusters
• Given: a set of documents and the number K • Find: a partition of K clusters that optimizes the
chosen partitioning criterion– Globally optimal: exhaustively enumerate all
partitions– Effective heuristic methods: K-means algorithms
K-Means
• Assumes instances are real-valued vectors.• Clusters based on centroids (aka the center of
gravity or mean) of points in a cluster, c:
• Reassignment of instances to clusters is based on distance to the current cluster centroids.
cx
xc
||
1(c)μ
K-Means Algorithm
Select K random seeds.Until clustering converges or other stopping criterion: For each doc di: Assign di to the cluster cj such that dist(xi, sj) is
minimal. (Update the seeds to the centroid of each cluster) For each cluster cj
sj = (cj)
How?
K Means Example(K=2)
Pick seeds
Reassign clusters
Compute centroids
xx
Reassign clusters
xx xx
Compute centroids
Reassign clusters
Converged!
Termination conditions
• Several possibilities, e.g.,– A fixed number of iterations.– Partition unchanged.– Centroid positions don’t change.
Does this mean that the instances in a cluster are
unchanged?
Convergence
• Why should the K-means algorithm ever reach a fixed point?– A state in which clusters don’t change.
• K-means is a special case of a general procedure known as the Expectation Maximization (EM) algorithm.– EM is known to converge.– Theoretically, number of iterations could be large.– Typically converges quickly
0
1
2
3
4
5
0 1 2 3 4 5
k1
k2
k3
K-means Clustering: Step 1
Decide on a value for k.
0
1
2
3
4
5
0 1 2 3 4 5
k1
k2
k3
K-means Clustering: Step 2
Initialize the k cluster centers
0
1
2
3
4
5
0 1 2 3 4 5
k1
k2
k3
K-means Clustering: Step 3
Decide the class memberships of the N objects by assigning them to the nearest cluster center.
0
1
2
3
4
5
0 1 2 3 4 5
k1
k2
k3
K-means Clustering: Step 4
Re-estimate the k cluster centers, by assuming the memberships found above are correct.
0
1
2
3
4
5
0 1 2 3 4 5
expr
essi
on i
n co
ndi
tio
n 2
k1
k2k3
K-means Clustering: Step 5
If none of the N objects1 changed membership in the last iteration, exit. Otherwise go to step 3.
How Many Clusters?
• Number of clusters K is given– Partition n docs into predetermined number of clusters
• Finding the “right” number of clusters is part of the problem– Given data, partition into an “appropriate” number of
subsets.– E.g., for query results - ideal value of K not known up front
- though UI may impose limits.
• Can usually take an algorithm for one flavor and convert to the other.
K not specified in advance
• Say, the results of a query.• Solve an optimization problem: penalize having
lots of clusters– application dependent, e.g., compressed summary
of search results list.• Tradeoff between having more clusters (better
focus within each cluster) and having too many clusters
K not specified in advance
• Given a clustering, define the Benefit for a doc to be some inverse distance to its centroid
• Define the Total Benefit to be the sum of the individual doc Benefits.
Penalize lots of clusters
• For each cluster, we have a Cost C.• Thus for a clustering with K clusters, the Total Cost is
KC.• Define the Value of a clustering to be =
Total Benefit - Total Cost.
• Find the clustering of highest value, over all choices of K.– Total benefit increases with increasing K. But can stop
when it doesn’t increase by “much”. The Cost term enforces this.
Density-based Clustering
• Why Density-Based Clustering methods?• Discover clusters of arbitrary shape. • Clusters – Dense regions of objects separated by
regions of low density
– DBSCAN – the first density based clustering– OPTICS – density based cluster-ordering – DENCLUE – a general density-based
description of cluster and clustering
Density-Based Clustering
• Why Density-Based Clustering?
Results of a k-medoid
algorithm for k=4
Basic Idea:Clusters are dense regions in the data space, separated by regions of lower object density
Different density-based approaches exist
Here we discuss the ideas underlying the DBSCAN algorithm
DBSCAN: Density Based Spatial Clustering of Applications with Noise
• Proposed by Ester, Kriegel, Sander, and Xu (KDD96)• Relies on a density-based notion of cluster: A cluster
is defined as a maximal set of density-connected points.
• Discovers clusters of arbitrary shape in spatial databases with noise
DBSCAN
Density-based Clustering locates regions of high density that are separated from one another by regions of low density. – Density = number of points within a specified radius (Eps)
• DBSCAN is a density-based algorithm.– A point is a core point if it has more than a specified number of points
(MinPts) within Eps • These are points that are at the interior of a cluster
– A border point has fewer than MinPts within Eps, but is in the neighborhood of a core point
– A noise point is any point that is not a core point or a border point.
– Any two core points are close enough– within a distance Eps of one another – are put in the same cluster
– Any border point that is close enough to a core point is put in the same cluster as the core point
– Noise points are discarded
DBSCAN
Border & Core
Core
Border
Outlier
= 1unit
MinPts = 5
Concepts: ε-Neighborhood
• ε-Neighborhoodε-Neighborhood - Objects within a radius of ε from an object. (epsilon-neighborhood)
• Core objectsCore objects - ε-Neighborhood of an object contains at least MinPtsMinPts of objects
qq ppεεεε
ε-Neighborhood of pε-Neighborhood of q
p is a core object (MinPts = 4)
q is not a core object
Concepts: Reachability
• Directly density-reachableDirectly density-reachable– An object q is directly density-reachable from
object p if q is within the ε-Neighborhood of p and p is a core object.
qq ppεεεε
q is directly density-reachable from p
p is not directly density- reachable from q?
Concepts: Reachability
• Density-reachable: Density-reachable: – An object p is density-reachable from q w.r.t ε and
MinPts if there is a chain of objects p1,…,pn, with p1=q, pn=p such that pi+1is directly density-reachable from pi w.r.t ε and MinPts for all 1 <= i <= n
pp
q is density-reachable from p p is not density- reachable from q? Transitive closure of direct density-
Reachability, asymmetricqq
Concepts: Connectivity
• Density-connectivityDensity-connectivity– Object p is density-connected to object q w.r.t ε
and MinPts if there is an object o such that both p and q are density-reachable from o w.r.t ε and MinPts
pp
rr
P and q are density-connected to each other by r
Density-connectivity is symmetric
Concepts: cluster & noise
• ClusterCluster: a cluster C in a set of objects D w.r.t ε and MinPts is a non empty subset of D satisfying– Maximality: For all p, q if p C and if q is density-
reachable from p w.r.t ε and MinPts, then also q C.– Connectivity: for all p, q C, p is density-connected to q
w.r.t ε and MinPts in D.– Note: cluster contains core objects as well as border
objects• Noise:Noise: objects which are not directly density-
reachable from at least one core object.
(Indirectly) Density-reachable:
p
qp1
p q
o
Density-connected
DBSCAN: The Algorithm
– select a point p
– Retrieve all points density-reachable from p wrt and MinPts.
– If p is a core point, a cluster is formed.
– If p is a border point, no points are density-reachable from p and DBSCAN visits the next point of the database.
– Continue the process until all of the points have been processed.
Result is independent of the order of processing the points
An ExampleMinPts = 4
C1
C1
C1
DBSCAN: Determining EPS and MinPts• Idea is that for points in a cluster, their kth nearest
neighbors are at roughly the same distance• Noise points have the kth nearest neighbor at farther
distance• So, plot sorted distance of every point to its kth
nearest neighbor
DBSCAN: Determining EPS and MinPts• Distance from a point to its kth nearest neighbor=>k-
dist• For points that belong to some clusters, the value of
k-dist will be small if k is not larger than cluster size• For points that are not in a cluster such as noise
points, the k-dist will be relatively large• Compute k-dist for all points for some k• Sort them in increasing order and plot sorted values• A sharp change at the value of k-dist that
corresponds to suitable value of eps and the value of k as MinPts
DBSCAN: Determining EPS and MinPts• A sharp change at the value of k-dist that
corresponds to suitable value of eps and the value of k as MinPts
– Points for which k-dist is less than eps will be labeled as core points while other points will be labeled as noise or border points.
• If k is too large=> small clusters (of size less than k) are likely to be labeled as noise
• If k is too small=> Even a small number of closely spaced that are noise or outliers will be incorrectly labeled as clusters
Clusters Identified by the DBSCAN Algorithm
• A density-based cluster is a set of density-connected objects that is maximal with respect to density-reachability.
• An object not contained in any cluster is considered to be noise.
On Class Exercise 1
• Data – Iris.arff and your own data (if applicable)• Method – Hierarchical algorithms – Parameter (num of cluster = 3)• Software– Weka 3.7.3• Steps– Explorer->Cluster->Clusterer (Hierachical Clusterer)
On Class Exercise 2
• Data – Iris.arff, and your own data (if applicable)• Method – K-means – Parameter (num of cluster = 3)• Software– Weka 3.7.3• Steps• Explorer -> Cluster->Clusterer (SimpleKMeans)
On Class Exercise 3
• Data – Iris.arff and your own data (if applicable )• Method – DBSCAN – Parameter (num of cluster = 3)• Software– Weka 3.7.3• Steps– Explorer->Cluster->Clusterer (DBScan)