Unsupervised Learning: Clustering

Preview:

DESCRIPTION

Unsupervised Learning: Clustering. Some material adapted from slides by Andrew Moore, CMU. Visit http://www.autonlab.org/tutorials/ for Andrew ’ s repository of Data Mining tutorials. Unsupervised Learning. Supervised learning used labeled data pairs (x, y) to learn a function f : X→Y. - PowerPoint PPT Presentation

Citation preview

UnsuperviseUnsupervised Learning: d Learning: ClusteringClusteringSome material adapted from slides by Andrew Some material adapted from slides by Andrew

Moore, CMU.Moore, CMU.

Visit Visit http://www.autonlab.org/tutorials/ for forAndrewAndrew’’s repository of Data Mining tutorials.s repository of Data Mining tutorials.

Unsupervised LearningUnsupervised Learning Supervised learning used labeled data pairs Supervised learning used labeled data pairs

(x, y) to learn a function f : X→Y.(x, y) to learn a function f : X→Y. But, what if we donBut, what if we don’’t have labels?t have labels?

No labels = No labels = unsupervised learningunsupervised learning Only some points are labeled = Only some points are labeled = semi-semi-

supervised learningsupervised learning Labels may be expensive to obtain, so we only get Labels may be expensive to obtain, so we only get

a few.a few.

ClusteringClustering is the unsupervised grouping of is the unsupervised grouping of data points. It can be used for data points. It can be used for knowledge knowledge discoverydiscovery..

Clustering DataClustering Data

K-Means ClusteringK-Means Clustering

K-Means ( k , data )• Randomly choose k

cluster center locations (centroids).

• Loop until convergence

• Assign each point to the cluster of the closest centroid.

• Reestimate the cluster centroids based on the data assigned to each.

K-Means ClusteringK-Means Clustering

K-Means ( k , data )• Randomly choose k

cluster center locations (centroids).

• Loop until convergence

• Assign each point to the cluster of the closest centroid.

• Reestimate the cluster centroids based on the data assigned to each.

K-Means ClusteringK-Means Clustering

K-Means ( k , data )• Randomly choose k

cluster center locations (centroids).

• Loop until convergence

• Assign each point to the cluster of the closest centroid.

• Reestimate the cluster centroids based on the data assigned to each.

K-Means AnimationK-Means Animation

Example generated by Andrew Moore using Dan Pelleg’s super-duper fast K-means system:

Dan Pelleg and Andrew Moore. Accelerating Exact k-means Algorithms with Geometric Reasoning.Proc. Conference onKnowledge Discovery in Databases 1999.

Problems with K-MeansProblems with K-Means VeryVery sensitive to the initial points. sensitive to the initial points.

Do many runs of k-Means, each with Do many runs of k-Means, each with different initial centroids.different initial centroids.

Seed the centroids using a better Seed the centroids using a better method than random. (e.g. Farthest-method than random. (e.g. Farthest-first sampling)first sampling)

Must manually choose k.Must manually choose k. Learn the optimal k for the clustering. Learn the optimal k for the clustering.

(Note that this requires a performance (Note that this requires a performance measure.)measure.)

Problems with K-MeansProblems with K-Means How do you tell it which clustering you How do you tell it which clustering you

want?want?

Constrained clustering techniquesConstrained clustering techniques

Same-cluster constraint(must-link)

Different-cluster constraint(cannot-link)

Recommended