Upload
megan-dawson
View
228
Download
1
Tags:
Embed Size (px)
Citation preview
ClusteringUNSUPERVISED LEARNING INTRODUCTION
Machine Learning
Andrew Ng
Supervised learning
Training set:
Andrew Ng
Unsupervised learning
Training set:
Andrew Ng
Applications of clustering
Organize computing clusters
Social network analysis
Image credit: NASA/JPL-Caltech/E. Churchwell (Univ. of Wisconsin, Madison)
Astronomical data analysis
Market segmentation
ClusteringVARIANT
Clustering Category• Based on the Clustering Algorithms, clustering are categorized into Four
Major Category:– Partitional (Centroid Based)
Try to cluster data into k number of cluster.Example: K-Means, K-Means++, Fuzzy C-Means.
– Hierarchical• Agglomerative
Start with all data as an individual cluster • Divisive
Start with the entire data as a single cluster.
– Distribution BasedThe clustering model most closely related to statistics is based on distribution
models.Example: EM-clusteringUnpopular because tend to overfitting
– Density Based• In density-based clustering, clusters are defined as areas of higher density than the
remainder of the data set.
Based on the data• Clustering are categorized into:
– Numerical data clustering– Categorical data clustering
Clustering
K-MEANS ALGORITHM
Andrew Ng
Andrew Ng
Andrew Ng
Andrew Ng
Andrew Ng
Andrew Ng
Andrew Ng
Andrew Ng
Andrew Ng
Andrew Ng
Input:- (number of clusters)- Training set
(drop convention)
K-means algorithm
Andrew Ng
Randomly initialize cluster centroids
K-means algorithm
Repeat {for = 1 to
:= index (from 1 to ) of cluster centroid closest to
for = 1 to := average (mean) of points assigned to cluster
}
Andrew Ng
K-means for non-separated clusters
T-shirt sizing
Height
Wei
ght
Clustering
OPTIMIZATION OBJECTIVE
Machine Learning
Andrew Ng
K-means optimization objective
= index of cluster (1,2,…, ) to which example is currently assigned
= cluster centroid ( )= cluster centroid of cluster to which example has been
assignedOptimization objective:
Andrew Ng
Randomly initialize cluster centroids
K-means algorithm
Repeat {for = 1 to
:= index (from 1 to ) of cluster centroid closest to
for = 1 to := average (mean) of points assigned to cluster
}
Clustering
RANDOM INITIALIZATION
Machine Learning
Andrew Ng
Randomly initialize cluster centroids
K-means algorithm
Repeat {for = 1 to
:= index (from 1 to ) of cluster centroid closest to
for = 1 to := average (mean) of points assigned to cluster
}
Andrew Ng
Random initialization
Should have
Randomly pick training examples.
Set equal to these examples.
Andrew Ng
Local optima
Andrew Ng
For i = 1 to 100 {
Randomly initialize K-means.Run K-means. Get .Compute cost function (distortion)
}
Pick clustering that gave lowest cost
Random initialization
ClusteringCHOOSING THE NUMBER OF CLUSTERS
Machine Learning
Andrew Ng
What is the right value of K?
Andrew Ng
Choosing the value of K
Elbow method:
1 2 3 4 5 6 7 8
Cost
func
tion
(no. of clusters)1 2 3 4 5 6 7 8
Cost
func
tion
(no. of clusters)
Andrew Ng
Choosing the value of KSometimes, you’re running K-means to get clusters to use for some later/downstream purpose. Evaluate K-means based on a metric for how well it performs for that later purpose.
E.g. T-shirt sizing
Height
Wei
ght
T-shirt sizing
HeightW
eigh
t