Upload
camille-rawls
View
226
Download
1
Embed Size (px)
Citation preview
K-MEANS ALGORITHMJelena Vukovic 53/07
Elektrotehnički fakultet u Beogradu
Introduction • Basic idea of k-means algorithm• Detailed explenation• Most common problems of the algorithm• Applications• Possible improvements
2/16
Elektrotehnički fakultet u Beogradu
Bassic principles of algorithm
3/16
• Given the set of points (x1, x2, … , xn)• Partition n points into k sets (n>k) (S1, S2, … , Sk)• The goal is to minimize within-cluster sum of squares
• µi is the mean of points in Si
Elektrotehnički fakultet u Beogradu
The algorithm
• Initialize the numberof means (k)
• Iterate:1. Assign each point to the
nearest mean
2. Move mean tocenter of its cluster
4/16
Elektrotehnički fakultet u Beogradu
The algorithm
5/16
Assign points to nearest mean Move means
Elektrotehnički fakultet u Beogradu
The algorithm• The complexity is
O(n * k * I * d)
• n – number of points• k – number of clusters• I – number of iterations• d – number of attributes
6/16
Re-assign points
Elektrotehnički fakultet u Beogradu
The algorithm
7/16
Elektrotehnički fakultet u Beogradu
K nearest neighbors
• Very similar algorithm• The decision is made based on the
simple majority of the closest k neighbors• In k-means the Euclidian distant measure is used
8/16
Elektrotehnički fakultet u Beogradu
Some limitations of algorithm• The number of clusters needs to be
known in advance
• Initialization of means position
• Problems appear when clusters have different• Shapes• Sizes• Density
9/16
Elektrotehnički fakultet u Beogradu
Initial centroids problem
• Random distribution (the most common)• Multiple runs• Testing on a data sample• Analyze the data
10/16
Elektrotehnički fakultet u Beogradu
Different density
11/16
Original points 3 Clusters
Elektrotehnički fakultet u Beogradu
Non-globular shapes
12/16
Original points 2 Clusters
Elektrotehnički fakultet u Beogradu
Pros and cons
Pros
• Simple to implement• Fast• Not highly demanding
Cons
• K needs to be known• Ellipsoid shape is
assumed• Requires some
knowledge about data in advance
• Possibility of many loop turns, without significant changes in clusters
13/16
Elektrotehnički fakultet u Beogradu
Applications of the algorithm• Many different uses
• Computer vision• Market segmentation• Geostatic• Astronomy• etc
14/16
Elektrotehnički fakultet u Beogradu
Improvements• Pre-processing of the data in order to better estimate k• Run multiple iteration in parallel with
different centroid initialization• Ignore possible errors to avoid
non-standard cluster shapes
15/16
Elektrotehnički fakultet u Beogradu
Thank you!
16/16