16
K-MEANS ALGORITHM Jelena Vukovic 53/07 [email protected]

K-MEANS ALGORITHM Jelena Vukovic 53/07 [email protected]

Embed Size (px)

Citation preview

Page 1: K-MEANS ALGORITHM Jelena Vukovic 53/07 jeca.zr@gmail.com

K-MEANS ALGORITHMJelena Vukovic 53/07

[email protected]

Page 2: K-MEANS ALGORITHM Jelena Vukovic 53/07 jeca.zr@gmail.com

Elektrotehnički fakultet u Beogradu

Introduction • Basic idea of k-means algorithm• Detailed explenation• Most common problems of the algorithm• Applications• Possible improvements

2/16

Page 3: K-MEANS ALGORITHM Jelena Vukovic 53/07 jeca.zr@gmail.com

Elektrotehnički fakultet u Beogradu

Bassic principles of algorithm

3/16

• Given the set of points (x1, x2, … , xn)• Partition n points into k sets (n>k) (S1, S2, … , Sk)• The goal is to minimize within-cluster sum of squares

• µi is the mean of points in Si

Page 4: K-MEANS ALGORITHM Jelena Vukovic 53/07 jeca.zr@gmail.com

Elektrotehnički fakultet u Beogradu

The algorithm

• Initialize the numberof means (k)

• Iterate:1. Assign each point to the

nearest mean

2. Move mean tocenter of its cluster

4/16

Page 5: K-MEANS ALGORITHM Jelena Vukovic 53/07 jeca.zr@gmail.com

Elektrotehnički fakultet u Beogradu

The algorithm

5/16

Assign points to nearest mean Move means

Page 6: K-MEANS ALGORITHM Jelena Vukovic 53/07 jeca.zr@gmail.com

Elektrotehnički fakultet u Beogradu

The algorithm• The complexity is

O(n * k * I * d)

• n – number of points• k – number of clusters• I – number of iterations• d – number of attributes

6/16

Re-assign points

Page 7: K-MEANS ALGORITHM Jelena Vukovic 53/07 jeca.zr@gmail.com

Elektrotehnički fakultet u Beogradu

The algorithm

7/16

Page 8: K-MEANS ALGORITHM Jelena Vukovic 53/07 jeca.zr@gmail.com

Elektrotehnički fakultet u Beogradu

K nearest neighbors

• Very similar algorithm• The decision is made based on the

simple majority of the closest k neighbors• In k-means the Euclidian distant measure is used

8/16

Page 9: K-MEANS ALGORITHM Jelena Vukovic 53/07 jeca.zr@gmail.com

Elektrotehnički fakultet u Beogradu

Some limitations of algorithm• The number of clusters needs to be

known in advance

• Initialization of means position

• Problems appear when clusters have different• Shapes• Sizes• Density

9/16

Page 10: K-MEANS ALGORITHM Jelena Vukovic 53/07 jeca.zr@gmail.com

Elektrotehnički fakultet u Beogradu

Initial centroids problem

• Random distribution (the most common)• Multiple runs• Testing on a data sample• Analyze the data

10/16

Page 11: K-MEANS ALGORITHM Jelena Vukovic 53/07 jeca.zr@gmail.com

Elektrotehnički fakultet u Beogradu

Different density

11/16

Original points 3 Clusters

Page 12: K-MEANS ALGORITHM Jelena Vukovic 53/07 jeca.zr@gmail.com

Elektrotehnički fakultet u Beogradu

Non-globular shapes

12/16

Original points 2 Clusters

Page 13: K-MEANS ALGORITHM Jelena Vukovic 53/07 jeca.zr@gmail.com

Elektrotehnički fakultet u Beogradu

Pros and cons

Pros

• Simple to implement• Fast• Not highly demanding

Cons

• K needs to be known• Ellipsoid shape is

assumed• Requires some

knowledge about data in advance

• Possibility of many loop turns, without significant changes in clusters

13/16

Page 14: K-MEANS ALGORITHM Jelena Vukovic 53/07 jeca.zr@gmail.com

Elektrotehnički fakultet u Beogradu

Applications of the algorithm• Many different uses

• Computer vision• Market segmentation• Geostatic• Astronomy• etc

14/16

Page 15: K-MEANS ALGORITHM Jelena Vukovic 53/07 jeca.zr@gmail.com

Elektrotehnički fakultet u Beogradu

Improvements• Pre-processing of the data in order to better estimate k• Run multiple iteration in parallel with

different centroid initialization• Ignore possible errors to avoid

non-standard cluster shapes

15/16

Page 16: K-MEANS ALGORITHM Jelena Vukovic 53/07 jeca.zr@gmail.com

Elektrotehnički fakultet u Beogradu

Thank you!

16/16