31
Clustering: Kmeans Industrial AI Lab.

12 Clustering - K-meansi-systems.github.io/HSE545/machine learning all/05...K.means:,Limitations •Makehardassignmentsofpointstoclusters –A*point*either*completely*belongs*to*a*cluster*or*not*belongs*at*all

Embed Size (px)

Citation preview

Page 1: 12 Clustering - K-meansi-systems.github.io/HSE545/machine learning all/05...K.means:,Limitations •Makehardassignmentsofpointstoclusters –A*point*either*completely*belongs*to*a*cluster*or*not*belongs*at*all

Clustering:  K-­‐means

Industrial  AI  Lab.

Page 2: 12 Clustering - K-meansi-systems.github.io/HSE545/machine learning all/05...K.means:,Limitations •Makehardassignmentsofpointstoclusters –A*point*either*completely*belongs*to*a*cluster*or*not*belongs*at*all

Supervised  vs.  Unsupervised  Learning

2

Supervised  Learning Unsupervised  Learning

Building a  model  from  labeled  data Clustering  from  unlabeled  data

Page 3: 12 Clustering - K-meansi-systems.github.io/HSE545/machine learning all/05...K.means:,Limitations •Makehardassignmentsofpointstoclusters –A*point*either*completely*belongs*to*a*cluster*or*not*belongs*at*all

Data  Clustering• Data  clustering  is  an  unsupervised  learning  problem

• Given:– 𝑚 unlabeled  examples  {𝑥 $ , 𝑥 & ,⋯ , 𝑥())}– the  number  of  partitions  𝑘

• Goal:  group  the  examples  into  𝑘 partitions

3

Page 4: 12 Clustering - K-meansi-systems.github.io/HSE545/machine learning all/05...K.means:,Limitations •Makehardassignmentsofpointstoclusters –A*point*either*completely*belongs*to*a*cluster*or*not*belongs*at*all

Data  Clustering:  Similarity

• The  only  information  clustering  uses  is  the  mutual  similarity  between  samples

• A  good  clustering  is  one  that  achieves:– high  within-­‐cluster  similarity– low  inter-­‐cluster  similarity

4

Page 5: 12 Clustering - K-meansi-systems.github.io/HSE545/machine learning all/05...K.means:,Limitations •Makehardassignmentsofpointstoclusters –A*point*either*completely*belongs*to*a*cluster*or*not*belongs*at*all

K-­‐means:  (Iterative)  Algorithm1)  Initialization

• Input– 𝑘:  the  number  of  clusters– Training  set 𝑥 $ , 𝑥 & ,⋯ , 𝑥 )

• Randomly  initialize  cluster  centers  anywhere  in  ℝ.

5

Page 6: 12 Clustering - K-meansi-systems.github.io/HSE545/machine learning all/05...K.means:,Limitations •Makehardassignmentsofpointstoclusters –A*point*either*completely*belongs*to*a*cluster*or*not*belongs*at*all

K-­‐means:  (Iterative)  Algorithm2)  Iteration

• Repeat  until  convergence  (a  possible  convergence  criteria:  cluster  centers  do  not  change  anymore)

6

Page 7: 12 Clustering - K-meansi-systems.github.io/HSE545/machine learning all/05...K.means:,Limitations •Makehardassignmentsofpointstoclusters –A*point*either*completely*belongs*to*a*cluster*or*not*belongs*at*all

K-­‐means:  (Iterative)  Algorithm3)  Output• 𝑐 (label)  :  index  (1  to  𝑘)  of  cluster  centroid  (centers)• 𝜇:  averages  (mean)  of  points  assigned  to  cluster   𝜇$, 𝜇&,⋯ , 𝜇1

7

Page 8: 12 Clustering - K-meansi-systems.github.io/HSE545/machine learning all/05...K.means:,Limitations •Makehardassignmentsofpointstoclusters –A*point*either*completely*belongs*to*a*cluster*or*not*belongs*at*all

Initialization  (𝒌 = 𝟐)

8

Page 9: 12 Clustering - K-meansi-systems.github.io/HSE545/machine learning all/05...K.means:,Limitations •Makehardassignmentsofpointstoclusters –A*point*either*completely*belongs*to*a*cluster*or*not*belongs*at*all

Assigning  Points

9

Page 10: 12 Clustering - K-meansi-systems.github.io/HSE545/machine learning all/05...K.means:,Limitations •Makehardassignmentsofpointstoclusters –A*point*either*completely*belongs*to*a*cluster*or*not*belongs*at*all

Recomputing  the  Cluster  Centers

10

Page 11: 12 Clustering - K-meansi-systems.github.io/HSE545/machine learning all/05...K.means:,Limitations •Makehardassignmentsofpointstoclusters –A*point*either*completely*belongs*to*a*cluster*or*not*belongs*at*all

Assigning  Points

11

Page 12: 12 Clustering - K-meansi-systems.github.io/HSE545/machine learning all/05...K.means:,Limitations •Makehardassignmentsofpointstoclusters –A*point*either*completely*belongs*to*a*cluster*or*not*belongs*at*all

Recomputing  the  Cluster  Centers

12

Page 13: 12 Clustering - K-meansi-systems.github.io/HSE545/machine learning all/05...K.means:,Limitations •Makehardassignmentsofpointstoclusters –A*point*either*completely*belongs*to*a*cluster*or*not*belongs*at*all

Assigning  Points

13

Page 14: 12 Clustering - K-meansi-systems.github.io/HSE545/machine learning all/05...K.means:,Limitations •Makehardassignmentsofpointstoclusters –A*point*either*completely*belongs*to*a*cluster*or*not*belongs*at*all

Recomputing  the  Cluster  Centers

14

Page 15: 12 Clustering - K-meansi-systems.github.io/HSE545/machine learning all/05...K.means:,Limitations •Makehardassignmentsofpointstoclusters –A*point*either*completely*belongs*to*a*cluster*or*not*belongs*at*all

Assigning  Points

15

Page 16: 12 Clustering - K-meansi-systems.github.io/HSE545/machine learning all/05...K.means:,Limitations •Makehardassignmentsofpointstoclusters –A*point*either*completely*belongs*to*a*cluster*or*not*belongs*at*all

Recomputing  the  Cluster  Centers

16

Page 17: 12 Clustering - K-meansi-systems.github.io/HSE545/machine learning all/05...K.means:,Limitations •Makehardassignmentsofpointstoclusters –A*point*either*completely*belongs*to*a*cluster*or*not*belongs*at*all

Summary:  K-­‐means  Clustering• (Iterative)  Algorithm

17

Page 18: 12 Clustering - K-meansi-systems.github.io/HSE545/machine learning all/05...K.means:,Limitations •Makehardassignmentsofpointstoclusters –A*point*either*completely*belongs*to*a*cluster*or*not*belongs*at*all

K-­‐means:  Optimization  Point  of  View  (optional)• 𝑐5  =  index  of  cluster  (1, 2,⋯ , 𝑘)  to  which  example  𝑥 5  is  currently  assigned

• 𝜇1 =  cluster  centroid• 𝜇9:  =  cluster  centroid  of  cluster  to  which  example  𝑥 5  has  been  assigned

• Optimization  objective:

18

Page 19: 12 Clustering - K-meansi-systems.github.io/HSE545/machine learning all/05...K.means:,Limitations •Makehardassignmentsofpointstoclusters –A*point*either*completely*belongs*to*a*cluster*or*not*belongs*at*all

Expectation  Maximization  Algorithm• It  is  a  "chicken  and  egg"  problem  (dilemma)– Q:  if  we  knew  𝑐5s,  how  would  we  determine  which  points  to  associate  with  each  cluster  center?

– A:  for  each  point  𝑥 5 ,  choose  closest  𝑐5

– Q:  if  we  knew  the  cluster  memberships,  how  do  we  get  the  centers?– A:  choose  𝑐5 to  be  the  mean  of  all  points  in  the  cluster  

• Extension  of  K-­‐means  algorithm– A  special  case  of  Expectation  Maximization  (EM)  algorithm– A  special  case  of  Gaussian  Mixture  Model  (GMM)– Won’t  be  discussed  in  this  course

19

Page 20: 12 Clustering - K-meansi-systems.github.io/HSE545/machine learning all/05...K.means:,Limitations •Makehardassignmentsofpointstoclusters –A*point*either*completely*belongs*to*a*cluster*or*not*belongs*at*all

Python:  Data  Generation  

20

Page 21: 12 Clustering - K-meansi-systems.github.io/HSE545/machine learning all/05...K.means:,Limitations •Makehardassignmentsofpointstoclusters –A*point*either*completely*belongs*to*a*cluster*or*not*belongs*at*all

Python:  Random  Initialization

21

Page 22: 12 Clustering - K-meansi-systems.github.io/HSE545/machine learning all/05...K.means:,Limitations •Makehardassignmentsofpointstoclusters –A*point*either*completely*belongs*to*a*cluster*or*not*belongs*at*all

Python:  K-­‐Means

22

Page 23: 12 Clustering - K-meansi-systems.github.io/HSE545/machine learning all/05...K.means:,Limitations •Makehardassignmentsofpointstoclusters –A*point*either*completely*belongs*to*a*cluster*or*not*belongs*at*all

Python:  Result

23

Page 24: 12 Clustering - K-meansi-systems.github.io/HSE545/machine learning all/05...K.means:,Limitations •Makehardassignmentsofpointstoclusters –A*point*either*completely*belongs*to*a*cluster*or*not*belongs*at*all

Python:  K-­‐Means  in  Scikit-­‐learn

24

Page 25: 12 Clustering - K-meansi-systems.github.io/HSE545/machine learning all/05...K.means:,Limitations •Makehardassignmentsofpointstoclusters –A*point*either*completely*belongs*to*a*cluster*or*not*belongs*at*all

Initialization  Issues• k-­‐means  is  extremely  sensitive  to  cluster  center  initialization

• Bad  initialization  can  lead  to– Poor  convergence  speed– Bad  overall  clustering

• Safeguarding  measures:– Choose  first  center  as  one  of  the  examples,  second  which  is  the  farthest  from  the  first,  third  which  is  the  farthest  from  both,  and  so  on.

– Try  multiple  initialization  and  choose  the  best  result

25

Page 26: 12 Clustering - K-meansi-systems.github.io/HSE545/machine learning all/05...K.means:,Limitations •Makehardassignmentsofpointstoclusters –A*point*either*completely*belongs*to*a*cluster*or*not*belongs*at*all

Choosing  the  Number  of  Clusters• Idea:  when  adding  another  cluster  does  not  give  much  better  modeling  of  the  data

• One  way  to  select  𝑘 for  the  K-­‐means  algorithm  is  to  try  different  values  of  𝑘,  plot  the  K-­‐means  objective  versus  𝑘,  and  look  at  the  'elbow-­‐point'  in  the  plot

26

Page 27: 12 Clustering - K-meansi-systems.github.io/HSE545/machine learning all/05...K.means:,Limitations •Makehardassignmentsofpointstoclusters –A*point*either*completely*belongs*to*a*cluster*or*not*belongs*at*all

Choosing  the  Number  of  Clusters

27

Page 28: 12 Clustering - K-meansi-systems.github.io/HSE545/machine learning all/05...K.means:,Limitations •Makehardassignmentsofpointstoclusters –A*point*either*completely*belongs*to*a*cluster*or*not*belongs*at*all

K-­‐means:  Limitations• Make hard  assignments of  points  to  clusters– A  point  either  completely  belongs  to  a  cluster  or  not  belongs  at  all– No  notion  of  a soft  assignment (i.e.,  probability  of  being  assigned  to  each  cluster)

– Gaussian  mixture  model  (we  will  study  later)  and  Fuzzy  K-­‐means  allow  soft  assignments

• Sensitive  to  outlier  examples– K-­‐medians algorithm  is  a  more  robust  alternative  for  data  with  outliers

28

Page 29: 12 Clustering - K-meansi-systems.github.io/HSE545/machine learning all/05...K.means:,Limitations •Makehardassignmentsofpointstoclusters –A*point*either*completely*belongs*to*a*cluster*or*not*belongs*at*all

K-­‐means:  Limitations• Works  well  only  for  round  shaped,  and  of  roughly  equal  sizes/density  cluster

• Does  badly  if  the  cluster  have  non-­‐convex  shapes– Spectral  clustering  (we  will  study  later)  and  Kernelized  K-­‐means  can  be  an  alternative

29

Page 30: 12 Clustering - K-meansi-systems.github.io/HSE545/machine learning all/05...K.means:,Limitations •Makehardassignmentsofpointstoclusters –A*point*either*completely*belongs*to*a*cluster*or*not*belongs*at*all

K-­‐means:  Limitations• Non-­‐convex/non-­‐round-­‐shaped  cluster:  standard  K-­‐means  fails  !  

• (optional)  Connectivity  → networks  → spectral  partitioning

30

Page 31: 12 Clustering - K-meansi-systems.github.io/HSE545/machine learning all/05...K.means:,Limitations •Makehardassignmentsofpointstoclusters –A*point*either*completely*belongs*to*a*cluster*or*not*belongs*at*all

K-­‐means:  Limitations• Clusters  with  different  densities

31