28
Cluster Analysis Organize observations into similar groups

Cluster Analysis - GGobiggobi.org/book/2007-infovis/05-clustering.pdf · hierarc hical algorithms and mo del-based clustering, and sho ws ho w graphical to ols are used to assess

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Cluster Analysis - GGobiggobi.org/book/2007-infovis/05-clustering.pdf · hierarc hical algorithms and mo del-based clustering, and sho ws ho w graphical to ols are used to assess

Cluster AnalysisOrganize observations into similar groups

Page 2: Cluster Analysis - GGobiggobi.org/book/2007-infovis/05-clustering.pdf · hierarc hical algorithms and mo del-based clustering, and sho ws ho w graphical to ols are used to assess

Outline

• Visual clustering

• Algorithmic clustering

• Hierarchical clustering

• Self-organising maps

Page 3: Cluster Analysis - GGobiggobi.org/book/2007-infovis/05-clustering.pdf · hierarc hical algorithms and mo del-based clustering, and sho ws ho w graphical to ols are used to assess

Graphical clustering

How many clusters are cluster-unknown.csv?

Use brush and spin to identify them

Page 4: Cluster Analysis - GGobiggobi.org/book/2007-infovis/05-clustering.pdf · hierarc hical algorithms and mo del-based clustering, and sho ws ho w graphical to ols are used to assess

Spin and brush

“book”2007/7/19page 109!

!!

!

!!

!!

5.2 Purely graphics 109

Initial projection...

!

! !

!

!!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

! !

!

!

!

!

!

!!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

! !

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

! !

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!!

!

!!

!

!

!

!

!

!

!

!!

! !

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

! !

!

!

!

!

!!

!!

! !!

!

!

!

!!

!

!

!! !

!

!

!

!

!

!

!

!

!

!

! !

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!!

!

!!

!

!

!!!

!

!!

!

!

!

!

!!

!!

!

!

!

!

!

!

!

!

!!

!!

!

!

!!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

! !!

!

!

!

!

!

!

!

!

!

!

!

! !

!!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

! !

!

!

!

!

!!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!!

!

!

!

!

!

!

!!

!

!

!

! !

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

! !

!

!

!!

!

!

!

!

!

!!

PC1

PC2

Spin,Stop,Brush,...

!

!!

!

!

!

!

!

!

!

!

!

!

!

!!!!

! !!

!

!

!

!

!!

!

!

!!!

!!

!

!

!

!

!

!

!

!

!

!

!

!!!

!

!

!

!

!!!!

!!

!

!!!!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!!

!

!

!

!!!

!

!

!

!

!

!

!

!

!

!

!!!

!!

!

!

!!

!

!

!

!

!

!

!

!!!!

!

!

!

!!

!

!!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!!!

!

!!

!

!!! !

!!

!

!

!

!

!

!

!

!!

!!

!

!

!

!

!!

!

!!!

!

!!

!

!

!!!

!

!

!!!!

!!

!

!!

!!

!

!

!!

! !

!!

!!

!!

!

!!

!

!

!

!

!

!

!

!

!

!!

!!!

!

!

!

!

!

!

!

!

!!

!

!

!

!!

!

!

!

!

!!

!

!

!

!

!

!

!

!!

!

!!

!

!

!

!

!

!!

!

!!

!!

!

!

!

! !!

!!

!

!!!

!

!

!!!

!

!

!

!!

!

!

!

!!

!!!

!

!

!!

!

!

!

!

!!

!

!

!

!

!

!

!!

!

!!

!

!! !

!!

!

!!

!!

!!

!

!

!

!

!

!!!

!

!

!

!!!

!!!

!

!

!

!

!!!

!

!

!!

!

!!!

!

!!

!

!!!

!

!

!!

!

!!!

!

!!

!

!!!

!

!!

!

!

!

!!!!!

!

!

!

!

!!

!!

!

!

!!

!

!

!

!

!!

!

!

!

!!

!

!

!!

!

!

!

!!!

!

!

!

!

!

!

! !

!

!!

PC2

PC3

PC4

PC5

PC6

PC7

Brush,Brush,Spin...

!

!!

!

!

!

!

!

!

!

!

!

!

!

!!!!

! !!

!

!

!

!

!!

!

!

!!!

!!

!

!

!

!

!

!

!

!

!

!

!

!!!

!

!

!

!

!!!!

!!

!

!!!!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!!

!

!

!

!!!

!

!

!

!

!

!

!

!

!

!

!!!

!!

!

!

!!

!

!

!

!

!

!

!

!!!!

!

!

!

!!

!

!!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!!!

!

!!

!

!!! !

!!

!

!

!

!

!

!

!

!!

!!

!

!

!

!

!!

!

!!!

!

!!

!

!

!!!

!

!

!!!!

!!

!

!!

!!

!

!

!!

! !

!!

!!

!!

!

!!

!

!

!

!

!

!

!

!

!

!!

!!!

!

!

!

!

!

!

!

!

!!

!

!

!

!!

!

!

!

!

!!

!

!

!

!

!

!

!

!!

!

!!

!

!

!

!

!

!!

!

!!

!!

!

!

!

! !!

!!

!

!!!

!

!

!!!

!

!

!

!!

!

!

!

!!

!!!

!

!

!!

!

!

!

!

!!

!

!

!

!

!

!

!!

!

!!

!

!! !

!!

!

!!

!!

!!

!

!

!

!

!

!!!

!

!

!

!!!

!!!

!

!

!

!

!!!

!

!

!!

!

!!!

!

!!

!

!!!

!

!

!!

!

!!!

!

!!

!

!!!

!

!!

!

!

!

!!!!!

!

!

!

!

!!

!!

!

!

!!

!

!

!

!

!!

!

!

!

!!

!

!

!!

!

!

!

!!!

!

!

!

!

!

!

! !

!

!!

PC2

PC3

PC4

PC5

PC6

PC7

...Brush...

!

!

!!

!

! !!

!!

!!!

!!

!!

! !!

!

!!!

!!

!

!

!

!!

!

!!!

!!

! !!

!

!!

!

!

!! !

!

!

!

!!

!!

!!

!

!!

!

!!

!

!!

!!

!!

!

!

!!

!!

!!!!

! !

!

!

!

!

!!

!!

!!

!!!!!

!

!

!!

!!

!! !

!

!

!!

!

!

!

!

!

!

!!! !

!! !!!

!

!

!!!

!

!

!!

!

!

!

!!

!!

!

!

!!

!

!

!!

!! !

!

!

!

!

!! !

!

!

!!

!

!

!

!!

!

!

!

!

!!

!

!!

!

!!

! !!!!

!! !!

!

!!!

!

!

!!

!!

!

!

!

!

!

!!! !

!!

! !

!

!

!

!

!

!

!!

!

!

!! !!

!!

!!! !!

!

!

!

!!

!! !!

!!

!

!!

!!

!

!!

!!

!!

! !!

!!!

!

!

!

!!

!

!!

!

!

!

!

!

!

!

!! !

!!

!

!!

!

!

!!!!!

!!

!!

!

!!

!!!!!

!

!

!!

! !

! !

!!!! !! !

!

!! !

!

!

!

!

!! !

!!

!

!

!

!!

!

!!

!

!

!

!

!

!!

!

!

!

!

!

!!

!!!

!!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!!

!!

!

!

!

! ! !

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!!

!

!!

!

!!

!!

!

!

!!

! !

!

!

!

!

!!

!

!

!

!

!!

!

!

!

!

!

!

!

!

! !!

!

!

!

!

!

!

!!

!!

!

!

!

!

!

!

!

!

!!

PC1PC3

PC4 PC6PC7

Hide,Brush,Spin...

!!

! !

!

!

!!

!

!!

!

! !

!

!

!

!!

!

! !!!!

!

!

!

!! !

!!

!

! !

!

!!

!!

!

!!

!

!

!

!

!

!

!!

!

!

! !!

!!

!

!!!

!!

!!

!

! !!

!

!

!!!

!

!

!

!

!!

!

!

!!

!

!!!

!!

!

!

!!

!

!!

!

!

!!

!

!

!

!

!

!

!

!!

!!

!!

!

!

! !

!!

!!

!!

!

!!

!!

!

!!

!

!

!!

!!

!!!!

!

!

! !

!

!

!

!

!

!!

!!

!!

!

!!!!!

!

!

!

!!

!!

!

!

! !

!

!

!!

!

!

!

!

!

!

!

!!!

!

!

!

!

!! !!

!!

!

!

!!!

!

!

!

!

!

!

!

!

!

!

!

!!

!!

!

!

!

!

!

!!

!

!

!!

!! !

!

!

!

!

!

!

!

!! !

!

!

!!

!

!

! !

!

!!

!

!

!

!

!

!

!

!

!!

!

!!

! !!!!

!

!

! !!!

!

!

!!!

!

!

!!

!!

!

!

! !

!

!

!

!!! !

!!

! !

!

!

!

!

!

!

!

!

!!

!

!

!!

!

!! !!

!!

!!

!! !!

!

!

!

!

!

!

!

!!

!

!

!!! !

!

!

!!!

!

!!

!

!!

!

!

!

!

!!

!!

!!

! !!

!!!

!

!

!

!

!!

!!

!

!

!!

!

!

!

!

!

!

!

!

!

! !!

!

!

!!

!

!

!

!

!!

!

!!!

!

!

!

!

!!

!

!

!!

!!

!

!!!

!

!

!

!

!

! !

! !

!

!

!!! !! !

!

!

!

!

!

! !

!

!

!

!

!! !

!!

!

! !

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!!

!

!

!

!!

!

!!

PC1PC3

PC4 PC6PC7

...Brush...

!!!!

!

!

!!

!

!

!

!

!

!

!!!

!

!!

!!

!

!

!!

!

!

!!

!!

!

!!

!!

!

!!

!

!

!

!!

!

!

!! !

!

!!!!

!

!

!

!

!

!

!

!!

!!

!

!

!

!!

!

!

!!

!

!!!

!

!

!

!

!

!

!

!

!

!

!!

!

!

! ! !

!

!

!

! !!!

!

!!! !

!!

!

!

!!

!!!

!

!

!!!!!

!

!!

! !! !

!

! !

!

!

!

!

!!

!!

!

!!

!

!

!

!!

!!! !

!

!

!!

!

!!

!!

!

!!

!!

!!

!

! !

!

!

!

!!

!!

!

!

!

!!

!

!

!

!

!

!

!

!

!!!

!

!

!

!

!

!

!!

!

!

!!

!

!!

!

!

!

!

!

!

!!!!

!! !

!

!

!

!

!

!

!!

!

!

!

!!!

!!

!

!!

!!!

!

!

!

!!

! !

!

!!

!

!!

!

!

!!

!!!!

!

!

!

!

!

!

!

!

!

!!!

!!!

!

! !!

!

!

!

!

!!

!!

!

!!

!

!!

!

!

!

!!

!

!!

!!

!

!

!

!

!

!!

!

!

!

!

!!

!

!!

!

!

!

!

!

!

!

!

!

!!

!!

!

!!!

!

! !

!

!

!

!

! !!

!

!

!

!

!

!

!!

!

!

!

!

!!

! !!

!

!

!

!!

!

! !!

!

!

!

!

!

!

! !!!

!!

!

!

!

!

!

!

!!

!

!

!!!!

!!!

!

!!

!

!!!

!

!

! !!

!

!

!!!!

!

!

!

!

!

!!!

!!!!!!

!

!

!!!

!

!

!

!

!

!

!!!!!

!

!

!

!

!!!

!

!

!!

!

!!

!

!!

!

!!

!

!

!

!

!

!!

!

!

!

!

!!

!

!

!

!

!!

!

!!

!!

PC3

PC5PC6

PC7

Connect Dots

!

!

!

!!

!

!!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

! !

!

!

!

!

! !

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

! !

!

!

!

!

! !

!

!

!

! !

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!! ! !

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

! !

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

PC1PC3

PC5PC7

Show,Connect

!

!!

!

!

!

!

!!!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!!

!

!

!

!

!

!!!

!

!

!!

!

!!!

!

!!

!

!

!

!!

!

!

! !

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!!

!

!!

!!

!

!

!

!!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!!

!

!

!

!

!

!

!!

!

!!

!

!!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!!

!!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

! !

!

!

!

!

!

!

!

!

!

!

!!

!

!!

! !

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

! !

!

!

!

!

!

!

!

!!

!

!

!

!

!

! !

!

!

!!

!

!!!

!

!

!

!

!

!

!

!

PC1

PC2PC3

PC4PC5

PC6

...Finished!

!

!

!!

!

!

!

!!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!!

!

! !

!

!!

!!

!

!

!

!!

!!

!

!

!!

!

!

!

!!

!

! !

! !

!!!

!

!

! !!

!!

!!

! !!!

!

!

!!

!!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

! !

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!!!

!

!!!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!!

!

!

!

!

!

!

!

!

!!

!

!!

!!

!

!

!

!

!!!

!

!

! !

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

! !

!

!

!

!

!

!

! !

!

!

!!

!!!

!

! !

!

!

!

!

!

!

!

!

!

! !

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!!!

!

! !

!

!

!!

!

!

!

!!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!!!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!!

!

!!

!

!

!!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

! !!

!!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!!

!!

!!

!

!!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

! !

!

!

!!!

!!!

!!

!!

!

!

!!

!!

!

!

!

!!

!

!

!

!

!!

!

!

!

!

X3

X4X5

X6

Fig. 5.3. Stages of spin and brush on PRIM7. The high-dimensional geometryemerges as the clusters are painted.

The results at this stage are summarized by the bottom row of plots.There is a very visible triangular component (in gray) revealed when all ofthe colored points are hidden. We check the shape of this cluster by drawinglines between outer points to contain the inner ones. Touring after the linesare drawn helps to check how well they match the shape of the clusters. Thecolored groups pair up at each vertex, and we draw in the shape of these too— a single line matches the structures reasonably well.

The final step of the spin and brush clustering is to clean up this solution,touching up the color groups by continuing to tour, and repainting a point

Tour, paint a cluster, continue touring, until no more clusters are revealed.

Page 5: Cluster Analysis - GGobiggobi.org/book/2007-infovis/05-clustering.pdf · hierarc hical algorithms and mo del-based clustering, and sho ws ho w graphical to ols are used to assess

“book”2007/7/19page 105!

!!

!

!!

!!

5.1 Background 105

had not been found using numerical clustering methods, and to find a varietyof structures in high-dimensional space. Section 5.3 describes methods for re-ducing the interpoint distance matrix to an intercluster distance matrix usinghierarchical algorithms and model-based clustering, and shows how graphicaltools are used to assess the results of numerical methods. Section 5.5 summa-rizes the chapter and revisits the data analysis strategies used in the examples.A good companion to the material presented in this chapter is Venables &Ripley (2002), which provides data and code for practical examples of clusteranalysis using R. Section 5.5 summarizes the chapter and revisits the dataanalysis strategies used in the examples.

5.1 Background

Before we can begin finding groups of cases that are similar, we needto decide on a definition of similarity. How is similarity defined? Consider adataset with three cases and four variables, described in matrix format as

X =

!

"X1

X2

X3

#

$ =

!

"7.3 7.6 7.7 8.07.4 7.2 7.3 7.24.1 4.6 4.6 4.8

#

$

which is plotted in Fig. 5.2. The Euclidean distance between two cases (rowsof the matrix) is defined as

dEuc(Xi,Xj) = ||Xi !Xj || i, j = 1, . . . , n,

where ||Xi|| =%

X2i1 + X2

i2 + . . . + X2ip. For example, the Euclidean distance

between cases 1 and 2 in the above data, is&

(7.3! 7.4)2 + (7.6! 7.2)2 + (7.7! 7.3)2 + (8.0! 7.2)2 = 1.0.

For the three cases, the interpoint Euclidean distance matrix is

dEuc =

!

"0.01.0 0.06.3 5.5 0.0

#

$X1

X2

X3

Cases 1 and 2 are more similar to each other than they are to case 3, becausethe Euclidean distance between cases 1 and 2 is much smaller than the distancebetween cases 1 and 3 and between cases 2 and 3.

What is similar?

“book”2007/7/19page 105!

!!

!

!!

!!

5.1 Background 105

had not been found using numerical clustering methods, and to find a varietyof structures in high-dimensional space. Section 5.3 describes methods for re-ducing the interpoint distance matrix to an intercluster distance matrix usinghierarchical algorithms and model-based clustering, and shows how graphicaltools are used to assess the results of numerical methods. Section 5.5 summa-rizes the chapter and revisits the data analysis strategies used in the examples.A good companion to the material presented in this chapter is Venables &Ripley (2002), which provides data and code for practical examples of clusteranalysis using R. Section 5.5 summarizes the chapter and revisits the dataanalysis strategies used in the examples.

5.1 Background

Before we can begin finding groups of cases that are similar, we needto decide on a definition of similarity. How is similarity defined? Consider adataset with three cases and four variables, described in matrix format as

X =

!

"X1

X2

X3

#

$ =

!

"7.3 7.6 7.7 8.07.4 7.2 7.3 7.24.1 4.6 4.6 4.8

#

$

which is plotted in Fig. 5.2. The Euclidean distance between two cases (rowsof the matrix) is defined as

dEuc(Xi,Xj) = ||Xi !Xj || i, j = 1, . . . , n,

where ||Xi|| =%

X2i1 + X2

i2 + . . . + X2ip. For example, the Euclidean distance

between cases 1 and 2 in the above data, is&

(7.3! 7.4)2 + (7.6! 7.2)2 + (7.7! 7.3)2 + (8.0! 7.2)2 = 1.0.

For the three cases, the interpoint Euclidean distance matrix is

dEuc =

!

"0.01.0 0.06.3 5.5 0.0

#

$X1

X2

X3

Cases 1 and 2 are more similar to each other than they are to case 3, becausethe Euclidean distance between cases 1 and 2 is much smaller than the distancebetween cases 1 and 3 and between cases 2 and 3.

“book”2007/7/19page 105!

!!

!

!!

!!

5.1 Background 105

had not been found using numerical clustering methods, and to find a varietyof structures in high-dimensional space. Section 5.3 describes methods for re-ducing the interpoint distance matrix to an intercluster distance matrix usinghierarchical algorithms and model-based clustering, and shows how graphicaltools are used to assess the results of numerical methods. Section 5.5 summa-rizes the chapter and revisits the data analysis strategies used in the examples.A good companion to the material presented in this chapter is Venables &Ripley (2002), which provides data and code for practical examples of clusteranalysis using R. Section 5.5 summarizes the chapter and revisits the dataanalysis strategies used in the examples.

5.1 Background

Before we can begin finding groups of cases that are similar, we needto decide on a definition of similarity. How is similarity defined? Consider adataset with three cases and four variables, described in matrix format as

X =

!

"X1

X2

X3

#

$ =

!

"7.3 7.6 7.7 8.07.4 7.2 7.3 7.24.1 4.6 4.6 4.8

#

$

which is plotted in Fig. 5.2. The Euclidean distance between two cases (rowsof the matrix) is defined as

dEuc(Xi,Xj) = ||Xi !Xj || i, j = 1, . . . , n,

where ||Xi|| =%

X2i1 + X2

i2 + . . . + X2ip. For example, the Euclidean distance

between cases 1 and 2 in the above data, is&

(7.3! 7.4)2 + (7.6! 7.2)2 + (7.7! 7.3)2 + (8.0! 7.2)2 = 1.0.

For the three cases, the interpoint Euclidean distance matrix is

dEuc =

!

"0.01.0 0.06.3 5.5 0.0

#

$X1

X2

X3

Cases 1 and 2 are more similar to each other than they are to case 3, becausethe Euclidean distance between cases 1 and 2 is much smaller than the distancebetween cases 1 and 3 and between cases 2 and 3.

Page 6: Cluster Analysis - GGobiggobi.org/book/2007-infovis/05-clustering.pdf · hierarc hical algorithms and mo del-based clustering, and sho ws ho w graphical to ols are used to assess

Hierarchical clustering> library(rggobi)> d.prim7 <- read.csv("prim7.csv")> d.prim7.dist <- dist(d.prim7)> d.prim7.dend <- hclust(d.prim7.dist, method="average")> plot(d.prim7.dend)

“book”2007/7/19page 112!

!!

!

!!

!!

112 5 Cluster Analysis119

485

108

410

306

44

319

493 22

64

271

170

461

52

285

449

451 197

80

416 1

92

138

382

148

371

424

439

224

472 2

4247

184

328

337

25 7

36

304

135

435

103

146 45

426

256

287 31

387

173

190

290

89

258 29

254

309

324

293

174

168

12

475 181

466

134

179

462

123

83

237

341

359

437

447

65

145

428

434

440

253

23 3

286

480

238

405

139

218 2

244

478

487

379

383

210

467

204

430

203

331

113

444

419

54

296

185

265

346

282

388

473

311

463

222

325

368

396

167

261 1

307

221

122

308

357

151

355

458

154

206

117

187

464

294

494

124

219

333

101

160

432 79

234

116

66

121 161

220

281

377

120 4

342 352

267

131

196

298

233

360

402

407

34

317

299

385 43

58

107

14

72 92

90

484 77

413

88

351

102

63

105

127

277

335

89

431

295

230

193

349

421

250

251 5

1358

338

339 98

401

268

381 474

363

111

367

411

81

32932

199

279

399

406

441

384

443

156

344

479

262

188

165

442

71

162 322

19

13

96

320

274

115

378 6

2499

21

429

245

452

490

240

415

53

266

455

278

86

152 91

106

177

158

126

414

400

433

140

393

453

147

284

195

46

418

476 26

297

292

109

438

37

100

180

200

389

94

133 7

8456

75

471

372

252

305 74

270

110

194

482

249

454

369

469

137

495

144

289

404

364

225

354

136

216

243

242

353

149

272

376

27

395 175

291

48

483

47

235

422

186

155

205

213

191

460

500

201

239

39

423

255

301 5

348

326

327

11

202 56

231

207

208 33

366

118

446

60

20

496 61

232 87

318 198

392

217

227

340

457

302

171

486

10

241 8

468

373

82

288 38

214

394

276

459

313

189

347 492

67

215

350 6

40

112

303

28

182

128

336

163

330

183

73

248

310

176

209

226

212

321

172

273

125

30

104

477

489

257

59

409

498

275

228

260

142

99

488

130

166

356

465

97

450

398

236

211

427 50

169

70

334

93

365 18

390

129

17

263

315

481

314

391

380

468

76

361 132

412

35

345

420

16

143

408

15

55

150

445

114

264

95

69

164

386

323

374

300

362 85

343

159

229

425

49

153

497

316

312

259

283

370

397

157

417

41

280

448

491

470

332

141

223

436

246

375

403

178

269

42

57

05

10

15

20

25

hclust (*, "average")

d.prim7.dist

Heig

ht

Dendrogram for Hierarchical Clustering with Average Linkage

Cluster 1

!

!

!

!!

!

!

!

!

!!

! !

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

! !

!!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

! !!

!

!

!!

!!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!!

!

!

!!!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!!!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!! !

!

!

!

!

!

!

!

!

!

!

!

!!

!!

!

!!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

! !

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!!!!

!

!

!

!! !!

!!

!

!!

!!

!

!!!

! !

!

!

!

!

!!

!!

!

!

!!

!!

!

!

!

!!!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!!

!

!

!

!

!

! !

!!!

!

!

!!

!!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

X3

X5

Cluster 2

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

! !

!!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!!

!!

!

!!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!!

!!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!!

!!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!!

!

!!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!!

!

!

!!

!

!

!

!

!

!

!!

!

!

!

!

!

!!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!!

!

!!

!

!

!

!

!

!

! !

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!!

!

!!

!

!!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!!

!

!

!

!

!

!

! !

!!!!

!

!

!

!

!

!

!!

!

!!

!

!

!

!!

!

!

!

!

!

!!!

!

!

!! !!

!

!

!!

!

!

!!

!

!!

!

! !

!

!

!

!!

!

!

!

!!

!

!

!!

!

!

!!

!

!!

!

!!

!!

!

!!

!!

!

!

!!

!

!

!

!!

!

!

!!

!

!

!!!

!

!!! !

!!

!

!

!

!

!

!

! !! !

!

!

!

!

!

!

!

!

!

!

!!

!! !!!

!

!

!!

!

!!

!

!

!

X3

X5

Cluster 3

!

!

!

!

!

!

!

!!

!

!

!

!

!

!!

! !

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

! !

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

! !!

!

!

!

!

!

!

! !

!!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

! !!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!!!

!

!!

!

!

! !

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

! !

!

!

!

!

!

!

!

!

! !

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!!

!

!

!!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!! !

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!!

!!

!

!!

!

!

!

!

!!!! !!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!!

!! !

!!!

!!!!!!!

!

!

!

!

!!

!

!

!

!!!

!!

!

!!

!

!!

!!

!

!!

!

!

!

!

!!

!!

!!!

!

!

!

!

!!

!!!

!

!!

!

!

!

!!

X3

X5

Cluster 5

!

!!

!

!

!

!

!

!

!

!

!

!!!!

!

!

!

!

!

!

!

!

!

!!

!

!

!!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!!

!

!

!

!

!

! !

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

! !

!!

!

!

!!

!

!

!

!!

!!

!

! !

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!!

!

!

!

!!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!!

!

!

!!

!

!!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!!

!

!

!! !

!

!!

!

!

!!

!!

!

!

!

!

!

!

!

!!

!!!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!!

!

!

!

!

!

!

!

! !!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!!!

!

! !

!!

!

!

!!

!!!

!

!!!!

!

!!

!!!!

!

!

!

!!

! !!!

!!

!

!!!!!!!!

!

!

!!

!!

!!

!

!

! !!!!!!!

X1 X3

X5

X6

X7

Cluster 6

!

!!

!

!

!

!

!

!

!

!

!

!

!

!!!!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

! !

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!!

!

!!

!

!

!!

!

!

!

!!

!!

!

!!

!

! !

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!!

!

!

!!

!

!!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!!

!

!

!!

!

!

!

!!

!!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!!!

!

!

!!

!

!

!!

!!!!!

!

!!!

!

!

!!! !

X1 X3

X5

X6

X7

Cluster 7

!

!!

!

!

!

!

!

!

!

!

!

!

!

!!!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!!

!

!

!

!

!

!!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

! !

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!!

!

!!

!

!

!!

!

!

!

!!

!!

!

!!

!

! !

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!!

!!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!!

!

!

!!

!

!!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!!

!

!

!

!!

!

!

!

!!

!!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!!

!

!

!

!

!!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!!!

!

! !

!!

!

!

!!

!

X1 X3

X5

X6

X7

Fig. 5.5. Hierarchical clustering of the particle physics data. The dendrogram showsthe results of clustering the data using average linkage. Clusters 1, 2, and 3 carveup the base triangle of the data; clusters 5 and 6 divide one of the arms; and cluster7 is a singleton.

The top three plots show, respectively, clusters 1, 2, and 3: These clustersroughly divide the main triangular section of the data into three. The bottomrow of plots show clusters labeled 5, and 6, which lie along the linear pieces,and cluster 7, which is a singleton cluster corresponding to an outlier in thedata.

The results are reasonably easy to interpret. Recall that the basic geometryunderlying this data is that there is a 2D triangle with two linear strandsextending from each vertex. The hierarchical average linkage clustering ofthe particle physics data using nine clusters essentially divides the data into

Page 7: Cluster Analysis - GGobiggobi.org/book/2007-infovis/05-clustering.pdf · hierarc hical algorithms and mo del-based clustering, and sho ws ho w graphical to ols are used to assess

> gd <- ggobi(d.prim7)[1]> clust9 <- cutree(d.prim7.dend, k=9)> glyph_color(gd)[clust9==1] <- 9 # highlight triangle> glyph_color(gd)[clust9==1] <- 1 # reset color> glyph_color(gd)[clust9==2] <- 9 # highlight cluster 2

“book”2007/7/19page 112!

!!

!

!!

!!

112 5 Cluster Analysis

119

485

108

410

30

644

319

493 22

64

27

1170

46

152

285

44

945

1 197

80

416 1

92

13

83

82

14

83

71

42

443

92

24

472 2

42

47

18

43

28

337

25 7

36

30

41

35

435

103

146 45

42

62

56

287 31

387

17

319

029

08

92

58 29

254

30

932

42

93

17

41

68

12

475 18

14

66

13

41

79

46

21

23

83

23

73

41

359

437

44

76

51

45

428

434

440

25

323 3

286

480

23

840

51

39

21

8 224

447

84

87

37

938

321

046

72

04

430

203

331

11

344

44

19

54

29

618

52

65

346

28

238

847

33

11

463

22

23

25

368

396

16

726

1 13

07

22

112

230

83

57

15

135

54

58

15

42

06

11

71

87

46

42

94

49

412

421

93

33

101

160

432 79

234

116

66

12

1 161

220

28

13

77

12

0 43

42 352

267

13

11

96

298

233

360

402

407

34

317

29

938

5 43

58

10

71

47

2 92

90

48

4 77

413

88

351

10

26

31

05

127

277

335

89

43

129

52

30

193

349

421

25

025

1 51

35

83

38

33

9 98

40

12

68

381 47

43

63

11

136

74

11

81

32

93

21

99

279

399

406

441

384

443

15

63

44

479

262

18

81

65

44

27

11

62 322

19

13

96

320

27

41

15

37

8 62

499

21

429

245

452

490

24

041

553

26

64

55

278

86

15

2 91

106

177

158

12

641

44

00

433

14

039

34

53

14

728

41

95

46

41

847

6 26

29

729

21

09

43

83

71

00

18

020

03

89

94

13

3 78

456

75

47

137

225

230

5 74

27

01

10

19

44

82

249

454

36

94

69

137

495

144

289

404

364

225

354

13

62

16

243

24

235

31

49

272

376

27

395 17

52

91

48

483

47

23

54

22

186

155

20

52

13

19

146

05

00

20

12

39

39

42

325

530

1 53

48

32

63

27

11

20

2 56

231

207

208 33

36

61

18

44

66

02

04

96 61

232 87

318 19

839

22

17

227

340

45

73

02

17

14

86

10

241 8

46

837

382

28

8 38

214

394

27

64

59

31

31

89

347 49

26

721

535

0 64

01

12

30

328

182

128

336

163

330

183

73

24

83

10

17

620

922

621

232

11

72

27

31

25

30

10

44

77

489

25

75

94

09

49

82

75

22

826

014

29

94

88

13

016

635

646

59

74

50

398

23

621

142

7 50

16

97

03

34

93

36

5 18

390

129

17

26

33

15

481

314

391

38

046

87

636

1 132

41

235

345

420

16

143

408

15

55

15

044

51

14

264

95

69

16

43

86

323

374

300

362 85

34

315

922

942

54

91

53

49

73

16

312

259

283

37

039

71

57

417

41

280

448

491

470

332

141

223

436

246

375

403

178

269

42

57

05

10

15

20

25

hclust (*, "average")

d.prim7.dist

Heig

ht

Dendrogram for Hierarchical Clustering with Average Linkage

Cluster 1

!

!

!

!!

!

!

!

!

!!

! !

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

! !

!!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

! !!

!

!

!!

!!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!!

!

!

!!!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!!!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!! !

!

!

!

!

!

!

!

!

!

!

!

!!

!!

!

!!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

! !

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!!!!

!

!

!

!! !!

!!

!

!!

!!

!

!!!

! !

!

!

!

!

!!

!!

!

!

!!

!!

!

!

!

!!!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!!

!

!

!

!

!

! !

!!!

!

!

!!

!!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

X3

X5

Cluster 2

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

! !

!!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!!

!!

!

!!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!!

!!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!!

!!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!!

!

!!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!!

!

!

!!

!

!

!

!

!

!

!!

!

!

!

!

!

!!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!!

!

!!

!

!

!

!

!

!

! !

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!!

!

!!

!

!!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!!

!

!

!

!

!

!

! !

!!!!

!

!

!

!

!

!

!!

!

!!

!

!

!

!!

!

!

!

!

!

!!!

!

!

!! !!

!

!

!!

!

!

!!

!

!!

!

! !

!

!

!

!!

!

!

!

!!

!

!

!!

!

!

!!

!

!!

!

!!

!!

!

!!

!!

!

!

!!

!

!

!

!!

!

!

!!

!

!

!!!

!

!!! !

!!

!

!

!

!

!

!

! !! !

!

!

!

!

!

!

!

!

!

!

!!

!! !!!

!

!

!!

!

!!

!

!

!

X3

X5

Cluster 3

!

!

!

!

!

!

!

!!

!

!

!

!

!

!!

! !

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

! !

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

! !!

!

!

!

!

!

!

! !

!!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

! !!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!!!

!

!!

!

!

! !

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

! !

!

!

!

!

!

!

!

!

! !

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!!

!

!

!!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!! !

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!!

!!

!

!!

!

!

!

!

!!!! !!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!!

!! !

!!!

!!!!!!!

!

!

!

!

!!

!

!

!

!!!

!!

!

!!

!

!!

!!

!

!!

!

!

!

!

!!

!!

!!!

!

!

!

!

!!

!!!

!

!!

!

!

!

!!

X3

X5

Cluster 5

!

!!

!

!

!

!

!

!

!

!

!

!!!!

!

!

!

!

!

!

!

!

!

!!

!

!

!!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!!

!

!

!

!

!

! !

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

! !

!!

!

!

!!

!

!

!

!!

!!

!

! !

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!!

!

!

!

!!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!!

!

!

!!

!

!!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!!

!

!

!! !

!

!!

!

!

!!

!!

!

!

!

!

!

!

!

!!

!!!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!!

!

!

!

!

!

!

!

! !!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!!!

!

! !

!!

!

!

!!

!!!

!

!!!!

!

!!

!!!!

!

!

!

!!

! !!!

!!

!

!!!!!!!!

!

!

!!

!!

!!

!

!

! !!!!!!!

X1 X3

X5

X6

X7

Cluster 6

!

!!

!

!

!

!

!

!

!

!

!

!

!

!!!!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

! !

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!!

!

!!

!

!

!!

!

!

!

!!

!!

!

!!

!

! !

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!!

!

!

!!

!

!!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!!

!

!

!!

!

!

!

!!

!!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!!!

!

!

!!

!

!

!!

!!!!!

!

!!!

!

!

!!! !

X1 X3

X5

X6

X7

Cluster 7

!

!!

!

!

!

!

!

!

!

!

!

!

!

!!!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!!

!

!

!

!

!

!!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

! !

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!!

!

!!

!

!

!!

!

!

!

!!

!!

!

!!

!

! !

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!!

!!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!!

!

!

!!

!

!!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!!

!

!

!

!!

!

!

!

!!

!!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!!

!

!

!

!

!!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!!!

!

! !

!!

!

!

!!

!

X1 X3

X5

X6

X7

Fig. 5.5. Hierarchical clustering of the particle physics data. The dendrogram showsthe results of clustering the data using average linkage. Clusters 1, 2, and 3 carveup the base triangle of the data; clusters 5 and 6 divide one of the arms; and cluster7 is a singleton.

The top three plots show, respectively, clusters 1, 2, and 3: These clustersroughly divide the main triangular section of the data into three. The bottomrow of plots show clusters labeled 5, and 6, which lie along the linear pieces,and cluster 7, which is a singleton cluster corresponding to an outlier in thedata.

The results are reasonably easy to interpret. Recall that the basic geometryunderlying this data is that there is a 2D triangle with two linear strandsextending from each vertex. The hierarchical average linkage clustering ofthe particle physics data using nine clusters essentially divides the data into

Clusters 1, 2, 3 carve up the base triangle.

Clusters 5, 6 divide an arm.

Cluster 7 is a singleton cluster containing an outlier.

Page 8: Cluster Analysis - GGobiggobi.org/book/2007-infovis/05-clustering.pdf · hierarc hical algorithms and mo del-based clustering, and sho ws ho w graphical to ols are used to assess

Self-organizing maps

• A constrained k-means algorithm.

• A 1D or 2D net is stretched through the data. The knots in the net form the cluster centres, and the points closest to the knot are considered to belong to that cluster.

• Net can be unwrapped and laid out flat to give a visual representation of the clusters.

Page 9: Cluster Analysis - GGobiggobi.org/book/2007-infovis/05-clustering.pdf · hierarc hical algorithms and mo del-based clustering, and sho ws ho w graphical to ols are used to assess

Music data

• Descriptive statistics for 62 songs: Variance, Mean, Max, Energy & Frequency

• 32 rock (Abba, Beatles, Eels) 27 classical (Beethoven, Mozart, Vivaldi)3 new wave (Enya)

• Can a computer identify the different types?

Page 10: Cluster Analysis - GGobiggobi.org/book/2007-infovis/05-clustering.pdf · hierarc hical algorithms and mo del-based clustering, and sho ws ho w graphical to ols are used to assess

Can you hear the difference?

• One

• Two

• Three

Page 11: Cluster Analysis - GGobiggobi.org/book/2007-infovis/05-clustering.pdf · hierarc hical algorithms and mo del-based clustering, and sho ws ho w graphical to ols are used to assess

Can you hear the difference?

• One

• Two

• Three

Page 12: Cluster Analysis - GGobiggobi.org/book/2007-infovis/05-clustering.pdf · hierarc hical algorithms and mo del-based clustering, and sho ws ho w graphical to ols are used to assess

Can you hear the difference?

"Rock" - Abba• One

• Two

• Three

Page 13: Cluster Analysis - GGobiggobi.org/book/2007-infovis/05-clustering.pdf · hierarc hical algorithms and mo del-based clustering, and sho ws ho w graphical to ols are used to assess

Can you hear the difference?

"Rock" - Abba• One

• Two

• Three

Page 14: Cluster Analysis - GGobiggobi.org/book/2007-infovis/05-clustering.pdf · hierarc hical algorithms and mo del-based clustering, and sho ws ho w graphical to ols are used to assess

Can you hear the difference?

"Rock" - Abba

New wave - Enya

• One

• Two

• Three

Page 15: Cluster Analysis - GGobiggobi.org/book/2007-infovis/05-clustering.pdf · hierarc hical algorithms and mo del-based clustering, and sho ws ho w graphical to ols are used to assess

Can you hear the difference?

"Rock" - Abba

New wave - Enya

• One

• Two

• Three

Page 16: Cluster Analysis - GGobiggobi.org/book/2007-infovis/05-clustering.pdf · hierarc hical algorithms and mo del-based clustering, and sho ws ho w graphical to ols are used to assess

Can you hear the difference?

"Rock" - Abba

New wave - Enya

Classical - Vivaldi

• One

• Two

• Three

Page 17: Cluster Analysis - GGobiggobi.org/book/2007-infovis/05-clustering.pdf · hierarc hical algorithms and mo del-based clustering, and sho ws ho w graphical to ols are used to assess

Can you see the difference?

Is rock different from classical?

Page 18: Cluster Analysis - GGobiggobi.org/book/2007-infovis/05-clustering.pdf · hierarc hical algorithms and mo del-based clustering, and sho ws ho w graphical to ols are used to assess

Can you see the difference?

Is rock different from classical?

Can also see the shape in 5d: outliers and non-linearity

Page 19: Cluster Analysis - GGobiggobi.org/book/2007-infovis/05-clustering.pdf · hierarc hical algorithms and mo del-based clustering, and sho ws ho w graphical to ols are used to assess

Model & data

1

2

3

4

5

6

1 2 3 4 5 6

● ● ● ● ● ●

● ● ● ● ● ●

● ● ● ● ● ●

● ● ● ● ● ●

● ● ● ● ● ●

● ● ● ● ● ●

● ●●

●●

●●

●●●

●●

● ●

GirlB3 AgonyV2 V1 B4

M2M4 Saturday Morning

I Want to Hold Your Hand

Love of the Loveless

V10

M1 B8 HelpM5The Good Old Days

Ticket to RideB7V5 Restraining All in a Days Work

Rock Hard Times

YesterdayV9B6I Have A Dream Eleanor Rigby Yellow SubmarineV3 Penny LaneWrong About BobbyI Feel Fine

B5

Lone WolfLove Me DoCant Buy Me LoveHey Jude

The WinnerThe Memory of Trees Anywhere IsV4 B2 V13

M3

Money WaterlooV12

Super TrouperMamma MiaTake a ChanceKnowing MePax Deorum Lay All YouV11

V7

V8 Dancing QueenSOS

V6M6 B1

map

2

map1

data in the model space

How well does the model fit the data?

SOM

Page 20: Cluster Analysis - GGobiggobi.org/book/2007-infovis/05-clustering.pdf · hierarc hical algorithms and mo del-based clustering, and sho ws ho w graphical to ols are used to assess

−0.2

0.0

0.2

0.4

0.6

●●

●● ●

●●

●●

●●●

● ●●

●●

● ●

●●

V8

Lay All YouDancing QueenSuper TrouperTake a ChanceAnywhere Is The WinnerMoney

WaterlooLove Me Do V4

V11

Eleanor RigbyYellow Submarine B2All in a Days Work I Have A DreamV9Penny LaneLone WolfI Feel Fine Restraining

Pax Deorum

V3Cant Buy Me Love V5Ticket to RideWrong About Bobby B7 Hey JudeYesterday B6M5

SOS

The Good Old DaysHelpSaturday Morning M1B8Love of the LovelessM4

I Want to Hold Your Hand Agony B1

Mamma Mia

V1 V10B3Rock Hard Times M2B5V6 M3 M6

Girl

V13

B4

V2

V7

The Memory of TreesKnowing Me V12

Com

p.2

Comp.1

PCA

Model & data

1

2

3

4

5

6

1 2 3 4 5 6

● ● ● ● ● ●

● ● ● ● ● ●

● ● ● ● ● ●

● ● ● ● ● ●

● ● ● ● ● ●

● ● ● ● ● ●

● ●●

●●

●●

●●●

●●

● ●

GirlB3 AgonyV2 V1 B4

M2M4 Saturday Morning

I Want to Hold Your Hand

Love of the Loveless

V10

M1 B8 HelpM5The Good Old Days

Ticket to RideB7V5 Restraining All in a Days Work

Rock Hard Times

YesterdayV9B6I Have A Dream Eleanor Rigby Yellow SubmarineV3 Penny LaneWrong About BobbyI Feel Fine

B5

Lone WolfLove Me DoCant Buy Me LoveHey Jude

The WinnerThe Memory of Trees Anywhere IsV4 B2 V13

M3

Money WaterlooV12

Super TrouperMamma MiaTake a ChanceKnowing MePax Deorum Lay All YouV11

V7

V8 Dancing QueenSOS

V6M6 B1

map

2

map1

data in the model space

How well does the model fit the data?

SOM

Page 21: Cluster Analysis - GGobiggobi.org/book/2007-infovis/05-clustering.pdf · hierarc hical algorithms and mo del-based clustering, and sho ws ho w graphical to ols are used to assess

−0.2

0.0

0.2

0.4

0.6

●●

●● ●

●●

●●

●●●

● ●●

●●

● ●

●●

V8

Lay All YouDancing QueenSuper TrouperTake a ChanceAnywhere Is The WinnerMoney

WaterlooLove Me Do V4

V11

Eleanor RigbyYellow Submarine B2All in a Days Work I Have A DreamV9Penny LaneLone WolfI Feel Fine Restraining

Pax Deorum

V3Cant Buy Me Love V5Ticket to RideWrong About Bobby B7 Hey JudeYesterday B6M5

SOS

The Good Old DaysHelpSaturday Morning M1B8Love of the LovelessM4

I Want to Hold Your Hand Agony B1

Mamma Mia

V1 V10B3Rock Hard Times M2B5V6 M3 M6

Girl

V13

B4

V2

V7

The Memory of TreesKnowing Me V12

Com

p.2

Comp.1

PCA

Model & data

1

2

3

4

5

6

1 2 3 4 5 6

● ● ● ● ● ●

● ● ● ● ● ●

● ● ● ● ● ●

● ● ● ● ● ●

● ● ● ● ● ●

● ● ● ● ● ●

● ●●

●●

●●

●●●

●●

● ●

GirlB3 AgonyV2 V1 B4

M2M4 Saturday Morning

I Want to Hold Your Hand

Love of the Loveless

V10

M1 B8 HelpM5The Good Old Days

Ticket to RideB7V5 Restraining All in a Days Work

Rock Hard Times

YesterdayV9B6I Have A Dream Eleanor Rigby Yellow SubmarineV3 Penny LaneWrong About BobbyI Feel Fine

B5

Lone WolfLove Me DoCant Buy Me LoveHey Jude

The WinnerThe Memory of Trees Anywhere IsV4 B2 V13

M3

Money WaterlooV12

Super TrouperMamma MiaTake a ChanceKnowing MePax Deorum Lay All YouV11

V7

V8 Dancing QueenSOS

V6M6 B1

map

2

map1

data in the model space

How well does the model fit the data?

1

2

3

4

5

6

1 2 3 4 5 6

● ● ● ● ● ●

● ● ● ● ● ●

● ● ● ● ● ●

● ● ● ● ● ●

● ● ● ● ● ●

● ● ● ● ● ●

● ●●

●●

●●

●●●

●●

●●●

●●

● ●

map2

map1

−0.2

0.0

0.2

0.4

0.6

● ●

●● ●

●●

●●

● ●

●●●

● ●●

●●

● ●

●●

●●

Com

p.2

Comp.1

SOM

Page 22: Cluster Analysis - GGobiggobi.org/book/2007-infovis/05-clustering.pdf · hierarc hical algorithms and mo del-based clustering, and sho ws ho w graphical to ols are used to assess

Model & datamodel in the data space

som

Page 23: Cluster Analysis - GGobiggobi.org/book/2007-infovis/05-clustering.pdf · hierarc hical algorithms and mo del-based clustering, and sho ws ho w graphical to ols are used to assess

“book”2007/7/19page 122!

!!

!

!!

!!

122 5 Cluster Analysis

points are spread evenly through the grid, with rock tracks (purple) at theupper right, classical tracks (green) at the lower left, and new wave tracks(the three black rectangles) in between. The tour view in the same figure,however, shows the fit to be inadequate. The net is a flat rectangle in the 5Dspace and has not su!ciently wrapped through the data. This is the result ofstopping the algorithm too soon, thus failing to let it converge fully.

! ! ! ! ! !

! ! ! ! ! !

! ! ! ! ! !

! ! ! ! ! !

! ! ! ! ! !

! ! ! ! ! !!!

!

!!!

!!

!

!

!

!

!!

!

!

!

!

!!!

!

!!

!

!!

!

!!

!

!

!!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!!

!

!

!

!

!!

!!

!!

!!

!!

!!

!!

!!

!!

!!

!!

!!

!!

!!

!!

!!

!!

!!

!

!

!!!

!

!

!

!

!

!

!

!!

!

!!

!!!

!

!

! !!

!!

!!

!

!

!

!

!

!!

!

!

!

!

!

!!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!!

lvar lave

lmax lfreq

Fig. 5.9. Unsuccessful SOM fit shown in a 2D map view and a tour projection.Although the fit looks good in the map view, the view of the fit in the 5D dataspace shows that the net has not su!ciently wrapped into the data: The algorithmhas not converged fully.

Figure 5.10 shows our favorite fit to the data. The data was standardized,we used a 6!6 net, and we ran the SOM algorithm for 1,000 iterations. Themap is at the top left, and it matches the map already shown in Fig. 5.8,except for the small jittering of points in the earlier figure. The other threeplots show di"erent projections from the grand tour. The upper right plotshows how the net curves with the nonlinear dependency in the data: The netis warped in some directions to fit the variance pattern. At the bottom rightwe see that one side of the net collects a long separated cluster of the Abbatracks. We can also see that the net has not been stretched out to the fullextent of the range of the data. It is tempting to manually manipulate the netto stretch it in di"erent directions and update the fit.

It turns out that the PCA view of the data more accurately reflects thestructure in the data than the map view. The music pieces really are clumpedtogether in the 5D space, and there are a few outliers.

5.3.4 Comparing methods

To compare the results of two methods we commonly compute a confusiontable. For example, Table 5.1 is the confusion table for five-cluster solutions

First attempt at model fitting doesn’t work. Model appears to be unfinished, flat in the data space rather than wrapped into the points.

Page 24: Cluster Analysis - GGobiggobi.org/book/2007-infovis/05-clustering.pdf · hierarc hical algorithms and mo del-based clustering, and sho ws ho w graphical to ols are used to assess

“book”2007/7/19page 123!

!!

!

!!

!!

5.3 Numerical methods 123

! ! ! ! ! !

! ! ! ! ! !

! ! ! ! ! !

! ! ! ! ! !

! ! ! ! ! !

! ! ! ! ! !!

!

!

!

!!!

!

!

!

!

!

! !

!

!

!

!

!

!

!

!!

!

!

!

! !

!

!

!

!

!!!

!

!

!

!

!

!

!

!

!

!!

!

!!

!

! !!

! !!

!

!

!

!

!

!

!!

!

!

!

!

!!

!

!

!

!

! ! !

!

!

!

! ! !

!!

!

!!

!

!

!

!

!!

!!!

!

!!!

!

!

!

!

!

!

!!

!

!

!

!!

!

! !!

!!

!!!

!

!!

!

!

!

!

!

!

!

!

!

!

!

! !

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

lvar

lavelmax

lfenerlfreq

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!!

!

!

!

!

!

!

!

!!

!

!

!!!!!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

! !

!

!!

!

!

!

!

!

!

!!

!

!!

!

!

!

!

!

!!

!

!

!

!!

!

!

!

!

!

!

!

!!

!

lvarlmaxlfenerlfreq

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!!

!

!!!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!! !

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

lvarlavelmaxlfener

lfreq

Fig. 5.10. Successful SOM fit shown in a 2D map view and tour projections. Herewe see a more successful SOM fit, using standardized data. The net wraps throughthe nonlinear dependencies in the data, but some outliers remain.

for the Music data from k-means and Ward’s linkage hierarchical clustering,generated by:

> d.music.dist <- dist(subset(d.music.std,select=c(lvar:lfreq)))

> d.music.hc <- hclust(d.music.dist, method="ward")> cl5 <- cutree(d.music.hc,5)> d.music.km <- kmeans(subset(d.music.std,

select=c(lvar:lfreq)), 5)> table(d.music.km$cluster, cl5)> d.music.clustcompare <-

cbind(d.music.std,cl5,d.music.km$cluster)> names(d.music.clustcompare)[8] <- "Wards"> names(d.music.clustcompare)[9] <- "km"> gd <- ggobi(d.music.clustcompare)[1]

Next attempt looks better.

Problem was solved by increasing the number of iterations used.

Page 25: Cluster Analysis - GGobiggobi.org/book/2007-infovis/05-clustering.pdf · hierarc hical algorithms and mo del-based clustering, and sho ws ho w graphical to ols are used to assess

Deciding on a solution• One part in coming to a decision about the

best solution is to compare the results from different methods.

• This can be done using a confusion table.

“book”2007/7/19page 124!

!!

!

!!

!!

124 5 Cluster Analysis

The numerical labels of clusters are arbitrary, so these can be rearranged tobetter digest the table. There is a lot of agreement between the two methods:Both methods agree on the cluster for 48 tracks out of 62, or 77% of the time.We want to explore the data space to see where the agreement occurs andwhere the two methods disagree.

Table 5.1. Tables showing the agreement between two five-cluster solutions for theMusic data, showing a lot of agreement between k-means and Ward’s linkage hierar-chical clustering. The rows have been rearranged to make the table more readable.

Ward’sk-means 1 2 3 4 5

1 0 0 3 0 142 0 0 1 0 03 0 9 5 0 04 8 2 1 0 05 0 0 3 16 0

Rearrange rows !

Ward’sk-means 1 2 3 4 5

4 8 2 1 0 03 0 9 5 0 02 0 0 1 0 05 0 0 3 16 01 0 0 3 0 14

In Fig. 5.11, we link jittered plots of the confusion table for the two clus-tering methods with 2D tour plots of the data. The first column contains twojittered plots of the confusion table. In the top row of the figure, we have high-lighted a group of 14 points that both methods agree form a cluster, paintingthem as orange triangles. From the plot at the right, we see that this clus-ter is a closely grouped set of points in the data space. From the tour axeswe see that lvar has the largest axis pointing in the direction of the clusterseparation, which suggests that music pieces in this cluster are characterizedby high values on lvar (variable 3 in the data); that is, they have large vari-ance in frequency. By further investigating which tracks are in this cluster,we can learn that it consists of a mix of tracks by the Beatles (“Penny Lane,”“Help,” “Yellow Submarine,” ...) and the Eels (“Saturday Morning,” “Loveof the Loveless,” ...).

In the bottom row of the figure, we have highlighted a second group oftracks that were clustered together by both methods, painting them againusing orange triangles. In the plot to the right, we see that this cluster isclosely grouped in the data space. Despite that, this cluster is a bit moredi!cult to characterize. It is oriented mostly in the negative direction of lave(variable 4), so it would have smaller values on this variable. But this verticaldirection in the plot also has large contributions from variables 3 (lvar) and 7(lfreq). If you label these eight points on your own, you will see that they areall Abba songs (“Dancing Queen,” “Waterloo,” “Mamma Mia,” ...).

We have explored two groups of tracks where the methods agree. In asimilar fashion, we could also explore the tracks where the methods disagree.

Agree on 48/62 cases = 77%

Page 26: Cluster Analysis - GGobiggobi.org/book/2007-infovis/05-clustering.pdf · hierarc hical algorithms and mo del-based clustering, and sho ws ho w graphical to ols are used to assess

...and graphically“book”2007/7/19page 125!

!!

!

!!

!!

5.4 Characterizing clusters 125

1

2

3

4

5

1 2 3 4 5

!!

!!

!

!

!

!!

!!!!!!!!

!

!!!!!

!!!!

!!!!!! !

!

! !

!

!

!

!!

!

!

!

KM!5

HC!W5

!

!

!

!

!

!

!

!

!

!!!

!!

!

!!

!

!!

!

!

!

!

!!

!

!

!!!!!

!

!

!

!

!!

!

!

!

!

!

!

lvar

lavelmax

lfener

1

2

3

4

5

1 2 3 4 5

!!

!!

!

!

!

! !

!!!!!!!!

!

!!!!!

!!!!

!

!

!

!!! !

!

!

!

!

!!! !!!!!! !

!

!

!

KM!5

HC!W5

!

!

!!

!

!

!

!!

!

!

!

! !!

!

!

!

!!

!!

!! !

!

!!

!

!!

!

!!

!

!

!

!

!! !

!

!!

!!!!!!

!

lvarlave

lmaxlfener

lfreq

Fig. 5.11. Comparing two five-cluster models of the Music data using confusiontables linked to tour plots. In the confusion tables, k-means cluster identifiers foreach plot are plotted against Ward’s linkage hierarchical clustering ids. (The valueshave been jittered.) Two di!erent areas of agreement have been highlighted, and thetour projections show the tightness of each cluster where the methods agree.

5.4 Characterizing clusters

The final step in a cluster analysis is to characterize the clusters. Actually, wehave engaged in cluster characterization throughout the examples, becauseit is an intrinsic part of assessing the results of any cluster analysis. If wecannot detect any numerical or qualitative di!erences between clusters, thenour analysis was not successful, and we start over with a di!erent distancemetric or algorithm.

However, once we are satisfied that we have found a set of clusters that canbe di!erentiated from one another, we want to describe them more formally,both quantitatively and qualitatively. We characterize them quantitatively bycomputing such statistics as cluster means and standard deviations for eachvariable. We can look at these results in tables and in plots, and we can refine

Two different areas of agreement have been highlighted, and the tour projections show the tightness of each cluster where the methods agree.

Page 27: Cluster Analysis - GGobiggobi.org/book/2007-infovis/05-clustering.pdf · hierarc hical algorithms and mo del-based clustering, and sho ws ho w graphical to ols are used to assess

Your turn

Run hierarchical clustering with average linkage on the Flea Beetles data (excluding species).

• Cut the tree at three clusters and append a cluster id to the dataset.

• How well do the clusters correspond to the species? (Plot cluster id vs species} and use jittering if necessary.)

• Using brushing in a plot of cluster id linked to a tour plot of the six variables, examine the beetles that are misclassified.

• Now cut the tree at four clusters, and repeat the last part.

• Which is the better solution, three or four clusters? Why?

Page 28: Cluster Analysis - GGobiggobi.org/book/2007-infovis/05-clustering.pdf · hierarc hical algorithms and mo del-based clustering, and sho ws ho w graphical to ols are used to assess

Timeline20 mins Toolbox

30 mins Missing values

45 minsSupervised

Classification

45 minsUnsupervised Classification

30 mins Inference

Break