83
Hierarchical clustering David M. Blei COS424 Princeton University February 28, 2008 D. Blei Clustering 02 1 / 21

David M. Blei...• Dendrogram: Plot each merge at the (negative) similarity between the two merged groups • Provides an interpretable visualization of the algorithm and data •

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: David M. Blei...• Dendrogram: Plot each merge at the (negative) similarity between the two merged groups • Provides an interpretable visualization of the algorithm and data •

Hierarchical clustering

David M. Blei

COS424Princeton University

February 28, 2008

D. Blei Clustering 02 1 / 21

Page 2: David M. Blei...• Dendrogram: Plot each merge at the (negative) similarity between the two merged groups • Provides an interpretable visualization of the algorithm and data •

Hierarchical clustering

• Hierarchical clustering is a widely used data analysis tool.

• The idea is to build a binary tree of the data that successivelymerges similar groups of points

• Visualizing this tree provides a useful summary of the data

D. Blei Clustering 02 2 / 21

Page 3: David M. Blei...• Dendrogram: Plot each merge at the (negative) similarity between the two merged groups • Provides an interpretable visualization of the algorithm and data •

Hierarchical clustering

• Hierarchical clustering is a widely used data analysis tool.

• The idea is to build a binary tree of the data that successivelymerges similar groups of points

• Visualizing this tree provides a useful summary of the data

D. Blei Clustering 02 2 / 21

Page 4: David M. Blei...• Dendrogram: Plot each merge at the (negative) similarity between the two merged groups • Provides an interpretable visualization of the algorithm and data •

Hierarchical clustering

• Hierarchical clustering is a widely used data analysis tool.

• The idea is to build a binary tree of the data that successivelymerges similar groups of points

• Visualizing this tree provides a useful summary of the data

D. Blei Clustering 02 2 / 21

Page 5: David M. Blei...• Dendrogram: Plot each merge at the (negative) similarity between the two merged groups • Provides an interpretable visualization of the algorithm and data •

Hierarchical clusering vs. k-means

• Recall that k-means or k-medoids requires

• A number of clusters k• An initial assignment of data to clusters• A distance measure between data d(xn, xm)

• Hierarchical clustering only requires a measure of similarity betweengroups of data points.

D. Blei Clustering 02 3 / 21

Page 6: David M. Blei...• Dendrogram: Plot each merge at the (negative) similarity between the two merged groups • Provides an interpretable visualization of the algorithm and data •

Hierarchical clusering vs. k-means

• Recall that k-means or k-medoids requires

• A number of clusters k

• An initial assignment of data to clusters• A distance measure between data d(xn, xm)

• Hierarchical clustering only requires a measure of similarity betweengroups of data points.

D. Blei Clustering 02 3 / 21

Page 7: David M. Blei...• Dendrogram: Plot each merge at the (negative) similarity between the two merged groups • Provides an interpretable visualization of the algorithm and data •

Hierarchical clusering vs. k-means

• Recall that k-means or k-medoids requires

• A number of clusters k• An initial assignment of data to clusters

• A distance measure between data d(xn, xm)

• Hierarchical clustering only requires a measure of similarity betweengroups of data points.

D. Blei Clustering 02 3 / 21

Page 8: David M. Blei...• Dendrogram: Plot each merge at the (negative) similarity between the two merged groups • Provides an interpretable visualization of the algorithm and data •

Hierarchical clusering vs. k-means

• Recall that k-means or k-medoids requires

• A number of clusters k• An initial assignment of data to clusters• A distance measure between data d(xn, xm)

• Hierarchical clustering only requires a measure of similarity betweengroups of data points.

D. Blei Clustering 02 3 / 21

Page 9: David M. Blei...• Dendrogram: Plot each merge at the (negative) similarity between the two merged groups • Provides an interpretable visualization of the algorithm and data •

Hierarchical clusering vs. k-means

• Recall that k-means or k-medoids requires

• A number of clusters k• An initial assignment of data to clusters• A distance measure between data d(xn, xm)

• Hierarchical clustering only requires a measure of similarity betweengroups of data points.

D. Blei Clustering 02 3 / 21

Page 10: David M. Blei...• Dendrogram: Plot each merge at the (negative) similarity between the two merged groups • Provides an interpretable visualization of the algorithm and data •

Agglomerative clustering

• We will talk about agglomerative clustering.

• Algorithm:

1 Place each data point into its own singleton group2 Repeat: iteratively merge the two closest groups3 Until: all the data are merged into a single cluster

D. Blei Clustering 02 4 / 21

Page 11: David M. Blei...• Dendrogram: Plot each merge at the (negative) similarity between the two merged groups • Provides an interpretable visualization of the algorithm and data •

Agglomerative clustering

• We will talk about agglomerative clustering.

• Algorithm:

1 Place each data point into its own singleton group2 Repeat: iteratively merge the two closest groups3 Until: all the data are merged into a single cluster

D. Blei Clustering 02 4 / 21

Page 12: David M. Blei...• Dendrogram: Plot each merge at the (negative) similarity between the two merged groups • Provides an interpretable visualization of the algorithm and data •

Agglomerative clustering

• We will talk about agglomerative clustering.

• Algorithm:

1 Place each data point into its own singleton group

2 Repeat: iteratively merge the two closest groups3 Until: all the data are merged into a single cluster

D. Blei Clustering 02 4 / 21

Page 13: David M. Blei...• Dendrogram: Plot each merge at the (negative) similarity between the two merged groups • Provides an interpretable visualization of the algorithm and data •

Agglomerative clustering

• We will talk about agglomerative clustering.

• Algorithm:

1 Place each data point into its own singleton group2 Repeat: iteratively merge the two closest groups

3 Until: all the data are merged into a single cluster

D. Blei Clustering 02 4 / 21

Page 14: David M. Blei...• Dendrogram: Plot each merge at the (negative) similarity between the two merged groups • Provides an interpretable visualization of the algorithm and data •

Agglomerative clustering

• We will talk about agglomerative clustering.

• Algorithm:

1 Place each data point into its own singleton group2 Repeat: iteratively merge the two closest groups3 Until: all the data are merged into a single cluster

D. Blei Clustering 02 4 / 21

Page 15: David M. Blei...• Dendrogram: Plot each merge at the (negative) similarity between the two merged groups • Provides an interpretable visualization of the algorithm and data •

Example

● ●

0 20 40 60 80

−20

020

4060

80

Data

D. Blei Clustering 02 5 / 21

Page 16: David M. Blei...• Dendrogram: Plot each merge at the (negative) similarity between the two merged groups • Provides an interpretable visualization of the algorithm and data •

Example

● ●

0 20 40 60 80

−20

020

4060

80

iteration 001

V1

V2

D. Blei Clustering 02 5 / 21

Page 17: David M. Blei...• Dendrogram: Plot each merge at the (negative) similarity between the two merged groups • Provides an interpretable visualization of the algorithm and data •

Example

● ●

0 20 40 60 80

−20

020

4060

80

iteration 002

V1

V2

D. Blei Clustering 02 5 / 21

Page 18: David M. Blei...• Dendrogram: Plot each merge at the (negative) similarity between the two merged groups • Provides an interpretable visualization of the algorithm and data •

Example

● ●

0 20 40 60 80

−20

020

4060

80

iteration 003

V1

V2

D. Blei Clustering 02 5 / 21

Page 19: David M. Blei...• Dendrogram: Plot each merge at the (negative) similarity between the two merged groups • Provides an interpretable visualization of the algorithm and data •

Example

● ●

0 20 40 60 80

−20

020

4060

80

iteration 004

V1

V2

D. Blei Clustering 02 5 / 21

Page 20: David M. Blei...• Dendrogram: Plot each merge at the (negative) similarity between the two merged groups • Provides an interpretable visualization of the algorithm and data •

Example

● ●

0 20 40 60 80

−20

020

4060

80

iteration 005

V1

V2

D. Blei Clustering 02 5 / 21

Page 21: David M. Blei...• Dendrogram: Plot each merge at the (negative) similarity between the two merged groups • Provides an interpretable visualization of the algorithm and data •

Example

● ●

0 20 40 60 80

−20

020

4060

80

iteration 006

V1

V2

D. Blei Clustering 02 5 / 21

Page 22: David M. Blei...• Dendrogram: Plot each merge at the (negative) similarity between the two merged groups • Provides an interpretable visualization of the algorithm and data •

Example

● ●

0 20 40 60 80

−20

020

4060

80

iteration 007

V1

V2

D. Blei Clustering 02 5 / 21

Page 23: David M. Blei...• Dendrogram: Plot each merge at the (negative) similarity between the two merged groups • Provides an interpretable visualization of the algorithm and data •

Example

● ●

0 20 40 60 80

−20

020

4060

80

iteration 008

V1

V2

D. Blei Clustering 02 5 / 21

Page 24: David M. Blei...• Dendrogram: Plot each merge at the (negative) similarity between the two merged groups • Provides an interpretable visualization of the algorithm and data •

Example

● ●

0 20 40 60 80

−20

020

4060

80

iteration 009

V1

V2

D. Blei Clustering 02 5 / 21

Page 25: David M. Blei...• Dendrogram: Plot each merge at the (negative) similarity between the two merged groups • Provides an interpretable visualization of the algorithm and data •

Example

● ●

0 20 40 60 80

−20

020

4060

80

iteration 010

V1

V2

D. Blei Clustering 02 5 / 21

Page 26: David M. Blei...• Dendrogram: Plot each merge at the (negative) similarity between the two merged groups • Provides an interpretable visualization of the algorithm and data •

Example

● ●

0 20 40 60 80

−20

020

4060

80

iteration 011

V1

V2

D. Blei Clustering 02 5 / 21

Page 27: David M. Blei...• Dendrogram: Plot each merge at the (negative) similarity between the two merged groups • Provides an interpretable visualization of the algorithm and data •

Example

● ●

0 20 40 60 80

−20

020

4060

80

iteration 012

V1

V2

D. Blei Clustering 02 5 / 21

Page 28: David M. Blei...• Dendrogram: Plot each merge at the (negative) similarity between the two merged groups • Provides an interpretable visualization of the algorithm and data •

Example

● ●

0 20 40 60 80

−20

020

4060

80

iteration 013

V1

V2

D. Blei Clustering 02 5 / 21

Page 29: David M. Blei...• Dendrogram: Plot each merge at the (negative) similarity between the two merged groups • Provides an interpretable visualization of the algorithm and data •

Example

● ●

0 20 40 60 80

−20

020

4060

80

iteration 014

V1

V2

D. Blei Clustering 02 5 / 21

Page 30: David M. Blei...• Dendrogram: Plot each merge at the (negative) similarity between the two merged groups • Provides an interpretable visualization of the algorithm and data •

Example

● ●

0 20 40 60 80

−20

020

4060

80

iteration 015

V1

V2

D. Blei Clustering 02 5 / 21

Page 31: David M. Blei...• Dendrogram: Plot each merge at the (negative) similarity between the two merged groups • Provides an interpretable visualization of the algorithm and data •

Example

● ●

0 20 40 60 80

−20

020

4060

80

iteration 016

V1

V2

D. Blei Clustering 02 5 / 21

Page 32: David M. Blei...• Dendrogram: Plot each merge at the (negative) similarity between the two merged groups • Provides an interpretable visualization of the algorithm and data •

Example

● ●

0 20 40 60 80

−20

020

4060

80

iteration 017

V1

V2

D. Blei Clustering 02 5 / 21

Page 33: David M. Blei...• Dendrogram: Plot each merge at the (negative) similarity between the two merged groups • Provides an interpretable visualization of the algorithm and data •

Example

● ●

0 20 40 60 80

−20

020

4060

80

iteration 018

V1

V2

D. Blei Clustering 02 5 / 21

Page 34: David M. Blei...• Dendrogram: Plot each merge at the (negative) similarity between the two merged groups • Provides an interpretable visualization of the algorithm and data •

Example

● ●

0 20 40 60 80

−20

020

4060

80

iteration 019

V1

V2

D. Blei Clustering 02 5 / 21

Page 35: David M. Blei...• Dendrogram: Plot each merge at the (negative) similarity between the two merged groups • Provides an interpretable visualization of the algorithm and data •

Example

● ●

0 20 40 60 80

−20

020

4060

80

iteration 020

V1

V2

D. Blei Clustering 02 5 / 21

Page 36: David M. Blei...• Dendrogram: Plot each merge at the (negative) similarity between the two merged groups • Provides an interpretable visualization of the algorithm and data •

Example

● ●

0 20 40 60 80

−20

020

4060

80

iteration 021

V1

V2

D. Blei Clustering 02 5 / 21

Page 37: David M. Blei...• Dendrogram: Plot each merge at the (negative) similarity between the two merged groups • Provides an interpretable visualization of the algorithm and data •

Example

● ●

0 20 40 60 80

−20

020

4060

80

iteration 022

V1

V2

D. Blei Clustering 02 5 / 21

Page 38: David M. Blei...• Dendrogram: Plot each merge at the (negative) similarity between the two merged groups • Provides an interpretable visualization of the algorithm and data •

Example

● ●

0 20 40 60 80

−20

020

4060

80

iteration 023

V1

V2

D. Blei Clustering 02 5 / 21

Page 39: David M. Blei...• Dendrogram: Plot each merge at the (negative) similarity between the two merged groups • Provides an interpretable visualization of the algorithm and data •

Example

● ●

0 20 40 60 80

−20

020

4060

80

iteration 024

V1

V2

D. Blei Clustering 02 5 / 21

Page 40: David M. Blei...• Dendrogram: Plot each merge at the (negative) similarity between the two merged groups • Provides an interpretable visualization of the algorithm and data •

Agglomerative clustering

• Each level of the resulting tree is a segmentation of the data

• The algorithm results in a sequence of groupings

• It is up to the user to choose a ”natural” clustering from thissequence

D. Blei Clustering 02 6 / 21

Page 41: David M. Blei...• Dendrogram: Plot each merge at the (negative) similarity between the two merged groups • Provides an interpretable visualization of the algorithm and data •

Agglomerative clustering

• Each level of the resulting tree is a segmentation of the data

• The algorithm results in a sequence of groupings

• It is up to the user to choose a ”natural” clustering from thissequence

D. Blei Clustering 02 6 / 21

Page 42: David M. Blei...• Dendrogram: Plot each merge at the (negative) similarity between the two merged groups • Provides an interpretable visualization of the algorithm and data •

Agglomerative clustering

• Each level of the resulting tree is a segmentation of the data

• The algorithm results in a sequence of groupings

• It is up to the user to choose a ”natural” clustering from thissequence

D. Blei Clustering 02 6 / 21

Page 43: David M. Blei...• Dendrogram: Plot each merge at the (negative) similarity between the two merged groups • Provides an interpretable visualization of the algorithm and data •

Dendrogram

• Agglomerative clustering is monotonic

• The similarity between merged clusters is monotone decreasingwith the level of the merge.

• Dendrogram: Plot each merge at the (negative) similarity betweenthe two merged groups

• Provides an interpretable visualization of the algorithm and data

• Useful summarization tool, part of why hierarchical clustering ispopular

D. Blei Clustering 02 7 / 21

Page 44: David M. Blei...• Dendrogram: Plot each merge at the (negative) similarity between the two merged groups • Provides an interpretable visualization of the algorithm and data •

Dendrogram

• Agglomerative clustering is monotonic

• The similarity between merged clusters is monotone decreasingwith the level of the merge.

• Dendrogram: Plot each merge at the (negative) similarity betweenthe two merged groups

• Provides an interpretable visualization of the algorithm and data

• Useful summarization tool, part of why hierarchical clustering ispopular

D. Blei Clustering 02 7 / 21

Page 45: David M. Blei...• Dendrogram: Plot each merge at the (negative) similarity between the two merged groups • Provides an interpretable visualization of the algorithm and data •

Dendrogram

• Agglomerative clustering is monotonic

• The similarity between merged clusters is monotone decreasingwith the level of the merge.

• Dendrogram: Plot each merge at the (negative) similarity betweenthe two merged groups

• Provides an interpretable visualization of the algorithm and data

• Useful summarization tool, part of why hierarchical clustering ispopular

D. Blei Clustering 02 7 / 21

Page 46: David M. Blei...• Dendrogram: Plot each merge at the (negative) similarity between the two merged groups • Provides an interpretable visualization of the algorithm and data •

Dendrogram

• Agglomerative clustering is monotonic

• The similarity between merged clusters is monotone decreasingwith the level of the merge.

• Dendrogram: Plot each merge at the (negative) similarity betweenthe two merged groups

• Provides an interpretable visualization of the algorithm and data

• Useful summarization tool, part of why hierarchical clustering ispopular

D. Blei Clustering 02 7 / 21

Page 47: David M. Blei...• Dendrogram: Plot each merge at the (negative) similarity between the two merged groups • Provides an interpretable visualization of the algorithm and data •

Dendrogram

• Agglomerative clustering is monotonic

• The similarity between merged clusters is monotone decreasingwith the level of the merge.

• Dendrogram: Plot each merge at the (negative) similarity betweenthe two merged groups

• Provides an interpretable visualization of the algorithm and data

• Useful summarization tool, part of why hierarchical clustering ispopular

D. Blei Clustering 02 7 / 21

Page 48: David M. Blei...• Dendrogram: Plot each merge at the (negative) similarity between the two merged groups • Provides an interpretable visualization of the algorithm and data •

Dendrogram of example data

2904

2432

2641

2489

2278

2905

2085

2959 27

4327

9723

1422

8224

2520

2414

5517

2316

2216

16 244

851

220

81 184 25

247

7

020

4060

8010

012

0

Cluster Dendrogram

hclust (*, "complete")dist(x)

Hei

ght

Groups that merge at high values relative to the merger values of theirsubgroups are candidates for natural clusters. (Tibshirani et al., 2001)

D. Blei Clustering 02 8 / 21

Page 49: David M. Blei...• Dendrogram: Plot each merge at the (negative) similarity between the two merged groups • Provides an interpretable visualization of the algorithm and data •

Group similarity

• Given a distance measure between points, the user has many choicesfor how to define intergroup similarity.

• Three most popular choices

• Single-linkage: the similarity of the closest pair

dSL(G ,H) = mini∈G ,j∈H

di ,j

• Complete linkage: the similarity of the furthest pair

dCL(G ,H) = maxi∈G ,j∈H

di ,j

• Group average: the average similarity between groups

dGA =1

NGNH

∑i∈G

∑j∈H

di ,j

D. Blei Clustering 02 9 / 21

Page 50: David M. Blei...• Dendrogram: Plot each merge at the (negative) similarity between the two merged groups • Provides an interpretable visualization of the algorithm and data •

Group similarity

• Given a distance measure between points, the user has many choicesfor how to define intergroup similarity.

• Three most popular choices

• Single-linkage: the similarity of the closest pair

dSL(G ,H) = mini∈G ,j∈H

di ,j

• Complete linkage: the similarity of the furthest pair

dCL(G ,H) = maxi∈G ,j∈H

di ,j

• Group average: the average similarity between groups

dGA =1

NGNH

∑i∈G

∑j∈H

di ,j

D. Blei Clustering 02 9 / 21

Page 51: David M. Blei...• Dendrogram: Plot each merge at the (negative) similarity between the two merged groups • Provides an interpretable visualization of the algorithm and data •

Group similarity

• Given a distance measure between points, the user has many choicesfor how to define intergroup similarity.

• Three most popular choices

• Single-linkage: the similarity of the closest pair

dSL(G ,H) = mini∈G ,j∈H

di ,j

• Complete linkage: the similarity of the furthest pair

dCL(G ,H) = maxi∈G ,j∈H

di ,j

• Group average: the average similarity between groups

dGA =1

NGNH

∑i∈G

∑j∈H

di ,j

D. Blei Clustering 02 9 / 21

Page 52: David M. Blei...• Dendrogram: Plot each merge at the (negative) similarity between the two merged groups • Provides an interpretable visualization of the algorithm and data •

Group similarity

• Given a distance measure between points, the user has many choicesfor how to define intergroup similarity.

• Three most popular choices

• Single-linkage: the similarity of the closest pair

dSL(G ,H) = mini∈G ,j∈H

di ,j

• Complete linkage: the similarity of the furthest pair

dCL(G ,H) = maxi∈G ,j∈H

di ,j

• Group average: the average similarity between groups

dGA =1

NGNH

∑i∈G

∑j∈H

di ,j

D. Blei Clustering 02 9 / 21

Page 53: David M. Blei...• Dendrogram: Plot each merge at the (negative) similarity between the two merged groups • Provides an interpretable visualization of the algorithm and data •

Group similarity

• Given a distance measure between points, the user has many choicesfor how to define intergroup similarity.

• Three most popular choices

• Single-linkage: the similarity of the closest pair

dSL(G ,H) = mini∈G ,j∈H

di ,j

• Complete linkage: the similarity of the furthest pair

dCL(G ,H) = maxi∈G ,j∈H

di ,j

• Group average: the average similarity between groups

dGA =1

NGNH

∑i∈G

∑j∈H

di ,j

D. Blei Clustering 02 9 / 21

Page 54: David M. Blei...• Dendrogram: Plot each merge at the (negative) similarity between the two merged groups • Provides an interpretable visualization of the algorithm and data •

Properties of intergroup similarity

• Single linkage can produce “chaining,” where a sequence of closeobservations in different groups cause early merges of those groups

• Complete linkage has the opposite problem. It might not mergeclose groups because of outlier members that are far apart.

• Group average represents a natural compromise, but depends on thescale of the similarities. Applying a monotone transformation to thesimilarities can change the results.

D. Blei Clustering 02 10 / 21

Page 55: David M. Blei...• Dendrogram: Plot each merge at the (negative) similarity between the two merged groups • Provides an interpretable visualization of the algorithm and data •

Properties of intergroup similarity

• Single linkage can produce “chaining,” where a sequence of closeobservations in different groups cause early merges of those groups

• Complete linkage has the opposite problem. It might not mergeclose groups because of outlier members that are far apart.

• Group average represents a natural compromise, but depends on thescale of the similarities. Applying a monotone transformation to thesimilarities can change the results.

D. Blei Clustering 02 10 / 21

Page 56: David M. Blei...• Dendrogram: Plot each merge at the (negative) similarity between the two merged groups • Provides an interpretable visualization of the algorithm and data •

Properties of intergroup similarity

• Single linkage can produce “chaining,” where a sequence of closeobservations in different groups cause early merges of those groups

• Complete linkage has the opposite problem. It might not mergeclose groups because of outlier members that are far apart.

• Group average represents a natural compromise, but depends on thescale of the similarities. Applying a monotone transformation to thesimilarities can change the results.

D. Blei Clustering 02 10 / 21

Page 57: David M. Blei...• Dendrogram: Plot each merge at the (negative) similarity between the two merged groups • Provides an interpretable visualization of the algorithm and data •

Caveats

• Hierarchical clustering should be treated with caution.

• Different decisions about group similarities can lead to vastlydifferent dendrograms.

• The algorithm imposes a hierarchical structure on the data, evendata for which such structure is not appropriate.

D. Blei Clustering 02 11 / 21

Page 58: David M. Blei...• Dendrogram: Plot each merge at the (negative) similarity between the two merged groups • Provides an interpretable visualization of the algorithm and data •

Caveats

• Hierarchical clustering should be treated with caution.

• Different decisions about group similarities can lead to vastlydifferent dendrograms.

• The algorithm imposes a hierarchical structure on the data, evendata for which such structure is not appropriate.

D. Blei Clustering 02 11 / 21

Page 59: David M. Blei...• Dendrogram: Plot each merge at the (negative) similarity between the two merged groups • Provides an interpretable visualization of the algorithm and data •

Caveats

• Hierarchical clustering should be treated with caution.

• Different decisions about group similarities can lead to vastlydifferent dendrograms.

• The algorithm imposes a hierarchical structure on the data, evendata for which such structure is not appropriate.

D. Blei Clustering 02 11 / 21

Page 60: David M. Blei...• Dendrogram: Plot each merge at the (negative) similarity between the two merged groups • Provides an interpretable visualization of the algorithm and data •

Examples

• “Repeated Observation of Breast Tumor Subtypes in IndependentGene Expression Data Sets” (Sorlie et al., 2003)

• Hierarchical clustering of gene expression data lead to new theories

• Later, theories tested in the lab.

D. Blei Clustering 02 12 / 21

Page 61: David M. Blei...• Dendrogram: Plot each merge at the (negative) similarity between the two merged groups • Provides an interpretable visualization of the algorithm and data •

Examples

• “Repeated Observation of Breast Tumor Subtypes in IndependentGene Expression Data Sets” (Sorlie et al., 2003)

• Hierarchical clustering of gene expression data lead to new theories

• Later, theories tested in the lab.

D. Blei Clustering 02 12 / 21

Page 62: David M. Blei...• Dendrogram: Plot each merge at the (negative) similarity between the two merged groups • Provides an interpretable visualization of the algorithm and data •

Examples

• “Repeated Observation of Breast Tumor Subtypes in IndependentGene Expression Data Sets” (Sorlie et al., 2003)

• Hierarchical clustering of gene expression data lead to new theories

• Later, theories tested in the lab.

D. Blei Clustering 02 12 / 21

Page 63: David M. Blei...• Dendrogram: Plot each merge at the (negative) similarity between the two merged groups • Provides an interpretable visualization of the algorithm and data •

D. Blei Clustering 02 13 / 21

Page 64: David M. Blei...• Dendrogram: Plot each merge at the (negative) similarity between the two merged groups • Provides an interpretable visualization of the algorithm and data •

Examples

• “The Balance of Roger de Piles” (Studdert-Kennedy and Davenport,1974)

• Roger de Piles rated 57 paintings along different dimensions.

• These authors cluster them using different methods, includinghierarchical clustering

• They discuss the different clusters. (They are art critics.)

D. Blei Clustering 02 14 / 21

Page 65: David M. Blei...• Dendrogram: Plot each merge at the (negative) similarity between the two merged groups • Provides an interpretable visualization of the algorithm and data •

Examples

• “The Balance of Roger de Piles” (Studdert-Kennedy and Davenport,1974)

• Roger de Piles rated 57 paintings along different dimensions.

• These authors cluster them using different methods, includinghierarchical clustering

• They discuss the different clusters. (They are art critics.)

D. Blei Clustering 02 14 / 21

Page 66: David M. Blei...• Dendrogram: Plot each merge at the (negative) similarity between the two merged groups • Provides an interpretable visualization of the algorithm and data •

Examples

• “The Balance of Roger de Piles” (Studdert-Kennedy and Davenport,1974)

• Roger de Piles rated 57 paintings along different dimensions.

• These authors cluster them using different methods, includinghierarchical clustering

• They discuss the different clusters. (They are art critics.)

D. Blei Clustering 02 14 / 21

Page 67: David M. Blei...• Dendrogram: Plot each merge at the (negative) similarity between the two merged groups • Provides an interpretable visualization of the algorithm and data •

Examples

• “The Balance of Roger de Piles” (Studdert-Kennedy and Davenport,1974)

• Roger de Piles rated 57 paintings along different dimensions.

• These authors cluster them using different methods, includinghierarchical clustering

• They discuss the different clusters. (They are art critics.)

D. Blei Clustering 02 14 / 21

Page 68: David M. Blei...• Dendrogram: Plot each merge at the (negative) similarity between the two merged groups • Provides an interpretable visualization of the algorithm and data •

Good: They are cautious. “The value of this analysis...will depend onany interesting speculation it may provoke.”

D. Blei Clustering 02 15 / 21

Page 69: David M. Blei...• Dendrogram: Plot each merge at the (negative) similarity between the two merged groups • Provides an interpretable visualization of the algorithm and data •

Examples

• “Similarity Grouping of Australian Universities” (Stanley andReynlds, 1994)

• Use hierarchical clustering on Austrailian universities

• Use features such as

• # of staff in different departments• entry scores• funding• evaluations

D. Blei Clustering 02 16 / 21

Page 70: David M. Blei...• Dendrogram: Plot each merge at the (negative) similarity between the two merged groups • Provides an interpretable visualization of the algorithm and data •

Examples

• “Similarity Grouping of Australian Universities” (Stanley andReynlds, 1994)

• Use hierarchical clustering on Austrailian universities

• Use features such as

• # of staff in different departments• entry scores• funding• evaluations

D. Blei Clustering 02 16 / 21

Page 71: David M. Blei...• Dendrogram: Plot each merge at the (negative) similarity between the two merged groups • Provides an interpretable visualization of the algorithm and data •

Examples

• “Similarity Grouping of Australian Universities” (Stanley andReynlds, 1994)

• Use hierarchical clustering on Austrailian universities

• Use features such as

• # of staff in different departments• entry scores• funding• evaluations

D. Blei Clustering 02 16 / 21

Page 72: David M. Blei...• Dendrogram: Plot each merge at the (negative) similarity between the two merged groups • Provides an interpretable visualization of the algorithm and data •

Examples

• “Similarity Grouping of Australian Universities” (Stanley andReynlds, 1994)

• Use hierarchical clustering on Austrailian universities

• Use features such as

• # of staff in different departments

• entry scores• funding• evaluations

D. Blei Clustering 02 16 / 21

Page 73: David M. Blei...• Dendrogram: Plot each merge at the (negative) similarity between the two merged groups • Provides an interpretable visualization of the algorithm and data •

Examples

• “Similarity Grouping of Australian Universities” (Stanley andReynlds, 1994)

• Use hierarchical clustering on Austrailian universities

• Use features such as

• # of staff in different departments• entry scores

• funding• evaluations

D. Blei Clustering 02 16 / 21

Page 74: David M. Blei...• Dendrogram: Plot each merge at the (negative) similarity between the two merged groups • Provides an interpretable visualization of the algorithm and data •

Examples

• “Similarity Grouping of Australian Universities” (Stanley andReynlds, 1994)

• Use hierarchical clustering on Austrailian universities

• Use features such as

• # of staff in different departments• entry scores• funding

• evaluations

D. Blei Clustering 02 16 / 21

Page 75: David M. Blei...• Dendrogram: Plot each merge at the (negative) similarity between the two merged groups • Provides an interpretable visualization of the algorithm and data •

Examples

• “Similarity Grouping of Australian Universities” (Stanley andReynlds, 1994)

• Use hierarchical clustering on Austrailian universities

• Use features such as

• # of staff in different departments• entry scores• funding• evaluations

D. Blei Clustering 02 16 / 21

Page 76: David M. Blei...• Dendrogram: Plot each merge at the (negative) similarity between the two merged groups • Provides an interpretable visualization of the algorithm and data •

One dendrogram

D. Blei Clustering 02 17 / 21

Page 77: David M. Blei...• Dendrogram: Plot each merge at the (negative) similarity between the two merged groups • Provides an interpretable visualization of the algorithm and data •

Another dendrogram—notice that it’s different

D. Blei Clustering 02 18 / 21

Page 78: David M. Blei...• Dendrogram: Plot each merge at the (negative) similarity between the two merged groups • Provides an interpretable visualization of the algorithm and data •

• Split values: They notice that there’s no kink and conclude thatthere is no cluster structure in Austrailian universities.

• Good: Cautious interpretation of clustering, analysis of clusteringbased on multiple subsets of the features.

• Bad: Their conclusions—we can’t cluster Australianuniversities—ignores all the algorithmic choices that were made.

D. Blei Clustering 02 19 / 21

Page 79: David M. Blei...• Dendrogram: Plot each merge at the (negative) similarity between the two merged groups • Provides an interpretable visualization of the algorithm and data •

Examples

• “Comovement of International Equity Markets: A TaxonomicApproach” (Panton et al., 1976)

• Data: weekly rates of return for stocks in twelve countries

• Run agglometerative clustering year by year

• Interpret the structure and examine stability over different timeperiods

D. Blei Clustering 02 20 / 21

Page 80: David M. Blei...• Dendrogram: Plot each merge at the (negative) similarity between the two merged groups • Provides an interpretable visualization of the algorithm and data •

Examples

• “Comovement of International Equity Markets: A TaxonomicApproach” (Panton et al., 1976)

• Data: weekly rates of return for stocks in twelve countries

• Run agglometerative clustering year by year

• Interpret the structure and examine stability over different timeperiods

D. Blei Clustering 02 20 / 21

Page 81: David M. Blei...• Dendrogram: Plot each merge at the (negative) similarity between the two merged groups • Provides an interpretable visualization of the algorithm and data •

Examples

• “Comovement of International Equity Markets: A TaxonomicApproach” (Panton et al., 1976)

• Data: weekly rates of return for stocks in twelve countries

• Run agglometerative clustering year by year

• Interpret the structure and examine stability over different timeperiods

D. Blei Clustering 02 20 / 21

Page 82: David M. Blei...• Dendrogram: Plot each merge at the (negative) similarity between the two merged groups • Provides an interpretable visualization of the algorithm and data •

Examples

• “Comovement of International Equity Markets: A TaxonomicApproach” (Panton et al., 1976)

• Data: weekly rates of return for stocks in twelve countries

• Run agglometerative clustering year by year

• Interpret the structure and examine stability over different timeperiods

D. Blei Clustering 02 20 / 21

Page 83: David M. Blei...• Dendrogram: Plot each merge at the (negative) similarity between the two merged groups • Provides an interpretable visualization of the algorithm and data •

Examples

Good: Cautious. “This study is only descriptive...A logical subsequentresearch area is to explain observed structural properties and the causesof structural change.”

D. Blei Clustering 02 21 / 21