129
Computer-Assisted Clustering and Conceptualization Gary King Institute for Quantitative Social Science Harvard University Parthemos Lecture at University of Georgia, 3/4/2011 1 Based on joint work with Justin Grimmer (Harvard Stanford) Gary King (Harvard IQSS) Quantitative Discovery Parthemos Lecture at University of / 20

discov-uga_1.pdf

Embed Size (px)

Citation preview

Page 1: discov-uga_1.pdf

Computer-Assisted Clustering and Conceptualization

Gary King

Institute for Quantitative Social ScienceHarvard University

Parthemos Lecture at University of Georgia, 3/4/2011

1Based on joint work with Justin Grimmer (Harvard Stanford)

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 1

/ 20

Page 2: discov-uga_1.pdf

A Method for Computer Assisted Conceptualization

Conceptualization through Classification: “one of the most centraland generic of all our conceptual exercises. . . . the foundation notonly for conceptualization, language, and speech, but also formathematics, statistics, and data analysis. . . . Without classification,there could be no advanced conceptualization, reasoning, language,data analysis or,for that matter, social science research.” (Bailey,1994).

Cluster Analysis: simultaneously (1) invents categories and (2)assigns documents to categories

We focus on unstructured text; methods apply more broadly.

Main goal: Switch from Fully Automated to Computer Assisted

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 2

/ 20

Page 3: discov-uga_1.pdf

A Method for Computer Assisted Conceptualization

Conceptualization through Classification: “one of the most centraland generic of all our conceptual exercises. . . . the foundation notonly for conceptualization, language, and speech, but also formathematics, statistics, and data analysis. . . . Without classification,there could be no advanced conceptualization, reasoning, language,data analysis or,for that matter, social science research.” (Bailey,1994).

Cluster Analysis: simultaneously (1) invents categories and (2)assigns documents to categories

We focus on unstructured text; methods apply more broadly.

Main goal: Switch from Fully Automated to Computer Assisted

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 2

/ 20

Page 4: discov-uga_1.pdf

A Method for Computer Assisted Conceptualization

Conceptualization through Classification: “one of the most centraland generic of all our conceptual exercises. . . . the foundation notonly for conceptualization, language, and speech, but also formathematics, statistics, and data analysis. . . . Without classification,there could be no advanced conceptualization, reasoning, language,data analysis or,for that matter, social science research.” (Bailey,1994).

Cluster Analysis: simultaneously (1) invents categories and (2)assigns documents to categories

We focus on unstructured text; methods apply more broadly.

Main goal: Switch from Fully Automated to Computer Assisted

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 2

/ 20

Page 5: discov-uga_1.pdf

A Method for Computer Assisted Conceptualization

Conceptualization through Classification: “one of the most centraland generic of all our conceptual exercises. . . . the foundation notonly for conceptualization, language, and speech, but also formathematics, statistics, and data analysis. . . . Without classification,there could be no advanced conceptualization, reasoning, language,data analysis or,for that matter, social science research.” (Bailey,1994).

Cluster Analysis: simultaneously (1) invents categories and (2)assigns documents to categories

We focus on unstructured text; methods apply more broadly.

Main goal: Switch from Fully Automated to Computer Assisted

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 2

/ 20

Page 6: discov-uga_1.pdf

What’s Hard about Clustering?

Clustering seems easy; its not!

Bell(n) = number of ways of partitioning n objects

Bell(2) = 2 (AB, A B)

Bell(3) = 5 (ABC, AB C, A BC, AC B, A B C)

Bell(5) = 52

Bell(100) ≈

1028× Number of elementary particles in the universe

Now imagine choosing the optimal classification scheme by hand!

Fully automated algorithms can help, but which ones?

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 3

/ 20

Page 7: discov-uga_1.pdf

What’s Hard about Clustering?(aka Why Johnny Can’t Classify)

Clustering seems easy; its not!

Bell(n) = number of ways of partitioning n objects

Bell(2) = 2 (AB, A B)

Bell(3) = 5 (ABC, AB C, A BC, AC B, A B C)

Bell(5) = 52

Bell(100) ≈

1028× Number of elementary particles in the universe

Now imagine choosing the optimal classification scheme by hand!

Fully automated algorithms can help, but which ones?

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 3

/ 20

Page 8: discov-uga_1.pdf

What’s Hard about Clustering?(aka Why Johnny Can’t Classify)

Clustering seems easy; its not!

Bell(n) = number of ways of partitioning n objects

Bell(2) = 2 (AB, A B)

Bell(3) = 5 (ABC, AB C, A BC, AC B, A B C)

Bell(5) = 52

Bell(100) ≈

1028× Number of elementary particles in the universe

Now imagine choosing the optimal classification scheme by hand!

Fully automated algorithms can help, but which ones?

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 3

/ 20

Page 9: discov-uga_1.pdf

What’s Hard about Clustering?(aka Why Johnny Can’t Classify)

Clustering seems easy; its not!

Bell(n) = number of ways of partitioning n objects

Bell(2) = 2 (AB, A B)

Bell(3) = 5 (ABC, AB C, A BC, AC B, A B C)

Bell(5) = 52

Bell(100) ≈

1028× Number of elementary particles in the universe

Now imagine choosing the optimal classification scheme by hand!

Fully automated algorithms can help, but which ones?

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 3

/ 20

Page 10: discov-uga_1.pdf

What’s Hard about Clustering?(aka Why Johnny Can’t Classify)

Clustering seems easy; its not!

Bell(n) = number of ways of partitioning n objects

Bell(2) = 2 (AB, A B)

Bell(3) = 5 (ABC, AB C, A BC, AC B, A B C)

Bell(5) = 52

Bell(100) ≈

1028× Number of elementary particles in the universe

Now imagine choosing the optimal classification scheme by hand!

Fully automated algorithms can help, but which ones?

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 3

/ 20

Page 11: discov-uga_1.pdf

What’s Hard about Clustering?(aka Why Johnny Can’t Classify)

Clustering seems easy; its not!

Bell(n) = number of ways of partitioning n objects

Bell(2) = 2 (AB, A B)

Bell(3) = 5 (ABC, AB C, A BC, AC B, A B C)

Bell(5) = 52

Bell(100) ≈

1028× Number of elementary particles in the universe

Now imagine choosing the optimal classification scheme by hand!

Fully automated algorithms can help, but which ones?

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 3

/ 20

Page 12: discov-uga_1.pdf

What’s Hard about Clustering?(aka Why Johnny Can’t Classify)

Clustering seems easy; its not!

Bell(n) = number of ways of partitioning n objects

Bell(2) = 2 (AB, A B)

Bell(3) = 5 (ABC, AB C, A BC, AC B, A B C)

Bell(5) = 52

Bell(100) ≈

1028× Number of elementary particles in the universe

Now imagine choosing the optimal classification scheme by hand!

Fully automated algorithms can help, but which ones?

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 3

/ 20

Page 13: discov-uga_1.pdf

What’s Hard about Clustering?(aka Why Johnny Can’t Classify)

Clustering seems easy; its not!

Bell(n) = number of ways of partitioning n objects

Bell(2) = 2 (AB, A B)

Bell(3) = 5 (ABC, AB C, A BC, AC B, A B C)

Bell(5) = 52

Bell(100) ≈

1028× Number of elementary particles in the universe

Now imagine choosing the optimal classification scheme by hand!

Fully automated algorithms can help, but which ones?

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 3

/ 20

Page 14: discov-uga_1.pdf

What’s Hard about Clustering?(aka Why Johnny Can’t Classify)

Clustering seems easy; its not!

Bell(n) = number of ways of partitioning n objects

Bell(2) = 2 (AB, A B)

Bell(3) = 5 (ABC, AB C, A BC, AC B, A B C)

Bell(5) = 52

Bell(100) ≈ 1028× Number of elementary particles in the universe

Now imagine choosing the optimal classification scheme by hand!

Fully automated algorithms can help, but which ones?

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 3

/ 20

Page 15: discov-uga_1.pdf

What’s Hard about Clustering?(aka Why Johnny Can’t Classify)

Clustering seems easy; its not!

Bell(n) = number of ways of partitioning n objects

Bell(2) = 2 (AB, A B)

Bell(3) = 5 (ABC, AB C, A BC, AC B, A B C)

Bell(5) = 52

Bell(100) ≈ 1028× Number of elementary particles in the universe

Now imagine choosing the optimal classification scheme by hand!

Fully automated algorithms can help, but which ones?

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 3

/ 20

Page 16: discov-uga_1.pdf

What’s Hard about Clustering?(aka Why Johnny Can’t Classify)

Clustering seems easy; its not!

Bell(n) = number of ways of partitioning n objects

Bell(2) = 2 (AB, A B)

Bell(3) = 5 (ABC, AB C, A BC, AC B, A B C)

Bell(5) = 52

Bell(100) ≈ 1028× Number of elementary particles in the universe

Now imagine choosing the optimal classification scheme by hand!

Fully automated algorithms can help, but which ones?

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 3

/ 20

Page 17: discov-uga_1.pdf

The Problem with Fully Automated Clustering

The (Impossible) Goal: optimal, fully automated,application-independent cluster analysis

No free lunch theorem: every possible clustering method performsequally well on average over all possible substantive applications

Existing methods:

Many choices: model-based, subspace, spectral, grid-based, graph-based, fuzzy k-modes, affinity propagation, self-organizing maps,. . .Well-defined statistical, data analytic, or machine learning foundationsHow to add substantive knowledge: With few exceptions, unclearThe literature: little guidance on when methods applyDeriving such guidance: difficult or impossible

Deep problem: full automation requires more information

No surprise: everyone’s tried cluster analysis; very few are satisfied

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 4

/ 20

Page 18: discov-uga_1.pdf

The Problem with Fully Automated Clustering

The (Impossible) Goal: optimal, fully automated,application-independent cluster analysis

No free lunch theorem: every possible clustering method performsequally well on average over all possible substantive applications

Existing methods:

Many choices: model-based, subspace, spectral, grid-based, graph-based, fuzzy k-modes, affinity propagation, self-organizing maps,. . .Well-defined statistical, data analytic, or machine learning foundationsHow to add substantive knowledge: With few exceptions, unclearThe literature: little guidance on when methods applyDeriving such guidance: difficult or impossible

Deep problem: full automation requires more information

No surprise: everyone’s tried cluster analysis; very few are satisfied

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 4

/ 20

Page 19: discov-uga_1.pdf

The Problem with Fully Automated Clustering

The (Impossible) Goal: optimal, fully automated,application-independent cluster analysis

No free lunch theorem: every possible clustering method performsequally well on average over all possible substantive applications

Existing methods:

Many choices: model-based, subspace, spectral, grid-based, graph-based, fuzzy k-modes, affinity propagation, self-organizing maps,. . .Well-defined statistical, data analytic, or machine learning foundationsHow to add substantive knowledge: With few exceptions, unclearThe literature: little guidance on when methods applyDeriving such guidance: difficult or impossible

Deep problem: full automation requires more information

No surprise: everyone’s tried cluster analysis; very few are satisfied

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 4

/ 20

Page 20: discov-uga_1.pdf

The Problem with Fully Automated Clustering

The (Impossible) Goal: optimal, fully automated,application-independent cluster analysis

No free lunch theorem: every possible clustering method performsequally well on average over all possible substantive applications

Existing methods:

Many choices: model-based, subspace, spectral, grid-based, graph-based, fuzzy k-modes, affinity propagation, self-organizing maps,. . .Well-defined statistical, data analytic, or machine learning foundationsHow to add substantive knowledge: With few exceptions, unclearThe literature: little guidance on when methods applyDeriving such guidance: difficult or impossible

Deep problem: full automation requires more information

No surprise: everyone’s tried cluster analysis; very few are satisfied

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 4

/ 20

Page 21: discov-uga_1.pdf

The Problem with Fully Automated Clustering

The (Impossible) Goal: optimal, fully automated,application-independent cluster analysis

No free lunch theorem: every possible clustering method performsequally well on average over all possible substantive applications

Existing methods:

Many choices: model-based, subspace, spectral, grid-based, graph-based, fuzzy k-modes, affinity propagation, self-organizing maps,. . .

Well-defined statistical, data analytic, or machine learning foundationsHow to add substantive knowledge: With few exceptions, unclearThe literature: little guidance on when methods applyDeriving such guidance: difficult or impossible

Deep problem: full automation requires more information

No surprise: everyone’s tried cluster analysis; very few are satisfied

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 4

/ 20

Page 22: discov-uga_1.pdf

The Problem with Fully Automated Clustering

The (Impossible) Goal: optimal, fully automated,application-independent cluster analysis

No free lunch theorem: every possible clustering method performsequally well on average over all possible substantive applications

Existing methods:

Many choices: model-based, subspace, spectral, grid-based, graph-based, fuzzy k-modes, affinity propagation, self-organizing maps,. . .Well-defined statistical, data analytic, or machine learning foundations

How to add substantive knowledge: With few exceptions, unclearThe literature: little guidance on when methods applyDeriving such guidance: difficult or impossible

Deep problem: full automation requires more information

No surprise: everyone’s tried cluster analysis; very few are satisfied

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 4

/ 20

Page 23: discov-uga_1.pdf

The Problem with Fully Automated Clustering

The (Impossible) Goal: optimal, fully automated,application-independent cluster analysis

No free lunch theorem: every possible clustering method performsequally well on average over all possible substantive applications

Existing methods:

Many choices: model-based, subspace, spectral, grid-based, graph-based, fuzzy k-modes, affinity propagation, self-organizing maps,. . .Well-defined statistical, data analytic, or machine learning foundationsHow to add substantive knowledge: With few exceptions, unclear

The literature: little guidance on when methods applyDeriving such guidance: difficult or impossible

Deep problem: full automation requires more information

No surprise: everyone’s tried cluster analysis; very few are satisfied

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 4

/ 20

Page 24: discov-uga_1.pdf

The Problem with Fully Automated Clustering

The (Impossible) Goal: optimal, fully automated,application-independent cluster analysis

No free lunch theorem: every possible clustering method performsequally well on average over all possible substantive applications

Existing methods:

Many choices: model-based, subspace, spectral, grid-based, graph-based, fuzzy k-modes, affinity propagation, self-organizing maps,. . .Well-defined statistical, data analytic, or machine learning foundationsHow to add substantive knowledge: With few exceptions, unclearThe literature: little guidance on when methods apply

Deriving such guidance: difficult or impossible

Deep problem: full automation requires more information

No surprise: everyone’s tried cluster analysis; very few are satisfied

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 4

/ 20

Page 25: discov-uga_1.pdf

The Problem with Fully Automated Clustering

The (Impossible) Goal: optimal, fully automated,application-independent cluster analysis

No free lunch theorem: every possible clustering method performsequally well on average over all possible substantive applications

Existing methods:

Many choices: model-based, subspace, spectral, grid-based, graph-based, fuzzy k-modes, affinity propagation, self-organizing maps,. . .Well-defined statistical, data analytic, or machine learning foundationsHow to add substantive knowledge: With few exceptions, unclearThe literature: little guidance on when methods applyDeriving such guidance: difficult or impossible

Deep problem: full automation requires more information

No surprise: everyone’s tried cluster analysis; very few are satisfied

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 4

/ 20

Page 26: discov-uga_1.pdf

The Problem with Fully Automated Clustering

The (Impossible) Goal: optimal, fully automated,application-independent cluster analysis

No free lunch theorem: every possible clustering method performsequally well on average over all possible substantive applications

Existing methods:

Many choices: model-based, subspace, spectral, grid-based, graph-based, fuzzy k-modes, affinity propagation, self-organizing maps,. . .Well-defined statistical, data analytic, or machine learning foundationsHow to add substantive knowledge: With few exceptions, unclearThe literature: little guidance on when methods applyDeriving such guidance: difficult or impossible

Deep problem: full automation requires more information

No surprise: everyone’s tried cluster analysis; very few are satisfied

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 4

/ 20

Page 27: discov-uga_1.pdf

The Problem with Fully Automated Clustering

The (Impossible) Goal: optimal, fully automated,application-independent cluster analysis

No free lunch theorem: every possible clustering method performsequally well on average over all possible substantive applications

Existing methods:

Many choices: model-based, subspace, spectral, grid-based, graph-based, fuzzy k-modes, affinity propagation, self-organizing maps,. . .Well-defined statistical, data analytic, or machine learning foundationsHow to add substantive knowledge: With few exceptions, unclearThe literature: little guidance on when methods applyDeriving such guidance: difficult or impossible

Deep problem: full automation requires more information

No surprise: everyone’s tried cluster analysis; very few are satisfied

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 4

/ 20

Page 28: discov-uga_1.pdf

Switch from Fully Automated to Computer Assisted

Fully Automated Clustering may succeed sometimes, but fails ingeneral: too hard to understand when each model applies

An alternative: Computer-Assisted Clustering

Easy in theory: list all clusterings; choose the bestImpossible in practice: Too hard for us mere humans!An organized list will make the search possibleInsight: Many clusterings are perceptually identicalE.g.,: consider two clusterings that differ only because one document(of 10,000) moves from category 5 to 6

Question: How to organize clusterings so humans can understand?

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 5

/ 20

Page 29: discov-uga_1.pdf

Switch from Fully Automated to Computer Assisted

Fully Automated Clustering may succeed sometimes, but fails ingeneral: too hard to understand when each model applies

An alternative: Computer-Assisted Clustering

Easy in theory: list all clusterings; choose the bestImpossible in practice: Too hard for us mere humans!An organized list will make the search possibleInsight: Many clusterings are perceptually identicalE.g.,: consider two clusterings that differ only because one document(of 10,000) moves from category 5 to 6

Question: How to organize clusterings so humans can understand?

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 5

/ 20

Page 30: discov-uga_1.pdf

Switch from Fully Automated to Computer Assisted

Fully Automated Clustering may succeed sometimes, but fails ingeneral: too hard to understand when each model applies

An alternative: Computer-Assisted Clustering

Easy in theory: list all clusterings; choose the bestImpossible in practice: Too hard for us mere humans!An organized list will make the search possibleInsight: Many clusterings are perceptually identicalE.g.,: consider two clusterings that differ only because one document(of 10,000) moves from category 5 to 6

Question: How to organize clusterings so humans can understand?

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 5

/ 20

Page 31: discov-uga_1.pdf

Switch from Fully Automated to Computer Assisted

Fully Automated Clustering may succeed sometimes, but fails ingeneral: too hard to understand when each model applies

An alternative: Computer-Assisted Clustering

Easy in theory: list all clusterings; choose the best

Impossible in practice: Too hard for us mere humans!An organized list will make the search possibleInsight: Many clusterings are perceptually identicalE.g.,: consider two clusterings that differ only because one document(of 10,000) moves from category 5 to 6

Question: How to organize clusterings so humans can understand?

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 5

/ 20

Page 32: discov-uga_1.pdf

Switch from Fully Automated to Computer Assisted

Fully Automated Clustering may succeed sometimes, but fails ingeneral: too hard to understand when each model applies

An alternative: Computer-Assisted Clustering

Easy in theory: list all clusterings; choose the bestImpossible in practice: Too hard for us mere humans!

An organized list will make the search possibleInsight: Many clusterings are perceptually identicalE.g.,: consider two clusterings that differ only because one document(of 10,000) moves from category 5 to 6

Question: How to organize clusterings so humans can understand?

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 5

/ 20

Page 33: discov-uga_1.pdf

Switch from Fully Automated to Computer Assisted

Fully Automated Clustering may succeed sometimes, but fails ingeneral: too hard to understand when each model applies

An alternative: Computer-Assisted Clustering

Easy in theory: list all clusterings; choose the bestImpossible in practice: Too hard for us mere humans!An organized list will make the search possible

Insight: Many clusterings are perceptually identicalE.g.,: consider two clusterings that differ only because one document(of 10,000) moves from category 5 to 6

Question: How to organize clusterings so humans can understand?

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 5

/ 20

Page 34: discov-uga_1.pdf

Switch from Fully Automated to Computer Assisted

Fully Automated Clustering may succeed sometimes, but fails ingeneral: too hard to understand when each model applies

An alternative: Computer-Assisted Clustering

Easy in theory: list all clusterings; choose the bestImpossible in practice: Too hard for us mere humans!An organized list will make the search possibleInsight: Many clusterings are perceptually identical

E.g.,: consider two clusterings that differ only because one document(of 10,000) moves from category 5 to 6

Question: How to organize clusterings so humans can understand?

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 5

/ 20

Page 35: discov-uga_1.pdf

Switch from Fully Automated to Computer Assisted

Fully Automated Clustering may succeed sometimes, but fails ingeneral: too hard to understand when each model applies

An alternative: Computer-Assisted Clustering

Easy in theory: list all clusterings; choose the bestImpossible in practice: Too hard for us mere humans!An organized list will make the search possibleInsight: Many clusterings are perceptually identicalE.g.,: consider two clusterings that differ only because one document(of 10,000) moves from category 5 to 6

Question: How to organize clusterings so humans can understand?

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 5

/ 20

Page 36: discov-uga_1.pdf

Switch from Fully Automated to Computer Assisted

Fully Automated Clustering may succeed sometimes, but fails ingeneral: too hard to understand when each model applies

An alternative: Computer-Assisted Clustering

Easy in theory: list all clusterings; choose the bestImpossible in practice: Too hard for us mere humans!An organized list will make the search possibleInsight: Many clusterings are perceptually identicalE.g.,: consider two clusterings that differ only because one document(of 10,000) moves from category 5 to 6

Question: How to organize clusterings so humans can understand?

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 5

/ 20

Page 37: discov-uga_1.pdf

Our Idea: Meaning Through Geography

Set of clusterings

≈A list of unconnected addresses

We develop a (conceptual) geography of clusterings

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 6

/ 20

Page 38: discov-uga_1.pdf

Our Idea: Meaning Through Geography

Set of clusterings ≈A list of unconnected addresses

We develop a (conceptual) geography of clusterings

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 6

/ 20

Page 39: discov-uga_1.pdf

Our Idea: Meaning Through Geography

Set of clusterings ≈A list of unconnected addresses

We develop a (conceptual) geography of clusterings

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 6

/ 20

Page 40: discov-uga_1.pdf

Our Idea: Meaning Through Geography

Set of clusterings ≈A list of unconnected addresses

We develop a (conceptual) geography of clusterings

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 6

/ 20

Page 41: discov-uga_1.pdf

A New StrategyMake it easy to choose best clustering from millions of choices

1 Code text as numbers (in one or more of several ways)

2 Apply all clustering methods we can find to the data — eachrepresenting different (unstated) substantive assumptions (<15 mins)

3 (Too much for a person to understand, but organization will help)

4 Develop an application-independent distance metric betweenclusterings, a metric space of clusterings, and a 2-D projection

5 “Local cluster ensemble” creates a new clustering at any point, basedon weighted average of nearby clusterings

6 A new animated visualization to explore the space of clusterings(smoothly morphing from one into others)

7 Millions of clusterings, easily comprehended

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 7

/ 20

Page 42: discov-uga_1.pdf

A New StrategyMake it easy to choose best clustering from millions of choices

1 Code text as numbers (in one or more of several ways)

2 Apply all clustering methods we can find to the data — eachrepresenting different (unstated) substantive assumptions (<15 mins)

3 (Too much for a person to understand, but organization will help)

4 Develop an application-independent distance metric betweenclusterings, a metric space of clusterings, and a 2-D projection

5 “Local cluster ensemble” creates a new clustering at any point, basedon weighted average of nearby clusterings

6 A new animated visualization to explore the space of clusterings(smoothly morphing from one into others)

7 Millions of clusterings, easily comprehended

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 7

/ 20

Page 43: discov-uga_1.pdf

A New StrategyMake it easy to choose best clustering from millions of choices

1 Code text as numbers (in one or more of several ways)

2 Apply all clustering methods we can find to the data — eachrepresenting different (unstated) substantive assumptions (<15 mins)

3 (Too much for a person to understand, but organization will help)

4 Develop an application-independent distance metric betweenclusterings, a metric space of clusterings, and a 2-D projection

5 “Local cluster ensemble” creates a new clustering at any point, basedon weighted average of nearby clusterings

6 A new animated visualization to explore the space of clusterings(smoothly morphing from one into others)

7 Millions of clusterings, easily comprehended

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 7

/ 20

Page 44: discov-uga_1.pdf

A New StrategyMake it easy to choose best clustering from millions of choices

1 Code text as numbers (in one or more of several ways)

2 Apply all clustering methods we can find to the data — eachrepresenting different (unstated) substantive assumptions (<15 mins)

3 (Too much for a person to understand, but organization will help)

4 Develop an application-independent distance metric betweenclusterings, a metric space of clusterings, and a 2-D projection

5 “Local cluster ensemble” creates a new clustering at any point, basedon weighted average of nearby clusterings

6 A new animated visualization to explore the space of clusterings(smoothly morphing from one into others)

7 Millions of clusterings, easily comprehended

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 7

/ 20

Page 45: discov-uga_1.pdf

A New StrategyMake it easy to choose best clustering from millions of choices

1 Code text as numbers (in one or more of several ways)

2 Apply all clustering methods we can find to the data — eachrepresenting different (unstated) substantive assumptions (<15 mins)

3 (Too much for a person to understand, but organization will help)

4 Develop an application-independent distance metric betweenclusterings, a metric space of clusterings, and a 2-D projection

5 “Local cluster ensemble” creates a new clustering at any point, basedon weighted average of nearby clusterings

6 A new animated visualization to explore the space of clusterings(smoothly morphing from one into others)

7 Millions of clusterings, easily comprehended

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 7

/ 20

Page 46: discov-uga_1.pdf

A New StrategyMake it easy to choose best clustering from millions of choices

1 Code text as numbers (in one or more of several ways)

2 Apply all clustering methods we can find to the data — eachrepresenting different (unstated) substantive assumptions (<15 mins)

3 (Too much for a person to understand, but organization will help)

4 Develop an application-independent distance metric betweenclusterings, a metric space of clusterings, and a 2-D projection

5 “Local cluster ensemble” creates a new clustering at any point, basedon weighted average of nearby clusterings

6 A new animated visualization to explore the space of clusterings(smoothly morphing from one into others)

7 Millions of clusterings, easily comprehended

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 7

/ 20

Page 47: discov-uga_1.pdf

A New StrategyMake it easy to choose best clustering from millions of choices

1 Code text as numbers (in one or more of several ways)

2 Apply all clustering methods we can find to the data — eachrepresenting different (unstated) substantive assumptions (<15 mins)

3 (Too much for a person to understand, but organization will help)

4 Develop an application-independent distance metric betweenclusterings, a metric space of clusterings, and a 2-D projection

5 “Local cluster ensemble” creates a new clustering at any point, basedon weighted average of nearby clusterings

6 A new animated visualization to explore the space of clusterings(smoothly morphing from one into others)

7 Millions of clusterings, easily comprehended

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 7

/ 20

Page 48: discov-uga_1.pdf

A New StrategyMake it easy to choose best clustering from millions of choices

1 Code text as numbers (in one or more of several ways)

2 Apply all clustering methods we can find to the data — eachrepresenting different (unstated) substantive assumptions (<15 mins)

3 (Too much for a person to understand, but organization will help)

4 Develop an application-independent distance metric betweenclusterings, a metric space of clusterings, and a 2-D projection

5 “Local cluster ensemble” creates a new clustering at any point, basedon weighted average of nearby clusterings

6 A new animated visualization to explore the space of clusterings(smoothly morphing from one into others)

7 Millions of clusterings, easily comprehended

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 7

/ 20

Page 49: discov-uga_1.pdf

Many Thousands of Clusterings, Sorted & OrganizedYou choose one (or more), based on insight, discovery, useful information,. . .

Space of Clusterings

biclust_spectral

clust_convex

mult_dirproc

dismea

rock

som

spec_cos spec_eucspec_man

spec_mink

spec_max

spec_canb

mspec_cos

mspec_euc

mspec_man

mspec_mink

mspec_max

mspec_canb

affprop cosine

affprop euclidean

affprop manhattan

affprop info.costs

affprop maximum

divisive stand.euc

divisive euclidean

divisive manhattan

kmedoids stand.euc

kmedoids euclidean

kmedoids manhattan

mixvmf

mixvmfVA

kmeans euclidean

kmeans maximum

kmeans manhattan

kmeans canberra

kmeans binary

kmeans pearson

kmeans correlation

kmeans spearman

kmeans kendall

hclust euclidean ward

hclust euclidean single

hclust euclidean complete

hclust euclidean average

hclust euclidean mcquitty

hclust euclidean median

hclust euclidean centroidhclust maximum ward

hclust maximum single

hclust maximum completehclust maximum averagehclust maximum mcquitty

hclust maximum medianhclust maximum centroid

hclust manhattan ward

hclust manhattan single

hclust manhattan complete

hclust manhattan averagehclust manhattan mcquittyhclust manhattan median

hclust manhattan centroid

hclust canberra ward

hclust canberra single

hclust canberra complete

hclust canberra average

hclust canberra mcquittyhclust canberra median

hclust canberra centroid

hclust binary ward

hclust binary single

hclust binary complete

hclust binary average

hclust binary mcquitty

hclust binary median

hclust binary centroid

hclust pearson ward

hclust pearson single

hclust pearson complete

hclust pearson averagehclust pearson mcquitty

hclust pearson medianhclust pearson centroid

hclust correlation ward

hclust correlation single

hclust correlation complete

hclust correlation averagehclust correlation mcquitty

hclust correlation medianhclust correlation centroid

hclust spearman ward

hclust spearman single

hclust spearman complete

hclust spearman average

hclust spearman mcquitty

hclust spearman median

hclust spearman centroid

hclust kendall ward

hclust kendall single

hclust kendall complete

hclust kendall average

hclust kendall mcquitty

hclust kendall median

hclust kendall centroid

Clustering 1

Carter

Clinton

Eisenhower

Ford

Johnson

Kennedy

Nixon

Obama

Roosevelt

Truman

Bush

HWBush

Reagan

``Other Presidents ''

``Reagan Republicans''

Clustering 2

Carter

Eisenhower

Ford

Johnson

Kennedy

Nixon

Roosevelt

Truman

Bush

Clinton

HWBush

Obama

Reagan

``RooseveltTo Carter''

`` Reagan To Obama ''

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 8

/ 20

Page 50: discov-uga_1.pdf

Software Screenshot

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 9

/ 20

Page 51: discov-uga_1.pdf

Evaluating Performance

Goals:

Validate Claim: computer-assisted conceptualization outperformshuman conceptualizationDemonstrate: new experimental designs for cluster evaluationInject human judgement: relying on insights from survey research

We now present three evaluations

Cluster Quality ⇒ RA codersInformative discoveries ⇒ Experienced scholars analyzing textsDiscovery ⇒ You’re the judge

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 10

/ 20

Page 52: discov-uga_1.pdf

Evaluating Performance

Goals:

Validate Claim: computer-assisted conceptualization outperformshuman conceptualizationDemonstrate: new experimental designs for cluster evaluationInject human judgement: relying on insights from survey research

We now present three evaluations

Cluster Quality ⇒ RA codersInformative discoveries ⇒ Experienced scholars analyzing textsDiscovery ⇒ You’re the judge

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 10

/ 20

Page 53: discov-uga_1.pdf

Evaluating Performance

Goals:

Validate Claim: computer-assisted conceptualization outperformshuman conceptualization

Demonstrate: new experimental designs for cluster evaluationInject human judgement: relying on insights from survey research

We now present three evaluations

Cluster Quality ⇒ RA codersInformative discoveries ⇒ Experienced scholars analyzing textsDiscovery ⇒ You’re the judge

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 10

/ 20

Page 54: discov-uga_1.pdf

Evaluating Performance

Goals:

Validate Claim: computer-assisted conceptualization outperformshuman conceptualizationDemonstrate: new experimental designs for cluster evaluation

Inject human judgement: relying on insights from survey research

We now present three evaluations

Cluster Quality ⇒ RA codersInformative discoveries ⇒ Experienced scholars analyzing textsDiscovery ⇒ You’re the judge

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 10

/ 20

Page 55: discov-uga_1.pdf

Evaluating Performance

Goals:

Validate Claim: computer-assisted conceptualization outperformshuman conceptualizationDemonstrate: new experimental designs for cluster evaluationInject human judgement: relying on insights from survey research

We now present three evaluations

Cluster Quality ⇒ RA codersInformative discoveries ⇒ Experienced scholars analyzing textsDiscovery ⇒ You’re the judge

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 10

/ 20

Page 56: discov-uga_1.pdf

Evaluating Performance

Goals:

Validate Claim: computer-assisted conceptualization outperformshuman conceptualizationDemonstrate: new experimental designs for cluster evaluationInject human judgement: relying on insights from survey research

We now present three evaluations

Cluster Quality ⇒ RA codersInformative discoveries ⇒ Experienced scholars analyzing textsDiscovery ⇒ You’re the judge

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 10

/ 20

Page 57: discov-uga_1.pdf

Evaluating Performance

Goals:

Validate Claim: computer-assisted conceptualization outperformshuman conceptualizationDemonstrate: new experimental designs for cluster evaluationInject human judgement: relying on insights from survey research

We now present three evaluations

Cluster Quality ⇒ RA coders

Informative discoveries ⇒ Experienced scholars analyzing textsDiscovery ⇒ You’re the judge

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 10

/ 20

Page 58: discov-uga_1.pdf

Evaluating Performance

Goals:

Validate Claim: computer-assisted conceptualization outperformshuman conceptualizationDemonstrate: new experimental designs for cluster evaluationInject human judgement: relying on insights from survey research

We now present three evaluations

Cluster Quality ⇒ RA codersInformative discoveries ⇒ Experienced scholars analyzing texts

Discovery ⇒ You’re the judge

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 10

/ 20

Page 59: discov-uga_1.pdf

Evaluating Performance

Goals:

Validate Claim: computer-assisted conceptualization outperformshuman conceptualizationDemonstrate: new experimental designs for cluster evaluationInject human judgement: relying on insights from survey research

We now present three evaluations

Cluster Quality ⇒ RA codersInformative discoveries ⇒ Experienced scholars analyzing textsDiscovery ⇒ You’re the judge

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 10

/ 20

Page 60: discov-uga_1.pdf

Evaluation 1: Cluster Quality

What Are Humans Good For?

They can’t: keep many documents & clusters in their headThey can: compare two documents at a time=⇒ Cluster quality evaluation: human judgement of document pairs

Experimental Design to Assess Cluster Quality

automated visualization to choose one clusteringmany pairs of documentsfor coders: (1) unrelated, (2) loosely related, (3) closely relatedQuality = mean(within cluster) - mean(between clusters)Bias results against ourselves by not letting evaluators choose clustering

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 11

/ 20

Page 61: discov-uga_1.pdf

Evaluation 1: Cluster Quality

What Are Humans Good For?

They can’t: keep many documents & clusters in their headThey can: compare two documents at a time=⇒ Cluster quality evaluation: human judgement of document pairs

Experimental Design to Assess Cluster Quality

automated visualization to choose one clusteringmany pairs of documentsfor coders: (1) unrelated, (2) loosely related, (3) closely relatedQuality = mean(within cluster) - mean(between clusters)Bias results against ourselves by not letting evaluators choose clustering

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 11

/ 20

Page 62: discov-uga_1.pdf

Evaluation 1: Cluster Quality

What Are Humans Good For?

They can’t: keep many documents & clusters in their head

They can: compare two documents at a time=⇒ Cluster quality evaluation: human judgement of document pairs

Experimental Design to Assess Cluster Quality

automated visualization to choose one clusteringmany pairs of documentsfor coders: (1) unrelated, (2) loosely related, (3) closely relatedQuality = mean(within cluster) - mean(between clusters)Bias results against ourselves by not letting evaluators choose clustering

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 11

/ 20

Page 63: discov-uga_1.pdf

Evaluation 1: Cluster Quality

What Are Humans Good For?

They can’t: keep many documents & clusters in their headThey can: compare two documents at a time

=⇒ Cluster quality evaluation: human judgement of document pairs

Experimental Design to Assess Cluster Quality

automated visualization to choose one clusteringmany pairs of documentsfor coders: (1) unrelated, (2) loosely related, (3) closely relatedQuality = mean(within cluster) - mean(between clusters)Bias results against ourselves by not letting evaluators choose clustering

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 11

/ 20

Page 64: discov-uga_1.pdf

Evaluation 1: Cluster Quality

What Are Humans Good For?

They can’t: keep many documents & clusters in their headThey can: compare two documents at a time=⇒ Cluster quality evaluation: human judgement of document pairs

Experimental Design to Assess Cluster Quality

automated visualization to choose one clusteringmany pairs of documentsfor coders: (1) unrelated, (2) loosely related, (3) closely relatedQuality = mean(within cluster) - mean(between clusters)Bias results against ourselves by not letting evaluators choose clustering

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 11

/ 20

Page 65: discov-uga_1.pdf

Evaluation 1: Cluster Quality

What Are Humans Good For?

They can’t: keep many documents & clusters in their headThey can: compare two documents at a time=⇒ Cluster quality evaluation: human judgement of document pairs

Experimental Design to Assess Cluster Quality

automated visualization to choose one clusteringmany pairs of documentsfor coders: (1) unrelated, (2) loosely related, (3) closely relatedQuality = mean(within cluster) - mean(between clusters)Bias results against ourselves by not letting evaluators choose clustering

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 11

/ 20

Page 66: discov-uga_1.pdf

Evaluation 1: Cluster Quality

What Are Humans Good For?

They can’t: keep many documents & clusters in their headThey can: compare two documents at a time=⇒ Cluster quality evaluation: human judgement of document pairs

Experimental Design to Assess Cluster Quality

automated visualization to choose one clustering

many pairs of documentsfor coders: (1) unrelated, (2) loosely related, (3) closely relatedQuality = mean(within cluster) - mean(between clusters)Bias results against ourselves by not letting evaluators choose clustering

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 11

/ 20

Page 67: discov-uga_1.pdf

Evaluation 1: Cluster Quality

What Are Humans Good For?

They can’t: keep many documents & clusters in their headThey can: compare two documents at a time=⇒ Cluster quality evaluation: human judgement of document pairs

Experimental Design to Assess Cluster Quality

automated visualization to choose one clusteringmany pairs of documents

for coders: (1) unrelated, (2) loosely related, (3) closely relatedQuality = mean(within cluster) - mean(between clusters)Bias results against ourselves by not letting evaluators choose clustering

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 11

/ 20

Page 68: discov-uga_1.pdf

Evaluation 1: Cluster Quality

What Are Humans Good For?

They can’t: keep many documents & clusters in their headThey can: compare two documents at a time=⇒ Cluster quality evaluation: human judgement of document pairs

Experimental Design to Assess Cluster Quality

automated visualization to choose one clusteringmany pairs of documentsfor coders: (1) unrelated, (2) loosely related, (3) closely related

Quality = mean(within cluster) - mean(between clusters)Bias results against ourselves by not letting evaluators choose clustering

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 11

/ 20

Page 69: discov-uga_1.pdf

Evaluation 1: Cluster Quality

What Are Humans Good For?

They can’t: keep many documents & clusters in their headThey can: compare two documents at a time=⇒ Cluster quality evaluation: human judgement of document pairs

Experimental Design to Assess Cluster Quality

automated visualization to choose one clusteringmany pairs of documentsfor coders: (1) unrelated, (2) loosely related, (3) closely relatedQuality = mean(within cluster) - mean(between clusters)

Bias results against ourselves by not letting evaluators choose clustering

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 11

/ 20

Page 70: discov-uga_1.pdf

Evaluation 1: Cluster Quality

What Are Humans Good For?

They can’t: keep many documents & clusters in their headThey can: compare two documents at a time=⇒ Cluster quality evaluation: human judgement of document pairs

Experimental Design to Assess Cluster Quality

automated visualization to choose one clusteringmany pairs of documentsfor coders: (1) unrelated, (2) loosely related, (3) closely relatedQuality = mean(within cluster) - mean(between clusters)Bias results against ourselves by not letting evaluators choose clustering

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 11

/ 20

Page 71: discov-uga_1.pdf

Evaluation 1: Cluster Quality

What Are Humans Good For?

They can’t: keep many documents & clusters in their headThey can: compare two documents at a time=⇒ Cluster quality evaluation: human judgement of document pairs

Experimental Design to Assess Cluster Quality

automated visualization to choose one clusteringmany pairs of documentsfor coders: (1) unrelated, (2) loosely related, (3) closely relatedQuality = mean(within cluster) - mean(between clusters)Bias results against ourselves by not letting evaluators choose clustering

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 11

/ 20

Page 72: discov-uga_1.pdf

Evaluation 1: Cluster Quality

(Our Method) − (Human Coders)

−0.3 −0.2 −0.1 0.1 0.2 0.3

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 12

/ 20

Page 73: discov-uga_1.pdf

Evaluation 1: Cluster Quality

(Our Method) − (Human Coders)

−0.3 −0.2 −0.1 0.1 0.2 0.3

Lautenberg Press Releases

Lautenberg: 200 Senate Press Releases (appropriations, economy,education, tax, veterans, . . . )

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 12

/ 20

Page 74: discov-uga_1.pdf

Evaluation 1: Cluster Quality

(Our Method) − (Human Coders)

−0.3 −0.2 −0.1 0.1 0.2 0.3

Lautenberg Press Releases

Policy Agendas Project

Policy Agendas: 213 quasi-sentences from Bush’s State of the Union(agriculture, banking & commerce, civil rights/liberties, defense, . . . )

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 12

/ 20

Page 75: discov-uga_1.pdf

Evaluation 1: Cluster Quality

(Our Method) − (Human Coders)

−0.3 −0.2 −0.1 0.1 0.2 0.3

Lautenberg Press Releases

Policy Agendas Project

Reuter's Gold Standard

Reuter’s: financial news (trade, earnings, copper, gold, coffee, . . . ); “goldstandard” for supervised learning studies

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 12

/ 20

Page 76: discov-uga_1.pdf

Evaluation 2: More Informative Discoveries

Found 2 scholars analyzing lots of textual data for their work

Created 6 clusterings:

2 clusterings selected with our method (biased against us)2 clusterings from each of 2 other methods (varying tuning parameters)

Created info packet on each clustering (for each cluster: exemplardocument, automated content summary)

Asked for(62

)=15 pairwise comparisons

User chooses ⇒ only care about the one clustering that wins

Both cases a Condorcet winner:

“Immigration”:

Our Method 1 → vMF 1 → vMF 2 → Our Method 2 → K-Means 1 → K-Means 2

“Genetic testing”:

Our Method 1 → {Our Method 2, K-Means 1, K-means 2} → Dir Proc. 1 → Dir Proc. 2

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 13

/ 20

Page 77: discov-uga_1.pdf

Evaluation 2: More Informative Discoveries

Found 2 scholars analyzing lots of textual data for their work

Created 6 clusterings:

2 clusterings selected with our method (biased against us)2 clusterings from each of 2 other methods (varying tuning parameters)

Created info packet on each clustering (for each cluster: exemplardocument, automated content summary)

Asked for(62

)=15 pairwise comparisons

User chooses ⇒ only care about the one clustering that wins

Both cases a Condorcet winner:

“Immigration”:

Our Method 1 → vMF 1 → vMF 2 → Our Method 2 → K-Means 1 → K-Means 2

“Genetic testing”:

Our Method 1 → {Our Method 2, K-Means 1, K-means 2} → Dir Proc. 1 → Dir Proc. 2

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 13

/ 20

Page 78: discov-uga_1.pdf

Evaluation 2: More Informative Discoveries

Found 2 scholars analyzing lots of textual data for their work

Created 6 clusterings:

2 clusterings selected with our method (biased against us)2 clusterings from each of 2 other methods (varying tuning parameters)

Created info packet on each clustering (for each cluster: exemplardocument, automated content summary)

Asked for(62

)=15 pairwise comparisons

User chooses ⇒ only care about the one clustering that wins

Both cases a Condorcet winner:

“Immigration”:

Our Method 1 → vMF 1 → vMF 2 → Our Method 2 → K-Means 1 → K-Means 2

“Genetic testing”:

Our Method 1 → {Our Method 2, K-Means 1, K-means 2} → Dir Proc. 1 → Dir Proc. 2

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 13

/ 20

Page 79: discov-uga_1.pdf

Evaluation 2: More Informative Discoveries

Found 2 scholars analyzing lots of textual data for their work

Created 6 clusterings:

2 clusterings selected with our method (biased against us)

2 clusterings from each of 2 other methods (varying tuning parameters)

Created info packet on each clustering (for each cluster: exemplardocument, automated content summary)

Asked for(62

)=15 pairwise comparisons

User chooses ⇒ only care about the one clustering that wins

Both cases a Condorcet winner:

“Immigration”:

Our Method 1 → vMF 1 → vMF 2 → Our Method 2 → K-Means 1 → K-Means 2

“Genetic testing”:

Our Method 1 → {Our Method 2, K-Means 1, K-means 2} → Dir Proc. 1 → Dir Proc. 2

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 13

/ 20

Page 80: discov-uga_1.pdf

Evaluation 2: More Informative Discoveries

Found 2 scholars analyzing lots of textual data for their work

Created 6 clusterings:

2 clusterings selected with our method (biased against us)2 clusterings from each of 2 other methods (varying tuning parameters)

Created info packet on each clustering (for each cluster: exemplardocument, automated content summary)

Asked for(62

)=15 pairwise comparisons

User chooses ⇒ only care about the one clustering that wins

Both cases a Condorcet winner:

“Immigration”:

Our Method 1 → vMF 1 → vMF 2 → Our Method 2 → K-Means 1 → K-Means 2

“Genetic testing”:

Our Method 1 → {Our Method 2, K-Means 1, K-means 2} → Dir Proc. 1 → Dir Proc. 2

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 13

/ 20

Page 81: discov-uga_1.pdf

Evaluation 2: More Informative Discoveries

Found 2 scholars analyzing lots of textual data for their work

Created 6 clusterings:

2 clusterings selected with our method (biased against us)2 clusterings from each of 2 other methods (varying tuning parameters)

Created info packet on each clustering (for each cluster: exemplardocument, automated content summary)

Asked for(62

)=15 pairwise comparisons

User chooses ⇒ only care about the one clustering that wins

Both cases a Condorcet winner:

“Immigration”:

Our Method 1 → vMF 1 → vMF 2 → Our Method 2 → K-Means 1 → K-Means 2

“Genetic testing”:

Our Method 1 → {Our Method 2, K-Means 1, K-means 2} → Dir Proc. 1 → Dir Proc. 2

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 13

/ 20

Page 82: discov-uga_1.pdf

Evaluation 2: More Informative Discoveries

Found 2 scholars analyzing lots of textual data for their work

Created 6 clusterings:

2 clusterings selected with our method (biased against us)2 clusterings from each of 2 other methods (varying tuning parameters)

Created info packet on each clustering (for each cluster: exemplardocument, automated content summary)

Asked for(62

)=15 pairwise comparisons

User chooses ⇒ only care about the one clustering that wins

Both cases a Condorcet winner:

“Immigration”:

Our Method 1 → vMF 1 → vMF 2 → Our Method 2 → K-Means 1 → K-Means 2

“Genetic testing”:

Our Method 1 → {Our Method 2, K-Means 1, K-means 2} → Dir Proc. 1 → Dir Proc. 2

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 13

/ 20

Page 83: discov-uga_1.pdf

Evaluation 2: More Informative Discoveries

Found 2 scholars analyzing lots of textual data for their work

Created 6 clusterings:

2 clusterings selected with our method (biased against us)2 clusterings from each of 2 other methods (varying tuning parameters)

Created info packet on each clustering (for each cluster: exemplardocument, automated content summary)

Asked for(62

)=15 pairwise comparisons

User chooses ⇒ only care about the one clustering that wins

Both cases a Condorcet winner:

“Immigration”:

Our Method 1 → vMF 1 → vMF 2 → Our Method 2 → K-Means 1 → K-Means 2

“Genetic testing”:

Our Method 1 → {Our Method 2, K-Means 1, K-means 2} → Dir Proc. 1 → Dir Proc. 2

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 13

/ 20

Page 84: discov-uga_1.pdf

Evaluation 2: More Informative Discoveries

Found 2 scholars analyzing lots of textual data for their work

Created 6 clusterings:

2 clusterings selected with our method (biased against us)2 clusterings from each of 2 other methods (varying tuning parameters)

Created info packet on each clustering (for each cluster: exemplardocument, automated content summary)

Asked for(62

)=15 pairwise comparisons

User chooses ⇒ only care about the one clustering that wins

Both cases a Condorcet winner:

“Immigration”:

Our Method 1 → vMF 1 → vMF 2 → Our Method 2 → K-Means 1 → K-Means 2

“Genetic testing”:

Our Method 1 → {Our Method 2, K-Means 1, K-means 2} → Dir Proc. 1 → Dir Proc. 2

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 13

/ 20

Page 85: discov-uga_1.pdf

Evaluation 2: More Informative Discoveries

Found 2 scholars analyzing lots of textual data for their work

Created 6 clusterings:

2 clusterings selected with our method (biased against us)2 clusterings from each of 2 other methods (varying tuning parameters)

Created info packet on each clustering (for each cluster: exemplardocument, automated content summary)

Asked for(62

)=15 pairwise comparisons

User chooses ⇒ only care about the one clustering that wins

Both cases a Condorcet winner:

“Immigration”:

Our Method 1 → vMF 1 → vMF 2 → Our Method 2 → K-Means 1 → K-Means 2

“Genetic testing”:

Our Method 1 → {Our Method 2, K-Means 1, K-means 2} → Dir Proc. 1 → Dir Proc. 2

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 13

/ 20

Page 86: discov-uga_1.pdf

Evaluation 2: More Informative Discoveries

Found 2 scholars analyzing lots of textual data for their work

Created 6 clusterings:

2 clusterings selected with our method (biased against us)2 clusterings from each of 2 other methods (varying tuning parameters)

Created info packet on each clustering (for each cluster: exemplardocument, automated content summary)

Asked for(62

)=15 pairwise comparisons

User chooses ⇒ only care about the one clustering that wins

Both cases a Condorcet winner:

“Immigration”:

Our Method 1 → vMF 1 → vMF 2 → Our Method 2 → K-Means 1 → K-Means 2

“Genetic testing”:

Our Method 1 → {Our Method 2, K-Means 1, K-means 2} → Dir Proc. 1 → Dir Proc. 2

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 13

/ 20

Page 87: discov-uga_1.pdf

Evaluation 3: What Do Members of Congress Do?

- David Mayhew’s (1974) famous typology

- Advertising- Credit Claiming- Position Taking

- Data: 200 press releases from Frank Lautenberg’s office (D-NJ)

- Apply our method

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 14

/ 20

Page 88: discov-uga_1.pdf

Evaluation 3: What Do Members of Congress Do?

- David Mayhew’s (1974) famous typology

- Advertising- Credit Claiming- Position Taking

- Data: 200 press releases from Frank Lautenberg’s office (D-NJ)

- Apply our method

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 14

/ 20

Page 89: discov-uga_1.pdf

Evaluation 3: What Do Members of Congress Do?

- David Mayhew’s (1974) famous typology

- Advertising

- Credit Claiming- Position Taking

- Data: 200 press releases from Frank Lautenberg’s office (D-NJ)

- Apply our method

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 14

/ 20

Page 90: discov-uga_1.pdf

Evaluation 3: What Do Members of Congress Do?

- David Mayhew’s (1974) famous typology

- Advertising- Credit Claiming

- Position Taking

- Data: 200 press releases from Frank Lautenberg’s office (D-NJ)

- Apply our method

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 14

/ 20

Page 91: discov-uga_1.pdf

Evaluation 3: What Do Members of Congress Do?

- David Mayhew’s (1974) famous typology

- Advertising- Credit Claiming- Position Taking

- Data: 200 press releases from Frank Lautenberg’s office (D-NJ)

- Apply our method

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 14

/ 20

Page 92: discov-uga_1.pdf

Evaluation 3: What Do Members of Congress Do?

- David Mayhew’s (1974) famous typology

- Advertising- Credit Claiming- Position Taking

- Data: 200 press releases from Frank Lautenberg’s office (D-NJ)

- Apply our method

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 14

/ 20

Page 93: discov-uga_1.pdf

Evaluation 3: What Do Members of Congress Do?

- David Mayhew’s (1974) famous typology

- Advertising- Credit Claiming- Position Taking

- Data: 200 press releases from Frank Lautenberg’s office (D-NJ)

- Apply our method

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 14

/ 20

Page 94: discov-uga_1.pdf

Example Discovery

: Partisan Taunting

biclust_spectral

clust_convex

mult_dirproc

dismeadist_cosdist_fbinarydist_ebinarydist_minkowskidist_maxdist_canbdist_binary

mec

rocksom

sot_euc

sot_cor

spec_cosspec_eucspec_man

spec_minkspec_maxspec_canbmspec_cosmspec_euc

mspec_manmspec_minkmspec_max

mspec_canb

affprop cosine

affprop euclidean

affprop manhattan

affprop info.costs

affprop maximum

divisive stand.euc

divisive euclidean

divisive manhattan

kmedoids stand.euckmedoids euclidean

kmedoids manhattan

mixvmf mixvmfVA

kmeans euclidean

kmeans maximum

kmeans manhattankmeans canberra

kmeans binary

kmeans pearson

kmeans correlation

kmeans spearmankmeans kendall

hclust euclidean ward

hclust euclidean single

hclust euclidean complete

hclust euclidean averagehclust euclidean mcquitty

hclust euclidean medianhclust euclidean centroid

hclust maximum ward

hclust maximum single

hclust maximum completehclust maximum average

hclust maximum mcquitty

hclust maximum medianhclust maximum centroid

hclust manhattan ward

hclust manhattan single

hclust manhattan complete

hclust manhattan average

hclust manhattan mcquitty

hclust manhattan medianhclust manhattan centroid

hclust canberra ward

hclust canberra single

hclust canberra complete

hclust canberra average

hclust canberra mcquitty

hclust canberra median

hclust canberra centroid

hclust binary ward

hclust binary single

hclust binary complete

hclust binary average

hclust binary mcquittyhclust binary median

hclust binary centroid

hclust pearson ward

hclust pearson single

hclust pearson completehclust pearson average

hclust pearson mcquittyhclust pearson median

hclust pearson centroid

hclust correlation ward

hclust correlation single hclust correlation completehclust correlation average

hclust correlation mcquitty

hclust correlation median

hclust correlation centroid

hclust spearman ward

hclust spearman single

hclust spearman complete

hclust spearman averagehclust spearman mcquitty

hclust spearman medianhclust spearman centroid

hclust kendall ward

hclust kendall singlehclust kendall complete

hclust kendall averagehclust kendall mcquittyhclust kendall medianhclust kendall centroid

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 15

/ 20

Page 95: discov-uga_1.pdf

Example Discovery

: Partisan Taunting

biclust_spectral

clust_convex

mult_dirproc

dismeadist_cosdist_fbinarydist_ebinarydist_minkowskidist_maxdist_canbdist_binary

mec

rocksom

sot_euc

sot_cor

spec_cosspec_eucspec_man

spec_minkspec_maxspec_canbmspec_cosmspec_euc

mspec_manmspec_minkmspec_max

mspec_canb

affprop cosine

affprop euclidean

affprop manhattan

affprop info.costs

affprop maximum

divisive stand.euc

divisive euclidean

divisive manhattan

kmedoids stand.euckmedoids euclidean

kmedoids manhattan

mixvmf mixvmfVA

kmeans euclidean

kmeans maximum

kmeans manhattankmeans canberra

kmeans binary

kmeans pearson

kmeans correlation

kmeans spearmankmeans kendall

hclust euclidean ward

hclust euclidean single

hclust euclidean complete

hclust euclidean averagehclust euclidean mcquitty

hclust euclidean medianhclust euclidean centroid

hclust maximum ward

hclust maximum single

hclust maximum completehclust maximum average

hclust maximum mcquitty

hclust maximum medianhclust maximum centroid

hclust manhattan ward

hclust manhattan single

hclust manhattan complete

hclust manhattan average

hclust manhattan mcquitty

hclust manhattan medianhclust manhattan centroid

hclust canberra ward

hclust canberra single

hclust canberra complete

hclust canberra average

hclust canberra mcquitty

hclust canberra median

hclust canberra centroid

hclust binary ward

hclust binary single

hclust binary complete

hclust binary average

hclust binary mcquittyhclust binary median

hclust binary centroid

hclust pearson ward

hclust pearson single

hclust pearson completehclust pearson average

hclust pearson mcquittyhclust pearson median

hclust pearson centroid

hclust correlation ward

hclust correlation single hclust correlation completehclust correlation average

hclust correlation mcquitty

hclust correlation median

hclust correlation centroid

hclust spearman ward

hclust spearman single

hclust spearman complete

hclust spearman averagehclust spearman mcquitty

hclust spearman medianhclust spearman centroid

hclust kendall ward

hclust kendall singlehclust kendall complete

hclust kendall averagehclust kendall mcquittyhclust kendall medianhclust kendall centroid

affprop cosine

Red point: a clustering byAffinity Propagation-Cosine(Dueck and Frey 2007)

Close to:Mixture of von Mises-Fisherdistributions (Banerjee et. al.2005)

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 15

/ 20

Page 96: discov-uga_1.pdf

Example Discovery

: Partisan Taunting

biclust_spectral

clust_convex

mult_dirproc

dismeadist_cosdist_fbinarydist_ebinarydist_minkowskidist_maxdist_canbdist_binary

mec

rocksom

sot_euc

sot_cor

spec_cosspec_eucspec_man

spec_minkspec_maxspec_canbmspec_cosmspec_euc

mspec_manmspec_minkmspec_max

mspec_canb

affprop cosine

affprop euclidean

affprop manhattan

affprop info.costs

affprop maximum

divisive stand.euc

divisive euclidean

divisive manhattan

kmedoids stand.euckmedoids euclidean

kmedoids manhattan

mixvmf mixvmfVA

kmeans euclidean

kmeans maximum

kmeans manhattankmeans canberra

kmeans binary

kmeans pearson

kmeans correlation

kmeans spearmankmeans kendall

hclust euclidean ward

hclust euclidean single

hclust euclidean complete

hclust euclidean averagehclust euclidean mcquitty

hclust euclidean medianhclust euclidean centroid

hclust maximum ward

hclust maximum single

hclust maximum completehclust maximum average

hclust maximum mcquitty

hclust maximum medianhclust maximum centroid

hclust manhattan ward

hclust manhattan single

hclust manhattan complete

hclust manhattan average

hclust manhattan mcquitty

hclust manhattan medianhclust manhattan centroid

hclust canberra ward

hclust canberra single

hclust canberra complete

hclust canberra average

hclust canberra mcquitty

hclust canberra median

hclust canberra centroid

hclust binary ward

hclust binary single

hclust binary complete

hclust binary average

hclust binary mcquittyhclust binary median

hclust binary centroid

hclust pearson ward

hclust pearson single

hclust pearson completehclust pearson average

hclust pearson mcquittyhclust pearson median

hclust pearson centroid

hclust correlation ward

hclust correlation single hclust correlation completehclust correlation average

hclust correlation mcquitty

hclust correlation median

hclust correlation centroid

hclust spearman ward

hclust spearman single

hclust spearman complete

hclust spearman averagehclust spearman mcquitty

hclust spearman medianhclust spearman centroid

hclust kendall ward

hclust kendall singlehclust kendall complete

hclust kendall averagehclust kendall mcquittyhclust kendall medianhclust kendall centroid

affprop cosine

mixvmf

Red point: a clustering byAffinity Propagation-Cosine(Dueck and Frey 2007)Close to:Mixture of von Mises-Fisherdistributions (Banerjee et. al.2005)

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 15

/ 20

Page 97: discov-uga_1.pdf

Example Discovery

: Partisan Taunting

biclust_spectral

clust_convex

mult_dirproc

dismeadist_cosdist_fbinarydist_ebinarydist_minkowskidist_maxdist_canbdist_binary

mec

rocksom

sot_euc

sot_cor

spec_cosspec_eucspec_man

spec_minkspec_maxspec_canbmspec_cosmspec_euc

mspec_manmspec_minkmspec_max

mspec_canb

affprop cosine

affprop euclidean

affprop manhattan

affprop info.costs

affprop maximum

divisive stand.euc

divisive euclidean

divisive manhattan

kmedoids stand.euckmedoids euclidean

kmedoids manhattan

mixvmf mixvmfVA

kmeans euclidean

kmeans maximum

kmeans manhattankmeans canberra

kmeans binary

kmeans pearson

kmeans correlation

kmeans spearmankmeans kendall

hclust euclidean ward

hclust euclidean single

hclust euclidean complete

hclust euclidean averagehclust euclidean mcquitty

hclust euclidean medianhclust euclidean centroid

hclust maximum ward

hclust maximum single

hclust maximum completehclust maximum average

hclust maximum mcquitty

hclust maximum medianhclust maximum centroid

hclust manhattan ward

hclust manhattan single

hclust manhattan complete

hclust manhattan average

hclust manhattan mcquitty

hclust manhattan medianhclust manhattan centroid

hclust canberra ward

hclust canberra single

hclust canberra complete

hclust canberra average

hclust canberra mcquitty

hclust canberra median

hclust canberra centroid

hclust binary ward

hclust binary single

hclust binary complete

hclust binary average

hclust binary mcquittyhclust binary median

hclust binary centroid

hclust pearson ward

hclust pearson single

hclust pearson completehclust pearson average

hclust pearson mcquittyhclust pearson median

hclust pearson centroid

hclust correlation ward

hclust correlation single hclust correlation completehclust correlation average

hclust correlation mcquitty

hclust correlation median

hclust correlation centroid

hclust spearman ward

hclust spearman single

hclust spearman complete

hclust spearman averagehclust spearman mcquitty

hclust spearman medianhclust spearman centroid

hclust kendall ward

hclust kendall singlehclust kendall complete

hclust kendall averagehclust kendall mcquittyhclust kendall medianhclust kendall centroid

Space between methods:

local cluster ensemble

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 15

/ 20

Page 98: discov-uga_1.pdf

Example Discovery

: Partisan Taunting

biclust_spectral

clust_convex

mult_dirproc

dismeadist_cosdist_fbinarydist_ebinarydist_minkowskidist_maxdist_canbdist_binary

mec

rocksom

sot_euc

sot_cor

spec_cosspec_eucspec_man

spec_minkspec_maxspec_canbmspec_cosmspec_euc

mspec_manmspec_minkmspec_max

mspec_canb

affprop cosine

affprop euclidean

affprop manhattan

affprop info.costs

affprop maximum

divisive stand.euc

divisive euclidean

divisive manhattan

kmedoids stand.euckmedoids euclidean

kmedoids manhattan

mixvmf mixvmfVA

kmeans euclidean

kmeans maximum

kmeans manhattankmeans canberra

kmeans binary

kmeans pearson

kmeans correlation

kmeans spearmankmeans kendall

hclust euclidean ward

hclust euclidean single

hclust euclidean complete

hclust euclidean averagehclust euclidean mcquitty

hclust euclidean medianhclust euclidean centroid

hclust maximum ward

hclust maximum single

hclust maximum completehclust maximum average

hclust maximum mcquitty

hclust maximum medianhclust maximum centroid

hclust manhattan ward

hclust manhattan single

hclust manhattan complete

hclust manhattan average

hclust manhattan mcquitty

hclust manhattan medianhclust manhattan centroid

hclust canberra ward

hclust canberra single

hclust canberra complete

hclust canberra average

hclust canberra mcquitty

hclust canberra median

hclust canberra centroid

hclust binary ward

hclust binary single

hclust binary complete

hclust binary average

hclust binary mcquittyhclust binary median

hclust binary centroid

hclust pearson ward

hclust pearson single

hclust pearson completehclust pearson average

hclust pearson mcquittyhclust pearson median

hclust pearson centroid

hclust correlation ward

hclust correlation single hclust correlation completehclust correlation average

hclust correlation mcquitty

hclust correlation median

hclust correlation centroid

hclust spearman ward

hclust spearman single

hclust spearman complete

hclust spearman averagehclust spearman mcquitty

hclust spearman medianhclust spearman centroid

hclust kendall ward

hclust kendall singlehclust kendall complete

hclust kendall averagehclust kendall mcquittyhclust kendall medianhclust kendall centroid

Space between methods:

local cluster ensemble

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 15

/ 20

Page 99: discov-uga_1.pdf

Example Discovery

: Partisan Taunting

biclust_spectral

clust_convex

mult_dirproc

dismeadist_cosdist_fbinarydist_ebinarydist_minkowskidist_maxdist_canbdist_binary

mec

rocksom

sot_euc

sot_cor

spec_cosspec_eucspec_man

spec_minkspec_maxspec_canbmspec_cosmspec_euc

mspec_manmspec_minkmspec_max

mspec_canb

affprop cosine

affprop euclidean

affprop manhattan

affprop info.costs

affprop maximum

divisive stand.euc

divisive euclidean

divisive manhattan

kmedoids stand.euckmedoids euclidean

kmedoids manhattan

mixvmf mixvmfVA

kmeans euclidean

kmeans maximum

kmeans manhattankmeans canberra

kmeans binary

kmeans pearson

kmeans correlation

kmeans spearmankmeans kendall

hclust euclidean ward

hclust euclidean single

hclust euclidean complete

hclust euclidean averagehclust euclidean mcquitty

hclust euclidean medianhclust euclidean centroid

hclust maximum ward

hclust maximum single

hclust maximum completehclust maximum average

hclust maximum mcquitty

hclust maximum medianhclust maximum centroid

hclust manhattan ward

hclust manhattan single

hclust manhattan complete

hclust manhattan average

hclust manhattan mcquitty

hclust manhattan medianhclust manhattan centroid

hclust canberra ward

hclust canberra single

hclust canberra complete

hclust canberra average

hclust canberra mcquitty

hclust canberra median

hclust canberra centroid

hclust binary ward

hclust binary single

hclust binary complete

hclust binary average

hclust binary mcquittyhclust binary median

hclust binary centroid

hclust pearson ward

hclust pearson single

hclust pearson completehclust pearson average

hclust pearson mcquittyhclust pearson median

hclust pearson centroid

hclust correlation ward

hclust correlation single hclust correlation completehclust correlation average

hclust correlation mcquitty

hclust correlation median

hclust correlation centroid

hclust spearman ward

hclust spearman single

hclust spearman complete

hclust spearman averagehclust spearman mcquitty

hclust spearman medianhclust spearman centroid

hclust kendall ward

hclust kendall singlehclust kendall complete

hclust kendall averagehclust kendall mcquittyhclust kendall medianhclust kendall centroid

Space between methods:local cluster ensemble

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 15

/ 20

Page 100: discov-uga_1.pdf

Example Discovery

: Partisan Taunting

biclust_spectral

clust_convex

mult_dirproc

dismeadist_cosdist_fbinarydist_ebinarydist_minkowskidist_maxdist_canbdist_binary

mec

rocksom

sot_euc

sot_cor

spec_cosspec_eucspec_man

spec_minkspec_maxspec_canbmspec_cosmspec_euc

mspec_manmspec_minkmspec_max

mspec_canb

affprop cosine

affprop euclidean

affprop manhattan

affprop info.costs

affprop maximum

divisive stand.euc

divisive euclidean

divisive manhattan

kmedoids stand.euckmedoids euclidean

kmedoids manhattan

mixvmf mixvmfVA

kmeans euclidean

kmeans maximum

kmeans manhattankmeans canberra

kmeans binary

kmeans pearson

kmeans correlation

kmeans spearmankmeans kendall

hclust euclidean ward

hclust euclidean single

hclust euclidean complete

hclust euclidean averagehclust euclidean mcquitty

hclust euclidean medianhclust euclidean centroid

hclust maximum ward

hclust maximum single

hclust maximum completehclust maximum average

hclust maximum mcquitty

hclust maximum medianhclust maximum centroid

hclust manhattan ward

hclust manhattan single

hclust manhattan complete

hclust manhattan average

hclust manhattan mcquitty

hclust manhattan medianhclust manhattan centroid

hclust canberra ward

hclust canberra single

hclust canberra complete

hclust canberra average

hclust canberra mcquitty

hclust canberra median

hclust canberra centroid

hclust binary ward

hclust binary single

hclust binary complete

hclust binary average

hclust binary mcquittyhclust binary median

hclust binary centroid

hclust pearson ward

hclust pearson single

hclust pearson completehclust pearson average

hclust pearson mcquittyhclust pearson median

hclust pearson centroid

hclust correlation ward

hclust correlation single hclust correlation completehclust correlation average

hclust correlation mcquitty

hclust correlation median

hclust correlation centroid

hclust spearman ward

hclust spearman single

hclust spearman complete

hclust spearman averagehclust spearman mcquitty

hclust spearman medianhclust spearman centroid

hclust kendall ward

hclust kendall singlehclust kendall complete

hclust kendall averagehclust kendall mcquittyhclust kendall medianhclust kendall centroid

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 15

/ 20

Page 101: discov-uga_1.pdf

Example Discovery

: Partisan Taunting

biclust_spectral

clust_convex

mult_dirproc

dismeadist_cosdist_fbinarydist_ebinarydist_minkowskidist_maxdist_canbdist_binary

mec

rocksom

sot_euc

sot_cor

spec_cosspec_eucspec_man

spec_minkspec_maxspec_canbmspec_cosmspec_euc

mspec_manmspec_minkmspec_max

mspec_canb

affprop cosine

affprop euclidean

affprop manhattan

affprop info.costs

affprop maximum

divisive stand.euc

divisive euclidean

divisive manhattan

kmedoids stand.euckmedoids euclidean

kmedoids manhattan

mixvmf mixvmfVA

kmeans euclidean

kmeans maximum

kmeans manhattankmeans canberra

kmeans binary

kmeans pearson

kmeans correlation

kmeans spearmankmeans kendall

hclust euclidean ward

hclust euclidean single

hclust euclidean complete

hclust euclidean averagehclust euclidean mcquitty

hclust euclidean medianhclust euclidean centroid

hclust maximum ward

hclust maximum single

hclust maximum completehclust maximum average

hclust maximum mcquitty

hclust maximum medianhclust maximum centroid

hclust manhattan ward

hclust manhattan single

hclust manhattan complete

hclust manhattan average

hclust manhattan mcquitty

hclust manhattan medianhclust manhattan centroid

hclust canberra ward

hclust canberra single

hclust canberra complete

hclust canberra average

hclust canberra mcquitty

hclust canberra median

hclust canberra centroid

hclust binary ward

hclust binary single

hclust binary complete

hclust binary average

hclust binary mcquittyhclust binary median

hclust binary centroid

hclust pearson ward

hclust pearson single

hclust pearson completehclust pearson average

hclust pearson mcquittyhclust pearson median

hclust pearson centroid

hclust correlation ward

hclust correlation single hclust correlation completehclust correlation average

hclust correlation mcquitty

hclust correlation median

hclust correlation centroid

hclust spearman ward

hclust spearman single

hclust spearman complete

hclust spearman averagehclust spearman mcquitty

hclust spearman medianhclust spearman centroid

hclust kendall ward

hclust kendall singlehclust kendall complete

hclust kendall averagehclust kendall mcquittyhclust kendall medianhclust kendall centroid

Found a region with particularlyinsightful clusterings

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 15

/ 20

Page 102: discov-uga_1.pdf

Example Discovery

: Partisan Taunting

biclust_spectral

clust_convex

mult_dirproc

dismeadist_cosdist_fbinarydist_ebinarydist_minkowskidist_maxdist_canbdist_binary

mec

rocksom

sot_euc

sot_cor

spec_cosspec_eucspec_man

spec_minkspec_maxspec_canbmspec_cosmspec_euc

mspec_manmspec_minkmspec_max

mspec_canb

affprop cosine

affprop euclidean

affprop manhattan

affprop info.costs

affprop maximum

divisive stand.euc

divisive euclidean

divisive manhattan

kmedoids stand.euckmedoids euclidean

kmedoids manhattan

mixvmf mixvmfVA

kmeans euclidean

kmeans maximum

kmeans manhattankmeans canberra

kmeans binary

kmeans pearson

kmeans correlation

kmeans spearmankmeans kendall

hclust euclidean ward

hclust euclidean single

hclust euclidean complete

hclust euclidean averagehclust euclidean mcquitty

hclust euclidean medianhclust euclidean centroid

hclust maximum ward

hclust maximum single

hclust maximum completehclust maximum average

hclust maximum mcquitty

hclust maximum medianhclust maximum centroid

hclust manhattan ward

hclust manhattan single

hclust manhattan complete

hclust manhattan average

hclust manhattan mcquitty

hclust manhattan medianhclust manhattan centroid

hclust canberra ward

hclust canberra single

hclust canberra complete

hclust canberra average

hclust canberra mcquitty

hclust canberra median

hclust canberra centroid

hclust binary ward

hclust binary single

hclust binary complete

hclust binary average

hclust binary mcquittyhclust binary median

hclust binary centroid

hclust pearson ward

hclust pearson single

hclust pearson completehclust pearson average

hclust pearson mcquittyhclust pearson median

hclust pearson centroid

hclust correlation ward

hclust correlation single hclust correlation completehclust correlation average

hclust correlation mcquitty

hclust correlation median

hclust correlation centroid

hclust spearman ward

hclust spearman single

hclust spearman complete

hclust spearman averagehclust spearman mcquitty

hclust spearman medianhclust spearman centroid

hclust kendall ward

hclust kendall singlehclust kendall complete

hclust kendall averagehclust kendall mcquittyhclust kendall medianhclust kendall centroid

●●

Mixture:

0.39 Hclust-Canberra-McQuitty

0.30 Spectral clusteringRandom Walk(Metrics 1-6)

0.13 Hclust-Correlation-Ward

0.09 Hclust-Pearson-Ward

0.05 Kmediods-Cosine

0.04 Spectral clusteringSymmetric(Metrics 1-6)

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 15

/ 20

Page 103: discov-uga_1.pdf

Example Discovery

: Partisan Taunting

biclust_spectral

clust_convex

mult_dirproc

dismeadist_cosdist_fbinarydist_ebinarydist_minkowskidist_maxdist_canbdist_binary

mec

rocksom

sot_euc

sot_cor

spec_cosspec_eucspec_man

spec_minkspec_maxspec_canbmspec_cosmspec_euc

mspec_manmspec_minkmspec_max

mspec_canb

affprop cosine

affprop euclidean

affprop manhattan

affprop info.costs

affprop maximum

divisive stand.euc

divisive euclidean

divisive manhattan

kmedoids stand.euckmedoids euclidean

kmedoids manhattan

mixvmf mixvmfVA

kmeans euclidean

kmeans maximum

kmeans manhattankmeans canberra

kmeans binary

kmeans pearson

kmeans correlation

kmeans spearmankmeans kendall

hclust euclidean ward

hclust euclidean single

hclust euclidean complete

hclust euclidean averagehclust euclidean mcquitty

hclust euclidean medianhclust euclidean centroid

hclust maximum ward

hclust maximum single

hclust maximum completehclust maximum average

hclust maximum mcquitty

hclust maximum medianhclust maximum centroid

hclust manhattan ward

hclust manhattan single

hclust manhattan complete

hclust manhattan average

hclust manhattan mcquitty

hclust manhattan medianhclust manhattan centroid

hclust canberra ward

hclust canberra single

hclust canberra complete

hclust canberra average

hclust canberra mcquitty

hclust canberra median

hclust canberra centroid

hclust binary ward

hclust binary single

hclust binary complete

hclust binary average

hclust binary mcquittyhclust binary median

hclust binary centroid

hclust pearson ward

hclust pearson single

hclust pearson completehclust pearson average

hclust pearson mcquittyhclust pearson median

hclust pearson centroid

hclust correlation ward

hclust correlation single hclust correlation completehclust correlation average

hclust correlation mcquitty

hclust correlation median

hclust correlation centroid

hclust spearman ward

hclust spearman single

hclust spearman complete

hclust spearman averagehclust spearman mcquitty

hclust spearman medianhclust spearman centroid

hclust kendall ward

hclust kendall singlehclust kendall complete

hclust kendall averagehclust kendall mcquittyhclust kendall medianhclust kendall centroid

●●

Mixture:

0.39 Hclust-Canberra-McQuitty

0.30 Spectral clusteringRandom Walk(Metrics 1-6)

0.13 Hclust-Correlation-Ward

0.09 Hclust-Pearson-Ward

0.05 Kmediods-Cosine

0.04 Spectral clusteringSymmetric(Metrics 1-6)

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 15

/ 20

Page 104: discov-uga_1.pdf

Example Discovery

: Partisan Taunting

biclust_spectral

clust_convex

mult_dirproc

dismeadist_cosdist_fbinarydist_ebinarydist_minkowskidist_maxdist_canbdist_binary

mec

rocksom

sot_euc

sot_cor

spec_cosspec_eucspec_man

spec_minkspec_maxspec_canbmspec_cosmspec_euc

mspec_manmspec_minkmspec_max

mspec_canb

affprop cosine

affprop euclidean

affprop manhattan

affprop info.costs

affprop maximum

divisive stand.euc

divisive euclidean

divisive manhattan

kmedoids stand.euckmedoids euclidean

kmedoids manhattan

mixvmf mixvmfVA

kmeans euclidean

kmeans maximum

kmeans manhattankmeans canberra

kmeans binary

kmeans pearson

kmeans correlation

kmeans spearmankmeans kendall

hclust euclidean ward

hclust euclidean single

hclust euclidean complete

hclust euclidean averagehclust euclidean mcquitty

hclust euclidean medianhclust euclidean centroid

hclust maximum ward

hclust maximum single

hclust maximum completehclust maximum average

hclust maximum mcquitty

hclust maximum medianhclust maximum centroid

hclust manhattan ward

hclust manhattan single

hclust manhattan complete

hclust manhattan average

hclust manhattan mcquitty

hclust manhattan medianhclust manhattan centroid

hclust canberra ward

hclust canberra single

hclust canberra complete

hclust canberra average

hclust canberra mcquitty

hclust canberra median

hclust canberra centroid

hclust binary ward

hclust binary single

hclust binary complete

hclust binary average

hclust binary mcquittyhclust binary median

hclust binary centroid

hclust pearson ward

hclust pearson single

hclust pearson completehclust pearson average

hclust pearson mcquittyhclust pearson median

hclust pearson centroid

hclust correlation ward

hclust correlation single hclust correlation completehclust correlation average

hclust correlation mcquitty

hclust correlation median

hclust correlation centroid

hclust spearman ward

hclust spearman single

hclust spearman complete

hclust spearman averagehclust spearman mcquitty

hclust spearman medianhclust spearman centroid

hclust kendall ward

hclust kendall singlehclust kendall complete

hclust kendall averagehclust kendall mcquittyhclust kendall medianhclust kendall centroid

●●

Mixture:

0.39 Hclust-Canberra-McQuitty

0.30 Spectral clusteringRandom Walk(Metrics 1-6)

0.13 Hclust-Correlation-Ward

0.09 Hclust-Pearson-Ward

0.05 Kmediods-Cosine

0.04 Spectral clusteringSymmetric(Metrics 1-6)

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 15

/ 20

Page 105: discov-uga_1.pdf

Example Discovery

: Partisan Taunting

biclust_spectral

clust_convex

mult_dirproc

dismeadist_cosdist_fbinarydist_ebinarydist_minkowskidist_maxdist_canbdist_binary

mec

rocksom

sot_euc

sot_cor

spec_cosspec_eucspec_man

spec_minkspec_maxspec_canbmspec_cosmspec_euc

mspec_manmspec_minkmspec_max

mspec_canb

affprop cosine

affprop euclidean

affprop manhattan

affprop info.costs

affprop maximum

divisive stand.euc

divisive euclidean

divisive manhattan

kmedoids stand.euckmedoids euclidean

kmedoids manhattan

mixvmf mixvmfVA

kmeans euclidean

kmeans maximum

kmeans manhattankmeans canberra

kmeans binary

kmeans pearson

kmeans correlation

kmeans spearmankmeans kendall

hclust euclidean ward

hclust euclidean single

hclust euclidean complete

hclust euclidean averagehclust euclidean mcquitty

hclust euclidean medianhclust euclidean centroid

hclust maximum ward

hclust maximum single

hclust maximum completehclust maximum average

hclust maximum mcquitty

hclust maximum medianhclust maximum centroid

hclust manhattan ward

hclust manhattan single

hclust manhattan complete

hclust manhattan average

hclust manhattan mcquitty

hclust manhattan medianhclust manhattan centroid

hclust canberra ward

hclust canberra single

hclust canberra complete

hclust canberra average

hclust canberra mcquitty

hclust canberra median

hclust canberra centroid

hclust binary ward

hclust binary single

hclust binary complete

hclust binary average

hclust binary mcquittyhclust binary median

hclust binary centroid

hclust pearson ward

hclust pearson single

hclust pearson completehclust pearson average

hclust pearson mcquittyhclust pearson median

hclust pearson centroid

hclust correlation ward

hclust correlation single hclust correlation completehclust correlation average

hclust correlation mcquitty

hclust correlation median

hclust correlation centroid

hclust spearman ward

hclust spearman single

hclust spearman complete

hclust spearman averagehclust spearman mcquitty

hclust spearman medianhclust spearman centroid

hclust kendall ward

hclust kendall singlehclust kendall complete

hclust kendall averagehclust kendall mcquittyhclust kendall medianhclust kendall centroid

●●

Mixture:

0.39 Hclust-Canberra-McQuitty

0.30 Spectral clusteringRandom Walk(Metrics 1-6)

0.13 Hclust-Correlation-Ward

0.09 Hclust-Pearson-Ward

0.05 Kmediods-Cosine

0.04 Spectral clusteringSymmetric(Metrics 1-6)

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 15

/ 20

Page 106: discov-uga_1.pdf

Example Discovery

: Partisan Taunting

biclust_spectral

clust_convex

mult_dirproc

dismeadist_cosdist_fbinarydist_ebinarydist_minkowskidist_maxdist_canbdist_binary

mec

rocksom

sot_euc

sot_cor

spec_cosspec_eucspec_man

spec_minkspec_maxspec_canbmspec_cosmspec_euc

mspec_manmspec_minkmspec_max

mspec_canb

affprop cosine

affprop euclidean

affprop manhattan

affprop info.costs

affprop maximum

divisive stand.euc

divisive euclidean

divisive manhattan

kmedoids stand.euckmedoids euclidean

kmedoids manhattan

mixvmf mixvmfVA

kmeans euclidean

kmeans maximum

kmeans manhattankmeans canberra

kmeans binary

kmeans pearson

kmeans correlation

kmeans spearmankmeans kendall

hclust euclidean ward

hclust euclidean single

hclust euclidean complete

hclust euclidean averagehclust euclidean mcquitty

hclust euclidean medianhclust euclidean centroid

hclust maximum ward

hclust maximum single

hclust maximum completehclust maximum average

hclust maximum mcquitty

hclust maximum medianhclust maximum centroid

hclust manhattan ward

hclust manhattan single

hclust manhattan complete

hclust manhattan average

hclust manhattan mcquitty

hclust manhattan medianhclust manhattan centroid

hclust canberra ward

hclust canberra single

hclust canberra complete

hclust canberra average

hclust canberra mcquitty

hclust canberra median

hclust canberra centroid

hclust binary ward

hclust binary single

hclust binary complete

hclust binary average

hclust binary mcquittyhclust binary median

hclust binary centroid

hclust pearson ward

hclust pearson single

hclust pearson completehclust pearson average

hclust pearson mcquittyhclust pearson median

hclust pearson centroid

hclust correlation ward

hclust correlation single hclust correlation completehclust correlation average

hclust correlation mcquitty

hclust correlation median

hclust correlation centroid

hclust spearman ward

hclust spearman single

hclust spearman complete

hclust spearman averagehclust spearman mcquitty

hclust spearman medianhclust spearman centroid

hclust kendall ward

hclust kendall singlehclust kendall complete

hclust kendall averagehclust kendall mcquittyhclust kendall medianhclust kendall centroid

●●

Mixture:

0.39 Hclust-Canberra-McQuitty

0.30 Spectral clusteringRandom Walk(Metrics 1-6)

0.13 Hclust-Correlation-Ward

0.09 Hclust-Pearson-Ward

0.05 Kmediods-Cosine

0.04 Spectral clusteringSymmetric(Metrics 1-6)

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 15

/ 20

Page 107: discov-uga_1.pdf

Example Discovery

: Partisan Taunting

biclust_spectral

clust_convex

mult_dirproc

dismeadist_cosdist_fbinarydist_ebinarydist_minkowskidist_maxdist_canbdist_binary

mec

rocksom

sot_euc

sot_cor

spec_cosspec_eucspec_man

spec_minkspec_maxspec_canbmspec_cosmspec_euc

mspec_manmspec_minkmspec_max

mspec_canb

affprop cosine

affprop euclidean

affprop manhattan

affprop info.costs

affprop maximum

divisive stand.euc

divisive euclidean

divisive manhattan

kmedoids stand.euckmedoids euclidean

kmedoids manhattan

mixvmf mixvmfVA

kmeans euclidean

kmeans maximum

kmeans manhattankmeans canberra

kmeans binary

kmeans pearson

kmeans correlation

kmeans spearmankmeans kendall

hclust euclidean ward

hclust euclidean single

hclust euclidean complete

hclust euclidean averagehclust euclidean mcquitty

hclust euclidean medianhclust euclidean centroid

hclust maximum ward

hclust maximum single

hclust maximum completehclust maximum average

hclust maximum mcquitty

hclust maximum medianhclust maximum centroid

hclust manhattan ward

hclust manhattan single

hclust manhattan complete

hclust manhattan average

hclust manhattan mcquitty

hclust manhattan medianhclust manhattan centroid

hclust canberra ward

hclust canberra single

hclust canberra complete

hclust canberra average

hclust canberra mcquitty

hclust canberra median

hclust canberra centroid

hclust binary ward

hclust binary single

hclust binary complete

hclust binary average

hclust binary mcquittyhclust binary median

hclust binary centroid

hclust pearson ward

hclust pearson single

hclust pearson completehclust pearson average

hclust pearson mcquittyhclust pearson median

hclust pearson centroid

hclust correlation ward

hclust correlation single hclust correlation completehclust correlation average

hclust correlation mcquitty

hclust correlation median

hclust correlation centroid

hclust spearman ward

hclust spearman single

hclust spearman complete

hclust spearman averagehclust spearman mcquitty

hclust spearman medianhclust spearman centroid

hclust kendall ward

hclust kendall singlehclust kendall complete

hclust kendall averagehclust kendall mcquittyhclust kendall medianhclust kendall centroid

●●

Mixture:

0.39 Hclust-Canberra-McQuitty

0.30 Spectral clusteringRandom Walk(Metrics 1-6)

0.13 Hclust-Correlation-Ward

0.09 Hclust-Pearson-Ward

0.05 Kmediods-Cosine

0.04 Spectral clusteringSymmetric(Metrics 1-6)

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 15

/ 20

Page 108: discov-uga_1.pdf

Example Discovery

: Partisan Taunting

biclust_spectral

clust_convex

mult_dirproc

dismeadist_cosdist_fbinarydist_ebinarydist_minkowskidist_maxdist_canbdist_binary

mec

rocksom

sot_euc

sot_cor

spec_cosspec_eucspec_man

spec_minkspec_maxspec_canbmspec_cosmspec_euc

mspec_manmspec_minkmspec_max

mspec_canb

affprop cosine

affprop euclidean

affprop manhattan

affprop info.costs

affprop maximum

divisive stand.euc

divisive euclidean

divisive manhattan

kmedoids stand.euckmedoids euclidean

kmedoids manhattan

mixvmf mixvmfVA

kmeans euclidean

kmeans maximum

kmeans manhattankmeans canberra

kmeans binary

kmeans pearson

kmeans correlation

kmeans spearmankmeans kendall

hclust euclidean ward

hclust euclidean single

hclust euclidean complete

hclust euclidean averagehclust euclidean mcquitty

hclust euclidean medianhclust euclidean centroid

hclust maximum ward

hclust maximum single

hclust maximum completehclust maximum average

hclust maximum mcquitty

hclust maximum medianhclust maximum centroid

hclust manhattan ward

hclust manhattan single

hclust manhattan complete

hclust manhattan average

hclust manhattan mcquitty

hclust manhattan medianhclust manhattan centroid

hclust canberra ward

hclust canberra single

hclust canberra complete

hclust canberra average

hclust canberra mcquitty

hclust canberra median

hclust canberra centroid

hclust binary ward

hclust binary single

hclust binary complete

hclust binary average

hclust binary mcquittyhclust binary median

hclust binary centroid

hclust pearson ward

hclust pearson single

hclust pearson completehclust pearson average

hclust pearson mcquittyhclust pearson median

hclust pearson centroid

hclust correlation ward

hclust correlation single hclust correlation completehclust correlation average

hclust correlation mcquitty

hclust correlation median

hclust correlation centroid

hclust spearman ward

hclust spearman single

hclust spearman complete

hclust spearman averagehclust spearman mcquitty

hclust spearman medianhclust spearman centroid

hclust kendall ward

hclust kendall singlehclust kendall complete

hclust kendall averagehclust kendall mcquittyhclust kendall medianhclust kendall centroid

●●

Mixture:

0.39 Hclust-Canberra-McQuitty

0.30 Spectral clusteringRandom Walk(Metrics 1-6)

0.13 Hclust-Correlation-Ward

0.09 Hclust-Pearson-Ward

0.05 Kmediods-Cosine

0.04 Spectral clusteringSymmetric(Metrics 1-6)

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 15

/ 20

Page 109: discov-uga_1.pdf

Example Discovery

: Partisan Taunting

biclust_spectral

clust_convex

mult_dirproc

dismeadist_cosdist_fbinarydist_ebinarydist_minkowskidist_maxdist_canbdist_binary

mec

rocksom

sot_euc

sot_cor

spec_cosspec_eucspec_man

spec_minkspec_maxspec_canbmspec_cosmspec_euc

mspec_manmspec_minkmspec_max

mspec_canb

affprop cosine

affprop euclidean

affprop manhattan

affprop info.costs

affprop maximum

divisive stand.euc

divisive euclidean

divisive manhattan

kmedoids stand.euckmedoids euclidean

kmedoids manhattan

mixvmf mixvmfVA

kmeans euclidean

kmeans maximum

kmeans manhattankmeans canberra

kmeans binary

kmeans pearson

kmeans correlation

kmeans spearmankmeans kendall

hclust euclidean ward

hclust euclidean single

hclust euclidean complete

hclust euclidean averagehclust euclidean mcquitty

hclust euclidean medianhclust euclidean centroid

hclust maximum ward

hclust maximum single

hclust maximum completehclust maximum average

hclust maximum mcquitty

hclust maximum medianhclust maximum centroid

hclust manhattan ward

hclust manhattan single

hclust manhattan complete

hclust manhattan average

hclust manhattan mcquitty

hclust manhattan medianhclust manhattan centroid

hclust canberra ward

hclust canberra single

hclust canberra complete

hclust canberra average

hclust canberra mcquitty

hclust canberra median

hclust canberra centroid

hclust binary ward

hclust binary single

hclust binary complete

hclust binary average

hclust binary mcquittyhclust binary median

hclust binary centroid

hclust pearson ward

hclust pearson single

hclust pearson completehclust pearson average

hclust pearson mcquittyhclust pearson median

hclust pearson centroid

hclust correlation ward

hclust correlation single hclust correlation completehclust correlation average

hclust correlation mcquitty

hclust correlation median

hclust correlation centroid

hclust spearman ward

hclust spearman single

hclust spearman complete

hclust spearman averagehclust spearman mcquitty

hclust spearman medianhclust spearman centroid

hclust kendall ward

hclust kendall singlehclust kendall complete

hclust kendall averagehclust kendall mcquittyhclust kendall medianhclust kendall centroid

Clusters in this Clustering

MayhewGary King (Harvard IQSS) Quantitative Discovery

Parthemos Lecture at University of Georgia, 3/4/2011 15/ 20

Page 110: discov-uga_1.pdf

Example Discovery

: Partisan Taunting

biclust_spectral

clust_convex

mult_dirproc

dismeadist_cosdist_fbinarydist_ebinarydist_minkowskidist_maxdist_canbdist_binary

mec

rocksom

sot_euc

sot_cor

spec_cosspec_eucspec_man

spec_minkspec_maxspec_canbmspec_cosmspec_euc

mspec_manmspec_minkmspec_max

mspec_canb

affprop cosine

affprop euclidean

affprop manhattan

affprop info.costs

affprop maximum

divisive stand.euc

divisive euclidean

divisive manhattan

kmedoids stand.euckmedoids euclidean

kmedoids manhattan

mixvmf mixvmfVA

kmeans euclidean

kmeans maximum

kmeans manhattankmeans canberra

kmeans binary

kmeans pearson

kmeans correlation

kmeans spearmankmeans kendall

hclust euclidean ward

hclust euclidean single

hclust euclidean complete

hclust euclidean averagehclust euclidean mcquitty

hclust euclidean medianhclust euclidean centroid

hclust maximum ward

hclust maximum single

hclust maximum completehclust maximum average

hclust maximum mcquitty

hclust maximum medianhclust maximum centroid

hclust manhattan ward

hclust manhattan single

hclust manhattan complete

hclust manhattan average

hclust manhattan mcquitty

hclust manhattan medianhclust manhattan centroid

hclust canberra ward

hclust canberra single

hclust canberra complete

hclust canberra average

hclust canberra mcquitty

hclust canberra median

hclust canberra centroid

hclust binary ward

hclust binary single

hclust binary complete

hclust binary average

hclust binary mcquittyhclust binary median

hclust binary centroid

hclust pearson ward

hclust pearson single

hclust pearson completehclust pearson average

hclust pearson mcquittyhclust pearson median

hclust pearson centroid

hclust correlation ward

hclust correlation single hclust correlation completehclust correlation average

hclust correlation mcquitty

hclust correlation median

hclust correlation centroid

hclust spearman ward

hclust spearman single

hclust spearman complete

hclust spearman averagehclust spearman mcquitty

hclust spearman medianhclust spearman centroid

hclust kendall ward

hclust kendall singlehclust kendall complete

hclust kendall averagehclust kendall mcquittyhclust kendall medianhclust kendall centroid

Clusters in this Clustering

Mayhew

●● ●

●●

●●

●●

●●

Credit ClaimingPork

Credit Claiming, Pork:“Sens. Frank R. Lautenberg(D-NJ) and Robert Menendez(D-NJ) announced that the U.S.Department of Commerce hasawarded a $100,000 grant to theSouth Jersey EconomicDevelopment District”

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 15

/ 20

Page 111: discov-uga_1.pdf

Example Discovery

: Partisan Taunting

biclust_spectral

clust_convex

mult_dirproc

dismeadist_cosdist_fbinarydist_ebinarydist_minkowskidist_maxdist_canbdist_binary

mec

rocksom

sot_euc

sot_cor

spec_cosspec_eucspec_man

spec_minkspec_maxspec_canbmspec_cosmspec_euc

mspec_manmspec_minkmspec_max

mspec_canb

affprop cosine

affprop euclidean

affprop manhattan

affprop info.costs

affprop maximum

divisive stand.euc

divisive euclidean

divisive manhattan

kmedoids stand.euckmedoids euclidean

kmedoids manhattan

mixvmf mixvmfVA

kmeans euclidean

kmeans maximum

kmeans manhattankmeans canberra

kmeans binary

kmeans pearson

kmeans correlation

kmeans spearmankmeans kendall

hclust euclidean ward

hclust euclidean single

hclust euclidean complete

hclust euclidean averagehclust euclidean mcquitty

hclust euclidean medianhclust euclidean centroid

hclust maximum ward

hclust maximum single

hclust maximum completehclust maximum average

hclust maximum mcquitty

hclust maximum medianhclust maximum centroid

hclust manhattan ward

hclust manhattan single

hclust manhattan complete

hclust manhattan average

hclust manhattan mcquitty

hclust manhattan medianhclust manhattan centroid

hclust canberra ward

hclust canberra single

hclust canberra complete

hclust canberra average

hclust canberra mcquitty

hclust canberra median

hclust canberra centroid

hclust binary ward

hclust binary single

hclust binary complete

hclust binary average

hclust binary mcquittyhclust binary median

hclust binary centroid

hclust pearson ward

hclust pearson single

hclust pearson completehclust pearson average

hclust pearson mcquittyhclust pearson median

hclust pearson centroid

hclust correlation ward

hclust correlation single hclust correlation completehclust correlation average

hclust correlation mcquitty

hclust correlation median

hclust correlation centroid

hclust spearman ward

hclust spearman single

hclust spearman complete

hclust spearman averagehclust spearman mcquitty

hclust spearman medianhclust spearman centroid

hclust kendall ward

hclust kendall singlehclust kendall complete

hclust kendall averagehclust kendall mcquittyhclust kendall medianhclust kendall centroid

Clusters in this Clustering

Mayhew

●● ●

●●

●●

●●

●●

Credit ClaimingPork

● ●●

●●

●●

●●

Credit ClaimingLegislation

Credit Claiming, Legislation:“As the Senate begins its recess,Senator Frank Lautenberg todaypointed to a string of victories inCongress on his legislative agendaduring this work period”

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 15

/ 20

Page 112: discov-uga_1.pdf

Example Discovery

: Partisan Taunting

biclust_spectral

clust_convex

mult_dirproc

dismeadist_cosdist_fbinarydist_ebinarydist_minkowskidist_maxdist_canbdist_binary

mec

rocksom

sot_euc

sot_cor

spec_cosspec_eucspec_man

spec_minkspec_maxspec_canbmspec_cosmspec_euc

mspec_manmspec_minkmspec_max

mspec_canb

affprop cosine

affprop euclidean

affprop manhattan

affprop info.costs

affprop maximum

divisive stand.euc

divisive euclidean

divisive manhattan

kmedoids stand.euckmedoids euclidean

kmedoids manhattan

mixvmf mixvmfVA

kmeans euclidean

kmeans maximum

kmeans manhattankmeans canberra

kmeans binary

kmeans pearson

kmeans correlation

kmeans spearmankmeans kendall

hclust euclidean ward

hclust euclidean single

hclust euclidean complete

hclust euclidean averagehclust euclidean mcquitty

hclust euclidean medianhclust euclidean centroid

hclust maximum ward

hclust maximum single

hclust maximum completehclust maximum average

hclust maximum mcquitty

hclust maximum medianhclust maximum centroid

hclust manhattan ward

hclust manhattan single

hclust manhattan complete

hclust manhattan average

hclust manhattan mcquitty

hclust manhattan medianhclust manhattan centroid

hclust canberra ward

hclust canberra single

hclust canberra complete

hclust canberra average

hclust canberra mcquitty

hclust canberra median

hclust canberra centroid

hclust binary ward

hclust binary single

hclust binary complete

hclust binary average

hclust binary mcquittyhclust binary median

hclust binary centroid

hclust pearson ward

hclust pearson single

hclust pearson completehclust pearson average

hclust pearson mcquittyhclust pearson median

hclust pearson centroid

hclust correlation ward

hclust correlation single hclust correlation completehclust correlation average

hclust correlation mcquitty

hclust correlation median

hclust correlation centroid

hclust spearman ward

hclust spearman single

hclust spearman complete

hclust spearman averagehclust spearman mcquitty

hclust spearman medianhclust spearman centroid

hclust kendall ward

hclust kendall singlehclust kendall complete

hclust kendall averagehclust kendall mcquittyhclust kendall medianhclust kendall centroid

Clusters in this Clustering

Mayhew

●● ●

●●

●●

●●

●●

Credit ClaimingPork

● ●●

●●

●●

●●

Credit ClaimingLegislation

●●

●●

●●

AdvertisingAdvertising

Advertising:“Senate AdoptsLautenberg/Menendez ResolutionHonoring Spelling Bee Championfrom New Jersey”

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 15

/ 20

Page 113: discov-uga_1.pdf

Example Discovery: Partisan Taunting

biclust_spectral

clust_convex

mult_dirproc

dismeadist_cosdist_fbinarydist_ebinarydist_minkowskidist_maxdist_canbdist_binary

mec

rocksom

sot_euc

sot_cor

spec_cosspec_eucspec_man

spec_minkspec_maxspec_canbmspec_cosmspec_euc

mspec_manmspec_minkmspec_max

mspec_canb

affprop cosine

affprop euclidean

affprop manhattan

affprop info.costs

affprop maximum

divisive stand.euc

divisive euclidean

divisive manhattan

kmedoids stand.euckmedoids euclidean

kmedoids manhattan

mixvmf mixvmfVA

kmeans euclidean

kmeans maximum

kmeans manhattankmeans canberra

kmeans binary

kmeans pearson

kmeans correlation

kmeans spearmankmeans kendall

hclust euclidean ward

hclust euclidean single

hclust euclidean complete

hclust euclidean averagehclust euclidean mcquitty

hclust euclidean medianhclust euclidean centroid

hclust maximum ward

hclust maximum single

hclust maximum completehclust maximum average

hclust maximum mcquitty

hclust maximum medianhclust maximum centroid

hclust manhattan ward

hclust manhattan single

hclust manhattan complete

hclust manhattan average

hclust manhattan mcquitty

hclust manhattan medianhclust manhattan centroid

hclust canberra ward

hclust canberra single

hclust canberra complete

hclust canberra average

hclust canberra mcquitty

hclust canberra median

hclust canberra centroid

hclust binary ward

hclust binary single

hclust binary complete

hclust binary average

hclust binary mcquittyhclust binary median

hclust binary centroid

hclust pearson ward

hclust pearson single

hclust pearson completehclust pearson average

hclust pearson mcquittyhclust pearson median

hclust pearson centroid

hclust correlation ward

hclust correlation single hclust correlation completehclust correlation average

hclust correlation mcquitty

hclust correlation median

hclust correlation centroid

hclust spearman ward

hclust spearman single

hclust spearman complete

hclust spearman averagehclust spearman mcquitty

hclust spearman medianhclust spearman centroid

hclust kendall ward

hclust kendall singlehclust kendall complete

hclust kendall averagehclust kendall mcquittyhclust kendall medianhclust kendall centroid

Clusters in this Clustering

Mayhew

●● ●

●●

●●

●●

●●

Credit ClaimingPork

● ●●

●●

●●

●●

Credit ClaimingLegislation

●●

●●

●●

AdvertisingAdvertising

●●

●●

●●

● ●

●●

●●

●●

● ●

Partisan Taunting

Partisan Taunting:“Republicans Selling Out Nationon Chemical Plant Security”

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 15

/ 20

Page 114: discov-uga_1.pdf

Example Discovery: Partisan Taunting

biclust_spectral

clust_convex

mult_dirproc

dismeadist_cosdist_fbinarydist_ebinarydist_minkowskidist_maxdist_canbdist_binary

mec

rocksom

sot_euc

sot_cor

spec_cosspec_eucspec_man

spec_minkspec_maxspec_canbmspec_cosmspec_euc

mspec_manmspec_minkmspec_max

mspec_canb

affprop cosine

affprop euclidean

affprop manhattan

affprop info.costs

affprop maximum

divisive stand.euc

divisive euclidean

divisive manhattan

kmedoids stand.euckmedoids euclidean

kmedoids manhattan

mixvmf mixvmfVA

kmeans euclidean

kmeans maximum

kmeans manhattankmeans canberra

kmeans binary

kmeans pearson

kmeans correlation

kmeans spearmankmeans kendall

hclust euclidean ward

hclust euclidean single

hclust euclidean complete

hclust euclidean averagehclust euclidean mcquitty

hclust euclidean medianhclust euclidean centroid

hclust maximum ward

hclust maximum single

hclust maximum completehclust maximum average

hclust maximum mcquitty

hclust maximum medianhclust maximum centroid

hclust manhattan ward

hclust manhattan single

hclust manhattan complete

hclust manhattan average

hclust manhattan mcquitty

hclust manhattan medianhclust manhattan centroid

hclust canberra ward

hclust canberra single

hclust canberra complete

hclust canberra average

hclust canberra mcquitty

hclust canberra median

hclust canberra centroid

hclust binary ward

hclust binary single

hclust binary complete

hclust binary average

hclust binary mcquittyhclust binary median

hclust binary centroid

hclust pearson ward

hclust pearson single

hclust pearson completehclust pearson average

hclust pearson mcquittyhclust pearson median

hclust pearson centroid

hclust correlation ward

hclust correlation single hclust correlation completehclust correlation average

hclust correlation mcquitty

hclust correlation median

hclust correlation centroid

hclust spearman ward

hclust spearman single

hclust spearman complete

hclust spearman averagehclust spearman mcquitty

hclust spearman medianhclust spearman centroid

hclust kendall ward

hclust kendall singlehclust kendall complete

hclust kendall averagehclust kendall mcquittyhclust kendall medianhclust kendall centroid

Clusters in this Clustering

Mayhew

●● ●

●●

●●

●●

●●

Credit ClaimingPork

● ●●

●●

●●

●●

Credit ClaimingLegislation

●●

●●

●●

AdvertisingAdvertising

●●

●●

●●

● ●

●●

●●

●●

● ●

Partisan Taunting

Partisan Taunting:“Senator Lautenberg’samendment would change thename of. . . the Republicanbill. . . to ‘More Tax Breaks forthe Rich and More Debt for OurGrandchildren Deficit ExpansionReconciliation Act of 2006”’

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 15

/ 20

Page 115: discov-uga_1.pdf

Example Discovery: Partisan Taunting

biclust_spectral

clust_convex

mult_dirproc

dismeadist_cosdist_fbinarydist_ebinarydist_minkowskidist_maxdist_canbdist_binary

mec

rocksom

sot_euc

sot_cor

spec_cosspec_eucspec_man

spec_minkspec_maxspec_canbmspec_cosmspec_euc

mspec_manmspec_minkmspec_max

mspec_canb

affprop cosine

affprop euclidean

affprop manhattan

affprop info.costs

affprop maximum

divisive stand.euc

divisive euclidean

divisive manhattan

kmedoids stand.euckmedoids euclidean

kmedoids manhattan

mixvmf mixvmfVA

kmeans euclidean

kmeans maximum

kmeans manhattankmeans canberra

kmeans binary

kmeans pearson

kmeans correlation

kmeans spearmankmeans kendall

hclust euclidean ward

hclust euclidean single

hclust euclidean complete

hclust euclidean averagehclust euclidean mcquitty

hclust euclidean medianhclust euclidean centroid

hclust maximum ward

hclust maximum single

hclust maximum completehclust maximum average

hclust maximum mcquitty

hclust maximum medianhclust maximum centroid

hclust manhattan ward

hclust manhattan single

hclust manhattan complete

hclust manhattan average

hclust manhattan mcquitty

hclust manhattan medianhclust manhattan centroid

hclust canberra ward

hclust canberra single

hclust canberra complete

hclust canberra average

hclust canberra mcquitty

hclust canberra median

hclust canberra centroid

hclust binary ward

hclust binary single

hclust binary complete

hclust binary average

hclust binary mcquittyhclust binary median

hclust binary centroid

hclust pearson ward

hclust pearson single

hclust pearson completehclust pearson average

hclust pearson mcquittyhclust pearson median

hclust pearson centroid

hclust correlation ward

hclust correlation single hclust correlation completehclust correlation average

hclust correlation mcquitty

hclust correlation median

hclust correlation centroid

hclust spearman ward

hclust spearman single

hclust spearman complete

hclust spearman averagehclust spearman mcquitty

hclust spearman medianhclust spearman centroid

hclust kendall ward

hclust kendall singlehclust kendall complete

hclust kendall averagehclust kendall mcquittyhclust kendall medianhclust kendall centroid

Clusters in this Clustering

Mayhew

●● ●

●●

●●

●●

●●

Credit ClaimingPork

● ●●

●●

●●

●●

Credit ClaimingLegislation

●●

●●

●●

AdvertisingAdvertising

●●

●●

●●

● ●

●●

●●

●●

● ●

Partisan Taunting

Definition: Explicit, public, andnegative attacks on anotherpolitical party or its members

Taunting ruinsdeliberation

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 15

/ 20

Page 116: discov-uga_1.pdf

Example Discovery: Partisan Taunting

biclust_spectral

clust_convex

mult_dirproc

dismeadist_cosdist_fbinarydist_ebinarydist_minkowskidist_maxdist_canbdist_binary

mec

rocksom

sot_euc

sot_cor

spec_cosspec_eucspec_man

spec_minkspec_maxspec_canbmspec_cosmspec_euc

mspec_manmspec_minkmspec_max

mspec_canb

affprop cosine

affprop euclidean

affprop manhattan

affprop info.costs

affprop maximum

divisive stand.euc

divisive euclidean

divisive manhattan

kmedoids stand.euckmedoids euclidean

kmedoids manhattan

mixvmf mixvmfVA

kmeans euclidean

kmeans maximum

kmeans manhattankmeans canberra

kmeans binary

kmeans pearson

kmeans correlation

kmeans spearmankmeans kendall

hclust euclidean ward

hclust euclidean single

hclust euclidean complete

hclust euclidean averagehclust euclidean mcquitty

hclust euclidean medianhclust euclidean centroid

hclust maximum ward

hclust maximum single

hclust maximum completehclust maximum average

hclust maximum mcquitty

hclust maximum medianhclust maximum centroid

hclust manhattan ward

hclust manhattan single

hclust manhattan complete

hclust manhattan average

hclust manhattan mcquitty

hclust manhattan medianhclust manhattan centroid

hclust canberra ward

hclust canberra single

hclust canberra complete

hclust canberra average

hclust canberra mcquitty

hclust canberra median

hclust canberra centroid

hclust binary ward

hclust binary single

hclust binary complete

hclust binary average

hclust binary mcquittyhclust binary median

hclust binary centroid

hclust pearson ward

hclust pearson single

hclust pearson completehclust pearson average

hclust pearson mcquittyhclust pearson median

hclust pearson centroid

hclust correlation ward

hclust correlation single hclust correlation completehclust correlation average

hclust correlation mcquitty

hclust correlation median

hclust correlation centroid

hclust spearman ward

hclust spearman single

hclust spearman complete

hclust spearman averagehclust spearman mcquitty

hclust spearman medianhclust spearman centroid

hclust kendall ward

hclust kendall singlehclust kendall complete

hclust kendall averagehclust kendall mcquittyhclust kendall medianhclust kendall centroid

Clusters in this Clustering

Mayhew

●● ●

●●

●●

●●

●●

Credit ClaimingPork

● ●●

●●

●●

●●

Credit ClaimingLegislation

●●

●●

●●

AdvertisingAdvertising

●●

●●

●●

● ●

●●

●●

●●

● ●

Partisan Taunting

Definition: Explicit, public, andnegative attacks on anotherpolitical party or its members

Taunting ruinsdeliberation

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 15

/ 20

Page 117: discov-uga_1.pdf

In Sample Illustration of Partisan Taunting

Taunting ruins deliberation

Sen. Lautenbergon Senate Floor4/29/04

- “Senator Lautenberg BlastsRepublicans as ‘Chicken Hawks’ ”[Government Oversight]

- “The scopes trial took place in1925. Sadly, President Bush’s vetotoday shows that we haven’tprogressed much since then”[Healthcare]

- “Every day the House Republicansdragged this out was a day thatmade our communities lesssafe.”[Homeland Security]

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 16

/ 20

Page 118: discov-uga_1.pdf

In Sample Illustration of Partisan Taunting

Taunting ruins deliberation

Sen. Lautenbergon Senate Floor4/29/04

- “Senator Lautenberg BlastsRepublicans as ‘Chicken Hawks’ ”[Government Oversight]

- “The scopes trial took place in1925. Sadly, President Bush’s vetotoday shows that we haven’tprogressed much since then”[Healthcare]

- “Every day the House Republicansdragged this out was a day thatmade our communities lesssafe.”[Homeland Security]

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 16

/ 20

Page 119: discov-uga_1.pdf

In Sample Illustration of Partisan Taunting

Taunting ruins deliberation

Sen. Lautenbergon Senate Floor4/29/04

- “Senator Lautenberg BlastsRepublicans as ‘Chicken Hawks’ ”[Government Oversight]

- “The scopes trial took place in1925. Sadly, President Bush’s vetotoday shows that we haven’tprogressed much since then”[Healthcare]

- “Every day the House Republicansdragged this out was a day thatmade our communities lesssafe.”[Homeland Security]

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 16

/ 20

Page 120: discov-uga_1.pdf

Out of Sample Confirmation of Partisan Taunting

- Discovered using 200 press releases; 1 senator.

- Confirmed using 64,033 press releases; 301 senator-years.- Apply supervised learning method: measure proportion of press

releases a senator taunts other party

Prop. of Press Releases Taunting

Fre

quen

cy

0.1 0.2 0.3 0.4 0.5

1020

30

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 17

/ 20

Page 121: discov-uga_1.pdf

Out of Sample Confirmation of Partisan Taunting

- Discovered using 200 press releases; 1 senator.- Confirmed using 64,033 press releases; 301 senator-years.

- Apply supervised learning method: measure proportion of pressreleases a senator taunts other party

Prop. of Press Releases Taunting

Fre

quen

cy

0.1 0.2 0.3 0.4 0.5

1020

30

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 17

/ 20

Page 122: discov-uga_1.pdf

Out of Sample Confirmation of Partisan Taunting

- Discovered using 200 press releases; 1 senator.- Confirmed using 64,033 press releases; 301 senator-years.- Apply supervised learning method: measure proportion of press

releases a senator taunts other party

Prop. of Press Releases Taunting

Fre

quen

cy

0.1 0.2 0.3 0.4 0.5

1020

30

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 17

/ 20

Page 123: discov-uga_1.pdf

Out of Sample Confirmation of Partisan Taunting

- Discovered using 200 press releases; 1 senator.- Confirmed using 64,033 press releases; 301 senator-years.- Apply supervised learning method: measure proportion of press

releases a senator taunts other party

Prop. of Press Releases Taunting

Fre

quen

cy

0.1 0.2 0.3 0.4 0.5

1020

30

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 18

/ 20

Page 124: discov-uga_1.pdf

Out of Sample Confirmation of Partisan Taunting

- Discovered using 200 press releases; 1 senator.- Confirmed using 64,033 press releases; 301 senator-years.- Apply supervised learning method: measure proportion of press

releases a senator taunts other party

Prop. of Press Releases Taunting

Fre

quen

cy

0.1 0.2 0.3 0.4 0.5

1020

30

On Avg., Senators Taunt in 27 % of Press Releases

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 18

/ 20

Page 125: discov-uga_1.pdf

Quantitative Methods for Qualitative Conceptualization

1) Conceptualization

2) Measurement

3) Validation

Quantitative Methods

Qualitative Methods (reading!)

Quantitative methods for conceptualization and discovery

- Few formal methods designed explicitly for conceptualization

- Belittled: “Tom Swift and His Electric Factor Analysis Machine”(Armstrong 1967)

- Evaluation methods measure progress in discovery

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 19

/ 20

Page 126: discov-uga_1.pdf

Quantitative Methods for Qualitative Conceptualization

1) Conceptualization

2) Measurement

3) Validation

Quantitative Methods

Qualitative Methods (reading!)

Quantitative methods for conceptualization and discovery

- Few formal methods designed explicitly for conceptualization

- Belittled: “Tom Swift and His Electric Factor Analysis Machine”(Armstrong 1967)

- Evaluation methods measure progress in discovery

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 19

/ 20

Page 127: discov-uga_1.pdf

Quantitative Methods for Qualitative Conceptualization

1) Conceptualization

2) Measurement

3) Validation

Quantitative Methods

Qualitative Methods (reading!)

Quantitative methods for conceptualization and discovery

- Few formal methods designed explicitly for conceptualization

- Belittled: “Tom Swift and His Electric Factor Analysis Machine”(Armstrong 1967)

- Evaluation methods measure progress in discovery

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 19

/ 20

Page 128: discov-uga_1.pdf

Quantitative Methods for Qualitative Conceptualization

1) Conceptualization

2) Measurement

3) Validation

Quantitative Methods

Qualitative Methods (reading!)

Quantitative methods for conceptualization and discovery

- Few formal methods designed explicitly for conceptualization

- Belittled: “Tom Swift and His Electric Factor Analysis Machine”(Armstrong 1967)

- Evaluation methods measure progress in discoveryGary King (Harvard IQSS) Quantitative Discovery

Parthemos Lecture at University of Georgia, 3/4/2011 19/ 20

Page 129: discov-uga_1.pdf

For more information

http://GKing.Harvard.edu

Gary King (Harvard IQSS) Quantitative DiscoveryParthemos Lecture at University of Georgia, 3/4/2011 20

/ 20