58
Text Clustering PengBo Nov 1, 2010

Text Clustering PengBo Nov 1, 2010. Today’s Topic Document clustering Motivations Clustering algorithms Partitional Hierarchical Evaluation

Embed Size (px)

Citation preview

Page 1: Text Clustering PengBo Nov 1, 2010. Today’s Topic Document clustering Motivations Clustering algorithms Partitional Hierarchical Evaluation

Text Clustering

PengBoNov 1, 2010

Page 2: Text Clustering PengBo Nov 1, 2010. Today’s Topic Document clustering Motivations Clustering algorithms Partitional Hierarchical Evaluation

Today’s Topic

Document clustering Motivations Clustering algorithms

Partitional Hierarchical

Evaluation

Page 3: Text Clustering PengBo Nov 1, 2010. Today’s Topic Document clustering Motivations Clustering algorithms Partitional Hierarchical Evaluation

What’s Clustering?

Page 4: Text Clustering PengBo Nov 1, 2010. Today’s Topic Document clustering Motivations Clustering algorithms Partitional Hierarchical Evaluation

What is clustering?

Clustering: the process of grouping a set of objects into classes of similar objects The commonest form of unsupervised learning

Unsupervised learning = learning from raw data, as opposed to supervised data where a classification of examples is given

A common and important task that finds many applications in IR and other places

Page 5: Text Clustering PengBo Nov 1, 2010. Today’s Topic Document clustering Motivations Clustering algorithms Partitional Hierarchical Evaluation

Clustering Internal Criterion

High intra-cluster similarity Low inter-cluster similarity

How many clusters?How many clusters?

Page 6: Text Clustering PengBo Nov 1, 2010. Today’s Topic Document clustering Motivations Clustering algorithms Partitional Hierarchical Evaluation

Issues for clustering

Representation for clustering 文档表示 Document representation

Vector space or language model? 相似度 / 距离 similarity/distance

COS similarity or KL distance How many clusters?

Fixed a priori? Completely data driven?

Avoid “trivial” clusters - too large or small

Page 7: Text Clustering PengBo Nov 1, 2010. Today’s Topic Document clustering Motivations Clustering algorithms Partitional Hierarchical Evaluation

Clustering Algorithms

Hard clustering algorithms computes a hard assignment – each document

is a member of exactly one cluster. Soft clustering algorithms

is soft – a document’s assignment is a distribution over all clusters.

Page 8: Text Clustering PengBo Nov 1, 2010. Today’s Topic Document clustering Motivations Clustering algorithms Partitional Hierarchical Evaluation

Clustering Algorithms

Flat algorithms Create cluster set without explicit structure Usually start with a random (partial) partitioning Refine it iteratively

K means clustering Model based clustering

Hierarchical algorithms Bottom-up, agglomerative Top-down, divisive

Page 9: Text Clustering PengBo Nov 1, 2010. Today’s Topic Document clustering Motivations Clustering algorithms Partitional Hierarchical Evaluation

Clustering Algorithms

Flat algorithms Create cluster set without explicit structure Usually start with a random (partial) partitioning Refine it iteratively

K means clustering Model based clustering

Hierarchical algorithms Bottom-up, agglomerative Top-down, divisive

Page 10: Text Clustering PengBo Nov 1, 2010. Today’s Topic Document clustering Motivations Clustering algorithms Partitional Hierarchical Evaluation

Evaluation

Page 11: Text Clustering PengBo Nov 1, 2010. Today’s Topic Document clustering Motivations Clustering algorithms Partitional Hierarchical Evaluation

Think about it…

Evaluation by High internal criterion scores? Object function for High intra-cluster similarity

and Low inter-cluster similarity ApplicationUser judgment

ApplicationUser judgment

Internal judgmentInternal judgment

Page 12: Text Clustering PengBo Nov 1, 2010. Today’s Topic Document clustering Motivations Clustering algorithms Partitional Hierarchical Evaluation

Cluster I Cluster II Cluster III

Example

Page 13: Text Clustering PengBo Nov 1, 2010. Today’s Topic Document clustering Motivations Clustering algorithms Partitional Hierarchical Evaluation

External criteria for clustering quality

测试集是什么? ground truth= ? Assume documents with C gold standard

classes, while our clustering algorithms produce K clusters, ω1, ω2, …, ωK with ni members each.

一个简单的 measure: purity 定义为 cluster 中占主导的 class Ci 的文档数与 cluster ωK 大小的比率

ω= {ω1,ω2, . . . ,ωK} is the set of clusters and C = {c1, c2, . . . , cJ} the set of classes.

Page 14: Text Clustering PengBo Nov 1, 2010. Today’s Topic Document clustering Motivations Clustering algorithms Partitional Hierarchical Evaluation

Cluster I Cluster II Cluster III

Cluster I: Purity = 1/6 *(max(5, 1, 0)) = 5/6Cluster II: Purity = 1/6 * (max(1, 4, 1)) = 4/6

Cluster III: Purity = 1/5 * (max(2, 0, 3)) = 3/5

Purity example

Total: Purity = 1/17 * (5+4+3) = 12/17

Page 15: Text Clustering PengBo Nov 1, 2010. Today’s Topic Document clustering Motivations Clustering algorithms Partitional Hierarchical Evaluation

Rand Index

View it as a series of decisions, one for each of the N(N − 1)/2 pairs of documents in the collection.

true positive (TP) decision assigns two similar documents to the same cluster

true negative (TN) decision assigns two dissimilar documents to different clusters.

false positive (FP) decision assigns two dissimilar documents to the same cluster.

false negative (FN) decision assigns two similar documents to different clusters.

Page 16: Text Clustering PengBo Nov 1, 2010. Today’s Topic Document clustering Motivations Clustering algorithms Partitional Hierarchical Evaluation

Rand Index

Number of points

Same Cluster in clustering

Different Clusters in clustering

Same class in ground truth

Different classes in ground truth

TP FN

TNFP

Page 17: Text Clustering PengBo Nov 1, 2010. Today’s Topic Document clustering Motivations Clustering algorithms Partitional Hierarchical Evaluation

Rand index Example

Cluster I Cluster II Cluster III

Page 18: Text Clustering PengBo Nov 1, 2010. Today’s Topic Document clustering Motivations Clustering algorithms Partitional Hierarchical Evaluation

K Means Algorithm

Page 19: Text Clustering PengBo Nov 1, 2010. Today’s Topic Document clustering Motivations Clustering algorithms Partitional Hierarchical Evaluation

Partitioning Algorithms

Given: a set of documents D and the number K

Find: 找到一个 K clusters 的划分,使 partitioning criterion

最优 Globally optimal: exhaustively enumerate all part

itions Effective heuristic methods: K-means algorithms

partitioning criterion: residual sum of squares( 残差平方和 )

partitioning criterion: residual sum of squares( 残差平方和 )

Page 20: Text Clustering PengBo Nov 1, 2010. Today’s Topic Document clustering Motivations Clustering algorithms Partitional Hierarchical Evaluation

K-Means

假设 documents 是实值 vectors. 基于 cluster ω 的中心 centroids (aka the center

of gravity or mean)

划分 instances 到 clusters 是根据它到 cluster centroid 中心点的距离,选择最近的 centroid

Page 21: Text Clustering PengBo Nov 1, 2010. Today’s Topic Document clustering Motivations Clustering algorithms Partitional Hierarchical Evaluation

K Means Example(K=2)

Pick seeds

Reassign clusters

Compute centroids

xx

Reassign clusters

xx xx Compute centroids

Reassign clusters

Converged!

Page 22: Text Clustering PengBo Nov 1, 2010. Today’s Topic Document clustering Motivations Clustering algorithms Partitional Hierarchical Evaluation

K-Means Algorithm

Page 23: Text Clustering PengBo Nov 1, 2010. Today’s Topic Document clustering Motivations Clustering algorithms Partitional Hierarchical Evaluation

Convergence

为什么 K-means 算法会收敛 ? A state in which clusters don’t change.

Reassignment: RSS 单调减 , 每个向量分到最近的centroid.

Recomputation: 每个 RSSk 单调减 (mk is number of members in cluster k): a =(ωk ) 取什么值,使 RSSK 取得最小值?

Σ –2(X – a) = 0 Σ X = Σ amK a = Σ X a = (1/ mk) Σ X

Σ –2(X – a) = 0 Σ X = Σ amK a = Σ X a = (1/ mk) Σ X

Page 24: Text Clustering PengBo Nov 1, 2010. Today’s Topic Document clustering Motivations Clustering algorithms Partitional Hierarchical Evaluation

Convergence = Global Minimum?

There is unfortunately no guarantee that a global minimum in the objective function will be reached

outlieroutlier

Page 25: Text Clustering PengBo Nov 1, 2010. Today’s Topic Document clustering Motivations Clustering algorithms Partitional Hierarchical Evaluation

Seed Choice

Seed 的选择会影响结果 某些 seeds 导致收敛很慢,

或者收敛到 sub-optimal clusterings. 用 heuristic 选 seeds (e.g., d

oc least similar to any existing mean)

尝试多种 starting points 以其它 clustering 方法的结果

来初始化 .(e.g., by sampling)

In the above, if you startwith B and E as centroidsyou converge to {A,B,C}and {D,E,F}If you start with D and Fyou converge to {A,B,D,E} {C,F}

In the above, if you startwith B and E as centroidsyou converge to {A,B,C}and {D,E,F}If you start with D and Fyou converge to {A,B,D,E} {C,F}

Page 26: Text Clustering PengBo Nov 1, 2010. Today’s Topic Document clustering Motivations Clustering algorithms Partitional Hierarchical Evaluation

How Many Clusters?

怎样确定合适的 K? 在产生更多 cluster( 每个 cluster 内部更像 ) 和产生太

多的 cluster (eg. 浏览代价大 ) 之间取得平衡 例如:

定义 Benefit :a doc 到它所在的 cluster centroid 的 cosine similarity 。所有 docs 的 benefit 之和为 Total Benefit.

定义一个 cluster 的 Cost 定义 clustering 的 Value = Total Benefit - Total Cost. 所有可能的 K 中,选取 value 最大的那一个

Page 27: Text Clustering PengBo Nov 1, 2010. Today’s Topic Document clustering Motivations Clustering algorithms Partitional Hierarchical Evaluation

Is K-Means Efficient?

Time Complexity Computing distance between two docs is O(M) wher

e M is the dimensionality of the vectors. Reassigning clusters: O(KN) distance computations,

or O(KNM). Computing centroids: Each doc gets added once to s

ome centroid: O(NM). Assume these two steps are each done once for I iter

ations: O(IKNM). M is …

Document is sparse vector, but Centroid is not K-medoids algorithms: the element closest to the ce

nter as "the medoid"

Page 28: Text Clustering PengBo Nov 1, 2010. Today’s Topic Document clustering Motivations Clustering algorithms Partitional Hierarchical Evaluation

Efficiency: Medoid As Cluster Representative

Medoid: 用一个 document 来作 cluster 的表示 如 : 离 centroid 最近的 document

One reason this is useful 考察一个很大的 cluster 的 representative (>1000 doc

uments) The centroid of this cluster will be a dense vector The medoid of this cluster will be a sparse vector

类似于 : mean .vs. median centroid vs. medoid

Page 29: Text Clustering PengBo Nov 1, 2010. Today’s Topic Document clustering Motivations Clustering algorithms Partitional Hierarchical Evaluation

Hierarchical Clustering Algorithm

Page 30: Text Clustering PengBo Nov 1, 2010. Today’s Topic Document clustering Motivations Clustering algorithms Partitional Hierarchical Evaluation

Hierarchical Agglomerative Clustering (HAC)

假定有了一个 similarity function 来确定两个 instances 的相似度 .

贪心算法: 每个 instances 为一独立

的 cluster 开始 选择最 similar 的两个 clu

ster ,合并为一个新 cluster

直到最后剩下一个 cluster为止

上面的合并历史形成一个binary tree 或 hierarchy.

Dendrogram

Page 31: Text Clustering PengBo Nov 1, 2010. Today’s Topic Document clustering Motivations Clustering algorithms Partitional Hierarchical Evaluation

Dendrogram: Document Example

As clusters agglomerate, docs likely to fall into a hierarchy of “topics” or concepts.

d1

d2

d3

d4

d5

d1,d2

d4,d5

d3

d3,d4,d5

Page 32: Text Clustering PengBo Nov 1, 2010. Today’s Topic Document clustering Motivations Clustering algorithms Partitional Hierarchical Evaluation

HAC Algorithm, pseudo-code

Page 33: Text Clustering PengBo Nov 1, 2010. Today’s Topic Document clustering Motivations Clustering algorithms Partitional Hierarchical Evaluation

Agglomerative (bottom-up): Start with each document being a single cluster. Eventually all documents belong to the same cluste

r. Divisive (top-down):

Start with all documents belong to the same cluster.

Eventually each node forms a cluster on its own. 不需要预定 clusters 的数目 k

Hierarchical Clustering algorithms

Page 34: Text Clustering PengBo Nov 1, 2010. Today’s Topic Document clustering Motivations Clustering algorithms Partitional Hierarchical Evaluation

Key notion: cluster representative

如何计算哪两个 clusters 最近? 为了有效进行此计算,怎样表达每个 cluster(clust

er representation)? Representative 可以 cluster 中的某些“ typica

l” 或 central 点: point inducing smallest radii to docs in cluster

smallest squared distances, etc. point that is the “average” of all docs in the cluste

r Centroid or center of gravity

Page 35: Text Clustering PengBo Nov 1, 2010. Today’s Topic Document clustering Motivations Clustering algorithms Partitional Hierarchical Evaluation

“Closest pair” of clusters

“Center of gravity” centroids (centers of gravity) 最 cosine-similar 的 clu

sters Average-link

每对元素的平均 cosine-similar Single-link

最近点 (Similarity of the most cosine-similar) Complete-link

最远点 (Similarity of the “furthest” points, the least cosine-similar)

Page 36: Text Clustering PengBo Nov 1, 2010. Today’s Topic Document clustering Motivations Clustering algorithms Partitional Hierarchical Evaluation

Single Link Example

),(max),(,

yxsimccsimji cycx

ji

)),(),,(max()),(( kjkikji ccsimccsimcccsim

chainingchaining

Page 37: Text Clustering PengBo Nov 1, 2010. Today’s Topic Document clustering Motivations Clustering algorithms Partitional Hierarchical Evaluation

Complete Link Example

),(min),(,

yxsimccsimji cycx

ji

)),(),,(min()),(( kjkikji ccsimccsimcccsim

Affect by outliersAffect by outliers

Page 38: Text Clustering PengBo Nov 1, 2010. Today’s Topic Document clustering Motivations Clustering algorithms Partitional Hierarchical Evaluation

Computational Complexity

第一次 iteration, HAC 计算所有 pairs 之间的 similarity : O(n2).

后续的 n2 merging iterations, 需要计算最新产生的 cluster 和其它已有的 clusters 之间的 similarity 其它的 similarity 不变

为了达到整体的 O(n2) performance 计算和其它 cluster 之间的 similarity 必须是 constant t

ime. 否则 O(n2 log n) or O(n3)

Page 39: Text Clustering PengBo Nov 1, 2010. Today’s Topic Document clustering Motivations Clustering algorithms Partitional Hierarchical Evaluation

Centroid Agglomerative Clustering

d1 d2

d3

d4

d5

d6

Centroid after first step.

Centroid aftersecond step.

Example: n=6, k=3, closest pair of centroids

Page 40: Text Clustering PengBo Nov 1, 2010. Today’s Topic Document clustering Motivations Clustering algorithms Partitional Hierarchical Evaluation

Group Average Agglomerative Clustering

合并后的 cluster 中所有 pairs 的平均 similarity

可以在常数时间计算 ? Vectors 都经过单位长度 normalized. 保存每个 cluster 的 sum of vectors.

)( :)(

),()1(

1),(

ji jiccx xyccyjiji

ji yxsimcccc

ccsim

jcx

j xcs

)(

)1||||)(|||(|

|)||(|))()(())()((),(

jiji

jijijiji cccc

cccscscscsccsim

Page 41: Text Clustering PengBo Nov 1, 2010. Today’s Topic Document clustering Motivations Clustering algorithms Partitional Hierarchical Evaluation

Exercise

考虑在一条直线上的 n 个点的 agglomerative 聚类 . 你能避免 n3 次的距离 / 相似度计算吗?你的方式需要计算多少次?

Page 42: Text Clustering PengBo Nov 1, 2010. Today’s Topic Document clustering Motivations Clustering algorithms Partitional Hierarchical Evaluation

Efficiency: “Using approximations”

标准算法中,每一步都必须找到最近的 centroid pairs

近似算法 : 找 nearly closest pair simplistic example: maintain closest pair based on d

istances in projection on a random line

Random line

Page 43: Text Clustering PengBo Nov 1, 2010. Today’s Topic Document clustering Motivations Clustering algorithms Partitional Hierarchical Evaluation

Applications in IR

Page 44: Text Clustering PengBo Nov 1, 2010. Today’s Topic Document clustering Motivations Clustering algorithms Partitional Hierarchical Evaluation

Navigating document collections

Information Retrieval —— a book index Document clusters —— a table of contents

IndexAardvark, 15Blueberry, 200Capricorn, 1, 45-55Dog, 79-99Egypt, 65Falafel, 78-90Giraffes, 45-59

Table of Contents1. Science of Cognition

1.a. Motivations1.a.i. Intellectual Curiosity1.a.ii. Practical Applications

1.b. History of Cognitive Psychology2. The Neural Basis of Cognition

2.a. The Nervous System2.b. Organization of the Brain2.c. The Visual System

3. Perception and Attention3.a. Sensory Memory3.b. Attention and Sensory Information Processing

Page 45: Text Clustering PengBo Nov 1, 2010. Today’s Topic Document clustering Motivations Clustering algorithms Partitional Hierarchical Evaluation

Scatter/Gather: Cutting, Karger, and Pedersen

Page 46: Text Clustering PengBo Nov 1, 2010. Today’s Topic Document clustering Motivations Clustering algorithms Partitional Hierarchical Evaluation

For better navigation of search results

Page 47: Text Clustering PengBo Nov 1, 2010. Today’s Topic Document clustering Motivations Clustering algorithms Partitional Hierarchical Evaluation
Page 48: Text Clustering PengBo Nov 1, 2010. Today’s Topic Document clustering Motivations Clustering algorithms Partitional Hierarchical Evaluation

Navigating search results (2)

按 sense of a word 对 documents 聚类 对搜索结果 (say Jaguar, or NLP), 聚成相关的文档组 可看作是一种 word sense disambiguation

Page 49: Text Clustering PengBo Nov 1, 2010. Today’s Topic Document clustering Motivations Clustering algorithms Partitional Hierarchical Evaluation
Page 50: Text Clustering PengBo Nov 1, 2010. Today’s Topic Document clustering Motivations Clustering algorithms Partitional Hierarchical Evaluation

For speeding up vector space retrieval

VSM 中 retrieval, 需要找到和 query vector最近的 doc vectors

计算文档集里所有 doc 和 query doc 的 similarity – slow (for some applications)

优化一下:使用 inverted index ,只计算那些query doc 中的 term 出现过的 doc

By clustering docs in corpus a priori 只在子集上计算: query doc 所在的 cluster

Page 51: Text Clustering PengBo Nov 1, 2010. Today’s Topic Document clustering Motivations Clustering algorithms Partitional Hierarchical Evaluation

Resources

Weka 3 - Data Mining with Open Source Machine Learning Software in Java

Page 52: Text Clustering PengBo Nov 1, 2010. Today’s Topic Document clustering Motivations Clustering algorithms Partitional Hierarchical Evaluation

本次课小结 Text Clustering Evaluation

Purity, NMI ,Rand Index Partition Algorithm

K-Means Reassignment Recomputation

Hierarchical Algorithm Cluster representation Close measure of cluster pair

Single link Complete link Average link centroid

Page 53: Text Clustering PengBo Nov 1, 2010. Today’s Topic Document clustering Motivations Clustering algorithms Partitional Hierarchical Evaluation

Thank You!

Q&A

Page 54: Text Clustering PengBo Nov 1, 2010. Today’s Topic Document clustering Motivations Clustering algorithms Partitional Hierarchical Evaluation

Readings

[1]. IIR Ch16.1-4 Ch17.1-4 [2]. B. Florian, E. Martin, and X. Xiaowei, "Frequ

ent term-based text clustering," in Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining. Edmonton, Alberta, Canada: ACM, 2002.

Page 55: Text Clustering PengBo Nov 1, 2010. Today’s Topic Document clustering Motivations Clustering algorithms Partitional Hierarchical Evaluation

Cluster Labeling

Page 56: Text Clustering PengBo Nov 1, 2010. Today’s Topic Document clustering Motivations Clustering algorithms Partitional Hierarchical Evaluation

Major issue - labeling

After clustering algorithm finds clusters - how can they be useful to the end user?

Need pithy label for each cluster In search results, say “Animal” or “Car” in the

jaguar example. In topic trees (Yahoo), need navigational cues.

Often done by hand, a posteriori.

Page 57: Text Clustering PengBo Nov 1, 2010. Today’s Topic Document clustering Motivations Clustering algorithms Partitional Hierarchical Evaluation

How to Label Clusters

Show titles of typical documents Titles are easy to scan Authors create them for quick scanning! But you can only show a few titles which may

not fully represent cluster Show words/phrases prominent in cluster

More likely to fully represent cluster Use distinguishing words/phrases

Differential labeling (think about Feature Selection)

But harder to scan

Page 58: Text Clustering PengBo Nov 1, 2010. Today’s Topic Document clustering Motivations Clustering algorithms Partitional Hierarchical Evaluation

Labeling

Common heuristics - list 5-10 most frequent terms in the centroid vector. Drop stop-words; stem.

Differential labeling by frequent terms Within a collection “Computers”, clusters all have

the word computer as frequent term. Discriminant analysis of centroids.

Perhaps better: distinctive noun phrase