Upload
gerald-connolly
View
218
Download
0
Tags:
Embed Size (px)
Citation preview
Information Retrieval
Lecture 7Introduction to Information Retrieval (Manning et al. 2007)
Chapter 17
For the MSc Computer Science Programme
Dell ZhangBirkbeck, University of London
Yahoo! Hierarchy
http://dir.yahoo.com/science
dairycrops
agronomyforestry
AI
HCIcraft
missions
botany
evolution
cellmagnetism
relativity
courses
agriculture biology physics CS space
... ... ...
… (30)
... ...
Hierarchical Clustering
Build a tree-like hierarchical taxonomy (dendrogram) from a set of unlabeled documents. Divisive (top-down)
Start with all documents belong to the same cluster. Eventually each node forms a cluster on its own.
Recursive application of a (flat) partitional clustering algorithm, e.g., kMeans (k=2) Bi-secting kMeans.
Agglomerative (bottom-up) Start with each document being a single cluster.
Eventually all documents belong to the same cluster.
Dendrogram
Clustering is obtained by cutting the dendrogram at a desired level: each connected component forms a cluster.
The number of clusters k is not required in advance.
HAC
Hierarchical Agglomerative Clustering Starts with each doc in a separate cluster. Repeat until there is only one cluster:
Among the current clusters, determine the pair of clusters, ci and cj, that are most similar. (Single-Link, Complete-Link, etc.)
Then merges ci and cj to a single cluster. The history of merging forms a binary tree or
hierarchy.
Single-Link
The similarity between a pair of clusters is defined by the single strongest link (i.e., maximum cosine-similarity) between their members:
After merging ci and cj, the similarity of the resulting cluster to another cluster, ck, is:
),(max),(,
yxsimccsimji cycx
ji
),(),,(max)),(( kjkikji ccsimccsimcccsim
HAC – Example
d1
d2
d3
d4
d5
d1,d2
d4,d5
d3
d3,d4,d5
As clusters agglomerate, docs are likely to fall into a dendrogram.