The 5th annual UK Workshop on Computational Intelligence London, 5-7 September 2005

The 5th annual UK Workshop on Computational Intelligence

London, 5-7 September 2005

The 5th annual UK Workshop on Computational Intelligence

London, 5-7 September 2005

Department of Electronic & Electrical Engineering

University College London, UK

Learning Topic Hierarchies from Text Documents using

a Scalable Hierarchical Fuzzy Clustering Method

Learning Topic Hierarchies from Text Documents using

a Scalable Hierarchical Fuzzy Clustering Method

E. Mendes Rodrigues and L. Sacks

{mmendes, lsacks}@ee.ucl.ac.uk

http://www.ee.ucl.ac.uk/~mmendes/

E. Mendes Rodrigues and L. Sacks

{mmendes, lsacks}@ee.ucl.ac.uk

http://www.ee.ucl.ac.uk/~mmendes/

OutlineOutline

• Document clustering process

• H-FCM: Hyper-spherical Fuzzy C-Means

• H2-FCM: Hierarchical H-FCM

• Clustering experiments

• Topic hierarchies

Document Clustering ProcessDocument Clustering Process

DocumentRepresentation

DocumentEncoding

Document Clustering

Pre-processing

DocumentClusters

DocumentSimilarity

ClusteringMethod

Cluster Validity

DocumentCollection

Application

Document Clustering

DocumentSimilarity

ClusteringMethod

DocumentCollection

DocumentRepresentation

DocumentEncoding

Pre-processing

DocumentClusters

Cluster Validity

Application

Identify all unique words in the document collection

Discard common words that are included in the stop list

Apply stemming algorithm and combine identical word stems

Apply term weighting scheme to the final set of k indexing terms

Discard terms using pre-processing filters

DocumentVectors

x11 x12 x1k

x21 x22

xN1 xN2 xNk

X =

Vector-Space Model of Information Retrieval

Very high-dimensional

Very sparse (+95%)

Measures of Document RelationshipMeasures of Document Relationship

2/1k

1j

2Bj

k

1j

2Aj

k

1jBjAj

BABA

xx

xx1)x,x(S1)x,x(D

B,ABA ,1)x,x(S0

• FCM applies the Euclidean distance, which is inappropriate for high-dimensional text clustering

non-occurrence of the same terms in both documents is handled in a similar way as the co-occurrence of terms

• Cosine (dis)similarity measure:

widely applied in Information Retrieval

represents the cosine of the angle between two document vectors

insensitive to different document lengths, since it is normalised bythe length of the document vectors

H-FCM: Hyper-spherical Fuzzy C-MeansH-FCM: Hyper-spherical Fuzzy C-Means

• Applies the cosine measure to assess document relationships

• Modified objective function:

• Subject to an additional constraint:

• Fuzzy memberships (u) and cluster centroids (v):

)vx1(uDu)V,U(J N

1i

k

1jjij

c

1

mi

N

1i

c

1i

mim

1

1

)1(1

c m

i

ii D

Du

1.

1

1

2

1

N

i k

j

N

iij

mi

im

i

xu

xuv

,0v1vv1)v,v(Dk

1j

2j

k

1jjj

How many clusters?How many clusters?

• Usually the final number of clusters is not know a priori

Run the algorithm for a range of c values

Apply validity measures and determine which c leads to the best partition (clusters compactness, density, separation, etc.)

• How compact and dense are clusters in a sparse high-dimensional problem space?

Very small percentage of documents within a cluster present high similarity to the respective centroid clusters are not compact

However, there is always a clear separation between intra- and inter-cluster similarity distributions

H2-FCM: Hierarchical Hyper-spherical Fuzzy C-MeansH2-FCM: Hierarchical Hyper-spherical Fuzzy C-Means

• Key concepts

Apply partitional algorithm (H-FCM) to obtain a sufficiently large number of clusters

Exploit the granularity of the topics associated with each cluster to link cluster centroids hierarchically

Form a topic hierarchy

• Asymmetric similarity measure

Identify parent-child type relationships between cluster centroids

Child should be less similar to parent, than parent to child

k

1jj

k

1jjj v)v,vmin()v,v(S

S(v8,v5)<tPCS

C1

C3

C6

C9 C10

C12

C11

C8

C7

C4

C2

C5

Document Cluster centroid

The H2-FCM Algorithm

AsymmetricSimilarity

v1

v3

v6

v9v10

v12

v11

v8v7

v4

v2

v5

v1

v3

v6

v9v10

v12

v11

v8v7

v4

v2

v5

vVF S(v,v) = max[S(v,v)], v,vVF

v3

v6

v9

v10

v12

v11

v7

v4

v1

v8

v2

v5

v1

v8

v2

S(v1,v5)≥tPCS

v10

VF

VH

S(v8,v1)<tPCS

Compute S(v,v),

YApply H-FCM(c, m)

Allclusters have

size≥tND? Select

centroidWhileVF ≠

VH=?

N

Add root

Selectparent

S≥tPCS?

Add child

Y

NNc=c-K

Scalability of the AlgorithmScalability of the Algorithm

• H2-FCM time complexity depends on H-FCM and centroid linking heuristic

• H-FCM computation time is O(Nc2k)

• Linking heuristic is at most O(c2k)

Computation of the asymmetric similarity between every pair of cluster centroids - O(c2k)

Generation of the cluster hierarchy - O(c2)

• Overall, H2-FCM time complexity is O(Nc2k)

• Scales well to large document sets!

Description of ExperimentsDescription of Experiments

fntptp

R

• Goal: evaluate the H2-FCM performance

• Evaluation measures: clustering Precision (P) and Recall (R)

• H2-FCM algorithm run for a range of c values

• No. hierarchy roots=No. reference classes tPCS dynamically set

• Are sub-clusters of the same topic assigned to the same branch?

pftptp

P

In reference class Not in reference class

Assigned to cluster

true positives (tp) false positives (fp)

Not assigned to cluster false negatives (fn) true negatives (tn)

Test Document CollectionsTest Document Collections

Reuters-21578 test collection: http://www.daviddlewis.com/resources/testcollections/reuters21578/Open Directory Project (ODP): http://dmoz.org/INSPEC database: http://www.iee.org/publish/inspec/

CollectionSize Classes Document length Document sparsity

N k no. labels avg stdev avg stdev

reuters1 1708 15744 3acqearntrade

73.45 63.97 99.67 % 0.26 %

reuters2 1374 11778 5

crudeinterest

money-fxshiptrade

102.65 86.37 99.39 % 0.47 %

odp 556 620 5

gamelegomathsafetysport

15.14 5.07 97.69 % 0.50 %

inspec 7473 11803 3back-propagation

fuzzy controlPattern clustering

93.28 32.79 99.59 % 0.14 %

Clustering Results: H2-FCM Precision and RecallClustering Results: H2-FCM Precision and Recall

odp inspec

reuters1 reuters2

Topic HierarchyTopic Hierarchy

• Each centroid vector consists of a set of weighted terms

• Terms describe the topics associated with the document cluster

• Centroid hierarchy produces a topic hierarchy

Useful for efficient access to individual documents

Provides context to users in exploratory information access

Topic Hierarchy ExampleTopic Hierarchy Example

Concluding RemarksConcluding Remarks

• H2-FCM clustering algorithm

Partitional clustering (H-FCM) Linking heuristic organizes centroids hierarchically bases on

asymmetric similarity

• Scales linearly with the number of documents

• Exhibits good clustering performance

• Topic hierarchy can be extracted from the centroid hierarchy