Web Document Clustering based on Document Structurepami.uwaterloo.ca/pub/hammouda/hammouda-ieee-tkde.pdf · Document clustering techniques mostly rely on single term analysis of the

Web Document Clustering based on Document Structure

Khaled M. Hammouda∗ and Mohamed S. Kamel

Department of Systems Design Engineering

University of Waterloo

Waterloo, Ontario, Canada N2L 3G1

E-mail: {hammouda,mkamel}@pami.uwaterloo.ca

∗Corresponding author

1

Abstract

Document clustering techniques mostly rely on single term analysis of the document data set,

such as the Vector Space Model. To achieve more accurate document clustering, document structure

should be reflected in the underlying data model. This paper presents a framework for web document

clustering based on two important concepts. The first one is the web document structure, which is

currently ignored by many people. However, the (semi-)structure of a web document provides signif-

icant information about the content of the document. The second concept is finding the relationships

between documents based on local context using a new phrase matching technique, so that docu-

ments are indexed based on phrases, rather than individual words as widely used in current systems.

A novel document data model, the Document Index Graph, is designed specifically to facilitate phrase

matching between documents. The combination of these two concepts creates an underlying model

for robust and accurate document similarity calculation that leads to much improved results in web

document clustering over traditional methods. To make the approach applicable to online cluster-

ing, an incremental clustering algorithm guided by the maximization of cluster cohesiveness is also

presented.

Keywords: web mining, document clustering, document similarity, document structure, docu-

ment index graph, phrase matching.

2

1 Introduction

In an effort to keep up with the tremendous growth of the World Wide Web, many research projects

were targeted on how to organize such information in a way that will make it easier for the end users

to find the information they want efficiently and accurately. Information on the web is present in the

form of text documents (formatted in HTML), and that is the reason many web document processing

systems are rooted in text data mining techniques.

Text data mining shares many concepts with traditional data mining methods. Data mining in-

cludes many techniques that can unveil inherent structure in the underlying data. One of these tech-

niques is clustering. Applied to text data, clustering methods try to identify inherent groupings of

the text documents so that a set of clusters are produced in which clusters exhibit high intra-cluster

similarity and low inter-cluster similarity [5]. Generally speaking, text document clustering methods

attempt to segregate the documents into groups where each group represents some topic that is differ-

ent than those topics represented by the other groups [8]. By applying text mining in the web domain,

the process becomes what is known as web mining. There are three types of web mining in general,

according to Kosala et al [16]: (1) web structure mining; (2) web usage mining; and (3) web content

mining. We are mainly interested in the last type, web content mining.

Any clustering technique relies on four concepts: (1) a data representation model, (2) a similarity

measure, (3) a cluster model, and (4) a clustering algorithm that builds the clusters using the data

model and the similarity measure. Most of the document clustering methods that are in use today

are based on the Vector Space Model [1, 21, 20, 22], which is a very widely used data model for

text classification and clustering. The Vector Space Model represents documents as a feature vector

of the terms (words) that appear in all the document set. Each feature vector contains term-weights

3

(usually term-frequencies) of the terms appearing in that document. Similarity between documents

is measured using one of several similarity measures that are based on such a feature vector. Exam-

ples include the cosine measure and the Jaccard measure. Clustering methods based on this model

make use of single-term analysis only, they do not make use of any word proximity or phrase-based

analysis1.

The motivation behind the work in this paper is that we believe that document clustering should

be based not only on single word analysis, but on phrases as well. Phrase-based analysis means that

the similarity between documents should be based on matching phrases rather than on single words

only.

The work that has been reported in literature about using phrases in document clustering is limited.

Most efforts have been targeted toward single-word analysis. The methods used for text clustering

includes decision trees [7, 15, 18, 28], statistical analysis [9, 10, 15], neural nets [11], inductive logic

programming [6, 14], and rule-based systems [23, 24] among others. These methods are at the cross

roads of more than one research area, such as database (DB), information retrieval (IR), and artificial

intelligence (AI) including machine learning (ML) and Natural Language Processing (NLP).

The most relevant work to what is presented here is that of Oren Zamir et al [31, 32, 30]. They

proposed a phrase-based document clustering approach based on Suffix Tree Clustering (STC). The

method basically involves the use of a “trie” (a compact tree) structure to represent shared suffixes

between documents. Based on these shared suffixes they identify base clusters of documents, which

are then combined into final clusters based on a connected-component graph algorithm. They claim

to achieve n log(n) performance and produce high quality clusters. The results they showed were

encouraging, but the suffix tree model could be argued to have a high number of redundancies in

1Throughout this paper the term “phrase” means a sequence of words, and not the grammatical structure of a sentence.

4

terms of the suffixes stored in the tree.

In this paper we propose a system for web clustering based on document structure. The system

consists of four components:

1. A web document restructuring scheme that identifies different document parts, and assigns

levels of significance to these parts according to their importance.

2. A novel document representation model, the Document Index Graph (DIG) that captures the

structure of sentences in the document set, rather than single words only. The DIG model is

based on graph theory and utilizes graph properties to match any-length phrase from a document

to any number of previously seen documents in a time nearly proportional to the number of

words of the document.

3. A phrase-based similarity measure for scoring the similarity between two documents according

to the matching phrases and their significance.

4. An incremental document clustering method based on maintaining high cluster cohesiveness

using a new cluster quality concept called “Similarity Histogram”.

The integration of these four components proved to be of superior performance to traditional doc-

ument clustering methods. Although the whole system performance is quite good, each component

could be used independent of the other. The overall system design is illustrated in figure 1.

The proposed document model is used to measure the similarity between the documents using a

new similarity measure that makes use of phrase-based matching. The similarity calculation between

documents is based on a combination of single-term similarity and phrase-based similarity. Similarity

5

Document StructureIdentification

Document Index GraphRepresentation

Document SimilarityCalculation

Incremental Clustering

Title

Section Head ing

Section Head ing

Section Head ing

Section Head ing

Abstractor

Introductory Paragraph

Abstractor

Introductory Paragraph

Keywords

(Stored in HTML)

Bold Text

Italic Text

Hyperlink

Significant

Text

Coloured Text

Bold Text

Italic Text

Hyperlink

Significant

Text

Coloured Text

Image Alternate Text

Document Clusters

well-structuredXML documents

phrasematching

documentsimilarity

WebDocuments

Figure 1: Web Document Clustering System Design

based on matching phrases between documents is proved to have a more significant effect on the

clustering quality due to its insensitivity to noisy terms that could lead to incorrect similarity measure.

The proposed incremental document clustering method relies on improving the pair-wise docu-

ment similarity distribution inside each cluster so that similarities are maximized in each cluster. The

quality of the clusters produced using this system were higher than those produced using traditional

clustering methods. Improvement in clustering ranged from 20% to 70% over traditional clustering

methods.

The rest of this paper is organized as follows. Section 2 presents an analysis of the important

features of the semi-structured web documents. Section 3 introduces the Document Index Graph

model. Section 4 presents the phrase-based similarity measure. Section 5 presents our proposed

incremental clustering algorithm. Section 6 presents our experimental results. Finally we conclude

and discuss future work in the last section.

6

2 Web document structure analysis

Web documents are known to be semi-structured. HTML tags are used to designate different parts

of the document. However, since the HTML language is meant for specifying the layout of the

document, it is used to present the document to the user in a friendly manner, rather than specify the

structure of the data in the document, hence they are semi-structured. However, it is still possible to

identify key parts of the document based on this structure. The idea is that some parts of the document

are more informative than other parts, thus having different levels of significance based on where they

appear in the document and the tags that surround them. It is less informing to treat the title of the

document, for example, and the text body equally.

The proposed system analyzes the HTML document and restructures the document according into

a predetermined structure that assigns different levels of significance to different document parts. The

result is a well structured XML document that corresponds to the original HTML document, but with

the significance levels assigned to the different parts of the original document.

Currently we assign one of three levels of significance to the different parts; HIGH, MEDIUM,

and LOW. Examples of HIGH significance parts are the title, meta keywords, meta description, and

section headings. Example of MEDIUM significance parts are text that appear in bold, italics, col-

ored, hyper-linked text, image alternate text, and table captions. LOW significance parts are usually

comprised of the document body text that was not assigned any of the other levels.

This structuring scheme is exploited in measuring the similarity between two documents (see sec-

tion 4 for details). For example, if we have a phrase match of HIGH significance in both documents,

the similarity is rewarded more than if the match was for LOW significance phrases. This is justified

by arguing that a phrase match in titles, for example, is much more informative than a phrase match

7

in body text.

A sentence boundary detector algorithm was developed to locate sentence boundaries in the doc-

uments. The algorithm is based on a finite state machine lexical analyzer with heuristic rules for

finding the boundaries. A similar approach is used to find word boundaries. About 98% of the actual

boundaries are correctly detected. To achieve 100% accuracy, however, requires natural language

processing techniques and underlying knowledge of the data set domain, which is beyond the scope

of this paper. However, the resulting documents contain very accurate sentence separation and word

separation, with very negligible noise.

Finally, a document cleaning step is performed to remove stop-words that have no significance,

and to stem the words using the popular Porter Stemmer algorithm [19].

3 Document Index Graph

To achieve better clustering results, the data model that underlies the clustering method must accu-

rately capture the salient features of the data. According to the Vector Space Model, the document

data is represented as a feature vector of terms with different weights assigned to the terms accord-

ing to their frequency of appearance in the document. It does not represent any relation between the

words, so sentences are broken down into their individual components without any representation of

the sentence structure.

The proposed Document Index Graph (DIG for short) indexes the documents while maintaining

the sentence structure in the original documents. This allows us to make use of more informative

phrase matching rather than individual words matching. Moreover, the DIG also captures the different

levels of significance of the original sentences, thus allowing us to make use of sentence significance.

8

3.1 DIG structure

The Document Index Graph (DIG for short) is a directed graph (digraph) G = (V,E)

where V : is a set of nodes {v1, v2, . . . , vn}, where each node v represents a unique word in the entire

document set; and

E: is a set of edges {e1, e2, . . . , em}, such that each edge e is an ordered pair of nodes (vi, vj).

Edge (vi, vj) is from vi to vj , and vj is adjacent to vi. There will be an edge from vi to vj

if, and only if, the word vj appears successive to the word vi in any document.

The above definition of the graph suggests that the number of nodes in the graph is the number of

unique words in the document set; i.e. the vocabulary of the document set, since each node represents

a single word in the whole document set.

Nodes in the graph carry information about the documents they appeared in, along with the sen-

tence path information. Sentence structure is maintained by recording the edge along which each

sentence continues. This essentially creates an inverted list of the documents, but with sentence in-

formation recorded in the inverted list.

Assume a sentence of m words appearing in one document consists of the following word se-

quence: {v1, v2, . . . , vm}. The sentence is represented in the graph by a path from v1 to vm, such that

(v1, v2)(v2, v3), . . . , (vm−1, vm) are edges in the graph. Path information is stored in the vertices along

the path to uniquely identify each sentence. Sentences that share sub-phrases will have shared parts

of their paths in the graph that correspond to the shared sub-phrase.

The structure maintained in each node is a table of documents. Each document entry in the doc-

ument table records the term frequency of the word in that document. Since words can appear in

different parts of a document with different level of significance, the recorded term frequency is ac-

9

tually broken into those levels of significance, with a frequency count per level per document entry.

This structure helps in achieving a more accurate similarity measure based on level of significance

later on.

Since the graph is directed, each node maintains a list of an outgoing edges per document entry.

This list of edges tells us which sentence continues along which edge. The task of creating a sentence

path in the graph is thus reduced to recording the necessary information in this edge table to reflect

the structure of the sentences.

Document 1

river raftingmild river raftingriver rafting trips

Document 2

wild river adventuresriver rafting vacation plan

Document 3

fishing tripsfishing vacation planbooking fishing tripsriver fishing

mild

wild

river

rafting

adventures

booking

fishing

trips vacationplan

Figure 2: Example of the Document Index Graph

To better illustrate the graph structure, Figure 2 presents a simple example graph that represents

three documents. Each document contains a number of sentences with some overlap between the

documents. As seen from the graph, an edge is created between two nodes only if the words repre-

sented by the two nodes appear successive in any document. Thus, sentences map into paths in the

10

graph. Dotted lines represent sentences from document 1, dash-dotted lines represent sentences from

document 2, and dashed lines represent sentences from document 3. As mentioned earlier, matching

phrases between documents becomes a task of finding shared paths in the graph between different

documents. The example presented here is a simple one. Real web documents will contain hundreds

or thousands of words. With a very large document set, the graph could become more complex in

terms of memory usage. Typically, the number of graph nodes will be exactly the same as the number

of unique words in the data set. The number of edges is about 4 to 6 times the number of nodes (that

is the average degree of a node).

3.2 Constructing the graph

The DIG is built incrementally by processing one document at a time. When a new document is

introduced, it is scanned in sequential fashion, and the graph is updated with the new sentence in-

formation as necessary. New words are added to the graph as necessary and connected with other

nodes to reflect the sentence structure. The graph building process becomes less memory demanding

when no new words are introduced by a new document (or very few new words are introduced). At

this point the graph becomes more stable, and the only operation needed is to update the sentence

structure in the graph to accommodate for the new sentences introduced. It is very critical to note that

introducing a new document will only require the inspection (or addition) of those words that appear

in that document, and not every node in the graph. This is where the efficiency of the model comes

from. Along with indexing the sentence structure, the level of significance of each sentence is also

recorded in the graph. This allows us to recall such information when we match sentences from other

documents.

11

mild

river

rafting

trips

mild

wild

river

rafting

adventures

booking

fishing

trips vacation plan Document 3fishing tripsfishing vacation planbooking fishing tripsriver fishing

Document 1river raftingmild river raftingriver rafting trips

Document 2wild river adventuresriver rafting vacation plan

mild

wild

river

rafting

adventures

trips vacation plan

Figure 3: Incremental construction of the Document Index Graph

Continuing from the example introduced earlier, the process of constructing the graph that repre-

sents the three documents is illustrated in Figure 3. The emphasis here is on the incremental construc-

tion process, where new nodes are added and new edges are created incrementally upon introducing

a new document.

Unlike traditional phrase matching techniques that are usually used in information retrieval litera-

ture, the Document Index Graph provides complete information about full phrase matching between

every pair of documents. While traditional phrase matching methods are aimed at searching and re-

trieval of documents that have matching phrases to a specific query, the Document Index Graph is

aimed at providing information about the degree of overlap between every pair of documents. This

12

information will help in determining the degree of similarity between documents as will be explained

in section 4.

3.3 Detecting Matching Phrases

Upon introducing a new document, finding matching phrases from previously seen documents be-

comes an easy task using the DIG. Algorithm 1 describes the process of both incremental graph

building and phrase matching.

The procedure starts with a new document to process (line 1). We expect the new document to

have well defined sentence boundaries; each sentence is processed individually. This is important

because we do not want to match a phrase that spans two sentences (which could break the local

context we are looking for.) It is also important to know the original sentence length so that it will

be used in the similarity calculation (section 4). For each sentence (for loop at line 2) we process the

words in the sentence sequentially, adding new words as new nodes to the graph, and constructing a

path in the graph (by adding new edges if necessary) to represent the sentence we are processing.

Matching the phrases from previous documents is done by keeping a list L that holds an entry

for every previous document that shares a phrase with the current document D. As we continue

along the sentence path, we update L by adding new matching phrases and their respective document

identifiers, and extending phrase matches from the previous iteration (lines 10 and 11). If there are

no matching phrases at some point, we just update the respective nodes of the graph to reflect the

new sentence path (lines 13 and 14). After the whole document is processed L will contain all the

matching phrases between the current document and any previous document that shared at least one

phrase with the new document. Finally we output L as the list of documents with matching phrases

13

and all the necessary information about the matching phrases.

Algorithm 1 Document Index Graph construction and phrase matching1: D ← New Document

2: for each sentence s in D do

3: w1 ← first word in s

4: if w1 is not in G then

5: Add w1 to G

6: end if

7: L← Empty List {L is a list of matching phrases}

8: for each word wi ∈ {w2, w3, . . . , wk} in s do

9: if (wi−1, wi) is an edge in G then

10: Extend phrase matches in L for sentences that continue along (wi−1, wi)

11: Add new phrase matches to L

12: else

13: Add edge(wi−1, wi) to G

14: Update sentence path in nodes wi−1 and wi

15: end if

16: end for

17: end for

18: Output matching phrases in L˜

The above algorithm is capable of matching any-length phrases between a new document D and

all previously seen documents in roughly O(m) time, where m is the number of words in document

14

D. The step at line 10 in the algorithm, where we extend the matching phrases as we continue along

an existing path, may seem not to be a constant time step, because when the graph starts building up,

the number of matching phrases becomes larger, and consequently when moving along an existing

path we have to match more phrases. However, it turns out that the size of the list of matching phrases

becomes roughly constant even with very large document sets, due to the fact that a certain phrase

will be shared by only a small set of documents; which on average tends to be a constant number.

4 A phrase-based similarity measure

As mentioned earlier, phrases convey local context information, which is essential in determining an

accurate similarity between documents. Towards this end we devised a similarity measure based on

matching phrases rather than individual terms. This measure exploits the information extracted from

the previous phrase matching algorithm to better judge the similarity between the documents. This is

related to the work of Isaacs et al [12] who used a pair-wise probabilistic document similarity mea-

sure based on Information Theory. Although they showed it could improve on traditional similarity

measures, but it is still fundamentally based on the vector space model representation.

The phrase similarity between two documents is calculated based on the list of matching phrases

between the two documents. This similarity measure is a function of four factors:

• The number of matching phrases P ,

• The lengthes of the matching phrases (li : i = 1, 2, . . . , P ),

• The frequencies of the matching phrases in both documents (fi1 and fi2 : i = 1, 2, . . . , P ), and

15

• The levels of significance (weight) of the matching phrases in both document (wi1 and wi2 :

i = 1, 2, . . . , P ).

Frequency of phrases is an important factor in the similarity measure. The more frequent the

phrase appears in both documents, the more similar they tend to be. Similarly, the level of significance

of the matching phrase in both documents should be taken into consideration.

The phrase similarity between two documents, d1 and d2, is calculated using the following em-

pirical equation:

simp(d1,d2) =

√∑Pi=1[g(li) · (fi1wi1 + fi2wi2)]2∑

j |sj1| · wj1 +∑

k |sk2| · wk2

(1)

where g(li) is a function that scores the matching phrase length, giving higher score as the matching

phrase length approaches the length of the original sentence; |sj1| and |sk2| are the original sentence

lengths from document d1 and d2, respectively. The equation rewards longer phrase matches with

higher level of significance, and with higher frequency in both documents. The function g(li) in the

implemented system was used as:

g(li) = (|msi|/|si|)γ (2)

where |msi| is the matching phrase length, and γ is a sentence fragmentation factor with values

greater than or equal to 1. If γ is 1, two halves of a sentence could be matched independently and

would be treated as a whole sentence match. However, by increasing γ we can avoid this situation,

and score whole sentence matches higher than fractions of sentences. A value of 1.2 for γ was found

to produce best results.

16

The normalization by the length of the two documents in equation (1) is necessary to be able to

compare the similarities from other documents.

4.1 Combining single-term and phrase similarities

If the similarity between documents is based solely on matching phrases, and not single-terms at the

same time, related documents could be judged as non-similar if they do not share enough phrases (a

typical case that could happen in many situations.) Shared phrases provide important local context

matching, but sometimes similarity based on phrases only is not sufficient. To alleviate this problem,

and to produce high quality clusters, we combined single-term similarity measure with our phrase-

based similarity measure. We used the cosine correlation similarity measure [21, 22], with TF-IDF

(Term Frequency–Inverse Document Frequency) term weights, as the single-term similarity measure.

The cosine measure was chosen due to its wide use in the document clustering literature, and since it

is described as being able to capture human categorization behavior well [26]. The TF-IDF weighting

is also a widely used term weighting scheme [29].

Recall that the cosine measure calculates the cosine of the angle between the two document vec-

tors. Accordingly our term-based similarity measure (simt) is given as:

simt(d1,d2) = cos(d1,d2) =d1 · d2

‖d1‖‖d2‖ (3)

where the vectors d1 and d2 are represented as term weights calculated using TF-IDF weighting

scheme.

The combination of the term-based and the phrase-based similarity measures is a weighted average

of the two quantities from equations (1) and (3), and is given by equation (4).

17

sim(d1,d2) = α · simp(d1,d2) + (1− α) · simt(d1,d2) (4)

where α is a value in the interval [0, 1] which determines the weight of the phrase similarity measure,

or, as we call it, the Similarity Blend Factor. According to the experimental results discussed in

section 6 we found that a value between 0.6 and 0.8 for α results in the maximum improvement in the

clustering quality.

5 Incremental Document Clustering

In this section we present a brief overview of incremental clustering algorithms, and introduce the

proposed algorithm, based on pair-wise document similarity, and employ it as part of the whole web

document clustering system.

The role of a document similarity measure is to provide judgement on the closeness of documents

to each other. However, it is up to the clustering method how to make use of such similarity calcu-

lation. The idea here is to employ an incremental clustering method that will exploit our similarity

measure to produce clusters of high quality (assessing quality of clustering is described in section 6).

Incremental clustering is an essential strategy for online applications, where time is a critical

factor for usability. Incremental clustering algorithms work by processing data objects one at a time,

incrementally assigning data objects to their respective clusters while they progress. The process is

simple enough, but faces several challenges, including:

• How to determine to which cluster the next object should be assigned?

• How to deal with the problem of insertion order?

18

• Once an object has been assigned to a cluster, should its assignment to the cluster be frozen or

is it allowed to be re-assigned to other clusters later on?

Usually a heuristic method is employed to deal with the above challenges. A “good” incremen-

tal clustering algorithm has to find the respective cluster for each newly introduced object without

significantly sacrificing the accuracy of clustering due to insertion order or fixed object-to-cluster

assignment. We will briefly discuss two incremental clustering methods in the light of the above

challenges, before we introduce our proposed method.

Suffix Tree Clustering (STC). Introduced by Zamir et al [31] in 1997, the idea behind the STC

algorithm is to build a tree of phrase suffixes shared between multiple documents. The documents

sharing a suffix are considered as a base cluster. Base clusters are then combined together if they have

a document overlap of 50% or more. The algorithm has two drawbacks. First, although the structure

used is a compact tree, suffixes can appear multiple times if they are part of larger shared suffixes.

The other drawback is that the second phase of the algorithm is not incremental. Combining base

clusters into final clusters has to be done in a non-incremental way. The algorithm deals properly

with the insertion order problem though, since any insertion order will lead to the same result suffix

tree.

DC-tree Clustering. The DC-tree incremental algorithm was introduced by Wong et al [27] in

2000. The algorithm is based on the B+-tree structure. Unlike the STC algorithm, this algorithm is

based on vector space representation of the documents. Most of the algorithm operations are borrowed

from the B+-tree operations. Each node in the tree is a representation of a cluster, where a cluster is

represented by the combined feature vectors of its individual documents. Inserting a new document

involves comparison of the document feature vector with the cluster vectors at one level of the tree,

19

and descending to the most similar cluster. The algorithm defines several parameters and thresholds

for the various operations. The algorithm suffers from two problems though. Once a document is

assigned to a cluster it is not allowed to be re-assigned later to a newly created cluster. Second, which

is a consequence of the first drawback, clusters are not allowed to overlap; i.e. a document can belong

to only one cluster.

5.1 Similarity histogram-based incremental clustering

The clustering approach proposed here is an incremental dynamic method of building the clusters.

We adopt an overlapped cluster model. The key concept for the proposed clustering method is to

keep each cluster at a high degree of coherency at any time. We represent the coherency of a cluster

with a new concept called Cluster Similarity Histogram.

Cluster Similarity Histogram: is a concise statistical representation of the set of pair-

wise document similarities distribution in the cluster. A number of bins in the histogram

correspond to fixed similarity value intervals. Each bin contains the count of pair-wise

document similarities in the corresponding interval.

Figure 4 shows a typical cluster similarity histogram, where the distribution is almost a normal

distribution. A perfect cluster would have a histogram where the similarities are all maximum, while

a loose cluster would have a histogram where the similarities are all minimum.

5.2 Creating coherent clusters incrementally

Our objective is to keep each cluster as coherent as possible. In terms of the similarity histogram

concept this translates to maximizing the number of similarities in the high similarity intervals. To

20

Typical Cluster Histogram

0

5

10

15

20

25

30

35

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Sim ilarity

Co

un

t

Figure 4: Typical Cluster Similarity Histogram

achieve this goal in an incremental fashion, we judge the effect of adding a new document to a

certain cluster. If the document is going to degrade the distribution of the similarities in the clusters

very much, it should not be added, otherwise it is added. A much stricter strategy would be to add

documents that will enhance the similarity distribution. However, this could create a problem with

perfect clusters. The document will be rejected by the cluster even if it has high similarity to most of

the documents to the cluster (because it is perfect).

We judge the quality of a similarity histogram (cluster cohesiveness) by calculating the ratio of

the count of similarities above a certain similarity threshold ST to the total count of similarities. The

higher this ratio, the more coherent is the cluster.

Let n be the number of the documents in a cluster. The number of pair-wise similarities in the

cluster is m = n(n + 1)/2. Let S = {si : i = 1, . . . ,m} be the set of similarities in the cluster. The

histogram of the similarities in the cluster is represented as:

H = {hi : i = 1, . . . , B} (5a)

hi = count(sk) sli < sk < sui (5b)

21

where B: the number of histogram bins,

hi: the count of similarities in bin i,

sli: the lower similarity bound of bin i, and

sui: the upper similarity bound of bin i.

The histogram ratio of a cluster is the measure of cohesiveness of the cluster as described above,

and is calculated as:

HR(C) =

∑Bi=T hi∑Bj=1 hj

(6a)

T = �ST ·B� (6b)

where HR: the histogram ratio,

C: the cluster under consideration,

ST : the similarity threshold, and

T : the bin number corresponding to the similarity threshold.

Basically we would like to keep the histogram ratio of each cluster high. However, since we

allow documents that can degrade the histogram ratio to be added, this could result in a chain effect

of degrading the ratio to zero eventually. To prevent this, we set a minimum histogram ratio HRmin

that clusters should maintain. We also do not allow adding a document that will bring down the

histogram ratio significantly (even if still above HRmin). This is to prevent a bad document from

severely bringing down cluster quality by one single document addition.

We now present the incremental clustering algorithm based on the above framework (Algorithm 2).

The algorithm works incrementally by receiving a new document, and for each cluster calculates the

cluster histogram before and after simulating the addition of the document (lines 3-5). The old and

22

new histogram ratios are compared and if the new ratio is greater than or equal to the old one, the

document is added to the cluster. If the new ratio is less than the old one by no more than ε and

still above HRmin, it is added (lines 6-8). Otherwise it is not added. If after checking all clusters the

document was not assigned to any cluster, a new cluster is created and the document is added to it

(lines 10-13).

Algorithm 2 Similarity Histogram-based Incremental Document Clustering1: L← Empty List {Cluster List}

2: for each document D do

3: for each cluster C in L do

4: HRold = HR(C)

5: Simulate adding D to C

6: HRnew = HR(C)

7: if (HRnew ≥ HRold) OR ((HRnew > HRmin) AND (HRold −HRnew < ε)) then

8: Add D to C

9: end if

10: end for

11: if D was not added to any cluster then

12: Create a new cluster C

13: ADD D to C

14: ADD C to L

15: end if

16: end for

23

5.3 Dealing with insertion order problems

Our strategy for the insertion order problem is to implement a document reassignment strategy. Older

documents that were added before new clusters were created should have the chance to be reassigned

to newly created clusters. Only documents that seem to be “bad” for a certain cluster are tagged and

considered for reassignment to other clusters. The documents that are candidates to leave a cluster are

the documents that their leaving the cluster will increase the cluster similarity histogram ratio; i.e. the

cluster is better off without them.

We keep a record of each document of the histogram ratio if the document was not in the cluster.

If this value is greater than the current histogram ratio, then the document is a candidate for leaving

the cluster. Upon adding a new document to any cluster, we consult the documents that are candidate

for leaving the cluster. If any of such documents can be added to other clusters, we move it to that

cluster, thus benefiting both clusters.

This strategy creates a dynamic negotiation scheme between clusters for document assignment. It

also allows for overlapping clusters, and dynamic incremental document clustering.

6 Experimental Results

In order to test the effectiveness of the web clustering system, we conducted a set of experiments using

our proposed data model, phrase matching, similarity measure, and incremental clustering method.

The experiments conducted were divided into two sets. We first tested the effectiveness of the Doc-

ument Index Graph model, presented in section 3, and the accompanying phrase matching algorithm

for calculating the similarity between documents based on phrases versus individual words only. The

second set of experiments was to evaluate the accuracy of the incremental document clustering algo-

24

Data Set Description Categories Documents

DS1 UofW web site, Canadian web sites 10 314

DS2 Reuters news articles (from Yahoo! news) 20 2340

Table 1: Data Sets Descriptions

rithm, presented in section 5, based on the cluster cohesiveness measure using similarity histograms.

6.1 Experimental setup

Because the proposed system was designed for making use of the semi-structure of web documents,

regular text corpora were not used. Our experimental setup consisted of two web document sets.

The first consists of 314 web documents collected from University of Waterloo various web sites,

such as the Graduate Studies Office, Information Systems and Technology, Health Services, Career

Services, Co-operative Education, and other Canadian web sites. The documents were classified,

according to their content, into 10 different categories. In order to allow for independent testing

and the reproduction of the results presented here, this document collection can be downloaded

at: http://pami.uwaterloo.ca/˜hammouda/webdata/. The second data set is a collection

of Reuters news articles from the Yahoo! news site. The set contains 2340 documents classified into

20 different categories (with some relevancy between the categories as well.) The second data set was

used by Boley et al in [4, 2, 3]. Table 1 summarizes the two data sets.

6.2 Evaluation measures

In order to evaluate the quality of the clustering, we adopted two quality measures widely used in the

text mining literature for the purpose of document clustering [25]. The first is the F-measure, which

25

combines the Precision and Recall ideas from the Information Retrieval literature. The precision and

recall of a cluster j with respect to a class i are defined as:

P = Precision(i, j) =Nij

Ni

(7a)

R = Recall(i, j) =Nij

Nj

(7b)

where Nij: is the number of members of class i in cluster j,

Nj: is the number of members of cluster j, and

Ni: is the number of members of class i.

The F-measure of a class i is defined as:

F (i) =2PR

P +R(8)

With respect to class i we consider the cluster with the highest F-measure to be the cluster j that maps

to class i, and that F-measure becomes the score for class i. The overall F-measure for the clustering

result C is the weighted average of the F-measure for each class i:

FC =

∑i(|i| × F (i))∑

i |i|(9)

where |i| is the number of objects in class i. The higher the overall F-measure, the better the clustering,

due to the higher accuracy of the clusters mapping to the original classes.

The second measure is the Entropy, which provides a measure of ”goodness” for un-nested clus-

ters or for the clusters at one level of a hierarchical clustering. Entropy tells us how homogeneous

a cluster is. The higher the homogeneity of a cluster, the lower the entropy is, and vice versa. The

entropy of a cluster containing only one object (perfect homogeneity) is zero.

For every cluster j in the clustering result C we compute pij , the probability that a member of

cluster j belongs to class i. The entropy of each cluster j is calculated using the standard formula

26

Ej = −∑i pij log(pij), where the sum is taken over all classes. The total entropy for a set of clusters

is calculated as the sum of entropies for each cluster weighted by the size of each cluster:

EC =m∑

j=1

(Nj

N× Ej) (10)

where Nj is the size of cluster j, and N is the total number of data objects.

Basically we would like to maximize the F-measure, and minimize the Entropy of clusters to

achieve high quality clustering.

6.3 Effect of phrase-based similarity on clustering quality

The similarities calculated by our algorithm were used to construct a similarity matrix between the

documents. We elected to use three standard document clustering techniques for testing the effect

of phrase similarity on clustering [13]: (1) Hierarchical Agglomerative Clustering (HAC), (2) Single

Pass Clustering, and (3) K-Nearest Neighbor Clustering (k-NN)2. For each of the algorithms, we

constructed the similarity matrix and let the algorithm cluster the documents based on the presented

similarity matrix.

The results listed in Table 2 show the improvement in the clustering quality on the first data set

using the combined similarity measure. The improvements shown were achieved at a similarity blend

factor between 70% and 80% (phrase similarity weight). The parameters chosen for the different

algorithms were the ones that produced best results. The percentage of improvement ranges from

19.5% to 60.6% increase in the F-measure quality, and 9.1% to 46.2% drop in Entropy (lower is

better for Entropy). It is obvious that the phrase based similarity plays an important role in accurately

judging the relation between documents. It is known that Single Pass clustering is very sensitive to

2Although k-NN is mostly known to be used for classification, it has also been used for clustering (example could be found in [17]).

27

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Similarity Blend Factor (alpha)

F-m

easu

re

HACSingle PassK-NN

(a) Effect of phrase similarity on F-measure

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Similarity Blend Factor (alpha)

Ent

ropy

HACSingle PassK-NN

(b) Effect of phrase similarity on Entropy

Figure 5: Effect of phrase similarity on clustering quality

28

Table 2: Phrase-based clustering improvement

Single-Term Similarity Combined Similarity Improvement

F-measure Entropy F-measure Entropy

HAC a 0.709 0.351 0.904 0.103 +19.5%F, -24.8%E

Single Pass b 0.427 0.613 0.817 0.151 +39.0%F, -46.2%E

k-NN c 0.228 0.173 0.834 0.082 +60.6%F, -9.1%E

aComplete Linkage was used as the cluster distance measure for the HAC method since it tends to produce

tight clusters with small diameter.bA document-to-cluster similarity threshold of 0.25 was used.cA K of 5 and a cluster similarity threshold of 0.25 were used.

noise; that is why it has the worst performance. However, when the phrase similarity was introduced,

the quality of clusters produced was pushed close to that produced by HAC and k-NN.

In order to better understand the effect of the phrase similarity on the clustering quality, we gen-

erated a clustering quality profile against the similarity blend factor. Figure 5(a) illustrates the effect

of introducing the phrase similarity on the F-measure of the resulting clusters. It is obvious that the

phrase similarity enhances the F-measure of the clustering until a certain point (around a weight of

80%) and then its effect starts bringing down the quality. As we mentioned in section 4.1 that phrases

alone cannot capture all the similarity information between documents, the single-term similarity is

still required, but to a smaller degree. The same can be seen from the Entropy profile in Figure 5(b),

where Entropy is minimized at around 80% contribution of phrase similarity against 20% for the

single-term similarity.

The results show that both evaluation measures are optimized in the same trend with respect

29

to the blend factor. By having two independent evaluation measures prove the clustering quality

improvement, we are confident that the results are not biased by any of the evaluation measures.

6.4 Incremental clustering evaluation

Our proposed incremental document clustering method was evaluated using both data sets mentioned

earlier. We relied on the same evaluation measures discussed above, as well as another measure

called “Overall Similarity”, which is the average of the similarities inside each cluster. Higher overall

similarity means better cluster cohesiveness.

Data Set 1 Data Set 2

F-measure Entropy O-Sa F-measure Entropy O-S

Proposed Method 0.931 0.119 0.504 0.682 0.156 0.497

HAC 0.709 0.351 0.455 0.584 0.281 0.398

Single Pass 0.427 0.613 0.385 0.502 0.250 0.311

k-NN 0.228 0.173 0.367 0.522 0.161 0.452

aOverall-Similarity

Table 3: Proposed Clustering Method Improvement

Table 3 shows the result of the proposed clustering method against HAC, Single Pass, and k-NN

clustering. For the first data set, the improvement was very significant, reaching over 70% improve-

ment over k-NN (in terms of F-measure), 25% improvement over HAC, and 53% improvement over

Single Pass. This is attributed to the fact that the different categories of the documents do not have a

great deal of overlap, which makes the algorithm able to avoid noisy similarities from other clusters.

For the second data set an improvement between 10% to 18% was achieved over the other meth-

30

ods. However, the F-measure was not really high compared to the first data set. By examining the

actual documents and their classification it turns out that the documents do not have enough overlap

in each single class, which makes it difficult to have an accurate similarity calculation between the

documents. However, we were able to push the quality of clustering further by relying on accurate

and robust phrase matching similarity calculation, and achieve higher clustering quality.

0

0.1

0 .2

0 .3

0 .4

0 .5

0 .6

0 .7

0 .8

0 .9

1

1 2Data Set

F-M

easu

re

P ro po sed M etho d (with Re-assignment)

P ro po sed M etho d (no Re-assignment)

HA C

Single P ass

K-NN

(a) Clustering Quality - F-measure

0

0.1

0 .2

0 .3

0 .4

0 .5

0 .6

0 .7

0 .8

0 .9

1

1 2Data Set

Ent

ropy

P ro po sed M etho d (with Re-assignment)

P ro po sed M etho d (no Re-assignment)

HA C

Single P ass

K-NN

(b) Clustering Quality - Entropy

Figure 6: Quality of Clustering Comparison

Figure 6 shows the above mentioned results more clearly, showing the achieved improvement in

comparison with the other methods. The figure shows also the effect of apply the re-assignment

31

strategy discussed in section 5.3. The problem with incremental clustering is that documents usually

do not end up where they should be. The re-assignment strategy we chose to use re-assigns documents

that are seen as bad for some clusters to other clusters that can accept the document, all based on

the idea of increasing the cluster similarity histogram. The re-assignment strategy showed a slight

improvement over the same method without document re-assignment as shown in the figure.

7 Conclusion

We presented a system composed of four decoupled components in an attempt to improve the doc-

ument clustering problem in the web domain. Information in web documents does not lie in the

content only, but in their inherent semi-structure of the web documents. By exploiting this structure

we can achieve better clustering results. We presented a web document analysis component that is

capable of identifying the structure of web documents, and building structured documents out of the

semi-structured web documents.

The second component, and perhaps the most important one that has most of the impact on per-

formance, is the new document model introduced in this paper, the Document Index Graph. This

model is based on indexing web documents using phrases and their levels of significance. Such a

model enables us to perform phrase matching and similarity calculation between documents in a very

robust and accurate way. The quality of clustering achieved using this model significantly surpasses

the traditional vector space model based approaches.

The third component is the phrase-based similarity measure. By carefully examining the factors

affecting the degree of overlap between documents, we devised a phrase-based similarity measure

that is capable of accurate calculation of pair-wise document similarity.

32

The fourth component is an incremental document clustering method based on maintaining high

cluster cohesiveness by improving the pair-wise document similarity distribution inside each cluster.

The merits of such a design is that each component could be utilized independent of the other. But

we have confidence that the combination of these components leads to better results, as justified by

the results presented in this paper. By adopting different standard clustering techniques to test against

our model, we are very confident that this model is well justified.

There are a number of future research directions to extend and improve this work. One direction

that this work might continue on is to improve on the accuracy of similarity calculation between

documents by employing different similarity calculation strategies. Although the current scheme

proved more accurate than traditional methods, there are still room for improvement.

Although the work presented here is aimed at web document clustering, it could be easily adapted

to any document type as well. However, it will not benefit from the semi-structure found in web

documents. Our intention is to investigate the usage of such model on standard corpora and see its

effect on clustering compared to traditional methods.

33

References

[1] K. Aas and L. Eikvil. Text categorisation: A survey. Technical Report 941, Norwegian Com-

puting Center, June 1999.

[2] D. Boley, M. Gini, R. Gross, S. Han, K. Hastings, G. Karypis, V. Kumar, B. Mobasher, and

J. Moore. Partitioning-based clustering for web document categorization. Decision Support

Systems, 27:329–341, 1999.

[3] D. Boley, M. Gini, R. Gross, S. Han, K. Hastings, G. Karypis, V. Kumar, B. Mobasher, and

J. Moore. Document categorization and query generation on the World Wide Web using We-

bACE. AI Review, 13(5-6):365–391, 1999.

[4] D. Boley. Principal direction divisive partitioning. Data Mining and Knowledge Discovery,

2(4):325–344, 1998.

[5] K. Cios, W. Pedrycs, and R. Swiniarski. Data Mining Methods for Knowledge Discovery.

Kluwer Academic Publishers, Boston, 1998.

[6] W. W. Cohen. Learning to classify English text with ILP methods. In Proceedings of the 5th

International Workshop on Inductive Logic Programming, pages 3–24. Department of Computer

Science, Katholieke Universiteit Leuven, 1995.

[7] S. Dumais, J. Platt, D. Heckerman, and M. Sahami. Inductive learning algorithms and represen-

tations for text categorization. In Proceedings of the 7th International Conference on Informa-

tion and Knowledge Management, pages 148–15, November 1998.

[8] W. B. Frakes and R. Baeza-Yates. Information Retrieval: Data Structures and Algorithms.

Prentice Hall, Englewood Cliffs, N.J., 1992.

[9] D. Freitag and A. McCallum. Information extraction with HMMs and shrinkage. In Proceedings

of the AAAI-99 Workshop on Machine Learning for Information Extraction, pages 31–36, 1999.

[10] T. Hofmann. The cluster-abstraction model: Unsupervised learning of topic hierarchies from

text data. In Proceedings of the 16th International Joint Conference on Artificial Intelligence

IJCAI-99, pages 682–687, 1999.

[11] T. Honkela, S. Kaski, K. Lagus, and T. Kohonen. WEBSOM—self-organizing maps of doc-

ument collections. In Proceedings of WSOM’97, Workshop on Self-Organizing Maps, pages

310–315, Espoo, Finland, June 1997.

[12] J. D. Isaacs and J. A. Aslam. Investigating measures for pairwise document similarity. Technical

Report PCS-TR99-357, Dartmouth College, Computer Science, Hanover, NH, June 1999.

[13] A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Prentice Hall, Englewood Cliffs,

N.J., 1988.

[14] M. Junker, M. Sintek, and M. Rinck. Learning for text categorization and information extraction

with ILP. In James Cussens, editor, Proceedings of the 1st Workshop on Learning Language in

Logic, pages 84–93, Bled, Slovenia, 1999.

[15] H. Kargupta, I. Hamzaoglu, and B. Stafford. Distributed data mining using an agent based

architecture. In Proceedings of Knowledge Discovery and Data Mining, pages 211–214. AAAI

Press, 1997.

[16] R. Kosala and H. Blockeel. Web mining research: a survey. ACM SIGKDD Explorations

Newsletter, 2(1):1–15, 2000.

[17] S. Y. Lu and K. S. Fu. A sentence-to-sentence clustering procedure for pattern analysis. IEEE

Transactions on Systems, Man, and Cybernetics, 8:381–389, 1978.

[18] U. Y. Nahm and R. J. Mooney. A mutually beneficial integration of data mining and information

extraction. In 17th National Conference on Artificial Intelligence (AAAI-00), pages 627–632,

2000.

[19] M. F. Porter. An algorithm for suffix stripping. Program, 14(3):130–137, July 1980.

[20] G. Salton and M. J. McGill. Introduction to Modern Information Retrieval. McGraw-Hill

computer science series. McGraw-Hill, New York, 1983.

[21] G. Salton, A. Wong, and C. Yang. A vector space model for automatic indexing. Communica-

tions of the ACM, 18(11):613–620, November 1975.

[22] G. Salton. Automatic Text Processing: The Transformation, Analysis, and Retrieval of Informa-

tion by Computer. Addison Wesley, Reading, MA, 1989.

[23] S. Scott and S. Matwin. Feature engineering for text classification. In Proceedings of the 16th

International Conference on Machine Learning ICML-99, pages 379–388, 1999.

[24] S. Soderland. Learning information extraction rules for semi-structured and free text. Machine

Learning, 34(1-3):233–272, 1999.

[25] M. Steinbach, G. Karypis, and V. Kumar. A comparison of document clustering techniques.

KDD-2000 Workshop on TextMining, August 2000.

[26] A. Strehl, J. Ghosh, and R. Mooney. Impact of similarity measures on web-page clustering. In

Proceedings of the 17th National Conference on Artificial Intelligence: Workshop of Artificial

Intelligence for Web Search (AAAI 2000), pages 58–64, Austin, TX, July 2000. AAAI.

[27] W. Wong and A. Fu. Incremental document clustering for web page classification. In 2000

International Conference on Information Society in the 21th Century: Emerging Technologies

and New challenges (IS2000), Japan, 2000.

[28] Y. Yang, J. Carbonell, R. Brown, T. Pierce, B. Archibald, and X. Liu. Learning approaches for

detecting and tracking news events. IEEE Intelligent Systems, 14(4):32–43, 1999.

[29] Y. Yang and J. P. Pedersen. A comparative study on feature selection in text categorization.

In Proceedings of the 14th International Conference on Machine Learning (ICML’97), pages

412–420, Nashville, TN, 1997.

[30] Oren Zamir and Oren Etzioni. Grouper: a dynamic clustering interface to Web search results.

Computer Networks (Amsterdam, Netherlands: 1999), 31(11–16):1361–1374, 1999.

[31] O. Zamir, O. Etzioni, O. Madanim, and R. M. Karp. Fast and intuitive clustering of web docu-

ments. In Proceedings of the 3rd International Conference on Knowledge Discovery and Data

Mining, pages 287–290, Newport Beach, CA, August 1997. AAAI.

[32] O. Zamir and O. Etzioni. Web document clustering: A feasibility demonstration. In Proceedings

of the 21st Annual International ACM SIGIR Conference, pages 46–54, Melbourne, Australia,

1998 .