An On-line Document Clustering Method Based on Forgetting Factors

Sept. 7, 2001 ECDL2001 1

An On-line Document Clustering Method Based on Forgetting Factors

Yoshiharu Ishikawa, Yibing Chen

Hiroyuki Kitagawa

University of Tsukuba, Japan

2

Outline Background and Objectives F2ICM Incremental Document Clustering

Method Document Similarity Based on Forgetting Factor Updating Statistics and Probabilities Document Expiration and Parameter Setting Experimental Results Conclusions and Future Work

3

Background The Internet enabled on-line document delivery services

newsfeed services over the network periodically issued on-line journals

Important technologies (and applications) for on-line documents information filtering document summarization, information extraction topic detection and tracking (TDT)

Clustering works as a core technique for these applications

4

Our Objectives (1) Development of an on-line clustering method which

considers the novelty of each document Presents a snapshot of clusters in an up-to-date manner Example: articles from sports news feed

ττ

Soccer World Cup

time

Formula 1 & M. Schumacher

U.S. Open Tennis

Other articles

5

Our Objectives (2) Development of a novelty-based clustering

method for on-line documents Features:

It weights high importance on newer documents than older ones and forgets obsolete ones

introduction of a new document similarity measure that considers novelty and obsolescence of documents

Incremental clustering processing low processing cost to generate a new clustering result

Automatic maintenance of target documents obsolete documents are automatically deleted from the

clustering target

6

A

Incremental Clustering Process (1) when t = 0 (initial state)

Clustering Module

AA

1. arrival of new documents

AAA

t = 0

3. calculate and store statistics

2. store new documents in the repository

AACluster 1

AACluster k

：

4. cluster documents and present the result

7

A

Incremental Clustering Process (2) when t = 1

AA

AAA

t = 0

AACluster 1

AACluster k

：

AAA

t = 1

AACluster 1

AACluster k

：1. arrival of new documents

Clustering Module


3. update statistics


8

A

Incremental Clustering Process (3) when t = +

AA

AAA

t =

AACluster 1

AACluster k

：

AACluster 1

AACluster k

：

AAA

t =

AAA

t = +

...AAA

t =


Clustering Module


4. delete old documents



9


Method C2ICM Clustering Method F2ICM Clustering Method

Document Similarity Based on Forgetting Factor Updating Statistics and Probabilities Document Expiration and parameter Setting Experimental Results Conclusions and Future Work

10

C2ICM Clustering Method Cover-Coefficient-based Incremental Clustering

Methodology Proposed by F. Can (ACM TOIS, 1993) [3] Incremental Clustering Method with Low

Update Cost Seed-based Clustering Method

Based on the concept of seed powers Seed powers are defined probabilistically Documents with highest seed powers are selected as

cluster seeds

11

Decoupling/Coupling Coefficients Two important notions in C2ICM method

used to calculate seed powers Decoupling coefficient of document di :

the probability that the document di is obtained when a document di itself is given

an index to measure the independence of di

Coupling coefficient of document di:

an index to measure the dependence of di

)|Pr( iii ddδ

ii δ1

12

Seed Power

Seed power spi for document di measures the appropriateness (moderate dependence) of di as a cluster seed

freq(di, tj): the occurrence frequency of term tj within document di

: decoupling coefficient for term tj

: coupling coefficient for term tj

)|Pr( jjj ttδ

jjj

n

j iiii δtdfreqδsp ),(

1

jδ

jj δ 1j

13

C2ICM Clustering Algorithm (1) Initial phase

Red: F1 & SchumacherGreen: U.S. Open Tennis

2. Other documents are assigned to the cluster with the most similar seed

1. Select new seeds based on the seed powers

14

C2ICM Clustering Algorithm (2) Incremental update phase

2. Other documents are assigned to the cluster with the most similar seed

1. Select new seeds based on the seed powers

ττ Red: F1 & SchumacherGreen: U.S. Open TennisOrange: Soccer World Cup

15


Method C2ICM Clustering Method F2ICM Clustering Method

Document Similarity Based on Forgetting Factor Updating Statistics and Probabilities Document Expiration and parameter Setting Experimental Results Conclusions and Future Work

16

F2ICM Clustering Method Extension of C2ICM method Main differences

Introduction of a new document similarity measure based on the notion of the forgetting factor: it weights high importance on newer documents to generate clusters

Incremental maintenance of statistics Automatic deletion of obsolete old documents

17


Method Document Similarity Based on Forgetting Factor

Document forgetting model Derivation of document similarity measure

Updating Statistics and Probabilities Document Expiration and Parameter Setting Experimental Results Conclusions and Future Work

18

Document Similarity Based onForgetting Factor

New Document Similarity Measure Based on Document Forgetting Model Assumption: each delivered document gradually

loses its value (weight) as time passes Derivation of document similarity measure based on

the assumption put high weights on new documents and low weights on

old ones old documents have low effects on clustering Using the derived document similarity measure, we

can achieve a novelty-based clustering

19

Document Forgetting Model (1) Ti : acquisition time of

document di

Information value (weight) of di is defined as

Document weight exponentially decreases as time passes

(0 < < 1) determines the forgetting speed

iTττi λdw |

1

dwi

Tit

iTτλ

acquisition time of document di

current time

20

Document Forgetting Model (2) Why we use the exponential forgetting model?

It inherits the ideas from the behavioral law of human memory

The Power Law of Forgetting [1]: human memory exponentially decreases as time passes

Relationship with citation analysis: Obsolescence (aging) of citation can be measured by

measuring citation rates Some simple obsolescence model takes exponential forms

Efficiency: based on the model, we can obtain an efficient statistics maintenance procedure

Simplicity: we can control the forgetting speed using the parameter

21





22

Our Approach for Document Similarity Derivation

Probabilistic derivation based on the document forgetting model

Let Pr(di, dj) be the probability to select the document pair (di, dj) from the document repository

We regard the coocurrence probability Pr(di, dj) as their similarity sim(di, dj)

AAAAAA

A A

doc di doc dj

Pr(di, dj)

23

Derivation of Similarity Formula (1) tdw: total weights of all the m documents

simple summation of all document weights

Pr(di): subjective probability to select document di from the repository

m

lldwtdw

1

tdw

dwd i

i )Pr(Since old documents have small document weights, their selection probabilities are small

iTττl λdw |where

24

Derivation of Similarity Formula (2) Pr(tk|di): selection probability of term tk from

document di

freq(di, tk): the number of occurrence of tk in di the probability corresponds to term frequency

n

lki

kiik

tdfreq

tdfreqdt

1

),(

),()|Pr(

)|Pr(),( ikki dttdtf

25

Derivation of Similarity Formula (3) Pr(tk): occurrence probability of term tk

this probability corresponds to document frequency of term tk

the reciprocal of df(tk) represents IDF (inverse document frequency)

)Pr()|Pr()Pr(1

ii

m

ikk ddtt

)Pr()( kk ttdf

)(

1)(

kk tdf

tidf

26

Derivation of Similarity Formula (4) Using Bayes’ theorem,

Then we get

)(),()Pr()|Pr( kkjjkj tidftdtfdtd

)(),(),()Pr(

)|Pr()|Pr()|Pr(

1

1

kkjk

n

kij

ikk

n

kjij

tidftdtftdtfd

dttddd

27

Derivation of Similarity Formula (5) Therefore, the coocurrence probability of di, dj is:

The more a document di becomes old, the smaller its similarity scores with other documents are because old documents have low Pr(di) values

)(),(),()Pr()Pr(),Pr(1

kkjk

n

kijiji tidftdtftdtfdddd

old documents have low similarity scores

inner prodocut of document vectorsbased on TF-IDF weighting

28





29

A

Updating Statistics and Probabilities when t = +

AA

AAA

t =

AACluster 1

AACluster k

：

AACluster 1

AACluster k

：

AAA

t =

AAA

t = +

...AAA

t =


Clustering Module



5. present new clustering result


30

Approach to Update Processing (1) In every incremental clustering step, we have to

calculate document similarities To compute similarities, we need to calculate

document statistics and probabilities beforehand It is inefficient to compute statistics every time from

scratch

Store the calculated statistics and probabilities and utilize them for later computation

Incremental Update Processing

31

Approach to Update Processing (2) Formulation

d1, ..., dm: document set consists of m documents

t1, ..., tn: index term sets that appear in d1, ..., dm t = : the latest update time of the document set

Assumption when t = + , new documents dm + 1, ..., dm + m’ are a

ppended to the document set new documents dm + 1, ..., dm + m’ introduce additional ter

ms tn + 1, ..., tn + n’

m >> m and n >> n are satisfied

32

Update Processing Method (1) Update of document weight dwi

Since unit time has passed from the previous update time t = , the weight of each document decreases according to

For each new document, assign initial value 1

)'1(

)1(

1

||

mmim

midwλλdw τi

τTττ

ττi

i

33

Update Processing Method (2) Example of Incremental Update Processing: Upd

ating from tdw| to tdw|+

Naive Approach: compute tdw|+ from scratch

time consuming!

mmTττTττττmmττ

ττ

mm

llττ

λλ

dwdw

dwtdw

1

||

||

1

1

34

Update Processing Method (3) Smart Approach: compute tdw|+ incrementally

exponential weighting enables efficient incremental computation

mtdwλ

λλ

λλ

λtdw

ττ

mm

ml

m

l

Tτττ

m

l

mm

ml

TττTττ

mm

l

Tττττ

l

ll

l

|

1

|

11

1 1

1

35

Updating Processing Method (4) Occurrence probability of each document Pr(di)

can be easily recalculated

We need to calculate term frequencies tf(di, tk) only for new documents dm + 1, ..., dm + m’

'||1

mtdwλλtdw ττ

mm

l

Tττττ

i

ττ

ττiττi tdw

dwd

|

||)Pr(

36

Updating Processing Method (5) Update formulas for document frequency of each

term df(tk) we expand the formula of df(tk) as follows, then store

each permanently

　　　　 can be incrementally updated using the formula

m

ikiτiτk

τkτ

τk

tdtfdwtfd

tfdtdw

tdf

1

),(||)(~

|)(~

|

1|)(

)(~

ktfd

)(~

ktfd

)(|)(~

|)(~

sum kτkτ

ττk ttftfdλtfd

37

Update Processing Method (6) Calculation of new decoupling coefficient i is

easy:

Update formulas for decoupling coefficient for terms incremental update is also possible details are shown in the paper

nn

kk

kii

tfd

tdtfδ

1ττ

2

|)(~

),(

iδ

38

Summary of Update Processing Following statistics are maintained persistently (m: no. of

documents, n: no. of terms) dwi: weight of document di (1 i m) tdw: total weight of documents freq(di, tk): term occurrence frequency (1 i m, 1 k n) docleni: document length (1 i m)

: statistics fo compute df(tk) (1 k n) : statistics to compute (1 i m)

Incremental statistics update cost O(m + m’n) O(m + n) with storage cost O(m + n): linear cost cf. naive method (not incremental) costs O(mn) cost

iδ ~)(

~ktfd

iδ

39



40

A

Expiration of Old Documents (1) when t = +

AA

AAA

t =

AACluster 1

AACluster k

：

AACluster 1

AACluster k

：

AAA

t =

AAA

t = +

...AAA

t =


Clustering Module



5. present new clustering result


41

Expiration of Old Documents (2) Two reasons to delete old documents:

reduction of storage area old documents have only tiny effects on the resulting

clustering structure Our approach:

If dwi < ( is a small parameter constant) is satisfied, delete document di

When we delete di, related statistics values are deleted

e.g., freq(di, tk) details are in the proceedings and [6]

42

Parameter Setting Methods F2ICM uses two parameters in its algorithms:

forgetting factor (0 < < 1): specifies the forgetting speed

expiration parameter (0 < < 1): threshold value for document deletion

We use the following metaphors: : half-life span of the value of a document

= ½ is satisfied, namely

： life span of a document is determined by =

)/2logexp( βλ

43



44

Dataset and Parameter Settings Dataset: Mainich Daily Newspaper articles

Each article consists of the following information: issue date subject area (e.g., economy, sports) keyword list (50 150 words in Japanese)

The articles we used in the experiment: issue date: January 1994 to February 1994 subject area: international affairs

Parameter Settings nc (no. of clusters) = 10 (half-life span) = 7: the value of an article reduces to ½ in

one week (life span) = 30: every document will be deleted after 30

days

45

Computational Cost for Clustering Sequences

Plot of CPU time and response time for each clustering performed everyday

Costs linearly increase until 30th day, then becomes almost constant

0

10

20

30

40

50

60

1 11 21 31 41 51

Days

CPU/R

espo

nse

Tim

e (s

ec)

CPU Time

Response Time

46

Overview of Clustering Result (1) Summarization of 10 clusters after 30 days (at

January 31, 1994)No. Subject

1 East Europe, NATO, Russia, Ukraine

2 Clinton (White Water/politics), military issue (Korea/Myanmar/Mexico/Indonesia)

3 China (import and export/U.S.)

4 U.S. politics (economic sanctions and Vietnam/elections)

5 Clinton (Syria/South East issue/visiting Europe), Europe (France/Italy/Switzerland)

6 South Africa (ANC/human rights), East Europe (Bosnia-Herzegovina, Croatia), Russia (Zhirinovsky/ruble/Ukraine)

7 Russia (economy/Moscow/U.S.), North Korea (IAEA/nuclear)

8 China (Patriot missiles/South Korea/Russia/Taiwan/economics)

9 Mexico (indigenous peoples/riot), Israel

10 South East Asia (Indonesia/Cambodia/Thailand), China (Taiwan/France), South Korea (politics)

47

Overview of Clustering Result (2) Summarization of 10 clusters after 57 days (at March

1, 1994)No. Subject

1 Bosnia-Herzegovina (NATO/PKO/UN/Serbia), China (diplomacy)

2 U. S. issue (Japan/economy/New Zealand/Bosnia/Washington)

3 Myanmar, Russia, Mexico

4 Bosnia-Herzegovina (Sarajevo/Serbia), U.S. (North Korea/economy/military)

5 North Korea (IAEA/U.S./nuclear)

6 East Asia (Hebron/random shooting/PLO), Myanmar, Bosnia-Herzegovina

7 U.S. (society/crime/North Korea/IAEA)

8 U.N. (PKO/Bosnia-Herzegovina/EU), China

9 Bosnia-Herzegovina (U.N./PKO/Sarajevo), Russia (Moscow/Serbia)

10 Sarajevo (Bosnia-Herzegovina), China (Taiwan/Tibet)

48

Summary of the Experiment Brief observations

F2ICM groups similar articles into a cluster as far as an appropriate seed is selected

But a cluster obtained by the experiment usually contains multiple topics, and different clusters contain similar topics: clusters are not well separated

Reasons of the observed phenomena: Selected seeds are not well separated in topics: more

sophisticated seed selection method is required The number of keywords for an articles is rather

small (50 150 words)

49



50

Conclusions and Future Work Conclusions

Development of an on-line clustering method which considers the novelty of documents

Introduction of document forgetting model F2ICM: Forgetting Factor-based Incremental Clustering Method

Incremental statistics update method (linear update cost) Automatic document expiration and parameter setting

methods Preliminary report of the experiments

Current and Future Work Revision of the clustering algorithms based on Scatter/Gather

approach [4] More detailed experiments and their evaluation Development of automatic parameter tuning methods