50
Sept. 7, 2001 ECDL2001 1 An On-line Document Clustering Method Based on Forgetting Factors Yoshiharu Ishikawa , Yibing Chen Hiroyuki Kitagawa University of Tsukuba, Japan

An On-line Document Clustering Method Based on Forgetting Factors

  • Upload
    alton

  • View
    32

  • Download
    0

Embed Size (px)

DESCRIPTION

An On-line Document Clustering Method Based on Forgetting Factors. Yoshiharu Ishikawa , Yibing Chen Hiroyuki Kitagawa University of Tsukuba, Japan. Outline. Background and Objectives F 2 ICM Incremental Document Clustering Method Document Similarity Based on Forgetting Factor - PowerPoint PPT Presentation

Citation preview

Page 1: An On-line  Document Clustering Method  Based on Forgetting Factors

Sept. 7, 2001 ECDL2001 1

An On-line Document Clustering Method Based on Forgetting Factors

Yoshiharu Ishikawa, Yibing Chen

Hiroyuki Kitagawa

University of Tsukuba, Japan

Page 2: An On-line  Document Clustering Method  Based on Forgetting Factors

2

Outline Background and Objectives F2ICM Incremental Document Clustering

Method Document Similarity Based on Forgetting Factor Updating Statistics and Probabilities Document Expiration and Parameter Setting Experimental Results Conclusions and Future Work

Page 3: An On-line  Document Clustering Method  Based on Forgetting Factors

3

Background The Internet enabled on-line document delivery services

newsfeed services over the network periodically issued on-line journals

Important technologies (and applications) for on-line documents information filtering document summarization, information extraction topic detection and tracking (TDT)

Clustering works as a core technique for these applications

Page 4: An On-line  Document Clustering Method  Based on Forgetting Factors

4

Our Objectives (1) Development of an on-line clustering method which

considers the novelty of each document Presents a snapshot of clusters in an up-to-date manner Example: articles from sports news feed

ττ

Soccer World Cup

time

Formula 1 & M. Schumacher

U.S. Open Tennis

Other articles

Page 5: An On-line  Document Clustering Method  Based on Forgetting Factors

5

Our Objectives (2) Development of a novelty-based clustering

method for on-line documents Features:

It weights high importance on newer documents than older ones and forgets obsolete ones

introduction of a new document similarity measure that considers novelty and obsolescence of documents

Incremental clustering processing low processing cost to generate a new clustering result

Automatic maintenance of target documents obsolete documents are automatically deleted from the

clustering target

Page 6: An On-line  Document Clustering Method  Based on Forgetting Factors

6

A

Incremental Clustering Process (1) when t = 0 (initial state)

Clustering Module

AA

1. arrival of new documents

AAA

t = 0

3. calculate and store statistics

2. store new documents in the repository

AACluster 1

AACluster k

4. cluster documents and present the result

Page 7: An On-line  Document Clustering Method  Based on Forgetting Factors

7

A

Incremental Clustering Process (2) when t = 1

AA

AAA

t = 0

AACluster 1

AACluster k

AAA

t = 1

AACluster 1

AACluster k

:1. arrival of new documents

Clustering Module

2. store new documents in the repository

3. update statistics

4. cluster documents and present the result

Page 8: An On-line  Document Clustering Method  Based on Forgetting Factors

8

A

Incremental Clustering Process (3) when t = +

AA

AAA

t =

AACluster 1

AACluster k

AACluster 1

AACluster k

AAA

t =

AAA

t = +

...AAA

t =

1. arrival of new documents

Clustering Module

2. store new documents in the repository

4. delete old documents

3. update statistics

5. cluster documents and present the result

Page 9: An On-line  Document Clustering Method  Based on Forgetting Factors

9

Outline Background and Objectives F2ICM Incremental Document Clustering

Method C2ICM Clustering Method F2ICM Clustering Method

Document Similarity Based on Forgetting Factor Updating Statistics and Probabilities Document Expiration and parameter Setting Experimental Results Conclusions and Future Work

Page 10: An On-line  Document Clustering Method  Based on Forgetting Factors

10

C2ICM Clustering Method Cover-Coefficient-based Incremental Clustering

Methodology Proposed by F. Can (ACM TOIS, 1993) [3] Incremental Clustering Method with Low

Update Cost Seed-based Clustering Method

Based on the concept of seed powers Seed powers are defined probabilistically Documents with highest seed powers are selected as

cluster seeds

Page 11: An On-line  Document Clustering Method  Based on Forgetting Factors

11

Decoupling/Coupling Coefficients Two important notions in C2ICM method

used to calculate seed powers Decoupling coefficient of document di :

the probability that the document di is obtained when a document di itself is given

an index to measure the independence of di

Coupling coefficient of document di:

an index to measure the dependence of di

)|Pr( iii ddδ

ii δ1

Page 12: An On-line  Document Clustering Method  Based on Forgetting Factors

12

Seed Power

Seed power spi for document di measures the appropriateness (moderate dependence) of di as a cluster seed

freq(di, tj): the occurrence frequency of term tj within document di

: decoupling coefficient for term tj

: coupling coefficient for term tj

)|Pr( jjj ttδ

jjj

n

j iiii δtdfreqδsp ),(

1

jj δ 1j

Page 13: An On-line  Document Clustering Method  Based on Forgetting Factors

13

C2ICM Clustering Algorithm (1) Initial phase

Red: F1 & SchumacherGreen: U.S. Open Tennis

2. Other documents are assigned to the cluster with the most similar seed

1. Select new seeds based on the seed powers

Page 14: An On-line  Document Clustering Method  Based on Forgetting Factors

14

C2ICM Clustering Algorithm (2) Incremental update phase

2. Other documents are assigned to the cluster with the most similar seed

1. Select new seeds based on the seed powers

ττ Red: F1 & SchumacherGreen: U.S. Open TennisOrange: Soccer World Cup

Page 15: An On-line  Document Clustering Method  Based on Forgetting Factors

15

Outline Background and Objectives F2ICM Incremental Document Clustering

Method C2ICM Clustering Method F2ICM Clustering Method

Document Similarity Based on Forgetting Factor Updating Statistics and Probabilities Document Expiration and parameter Setting Experimental Results Conclusions and Future Work

Page 16: An On-line  Document Clustering Method  Based on Forgetting Factors

16

F2ICM Clustering Method Extension of C2ICM method Main differences

Introduction of a new document similarity measure based on the notion of the forgetting factor: it weights high importance on newer documents to generate clusters

Incremental maintenance of statistics Automatic deletion of obsolete old documents

Page 17: An On-line  Document Clustering Method  Based on Forgetting Factors

17

Outline Background and Objectives F2ICM Incremental Document Clustering

Method Document Similarity Based on Forgetting Factor

Document forgetting model Derivation of document similarity measure

Updating Statistics and Probabilities Document Expiration and Parameter Setting Experimental Results Conclusions and Future Work

Page 18: An On-line  Document Clustering Method  Based on Forgetting Factors

18

Document Similarity Based onForgetting Factor

New Document Similarity Measure Based on Document Forgetting Model Assumption: each delivered document gradually

loses its value (weight) as time passes Derivation of document similarity measure based on

the assumption put high weights on new documents and low weights on

old ones old documents have low effects on clustering Using the derived document similarity measure, we

can achieve a novelty-based clustering

Page 19: An On-line  Document Clustering Method  Based on Forgetting Factors

19

Document Forgetting Model (1) Ti : acquisition time of

document di

Information value (weight) of di is defined as

Document weight exponentially decreases as time passes

(0 < < 1) determines the forgetting speed

iTττi λdw |

1

dwi

Tit

iTτλ

acquisition time of document di

current time

Page 20: An On-line  Document Clustering Method  Based on Forgetting Factors

20

Document Forgetting Model (2) Why we use the exponential forgetting model?

It inherits the ideas from the behavioral law of human memory

The Power Law of Forgetting [1]: human memory exponentially decreases as time passes

Relationship with citation analysis: Obsolescence (aging) of citation can be measured by

measuring citation rates Some simple obsolescence model takes exponential forms

Efficiency: based on the model, we can obtain an efficient statistics maintenance procedure

Simplicity: we can control the forgetting speed using the parameter

Page 21: An On-line  Document Clustering Method  Based on Forgetting Factors

21

Outline Background and Objectives F2ICM Incremental Document Clustering

Method Document Similarity Based on Forgetting Factor

Document forgetting model Derivation of document similarity measure

Updating Statistics and Probabilities Document Expiration and Parameter Setting Experimental Results Conclusions and Future Work

Page 22: An On-line  Document Clustering Method  Based on Forgetting Factors

22

Our Approach for Document Similarity Derivation

Probabilistic derivation based on the document forgetting model

Let Pr(di, dj) be the probability to select the document pair (di, dj) from the document repository

We regard the coocurrence probability Pr(di, dj) as their similarity sim(di, dj)

AAAAAA

A A

doc di doc dj

Pr(di, dj)

Page 23: An On-line  Document Clustering Method  Based on Forgetting Factors

23

Derivation of Similarity Formula (1) tdw: total weights of all the m documents

simple summation of all document weights

Pr(di): subjective probability to select document di from the repository

m

lldwtdw

1

tdw

dwd i

i )Pr(Since old documents have small document weights, their selection probabilities are small

iTττl λdw |where

Page 24: An On-line  Document Clustering Method  Based on Forgetting Factors

24

Derivation of Similarity Formula (2) Pr(tk|di): selection probability of term tk from

document di

freq(di, tk): the number of occurrence of tk in di the probability corresponds to term frequency

n

lki

kiik

tdfreq

tdfreqdt

1

),(

),()|Pr(

)|Pr(),( ikki dttdtf

Page 25: An On-line  Document Clustering Method  Based on Forgetting Factors

25

Derivation of Similarity Formula (3) Pr(tk): occurrence probability of term tk

this probability corresponds to document frequency of term tk

the reciprocal of df(tk) represents IDF (inverse document frequency)

)Pr()|Pr()Pr(1

ii

m

ikk ddtt

)Pr()( kk ttdf

)(

1)(

kk tdf

tidf

Page 26: An On-line  Document Clustering Method  Based on Forgetting Factors

26

Derivation of Similarity Formula (4) Using Bayes’ theorem,

Then we get

)(),()Pr()|Pr( kkjjkj tidftdtfdtd

)(),(),()Pr(

)|Pr()|Pr()|Pr(

1

1

kkjk

n

kij

ikk

n

kjij

tidftdtftdtfd

dttddd

Page 27: An On-line  Document Clustering Method  Based on Forgetting Factors

27

Derivation of Similarity Formula (5) Therefore, the coocurrence probability of di, dj is:

The more a document di becomes old, the smaller its similarity scores with other documents are because old documents have low Pr(di) values

)(),(),()Pr()Pr(),Pr(1

kkjk

n

kijiji tidftdtftdtfdddd

old documents have low similarity scores

inner prodocut of document vectorsbased on TF-IDF weighting

Page 28: An On-line  Document Clustering Method  Based on Forgetting Factors

28

Outline Background and Objectives F2ICM Incremental Document Clustering

Method Document Similarity Based on Forgetting Factor

Document forgetting model Derivation of document similarity measure

Updating Statistics and Probabilities Document Expiration and Parameter Setting Experimental Results Conclusions and Future Work

Page 29: An On-line  Document Clustering Method  Based on Forgetting Factors

29

A

Updating Statistics and Probabilities when t = +

AA

AAA

t =

AACluster 1

AACluster k

AACluster 1

AACluster k

AAA

t =

AAA

t = +

...AAA

t =

1. arrival of new documents

Clustering Module

4. delete old documents

2. store new documents in the repository

5. present new clustering result

3. update statistics

Page 30: An On-line  Document Clustering Method  Based on Forgetting Factors

30

Approach to Update Processing (1) In every incremental clustering step, we have to

calculate document similarities To compute similarities, we need to calculate

document statistics and probabilities beforehand It is inefficient to compute statistics every time from

scratch

Store the calculated statistics and probabilities and utilize them for later computation

Incremental Update Processing

Page 31: An On-line  Document Clustering Method  Based on Forgetting Factors

31

Approach to Update Processing (2) Formulation

d1, ..., dm: document set consists of m documents

t1, ..., tn: index term sets that appear in d1, ..., dm t = : the latest update time of the document set

Assumption when t = + , new documents dm + 1, ..., dm + m’ are a

ppended to the document set new documents dm + 1, ..., dm + m’ introduce additional ter

ms tn + 1, ..., tn + n’

m >> m and n >> n are satisfied

Page 32: An On-line  Document Clustering Method  Based on Forgetting Factors

32

Update Processing Method (1) Update of document weight dwi

Since unit time has passed from the previous update time t = , the weight of each document decreases according to

For each new document, assign initial value 1

)'1(

)1(

1

||

mmim

midwλλdw τi

τTττ

ττi

i

Page 33: An On-line  Document Clustering Method  Based on Forgetting Factors

33

Update Processing Method (2) Example of Incremental Update Processing: Upd

ating from tdw| to tdw|+

Naive Approach: compute tdw|+ from scratch

time consuming!

mmTττTττττmmττ

ττ

mm

llττ

λλ

dwdw

dwtdw

1

||

||

1

1

Page 34: An On-line  Document Clustering Method  Based on Forgetting Factors

34

Update Processing Method (3) Smart Approach: compute tdw|+ incrementally

exponential weighting enables efficient incremental computation

mtdwλ

λλ

λλ

λtdw

ττ

mm

ml

m

l

Tτττ

m

l

mm

ml

TττTττ

mm

l

Tττττ

l

ll

l

|

1

|

11

1 1

1

Page 35: An On-line  Document Clustering Method  Based on Forgetting Factors

35

Updating Processing Method (4) Occurrence probability of each document Pr(di)

can be easily recalculated

We need to calculate term frequencies tf(di, tk) only for new documents dm + 1, ..., dm + m’

'||1

mtdwλλtdw ττ

mm

l

Tττττ

i

ττ

ττiττi tdw

dwd

|

||)Pr(

Page 36: An On-line  Document Clustering Method  Based on Forgetting Factors

36

Updating Processing Method (5) Update formulas for document frequency of each

term df(tk) we expand the formula of df(tk) as follows, then store

each permanently

     can be incrementally updated using the formula

m

ikiτiτk

τkτ

τk

tdtfdwtfd

tfdtdw

tdf

1

),(||)(~

|)(~

|

1|)(

)(~

ktfd

)(~

ktfd

)(|)(~

|)(~

sum kτkτ

ττk ttftfdλtfd

Page 37: An On-line  Document Clustering Method  Based on Forgetting Factors

37

Update Processing Method (6) Calculation of new decoupling coefficient i is

easy:

Update formulas for decoupling coefficient for terms incremental update is also possible details are shown in the paper

nn

kk

kii

tfd

tdtfδ

1ττ

2

|)(~

),(

Page 38: An On-line  Document Clustering Method  Based on Forgetting Factors

38

Summary of Update Processing Following statistics are maintained persistently (m: no. of

documents, n: no. of terms) dwi: weight of document di (1 i m) tdw: total weight of documents freq(di, tk): term occurrence frequency (1 i m, 1 k n) docleni: document length (1 i m)

: statistics fo compute df(tk) (1 k n) : statistics to compute (1 i m)

Incremental statistics update cost O(m + m’n) O(m + n) with storage cost O(m + n): linear cost cf. naive method (not incremental) costs O(mn) cost

iδ ~)(

~ktfd

Page 39: An On-line  Document Clustering Method  Based on Forgetting Factors

39

Outline Background and Objectives F2ICM Incremental Document Clustering

Method Document Similarity Based on Forgetting Factor Updating Statistics and Probabilities Document Expiration and Parameter Setting Experimental Results Conclusions and Future Work

Page 40: An On-line  Document Clustering Method  Based on Forgetting Factors

40

A

Expiration of Old Documents (1) when t = +

AA

AAA

t =

AACluster 1

AACluster k

AACluster 1

AACluster k

AAA

t =

AAA

t = +

...AAA

t =

1. arrival of new documents

Clustering Module

4. delete old documents

2. store new documents in the repository

5. present new clustering result

3. update statistics

Page 41: An On-line  Document Clustering Method  Based on Forgetting Factors

41

Expiration of Old Documents (2) Two reasons to delete old documents:

reduction of storage area old documents have only tiny effects on the resulting

clustering structure Our approach:

If dwi < ( is a small parameter constant) is satisfied, delete document di

When we delete di, related statistics values are deleted

e.g., freq(di, tk) details are in the proceedings and [6]

Page 42: An On-line  Document Clustering Method  Based on Forgetting Factors

42

Parameter Setting Methods F2ICM uses two parameters in its algorithms:

forgetting factor (0 < < 1): specifies the forgetting speed

expiration parameter (0 < < 1): threshold value for document deletion

We use the following metaphors: : half-life span of the value of a document

= ½ is satisfied, namely

: life span of a document is determined by =

)/2logexp( βλ

Page 43: An On-line  Document Clustering Method  Based on Forgetting Factors

43

Outline Background and Objectives F2ICM Incremental Document Clustering

Method Document Similarity Based on Forgetting Factor Updating Statistics and Probabilities Document Expiration and Parameter Setting Experimental Results Conclusions and Future Work

Page 44: An On-line  Document Clustering Method  Based on Forgetting Factors

44

Dataset and Parameter Settings Dataset: Mainich Daily Newspaper articles

Each article consists of the following information: issue date subject area (e.g., economy, sports) keyword list (50 150 words in Japanese)

The articles we used in the experiment: issue date: January 1994 to February 1994 subject area: international affairs

Parameter Settings nc (no. of clusters) = 10 (half-life span) = 7: the value of an article reduces to ½ in

one week (life span) = 30: every document will be deleted after 30

days

Page 45: An On-line  Document Clustering Method  Based on Forgetting Factors

45

Computational Cost for Clustering Sequences

Plot of CPU time and response time for each clustering performed everyday

Costs linearly increase until 30th day, then becomes almost constant

0

10

20

30

40

50

60

1 11 21 31 41 51

Days

CPU/R

espo

nse

Tim

e (s

ec)

CPU Time

Response Time

Page 46: An On-line  Document Clustering Method  Based on Forgetting Factors

46

Overview of Clustering Result (1) Summarization of 10 clusters after 30 days (at

January 31, 1994)No. Subject

1 East Europe, NATO, Russia, Ukraine

2 Clinton (White Water/politics), military issue (Korea/Myanmar/Mexico/Indonesia)

3 China (import and export/U.S.)

4 U.S. politics (economic sanctions and Vietnam/elections)

5 Clinton (Syria/South East issue/visiting Europe), Europe (France/Italy/Switzerland)

6 South Africa (ANC/human rights), East Europe (Bosnia-Herzegovina, Croatia), Russia (Zhirinovsky/ruble/Ukraine)

7 Russia (economy/Moscow/U.S.), North Korea (IAEA/nuclear)

8 China (Patriot missiles/South Korea/Russia/Taiwan/economics)

9 Mexico (indigenous peoples/riot), Israel

10 South East Asia (Indonesia/Cambodia/Thailand), China (Taiwan/France), South Korea (politics)

Page 47: An On-line  Document Clustering Method  Based on Forgetting Factors

47

Overview of Clustering Result (2) Summarization of 10 clusters after 57 days (at March

1, 1994)No. Subject

1 Bosnia-Herzegovina (NATO/PKO/UN/Serbia), China (diplomacy)

2 U. S. issue (Japan/economy/New Zealand/Bosnia/Washington)

3 Myanmar, Russia, Mexico

4 Bosnia-Herzegovina (Sarajevo/Serbia), U.S. (North Korea/economy/military)

5 North Korea (IAEA/U.S./nuclear)

6 East Asia (Hebron/random shooting/PLO), Myanmar, Bosnia-Herzegovina

7 U.S. (society/crime/North Korea/IAEA)

8 U.N. (PKO/Bosnia-Herzegovina/EU), China

9 Bosnia-Herzegovina (U.N./PKO/Sarajevo), Russia (Moscow/Serbia)

10 Sarajevo (Bosnia-Herzegovina), China (Taiwan/Tibet)

Page 48: An On-line  Document Clustering Method  Based on Forgetting Factors

48

Summary of the Experiment Brief observations

F2ICM groups similar articles into a cluster as far as an appropriate seed is selected

But a cluster obtained by the experiment usually contains multiple topics, and different clusters contain similar topics: clusters are not well separated

Reasons of the observed phenomena: Selected seeds are not well separated in topics: more

sophisticated seed selection method is required The number of keywords for an articles is rather

small (50 150 words)

Page 49: An On-line  Document Clustering Method  Based on Forgetting Factors

49

Outline Background and Objectives F2ICM Incremental Document Clustering

Method Document Similarity Based on Forgetting Factor Updating Statistics and Probabilities Document Expiration and Parameter Setting Experimental Results Conclusions and Future Work

Page 50: An On-line  Document Clustering Method  Based on Forgetting Factors

50

Conclusions and Future Work Conclusions

Development of an on-line clustering method which considers the novelty of documents

Introduction of document forgetting model F2ICM: Forgetting Factor-based Incremental Clustering Method

Incremental statistics update method (linear update cost) Automatic document expiration and parameter setting

methods Preliminary report of the experiments

Current and Future Work Revision of the clustering algorithms based on Scatter/Gather

approach [4] More detailed experiments and their evaluation Development of automatic parameter tuning methods