Upload
alton
View
32
Download
0
Tags:
Embed Size (px)
DESCRIPTION
An On-line Document Clustering Method Based on Forgetting Factors. Yoshiharu Ishikawa , Yibing Chen Hiroyuki Kitagawa University of Tsukuba, Japan. Outline. Background and Objectives F 2 ICM Incremental Document Clustering Method Document Similarity Based on Forgetting Factor - PowerPoint PPT Presentation
Citation preview
Sept. 7, 2001 ECDL2001 1
An On-line Document Clustering Method Based on Forgetting Factors
Yoshiharu Ishikawa, Yibing Chen
Hiroyuki Kitagawa
University of Tsukuba, Japan
2
Outline Background and Objectives F2ICM Incremental Document Clustering
Method Document Similarity Based on Forgetting Factor Updating Statistics and Probabilities Document Expiration and Parameter Setting Experimental Results Conclusions and Future Work
3
Background The Internet enabled on-line document delivery services
newsfeed services over the network periodically issued on-line journals
Important technologies (and applications) for on-line documents information filtering document summarization, information extraction topic detection and tracking (TDT)
Clustering works as a core technique for these applications
4
Our Objectives (1) Development of an on-line clustering method which
considers the novelty of each document Presents a snapshot of clusters in an up-to-date manner Example: articles from sports news feed
ττ
Soccer World Cup
time
Formula 1 & M. Schumacher
U.S. Open Tennis
Other articles
5
Our Objectives (2) Development of a novelty-based clustering
method for on-line documents Features:
It weights high importance on newer documents than older ones and forgets obsolete ones
introduction of a new document similarity measure that considers novelty and obsolescence of documents
Incremental clustering processing low processing cost to generate a new clustering result
Automatic maintenance of target documents obsolete documents are automatically deleted from the
clustering target
6
A
Incremental Clustering Process (1) when t = 0 (initial state)
Clustering Module
AA
1. arrival of new documents
AAA
t = 0
3. calculate and store statistics
2. store new documents in the repository
AACluster 1
AACluster k
:
4. cluster documents and present the result
7
A
Incremental Clustering Process (2) when t = 1
AA
AAA
t = 0
AACluster 1
AACluster k
:
AAA
t = 1
AACluster 1
AACluster k
:1. arrival of new documents
Clustering Module
2. store new documents in the repository
3. update statistics
4. cluster documents and present the result
8
A
Incremental Clustering Process (3) when t = +
AA
AAA
t =
AACluster 1
AACluster k
:
AACluster 1
AACluster k
:
AAA
t =
AAA
t = +
...AAA
t =
1. arrival of new documents
Clustering Module
2. store new documents in the repository
4. delete old documents
3. update statistics
5. cluster documents and present the result
9
Outline Background and Objectives F2ICM Incremental Document Clustering
Method C2ICM Clustering Method F2ICM Clustering Method
Document Similarity Based on Forgetting Factor Updating Statistics and Probabilities Document Expiration and parameter Setting Experimental Results Conclusions and Future Work
10
C2ICM Clustering Method Cover-Coefficient-based Incremental Clustering
Methodology Proposed by F. Can (ACM TOIS, 1993) [3] Incremental Clustering Method with Low
Update Cost Seed-based Clustering Method
Based on the concept of seed powers Seed powers are defined probabilistically Documents with highest seed powers are selected as
cluster seeds
11
Decoupling/Coupling Coefficients Two important notions in C2ICM method
used to calculate seed powers Decoupling coefficient of document di :
the probability that the document di is obtained when a document di itself is given
an index to measure the independence of di
Coupling coefficient of document di:
an index to measure the dependence of di
)|Pr( iii ddδ
ii δ1
12
Seed Power
Seed power spi for document di measures the appropriateness (moderate dependence) of di as a cluster seed
freq(di, tj): the occurrence frequency of term tj within document di
: decoupling coefficient for term tj
: coupling coefficient for term tj
)|Pr( jjj ttδ
jjj
n
j iiii δtdfreqδsp ),(
1
jδ
jj δ 1j
13
C2ICM Clustering Algorithm (1) Initial phase
Red: F1 & SchumacherGreen: U.S. Open Tennis
2. Other documents are assigned to the cluster with the most similar seed
1. Select new seeds based on the seed powers
14
C2ICM Clustering Algorithm (2) Incremental update phase
2. Other documents are assigned to the cluster with the most similar seed
1. Select new seeds based on the seed powers
ττ Red: F1 & SchumacherGreen: U.S. Open TennisOrange: Soccer World Cup
15
Outline Background and Objectives F2ICM Incremental Document Clustering
Method C2ICM Clustering Method F2ICM Clustering Method
Document Similarity Based on Forgetting Factor Updating Statistics and Probabilities Document Expiration and parameter Setting Experimental Results Conclusions and Future Work
16
F2ICM Clustering Method Extension of C2ICM method Main differences
Introduction of a new document similarity measure based on the notion of the forgetting factor: it weights high importance on newer documents to generate clusters
Incremental maintenance of statistics Automatic deletion of obsolete old documents
17
Outline Background and Objectives F2ICM Incremental Document Clustering
Method Document Similarity Based on Forgetting Factor
Document forgetting model Derivation of document similarity measure
Updating Statistics and Probabilities Document Expiration and Parameter Setting Experimental Results Conclusions and Future Work
18
Document Similarity Based onForgetting Factor
New Document Similarity Measure Based on Document Forgetting Model Assumption: each delivered document gradually
loses its value (weight) as time passes Derivation of document similarity measure based on
the assumption put high weights on new documents and low weights on
old ones old documents have low effects on clustering Using the derived document similarity measure, we
can achieve a novelty-based clustering
19
Document Forgetting Model (1) Ti : acquisition time of
document di
Information value (weight) of di is defined as
Document weight exponentially decreases as time passes
(0 < < 1) determines the forgetting speed
iTττi λdw |
1
dwi
Tit
iTτλ
acquisition time of document di
current time
20
Document Forgetting Model (2) Why we use the exponential forgetting model?
It inherits the ideas from the behavioral law of human memory
The Power Law of Forgetting [1]: human memory exponentially decreases as time passes
Relationship with citation analysis: Obsolescence (aging) of citation can be measured by
measuring citation rates Some simple obsolescence model takes exponential forms
Efficiency: based on the model, we can obtain an efficient statistics maintenance procedure
Simplicity: we can control the forgetting speed using the parameter
21
Outline Background and Objectives F2ICM Incremental Document Clustering
Method Document Similarity Based on Forgetting Factor
Document forgetting model Derivation of document similarity measure
Updating Statistics and Probabilities Document Expiration and Parameter Setting Experimental Results Conclusions and Future Work
22
Our Approach for Document Similarity Derivation
Probabilistic derivation based on the document forgetting model
Let Pr(di, dj) be the probability to select the document pair (di, dj) from the document repository
We regard the coocurrence probability Pr(di, dj) as their similarity sim(di, dj)
AAAAAA
A A
doc di doc dj
Pr(di, dj)
23
Derivation of Similarity Formula (1) tdw: total weights of all the m documents
simple summation of all document weights
Pr(di): subjective probability to select document di from the repository
m
lldwtdw
1
tdw
dwd i
i )Pr(Since old documents have small document weights, their selection probabilities are small
iTττl λdw |where
24
Derivation of Similarity Formula (2) Pr(tk|di): selection probability of term tk from
document di
freq(di, tk): the number of occurrence of tk in di the probability corresponds to term frequency
n
lki
kiik
tdfreq
tdfreqdt
1
),(
),()|Pr(
)|Pr(),( ikki dttdtf
25
Derivation of Similarity Formula (3) Pr(tk): occurrence probability of term tk
this probability corresponds to document frequency of term tk
the reciprocal of df(tk) represents IDF (inverse document frequency)
)Pr()|Pr()Pr(1
ii
m
ikk ddtt
)Pr()( kk ttdf
)(
1)(
kk tdf
tidf
26
Derivation of Similarity Formula (4) Using Bayes’ theorem,
Then we get
)(),()Pr()|Pr( kkjjkj tidftdtfdtd
)(),(),()Pr(
)|Pr()|Pr()|Pr(
1
1
kkjk
n
kij
ikk
n
kjij
tidftdtftdtfd
dttddd
27
Derivation of Similarity Formula (5) Therefore, the coocurrence probability of di, dj is:
The more a document di becomes old, the smaller its similarity scores with other documents are because old documents have low Pr(di) values
)(),(),()Pr()Pr(),Pr(1
kkjk
n
kijiji tidftdtftdtfdddd
old documents have low similarity scores
inner prodocut of document vectorsbased on TF-IDF weighting
28
Outline Background and Objectives F2ICM Incremental Document Clustering
Method Document Similarity Based on Forgetting Factor
Document forgetting model Derivation of document similarity measure
Updating Statistics and Probabilities Document Expiration and Parameter Setting Experimental Results Conclusions and Future Work
29
A
Updating Statistics and Probabilities when t = +
AA
AAA
t =
AACluster 1
AACluster k
:
AACluster 1
AACluster k
:
AAA
t =
AAA
t = +
...AAA
t =
1. arrival of new documents
Clustering Module
4. delete old documents
2. store new documents in the repository
5. present new clustering result
3. update statistics
30
Approach to Update Processing (1) In every incremental clustering step, we have to
calculate document similarities To compute similarities, we need to calculate
document statistics and probabilities beforehand It is inefficient to compute statistics every time from
scratch
Store the calculated statistics and probabilities and utilize them for later computation
Incremental Update Processing
31
Approach to Update Processing (2) Formulation
d1, ..., dm: document set consists of m documents
t1, ..., tn: index term sets that appear in d1, ..., dm t = : the latest update time of the document set
Assumption when t = + , new documents dm + 1, ..., dm + m’ are a
ppended to the document set new documents dm + 1, ..., dm + m’ introduce additional ter
ms tn + 1, ..., tn + n’
m >> m and n >> n are satisfied
32
Update Processing Method (1) Update of document weight dwi
Since unit time has passed from the previous update time t = , the weight of each document decreases according to
For each new document, assign initial value 1
)'1(
)1(
1
||
mmim
midwλλdw τi
τTττ
ττi
i
33
Update Processing Method (2) Example of Incremental Update Processing: Upd
ating from tdw| to tdw|+
Naive Approach: compute tdw|+ from scratch
time consuming!
mmTττTττττmmττ
ττ
mm
llττ
λλ
dwdw
dwtdw
1
||
||
1
1
34
Update Processing Method (3) Smart Approach: compute tdw|+ incrementally
exponential weighting enables efficient incremental computation
mtdwλ
λλ
λλ
λtdw
ττ
mm
ml
m
l
Tτττ
m
l
mm
ml
TττTττ
mm
l
Tττττ
l
ll
l
|
1
|
11
1 1
1
35
Updating Processing Method (4) Occurrence probability of each document Pr(di)
can be easily recalculated
We need to calculate term frequencies tf(di, tk) only for new documents dm + 1, ..., dm + m’
'||1
mtdwλλtdw ττ
mm
l
Tττττ
i
ττ
ττiττi tdw
dwd
|
||)Pr(
36
Updating Processing Method (5) Update formulas for document frequency of each
term df(tk) we expand the formula of df(tk) as follows, then store
each permanently
can be incrementally updated using the formula
m
ikiτiτk
τkτ
τk
tdtfdwtfd
tfdtdw
tdf
1
),(||)(~
|)(~
|
1|)(
)(~
ktfd
)(~
ktfd
)(|)(~
|)(~
sum kτkτ
ττk ttftfdλtfd
37
Update Processing Method (6) Calculation of new decoupling coefficient i is
easy:
Update formulas for decoupling coefficient for terms incremental update is also possible details are shown in the paper
nn
kk
kii
tfd
tdtfδ
1ττ
2
|)(~
),(
iδ
38
Summary of Update Processing Following statistics are maintained persistently (m: no. of
documents, n: no. of terms) dwi: weight of document di (1 i m) tdw: total weight of documents freq(di, tk): term occurrence frequency (1 i m, 1 k n) docleni: document length (1 i m)
: statistics fo compute df(tk) (1 k n) : statistics to compute (1 i m)
Incremental statistics update cost O(m + m’n) O(m + n) with storage cost O(m + n): linear cost cf. naive method (not incremental) costs O(mn) cost
iδ ~)(
~ktfd
iδ
39
Outline Background and Objectives F2ICM Incremental Document Clustering
Method Document Similarity Based on Forgetting Factor Updating Statistics and Probabilities Document Expiration and Parameter Setting Experimental Results Conclusions and Future Work
40
A
Expiration of Old Documents (1) when t = +
AA
AAA
t =
AACluster 1
AACluster k
:
AACluster 1
AACluster k
:
AAA
t =
AAA
t = +
...AAA
t =
1. arrival of new documents
Clustering Module
4. delete old documents
2. store new documents in the repository
5. present new clustering result
3. update statistics
41
Expiration of Old Documents (2) Two reasons to delete old documents:
reduction of storage area old documents have only tiny effects on the resulting
clustering structure Our approach:
If dwi < ( is a small parameter constant) is satisfied, delete document di
When we delete di, related statistics values are deleted
e.g., freq(di, tk) details are in the proceedings and [6]
42
Parameter Setting Methods F2ICM uses two parameters in its algorithms:
forgetting factor (0 < < 1): specifies the forgetting speed
expiration parameter (0 < < 1): threshold value for document deletion
We use the following metaphors: : half-life span of the value of a document
= ½ is satisfied, namely
: life span of a document is determined by =
)/2logexp( βλ
43
Outline Background and Objectives F2ICM Incremental Document Clustering
Method Document Similarity Based on Forgetting Factor Updating Statistics and Probabilities Document Expiration and Parameter Setting Experimental Results Conclusions and Future Work
44
Dataset and Parameter Settings Dataset: Mainich Daily Newspaper articles
Each article consists of the following information: issue date subject area (e.g., economy, sports) keyword list (50 150 words in Japanese)
The articles we used in the experiment: issue date: January 1994 to February 1994 subject area: international affairs
Parameter Settings nc (no. of clusters) = 10 (half-life span) = 7: the value of an article reduces to ½ in
one week (life span) = 30: every document will be deleted after 30
days
45
Computational Cost for Clustering Sequences
Plot of CPU time and response time for each clustering performed everyday
Costs linearly increase until 30th day, then becomes almost constant
0
10
20
30
40
50
60
1 11 21 31 41 51
Days
CPU/R
espo
nse
Tim
e (s
ec)
CPU Time
Response Time
46
Overview of Clustering Result (1) Summarization of 10 clusters after 30 days (at
January 31, 1994)No. Subject
1 East Europe, NATO, Russia, Ukraine
2 Clinton (White Water/politics), military issue (Korea/Myanmar/Mexico/Indonesia)
3 China (import and export/U.S.)
4 U.S. politics (economic sanctions and Vietnam/elections)
5 Clinton (Syria/South East issue/visiting Europe), Europe (France/Italy/Switzerland)
6 South Africa (ANC/human rights), East Europe (Bosnia-Herzegovina, Croatia), Russia (Zhirinovsky/ruble/Ukraine)
7 Russia (economy/Moscow/U.S.), North Korea (IAEA/nuclear)
8 China (Patriot missiles/South Korea/Russia/Taiwan/economics)
9 Mexico (indigenous peoples/riot), Israel
10 South East Asia (Indonesia/Cambodia/Thailand), China (Taiwan/France), South Korea (politics)
47
Overview of Clustering Result (2) Summarization of 10 clusters after 57 days (at March
1, 1994)No. Subject
1 Bosnia-Herzegovina (NATO/PKO/UN/Serbia), China (diplomacy)
2 U. S. issue (Japan/economy/New Zealand/Bosnia/Washington)
3 Myanmar, Russia, Mexico
4 Bosnia-Herzegovina (Sarajevo/Serbia), U.S. (North Korea/economy/military)
5 North Korea (IAEA/U.S./nuclear)
6 East Asia (Hebron/random shooting/PLO), Myanmar, Bosnia-Herzegovina
7 U.S. (society/crime/North Korea/IAEA)
8 U.N. (PKO/Bosnia-Herzegovina/EU), China
9 Bosnia-Herzegovina (U.N./PKO/Sarajevo), Russia (Moscow/Serbia)
10 Sarajevo (Bosnia-Herzegovina), China (Taiwan/Tibet)
48
Summary of the Experiment Brief observations
F2ICM groups similar articles into a cluster as far as an appropriate seed is selected
But a cluster obtained by the experiment usually contains multiple topics, and different clusters contain similar topics: clusters are not well separated
Reasons of the observed phenomena: Selected seeds are not well separated in topics: more
sophisticated seed selection method is required The number of keywords for an articles is rather
small (50 150 words)
49
Outline Background and Objectives F2ICM Incremental Document Clustering
Method Document Similarity Based on Forgetting Factor Updating Statistics and Probabilities Document Expiration and Parameter Setting Experimental Results Conclusions and Future Work
50
Conclusions and Future Work Conclusions
Development of an on-line clustering method which considers the novelty of documents
Introduction of document forgetting model F2ICM: Forgetting Factor-based Incremental Clustering Method
Incremental statistics update method (linear update cost) Automatic document expiration and parameter setting
methods Preliminary report of the experiments
Current and Future Work Revision of the clustering algorithms based on Scatter/Gather
approach [4] More detailed experiments and their evaluation Development of automatic parameter tuning methods