Upload
rachel-williamson
View
216
Download
1
Tags:
Embed Size (px)
Citation preview
Self Organization of a Massive Document Collection
Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang
Author : Teuvo Kohonen et al.
Outline
Motivation Objective Introduction Self-Organizing Map Statistical Models of Documents Rapid Construction of Large Document Maps The Document Map of All Electronic Patent Abstracts Conclusion Personal opinion
Motivation
To improve the WEBSOM and to organize vast document collections according to textual similarities.
Objective
The main goal has been to scale up the SOM algorithm to be able to deal with large amounts of high-dimensional data.
Introduction
From Simple Searches to Browsing of Self-Organized Data Collections.
Scope of This Work. WEBSOM Dimensionality
Latent semantic indexing, LSI. Clustering of words into semantic
categories. By a random projection method.
Self Organizing Map
The original SOM algorithm.
(1) )]()()[()()1( ),( tmtxthtmtm iixcii
(2) ||}{||minarg)( ii
mxxc
(3) ))(2
||||exp()()(
2
2)(
),( t
rrtth xci
ixc
Self Organizing Map Batch-map SOM : to accelerate the
computation of the SOM.
(5) )(
(4) 0)]}()()[({E ,
),(
),(*
*),(
tixc
tixc
i
iixct
h
txhm
tmtxthi
Self Organizing Map Let Vi be the set of all x(t) that have as their closest model.Called the Voronoi set. The number of samples x(t) falling into Vi is c
alled .
(7) smoothing )3 Step
(6)
)(
,on quantizati vector 2) Step
method.proper any by theInitialize 1) Step
,
,*
)(
*
jijj
j
jijj
i
i
Vtxi
i
hn
xhn
m
n
tx
xi
m
i
*im
in
Statistical Models of Documents
The histograms formed over word clusters using self-organizing semantic maps. This system was called the WEBSOM.
The overview of the WEBSOM2 system.
Statistical Models of Documents
A. The Primitive Vector-Space Model Inverse document frequency(IDF). Shannon entropy.
B. Latent Semantic Indexing(LSI) Sigular-value decomposition(SVD).
Statistical Models of Documents
C. Randomly Projected Histograms Original document vector Rectangular random matrix R Projections
D. Histograms on the Word Category Map The original version of the WEBSOM. The new method is random projection of the
word histograms.
nin
(8) ii Rnx
mix
Statistical Models of Documents
E. Validation of the Random Projection Method by Small-Scale Preliminary Experiments 13742 patents from the whole corpus of 6840568
abstracts. Equal number of patents from each of the 21
subsections. 1814 words or word forms. With full 1344 D histograms as document vectors.
Statistical Models of Documents
F. Construction of Random Projections of Word Histograms by Pointers. Thresholding(+1 or -1). Sparse matrices(1 and 0).
Statistical Models of Documents
Hash table and pointer. The computing time was about 20%
of that of the usual matrix-product method.
Computational complexity of the random projection with pointers is only
In contrast, the big O of the LSI is
)()( nONlO )(NldO
Rapid Construction of Large Document Maps
A. Fast Distance Computation To tabulate the indexes of the
nonzero components of each input vector.
Euclidean distances between sparse vectors.
We must use low-dimensional models.
Rapid Construction of Large Document Maps
B. Estimation of Larger Maps Based on Carefully Constructed Smaller Ones Increasing the number of nodes of the
SOM during its construction. The new idea is to estimate good
initial values for the model vectors of a very large map on the basis of asymptotic values of the model vectors of a much smaller map.
Rapid Construction of Large Document Maps
(10) )1(ˆ
(9) )1()()()()(
)(')(')(')('
skhh
sjh
sih
dh
skhh
sjh
sih
dh
mmmm
mmmm
densespars
e
Rapid Construction of Large Document Maps
C. Rapid Fine-Tuning of the Large Maps 1) Addressing Old Winners:
This idea is same with LAB! 2) Initialization of the Pointers:
The size of the maps is increased stepwise during learning.~using formula (10).
The winner is the map unit for which the inner product with the data vector is the largest. (11) )1( )()()()( s
kT
hhsj
Th
si
Th
dh
T mxmxmxmx
Rapid Construction of Large Document Maps
Rapid Construction of Large Document Maps
3) Parallelized Batch Map Algorithm: The winner search can be implemented in pa
rallel process. 4) Saving Memory by Reducing Represent
ation Accuracy: The sufficient accuracy can be maintained du
ring the computation.
Rapid Construction of Large Document Maps
D. Performance Evaluation of the New Methods 1) Numerical Comparison with the Traditional SO
M Algorithm: Two performance indexes to measure the quality of th
e maps:Average quantization error and Classification accuracy
Experiments:Two sets of maps
Rapid Construction of Large Document Maps
2) Comparison of the Computational Complexity:
, stems from the computation of the small map. , results from the VQ step(6) of the batch map alg
orithm. , refers to the estimation of the pointers. N:Data Samples; M:Map Units; d:dimensionality.
)( 2dMO
)(dNO
)( 2NO
The Document Map of All Electronic Patent Abstracts
A. Preprocessing We first extracted the titles and the texts for further
processing. We removed nontextual information. Mathematical symbols and numbers were converte
d into special dummy symbols. Contained 733179 different words. A set of common words were removed. The remaining vocabulary consisted of 43222 words. Finally, we omitted the 122524 abstracts in which le
ss than five words remained.
The Document Map of All Electronic Patent Abstracts
B. Formation of Statistical Models The final dimensionality we selected
500 and five random pointers were used for each word.
The words were weighted using the Shannon entropy of their distribution of occurrence among the subsections of the patent classification system.
The Document Map of All Electronic Patent Abstracts
The weight is a measure of the unevenness of the distribution of the word in the subsections.
The weights were calculated as follows: be the probability of a randomly
chosen instance of the word w occurring in subsection g, and Ng the number of subsections.
Shannon entropy Weight
)(wPg
g
gg wPwPwH )(log)()(
gNH
wHHwW
log
)()(
max
max
The Document Map of All Electronic Patent Abstracts
C. Formation of the Document Map 500-dimensional document vectors. The map was increased twice sixteenfold
and one ninefold. Each of the enlarged, estimated maps(cf.
Section IV-B) was then fine-tuned by five batch map iteration cycles.
The Document Map of All Electronic Patent Abstracts
D. Results When each map node was labeled
according to the majority of the subsections in the node.
The resulting accuracy was 64%.
Conclusion
In this paper the emphasis has been on the up scalability of the methods relating to very large text collections.
Contributions: Larger than our previous one. A new method of forming statistical
models of documents. Several new fast computing methods.
Personal Opinion
Put SOM into a domain knowledge,e.g.IR or …?