Self Organization of a Massive Document Collection Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Teuvo Kohonen et al

Self Organization of a Massive Document Collection

Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang

Author : Teuvo Kohonen et al.

Outline

Motivation Objective Introduction Self-Organizing Map Statistical Models of Documents Rapid Construction of Large Document Maps The Document Map of All Electronic Patent Abstracts Conclusion Personal opinion

Motivation

To improve the WEBSOM and to organize vast document collections according to textual similarities.

Objective

The main goal has been to scale up the SOM algorithm to be able to deal with large amounts of high-dimensional data.

Introduction

From Simple Searches to Browsing of Self-Organized Data Collections.

Scope of This Work. WEBSOM Dimensionality

Latent semantic indexing, LSI. Clustering of words into semantic

categories. By a random projection method.

Self Organizing Map

The original SOM algorithm.

(1) )]()()[()()1( ),( tmtxthtmtm iixcii

(2) ||}{||minarg)( ii

mxxc

(3) ))(2

||||exp()()(

2

2)(

),( t

rrtth xci

ixc

Self Organizing Map Batch-map SOM : to accelerate the

computation of the SOM.

(5) )(

(4) 0)]}()()[({E ,

),(

),(*

*),(

tixc

tixc

i

iixct

h

txhm

tmtxthi

Self Organizing Map Let Vi be the set of all x(t) that have as their closest model.Called the Voronoi set. The number of samples x(t) falling into Vi is c

alled .

(7) smoothing )3 Step

(6)

)(

,on quantizati vector 2) Step

method.proper any by theInitialize 1) Step

,

,*

)(

*

jijj

j

jijj

i

i

Vtxi

i

hn

xhn

m

n

tx

xi

m

i

*im

in

Statistical Models of Documents

The histograms formed over word clusters using self-organizing semantic maps. This system was called the WEBSOM.

The overview of the WEBSOM2 system.


A. The Primitive Vector-Space Model Inverse document frequency(IDF). Shannon entropy.

B. Latent Semantic Indexing(LSI) Sigular-value decomposition(SVD).


C. Randomly Projected Histograms Original document vector Rectangular random matrix R Projections

D. Histograms on the Word Category Map The original version of the WEBSOM. The new method is random projection of the

word histograms.

nin

(8) ii Rnx

mix


E. Validation of the Random Projection Method by Small-Scale Preliminary Experiments 13742 patents from the whole corpus of 6840568

abstracts. Equal number of patents from each of the 21

subsections. 1814 words or word forms. With full 1344 D histograms as document vectors.


F. Construction of Random Projections of Word Histograms by Pointers. Thresholding(+1 or -1). Sparse matrices(1 and 0).


Hash table and pointer. The computing time was about 20%

of that of the usual matrix-product method.

Computational complexity of the random projection with pointers is only

In contrast, the big O of the LSI is

)()( nONlO )(NldO

Rapid Construction of Large Document Maps

A. Fast Distance Computation To tabulate the indexes of the

nonzero components of each input vector.

Euclidean distances between sparse vectors.

We must use low-dimensional models.


B. Estimation of Larger Maps Based on Carefully Constructed Smaller Ones Increasing the number of nodes of the

SOM during its construction. The new idea is to estimate good

initial values for the model vectors of a very large map on the basis of asymptotic values of the model vectors of a much smaller map.


(10) )1(ˆ

(9) )1()()()()(

)(')(')(')('

skhh

sjh

sih

dh

skhh

sjh

sih

dh

mmmm

mmmm

densespars

e


C. Rapid Fine-Tuning of the Large Maps 1) Addressing Old Winners:

This idea is same with LAB! 2) Initialization of the Pointers:

The size of the maps is increased stepwise during learning.~using formula (10).

The winner is the map unit for which the inner product with the data vector is the largest. (11) )1( )()()()( s

kT

hhsj

Th

si

Th

dh

T mxmxmxmx



3) Parallelized Batch Map Algorithm: The winner search can be implemented in pa

rallel process. 4) Saving Memory by Reducing Represent

ation Accuracy: The sufficient accuracy can be maintained du

ring the computation.


D. Performance Evaluation of the New Methods 1) Numerical Comparison with the Traditional SO

M Algorithm: Two performance indexes to measure the quality of th

e maps:Average quantization error and Classification accuracy

Experiments:Two sets of maps


2) Comparison of the Computational Complexity:

, stems from the computation of the small map. , results from the VQ step(6) of the batch map alg

orithm. , refers to the estimation of the pointers. N:Data Samples; M:Map Units; d:dimensionality.

)( 2dMO

)(dNO

)( 2NO

The Document Map of All Electronic Patent Abstracts

A. Preprocessing We first extracted the titles and the texts for further

processing. We removed nontextual information. Mathematical symbols and numbers were converte

d into special dummy symbols. Contained 733179 different words. A set of common words were removed. The remaining vocabulary consisted of 43222 words. Finally, we omitted the 122524 abstracts in which le

ss than five words remained.


B. Formation of Statistical Models The final dimensionality we selected

500 and five random pointers were used for each word.

The words were weighted using the Shannon entropy of their distribution of occurrence among the subsections of the patent classification system.


The weight is a measure of the unevenness of the distribution of the word in the subsections.

The weights were calculated as follows: be the probability of a randomly

chosen instance of the word w occurring in subsection g, and Ng the number of subsections.

Shannon entropy Weight

)(wPg

g

gg wPwPwH )(log)()(

gNH

wHHwW

log

)()(

max

max


C. Formation of the Document Map 500-dimensional document vectors. The map was increased twice sixteenfold

and one ninefold. Each of the enlarged, estimated maps(cf.

Section IV-B) was then fine-tuned by five batch map iteration cycles.


D. Results When each map node was labeled

according to the majority of the subsections in the node.

The resulting accuracy was 64%.

Conclusion

In this paper the emphasis has been on the up scalability of the methods relating to very large text collections.

Contributions: Larger than our previous one. A new method of forming statistical

models of documents. Several new fast computing methods.

Personal Opinion

Put SOM into a domain knowledge,e.g.IR or …?