15
Intelligent Database Systems Lab N.Y.U.S. T. I. M. Mining massive document collections by the WEBSOM method Presenter : Yu-hui Huang Authors :Krista Lagus, Samuel Kaski *, Teuvo Kohonen Information Sciences 2004 國國國國國國國國 National Yunlin University of Science and Technology

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Mining massive document collections by the WEBSOM method Presenter : Yu-hui Huang Authors :Krista Lagus,

Embed Size (px)

Citation preview

Page 1: Intelligent Database Systems Lab N.Y.U.S.T. I. M. Mining massive document collections by the WEBSOM method Presenter : Yu-hui Huang Authors :Krista Lagus,

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

Mining massive document collections bythe WEBSOM method

Presenter : Yu-hui Huang

Authors :Krista Lagus, Samuel Kaski *, Teuvo Kohonen

Information Sciences 2004

國立雲林科技大學National Yunlin University of Science and Technology

Page 2: Intelligent Database Systems Lab N.Y.U.S.T. I. M. Mining massive document collections by the WEBSOM method Presenter : Yu-hui Huang Authors :Krista Lagus,

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Outline

Motivation

Objective

Methodology

Experimental

Conclusion

Personal Comments

Page 3: Intelligent Database Systems Lab N.Y.U.S.T. I. M. Mining massive document collections by the WEBSOM method Presenter : Yu-hui Huang Authors :Krista Lagus,

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Motivation

It would be of great help for browsing an encyclopaedia or a digital library, if the items could be preordered according to their contents.

The main problem with the MDS methods is that one has to know all the items before computation of the mapping. The computation is also a heavy and even impossible task for any sizable collection of items.

Page 4: Intelligent Database Systems Lab N.Y.U.S.T. I. M. Mining massive document collections by the WEBSOM method Presenter : Yu-hui Huang Authors :Krista Lagus,

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Objective

when the searching can be started that match best with the search expression, further relevant search results can be found on the basis of the pointers stored at the same or neighboring map units.

Page 5: Intelligent Database Systems Lab N.Y.U.S.T. I. M. Mining massive document collections by the WEBSOM method Presenter : Yu-hui Huang Authors :Krista Lagus,

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Methodology

The Batch Map version of the SOM:

5

Page 6: Intelligent Database Systems Lab N.Y.U.S.T. I. M. Mining massive document collections by the WEBSOM method Presenter : Yu-hui Huang Authors :Krista Lagus,

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Methodology-Encoding

Vector space method

Methods for dimensionality reduction

Latent semantic indexing Random projection Word clustering

6

docs t1 t2 t3D1 1 0 1D2 1 0 0D3 0 1 1D4 1 0 0D5 1 1 1D6 1 1 0D7 0 1 0D8 0 1 0D9 0 0 1

D10 0 1 1D11 1 0 1

Page 7: Intelligent Database Systems Lab N.Y.U.S.T. I. M. Mining massive document collections by the WEBSOM method Presenter : Yu-hui Huang Authors :Krista Lagus,

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Methodology-Encoding

Weighting of words

IDF-based weights

Entropy over topical document classes

7

Page 8: Intelligent Database Systems Lab N.Y.U.S.T. I. M. Mining massive document collections by the WEBSOM method Presenter : Yu-hui Huang Authors :Krista Lagus,

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Methodology-Fast

Rapid initialization by increasing the map size

Faster computation of the final state of the SOM

Addressing old winners Intial best matching units

8

Page 9: Intelligent Database Systems Lab N.Y.U.S.T. I. M. Mining massive document collections by the WEBSOM method Presenter : Yu-hui Huang Authors :Krista Lagus,

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Methodology-Fast

Additional computational shortcuts

Parallelized Batch Map algorithm Saving memory by reducing representation

accuracy Utilizing the sparsity of the vectors

9

Page 10: Intelligent Database Systems Lab N.Y.U.S.T. I. M. Mining massive document collections by the WEBSOM method Presenter : Yu-hui Huang Authors :Krista Lagus,

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Methodology-Fast

Performance evaluation of the new methods

Numerical comparison with the traditional SOM algorithm

Comparison of the computational complexity

10

Page 11: Intelligent Database Systems Lab N.Y.U.S.T. I. M. Mining massive document collections by the WEBSOM method Presenter : Yu-hui Huang Authors :Krista Lagus,

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Experimental

Largest experiment: nearly 7 million patent abstracts

11

Page 12: Intelligent Database Systems Lab N.Y.U.S.T. I. M. Mining massive document collections by the WEBSOM method Presenter : Yu-hui Huang Authors :Krista Lagus,

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Experimental

Experiment on the Britannica collection Preprocessing and document encoding Construction of the map Obtaining descriptive labels for text clusters and map

regions

Exploration of the map

12

Page 13: Intelligent Database Systems Lab N.Y.U.S.T. I. M. Mining massive document collections by the WEBSOM method Presenter : Yu-hui Huang Authors :Krista Lagus,

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Experimental

Exploration of the map

13

Page 14: Intelligent Database Systems Lab N.Y.U.S.T. I. M. Mining massive document collections by the WEBSOM method Presenter : Yu-hui Huang Authors :Krista Lagus,

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Conclusion

WEBSOM method has been shown to be robust for organizing large and varied collections onto meaningfully ordered document maps.

The developed computational speedups enable the creation of very large maps.

14

Page 15: Intelligent Database Systems Lab N.Y.U.S.T. I. M. Mining massive document collections by the WEBSOM method Presenter : Yu-hui Huang Authors :Krista Lagus,

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Personal Comments

Advantage …

Drawback …

Application Search Engine,

various retrieval of large document such as encyclopaedia or digital library.