13
1 Intelligent Database Systems Lab 國國國國國國國國 National Yunlin University of Science and T echnology Presenter Chien-Hsing Chen Author: Rudi L. Cilibrasi Paul M.B. Vitanyi The Google Similarity Distance 2007,TKDE

The Google Similarity Distance

  • Upload
    vilmos

  • View
    20

  • Download
    0

Embed Size (px)

DESCRIPTION

The Google Similarity Distance. Presenter : Chien-Hsing Chen Author: Rudi L. Cilibrasi Paul M.B. Vitanyi. 2007,TKDE. Outline. Motivation Objective NGD Experiments Conclusions Personal Opinion. Motivation. - PowerPoint PPT Presentation

Citation preview

Page 1: The Google Similarity Distance

1Intelligent Database Systems Lab

國立雲林科技大學National Yunlin University of Science and Technology

Presenter: Chien-Hsing Chen

Author: Rudi L. Cilibrasi

Paul M.B. Vitanyi

The Google Similarity Distance

2007,TKDE

Page 2: The Google Similarity Distance

2

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Outline

Motivation Objective NGD Experiments Conclusions Personal Opinion

Page 3: The Google Similarity Distance

3

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Motivation

great cost of designing structures capable of manipulating knowledge

entering high quality contents in these structures by knowledgeable human experts

the efforts are long-running

large scale

Page 4: The Google Similarity Distance

4

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Objective

The author develop a method that uses only the name of an object and obtains knowledge about the similarity of objects

a regular FCA, used in Ontology, acquires the similarity between objects and attributes

Page 5: The Google Similarity Distance

5

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.The Google Similarity Distance

Kolmogorov complexity

Page 6: The Google Similarity Distance

6

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.The Google Similarity Distance

NGD (horse, rider) = 0.443“horse” 46,700,000 pages

“rider” 12,200,000 pages

“horse, rider” 2,630,000 pages

N= Indexed 8,058,044,651 pages

NGD(pensi, cola)=0.797NGD( 賓拉登 , 攻擊 )=0.64NGD(horse, rider)=0.898NGD(book, drink)=0.694NGD(web, network)=0.2768

Page 7: The Google Similarity Distance

7

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Applications and Experiments

Hierarchical ClusteringGiven a set of objects in a space provided with a distance measure, the matrix has as entries the pairwise distances between the objects.

Page 8: The Google Similarity Distance

8

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Applications and Experiments

Hierarchical ClusteringDataset: 17th Century painters

Page 9: The Google Similarity Distance

9

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Applications and Experiments

SVM-NGD LearningThe author uses the anchor words to convert each of the 40 training words w1, …, w40 to 6-dimensional training vector v1,…v40.

The entry vj,i of vj=(vj,1,…,vj,6) is defined as vj,i=NGD(wj,ai) (1j 40, 1 i 6)≦ ≦ ≦ ≦

Page 10: The Google Similarity Distance

10

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.NGD Translation

Page 11: The Google Similarity Distance

11

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Comparison to WordNet semantics

Randomly selected 100 semantic categories from the WordNet database

for each category, SVM is trained on 50 labeled training samplesPositive examples are from WordNet, others are from dictionary

Per experiment is used a total of six anchors, 3 are from WordNet, 3 are from dictionary

Testing dataset, 20 new examples

Running with 100 experiments

The author ignores the false negatives

Page 12: The Google Similarity Distance

12

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Conclusion

This knowledge base was created over the course of decades by paid human experts.

Google has already indexed more than 8 billion pages and shows no signs of slowing down.

Someone who estimated the 8-billion indexed pages was in 2004.

Page 13: The Google Similarity Distance

13

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Opinion

AdvantageGoogle search engine was respected recently for similarity measure.

Drawbackanchors determination, accuracy measure (ignore false-negative)

NGD is a nothing novel but a demonstration straightly

Application