16
Semantic Transforms Using Collaborative Knowledge Bases Yegin Genc, Winter Mason, Jeffrey V. Nickerson Stevens Institute of Technology

Semantic Transforms Using Collaborative Knowledge Bases

Embed Size (px)

DESCRIPTION

presented at WIN2012

Citation preview

Page 1: Semantic Transforms Using Collaborative Knowledge Bases

Semantic Transforms Using Collaborative Knowledge Bases

Yegin Genc, Winter Mason, Jeffrey V. Nickerson

Stevens Institute of Technology

Page 2: Semantic Transforms Using Collaborative Knowledge Bases

Overview

• Automatically understand online information

• Using network artifacts, such as Wikipedia, to help

Page 3: Semantic Transforms Using Collaborative Knowledge Bases

Topic Models

Algorithms to understand and organize documents by uncovering semantic structure of a document collection

• Discover hidden themes – patterns of word use

• Connect documents that exhibit similar patterns

Page 4: Semantic Transforms Using Collaborative Knowledge Bases

Algorithms – 0.28Optimization – 0.28Algorithm – 0.14Computer – 0.14Techniques – 0.14….

Genetic – 0.18Natural – 0.18Evolution – 0.18Evolutionary – 0.09…

“In the computer science field of artificial intelligence, a genetic algorithm (GA) is a search heuristic that mimics the process of natural evolution. This heuristic is routinely used to generate useful solutions to optimization and search problems. Genetic algorithms belong to the larger class of evolutionary algorithms (EA), which generate solutions to optimization problems using techniques inspired by natural evolution, such as inheritance, mutation, selection, and crossover.” 1

1http://en.wikipedia.org/wiki/Genetic_algorithm

Latent Dirichlet Allocation (LDA)

Page 5: Semantic Transforms Using Collaborative Knowledge Bases

Topics from LDA

Five topics from a 50-topic LDA model to fit Science from 1980 – 2002 (Blei and Lafferty, 2009)

computer chemistry cortex orbit infectionmethods synthesis stimulus dust immunenumber oxidation fig jupiter aids

two reaction vision line infectedprinciple product neuron system viraldesign organic recordings solar cells

methods k of the for the the operations thethe the objects of the o and the of

a of to a linear we of functional aof algorithm and to problem and to requires is

problems for the we problems a that and inTen randomly chosen topics from a 50-topic LDA model fit to abstracts from the Journal of the ACM (JACM) from the years 1987 to 2004 (Blei et al., 2010).

Page 6: Semantic Transforms Using Collaborative Knowledge Bases

The interpretation problem

1. Labeling the topics is difficult (J. Chang et al., 2009)

2. The relationships between topics are not identified

3. The information in the topics is based solely on the input corpus

4. The external validity of the topics may be limited

Page 7: Semantic Transforms Using Collaborative Knowledge Bases

Collaborative Knowledge Bases

1. Labeled topics 2. Connected to each other in a meaningful way3. Contain rich, focused information on

particular topics4. Contain fresh, up-to-date information about

practically everything

Page 8: Semantic Transforms Using Collaborative Knowledge Bases

Wikipedia Pages as Topics

orbitdust

jupiterline

systemsolargas

atmosphericmarsfield

Wikipedia Page

Solar System“The Solar System[a] consists of the Sun and the astronomical objects gravitationally bound in orbit around it, all of which formed from the collapse of a giant molecular cloud approximately 4.6 billion years ago…”

(http://en.wikipedia.org/wiki/Solar_System)

LDA topic

Page 9: Semantic Transforms Using Collaborative Knowledge Bases

Wikipedia Pages as TopicsTopics are characterized as distributions over observed words in Wikipedia pages

βk : Per-topic word distribution

Wikipedia Word Freq. orbit 34 0.12dust 7 0.02

jupiter 36 0.12line 0 0.00

system 76 0.26solar 110 0.38gas 11 0.04

atmospheric 1 0.00mars 8 0.03field 8 0.03

Page 10: Semantic Transforms Using Collaborative Knowledge Bases

*=

d

Z d,n

β (K x W)

Z d,n

d

n

W d,n

Wiki (W x K)

d

k

d

k

D: Documents K: TopicsW: Words

DOCUMENT – TOPICΘ (D x K)

DOCUMENT – W0RDW (D x W )

TOPIC - WORDLD

AW

IKI

Page 11: Semantic Transforms Using Collaborative Knowledge Bases

ExperimentData617 abstracts from Journal of the ACMClassified into 80 categories by their authors53 categories have corresponding Wikipedia Pages

Abstracts{Article Name: On the (Im)possibility of Obfuscating Programs, Category: D.4. Operating Systems Add. Category: F.1 Computation by Abstract Devices … }

Category Mappings Category Wikipedia Page

D.4 Operating Systems: Operating SystemF.1 Computation by Abstract Devices : Abstract Machine

Page 12: Semantic Transforms Using Collaborative Knowledge Bases

Three variations of our method

- Inbound links are Wikipedia pages that link to the topic page - Outbound links are Wikipedia pages linked to by the topic

page- Text-based method only uses word distributions in topic pages

Page 13: Semantic Transforms Using Collaborative Knowledge Bases

Results

Method Primary Primary or Additional

Text 182 (29.5%) 314 (50.8%)

Inbound links 131 (21.2%) 249 (40.0%)

Outbound links 79 (12.8%) 166 (26.9%)

The number (and percentage) of authors’ primary ACM topic labels, or authors’ primary + additional ACM topics successfully identified by each method.

LDA cannot be compared without an additional step mapping word distributions to ACM topics.

Page 14: Semantic Transforms Using Collaborative Knowledge Bases

Results (Qualitative)

Page 15: Semantic Transforms Using Collaborative Knowledge Bases

Concluding Remarks

The Wiki categories often match the categories that were chosen by the authors. When they don’t match, they generally appear plausible.

Among the variations of our method, the text based approach performed better than link based approaches.

Among the link based approaches, inbound links performed better than outbound links.

Page 16: Semantic Transforms Using Collaborative Knowledge Bases

Next Steps

Dependent topic structures

Combine heuristics with generative models: Wikipedia as a prior for the topic

distribution Learn from the documents observed.