View
205
Download
0
Category
Preview:
DESCRIPTION
presented at WIN2012
Citation preview
Semantic Transforms Using Collaborative Knowledge Bases
Yegin Genc, Winter Mason, Jeffrey V. Nickerson
Stevens Institute of Technology
Overview
• Automatically understand online information
• Using network artifacts, such as Wikipedia, to help
Topic Models
Algorithms to understand and organize documents by uncovering semantic structure of a document collection
• Discover hidden themes – patterns of word use
• Connect documents that exhibit similar patterns
Algorithms – 0.28Optimization – 0.28Algorithm – 0.14Computer – 0.14Techniques – 0.14….
Genetic – 0.18Natural – 0.18Evolution – 0.18Evolutionary – 0.09…
“In the computer science field of artificial intelligence, a genetic algorithm (GA) is a search heuristic that mimics the process of natural evolution. This heuristic is routinely used to generate useful solutions to optimization and search problems. Genetic algorithms belong to the larger class of evolutionary algorithms (EA), which generate solutions to optimization problems using techniques inspired by natural evolution, such as inheritance, mutation, selection, and crossover.” 1
1http://en.wikipedia.org/wiki/Genetic_algorithm
Latent Dirichlet Allocation (LDA)
Topics from LDA
Five topics from a 50-topic LDA model to fit Science from 1980 – 2002 (Blei and Lafferty, 2009)
computer chemistry cortex orbit infectionmethods synthesis stimulus dust immunenumber oxidation fig jupiter aids
two reaction vision line infectedprinciple product neuron system viraldesign organic recordings solar cells
methods k of the for the the operations thethe the objects of the o and the of
a of to a linear we of functional aof algorithm and to problem and to requires is
problems for the we problems a that and inTen randomly chosen topics from a 50-topic LDA model fit to abstracts from the Journal of the ACM (JACM) from the years 1987 to 2004 (Blei et al., 2010).
The interpretation problem
1. Labeling the topics is difficult (J. Chang et al., 2009)
2. The relationships between topics are not identified
3. The information in the topics is based solely on the input corpus
4. The external validity of the topics may be limited
Collaborative Knowledge Bases
1. Labeled topics 2. Connected to each other in a meaningful way3. Contain rich, focused information on
particular topics4. Contain fresh, up-to-date information about
practically everything
Wikipedia Pages as Topics
orbitdust
jupiterline
systemsolargas
atmosphericmarsfield
Wikipedia Page
Solar System“The Solar System[a] consists of the Sun and the astronomical objects gravitationally bound in orbit around it, all of which formed from the collapse of a giant molecular cloud approximately 4.6 billion years ago…”
(http://en.wikipedia.org/wiki/Solar_System)
LDA topic
Wikipedia Pages as TopicsTopics are characterized as distributions over observed words in Wikipedia pages
βk : Per-topic word distribution
Wikipedia Word Freq. orbit 34 0.12dust 7 0.02
jupiter 36 0.12line 0 0.00
system 76 0.26solar 110 0.38gas 11 0.04
atmospheric 1 0.00mars 8 0.03field 8 0.03
*=
d
Z d,n
β (K x W)
Z d,n
d
n
W d,n
Wiki (W x K)
d
k
d
k
D: Documents K: TopicsW: Words
DOCUMENT – TOPICΘ (D x K)
DOCUMENT – W0RDW (D x W )
TOPIC - WORDLD
AW
IKI
ExperimentData617 abstracts from Journal of the ACMClassified into 80 categories by their authors53 categories have corresponding Wikipedia Pages
Abstracts{Article Name: On the (Im)possibility of Obfuscating Programs, Category: D.4. Operating Systems Add. Category: F.1 Computation by Abstract Devices … }
Category Mappings Category Wikipedia Page
D.4 Operating Systems: Operating SystemF.1 Computation by Abstract Devices : Abstract Machine
Three variations of our method
- Inbound links are Wikipedia pages that link to the topic page - Outbound links are Wikipedia pages linked to by the topic
page- Text-based method only uses word distributions in topic pages
Results
Method Primary Primary or Additional
Text 182 (29.5%) 314 (50.8%)
Inbound links 131 (21.2%) 249 (40.0%)
Outbound links 79 (12.8%) 166 (26.9%)
The number (and percentage) of authors’ primary ACM topic labels, or authors’ primary + additional ACM topics successfully identified by each method.
LDA cannot be compared without an additional step mapping word distributions to ACM topics.
Results (Qualitative)
Concluding Remarks
The Wiki categories often match the categories that were chosen by the authors. When they don’t match, they generally appear plausible.
Among the variations of our method, the text based approach performed better than link based approaches.
Among the link based approaches, inbound links performed better than outbound links.
Next Steps
Dependent topic structures
Combine heuristics with generative models: Wikipedia as a prior for the topic
distribution Learn from the documents observed.
Recommended