Semantic Transforms Using Collaborative Knowledge Bases

Yegin Genc, Winter Mason, Jeffrey V. Nickerson

Stevens Institute of Technology

Overview

• Automatically understand online information

• Using network artifacts, such as Wikipedia, to help

Topic Models

Algorithms to understand and organize documents by uncovering semantic structure of a document collection

• Discover hidden themes – patterns of word use

• Connect documents that exhibit similar patterns

Algorithms – 0.28Optimization – 0.28Algorithm – 0.14Computer – 0.14Techniques – 0.14….

Genetic – 0.18Natural – 0.18Evolution – 0.18Evolutionary – 0.09…

“In the computer science field of artificial intelligence, a genetic algorithm (GA) is a search heuristic that mimics the process of natural evolution. This heuristic is routinely used to generate useful solutions to optimization and search problems. Genetic algorithms belong to the larger class of evolutionary algorithms (EA), which generate solutions to optimization problems using techniques inspired by natural evolution, such as inheritance, mutation, selection, and crossover.” 1

1http://en.wikipedia.org/wiki/Genetic_algorithm

Latent Dirichlet Allocation (LDA)

Topics from LDA

Five topics from a 50-topic LDA model to fit Science from 1980 – 2002 (Blei and Lafferty, 2009)

computer chemistry cortex orbit infectionmethods synthesis stimulus dust immunenumber oxidation fig jupiter aids

two reaction vision line infectedprinciple product neuron system viraldesign organic recordings solar cells

methods k of the for the the operations thethe the objects of the o and the of

a of to a linear we of functional aof algorithm and to problem and to requires is

problems for the we problems a that and inTen randomly chosen topics from a 50-topic LDA model fit to abstracts from the Journal of the ACM (JACM) from the years 1987 to 2004 (Blei et al., 2010).

The interpretation problem

1. Labeling the topics is difficult (J. Chang et al., 2009)

2. The relationships between topics are not identified

3. The information in the topics is based solely on the input corpus

4. The external validity of the topics may be limited

Collaborative Knowledge Bases

1. Labeled topics 2. Connected to each other in a meaningful way3. Contain rich, focused information on

particular topics4. Contain fresh, up-to-date information about

practically everything

Wikipedia Pages as Topics

orbitdust

jupiterline

systemsolargas

atmosphericmarsfield

Wikipedia Page

Solar System“The Solar System[a] consists of the Sun and the astronomical objects gravitationally bound in orbit around it, all of which formed from the collapse of a giant molecular cloud approximately 4.6 billion years ago…”

(http://en.wikipedia.org/wiki/Solar_System)

LDA topic

Wikipedia Pages as TopicsTopics are characterized as distributions over observed words in Wikipedia pages

βk : Per-topic word distribution

Wikipedia Word Freq. orbit 34 0.12dust 7 0.02

jupiter 36 0.12line 0 0.00

system 76 0.26solar 110 0.38gas 11 0.04

atmospheric 1 0.00mars 8 0.03field 8 0.03

β (K x W)

Wiki (W x K)

D: Documents K: TopicsW: Words

DOCUMENT – TOPICΘ (D x K)

DOCUMENT – W0RDW (D x W )

TOPIC - WORDLD

ExperimentData617 abstracts from Journal of the ACMClassified into 80 categories by their authors53 categories have corresponding Wikipedia Pages

Abstracts{Article Name: On the (Im)possibility of Obfuscating Programs, Category: D.4. Operating Systems Add. Category: F.1 Computation by Abstract Devices … }

Category Mappings Category Wikipedia Page

D.4 Operating Systems: Operating SystemF.1 Computation by Abstract Devices : Abstract Machine

Three variations of our method

- Inbound links are Wikipedia pages that link to the topic page - Outbound links are Wikipedia pages linked to by the topic

page- Text-based method only uses word distributions in topic pages

Results

Method Primary Primary or Additional

Text 182 (29.5%) 314 (50.8%)

Inbound links 131 (21.2%) 249 (40.0%)

Outbound links 79 (12.8%) 166 (26.9%)

The number (and percentage) of authors’ primary ACM topic labels, or authors’ primary + additional ACM topics successfully identified by each method.

LDA cannot be compared without an additional step mapping word distributions to ACM topics.

Results (Qualitative)

Concluding Remarks

The Wiki categories often match the categories that were chosen by the authors. When they don’t match, they generally appear plausible.

Among the variations of our method, the text based approach performed better than link based approaches.

Among the link based approaches, inbound links performed better than outbound links.

Next Steps

Dependent topic structures

Combine heuristics with generative models: Wikipedia as a prior for the topic

distribution Learn from the documents observed.

Semantic Transforms Using Collaborative Knowledge Bases

Technology

Life transforms living transforms life

13.1 Fourier transforms: Chapter 13 Integral transforms

A Goal-Oriented Web Browseralumni.media.mit.edu/.../CHI06_goalOrientedWebBrowser.pdftype of user interface through the use of large-scale knowledge bases of semantic information. The

Bases Bases Bases Bases Bases Bases Bases Bases Bases ......Hair loss or alopecia is a problem in modern society, which is usually related to hair loss on the scalp [1]. The most common

Spatial Transforms Spatial Transforms

Geometrical Transforms

3D Transforms

fourier transforms

Transforms AK

00300 Transforms

EISENMANN...semantic content. Punched sheet metal transforms into visually attractive compositions, packaged system parts mutate into mystical landscapes, and steel plate channels

Discrete Cosine Transforms - Semantic Scholar · Sinusoidal unitary transform: ~ is an invertible linear transform whose kernel describes a set of complete, orthogonal discrete cosine

Semantic Knowledge Bases and Be Informed for the FAA Dr. Brand Niemann Director and Senior Enterprise Architect – Data Scientist Semantic Community

Linear Transforms

WAVELET TRANSFORMS VERSUS FOURIER TRANSFORMS · If the signal f{x) disappears after x ... WAVELET TRANSFORMS VERSUS FOURIER TRANSFORMS 291 Adelson and Mallat and others. ... ahead

A Survey of Volunteered Open Geo-Knowledge Bases in the Semantic Web

The neural bases of the pseudohomophone effect: phonological ... · Accepted Manuscript The neural bases of the pseudohomophone effect: phonological constraints on lexico-semantic

Transforms (1)

bases. ... bases

Image Transforms