Download pdf - Big Data and the Semantic Web: Challenges and Opportunities

Big Data Tech Conclave, 26—27 April 2013Bangalore, India

Big Data and the Semantic Web:Challenges and Opportunities

Srinath SrinivasaOpen Systems Laboratory

IIIT Bangalorehttp://osl.iiitb.ac.in/

[email protected]

http://osl.iiitb.ac.in/


http://www.bda2013.net/

http://www.bda2013.net/


OSL ReleasesTopical Anchors: Given a list of noun phrases, identify a semantic topic for these terms.

Powered by Wikipedia cooccurrence graph hosted by Agama

Web APIs enable use of Topical Anchors in third party applications


OSL ReleasesTopic Expansion: Given a term, expands it into semantically relevant topical clusters with different senses.

Uses co-occurrence datasets from Wikipedia 2006 or 2011.

Web APIs enable use by third party applications


OSL ReleasesAgama: A graph database for storing large undirected graphs for efficient traversal (not structurebased retrieval)

Currently Agama powers a cooccurrence graph of all nounphrases from Wikipedia articles hosted in OSL, managing 10s of millions of nodes and 100s of millions of edges


More data beats better algorithms..

meets

No data is an island..


Outline● Big Data Characteristics

● Big Data Analytics● Patterndriven and Modeldriven Analytics

● Big Data and the Semantic Web

● Semantic Challenges● The myth of a global ontology

● Convergent and divergent semantics

● Semantic interoperability

● Technology Challenges● Storage, traversal and retrieval of largescale semantic networks

● Inference on Big Data

● On the road ahead


Big Data

Data that is ● Too large to be processed by conventional

databases and data management techniques (Volume)

● Too diverse in structure that no single data model captures all elements of the data (Variety)

● Transient and/or impermanent, especially when pertaining to dynamic phenomena (Velocity)


Big Data● Transaction records

● Network streams

● Experimental output

● Social media data

● Demographic records

● Citation data

● Clickstreams

● Log data

● Weather data

● …


Some Big Data Stats

● YouTube users upload 48 hours of video every minute http://gigaom.com/2011/05/25/youtube48hoursofvideoperminute/

● Facebook data grows by 500TB daily http://www.slashgear.com/facebookdatagrowsbyover500tbdaily23243691/

● WalMart handles more than 1 million customer transactions every hour http://www.economist.com/node/15557443

● Akamai analyzes 75 million events per day for targeted advertising http://wikibon.org/blog/tamingbigdata/

● 90% of data in the world today was created in the last 2 years http://wikibon.org/blog/bigdatainfographics/

http://gigaom.com/2011/05/25/youtube-48-hours-of-video-per-minute/

http://www.slashgear.com/facebook-data-grows-by-over-500-tb-daily-23243691/

http://www.economist.com/node/15557443

http://wikibon.org/blog/taming-big-data/

http://wikibon.org/blog/big-data-infographics/


Big Data Analytics

Examine Big Data for useful (often actionable) knowledge

The long spectrum of Big Data Analytics

Pattern identification

Association rule mining

Classification/Clustering

Record Linkage

Security analytics

Complex EventProcessing

Opinion mining

Predictive modeling

Pattern driven

Model driven


Pattern Driven Analytics● Discovery and visualization

of recurring patterns in datasets

● Mostly quantitative

● Paradigms in pattern discovery:

● Sampling and aggregation

● Thresholding and filtering

Image Source: Wikipedia


Pattern Driven Analytics

Sampling and Aggregation● Query based pattern aggregation● Based on an initial idea of what we are looking

for

Hypothesis

Data

Query Patterns Aggregation Presentation


Pattern Driven Analytics

Tresholding and Filtering● Based on sifting through the entire dataset (or a

view) to look for “interesting” patterns without the context of a query

Data

Interestingnesscriteria

Patterns Filteringand

SegregationPresentation


Model Driven Analytics

Analytics as a modeldiscovery problem

Wedding

Images source: Wikipedia

ObservableData

LatentConcept


Model Driven Analytics

● Pattern discovery coupled with semantic modeling

● Nontrivial qualitative modeling challenges● Model discovery:

● Descriptive model discoveryFit a model to explain the observed data

● Predictive model discoveryDiscover a model that can predict values of data elements into the future


Linked Data

Image source: Wikipedia

The Linked DataCloud as of September 2011


Linked Data

● Using Semantic Web technologies to connect data elements from disparate data sources

● From Web of Documents to Web of Data● Elements of Linked Data

● URIs ● HTTP● Resource Description Framework (RDF)● Serialization formats (RDFa, RDF/XML, N3, Turtle,

and others)


Big Data and the Semantic Web

Big DataSemantic Web

Model Discovery

Catalyzation andPredictive Modeling


Big Data Semantic Web● One of the main elements of the Linked Data Cloud: DBpedia is

built from a Big Data resource: Wikipedia

● Open Biomedical Ontology (OBO) (http://www.oboedit.org/) created from mining PubMed publications

● Enterprise scale Big Data Analytics helping build organizational models, operational intelligence solutions, etc. Example: Anzo software suite by Cambridge Semantics (www.cambridgesemantics.com), Loom data management suite by Revelytix (www.revelytix.com)

http://www.oboedit.org/

http://www.cambridgesemantics.com/

http://www.revelytix.com/


Semantic Web Big Data

Schema.org● Collection of schemata on various topics that are recognized by major

search providers and used to semantically interpret web content

SourceMap● Linked data augmented with web content and crowdsourced data used

to provide details about companies like their carbon footprint, energy use, water use, etc. www.sourcemap.com

OpenSteetMap● Linked data augmenting crowdsourced data on www.openstreetmap.org

helped in detailed mapping of disaster scenario during the Jan 2010 Haiti earthquake (http://www.scientificamerican.com/article.cfm?id=bernersleelinkeddata)

http://www.sourcemap.com/

http://www.openstreetmap.org/

http://www.scientificamerican.com/article.cfm?id=berners-lee-linked-data


Big Data and the Semantic Web: Challenges

Semantic challenges● The myth of a global ontology● Convergent and divergent semantics

Technology and system challenges● Characteristics of a semantic graph● Managing graph structured data


The Myth of a Global Ontology

Several “core” semantic ontologies exist:● WordNet● YAGO● OpenCyc● SUMO

However, none of them (even automated ones) can capture all possible semantic associations and all possible perspectives on a given topic


The Myth of a Global Ontology

The open world problem

● We don't know what we don't know..

● Representation bias in big data sources

The neutralbutuseless perspective

● Localized, utilitarian descriptions often more useful than neutral, global descriptions. Ex: Use of “zones” as a geographical element in Indian Railways

● Difficult for disparate perspectives to coexist in a single Ontology, violating design principles like Occam's razor


Convergent and Divergent Semantics

Wikipedia article onWest Bank

conflict

Palestine POV

Israeli POV

Historians' POV

UN's POV

Encyclopedic Semantics


Convergent and Divergent Semantics

IPL event schedule

Traffic planning

Advertisement planningaround IPL

Legal structuringaround IPL

TV programmescheduling

Securityplanning


Semantic Interoperability

● Binary predicates like RDF may not capture complete semantics of the association

But it is too difficult to work with higherorder predicates

● Semantic queries are characterized by contextual relevance and default assumptions

● Linked Data can be useful primarily within the context of a model

Modelbuilding from predicates as complex a problem as identifying predicates from data


Semantic Challenges: Summary

● Hard to distinguish data from noise without a modelEspecially hard when we are using data to help build a model!

● There may not be a single global model explaining the data

● Model construction as challenging, if not more challenging, as predicate mining

● No clarity on the underlying processes that aid in knowledge aggregationKnowledge aggregation happens differently depending on the kind of knowledge being aggregated (encyclopedic versus operational knowledge)


Tech Challenges

Storing Big Semantic Data● Semantic data not amenable to physical access coherence to be

efficiently stored in relational tables● Logical proximity of triples, more important than physical

proximity● Read/Write storage models change logical proximity● RDF graphs tend to be extremely dense and/or clustered● Need efficient methods of graph storage and retrieval


Semantic store for Big Data

● Databases optimized to store and retrieve interrelated sets of triples of the form (subject, predicate, object)

● Query models based on answering graph queries (usually in SPARQL) rather than SQL queries

● Main design criteria: storage and readahead policies of triples based on their logical proximity rather than physical proximity in order to enable Bulk Synchronous Parallel (BSP) processing



AllegroGraph (http://www.franz.com/agraph/allegrograph/)

● NoSQL Graph based native storage for RDF triples● ACID compliant● Interfaces with Solr for free text indexing ● Triple and text level indexing● MongoDB integration● RDFS++ Reasoning with dynamic materialization ● SPARQL queries on named graphs and Prolog based

inferencing engine

http://www.franz.com/agraph/allegrograph/



Sesame http://www.openrdf.org/

● Open source Java framework for parsing, storing, querying and inferencing over RDF data

● Collections of RDF triples can be manipulated in memory using a graph data model

● Compliant with SPARQL 1.1 protocol recommendation ● Provides two levels of APIs: SAIL (Storage and Inference

Layer) for low level RDF processing and Repository layer for programmatic interfacing with Sesame

http://www.openrdf.org/



Mulgara http://www.mulgara.org/ ● Native storage model for RDF● Supports multiple models (databases) per server● ACID transactions and concurrency support ● Copyonwrite cache semantics● Fulltext search and support for data types● Primarily useful as a repository – no evidence of

support for logical inferences over RDF

http://www.mulgara.org/



Other examples:● InfiniteGraph from Objectivity http://www.objectivity.com/

● BigData http://www.bigdata.com/bigdata/blog/

– A high scaleout storage and computing engine● Agama https://github.com/arrac/agama/wiki/Agama

– Storage, search and traversal support (Ruby library) for very large graphs

● Neo4j http://www.neo4j.org/ – Embedded, diskbased transactional graph database

written in Java

http://www.objectivity.com/

http://www.bigdata.com/bigdata/blog/

https://github.com/arrac/agama/wiki/Agama

http://www.neo4j.org/


Logical inference over Big Data

● Problem: Find factual answers to specific questions by reasoning over largescale data.

● Performing extremely largescale deductions over large semantic datasets in interactive response time

● Need to contend with potentially inconsistent predicates, incomplete or missing values and default assumptions

● Varieties of inference over datasets● Deduction● Induction● Abduction● Statistical inference


Logical inference over Big Data

Common approaches for scalable inferencing:● Horn clause inferencing● Variants of random walks on knowledge graphs● Distributed MCMC (Markov Chain Monte Carlo)

methods


Horn Clauses

Horn clauses are predicates of the form:

atomic sentence with no negation and a single consequent

Horn clause knowledge bases can be resolved using “backward chaining” starting from the consequent and building a tree of antecedents until they are grounded in facts

Horn clause resolution can be scaled over large datasets by parallelizing resolutions using MapReduce

p1∧p2∧...∧pn→u


Random Walks on Big Data

Random walks on RDF graphs as a means of:

● Belief materialization● Soft inference

a c e d f b

R R

R

R

Assuming transitivity of R


Random Walks on Big Data

Large scale graph processing solutions for scaling random walks over Big Data: ● Apache Giraph http://giraph.apache.org/

● Pregel [Malewicz et al., 2010]

● Grappa http://www.cs.washington.edu/node/4217/

http://giraph.apache.org/

http://www.cs.washington.edu/node/4217/


MCMC

A “generic” problem solving method based on local sampling, useful for soft inferences on semantic data

Time homogeneous Markov Chain:


MCMC

A homogeneous Markov chain can be represented as a set of “states” and “transition probabilities” across states

Given an initial “prior” probability distribution across states the “stationary distribution” or “equilibrium condition” is defined as:


MCMC

Markov Chain Monte Carlo

Given a state space S and an “equilibrium” distribution choose a sample s of the state space S so that a Markov chain on s results in as the stationary distribution

MCMC for logical inference

For a logical inference problem, the equilibrium condition would be of the form [0,1]m defined over a set of m predicates

Example Sampling algorithms for MCMC

Gibbs Sampling http://en.wikipedia.org/wiki/Gibbs_sampling

MetropolisHastings algorithm http://en.wikipedia.org/wiki/Metropolis%E2%80%93Hastings_algorithm

http://en.wikipedia.org/wiki/Gibbs_sampling

http://en.wikipedia.org/wiki/Metropolis%E2%80%93Hastings_algorithm


Scaling MCMC for Big Data

Distributed MCMC

Several models are explored for distributing MCMC computations over large datasets making them amenable to diffusing computations. Some examples include: [Murray 2010; Singh et al 2011]

Distributional models for MCMC beyond the scope of this talk..


On the road ahead..

Some promising directions for Big Data and Semantics● Diffusion models for large scale inference● Cognitive models for semantics over large scale data● Modelbased reasoning and reasoning across models● Soft (probabilistic) inferences, confidence measures,

relevance feedback● Continuous learning over Big Data


Thank You!


References● Neal Madras. Introduction to Markov Chain Monte Carlo.

http://www.cs.cornell.edu/selman/cs475/lectures/intromcmclukas.pdf

● Grzegorz Malewicz, Matthew H. Austern, Aart J.C Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Czajkowski. 2010. Pregel: a system for largescale graph processing. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data (SIGMOD '10). ACM, New York, NY, USA, 135146. DOI=10.1145/1807167.1807184 http://doi.acm.org/10.1145/1807167.1807184

● Ni Lao, Tom Mitchell, and William W. Cohen. 2011. Random walk inference and learning in a large scale knowledge base. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP '11). Association for Computational Linguistics, Stroudsburg, PA, USA, 529539.

● Lawrence Murray, Distributed Markov Chain Monte Carlo. Proceedings of NIPS 2010 Workshop on Learning on Cores, Clusters and Clouds. http://lccc.eecs.berkeley.edu/

● Stefan Schoenmackers, Oren Etzioni, and Daniel S. Weld. 2008. Scaling textual inference to the web. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP '08). Association for Computational Linguistics, Stroudsburg, PA, USA, 7988.

● Stefan Schoenmackers, Oren Etzioni, Daniel S. Weld, and Jesse Davis. 2010. Learning firstorder Horn clauses from web text. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing (EMNLP '10). Association for Computational Linguistics, Stroudsburg, PA, USA, 10881098.

● Sameer Singh, Amarnag Subramanya, Fernando Pereira, and Andrew McCallum. 2011. Largescale crossdocument coreference using distributed inference and hierarchical models. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies Volume 1 (HLT '11), Vol. 1. Association for Computational Linguistics, Stroudsburg, PA, USA, 793803.

http://www.cs.cornell.edu/selman/cs475/lectures/intro-mcmc-lukas.pdf

http://lccc.eecs.berkeley.edu/