Big Data Tech Conclave, 26—27 April 2013Bangalore, India
Big Data and the Semantic Web:Challenges and Opportunities
Srinath SrinivasaOpen Systems Laboratory
IIIT Bangalorehttp://osl.iiitb.ac.in/
Big Data Tech Conclave, 26—27 April 2013Bangalore, India
http://www.bda2013.net/
Big Data Tech Conclave, 26—27 April 2013Bangalore, India
OSL ReleasesTopical Anchors: Given a list of noun phrases, identify a semantic topic for these terms.
Powered by Wikipedia cooccurrence graph hosted by Agama
Web APIs enable use of Topical Anchors in third party applications
Big Data Tech Conclave, 26—27 April 2013Bangalore, India
OSL ReleasesTopic Expansion: Given a term, expands it into semantically relevant topical clusters with different senses.
Uses co-occurrence datasets from Wikipedia 2006 or 2011.
Web APIs enable use by third party applications
Big Data Tech Conclave, 26—27 April 2013Bangalore, India
OSL ReleasesAgama: A graph database for storing large undirected graphs for efficient traversal (not structurebased retrieval)
Currently Agama powers a cooccurrence graph of all nounphrases from Wikipedia articles hosted in OSL, managing 10s of millions of nodes and 100s of millions of edges
Big Data Tech Conclave, 26—27 April 2013Bangalore, India
More data beats better algorithms..
meets
No data is an island..
Big Data Tech Conclave, 26—27 April 2013Bangalore, India
Outline● Big Data Characteristics
● Big Data Analytics● Patterndriven and Modeldriven Analytics
● Big Data and the Semantic Web
● Semantic Challenges● The myth of a global ontology
● Convergent and divergent semantics
● Semantic interoperability
● Technology Challenges● Storage, traversal and retrieval of largescale semantic networks
● Inference on Big Data
● On the road ahead
Big Data Tech Conclave, 26—27 April 2013Bangalore, India
Big Data
Data that is ● Too large to be processed by conventional
databases and data management techniques (Volume)
● Too diverse in structure that no single data model captures all elements of the data (Variety)
● Transient and/or impermanent, especially when pertaining to dynamic phenomena (Velocity)
Big Data Tech Conclave, 26—27 April 2013Bangalore, India
Big Data● Transaction records
● Network streams
● Experimental output
● Social media data
● Demographic records
● Citation data
● Clickstreams
● Log data
● Weather data
● …
Big Data Tech Conclave, 26—27 April 2013Bangalore, India
Some Big Data Stats
● YouTube users upload 48 hours of video every minute http://gigaom.com/2011/05/25/youtube48hoursofvideoperminute/
● Facebook data grows by 500TB daily http://www.slashgear.com/facebookdatagrowsbyover500tbdaily23243691/
● WalMart handles more than 1 million customer transactions every hour http://www.economist.com/node/15557443
● Akamai analyzes 75 million events per day for targeted advertising http://wikibon.org/blog/tamingbigdata/
● 90% of data in the world today was created in the last 2 years http://wikibon.org/blog/bigdatainfographics/
Big Data Tech Conclave, 26—27 April 2013Bangalore, India
Big Data Analytics
Examine Big Data for useful (often actionable) knowledge
The long spectrum of Big Data Analytics
Pattern identification
Association rule mining
Classification/Clustering
Record Linkage
Security analytics
Complex EventProcessing
Opinion mining
Predictive modeling
Pattern driven
Model driven
Big Data Tech Conclave, 26—27 April 2013Bangalore, India
Pattern Driven Analytics● Discovery and visualization
of recurring patterns in datasets
● Mostly quantitative
● Paradigms in pattern discovery:
● Sampling and aggregation
● Thresholding and filtering
Image Source: Wikipedia
Big Data Tech Conclave, 26—27 April 2013Bangalore, India
Pattern Driven Analytics
Sampling and Aggregation● Query based pattern aggregation● Based on an initial idea of what we are looking
for
Hypothesis
Data
Query Patterns Aggregation Presentation
Big Data Tech Conclave, 26—27 April 2013Bangalore, India
Pattern Driven Analytics
Tresholding and Filtering● Based on sifting through the entire dataset (or a
view) to look for “interesting” patterns without the context of a query
Data
Interestingnesscriteria
Patterns Filteringand
SegregationPresentation
Big Data Tech Conclave, 26—27 April 2013Bangalore, India
Model Driven Analytics
Analytics as a modeldiscovery problem
Wedding
Images source: Wikipedia
ObservableData
LatentConcept
Big Data Tech Conclave, 26—27 April 2013Bangalore, India
Model Driven Analytics
● Pattern discovery coupled with semantic modeling
● Nontrivial qualitative modeling challenges● Model discovery:
● Descriptive model discoveryFit a model to explain the observed data
● Predictive model discoveryDiscover a model that can predict values of data elements into the future
Big Data Tech Conclave, 26—27 April 2013Bangalore, India
Linked Data
Image source: Wikipedia
The Linked DataCloud as of September 2011
Big Data Tech Conclave, 26—27 April 2013Bangalore, India
Linked Data
● Using Semantic Web technologies to connect data elements from disparate data sources
● From Web of Documents to Web of Data● Elements of Linked Data
● URIs ● HTTP● Resource Description Framework (RDF)● Serialization formats (RDFa, RDF/XML, N3, Turtle,
and others)
Big Data Tech Conclave, 26—27 April 2013Bangalore, India
Big Data and the Semantic Web
Big DataSemantic Web
Model Discovery
Catalyzation andPredictive Modeling
Big Data Tech Conclave, 26—27 April 2013Bangalore, India
Big Data Semantic Web● One of the main elements of the Linked Data Cloud: DBpedia is
built from a Big Data resource: Wikipedia
● Open Biomedical Ontology (OBO) (http://www.oboedit.org/) created from mining PubMed publications
● Enterprise scale Big Data Analytics helping build organizational models, operational intelligence solutions, etc. Example: Anzo software suite by Cambridge Semantics (www.cambridgesemantics.com), Loom data management suite by Revelytix (www.revelytix.com)
Big Data Tech Conclave, 26—27 April 2013Bangalore, India
Semantic Web Big Data
Schema.org● Collection of schemata on various topics that are recognized by major
search providers and used to semantically interpret web content
SourceMap● Linked data augmented with web content and crowdsourced data used
to provide details about companies like their carbon footprint, energy use, water use, etc. www.sourcemap.com
OpenSteetMap● Linked data augmenting crowdsourced data on www.openstreetmap.org
helped in detailed mapping of disaster scenario during the Jan 2010 Haiti earthquake (http://www.scientificamerican.com/article.cfm?id=bernersleelinkeddata)
Big Data Tech Conclave, 26—27 April 2013Bangalore, India
Big Data and the Semantic Web: Challenges
Semantic challenges● The myth of a global ontology● Convergent and divergent semantics
Technology and system challenges● Characteristics of a semantic graph● Managing graph structured data
Big Data Tech Conclave, 26—27 April 2013Bangalore, India
The Myth of a Global Ontology
Several “core” semantic ontologies exist:● WordNet● YAGO● OpenCyc● SUMO
However, none of them (even automated ones) can capture all possible semantic associations and all possible perspectives on a given topic
Big Data Tech Conclave, 26—27 April 2013Bangalore, India
The Myth of a Global Ontology
The open world problem
● We don't know what we don't know..
● Representation bias in big data sources
The neutralbutuseless perspective
● Localized, utilitarian descriptions often more useful than neutral, global descriptions. Ex: Use of “zones” as a geographical element in Indian Railways
● Difficult for disparate perspectives to coexist in a single Ontology, violating design principles like Occam's razor
Big Data Tech Conclave, 26—27 April 2013Bangalore, India
Convergent and Divergent Semantics
Wikipedia article onWest Bank
conflict
Palestine POV
Israeli POV
Historians' POV
UN's POV
Encyclopedic Semantics
Big Data Tech Conclave, 26—27 April 2013Bangalore, India
Convergent and Divergent Semantics
IPL event schedule
Traffic planning
Advertisement planningaround IPL
Legal structuringaround IPL
TV programmescheduling
Securityplanning
Big Data Tech Conclave, 26—27 April 2013Bangalore, India
Semantic Interoperability
● Binary predicates like RDF may not capture complete semantics of the association
But it is too difficult to work with higherorder predicates
● Semantic queries are characterized by contextual relevance and default assumptions
● Linked Data can be useful primarily within the context of a model
Modelbuilding from predicates as complex a problem as identifying predicates from data
Big Data Tech Conclave, 26—27 April 2013Bangalore, India
Semantic Challenges: Summary
● Hard to distinguish data from noise without a modelEspecially hard when we are using data to help build a model!
● There may not be a single global model explaining the data
● Model construction as challenging, if not more challenging, as predicate mining
● No clarity on the underlying processes that aid in knowledge aggregationKnowledge aggregation happens differently depending on the kind of knowledge being aggregated (encyclopedic versus operational knowledge)
Big Data Tech Conclave, 26—27 April 2013Bangalore, India
Tech Challenges
Storing Big Semantic Data● Semantic data not amenable to physical access coherence to be
efficiently stored in relational tables● Logical proximity of triples, more important than physical
proximity● Read/Write storage models change logical proximity● RDF graphs tend to be extremely dense and/or clustered● Need efficient methods of graph storage and retrieval
Big Data Tech Conclave, 26—27 April 2013Bangalore, India
Semantic store for Big Data
● Databases optimized to store and retrieve interrelated sets of triples of the form (subject, predicate, object)
● Query models based on answering graph queries (usually in SPARQL) rather than SQL queries
● Main design criteria: storage and readahead policies of triples based on their logical proximity rather than physical proximity in order to enable Bulk Synchronous Parallel (BSP) processing
Big Data Tech Conclave, 26—27 April 2013Bangalore, India
Semantic store for Big Data
AllegroGraph (http://www.franz.com/agraph/allegrograph/)
● NoSQL Graph based native storage for RDF triples● ACID compliant● Interfaces with Solr for free text indexing ● Triple and text level indexing● MongoDB integration● RDFS++ Reasoning with dynamic materialization ● SPARQL queries on named graphs and Prolog based
inferencing engine
Big Data Tech Conclave, 26—27 April 2013Bangalore, India
Semantic store for Big Data
Sesame http://www.openrdf.org/
● Open source Java framework for parsing, storing, querying and inferencing over RDF data
● Collections of RDF triples can be manipulated in memory using a graph data model
● Compliant with SPARQL 1.1 protocol recommendation ● Provides two levels of APIs: SAIL (Storage and Inference
Layer) for low level RDF processing and Repository layer for programmatic interfacing with Sesame
Big Data Tech Conclave, 26—27 April 2013Bangalore, India
Semantic store for Big Data
Mulgara http://www.mulgara.org/ ● Native storage model for RDF● Supports multiple models (databases) per server● ACID transactions and concurrency support ● Copyonwrite cache semantics● Fulltext search and support for data types● Primarily useful as a repository – no evidence of
support for logical inferences over RDF
Big Data Tech Conclave, 26—27 April 2013Bangalore, India
Semantic store for Big Data
Other examples:● InfiniteGraph from Objectivity http://www.objectivity.com/
● BigData http://www.bigdata.com/bigdata/blog/
– A high scaleout storage and computing engine● Agama https://github.com/arrac/agama/wiki/Agama
– Storage, search and traversal support (Ruby library) for very large graphs
● Neo4j http://www.neo4j.org/ – Embedded, diskbased transactional graph database
written in Java
Big Data Tech Conclave, 26—27 April 2013Bangalore, India
Logical inference over Big Data
● Problem: Find factual answers to specific questions by reasoning over largescale data.
● Performing extremely largescale deductions over large semantic datasets in interactive response time
● Need to contend with potentially inconsistent predicates, incomplete or missing values and default assumptions
● Varieties of inference over datasets● Deduction● Induction● Abduction● Statistical inference
Big Data Tech Conclave, 26—27 April 2013Bangalore, India
Logical inference over Big Data
Common approaches for scalable inferencing:● Horn clause inferencing● Variants of random walks on knowledge graphs● Distributed MCMC (Markov Chain Monte Carlo)
methods
Big Data Tech Conclave, 26—27 April 2013Bangalore, India
Horn Clauses
Horn clauses are predicates of the form:
atomic sentence with no negation and a single consequent
Horn clause knowledge bases can be resolved using “backward chaining” starting from the consequent and building a tree of antecedents until they are grounded in facts
Horn clause resolution can be scaled over large datasets by parallelizing resolutions using MapReduce
p1∧p2∧...∧pn→u
Big Data Tech Conclave, 26—27 April 2013Bangalore, India
Random Walks on Big Data
Random walks on RDF graphs as a means of:
● Belief materialization● Soft inference
a c e d f b
R R
R
R
Assuming transitivity of R
Big Data Tech Conclave, 26—27 April 2013Bangalore, India
Random Walks on Big Data
Large scale graph processing solutions for scaling random walks over Big Data: ● Apache Giraph http://giraph.apache.org/
● Pregel [Malewicz et al., 2010]
● Grappa http://www.cs.washington.edu/node/4217/
Big Data Tech Conclave, 26—27 April 2013Bangalore, India
MCMC
A “generic” problem solving method based on local sampling, useful for soft inferences on semantic data
Time homogeneous Markov Chain:
Big Data Tech Conclave, 26—27 April 2013Bangalore, India
MCMC
A homogeneous Markov chain can be represented as a set of “states” and “transition probabilities” across states
Given an initial “prior” probability distribution across states the “stationary distribution” or “equilibrium condition” is defined as:
Big Data Tech Conclave, 26—27 April 2013Bangalore, India
MCMC
Markov Chain Monte Carlo
Given a state space S and an “equilibrium” distribution choose a sample s of the state space S so that a Markov chain on s results in as the stationary distribution
MCMC for logical inference
For a logical inference problem, the equilibrium condition would be of the form [0,1]m defined over a set of m predicates
Example Sampling algorithms for MCMC
Gibbs Sampling http://en.wikipedia.org/wiki/Gibbs_sampling
MetropolisHastings algorithm http://en.wikipedia.org/wiki/Metropolis%E2%80%93Hastings_algorithm
Big Data Tech Conclave, 26—27 April 2013Bangalore, India
Scaling MCMC for Big Data
Distributed MCMC
Several models are explored for distributing MCMC computations over large datasets making them amenable to diffusing computations. Some examples include: [Murray 2010; Singh et al 2011]
Distributional models for MCMC beyond the scope of this talk..
Big Data Tech Conclave, 26—27 April 2013Bangalore, India
On the road ahead..
Some promising directions for Big Data and Semantics● Diffusion models for large scale inference● Cognitive models for semantics over large scale data● Modelbased reasoning and reasoning across models● Soft (probabilistic) inferences, confidence measures,
relevance feedback● Continuous learning over Big Data
Big Data Tech Conclave, 26—27 April 2013Bangalore, India
Thank You!
Big Data Tech Conclave, 26—27 April 2013Bangalore, India
References● Neal Madras. Introduction to Markov Chain Monte Carlo.
http://www.cs.cornell.edu/selman/cs475/lectures/intromcmclukas.pdf
● Grzegorz Malewicz, Matthew H. Austern, Aart J.C Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Czajkowski. 2010. Pregel: a system for largescale graph processing. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data (SIGMOD '10). ACM, New York, NY, USA, 135146. DOI=10.1145/1807167.1807184 http://doi.acm.org/10.1145/1807167.1807184
● Ni Lao, Tom Mitchell, and William W. Cohen. 2011. Random walk inference and learning in a large scale knowledge base. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP '11). Association for Computational Linguistics, Stroudsburg, PA, USA, 529539.
● Lawrence Murray, Distributed Markov Chain Monte Carlo. Proceedings of NIPS 2010 Workshop on Learning on Cores, Clusters and Clouds. http://lccc.eecs.berkeley.edu/
● Stefan Schoenmackers, Oren Etzioni, and Daniel S. Weld. 2008. Scaling textual inference to the web. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP '08). Association for Computational Linguistics, Stroudsburg, PA, USA, 7988.
● Stefan Schoenmackers, Oren Etzioni, Daniel S. Weld, and Jesse Davis. 2010. Learning firstorder Horn clauses from web text. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing (EMNLP '10). Association for Computational Linguistics, Stroudsburg, PA, USA, 10881098.
● Sameer Singh, Amarnag Subramanya, Fernando Pereira, and Andrew McCallum. 2011. Largescale crossdocument coreference using distributed inference and hierarchical models. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies Volume 1 (HLT '11), Vol. 1. Association for Computational Linguistics, Stroudsburg, PA, USA, 793803.