Upload
walter
View
32
Download
0
Embed Size (px)
DESCRIPTION
Big Data Management: Storing and Querying the Semantic Web Artem Chebotko Department of Computer Science University of Texas – Pan American chebotkoa @ utpa .edu http://faculty.utpa.edu/chebotkoa. October 17, 2012. Background: Data Management. Data Base File System Legacy Database - PowerPoint PPT Presentation
Citation preview
Big Data Management: Storing and Querying the Semantic Web
Artem ChebotkoDepartment of Computer Science
University of Texas – Pan [email protected]
http://faculty.utpa.edu/chebotkoa
October 17, 2012
22
2
Background: Data Management
Data BaseFile SystemLegacy DatabaseRelational DatabaseObject-Oriented DatabaseXML DatabaseRDF DatabaseNoSQL Database
33
3
Background: Big Data
Big DataWeb-Scale Data
Many companies work at this level: Google, Yahoo!, LinkedIn, Facebook, Twitter, Amazon, Walmart, etc.
Many more companies will have to meet Big Data this decade
“As of 2012, about 2.5 exabytes of data are created each day, and the number is doubling every 40 months or so” and
“Walmart collects more than 2.5 petabytes of data every hour” (Harvard Business Review, October 2012)
1 EB = 1,000,000 TB 1 PB = 1,000 TB
44
4
Background: Big Data
What can you do with 1 PB of data?Data Scientist: The Sexiest Job of the
21st Century (HBR, October 2012)Data Management Skills
Programming Skills
Data Mining and Data Analysis Skills
Social Skills
Business Understanding
55
5
®The Semantic Web – a neat, meaningful mate for the messy, unstructured Big Data
66
6
®
WWW and Semantic Web
World Wide Web – Web of Linked DocumentsEnormous collection of information (Big Data) intended for
people to share and use
Keyword-based search
Semantic Web – Web of DataAn emerging vision to make information collected by WWW
processable by machines
Computational knowledge-based search/answering
Big Data
77
7
®
Motivating Example
Web Search Example:Find a professor in UTPA who authored an article published
in Data & Knowledge Engineering in 2009.
This information is available in two different pages of my website
welcome.html publications.html
88
8
®
Example (cont): traditional search
Google search in Nov. 2009 finds 184 documentsOne of them mentions my name
• It is not displayed on the first page of the results• It contains my name and affiliation, but no information about the
DKE article Google search in Oct. 2010 finds ~19,800
documentsFour of them mention my name and affiliation
• They are not displayed on the first page of the results• No information about the DKE article
2011 & 2012: no noticable improvement
99
9
®
Example (cont): traditional search
What went wrong?Keyword-based search interprets my query as a list of
syntactic words: professor, UTPA, data, article, publish, knowledge, engineering, 2009
It searches for a document that contains as many matching words as possible
PageRank is “biased” towards keyword ‘UTPA’
Moreover, my two pieces of information are viewed as lists of syntactic words. The pieces are not linked!
1010
10
®
Example (cont): semantic search
How can we do better?Encode the two pieces of information as
machine-interpretable data
Link them
Express (automatically) the natural language query in a machine-friendly query language
1111
11
®
Example (cont): encoding
<resource1> <type> <Professor>.<resource1> <name> “Artem Chebotko”.<resource1> <worksIn> <resource2>.<resource2> <type> <University>.<resource2> <name> “UTPA”.
<resource3> <type> <Journal>.<resourse3> <title> “Data & …”.<resource3> <published> <resource4>.<resource4> <type> <Article>.<resource4> <title> “Semantics …”.<resource4> <year> “2009”.<resource4> <author> <resource5>.
<resource1> <sameAs> <resource5>.
1212
12
®
Example (cont): linked data!
re s o u rc e1
re s o u rc e2ty p e nam
e
p u b lish e d
titlety
p e
ty p e
ty p e
sameA s
"A rtem C h e b o tko "
U n ive rs ity
P ro fe s s o r
"U TP A "
re s o u rc e3
worksIn
n ame
J o u rn a l
"D a ta & ..."
re s o u rc e4
A rtic le
title
"S e m a n tic s ..."ye
a r
"2009"
a u th o rre s o u rc e5
Information from two sources was integrated
1313
13
®
Example (cont): query
Find a professor in UTPA who authored an article published in Data & Knowledge Engineering in 2009.
SELECT ?nameWHERE { ?p <type> <Professor>. ?p <name> ?name ?p <worksIn> ?u. ?u <type> <University>. ?u <name> “UTPA”. ?j <type> <Journal>. ?j <title> “Data & …”. ?j <published> ?a. ?a <type> <Article>. ?a <title> “Semantics …”. ?a <year> “2009”. ?a <author> ?p.}
Result:?name = “Artem Chebotko”
This is the exact answer to our question
1414
14
®
Semantic Web Technologies
1515
15
®
Semantic Web Current State
Semantic search/indexing http://sindice.com/
• Over 664 million Semantic Web documents as of today• ~400 million Semantic Web documents in 2011• ~ 140 million Semantic Web documents in 2010• ~ 70 million in 2009
1616
16
®
Semantic Web Current State (cont)
Semantic Web datasets:DBPedia (~2 billion triples)
US Census Data (>1 billion triples)
UniProt (>600 million triples)
BestBuy (>27 million triples)
Semantic Web can potentially grow the size of Web (> 22 billion pages)
1717
17
®
Linking Open Data Project (March 2009)
1818
18
®
Linking Open Data Project (Sept 2010)
1919
19
®
Linking Open Data Project (Sept 2011)
2020
20
®
Semantic Web Data Management:Research at UTPA
2121
21
®
Semantic Web Data Management:Research at UTPA
Roadmap:Research Goals and Current Projects
S2ST
ProvBase
Future Directions
2222
22
®
Research Goal and Current Projects
Goal: efficient storage and querying of large Semantic Web data sets
Projects:S2ST: Relational RDF Database Management System (RRDBMS)
http://s2stproject.cs.panam.edu/
ProvBase: Semantic Web Database in the Cloud http://provbase.cs.panam.edu/
2323
23
®
S2ST Overview
http://s2stproject.cs.panam.edu/
2424
24
®
S2ST Definition
2525
25
®
S2ST Architecture
2626
26
®
S2ST Main Functions
Create logical schemaUser specifies a template for a database schema that will
store RDF data
Very flexible. Supports the following approaches to schema design:
• Generic
• Schema-aware
• Schema-oblivious
• Data-driven
• User-driven
• Hybrid
2727
27
®
S2ST Main Functions (cont)
Schema mappingCreates physical schema and database schema in an
RDBMS.
Data mappingMaps RDF triples into relational tuples and inserts them
into the database
Query mappingMaps SPARQL queries into SQL that can be evaluated by
an RDBMS
Most complex mapping
2828
28
®
SPARQL-to-SQL Query Translation
Generic
Reusable
Semantics preserving
Correct
2929
29
®
S2ST Fact Sheet
Next-generation relation RDF storeRelational RDF Database Management System
Supports user-driven schema design like in relational databases
Supports semantics-preserving SPARQL-to-SQL query translation
Supports generic schema, data and query mapping algorithms
Supports ~20 RDBMS backends, including Oracle, DB2, PostgreSQL, MySQL, and SQLServer
3030
30
®
S2ST Applications
VIEWScientific workflow provenance metadata
management
GEO-SEEDWeb services RDF data management
3131
31
®
Future Directions
Inference supportQuery optimizationData mapping algorithmsData browsing interfaceDistributed data managementTesting and performance evaluationData and query visualizationApplications
3232
32
®
ProvBase Overview
http://provbase.cs.panam.edu/
3333
33
ProvBase: Distributed RDF Provenance Database
Based on
Hadoop Wins Terabyte Sort Benchmark: One of Yahoo's Hadoop clusters sorted 1 terabyte of data in 209 seconds, which beat the previous record of 297 seconds in the annual general purpose (daytona) terabyte sort benchmark. This is the first time that either a Java or an open source program has won.
http://hadoop.apache.org
3434
34
ProvBase: Distributed RDF Provenance Database
Based on
HBase is the Hadoop database. Its an open-source, distributed, column-oriented store modeled after the Google paper, Bigtable: A Distributed Storeage System for Structured Data by Chang et al. Just as Bigtable leverages the distributed data storage provided by the Google File System, HBase provides Bigtable-like capabilities on top of Hadoop.
Sample BigTable:
http://hbase.apache.org
3535
35
®
ProvBase Architecture
3636
36
®
Future Directions
SPARQL optional graph pattern support
Inference supportQuery optimizationGUITesting and performance evaluationData and query visualizationApplications
3737
37
Other Projects
Relational Algebra Toolkit (RAT) http://rat.cs.panam.edu
The University of Texas Provenance Benchmark (UTPB) http://faculty.utpa.edu/chebotkoa/utpb
Student Research Organizer k-Nearest Keyword Search in RDF Graphs
3838
38
Thank You!
Questions?
Artem Chebotko
Department of Computer ScienceUniversity of Texas – Pan American
[email protected] http://faculty.utpa.edu/chebotkoa