SW-STORE: A VERTICALLY PARTITIONED DBMS FOR SEMANTIC WEB DATA MANAGEMENT
Surabhi Mithal
Nipun Garg
Daniel J. Abadi, Adam Marcus, Samuel R. Madden, and Kate Hollenbach. 2009. The VLDB Journal.
Group 4Surabhi Mithal 4282643Nipun Garg 4282567http://www-users.cs.umn.edu/~smithal/
OUTLINE
Introduction to Semantic Web Motivation Problem Statement Challenges Major Contributions Related Work Key Concepts Assumptions Validation Methodology Results Improvements
INTRODUCTION TO SEMANTIC WEB : AN EXAMPLE
ISBN Author Title Publisher Year
0006511409X id_xyz The Glass Palace id_qpr 2000
ID Name Homepage
id_xyz Ghosh, Amitav http://www.amitavghosh.com
ID Publisher’s name City
id_qpr Harper Collins London
Source : http://www.w3.org/People/Ivan/CorePresentations/SWTutorial/
A simplified bookstore data (dataset “A”)
EXAMPLE CONT : GRAPH REPRESENATION
http://…isbn/000651409X
Ghosh, Amitav
http://www.amitavghosh.com
The Glass Palace
2000
London
Harper Collins
a:title
a:year
a:city
a:p_name
a:namea:homepage
a:authora:publisher
ANOTHER BOOKSTORE DATA (DATASET “F”)
A B C D
1 ID Titre Traducteur
Original
2 ISBN 2020286682
Le Palais des Miroirs
$A12$ ISBN 0-00-6511409-X
3
4
5
6 ID Auteur7 ISBN 0-00-
6511409-X$A11$
8
9
10 Nom11 Ghosh, Amitav12 Besse,
Christianne
EXAMPLE CONT : GRAPH REPRESENATION
http://…isbn/000651409X
Ghosh, Amitav
Besse, Christianne
Le palais des miroirs
f:original
f:nom
f:traducteur
f:auteurf:t
itre
http://…isbn/2020386682
f:nom
DATA INTEGRATION ACROSS THE TWO DATASETS : SEMANTIC WEB
http://…isbn/000651409X
Ghosh, Amitav
Besse, Christianne
Le palais des miroirs
f:orig
ina
l
f:nom
f:traducteur
f:auteur f:titr
e
http://…isbn/2020386682
f:nom
http://…isbn/000651409X
Ghosh, Amitav
http://www.amitavghosh.com
The Glass Palace
2000
London
Harper Collins
a:title
a:year
a:city
a:p_nam
e
a:name a:homepag
e
a:author
a:publish
er
DATA INTEGRATION ACROSS THE TWO DATASETS : SEMANTIC WEB
http://…isbn/000651409X
Ghosh, Amitav
Besse, Christianne
Le palais des miroirs
f:orig
in
al
f:nom
f:traducteur
f:auteur f:titr
e
http://…isbn/2020386682
f:nom
http://…isbn/000651409X
Ghosh, Amitav
http://www.amitavghosh.com
The Glass Palace
2000
London
Harper Collins
a:title
a:year
a:city
a:p_nam
e
a:name a:homepag
e
a:author
a:publish
erSAME URI
DATA INTEGRATION ACROSS THE TWO DATASETS :SEMANTIC WEB
a:title
Ghosh, Amitav
Besse, Christianne
Le palais des miroirs
f:original
f:nom
f:traducteur
f:auteur
f:titr
e
http://…isbn/2020386682
f:nom
Ghosh, Amitav
http://www.amitavghosh.com
The Glass Palace
2000
London
Harper Collins
a:year
a:city
a:p_nam
e
a:name a:homepag
e
a:author
a:publish
er
http://…isbn/000651409X
User of data “F” can now ask queries like:“give me the title of the original”
MOTIVATION
Integration and sharing of data across different applications and organizations.
The Semantic Web logical data model is called “Resource Description Framework.
Semantic web concept has issues related to scalability and performance due to the nature of the data. Current data management solutions for RDF scale poorly.
PROBLEM STATEMENT
Input : RDF data in the form of triples <subject,property,object>
e.g. The Glass Palace hasAuthor Amitav Ghosh Output : Efficient storage system for RDF data.
Objective : Improve the query performance for complex real world queries.
CHALLENGES
Find all authors of books whose title has the word “Transaction”.
5 way self join!
MAJOR CONTRIBUTIONS AND NOVELTY
Introduction of a new concept of vertically partitioning RDF data and use of a column-oriented database to improve performance and increase simplicity.
The performance evaluation of the new and existing techniques with a real world example.
A new column oriented database SW-store is proposed which is based on the above approach.
RELATED WORK– PROPERTY TABLESHP LABORATORIES - JENA
Property Clustered Tables and Property Class Tables
Approach 1: A data clustering approach. Approach 2: Creates clusters based on subject’s type.
Limitations: Accuracy of Clustering algorithms. NULLs in data. Multivalued attributes.
SAMPLE DATABASE
Source: - SW-Store: a vertically partitioned DBMS for Semantic Web data management
Too many NULLs
KEY CONCEPTS: VERTICAL PARTITIONING AND COLUMN ORIENTED STORE
Vertical partitioning of data and further storing this vertically partitioned data into a column oriented database.
Subject-object columns for each property. Advantages: Effective handling of Multivalued attributes. Elimination of NULLs The number of unions is less.
Column oriented storage. Advantages: no wastage of bandwidth as projections on data happen before it is pulled
into main memory. record header is stored in separate columns thus reducing the tuple width
and letting us choose different compression techniques for each column.
KEY CONCEPTS: SW-STORE
SW-store is a column oriented DBMS optimized for storing RDF Single column table for subjects.
Representing Sparse data
Overflow tables
ASSUMPTIONS
Postgres is assumed to be the best available choice for a row oriented RDBMS because of effective handling of NULLs.
Queries that do not restrict on property values are very rare for RDF applications.
Moderate amount of Insert/Updates on RDF store.
Critique for Assumption: Limited Insert/Update If the overflow tables get filled rapidly, the batch operation to update
the column oriented store will occur more often degrading the performance as a whole.
VALIDATION METHODOLOGY
Barton Libraries dataset provided by the Simile Project at MIT (http://simile.mit.edu/rdf-test-data/barton).
The benchmark is set of 7 queries which is based on a browsing session of Long well, a UI built by Simile group for querying the library dataset. These queries are executed on: Triple data store (subject, property, object table with no
improvements on Postgres). Property tables ( on Postgres) Vertically partitioned data in a row oriented store (Postgres). Vertically partitioned data in a column oriented store (C- Store).
VALIDATION METHODOLOGY
Strengths : Real world data and query scenarios. Comparison of all the existing techniques the proposed
technique.
Weaknesses :- Avoiding queries involving unrestricted property problem
which are particularly prevalent for vertical partitioned scenarios.
Accuracy of clustering for property tables. Performance may differ when using different underlying
databases.
RESULTS
From the results, it is clear that proposed storage scheme outperforms the exiting methods in terms of query time.
IMPROVEMENTS – SPATIAL PERSPECTIVE
Schema design- Queries are fired on vertically partitioned tables as well as overflow tables. Owing to the heaviness of spatial data, there should be some spatial indexing like R* TREE or GRID to make these queries faster.
Restrictive nature - Spatial queries are not restricted to only specific “properties” which is an important assumption on their part.
E.g. Landmarks Tables should be partitioned in a better way rather than just
handling one property per table!e.g. Grouping similar properties together based on domain
knowledge.