Upload
barrie-gibson
View
222
Download
0
Embed Size (px)
Citation preview
RDF-3X: a RISC-style Engine for RDFRDF-3X: a RISC-style Engine for RDF
Presented by Thomas Neumann, Gerhard Weikum
Max-Planck-Institut fur Informatik Saarbrucken, Germany
Session 19: System Centric Optimization, VLDB, 2008
2009-02-05
Summarized by Jaeseok Myung
Intelligent Database Systems LabSchool of Computer Science & EngineeringSeoul National University, Seoul, Korea
Copyright 2009 by CEBT
OverviewOverview
Goal
Building a new type of TripleStore => RDF-3X
Compare RDF-3X with traditional ones
In this presentation,
Focusing on physical storage design that had an effect on entire implementation of the system
Center for E-Business Technology
Copyright 2009 by CEBT
IntroductionIntroduction
RDF: Resource Description Framework
Conceptually a labeled graph
In RDF, all data items are represented in the form of
– (subject, predicate, object), aka (subject, property, value)
RDF data can be seen as a (potentially huge) set of triples
Center for E-Business Technology
S P O
S1 P1
O1
S1 P2
O2
… ... …
2009 IDS Lab. Winter Seminar – 3/22
Copyright 2009 by CEBT
IntroductionIntroduction
SPARQL: SPARQL Protocol and RDF Query Language
The official standard for searching over RDF storages
Example
– Retrieve the titles of all movies with Johnny Depp
SPARQL queries are pattern matching queries on triples that are stored in the RDF storage
Center for E-Business Technology
S P O
S1 P1
O1
S1 P2
O2
… ... …Each pattern consists of S, P, O, and each of these is either a variable or a literalEach pattern consists of S, P, O, and each of these is either a variable or a literal
Copyright 2009 by CEBT
Physical Designs for RDF Storage Physical Designs for RDF Storage (1/4)(1/4)
Giant Triples Table
Center for E-Business Technology
SELECT ?titleWHERE {
?book <title> ?title.?book <author> <Fox, Joe>.?book <copyright> <2001>
}
Join! Join!
Entire Table Scan!
Redundancy!
Copyright 2009 by CEBT
Physical Designs for RDF Storage Physical Designs for RDF Storage (2/4)(2/4)
Clustered Property Table
Contains clusters of properties that tend to be defined together
Center for E-Business Technology
Copyright 2009 by CEBT
Physical Designs for RDF Storage Physical Designs for RDF Storage (3/4)(3/4)
Property-Class Table
Exploits the type property of subjects to cluster similar sets of subjects together in the same table
Unlike clustered property table, a property may exist in multiple property-class tables
Center for E-Business Technology
Values of the type propertyValues of the type property
Copyright 2009 by CEBT
Physical Designs for RDF Storage Physical Designs for RDF Storage (4/4)(4/4)
Vertically Partitioned Table
The giant table is rewritten into n two column tables where n is the number of unique properties in the data
We don’t have to
– Maintain null values
– Have a certain clustering algorithm
Center for E-Business Technology
subjectsubject
propertyproperty
objectobject
Copyright 2009 by CEBT
RDF-3XRDF-3X
Technical Challenges
The diversity of predicate names pose major problem for the physical database design
– Join, Redundancy, ..
RDF-3X (RDF Triple eXpress)
A novel architecture for RDF indexing and querying, eliminating the need for physical database design
Center for E-Business Technology
Copyright 2009 by CEBT
Mapping DictionaryMapping Dictionary
Replacing all literals by unique IDs using a mapping dictionary
RDF-3X is based on a single “giant triples table”, but
Mapping dictionary compresses the triple store
– Reduced redundancy, Saving a lot of physical space
Center for E-Business Technology
S P O
object214 hasColor blue
object214 belongsTo
object352
… … …
S P O
0 1 2
0 3 4
… … …
ID Value
0 object214
1 hasColor
… …
Copyright 2009 by CEBT
Clustered BClustered B++-Tree-Tree
Store everything in a clustered B+-Tree
Triples are sorted in lexicographical order
– Allowing the conversion of SPARQL patterns into range scan
We don’t have to do entire table scan
Center for E-Business Technology
002 …
000 001 002 003
S P O
0 1 2
0 3 4
… … …
Actually, we don’t need this table!Actually, we don’t need this table!
ID Value
0 object214
1 hasColor
… …
<Mapping Dictionary>
Copyright 2009 by CEBT
Exhaustive IndexingExhaustive Indexing
We relied on the fact that the variables are a suffix
<S> - <P> - ?var , <S> - ?var1 - ?var2
But, ?var - <P> - <O>
– To guarantee that we can answer every possible pattern with variables in any position of the pattern triple by merely a single index scan, we maintain all six possible permutations of S, P, and O in six separate indexes
– (SPO, SOP, OSP, OPS, PSO, POS)
– We can afford this level of redundancy
– On all experimental datasets, the total size for all indexes together is less than the original data
Center for E-Business Technology
<POS>
?var - <P> - <O>?var - <P> - <O>
Copyright 2009 by CEBT
Moreover, …Moreover, …
Aggregated Indices
Sometimes we don’t need the full triple
– Is there a connection between obj4 and obj13?
– How many author does object14 have?
Therefore maintain aggregated indexes with (value1, value2, count)
– (value1, value2) => (SP, PS, SO, OS, PO, OP)
– We can use clustered B+ tree
Other Features
Join ordering
Selectivity estimation
…
Center for E-Business Technology
Copyright 2009 by CEBT
An Experimental SetupAn Experimental Setup
Setup
2GHz dual core, 2GB RAM, 30MB/s disk, Linux
Competitors
MonetDB
– column-store-based (vertically partitioned) approach
– Presented in VLDB07, by Abadi et al.
PostgreSQL
– Triple store with SPO, POS, PSO indexes, similar to Sesame
Other approaches performed much worse
– Jena2, Yars2(DERI), …
Datasets
Barton, library data, 51 mil. triples (4.1 GB)
Yago, Wikipedia-based ontology, 40 mil. triples (3.1 GB)
LibraryThing(partial crawl), users tag books, 30 mil. triples (1.8 GB)
Benchmark queries (7 or 8 per dataset) - appendix
Center for E-Business Technology
Copyright 2009 by CEBT
DB Load Time & DB SizeDB Load Time & DB Size
Center for E-Business Technology
Barton Yago LibThing
RDF-3X 13 25 20
MonetDB 11 21 4
PostgreSQL 30 25 20
DB Load Time (min.)
Barton Yago LibThing
RDF-3X 2.8 2.7 1.6
MonetDB 1.6 1.1 0.7
PostgreSQL 8.7 7.5 5.7
DB Size (GB)
GoodGood
Bad!Bad!
After running the benchmarkAfter running the benchmark
2.0 2.4 6.9
Copyright 2009 by CEBT
Query Run-timesQuery Run-times
Center for E-Business Technology
Barton Yago LibThing
RDF-3X 0.4(5.9) 0.04(0.7) 0.13(0.89)
MonetDB 4.8(26.4) 54.6(78.2) 4.39(8.16)
PostgreSQL 64.3(167.8) 0.56(10.6) 30.4(93.9)
Average run-times for warm(cold) cache (sec.)
Copyright 2009 by CEBT
ConclusionConclusion
RDF-3X(RDF Triple eXpress) is a fast and flexible RDF/SPARQL engine
Exhaustive but very space-efficient triple indexes
Avoids physical design tuning, generic storage
Fast runtime system, query optimization has a huge impact
RDF-3X is freely available
http://www.mpi-inf.mpg.de/~neumann/rdf3x
Center for E-Business Technology
Copyright 2009 by CEBT
Paper EvaluationPaper Evaluation
Pros
Good Idea
Introduce & Solve Optimization Issues
Implementation
My Comments
Real examples about optimization issues
RISC-style?
– Most operators merely process integer-encoded IDs, consume and produce streams of ID tuples, compare IDs, etc. .. ??
Insert & Update & Delete ?
Namespace
Center for E-Business Technology