18
RDF-3X: a RISC-style Engine for RDF RDF-3X: a RISC-style Engine for RDF Presented by Thomas Neumann, Gerhard Weikum Max-Planck-Institut fur Informatik Saarbrucken, Germany Session 19: System Centric Optimization, VLDB, 2008 2009-02-05 Summarized by Jaeseok Myung Intelligent Database Systems Lab School of Computer Science & Engineering Seoul National University, Seoul, Korea

RDF-3X: a RISC-style Engine for RDF Presented by Thomas Neumann, Gerhard Weikum Max-Planck-Institut fur Informatik Saarbrucken, Germany Session 19: System

Embed Size (px)

Citation preview

Page 1: RDF-3X: a RISC-style Engine for RDF Presented by Thomas Neumann, Gerhard Weikum Max-Planck-Institut fur Informatik Saarbrucken, Germany Session 19: System

RDF-3X: a RISC-style Engine for RDFRDF-3X: a RISC-style Engine for RDF

Presented by Thomas Neumann, Gerhard Weikum

Max-Planck-Institut fur Informatik Saarbrucken, Germany

Session 19: System Centric Optimization, VLDB, 2008

2009-02-05

Summarized by Jaeseok Myung

Intelligent Database Systems LabSchool of Computer Science & EngineeringSeoul National University, Seoul, Korea

Page 2: RDF-3X: a RISC-style Engine for RDF Presented by Thomas Neumann, Gerhard Weikum Max-Planck-Institut fur Informatik Saarbrucken, Germany Session 19: System

Copyright 2009 by CEBT

OverviewOverview

Goal

Building a new type of TripleStore => RDF-3X

Compare RDF-3X with traditional ones

In this presentation,

Focusing on physical storage design that had an effect on entire implementation of the system

Center for E-Business Technology

Page 3: RDF-3X: a RISC-style Engine for RDF Presented by Thomas Neumann, Gerhard Weikum Max-Planck-Institut fur Informatik Saarbrucken, Germany Session 19: System

Copyright 2009 by CEBT

IntroductionIntroduction

RDF: Resource Description Framework

Conceptually a labeled graph

In RDF, all data items are represented in the form of

– (subject, predicate, object), aka (subject, property, value)

RDF data can be seen as a (potentially huge) set of triples

Center for E-Business Technology

S P O

S1 P1

O1

S1 P2

O2

… ... …

2009 IDS Lab. Winter Seminar – 3/22

Page 4: RDF-3X: a RISC-style Engine for RDF Presented by Thomas Neumann, Gerhard Weikum Max-Planck-Institut fur Informatik Saarbrucken, Germany Session 19: System

Copyright 2009 by CEBT

IntroductionIntroduction

SPARQL: SPARQL Protocol and RDF Query Language

The official standard for searching over RDF storages

Example

– Retrieve the titles of all movies with Johnny Depp

SPARQL queries are pattern matching queries on triples that are stored in the RDF storage

Center for E-Business Technology

S P O

S1 P1

O1

S1 P2

O2

… ... …Each pattern consists of S, P, O, and each of these is either a variable or a literalEach pattern consists of S, P, O, and each of these is either a variable or a literal

Page 5: RDF-3X: a RISC-style Engine for RDF Presented by Thomas Neumann, Gerhard Weikum Max-Planck-Institut fur Informatik Saarbrucken, Germany Session 19: System

Copyright 2009 by CEBT

Physical Designs for RDF Storage Physical Designs for RDF Storage (1/4)(1/4)

Giant Triples Table

Center for E-Business Technology

SELECT ?titleWHERE {

?book <title> ?title.?book <author> <Fox, Joe>.?book <copyright> <2001>

}

Join! Join!

Entire Table Scan!

Redundancy!

Page 6: RDF-3X: a RISC-style Engine for RDF Presented by Thomas Neumann, Gerhard Weikum Max-Planck-Institut fur Informatik Saarbrucken, Germany Session 19: System

Copyright 2009 by CEBT

Physical Designs for RDF Storage Physical Designs for RDF Storage (2/4)(2/4)

Clustered Property Table

Contains clusters of properties that tend to be defined together

Center for E-Business Technology

Page 7: RDF-3X: a RISC-style Engine for RDF Presented by Thomas Neumann, Gerhard Weikum Max-Planck-Institut fur Informatik Saarbrucken, Germany Session 19: System

Copyright 2009 by CEBT

Physical Designs for RDF Storage Physical Designs for RDF Storage (3/4)(3/4)

Property-Class Table

Exploits the type property of subjects to cluster similar sets of subjects together in the same table

Unlike clustered property table, a property may exist in multiple property-class tables

Center for E-Business Technology

Values of the type propertyValues of the type property

Page 8: RDF-3X: a RISC-style Engine for RDF Presented by Thomas Neumann, Gerhard Weikum Max-Planck-Institut fur Informatik Saarbrucken, Germany Session 19: System

Copyright 2009 by CEBT

Physical Designs for RDF Storage Physical Designs for RDF Storage (4/4)(4/4)

Vertically Partitioned Table

The giant table is rewritten into n two column tables where n is the number of unique properties in the data

We don’t have to

– Maintain null values

– Have a certain clustering algorithm

Center for E-Business Technology

subjectsubject

propertyproperty

objectobject

Page 9: RDF-3X: a RISC-style Engine for RDF Presented by Thomas Neumann, Gerhard Weikum Max-Planck-Institut fur Informatik Saarbrucken, Germany Session 19: System

Copyright 2009 by CEBT

RDF-3XRDF-3X

Technical Challenges

The diversity of predicate names pose major problem for the physical database design

– Join, Redundancy, ..

RDF-3X (RDF Triple eXpress)

A novel architecture for RDF indexing and querying, eliminating the need for physical database design

Center for E-Business Technology

Page 10: RDF-3X: a RISC-style Engine for RDF Presented by Thomas Neumann, Gerhard Weikum Max-Planck-Institut fur Informatik Saarbrucken, Germany Session 19: System

Copyright 2009 by CEBT

Mapping DictionaryMapping Dictionary

Replacing all literals by unique IDs using a mapping dictionary

RDF-3X is based on a single “giant triples table”, but

Mapping dictionary compresses the triple store

– Reduced redundancy, Saving a lot of physical space

Center for E-Business Technology

S P O

object214 hasColor blue

object214 belongsTo

object352

… … …

S P O

0 1 2

0 3 4

… … …

ID Value

0 object214

1 hasColor

… …

Page 11: RDF-3X: a RISC-style Engine for RDF Presented by Thomas Neumann, Gerhard Weikum Max-Planck-Institut fur Informatik Saarbrucken, Germany Session 19: System

Copyright 2009 by CEBT

Clustered BClustered B++-Tree-Tree

Store everything in a clustered B+-Tree

Triples are sorted in lexicographical order

– Allowing the conversion of SPARQL patterns into range scan

We don’t have to do entire table scan

Center for E-Business Technology

002 …

000 001 002 003

S P O

0 1 2

0 3 4

… … …

Actually, we don’t need this table!Actually, we don’t need this table!

ID Value

0 object214

1 hasColor

… …

<Mapping Dictionary>

Page 12: RDF-3X: a RISC-style Engine for RDF Presented by Thomas Neumann, Gerhard Weikum Max-Planck-Institut fur Informatik Saarbrucken, Germany Session 19: System

Copyright 2009 by CEBT

Exhaustive IndexingExhaustive Indexing

We relied on the fact that the variables are a suffix

<S> - <P> - ?var , <S> - ?var1 - ?var2

But, ?var - <P> - <O>

– To guarantee that we can answer every possible pattern with variables in any position of the pattern triple by merely a single index scan, we maintain all six possible permutations of S, P, and O in six separate indexes

– (SPO, SOP, OSP, OPS, PSO, POS)

– We can afford this level of redundancy

– On all experimental datasets, the total size for all indexes together is less than the original data

Center for E-Business Technology

<POS>

?var - <P> - <O>?var - <P> - <O>

Page 13: RDF-3X: a RISC-style Engine for RDF Presented by Thomas Neumann, Gerhard Weikum Max-Planck-Institut fur Informatik Saarbrucken, Germany Session 19: System

Copyright 2009 by CEBT

Moreover, …Moreover, …

Aggregated Indices

Sometimes we don’t need the full triple

– Is there a connection between obj4 and obj13?

– How many author does object14 have?

Therefore maintain aggregated indexes with (value1, value2, count)

– (value1, value2) => (SP, PS, SO, OS, PO, OP)

– We can use clustered B+ tree

Other Features

Join ordering

Selectivity estimation

Center for E-Business Technology

Page 14: RDF-3X: a RISC-style Engine for RDF Presented by Thomas Neumann, Gerhard Weikum Max-Planck-Institut fur Informatik Saarbrucken, Germany Session 19: System

Copyright 2009 by CEBT

An Experimental SetupAn Experimental Setup

Setup

2GHz dual core, 2GB RAM, 30MB/s disk, Linux

Competitors

MonetDB

– column-store-based (vertically partitioned) approach

– Presented in VLDB07, by Abadi et al.

PostgreSQL

– Triple store with SPO, POS, PSO indexes, similar to Sesame

Other approaches performed much worse

– Jena2, Yars2(DERI), …

Datasets

Barton, library data, 51 mil. triples (4.1 GB)

Yago, Wikipedia-based ontology, 40 mil. triples (3.1 GB)

LibraryThing(partial crawl), users tag books, 30 mil. triples (1.8 GB)

Benchmark queries (7 or 8 per dataset) - appendix

Center for E-Business Technology

Page 15: RDF-3X: a RISC-style Engine for RDF Presented by Thomas Neumann, Gerhard Weikum Max-Planck-Institut fur Informatik Saarbrucken, Germany Session 19: System

Copyright 2009 by CEBT

DB Load Time & DB SizeDB Load Time & DB Size

Center for E-Business Technology

Barton Yago LibThing

RDF-3X 13 25 20

MonetDB 11 21 4

PostgreSQL 30 25 20

DB Load Time (min.)

Barton Yago LibThing

RDF-3X 2.8 2.7 1.6

MonetDB 1.6 1.1 0.7

PostgreSQL 8.7 7.5 5.7

DB Size (GB)

GoodGood

Bad!Bad!

After running the benchmarkAfter running the benchmark

2.0 2.4 6.9

Page 16: RDF-3X: a RISC-style Engine for RDF Presented by Thomas Neumann, Gerhard Weikum Max-Planck-Institut fur Informatik Saarbrucken, Germany Session 19: System

Copyright 2009 by CEBT

Query Run-timesQuery Run-times

Center for E-Business Technology

Barton Yago LibThing

RDF-3X 0.4(5.9) 0.04(0.7) 0.13(0.89)

MonetDB 4.8(26.4) 54.6(78.2) 4.39(8.16)

PostgreSQL 64.3(167.8) 0.56(10.6) 30.4(93.9)

Average run-times for warm(cold) cache (sec.)

Page 17: RDF-3X: a RISC-style Engine for RDF Presented by Thomas Neumann, Gerhard Weikum Max-Planck-Institut fur Informatik Saarbrucken, Germany Session 19: System

Copyright 2009 by CEBT

ConclusionConclusion

RDF-3X(RDF Triple eXpress) is a fast and flexible RDF/SPARQL engine

Exhaustive but very space-efficient triple indexes

Avoids physical design tuning, generic storage

Fast runtime system, query optimization has a huge impact

RDF-3X is freely available

http://www.mpi-inf.mpg.de/~neumann/rdf3x

Center for E-Business Technology

Page 18: RDF-3X: a RISC-style Engine for RDF Presented by Thomas Neumann, Gerhard Weikum Max-Planck-Institut fur Informatik Saarbrucken, Germany Session 19: System

Copyright 2009 by CEBT

Paper EvaluationPaper Evaluation

Pros

Good Idea

Introduce & Solve Optimization Issues

Implementation

My Comments

Real examples about optimization issues

RISC-style?

– Most operators merely process integer-encoded IDs, consume and produce streams of ID tuples, compare IDs, etc. .. ??

Insert & Update & Delete ?

Namespace

Center for E-Business Technology