Hexastore: Sextuple Indexing for Semantic Web Data Management Presented by Cathrin Weiss, Panagiotis...

Preview:

Citation preview

Hexastore:Hexastore:Sextuple Indexing for Semantic Web Data Sextuple Indexing for Semantic Web Data

ManagementManagement

Presented by Cathrin Weiss, Panagiotis Karras, Abraham Bernstein

Department of Informatics, University of Zurich

Session: Indexing and Query Processing, VLDB 2008

2010-01-22

Summarized by Jaeseok Myung

Intelligent Database Systems LabSchool of Computer Science & EngineeringSeoul National University, Seoul, Korea

Copyright 2010 by CEBT

OverviewOverview

Hexastore – Sextuple Indexing

A Triple (S, P, O) can be represented in six ways (3! = 6)

– SPO, SOP, PSO, POS, OSP, OPS

Every possible indexing scheme can be materialized

– Allows quick and scalable query processing

– Up to five times bigger index space is needed

In this presentation,

Review conventional RDF storage structures

Introduction to Hexastore

Discussion

Center for E-Business Technology IDS Lab. Seminar – 2/20

Copyright 2010 by CEBT

Physical Designs for RDF Storage Physical Designs for RDF Storage (1/4)(1/4)

Giant Triples Table

Center for E-Business Technology

SELECT ?titleWHERE {

?book <title> ?title.?book <author> <Fox, Joe>.?book <copyright> <2001>

}

Join! Join!

Entire Table Scan!

Redundancy!

IDS Lab. Seminar – 3/20

Copyright 2010 by CEBT

Physical Designs for RDF Storage Physical Designs for RDF Storage (2/4)(2/4)

Clustered Property Table

Contains clusters of properties that tend to be defined together

Center for E-Business Technology IDS Lab. Seminar – 4/20

Copyright 2010 by CEBT

Physical Designs for RDF Storage Physical Designs for RDF Storage (3/4)(3/4)

Property-Class Table

Exploits the type property of subjects to cluster similar sets of subjects together in the same table

Unlike clustered property table, a property may exist in multiple property-class tables

Center for E-Business Technology

Values of the type propertyValues of the type property

IDS Lab. Seminar – 5/20

Copyright 2010 by CEBT

Physical Designs for RDF Storage Physical Designs for RDF Storage (4/4)(4/4)

Vertically Partitioned Table

The giant table is rewritten into n two column tables where n is the number of unique properties in the data

We don’t have to

– Maintain null values

– Have a certain clustering algorithm

Center for E-Business Technology

subjectsubject

propertyproperty

objectobject

IDS Lab. Seminar – 6/20

Copyright 2010 by CEBT

The problem of having non-property-bound queries

MotivationMotivation

Center for E-Business Technology IDS Lab. Seminar – 7/20

Copyright 2010 by CEBT

Hexastore: Sextuple IndexingHexastore: Sextuple Indexing

Center for E-Business Technology

OOPP

PP

OO SSSS

OO

PP

PP

SS

SS

PPOO

SS

SS

OO

PP

OOOOPPSS

IDS Lab. Seminar – 8/20

Copyright 2010 by CEBT

Hexastore: Sextuple IndexingHexastore: Sextuple Indexing

Center for E-Business Technology IDS Lab. Seminar – 9/20

Copyright 2010 by CEBT

Five-fold Increase in Index SpaceFive-fold Increase in Index Space

Sharing The Same Terminal Lists

SPO-PSO, SOP-OSP, POS-OPS

The key of each of the three resources in a triple appears in two headers and two vectors, but only in one list

Center for E-Business Technology IDS Lab. Seminar – 10/20

Copyright 2010 by CEBT

Mapping DictionaryMapping Dictionary

Replacing all literals by unique IDs using a mapping dictionary

Mapping dictionary compresses the triple store

– Reduced redundancy, Saving a lot of physical space

We can concentrate on a logical index structure rather than the physical storage design

Center for E-Business Technology

S P O

object214 hasColor blue

object214 belongsTo

object352

… … …

S P O

0 1 2

0 3 4

… … …

ID Value

0 object214

1 hasColor

… …

IDS Lab. Seminar – 11/20

Copyright 2010 by CEBT

Clustered BClustered B++-Tree (RDF-3X, VLDB -Tree (RDF-3X, VLDB 2008)2008)

Store everything in a clustered B+-Tree

Triples are sorted in lexicographical order

– Allowing the conversion of SPARQL patterns into range scan

We don’t have to do entire table scan

Center for E-Business Technology

002 …

000 001 002 003

S P O

0 1 2

0 3 4

… … …

Actually, we don’t need this table!Actually, we don’t need this table!

ID Value

0 object214

1 hasColor

… …

<Mapping Dictionary>

IDS Lab. Seminar – 12/20

Copyright 2010 by CEBT

ArgumentationArgumentation

Concise and Efficient Handling of Multi-valued Resources

Index can contain multiple items

cf. Multi-valued Property Table

Avoidance of NULLs

Only those RDF elements that are relevant to a particular other element need to be stored in a particular index

No ad-hoc Choices Needed

Most other RDF data storage schemes require several ad-hoc decisions about their data representation architecture

– ex. Clustered Property Table (which properties to be stored together)

Center for E-Business Technology IDS Lab. Seminar – 13/20

Copyright 2010 by CEBT

ArgumentationArgumentation

Reduced I/O cost

Other RDF storage schemes may need to access multiple tables which are irrelevant to a query

– Queries that are not bounded by property

All First-step Pairwise Joins are Fast Merge-Joins

The key of resources in all vectors and lists used in a Hexastore are sorted

Reduction of Unions and Joins

ex. a list of subjects related to two particular objects through any property

– Hexastore can use osp index

Center for E-Business Technology IDS Lab. Seminar – 14/20

Copyright 2010 by CEBT

Treating the Path Expression ProblemTreating the Path Expression Problem

Select B.subjFROM triples AS A, triples AS BWHERE A.prop = wasBornAND A.obj = ‘1860’AND A.subj = B.objAND B.prop = ‘Author’

A path expression requires (n-1) subject-object self-joins where n is the length of the path

Vertical Partitioning

– Materialized Path Expressions (A.author:wasBorn = ‘1860’)

– n-1C2 = O(n2) possible additional properties

Hexastore

– (n-1) merge-join using pso and pos indices

Center for E-Business Technology IDS Lab. Seminar – 15/20

Copyright 2010 by CEBT

Experimental EvaluationExperimental Evaluation

Setup

2.8GHz dual core, 16GB RAM

Competitors

Column-oriented Vertical Partitioning Approaches– COVP1 – PSO Index

– COVP2 – PSO Index + POS Index (second copy)

Hexastore– SPO, SOP, PSO, POS, OSP, OPS

Datasets

Barton, MIT library data, 61 mil. triples, 258 properties

LUBM, A synthetic benchmark data set(10 univ.), 6.8 mil. triples, 18 predicates

Center for E-Business Technology IDS Lab. Seminar – 16/20

Copyright 2010 by CEBT

Performance (Barton Data)Performance (Barton Data)

Center for E-Business Technology IDS Lab. Seminar – 17/20

Copyright 2010 by CEBT

Performance (LUBM, 10)Performance (LUBM, 10)

Center for E-Business Technology IDS Lab. Seminar – 18/20

Copyright 2010 by CEBT

Memory UsageMemory Usage

In practice, Hexastore requires a four-fold increase in memory in comparison to COVP1, which is an affordable cost for the derived advantages

Center for E-Business Technology IDS Lab. Seminar – 19/20

Copyright 2010 by CEBT

ConclusionConclusion

Hexastore: Sextuple-Indexing Scheme

Worst-case five-fold storage increase in comparison to a conventional triples table

Quick and scalable general-purpose query processing

– All pairwise joins in a Hexastore can be rendered as merge joins

My Question

Main-memory Indexing (Is it possible?)

– 7GB RAM for 6 mil. triples

Other Options?

Center for E-Business Technology IDS Lab. Seminar – 20/20

Recommended