SPARQL and RDF query optimization

SPARQL Query Processing Techniques using Structural Information of RDF Graphs in Relational RDF Store

Seoul National UniversityInternet Database Lab.

Kisung Kim2013. 11. 22

Ph.D Defense Presentation

OUTLINE

• Introduction– Motivation– Existing Approaches– Contributions

• R3F: RDF Triple Filtering for SPARQL Query Processing• RP-Index: RDF Path index for Triple Filtering• RG-index: RDF Graph index for Triple Filtering• Conclusion & Future Work

2/39

INTRODUCTION (1/8)RDF IS BIG GRAPH DATA

• RDF (Resource Description Framework)– W3C recommendation in 1998– General and flexible data model for sharing data via Web– Schema-less and graph-structure data model

• Query processing over large-scale RDF graphs becomes more challenging

Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/

September, 2011

295 data sources31 Billion Triples

May, 2007

12 data sources

3/39

• RDF – A set of RDF triples (<Subject, Predicate, Object>)– Edge-labeled directed graph

• SPARQL– Standard query language for RDF (W3C recommendation in 2008)– SELECT-FROM-WHERE form– Sub-graph pattern matching

SPARQL Query

INTRODUCTION (2/8)DATA MODEL OF RDF AND SPARQL

RDF Triples

<v1, p1, v2><v2, p2, v4><v3, p1, v2><v2, p2, v5>

v2

v4

v1

RDF Graph

?v1

?v2

?v3

SPARQL Query Graph

SELECT * WHERE {<?v1, p1, ?v2><?v2, p2, ?v3>}

v3

v5

?v1 ?v2 ?v3v1 v2 v4

v1 v2 v5

v3 v2 v4

v3 v2 v5

Results

Ex) <paper1, publicationType, ‘Survey Paper’>

paper1 Survey PaperpublicationType

p1 p1

p2 p2

p1

p2

4/39

Relational RDF Store Graph RDF Store

Storage Relational table Adjacent listMainly In-memory

Query Pro-cessing

Relational operatorJoin and scan Sub-graph isomorphism algorithm

SystemJena [WWW2004] , Sesame [ISWC2002], Oracle [VLDB2005], SW-store [VLDBJ2009], RDF-3X [VLDBJ2010]

GRIN [AAAI2007], Dogma [ISWC2009], PIG [SemData2010], gStore [VLDBJ2013]

Pros Batch processing using Join operatorLarge-scale RDF processing [VLDB2012]

Reduce search space of the graph traversal using the graph structure

Cons Not using the graph structureNot scalableInappropriate for large-scale pro-cessing

INTRODUCTION (3/8)TWO TYPES OF RDF STORES

5/39

• Most RDF stores use the relational model– Store RDF triples in relational tables (triple table)– Processing SPARQL queries using scan and join operators

• Challenges of relational RDF stores– Involves many join operators– SPARQL query with N triple patterns requires N-1 joins

• We will focus on the relational RDF stores

INTRODUCTION (4/8)RELATIONAL RDF STORE

Scanp1

Scanp2

Join1 Scanp3

Join2

SPARQL Graph

S P O

Triple Table

Too many self-joinSimple and General

<?v1, p1, ?v2><?v2, p2, ?v3>

….<?vn, pn, ?vn+1>

Scanpn

Joinn-1

….

6/39

• Storage approaches– Clustered property table

– Jena [WWW2004] , Sesame [ISWC2002], Oracle [VLDB2005]

– Cluster properties which are accessed together frequently– Sorted triple tables (multiple indexing)

– SW-store [VLDBJ2009], RDF-3X [VLDBJ2010]– Store triples as sorted in a column-oriented store or clustered B+ trees

ID Name age gender

Clustered Property TableS P O

Sorted Triple Table

SS P O

PS P O

O

INTRODUCTION (5/8)EXISTING RELATIONAL RDF STORE

Reduce joinsLimited flexibility, Cluster decisionNull value, Multi value

Fast retrieval of matching triplesFast merge joinStorage overhead, update

7/39

• Handling intermediate results approaches– Finding optimal plan

– Static and traditional approach– Propose RDF-specific histograms– RDF-3X [VLDBJ2010], Characteristics set [ICDE2011], ARQ [WWW2008]

– Dynamic filtering method– Build dynamic filters and use subsequent operators– U-SIP [SIGMOD2009]

• Existing methods do not exploit graph structure of RDF graphs

Scanp1

Scanp2

Merge Join Scanp3

Hash Join

Next Information

Domain Filter

Scanp1

Scanp2

Merge Join Scanp3

Hash Join

Finding Optimal Plan (static) Dynamic Filtering Method

INTRODUCTION (6/8)EXISTING RELATIONAL RDF STORE

8/39

• Reduce intermediate results using structure of RDF graphs in relational RDF stores

• We propose RDF triple filtering method– Filter irrelevant triples in advance – Reduce intermediate results using graph structure

INTRODUCTION (7/8)OUR APPROACH: RDF TRIPLE FILTERING

v3

v4

v2 v5

v1

v8

v9

v7 v10

v6

v11 v14

v15

v13

v12p3

p2

p1

p4 p3

p2

p1 p2p4

p3

p2

p1

RDF Graph

Scanp1

Scanp2

Join1 Scanp3

Join2 Scanp4

Join3

?v2 ?v3 ?v4 ?v5

v2 v3 v3 v4

v7 v8 v8 v9

v13 v14 v14 v15

Redundant Intermediate Results

?v3

?v4

?v2 ?v5

?v1p3

p2p4

p1

SPARQL Query

9/39

• RDF triple filtering framework (R3F)– Filtering out irrelevant triples in advance– Reducing redundant intermediate results during SPARQL processing– Incorporate triple filtering method in relational RDF processing framework– Deal with whole query processing steps

• We propose two indices for R3F– RP-index (RDF Path index)

– Path-based index designed for efficient RDF triple filtering– Deal with several issues: size problem, building and maintenance

– RG-index (RDF Graph index) to overcome the limitation of RP-index– Use sub-graph pattern mining algorithm– Propose efficient sub-graph pattern mining for RDF graphs

INTRODUCTION (8/8)CONTRIBUTIONS: SUMMARY

10/39

OUTLINE

• Introduction• R3F: RDF Triple Filtering for SPARQL Query Processing

– Motivation– Overview of R3F– Three components of R3F

• RP-Index: RDF Path Index for R3F• RG-index: RDF Graph index for Triple Filtering• Conclusion & Future Work

11/39

• Goal– Provide general framework for RDF triple filtering– Use structural information of RDF graphs in relational RDF stores– Incorporate triple filtering feature in existing relational RDF stores

• Three components of R3F– Materialized filter data built using structural information– Relation filtering operator– Cardinality estimation method of the filtering operator

• We assume that the retrieved triples from scan operators are sorted by subject or object column– Triples are stored as sorted in many RDF stores for efficient triple retrieval and

using merge joins

R3F (1/6)MOTIVATION

12/39

Query Execution Engine

Query Optimizer

SPARQL Query

Plan

StatisticalInformation

Triple Storage

Triples

Results

RDF Store

Filter DataRP-index, RG-index

RFLT OperatorCardinality Estimation of

RFLT Operator

LoaderUpdater

RDF Data(RDF/XML, N3, …)

Triple TableIndex, HistogramIndex

Updater

R3F (2/6)SYSTEM OVERVIEW

Filter Data

13/39

• Answer vertices should satisfy some structural conditions• Provide lists of vertices which satisfy a specific structural conditions• Candidate vertex (CV) for a query vertex

– Superset of final results– Define candidate vertex set using several query structure

• Vertex lists (Vlist) provide CVs as sorted lists

?v3

?v4

?v2 ?v5

?v1p3

p2p4

p1

SPARQL Query

v3

v4

v2 v5

v1

v8

v9

v7 v10

v6

v11 v14

v15

v13

v12p3

p2

p1

p4 p3

p2

p1 p2p4

p3

p2

p1

Answers for ‘?v3’? should havetwo incoming path patterns<p3, p2> and <p4, p2>

Vlist (<p3, p2>)=v3, v8, v14Vlist (<p4, p2>)=v3, v8

R3F (3/6)FILTER DATA

RDF Graph

14/39

• Perform triple filtering for scan operators• Filter triples whose filtering keys are not in CV sets

• Filtering by N-way merge process• Input triples are sorted in many RDF stores• Vlists are also stored as sorted• Need sequential I/O (reading Vlists) and merge process

Scan<?v3, p1, ?v4>

RFLT?v3v3 v8

v3 v4

v8 v9

v14 v15

?v3

?v4

?v2 ?v5

?v1p3

p2p4

p1

SPARQL Query

CV for ?v3

v3 v4

v8 v9

R3F (4/6)FILTERING OPERATOR: RFLT

FilteringKey

Input triples

Output triples

15/39

• Output cardinality estimation is essential for the cost-based optimizer (CBO)

• Cardinality estimation of RFLT operator– Assume the uniform distribution for filtering key values – Use the set intersection estimation method: e.g.

• CBO determines based-on estimated cardinality– Whether to apply an RFLT operator for a scan operator– Which Vlists to be used

Scan<?v3, p1, ?v4>

RFLT?v3

v3 v8

Vlist for ?v3

R3F (5/6)QUERY OPTIMIZATION

v3 v4

v8 v9

v14 v15

v3 v4

v8 v9

Input triples

Output triples ScanFK

FKvlistRFLT Vvlist

V : a set of Vlists for RFLTFK : a set of filtering key values

Intersection estimation

From statisticalinformation

2332

|}14,8,3{|}14,8,3{}8,3{||

Scanvvv

vvvvvRFLT

16/39

R3F (6/6)SUMMARY OF R3F

RP-indexRG-index

Filter data assorted list

Query OptimizerSPARQLQuery

Optimized Planwith RFLT operator

Query ExecutorRFLT Operator Results

Statistical information

R3F

17/39

OUTLINE

• Introduction• R3F: RDF Triple Filtering• RP-Index: RDF Path Index for R3F

– Design of RP-Index– Size Problem– Experimental Results

• RG-index: RDF Graph index for Triple Filtering• Conclusion & Future Work

18/39

• Motivation– Design an index to provide vertex lists having a specific path pattern– Efficient and updatable index

• Related work: path-based index– DataGuide [VLDB1997], 1-index [ICDT1999], A(k)-index [ICDE2002],

D(k)-index [SIGMOD2003], M(k)-index [ICDE2004]– Provide a concise summary of the original data for query processing– Handle size problem by store every vertex one time in the index

• Our goal is to provide filter data efficiently– Vertices can be stored several times and stored as sorted– We deal with the size problem differently

RP-INDEX (1/7)MOTIVATION AND RELATED WORK

19/39

• Provide CV sets using predicate path patterns

• Predicate path pattern– A sequence of predicate: e.g. <p1, p2, p3>

• Definition: RP-index (RDF Database D, maxL)– A set of <ppath, Vlist(ppath)>, where ppath exists in D and |ppath| ≤ maxL

• We also index reverse predicates (outgoing edges)

v3

v4

v2 v5

v1

v8

v9

v7 v10

v6

v11 v14

v15

v13

v12p3

p2

p1

p4 p3

p2

p1 p2p4

p3

p2

p1 ?v3

?v4

?v2 ?v5

?v1p3

p2p4

p1

SPARQL Query

CV for ?v3 =Vlist(<p3, p2>) ∩ Vlist(p4, p2) = {v3, v8}

RP-INDEX (2/7)DESIGN OF RP-INDEX

?v1 ?v2 ?v3p1 p2 ?v4p3

RDF Graph

Vlist (<p1>) = v4Vlist (<p2>) = v3Vlist (<p3>) = v2Vlist (<p4>) = v2Vlist (<p1R>) = v3Vlist (<p2R>) = v2Vlist (<p3R>) = v1Vlist (<p4R>) = v5

Vlist (<p2, p1>) = v4Vlist (<p3, p2>) = v3Vlist (<p4, p2>) = v3Vlist (<p1R,p2R>) = v2Vlist (<p2R,p3R>) = v1Vlist (<p2R,p4R>) = v5Vlist(<p2,p2R>)=v11Vlist(<p3,p4R>)=v5

RP–index (D, 2) with reverse predicate

20/39

• Exponential number of predicate paths , where |P| is the number of predicates

• Solution– Choose effective predicate path for filtering

• Two criteria for choosing predicate paths– Discriminative predicate path: use a replaceable predicate path– Frequent predicate path: infrequent Vlists are rarely used

)||(1

maxL

iiPO

r1 r2 r3 r4 r5 r6 r7 r2 r3 r4 r6 r7Vlist(<p2, p3>) Vlist(<p1,p2,p3>)

|Vlist(<p2,p3>)| / |Vlist(<p1,p2,p3>)| = 5/7 = 0.71

∩

RP-INDEX (3/7)SIZE PROBLEM OF RP-INDEX

If discriminative ratio is 0.7, thenVlist(<p1, p2, p3>) is not stored

If minimum frequency is 7, then Vlist(<p1, p2, p3>) is not stored

r2 r3 r4 r6 r7Vlist(<p1,p2,p3>)

Discriminative Predicate Path Frequent Predicate Path

21/39

• Build Vlist(ppath) using Vlist of the longest proper prefix of ppath– Reduce redundant computation

• Incremental update– Predicate path containing predicates in the update– We reduce the number of Vlists to update using delta information

RP-INDEX (4/7)BUILDING AND MAINTENANCE

3,2,1 pppVlist 2,1 ppVlistJoin with and P3

Root

p1 p2 p3

p1,p1 p1,p2 p1,p3 p2,p1 p2,p2 p2,p3 p3,p1 p3,p2 p3,p3

UP={p1, p2}

Vlist(p1) p1 Vlist(p1) p2 Vlist(p1) p3

22/39

• Experimental environment– We implemented R3F and RP-index on the top of an open source RDF store,

RDF-3X (0.3.6)*– IBM machine having 8 Intel Xeon 3.0 GHz cores, 16 GB memory

• Datasets– LUBM (Leihigh University Benchmark) : university domain– SP2B (SPARQL Performance Benchmark) : DBLP scenario– DBSPB (DBpedia SPARQL Benchmark) : DBpedia

Predicates Triples RDF-3X Size (GB)

LUBM 18 1,335 M 77

SP2B 77 1,399 M 123

DBSPB 39,675 183 M 25

Dataset Statistics

RP-INDEX (5/7)EXPERIMENTAL RESULTS: SETTING

Synthetic dataset

Real-worldcharacteristics

* https://code.google.com/p/rdf3x/

23/39

https://code.google.com/p/rdf3x/

https://code.google.com/p/rdf3x/

• We built three RP-indices (maxL=3)• RP-index is much smaller than database

Setting LUBM SP2B DBSPB1 0.307 2.05 2.85

2 19.12 87.99 N/A

3 1.39 21.97 6.52

Setting Discriminative Ratio Frequency Function Reverse Predicate1 1 0 not included

2 1 0 included

3 0.7 (l-1/maxL)2 X n included

Parameter Settings

RP-index Size (GB)

RP-INDEX (6/7)EXPERIMENTAL RESULTS: RP-INDEX SIZE

LUBM SP2B DBPSB77 123 25

Database Size (GB)

24/39

• For most queries, R3F using RP-index reduces the execution times• Including reverse predicate is more effective for triple filtering• Indexing only discriminative and frequent predicate path does not degrade

query performance much

RP-INDEX (7/7)EXPERIMENTAL RESULTS

(a) LUBM (b) SP2B (C) DBSPB

25/39

OUTLINE

• Introduction• R3F: RDF Triple Filtering using RG-index• RP-Index: RDF Path Index for R3F • RG-index: RDF Graph index for Triple Filtering

– Motivation– Design of RG-index– Building RG-index– Evaluaion Results

• Conclusion & Future Work

26/39

• Limited filtering power of RP-index– Use only path information for graph-structural RDF data

• Need to index graph structures

RP-index cannot filter out this result

?v3

?v4

?v2 ?v5

?v1p3

p2p4

p1

SPARQL Query

v3

v4

v2 v5

v1

v8

v9

v7 v10

v6

v11 v14

v15

v13

v12p3

p2

p1

p4 p3

p2

p1 p2p4

p3

p2

p1

RG-INDEX (1/11)MOTIVATION

RDF Graph

27/39

• Graph index– Graph-transactional setting (many small graphs)

– GraphGrep [PODS2002], gIndex [SIGMOD2004], C-tree [ICDE2006], QuckSI [VLDB2008], Tale [ICDE2008]

– A single large graph– GraphQL [SIGMOD2008], GADDI [EDBT2009], SPath [VLDB2010]

– For reducing the search space of the graph traversal– Non-trivial to apply to relational RDF stores

• Subgraph pattern mining– Graph-transactional setting

– gSpan [ICDM2002], Gaston [KDD2004]– A single large graph

– HSIGRAM, VSIGRAM [JDMKD2005]– Not scalable for large RDF graphs

– We need to adapt existing algorithm for RDF graphs

RG-INDEX (2/11)RELATED WORK

28/39

• Graph pattern– A graph which all vertices are variables and all predicates are bound

• Definition: RG-index (D, maxL)– A set of <gp, VS(gp)>, where gp is a graph pattern in D and |gp| ≤ maxL,

VS(gp) is the set of Vlists for vertices in gp

Graph Pattern

?v1 ?v2Size: 1

?v1

?v2Size: 2

Size: maxL

?v3

VlistsRG-indexVlist(gp1, ?v1)

RG-INDEX (3/11)DESIGN OF RG-INDEX

p1

p1

p2

gp1

gp2

Vlist(gp1, ?v2)

Vlist(gp2, ?v1)Vlist(gp2, ?v2)Vlist(gp2, ?v3)

29/39

• Use subgraph mining due to the size problem of RG-index– Indexing only frequent subgraph patterns Frequent subgraph mining

• Adapt gSpan [Yan and Han, ICDM ’02] algorithm for RDF graphs

• gSpan– Transactional setting– Depth-first pattern growth approach– Use anti-monocity property of support– Use DFS encoding and edge extension

to prevent duplicate pattern generation

RG-INDEX (4/11)BUILDING RG-INDEX USING SUBGRAPH MINING

size-2

size-1

size-maxL

Edge extension

pruning infrequent or duplicate pattern

30/39

• Pattern representation– Use DFS code and extend it to directed edge-label graph

[SIGKDD2003]

• Support definition– Should satisfy anti-monotonicity property for efficient mining– Most mining algorithm use MIS (maximum independent set) approach,

which is NP-hard for the single large setting– We use support definition in [Bringmann and Nijssen, PAKDD ‘08]

as minimum matching vertex number– Very efficient to compute and upper-bound of MIS approaches

(mining more patterns)|)),((|min)sup( vGVlistG Vv

RG-INDEX (5/11)ADAPTING GSPAN FOR RDF GRAPHS

31/39

• Redundant subgraph patterns– Graph patterns with same Vlists– Graphs having non-trivial automorphisms

• Compute occurrences of graph pattern – Exploit depth-first style pattern generation similarly to VSIGRAM [JDMKD2005]– Store all occurrences of a pattern to compute child patterns– Store occurrences from root to a leaf (depth-first approach)– We propose efficient occurrence computation method

RG-INDEX (6/11)ADAPTING GSPAN FOR RDF GRAPHS

Redundantpatterns

32/39

• Data sets– YAGO2: Yet Another Great Ontology 2

• Index build

RG-INDEX (7/11)EVALUATION RESULTS

Predicates Triples RDF-3X Size (GB)LUBM 18 1,335 M 77

YAGO2 93 37 M 9

SP2B 77 1,399 M 123

Dataset Statistics

Setting YAGO2 LUBM SP2B

RP-index 341 MB 1.4 GB 1.3 GBRP-index (R) 2.3 G 1.7 G 3.1 GB

RG-index 880 MB 1.1 G 1.3 GB

Setting Discriminative Ratio

Frequency Function

Reverse Predicate

RP-index 1 0 not included

RP-index (R)

0.7 (l-1/maxL)2 X n included

RG-index 0.7 (l-1/maxL)2 X n N/AParameter Settings Index Size (GB)

33/39

• Query sets– Extract graph patterns from each data set– Use these patterns as test queries– Divide the queries into four groups according to their evaluation times in

RDF-3X

RG-INDEX (8/11)EVALUATION RESULTS: QUERY PERFORMANCE

Test Query Groups

GroupExecution Times (ms)

A0~10

B10~100

C100~1000

D1000~

Totalavg.

YAGO2 824 143 41 19 1,027

LUBM 0 7 14 45 67

SP2B 161 210 187 7 565

34/39

Group A B C D TotalRDF-3X 2.76 29.02 244.62 1383.42 108.65

RP-index 2.38 (13%) 25.2 (13%) 182.72 (25%)

555.42 (59%) 76.08 (30%)

RP-index (reverse) 2.39 (13%) 25.2 (13%) 153.92 (37%)

127 (91%) 61.06 (43.8%)

RG-index 2.33 (15%) 16.39 (43%) 122.8 (49%) 106.85 (92%) 44.34 (59.19%)


Group A B C D TotalRDF-3X N/A 59 444.6 2158.6 1548.8

RP-index N/A 58 (1%) 441. 6 (0.6%)

2126.9 (0.1%)

1526.8 (1%)

RP-index (reverse) N/A 50 (15%) 420 (5%) 1274.1 (40%) 946.4 (38%)

RG-index N/A 50 (15%) 406 (8%) 1250.2 (42%) 929.7 (40%)

Group A B C D TotalRDF-3X 3.53 34.18 240.43 16671.261 325.62

RP-index 2.75 (22%) 11.83 (65%) 94.73 (60%) 9194.21 (44%)

177.73 (45%)

RP-index (reverse) 3.00 (15%) 17.82 (47%) 79.78 (66%) 4747.26 (71%)

95.90 (70%)

RG-index 2.32 (34%) 8.65 (74%) 27.60 (88%) 581.36 (96%) 14.92 (95%)

SP2B (ms)

LUBM (ms)

YAGO2 (ms)

35/39

• RG-index is more effective for YAGO2 and SP2B than LUBM

• RG-index is more effective for queries with longer evaluation times

• RG-index is more effective than RP-index and RP-index with reverse predicate – RG-index is smaller than RP-index with reverse predicate


36/39

Frequency=1000 Frequency=2000 Frequency=4000

Build Time 5776.25 secs 3290.53 secs 1381.61 secsQuery Time 171.25 msecs 169.46 msec 187.34 msecs

Not including reverse predicates

including reverse predicates (frequency = 1000)



Build Time 93.33 secs 449.33 secs 299.79 secs 164.88 secsQuery Time 368.19 msecs 254.0 msecs 254.01 msecs 258.3 msecs

RDF-3XLoading Time 4264 secsQuery Time 409.4 msecs

RP-index (maxL=5, discriminative ratio = 0.8)

RG-index (maxL=5 , discriminative ratio = 0.8)

Include loading triples, Building triple indices, computing statistics

RG-INDEX (11/11)EVALUATION RESULTS: INDEX BUILD TIME (YAGO2)

37/39

RDF-3X

CONCLUSIONS• We propose RDF triple filtering method for handling redundant

intermediate results of SPARQL query processing (Chapter 4)– Provide a framework for filtering irrelevant triples

• We propose RP-index which uses path information (Chapter 4)– Deal with size problem and maintenance issues

• We propose RG-index which uses graph-structural information (Chapter 5)– Improve the filtering power of RP-index– Use frequent sub-graph mining algorithm for building RG-index

38/39

FUTURE WORK• Indexing patterns considering query workload

– More effective triple filtering for current query workload

• More accurate estimation of cardinality– We have assumed the uniform distribution– Very crucial for the query evaluation performance

• Applying distributed environment– Handling intermediate results is more important in MapReduce– How to store and access the index

39/39

PAPERS• R3F and RP-index

– Kisung Kim, Bongki Moon, Hyoung-Joo Kim, RP-Filter: A Path-based Triple Filtering Method for Efficient SPARQL Query Processing, JIST (Joint International Semantic Technology) conference, 2011

– Kisung Kim, Bongki Moon, Hyoung-Joo Kim, R3F: RDF Triple Filtering Method for Efficient SPARQL Query Processing, Accepted, Online first published, World Wide Web Journal (Springer), 2013

• RG-index– Kisung Kim, Bongki Moon, Hyoung-Joo Kim,

RG-index: an RDF Graph Index for Efficient SPARQL Query Processing Submitted to ESWA Expert Systems with Applications (Elsevier), under review

Thank You Any Questions?

RP-INDEX: TRIE OF PREDICATE PATHS• Search the Vlist of a given predicate path

– Each node has a pointer to the Vlist of the corresponding predicate paths

• Indexing path patterns other than incoming path

• Redundant predicate path– We do not index predicate path pattern such as p, pR

v3

v4

v2 v5

v1p3

p2p1

p4RP-index (R, 2)

Vlist (<p1>) = v4Vlist (<p2>) = v3Vlist (<p3>) = v2Vlist (<p4>) = v2Vlist (<p1R>) = v3Vlist (<p2R>) = v2Vlist (<p3R>) = v1Vlist (<p4R>) = v5

P = {p1, p2, p3, p4} P = {p1, p2, p3, p4 p1R, p2R, p3R, p4R}

p3R

p2R

p1R

p4R

Vlist (<p2, p1>) = v4Vlist (<p3, p2>) = v3Vlist (<p4, p2>) = v3Vlist (<p1R,p2R>) = v2Vlist (<p2R,p3R>) = v1Vlist (<p2R,p4R>) = v5

REVERSE PREDICATE

RP-index (D, 2)

Vlist (<p1>) = v4, v9, v15Vlist (<p2>) = v3, v8, v14Vlist (<p3>) = v2, v7, v13Vlist (<p4>) = v2, v8Vlist (<p2, p1>) = v4, v9, v15Vlist (<p3, p2>) = v3, v8, v14Vlist (<p4, p2>) = v3, v8

BUILDING RP-INDEX• Build RP-index in the Breadth-First Search (BFS) manner• Vlists for (i + 1)-length predicate paths is built using Vlists for i-

length predicate path

Root

p1 p2 p3



PARALLEL BUILDING OF RP-INDEX• Building each Vlists is independent• We can build multiple Vlists while reading triples once

1 Thread 2 Threads 4 Threads

Build Time 503.43 secs 349 secs 238.84

Including reverse predicates (frequency = 1000)

INCREMENTAL MAINTENANCE RP-INDEX

• Rebuilding RP-index for every update is too inefficient– Query processing should be suspended until RP-index is updated

• Which Vlists should be updated due to the database update?– Predicate path containing predicates in the update– We reduce the number of Vlists to update using delta information

Root

p1 p2 p3


Δ=∅

UP={p1, p2}


ACCURACY OF CARDINALITY ESTI-MATION• use q-error: max(c/c’, c’/c)

– c: real cardinality– c’: estimated cardinality

RP-INDEX BUILD• Algorithm

• Costs

1

1

maxL

i

i DRPDO

build 1-length Vlistsfor i = 1 to maxL for each ppath in RP-index for each p in P build Vlist(<ppath,p>) using Vlist(<ppath>) if Vlist is discriminative and frequent insert into RP-index

Building Size-1 Vlists Reading size n-1 Vlists

Building Size-n Vlists

D: a set of triplesP : a set of predicatesR: a set of resources

RP-INDEX: INCREMENTAL UPDATE• RDF database

– 3,000,000 triples and 1,000 predicates• Incremental update times are proportion to the number of predicates in the

updates• Total rebuilding times are almost same• The update times for insert updates are less than the update times for delete

updates

RFLT OPERATOR WITH JOIN• Combine RFLT operators with merge join

Scan<?v1, p1, ?v2>

RFLT?v1

Merge Join?v1

Scan<?v1, p2, ?v3>

RFLT?v1

Scan<?v1, p1, ?v2>

RFLT with Join?v1

Scan<?v1, p2, ?v3>

FREQUENT GRAPH PATTERN MINING ALGORITHM

• Frequent graph pattern

– Sup(g): support of graph g (frequency count)– minSup: minimum support (input parameter)

• Two steps of frequent graph pattern mining

• Most studies focus on the optimization of the first step– The second step involves a subgraph isomorphism test (NP-complete)

2nd step: check the frequency of g,

Sup(g)

1st step: generate candidate pattern,

g

Input

minSup

Graph Mining Algorithm Results

𝑆𝑢𝑝 (𝑔 )≥𝑚𝑖𝑛𝑆𝑢𝑝

OVERVIEW OF GSPAN• X.Yan and J. Han, gSpan: Graph-based substructure pattern min-

ing, ICDM, 2002

• Popular algorithm for graph pattern mining

• Graph-transaction setting– A set of relatively small graphs

• Depth-first style pattern generation

• Use DFS code – To represent graph patterns– To reduce redundant pattern generation

SUPPORT METHOD

Graph-transaction setting

Single-graph setting

a ab

GP1 GP2 Anti-monotonicityIf |GP1| < |GP2|, Support(GP1) >= Support(GP2)

aG2G1 ab b

The number of graph transactions that the pattern occurs in

Support(GP1) = 2Support(GP2) = 1

a

The number of occurrences

Support(GP1) = 2Support(GP2) = 3

b

ab bb

FINDING MATCHING GRAPHS: NAÏVE APPROACH

• Generate a SPARQL query for each graph pattern• Execute the SPARQL query• Make Vlists for each vertex from query results (obtain distinct

values)

• Problem– Redundant computationStore previous results and reuse them

p1

p1

p1 SELECT ?v1, ?v2, ?v3, ?v4WHERE { ?v1 <p1> ?v2. ?v2 <p1> ?v3. ?v3 <p1> ?v4. }

p1

p1 SELECT ?v1, ?v2, ?v3WHERE { ?v1 <p1> ?v2. ?v2 <p1> ?v3.}

RG-INDEX: REUSING PREVIOUS RESULTS

p1

p1

p1

p1

p1

p2

p1

p1

p1

p1

p1

p1

p1

p1

p2 p1

p1

p2

p2

p1

p1

p1p1

p1

p1p1

p1

(0, p1, 1, )Rightmost vertex

Results

p1

p1

p1

p1

p1

p1

p1

p1

Reuse

RG-INDEX BUILD• Algorithm

• Cost analysis

gSpanRDF (G) /* V: a subgraph pattern */for v in G(V) do /* G(V): a set of vertices in G */ for p in P do /* P: a set of predicates */

expand G to G’ with an edge (label p) according to gSpancalculate all occurrences of G’ in Dif G’ is minimal and frequent and not redundant then

Insert discriminative Vlists of G’ in RG-indexgSpanRDF (G’)

1

1

maxL

i

i DDDO

Building size-1 subgraphs Number of possible size-n-1 subgraphs

D: a set of triples

Number of possible size-n subgraphs

Clustered Property Ta-ble Sorted Triple Storage Reducing Intermediate

Results

MethodReducing joins using materi-alized views

Store triples as sorted and use merge joins

Build dynamic filters for join variables

ProsReduce the number of joins •Efficient retrieval of matching

triples •Fast merge join

Reduce redundant intermediate results

Cons•Need user’s clustering deci-sion•Incur null and multi-values which are hard to process

•Storage overhead•Do not handle redundant in-termediate results

Do not exploit structural infor-mation of RDF graphs

SystemJena [Carroll et al., WWW 2004]Oracle [Chong et al., VLDB 2005]

SW-store [Abadi et al., VLDB 2007]RDF-3X [Neumann and Weikum, VLDB 2008]

U-SIP [Neumann and Weikum, SIG-MOD’09]

EXISTING RELATIONAL RDF STORE

• Graph patterns can express more relationship constraints be-tween vertices than path patterns

• Combination of path patterns cannot express relationship with ver-tices in another path pattern

?v3

?v4

?v2 ?v5

?v1p3

p2p4

p1

SPARQL Query

?v3

Path Pattern (maxL=3)

Graph Pattern (maxL=3)

p3

p2

p4

p2

p1?v3 ?v3 ?v3

p3

p2p4

?v3p2

p4

p1?v3

p3

p2

p1

Expressible byCombination of Path patterns

Can not expressby path patterns

GRAPH PATTERNS AND PATH PATTERNS

• RG-index Size and query evaluation performance (YAGO2)– RG-index size

– Query evaluation performance

EVALUATION RESULTS: RG-INDEX SIZE

DFS CODE REPRESENTATION• Edge representation:

RIGHTMOST EXTENSION: FORWARD

?v2

?v1

p1

r3

r1

p1

r4

r5

r2p1

p2 p2

RDF Graph

r6p2

r7

p1p1

p2

?v3 ?v4

p2 p2

Tuple Representation

RIGHTMOST EXTENSION: BACK-WARD

?v2

?v1

p1

?v3

p2p2 ?v2

?v1

p1

?v3

p2p2

?v4

Selection?v1=?v4

Join(forward extension)

DIFFERENCE FROM EXISTING PATH INDICES

• Summary graphs store vertices only one time (except DataGuide)– Need union a number of vertex lists

<p1, p2, p3>

<p2, p2, p3>

<p3, p2, p3>

<pn, p2, p3>

… If we need Vlist for <p2, p3> andVlists for each path stored seperately,we should union all these Vlists

p1

p2

p3

p2 p3 pn…

Data & Analytics

SPARQL and RDF query optimization