Parallel Datalog Reasoning in RDFox Presentation

PARALLEL DATALOG REASONING IN RDFOX

Boris Motik, Yavor Nenov, Robert Piro, Ian Horrocks, Dan Olteanu

University of Oxford

September 25, 2014

TABLE OF CONTENTS

1 INTRODUCTION

2 OUR APPROACH TO PARALLEL MATERIALISATION

3 EVALUATION

4 CONCLUSION

Motik et al. Parallel Datalog Reasoning in RDFox 0/9

Introduction

TABLE OF CONTENTS

1 INTRODUCTION


3 EVALUATION

4 CONCLUSION


Introduction

CURRENT TRENDS IN KNOWLEDGE/DATABASE SYSTEMS

Materialisation is computationally intensive ⇒ natural to parallelisemid-range laptops have 4 cores, servers with 16 cores are routine

Price of RAM keeps falling128 GB is routine, systems with 1 TB are emergingin-memory databases: SAP’s HANA, Oracle’s TimesTen, YarcData’s Urika


Introduction

RDFOX SUMMARY

RDFox: a new RDF store and reasoner

http://www.cs.ox.ac.uk/isg/tools/RDFox/

store data in RAM

effectively use modern multi-core architectures

Core focus: materialisation reasoning (recursive) datalog rules

e.g., 〈x , rdf :type, y〉 ∧ 〈y , rdfs:subClassOf , z〉 → 〈x , rdf :type, z〉handle OWL 2 RL and the Semantic Web Rule Language (SWRL)

explicitly store all implied triples so that queries can be evaluated directly

used in a number of systems: Oracle’s database, OWLIM, WebPIE

Used in projects for reasoning beyond OWL 2 RL

‘combined approach’ for reasoning with OWL 2 EL

PAGOaA reasoner for OWL 2 DL comdining RDFox with HermiT



Our Approach to Parallel Materialisation

TABLE OF CONTENTS

1 INTRODUCTION


3 EVALUATION

4 CONCLUSION



EXISTING APPROACHES TO PARALLEL MATERIALISATION

Interquery parallelism: run independent rules in parallel

degree of parallelism limited by the number of independent rules

⇒ does not distribute workload to cores evenly

Intraquery parallelism: partition rule instantiations to N threads

e.g., constrain the body of rules evaluated by thread i to (x mod N = i)

⇒ static partitioning may not distribute workload well due to data skew

⇒ dynamic partitioning may incur an overhead due to load balancing

Goal: distribute workload to threads evenly and with minimum overhead



SOLUTION PART I: ALGORITHM

R(a,b)R(a,c)R(b,d)R(b,e)A(a)R(c,f)R(c,g)

A(x) ∧ R(x , y) → A(y)

For each fact:match the fact to all body atoms to obtain subqueriesevaluate subqueries w.r.t. all previous factsadd results to the table

Current subquery:




⇒ R(a,b)R(a,c)R(b,d)R(b,e)A(a)R(c,f)R(c,g)

A(x) ∧ R(x , y) → A(y)


Current subquery: A(a)




R(a,b)⇒ R(a,c)

R(b,d)R(b,e)A(a)R(c,f)R(c,g)

A(x) ∧ R(x , y) → A(y)


Current subquery: A(a)




R(a,b)R(a,c)

⇒ R(b,d)R(b,e)A(a)R(c,f)R(c,g)

A(x) ∧ R(x , y) → A(y)


Current subquery: A(b)




R(a,b)R(a,c)R(b,d)

⇒ R(b,e)A(a)R(c,f)R(c,g)

A(x) ∧ R(x , y) → A(y)


Current subquery: A(b)




R(a,b)R(a,c)R(b,d)R(b,e)

⇒ A(a)R(c,f)R(c,g)A(b)A(c)

A(x) ∧ R(x , y) → A(y)


Current subquery: R(a,y)




R(a,b)R(a,c)R(b,d)R(b,e)A(a)

⇒ R(c,f)R(c,g)A(b)A(c)

A(x) ∧ R(x , y) → A(y)


Current subquery: A(c)




R(a,b)R(a,c)R(b,d)R(b,e)A(a)R(c,f)

⇒ R(c,g)A(b)A(c)

A(x) ∧ R(x , y) → A(y)


Current subquery: A(c)




R(a,b)R(a,c)R(b,d)R(b,e)A(a)R(c,f)R(c,g)

⇒ A(b)A(c)A(d)A(e)

A(x) ∧ R(x , y) → A(y)


Current subquery: R(b,y)




R(a,b)R(a,c)R(b,d)R(b,e)A(a)R(c,f)R(c,g)A(b)

⇒ A(c)A(d)A(e)A(f)A(g)

A(x) ∧ R(x , y) → A(y)


Current subquery: R(c,y)




R(a,b)R(a,c)R(b,d)R(b,e)A(a)R(c,f)R(c,g)A(b)A(c)

⇒ A(d)A(e)A(f)A(g)

A(x) ∧ R(x , y) → A(y)


Current subquery: R(d,y)




R(a,b)R(a,c)R(b,d)R(b,e)A(a)R(c,f)R(c,g)A(b)A(c)A(d)

⇒ A(e)A(f)A(g)

A(x) ∧ R(x , y) → A(y)


Current subquery: R(e,y)




R(a,b)R(a,c)R(b,d)R(b,e)A(a)R(c,f)R(c,g)A(b)A(c)A(d)A(e)

⇒ A(f)A(g)

A(x) ∧ R(x , y) → A(y)


Current subquery: R(f,y)




R(a,b)R(a,c)R(b,d)R(b,e)A(a)R(c,f)R(c,g)A(b)A(c)A(d)A(e)A(f)

⇒ A(g)

A(x) ∧ R(x , y) → A(y)


Current subquery: R(g,y)



PARALLELISING COMPUTATION

Each thread extracts facts and evaluates subqueries independently

The number of subqueries is determined by the number of factsensures in practice that threads are equally loaded

Requires no thread synchronisation

⇒ We partition rule instances dynamically and with little overhead

EXAMPLE WHEN PARALLELISATION FAILS

A(x) ∧ R(x , y) → A(y)

R(a, b), R(b, c), R(c, d), R(d , e), A(a)

, A(b), A(c), A(d), A(e)









A(x) ∧ R(x , y) → A(y)

R(a, b), R(b, c), R(c, d), R(d , e), A(a), A(b)

, A(c), A(d), A(e)









A(x) ∧ R(x , y) → A(y)

R(a, b), R(b, c), R(c, d), R(d , e), A(a), A(b), A(c)

, A(d), A(e)









A(x) ∧ R(x , y) → A(y)

R(a, b), R(b, c), R(c, d), R(d , e), A(a), A(b), A(c), A(d)

, A(e)









A(x) ∧ R(x , y) → A(y)

R(a, b), R(b, c), R(c, d), R(d , e), A(a), A(b), A(c), A(d), A(e)



SOLUTION PART II: INDEXING RDF DATA IN MAIN MEMORY

The critically algorithm depends on:

matching atoms 〈t1, t2, t3〉 with ti a constant or variable

continuous concurrent updates

Our RDF storage data structure:

hash-based indexes ⇒ naturally parallel data structure‘mostly’ lock-free: at least one thread makes progress at most of the time

compare: if a thread acquire a lock and dies, other threads are blockedmain benefit: performance is less susceptible to scheduling decisions

Main technical challenge: reduce thread interference

when A writes to a location cached by B, the cache of B is invalidated

our updates ensure that threads (typically) write to different locations


Evaluation

TABLE OF CONTENTS

1 INTRODUCTION


3 EVALUATION

4 CONCLUSION


Evaluation

CONCURRENCY OVERHEAD AND PARALLELISATION SPEEDUP

RDFox: an RDF store developed at Oxford University


8 16 24 32

2

4

6

8

10

12

14

16

18

20ClarosL

ClarosLEDBpediaL

DBpediaLELUBML01KLUBMU 01K

8 16 24 32

2

4

6

8

10

12

14

16

18

20UOBML01KUOBMU 010LUBMLE 01KLUBML05K

LUBMLE 05KLUBMU 05K

Small concurrency overhead; parallelisation pays off already with two threads

Speedup continues to increase after we exhaust all physical cores

⇒ hyperthreading and parallelism can compensate CPU cache misses



Evaluation

COMPARISON WITH THE STATE OF THE ART

8 16 24 32

2

4

6

8

10

12

14

16

18

20ClarosL

ClarosLEDBpediaL

DBpediaLELUBML01KLUBMU 01K

(a) Scalability and Concurrency Overhead of SysX

ClarosL ClarosLE DBpediaL DBpediaLELUBML

01KLUBMLE

01KLUBMU

01KLUBML

05KLUBMLE

05KLUBMU

05KUOBMU

010UOBML

01KTime Spd Time Spd Time Spd Time Spd Time Spd Time Spd Time Spd Time Spd Time Spd Time Spd Time Spd Time Spd

seq 1907 1.3 4128 1.2 158 1.0 8457 1.1 68 1.1 913 1.0 122 1.1 422 1.0 4464 1.1 580 1.1 2381 1.2 476 1.11 2477 1.0 4989 1.0 161 1.0 9075 1.0 73 1.0 947 1.0 135 1.0 442 1.0 4859 1.0 635 1.0 2738 1.0 532 1.02 1177 2.1 2543 2.0 84 1.9 4699 1.9 35 2.1 514 1.8 61 2.2 221 2.0 2574 1.9 310 2.1 1400 2.0 247 2.28 333 7.4 773 6.5 28 5.8 1453 6.2 14 5.2 155 6.1 20 6.8 65 6.8 745 6.5 96 6.6 451 6.1 72 7.4

16 179 13.9 415 12.0 26 6.1 828 11.0 8 8.7 88 10.8 13 10.6 42 10.6 Mem 60.7 10.5 256 10.7 42 12.624 139 17.8 313 15.9 25 6.4 695 13.1 7 10.9 77 12.4 11 12.7 39 11.3 Mem 54.3 11.7 188 14.6 34 15.532 127 19.5 285 17.5 24 6.6 602 15.1 7 10.1 71 13.4 10 13.4 Mem Mem Mem 168 16.3 31 17.1

Seq. imp. 61 57 331 335 421 408 410 2610 2587 2643 6 798Par. imp. 58 -4.9% 58 1.7% 334 0.9% 367 9.5% 415 1.4% 415 1.7% 415 1.2% 2710 3.8% 3553 37% 2733 3.4% 6 0.0% 783 -1.9%Triples 95.5 M 555.1 M 118.3 M 1529.7 M 182.4 M 332.6 M 219.0 M 911.2 M 1661.0 M 1094.0 M 35.6 M 429.6 M

Memory 4.2 GB 18.0 GB 6.1 GB 51.9 GB 9.3 GB 13.8 GB 11.1 GB 49.0 GB 75.5 GB 53.3 GB 1.1 GB 20.2 GBActive 93.3% 98.8% 28.1% 94.5% 66.5% 90.3% 85.3% 66.5% 90.3% 85.3% 99.7% 99.1%

8 16 24 32

2

4

6

8

10

12

14

16

18

20UOBML01KUOBMU 010

LUBMLE 01KLUBML05K

LUBMLE 05KLUBMU 05K

(b) Comparison of SysX with DBRDF and OWLIM-LiteSysX PG-VP MDB-VP PG-TT MDB-TT OWLIM-Lite

Import Materialise Import Materialise Import Materialise Import Import Import MaterialiseT B/t T B/t T B/t T B/t T B/t T B/t T B/t T B/t T B/t T B/t

ClarosL 48 892062 60

1138 16525026 97

883 58Mem

1174 217 896 94 300 5414293 28

ClarosLE 4218 44 Mem Mem TimeDBpediaL 274 69

143 675844 148

Mem15499 51

354 495968 174 4492 92 1735 171

14855 164DBpediaLE 9538 49 Mem Mem TimeLUBML01K

332 7471 65

7736 144948 127

6136 5627 41

7947 187 5606 104 2409 40316 36

LUBMLE 01K 765 54 15632 112 Mem TimeLUBMU 01K 113 67 Time 138 44 TimeUOBMU 010 5 68 2501 43 116 154 Time 96 50 Mem 120 203 96 85 32 69 11311 24UOBML01K 632 64 467 63 13864 119 6708 107 11063 41 358 33 14901 176 10892 101 4778 38 Time

Figure 2: Results of our Empirical Evaluation

complete, and we report averages over three runs. OWLIM-Lite combines materialisation and import, so we loaded eachRDF graph once with the test datalog program and once withno program, and we subtracted the two times; Ontotext con-firmed that this yields a good materialisation time estimate.

The graphs and Table (a) in Figure 2 show the speedupof SysX with the number of used threads. The middle partof Table (a) shows the sequential and parallel import times,and the percentage slowdown for the parallel version. Thelower part of Table (a) shows the number of triples, the mem-ory consumption, and the percentage of active triples (i.e.,triples to which a rule was applied during materialisation).

With all 16 physical cores, materialisation is up to 13.9times (on ClarosL) faster than with just one core. Speedupincreases further up to 19.5 with 32 cores (on ClarosL), sug-gesting that hyperthreading and high degree of parallelismcan compensate for CPU stalls due to random memory ac-cess. Memory exhaustion in some tests was caused by ‘pri-vate’ per-thread indexes described at the end of Section 4.These indexes, however, proved very effective at reducingthread interference, and the main remaining source of inter-ference is in retrieving triples from factsI . When the per-centage of active triples is low and the number of threads ishigh, the calls to factsI .next are more likely to overlap, assuggested by the correlation of 0.9 between the speedup for32 threads and the percentage of active triples; this explainsthe comparatively low speedup on DBpediaL.

We further analyse these results using the Karp-Flatt met-ric: if N threads achieve a speedup of s(N), the fraction ofthe parallelised work is p(N) = (1/s(N) � 1)(1/N � 1).In our case, p(32) ranges from 88% to 98%; thus, computa-tion is parallelised exceptionally well. Solving this equationfor s(N) and letting N approach infinity yields the max-

imum speedup as s(1) = 1/(1 � p(1)); assuming thatp(1) = p(32), we get s(1) between 8.3 and 50. With 32threads we thus already approach this maximum, which ex-plains the flattening of the curves in Figure 2.

Table (b) in Figure 2 compares SysX with DBRDF onPostgreSQL (PG) and MonetDB (MDB) with VP or TTscheme, and with OWLIM-Lite. Columns T show the timesin seconds, and columns B/t show the number of bytes pertriple. Import in DBRDF is about 20 times slower than inSysX; we found out that half of the import time is used inthe slow Jena RDF parser. Furthermore, VP it is more mem-ory efficient than TT by 33% as it does not store triples’spredicates. MDB-VP can be more memory-efficient thanSysX; however, MDB-TT is not, which is surprising sinceSysX does not compress data. For OWLIM-Lite, we do notknow how the dictionary is split between the disk and RAM,so the RAM consumption reported in Table (b) is a ‘best-case’ estimate. On materialisation tests, MDB-VP comfort-ably outperformed PG-VP, and it outperformed SysX onLUBML01K and UOBML01K, which is probably due tobetter query planning. However, both MDB-TT and PG-TTran out of time in all cases, apart from MDB-TT which com-pleted DBpediaL in 11,958 seconds: as Abadi et al. (2009)have already observed, self-joins on the triple table are no-toriously difficult for RDBMSs. In contrast, although it im-plements TT, SysX could successfully complete all tests.

6 Conclusion & OutlookWe presented a novel and very efficient approach to parallelmaterialisation of datalog in centralised, multi-core, main-memory RDF systems. Our main challenge is to adapt theRDF indexing scheme to secondary storage, the main diffi-culty of which is to reduce random access.

DBRDF: an implementation using RDBMS and known techniques

PostgreSQL (row store) or MonetDB (column store) as underlying engine

vertical partitioning (VP) or triple table (TT) storage model

OWLIM-Lite: a commercial RDF store by OntoText


Conclusion

TABLE OF CONTENTS

1 INTRODUCTION


3 EVALUATION

4 CONCLUSION


Conclusion

RESEARCH DIRECTIONS

Under development/already finished:Deal with owl:sameAs via rewriting

Incremental deletions/additions

Add a data/query/reasoning distribution layer

Future work:Investigate potential for data compression

Improve join cardinality estimation

Improve query planning


Technology

Parallel Datalog Reasoning in RDFox Presentation