ISWC 2012 "Efficient execution of top-k SPARQL queries"

Ef#icient Execution of top-‐k SPARQL queries Sara Magliacane (VU University Amsterdam) Alessandro Bozzon (Politecnico di Milano) Emanuele Della Valle (Politecnico di Milano)

Outline •  Introduc?on

• What are top-‐k queries? • Why do we need to op?mize them?

• Our approach: •  A rank-‐aware SPARQL algebra •  A rank-‐aware execu?on model •  Three planning strategies

•  Evalua?on 1

What is a top-‐k query?

2

• A query that returns 1.  a limited number of results k

2.  ordered by a scoring func?on that combines several criteria

Rankings, rankings everywhere…

3


4


5

A very intui?ve and simplified example:

• Top 3 largest countries (by both area and popula?on)

6

Why do we need to optimize them?

The standard way: materialize-‐then-‐sort scheme

7

Countries by area

Countries by popula?on

Compute all the 242 join combina?ons

Sort all the 242 join combina?ons

Fetch 3 best results

…

…

…

…

…

242 242

…

Can we make it more ef#icient?

8

Countries by area

Countries by popula?on

Order incrementally the combina?ons using par0al orders


…

7 9

Can we exploit the available sorted access by area and by popula?on?

The split-‐and-‐interleave scheme

•  The intui?on of the previous example can be formalized with the split-‐and-‐interleave scheme from RDBMS [Li2005, Hwang2007, Ilyas2004, Ilyas2008] 1.  Split the evalua?on of the scoring func?on into single criteria 2.  Interleave them with other operators 3.  Use par?al orders to construct incrementally the final order

•  Standard assump?ons: •  Monotone scoring func?on •  Each criterion is evaluated as a [0,1] number (normaliza?on)

•  Op?mized for the case of fast sorted access for each criterion 9

No free lunch…

Users are interested in 1<= k <= 100 (search engines)

!

"#$%&'(%)*+!

"#$%&'(%)*+!

",)*-).-!!

/01(+!

234!,567!

*8697.!0:!-7;5.7-!.7;8<,;!+!

!!+=!

!

>!

?61.0@767*,! /[email protected])-!

234!,567!

C537-!*8697.!0:!-7;5.7-!.7;8<,;!D!

C537-!*8697.!0:!)<<!.7;8<,;!E!

C537-!

237A8,50*!,567!0:!,B7!;A0.5*F!!

:8*A,50*!:0.!7)AB!95*-5*F!

G9.7)+!7@7*!105*,H!,737A!

!

",)*-).-!"#$%&'!

C537-!*8697.!0:!-7;5.7-!.7;8<,;!D!!

",)*-).-!"#$%&'!

*8697.!0:!)<<!.7;8<,;!E!

234!,567!;89<5*7).!

!

"#$%&'(%)*+!

"#$%&'(%)*+!

",)*-).-!!

/01(+!

234!,567!

*8697.!0:!-7;5.7-!.7;8<,;!+!

!!+=!

!

>!

?61.0@767*,! /[email protected])-!

234!,567!

C537-!*8697.!0:!-7;5.7-!.7;8<,;!D!

C537-!*8697.!0:!)<<!.7;8<,;!E!

C537-!

237A8,50*!,567!0:!,B7!;A0.5*F!!

:8*A,50*!:0.!7)AB!95*-5*F!

G9.7)+!7@7*!105*,H!,737A!

!

",)*-).-!"#$%&'!

C537-!*8697.!0:!-7;5.7-!.7;8<,;!D!!

",)*-).-!"#$%&'!

*8697.!0:!)<<!.7;8<,;!E!

234!,567!;89<5*7).!

Orders of magnitude

Split-‐and-‐interleave

Orders of magnitude

11

Top-‐k queries in SPARQL 1.1 Example query on BSBM [Bizer2009]: •  The top 10 offers ordered by the product ra?ngs and offer price:

Tens of seconds on 5M triples (could be improved to milliseconds)

SELECT ?product ?offer (norm1(?avgRat1) + norm2(?avgRat2) + norm3(?price) AS ?score) WHERE {

?product hasAvgRat1 ?avgRat1 . ?product hasAvgRat2 ?avgRat2 . ?product hasName ?name . ?product hasOffers ?offer . ?offer hasPrice ?price

} ORDER BY DESC (?score) LIMIT 10

12

Split-‐and-‐interleave in SPARQL? Related work • A possible solu?on [Straccia2010, Bozzon2011]:

•  Rewrite SPARQL into SQL •  Use exis?ng op?mized RDBMS (e.g. RankSQL [Li2005])

• Disadvantages: • Works if data are already in a RDBMS

• What about na?ve SPARQL op?miza?ons? •  Federated queries over Linked Data [Wagner2012]: complementary to our approach 13

Challenges for native SPARQL split-‐and-‐interleave solutions

14

Differences with SQL and RDBMS Proposed solu0on

Different algebra STEP 1: New algebra (algebraic operators and algebraic equivalences)

Different cost of data access in na?ve RDF triplestores (sorted access is slow)

STEP 2: New algorithms for physical operators, possibly using less sorted access

Addi?onal op?miza?on dimensions STEP 3: New planning strategies

Algebra Planning strategies

Planner Query Algebraic tree

Physical operators

Algebra generator

Query plan

Step 1: a rank-‐aware algebra •  SPARQL-‐Rank algebra [Bozzon2011]

•  Extends the standard SPARQL algebra [Perez2009] •  Ranked set of mappings: set of mappings augmented with an order rela?on

15

Extended OPERATORS

New EQUIVALENCES

The SPARQL-‐Rank algebraic operators

New operator rank

16

(a) (b) (c)

g1(?a1)

g3(?p1)

?pr, ?of, ?score

[0,10]SLICE

seqScan

?pr hasA1 ?a1 . ?pr hasN ?n . ?pr hasO ?of . ?of hasP1 ?p1

g3(?p1)

?pr, ?of, ?score

[0,10]SLICE

orderScan_a1


?pr = ?pr

?pr, ?of, ?score

[0,10]SLICE

g1(?a1)

g3(?p1)seqScan

?pr hasN ?n

Sequence

seqScan

?pr hasA1 ?a1 . ?pr hasO ?of . ?of hasP1 ?p1

Ω

ρp1

ρp1(Ω )

?x ?y ?p1 ?p2

µ1 1 8 0.8 0.8

µ2 3 3 0.3 0.6

µ3 3 4 0.4 0.6

?x ?y ?p1 Fp1

µ1 1 8 0.8 1.8

µ3 3 4 0.4 1.4

µ2 3 3 0.3 1.3

The Rank Operator


Redefined standard operators

18

?x ?z ?p2 Fp2 µ4 1 9 0.8 1.8

µ5 3 0 0.6 1.6

Ω’p2

?x ?y ?z ?p1 ?p2 Fp1Up2

µ1 U µ4 1 8 9 0.8 0.8 1.6

µ3 U µ5 3 4 0 0.4 0.6 1.0

µ2 U µ5 3 3 0 0.3 0.6 0.9

?x ?y ?p1 Fp1

µ1 1 8 0.8 1.8

µ3 3 4 0.4 1.4

µ2 3 3 0.3 1.3

Ωp1

The Join Operator

SPARQL-‐Rank algebraic equivalences

Split

20


21

•  Allows the splimng of a monolithic scoring func?on into several rank operators


Interleave

22


•  Allows to order incrementally the results by pushing the rank operator inside the query tree.

24

Image from: hnp://de-‐?mekeeper.com/yahoo_site_admin/assets/images/benzinger20gold20gears200291.17120724_std.jpg

From algebra to execution

•  Rank operator

•  If there is a sorted access index on the ranking criterion we use it •  Otherwise: rank aggrega?on algorithms, e.g. [Hwang2007]

•  Join operator •  If the right operand does not influence the ranking: streaming index join

•  Otherwise: a rank-‐join algorithm [see next slides]

•  Other operators are straighsorward: •  E.g. the standard FILTER conserves the ordering of its input

Step 2: physical operators (top-‐k algorithms)

25

•  Different algorithms based on available access in the inputs:

•  Hash Rank-‐Join •  e.g. HRJN [Ilyas2004]

•  Random Access Rank-‐Join

•  e.g. RA-‐HRJN [Ilyas2004]

•  RankSequence (e,g, RSEQ) •  Minimum sorted access •  Leverages random access

(a)RankJoin

sortedAccesssortedAccess

(b)RankSequence

randomAccesssortedAccess

(c)

RA-RankJoin

sortedAccessrandomAccess


(a)RankJoin


(b)RankSequence


(c)

RA-RankJoin


sortedAccessrandomAccess(a)

RankJoin


(b)RankSequence


(c)

RA-RankJoin



Rank-‐Join algorithms

26






(a)RankJoin


(b)RankSequence


(c)

RA-RankJoin



(a)RankJoin


(b)RankSequence


(c)

RA-RankJoin



RankJoin


(b)RankSequence


(c)

RA-RankJoin




27

Literature






(a)RankJoin


(b)RankSequence


(c)

RA-RankJoin



(a)RankJoin


(b)RankSequence


(c)

RA-RankJoin



RankJoin


(b)RankSequence


(c)

RA-RankJoin




28

New

Step3: planning strategies •  Using the algebraic equivalences we can produce several equivalent algebraic trees

•  The planner can use them to implement several planning strategies

(a) (b) (c)

g1(?a1)

g3(?p1)

?pr, ?of, ?score

[0,10]SLICE

seqScan


g3(?p1)

?pr, ?of, ?score

[0,10]SLICE

orderScan_a1


?pr = ?pr

?pr, ?of, ?score

[0,10]SLICE

g1(?a1)

g3(?p1)seqScan

?pr hasN ?n

Sequence

seqScan


(a) (b) (c)

g1(?a1)

g3(?p1)

?pr, ?of, ?score

[0,10]SLICE

seqScan


g3(?p1)

?pr, ?of, ?score

[0,10]SLICE

orderScan_a1


?pr = ?pr

?pr, ?of, ?score

[0,10]SLICE

g1(?a1)

g3(?p1)seqScan

?pr hasN ?n

Sequence

seqScan


?pr, ?of, ?score

[0,10]SLICE

?pr hasA1 ?a1. ?pr hasA2 ?a2 . ?pr hasN ?n . ?pr hasO ?of .?of hasP ?p1.

[?score]ORDER

[?score =g1(?a1)+g2(?a2)+g3(?p1)]EXTEND

(a)

?pr = ?pr

?pr, ?of, ?score

[0,10]SLICEJoin

g3(?p1) g1(?a1)?pr hasO ?of .?of hasP ?p1 . ?pr hasA1 ?a1 .

?pr = ?prRankJoin

?pr = ?pr?pr hasN ?n .

RankJoin

g2(?a2)

?pr hasA2 ?a2 .

(b)

1. Rank of BGPs 2. Interleaved 3. Rank Join 29

?product, ?offer, ?score

[0,10]SLICE

?product hasAvgRat1 ?avgRat1. ?product hasAvgRat2 ?avgRat2 . ?product hasName ?name . ?product hasOffer ?offer .?offer hasPrice ?price.

[?score]ORDER

[?score = norm1(?avgRat1)+norm2(?avgRat2)+norm3(?price)]EXTEND

?product = ?product


[0,10]SLICE

norm3(?price) norm1(?avgRat1)?product hasOffer ?offer .?offer hasPrice ?price. ?product hasAvgRat1 ?avgRat1}

?product = ?productRankJoin

?product = ?product {?product hasName ?name} RankJoin

norm2(?avgRat2)

?product hasAvgRat2 ?avgRat2}

norm2(?avgRat2)

norm3(?price)


[0,10]SLICE


norm1(?avgRat1)

norm3(?price)


norm1(?avgRat1) norm2(?avgRat2)

?product hasAvgRat1 ?avgRat1. ?product hasAvgRat2 ?avgRat2 . ?product hasOffer ?offer .?offer hasPrice ?price.

1. Rank of BGPs (ROB) •  Split the monolithic scoring func?on into several incremental rank operators (rho)

30 Rank of BGPs Materialize-‐then-‐sort

?product = ?product


[0,10]SLICE

norm1(?avgRat1)

norm3(?price) {?product hasName ?name }

norm2(?avgRat2)

?product hasAvgRat1 ?avgRat1. ?product hasAvgRat2 ?avgRat2 . ?product hasOffer ?offer .?offer hasPrice ?price.

2. Interleaved (INTER) •  Separate the panern in two groups:

•  Triple panerns that influence the ranking •  Triple panerns that don’t influence the ranking

31 Interleaved


[0,10]SLICE


[?score]ORDER


?product = ?product


[0,10]SLICE




norm2(?avgRat2)


Materialize-‐then-‐sort


[0,10]SLICE


[?score]ORDER


?product = ?product


[0,10]SLICE

norm3(?price) norm1(?avgRat1)

?product hasOffer ?offer .?offer hasPrice ?price. ?product hasAvgRat1 ?avgRat1}



norm2(?avgRat2)


3. Rank-‐Join (RJ) •  Split into one triple panern for each ranking criterion •  Most appropriate join algorithm based on available access

32 Rank-‐Join


[0,10]SLICE


[?score]ORDER


?product = ?product


[0,10]SLICE




norm2(?avgRat2)


Materialize-‐then-‐sort

Experimental evaluation

33

Experimental evaluation •  Prototype implementa?on of our system:

•  ARQ-‐Rank (extends Jena ARQ 2.8.9) •  Extended version of Berlin SPARQL Benchmark [Bizer2009] •  Added ranking anributes •  Added top-‐k queries

•  Jena TDB 0.8.11 as storage

•  Code and experiments: sparqlrank.search-‐compu?ng.org 34

Experiment 1: compare planning strategies •  Example query, 5M triples dataset •  Worst-‐case scenario: no sorted access indexes (slow sorted access)

35

One to two orders of magnitude bener

Experiment 1: compare planning strategies •  Example query, 5M triples dataset •  Standard scenario: sorted access indexes (fast sorted access)

36

Two orders of magnitude bener

Experiment 2: Small Benchmark (8 queries)

!""#$$ %&"#$$ &""#$ !'$ &'$ !""#$$ %&"#$$ &""#$ !'$ &'$ !""#$$ %&"#$$ &""#$ !'$ &'$

!"($

!")$

!"%$

!"!$

!"($

!")$

!"%$

!"!$

!"($

!")$

!"%$

!"!$*+,-.$,

/,0+12

3$14

,$5467$

89:96,:$6;<,$ 89:96,:$6;<,$ 89:96,:$6;<,$

!"#$ !%#$ !&#$

37 !""#$$ %&"#$$ &""#$ !'$ &'$ !""#$$ %&"#$$ &""#$ !'$ &'$ !""#$$ %&"#$$ &""#$ !'$ &'$

!"($

!")$

!"%$

!"!$

!"($

!")$

!"%$

!"!$

!"($

!")$

!"%$

!"!$*+,

-.$,/,0+12

3$14

,$54

67$

89:96,:$6;<,$ 89:96,:$6;<,$ 89:96,:$6;<,$

!"#$ !%#$ !&#$

Conclusions and Future Work •  A system that speeds up the execu?on of top-‐k queries in SPARQL by orders of magnitude: •  STEP 1: A rank-‐aware SPARQL algebra (SPARQL-‐Rank algebra) •  STEP 2: A rank-‐join algorithm (RSEQ) •  STEP 3: Three planning strategies (ROB, INTER, RJ)

•  ARQ-‐Rank, a rank-‐aware extension of Jena ARQ •  A small benchmark for top-‐k queries, based on BSBM [Bizer2009]

•  All available at sparqlrank.search-‐compu?ng.org •  Future work:

•  More advanced, cost-‐based, op?miza?on techniques •  Extension to federated top-‐k query processing •  Top-‐k queries under OWL2QL entailment regime

38

Bibliography •  [Bozzon2011] A. Bozzon et al. Towards and efficient SPARQL top-‐k query execu?on in virtual RDF stores. In DBRANK workshop at VLDB ’11, 2011.

•  [Wagner2012] A. Wagner et al. Top-‐k Linked Data Query Processing. In ESWC ’12. Springer, 2012.

•  [Bizer2009] C. Bizer and A. Schultz. The Berlin SPARQL Benchmark. Int. J. Seman?c Web Inf. Syst., 5(2), 2009.

•  [Li2005] C. Li et al. RankSQL: query algebra and op?miza?on for rela?onal top-‐k queries. In SIGMOD ’05. ACM, 2005.

•  [DellaValle2012] E. Della Valle et al. Order maners! harnessing a world of orderings for reasoning over massive data. Seman?c Web Journal, 2012.

•  [Hwang2007] S.-‐w. Hwang and K. Chang. Probe minimiza?on by schedule op?miza?on: Suppor?ng top-‐k queries with expensive predicates. IEEE TKDE, 19(5), 2007.

39

Bibliography •  [Ilyas2004] I. F. Ilyas et al. Rank-‐aware Query Op?miza?on. In SIGMOD ’04. ACM, 2004.

•  [Ilyas2008] I.F.Ilyas et al. A survey of top-‐k query processing techniques in rela?onal database systems. ACM Comput. Surv., 40(4), 2008.

•  [Perez2009] J. Perez et al. Seman?cs and complexity of SPARQL. ACM Trans. Database Syst., 34(3), 2009.

•  [Schmidt2010] M. Schmidt et al. Founda?ons of SPARQL query op?miza?on. In ICDT ’10, ACM, 2010.

•  [Straccia2010] U. Straccia. SoxFacts: A top-‐k retrieval engine for ontology mediated access to rela?onal databases. In SMC ’10. IEEE, 2010.

40

41

BACK-‐UP SLIDES 42

An addi?onal less intui?ve and less simplified example:

• Top 2 couples of most populated ci?es and largest countries

43

Why do we need to optimize them?

Shanghai Moscow

The materialize-‐then-‐sort scheme

44

Countries by area

Ci?es by popula?on

Materialize all 14K combina?ons

Sort all 14K join combina?ons


…

249 14K*

0.04

2e-‐08

Shanghai Moscow

Va?can

* According to DBPedia, but probably more

1 Shanghai Istanbul Karachi Mumbai Moscow

0.567

0.563 0.497

0.05 0.185

Shanghai … Va?can

Can we make it more ef#icient?

45

Countries by area

Ci?es by popula?on

Order incrementally the combina?ons using par0al orders


9 13

Can we exploit the sorted access by area and by popula?on?

Shanghai Moscow

…

Shanghai Istanbul Karachi Mumbai Moscow …

Maximal possible score

Given a scoring function F (p1, …, pn) and a set of predicates P = {p1, …, pj} the maximal possible score for a mapping µ is defined as:

pi = pi [µ] if pi ∈ P pi = 1 otherwise

∀i FP (p1, …, pn) [µ] = F ( )

Mapping µ … an intermediate SPARQL solution, equivalent to a SQL tuple

?x ?y ?p1 ?p2 µ1 1 8 0.8 0.8

µ2 3 3 0.3 0.6 set of mappings

SPARQL-‐Rank algebra De#initions

Ranking principle

Ranked set of mappings

Given a set of predicates P, a ranked set of mappings ΩP is a set of mappings Ω augmented with the following properties:

•  Score: for each mapping µ, the maximal possible score FP [µ] •  Order: the order relation <ΩP is defined on ΩP based on the scores

of the single mappings

Given two mappings µ1 e µ2 with FP [µ1]> FP [µ2] , if we process µ2 we need to process also µ1.

SPARQL-‐Rank algebra De#initions


48


49

Allows to order incrementally the results by pushing the rank operator inside the query execution tree.


The RSEQ algorithm

51

Evaluation: additional technical information •  Experimental semng:

•  AMD 64 bit processor 2.66 GHz •  4 GB RAM •  Debian kernel 2.6.26-‐2 •  Sun Java 1.6.0

•  Maximum heap size 2GB

•  8 queries available at sparqlrank.search-‐compu?ng.org

52

More experimental results the RankJoin operators •  Example query, 5M triples dataset •  Worst-‐case scenario: no sorted access indexes (lex)

•  RSEQ is the best, especially for k < 1000 •  Standard scenario: sorted access indexes (right)

•  All three are comparable, RA-‐HRJN is best for k > 1000

53

ARQ-‐Rank architecture

54

Technology

ISWC 2012 "Efficient execution of top-k SPARQL queries"