29
Semantics and optimization of the SPARQL 1.1 federation extension Carlos Buil Aranda (1), Marcelo Arenas (2), Oscar Corcho (1) [email protected], [email protected], [email protected] 11th November, 2010, Madrid, Spain (1) Ontology Engineering Groupd, Facultad de Informática, Universidad Politécnica de Madrid (2) Departamento Ciencias de la Computacion, Pontificia Universidad Católica de Chile

Semantics and optimisation of the SPARQL1.1 federation extension

Embed Size (px)

DESCRIPTION

Presentation done at ESWC2011 for the paper "Semantics and optimisation of the SPARQL1.1 federation extension". Buil-Aranda C, Arenas M, Corcho O. ESWC2011, May 2011, Hersonissos, Greece

Citation preview

Page 1: Semantics and optimisation of the SPARQL1.1 federation extension

Semantics and optimization of the SPARQL 1.1 federation

extension

Carlos Buil Aranda (1), Marcelo Arenas (2), Oscar Corcho (1)

[email protected], [email protected], [email protected]

11th November, 2010, Madrid, Spain

(1) Ontology Engineering Groupd, Facultad de Informática, Universidad Politécnica de Madrid

(2) Departamento Ciencias de la Computacion, Pontificia Universidad Católica de Chile

Page 2: Semantics and optimisation of the SPARQL1.1 federation extension

Introduction

• How many of you have been in the need of making queries to distributed SPARQL endpoints?

Example

• Using the Pubmed references obtained from the Geneid gene dataset, retrieve information about genes and their references in the Pubmed dataset.

• From Pubmed we access the information in the National Library of Medicine’s controlled vocabulary thesaurus, stored at the MeSH endpoint, so we have more complete information about such genes.

• Finally, we also access the HHPID endpoint, which is the knowledge base for the HIV-1 protein.

Example

• Using the Pubmed references obtained from the Geneid gene dataset, retrieve information about genes and their references in the Pubmed dataset.

• From Pubmed we access the information in the National Library of Medicine’s controlled vocabulary thesaurus, stored at the MeSH endpoint, so we have more complete information about such genes.

• Finally, we also access the HHPID endpoint, which is the knowledge base for the HIV-1 protein.

Page 3: Semantics and optimisation of the SPARQL1.1 federation extension

Introduction

Pubmed

MESH

HHPID

?meshReference <owl:sameAs> ?descriptor

{?pubmed <pubmed:meshref> ?mesh . ?mesh <pubmed:descriptor> ?descriptor .}

?int <hhpid:elementGene2> ?gene1

GeneID

?gene1 <geneid:pubmed_xref> ?pubmed

Page 4: Semantics and optimisation of the SPARQL1.1 federation extension

Introduction

Given SPARQL1.0: How do you do those queries?•Option 1: Make local copies of all those graphs into your favourite triple store, separated into different named graphs / contexts, and evaluate a single query over the whole set of graphs.

•Option 2: Send individual queries into each SPARQL endpoint, and join information in a programmatic manner on the client side. Highly inefficient. •Option 3: Use some of the existing distributed query processing extensions: DARQ, NetworkedGraphs, ARQ, etc.

SELECT ?pubmed ?gene1 ?mesh ?descriptor ?meshReference WHERE { ?interaction <http://ontology.bio2rdf.org/hhpid:elementGene2> ?gene1 . ?gene1 <http://bio2rdf.org/geneid_resource:pubmed_xref> ?pubmed . ?pubmed <http://bio2rdf.org/pubmed_resource:meshref> ?mesh . ?mesh <http://bio2rdf.org/pubmed_resource:descriptor> ?descriptor . ?meshReference <http://www.w3.org/2002/07/owl#sameAs> ?descriptor .}

SELECT ?pubmed ?gene1 ?mesh ?descriptor ?meshReference WHERE { ?interaction <http://ontology.bio2rdf.org/hhpid:elementGene2> ?gene1 . ?gene1 <http://bio2rdf.org/geneid_resource:pubmed_xref> ?pubmed . ?pubmed <http://bio2rdf.org/pubmed_resource:meshref> ?mesh . ?mesh <http://bio2rdf.org/pubmed_resource:descriptor> ?descriptor . ?meshReference <http://www.w3.org/2002/07/owl#sameAs> ?descriptor .}

Page 5: Semantics and optimisation of the SPARQL1.1 federation extension

Introduction: SPARQL 1.1 Federation Extension

• Allows specifying queries over distributed SPARQL endpoints• New operator: SERVICE a P

• We may combine local and remote SPARQL endpoints, depending on the characteristics of the data that we are handling

SELECT ?pubmed ?gene1 ?mesh ?descriptor ?meshReference WHERE{ SERVICE <http://quebec.hhpid.bio2rdf.org/sparql> { ?interaction <http://ontology.bio2rdf.org/hhpid:elementGene2> ?gene1 .} . SERVICE <http://127.0.0.1:2020/sparql-geneid> { ?gene1 <http://bio2rdf.org/geneid_resource:pubmed_xref> ?pubmed .} . SERVICE <http://pubmed.bio2rdf.org/sparql> { ?pubmed <http://bio2rdf.org/pubmed_resource:meshref> ?mesh . ?mesh <http://bio2rdf.org/pubmed_resource:descriptor> ?descriptor . }. SERVICE <http://127.0.0.1:2021/sparql-mesh> { ?meshReference <http://www.w3.org/2002/07/owl#sameAs> ?descriptor .}

SELECT ?pubmed ?gene1 ?mesh ?descriptor ?meshReference WHERE{ SERVICE <http://quebec.hhpid.bio2rdf.org/sparql> { ?interaction <http://ontology.bio2rdf.org/hhpid:elementGene2> ?gene1 .} . SERVICE <http://127.0.0.1:2020/sparql-geneid> { ?gene1 <http://bio2rdf.org/geneid_resource:pubmed_xref> ?pubmed .} . SERVICE <http://pubmed.bio2rdf.org/sparql> { ?pubmed <http://bio2rdf.org/pubmed_resource:meshref> ?mesh . ?mesh <http://bio2rdf.org/pubmed_resource:descriptor> ?descriptor . }. SERVICE <http://127.0.0.1:2021/sparql-mesh> { ?meshReference <http://www.w3.org/2002/07/owl#sameAs> ?descriptor .}

Page 6: Semantics and optimisation of the SPARQL1.1 federation extension

Table of Contents

• Introduction

• Syntax and Semantics

• SERVICE evaluation

• Optimising Federated Queries

• Implementation

• Conclusions

Page 7: Semantics and optimisation of the SPARQL1.1 federation extension

Syntax & Semantics Preliminaries

Page 8: Semantics and optimisation of the SPARQL1.1 federation extension

Syntax & Semantics Preliminaries

Page 9: Semantics and optimisation of the SPARQL1.1 federation extension

Syntax & Semantics Preliminaries

Page 10: Semantics and optimisation of the SPARQL1.1 federation extension

SPARQL1.1 SERVICE syntax

• Queries are of the form:

• “a” is an IRI or a variable, so it can be: • A predefined service endpoint:

• e.g., <http://quebec.hhpid.bio2rdf.org/sparql> • A variable: SERVICE ?X {P1}

SELECT * WHERE{

P2 . P3 .SERVICE a {P1} ....

}

Page 11: Semantics and optimisation of the SPARQL1.1 federation extension

SPARQL 1.1 SERVICE Semantics

• We extend [PAG09] with the semantics for SERVICE:

[PAG09] J. Pérez, M. Arenas and C. Gutiérrez. Semantics and complexity of SPARQL. TODS 34(3), 2009

!!!

Page 12: Semantics and optimisation of the SPARQL1.1 federation extension

SPARQL 1.1 SERVICE Semantics

• So, if we find SERVICE ?X P1, do we have to send queries to every single endpoint in the world?

Page 13: Semantics and optimisation of the SPARQL1.1 federation extension

Table of Contents

• Introduction

• Syntax and Semantics

• SERVICE evaluation

• Optimising Federated Queries

• Implementation

• Conclusions

Page 14: Semantics and optimisation of the SPARQL1.1 federation extension

SERVICE Evaluation

• What happens when there is a variable ?X in SERVICE ?X P ?• ?X must be bound in order to evaluate the pattern

• That is, ?X needs to have a value when the SERVICE operator is evaluated

• Examples:P1 = (SELECT {?X, ?N, ?E} WHERE

{(?X, service_address, ?Y) AND (SERVICE ?Y {?N, email, ?E})}

P2 = (SELECT {?X, ?N, ?E} WHERE

{((?X, service_description, ?Z) UNION (?X, service_address, ?Y)) AND ((SERVICE ?Z {?N, email, ?E}) UNION (SERVICE ?Y {?N, email, ?E})) }

… In order to evaluate P1 and P2, we must ensure a specific evaluation order so as to ensure a safe evaluation

Page 15: Semantics and optimisation of the SPARQL1.1 federation extension

SERVICE Evaluation – Ingredients and Informal Definitions

• Boundedness (of a variable in a query)• ?Y is bound in

P1 = ((?X, service address, ?Y) AND (SERVICE ?Y (?N, email, ?E)))

• ?Y is not bound in P1 = ((?X, service address, ?Z) OPT (?X, service_desc, ?Y)) AND (SERVICE ?Y (?N, email, ?E))

• However, checking this is undecidable

• Strong Boundedness (of a variable in a query)• We impose some syntactic conditions (details in the paper)

• Service Boundedness• Based on the parse tree of the query, if we find a SERVICE ?X P, then

if it has an ancestor where ?X is bound, the service is bound

• However, checking this is undecidable

• Service Safeness• Hence we impose syntactic conditions, and we have to check if variable

?X is strongly bound

Page 16: Semantics and optimisation of the SPARQL1.1 federation extension

SERVICE Evaluation - Boundedness

Page 17: Semantics and optimisation of the SPARQL1.1 federation extension

SERVICE Evaluation - Strong Boundedness

Page 18: Semantics and optimisation of the SPARQL1.1 federation extension

SERVICE Evaluation - Safeness

• Given a SPARQL query P, define T(P) as the parse tree of P. In this tree, every node corresponds to a sub-pattern of P.

Page 19: Semantics and optimisation of the SPARQL1.1 federation extension

SERVICE Evaluation - Safeness

• Definition (service-boundedness)A SPARQL query P is service-bound if for every node u of T(P) with label (SERVICE ?X P1), it holds that:

• There exists a node v of T(P) with label P2 such that v is an ancestor of u in T(P) and ?X is bound in P2

• P1 is service-bound

• TheoremThe problem of verifying, given a SPARQL query P, whether P is service-bound is undecidable.

• Definition (service-safeness)A SPARQL query P is service-safe if for every node u of T(P) with label (SERVICE ?X P1), it holds that:

• There exists a node v of T(P) with label P2 such that v is an ancestor of u in T(P) and ?X is strongly bound in P2

• P1 is service-safe

Page 20: Semantics and optimisation of the SPARQL1.1 federation extension

Table of Contents

• Introduction

• Syntax and Semantics

• SERVICE evaluation

• Optimising Federated Queries

• Implementation

• Conclusions

Page 21: Semantics and optimisation of the SPARQL1.1 federation extension

Optimising Federated Queries

• Well-designed patterns [PAG09]

Page 22: Semantics and optimisation of the SPARQL1.1 federation extension

Optimising federated queries with well-designed patterns

• We extended the notion of well-designed patterns for SPARQL1.1 Federated Query

• The following rules (from [PAG09]) also hold for SERVICE:

• Proposition• If P is a well-designed pattern and Q is obtained for P by applying

either (1) or (2) or (3), then Q is a well-designed pattern equivalent to P.

Page 23: Semantics and optimisation of the SPARQL1.1 federation extension

Table of Contents

• Introduction

• Syntax and Semantics

• SERVICE evaluation

• Optimising Federated Queries

• Implementation

• Conclusions

Page 24: Semantics and optimisation of the SPARQL1.1 federation extension

Implementation: OGSA-DAI

• Implemented on top of OGSA-DAI/OGSA-DQP• Extensible framework to access, integrate, transform and deliver

distributed and heterogeneous sources of data

• Implements part of the WS-DAI specification

• Service Oriented Data Access (direct and indirect access)

• Distributed Query Processing

• Features• RDF extension (available in the official OGSA-DAI release)

• We process

• SERVICE IRI P

• AND, OPTIONAL, UNION

• Most common FILTERS (<, =, >)

• Solution modifiers

• Coming features missing

• SERVICE ?X P

Page 25: Semantics and optimisation of the SPARQL1.1 federation extension

Implementation: Evaluation

• Benchmark test• Existing benchmarks (Berlin SPARQL Benchmark and

SP2Bench) were not suitable (no distributed queries), and other benchmarks in an early stage

• Focus in life sciences queries: bio2rdf.org project

• Seven queries of increasing complexity

• http://www.oeg-upm.net/files/sparql-dqp/

• Bio2rdf datasets• bio2rdf.org: 2.3 billion triples

• Used Entrez Gene (13 million triples), pubmed (797 million triples), HHPID (244.091 tiples) and MeSH (689.542 triples)

• Downloaded some datasets (HHPID and pubmed) and divided into several endpoints of 300.000 triples

• Hardware used:• Intel Core 2 Duo, 2,50GH, 3GB RAM

Page 26: Semantics and optimisation of the SPARQL1.1 federation extension

Results

Page 27: Semantics and optimisation of the SPARQL1.1 federation extension

Table of Contents

• Introduction

• Syntax and Semantics

• SERVICE evaluation

• Optimising Federated Queries

• Implementation

• Conclusions

Page 28: Semantics and optimisation of the SPARQL1.1 federation extension

Conclusions

• Formalisation of the SPARQL 1.1 Basic Federation Extension syntax and semantics

• Safeness conditions in the evaluation of SERVICE in the presence of variables

• Simple query optimisation based on an extension of well-designed patterns

• Implementation based on a robust data-access system like OGSA-DAI• Focused on large-scale data sources• More optimisations can be easily included• Indirect data access mode (you send the query, it sends you

a handler to where the result will be placed, and you can use that resource).

Page 29: Semantics and optimisation of the SPARQL1.1 federation extension

Semantics and optimization of the SPARQL 1.1 federation

extension

Acknowledgements•Implementation:

• OGSA-DAI team (specially Ally Hume)•Query generation:

• Bio2RDF project team (specially Marc-Alexandre Nolin)

•Heavy discussions on syntax and semantics• Jorge Pérez

•Funding• ADMIRE Project (FP7-ICT-215024)• FONDECYT grant 1090565

Acknowledgements•Implementation:

• OGSA-DAI team (specially Ally Hume)•Query generation:

• Bio2RDF project team (specially Marc-Alexandre Nolin)

•Heavy discussions on syntax and semantics• Jorge Pérez

•Funding• ADMIRE Project (FP7-ICT-215024)• FONDECYT grant 1090565