19
Querying Distributed RDF Data Sources Querying Distributed RDF Data Sources with SPARQL with SPARQL Presented by Bastian Quilitz and Ulf Leser Humboldt-Universitat zu Berlin ESWC 2008 2009-07-23 Summarized by Jaeseok Myung Intelligent Database Systems Lab School of Computer Science & Engineering Seoul National University, Seoul, Korea

Querying Distributed RDF Data Sources with SPARQL

  • Upload
    ophira

  • View
    39

  • Download
    1

Embed Size (px)

DESCRIPTION

Querying Distributed RDF Data Sources with SPARQL. Presented by Bastian Quilitz and Ulf Leser Humboldt-Universitat zu Berlin ESWC 2008 2009-07-23 Summarized by Jaeseok Myung. Intelligent Database Systems Lab School of Computer Science & Engineering Seoul National University, Seoul, Korea. - PowerPoint PPT Presentation

Citation preview

Page 1: Querying Distributed RDF Data Sources with SPARQL

Querying Distributed RDF Data Sources with Querying Distributed RDF Data Sources with SPARQLSPARQL

Presented by Bastian Quilitz and Ulf Leser

Humboldt-Universitat zu Berlin

ESWC 2008

2009-07-23

Summarized by Jaeseok Myung

Intelligent Database Systems LabSchool of Computer Science & EngineeringSeoul National University, Seoul, Korea

Page 2: Querying Distributed RDF Data Sources with SPARQL

Copyright 2009 by CEBT

IntroductionIntroduction

SPARQL has to deal with thousands of RDF data

with a local machine

with multiple and distributed machines

Integrated access to multiple RDF data sources is a key challenge for many semantic web applications

Current implementations of SPARQL load all RDF graphs to the local machine

This usually incurs a large overhead in network traffic

Center for E-Business Technology 2009 IDS & IDB Lab. Seminar – 2/19

Page 3: Querying Distributed RDF Data Sources with SPARQL

Copyright 2009 by CEBT

IntroductionIntroduction

DARQ, an engine for federated SPARQL queries

Provides transparent query access to multiple SPARQL services

Distributed ARQ, as an extension to ARQ (jena)

Available under GPL License at http://darq.sf.net/

Center for E-Business Technology

Do not care

In this presentation, ..

Data Source

Building Sub-queries

Metadata for each DS

2009 IDS & IDB Lab. Seminar – 3/19

Page 4: Querying Distributed RDF Data Sources with SPARQL

Copyright 2009 by CEBT

PreliminariesPreliminaries

A SPARQL query Q is defined as Q = (E, DS, R)

E : an algebra expression of the SPARQL query

DS : a RDF data source

R : Query Type (SELECT, CONSTRUCT, DESCRIBE, ASK)

The algebra expression E consists of

Graph Patterns

– Triple Pattern : (s, p, o)

– Basic Graph Pattern : a set of triple pattern

– Filtered BGP : BGP with constraints

Solution Modifiers,

– Such as PROJECTION, DISTINCT, LIMIT or ORDER BY

Center for E-Business Technology 2009 IDS & IDB Lab. Seminar – 4/19

Page 5: Querying Distributed RDF Data Sources with SPARQL

Copyright 2009 by CEBT

An Example SPARQ QueryAn Example SPARQ Query

Center for E-Business Technology

SELECT ?name ?mbox WHERE {

?x foaf:name ?name.

?x foaf:mbox ?mbox.

FILTER regex(?name, “^Tim”) && regex(?mbox, “w3c”)

} ORDER BY ?name LIMIT 5

Query TypeQuery Type ProjectionProjection

TPTP

BGPBGP FBGPFBGP

Solution ModifiersSolution Modifiers

2009 IDS & IDB Lab. Seminar – 5/19

Page 6: Querying Distributed RDF Data Sources with SPARQL

Copyright 2009 by CEBT

Query ProcessingQuery Processing

A query is processed in 4 stages:

Parsing : converts the query string into a tree model of SPARQL. The DARQ query engine reuses the parser shipped with ARQ

Query Planning : the query engine decomposes the original query and builds multiple sub-queries according to the information in the service descriptions, each of which can be answered by one known data source

Query Optimization : In the third stage, the query optimizer takes the sub-queries and rewrites them for optimization

Query Execution : the Query execution plan is executed. The sub-queries are sent to the data sources and the results are integrated

Center for E-Business Technology 2009 IDS & IDB Lab. Seminar – 6/19

Page 7: Querying Distributed RDF Data Sources with SPARQL

Copyright 2009 by CEBT

Service DescriptionsService Descriptions

Information for each data sources is helpful

To find the relevant data sources for the different triples

To decompose the query into sub-queries

Service descriptions

Let us know whether the data available from a data source

Allow limitations on access patterns

Include statistical information used for query optimization

Are represented in RDF

Center for E-Business Technology 2009 IDS & IDB Lab. Seminar – 7/19

Page 8: Querying Distributed RDF Data Sources with SPARQL

Copyright 2009 by CEBT

Service DescriptionsService Descriptions

Data Description

A service description defines the capabilities which indicates whether data is available or not

Ex) sd:capability [ sd:predicate rdf:type ];

The definition of capabilities is based on predicates

– DARQ currently only supports queries with bounded predicates

Limitation on Access Pattern

DARQ supports limitations on access patterns

Ex) sd:requiredBindings [ sd:subjectBinding foaf:name ];

Ex) sd:requiredBindings [ sd:objectBinding foaf:name ];

Center for E-Business Technology 2009 IDS & IDB Lab. Seminar – 8/19

Page 9: Querying Distributed RDF Data Sources with SPARQL

Copyright 2009 by CEBT

Service DescriptionsService Descriptions

Statistical Information

Helps the query optimizer to find a cost-effective query plan

Includes

– Ns : The total number of triples

– Optional information for each predicate

nD(p) : The number of triples for the predicate p in the data source D

sselD(p) : The selectivity of a triple pattern for the predicate p when the

subject is bounded (default = 1 / nD(p) )

oselD(p) : The selectivity of a triple pattern for the predicate p when the

object is bounded (default = 1)

Using simple statistics => every data source can provide them

– More precise statistics would be preferable but will not be available

Center for E-Business Technology 2009 IDS & IDB Lab. Seminar – 9/19

Page 10: Querying Distributed RDF Data Sources with SPARQL

Copyright 2009 by CEBT

Service DescriptionsService Descriptions

The data source defined in the example can answer queries for foaf:name, foaf:mbox and foaf:weblog.

Objects for a triple with predicate foaf:name must always start with a letter from A to R

In total it stores 112 triples

The data source has limitations on access patterns, i.e. a query must contain a triple pattern with predicate foaf:name or foaf:mbox with a bounded object

Center for E-Business Technology 2009 IDS & IDB Lab. Seminar – 10/19

Page 11: Querying Distributed RDF Data Sources with SPARQL

Copyright 2009 by CEBT

Query PlanningQuery Planning

Query planning is based on the information provided by service descriptions

In this system, we have two stages Source Selection: let us know which data source is relevant to

a given query

– The algorithm simply matches given triple patterns against the capabilities of the data sources Ex) sd:capability [ sd:predicate rdf:type ];

SELECT ?x WHERE ?x rdf:type foaf:Person;

– As a result, every triple pattern in a BGP has a set of corresponding data sources

– The results from source selection are used to build sub-queries that can be answered by the data source

Building Sub-Queries

– Each data source has a sub-query

– Each sub-query has a filtered BGP

Center for E-Business Technology 2009 IDS & IDB Lab. Seminar – 11/19

Page 12: Querying Distributed RDF Data Sources with SPARQL

Copyright 2009 by CEBT

Query PlanningQuery Planning

Center for E-Business Technology

(Person, name, “TBL”) (Person, mbox, “[email protected]”) (Person, name, “ABC”)(Person, mbox, “[email protected])

sd:capability sd:predicate foaf:name; sd:predicate foaf:mbox.

sd:capability sd:predicate foaf:name; sd:predicate foaf:mbox.

sd:capability sd:predicate foaf:mbox.sd:capability sd:predicate foaf:mbox.

sd:capability sd:predicate foaf:name.sd:capability sd:predicate foaf:name.

SELECT ?name ?mbox WHERE { ?x foaf:name ?name. ?x foaf:mbox ?mbox. FILTER regex(?name, “^Tim”) && regex(?mbox, “w3c”)} ORDER BY ?name LIMIT 5

DARQDARQ

(?x foaf:name ?name)(?x foaf:name ?name) (?x foaf:mbox ?mbox)(?x foaf:mbox ?mbox) (?x foaf:name ?name)(?x foaf:mbox ?mbox)(?x foaf:name ?name)(?x foaf:mbox ?mbox)

2009 IDS & IDB Lab. Seminar – 12/19

Page 13: Querying Distributed RDF Data Sources with SPARQL

Copyright 2009 by CEBT

Query Optimization - LogicalQuery Optimization - Logical

Rule-based Query Rewriting

Based on [Perez, J. et al., ISWC 2006]

Reduces the number of BGP & variables

Moving value constraints into sub-queries

Center for E-Business Technology 2009 IDS & IDB Lab. Seminar – 13/19

Page 14: Querying Distributed RDF Data Sources with SPARQL

Copyright 2009 by CEBT

Query Optimization - PhysicalQuery Optimization - Physical

Physical optimization is about the intermediate result size estimation (cost-based optimization)

The result size estimation is based on the statistics provided in the service descriptions

Join, Single Triple, Multiple Triples (BGP)

An example of a single triple pattern

Center for E-Business Technology 2009 IDS & IDB Lab. Seminar – 14/19

Page 15: Querying Distributed RDF Data Sources with SPARQL

Copyright 2009 by CEBT

EvaluationEvaluation

Dataset : a subset of DBpedia, 31.5 million triples in total

Contains RDF data extracted from Wikipedia

http://dbpedia.org

Center for E-Business Technology 2009 IDS & IDB Lab. Seminar – 15/19

Page 16: Querying Distributed RDF Data Sources with SPARQL

Copyright 2009 by CEBT

EvaluationEvaluation

2 physical machines, 5 logical SPARQL endpoints

Center for E-Business Technology 2009 IDS & IDB Lab. Seminar – 16/19

Page 17: Querying Distributed RDF Data Sources with SPARQL

Copyright 2009 by CEBT

EvaluationEvaluation

Optimization has made significant improvements

My opinion

The experiment doesn’t count the loading time

There need to be compared with other systems

– http://esw.w3.org/topic/LargeTripleStores

Center for E-Business Technology 2009 IDS & IDB Lab. Seminar – 17/19

Page 18: Querying Distributed RDF Data Sources with SPARQL

Copyright 2009 by CEBT

ConclusionConclusion

DARQ offers a single interface for querying multiple, distributed SPARQL end-points

Using SPARQL Standard => Flexible

Using Service Descriptions

– Data sources can be added and/or removed dynamically

– A query can be federated and optimized with statistical information

Limitation

Predicates must be bounded (Sub. ?p Obj. is not allowed)

CONSTRUCT, DESCRIBE, ASK are not supported

GRAPH, UNION, OPTIONAL are not supported

Center for E-Business Technology 2009 IDS & IDB Lab. Seminar – 18/19

Page 19: Querying Distributed RDF Data Sources with SPARQL

Copyright 2009 by CEBT

Paper EvaluationPaper Evaluation

Pros

Good idea

– Distributed SPARQL processing is relatively new research field

Defining service descriptions

Dealing with all aspects of query engine

Implementation

My Comments

Too simple, and still slow

Many limitations

Center for E-Business Technology 2009 IDS & IDB Lab. Seminar – 19/19