LNCS 9366 - Ontology-Based Integration of Cross-Linked ...calvanese/papers/calv-gies-hovl-rezk-ISWC... · Ontology-Based Integration of Cross-Linked Datasets Diego Calvanese1, Martin

Ontology-Based Integration of Cross-LinkedDatasets

Diego Calvanese1, Martin Giese2, Dag Hovland2, and Martin Rezk1(B)

1 Free University of Bozen-Bolzano, Bolzano, [email protected]

2 University of Oslo, Oslo, Norway

Abstract. In this paper we tackle the problem of answering SPARQLqueries over virtually integrated databases. We assume that the entityresolution problem has already been solved and explicit information isavailable about which records in the different databases refer to the samereal world entity. Surprisingly, to the best of our knowledge, there hasbeen no attempt to extend the standard Ontology-Based Data Access(OBDA) setting to take into account these DB links for SPARQL query-answering and consistency checking. This is partly because the OWLbuilt-in owl:sameAs property, the most natural representation of linksbetween data sets, is not included in OWL 2 QL, the de facto ontologylanguage for OBDA. We formally treat several fundamental questionsin this context: how links over database identifiers can be represented interms of owl:sameAs statements, how to recover rewritability of SPARQLinto SQL (lost because of owl:sameAs statements), and how to checkconsistency. Moreover, we investigate how our solution can be made toscale up to large enterprise datasets. We have implemented the approach,and carried out an extensive set of experiments showing its scalability.

1 Introduction

Since the mid 2000s, Ontology-Based Data Access (OBDA) [9,14,15] has becomea popular approach for virtual data integration [6]. In (virtual) OBDA, a concep-tual layer is given in the form of (the intensional part of) an ontology (usuallyin OWL 2 QL) that defines a shared vocabulary, models the domain, hides thestructure of the data sources, and can enrich incomplete data with backgroundknowledge. The ontology is connected to the data sources through a declarativespecification given in terms of mappings [4] that relate symbols in the ontology(classes and properties) to (SQL) views over data. The ontology and mappingstogether expose a virtual RDF graph, which can be queried using SPARQLqueries, that are then translated into SQL queries over the data sources. In thissetting, users no longer need an understanding of the data sources, the relationbetween them, or the encoding of the data.

One aspect of OBDA for data integration is less well studied however, namelythe fact that in many cases, complementary information about the same entityis distributed over several data sources, and this entity is represented using© Springer International Publishing Switzerland 2015M. Arenas et al. (Eds.): ISWC 2015, Part I, LNCS 9366, pp. 199–216, 2015.DOI: 10.1007/978-3-319-25007-6 12

200 D. Calvanese et al.

different identifiers. The first important issue that comes up is that of entityresolution, which requires to understand which records actually represent thesame real world entity. We do not deal with this problem here, and assume thatthis information is already available.

Traditional relational data integration techniques use extract, transform, load(ETL) processes to address this problem [6]. These techniques usually choose asingle representation of the entity, merge the information available in all datasources, and then answer queries on the merged data. However, this approach ofphysically merging the data is not possible in many real world scenarios whereone has no complete control over the data sources, so that they cannot be mod-ified, and where the data cannot be moved due to freshness, privacy, or legalissues (see, e.g., Section 3).

An alternative that can be pursued in OBDA is to make use of mappings tovirtually merge the data, by consistently generating only one URI per real worldentity. Unfortunately, also this approach is not viable in general: 1. it does notscale well for several datasets, since it requires a central authority for definingURI schemas, which may have to be revised along with all mappings whenever anew source is added, and 2. it is crucial for the efficiency of OBDA that URIs begenerated from the primary keys of the data sources, which will typically differfrom source to source.

The approach we propose in this paper is based on the natural idea of rep-resenting the links between database records resulting from entity resolution inthe form of linking tables, which are binary tables in dedicated data sources thatsimply maintain the information about pairs of records representing the sameentity. This bring about several problems that need to be addressed: 1. links overdatabase identifiers should be represented in terms of OWL owl:sameAs state-ments, which is the standard approach in semantic technologies for connectingentity identifiers; 2. the presence of owl:sameAs statements, which are inher-ently transitive, breaks rewritability of SPARQL queries into SQL queries overthe sources, and one needs to understand whether rewritability can be recoveredby imposing suitable restrictions on the linking mechanism; 3. a similar problemarises for checking consistency of the data sources with respect to the ontology,which is traditionally addressed through query answering; 4. since performancecan be prohibitively affected by the presence of owl:sameAs, it becomes oneof the key issues to address, so as to make the proposed approach scalable overlarge enterprise datasets.

In this paper we tackle the above issues in the setting where we are givenan OWL 2 QL ontology that is mapped to a set of data sources, which are thenextended with linking tables. Specifically, we provide the following contributions:

– We propose a mapping-based framework that carefully virtually constructsowl:sameAs statements from the linking tables, and deals with transitivityand symmetry, in such a way that performance is not compromised.

– We define a suitable set of restrictions on the linking mechanisms thatensures rewritability of SPARQL query answering, despite the presence ofowl:sameAs statements.

Ontology-Based Integration of Cross-Linked Datasets 201

– We develop a sound and complete SPARQL query translation technique, andshow how to apply it also for consistency checking.

– We show how to optimize the translation so as to critically reduce the sizeof the produced SQL query.

– To empirically demonstrate scalability of our solution, we carry out an exten-sive set of experiments, both over a real enterprise cross-linked data set fromthe oil&gas industry, and in a controlled environment; this demonstrates thefeasibility of our approach.

The structure of the paper is as follows: Section 2 briefly introduces the neces-sary background needed to understand this paper, and Section 3 describes ourenterprise scenario. Section 4 provides a sound and complete SPARQL querytranslation technique for cross-linked datasets. Section 5 presents the main contri-bution of the paper, showing how to construct an OBDA setting over cross-linkeddatasets, and Section 6 presents our optimization technique. Section 7 presents anextensive experimental evaluation. Section 8 surveys related work, and Section 9concludes the paper.

2 Preliminaries

Ontology Based Data Access. In the traditional OBDA setting (T ,M,D),the three main components are a set T of OWL 2 QL [12] axioms (called theTBox), a relational database D, and a set M of mappings. The OWL 2 QLprofile of OWL 2 guarantees that queries formulated over T can be rewritteninto SQL [2]. The mappings allow one to define how classes and properties in Tshould be populated with objects constructed from the data retrieved from Dby means of SQL queries. Each mapping has one of the forms:

Class(subject) ← sqlclass Property(subject,object) ← sqlprop,

where sqlclass and sqlprop respectively are a unary and binary SQL queryover D. For both types of mappings we also use the equivalent notation(s p o) ← sql. Subjects and objects in RDF triples are resources (individualsor values) represented by URIs or literals. They are generated using templatesin the mappings. For example, the URI template for the subject can take theform <http://www.statoil.com/{id}> where {id} is an attribute in some DBtable, and it generates the URI <http://www.statoil.com/25> when {id} isinstantiated as "25". From M and D, one can derive a (virtual) RDF graphGM,D, obtained by applying all mappings. Any RDF graph can be seen as a setof logical assertions. Thus, the Tbox together with GM,D constitutes an ontologyO = (T , GM,D).

To handle ontology-based integration of cross-linked datasets, we extend herethe traditional OBDA setting with a fourth component AS containing a set ofstatements of the form owl:sameAs (o1,o2). Thus, in this paper, an OBDAsetting is a tuple (T ,M,D,AS), and its corresponding ontology is the tupleO = (T , GM,D ∪ AS). Unless stated differently, in the following we work withOBDA settings of this form.


Semantics: To interpret ontologies, we use the standard notions of first orderinterpretation, model, and satisfaction. That is, O |= A(v) iff for every modelI of O, we have that I |= A(v). Intuitively, adding an ontology T on topof an RDF graph G, extends G with extra triples inferred by T . Formally,the RDF graph (virtually) exposed by the OBDA setting ((T ,M,D,AS) isG(T ,M,D,AS) = {A(v) | (T , GM,D ∪ AS) |= A(v)}.

SPARQL. SPARQL is a W3C standard language designed to query RDFgraphs. Its vocabulary contains four pairwise disjoint and countably infinite setsof symbols: I for IRIs, B for blank nodes, L for RDF literals, and V for variables.The elements of T = I ∪ B ∪ L are called RDF terms. A triple pattern is anelement of (T∪V) × (I∪V) × (T∪V). A basic graph pattern (BGP) is a finiteset of triple patterns. Finally, a graph pattern, Q, is an expression defined by thegrammar

Q ::= BGP | Filter(P, F ) | Union(P1, P2) | Join(P1, P2) | Opt(P1, P2, F ),

where F , is a filter expression. More details can be found in [3].A SPARQL query (Q,V ) is a graph pattern Q with a set of variables V

which specifies the answer variables—the set of variables in Q whose valueswe are interested in. The values to variables are given by solution mappings,which are partial maps s : V → T with (possibly empty) domain dom(s). Here,following [9,15], we use the set-based semantics for SPARQL (rather than thebag-based one, as in the specification).

The SPARQL algebra operators are used to evaluate the different fragmentsof the SPARQL query. Given an RDF graph G, the answer to a graph patternQ over G is the set �Q�G of solution mappings defined by induction using theSPARQL algebra operators and starting from the base case: triple patterns. Dueto space limitation, and since the entailment regime only modifies the SPARQLsemantics for triple patterns, here we only show the definition of for this basiccase. We provide the complete definition in our technical report [3].

For a triple pattern B, �B�G = {s : var(B) → T | s(B) ⊆ G} where s(B) isthe result of substituting each variable u in B by s(u). This semantics is knownas simple entailment. Given a set V of variables, the answer to (Q,V ) over G isthe restriction �Q�G|V of the solution mappings in �Q�G to the variables in V .

SPARQL Entailment Regime. We present now the standard W3C seman-tics for SPARQL queries over OWL 2 ontologies under different entailmentregimes. We use here the entailment regimes only to reason about individualsand, unlike [9], we do not allow for variables in triple patterns ranging over classand property names. We leave the problem of extending our results to handlealso this case for future work, but we do not expect this to present any majorchallenge.

We work with TBoxes expressed in the OWL 2 QL profile, which howevermay contain also owl:sameAs statements. Therefore, we consider two DirectSemantics entailment regimes for SPARQL queries, which differ in how they


interpret owl:sameAs: the DL entailment regime (which defines |=DL) inter-prets owl:sameAs internally, implicitly adding to the ontology O the axiomsto handle equality, i.e., transitivity, symmetry, and reflexivity. Instead, the QLentailment regime (which defines |=QL) interprets owl:sameAs as a standardobject property, hence does not assign to it any special semantics.

Observe that a basic property of logical equality is that if a and b are equal,everything that holds for a should hold also for b, and viceversa. In the contextof SPARQL, informally it means that given the answer �B�T ,G∪AS

to a triplepattern B, if the answer contains the solution mapping s : v �→ o and T |=owl:sameAs(o, o′), then �B�T ,G∪AS

must also contain a solution mapping s′

that coincides with s but s′ : v �→ o′. Formally, the answer �B�RT ,G∪AS

to a BGPB over an ontology O under entailment regime R is defined as follows:

�B�RO = {s : var(B) → T | (O) |=R s(B)},

Starting from the �B�RO and applying the SPARQL operators in Q, we compute

the set �Q�RO of solution mappings.

3 Use Case and Motivating Example

In this section we briefly describe the real-world scenario we have examined atStatoil, and we illustrate the challenges it presents for OBDA with an example.

At Statoil, users access several databases on a daily basis, some of them arethe Exploration and Production Data Store (EPDS), the Norwegian PetroleumDirectorate (NPD) FactPages, and several OpenWorks databases. EPDS is alarge Statoil-internal legacy SQL (Oracle 10g) database comprising over 1500tables (some of them with up to 10 million tuples), 1600 views and 700 Gb ofdata. The NPD FactPages1 is a dataset provided by the Norwegian government,and it contains information regarding the petroleum activities on the Norwegiancontinental shelf. OpenWorks Databases contain projects data produced by geo-scientists at Statoil. The information in these databases overlap, and often theyrefer to the same entities (companies, wells, licenses) with different identifiers.In this use case the entity resolution problem has been solved since the linksbetween records are available.

The users at Statoil need to query (and get an answer in reasonable time) theinformation about these objects without worrying about what is the particularidentifier in each database. Thus, we assume that the SPARQL queries providedby the users will not contain owl:sameAs statements. The equality betweenidentifiers should be handled internally by the OBDA system. To illustrate thiswe provide the following simplified example:

Example 1. Suppose we have the three datasets (from now on D1,D2, D3) withwellbore2 information, and a dataset D4 with information about companies and

1 http://factpages.npd.no/2 A wellbore is a hole drilled for the purpose of exploration or extraction of natural

resources.

http://factpages.npd.no/


D1 D2 D3 D4

id1 Name

a1 ’A’

a2 ’B’

a3 ’H’

id2 Name Well

b1 null 1

b2 ’C’ 2

b6 ’B’ 3

id3 AName

c3 ’U1’

c4 ’U2’

c5 ’U6’

id4 LName

9 ’Z1’

8 ’Z2’

7 ’Z3’

Fig. 1. Wellbore datasets D1, D2, D3, and company dataset D4

licenses, as illustrated in Figure 1. The wellbores in D1, D2, D3 are linked, butcompanies in D4 are not linked with the other datasets. These four datasourcesare integrated virtually by topping them with an ontology. The ontology containsthe concept Wellbore and the properties hasName, hasAlternativeName andhasLicense.

The terms Wellbore and hasName are defined using D1 and D2. The prop-erty hasAlternativeName is defined using D3. The property hasLicense isdefined over the isolated dataset D4. We assume that mappings for wellboresfrom Di use URI templates urii. In addition, we know that the wellbores arecross-linked between datasets as follows: wellbores a1, a2 in D1 are equal to b2, b1in D2 and c3, c4 in D3, respectively. In addition, a3 is equal to c5. These linksare represented at the ontology level by owl:sameAs statements of the form:owl:sameAs (uri1(a1),uri2(b2)), owl:sameAs (uri2(b2),uri3(c3)), etc.

Consider now a user looking for all the wellbores and their names. According tothe SPARQL entailment regime, the system should return all the 12 combinationsof equivalent ids and names ((uri1(a1),A), (uri2(b2),A), (uri3(c3),A),

(uri1(a2),B), (uri2(b1),B), etc.) since all this tuples are entailed by the ontol-ogy and the data (c.f. Section 2). Note that no wellbores from D4 are returned. �

The first issue in the context of OBDA is how to translate the user query intoa query over the databases. Recall that owl:sameAs is not included in OWLQL, thus it is not handled by the current query translation and optimizationtechniques. If we solve the first issue by applying suitable constraints, we getinto a second issue, how to minimize the negative impact on the query executiontime when reasoning over cross-linked datasets.A third issue is how to check,for instance, whether hasName is a functional property considering the linkedentities. A fourth issue is how to handle the multiplicity of equivalent answersrequired by the standard. For instance, in our example, in principle, it could beenough to pick individuals with template uri1 as class representative, and returnonly those triples. In the next sections we will tackle all these issues in turn.

4 Handling owl:sameAs by SPARQL Query Rewriting

In this section we present the theoretical foundations for query answer overontology-based integrated datasets. We also discuss how to perform consistencychecking using this approach. We assume for now that the links are given in the


form of owl:sameAs statements, and address later, in Section 5, the properOBDA scenario, where links are not given between URIs, but between databaserecords. Recall that owl:sameAs is not in the OWL 2 QL profile, and moreover,by adding the unrestricted use of owl:sameAs we lose first order rewritabil-ity [1], since one can encode reachability in undirected graphs. This implies that,if we allow for the unrestricted use of owl:sameAs, we cannot offer a soundand complete translation of SPARQL queries into SQL.3

We present here an approach, based on partial materialization of inference,that in principle allows us to exploit a relational engine for query answering inthe presence of owl:sameAs statements. This approach, however, is not feasiblein practice, and we will then show in Section 5 how to develop it into a practicalsolution. Our approach is based on the simple observation that we can expandthe set AS of owl:sameAs facts into the set A∗

S obtained from AS by closing itunder reflexivity, symmetry, and transitivity. Unlike other approaches based on(partial) materialization [8], we do not expand here also data triples (specifically,those in GM,D), but instead rewrite the input SPARQL query to guaranteecompleteness of query answering. We assume that user queries in general will notcontain owl:sameAs statements, and therefore, for simplicity of presentation,here we do not consider the case where they are present as input. However, ourapproach can be easily extended to deal also with owl:sameAs statements inuser queries. Given a SPARQL query (Q,V ) over (T , G ∪ AS), we generate anew SPARQL query (ϕ(Q), V ) over (T , G ∪ A∗

S) that returns the same answersas (Q,V ) over (T , G ∪ AS). This approach is very similar to the singularisationtechnique in [11]. The translation ϕ(·) is defined as follows.

Definition 1. Given a query (Q,V ), the query (ϕ(Q), V ) is obtained by replac-ing every triple pattern t in Q with ϕ(t), where:4

– ϕ({?v :P ?w}) = {?v owl:sameAs :a . :a :P :b . :b owl:sameAs?w .}– ϕ({?v rdf:type :C}) = {?v owl:sameAs :a . :a rdf:type :C .}

The following proposition states that answering SPARQL queries over a TBoxT under the DL entailment regime can be reduced to answering SPARQL queriesunder the QL entailment regime (where owl:sameAs has no built-in semantics).

Proposition 1. Given OBDA setting (T ,M,D,AS) and a query (Q,V ), wehave that �Q�DL

T ,GM,D∪AS|V = �ϕ(Q)�QL

T ,GM,D∪A∗S|V .

Consistency Check: Ontology languages, such as OWL 2 QL, allow for the speci-fication of constraints on the data. If the data exposed by the database throughthe mappings does not satisfy these constraints, then we say that the ontologyis inconsistent with respect to the mappings and the data. OBDA allows one to

3 Using the linear recursion mechanism of SQL-99, a translation would be possible,but with a severe performance penalty for evaluating queries involving transitiveclosure.

4 Recall that terms of the form :x are blank nodes that, when occuring in a query,correspond to existential variables.


check two types of constraints: (i) functionality of properties (although it cannotbe expressed in OWL 2 QL), which imposes that an individual is connected toat most one element; (ii) disjointness of classes/properties, which cannot have(pairs of) individuals in common. In OBDA, consistency checking can be reducedto query-answering [2]. This does not hold anymore in general, when consideringcross-linked datasets (where UNA does not hold). For instance, suppose we wantto check if the property :hasName in Example 1 is functional. Clearly withoutconsidering equality between datasets the property is functional, however, whenwe integrate the datasets, it is not anymore since we have in the graph (url1(a1):hasName ‘A’) and (url2(b2) :hasName ‘C’) and (url1(a1) owl:sameAsurl2(b2)). This implies that the wellbore url1(a1) has two names. Using thetranslation above we can extend the results in [2] for checking violations of classdisjointness and of functionality of data and object properties, to account forowl:sameAs statements. For disjointness and functionality of data propertiesthis is accomplished straightforwardly by the translation. Instead, for function-ality of object properties, we need to modify the query used in [2] and explicitlyincorporate the negation of owl:sameAs. For instance, to check if functional-ity of the object property :isRelatedTo might be violated, we can check if thefollowing query returns a non-empty answer over (T , G ∪ A∗

S):

SELECT ?x ?y1 ?y2 ?y3 WHERE {?x :isRelatedTo ?y1 . ?x :isRelatedTo ?y2 .FILTER(?y1 != ?y2 AND NOT EXISTS {?y1 owl:sameAs ?y2} ) }

If the answer is non-empty, the returned elements might witness the violation offunctionality. Notice that, because of the OWA if two elements are not knownto be equal, in general we cannot infer that they are not equal, and hence func-tionality might still hold in some models. We refer to [3] for more details.

5 Handling Cross-Linked Datasets in Practice

Fig. 2. Linking tables for the wellbores cat-egory

We now deal with the proper case ofquerying cross-linked datasets, wherewe are given: (a) an OWL 2 QL TBox,(b) a collection of datasets, (c) a setof mappings, and (d) a set of link-ing tables5 stating equality betweenrecords in different datasets that rep-resent the same entity. For simplicity,we can think of each dataset as cor-responding to a different data source,but datasets could be decoupled fromthe actual physical data sources. In

5 Note that these tables could be available virtually, and hence retrieved throughqueries.


general, in different datasets, the same identifiers might be used to denote dif-ferent objects, and the same objects might be denoted by different identifiers.Moreover, each dataset may contain data records belonging to different pairwisedisjoint categories C1, . . . , Cm, for example wellbores, or company names. A cat-egory corresponds to a set of records that can be mapped to individuals in theontology belonging to the same TBox class (different from owl:Thing), and thatcould, in principle, be joined. For instance, cats and men belong to the sameclass mammal, but a cat can never be joined with a man, hence cat and menconstitute two different categories. We assume that in addition to the datasetsD1, . . . , Dn, for each category C there is a database DC containing the linkingtables for the records in C. Specifically, we denote a linking table for datasetsDi, Dj and category C with LC

ij(idi, idj). A tuple r1, r2 in LCij means that the

record r1 in Di represents the same object as the record r2 in Dj . Notice that,we do not assume that there is a linking table for each pair of datasets Di, Dj

for each category C. The concepts above are illustrated in Figure 2. Our aim isto efficiently answer user SPARQL queries in this setting.

The approach presented in the previous section is theoretical, and cannot beeffectively applied in practice because: (1) it assumes that the links are givenin the form of owl:sameAs statements whereas in practice, in an cross-linkedsetting, they will be given as tables (with the results of the entity resolutionprocess); and (2) it requires pre-computing a large number of triples (namelyA∗

S) and materializing them into the ontology. Since these triples are not storedin the database, they cannot be efficiently retrieved using SQL. This negativelyimpacts the performance of query execution.

To tackle these problems, in this section we show how to: (a) expose, usingmapping assertions that are optimization-friendly, the information in the tablesexpressing equality between DB records, as a set AS of owl:sameAs statements;(b) extend the mappings so as to encode also transitivity and symmetry (butnot reflexivity), and hence expose the symmetric transitive closure A+

S of AS ;(c) modify the query-rewriting algorithm (cf. Definition 1) so as to return soundand complete answers over the (virtual) ontology extended with A+

S . We detailnow the above steps.

(a) Generating AS: We now present a set of constraints on the structure ofthe linking tables that are fully compatible with real-world requirements, andthat allow us to process queries efficiently, as we will show below:

1. All the information about which objects of category C are linked in datasetsDi and Dj is contained in LC

ij . Formally: If there are tables LCij , LC

ik and LCkj ,

then LCij contains all the tuples in πidi,idj

(LCik � LC

kj), when evaluated overDC .

2. Linking tables cannot state equality between different elements in the samedataset6. Formally: There is no join of the form LC

ik � · · · � LCni such that

6 Observe that this amounts to making the Unique Name Assumption for the objectsretrieved by the mappings from one dataset


L1,2 L2,3 L1,3

id1 id2

a1 b2

a2 b1

id2 id3

b1 c4

b2 c3

id1 id3

a1 c3

a2 c4

a3 c5

Fig. 3. Linking Tables

(o, o′), with o �= o′, occurs in πLCik.idi,LC

ni.idi(LC

ik � · · · � LCni), when evaluated

over DC .

Example 2 (Categories). Consider Example 1. Here we consider only wellbores,therefore we have a single category Cwb with three linking tables LCwb

12 , LCwb23 ,

and LCwb13 as shown in Figure 3. From the constraints above we know that

πid1,id3(LCwb12 � LCwb

23 ) is contained in LCwb13 , when both are evaluated over DCwb .

�

A key factor that affects performance of the overall OBDA system, is theform of the mappings, which includes the structure of the URI templates usedto generate the URIs. Here, we discuss how the part of the mappings (includ-ing URI templates) that deal with linking tables should be designed, so thisapproach scales up. The SPARQL-to-SQL translation must add all the SQLqueries defining owl:sameAs. However, as shown in Section 6, we exploit ourURI design to (intuitively) remove as many owl:sameAs SQL definitions aspossible before query execution.

We propose here to use a different URI template uriC,D for each pair con-stituted by a category C and a dataset D.7 Observe that this design decisionis quite natural, since objects belonging to different categories should not join,even if in some dataset they are identified in the same way. For example, wellboren. 25 should not be confused with the employee whose id is 25.

Next we generate the set of equalities AS extending the set of mappingsM, using a different URI template for each tuple (category C,dataset D). Moreprecisely, to generate AS out of the categories C1 . . . Cn, M is extended withmappings as follows. For each category C, and each linking table LC

ij we extendM with:

uriC,Di({idi}) owl:sameAs uriC,Dj

({idj}) ← select ∗ from LCij (1)

When the category C is clear from the context we write urii to denote uriC,Di

Example 3 (Mappings). To generate the owl:sameAs statements from thetables in Example 2, we extend our set of mappings M with the following map-pings (fragment):7 In the special case where there are several datasets that can be mapped to use

common URIs, there is no need for linking tables or any of the techniques presentedin this paper. We address the more general case, where this is not the case.


uri1({id1}) owl:sameAs uri2({id2}) ← SELECT * FROM LC1,2

uri2({id2}) owl:sameAs uri3({id3}) ← SELECT * FROM LC2,3

Observe that this also implies that to populate the concept Wellbore with ele-ments from D1, the mappings in M will have to use the URI template: uri1.�

Considering that the same URIs in different triples of the virtual RDF graphcan be generated from different mapping assertions, we observe that the form ofthe templates in the mappings related to linking tables will affect also those inthe remaining mapping assertions in the OBDA system.

(b) Approximating A+S : To be able to rewrite SPARQL queries into SQL

without adding A∗S as facts in the ontology, (relying only on the databases), we

embed the owl:sameAs axioms together with the axioms for symmetry andtransitivity into the mappings, that is, extending the notion of T -mappings [14](T stands for terminology). Intuitively, T -mappings embed the consequencesfrom a OWL QL ontology into the mappings. This allow us to drop the implicitaxioms for symmetry, and transitivity from the Tbox T .

For each categoryC and for each set of non-empty tablesLCi1,i2

LCi2,i3

. . . LCin−1,in

,if LC

i1,indoes not exist, we include the following transitivity mappings in M:

t1({id1}) owl:sameAs tn({idn}) ← select ∗ from LCi1,i2 � · · · � LC

in−1,in (2)

and for each of the owl:sameAs mapping described in (1) and (2) we includethe following symmetry mappings in M:

tj({idj}) owl:sameAs ti({idi}) ← select ∗ from sqlij (3)

We call the resulting set of mappings MS

(c) Rewriting the query Q: Encoding reflexivity would be extremely detri-mental for performance, not only by the large number of extra mappings weshould consider but also because it would render the optimizations explained inthe next sections ineffective. Intuitively, the reason for this is that while sym-metry and transitivity affect only elements which are linked to other datasets,reflexivity affects all the objects in the OBDA setting. Thus, we would not beable to distinguish during the query transformation process, which classes andproperties actually deal with linked objects (and should be rewritten) and whichones are not. Therefore, we modify the query-rewriting technique to keep sound-ness and completeness with respect to the DL entailment regime while evaluatingthe query under the QL entailment regime over (T ,MS ,D).

We modify the query translation as follows:

Definition 2 ((ϕ(Q), V )). Given a query (Q,V ), the query (ϕ(Q), V ) isobtained by replacing every triple pattern t in Q with ϕ(t), where:ϕ({?v :P ?w}) is shown in Fig. 4 (A) and ϕ({?v rdf:type :C}) is shownin Fig. 4 (B).


{ ?v :P ?w . } UNION {?v owl:sameAs _:z1 . _z1 :P ?w .} UNION {?v :P _:z2 . _:z2 owl:sameAs ?w .} UNION {?v owl:sameAs _:a ._:b owl:sameAs ?w . _:a :P _:b . }

(A)

?v rdf:type :C . UNION {?v owl:sameAs [ rdf:type :C ] .}

(B)

Fig. 4. SPARQL translation to handle owl:sameAs without Reflexivity

Intuitively, following up our running example, the first BGP in Fig. 4 (A) getsall triples such as (uri1(a1), :hasName, A) that do not need equality reasoning.The second BGP, will get triples such as (uri1(a1), :hasName, C), that requireowl:sameAs(uri1(a1), uri2(b2)). The two last BGPs are used only for objectproperties, and it tackles the cases where equality reasoning is needed for theobject (?w).

Recall that we do not allow owl:sameAs in the user query language. There-fore the user will not be able to query ?x owl:sameAs?x. In principle, we couldalso move transitivity and symmetry to the query, but it will not reduce the SQLquery rewriting.

Theorem 1. Given OBDA setting (T ,AS ,M,D) and a query (Q,V ), we havethat �Q�DL

T ,GM,D∪AS|V = �ϕ(Q)�QL

T ,GMS,D|V .

6 Optimization

The technique presented in Section 5 can cause excessive overhead on the querysize and therefore on the query execution time, since it has to extend every triplepattern with owl:sameAs statements. In this section we show how to removethe owl:sameAs statements that do not contribute to the answer. For instance,in our running example the property hasLicense is defined over the companiesin D4, which are not linked with the other 3 databases. Thus, the owl:sameAsstatements should not contribute to “populate” this property.

To translate SPARQL to SQL, in the literature [15] and in the implementa-tion, we encode the SPARQL algebra tree as a logic program. Intuitively, eachSPARQL operator is represented by a rule in the program as illustrated inExample 4. The translation algorithm employs a well-known process in LogicProgramming called partial evaluation [10]. Intuitively, the partial evaluation ofa SPARQL query Q (represented as a logic program) is another query Q′, thatrepresents the partial execution of Q. This process iterates over the structureof the query and specializes the query going from the highly abstract query tothe concrete SQL query over the database. It starts by replacing the atoms thatcorrespond to leaves in the algebra tree (triple patterns) with the union of all itsdefinitions in the mappings, and then it iterates over remaining atoms trying toreplace the atoms by their definitions. This procedure is done without executingany SQL query over the databases.


Select * WHERE {?v :hasLicense ?w .}

(A)

Select * WHERE {{?v :hasLicense ?w .} UNION {?v owl:sameAs [ :hasLicense :w ] . } }

(B)

Fig. 5. Optimizable Queries

We detect and remove owl:sameAs statements that do not contribute tothe answer using this procedure. It is critical to notice that this optimizationcan be performed because we intentionally added two constraints: (i) we disallowmappings modeling reflexivity; and (ii) we force unique URIs for each pair ofcategory/database. We illustrate this optimization in the following example.

Example 4 (Companies). Consider the query asking for the list of companies andlicenses shown in Figure 5 (A). This query is translated into the query (fragment)shown in Figure 5 (B). Since we know that only wellbore are linked through thedifferent datasets, it is clear that there is no need for owl:sameAs statements(nor unions) in this query. In the following, we show how the system partiallyevaluates the query to remove such pointless union. This translated query isrepresented as the following program encoding the SPARQL algebra tree:

(1)answer(v,w)← union(v,w)(2) union(v,w)← bgp1(v, w)(3) bgp1(v, w) ← hasLicense(v,w)(4) union(v,w)← bgp2(v, w)(5) bgp2(v, w) ← owl:sameAs(v,x), hasLicense(x,w)

The next step is to replace the leaves of the SPARQL tree (the triple patternsowl:sameAs and hasLicense ) with their definitions (fragment without includ-ing transitivity and symmetry):

(6) hasLicense(uri4(v),uri4(w))← sql(v,w)(7) owl:sameAs(uri1(v),uri2(x)) ← T12(v,w)(8) owl:sameAs(uri2(v),uri3(x)) ← T23(v,w)(9) owl:sameAs(uri1(v),uri3(x)) ← T13(v,w)

Thus, the system try to replace hasLicense(x,w) in (5) by its definition in(6), and analogously with owl:sameAs (5 by the union of 7-9) Using partialevaluation, the system will try to unify the head of (6) with hasLicense in (5).The result is:

(5')bgp2(v, uri4(w)) → owl:sameAs(v,uri4(x)), sql(uri4(x),uri4(w))

In the next step, the algorithm will try to unify the owl:sameAs in (5′) withthe head of at least one of the rules (7), (8), (9) (if all matched, it would add theunion of the tree). Given that the URI template (represented as a function)


uri4 does not occur in any of the rules, the whole branch will be removed. Theresulting program is:

(1)answer(v,w)→ union(v,w)(2) union(v,w)→ bgp1(v, w)(4) bgp1(v, w) → hasLicense(v,w)(5) hasLicense(uri4(v),uri4(w))→ sql(v,w)

This query without owl:sameAs overhead is now ready to be translated intoSQL. �

This process will also take care of eliminating unnecessary SQL queries usedto define owl:sameAs. For instance, if the user queries for wellbores, it willremove all the SQL queries used for linking company names. This is why werequire a unique URI for each pair category/dataset.

7 Experiments

In this section we present a sets of experiments evaluating the performance ofqueries over crossed-linked datasets. We integrated EPDS and the NPD factpages at Statoil extending the existing ontology and the set of mappings, andcreating the linking tables. We ran 22 queries covering real information needsof end-users over this integrated OBDA setting. Since EPDS is a productionserver with confidential data, and its loads changes constantly, and in additionthe OBDA setting is too complex to isolate different features of this approach,we also created a controlled OBDA environment in our own server to perform acareful study our technique. In addition, we exported the triples of this controlledenvironment and load them into the commercial triple store Stardog8 (v3.0.1).

To perform the controlled experiments, we setup an OBDA cross-linked envi-ronment based on the Wisconsin Benchmark [5].9 The Wisconsin benchmark wasdesigned for the systematic evaluation of database performance with respect todifferent query characteristics. It comes with a schema that is designed so onecan quickly understand the structure of each table and the distribution of eachattribute value. This allows easy construction of queries that isolate the featuresthat need to be tested. The schema can be used to instantiate multiple tables.These tables, which we now call “Wisconsin tables”, contain 16 attributes, anda primary key.

Observe that Ontop does not perform SQL federation, therefore it usuallyrelies on systems such as Teiid 10 or EXAREME [17] (a.k.a. ADP) to integratemultiple databases. These systems expose to Ontop a set of tables coming fromthe different databases. Thus, to mimic this scenario we created a single databasewith 10 tables: 4 Wisconsin tables, representing different datasets, and 6 link-ing tables. Each Wisconsin table contains 100M rows, the 6 tables occupied ca.100GB of disk space, exposing +1.8B triples.8 http://stardog.com9 All the material to reproduce the experiments can be found online: https://github.

com/ontop/ontop-examples/tree/master/iswc-crosslinked10 http://teiid.jboss.org

http://stardog.com

https://github.com/ontop/ontop-examples/tree/master/iswc-crosslinked

https://github.com/ontop/ontop-examples/tree/master/iswc-crosslinked

http://teiid.jboss.org


Fig. 6. Worst Execution Time including fetching time - 2 linked-DS (left) and 3 linked-DS (right)

The following experiments evaluate the overhead of equality reasoning whenanswering SPARQL queries. The variables we considered are: (i) Number ofSPARQL joins (1-4); (ii) Number and type of properties (0-4 /data-object);(iii) Number of linked datasets (2-3); (iv) Selectivity of the query (0.001%, 0.01%,0.1%); (v) Number of equal objects between datasets (10%,30%,60%). In totalwe ran 1332 queries. The SPARQL queries have the following template:

SELECT * WHERE {?x rdf:type :Classi . // i =1..4?x :DataPropertyj−1 ?y1 . ?x :DataPropertyj ?y2 . // j =0..4?x :ObjectPropertyk−1 ?z1 . ?x :ObjectPropertyk ?z2 . // k =0..4Filter( ?y < k% ) }

where a 0 or negative subindex means that the property is not present in thequery. When we evaluated 2 datasets we included equalities between elements ofthe classes A1 and A2. When we evaluated 3 datasets the equality was betweenA1, A2 and A4. The class A3 and the properties S3 and R3 are isolated. Wegroup the queries in 9 groups: (G1) No properties (c), (G2) 1 d. prop. 0 obj.prop. (1d), (G3) 0 d. prop. 1 obj. prop. (1o),. . . , (G9)2 d. prop. 2 obj. prop.(2d2o).

The average start-up time is ≈5 seconds. Observe that SPARQL enginesbased on materialization can take hours to start-up with OWL-DL ontologies [9].The results are summarized in Figure 6. We show the worst execution time ineach group including the time that it takes to fetch the results.

Discussion: The results confirm that reasoning over OBDA-based integrateddata has a high cost, but this cost is not prohibitive. The execution times atStatoil range from 3.2 seconds to 12.8 minutes, with mean 53 secs, and median8.6 secs. An overview of the execution times are shown in Fig. 7. The mostcomplex query had 15 triple patterns, using object and data properties comingfrom both data sources.

In the controlled environment, in the 2 linked-datasets scenario, with 120Mequal objects (60%), even in the worst case most of the queries run in ≈ 5min.The query that performs the worst in this setting, (4 joins, 2 data properties,


Num

ber

of q

uerie

s

02

46

8

0−10secs 10−30secs 30−60secs 1−5mins 5−20mins

Fig. 7. Overview of query execution times for tests on EPDS at Statoil.

2 object properties, 105 selectivity) returns 480.000 results, and takes ≈ 13min.When we move to the 3 linked-datasets scenario, most executions (again worsttime in every group) take around than 15min. In this case, the worst query in G9takes around 1.5hs and returns 1.620.000 results. One can see that the number oflinked datasets is the variable that impacts the most on the query performance.The second variable is the number of object properties since its translation ismore complex than the one for data properties. The third variable, is the selec-tivity. It is worth noticing that these results measure an almost pathologicalcase taking the system to its very limit. In practice, it is unlikely that 60% ofthe all the objects of a 300M integrated dataset will be equal and belong to thesame category. Recall that if they are not in the same category, the optimizationpresented in Section 6 removes the unnecessary SQL subqueries. For instance,in the integration of EPDS and NPD there are less than 10.000 equal wellboresand there are millions of objects of different categories. Moreover, even 1.5hs isa reasonable time. Recall that Statoil users required weeks to get an answer forthis sort of queries.

Because of the partial evaluation-based optimizations proposed in Section 6,with 2 datasets 30 out of 48 queries (52 out of 100 with 3 datasets) get optimizedand executed in a few milliseconds. These queries are the ones that join elementsin A1,2,4 (3 datasets) with A3, S3 and R3 elements. Since there is no equalitybetween these elements, neither through owl:sameAs, nor with standard equal-ity, the SPARQL translation produces an empty SQL, and no SQL query getsexecuted returning automatically 0 answers.

To load the data into Stardog we used Ontop to materialize the triples. Thematerialization took 11hs, and it took another 4hs to load the triples into Stardog.The default semantics that Stardog gives to owl:sameAs is not compliant withthe official OWL semantics since “Stardog designates one canonical individual foreach owl:sameAs equivalence set”; however, one can force Stardog to considerall the URIs in the equivalence set. Our experiments show that Stardog does notbehave according to the claimed semantics. Details can be found in [3].

8 Related Work

The treatment of owl:sameAs in reasoning and query evaluation has receivedconsiderable interest in recent years. After all, many data sources in the LinkedOpend Data (LOD) cloud give owl:sameAs links to equivalent URIs, so itwould be desirable to use them. Surprisingly, to the best of our knowledge, therehas been no attempt to extend OBDA to take into account owl:sameAs. Nextwe discuss several approaches that handle owl:sameAs trough rewriting.


Balloon Fusion [16] is a line of work that attempts to make use ofowl:sameAs information in the LOD cloud for query answering. The approachis similar to ours in that it is based on rewriting a query to take into accountequality inferences, before executing it. The treatment of owl:sameAs is seman-tically very incomplete however, since the rewriting only applies to URIs statedexplicitly in the query. No equality reasoning is applied to the variables in thequery, which is a main point of our work.

The question of equality handling becomes quite different in nature in thecontext of a single data store that is already in triple format. Equality can thenbe handled essentially by rewriting equal URIs to one common representative.E.g. [13] report on doing this for an in-memory triple store, while simultaneouslysaturating the data with respect to a set of forward chaining inference rules.Observe that in many scenarios (such as the Statoil scenario discussed here) thisapproach is not possible, both due to the fact that the data should be movedfrom the original source, and because of the amount of data that should beloaded into memory. In a query rewriting, OBDA setting, this corresponds tothe idea of making sure that mappings will map equivalent entities from severalsources to the same URI – which is often not practical or even impossible.

Our approach is only valid when the links between records really mean seman-tic identity. When the links are uncertain, query answering then requires the useof probabilistic database methods, as discussed e.g. in [7] for a limited type ofqueries. Extending these methods to handle arbitrary SPARQL-style queries isnot trivial.

9 Conclusions

In this paper we showed how to represent links over database as owl:sameAsstatements, we propose a mapping-based framework that carefully constructsowl:sameAs statements to minimize the performance impact of equality rea-soning. To recover rewritability of SPARQL into SQL we imposed a suitableset of restrictions on the linking mechanisms that are fully compatible with realworld requirements, and together with the owl:sameAs-mappings make it pos-sible to do the SPARQL-to-SQL translation. We showed how to answer SPARQLqueries over crossed linked datasets using query transformation. and how to opti-mize the translation to improve the performance of the produced SQL query. Toempirically support this claim, we provided an extensive set of experiments overreal enterprise data, and also in a controlled environment.

Acknowledgments. This paper is supported by the EU under the large-scale inte-grating project (IP) Optique (Scalable End-user Access to Big Data), grant agreementn. FP7-318338.


References

1. Artale, A., Calvanese, D., Kontchakov, R., Zakharyaschev, M.: The DL-Lite familyand relations. J. of Artificial Intelligence Research 36, 1–69 (2009)

2. Calvanese, D., De Giacomo, G., Lembo, D., Lenzerini, M., Rosati, R.: Tractablereasoning and efficient query answering in description logics: The DL-Lite family.J. Autom. Reasoning 39(3), 385–429 (2007)

3. Calvanese, D., Giese, M., Hovland, D., Rezk, M.: Ontology-based inte-gration of cross-linked datasets (2015). http://www.inf.unibz.it/∼mrezk/pdf/techRep-ISWC15.pdf (accessed April 30, 2015)

4. Das, S., Sundara, S., Cyganiak, R.: R2RML: RDB to RDF mapping language.W3C Recommendation, W3C (September 2012). http://www.w3.org/TR/r2rml/

5. DeWitt, D.J.: The wisconsin benchmark: past, present, and future. In: Gray, J.(ed.) The Benchmark Handbook. Morgan Kaufmann (1992)

6. Doan, A., Halevy, A.Y., Ives, Z.G.: Principles of Data Integration. Morgan Kauf-mann (2012)

7. Ioannou, E., Nejdl, W., Niederee, C., Velegrakis, Y.: On-the-fly entity-aware queryprocessing in the presence of linkage. PVLDB 3(1), 429–438 (2010)

8. Kontchakov, R., Lutz, C., Toman, D., Wolter, F., Zakharyaschev, M.: The com-bined approach to ontology-based data access. In: Proc. of IJCAI 2011, pp. 2656–2661 (2011)

9. Kontchakov, R., Rezk, M., Rodrıguez-Muro, M., Xiao, G., Zakharyaschev, M.:Answering SPARQL queries over databases under OWL 2 QL entailment regime.In: Mika, P., et al. (eds.) ISWC 2014, Part I. LNCS, vol. 8796, pp. 552–567. Springer,Heidelberg (2014)

10. Lloyd, J.W.: Foundations of Logic Programming, 2nd edn. Springer-Verlag NewYork Inc, Secaucus (1993)

11. Marnette, B.: Generalized schema-mappings: from termination to tractability. In:PODS 2009, pp. 13–22. ACM, New York (2009)

12. Motik, B., Cuenca Grau, B., Horrocks, I., Wu, Z., Fokoue, A., Lutz, C.: OWL 2 WebOntology Language profiles, 2nd edn. W3C Recommendation, W3C (December2012). http://www.w3.org/TR/owl2-profiles/

13. Motik, B., Nenov, Y., Piro, R.E.F., Horrocks, I.: Handling owl:sameAs via rewriting.In: Bonet, B., Koenig, S. (eds) Proc. 29th AAAI, pp. 231–237. AAAI Press (2015)

14. Rodrıguez-Muro, M., Kontchakov, R., Zakharyaschev, M.: Ontology-based dataaccess: Ontop of databases. In: Alani, H., et al. (eds.) ISWC 2013, Part I. LNCS,vol. 8218, pp. 558–573. Springer, Heidelberg (2013)

15. Rodriguez-Muro, M., Rezk, M.: Efficient SPARQL-to-SQL with R2RML mappings.J. of Web Semantics 33, 141–169 (2015)

16. Schlegel, K., Stegmaier, F., Bayerl, S., Granitzer, M., Kosch, H.: Balloon fusion:SPARQL rewriting based on unified co-reference information. In: Proc. of the 30thInt. Conf. on Data Engineering Workshops (ICDE 2014), pp. 254–259. IEEE (2014)

17. Tsangaris, M.M., Kakaletris, G., Kllapi, H., Papanikos, G., Pentaris, F., Polydoras,P., Sitaridi, E., Stoumpos, V., Ioannidis, Y.E.: Dataflow processing and optimiza-tion on grid and cloud infrastructures. IEEE Bull. on Data Engineering 32(1),67–74 (2009)

http://www.inf.unibz.it/~mrezk/pdf/techRep-ISWC15.pdf

http://www.inf.unibz.it/~mrezk/pdf/techRep-ISWC15.pdf

http://www.w3.org/TR/r2rml/

http://www.w3.org/TR/owl2-profiles/

Documents

LNCS 9366 - Ontology-Based Integration of Cross-Linked ...calvanese/papers/calv-gies-hovl-rezk-ISWC... · Ontology-Based Integration of Cross-Linked Datasets Diego Calvanese1, Martin