Relational and XML Data Exchange

8/8/2019 Relational and XML Data Exchange

1/112

Relational and XMLData Exchange


2/112

Copyright 2010 by Morgan & Claypool

All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or trany form or by any meanselectronic, mechanical, photocopy, recording, or any other except for briefprinted reviews, without the prior permission of the publisher.

Relational and XML Data ExchangeMarcelo Arenas, Pablo Barcel, Leonid Libkin, and Filip Murlak

www.morganclaypool.com

ISBN: 9781608454112 paperback ISBN: 9781608454129 ebook

DOI 10.2200/S00297ED1V01Y201008DTM008

A Publication in the Morgan & Claypool Publishers seriesSYNTHESIS LECTURES ON DATA MANAGEMENT

Lecture #8Series Editor: M.Tamer zsu,University of WaterlooSeries ISSNSynthesis Lectures on Data ManagementPrint 2153-5418 Electronic 2153-5426
http://www.morganclaypool.com/http://www.morganclaypool.com/


3/112

Synthesis Lectures on DataManagement

EditorM.Tamerzsu,University of WaterlooSynthesis Lectures on Data Management is edited by Tamer zsu of the University of Waterlooseries will publish 50- to 125 page publications on topics pertaining to data management.The sc will largely follow the purview of premier information and computer science conferences, such SIGMOD,VLDB, ICDE, PODS, ICDT, and ACM KDD. Potential topics include, but not arelimited to: query languages, database system architectures, transaction management, data wareho XML and databases, data stream systems, wide scale data distribution, multimedia data managemdata mining, and related subjects.

Relational and XML Data ExchangeMarcelo Arenas, Pablo Barcel, Leonid Libkin, and Filip Murlak 2010

Database ReplicationBettina Kemme, Ricardo Jimnez Peris, and Marta Patio-Martnez 2010

User-Centered Data Management Tiziana Catarci, Alan Dix, Stephen Kimani, and Giuseppe Santucci2010

Data Stream ManagementLukasz Golab and M.Tamer zsu2010

Access Control in Data Management SystemsElena Ferrari2010

An Introduction to Duplicate DetectionFelix Naumann and Melanie Herschel2010


4/112

iv

Privacy-Preserving Data Publishing: An Overview Raymond Chi-Wing Wong and Ada Wai-Chee Fu2010

Keyword Search in Databases Jeffrey Xu Yu, Lu Qin, and Lijun Chang2009


5/112

Relational and XMLData Exchange

Marcelo ArenasPonticia Universidad Catlica de Chile

Pablo BarcelUniversity of Chile

Leonid LibkinUniversity of Edinburgh

Filip Murlak University of Warsaw

SYNTHESIS LECTURES ON DATA MANAGEMENT #8

CM& cLaypoolMorgan publishers&


6/112


7/112

vii

Contents1 Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1

1.1 A data exchange example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1.1.1 XML data exchange. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.2 Overview of the main tasks in data exchange. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.3 Key denitions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Background. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5 Bibliographic comments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2 RelationalMappings andData Exchange. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .112.1 Relational databases: key denitions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.1.1 Relational schemas and constraints. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.1.2 Instances, constants, and nulls. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.2 Relational schema mappings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.3 Materializing target instances. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

2.3.1 Existence of Solutions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.3.2 Universal Solutions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.3.3 Materializing Universal Solutions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.3.4 Cores. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.4 Query Answering. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2.4.1 Answering rst-order and conjunctive queries. . . . . . . . . . . . . . . . . . . . . . . 292.4.2 Query Rewriting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2.6 Bibliographical comments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

3Metadata Management. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .373.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Composition of schema mappings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.2.1 Extending st-tgds with second-order quantication. . . . . . . . . . . . . . . . . . . 413.3 Inverting schema mapping. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4


8/112

viii CONTENTS3.3.1 A rst denition of inverse. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3.3.2 Bringing exchanged data back: The recovery of a schema mapping. . . . . . 533.3.3 Computing the inverse operator. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 Bibliographic comments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4 XMLMappings andData Exchange. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 674.1 XML databases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4.1.1 XML documents and DTDs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4.1.2 Expressing properties of trees. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4.2 XML schema mappings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Static analysis of XML schema mappings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4.3.1 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4.3.2 Absolute consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4.4 Exchange with XML schema mappings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.1 Data exchange problem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.2 Hardness of query answering. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4.4.3 Tractable query answering. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6 Bibliographic comments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

AuthorsBiographies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103


9/112

1

C H A P T E R 1Overview

Data exchange is the problem of nding an instance of a target schema, given an instancsource schema and a specication of the relationship between the source and the target. Starget instance shouldcorrectlyrepresent information from thesource instance under theconstimposed by the target schema, and it should allow one to evaluate queries on the target instaa way that is semantically consistent with the source data.

Data exchange is an old problem that re-emerged as an active research topic recently dthe increased need for exchange of data in various formats, often in e-business applications. The general setting of data exchange is this:

query Qmapping Msource S target T

We have xed source and target schemas, an instanceS of the source schema, and a mappingMthat species the relationship between the source and the target schemas.The goal is to constrinstanceT of the target schema, based on the source and the mapping, and answer queries agthe target data in a way consistent with the source data.

The goal of this introductory chapter is to make precise some of the key notions ofexchange: schema mappings, solutions, source-to-target dependencies, certain answers.We dmeans of an example we present in the next section.

1.1 A DATA EXCHANGE EXAMPLESuppose we want to create a database containing three relations:

ROUTES(fl i ght# , sourc e, de st i nat i on) This relation hasinformation about routesserved by severalairlines: it hasafl i ght# attribute(e.g., AF406 or KLM1276), as well assourc e andde st i nat i on attributes (e.g., Paris andSantiago for AF406).

INFO_FLIGHT(fl i ght# , de partur e _t ime, arr i val_t ime, a i rl i ne ) This relation provides additional information about the ight: departure and arrival tim well as the name of an airline.


10/112

2 1. OVERVIEW

dest airl city coun phone

SERVES

src dest airl dep

FLIGHT

dep airl

INFO FLIGHT

f# arr

coun

ROUTES

city pop

src

ROUTES

f#

Figure 1.1:Schema mapping: a simple graphical representation

SERVES(a i rl i ne, c i ty , country , phon e ) This relation has information about cities served by airlines: for example, it may h( Ai rFranc e, Sant i ago , Chi l e, 555 0000) , indicating that Air France serves SantiaChile, and its ofce there can be reached at 555-0000.

We do not start from scratch: there is a source database available from which we cainformation.This source database has two relations:

FLIGHT(sourc e, de st i nat i on , a i rl i ne, de partur e ) This relation contains information about ights, although not all the information nthe target.We only have source, destination,and airline (but no ight number),and dtime (but no arrival time).

GEO(ci ty , country , populat i on) This relation has some basic geographical information:cities, countries where they arand their population.

As the rst step of moving the data from the source database into our target, we have tschema mapping , a set of relationships between the two schemas.We canstartwith a simplegrrepresentation of such a mapping shown in Figure1.1. The arrows in such a graphical representatishow the relationship between attributes in different schemas.

But simple connections between attributes are not enough. For example, when wrecords inROUTES and INFO_FLIGHT based on a record inFLIGHT, we need to ensure that the values of fl i ght# attribute (abbreviated asf# in the Figure) are the same. This is indicated bcurved line connecting these attributes. Likewise, when we populate tableSERVES, we only want toinclude cities which appear in tableFLIGHT this is indicated by the line connecting attributetablesGEOandFLIGHT.


11/112

1.1. A DATA EXCHANGEEXAMPLE 3

(2)

airl city coun phone

SERVES

src dest airl dep

FLIGHT

dep airl

INFO FLIGHT

f# arr

coun

ROUTES

city pop

src

ROUTES

f# dest

(1) (1) (1)(1)

(2) (2)

Figure 1.2:Schema mapping: a proper graphical representation

Furthermore, there are severalrulesin a mapping that help us populate the target database.In this example, we can distinguish two rules. One uses tableFLIGHT to populateROUTES andINFO_FLIGHT, and the other uses bothFLIGHT andGEOto populateSERVES. So, in addition, weannotate arrows with names or numbers of rules that they are used in. Such a revised represeis shown in Figure1.2.

While it might be easy for someone understanding source and target schemas to prodgraphical representation of the mapping, we need to translate it into a formal specication. look at the rst rule which says:

whenever we have a tuple (src , de st , a i rl , de p) in relationFLIGHT, we must have a

tuple inROUTES that hassrc and de st as the values of the second and the third attributes,and a tuple inINFO_FLIGHT that hasde p and a i rl as the second and the fourth attributes.

Formally, this can be written as:

FLIGHT(src , de st , a i rl , de p) ROUTES(_ , src , de st) , INFO_FLIGHT(_ , de p , _ , a i rl)

This is not fully satisfactory: indeed, as we lose information that the ight numbers must same; we need to explicitly mention names of all the variables and produce the following ru

FLIGHT(src , de st , a i rl , de p) ROUTES(f# , src , de st) , INFO_FLIGHT(f# , de p , arr , a i rl)

What is the meaning of such a rule? In particular, what are those variables that appethe target specication without being mentioned in the source part? What the mapping says values for these variables mustexist in the target, in other words, the following must be satised:

FLIGHT(src , de st , a i rl , de p) f# arr ROUTES(f# , src , de st) INFO_FLIGHT(f# , de p, arr , a i rl)


12/112

4 1. OVERVIEW

To complete the description of the rule,we need to clarify the role of variablessrc , de st , a i rl andde p. The meaning of the rule is thatfor everytuple(src , de st , a i rl , de p) in tableFLIGHT, wehave to create tuples in relationsROUTES and INFO_FLIGHT of the target schema. Hence, nallthe meaning of the rst rule is:

src de st a i rl de p FLIGHT(src , de st , a i rl , de p)

f# arr ROUTES(f# , src , de st) INFO_FLIGHT(f# , de p, arr , a i rl)

Note that this is a query written in relational calculus, without free variables. In other wa sentence of rst-order logic, over the vocabulary including both source and target relameaning of this sentence is as follows: given a sourceS , a target instance we construct is such thtogether,S andT satisfy this sentence.

We now move to the second rule. Unlike the rst, it looks at two tuples in the(src , de st , a i rl , de p) in FLIGHT and(c i ty , country , popul) in GEO. If they satisfy the joinconditionc i ty =scr , then a tuple needs to be inserted in the target relationSERVES:

FLIGHT(src , de st , a i rl , de p) , GEO(ci ty , country , popul) , c i ty =src SERVES(a i rl , c i ty , country , phon e )

Aswith therst rule, theactualmeaning of this ruleisobtainedbyexplicitly quantifying theinvolved:

c i ty de st a i rl de p country popul

FLIGHT(c i ty , de st , a i rl , de p) GEO(ci ty , country , popul)

phon e SERVES(a i rl , c i ty , country , phon e )

We can also have a similar rule in which the destination city is moved in theSERVEStable in thetarget:

c i ty de st a i rl de p country popul

FLIGHT(src , c i ty , a i rl , de p) GEO(ci ty , country , popul)

phon e SERVES(a i rl , c i ty , country , phon e )

These rules together form what we call aschema mapping : a collection of rules that specithe relationship between the source and the target. When we write them, we actually ouniversal quantiers, as they can be reconstructed by the following rule:

every variable mentioned with one of the source relations is quantied universally

With these conventions, we arrive at the following schema mappingM :


13/112

1.1. A DATA EXCHANGEEXAMPLE 5

(1) FLIGHT

(src,

de

st,

ai

rl,

de

p) f# arr ROUTES(f# , src , de st) INFO_FLIGHT(f# , de p, arr , a i rl)

(2) FLIGHT(c i ty , de st , a i rl , de p) GEO(ci ty , country , popul) phon e SERVES(a i rl , c i ty , country , phon e )

(3) FLIGHT(src , c i ty , a i rl , de p) GEO(ci ty , country , popul) phon e SERVES(a i rl , c i ty , country , phon e )

Now, what does it mean to have a target instance, given a source instance and a mapping?mappings are logical sentences, we want target instances to satisfy these sentences, with resthe source. More precisely, note that mappings viewed as logical sentences mention both

and target schemas. So possible target instancesT for a given sourceS must satisfy the followingcondition:

For each condition of the mappingM , the pair(S ,T) satises.

We call such instancesT solutions for S under M .Look,for example,at our mappingM , and assumethat the sourceS has a tuple(Paris,Santiago,AirFrance,2320). Then every solutionT forS underM must have tuples

(x,Paris,Santiago) in ROUTES and (x,2320, y,AirFrance ) in INFO_FLIGHT

for some valuesx andy , interpreted as ight number and arrival time. The mapping says nothiabout these values: they may be real values (constants),e.g.,(406,Paris,Santiago), ornulls,indicatingthatwelackthisinformationatpresent.Weshallnormallyusethesymbol todenotenulls,soacom-monwaytopopulatethetargetwouldbewithtuples(,Paris,Santiago) and(,2320, ,AirFrance ).Note that the rst attributes of both tuples, while being unknown, are nonetheless the same.situation is referred to as havingmarked nulls, ornave nulls, as they are used in nave tables, studiedextensively in connection with incomplete information in relational databases.At the same tiknow nothing about the other null used: nothing prevents it from being different from butnothing tells us that it should be.

Note that already this simple example leads to a crucial observation that makes theexchangeprobleminteresting:solutionsarenotunique .Infact,therecouldbeinnitelymanysolutions: we can use different marked nulls, or we can instantiate them with different values.

If solutions are not unique, how can we answer queries? Consider, for example, a Bo(yes/no) query Is there a ight from Paris to Santiago that arrives before 10am? The answer to thisquery has to be no even though in some solutions we shall have tuples with arrival time10am. However, in others, in particular in the one with null values, the comparison with 10anot evaluate to true, and thus we have to return no as the answer.

On the other hand, the answer query Is there a ight from Paris to Santiago? is yes, as thetuple including Paris and Santiago will be in every solution. Intuitively, what we want to do i


14/112

6 1. OVERVIEW

answering in data exchange is to return answers that will be true in every solution.Thescertain answers; we shall dene them formally shortly.

1.1.1 XMLDATA EXCHANGEBefore outlining the key tasks in data exchange, we briey look at the XML representaabove problem. XML is a exible data format for storing and exchanging data on the Wdocuments are essentially trees, that can represent data organized in a way more comthe usual relational databases. But each relational database can be encoded as an XML a portion of our example database, representing information about the ParisSantiago information about Santiago, is shown in the picture below.

r

FLIGHT

t 1

src

Paris

dest

Santiago

airl

Air France

dep

2300

GEO

t 2

city

Santiago

country

Chile

popul

6M

The tree has a rootr with two children,corresponding to relationsFLIGHT andGEO. Each of those has several children, labeledt 1 andt 2 , respectively, corresponding to tuples in the relationsshow one tuple in each relation in the example. Eacht 1-node has four children that correspond the attributes of FLIGHT and eacht 2-node has three children, with attributes of GEO. Finally, eachof the attribute nodes has a child holding the value of the attribute.

To reformulate a rule in a schema mapping in this language, we show how portionare restructured. Consider, for example, the rule

FLIGHT(c i ty , de st , a i rl , de p) GEO(ci ty , country , popul) phon e SERVES(a i rl , c i ty , country , phon e )

We restate it in the XML context as follows:r

FLIGHT

t 1

src

x

airl

y

GEO

t 2

city

x

country

z

r

SERVES

t

airl

y

city

x

country

z

phone

u

That is, if we have tuples inFLIGHT andGEOthat agree on the values of thesourc e andc i ty attributes, we grab the values of thea i rl i ne andcountry attributes, invent a new valueuforphon e and create a tuple in relationSERVES.


15/112

1.2. OVERVIEW OFTHE MAINTASKS IN DATA EXCHANGE 7 The rulesofXML schemamappingsare thus represented viatreepatterns.Essentially, theysay

that ifa certainpattern occursina source document,someotherpattern,obtainedbyitsrestructumust occur in the target.

ThisviewofXMLschemamappingsisnotsurprisingifwenotethatinourrelationalexampthe rules are obtained by using a relational pattern i.e., a conjunction of source atoms rearranging them as a conjunction of target atoms.Conjunctions of atoms are natural analogspatterns. Indeed, the pattern on the right-hand side of the above rule, for example, can be vas the conjunction of the statements about existence of the following edge relations: betweroot and a node labeledSERVES, between that node an a node labeledt , between thet -node andnodes labeleda i rl , c i ty , country , andphon e , respectively, and between those nodes and nodescarrying attribute valuesy , x , z, andu .

Of course, we shall see when we describe XML data exchange that patterns could be s

cantly more complicated: they need not be simple translations of relational atoms. In fact, ouse more complicated forms of navigation such as the horizontal ordering of siblings in a docor the descendant relation.But for now, our goal was to introduce the idea of tree patterns byof a straightforward translation of a relational example.

1.2 OVERVIEW OFTHE MAINTASKSIN DATA EXCHANGE The key tasks in many database applications can be roughly split into two groups:

1. Static analysis. This mostly involves dealing with schemas; for example, the classical relatdatabase problems such as dependency implication and normalization fall into this cat Typically, the input one considers is (relatively) small, e.g., a schema, a set of constr Therefore, somewhat higher complexity bounds are normally tolerated: for example,problems related to reasoning about dependencies are complete for complexity classes NP , orcoNP , orPspace .

2. Dealing with data. These are the key problems such as querying or updating the data. course, given that databases are typically large, only low-complexity algorithms are to when one handles data. For example, the complexity of evaluating a xed relational aquery is very low ( AC 0 , to be precise), and even more expressive languages such as Datastay inPtime .

In data exchange, the key tasks too can be split into two groups. For static analysis tasks, wschema mappings as rst-class citizens.The questions one deals with are generally of two k

Consistency.For thesequestions, theinput isa schemamappingM ,and thequestion iswhetherit makes sense: for example, whether there exists a sourceS that has a solution underM , or whether all sources of a given schema have solutions. These analysis are important forout bad mappings that are unlikely to be useful in data exchange.


16/112

8 1. OVERVIEW

Operations on mappings. Suppose we have a mappingM from a source schemaRs to a targetschemaRt, and another mappingM that usesRt as the source schema and maps it intoschemaRu. Can we combine these mappings into one, the composition of the two,M M , which mapsRs toRu?Orcanweinvertamapping,andndamappingM 1 fromRt intoRs,that undoes the transformation performed by M and recovers as much original informatabout the source as possible? These questions arise when one considers schema evschemas evolve, so do the mappings between them. And once we understand when we can construct mappings such asM M orM 1 , we need to understand their properti with respect to the existence of solutions problem.

Tasks involving data are generally of two kinds.

Materializing target instances. Suppose we have a schema mappingM and a source instanceS . Which target instance do we materialize? As we already saw, there could be manyinnitelymany targetinstanceswhicharesolutionsforS underM .Choosingone,we shouldthink of three criteria:

1. it should faithfully represent the information from the source, under the conimposed by the mapping;

2. it should not contain (too much) redundant information;3. the computational cost of constructing the solution should be reasonable.

Query answering . Ultimately, we want to answer queries against the target schema.explained, due the existence of multiple solutions, we need to answer them in a w

consistent with the source data. So if we have a materialized targetT and a query Q , we needto nd a way of evaluating it to produce the set of certain answers. As we shall see,computingQ(T ) doesnot give us certain answers, so we may need to changeQ into anotherquery Q and then evaluateQ on a chosen solution to get the desired answer.The compof this task will also depend on a class of queries to whichQ belongs. We shall see that fosome classes, constructingQ is easy, while for others,Q must come from languages mucmore expressive (and harder to evaluate) than SQL, and for some, it may not even

1.3 KEY DEFINITIONS Wenowgetabitmoreformalandpresentthekeydenitionsrelatedtodataexchange.ForareschemaR, we letInst (R) stand for the set of all database instances of R. We start with

asource schemaRs,

atarget schemaRt, and

a set st of source-to-target dependencies.


17/112

1.3. KEYDEFINITIONS 9Source-to-target dependencies, orstds, are logical sentences overRs andRt; for example,

src de st a i rl de p FLIGHT(src , de st , a i rl , de p)

f# arr ROUTES(f# , src , de st) INFO_FLIGHT(f# , de p, arr , a i rl)

We may also have

a set of target dependenciest , i.e., logical sentences overRt. These are typical databaseintegrity constraints such as keys and foreign keys.

We dene aschema mapping as a quadruple

M = (Rs, Rt, st , t ),

or just(Rs, Rt , st ), when there are no target constraints.Now assumeS is asource instance , i.e., a database instance overRs. Then a target instanceT is called asolution for S under M if and only if S andT together satisfy all the stds inst , andT satises all the constraints int . That is,T is a solution forS underM if

(S ,T) |= st and T |= t .

For mappings without target constraints, we only require(S ,T) |= st . The set of all solutions forS underM is denoted by Sol M (S) :

Sol M (S) = { T Inst (Rt) | (S ,T) |= st andT |= t }.

Semantically, we can view the mappingM as a binary relation betweenInst (Rs) andInst (Rt),i.e.,

JM K Inst (Rs) Inst (Rt) is dened asJM K = { (S ,T) | S Inst (Rs), T Inst (Rt), andT Sol M (S) }.

If we have a query Q over the target schemaRt, we want to computecertain answers; these areanswers true in all solutions for a given source instanceS . That is, certain answers are dened as

certainM (Q,S) =T Sol M (S)

Q(T ).

We want to computecertainM (Q,S) using just onespecic materialized solution,say T 0 . In general,Q(T 0 ) need not be equalcertainM (Q,S) . Instead, one is looking for a different query Q , called arewriting of Q , so that

Q (T 0 ) = certainM (Q, S). When we have such a materialized solutionT 0 and a rewritingQ for each query Q we want topose, we have the answer to the two key questions of data exchange. First, we know which sto materialize it isT 0 . Second, we know how to answer queries compute the rewritingQ of Qand apply it toT 0 .


18/112

10 1. OVERVIEW

1.4 BACKGROUND We assume that the reader is familiar with the following.

Relational database modelWe expect the reader to have seen the relational database modbe familiar with the standard notions of schemas, instances, constraints, and query In particular, we assume the knowledge of the following:

Basic relational query languages, especially relational calculus (i.e., rst-ordelogic, often abbreviated as FO) and its fragments such as:1. Conjunctive queries, also known as select-project-join queries, formally d

the, -fragment of FO, or the,, -fragment of relational algebra; and2. Unions of conjunctive queries, that correspond to select-project-join-union

Relational integrity constraints, especially keys and foreign keys, and morefunctional and inclusion dependencies.

XMLAlthough we dene the key XML concepts we need, it would help the reader to hthem before. We assume the basic knowledge of automata on words (strings); all threlated to automata on trees will be dened here.

1.5 BIBLIOGRAPHIC COMMENTSData exchange,also known as data translation, is a very old problem that arises in many data must be transferred between independent applications [Housel et al., 1977]. Examples of dataexchange problems appeared in the literature over 30 years ago. But as the need for datincreased over the years [Bernstein, 2003], research prototypes appeared [Fagin et al., 2009] andmade their way into commercial database products.The theory was lagging behind untiby Fagin et al.[2005a] which presented the widely accepted theoretical model of data exchdeveloped the basis of the theory of data exchange, by identifying the key problems and lmaterializing target instances and answering queries.

Within a year or two of the publication of the conference version of this paper,change grew into a dominant subject in the database theory literature, with many papersin conferences such as PODS, SIGMOD, VLDB, ICDT, etc. By now, there are severapresenting surveys of various aspects of relational data exchange and schema mappingBarcel,2009;Bernstein and Melnik , 2007;Kolaitis, 2005]; extensions to XML have been described as [Amano et al., 2009;Arenas and Libkin, 2008]. Much more extensive bibliographic commentsbe provided in the subsequent chapters. The subject of data integration has also received much attention, see, for examHaas[2007] andLenzerini[2002] for a keynote and a tutorial. Relationships between data exchanintegration have also been explored [Giacomo et al., 2007].


19/112

11

C H A P T E R 2Relational Mappings and Data

Exchange2.1 RELATIONALDATABASES: KEY DEFINITIONSIn Chapter1, we presented the key denitions related to data exchange in the most general set

We now restate these denitions in the context of relational data exchange, and recall some key concepts from relational database theory.

2.1.1 RELATIONALSCHEMAS ANDCONSTRAINTSA relational schemaR is a nite sequenceU 1 , . . . , U m of relation symbols, with eachU i having axed arity n i > 0. For example, the target relational schema considered in the introduction has threlations:ROUTES of arity 3, INFO_FLIGHT of arity 4, andSERVES, also of arity 4.

We sometimes consider relational schemas withintegrity constraints, which are conditionsthat instances of such schemas must satisfy. The most commonly used constraints in databafunctional dependencies (and a special case of those:keys) and inclusion dependencies (and acase of functional and inclusion dependencies: foreign keys).

A functional dependency states that a set of attributesX uniquely determines another set of attributesY ; a key states that a set of attributes uniquely determines the tuple. For example,reasonable to assume thatfl i ght# is a key of ROUTES, which one may write as a logical sentence

f sd s d ROUTES(f,s,d) ROUTES(f, s , d ) (s = s ) (d = d ) .

An inclusion dependency states that a value of an attribute (or values of a set of attributes) ocinone relationmustoccur inanother relationaswell.Forexample,we may expecteachight nuappearing inINFO_FLIGHT to appear inROUTES as well: this is expressed as a logical sentence

f d aa INFO_FLIGHT( f ,d ,a ,a ) sd ROUTES(f,s,d ) .

Foreignkeysaresimplyacombinationofaninclusiondependencyandakeyconstraint:bycombthe above inclusion dependency with the constraint thatfl i ght# is a key forROUTES, we get aforeign key constraintINFO_FLIGHT[fl i ght#] FK ROUTES[fl i ght#] , stating that each valueof ight number inINFO_FLIGHT occurs inROUTES, and that this value is an identier for the tuplein ROUTES.


20/112

12 2. RELATIONAL MAPPINGSANDDATA EXCHANGE2.1.2 INSTANCES,CONSTANTS, ANDNULLSAninstance S of schemaR = U 1 , . . . , U m assigns to each relation symbolU i , where1 i m, aniten i -ary relationU S i .Thedomainof instanceS ,denotedby Dom (S) ,isthesetofallelementsthatoccur in any of the relationsU S i . It is often convenient to dene instances by simply listing theattached to the corresponding relation symbols. Further, sometimes we use the notationU( t ) S instead of t U S , and callU( t ) a fact of S . For instance,FLIGHT(Paris,Santiago,AirFrance ,2320) isan example of a fact. Finally,given instancesS , S of R, we denote by S S the fact thatU S i U S ifor every i {1 , . . . , m }, and we denote by S the size of S .

We have seen in earlier examples that target instances may containincomplete information. This is modeled by having two disjoint and innite set of values that populate instancthem is the set of constants, denoted by Const , and the other one is the set of nulls, or variables,denoted by Var . In general, the domain of an instance is a subset of Const Var (although weshall assume that domains of a source instances are always contained inConst ). We usually denoteconstants by lowercase lettersa , b , c , . . . , while nulls are denoted by symbols, 1 , 2 , etc.

We now need a couple of standard denitions related to database instances with nuan instanceS of schemaR, letV be the set of all nulls that occur inS . We then call a mapping : V Const a valuation. By (S) , we mean the instance obtained fromS by changing everyoccurrence of a null by ().Then

Rep(S) = { S overConst | (S) S for some valuation }

is the set of instances overConst represented by S .This notion is typically viewed as the semanof an incomplete instanceS with nulls: such an instance may represent several complete independing on the evaluation of nulls, and possibly adding new tuples.

2.2 RELATIONAL SCHEMA MAPPINGSFollowing the terminology used in Chapter1, we dene arelational mapping M as a tuple(Rs, Rt , st , t ), whereRs andRt are disjoint relational schemas,Rs is called thesource schema,Rt is called thetarget schema, st is a nite set of source-to-target dependencies (i.e., dependencieover the relations inRs andRt),and t is a nite set of target dependencies (i.e., dependencies ovRt). If the set t is empty, i.e., there are no target dependencies, we writeM = (Rs, Rt , st ).

Recall that instances of Rs are calledsource instances, while instances of Rt are calledtarget instances. Source instances are usually denotedS, S 1 , S 2 , . . . , while target instances are denoteT , T 1 , T 2 , . . .

To dene the notion of a solution we need the following terminology. Given schemR1 =U 1 , . . . , U m andR2 = W 1 , . . . , W n ,withnorelationsymbolsincommon,wedenoteby R1, R2

the schemaU 1 , . . . , U m , W 1 , . . . , W n . Further, if S 1 is an instance of R1 andS 2 is an instance of R2, then(S 1 , S 2 ) denotes an instanceT of R1, R2 such thatU T i = U

S 1i andW T j = W

S 2j , for each

i {1, . . . , m } andj {1, . . . , n }.


21/112

2.2. RELATIONAL SCHEMA MAPPINGS 13Given a source instanceS overConst , we say that a target instanceT overConst Var is a

solution for S under M , if (S ,T) satises every sentence inst andT satises every sentence int .In symbols,(S ,T) |= st andT |= t . WhenM is clear from the context, we say callT simply asolution forS . As before, the set of solutions for instanceS will be denoted by Sol M (S) .

Admitting arbitrary expressive powerforspecifying dependenciesindata exchangeeasilyto undecidability of some fundamental problems, like checking for the existence of solutionsit is customary in the dataexchange literature to restrict the study toa classofmappingsM that leadto efcient decidability of the key computational tasks associated with data exchange.The strestrictions used in data exchange are as follows: constraints used inst aretuple-generating depen-dencies(which generalize inclusion dependencies), and constraints int are either tuple-generatingdependencies orequality-generating dependencies(which generalize functional dependencies). Moreprecisely:

st consists of a set of source-to-target tuple-generating dependencies (st-tgds)of the form

xy ( s (x, y) z t (x, z)),

wheres (x, y) and t (x, z) are conjunctions of atomic formulas inRs andRt, respectively;and

t is the union of a set of tuple-generating dependencies (tgds), i.e., dependencies of the form

xy (( x, y) z ( x, z)),

where( x, y) and( x, z) are conjunctions of atomic formulas inRt, and a set of equality- generating dependencies (egds), i.e., dependencies of the form

x (( x) x i = xj ),

where( x) is a conjunction of atomic formulas inRt,andxi , x j are variables among those inx .

One can observe that the mapping we used in the Introduction follows this pattern. For theof simplicity, we usually omit universal quantication in front of st-tgds, tgds, and egdstice, in addition, that each (st-)tgd( x, y) z ( x, z) is logically equivalent to the formula(y (x, y)) (z ( x, z)) . Thus, when we use notation( x) z ( x, z) for a (st-)tgd, weassume that(x) is a formula of the formy (x, y) , where( x, y) is a conjunction of atomicformulas. Also, we denote by M the size of a mappingM .

From now on, and unless stated otherwise, we assume all relational mappings to be orestricted form specied above. The intuition behind the different components of these mapis as follows. Source-to-target dependencies inst are a tool for specifying which conditions on thesource imply a condition on the target. But from a different point of view, one can also seeas a tool for specifying how source data gets translated into target data. In addition, the tran


22/112

14 2. RELATIONAL MAPPINGSANDDATA EXCHANGEdata must satisfy usual database constraints. This is represented by means of the target cies in t . It is important to notice that the mappings described above are not restrictivdatabase point of view. Indeed, tuple-generating dependencies together with equality gdependencies precisely capture the class of embedded dependencies. And the latter class containsrelevant dependencies that appear in relational databases, e.g., it contains functional anddependencies, among others.

There is a particular class of data exchange mappings, calledLocal-As-View(LAV) mappings,that often appears in the literature.This class had its origins in the data integration commit has proved to be of interest also for data exchange.Formally, a mapping without target M = (Rs, Rt , st ) is a LAV mapping if st consists of LAV st-tgds of the formx(U( x) zt (x, z)) , whereU is a relation symbol inRs. That is, to generate tuples in the target, one nea single source fact.

2.3 MATERIALIZINGTARGET INSTANCESAs we mentioned in Chapter1, one of the key goals in data exchange is to materialize a solutireects as accurately as possible the given source instance.The next example shows twophenomena regarding solutions in data exchange that, in particular, explain why the mateproblem is far from trivial. First, solutions for a given source instance are not necessarSecond, there are source instances that have no solutions.

Example 2.1 Let us revisit the data exchange example presented in Chapter1. Thus,M =(Rs, Rt , st , t ) is a mapping such that:

The source schemaRs consists of the ternary relationGEO(ci ty , country , populat i on)

and the 4-ary relation

FLIGHT(sourc e, de st i nat i on , a i rl i ne, de partur e ) .

The target schemaRt consists of the ternary relation

ROUTES(fl i ght# , sourc e, de st i nat i on)

and the 4-ary relations

INFO_FLIGHT(fl i ght# , de partur e _t ime, arr i val_t ime, a i rl i ne )and

SERVES(a i rl i ne, c i ty , country , phon e ) .


23/112

2.3. MATERIALIZINGTARGETINSTANCES 15 st consists of the following st-tgds:

FLIGHT(src , de st , a i rl , de p) f #arr (ROUTES(f# , src , de st) INFO_FLIGHT(f# , de p, arr , a i rl) )

FLIGHT(c i ty , de st , a i rl , de p) GEO(ci ty , country , popul) phon e SERVES(a i rl , c i ty , country , phon e )

FLIGHT(src , c i ty , a i rl , de p) GEO(ci ty , country , popul) phon e SERVES(a i rl , c i ty , country , phon e ) .

There are no target dependencies, i.e.,t = .

It is clear that every source instance has a solution underM . Furthermore, since the st-tgds do notcompletely specify the target, solutions are not necessarily unique up to isomorphism, and, ithere is an innite number of solutions for each source instance.

On the other hand, assume thatM is the extension of M with the target dependency that imposes thatsrc is a key inROUTES(f# , src , de st) . Then it is no longer true that ev-ery source instance has a solution underM . Indeed, consider a source instanceS that con-tains factsFLIGHT(Paris, Santiago, AirFrance , 2320 ) and FLIGHT(Paris, Rio, TAM , 1720 ). Thenthe rst st-tgd in st imposes that there are facts of the formROUTES(x, Paris, Santiago) andROUTES(y, Paris, Rio) in every solutionT forS . But this contradicts the key condition on relationRout e s(f# , src , de st) . We conclude thatS has no solutions underM . 2

This example suggests that in order to be able to materialize a target solution for a given instance, we have to be able to do two things. First, to determine whether a solution existsand second, in case that there is more than just one solution, to compute the one that reectsaccurately the source data. Each one of these issues is naturally associated with a relevaexchange problem.

First,in order todeterminewhethersolutionsexist,it isnecessary tounderstandthe decidabof the problem of checking for the existence of solutions, as dened below:

Problem : Existence of solutions for relational mappingM .Input: A source instanceS .Question : Is there a solution forS underM ?

Second, it is necessary to identify the class of data exchange solutions that most accreect the source data and to understand the computability of the solutions in this class

This section is devoted to the study of these two problems. First, it is shown in Section2.3.1that,in the more general scenario, the existence of solutions problem is undecidable. Second, in S


24/112

16 2. RELATIONAL MAPPINGSANDDATA EXCHANGE2.3.2we present a class of solutions, calleduniversal solutions, that are widely assumed to be thpreferred solutions in data exchange. Unfortunately, as we point out below, universal sonot a general phenomenon in data exchange as there are source instances that have solutuniversal solutions.

The two previous facts raise the need for identifying a relevant class of relationalthat satises the following desiderata:

(C1) The existence of solutions implies the existence of universal solutions;

(C2) checking the existence of solutions is a decidable (ideally, tractable) problem; and

(C3) for every source instance that has a solution, at least one universal solution can be(hopefully, in polynomial time).

The denition of such a class is given in Section2.3.3. Finally, Section2.3.4is devoted to the studyof a particular universal solution, called the core, which exhibits good properties for dat

2.3.1 EXISTENCE OF SOLUTIONSAs we have mentioned on and on, one of the goals in data exchange is materializing that reects as accurately as possible the source data.Unfortunately, even the most basicchecking the existence of solutions (for a xed mapping) is undecidable.

Theorem 2.2 There exists a relational mapping M = (Rs, Rt, st , t ) , such that the problem of checking whether a given source instance has a solution under M , is undecidable.

Proof. We reduce from the embedding problem for nite semi-groups (explained below)known to be undecidable.

Recall that a nite semi-group is an algebraA = (A,f) , whereA is a nite nonempty setandf is an associative binary function onA. LetB = (B,g) be a partial nite algebra; i.e.,B is anite nonempty set andg is a partial function fromB B to B .ThenB isembeddable in the nitesemi-groupA = (A,f) if and only if B A andf is an extension of g , that is, wheneverg(a,a )is dened, we have thatg(a,a ) = f (a ,a ). Theembedding problem for nite semi-groups is follows: given a nite partial algebraB = (B,g) , isBembeddable in some nite semi-group?

We now construct a relational mappingM = (Rs, Rt , st , t ) such that the embeddingproblem for nite semi-groups is reducible to the problem of existence of solutions foM . The

mappingM is constructed as follows: Rs consists of the ternary relation symbolU , whileRt consists of the ternary relation symb

V . Intuitively, bothU andV encode the graphs of binary functions.

st = { U(x,y,z) V(x,y,z) }.


25/112

2.3. MATERIALIZINGTARGETINSTANCES 17 t consists of one egd and two tgds. First, an egd

V(x,y,z) V(x ,y,z ) z = z ,

asserts thatV encodes a function. Second, a tgd

V(x,y,u) V(y,z,v) V(u,z,w) V(x,v,w),

that asserts that the function encoded by V is associative. Finally, a tgd

V (x 1 , x 2 , x 3 ) V (y 1 , y 2 , y 3 ) 1 i,j 3

zij V (x i , y j , z ij )

that states that the function encoded by V is total. Indeed, this tgd expresses that if twoelementsa andb appear in the interpretation of V , then there must be an elementc such thatV(a,b,c) .

LetB = (B,g) bea nitepartialalgebra.Consider thesourceinstanceS B = { U(a,b,c) | g(a, b) =c}. It is clear thatB is embeddable in the class of nite semi-groups if and only if S B has a solutionunderM . This shows that the existence of solutions problem forM is undecidable. 2

2.3.2 UNIVERSAL SOLUTIONSAs we have mentioned above, we want to identify which data exchange solutions better resource data. As it is widely accepted in the data exchange literature, this corresponds to the universal solutions.Intuitively, a solution is universal when it is moregeneral than any other solution.

Next example will help us to illustrate when a solution is more general than other.Example 2.3(Example2.1continued) Consider the source instance

S = { FLIGHT(Paris, Santiago, AirFrance , 2320 )}. Then one possible solution forS underM is

T = { ROUTES(1 , Paris, Santiago), INFO_FLIGHT(1 , 2320 , 2 , AirFrance )}, where1 , 2 are values inVar (i.e., nulls). Another solution is

T = { ROUTES(1 , Paris, Santiago), INFO_FLIGHT(1 , 2320 , 1 , AirFrance )}. Yet another solution, but with no nulls, is

T = { ROUTES( AF 406 ,Paris,Santiago),INFO_FLIGHT( AF 406 , 2320 , 920 , AirFrance )}.

Notice that both solutionsT andT seem to be less general thanT .This is becauseT assumes thatthe values that witness the existentially quantied variablesf# andarr , in the rst st-tgd of st ,


26/112

18 2. RELATIONAL MAPPINGSANDDATA EXCHANGEare the same, whileT assumes that these variables are witnessed by the constantsAF 406 and 920,respectively. On the other hand, solutionT contains exactly what the specication requires. Tin this case it seems natural to say that one would like to materialize a solution likeT rather thansolutionT or T , asT is more accurate with respect toS (underM ) thanT andT . 2

How do we dene the notion of being as general as any other solution? There appear tdifferent ways of doing so, outlined below.

Solutions thatdescribe all others.A solutionT is an instance with nulls, and thus describes setRep(T ) of complete solutions. A most general solution must describe all other csolutions: that is, we must have

UnivSol1(T ) : Rep(T ) = { T Sol M (S) | T is overConst }

Solutions that areas generalas others.AseeminglyslightlyweakerconditionsaysthatifasolT is universal, it cannot describe fewer complete instances than another solution, i.

UnivSol2(T ) : Rep(T ) Rep(T ) for every T Sol M (S)

Solutions thatmaphomomorphically into others.This more technical, and yet very convenidenition, is inspired by the algebraic notion of a universal object, that has a homointo every object in a class. Ahomomorphismbetween two instancesh : T T is a mappingfrom the domain of T into the domain of T , that is the identity on constants, and such tht = (t 1 , . . . , t n ) W T impliesh( t ) = (h(t 1 ) , . . . , h ( t n )) is inW T for allW inRt. Our third

condition then is:UnivSol3(T ) : there is a homomorphismh : T T for every T Sol M (S)

So, which denition should we adopt? It turns out that we cantake either one,as they are eq

Proposition 2.4 If M is a mapping given by st-tgds, and T is a solution for some source instance, thconditionsUnivSol1(T ) , UnivSol2(T ) , and UnivSol3(T ) are equivalent.

Proof. Observe that a valuation on T is a homomorphism fromT into any instance containing(T ) . Since right-hand-sides of st-tgds are conjunctive queries, which are closed under hphisms,wehave that if T Sol M (S) andT Rep(T ) ,thenT Sol M (S) .With this observation, we now easily prove the result.

UnivSol1(T ) UnivSol2(T ) . AssumeT is an arbitrary solution, and letT Rep(T ). ThenT is a solution without nulls, and hence by UnivSol1(T ) it belongs toRep(T ) , provingRep(T ) Rep(T ) .


27/112

2.3. MATERIALIZINGTARGETINSTANCES 19UnivSol2(T ) UnivSol3(T ) . LetT Sol M (S) , and let1 , . . . , n enumerate nulls inT .Let

c1 , . . . , c n be elements of Const that do not occur inT , and letT be obtained fromT by replacing eachi withci ,fori n .ThenT Rep(T ) and henceT Rep(T ) .Take a val-uation witnessingT Rep(T ) and change it into a homomorphism by settingh() = i whenever() = ci , and otherwise lettingh coincide with . Then clearly h is a homomor-phism fromT to T .

UnivSol3(T ) UnivSol1(T ) . If T Rep(T ) , we know it is a solution. On the other hand, if T is a solution over constants, then there is a homomorphismh : T T , which must be a valuation since there are no nulls inT . HenceT Rep(T ) . This completes the proof of theproposition. 2

Proposition2.4justies the following denition.Denition 2.5(Universal solutions).Given a mappingM , a solutionT for S underM is auniversal solution if one of UnivSoli (T ) , fori = 1 , 2 , 3, is satised (and hence all are satised).2

Most commonly in the proofs one uses the condition that for every solutionT forS , there exists ahomomorphismh : T T .

Example 2.6 Neither solutionT norT in Example2.3is universal. In fact, there is no homo-morphismh : T T ; otherwise

h(( 1 , 2320 , 1 , AirFrance )) = (1 , 2320 , 2 , AirFrance ),

and, thus,h(1 ) = 1 andh(1 ) = 2 , which is a contradiction.Moreover, there is no homomor-phismh : T T ; otherwise

h(( AF 406 , 2320 , 920 , AirFrance )) = (1 , 2320 , 2 , AirFrance ),

and, thus,h(AF 406 ) = 1 andh( 920 ) = 2 , which is a contradiction (because homomorphismsare the identity on constants).On the other hand, it can be easily seen thatT is a universal solution.2

In addition to being more general than arbitrary solutions, universal solutions possess manyproperties that justify materializing them (as opposed to arbitrary solutions). We shall studyproperties in detail in this chapter. Unfortunately, universal solutions are not a general phenomIndeed, it can be proved that there is a mappingM and a source instanceS , such thatS has atleast one solution underM but has no universal solutions (see Example2.11).Thus, it is necessary to impose extra conditions on dependencies if one wants to ensure that the existence of solimplies the existence of universal solutions. We study this issue in the next section.


28/112

20 2. RELATIONAL MAPPINGSANDDATA EXCHANGE2.3.3 MATERIALIZING UNIVERSALSOLUTIONSSummingup,forarbitraryrelational mappingswehave twonegativeresults:the existenceoproblemisundecidable,andtheexistenceofsolutionsdoesnotalwaysimplytheexistenceosolutions.Thus, one would like to restrict the class of dependencies allowed in mapping way that it satises the following:

(C1) The existence of solutions implies the existence of universal solutions;

(C2) checking the existence of solutions is a decidable (ideally, tractable) problem; and

(C3) for every source instance that has a solution, at least one universal solution can be(hopefully, in polynomial time).

But before looking for such a class it is convenient to talk about the main algorithmic todata exchange community has applied in order to check for the existence of solutionsknownchase procedure, that was originally designed to reason about the implication prodata dependencies. In data exchange, the chase is used as a tool for constructing a univerfor a given source instance.The basic idea is the following.The chase starts with the sourS , and then triggers every dependency inst t , as long as this process is applicable. In dointhe chase may fail (if ring an egd forces two constants to be equal) or it may never terinstance, in some cases when the set of tgds iscyclic ).

We rst dene the notion of chasestepfor an instanceS . We distinguish between two kindof chase steps:

(tgd) Letd be a tgd of the form(x, y) z ( x, z) , such that for some tuple( a, b) of elements

of S it is the case that( a,b) holds inS , where| a | = | x | and |

b | = | y | . Then theresult of applying d to S with ( a, b) is the instanceS that extendsS with every factR( c) that belongs

to ( a, ), where is a tuple of fresh distinct values inVar such that|| = | z | .

In that case we writeS d,( a, b) S .

(egd) Letd be an egd of the form( x) x1 = x2 , and assume that for some tuplea of elementsinS it is the case that (1)( a) holds inS , and (2) if a1 is the element of a corresponding tox1anda 2 is the element of a corresponding tox2 , thena1 = a2 . We have to consider two case

If botha1 anda2 are constants, theresult of applying d to S with a is failure, which isdenoted by S d, a fa i l .

Otherwise, theresult of applying d to S with a is the instanceS such that the followingholds: If one of a 1 ora 2 is a constantc and the other one is a null, thenS is obtainedfromS by replacing every occurrence of in S by c; if, on the other hand,a 1 anda 2 arenulls, thenS is obtained fromS by replacing each occurrence of one of those nulthe other one. In both cases we writeS d, a S .


29/112

2.3. MATERIALIZINGTARGETINSTANCES 21 With the notion of chase step we can now dene what is a chase sequence.

Denition2.7(Chase). Let be a set of tgds and egds andS be an instance.

A chasesequence forS under is a sequenceS id i , a i S i + 1 of chase steps,i 0, such that

S 0 = S ,eachd i isadependencyin,andforeachdistincti, j 0,itisthecasethat(d i , a i ) =(d j , a j ) (thatis,d i = d j ora i = a j ).Thislast technicalconditionensures that chasesequencesconsist of different chase steps.

Anite chase sequence forS underM is a chase sequenceS id i , a i S i + 1 , 0 i < m , forS

underM such that either (1)S m = fa i l , or (2) no chase step can be applied toS m with thedependencies in.Technically, (2) means the following: there is no dependency d in ,tuplea inS and instanceS such thatS m

d, a S and(d, a) = (d i , a i ) for every i {0 , . . . , m 1}.

If S m = fa i l , we refer to this sequence as afailing chase sequence. Otherwise, we refer to itas asuccessful chase sequence, and we callS m itsresult .

2

In principle there could be different results of the chase,as we have to make some arbitrary cfor example, if an egd equates nulls and , we can either replace by , or by . However,it is not hard to see that if T andT are two results of the chase onS underM , thenT andT areisomorphic , i.e.,T can be obtained fromT by a renaming of nulls.

Letus now studytheapplication ofthe chase ina dataexchangescenario.LetM bearelationalmapping andS a source instance. A chase sequence forS underM is dened as a chase sequence

for(S, ) under st t . Notice that, by denition, any result of a successful chase sequence fS underM must be an instance of the form(S ,T) whereT is a solution forS . But not only that, weshow in the following theorem that if (S ,T) is the result of a successful chase sequence forS underM ,thenT is a universal solution forS underM . Moreover, we also show in the following theoremthat if there exists a failing chase sequence forS underM , then all of its chase sequences are failing, which further implies thatS has no solutions underM .

Theorem 2.8 Let M be a mapping and S a source instance. If there is a successful chase sequence foS under M with result (S ,T) , thenT is a universal solution for S . On the other hand, if there exists a failing chase sequence for S under M , thenS has no solution.

Proof. In order to prove the theorem we need the following technical, but rather intuitive lem The proof is left as an exercise.

Lemma 2.9 Let S 1d, a S 2 be a chase step, where S 2 = fa i l . Assume that S 3 is an instance that

satises the dependencyd and such that there is a homomorphism fromS 1 into S 3 . Then there exists ahomomorphism fromS 2 into S 3 .


30/112

22 2. RELATIONAL MAPPINGSANDDATA EXCHANGE We now prove the theorem. Assume rst that(S ,T) is the result of a successful chase sequeforS underM . Then, as we mentioned above,T is a solution forS . We show next that it is also universal solution. LetT be an arbitrary solution forS .Thus,(S,T ) satises every dependency i

st t .Furthermore, theidentitymapping isa homomorphismfrom(S, ) into(S,T ).ApplyingLemma2.9at each step of the chase sequence with result(S ,T) , we conclude that there exists homomorphismh : (S ,T) (S,T ). Further,h is also a homomorphism fromT intoT .

Assume,on the other hand,that there is a failing chase sequence forS underM , and that thelast chase step in that sequence is(S ,T) d, a fa i l .Thus,d is an egd of the form( x) x1 = x2 ,( a) holds in(S ,T) , and if a i is the element corresponding toxi in a , i [1, 2], thena1 anda 2are constants anda 1 = a 2 . Assume, for the sake of contradiction, thatS has a solutionT . As inthe previous case, it can be proved (with the help of Lemma2.9) that there is a homomorphismh fromT to T . Thus,(h( a)) holds inT , andh(a 1 ) = a 1 andh(a 2 ) = a2 are distinct constants. This contradicts the fact thatT satisesd . 2

The result of a successful chase sequence on a source instanceS is usually called acanonical universalsolution forS . Since for the chase as dened in this chapter this result is unique up to isomin what follows we refer to it as thecanonical universal solution forS .

Example 2.10(Example2.3continued) It is clear that the result of the chase forS underM isT . Thus, from Theorem2.8, T is a universal solution forS . 2

But the chase as a tool for data exchange has one big drawback: nothing can be concluthe existence of solutions in the case when the chase does not terminate. The next examdifferent applications of the chase procedure.

Example2.11 LetM bea mapping suchthat the sourceschemaconsistsofa binary relationE ,thetarget schema consists of binary relationsG andL , and st consists of the st-tgd = E(x,y) G(x,y) . Assume rst thatt consistsof the tgd1 = G(x,y) zL(y,z) ,andletS be the sourceinstanceE(a,b) .Thechasestartsbyring and,thus,bypopulatingthetargetwiththefactG(a, b) .In a second stage, the chase realizes that1 is being violated, and thus,1 is triggered. The targetis then extended with a factL(b, ), where is a fresh null value. At this stage, no dependenbeing violated, and thus, the chase stops with resultT = { G(a, b),L(b, )}. We conclude thatT is a universal solution forS .

Assume now thatt is extended with the tgd2 = L(x,y) zG(y,z) . Clearly,T doesnot satisfy 2 and the chase triggers this tgd.This means that a factG(, 1 ) is added to the target, where1 is a fresh null value. But1 is now being violated again, and a new factL(1 , 2 ), where2 is a fresh null value, will have to be added. It is clear that this process will continue iand thus, that the chase does not terminate. Notice that in this caseT = { G(a, b), L(b,a) } is asolution forS . It can be proved, on the other hand, thatS does not have universal solutions. (Wleave this as an exercise to the reader).


31/112

2.3. MATERIALIZINGTARGETINSTANCES 23

( G, 1) ( G, 2)

( L, 2)( L, 1)

( G, 1) ( G, 2)

( L, 1) ( L, 2)

(b)(a)

Figure 2.1:Dependency graphs of (a){1 } and (b){1 , 2 }.

Assume nally thatt consists of the egd = G(x,y) x = y .Then the chase forS fails,since after populating the target instance with the factG(a, b) , the egd forces to equate theconstantsa andb. Notice that, in this case,S has no solution. 2

As we have seen, the main problem with the application of the chase is non-terminationhappens, for instance,when the set of tgds allows for cascading of null creation during the chshow next that by restricting this cascading it is possible to obtain a meaningful class of mapfor which the chase is guaranteed to terminate; further, it does so in at most polynomiallysteps. It will follow that this class of mappings satises our desiderata expressed above as coC1, C2 and C3.

Assume that is a set of tgds overRt. We construct thedependencygraph of as follows.

The nodes (positions) of the graph are all pairs(R,i) , whereR is a relation symbol inRt of arity n andi {1 , . . . , n }. Then we add edges as follows. For every tgdxy (( x, y) z ( x, z)) in, and for every variablex mentioned inx that occurs in thei -th attribute of relationR in , do

the following: if x occurs in thej -th attribute of relationT in , then add an edge from(R,i) to (T, j ) (if

the edge does not already exist); and

for every existentially quantied variable inz ( x, z) that occurs in thej -th attribute of relationT in , add an edge labeledfrom(R,i) to (T, j ) (if the edge does not already exist).

Finally, we say thatis weakly acyclic if the dependency graph of does not have a cycle goingthrough an edge labeled.

Example 2.12(Example2.11continued) Let 1 and2 be as in Example2.11. As it is shownin Figure2.1, the set{1 } is weakly acyclic, while the set{1 , 2 } is not as there is a cycle in thedependency graph of these dependencies going through an edge labeled. 2

Interesting classes of weakly acyclic sets of tgds include the following:


32/112

24 2. RELATIONAL MAPPINGSANDDATA EXCHANGE Sets of tgds without existential quantiers, and

acyclic sets of inclusion dependencies.

The intuitionbehindthis notionisas follows.Edges labeledkeep track ofpositions(R,i) forwhichthe chase will have to create a fresh null value every time the left hand side of the cortgd is triggered. Thus, a cycle through an edge labeledimplies that a fresh null value createda position at a certain stage of the chase may determine the creation of another fresh in the same position, at a later stage. Therefore, sets of tgds that are not weakly acyclicnon-terminating chase sequences (e.g., the set{1 , 2 }). On the other hand, it can be proved ththe chase always terminates for mappings with a weakly acyclic sets of tgds. Further, in tchase forS terminates in at most polynomially many stages.

Theorem2.13 Let M = (Rs, Rt , st , t ) be a xed relational mapping, such that t is the union of a set of egds and a weakly acyclic set of tgds. Then there exists a polynomial p such that the length of everychase sequence for a source instance S under M is bounded byp( S ) .

Proof. For the sake of simplicity, we prove the theorem in the absence of egds.The additidoes not change the argument of the proof in any fundamental way.

Firstofall,with eachnode(V, i ) in thedependency graphof t ,weassociateitsrank,denotedby rank(V,i) , which is the maximum number of special edges in any path in the dependenG of st that ends in(V, i ) .Since st is weakly acyclic, the rank of each node inG is nite. Further,it is clear that the maximum rank r of a node inG is bounded by the numberm of nodes inG itself.Notice thatm (and, thus,r ) is a constant since it corresponds to the total number of attributhe schemaRt (which is assumed to be xed).

Let us partition the nodes of G into setsN 0 , N 1 , . . . , N r such that(V, i ) belongs toN j if andonly if rank(V,i) = j (0 j r ). To prove the theorem, we need the following technical r(The proof is left as an exercise).

Claim 2.14 For eachj {1 , . . . , r }, there exists a polynomial p j such that the following holds: Let S be a source instance and T be any target instance that is obtained fromS by a sequence of chase steps usinthe tgds in st t . Then the number of distinct elements (i.e., constants and nulls) that occur inT at positions that are restricted to be inN j is bounded byp j ( S ) .

Now it follows from the claim that there exists a polynomialp , that only depends onM , suchthat the maximum number of elements that occur in a solutionT for S underM , at a singleposition, is bounded by p ( S ). Thus, the total number of tuples that can exist in one relatiT is at mostp ( S )m (since each relation symbol inRt can have arity at mostm). Thus, the totalnumber of tuples inT is bounded by M p ( S )m . This is polynomial inS sinceM is xed.Furthermore,at each chase step with a tgd at least one tuple is added, which implies that


33/112

2.3. MATERIALIZINGTARGETINSTANCES 25of any chase sequence is bounded by M p ( S )m ; hence, we can takep(x) = M p (x) m . This completes the proof. 2

This good behavior also implies the good behavior of this class of mappings with respect exchange. Indeed, as an immediate corollary to Theorems2.8and2.13, we obtain the following:

Corollary 2.15 Let M = (Rs, Rt, st , t ) be a xed relational mapping, such that t is the unionof a set of egds, and a weakly acyclic set of tgds. Then the existence of solutions for M can be checked in polynomial time. Further, in case that solutions exist, a universal solution can be computed in polynomtime.

Thus, the class of mappings with a weakly acyclic set of tgds satises conditions C1, C2, anddenedabove,and therefore, it constitutes a good class fordata exchange accordingtoourdenIn this case, using the chase the canonical universal solution that can be constructed in polyntime (if a solution exists at all).

Let us nallymention that thereare interesting classesof dependenciesforwhich theprobof checking for the existence of solutions is trivial. For instance, if mappings do not have egsource instances always have (universal) solutions (under the condition thatt consists of a weakly acyclic set of tgds). In particular, for mappings without target dependencies, it is the case thasource instance has at least one universal solution.

For mappings without target dependencies, the canonical universal solution can alwa

computed inLogspace

. But, as next proposition shows, the complexity increases in the presenctarget dependencies (we leave both facts as exercises to the reader):

Proposition 2.16 There exists a relational mapping M = (Rs, Rt, st , t ) , such that t consists of a set of tgds without existentially quantied variables, and such that the problem of checking for a givsource instance S , whether a fact R( t ) belongs to the canonical universal solution for S , isPtime -complete.

Onthe other hand, the problem of checking for the existence of solutions isPtime -completealready for mappings whose set of target dependencies consists of a single egd (also left as an exe

the reader). This means that both the problem of computing the canonical universal solutiomappings with tgds but without egds, and checking for the existence of solutions for mappingegds butwithout tgds,are regarded as inherently sequential,and itsperformance cannotbe impconsiderably using parallel algorithms.


34/112

26 2. RELATIONAL MAPPINGSANDDATA EXCHANGE2.3.4 CORES We start this section with an example.

Example 2.17(Example2.1continued) Consider the source instance

S = { FLIGHT(Paris,Amsterdam,KLM, 1410 ),FLIGHT(Paris,Amsterdam,KLM, 2230 ),

GEO(Paris,France, 2M) }.

It is not hard to see that the canonical universal solutionT forS is

{ROUTES(1 ,Paris,Amsterdam), ROUTES(3 ,Paris,Amsterdam),INFO_FLIGHT(1 , 1410 , 2 ,KLM), INFO_FLIGHT(3 , 2230 , 4 ,KLM),

SERVES(KLM,Paris,France, 5 ), SERVES(KLM,Paris,France, 6 )}.

Now consider the instanceT that is obtained fromT by removing tupleSERVES(KLM,Paris,France, 6 ). Then T is also a solution forS , and, moreover, thereare homomorphismsh : T T andh : T T . It follows, therefore, thatT is also a universalsolution forS . 2

We can draw an interesting conclusion from this example: among all possible universathe canonical universal solution is not necessarily the smallest (asT is strictly contained inT ).Moreover, in the example,T is actually the smallest universal solution (up to isomorphism)

The rst natural question is whether there is always a unique smallest universal so we will see later, this question has a positive answer. Further, some authors have arguesmallest universal solution is the best universal solution since it is the most economterms of size, and that this solution should be the preferred one at the moment of matersolution.The whole issue is then how to characterize this smallest universal solution.

It can be shown that the smallest universal solution always coincides with thecore of theuniversal solutions. Thecore is a concept that originated in graph theory; here we presentarbitrary instances.

Denition 2.18(Core). Let T be a target instance with values inConst Var , and letT be asub-instance1 of T . We say thatT is acore of T if there is a homomorphism fromT to T (recallthat homomorphisms have to be the identity on constants), but there is no homomorphismT to a proper sub-instance of itself. 2

The following are well known facts about the cores:

1. every instance has a core, and1 That is, the domain of T is contained in the domain of T andV T V T , for every V inRt . If one of the inclusions is proper we refer toT as aproper sub-instance of T .


35/112

2.3. MATERIALIZINGTARGETINSTANCES 272. all cores of a given instance are isomorphic, i.e., the same up to a renaming of nulls.

Thus,wecantalkaboutthe coreofaninstance.Togiveanintuitionbehindthesecondfact,assumethT 1 andT 2 are cores of T , and thath i : T T i are homomorphisms, fori = 1 , 2.Thenh 1 restrictedto T 2 cannot map it to a substructure of T 1 , for otherwiseh 1 h 2 would be a homomorphism fromT to a substructure of the coreT 1 . Likewise,h 2 restricted toT 1 cannot map it to a substructure of T 2 . Hence, these restrictions are one-to-one homomorphisms betweenT 1 andT 2 . From here, it iseasy to derive thatT 1 andT 2 are isomorphic.

The next result summarizes some of the good properties of cores in data exchange.

Proposition 2.19 Let M = (Rs, Rt , st , t ) be a mapping, such that st consists of a set of st-tgdsand t consists of a set of tgds and egds.

1. If S is a source instance and T is a universal solution for S , then the core of T is also a universal solution for S .

2. If S is a source instance for which a universal solution exists, then every universal solution has tsame core (up to a renaming of nulls), and the core of an arbitrary universal solution is precisely smallest universal solution.

Example 2.20(Example2.17continued) The solutionT is the core of the universal solutionsforS since there is a homomorphism fromT to T , but there is no homomorphism fromT to aproper sub-instance of itself. 2

Inconclusion, thecore oftheuniversalsolutions hasgoodpropertiesfordata exchange.Thisnatraises the question about the computability of the core. As we have mentioned, the chase yuniversal solution that is not necessarily the core of the universal solutions, so different techhave to be applied in order to compute this solution.

It is well-known that computing thecore ofan arbitrary graph isa computationally intractproblem. Indeed, we know that a graphG is 3-colorable iff there is a homomorphism fromG toK 3 , the clique of size 3.Thus,G is 3-colorable iff the core of the disjoint union of G andK 3 isK 3itself.This shows that there is a polynomial time reduction from the problem of 3-colorabilithe problem of computing the core of a graph. It follows that the latter isNP -hard.In fact,checkingif axed graphG 0 is the core of an input graphG is anNP -complete problem; if bothG andG 0are inputs, then the complexity is even higher (more precisely, in the classDP , studied in complexity theory).

However, in data exchange, we are interested in computing the core of a universal soand not of an arbitrary instance. And the intractability of the general problem does not meanews in our case. In fact,we are about to see that computing the core of the universal solutionthe class of mappings with a weakly acyclic set of tgds is a tractable problem.


36/112

28 2. RELATIONAL MAPPINGSANDDATA EXCHANGELet us consider rst a simple class of relational mappings; those without tgds.The

a simple greedy algorithm that computes the core of the universal solutions in case that solution exists. It proceeds as follows. Fix a mappingM = (Rs, Rt , st , t ) without tgds in t .

Algorithm1ComputeCore (M )Require:A source instanceS Ensure:If S has a core underM , thenT is a target instance that is a core forS . Otherwise,

T = fa i l1: letT be the result of the chase of S underM2: if T = fa i l then3: T = fa i l4: else5: T := T 6: forallfactR( a) in T do7: letT , be an instance obtained fromT by removing factR( a)8: if (S,T , ) satises st then9: T := T ,

10: end if 11: end for12: end if

If the chase computing the canonical universal solution does not fail, then the algorithmthe core of the universal solutions forS . Furthermore, this algorithm runs in polynomial time insize of S .

Unfortunately, the algorithm described above cannot be easily adapted to moremappings, and more sophisticated techniques have to be developed if one wants to pcomputation of cores of universal solutions continues being tractable in the presence of techniques are based on theblocksmethod described below.

Let us assume for the time being that we deal with mappings without target depen The blocksmethod relies on the following observation. If T is the canonical universal solution a source instanceS with respect to a set of st-tgdsst , then theGaifmangraph of the nulls of T consists of a set of connected components (blocks) of size bounded by a constantc (this constantdepends only on the mapping, that we assume to be xed). By the Gaifman graph of nulT wemean the graph whose nodes are the nulls of T , such that two nulls1 , 2 are adjacent iff there isa tuple in some relation of T that mentions both1 and2 .

A crucial observation is that checking whether there is a homomorphism fromT into anarbitrary instanceT can be done in polynomial time. The justication is that this problemdown to the problem of whether each block of T has a homomorphism intoT . The latter can besolved in polynomial time since the size of each block is bounded by c. It follows that computing


37/112

2.4. QUERYANSWERING 29the core of the canonical universal solutionT can be done in polynomial time: It is sufcient tocheck whether there is a homomorphismh : T T such that the size of the image of T underhis strictly less than the size of T . Then we replaceT by h(T ) , and iteratively continue this processuntil reaching a xed-point.

The blocks method was alsoextended to the case whent consistsofa set ofegds.There isanextra difculty in this case:The property mentioned above, the bounded block-size in the Gagraph of the canonical universal solutionT of a source instanceS , is no longer true in the presenceof egds.This is because the chase, when applied to egds, can equate nulls from different blocthus, create blocks of nulls of arbitrary size. This problem is solved by a surprising rigiditystating the following: LetT be the canonical universal of a source instance with respect to a sest-tgds st . Further, letT be the target instance that is obtained fromT by chasing with the egdsin t . Then if two nulls1 and2 in different blocks of T are replaced by the same null in T ,

then is rigid . That is, if h : T T is a homomorphism thenh() = , and thus,T has thebounded block-size property if we treat those nulls as constants. The situation is much more complex in the presence of tgds. This is because the cano

universal solutionT for a source instanceS does not have the bounded block-size property, and inaddition, it is no longer true that equated nulls are rigid. A rened version of the blocks mhas been developed; it was used to show that computing cores of universal solutions for ma whose set of target dependencies consists of egds and a weakly acyclic set of tgds can be polynomial time.

Theorem2.21 Let M = (Rs, Rt , st , t ) be a xed relational mapping, such that t consists of a set of egds and a weakly acyclic set of tgds. There is a polynomial-time algorithm that for every source insS , checks whether a solution for S exists, and if that is the case, computes the core of the universal solution for S .

2.4 QUERY ANSWERING2.4.1 ANSWERINGFIRST-ORDERANDCONJUNCTIVE QUERIESRecall that in the context of data exchange we are interested in computing the certain answa query.These are formally dened as follows. LetM be a relational mapping,Q a query in somequery language over the schemaRt, andS a source instance. Then the set of certain answers of Qwith respect toS under M is

certainM (Q,S) = {Q(T ) | T Sol M (S) }.

We omitM if it is clear from the context. If Q is a query of arity 0 (aBooleanquery), thencertainM (Q, S) = tru e iff Q is true inevery solutionT forS ; otherwise,certainM (Q, S) = fals e .

Example 2.22(Example2.3continued) Consider again the source instance


38/112

30 2. RELATIONAL MAPPINGSANDDATA EXCHANGES = { FLIGHT(Paris,Santiago,AirFrance, 2320 )}.

The certain answers of the query Q = ROUTES(x,y,z) with respect toS is the empty set. On theother hand,certainM (Q , S) = { (Paris,Santiago) }, forQ = xROUTES(x,y,z) . 2

Given a mappingM and a query Q overRt, the problem of computing certain answers forQ underM is dened as follows:

Problem : Certain answers forQ andM .Input: A source instanceS and a tuplet of elements fromS .Question : Doest belong tocertainM (Q,S) ?

Evaluating the certain answers of a query involves computing the intersection of a (poten

nite numberof sets.Thisstrongly suggests that computingcertainanswers forarbitrary FOan undecidable problem.We can prove this with the help of the mappingM = (Rs, Rt , st , t ) in Theorem2.2.Let be the FOformula that isobtained bytakingthe conjunction ofall dependin t , and letM be the mapping that is obtained fromM by removing all target dependencies

t . Then for every source instanceS , we have

certainM ( ,S) = fals e T : T Sol M (S) andT |= T : T Sol M (S) andT |= T : T Sol M (S)

Thus, we obtain the following from Theorem2.2.

Proposition 2.23 There exists anFO queryQ and a mapping M = (Rs, Rt , st , t ) , such that t = and the problem of computing certain answers for Q and M is undecidable.

This does notpreclude,however, theexistenceof interesting classesof queriesforwhich thof computing certain answers is decidable, and even tractable. Indeed, next theorem showis the case for the class of unions of conjunctive queries. Recall that aconjunctive query is an FOformula of the formx (x, y) , where( x, y) is a conjunction of atoms.

Theorem 2.24 Let M = (Rs, Rt , st , t ) be a mapping, such that t consists of a set of egds anda weakly acyclic set of tgds, and let Q be a union of conjunctive queries. Then the problem of compucertain answers for Q under M can be solved in polynomial time.

This is a very positive result since unions of conjunctive queries are very common databthey correspond to theselect-project-join-unionfragment of relational algebra and to the core ofstandard query language for database systems, SQL.

Theorem2.24can be easily proved when we put together the following three facts:


39/112

2.4. QUERYANSWERING 311. Unions of conjunctive queries are preserved under homomorphisms; that is, if Q is a union of

conjunctive queries overRt, T andT are two target instances such that there is a homomor-phism fromT to T , and the tuplea of constants belongs to the evaluation of Q overT , thena also belongs to the evaluation of Q overT ;

2. every universal solution can be homomorphically mapped into any other solution, and

3. FO queries, and in particular, unions of conjunctive queries, have polynomial time dataplexity.

The actual computation of certain answers is based on the well known concept of nave evaluation. This is a way of evaluating queries over databases that contain nulls. Under nave evaluatisteps are performed:

First, a given query Q is evaluated over an instanceT with nulls as if nulls were constants. That is, if , are nulls andc is a constant, then conditions comparing nulls or nulls andconstants are evaluated as follows: = c is false, = is false, and = is true.

Second, from the result of the rst step, one eliminates all tuples containing nulls. Thresult is denoted by Q (T ) .

For example, if we have factsR( 1, ) andS(, 2),S( , ), then the nave evaluation of thequery Q = z R(x,z) S(z, y) results in one tuple(1, 2). Indeed, in the rst step one takes the join of R andS followed by the projection, which yields two tuples:(1, 2) and(1 , ); the secondstep eliminates the tuple containing .

Thus,in order tocompute the certainanswers toa union ofconjunctive queriesQ withrespectto a source instanceS , one can use the following procedure. Recall thatM is a mapping such thatt consists of a set of egds and a weakly acyclic set of tgds, and letQ be a union of conjunctive

queries.

Algorithm2ComputeCertainAnswers (Q ,M )Require:A source instanceS Ensure:If Sol M (S) = ,thenCA is the set of certain answers forQ overS underM . Otherwise,

CA = fa i l1: letT be the result of the chase of S underM2: if T = fa i l then3: CA := fa i l4: else5: CA := Q (T )6: end if


40/112

32 2. RELATIONAL MAPPINGSANDDATA EXCHANGE The previous observations imply that

certainM (Q, S) = Q (T). (2.1)

Infact,sinceQ isa unionofconjunctive queries,T canbetakentobeanyuniversalsolution,and (2.1)remains true. In particular, if in the above algorithm we compute the core instead of theuniversal solution, the output of the algorithm is still the set of certain answers.

There is a natural question that arises at this point: What happens if we extend coqueries with a restricted form of negation,e.g., inequalities? Unfortunately, eventhis slightof the language leads to intractability of the problem of computing certain answers, atheorem shows.

Theorem2.25

1. Let M be a mapping such that t consists of a set of egds and a weakly acyclic set of tgds, andQbe a conjunctive query with inequalities. The problem of computing certain answers for Q and Mis incoNP .

2. There is a Boolean conjunctive queryQ with inequalities and a LAV mapping M , such that the problem of computing certain answers for Q and M is coNP -hard.

Proof. We rst prove (1). We show that if there is a solutionT for a source instanceS such thatt Q(T ) , then there is a solutionT of polynomial size such thatt Q(T ). Suppose thatT is asolution forS such thatt Q(T ) . LetT can be the canonical universal solution forS . (Notice thatT can exists sinceS has at least one solution). Then there is a homomorphismh : T can T . We

denoteby h(T can ) thehomomorphic image of T underh (notice thath(T can ) is a solution forS ).Weclaim thatt Q(h(T can )) .Assume otherwise; thent Q(h(T can )) .Buth(T can ) is a sub-instanceof T , and clearly conjunctive queries with inequalities are preserved under sub-instances. Wthat t Q(T ) , which is a contradiction. Further, notice thatT can is of polynomial size, and, thuh(T can ) is also a polynomial size.

With this observation, it is easy to construct acoNP algorithm for the problem of certaanswers of Q andM . In fact, acoNP algorithm for checkingt certainM (Q,S) is the same as anNP algorithm for checkingt certainM (Q, S) . By the above observation, for the latter, it simsufces to guess a polynomial size instanceT and check, in polynomial time, thatT is a solution,and thatt Q(T ) .

We now prove (2). The LAV mappingM = (Rs, Rt, st ) is as follows. The source schem

Rs consists of two relations:A binary relationP

and a ternary relationR

.The target schemaRt alsoconsists of two relations: A binary relationU and a ternary relationV . Further, st is the followingset of source-to-target dependencies:

P(x,y) z(U (x, z) U(y,z))R(x,y,z) V(x,y,z)


41/112

2.4. QUERYANSWERING 33 The Boolean query Q is dened as:

x1y1x2y2x3y3 (V(x 1 , x 2 , x 3 ) U(x 1 , y 1 ) U(x 2 , y 2 ) U (x 3 , y 3 ) x1 = y1 x2 = y2 x3 = y3 ).

Next, we show that the problem of certain answers forQ andM iscoNP -hard. The coNP -hardness is established from a reduction from 3SAT to the complement of

problem of certain answers forQ andM . More precisely, for every 3CNF propositional formula, weconstru

Documents

Relational and XML Data Exchange