50

Click here to load reader

Compiling mappings to bridge applications and databases

Embed Size (px)

Citation preview

Page 1: Compiling mappings to bridge applications and databases

22

Compiling Mappings to Bridge Applicationsand Databases

SERGEY MELNIK, ATUL ADYA, and PHILIP A. BERNSTEINMicrosoft Research

Translating data and data access operations between applications and databases is a longstandingdata management problem. We present a novel approach to this problem, in which the relationshipbetween the application data and the persistent storage is specified using a declarative mapping,which is compiled into bidirectional views that drive the data transformation engine. Express-ing the application model as a view on the database is used to answer queries, while expressingthe database schema as a view on the application model allows us to leverage view maintenancealgorithms for update translation. This approach has been implemented in a commercial prod-uct. It enables developers to interact with a relational database via a conceptual schema and anobject-oriented programming surface. We outline the implemented system and focus on the chal-lenges of mapping compilation, which include rewriting queries under constraints and supportingnonrelational constructs.

Categories and Subject Descriptors: H.2 [Database Management]; D.3 [Programming Lan-guages]

General Terms: Algorithms, Performance, Design, Languages

Additional Key Words and Phrases: Mapping, updateable views, query rewriting

ACM Reference Format:Melnik, S., Adya, A., and Bernstein, P. A. 2008. Compiling mappings to bridge applicationsand databases. ACM Trans. Datab. Syst. 33, 4, Article 22 (November 2008), 50 pages. DOI =10.1145/1412331.1412334 http://doi.acm.org/10.1145/1412331.1412334

1. INTRODUCTION

Developers of data-centric solutions routinely face situations in which the datarepresentation used by an application differs substantially from the one usedby the database. The traditional reason is the impedance mismatch betweenprogramming language abstractions and persistent storage [Cook and Ibrahim

A preliminary version of this article appeared in Proceedings of the 2007 ACM SIGMOD Interna-tional Conference on Management of Data, Beijing, China, 461-472.Authors’ address: One Microsoft Way, Redmond WA 98052, USA; email: {melnik, adya, philbe}@microsoft.com.Permission to make digital or hard copies of part or all of this work for personal or classroom use isgranted without fee provided that copies are not made or distributed for profit or direct commercialadvantage and that copies show this notice on the first page or initial screen of a display alongwith the full citation. Copyrights for components of this work owned by others than ACM must behonored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers,to redistribute to lists, or to use any component of this work in other works requires prior specificpermission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 PennPlaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212) 869-0481, or [email protected]© 2008 ACM 0362-5915/2008/11-ART22 $5.00 DOI 10.1145/1412331.1412334 http://doi.acm.org/10.1145/1412331.1412334

ACM Transactions on Database Systems, Vol. 33, No. 4, Article 22, Publication date: November 2008.

Page 2: Compiling mappings to bridge applications and databases

22:2 • S. Melnik et al.

2006]: developers want to encapsulate business logic into objects, yet most en-terprise data is stored in relational database systems. A second reason is toenable data independence, even if applications and databases start with thesame data representation, they can evolve independently of each other, leadingto differing data representations that must be bridged. A third reason is inde-pendence from DBMS vendors: many enterprise applications run in the middletier and need to support backend database systems with different capabilities,which require different data representations.

The data transformations required to bridge applications and databases canbe very complex. Even relatively simple object-to-relational (O/R) mapping sce-narios, where a set of objects is partitioned across several relational tables,may require transformations that contain outer joins, nested queries, and casestatements in order to reassemble objects from tables (we will see an exampleshortly). Implementing such transformations is difficult, especially since thedata usually needs to be updatable, a common requirement for many enterpriseapplications. In an unpublished Microsoft study conducted in 2006, hundreds ofdatabase application developers reported that 40% of their development effortwas spent writing data access code and stored procedures. Similar figures werereported elsewhere [Keene 2004].

Since the mid 1990’s, client-side data mapping layers have become a popularalternative to handcoding the data access logic, fueled by the growth of Internetapplications. A core function of such a data mapping layer is to provide an updat-able view that exposes a data model closely aligned with the application’s datamodel, driven by an explicit mapping. Many commercial products (e.g., OracleTopLink) and open source projects (e.g., Hibernate) have emerged to offer thesecapabilities. Virtually every enterprise framework provides a client-side per-sistence layer (e.g., Java Persistence API in JEE [EJB 3.0 Expert Group 2006]).Most packaged business applications, such as Enterprise Resource Planning(ERP) and Customer Relationship Management (CRM) applications, incorpo-rate proprietary data access interfaces (e.g., BAPI in SAP R/3).

Today’s client-side mapping layers offer widely varying degrees of capabil-ity, robustness, and total cost of ownership. Typically, the mapping betweenthe application and database artifacts is represented as a custom structure, orschema annotations that have vague semantics and drive case-by-case reason-ing. A scenario-driven implementation limits the range of supported mappingsand often yields a fragile runtime that is difficult to extend. Few data accesssolutions leverage data transformation techniques developed by the databasecommunity. Furthermore, building such solutions using views, triggers, andstored procedures is problematic for a number of reasons. First, views contain-ing joins or unions are usually not updatable. Second, defining custom databaseviews and triggers for every application accessing mission-critical enterprisedata, is rarely acceptable due to security and manageability risks. Moreover,SQL dialects, object-relational features, and procedural extensions vary signif-icantly from one DBMS to the next.

In this article, we describe an approach to building a mapping-driven dataaccess layer that addresses some of these challenges. As a first contribution,we present a novel mapping architecture that provides a general-purpose

ACM Transactions on Database Systems, Vol. 33, No. 4, Article 22, Publication date: November 2008.

Page 3: Compiling mappings to bridge applications and databases

Compiling Mappings to Bridge Applications and Databases • 22:3

mechanism for supporting updatable views. It enables building client-side dataaccess layers in a principled way that could also be exploited inside a databaseengine. The architecture is based on three main steps:—Specification. Mappings are specified using a declarative language that has

well-defined semantics and puts a wide range of mapping scenarios withinreach of nonexpert users.

—Compilation. Mappings are compiled into bidirectional views, called queryand update views, that drive query and update processing in the runtimeengine.

—Execution. Update translation is done using algorithms for materialized viewmaintenance. Query translation uses view unfolding.

To the best of our knowledge, this mapping approach has not been exploitedpreviously in the research literature or in commercial products. It raises inter-esting research challenges.

Our second contribution addresses one of these challenges—formulation ofthe mapping compilation problem and algorithms that solve it. The algorithmsthat we developed are based on techniques for answering queries using views,and incorporate a number of novel aspects. One is the concept of bipartite map-pings, which are mappings that can be expressed as a composition of a view andan inverse view. Using bipartite mappings enables us to focus separately on ei-ther the left or right part of the mapping constraints when generating queryand update views. Another novel aspect is a schema partitioning approach thatallows us to rewrite the mapping constraints using union queries, and facil-itates view generation. Furthermore, we discuss support for object-orientedconstructs, case statement generation, bookkeeping of tuple provenance, andsimplification of views under schema and mapping constraints.

This article is an extended version of the conference paper Melnik et al.[2007]. As part of the new material, we present a technique for validating map-pings, that is, ensuring that they allow storing and retrieving application datain a lossless fashion (Section 6). We give an improved mapping compilationalgorithm to avoid exhaustive partitioning (Section 7.6). We discuss the map-ping language in more depth (Section 4), and provide additional theorems andexamples. The proofs are included in the appendix.

Our approach has been implemented in a commercial product, the MicrosoftADO.NET Entity Framework. Its detailed system architecture is presentedin Adya et al. [2007], including query and update pipelines, transactions, andcache management. The programming model is discussed in-depth in the devel-oper documentation.1 In Castro et al. [2007] we demonstrate how applicationsare built on top of the Entity Framework. In particular, we show how map-pings can shield applications from database refactoring: data normalizationcan often be compensated in the mapping without changing and recompilingthe application. To achieve that, mapping flexibility is critical.

This article is structured as follows. In Section 2, we give an overview ofthe Entity Framework. In Section 3, we outline our mapping approach and

1Available at msdn.microsoft.com.

ACM Transactions on Database Systems, Vol. 33, No. 4, Article 22, Publication date: November 2008.

Page 4: Compiling mappings to bridge applications and databases

22:4 • S. Melnik et al.

Fig. 1. Interacting with data in the Entity Framework.

illustrate it using a motivating example. The mapping compilation problemis stated formally in Section 5. The implemented algorithms are presented inSection 7. Experimental results are in Section 8. Related work is discussed inSection 9. Section 10 is the conclusion.

2. ARCHITECTURE OF THE ENTITY FRAMEWORK

The ADO.NET Entity Framework provides a mapping-driven data access layerfor developers of data-intensive applications. It comprises an extended entity-relationship data model, called EDM, and a set of design-time and run-timeservices. The services allow developers to describe the application data usingan entity schema, and interact with it at a high level of abstraction that isappropriate for business applications.

The system offers three major data programming facilities (see Figure 1).First, developers can manipulate the data represented in the entity schemausing Entity SQL, an extension of SQL that can deal with inheritance, asso-ciations, and other extended ER constructs. This capability enables general-purpose database development against the conceptual schema and is importantfor applications that do not need an object layer, such as business reporting.Second, the entity schema can be used to generate object-oriented interfacesin several major programming languages. In this way, persistent data can beaccessed using create/read/update/delete operations on objects. Third, queriesagainst the generated object model can be posed using a language-integratedquery mechanism (LINQ [Meijer et al. 2006]), which enables compile-timechecking of queries.

EDM is an extended entity-relationship model [Chen 1976; Batini et al.1992]. It distinguishes entity types, complex types, and primitive types. In-stances of entity types, called entities, can be organized into persistent col-lections called entity sets. An entity set of type T holds entities of type T orany type that derives from T . Each entity type has a key, which uniquelyidentifies an entity in the entity set. Entities and complex values may haveproperties holding other complex values or primitive values. Like entity types,complex types can be specialized through inheritance. However, complex val-ues can exist only as part of some entity. Entities can participate in 1:1,1:n, or m:n associations, which essentially relate the keys of the respectiveentities.

ACM Transactions on Database Systems, Vol. 33, No. 4, Article 22, Publication date: November 2008.

Page 5: Compiling mappings to bridge applications and databases

Compiling Mappings to Bridge Applications and Databases • 22:5

Fig. 2. System architecture.

A relational schema can be viewed as a simple EDM schema that containsa distinct entity set and entity type for each relational table. We exploit thisproperty to uniformly specify and manipulate mappings between EDM schemasand relational schemas. The relational schema constraints that are supportedby the system include the primary key constraints, foreign key constraints,and non-nullability constraints. Henceforth, we use the general term extent torefer to entity sets, associations, and tables. A detailed discussion of EDM ispresented in Blakeley et al. [2006].

Entity SQL is a data manipulation language based on SQL. Entity SQLallows retrieving entities from entity sets and navigating from an entity toa collection of entities reachable via a given association. Path expressionscan be used to ‘dot’ into complex values. Type interrogation can be done us-ing the predicates 〈value〉 IS OF 〈type〉 or 〈value〉 IS OF (ONLY〈type〉). En-tity SQL allows instantiating new entities or complex values similarly to the‘new’ construct in programming languages. It supports a tuple constructorthat produces row types and uses reference types similarly to SQL99. Thesemantics of Entity SQL statements used in the article is explained wherenecessary.

The system architecture is depicted (abridged) in Figure 2. It comprises sev-eral major components: query and update pipelines, object services, mappingcompiler, metadata services (not shown), data providers, and others. The shadedcomponents are discussed in more detail in subsequent sections. All data ac-cess requests go through a data transformation runtime, which translates dataand data access operations using a mapping between the entity schema andthe relational schema. The translated data access operations are fed into aData Provider, one for each supported database system, which turns them intoexpressions in a specific SQL dialect.

ACM Transactions on Database Systems, Vol. 33, No. 4, Article 22, Publication date: November 2008.

Page 6: Compiling mappings to bridge applications and databases

22:6 • S. Melnik et al.

Fig. 3. Mapping between entities (left) and tables (right).

3. MAPPING APPROACH

In this section, we present the three main steps of our approach: specifying amapping as a set of constraints, compiling a mapping into bidirectional views,and using view maintenance algorithms to perform update translation.

Mappings. A mapping is specified using a set of mapping fragments. Eachmapping fragment is a constraint of the form QEntities = QTables, where QEntitiesis a query over the entity schema (on the client side), and QTables is a queryover the database schema (on the relational store side). A mapping fragmentdescribes how a portion of entity data corresponds to a portion of relationaldata. In contrast to a view, a mapping fragment does not need to specify acomplete transformation that assembles an entity set from tables or vice versa.A mapping can be defined using an XML file or a graphical tool.

Example 1. To illustrate, consider the sample mapping scenario inFigure 3. It depicts an entity schema with entity types Person and Employeewhose instances are accessed via the extent Persons. On the store side, thereare two tables, HR and Empl, which represent a vertical partitioning of theentity data. The mapping is given by two fragments shown in Figure 3. Thefirst fragment specifies that the set of (Id, Name) values for all entities inPersons is identical to the set of (Id, Name) values retrieved from the HR table.Similarly, the second fragment tells us that (Id, Department) values for allEmployee entities can be obtained from the Empl table.

Bidirectional views. The mapping compiler (see Figure 2) takes a mappingas input and produces bidirectional views that drive the data transformationruntime. These views are specified in Entity SQL. Query views express entitiesin terms of tables, while update views express tables in terms of entities. Updateviews may be somewhat counterintuitive because they specify persistent datain terms of virtual constructs, but as we show later, they can be leveraged tosupport updates in an elegant way. The generated views respect the mapping, ina sense that will be defined shortly, and have the following properties (simplifiedhere and elaborated in subsequent sections):

ACM Transactions on Database Systems, Vol. 33, No. 4, Article 22, Publication date: November 2008.

Page 7: Compiling mappings to bridge applications and databases

Compiling Mappings to Bridge Applications and Databases • 22:7

Fig. 4. Bidirectional views compiled from mapping in Figure 3.

—Entities = QueryViews(Tables)—Tables = UpdateViews(Entities)—Entities = 4 QueryViews(UpdateViews(Entities))

The last condition, called the roundtripping criterion, ensures that all entitydata can be persisted and reassembled from the database in a lossless fashion.The mapping compiler included in the Entity Framework guarantees that thegenerated views satisfy the roundtripping criterion. It raises an error if no suchviews can be produced from the input mapping.

Figure 4 shows the query and update views generated by the mapping com-piler for the mapping in Example 1. In general, the views are significantly morecomplex than the input mapping, since they explicitly specify the required datatransformations. For example, to reassemble the Persons extent from the rela-tional tables, one needs to perform a left outer join between HR and Empl tables,and instantiate either Employee or Person entities depending on whether or notthe respective tuples from Empl participate in the join. In the query pipeline(Figure 2), queries against the entity schema can now be answered by unfold-ing the query views in the queries, pulling nonrelational operators up the querytree, and sending the relational-only portion of the query to the database server.

Update translation. A key insight exploited in our mapping architectureis that view maintenance algorithms can be leveraged to propagate updatesthrough bidirectional views. This process is illustrated in Figure 5. Tables holdpersistent data. Entities represent a virtual state, only a tiny fraction of whichis materialized on the client. The goal is to translate an update, �Entities, onthe virtual state of Entities into an update, �Tables, on the persistent stateof Tables. This can be done using the following two steps (we postpone thediscussion of merge views shown in the figure):

(1) View maintenance:�Tables′ = �UpdateViews (Entities, �Entities).

ACM Transactions on Database Systems, Vol. 33, No. 4, Article 22, Publication date: November 2008.

Page 8: Compiling mappings to bridge applications and databases

22:8 • S. Melnik et al.

Fig. 5. Update translation.

(2) View unfolding:�Tables = �UpdateViews (QueryViews(Tables), �Entities).

Recall that update views express the database as a view of the entities.Step 1 applies view maintenance algorithms to update views, and produces aset of delta expressions, �UpdateViews (not shown in Figure 5). Given a setof changes, �Entities, to the state of Entities, �UpdateViews computes incre-mental updates, �Tables′, to the database tables. We cannot use the expression�Tables′ produced in Step 1 directly because a snapshot of Entities is not fullymaterialized on the client. However, we can construct the state of Entities usingthe query views. This is done in Step 2 by using view unfolding to substituteEntities by query views in the expression �Tables′. The second step gives us anexpression �Tables that takes as input the initial persistent database state (i.e.,Tables) and the update to entities (i.e., �Entities), and computes the update tothe database.

This approach yields a clean, uniform algorithm that works for both object-at-a-time and set-based updates (i.e., those expressed using data manipulationstatements). In practice, Step 1 is often sufficient for update translation, sincemany updates do not directly depend on the current database state. In thosesituations we have �Tables = �UpdateViews(�Entities).

To illustrate, consider the update views in Figure 4. These views are verysimple, so applying view maintenance rules is straightforward. For insertions,we obtain the �UpdateViews expressions as:�HR = SELECT p.Id, p.Name FROM �Persons p�Empl = SELECT e.Id, e.Department FROM �Persons c

WHERE c IS OF Employee.

So, for inserted entity �Persons = {〈Employee(1, ’Alice’, ’R&D’)〉}, we computerow insertions �HR = {〈1, ’Alice’〉}, �Empl = {〈1, ’R&D’〉}. In general, updateviews can get much more complex (we will see an example shortly). Usingview maintenance techniques allows supporting a very large class of updates.It accounts in a uniform way for tricky update translations where multipleextents are updated, a deletion from an extent becomes an update in the store,and so on.

4. MAPPING SPECIFICATION

While query and update views are well suited to drive data transformation,they do not offer a convenient mapping specification mechanism for typical

ACM Transactions on Database Systems, Vol. 33, No. 4, Article 22, Publication date: November 2008.

Page 9: Compiling mappings to bridge applications and databases

Compiling Mappings to Bridge Applications and Databases • 22:9

application scenarios. In this section, we present the mapping specificationlanguage used in the Entity Framework.

4.1 Higher-level Mappings with Precise Semantics

Most of today’s object-relational mapping frameworks and products, such asJPA [EJB 3.0 Expert Group 2006], JDO [JDO Expert Group 2006], or Hibernate[Bauer and King 2006], use pattern-based mapping specification. Mappingsare defined by selecting among several fixed mapping patterns such as table-per-hierarchy, table-per-type, or association-as-embedded-foreign-key. Usually,the behavior of a mapping pattern is specified using examples, with no formalsemantics that explains what the system ought to do when several patternsare used in combination, or what pattern combinations are valid. Correctnessor completeness of mappings is typically neither defined nor enforced.

Another extreme is defining mappings directly by way of query and updateviews: the specification is precise but often prohibitively complex for all but themost expert developers and database professionals. This complexity has severalsources. First, writing the views by hand requires a deep understanding of SQLand the nonrelational extensions in Entity SQL. The views are usually an orderof magnitude more verbose than a higher-level mapping, and have some degreeof redundancy. For example, in a simple one-to-one mapping between an entityset and a table, the property correspondences would have to be wired twice, oncein the query view and once in the update view. Hence the views are hard to writeand maintain when the conceptual schema or relational schema evolves.

A second source of complexity is due to validating the roundtripping criterion.Allowing the developers to supply both query and update views is problematicbecause checking the roundtripping criterion for arbitrary Entity SQL views isundecidable. To see this, we show how to encode query containment, which isundecidable for any relationally complete language [Di Paola 1969; Abiteboulet al. 1995], as roundtripping validation. Define a query view E = R, and anupdate view R = E − (Q1 − Q2), where E is an entity set, R is a relationaltable, and Q1 and Q2 are queries on the entity schema that do not mention E.Unfolding the update view yields the roundtripping condition E = E − (Q1 −Q2). It holds if and only if Q1 ⊆ Q2. Although it is possible to restrict the queryand update views to a subset of Entity SQL for which containment is decidable,carving up such a decidable subset using simple syntactic criteria is tricky,since even relatively simple scenarios require the use of outer joins (which,when used together with NOT NULL conditions, can express set difference),case statements, disjunction, and so on.

We considered several middle-ground alternatives for mitigating the com-plexity inherent in explicit view specification (illustrated in more detail inSection 6). One such alternative is to obtain query views from update views.This requires testing the injectivity of update views and inverting them, whichis also undecidable for Entity SQL, as can be seen using the construction inthe previous paragraph: checking whether the update view R = E − (Q1 − Q2)preserves all data in E cannot be done effectively. A second alternative is toobtain update views from query views. This requires solving the view update

ACM Transactions on Database Systems, Vol. 33, No. 4, Article 22, Publication date: November 2008.

Page 10: Compiling mappings to bridge applications and databases

22:10 • S. Melnik et al.

Fig. 6. Mapping scenario illustrating a combination of vertical and horizontal partitioning and anassociation mapping.

problem. As was shown in Dayal and Bernstein [1978], a unique update trans-lation rarely exists for even quite simple (query) views.

These difficulties led us to search for a higher-level mapping language thatis declarative, compact, and has precise formal semantics. It is discussed below.

4.2 Extended Mapping Scenario

To set the stage for the subsequent discussion, and to substantiate the argu-ments made above, consider an extension of the mapping scenario presented inSection 3.

Example 2. Consider an extended conceptual schema that contains a newentity type Customer, and an association, Supports, between Employee andCustomer entities, as depicted in Figure 6. The inheritance hierarchy is storedusing a combination of horizontal (row) and vertical (column) partitioning. Justas in Example 1, table HR holds the names of Persons and Employees, and tableEmpl stores Employees departments. Customer names, as well as all other de-clared or inherited Customer properties, are stored in a new table, Client. TableClient has a foreign key FK2 to the Empl table, which embodies the association,Supports from the conceptual schema.

ACM Transactions on Database Systems, Vol. 33, No. 4, Article 22, Publication date: November 2008.

Page 11: Compiling mappings to bridge applications and databases

Compiling Mappings to Bridge Applications and Databases • 22:11

The mapping is specified using mapping fragments M1, . . . , M4. Frag-ment M1 has an extra condition in the WHERE clause that asserts that onlydirect instances of Person, and direct or derived instances of Employee, arepersisted in table HR. This condition effectively excludes Customer instancesby using disjunction instead of negation. Fragments M2 and M3 are self-explanatory. Fragment M4 maps association, Supports to table Client. Noticethat the righthand side of the mapping fragment has a NOT NULL conditionon the foreign key Client.Eid. In combination with FK2, fragment M4 assertsthat the association, Supports holds between the respective Customer and Em-ployee keys precisely when a foreign key of Client references an existing Empltuple.

The schema extensions introduced in Figure 6 cause a proportional increasein the size of the mapping specification. In contrast, the complexity of the queryand update views (Figures 7 and 15–20) that drive the data transformationengine, arguably grows beyond the capabilities of many application developers.For example, the query view for entity set, Persons in Figure 7, contains aleft outer join, a union, a case statement, entity constructors, and type casts,and leverages three-valued boolean logic. The schemas shown in Figure 6 aretrivial in comparison with those used in real applications. And yet, verifyingthe correctness and roundtripping of the views is already quite challenging.

Given the complexity of the query view in Figure 7 it is remarkable that themapping is actually more informative than the view—the latter is implied bythe mapping. However, the opposite is not true: the view does not imply themapping. For example, inserting an Employee entity into entity set, Persons,has multiple update translations that are consistent with the query view forPersons. In particular, we could insert some extra tuple into the Empl table thatdoes not join with HR, without modifying the result obtained by the query view.Mapping fragment M2 rules out such undesired update behavior.

We will formalize the exact relationship between the mapping and the viewsand discuss the views shown in Figures 7 and 15–20 in more detail in thecontext of mapping compilation.

4.3 Mapping Language

We argued that using mappings offers a substantial simplification. However,our approach to defining mappings is useful only if we can compile them intoquery and update views. This section presents the mapping language for whichwe developed effective compilation and validation algorithms.

The mappings taken as input by the mapping compiler are specified usingconstraints of the form QC = Q S , where QC and Q S are Entity SQL queries.In the current version of the system, these queries, which we call fragmentqueries, are essentially project-select queries (no joins) with a relational-onlyoutput and a limited form of disjunction and negation. This class of mappingscan be used to express a large number of mapping scenarios, yet is sufficientlysimple that mapping compilation can be performed effectively.

The specification of a fragment query Q is given in Figure 8, where E is anextent with alias e, A is a property of an entity type or complex type, c is a

ACM Transactions on Database Systems, Vol. 33, No. 4, Article 22, Publication date: November 2008.

Page 12: Compiling mappings to bridge applications and databases

22:12 • S. Melnik et al.

Fig. 7. Query view for Entity Set Persons.

scalar constant, and T is an entity type or a complex type. The return type ofQ is required to be a row of scalars and needs to contain the key propertiesof E.

The design of our mapping language was guided by two principles. First,we wanted the simplest possible language that supports the relevant cus-tomer scenarios. Besides being easier to compile, a simpler mapping languageis accessible to a broader developer audience and has a more intuitive vi-sual representation. The mapping language specified previously is expressiveenough to describe most inheritance mapping scenarios proposed in the lit-erature and implemented in commercial products. In particular, it supports

ACM Transactions on Database Systems, Vol. 33, No. 4, Article 22, Publication date: November 2008.

Page 13: Compiling mappings to bridge applications and databases

Compiling Mappings to Bridge Applications and Databases • 22:13

Fig. 8. BNF syntax for a fragment query Q .

table-per-hierarchy, table-per-type (vertical partitioning), table-per-concrete-type (horizontal partitioning) strategies, and all their combinations. Further-more, it supports entity splitting scenarios where entities are stored in differenttables based on certain property values.

The second principle is to avoid second-guessing the developer’s intention.For example, one could argue that the NOT NULL condition in M4 could beinferred from FK2 in the schema. The mapping designer in Visual Studio doesprecisely that—it uses the schema and UI hints to help with mapping creation.However, we do not exploit such hints at runtime to avoid unexpected effectsof schema modifications on mapping semantics.

Mappings in the Entity Framework can be stored and loaded using an XMLsyntax. XML-based mapping specification is used by most object-relationalmappers. What is different in the Entity Framework is that the XML syntaxexpresses a language that has a formal semantics. Figure 14 in the appendixshows the XML syntax used in the running example.

5. MAPPING COMPILATION

Mapping compilation turns concise higher-level mapping definitions intoexplicit data transformations used by the runtime engine. Due to theaforementioned challenges in checking roundtripping, currently, the En-tity Framework only accepts the views produced by the built-in mappingcompiler.

The compilation process is organized into two major parts: mapping valida-tion (Section 6) and view generation (Section 7). Mapping validation ensuresthat the input mappings allow producing views that roundtrip data, that is,support lossless storage of any valid instance of the conceptual model. Viewgeneration produces the actual query and update views that enforce roundtrip-ping while respecting the mapping.

In this section, we lay out the formal prerequisites for mapping validationand view generation. We state the mapping compilation problem formally us-ing state-based semantics, that is, independent of any data manipulation orconstraint languages. Section 5.1 defines schemas, mappings, and views. Weexplain what it means for a mapping to roundtrip data, and formulate the ba-sic problem of obtaining views from mappings. In Section 5.2, we introducea general class of mapping languages called bipartite mappings. Finally, inSection 5.3, we discuss merge views and refine our basic problem statement.

ACM Transactions on Database Systems, Vol. 33, No. 4, Article 22, Publication date: November 2008.

Page 14: Compiling mappings to bridge applications and databases

22:14 • S. Melnik et al.

5.1 Mappings and Data Roundtripping

We start with a general problem statement that makes no assumptions aboutthe languages used for specifying schemas, mappings, and views. To emphasizethat, we refer to the entity schema as a client schema and to the relationalschema as a store schema.

A schema defines a set of possible states (also called instances). Let C be theset of valid client states, that is, all instances satisfying the client schema andall of its schema constraints. Similarly, let S be the set of valid store states—those conforming to the store schema. That is, each element of S is an entirepopulated database. Occasionally, we use the same symbol (C, S, etc.) to denotethe schema itself, when its role is clear from the context.

A mapping between the client and store schema specifies a binary relationbetween C states and S states. A set �map of mapping constraints expressed insome formal language defines the mapping map = {(c, s) | c ∈ C, s ∈ S, (c, s) |=�map} consisting of all states (c, s) that satisfy every constraint in �map. We saythat map is given by �map. A view is a total2 functional mapping that mapseach instance of a given schema to an instance of a result schema.

We use standard algebraic properties and operations defined as follows:

— Id (C) =df {(x, x) | x ∈ C}—map1 ◦ map2 =df {(x, z) | ∃ y : (x, y) ∈ map1 ∧ ( y , z) ∈ map2}—map−1 =df {( y , x) | (x, y) ∈ map}— Domain(map) =df {x | ∃ y : (x, y) ∈ map}— Range(map) =df { y | ∃x : (x, y) ∈ map}—Function f is injective if ∀x, y , z : f (x) = z ∧ f ( y) = z → x = y—Function f : D → R is total on D if Domain( f ) = D. Henceforth all functions

are assumed total unless specified explicitly as partial.

The first question that we address is under what conditions a mapping map ⊆C × S describes a valid data access scenario. The job of the mapping layer is toenable the developer to run queries and updates on the client schema as if itwere a regular database schema. That is, the mapping must ensure that eachdatabase state of C can be losslessly encoded in S. Hence each state of C mustbe mapped to a distinct database state of S, or to a set of database states ofS that is disjoint from any other such set. If this condition is satisfied, we saythat the mapping roundtrips data, denoted as

map ◦ map−1 = Id(C),

that is, the composition of the mapping with its inverse yields the identitymapping on C. Notice that map may be nonfunctional. As an example, considera mapping just like the one in Figure 6 except the Empl table has an extracolumn, Date. This column is not referenced in the mapping. Hence the clientschema provides access to a proper subset of the data in the store, that is,

2We do not consider views that fail on some database instances (e.g., due to division by zero ornon-terminating recursion). That is, a view or query produces an output (i.e., a result table) oneach input (i.e., a specific database instance). Therefore we speak of a total function.

ACM Transactions on Database Systems, Vol. 33, No. 4, Article 22, Publication date: November 2008.

Page 15: Compiling mappings to bridge applications and databases

Compiling Mappings to Bridge Applications and Databases • 22:15

there exist multiple corresponding store states that are related by map to eachclient state. We explain how we deal with such situations using merge views inSection 5.3.

The next question we consider is what it means to obtain query and up-date views that roundtrip data and respect the mapping. Our statement of thisproblem is based on the following theorem. The theorem and the subsequentproposition state that the roundtripping property of a mapping coincides withthe existence of a query and update view that are related to the mapping in anintimate way and ‘witness’ its roundtripping:

THEOREM 1 (Data roundtripping). Let map ⊆ C × S. Then, map ◦ map−1 =Id (C) if and only if there exist two (total) views, q : S → C and u : C → S, suchthat u ⊆ map ⊆ q−1.

If the views u and q are given as sets of view definitions �u and �q , thenu ⊆ map ⊆ q−1 means that �u implies �map, which in turn implies �q for allinstances in C×S. It is easy to show that each query and update view satisfyingthis theorem ‘witness’ the roundtripping criterion from Section 3.

PROPOSITION 1. For each q and u satisfying Theorem 1, the following holds:

u ◦ q = Id (C).

Hence we formulate the following data roundtripping problem:

For a given map ⊆ C ×S, construct views q and u expressed in somelanguage L, such that u ⊆ map ⊆ q−1, or show that such views donot exist.

We refer to q and u as the query view and update view, respectively. Some-times, we use the plural form views to emphasize that q and u are specified assets of view definitions (e.g., as shown in Figures 7, 15–20).

5.2 Bipartite Mappings

The mappings in the Entity Framework are specified using a set of mappingfragments �map = {QC1 = Q S1, . . . , QCn = Q Sn}. A mapping fragment is aconstraint of the form QC = Q S , where QC is a query over the client schemaand Q S is a query over the store schema. We call such mappings bipartitemappings.

A bipartite mapping is one that can be expressed as a composition mapping ofa view with an inverse of a view with the same view schema. Thus the mappinggiven by �map above can be expressed as a composition mapping f ◦ g−1, wherethe view f : C → V is given by queries QC1, . . . , QCn, the view g : S →V is given by queries Q S1, . . . , Q Sn, and V corresponds to the view schemaV1, . . . , Vn induced by these queries (see Figure 9). We refer to Vi as a fragmentview (signature).

Example 3. For the mapping in Example 2 we obtain the following frag-ment view signatures (up to renaming of attributes):

V1 (Id, Name), V2 (Id, Dept), V3 (Cid, Name, Score, Addr), V4 (Cid, Eid).

ACM Transactions on Database Systems, Vol. 33, No. 4, Article 22, Publication date: November 2008.

Page 16: Compiling mappings to bridge applications and databases

22:16 • S. Melnik et al.

Fig. 9. Bipartite mapping between C and S.

A key property of bipartite mappings is using equality in mapping fragmentsinstead of inclusion. Inclusion constraints of the form Qsrc ⊆ Qtgt, used insource-to-target data exchange and query answering settings, are inadequatefor data roundtripping because they specify a one-way relationship betweenthe schemas. For example, the previous query inclusion does not constrain thesource database sufficiently; it may remain empty for each target database.

As we demonstrate shortly, bipartite mappings can be compiled into queryand update views by applying answering-queries-using-views techniques to theviews f and g , one after another. This reduces the complexity of the solution ascompared with working with the entire �map, and enables developing largelysymmetric algorithms for generating query and update views. Moreover, weexploit the properties of bipartite mappings to validate mappings, that is, verifythat they roundtrip data, as described subsequently.

5.3 Merge Views

In many mapping scenarios, only part of the store data is accessible through theclient schema. Some tables or columns, such as Empl.Date mentioned previously,may not be relevant to the application and not exposed through the mapping.Usually, such unexposed information needs to remain intact as updates areperformed against the store. To address this requirement we use the conceptof merge views. A merge view m : S × S → S combines the updated store statesupd , computed by the client, and the old store state sold into the new store statesnew (see Figure 5).

To illustrate, consider the merge view for the extended table Empl(Id, Dept,Date):

Emplnew = SELECT upd.Id, upd.Dept, old.DateFROM Empl AS updLEFT OUTER JOIN Emplold AS oldON upd.Id = old.Id.

This view, together with the query and update views shown in Figures 7 and19, determine completely the update behavior for the extended Empl table.The left outer join ensures that the new employee entities added on the clientappear in the Emplnew table, and the deleted ones get removed. If an employee’sdepartment gets updated, the Date value remains unchanged.

In contrast to query and update views, whose purpose is to reshape databetween the client and the store, merge views capture the store-side state tran-sition behavior. This behavior may go beyond preserving unexposed data. Forexample, if Date denotes the last date on which the department name was mod-ified, it may be necessary to reset it to the current date upon each update.

ACM Transactions on Database Systems, Vol. 33, No. 4, Article 22, Publication date: November 2008.

Page 17: Compiling mappings to bridge applications and databases

Compiling Mappings to Bridge Applications and Databases • 22:17

More generally, merge views may implement various update policies such asupdating timestamps, logging updates, resetting certain columns, or rejectingdeletions (using full outer join in the merge view).

Currently, we use merge views exclusively for preserving unexposed data.The formal criterion can be stated as:

∀s ∈ Range(map) : m(u(q(s)), s) = s.

It requires that a client that retrieves all store data and writes it back unmod-ified, leaves the store in the unchanged state. This property needs to hold forall mapped store states, that is, the ones that are consistent with the specifiedmapping.

The extended roundtripping criterion that considers merge views can bestated as:

∀c ∈ C, ∀s ∈ Range(map) : q(m(u(c), s)) = c.

It requires that applying the update view u to an arbitrary client state c, fol-lowed by merging the computed store state with the existing store state, mustallow reassembling c using the query view q from the merged store state.

6. MAPPING VALIDATION

Mapping compilation can succeed only if we start with a mapping thatroundtrips. To illustrate what can go wrong, consider the following example.

Example 4. The foreign key FK1 from Example 2 establishes an inclusiondependency between the keys of Empl and HR (see Figure 6). FK1, together withmapping fragments M1 and M2, implies that the Id values projected in QC2(the lefthand query of M2) are contained in those projected in QC1. Indeed, it iseasy to see that πId(QC2) ⊆ πId(QC1). However, suppose we modify the mappingfragment M1 : (QC1 = Q S1) into M ′

1 : (Q ′C1 = Q S1) by making the WHERE

clause of Q ′C1 contain just the condition p IS OF (ONLY Person). Let Employee(1,

’Alice’, ’R&D’) be the only entity in the entity set Persons. M2 requires that thetable Empl contain one tuple. However, M ′

1 now implies that HR must be empty,which would violate FK1. More generally, the containment πId(QC2) ⊆ πId(Q ′

C1)is violated whenever entity set Persons has at least one Employee entity—inthis case, no HR and Empl tables can be constructed that satisfy the databaseschema constraints and the mapping. Hence the mapping with M ′

1 does notroundtrip, and it is impossible to generate query and update views that storeand retrieve data losslessly for every instance of the conceptual schema.

This example suggests that validating mapping roundtripping involves prop-agating schema constraints over views. This relationship is formalized in thefollowing theorem:

THEOREM 2. Let map = f ◦ g−1 be a bipartite mapping defined by (total)views f and g. Then the following conditions are equivalent:

(1) map ◦ map−1 = Id (C).(2) (a) f is injective and (b) Range( f ) ⊆ Range(g ).

ACM Transactions on Database Systems, Vol. 33, No. 4, Article 22, Publication date: November 2008.

Page 18: Compiling mappings to bridge applications and databases

22:18 • S. Melnik et al.

That is, a mapping roundtrips if and only if f is injective and the range con-straints of f (i.e., those inferred from f and the schema constraints in C) implythe range constraints of g (i.e., those inferred from g and the schema con-straints in S).

To illustrate, in Example 4 the range constraints on fragment view symbolsimplied by f and g agree exactly and are:

πId(V2) ⊆ πId(V1).

As shown by Nash et al. [2007], the problem of computing the range ofa mapping (or a view, as a special case) can be reduced to that of map-ping composition. Composing mappings is very challenging [Fagin et al. 2005;Bernstein et al. 2006]. Therefore, we chose a validation approach that lies inbetween doing mapping composition and testing roundtripping of the gener-ated query and update views. (Theorem 2 is exploited in the context of map-ping compilation in Section 7.7.) Our validation approach is based on the fol-lowing theorem and leverages in part the views produced by the mappingcompiler.

THEOREM 3. Let map = f ◦ g−1 ⊆ C × S, f : C → V, g : S → V. Fur-ther, let u : C → S be some update view constructed from map (i.e., u ⊆ map,Domain(u) = C, Range(u) ⊆ S). Then, the following roundtripping conditionsare equivalent:

(1) map ◦ map−1 = Id (C).(2) (a) f is injective and (b) map ◦ u−1 = Id (C) .

The benefit of this theorem is that it replaces the problem of composingtwo mappings by that of composing a mapping and a view. The latter can bedone easily by unfolding the view in the mapping fragments. Suppose thatour view generation algorithm produces some update view u with u ⊆ map,Domain(u) = C (we explain how this is done shortly). To satisfy the premiseof the theorem, the update view u needs to guarantee that Range(u) ⊆ S, thatis, every instance constructed via u satisfies all schema constraints in S. Thesteps required to perform mapping validation for a relational schema S aresummarized in Algorithm 1.

Algorithm .1 ValidateMapping

(1) Check that f is injective (Section 7.4, Algorithm 4).(2) Check that u respects key constraints of S (Section 7.7, Algorithm 9).(3) Check that u respects foreign key constraints of S by unfolding the update view for

each relation mentioned in each foreign key constraint and testing query contain-ment.

(4) Check that u respects NOT NULL constraints of S by adding an IS NULL conditionto u for each non-nullable column that is populated by u and then checking that theresulting query is unsatisfiable.

(5) Check that map ◦ u−1 = Id (C) by unfolding u in map and checking the resultingquery equivalences.

ACM Transactions on Database Systems, Vol. 33, No. 4, Article 22, Publication date: November 2008.

Page 19: Compiling mappings to bridge applications and databases

Compiling Mappings to Bridge Applications and Databases • 22:19

As we explain in Section 7, Checks 1 and 2 are performed as part of our viewgeneration algorithm (Algorithm 2). Checks 3–5 are done after update viewshave been generated and are illustrated as follows.

Example 5. Consider the update views for tables HR and Empl shown inFigures 18–19. To verify the foreign key constraint FK1 (Check 3), we constructthe expression

πId(Empl) ⊆ πId(HR),

where HR and Empl are substituted by the view definitions from Figures 18–19,and check the containment. To illustrate Check 4, suppose that Name is a non-nullable column in table HR. We add an IS NULL condition on Person Namereturned by the update view for HR (Figure 18) and check that the resultingquery becomes unsatisfiable; this is true only if the property, Name of entitytype, Person is non-nullable. Check 5 passes, since unfolding the update viewsfor HR and Empl in the mapping fragments M1 and M2, respectively, yieldstrivial tautologies.

This mapping validation approach requires testing containment of EntitySQL queries. Since we start with a mapping expressed in a more restrictedlanguage, and are in control of the compilation algorithm, we know the exactshape of Entity SQL queries on which the containment checker is run, eventhough, syntactically, they fall into an undecidable fragment of Entity SQL.Specifically, for the class of queries produced by the mapping compiler, thereexists a reduction of the query containment problem to the satisfiability problemin propositional logic. To do these checks, we use a SAT solver that exploits avariant of Binary Decision Diagrams [Bryant 1986].

To emphasize the benefit of our validation approach, recall a more direct wayof validating roundtripping that we sketched in Section 4.1. Instead of Check 5,we could unfold the update views for HR, Empl, and Client inside the query viewfor Persons. This would give us an Entity SQL expression that combines thosein Figures 7, 18, 19, and 20, and has many difficult operators. This approach wasinvestigated by our colleagues in Mehra et al. [2007] using a general reductionof Entity SQL to first-order logic, and a state-of-the-art theorem prover. Theydesigned specialized techniques to ensure that the prover terminates withinreasonable time bounds, making brute force roundtripping validation sound,but not always complete.

7. VIEW GENERATION ALGORITHM

We start by explaining the intuition behind the algorithm. The key principlethat we use is reducing the mapping compilation problem to that of finding ex-act rewritings of queries using views [Halevy 2001]. Several factors make thisproblem difficult. First, we assume that the views are complete (often referredto as the closed-world assumption), which makes the problem computationallyharder, since we can also derive negative information from the views [Abitebouland Duschka 1998]. Second, our target language, Entity SQL, is very expres-sive, which makes it possible (and necessary) to produce complex rewritings.

ACM Transactions on Database Systems, Vol. 33, No. 4, Article 22, Publication date: November 2008.

Page 20: Compiling mappings to bridge applications and databases

22:20 • S. Melnik et al.

Finally, our goal is to find good rewritings that are likely to be executed effi-ciently by the database system.

7.1 Intuition

Consider the problem of obtaining query views from mapping map = f ◦g−1.The view f on C and the view g on S are specified as {V1 = QC1, . . . , Vn = QCn}and {V1 = Q S1, . . . , Vn = Q Sn}, respectively. Suppose that the view f ismaterialized. To preserve all information from C in the materialized view, fmust be lossless, that is, an injective function. In this case, we can reconstructthe state of C from the materialized view by finding an exact rewriting of theidentity query on C using the view f . Suppose f ′ : V → C is such a rewriting(depicted as a dashed arrow in Figure 9), given by a set of queries on V1, . . . , Vn.Hence we can unfold g in f ′ to express C in terms of S. That gives us thedesired query view q = f ′ ◦ g .

Update views and merge views can be obtained in a similar fashion by an-swering the identity query on S. The extra subtlety is that g need not be in-jective and may expose only a portion of the store data through the mapping.This raises two issues. First, the exact rewriting of the identity query on S maynot exist. That is, to leverage answering-queries-using-views techniques, weneed to extract an injective view from g . Second, we need to ensure that thestore information not exposed in the mapping can flow back into the updateprocessing in the form of a merge view. Before we present a general solution,consider the following example:

Example 6. Suppose that the store schema contains a single relation R(ID,A, B, C). Let map = f ◦ g−1, where g is defined as:

V1 = πID,A(R)V2 = πID,B(σC=3(R)).

The identity query on R cannot be answered using fragment views V1 and V2,since g is noninjective and loses some information in R. So we translate thestore schema into an equivalent partitioned schema containing relations P A

1 (ID,A), P B

1 (ID, B), P A2 (ID, A), P B,C

2 (ID, B, C) with the following schema constraints:for all i = j , πID(P X

i )∩πID(PYj ) = ∅ and πID(P X

i ) = πID(PYi ). The partitioning of

R is illustrated in Figurer 10. The equivalence between the partitioned schemaand the store schema {R} is witnessed by two bijective views p and r, where pis defined as

P A1 = πID,A(σC=3(R))

P B1 = πID,B(σC=3(R))

P A2 = πID,A(σC =3(R))

P B,C2 = πID,B,C(σC =3(R)),

and r is defined as3

R = πID,A,B,3(P A

1 � P B1

) ∪ (P A

2 � P B,C2

).

3We use an extended π operator, which may contain constants and computed expressions in theprojection list.

ACM Transactions on Database Systems, Vol. 33, No. 4, Article 22, Publication date: November 2008.

Page 21: Compiling mappings to bridge applications and databases

Compiling Mappings to Bridge Applications and Databases • 22:21

Fig. 10. Partitioning relation R based on V1 and V2.

This partitioning scheme is chosen in such a way that the view g can berewritten in terms of union queries on the partitions. Thus g can be restatedin terms of P A

1 , P B1 , and P A

2 as follows:

V1 = P A1 ∪ P A

2

V2 = P B1 .

We call partitions P A1 , P B

1 , and P A2 exposed because they appear in the pre-

vious rewriting. They are depicted as a shaded region in Figure 10. PartitionP B,C

2 is unexposed (white region). Notice that the above rewriting is injective,that is, information-preserving, on the schema formed by the exposed parti-tions. Due to the constraint πID(P A

1 ) = πID(P B1 ) on the partitioned schema, we

can reconstruct the exposed partitions from V1 and V2 as follows4:

P A1 = V1 � V2

P B1 = V2

P A2 = V1 � V2.

Now we have all the building blocks to construct both g ′ in Figure 9 and themerge view. Let Rold , Rnew, and Rupd denote, respectively, the old state of R inthe store, the new state obtained by merging Rupd and Rold , and the updatedstate of R computed using g ′. Rupd is populated from the exposed partitions,ignoring the information in the unexposed partitions:

Rupd = πID,A,B,3(P A

1 � P B1

) ∪ πID,A,NULL,NULL(P A

2

)

= πID,A,B,3(V1 � V2

) ∪ πID,A,NULL,NULL(V1 � V2).

Since πID(P A1 ) = πID(P B

1 ), so πID(V2) ⊆ πID(V1) and Rupd can be simplified asfollows:

Rupd = πID,A,B,3(V1−−� V2).

The merge view for Rnew assembles the exposed partitions that carry theupdated client state (Rupd ) and the unexposed partitions holding the old storestate (Rold). The goal is to keep as much of the old store information in the un-exposed partitions as possible. To achieve that, we left-outer-join the exposedpartitions in the updated state with the unexposed partitions in the old state.(Notice that using a full-outer-join to preserve all information in the old stateis usually undesirable. For example, preserving all of P B,C

2 data from the old

4� is the left anti-semijoin, −−� is the left outer join.

ACM Transactions on Database Systems, Vol. 33, No. 4, Article 22, Publication date: November 2008.

Page 22: Compiling mappings to bridge applications and databases

22:22 • S. Melnik et al.

store state would prevent the client from deleting any tuples from R with C = 3,which are partially visible through V1.) We obtain Rnew by taking the expres-sion that defines R in terms of its partitions, and unfolding the definitions ofpartitions, while replacing R by Rupd, for all exposed partitions, and replacingR by Rold, for all unexposed partitions:

Rnew = σC=3(Rupd )

∪(πID,A(σC =3(Rupd )) −−� πID,B,C(σC =3(Rold )))

= πID,A,B,CASE...AS C(Rupd−−� Rold ),

where the abbreviated case statement is:CASE WHEN Rold.ID IS NOT NULL THEN Rold.C ELSE 3 END.

Composing the merge view with g ′ produces:

Rnew = πID,A,B,CASE...AS C(V1−−� V2

−−� Rold).

Unfolding f (from map = f ◦ g−1) in this expression, states V1 and V2 in termsof the client schema, and produces the final transformation that drives theupdate pipeline.

The example motivates several issues. First, we need to justify that thisapproach produces query, update, and merge views that satisfy the conditionsfrom Section 5, that is, solve the data roundtripping problem. Second, we needto develop a partitioning scheme to express the client and store schema using anequivalent schema that allows rewriting the mapping fragments using unionqueries. This rewriting simplifies the algorithm for reconstructing partitionsand testing injectivity of f when constructing query views. Notice that thechoice of the partitioning scheme is sensitive to the mapping language. Third,as we will show, the generated case statements can become quite complex andrequire careful reasoning. Finally, unfolding f in the expression V1

−−� V2 mayrequire further simplification to avoid gratuitous self-joins and self-unions if,for example both terms are queries over the same entity set in the client schema.We discuss each of these issues in the subsequent sections.

7.2 General Solution

This section formalizes the intuition presented previously, and can be skippedwithout loss of continuity. We describe our view generation approach in generalterms and show that it solves the data roundtripping problem.

Let P be a schema containing a set of relations (partitions) and a set ofschema constraints �P . P is a partitioned schema for S, relative to a querylanguage L, if there exists a procedure that (i) allows rewriting each queryg ∈ L on S using a unique set of partitions in P, (ii) each such rewriting isinjective on the subschema Pexp formed by the partitions used in the rewritingand the respective schema constraints from �P , and (iii) the rewriting of theidentity query on S uses all partitions in P. (In Example 6, Pexp contains P A

1 ,P B

1 , P A2 and constraints πID(P A

1 ) = πID(P B1 ), πID(P A

1 ) ∩ πID(P A2 ) = ∅.)

By (ii) and (iii), there exist bijective views r : P → S, r ∈ L and p : S → Pwitnessing the equivalence of P and S.

ACM Transactions on Database Systems, Vol. 33, No. 4, Article 22, Publication date: November 2008.

Page 23: Compiling mappings to bridge applications and databases

Compiling Mappings to Bridge Applications and Databases • 22:23

Fig. 11. Obtaining u and m using partitioning scheme.

As shown in Figure 11, query g partitions S into P ⊆ Pexp ×Punexp such thatpexp : S → Pexp and punexp : S → Punexp are view complements [Bancilhon andSpyratos 1981] that together yield p. By condition (ii), g can be rewritten ash ◦ pexp, where h is injective. Let h′ be an exact rewriting of the identity queryon Pexp using h, that is, h′ reconstructs Pexp from V. Then, the update view uand merge view m can be constructed as follows:

u := f ◦ h′ ◦ r[., ∅]m(s1, s2) := r(pexp(s1), punexp(s2)),

where ∅ is the Punexp-state where all relations are empty, the view r[., ∅] is suchthat r[., ∅](x) = y iff r(x, ∅) = y , and s1, s2 ∈ S.

Assuming that f is injective and Range( f ) ⊆ Range(g ), it is easy to showthat u and m satisfy the information-preservation condition of Section 5. Theextended roundtripping criterion holds if we choose the view r in such a waythat

∀x ∈ Pexp, ∀s ∈ S : pexp(r(x, punexp(s))) = x.

To obtain such r, in Example 6, we left-outer-joined exposed partitions withunexposed partitions that agree on keys.

As we explain next, for our mapping and view language, it is always possibleto obtain the views h, h′, pexp, punexp, and r that satisfy the conditions juststated. Therefore, our solution is complete in that it allows constructing query,update, and merge views for each given valid mapping.

Due to conditions (i)–(iii), g is injective if and only if Punexp is empty, that, haszero partitions. We exploit this property to check the injectivity of f requiredby Theorem 3. To do that, we swap the sides of the mapping such that f in theprevious construction takes the place of g , and apply the partitioning schemeto C in a symmetric fashion.

7.3 Compilation Steps

In summary, mapping compilation comprises the steps described in thealgorithm CompileMapping (see Algorithm 2). It refers to Checks 1–5 fromSection 6.

Step 1 uses a divide-and-conquer method to scope the mapping compila-tion problem. Two mapping fragments are dependent if their fragment queriesshare a common extent symbol or some integrity constraint spans their extentsymbols; in this case, they are placed into the same subset of fragments. Allthe remaining steps process one such subset of mapping fragments at a time.Steps 7–10 produce query views, while Steps 2–5 produce update and merge

ACM Transactions on Database Systems, Vol. 33, No. 4, Article 22, Publication date: November 2008.

Page 24: Compiling mappings to bridge applications and databases

22:24 • S. Melnik et al.

Algorithm .2 CompileMapping

(1) Subdivide the mapping into independent sets of fragments. Execute Steps 2-10 foreach independent set of fragments.

(2) Apply the partitioning scheme to S and g (Punexp produced here may be non-empty).Rewrite g as h ◦ pexp (Section 7.4).

(3) Produce h′ as an exact rewriting of identity query on Pexp (Section 7.5).(4) Obtain u and m (Section 7.2).(5) Simplify u and m (Section 7.9).(6) Perform partial mapping validation by doing Checks 3–5 (Section 6).(7) Apply the partitioning scheme to C and f . Enforce Check 1: if the Punexp obtained

above is non-empty, abort since f is not injective (Section 7.4).(8) Produce f ′ as an exact rewriting of the identity query on C. While computing the

rewriting enforce Check 2 and abort if it fails (Section 7.5).(9) Construct query view as q = f ′ ◦ g .

(10) Simplify q (Section 7.9).

views. Next, we discuss the partitioning scheme, which is applied in Step 2 andStep 7 of mapping compilation.

7.4 Partitioning Scheme

A partitioning scheme allows rewriting the mapping fragments using querieson partitions. Thus its choice depends directly on the mapping language. Sincethe fragment queries used in our mapping language are join-free, the parti-tioning scheme can be applied to one extent at a time. Imagine that the datathat belongs to a certain extent is laid out on a two-dimensional grid along ahorizontal and a vertical axis, where each point on the vertical axis correspondsto a combination of (a) entity type, (b) complex types appearing in entities, (c)conditions on scalar properties, and (d) is null/is not null conditions on nullableproperties; and each point on the horizontal axis corresponds to a single director inherited attribute of an entity or complex type.

The partitioning along the vertical axis can be computed using a recursivealgorithm (see Algorithm 3). The algorithm is driven by the fragment queriesappearing in a given mapping map, and constructs each partition as a con-junction of conditions. Parameter p of the algorithm holds a path expressionmatching the production P ::= e | P.A from Section 4.3. The algorithm is in-voked on each extent e by passing the inputs e as p, type of e as Tp, and themapping map:

Dom(p.A, map) denotes the domain of p.A relative to the mapping map.That is, if p.A is an enumerated or boolean domain, Dom(p.A, map) containsconditions of the form p.A = c for each value c from that domain, and the ISNULL condition if p.A is nullable. Otherwise, Dom(p.A, map) contains equalityconditions on all constants appearing in conditions on p.A in any mappingfragment in map, treating NOT NULL conditions as illustrated in the followingexample.

Example 7. Suppose the mapping fragments contain conditions (p = 1)and (p IS NOT NULL) on path p of type integer. Then, Dom(p, map) :=ACM Transactions on Database Systems, Vol. 33, No. 4, Article 22, Publication date: November 2008.

Page 25: Compiling mappings to bridge applications and databases

Compiling Mappings to Bridge Applications and Databases • 22:25

Algorithm .3 PartitionVertically

procedure PartitionVertically(p, Tp, map)Part := ∅ // start with an empty set of partitionsfor each type T that is derived from or equal to Tp do

P := {σp IS OF (ONLY T )}for each direct or inherited member A of T do

if map contains a condition on p.A thenif p.A is of primitive type then

P := P × Dom(p.A, map)else if p.A is of complex type TA then

P := P × PartitionVertically(p.A, TA, map)end if

end forPart := Part ∪ P

end forreturn Part

{σcond1 , σcond2 , σcond3} where

cond1 := (p = 1)cond2 := (p IS NULL)cond3 := NOT(p = 1 OR p IS NULL).

Every pair of conditions in Dom(p, map) is mutually unsatisfiable.∨Dom(p, map) is a tautology. Notice that (p IS NOT NULL) is equiva-

lent to (cond1 OR cond3). That is, selection σp IS NOT NULL(R) can be rewrittenas a union query σcond1 (R) ∪ σcond3 (R).

The following example illustrates how the partitioning algorithm works inpresence of complex types:

Example 8. Consider an entity schema shown on the left in Figure 6, whereBillingAddr is a nullable property with complex type Address, and Address hasa subtype USAddress. Then, the vertical partitioning algorithm produces thefollowing partitions, which list all possible shapes of entities that can appearin the extent (extent Persons is omitted in selections for readability):

P1 : σe IS OF (ONLY Person)

P2 : σe IS OF (ONLY Customer) AND e.BillingAddr IS NULL

P3 : σe IS OF (ONLY Customer) AND e.BillingAddr IS OF (ONLY Address)

P4 : σe IS OF (ONLY Customer) AND e.BillingAddr IS OF (ONLY USAddress)

P5 : σe IS OF (ONLY Employee).

Horizontal partitioning is done by splitting each vertical partition in Partaccording to the properties projected in mapping fragments in map. It is easyto show that this partitioning scheme allows expressing each fragment queryappearing in map as a union query over the produced partitions. The partition-ing scheme can be easily generalized to support mappings with comparisons

ACM Transactions on Database Systems, Vol. 33, No. 4, Article 22, Publication date: November 2008.

Page 26: Compiling mappings to bridge applications and databases

22:26 • S. Melnik et al.

or inequalities between an attribute and a constant (e.g., A > 3, A = 5) bytreating partitions as intervals rather than points. However, partitioning is notwell suited for conditions that involve multiple attributes (e.g., A > B).

7.5 Reconstructing Partitions from Views

We move on to Step 3 and Step 8 of mapping compilation. Let P be the set ofpartitions constructed in the previous section for a given C or S extent. Let h bea view defined as V = (V1, . . . , Vn) where each fragment view Vi is expressedas a union query over partitions in P. To simplify the notation, each view canbe thought of as a set of partitions. Let Pexp = ⋃

V be all partitions that areexposed in views in V. Let the set of nonkey attributes of each partition P andview V be denoted as Attrs(P ) and Attrs(V ), respectively.

If Pexp = P, then h is noninjective (as explained in Section 7.2). However,even if all partitions are used in V, h may still be non injective. As an example,consider V = (V1), P = Pexp = {P1, P2}, and V1 = P1 ∪ P2. It is impossible todistinguish P1 tuples from P2 tuples in V1, so h is noninjective. Injectivity holdsonly if we can reconstruct each partition P ∈ P from the views V. Algorithm 4does that.

The algorithm takes as input the exposed partitions Pexp and the fragmentviews V and constructs an associative table Recovered (at the bottom of thealgorithm) that maps a partition P to a set of positive views, Pos, and negativeviews, Neg. These views, if they exist, can be used to reconstruct P by joiningall views in Pos and subtracting (using anti-semijoin) the views in Neg asP = (�Pos) � (∪ Neg).

The algorithm starts by ordering the views. This step is a heuristic for pro-ducing more compact expressions, which may use fewer views. Every set ofrewritings produced by the algorithm is equivalent under each view ordering.The set of needed attributes Att represents a horizontal region, while the re-covered partitions PT represents the vertical region of the view space. Theseregions need to be narrowed down to exactly P , for each P ∈ Pexp. In Phase 1,we narrow the vertical region by intersecting the views that contain P , whilekeeping track of the attributes they provide. If joining the views is insufficient todisambiguate P tuples, then in Phase 2, we further narrow the vertical regionusing anti-semijoins with views that do not contain P . P is fully recovered ifPT does not contain any tuples beyond those in P (i.e., |PT| = 1) and covers allthe required attributes (that is, |Att| = 0). Due to this condition, the algorithmis sound, that is, each found rewriting is correct.

Example 9. Consider the partitioning scheme in Example 8. Let the frag-ment queries used in the mapping be expressed as follows (extent Persons isomitted in selections for readability):

V1 : σe IS OF Person

V2 : σe IS OF (ONLY Customer)

V3 : σe IS OF (ONLY Customer) AND e.BillingAddr IS OF (ONLY USAddress)

V4 : σ(e IS OF (ONLY Customer) AND e.BillingAddr IS NULL) OR (e IS OF (ONLY Employee)).

ACM Transactions on Database Systems, Vol. 33, No. 4, Article 22, Publication date: November 2008.

Page 27: Compiling mappings to bridge applications and databases

Compiling Mappings to Bridge Applications and Databases • 22:27

Algorithm .4 RecoverPartitions

procedure RecoverPartitions(Pexp, P, V)Sort V by increasing number |V | of partitions per view and

by decreasing number |Attrs(V )| of attributes per viewfor each partition P ∈ Pexp do

Pos := ∅; Neg := ∅; // keeps intersected & subtracted viewsAtt := Attrs(P ); // attributes still missingPT := P; // keeps partitions disambiguated so farn := |V|; // the number of views in V// Phase 1: intersectfor (i = 1; i ≤ n and |PT |> 1 and |Att|> 0; i++) do

if P ∈ Vi thenPos := Pos ∪ Vi ; PT := PT ∩ Vi

Att := Att − Attrs(Vi)end if

end for// Phase 2: subtractfor (i = n; i ≥ 1 and |PT|> 1; i--) do

if P ∈ Vi thenNeg := Neg ∪ Vi ; PT := PT ∩ Vi

end ifend forif |PT |= 1 and |Att| = 0 then

Recovered[P ] := (Pos, Neg)end if

end forreturn

Then, V1, . . . , V4 can be expressed in terms of the partitions as shown inFigure 12 (for brevity, we finesse horizontal partitioning, represented by un-qualified π ). Sorting yields V = (V3, V4, V2, V1). Suppose that V1 contains anattribute (e.g., Name) that is required by every partition. Then, the algorithmproduces (up to projection – π is omitted for brevity):

P1 = (V1) � (V2 ∪ V4)P2 = (V4 � V2 � V1)P3 = (V2 � V1) � (V4 ∪ V3)P4 = (V3 � V1)P5 = (V4 � V1) � (V2).

Expanding the view definitions shows that the above rewritings are impliedby the expressions in Figure 12. To illustrate the effect of sorting, notice thatview V3 need not be used in the negative part of P1, and V2 does not appear inthe positive part of P4, in contrast to using the default sorting V1, V2, V3, V4.

ACM Transactions on Database Systems, Vol. 33, No. 4, Article 22, Publication date: November 2008.

Page 28: Compiling mappings to bridge applications and databases

22:28 • S. Melnik et al.

Fig. 12. Partitions and views (based on Example 8).

To establish the completeness of the algorithm, we need to show that it failsto recover P only if P cannot be reconstructed from V. This result is basedon the following theorem. It states that we can obtain a specific partition asa function of the given views if and only if intersecting the views that containthe partition, and subtracting the views in which the partition does not occur,yields a singleton set.

THEOREM 4. Let p be a element from P, and V = {V1, . . . , Vn} be a set ofsubsets of P. Then, there exists a relational expression Expr over V1, . . . , Vnsuch that Expr = {p} if and only if p appears in some V ∈ V and

⋂{V | p ∈ V , V ∈ V} −

⋃{V | p ∈ V , V ∈ V} = {p}.

This theorem proves the completeness of the algorithm, but does not guar-antee that each found expression is minimal, that is, uses the smallest possiblenumber of views and/or operators. Finding a minimal solution is equivalent tosolving the set cover problem, which is NP-complete. So, the algorithm imple-ments a greedy approach, which often produces near-optimal results. Due tosorting, the overall complexity of the algorithm is O(n log n), while Phases 1and 2 are O(n) in the number of partitions.

7.6 Avoiding Exhaustive Partitioning

The algorithm PartitionVertically produces an exhaustive enumeration of par-titions (where each partition is a project-select query). One of the benefits ofusing exhaustive partitioning is that operations on views, and query contain-ment checks between views, can be done using simple set operations. For ex-ample, to verify that V3 from Figure 12 is query-contained in V2 − V4, we cantest whether {P4} is contained as a set in {P2, P3, P4} − {P2, P5}. However, insome mapping scenarios, exhaustive partitioning may become prohibitive, asthe number of partitions may grow exponentially with the size of mappingfragments.

Example 10. Consider a star schema used in a data warehousing appli-cation. Suppose T(id, cFK1 , . . . , cFKn) is a fact table with n nullable foreignkeys referencing some dimension tables. These foreign keys are mapped tobinary associations in the conceptual model. The mapping fragment for eachassociation is specified using a NOT NULL condition, analogously to frag-ment M4 in Example 2. Hence the domain Dom(pi, map) of each path pi

ACM Transactions on Database Systems, Vol. 33, No. 4, Article 22, Publication date: November 2008.

Page 29: Compiling mappings to bridge applications and databases

Compiling Mappings to Bridge Applications and Databases • 22:29

corresponding to the foreign key column cFKi contains two elements: NULLand NOT NULL. Exhaustive partitioning yields 2n partitions that enumerateall possible NULL/NOT NULL combinations of foreign key values.

To address such scenarios, we developed a more general algorithmRewriteQuery that takes as input a project-select query Q and a set of project-select views V over some extent E, and produces a rewriting Expr(V) = Q.Just as the baseline partitioning algorithm, RewriteQuery is invoked duringmapping compilation. Query Q originates from a mapping fragment and typi-cally contains one or more IS OF conditions and at most one scalar condition.In general, query Q and the views in V may have arbitrary Boolean condi-tions. The algorithm RewriteQuery (see Algorithm 5) consists of three parts.First, RewriteKeyQuery is called to produce a rewriting for the vertical regionspanned by the query, πKey(Q). The output rewriting is simplified in the secondstep. Finally, CoverAttributes is called to ensure that all attributes requiredby Q are covered by the rewriting.

RewriteKeyQuery (see Algorithm 6) is a recursive procedure that consistsof four phases, and builds on the technique used in the RecoverPartitions al-gorithm: find some rewriting Expr that overlaps with the query (Phase 0) andkeep pruning it, by intersecting and subtracting some views in Phases 1 and2, until it is fully contained in the input query. Then, if the rewriting stillcontains missing tuples, call RewriteKeyQuery recursively to find a rewrit-ing for the missing region and union it with the rewriting produced so far(Phase 3). Thus algorithm RewriteKeyQuery attempts to perform on-demandpartitioning guided by the input query. Hence, the algorithm usually consid-ers only a small subset of all possible partitions. In the worst case, all par-titions need to be reached, that is, RewriteKeyQuery has the same exponen-tial worst-case complexity as the combined execution of PartitionVertically andRecoverPartitions.

Notice that algorithm RewriteKeyQuery is sensitive to the order of the viewsin V. To find a good initial ordering that is likely to produce a compact rewriting,we sort the views similarly to how it is done in RecoverPartitions. Phase 0 helpsreduce the impact of choosing a bad view order, which may result in excessivepartitioning: it looks ahead to see if the exact rewriting can be produced byintersecting all views that fully contain the query and subtracting all viewsthat are disjoint from the query. Intersection and subtraction operations onthe views correspond to conjunction and AND NOT operations on the WHEREclauses of the queries. Containment and satisfiability checking is performedusing a SAT solver.

Algorithm .5 RewriteQuery

procedure RewriteQuery(Q , V)Expr := RewriteKeyQuery(Q , V)Expr := PruneRedundantViews(Expr, V)Expr := CoverAttributes(Attrs(Q), Expr, V)

return Expr

ACM Transactions on Database Systems, Vol. 33, No. 4, Article 22, Publication date: November 2008.

Page 30: Compiling mappings to bridge applications and databases

22:30 • S. Melnik et al.

Algorithm .6 RewriteKeyQuery

procedure RewriteKeyQuery(Q , V)// Phase 0: Look ahead for a union-free rewritingExpr := ⋂{V : Q ⊆ V , V ∈ V} − ⋃{V : Q ∩ V = ∅, V ∈ V}if Expr = Q then return Expr end if

Pos := ∅; Neg := ∅ // keeps intersected & subtracted viewsif there exists V0 ∈ V, Q ∩ V0 = ∅ then

Expr := V0; Pos := Pos ∪ V0

else abortend if

// Phase 1: Eliminate extra tuples by intersectionfor each view V ∈ (V − Pos − Neg ) do while Expr − Q = ∅

if Expr ∩ V = ∅ thenPos := Pos ∪ V ; Expr := (Expr�V )

end ifend for

// Phase 2: Eliminate extra tuples by subtractionfor each view V ∈ (V − Pos − Neg ) do while Expr − Q = ∅

if Expr − V = ∅ thenNeg := Neg ∪ V ; Expr := (Expr � V )

end ifend forif Expr − Q = ∅ then abort end if

// Phase 3: Find rewriting for missing tuples, if anyQmissing := Q − Exprif Qmissing = ∅ then

Expr := (Expr ∪ RewriteKeyQuery(Qmissing , V))end if

return Expr

THEOREM 5. Algorithm RewriteKeyQuery is complete, that is, it aborts onlywhen no rewriting exists.

To illustrate the benefit of avoiding exhaustive partitioning, recallExample 10. Suppose we need to find a rewriting for the queryπId(σcFKi IS NOT NULL( T)). It is easy to see that RewriteKeyQuery finds therewriting Expr = Vi in linear time without enumerating all partitions.

In the second part of RewriteQuery algorithm, we simplify the rewritingproduced by RewriteKeyQuery (see Algorithm 7). We use a dynamic algo-rithm that tries to eliminate one view symbol at a time from the resultingexpression and checks that the reduced expression is still equivalent to theoriginal one. The algorithm avoids recomputing equivalent parts of expres-sions by fixing a subexpression (called head) and working on the remaining

ACM Transactions on Database Systems, Vol. 33, No. 4, Article 22, Publication date: November 2008.

Page 31: Compiling mappings to bridge applications and databases

Compiling Mappings to Bridge Applications and Databases • 22:31

Algorithm .7 PruneRedundantViews

procedure PruneRedundantViews(Exprorig, V)Exprpruned := Exprorigfor each view V ∈ V do

Exprpruned := PruneRedundantRecursively(V , V − {V }, Exprorig, V)end for

return Exprpruned

// For brevity only pruning union is shownprocedure PruneRedundantRecursively(Exprhead , Vtail, Exprorig, V)

Exprpruned := Exprorigif Vtail contains a single set Vlast then

// For union queries, testing containment is sufficient to ensure equivalenceif Exprorig ⊆ Exprhead then

V := V − {Vlast}Exprpruned := Exprhead

end ifelse if Vtail = ∅ then

// Compute the new ‘left’ head expressionVleft := first half of Vtail; Vright := Vtail − VleftExprleft := Exprhead ∪ ⋃

VleftExprpruned := PruneRedundantRecursively(Exprleft, Vright, Exprpruned, V)// Compute the new ‘right’ head expression; V may have shrunkVright := Vright ∩ VExprright := Exprhead ∪ ⋃

VrightExprpruned := PruneRedundantRecursively(Exprright, Vleft, Exprpruned, V)

end ifreturn Exprpruned

subexpression (tail). It is invoked recursively, splitting the tail into two halvesin each subsequent invocation, while the precomputed head is kept on stack. Forcompactness, only pruning unions is shown. The difference operator is treatedanalogously.

The last part of the rewriting algorithm is the CoverAttributes procedure(Algorithm 8). The dictionary ToCover associates with each projected attributeof Q a condition (vertical region) for which the attribute value needs to befilled in. ToCover is initialized to Expr (for which πKey(Expr) = πKey(Q) by con-struction done in RewriteKeyQuery). Next, ToCover is pruned by removing thevertical regions that are already covered by Expr. Here, V iews(Expr) denotesthe set of all view symbols from V that appear in Expr. Lastly, the algorithmiterates over all views V that do not appear in Expr, and removes the vertical re-gion corresponding to V from ToCover[A] whenever attribute A is projected inview V . The extra views that supply the attributes required by Q are appendedto Expr in the return statement.

ACM Transactions on Database Systems, Vol. 33, No. 4, Article 22, Publication date: November 2008.

Page 32: Compiling mappings to bridge applications and databases

22:32 • S. Melnik et al.

Algorithm .8 CoverAttributes

procedure CoverAttributes(A, Expr, V)// Initialize vertical region that needs to be covered for each attributefor each attribute A ∈ A do

ToCover[A] := Exprend for// Remove regions that are already covered by key rewritingfor each view V ∈ Views(Expr) do

for each attribute A ∈ Attrs(V ) doToCover[A] := ToCover[A] − V

end forend for

// Determine extra views needed to cover the attributesAttrV iews := ∅for each attribute A ∈ A do

for each view V ∈ V − V iews(Expr) doif V ∩ ToCover[A] = ∅ then

ToCover[A] := ToCover[A] − V ; AttrViews := AttrViews ∪ {V }end if

end forend forreturn Expr�

⋃AttrViews

7.7 Exploiting Outer Joins

The algorithm RecoverPartitions tells us how to reconstruct each partition P ofa given extent. Ultimately, we need to produce a rewriting for the entire extent,which combines multiple partitions. Consider Examples 8 and 9. To obtain thequery view for the Persons extent, we could union partitions P1, . . . , P5 fromExample 9. The resulting expression would use fourteen relational operatorsand be clearly suboptimal. Instead, we exploit the following observation. Sinceeach fragment query produces a new view symbol Vi and is join-free, every Vicontains only partitions from a single extent. Therefore, each extent can bereconstructed by doing a full outer join of all the views that contain partitionsfrom that extent. The full-outer-join expression can be further simplified usinginclusion and disjointness constraints on the views.

Example 11. Let Persons = τ (E), where τ is an expression that performsentity construction, fills in required constants, and so on (we talk about itshortly). Then, E can be obtained using the following expressions, all of whichare equivalent under the view definitions of Figure 12:

E = P1 ∪ P2 ∪ P3 ∪ P4 ∪ P5

= V1−−�−− V2

−−�−− V3−−�−− V4

= (V1−−� (V2

−−� V3)) −−� V4 (1)= ((V1

−−� V2) −−� (V3 ∪a V4)). (2)

ACM Transactions on Database Systems, Vol. 33, No. 4, Article 22, Publication date: November 2008.

Page 33: Compiling mappings to bridge applications and databases

Compiling Mappings to Bridge Applications and Databases • 22:33

∪a denotes union without duplicate elimination (UNION ALL). The equivalencecan be shown by noticing that the following constraints hold on the views:V3 ∩ V4 = ∅, V3 ⊆ V2 ⊆ V1, V4 ⊆ V1.

Using full outer joins allows us to construct an expression that is minimalwith respect to the number of relational operators: n views are connected us-ing the optimal number of (n − 1) binary operators. Still, as illustrated in Ex-ample 11, multiple minimal rewritings may exist. In fact, their number growsexponentially with n. As we illustrate in the experimental section, these rewrit-ings may have quite different performance characteristics when they are usedto execute queries on the database. One reason is the inability of the optimizerto push down selection predicates through outer joins.

To find a good rewriting that is likely to result in efficient execution plans,we use a heuristic that increases optimization opportunities by replacing outerjoins with inner joins and unions. As a side effect, this helps avoid unnecessaryjoining of data that is known to be disjoint, and minimize the size of interme-diate query results.

The algorithm we use performs successive grouping of the initial full-outer-join expression using the inclusion and disjointness constraints between theviews (see Algorithm 9). The views are arranged into groups, which are rela-tional expressions in the prefix notation. For clarity of presentation, let thefragment views over the extent to be reconstructed be called target views, whiletheir counterparts in the mapping fragments shall be called source views. Thatis, given a mapping fragment QC = Q S , if we generate query views, then QCis a target view and Q S is a source view. Let ρtgt(l ) denote the full-outer-joinquery of all groups in l , where l is a sequence of groups. Let ρsrc(l ) be the queryobtained from ρtgt(l ) by substituting all target views in ρtgt(l ) by their respectivesource views.

The grouping is performed by replacing −−�−− by ∪a, �, and −−�, in that order,where possible. Using ∪a (UNION ALL) is critical to avoid the sorting overheadupon query execution. It is straightforward to see how the disjointness or con-tainment relationships between the target views (ρtgt(l )) are exploited to dooperator substitution. To illustrate where ρsrc(l ) conditions become important,consider the following example.

Example 12. Suppose that the Supports association shown in Example 2has a [1..1]–[0..n] multiplicity instead of [0..1]–[0..n]. Then, every Customer hasan associated Employee. In this case, we could use an inner join instead of aleft outer join in the update view for the Client table shown in Figure 20. In thisexample, the relevant mapping fragments are:

QC3 = V3 = Q S3

QC4 = V4 = Q S4.

Containment check using ρtgt, Q S4 ⊆ Q S3, allows us to conclude that πCid(V4) ⊆πCid(V3), that is, justifies using a left outer join in the update view. However, ifthe Employee end multiplicity is [1..1], so the equivalence check on ρsrc yieldsπId(QC3) = πId(QC4), and we can use an inner join in place of the left outer join.

ACM Transactions on Database Systems, Vol. 33, No. 4, Article 22, Publication date: November 2008.

Page 34: Compiling mappings to bridge applications and databases

22:34 • S. Melnik et al.

Algorithm .9 GroupViews

procedure GroupViews(V)Arrange Vinto pairwise disjoint −−�−− − groupsG1, . . . , Gm

such that each group contains views from a single source extentGV := −−�−−(G1, . . . , Gm)for each group G in GV dorecursively depth-first

if G = (−−�−− : , l1, , l2, ) and |l1|> 0 and |l2| > 0 thenG1 = (−−�−− : l1); G2 = (−−�−− : l2)if ρtgt(l1) ∩ ρtgt(l2) = ∅ or ρsrc(l1) ∩ ρsrc(l2) then

G = −−�−−(∪a(G1, G2), )if generating update views and ρsrc(l1) ∩ ρsrc(l2) = ∅ then abort (∗ ∗ ∗)end if

else if ρtgt(l1) = ρtgt(l2) or ρsrc(l1) = ρsrc(l2) then G = −−�−−(�(G1, G2), )else if ρtgt(l1) ⊇ ρtgt(l2) or ρsrc(l1) ⊇ ρsrc(l2) then G = −−�−−(−−�(G1, G2), )else if ρtgt(l1) ⊆ ρtgt(l2) or ρsrc(l1) ⊆ ρsrc(l2) then G = −−�−−(−−�(G2, G1), )end if

end ifend forFlatten singleton groups in GV

return GV

A crucial property that we leverage in Check 2 of Section 6 is that the group-ing algorithm preserves key constraints. We full-outer-join keyed relations sothe output view has a key. Every operator replacement performed by the al-gorithm is an equivalence transformation—assuming that the mapping andschema constraints in C and S are satisfied for every instance of C. However,this property is not established at the time when the algorithm runs to produceupdate views:

Example 13. Suppose that fragment queries Q S1 and Q S2 in Example 2are defined as

Q S1 = πId,Name(σDisc=1(T))Q S2 = πId,Dept(σDisc=2(T)),

for some table T(Id, Name, Dept, Disc). Clearly, πId(Q S1) ∩ πId(Q S2) = ∅. So,Q S1

−−�−−Q S2 = Q S1 ∪a Q S2 (up to correct projection of attributes). However,since πId(QC2) ⊆ πId(QC2), if we replace −−�−− by ∪a when constructing the updateview for T we may violate the key constraint on T. Indeed, if the entity setPersons contains an instance of Employee, there exists no instance of S suchthat all the mapping and schema constraints are satisfied.

The only step of the algorithm where G may lose its key is where we replace−−�−− by ∪a. The disjointness check at line (∗∗∗) prevents that from happening. Ifthe check does not pass, the algorithm aborts. The formal justification for thatis provided by Theorem 2: if we are able to derive any constraint in Range(g )that is not backed by some constraint in Range( f ), then the mapping does notroundtrip.

ACM Transactions on Database Systems, Vol. 33, No. 4, Article 22, Publication date: November 2008.

Page 35: Compiling mappings to bridge applications and databases

Compiling Mappings to Bridge Applications and Databases • 22:35

Table I. Rewriting of CASE Statements for Example 8

Query Q RewriteKeyQuery(Q , V) Partitionsσe IS OF (ONLY Person) V1 − (V2 ∪ V4) P1σe IS OF (ONLY Customer) V2 P2 ∪ P3 ∪ P4σe IS OF (ONLY Employee) V4 − V2 P5σe.BillingAddr IS NULL V2 ∩ V4 P2σe.BillingAddr IS OF (ONLY Address) V2 − (V3 ∪ V4) P3σe.BillingAddr IS OF (ONLY USAddress) V3 P4

Algorithm GroupViews does not prescribe the choice of l1 and l2 uniquely. Inour implementation, a greedy approach is used to find large l1 and l2 in lineartime, yielding the overall O(n) complexity. We defer the explanation of the veryfirst step of the algorithm, the initial grouping by source extents, to Section 7.9.

7.8 Producing CASE Statements

In Example 11, we used the expression τ to encapsulate constant creation andnonrelational operators that reside on top of a relational expression. In thissection, we explain how τ is obtained. Consider the views shown in Example 8.Upon assembling the Persons extent from V = {V1, . . . , V4}, we need to instan-tiate Persons, Customers, or Employees, depending on the views from which thetuples originate.

We use procedure RewriteKeyQuery from Section 7.6 to construct the CASEstatements. The inputs and outputs to the procedure are shown in Table 1.For convenience, the last column shows the partitions corresponding to eachquery Q submitted to the rewriting procedure. That is, all tuples coming fromP1 yield Person instances, while all tuples coming from P2, P3, or P4 produceCustomer instances. Furthermore, for all tuples in P2, the BillingAddr propertyneeds to be set to NULL, and so on.

Let the boolean variables bV 1, . . . , bV 4 keep track of the tuple provenance.That is, bV 1 is true if and only if the tuple comes from V1, and so on Now we canconstruct τ by replacing the view symbols in Table 1 by the respective Booleanvariables. Continuing with Example 8, we obtain Persons as

CASE WHEN (bV1 AND NOT bV2 AND NOT bV4 ) THEN Person(. . . )WHEN bV2 THEN Customer(. . . ,

CASE WHEN (bV2 AND bV4 ) THEN NULLWHEN (bV2 AND NOT bV3 AND NOT bV4) THEN Address(. . . )WHEN bV3 THEN USAddress(. . . )END AS BillingAddr, . . . )

WHEN (bV4 AND NOT bV2 ) THEN Employee(. . . )(E

),

where E is the relational expression whose construction we discussed in theprevious section. A similar approach is used to reconstruct scalar values thatappear in selection conditions in the mapping fragments, such as discriminatorsused in table-per-hierarchy scenarios.

ACM Transactions on Database Systems, Vol. 33, No. 4, Article 22, Publication date: November 2008.

Page 36: Compiling mappings to bridge applications and databases

22:36 • S. Melnik et al.

7.9 Eliminating Self-joins and Self-unions

We discuss the final simplification phase of mapping compilation, performed inStep 5 and Step 10.

Example 14. Consider constructing query views in Example 11. In Step 9,we expand the view symbols Vi by fragment queries on relational tables. Sup-pose the fragment queries for V2 and V3 are on the same table, for example,V2 = R and V3 = σA=3(R). If we choose the rewriting (∗), then V2

−−� V3 becomesa redundant self-join and can be replaced by R. We obtain:

E = (V1−−� R) −−� V4.

In contrast, choosing the rewriting (∗∗) gives us:

E = ((V1−−� R) −−� (σA=3(R) ∪ V4)).

To bias the algorithm GroupViews toward the top expression, in its firststep, we group the fragment views according to their source extents. For queryviews, source extents are tables; for update views, source extents are thosein the entity schema. The main loop of GroupViews preserves this initialgrouping.

Eliminating the redundant operations poses an extra complication. As ex-plained in the previous section, we need to keep track of tuple provenance. So,when self-joins and self-unions get collapsed, the Boolean variables bV need tobe adjusted accordingly. These Boolean variables are initialized in the leavesof the query tree by replacing each Vi by Vi × {True AS bVi }. In Entity SQL,Boolean expressions are first-class citizens and can be returned by queries (seeFigure 4). To preserve their value upon collapsing of redundant operators, weuse several rewriting rules, one for each operator. Let b1 and b2 be computedboolean terms, A and B be disjoint lists of attributes, c1 and c2 be Booleanconditions, and let

E1 = πb1,A,B(σc1 (E))E2 = πb2,A,C(σc2 (E)).

Then, the following equivalences hold:

E1 �A E2 = πb1,b2,A,B,C(σc1∧c2 (E))

E1−−�A E2 = πb1,(b2∧c2),A,B,C(σc1 (E))

E1−−�−−A E2 = π(b1∧c1),(b2∧c2),A,B,C(σc1∨c2 (E)).

Observe that if πA(E1)∩πA(E2) = ∅, then E1 ∪a E2 = E1−−�−−A E2. The disjoint-

ness holds for ∪a introduced by the GroupViews algorithm because it requiresthe unioned expressions to have nonoverlapping keys. Therefore, for ∪a, we canuse the same rule as for −−�−−.

The presented rules assume two-valued Boolean logic, where outer joins pro-duce False values instead of NULLs for boolean terms. However, like standardSQL, Entity SQL uses three-valued boolean logic (True, False, NULL). To com-pensate for this, we translate nullable Boolean expressions into non-nullable

ACM Transactions on Database Systems, Vol. 33, No. 4, Article 22, Publication date: November 2008.

Page 37: Compiling mappings to bridge applications and databases

Compiling Mappings to Bridge Applications and Databases • 22:37

ones that replace NULLs by False values. This can be done by wrapping a nul-lable boolean expression b as (b AND (b IS NOT NULL)), or as (CASE WHEN bTHEN True ELSE False END).

7.10 Producing Merge Views

Merge views can be computed similarly to query and update views using thetechniques from the preceding sections. The input to the computation is apseudo-mapping constructed from the store-side (Q S) fragment views and onespecial mapping fragment that represents the old store state. To illustrate,consider Example 6. The constructed pseudomapping is:

πID,A(Rupd ) = πID,A(R ′)πID,B(σC=3(Rupd )) = πID,B(σC=3(R ′))

πID,A,B,C(Rold ) = πID,A,B,C(σIsOld=true(R ′)).

R ′ extends the relation signature of R by adding a new Boolean columnIsOld. Intuitively, the last mapping fragment specifies that the old portion of R ′

is obtained from Rold . Then, we run the algorithm as if computing the updateview for R ′, with one modification: the procedure RewriteQuery from Section 7.6is instrumented not to abort when the exact rewriting does not exist but, returna partial (maximal) rewriting instead. This is done to compensate for the addedcolumn IsOld. We obtain the rewriting of R ′ in terms of Rupd and Rold . As a laststep, we eliminate the column IsOld from the rewriting.

8. EVALUATION

The goal of a client-side mapping layer is to boost the developer’s productiv-ity while offering performance comparable to a lower-level solution that useshandcoded SQL. In this section, we discuss the experimental evaluation of theEntity Framework, focusing on the mapping compiler.

Correctness. The top priority for the mapping compiler is correctness. Itneeds to produce bidirectional views that roundtrip data, otherwise data lossor corruption is possible. In practice, proving correctness of a complex system onpaper is insufficient, since errors unavoidably creep into the implementation.We followed two paths. First, our product test team developed an automatedsuite that generates thousands of mappings by varying some core scenarios. Thecompiled views are verified by deploying the entire data access stack to queryand update pregenerated sample databases. Although this does not guaranteecomplete coverage, it exercises many other parts of the system (e.g., client-sidequery optimization) in addition to the mapping compiler. The second evalu-ation technique was developed by our colleagues in the software verificationgroup [Mehra et al. 2007]: a tool that translates Entity SQL queries into first-order logic formulas and feeds them into a theorem prover. As we explained inSection 5, testing roundtripping of views expressed in Entity SQL is undecid-able, so the prover may not terminate. For the positive cases, it has been ableto verify all views tested so far.

ACM Transactions on Database Systems, Vol. 33, No. 4, Article 22, Publication date: November 2008.

Page 38: Compiling mappings to bridge applications and databases

22:38 • S. Melnik et al.

Efficiency. Another aspect that we studied is the efficiency of mapping com-pilation. The validation and rewriting steps of the algorithm may require worst-case exponential time, which is mainly due to allowing complex Boolean con-ditions in mapping fragments. Subdividing the mapping fragments in Step 1usually yields a manageable number of fragments n on which the main algo-rithm runs. We have seen rare cases where n exceeds a few dozen. Althoughwe have encountered synthetic scenarios where compilation running time be-comes prohibitive, it is yet to be seen whether these scenarios are of practicalimportance and require further optimization.

Compiling mappings prior to query execution was a fundamental designchoice, which alleviates performance issues; it eliminates the runtime penaltyof teasing out the data transformation logic defined in a mapping. In our ini-tial design, the compiler was invoked upon the first query or update issuedby the application, yielding a one-time performance hit per lifetime of an ap-plication. This few-second delay was deemed unacceptable. One reason is thatapplications are rerun frequently during development. Another is that mid-tier applications often shut down and restart for load balancing. Therefore, wedecided to factor out the compiler execution from the runtime and integrateit with the development environment (IDE). This approach may enable usingexhaustive techniques in place of some of the currently deployed heuristics,where view generation runs as a compile-time background job and recompilesonly those parts of the mapping that have been modified by the developer.

Performance. Mapping compilation plays a key role both in the client-sidequery rewriting and server-side execution. In the algorithms we presented,we made heavy use of implied constraints to simplify the generated views. Toillustrate these benefits qualitatively, consider the following experiment run onthe AdventureWorks sample database shipped with Microsoft SQL Server. Wepartitioned the Sales table horizontally into H1 and H2, and vertically into V1and V2. This produces two mapping scenarios, where the sales data needs to bereassembled from the respective tables. Leveraging the mapping allows us torewrite −−�−− as ∪a (exploiting the disjointness of H1 and H2) and as � (knowingthat V1 and V2 agree on keys). Suppose that the client issues selection queries asshown in Figure 13. If the query views contain outer joins, the SQL optimizeris not able to push the selections down, resulting in suboptimal query plansthat take many times longer to execute. This effect multiplies in more complexscenarios.

In addition to the server load, another important performance metric is thetotal overhead of the client-side layer. The major factors contributing to thisoverhead are object instantiation, caching, query manipulation, and delta com-putation for updates. For small data sets and OLTP-style queries, these factorsmay dominate the overall execution time. Therefore, judicious optimization ofevery code path is critical. One such optimization is done during query pruningand directly involves the mapping compiler. After unfolding query views in auser query, the query pipeline tries to eliminate subexpressions from the querythat do not contribute to the query result. As an example, consider the querySELECT VALUE e FROM Persons WHERE e IS OF Employee. When the query

ACM Transactions on Database Systems, Vol. 33, No. 4, Article 22, Publication date: November 2008.

Page 39: Compiling mappings to bridge applications and databases

Compiling Mappings to Bridge Applications and Databases • 22:39

Fig. 13. Exploiting mappings to speed up queries.

view for Persons (see Figure 7) is unfolded, the resulting expression can besimplified by pruning away the subquery on the Client table and replacing theleft outer join with an inner join. Recognizing such optimization opportunitiesat runtime is expensive and not always possible without considering the map-ping. To help the query pipeline, the mapping compiler produces OfType(Extent,Type) views such as the ones shown in Figures 15 and 16 which are used inplace of the query view for the entire extent (Figure 7) whenever possible.

Performance analysts in our product group conducted an extensive studycomparing the overall system performance with custom implementations andcompeting products. The study found that the views produced by the mappingcompiler are close to those written by hand by experienced database developers.Details of this study are beyond the scope of this article. The generated viewsenable the query and update pipelines to produce query and update statementswhose server-side execution performance approaches that of a custom imple-mentation. Some mapping scenarios supported by the compiler, in particular,arbitrary combinations or vertical and horizontal partitioning in inheritancemappings, go beyond those supported in other commercial systems.

Reference applications for the Data Access Framework are currently beingdeveloped in collaboration with third parties, and will provide further insightinto performance characteristics. If developers encounter a scenario that cannotbe expressed using our mapping language they have the option of providingcustom E-SQL queries for read-only access, or using native SQL and storedprocedures of the DBMS. We are tracking customer feedback to prioritize futureextensions of mapping capabilities.

9. RELATED WORK

Transforming data between database and application representations remainsa hard problem. Researchers and practitioners have attacked it in a number of

ACM Transactions on Database Systems, Vol. 33, No. 4, Article 22, Publication date: November 2008.

Page 40: Compiling mappings to bridge applications and databases

22:40 • S. Melnik et al.

ways. A checkpoint of these efforts was presented by Carey and DeWitt [1996].They outlined why many such attempts, including object-oriented databasesand persistent programming languages [Atkinson and Buneman 1987], did notpan out. They speculated that object-relational databases would dominate in2006. Indeed, many of today’s database systems include a built-in object layerthat uses a hardwired O/R mapping on top of a conventional relational engine[Carey et al. 1999; Krishnamurthy et al. 1999]. However, the O/R features of-fered by these systems appear to be rarely used for storing enterprise data, withthe exception of multimedia and spatial data types [Grimes 1998]. Among thereasons are data and DBMS-vendor independence, the cost of migrating legacydatabases, scale-out difficulties when business logic runs inside the databaseinstead of the middle tier, and insufficient integration with programming lan-guages [Zhang and Ritter 2001].

Database research has contributed many powerful techniques that can beleveraged for supporting mapping-driven data access. And yet, there are sig-nificant gaps. Among the most critical ones is supporting updates through map-pings. Compared with queries, updates are far more difficult to deal with, sincethey need to preserve data consistency across mappings, may trigger businessrules, and so on. Updates through database views have a long history. Dayaland Bernstein [1978] put forth the view update problem. They observed thatfinding a unique update translation, even for very simple views, is rarely pos-sible due to the intrinsic underspecification of the update behavior by a view.As a consequence, commercial database systems offer very limited support forupdatable views.

Subsequent research followed two major directions. One line of work focusedon determining under what circumstances view updates can be translated un-ambiguously, more recently for XML views [Braganholo et al. 2004]. Unfortu-nately, there are few such cases; moreover, every update usually needs to havea well-defined translation—rejecting a valid update is unacceptable. Further-more, in mapping-driven scenarios, the updatability requirement goes beyond asingle view. For example, an application that manipulates Customer and Orderentities effectively performs operations against two views. Sometimes a consis-tent application state can only be achieved by updating several views simulta-neously. Another line of research focused on closing the underspecification gap.Several mechanisms were suggested in the literature, such as constant com-plement [Bancilhon and Spyratos 1981] or dynamic views [Gottlob et al. 1988].So far, this work has turned out to be of mostly theoretical value, althoughtechniques of Barsalou et al. [1991] and Keller et al. [1993] appear to have hadcommercial impact. Our merge views are based on view complements.

Recently, researchers have shown rekindled interest in the view update prob-lem. Bohannon et al. [2006] and Foster et al. [2007] developed a bidirectionalmechanism called ‘lenses’ that uses get and putback functions. Our query viewscorrespond to get, while update and merge views combined define putback.One of our key distinguishing contributions is a technique for propagating up-dates incrementally using view maintenance [Blakeley et al. 1986; Gupta andMumick 1995], which is critical for practical deployment. Another contributionis a fundamentally different mechanism for obtaining the views, by compiling

ACM Transactions on Database Systems, Vol. 33, No. 4, Article 22, Publication date: November 2008.

Page 41: Compiling mappings to bridge applications and databases

Compiling Mappings to Bridge Applications and Databases • 22:41

them from mappings, which allows describing complex mapping scenarios inan elegant way. Also, in our approach, data reshaping (specified by query andupdate views) is separated from the update policy (merge views). Further re-cent work on view updates is Kotidis et al. [2006], where delta tuples are storedin auxiliary tables if the updates cannot be applied to the underlying database.Bidirectional transformations for XML are examined in Bohannon et al. [2005]and Barbosa et al. [2005].

Mapping compilation was explored in IBM’s Clio project, which introducedthe problem of generating data transformations from correspondences betweenschema elements [Miller et al. 2000]. In Clio, schema element correspondencesare translated into GLAV mapping constraints, which are then compiled intotransformations expressed in XQuery or SQL. In contrast to our work, Clio fo-cuses on data exchange scenarios and uses uni-directional mappings expressedusing inclusion, rather than equality, in mapping constraints. More recent workon Clio considered nested mappings [Fuxman et al. 2006], which simplify map-ping specification for nested collections.

Our mapping compilation procedure draws on answering queries using viewsfor exact rewritings (surveyed in Halevy [2001] and examined recently in Gouet al. [2006], Segoufin and Vianu [2005], and Larson and Zhou [2007]), andexploits several novel aspects such as the partitioning scheme, bipartite map-pings, and rewriting queries under bidirectional mappings. We are not awareof published work that has exploited outer joins and case statements, as ourshas, to optimize the generated views.

Other related mapping manipulation problems are surveyed in Bernsteinand Melnik [2007]. These problems include creation, compilation, reuse, evolu-tion, and execution of mappings, and constitute the subject of research on modelmanagement. Mapping composition, computing an inverse of a mapping, andschema translation, are among the issues discussed in the survey.

10. CONCLUSIONS

Bridging applications and databases is a fundamental data management prob-lem. We presented a novel mapping approach that supports queries and updatesthrough mappings in a principled way. We formulated the mapping compilationproblem and described our solution. To the best of our knowledge, the ADO.NETEntity Framework is the first commercial data access product that is driven bydeclarative bidirectional mappings with well-defined semantics.

Our solution, just like most data access and management systems, is sensi-tive to the choice of schema and mapping languages. The technical challengeswe had to address were primarily due to the expressive power of the supportedlanguages, rather than their actual embodiment (EDM and Entity SQL). Wethink our approach is applicable to any data access layer that uses inheritance,relationships, and complex types in its conceptual model, PK/FK constraints onrelational schemas, and a query language that extends SQL with nonrelationalconstructs used in that conceptual model. Further research is needed to supportricher modeling capabilities, such as multiple inheritance or nested collections,and more expressive mapping languages.

ACM Transactions on Database Systems, Vol. 33, No. 4, Article 22, Publication date: November 2008.

Page 42: Compiling mappings to bridge applications and databases

22:42 • S. Melnik et al.

The new mapping architecture exposed many interesting research chal-lenges. In addition to mapping compilation, these challenges include enforcingdata consistency using a combination of client-side and server-side integrityconstraints, exploiting efficient view maintenance techniques for object-at-a-time and set-based updates, translating errors through mappings, optimisticconcurrency, notifications, and many others [Bernstein and Melnik 2007]. Weplan to report on these in the future.

APPENDIX

A. PROOFS

THEOREM 1. Let map ⊆ C × S. Then, map ◦ map−1 = Id (C) if and only ifthere exist two (total) views, q : S → C and u : C → S, such that u ⊆ map ⊆ q−1.

PROOF. Assume map ◦ map−1 = Id (C). Hence, the following injectivityproperty holds: ∀x1, x2, y : map(x1, y), map(x2, y) → x1 = x2. Otherwise,map ◦ map−1 may contain some (x1, x2), x1 = x2. That is, map−1 is a (possi-bly partial) function from S to C. We construct q by extending map−1 to becometotal. To do that, we assign each y = Domain(map−1) to some arbitrary c ∈ C.We have: map−1 ⊆ q, i.e., map ⊆ q−1. To construct u, restrict map to become afunction by picking an arbitrary s ∈ S, (c, s) ∈ map, for each c ∈ C.

Now let q and u be (total) functions, u ⊆ map ⊆ q−1. That is, for eachx, y the following implications hold: u(x) = y → map(x, y) → q( y) =x. Suppose (x, y) ∈ map ◦ map−1. Then, ∃z : map(x, z), map−1(z, y), somap(x, z), map( y , z), and consequently q(z) = x, q(z) = y . Since q is a func-tion, x = y , and since Range(q) ⊆ C, so x ∈ C. Conversely, suppose x ∈ C. Sinceu is total, so ∃ y : u(x) = y . Hence, map(x, y), map−1( y , x), and consequently(x, x) ∈ map ◦ map−1.

PROPOSITION 1. For each q and u satisfying Theorem 1 the following holds:

u ◦ q = Id (C)

PROOF. Let q and u be total views such that u ⊆ map ⊆ q−1. Clearly, u ⊆ q−1.Hence for each x and y , u(x) = y implies q( y) = x. Suppose (x, y) ∈ u ◦ q. Then,∃z : u(x) = z, q(z) = y . So, q(z) = x, and hence x = y ∈ C. Conversely, supposex ∈ C. Since u is total, so ∃ y : u(x) = y . Hence q( y) = x, that is, (x, x) ∈ u ◦ q.

THEOREM 2. Let map = f ◦g−1 be a bipartite mapping defined by (total)views f and g. Then the following conditions are equivalent:

(1) map ◦ map−1 = Id (C)(2) (a) f is injective and (b) Range( f ) ⊆ Range(g )

PROOF.

(1) → (2): Assume map ◦ map−1 = Id (C). To obtain a contradiction, supposeRange( f ) ⊆ Range(g ), that is, there exists y ∈ Range( f )−Range(g ).Then, for some (x, y) ∈ f we have x ∈ Domain( f ◦ g−1), and so mapis not total, and map ◦ map−1 = Id (C). We obtained a contradiction,that is, (2b) holds.

ACM Transactions on Database Systems, Vol. 33, No. 4, Article 22, Publication date: November 2008.

Page 43: Compiling mappings to bridge applications and databases

Compiling Mappings to Bridge Applications and Databases • 22:43

Moreover, f is injective. Suppose the opposite. Then,{(x1, v), (x2, v)} ⊆ f , x1 = x2. By (2b), v is in Range(g ). Hence,there exists z such that {(x1, z), (x2, z)} ⊆ map. Thus (x1, x2) ∈ Id (C).Hence, by contradiction, (2a) holds too.

(2) → (1): Suppose f is injective and Range( f ) ⊆ Range(g ). Let (x, y) ∈map ◦ map−1. Then, by definition of composition ∃z1, w, z2 : f (x, z1),g (w, z1), g (w, z2), f ( y , z2). Since g is a function, so z1 = z2. Since fis injective, so x = y . That is, map ◦ map−1 ⊆ Id(C). Conversely,suppose x ∈ C. Since f is total, there exists y with f (x) = y .Since Range( f ) ⊆ Range(g ), there exists z with g (z) = y . Sincemap = f ◦ g−1, so (x, y) ∈ map and ( y , x) ∈ map−1. Hence (x, x) ∈map ◦ map−1, as desired. Thus, Id (C) ⊆ map ◦ map−1.

THEOREM 3. Let map = f ◦ g−1 ⊆ C × S, f : C → V, g : S → V. Fur-ther let u : C → S be some update view constructed from map (i.e., u ⊆ map,Domain(u) = C, Range(u) ⊆ S). Then, the following roundtripping conditionsare equivalent:

(1) map ◦ map−1 = Id(C)(2) (a) f is injective and (b) map ◦ u−1 = Id(C)

PROOF.

(1) → (2): Let (1) be satisfied. (2a) follows from Theorem 2. We prove that(2b) holds, too. We first show that map ◦ u−1 ⊆ Id (C). Let (x, z) ∈map ◦ u−1. So, x ∈ C. Then, there exists y ∈ S such that (x, y) ∈ map,(z, y) ∈ u. Since u ⊆ map, we have (z, y) ∈ map. Therefore,(x, z) ∈ map ◦ map−1. By (1), (x, z) ∈ Id (C). Thus map ◦ u−1 ⊆ Id (C).

We need to show that map ◦ u−1 ⊇ Id (C). Let x ∈ C. Since u is totalon C, there exists some y ∈ S such that (x, y) ∈ u. But then, (x, y) ∈map and hence (x, x) ∈ map ◦ u−1. That is, map ◦ u−1 ⊇ Id (C). Thisshows that (2b) holds.

(2) → (1): Now, assume that (2a) and (2b) hold. Since u ⊆ map, we havemap ◦ map−1 ⊇ map ◦ u−1. By (2b), we can substitute Id (C) formap ◦ u−1, obtaining map ◦ map−1 ⊇ Id (C). So we need to show thatmap ◦ map−1 ⊆ Id (C). Let (x1, x2) ∈ map ◦ map−1. Hence there ex-ists y ∈ S such that (x1, y) ∈ map and (x2, y) ∈ map. Further, sincemap = f ◦ g−1, there exist some v1, v2 ∈ V such that (x1, v1) ∈ f ,( y , v1) ∈ g , (x2, v2) ∈ f , ( y , v2) ∈ g . Since g is functional, we havev1 = v2. Therefore, (x1, v1) ∈ f and (x2, v1) ∈ f . By (2a), f is injective,so x1 = x2 and (x1, x2) ∈ Id (C).

Theorem 4 is restated in terms of sets containing constants. It is not directlyrelated to partitions or views. We switch the terminology to emphasize thatthe theorem can be understood independently from the context of the article,while reusing the symbols for partitions and views for clarity of exposition. Weuse the notation expression variable := expression as a syntactic assignment,while predicates =, =, and ⊆ evaluate the equivalence, non-equivalence, andcontainment of two expressions, respectively.

ACM Transactions on Database Systems, Vol. 33, No. 4, Article 22, Publication date: November 2008.

Page 44: Compiling mappings to bridge applications and databases

22:44 • S. Melnik et al.

THEOREM 4. Let p be a constant from P, and V = {V1, . . . , Vn} be a set ofsubsets of P. Then, there exists a relational expression Expr over V1, . . . , Vn suchthat Expr = {p} if and only if p appears in some V ∈ V and

⋂{V | p ∈ V , V ∈ V} −

⋃{V | p ∈ V , V ∈ V} = {p}

PROOF. If p does not appear in any V , then clearly no Expr = {p} exists.Assume the opposite. Let E := ⋂{V | p ∈ V , V ∈ V} − ⋃{V | p ∈ V , V ∈V}. To prove the if-part, if E = {p} then set Expr := E. To prove the only-if part, suppose there is an Expr over V1, . . . , Vn such that Expr = {p}. SinceV1, . . . , Vn are sets, then Expr can be represented in the disjunctive normal formas

Expr := E1 ∪ E2 ∪ · · · ∪ Em,

where each E j is a conjunctive term for 1 ≤ j ≤ m, that is, E j is an intersectionof V1, . . . , Vn or their complements (or, equivalently, intersection of sets followedby subtracted sets).

By our assumption, Expr = {p}, so each term E j yields a subset of {p}.At least one of them, say Ek must be exactly {p}. All intersected sets in Ekmust contain p, while none of the subtracted sets in Ek may contain p. LetE ′

k be a term obtained by intersecting Ek with all sets in V that contain p,and subtracting all sets that do not contain p. Clearly, E ′

k = Ek = {p}. Byconstruction, E ′

k = ⋂{V | p ∈ V , V ∈ V} − ⋃{s | p ∈ V , V ∈ V}. Thus we have{p} = E ′

k as desired.

THEOREM 5. Algorithm RewriteKeyQuery is complete, that is, it aborts onlywhen no rewriting exists.

PROOF. Let P be the exhaustive set of all 2n − 1 partitions formed by V ={V1, . . . , Vn}:

P = { ⋂Vsubset −

⋃(V − Vsubset) | ∅ = Vsubset ⊆ V

}.

For example, for V = {V1, V2, V3}, we obtain seven partitions P = {(V1 ∩ V2 ∩V3), (V1 ∩V2)−V3, (V1 ∩V3)−V2, (V2 ∩V3)−V1, V1 −(V2 ∪V3), V2 −(V1 ∪V3), V3 −(V1 ∪ V2)}.

We need to show that the algorithm finds a rewriting for each query Q formedas a union of some arbitrary subset Psubset of partitions, Q = ⋃

Psubset. Theproof is done by induction. If Q is equivalent to a single partition from P,then the proof follows from Theorem 4. If Q comprises multiple partitions,then two cases are possible: either Phases 1 and 2 eliminate one partition fromQ , or they fail to do so. In the former case, induction applies. In case of fail-ure, Q is either disjoint from every partition in P or properly contained inone of the partitions—a contradiction to our assumption that Q is a union ofpartitions.

ACM Transactions on Database Systems, Vol. 33, No. 4, Article 22, Publication date: November 2008.

Page 45: Compiling mappings to bridge applications and databases

Compiling Mappings to Bridge Applications and Databases • 22:45

B. FIGURES

Fig. 14. XML syntax for mapping in Figure 6.

Fig. 15. Query view for entities of type Employee in Entity Set Persons.

ACM Transactions on Database Systems, Vol. 33, No. 4, Article 22, Publication date: November 2008.

Page 46: Compiling mappings to bridge applications and databases

22:46 • S. Melnik et al.

Fig. 16. Query view for entities of type Customer in Entity Set Persons.

Fig. 17. Query view for Association Set Supports.

Fig. 18. Update view for table HR.

ACM Transactions on Database Systems, Vol. 33, No. 4, Article 22, Publication date: November 2008.

Page 47: Compiling mappings to bridge applications and databases

Compiling Mappings to Bridge Applications and Databases • 22:47

Fig. 19. Update view for table Empl.

Fig. 20. Update view for table Client.

ACKNOWLEDGMENTS

We are grateful to the members of the Microsoft SQL Server product teamfor their support and engagement in developing the ideas presented in thisarticle. Special thanks are due to Jose Blakeley, S. Muralidhar, Jason Wilcox,Sam Druker, Colin Meek, Srikanth Mandadi, Sahil Thaker, Shyam Pather, TimMallalieu, Pablo Castro, Kawarjit Bedi, Mike Pizzo, Nick Kline, Ed Triou, AnilNori, and Dave Campbell. We thank Paul Larson and Christos Papadimitrioufor helpful discussions, and Yannis Katsis and Renee Miller for comments onan earlier version of this article.

REFERENCES

ABITEBOUL, S. AND DUSCHKA, O. M. 1998. Complexity of answering queries using materializedviews. In Proceedings of the ACM SIGACT SIGMOD SIGART Symposium on Principles ofDatabase Systems (PODS). ACM, New York, NY, 254–263.

ACM Transactions on Database Systems, Vol. 33, No. 4, Article 22, Publication date: November 2008.

Page 48: Compiling mappings to bridge applications and databases

22:48 • S. Melnik et al.

ABITEBOUL, S., HULL, R., AND VIANU, V. 1995. Foundations of Databases. Addison Wesley.ADYA, A., BLAKELEY, J. A., MELNIK, S., MURALIDHAR, S., AND THE ADO.NET TEAM. 2007. Anatomy of

the ADO.NET Entity Framework. In Proceedings of the ACM SIGMOD International Conferenceon Management of Data (SIGMOD). ACM, New York, NY, 877–888.

ATKINSON, M. P. AND BUNEMAN, P. 1987. Types and persistence in database programming lan-guages. ACM Comput. Surv. 19, 2, 105–190.

BANCILHON, F. AND SPYRATOS, N. 1981. Update semantics of relational views. ACM Trans. DatabaseSyst. ACM, New York, NY, 6, 4, 557–575.

BARBOSA, D., FREIRE, J., AND MENDELZON, A. O. 2005. Designing information-preserving mappingschemes for XML. In Proceedings of the International Conference on Very Large Data Bases(VLDB). VLDB Endowment, 109–120.

BARSALOU, T., KELLER, A. M., SIAMBELA, N., AND WIEDERHOLD, G. 1991. Updating rela-tional databases through object-based views. In Proceedings of the ACM SIGMOD In-ternational Conference on Management of Data (SIGMOD). ACM, New York, NY, 248–257.

BATINI, C., CERI, S., AND NAVATHE, S. B. 1992. Conceptual Database Design: an Entity-RelationshipApproach. Benjamin-Cummings Publishing Co., Inc., Redwood City, CA, USA.

BAUER, C. AND KING, G. 2006. Java Persistence with Hibernate. Manning Publications.BERNSTEIN, P. A., GREEN, T. J., MELNIK, S., AND NASH, A. 2006. Implementing mapping composi-

tion. In Proceedings of the International Conference on Very Large Data Bases (VLDB). VLDBEndowment, 55–66.

BERNSTEIN, P. A. AND MELNIK, S. 2007. Model management 2.0: manipulating richer mappings. InProceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD).ACM, New York, NY, 1–12.

BLAKELEY, J. A., LARSON, P.-A., AND TOMPA, F. W. 1986. Efficiently updating materialized views. InProceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD).ACM, New York, NY, 61–71.

BLAKELEY, J. A., MURALIDHAR, S., AND NORI, A. 2006. The ADO.NET Entity Framework: mak-ing the conceptual level real. In International Conference on Conceptual Modeling (ER). Lec-ture Notes in Computer Science, vol. 4312, Springer, Berlin, Heidelberg, New York, 552–565.

BOHANNON, A., PIERCE, B. C., AND VAUGHAN, J. A. 2006. Relational lenses: a language for updat-able views. In Proceedings of the ACM SIGACT SIGMOD SIGART Symposium on Principles ofDatabase Systems (PODS). ACM, New York, NY, 338–347.

BOHANNON, P., FAN, W., FLASTER, M., AND NARAYAN, P. P. S. 2005. Information preserving XMLschema embedding. In Proceedings of the International Conference on Very Large Data Bases(VLDB). 85–96.

BRAGANHOLO, V. P., DAVIDSON, S. B., AND HEUSER, C. A. 2004. From XML view updates to relationalview updates: old solutions to a new problem. In Proceedings of the International Conference onVery Large Data Bases (VLDB). VLDB Endowment, 276–287.

BRYANT, R. E. 1986. Graph-based algorithms for boolean function manipulation. IEEE Trans.Comput. 35, 8, 677–691.

CAREY, M. J., CHAMBERLIN, D. D., NARAYANAN, S., VANCE, B., DOOLE, D., RIELAU, S., SWAGER-MAN, R., AND MATTOS, N. M. 1999. O-o, what have they done to DB2? In Proceedings ofthe International Conference on Very Large Data Bases (VLDB). VLDB Endowment, 542–553.

CAREY, M. J. AND DEWITT, D. J. 1996. Of objects and databases: a decade of turmoil. In Proceedingsof the International Conference on Very Large Data Bases (VLDB). VLDB Endowment, MorganKaufmann, 3–14.

CASTRO, P., MELNIK, S., AND ADYA, A. 2007. ADO.NET Entity Framework: raising thelevel of abstraction in data programming. In Proceedings of the ACM SIGMOD Inter-national Conference on Management of Data (SIGMOD). ACM, New York, NY, 1070–1072.

ACM Transactions on Database Systems, Vol. 33, No. 4, Article 22, Publication date: November 2008.

Page 49: Compiling mappings to bridge applications and databases

Compiling Mappings to Bridge Applications and Databases • 22:49

CHEN, P. P. 1976. The entity-relationship model—toward a unified view of data. ACM Trans.Database Syst. 1, 1, 9–36.

COOK, W. R. AND IBRAHIM, A. H. 2006. Integrating programming languages and databases: whatis the problem? ODBMS.ORG, Expert Article.

DAYAL, U. AND BERNSTEIN, P. A. 1978. On the updatability of relational views. In Proceedingsof the International Conference on Very Large Data Bases (VLDB). VLDB Endowment, 368–377.

DI PAOLA, R. A. 1969. The recursive unsolvability of the decision problem for the class of definiteformulas. J. ACM 16, 2, 324–327.

EJB 3.0 EXPERT GROUP. 2006. JSR 220: Enterprise JavaBeans, Version 3.0, Final Release.FAGIN, R., KOLAITIS, P. G., POPA, L., AND TAN, W. C. 2005. Composing schema mappings: second-

order dependencies to the rescue. ACM Trans. Database Syst. 30, 4, 994–1055.FOSTER, J. N., GREENWALD, M. B., MOORE, J. T., PIERCE, B. C., AND SCHMITT, A. 2007. Combinators

for bidirectional tree transformations: a linguistic approach to the view-update problem. ACMTrans. Prog. Lang. Syst. 29, 3 (May), 17.

FUXMAN, A., HERNANDEZ, M. A., HO, H., MILLER, R. J., PAPOTTI, P., AND POPA, L. 2006. Nested map-pings: schema mapping reloaded. In Proceedings of the International Conference on Very LargeData Bases (VLDB). VLDB Endowment, 67–78.

GOTTLOB, G., PAOLINI, P., AND ZICARI, R. 1988. Properties and update semantics of consistent views.ACM Trans. on Database Syst. 13, 4, 486–524.

GOU, G., KORMILITSIN, M., AND CHIRKOVA, R. 2006. Query evaluation using overlapping views:completeness and efficiency. In Proceedings of the ACM SIGMOD International Conference onManagement of Data (SIGMOD). ACM Press, New York, NY, 37–48.

GRIMES, S. 1998. Object/relational reality check. Database Program. Des. 11, 7.GUPTA, A. AND MUMICK, I. S. 1995. Maintenance of materialized views: problems, techniques, and

applications. IEEE Data Eng. Bull. 18, 2, 3–18.HALEVY, A. Y. 2001. Answering queries using views: A Survey. VLDB J. 10, 4, 270–294.JDO EXPERT GROUP. 2006. JSR 243: Java Data Objects, Version 2.0, Final Release.KEENE, C. 2004. Data services for next-generation SOAs. SOA WebServices J. 4, 12.KELLER, A. M., JENSEN, R., AND AGRAWAL, S. 1993. Persistence software: bridging object-

oriented programming and relational databases. In Proceedings of the ACM SIGMOD In-ternational Conference on Management of Data (SIGMOD). ACM, New York, NY, 523–528.

KOTIDIS, Y., SRIVASTAVA, D., AND VELEGRAKIS, Y. 2006. Updates through views: a new hope. In Pro-ceedings of the International Conference on Data Engineering (ICDE). 2.

KRISHNAMURTHY, V., BANERJEE, S., AND NORI, A. 1999. Bringing object-relational technology tomainstream. In Proceedings of the ACM SIGMOD International Conference on Management ofData (SIGMOD). ACM, New York, NY, 513–514.

LARSON, P.-A. AND ZHOU, J. 2007. View matching for outer-join views. VLDB J. 16, 1, 29–53.MEHRA, K. K., RAJAMANI, S. K., SISTLA, A. P., AND JHA, S. K. 2007. Verification of object relational

maps. In Proceedings of the 5th IEEE Software Engineering and Formal Methods (SEFM). IEEEComputer Society, Washington, DC, 283–292.

MEIJER, E., BECKMAN, B., AND BIERMAN, G. M. 2006. LINQ: reconciling object, relations and XMLin the .NET framework. In Proceedings of the ACM SIGMOD International Conference on Man-agement of Data (SIGMOD). ACM, New York, NY, 706.

MELNIK, S., ADYA, A., AND BERNSTEIN, P. A. 2007. Compiling mappings to bridge applications anddatabases. In Proceedings of the ACM SIGMOD International Conference on Management of Data(SIGMOD). ACM, New York, NY, 461–472.

MILLER, R. J., HAAS, L. M., AND HERNANDEZ, M. A. 2000. Schema mapping as query discovery. InProceedings of the International Conference on Very Large Data Bases (VLDB). VLDB Endow-ment, 77–88.

NASH, A., BERNSTEIN, P. A., AND MELNIK, S. 2007. Composition of mappings given by embeddeddependencies. ACM Trans. Database Syst. 32, 1, 4.

ACM Transactions on Database Systems, Vol. 33, No. 4, Article 22, Publication date: November 2008.

Page 50: Compiling mappings to bridge applications and databases

22:50 • S. Melnik et al.

SEGOUFIN, L. AND VIANU, V. 2005. Views and queries: determinacy and rewriting. In Proceedingsof the ACM SIGACT SIGMOD SIGART Symposium on Principles of Database Systems (PODS).ACM, New York, NY, 49–60.

ZHANG, W. AND RITTER, N. 2001. The real benefits of object-relational db-technology for object-oriented software development. In Proceedings of the British National Conference on Databases(BNCOD). Lecture Notes in Computer Science, Vol. 2097, Springer, Berlin, Heidelberg, New York,89–104.

Received October 2007; revised March 2008; accepted June 2008

ACM Transactions on Database Systems, Vol. 33, No. 4, Article 22, Publication date: November 2008.