19
Data integration through database federation by L. M. Haas E. T. Lin M. A. Roth In a large modern enterprise, it is almost inevitable that different parts of the organization will use different systems to produce, store, and search their critical data. Yet, it is only by combining the information from these various systems that the enterprise can realize the full value of the data they contain. Database federation is one approach to data integration in which middleware, consisting of a relational database management system, provides uniform access to a number of heterogeneous data sources. In this paper, we describe the basics of database federation, introduce several styles of database federation, and outline the conditions under which each style of federation should be used. We discuss the benefits of an information integration solution based on database technology, and we demonstrate the utility of the database federation approach through a number of usage scenarios involving IBM’s DB2 product. In a large modern enterprise, it is inevitable that dif- ferent parts of the organization will use different sys- tems to produce, store, and search their critical data. This diversity of data sources is caused by many fac- tors including lack of coordination among company units, different rates of adopting new technology, mergers and acquisitions, and geographic separation of collaborating groups. Yet, it is only by combining the information from these various systems that the enterprise can realize the full value of the data they contain. In the finance industry, mergers are an almost com- monplace occurrence. The company resulting from a merger inherits the data stores of the original in- stitutions. Many of those stores will be relational da- tabase management systems, but often from differ- ent manufacturers. One company may have used primarily Sybase, Inc. products, whereas another may have used products from Informix Software, Inc. They may both have had one or more document management systems, such as Documentum 1 or IBM Content Manager, 2 for storing text documents. Each may have had applications that compute important information (e.g., the risk of granting a loan to a given customer) or that mine for information about cus- tomers’ buying patterns. After the merger, the new company needs to be able to access the customer information from both sets of data stores, to analyze its new portfolio using ex- isting and new applications, and, in general, to use the combined resources of both institutions through a common interface. The company needs to be able to identify common customers and consolidate their accounts, even though the customer data may be stored in different databases and in different formats. In addition, the company must be able to combine the legacy data with new data available on the Web or from its business partners. These are all aspects of data integration, 3 and all pose hefty challenges. Copyright 2002 by International Business Machines Corpora- tion. Copying in printed form for private use is permitted with- out payment of royalty provided that (1) each reproduction is done without alteration and (2) the Journal reference and IBM copy- right notice are included on the first page. The title and abstract, but no other portions, of this paper may be copied or distributed royalty free without further permission by computer-based and other information-service systems. Permission to republish any other portion of this paper must be obtained from the Editor. HAAS, LIN, AND ROTH 0018-8670/02/$5.00 © 2002 IBM IBM SYSTEMS JOURNAL, VOL 41, NO 4, 2002 578

Data integration through database federationpdfs.semanticscholar.org/c929/4a8f4501be95118da78bccc8549a42… · they contain. Database federation is one approach to data integration

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Data integration through database federationpdfs.semanticscholar.org/c929/4a8f4501be95118da78bccc8549a42… · they contain. Database federation is one approach to data integration

Data integrationthrough databasefederation

by L. M. HaasE. T. LinM. A. Roth

In a large modern enterprise, it is almostinevitable that different parts of theorganization will use different systems toproduce, store, and search their critical data.Yet, it is only by combining the informationfrom these various systems that theenterprise can realize the full value of the datathey contain. Database federation is oneapproach to data integration in whichmiddleware, consisting of a relationaldatabase management system, providesuniform access to a number of heterogeneousdata sources. In this paper, we describe thebasics of database federation, introduceseveral styles of database federation, andoutline the conditions under which each styleof federation should be used. We discuss thebenefits of an information integration solutionbased on database technology, and wedemonstrate the utility of the databasefederation approach through a number ofusage scenarios involving IBM’s DB2 product.

In a large modern enterprise, it is inevitable that dif-ferent parts of the organization will use different sys-tems to produce, store, and search their critical data.This diversity of data sources is caused by many fac-tors including lack of coordination among companyunits, different rates of adopting new technology,mergers and acquisitions, and geographic separationof collaborating groups. Yet, it is only by combiningthe information from these various systems that theenterprise can realize the full value of the data theycontain.

In the finance industry, mergers are an almost com-monplace occurrence. The company resulting froma merger inherits the data stores of the original in-stitutions. Many of those stores will be relational da-tabase management systems, but often from differ-ent manufacturers. One company may have usedprimarily Sybase, Inc. products, whereas another mayhave used products from Informix Software, Inc.They may both have had one or more documentmanagement systems, such as Documentum1 or IBMContent Manager,2 for storing text documents. Eachmay have had applications that compute importantinformation (e.g., the risk of granting a loan to a givencustomer) or that mine for information about cus-tomers’ buying patterns.

After the merger, the new company needs to be ableto access the customer information from both setsof data stores, to analyze its new portfolio using ex-isting and new applications, and, in general, to usethe combined resources of both institutions througha common interface. The company needs to be ableto identify common customers and consolidate theiraccounts, even though the customer data may bestored in different databases and in different formats.In addition, the company must be able to combinethe legacy data with new data available on the Webor from its business partners. These are all aspectsof data integration,3 and all pose hefty challenges.

�Copyright 2002 by International Business Machines Corpora-tion. Copying in printed form for private use is permitted with-out payment of royalty provided that (1) each reproduction is donewithout alteration and (2) the Journal reference and IBM copy-right notice are included on the first page. The title and abstract,but no other portions, of this paper may be copied or distributedroyalty free without further permission by computer-based andother information-service systems. Permission to republish anyother portion of this paper must be obtained from the Editor.

HAAS, LIN, AND ROTH 0018-8670/02/$5.00 © 2002 IBM IBM SYSTEMS JOURNAL, VOL 41, NO 4, 2002578

Page 2: Data integration through database federationpdfs.semanticscholar.org/c929/4a8f4501be95118da78bccc8549a42… · they contain. Database federation is one approach to data integration

Sometimes the need for data integration may be assimple as getting a reading from a sensor (say, thetemperature at some stage in a manufacturing pro-cess), and comparing it to a baseline value derivedfrom historical data stored in a database. Or per-haps data from a spreadsheet are to be comparedwith data from several databases containing exper-imental results, or perhaps one has to find custom-ers with large credit card balances whose income isless than some threshold. The need for data inte-gration is ubiquitous in today’s business world.

There are many mechanisms for integrating data.These include application-specific solutions, applica-tion-integration frameworks, workflow (or businessprocess integration) frameworks, digital libraries withportal-style or meta-search-engine integration, datawarehousing, and database federation. We discusseach briefly.

Perhaps the most common means of data integra-tion is via special-purpose applications that accesssources of interest directly and combine the data re-trieved from those sources with the application it-self. This approach always works, but it is expensive(in terms of both time and skills), fragile (changesto the underlying sources may all too easily breakthe application), and hard to extend (a new datasource requires new code to be written).

Application-integration frameworks4,5 and compo-nent-based frameworks6 provide a more promisingapproach. Such systems typically employ a standarddata or programming model, such as CORBA**(Common Object Request Broker Architecture),J2EE** (Java 2 Platform, Enterprise Edition), andso on. They provide well-defined interfaces to theapplication for accessing data and other applicationsand for adding new data sources, typically by writ-ing an adaptor for the data source that meets theframework’s adaptor interface. These frameworksprotect the application somewhat from changes inthe data sources (if the source changes, the adaptormay have to change, but the application may neversee it). Adding a new source may be easier (althougha new adaptor may need to be written, the changeis more isolated, and the adaptor may already existand be available for purchase). The application pro-grammer is not required to have detailed systemsknowledge, so applications will typically be easier towrite. However, such systems do not necessarily ad-dress data integration issues; if combination, anal-ysis, or comparison of the data received from the var-

ious sources is needed, the application developermust provide that code.

Similarly, workflow systems7 provide application de-velopers with a higher-level abstraction against whichto program, but using a more process-orientedmodel. Again, these systems provide some protec-tion to the application against changes in the envi-ronment, and they additionally provide support forroutine results from one source to other sources.However, they still provide only limited help for com-paring and manipulating data.

Digital libraries, which collect results from multipledifferent data sources in response to a user’s request,represent another style of data integration. For ex-ample, Stanford’s InfoBus architecture8 provides abibliographic search service that looks through sev-eral reference repositories and returns a single listof results. Usually the sources are similar in func-tion (e.g., all store text documents, bibliographies,URLs [uniform resource locators], images, etc.). Formost of these “meta search engines,” no combina-tion of results is done, except perhaps for normal-ization of relevance scores or formats—all are re-turned as part of one list, possibly sorted by estimatedrelevance. Similarly, portals provide a means of col-lecting information—perhaps from quite differentdata sources—and putting the results together forthe user to see in conjunction. Results from the dif-ferent searches are typically displayed in differentareas of the interface, although portal frameworksmay allow application code to be written to combinethe results in some other way. Portals are often builton top of some sort of application-integration frame-work, providing a simple and amazingly powerful wayto quickly get access to a variety of information. IBM’sEnterprise Information Portal, for example, can re-trieve various types of data from a range of repos-itories, including text search engines, document man-agement systems, image stores, and relationaldatabases.9 Yet again, however, the application de-veloper must write code to do any more sophisticatedanalysis, comparison, or integration that is required.

Data warehouses and database federation, by con-trast, provide users with a powerful, high-level querylanguage that can be used to combine, contrast, an-alyze, and otherwise manipulate their data. Tech-nology for optimizing queries ensures that they areanswered efficiently, even though the queries areposed nonprocedurally, greatly easing applicationdevelopment. A data warehouse is built by loadingdata from one or more data sources into a newly de-

IBM SYSTEMS JOURNAL, VOL 41, NO 4, 2002 HAAS, LIN, AND ROTH 579

Page 3: Data integration through database federationpdfs.semanticscholar.org/c929/4a8f4501be95118da78bccc8549a42… · they contain. Database federation is one approach to data integration

fined schema in a relational database. The data areoften cleansed and transformed in the load process.Changes in the underlying sources may causechanges to the load process, but the part of the ap-plication that deals with data analysis is protected.New data sources may introduce changes to theschema, requiring that a new load process for thenew data be defined. SQL (Structured Query Lan-guage) views can further protect the application fromsuch evolutions. However, any functions of the datasource that are not a standard part of a relationaldatabase management system must be re-imple-mented in the warehouse or as part of the applica-tion.

A solution based on warehousing alone may not bepossible or cost effective for various reasons. For ex-ample, it is not always feasible to move data fromtheir original location to a data warehouse, and asdescribed above, warehousing comes with its ownmaintenance and implementation costs. Databasefederation (or mediation as it is sometimes referredto in the literature) provides users with a virtual datawarehouse, without necessarily moving any of thedata. Thus, in addition to the benefits of a warehouse,it can provide access to “live” data and functions. Asingle arbitrarily complex query can efficiently com-bine data from multiple sources of different types,even if those sources themselves do not possess allthe functionality needed to answer such a query. Inother words, a federated database system can op-timize queries and compensate for SQL function thatmay be lacking in a data source. Additionally, que-ries can exploit the specialized functions of a datasource, so no functionality is lost in accessing thesource through the federated system.

Of course, database federation adds another com-ponent between the client application and the data,and this extra layer introduces performance trade-offs. Indeed, one should not expect that introducinga new layer would reduce access time for any singlefederated data source. The key performance advan-tage offered by database federation is the ability toefficiently combine data from multiple sources in asingle SQL statement. Two components of the fed-erated server contribute to this: query rewrite andcost-based optimization.10–12 The query rewrite com-ponent of a federated server can aggressively rewritea user’s query into a semantically equivalent formthat can be more efficiently executed. A cost-basedoptimizer can search a large space of access strat-egies and choose a global execution plan that is op-

timal across both local tables and the federated datasources involved in the query.

Note that in some cases, a query issued to a feder-ated server with a sophisticated query processing en-gine may actually outperform the same query issueddirectly to the data source itself. During query re-write phase, a user’s query can be rewritten in sucha way that the federated data source can choose amore efficient execution plan than it would if it weregiven the user’s original query.

The goal of this paper is to demonstrate that data-base federation is a fundamental tool for data in-tegration. The rest of the paper is organized as fol-lows. In the next section, we discuss databasefederation in greater detail and describe the archi-tecture for DB2* (Database 2*) database federation.There are several styles of database federation, andthese are introduced in the following section, inwhich we also discuss when each style of federationshould be used. In the next section, we discuss theadvantages of building an integration solution on da-tabase technology, and in the section that follows wedemonstrate the utility of this approach through anumber of usage scenarios. We conclude with a sum-mary and some thoughts on future work.

The basics of database federation

The term “database federation” refers to an archi-tecture in which middleware, consisting of a rela-tional database management system, provides uni-form access to a number of heterogeneous datasources. The data sources are federated, that is, theyare linked together into a unified system by the da-tabase management system. The system shields itsusers from the need to know what the sources are,what hardware and software they run on, how theyare accessed (via what programming interface or lan-guage), and even how the data stored in these sourcesare modeled and managed. Instead, a database fed-eration looks to the application developer like a sin-gle database management system. The user cansearch for information and manipulate data usingthe full power of the SQL language. A single querymay access data from multiple sources, joining andrestricting, aggregating and analyzing the data at will.Yet the sources may not be database systems at all,but in fact could be anything from sensors to flat filesto application programs to XML (Extensible MarkupLanguage), and so on.

HAAS, LIN, AND ROTH IBM SYSTEMS JOURNAL, VOL 41, NO 4, 2002580

Page 4: Data integration through database federationpdfs.semanticscholar.org/c929/4a8f4501be95118da78bccc8549a42… · they contain. Database federation is one approach to data integration

Many research projects and a few commercial sys-tems have implemented and evolved the concept ofdatabase federation. Pioneering research projects in-cluded TSIMMIS13 and HERMES,14 which used data-base concepts to implement “mediators,” special-purpose query engines that use nonproceduralspecifications to integrate specific data sources.DISCO15 and Pegasus16 were closer in feel to true da-tabase federation. DISCO focused on a design scal-ing up to many heterogeneous data sources. Pega-sus had its own data model and query language, andit had a functional object model that allowed greatflexibility for modeling data sources. Each createdits own database management system from scratch.Garlic17 was the first research project to exploit thefull power of a standard relational database (DB218),although it extended the language and data modelto support some object-oriented features. The wrap-per architecture19 and cross-source query optimiza-tion20 of Garlic are now fundamental componentsof IBM’s federated database offerings.

Meanwhile, in the commercial world, the first stepstoward federation involved “gateways.” A gatewayallows a database management system to route aquery to another database system. Initially, gatewayproducts only allowed access to homogeneoussources, often from the same manufacturer, and que-ries could reference data in only one data source.Over time, more elaborate gateway products ap-peared, such as iWay Software’s EDA/SQL Gate-ways.21 These products allowed access to heteroge-neous relational systems and to nonrelational systemsthat provided an Open Database Connectivity(ODBC) driver.22 DB2 DataJoiner*23 was the first com-mercial system to really use database federation. Itprovided a full database engine with the ability tocombine data from multiple heterogeneous rela-tional sources in a single query, and to optimize thatquery for good performance. Microsoft’s Access24

allowed access to a larger set of data sources, butwas geared toward smaller applications, as opposedto the mission-critical large enterprise applicationsthat DataJoiner supported.

DataJoiner and Garlic “grew up” together, withDataJoiner focused on robust and efficient queriesover a limited range of mostly relational sources,whereas Garlic focused on extensibility to a muchmore heterogeneous set of sources. Today, the bestideas from both projects have been incorporated intoDB2,25 which also enabled the use of user-definedfunctions to “federate” simple data sources.26 DB2thus supports a very rich notion of database feder-

ation that can be exploited for data integration. Dis-coveryLink*,27 for example, is an IBM solution thatapplies DB2 technology to integrate data in the lifesciences.

IBM’s database federation offerings have pursued sev-eral goals. The most basic goal is transparency, whichrequires masking from the user the differences, id-iosyncrasies, and implementations of the underlyingdata sources. This allows applications to be writtenas if all the data were in a single database, although,in fact, the data may be stored in a heterogeneouscollection of data sources. A second key goal is tosupport heterogeneity, or the ability to accommodatea broad range of data sources, without restriction ofhardware, software, data model, interface, or pro-tocols. A high degree of function provides users withthe best of both worlds: not only rich, standard-com-pliant DB2 SQL capabilities against all the data in thefederation, but also the ability to exploit functionsof the data sources that the federated engine maylack.

The fourth goal is extensibility, the ability to add newdata sources dynamically in order to meet the chang-ing needs of the business. Openness is an importantcorollary; DB2 products support the appropriate stan-dards28,29 in order to ensure the federation can beextended with standard-compliant components. Fur-ther, joining a federation should not compromise theautonomy of individual data sources. Existing appli-cations run unchanged; data are neither moved normodified; interfaces remain the same. Last but notleast, DB2 database federation technology optimizesqueries for performance. Relational queries are non-procedural. When a relational query spans multipledata sources, making the wrong decisions about howto execute it can be costly in terms of resources.Hence, the DB2 federated engine extends the capa-bilities of a traditional optimizer12 to handle que-ries involving federated data sources.

DB2 architecture for database federation. In the re-mainder of the paper, we focus on the database fed-eration capabilities of DB2. Figure 1 depicts the over-all architecture of a DB2 database federation. Usersaccess the federation via any DB2 interface (CLI [CallLevel Interface], ODBC, JDBC** [Java** DatabaseConnectivity], etc.). The key component in this ar-chitecture is the DB2 federated engine itself. This en-gine consists of a Starburst-style query processor30

that compiles and optimizes SQL queries, a run-timeinterpreter for driving the query execution, a datamanager that controls a local store, and several ex-

IBM SYSTEMS JOURNAL, VOL 41, NO 4, 2002 HAAS, LIN, AND ROTH 581

Page 5: Data integration through database federationpdfs.semanticscholar.org/c929/4a8f4501be95118da78bccc8549a42… · they contain. Database federation is one approach to data integration

tension mechanisms that allow access to externaldata, including user-defined functions and wrappers.

A user-defined function (UDF) is a routine writtenby an application developer and registered with thedatabase. User-defined functions may take input pa-rameters and return either a scalar result, or a tableof data. They can be used to combine functions al-ready supported by the database, or to provide ac-cess to a specific function provided by an externalsource.

A wrapper mediates between the DB2 engine and oneor more data sources, mapping the data model ofthe sources to the DB2 data model and transformingDB2 data operations into requests that the sourcescan handle. DB2 provides wrappers for many pop-ular data sources, such as Informix, Oracle, andMS SQL server,25 as well as wrappers for specializeddata sources such as Documentum and BLAST.31 Inaddition, there is a toolkit for customers and third-party vendors to build wrappers for their ownsources.

Clearly, there are different styles of federation avail-able with this architecture. In the next section, weexplore each of these styles in greater detail and giveguidelines for when each style is appropriate.

DB2 styles of federation

As shown in Figure 2, the federated architecture ofDB2 offers a range of alternatives for federation, fromthe simplicity of a scalar user-defined function to theflexibility of the DB2 wrapper architecture. The fig-ure illustrates that a natural trade-off exists betweenthe level of federation and the effort involved toachieve federation. In this section we describe pointsalong this range and provide a decision tree that helpsto illustrate which type of federation is most appro-priate for a given integration scenario.

Scalar UDFs: Federating function. The simplestform of a UDF is a scalar function, which can takedata from the surrounding SQL statement as inputand produce a single scalar result. Scalar UDFs pro-vide a way to federate function with data in DB2, com-bining data from one source with a function providedby another in a single statement. UDFs can providetwo major benefits to client applications: (1) a sim-pler programming model and (2) greater perfor-mance, achieved by reducing network traffic and pathlength between the calling application, data, andfunction.

For example, the user-defined function db2mq.mqsend( )ships with DB2 and allows a user to send a messageto an MQSeries32 queue. By invoking this function

Figure 1 DB2 architecture for database federation

QUERY COMPILER

RUN TIME

DATA MANAGERUDFs WRAPPERS

LOCAL TABLES

EXTERNAL DATAAND FUNCTIONS

DB2

EXTERNAL DATA SOURCES

HAAS, LIN, AND ROTH IBM SYSTEMS JOURNAL, VOL 41, NO 4, 2002582

Page 6: Data integration through database federationpdfs.semanticscholar.org/c929/4a8f4501be95118da78bccc8549a42… · they contain. Database federation is one approach to data integration

within a SELECT statement, data can be moved froma database table to an MQ queue without the clientapplication accessing the queue directly:

SELECT db2mq.mqsend(a.headline)FROM Articles aWHERE a.article_timestamp �� CURRENT TIMESTAMP

Scalar functions can also be used as a simple way tobring data into the engine. For example, the func-tion db2mq.mqreceive( ) can be used to pull the nextmessage from a queue. Because it is implementedas a UDF, this function can be used in a complex queryto combine the message with other data from thefederation.

Table UDFs: Federating data. A table function is amore sophisticated UDF that produces a table as out-put and can appear wherever a table can be refer-enced in an SQL statement. A row function is a spe-cial case of a table function that returns 1 row. Atable function can retrieve a set of data and refor-mat the data into rows and columns, providing a sim-ple way to federate external data. Because a tablefunction can appear anywhere a table can, the fullpower of SQL can be applied to the resulting data.Furthermore, views can be created on top of the UDFin order to make the external data appear even morelike a local table to client applications.

For example, a table function called addressbook( ) canbe used to access an address book stored in a LotusNotes* database that contains sales contact infor-mation. This function can be invoked within a querythat joins the result of the function call with a localtable that contains company profile information toreturn sales contacts at financial companies whosegross revenues are greater than $50 million:

SELECT a.first, a.last, a.phone, a.emailFROM TABLE (addressbook( )) AS a, Company_Profiles cWHERE c.industry � ‘FINANCIAL’ AND c.revenue �

50,000,000 AND c.name � a.company_name

Although table functions can be used just like ta-bles, they look more like functions. Table functionscan take arguments that can be used to restrict thedata to be returned. For example, a function to getinformation about files on the local disk might takeone argument that determines the directory tosearch, and a second that determines what file typesare of interest. But the table function can only filterusing those predicates that it was designed to han-dle. For example, in the query below, the dir( ) func-

tion could not be passed the predicate on the col-umn last_modified_date:

SELECT f.filename, f.author, f.last_modified_dateFROM TABLE (dir(‘�laura�papers’, ‘.pdf’)) AS fWHERE f.last_modified_date � ‘07/04/2002’

Fortunately, the DB2 engine can apply any otherpredicates to the result returned by the table func-tion. In this example, the engine would filter the re-sults of the table function and return only those mod-ified more recently than July 2002.

Wrappers: Federating function and data. The wrap-per architecture provides the most powerful and flex-ible infrastructure for federation.33 It provides themeans to integrate both function and data by rais-ing the level of federation from a single function orset of data to that of an entire external data source.Client applications can transparently access and usethe full power of the query language on data man-aged by these sources as though DB2 itself managesthe data.

For example, consider a set of scientists at a univer-sity working in a drug discovery laboratory. The sci-entists store chemical compound data and experi-mental results in an Oracle database. The universityalso has access to Lotus Extended Search34 (LES),a Web search engine that can perform searchesacross multiple Web search sites. The scientists usethis search engine to retrieve research articles fromother scientists. Both the Oracle database and the

Figure 2 Different styles of federation

WRAPPER

TABLE UDF

SCALAR UDF

LEV

EL

OF

FE

DE

RAT

ION

IMPLEMENTATION EFFORT

IBM SYSTEMS JOURNAL, VOL 41, NO 4, 2002 HAAS, LIN, AND ROTH 583

Page 7: Data integration through database federationpdfs.semanticscholar.org/c929/4a8f4501be95118da78bccc8549a42… · they contain. Database federation is one approach to data integration

search engine can be transparently accessed from DB2via wrappers, allowing the scientists to ask a singlequery that combines data and functions from bothsources.

The Oracle wrapper maps the compound data andexperimental results tables stored in Oracle to nick-names, which can appear wherever a table can ap-pear in an SQL query (for example, the FROM clause).The LES wrapper provides a nickname for a list ofarticles that can be searched. Each article has a ti-tle, subject, and URL. The search functionality itselfis mapped as an SQL function that takes two argu-ments, the section of the article to search (title, sub-ject, or both), and a set of key words, and returnsa score that measures the relevance of the article.For example, a common task the scientists might un-dertake would be to find other research reports onchemical compounds that achieved a certain resultin their experiments:

SELECT c.name, a.URLFROM Compounds c, Experiments e, Articles aWHERE e.result � 1.1e-9 and e.id � c.id andsearch(a.subject, c.name) � 0

Because Oracle is itself a relational database, the DB2optimizer can choose a plan in which both the pred-icate on the Experiments table (e.result � 1.1e-9) andthe join between the Compounds and Experiments ta-bles (e.id � c.id) are pushed down to the Oracle da-tabase to execute, and this results in only those com-pounds with the right test results being returned toDB2. DB2 then routes the names of the matching com-pounds to LES, which will retrieve relevant articlesand return their URLs back to DB2.

DB2 supplies wrappers for a variety of relational andnonrelational sources,18 and also provides a toolkitfor third-party vendors and customers to developwrappers for their own data sources.17,19 The wrap-per toolkit provides an external interface to the wrap-per architecture, allowing a developer to federate anew type of data source. The wrapper architectureenables the following federated features:

● Multiserver integration. Unlike a UDF, a wrapper caneasily provide access to multiple external sourcesof a particular type. The DB2 engine acts as a traf-fic cop, maintaining the state information for eachserver, managing connections, decomposing que-ries into fragments that each server can handle,and managing transactions across servers.

● Multidata-set integration and multioperation integra-tion. Table UDFs are well suited to integrate a sin-gle external operation and set of data into the fed-eration. The wrapper architecture expands on thiscapability by providing the infrastructure to modelmultiple data sets and multiple operations. Datasets are modeled as nicknames, and nicknames canappear wherever a table can appear in an SQL state-ment. Operations include query, insert, update,and delete. The wrapper participates in the plan-ning and execution of these operations, and trans-lates them into the corresponding operations sup-ported by the server.

The multidata-set and multioperation features in-troduce the notion of common query framework forexternal sources. The common query frameworkis the way in which the wrapper architecture re-alizes the goal of high function. That is, the frame-work enables (1) the full power of operations ex-pressible in SQL over the external source’s data,and (2) specialized functions supported by the ex-ternal source to be invoked in a query even thoughDB2 does not natively support the function.

Part 1 of the goal is addressed through functioncompensation. If the external source does not sup-port the full power of DB2’s SQL, client applicationswill not suffer any loss of query power over thosedata, because DB2 will automatically compensatefor any differences between DB2’s capabilities andthose of the external source. So, for example, if adata source does not support ORDER BY, DB2 willretrieve the data from the source and perform thesort.

Part 2 of the goal is addressed through functionmappings. Function mappings can be used to de-claratively expose functions supported by the ex-ternal source through the DB2 query language. Forexample, a chemical structure store may have theability to find structurally similar chemical com-pounds. Although that function is not present inDB2, it can still be exploited in DB2 queries, as longas there is a mapping that tells DB2 which sourceimplements the function.

If the external source implements new functions,the wrapper itself may not require modification,depending on the data source and what the wrap-per needs to do to map the function. For sourcesthat invoke new functions in a predictable way, thedatabase administrator (DBA) need only register

HAAS, LIN, AND ROTH IBM SYSTEMS JOURNAL, VOL 41, NO 4, 2002584

Page 8: Data integration through database federationpdfs.semanticscholar.org/c929/4a8f4501be95118da78bccc8549a42… · they contain. Database federation is one approach to data integration

the new function mappings. They can then be im-mediately used in DB2 queries.

● Optimization. Operations to be performed outsideof the DB2 engine may have significant cost impli-cations for the queries that contain them, and itis critical for DB2 to be aware of these implicationswhen choosing a strategy for executing the query.The wrapper architecture includes a flexible frame-work for wrapper writers to provide input to theDB2 query optimizer about the cost of their oper-ations and the size of the data that will be returned.This framework includes a dialog between the op-timizer and wrapper at query compilation time, al-lowing the wrapper writer to provide informationbased on the immediate context of the query.

This information is particularly crucial when theoptimizer must consider alternatives for cross-source operations, as these types of operations ex-plode the set of possible execution strategies. Ref-erence 11 provides an example of an image serverthat can support two kinds of image search, oneof which is substantially more expensive than theother. Because the wrapper can provide this in-formation to the DB2 optimizer, the optimizer isable to choose a plan that applies the more expen-sive operation after another predicate has filteredout much of the data.

● Transactional integration. DB2 acts as a transactionmanager for operations performed on externalsources, and the wrapper architecture provides theinfrastructure for wrappers to participate as re-source managers in a DB2 transaction. DB2 main-tains a list of wrappers that have participated inthe transaction, and it ensures that they are no-tified at transaction commit or rollback time, giv-ing a wrapper an opportunity to invoke the appro-priate routine on the external source.

Determining the style of database federation to use.To determine which level of federation is appropri-ate for a particular integration problem, it is impor-tant to characterize the purpose and desired prop-erties of the integration effort. UDFs are easy toimplement, but have limited support for data mod-eling and no support for transactional integration.Wrappers are powerful, but rely on more advancedcapabilities of the external source and require a moreadvanced skill set to implement.

Every application is different, and each can be solvedvia a “small matter of programming” using the dif-

ferent federation alternatives. However, it is usuallytrue that one alternative is more suitable than theothers for a given problem. Figure 3 presents a de-cision tree that helps characterize the style of fed-eration most appropriate for a particular integrationproblem.

The decision tree is based on a series of YES/NO ques-tions. Consistently answering NO to the questions in-dicates that the integration problem is sufficientlycontained that the best solution is likely to be a UDF.A YES answer indicates that the integration prob-lem exhibits a characteristic that is best handled bythe wrapper architecture. The decision tree pointsto the feature of the wrapper architecture that ad-dresses that characteristic. Next we consider thenodes in the decision tree in more detail.

1. Does the integration problem reach out to multipledata sources of the same type? If so, is there a ben-efit to modeling such servers?

If the server concept is key to the integration prob-lem, then the wrapper architecture is likely to be an

TRANSACTIONALCONSISTENCY?

MULTIPLE, DISTINCTDATA SETS?

MULTISERVERINTEGRATION

TRANSACTIONINTEGRATION

MULTITABLEINTEGRATION

QUERYPLANNING

QUERYOPTIMIZATION

MULTIPLE, DISTINCTOPERATIONS?

SOPHISTICATED COST MODEL?

SCALAR VALUE OR1/MORE ROWS?

SCALAR UDF

ROW/TABLE UDF

NO

NO

NO

NO

NO

YES

YES

YES

YES

YES

SCALAR ROWS

ACCESS TOMULTIPLE SERVERS

Figure 3 Determining the style of federation to use

WRAPPER

IBM SYSTEMS JOURNAL, VOL 41, NO 4, 2002 HAAS, LIN, AND ROTH 585

Page 9: Data integration through database federationpdfs.semanticscholar.org/c929/4a8f4501be95118da78bccc8549a42… · they contain. Database federation is one approach to data integration

appropriate solution. For example, a function thatretrieves the temperature from an on-line thermom-eter does not require the notion of a server, and thewrapper solution may be overkill. On the other hand,suppose the goal is to integrate a set of Lotus Notesdatabases. It is possible to write a UDF to access mul-tiple databases; however, the burden is on the UDFdeveloper to take the appropriate information toidentify the databases as arguments, manage con-nections to the databases, and use the scratchpad tostore any state information. Furthermore, the life-time of the scratchpad is within a single statement,so UDF invocations from separate statements will re-quire their own connections. For this integrationproblem, the multiserver support offered by thewrapper architecture, including connection manage-ment, will help to nicely manage access to those da-tabases.

2. Is transactional consistency important for the in-tegration problem?

The UDF architecture does not support transactionalconsistency, whereas the wrapper architecture offersboth one-phase and (in a future release) two-phasecommit support. If transactional consistency is cru-cial for the integration problem, and the source thathas the data or functions needed is able to partic-ipate as a resource in a transaction, the wrapper ar-chitecture is the only choice that can provide thatlevel of support.

3. Are there multiple, distinct data sets to be accessed?Are there multiple, distinct operations to be feder-ated?

An operation is a specific external action that canbe performed on the data to be integrated. Exam-ples of operations include data retrieval, search, in-sert, update, and delete. Table UDFs are well suitedto integrate a single external operation and set ofdata into DB2. When there are multiple data setsand/or multiple operations involved, the wrapper ar-chitecture may more easily provide the infrastruc-ture for modeling those data sets and operations.This is particularly true when there are restrictionson how multiple operations can be combined, or costimplications that depend on how they are combined.When using the wrapper architecture, data sets be-come separate nicknames, and flexible query plan-ning and query execution components allow thewrapper writer to control the set of operations thewrapper will support, and how those operations willbe executed. In addition, the wrapper architecture

includes APIs (application programming interfaces)for the wrapper writer to provide optimization in-formation on an operation-by-operation basis. Thewrapper writer is allowed to examine the operationto be performed to determine which portions of theoperation the data source can execute. The wrap-per reports this information and provides cost andcardinality estimates based on the operations, pa-rameters, persistent cost information stored in thesystem catalogs, and state information stored in thewrapper. With this cost information the DB2 opti-mizer can determine an optimal execution plan.

4. Is there a sophisticated cost model associated withthe federated operations?

The UDF architecture provides some support for pro-viding static cost information. If the operation is typ-ically executed as part of a simple, single-table query,a table function may be the appropriate vehicle forfederation. However, if the operation is likely to beinvoked as part of a more complex query, and its po-sition in the query plan may greatly impact the per-formance of the query, it may be worthwhile to usethe wrapper architecture to federate this operation,just to exploit the flexible costing infrastructure.

If the answer to all of the above questions is no, thena UDF is likely to be an appropriate choice for fed-eration. In this case, the final question to ask iswhether the integrated operation should return a sca-lar value, or a multirow set. If the operation returnsa scalar value, the right choice is a scalar UDF, andif the operation can return a result set, a table UDFis the appropriate choice.

If the answer to any of the questions is yes, the wrap-per architecture provides at least one valuable fea-ture that addresses some aspect of the integrationproblem, and it is worthwhile to consider writing awrapper to exploit that feature. Note that becausethe architecture is designed to be flexible, it is pos-sible to implement important pieces of a wrapperwithout writing a complete wrapper. For example,if the function being integrated is really a simple sca-lar UDF but transactional integration is key, it is pos-sible to implement that function within a simplewrapper that exploits the transactional architecture,and only minimally implements the data modelingand query optimization aspects. On the other hand,if the operation on the data source is a highly so-phisticated read-only function with significant costimplications, it is possible to write a wrapper that

HAAS, LIN, AND ROTH IBM SYSTEMS JOURNAL, VOL 41, NO 4, 2002586

Page 10: Data integration through database federationpdfs.semanticscholar.org/c929/4a8f4501be95118da78bccc8549a42… · they contain. Database federation is one approach to data integration

focuses on the optimization component and does notaddress transaction issues at all.

Why use a database engine for dataintegration?

Database federation centers on a relational databaseengine. That engine is an essential ingredient for in-tegrating data, because it significantly raises the levelof abstraction for the application developer. Ratherthan deal with the details of how to access individualsources, and in what order, the application devel-oper gets the benefits of a powerful nonprocedurallanguage, support for data integrity, a local store touse for persisting data and meta-data as needed, andthe use (or reuse) of the many tools and applicationsalready written for “normal” relational systems. Inthis section, we look at each of these benefits in moredetail.

The virtues of SQL and the relational data model haveoften been extolled since their invention in the late1960s and early 1970s.35–37 Foremost among theseare the simplicity of the model and the nonproc-edural nature of the language, which together allowthe application developer to specify what data areneeded, and not how to find the data. The systemautomatically finds a path to the individual relations,and decides how to combine them in order to returnthe desired result. With the invention of query op-timization in the late 1970s,12 systems were able tochoose wisely among alternative execution strategies,greatly enhancing the performance and expandingthe range of queries the systems could effectively han-dle. Over the years, the language has been contin-uously enriched with new constructs. It is now pos-sible to write recursive queries, perform statisticalanalyses, define virtual tables on the fly, set defaults,take different actions based on the data retrieved,and so on.38 User-defined functions and stored pro-cedures make it possible to execute much of the ap-plication logic in the database, as close as possibleto the data for greatest efficiency. Database feder-ation extends these advantages to data that are notstored in the local store. Now applications requiringdistributed data can be written as easily as singlesource applications, using all the power of SQL.

Another fundamental aspect of relational systemsis that they protect data integrity. Transaction sup-port39 guarantees an all-or-nothing semantics for up-dates. The extension of commit protocols to distrib-uted environments40 is well understood and manydatabase systems implement one or two-phase com-

mit protocols. These add great value, of course, forintegrating data from several sources, allowing in-tegrity guarantees to be offered and building users’trust in the system. Sophisticated integrity con-straints41 and active mechanisms for enforcing con-straints of various kinds42 may also be extended toprotect relationships across federated data. Were weto build a data integration engine on some other plat-form, all of this would need to be built from scratch.

The local store is a further important aid to appli-cations over federated data. Any application deal-ing with data, and especially distributed data froma variety of sources, needs some place to keep in-formation about the data that it uses. The local storeis a convenient place for these meta-data. Withoutthat store, and the automatic cataloging of basic factsabout the data, the application developer would beforced to manually track and ensure the persistenceof such information. A second use for the local storeis to materialize data. Data may need to be mate-rialized temporarily, as part of query processing, forexample, or for longer—from a short-term cache toa persistent copy, or even a long-term store of dataunique to the federated applications.

Materialized views43 are an advanced feature of SQLthat exploits the local store. Originally conceived ofas a way to amortize the cost of aggregating largevolumes of data over multiple queries, a material-ized view (called an automatic summary table, or AST,in DB2) creates a stored version of the query results.When new queries arrive that match the AST’s def-inition, they are automatically “routed” to use thestored results rather than recomputing them.44 If theview definition can reference foreign data, then astored summary of data from federated sources canbe created. When summary tables are consideredduring query planning, DB2 will compare the cost offetching data directly from the local summary tableto the cost of re-evaluating the remote query andpick the cheaper strategy. This is easy and naturalin a database federation, but would take an enor-mous effort to replicate with other integration ap-proaches, as it depends so heavily on multiple fea-tures of the database engine approach.

Last but not least, using the database engine to in-tegrate data means that many if not all of the toolsand applications that have been developed over theyears for relational databases can be used withoutmodification on distributed data. These includequery builders such as Cognos45 and Brio,46 appli-cation development environments such as IBM Web-

IBM SYSTEMS JOURNAL, VOL 41, NO 4, 2002 HAAS, LIN, AND ROTH 587

Page 11: Data integration through database federationpdfs.semanticscholar.org/c929/4a8f4501be95118da78bccc8549a42… · they contain. Database federation is one approach to data integration

Sphere* Studio,47 IBM VisualAge*,48 and MicrosoftVisual C��**49 or WebGain VisualCafe**,50 reportgenerators, visualization tools, and so on. Likewise,the customer may already have applications devel-oped for a single database that can be reused easilyin the federated environment by substituting namesof remote tables for local ones, or modifying viewdefinitions to make use of remote data.

In summary, many of the advantages offered by stan-dard relational database systems are equally appli-cable to database federation. Thus, the database fed-eration approach provides a richly supportiveenvironment for the task of data integration. In thefollowing section, we illustrate some of these ben-efits through some typical usage scenarios.

DB2 federation usage scenarios

In this section, we describe a set of scenarios thatillustrate how database federation can be used to in-tegrate local and remote data from a variety ofsources, including relational data, Microsoft Excelspreadsheets, XML documents, Web services, andmessage queues. Furthermore, these scenarios dem-onstrate that by using the database as the integra-tion engine, business applications are able to exploitthe full power of SQL over federated data, includingcomplex queries and views, automatic summary ta-

bles, OLAP (on-line analytical processing) functions,DB2 Spatial Extender, replication, and caching.

Federation of distributed data. A nationwide depart-ment store chain has stores in several regions of thecountry. Each of the stores relies on a relational da-tabase to maintain inventory records and customertransactions. However, because the department storehas gone through several technology migrations, notall of the stores use the same database products. Boththe San Francisco store and the New York store useOracle databases to record business transactions,while the corporate headquarters has recently mi-grated to DB2.

Each store maintains a Transactions table, which con-tains an entry for each item scanned during a cus-tomer transaction. It is easy for individual stores togenerate storewide sales reports using the informa-tion stored in this table. Figure 4 shows how the cor-porate office can use DB2 technology to generate asales report across all stores. Because the San Fran-cisco and New York offices both use Oracle, the cor-porate office can use the Oracle wrapper providedwith DB2 to access both stores’ databases. Likewise,the corporate office can access other stores’ databasesusing the wrappers appropriate for those databases.Note that the schemas for the individual databasesneed not be the same, as long as queries can be for-

Past_SalesAST

CORPORATE DB2 DATABASE

ORACLEWRAPPER

sf.TransactionsNICKNAME

ny.TransactionsNICKNAME

National_TransactionsFEDERATED VIEW

Figure 4 A DB2 database federation that includes relational databases

SF ORACLE DATABASE

store_id tran_date tran_id item_id

NY ORACLE DATABASE

store_id tran_date tran_id item_id

HAAS, LIN, AND ROTH IBM SYSTEMS JOURNAL, VOL 41, NO 4, 2002588

Page 12: Data integration through database federationpdfs.semanticscholar.org/c929/4a8f4501be95118da78bccc8549a42… · they contain. Database federation is one approach to data integration

mulated to extract the same information from eachstore.

Each store’s database is registered as a server in thecorporate headquarters database, and the tables thatthe corporate office needs to access are registeredas nicknames. For example, each store’s Transactionstable is registered as nickname. Once the nicknamesare in place, a federated view that shows company-wide transactions can be defined as follows:

CREATE VIEW National_Transactions (store_id, tran_date,tran_id, item_id) AS

SELECT store_id, tran_date, tran_id, item_idFROM sf.TransactionsUNION ALLSELECT store_id, tran_date, tran_id, item_idFROM ny.Transactions

Note that if more stores are added to the depart-ment chain, the corporate office does not need tomodify its business application. Rather, theNational_Transactions view definition can be updatedto include the information for the new stores. Giventhis view, the corporate office can run a single queryto generate a national sales report that shows thetotal of the number of items sold per month by allstores in the company:

SELECT MONTH(tran_date), item_id, COUNT(*)FROM National_TransactionsWHERE YEAR(tran_date)�2001GROUP BY MONTH(tran_date), item_id

In addition, the corporate office can create an au-tomatic summary table over the federated view tocache the transaction information for previous yearslocally, since it is not likely to change. The followingstatement can be used to create this materializedview:

CREATE TABLE Past_Sales AS (SELECT YEAR(tran_date) AS year, MONTH(tran_date) AS

month, item_id, COUNT(*) AS salesFROM National_TransactionsWHERE YEAR(tran_date) �� 2001GROUP BY YEAR(tran_date), MONTH(tran_date), item_id)

DATA INITIALLY DEFERRED REFRESH DEFERRED

The Past_Sales summary table is locally stored, andas a result, indexes can be created over the table andstatistics can be collected to improve query perfor-mance. In addition, a special register can be set toindicate that DB2 should automatically check to see

whether a given query could be executed over the(local) summary table rather than over the (remote)nicknames. If so, in addition to plans that send re-mote requests to the regional stores to get the in-formation and compute the result, DB2 will considerplans to extract the information from the Past_Salessummary table directly. This analysis is done trans-parently and does not require changes to the orig-inal query.

Federation of nonrelational structured data. Figure5 shows how the corporate office of the departmentstore chain can use a DB2 database federation to ac-cess nonrelational data sources as well. For exam-ple, the procurement office might like to know themanufacturer and the supplier of the best-sellingtelevision in 2001. Item and supplier information arestored in Excel spreadsheets and can be accessedfrom DB2 using the Excel wrapper. The items spread-sheet is mapped to an Items nickname, and the sup-pliers spreadsheet is mapped to a Suppliers nickname.Data contained in the spreadsheets can be retrievedusing SQL as shown by the following query:

SELECT i.mfg, s.idFROM Items i, Suppliers sWHERE i.id � s.id AND i.id � (SELECT g.id

FROM (SELECT g.id, COUNT(*), ROWNUMBER( )OVER (ORDER BY COUNT(*) DESC) AS rownumFROM National_Transactions g, Items itWHERE it.cat�‘television’ AND g.id � it.id AND

YEAR(tran_date)�2001GROUP BY g.id) AS tv_total_2001

WHERE rownum � 1)

In the above query, the OLAP function ROWNUMBERis used to order the COUNT(*) in descending order inthe nationwide total of sales for every model of tele-vision in the year 2001. The first row is then selectedto find the item identifier (ID) of the most frequentlysold television. This example shows that by exploit-ing DB2 database federation technology, the corpo-rate office can use a single (complex) SQL statementto correlate information among the stores and thecorporate office, although the data may be stored andrepresented differently at each location.

Federation of semi-structured data. The departmentstore chain also provides on-line shopping for cus-tomers. Customer data are stored in a Customers ta-ble in the corporate office database, and orders aregenerated as XML documents by a Web application.These XML documents can also be accessed from the

IBM SYSTEMS JOURNAL, VOL 41, NO 4, 2002 HAAS, LIN, AND ROTH 589

Page 13: Data integration through database federationpdfs.semanticscholar.org/c929/4a8f4501be95118da78bccc8549a42… · they contain. Database federation is one approach to data integration

corporate database using the XML wrapper to beshipped with DB2 Life Sciences Data Connect.31

XML documents map to nicknames, and elements ofthe XML document map to columns of the nickname.A single XML document can be mapped to multiplenicknames. An option associated with the nicknameallows the user to specify the location of the XMLdocument in the query, and an XPATH option on thenickname column definition maps the column to itslocation in the XML document. For example, the or-der document contains both order and item infor-mation, and as a result, the document is mapped toan Orders nickname and an Order_Items nickname, andqueries involving items from a particular order canbe expressed as joins of these two nicknames.

As part of order processing, the Web applicationmust determine and dispatch the order to the dis-tribution center closest to the customer. The corpo-rate office can store the geo-coded location of eachdistribution center in a Distribution_Centers table, geo-code the customer’s location, and use the DB2 Spa-tial Extender function db2gse.st_distance( ) to identifythe distribution center closest to the customer:

SELECT s.store_id, db2gse.st_distance(GEOCODE(o.street,o.city,o.state,o.zip), s.location, ‘mile’) AS distance

FROM Distribution_Centers s, Orders o

WHERE o.order � ‘/home/customers/order5795.xml’ ANDo.transaction_id�‘12AV56BG90’

ORDER BY distance

Federation of Web services. A small furniture com-pany supplies several nationwide retail stores withits products. The retail stores submit orders for fur-niture through a Web application, the furniture com-pany fulfills the order and ships the furniture to thestores. Since freight shipment represents a signifi-cant portion of the production cost, the furniturecompany contracts with several trucking companiesand puts each freight shipment up for bid. Figure 6shows a system configuration using DB2 for the or-der processing system. New orders are placed on anMQ Series queue. A back-end order processing sys-tem removes orders from the queue and solicits thebids from various trucking companies for shipment.

The order processing system uses a Pending_Orders ta-ble to maintain a list of orders currently being pro-cessed. Each order has an auto-generated unique or-der number, as well as the XML document thatcontains the order information. The furniture com-pany maintains a private UDDI registry for truckingcompanies that support a common Web services in-terface to request freight shipment bids. This reg-istry is available to the order processing system viaa wrapper, and a Freight_Shippers nickname supported

SF ORACLE DATABASE

DB2

ORACLEWRAPPER

NY ORACLE DATABASE

XML DOCUMENT

Figure 5 A DB2 database that includes relational and nonrelational data sources

XMLWRAPPER

EXCELWRAPPER

id cat mfg

id name phone

EXCEL

<order> <trans_id id=“12AV56BG90”/> <date date=“04/26/2002”/> <customer id=“8899”> <street street=“2041 Russell”/> <city name=“Paper City”/> <state name=“WI”/> <zip code=“54496”/> </customer> <item id=“5795”> <quantity amount=“1” /> </item> <item id=“51766”/> <quantity amount=“2” /> </item></order>

sf.Transactions ny.Transactions

HAAS, LIN, AND ROTH IBM SYSTEMS JOURNAL, VOL 41, NO 4, 2002590

Page 14: Data integration through database federationpdfs.semanticscholar.org/c929/4a8f4501be95118da78bccc8549a42… · they contain. Database federation is one approach to data integration

by the wrapper supplies the names and URLs of truck-ing companies that support the bid Web service. AUDF called bid( ) takes the URL of a company and anXML description of an order, sends a Web servicesrequest to retrieve a bid for the order from the com-pany specified by the URL, and returns the compa-ny’s bid. A Bids table contains a list of bids obtainedfor a given order, including the order number, thename of the trucking company that supplied the bid,and the bid itself.

The furniture company can exploit DB2’s federationcapabilities to automate a significant portion of theorder process. For example, orders can be removedfrom the MQ Series queue (via the db2mq.mqreceive( )UDF), assigned a unique order number, and insertedinto the Pending_Orders table with the followingstatement:

INSERT INTO Pending_OrdersVALUES(GENERATE_UNIQUE( ), db2mq.mqreceive( ))

Furthermore, a trigger defined on the Pending_Orderstable can automatically kick off the bid process asnew orders are inserted into the table:

CREATE TRIGGER Get_BidsAFTER INSERT ON Pending_OrdersREFERENCING NEW AS orderFOR EACH ROW MODE DB2SQL

INSERT INTO BidsSELECT Order.ordernum, s.name, bid(s.url, order.orderxml)FROM Freight_Shippers S

This SQL statement illustrates the power of a data-base federation solution for integrating data. A sim-ple insert statement causes a sophisticated trigger toexecute over federated data that are transparently com-bined via user defined functions (db2mq.mqreceive( ) andbid( )) and wrapper-based federation (the Freight_Shippersnickname).

Heterogeneous replication using database feder-ation. Many businesses choose to keep multiple cop-ies of their data for various uses, including data ware-housing, fault tolerance, and fail-over scenarios. Amajor retailer with outlets all over the United Statesbacks up data from its various locations to regionaldata centers. Due to independent purchasing deci-sions, the retail outlets use one relational databasemanagement system, while the data center might useanother. The replication process is relatively straight-forward, and involves extracting data from the out-lets’ databases, optionally reshaping the data and/oraggregating the data, and inserting the data into thedata center database.

Figure 7 shows two approaches that use DB2 tech-nology as the extract/transform vehicle to transfer

Figure 6 A DB2 database federation that includes Web services and other sources

DB2

WRAPPER

UDDI REGISTRY

Freight_Shippers

WEB SERVICES

UDF

UDF

MQSERIES QUEUE

New_OrdersTRUCKING COMPANY 1

TRUCKING COMPANY 2

TRUCKING COMPANY N

LOCAL TABLES

Pending_OrdersBids

IBM SYSTEMS JOURNAL, VOL 41, NO 4, 2002 HAAS, LIN, AND ROTH 591

Page 15: Data integration through database federationpdfs.semanticscholar.org/c929/4a8f4501be95118da78bccc8549a42… · they contain. Database federation is one approach to data integration

data from the outlets to the data center. Approach1 in Figure 7 shows that if the data center uses a DB2database, the data transfer can take place with sim-ple statements of the form:

INSERT INTO �data center local_table�

SELECT . . . FROM �outlet nickname�. . . .

Approach 2 in Figure 7 shows that even if the datacenter uses another relational database product, thesame INSERT statement can be used to populate thedatabase, with the only difference being that the fed-erated database will insert into a nickname instead:

INSERT INTO �data center nickname�

SELECT . . . FROM �outlet nickname�. . . .

Note that in either approach the SELECT statementcan be arbitrarily complex and can be used to se-lectively retrieve, re-shape, and aggregate data ac-cording to the semantics of the application.

DB2 DataPropagator*51 is an IBM product that usesDB2 federation technology to replicate data. Data-Propagator automates the copying of data betweenremote systems, providing automated change prop-agation, flexible scheduling, and other customizationfeatures. DataPropagator exploits the remoteinsert/update/delete capability illustrated above totransparently apply changes to all relational sourcesincluding non-DB2 sources.

Dynamic data caching. As shown in Figure 8, a typ-ical e-commerce Web application consists of a three-

tiered architecture: the Web server tier, the appli-cation server tier, and the data tier. User requestsare routed to one of multiple Web servers, whichforward the user requests to one of the applicationservers for processing. The application servers in turnretrieve product data from, and insert order infor-mation into, a single back-end database.

It is easy to see from the figure that as traffic in-creases, the back-end database can quickly becomethe bottleneck. DBCACHE52,53 is a research prototypebuilt with federation technology that provides scal-ability of the data tier. Each application server nodemay include a front-end database server, the “cache”of DBCACHE. DBCACHE allows database administra-tors to replicate portions of the back-end databaseacross multiple front-end databases, allowing clientrequests to be routed to the front-end databases. Thistopology is often less expensive than a single largeparallel system, and also provides a layer of fault tol-erance; a single database crash does not cause theentire database to become inoperative.

Figure 9 shows an e-commerce application usingDBCACHE technology. Application tables are dividedinto two categories: cached and noncached. Cachedtables are mostly read-only, and accessed by mul-tiple users. For these kinds of tables, maintainingstrict consistency is not necessary, and reading staledata is acceptable. From an on-line shopper’s per-spective, product catalog data are read-only (queryQ1 in Figure 9, for example). Updates occur infre-quently, usually through a scheduled deployment sce-nario (never by an on-line shopper), and it is not a

Figure 7 Heterogeneous replication using database federation

DB2

OUTLET TABLES

APPROACH 1

LOCAL TABLES

DB2

DATA CENTER DATABASEOUTLET TABLES

APPROACH 2

DATA CENTER DATABASE

HAAS, LIN, AND ROTH IBM SYSTEMS JOURNAL, VOL 41, NO 4, 2002592

Page 16: Data integration through database federationpdfs.semanticscholar.org/c929/4a8f4501be95118da78bccc8549a42… · they contain. Database federation is one approach to data integration

critical loss if a shopper sees product informationthat is a few seconds out of date. Furthermore, sinceall shoppers browse the product catalog before mak-ing a purchase, the product catalog is heavily ac-cessed.

Noncached tables are typically read/write. For thesetables, maintaining strict consistency is important,and accessing stale data is intolerable. Order infor-mation is an example of data stored in a noncachedtable. From the perspective of the shopper, orderinformation is mostly write-only (query Q2 in Fig-ure 9, for example). It is captured as part of on-lineorder processing and read only by back-end orderfulfillment applications and by the shopper to checkorder status. Because orders result in transfer ofmoney, it is critical that they present an accurate viewof the data.

As shown in Figure 9, the Products back-end table isautomatically replicated to a cached table across mul-tiple front ends using DataPropagator (Data Propin Figure 9), according to a schedule defined by theDBA. Reads to the Products table are transparentlyrouted to one of the front-end databases, whereaswrites (if any) are routed directly to the back-enddatabase. The Orders back-end table is representedas a nickname in the front-end databases, and bothreads and writes are passed to the back-end data-base using DB2 federated technology.

The client application sees a single view of the data.Queries for product information (query Q1 in Fig-ure 9) are dynamically routed to one of the cachedtables on the front-end database, providing a levelof load balancing and fault tolerance for these heav-ily accessed data. Insert statements to the Orders ta-ble (query Q2 in Figure 9) are routed directly to theback-end table. Statements can span both cached andnoncached tables. For example, a query to check thestatus of an order (query Q3 in Figure 9) can joininformation from the Orders nickname with theProducts table.

Conclusions

In this paper, we have shown that database feder-ation is a powerful tool for integrating data. Data-base federation employs a database engine to cre-ate a virtual database from several, possibly many,heterogeneous and distributed data stores. We iden-tified three styles of database federation. In all ofthem, the database engine is the key driver, but themethod by which data or functions are included inthe federation differs. We presented guidelines onwhen each style of federation should be used: user-defined functions are most appropriate for fairly sim-ple integration tasks, whereas the wrapper architec-ture supports a much broader and more complex setof tasks. We discussed why database technology isso crucial to data integration, and how many of the

Figure 8 Three-tier architecture for an e-commerce application

APPLICATIONSERVER

APPLICATIONSERVER

DATABASE

CLIENT APPLICATION

CLIENT APPLICATION

WEB SERVER

CLIENT APPLICATION

CLIENT APPLICATION

WEB SERVER

CLIENT APPLICATION

CLIENT APPLICATION

WEB SERVER

IBM SYSTEMS JOURNAL, VOL 41, NO 4, 2002 HAAS, LIN, AND ROTH 593

Page 17: Data integration through database federationpdfs.semanticscholar.org/c929/4a8f4501be95118da78bccc8549a42… · they contain. Database federation is one approach to data integration

features of relational databases can be applied in adistributed environment to ease the development ofnew applications that span multiple data sources. Fi-nally, we demonstrated through a number of usecases how various database features—includingviews, ASTs, OLAP functions, joins, unions, and ag-gregations—can be used in conjunction with mul-tiple federation styles to integrate data from sourcesas diverse as MQ queues, XML documents, Excelspreadsheets, Web services, and relational databasemanagement systems. We showed how federationcould be used as the basis for report gathering, forwarehouse loading and replication, and even forcaching. The diversity of applications for this tech-nology is reflective of the many ways in which datamust be integrated, and the applicability of databasefederation to the broad range of challenges demon-strates its importance for data integration.

Of course, database federation is not a panacea.There may be some integration scenarios for which

it is overkill, or for which another of the many ap-proaches to data integration might be easier. How-ever, we strongly believe that database federationmust be a fundamental part of any integration so-lution. Our future efforts will be directed toward im-proving the ease of use for this technology. Toolsare needed to develop robust and efficient wrappersquickly and with minimal effort. We continue to en-hance the performance of the system, improving theoptimization of queries and the set of execution strat-egies available to the optimizer. Further challengesin this modern age include being able to handle asyn-chrony everywhere, better integration of XML capa-bilities, and including native support for XML query.As customers exercise the technology, we are find-ing that richer support for semantic meta-data wouldbe helpful, as well as more automation in the ad-ministration of the federated system. With enhance-ments such as these, we expect that database fed-eration will come to be widely perceived as thecornerstone of data integration.

Figure 9 E-commerce application using DBCASHE

QI Q2 Q3

FRONTEND 1

Products Orders Orders Products

FRONTEND 2

BACKEND DATABASE

SELECT p*FROM Products pWHERE p.category = ‘SHOES’

INSERT INTO OrdersVALUES(GENERATE_UNIQUE( ),8080,50785,2)

SELECT o.status, p.nameFROM Orders o, Products pWHERE o.customerid = 8080 ANDo.item_id = p.item_id

DATAPROP

Products

WRAPPER

Orders

DATAPROP

WRAPPER

HAAS, LIN, AND ROTH IBM SYSTEMS JOURNAL, VOL 41, NO 4, 2002594

Page 18: Data integration through database federationpdfs.semanticscholar.org/c929/4a8f4501be95118da78bccc8549a42… · they contain. Database federation is one approach to data integration

*Trademark or registered trademark of International BusinessMachines Corporation.

**Trademark or registered trademark of the Object ManagementGroup, Sun Microsystems, Inc., Microsoft Corporation, orWebGain, Inc.

Cited references

1. Documentum, Inc., Content Management: DocumentumProducts, http://www.documentum.com/content-management_Products.html.

2. IBM Corporation, Content Manager, http://www.ibm.com/software/data/cm/.

3. M. A. Roth, D. C. Wolfson, J. C. Kleewein, and C. J. Nelin,“Information Integration: A New Generation of InformationTechnology,” IBM Systems Journal 41, No. 4, 563–577 (2002,this issue).

4. IBM Corporation, WebSphere Application Server, http://www.3.ibm.com/software/webservers/appserv/enterprise.html.

5. LION Bioscience AG, LION DiscoveryCenter, http://www.lionbioscience.com/solutions/discoverycenter/.

6. middleAware.com, Component Focus, http://www.middleware.net/components/index.html.

7. F. Leymann and D. Roller, “Using Flows in Information In-tegration,” IBM Systems Journal 41, No. 4, 732–742 (2002,this issue).

8. M. Roscheisen, M. Baldonado, C. Chang, L. Gravano,S. Ketchpel, and A. Paepcke, “The Stanford InfoBus and ItsService Layers: Augmenting the Internet with Higher-LevelInformation Management Protocols,” Digital Libraries inComputer Science: The MeDoc Approach, Lecture Notes inComputer Science, No. 1392, Springer, New York (1998), pp.213–230.

9. IBM Corporation, Enterprise Information Portal, http://www-3.ibm.com/software/data/eip.

10. H. Pirahesh, J. Hellerstein, and W. Hasan, “Extensible Rule-Based Query Rewrite Optimization in Starburst,” Proceed-ings of the 1992 ACM SIGMOD International Conference onManagement of Data, San Diego, CA, June 2–5, 1992, ACM,New York (1992), pp. 39–48.

11. M. Tork Roth, F. Ozcan, and L. Haas, “Cost Models DO Mat-ter: Providing Cost Information for Diverse Data Sources ina Federated System,” Proceedings of the Conference on VeryLarge Data Bases (VLDB), Edinburgh, Scotland, September1999, Morgan Kaufmann Publishers, San Mateo, CA (1999),pp. 599–610.

12. P. Selinger, M. Astrahan, D. Chamberlin, R. Lorie, andT. Price, “Access Path Selection in a Relational DatabaseManagement System,” Proceedings of the 1979 ACM SIGMODInternational Conference on Management of Data, Boston, MA,1979, ACM, New York (1979), pp. 23–34.

13. H. Garcia-Molina, J. Hammer, K. Ireland, Y. Papakon-stantinou, J. Ullman, and J. Widom, “Integrating and Access-ing Heterogeneous Information Sources in TSIMMIS,” Pro-ceedings of the AAAI Symposium on Information Gathering,Stanford, CA, March 1995, AAAI Press (1995), pp. 61–64.

14. S. Adali, K. Candan, Y. Papakonstantinou, and V. S. Sub-rahmanian, “Query Caching and Optimization in DistributedMediator Systems,” Proceedings of the ACM SIGMOD Con-ference on Management of Data, Montreal, Canada, June 1996,ACM, New York (1996), pp. 137–148.

15. A. Tomasic, L. Raschid, and P. Valduriez, “Scaling Heter-ogeneous Databases and the Design of DISCO,” Proceed-ings of the 16th International Conference on Distributed Com-

puter Systems, May 1996, Hong Kong, IEEE, New York(1996), pp. 449–457.

16. M.-C. Shan, “Pegasus Architecture and Design Principles,”Proceedings of the 1993 ACM SIGMOD International Confer-ence on Management of Data, Washington, D.C., May 1993,ACM, New York (1993), pp. 422–425.

17. M. Tork Roth, P. Schwarz, and L. Haas, “An Architecturefor Transparent Access to Diverse Data Sources,” Compo-nent Database Systems, K. R. Dittrich, A. Geppert, Editors,Morgan-Kaufmann Publishers, San Mateo, CA (2001), pp.175–206.

18. IBM Corporation, DB2 Product Family, http://www-3.ibm.com/software/data/db2/.

19. M. Tork Roth and P. Schwarz, “Don’t Scrap It, Wrap It! AWrapper Architecture for Legacy Data Sources,” Proceed-ings of the Conference on Very Large Data Bases (VLDB), Ath-ens, Greece, August 1997, Morgan Kaufmann Publishers, SanMateo, CA (1997), pp. 266–275.

20. L. Haas, D. Kossmann, E. Wimmers, and J. Yang, “Optimiz-ing Queries Across Diverse Data Sources,” Proceedings of theConference on Very Large Data Bases (VLDB), Athens,Greece, August 1997, Morgan Kaufmann Publishers, San Ma-teo, CA (1997), pp. 276–285.

21. iWay Software, iWay Adapters and Connectors, http://www.iwaysoftware.com/products/e2e_integration_products.html.

22. Microsoft Corporation, Microsoft ODBC, http://www.microsoft.com/data/odbc/.

23. IBM Corporation, DataJoiner, http://www.software.ibm.com/data/datajoiner/.

24. Microsoft Corporation, Access 2002 Product Guide, http://www.microsoft.com/office/access/default.asp.

25. IBM Corporation, DB2 Relational Connect, http://www-3.ibm.com/software/data/db2/relconnect/.

26. B. Reinwald, H. Pirahesh, G. Krishnamoorthy, G. Lapis,B. Tran, and S. Vora, “Heterogeneous Query ProcessingThrough SQL Table Functions,” Proceedings of the 15th In-ternational Conference on Data Engineering, March 1999, Sid-ney, Australia, IEEE, New York (1999), pp. 366–373.

27. L. M. Haas, P. M. Schwarz, P. Kodali, E. Kotlar, J. E. Rice,and W. C. Swope, “DiscoveryLink: A System for IntegratedAccess to Life Sciences Data Sources,” IBM Systems Journal40, No. 2, 489–511 (2001), http://www.research.ibm.com/journal/sj/402/haas.html.

28. ISO/IEC 9075-2:2000, Information technology—Databaselanguages—SQL—Part 2: Foundation (SQL/Foundation), In-ternational Organization for Standardization, Geneva, Swit-zerland (2000).

29. ISO/IEC 9075-9:2000, Information technology—Databaselanguages—SQL—Part 9: Management of External Data(SQL/MED), International Organization for Standardization,Geneva, Switzerland (2000).

30. L. Haas, J. Freytag, G. Lohman, and H. Pirahesh, “Exten-sible Query Processing in Starburst,” Proceedings of the 1989ACM SIGMOD International Conference on Management ofData, Portland, OR, May 31–June 2, 1989, ACM, New York(1989), pp. 377–388.

31. IBM Corporation, DB2 Life Sciences Data Connect, http://www-3.ibm.com/software/data/db2/lifesciencesdataconnect/.

32. IBM Corporation, IBM MQSeries Integrator V2.0: The NextGeneration Message Broker, http://www-3.ibm.com/software/ts/mqseries/library/whitepapers/mqintegrator/msgbrokers.html.

33. V. Josifovski, P. Schwarz, L. Haas, and E. Lin, “Garlic: ANew Flavor of Federated Query Processing for DB2,” Pro-

IBM SYSTEMS JOURNAL, VOL 41, NO 4, 2002 HAAS, LIN, AND ROTH 595

Page 19: Data integration through database federationpdfs.semanticscholar.org/c929/4a8f4501be95118da78bccc8549a42… · they contain. Database federation is one approach to data integration

ceedings of the ACM SIGMOD Conference on Managementof Data, Madison, WI, June 2002, ACM, New York (2002).

34. IBM Corporation, Lotus Extended Search, http://www.lotus.com/products/des.nsf/wdocs/home.

35. E. F. Codd, “A Relational Model of Data for Large SharedData Banks,” Communications of the ACM 13, No. 6, 377–387 (June 1970).

36. E. F. Codd, The Relational Model for Database Management,Version 2, Addison-Wesley Publishing Co., Reading, MA(1990).

37. C. J. Date and H. Darwen, A Guide to SQL Standard, 4thEdition, Addison-Wesley Publishing Co., Reading, MA(1997).

38. D. Chamberlin, A Complete Guide to DB2 Universal Data-base, Morgan Kaufmann Publishers, San Mateo, CA (1998).

39. J. Gray and A. Reuter, Transaction Processing: Concepts andTechniques, Morgan Kaufmann Publishers, San Mateo, CA(1993).

40. I. Traiger, J. Gray, C. Galtieli, and B. Lindsay, Transactionsand Consistency in Distributed Database Systems, Research Re-port RJ2555, IBM Almaden Research Center, San Jose, CA95120 (1979).

41. M. Stonebraker, “Implementation of Integrity Constraintsand Views by Query Modification,” Proceedings of the 1975ACM SIGMOD Conference on Management of Data, ACM,New York (1975), pp. 65–78.

42. Rules in Database Systems, T. K. Sellis, Editor, Proceedingsof the Second International Workshop (RIDS ’95), Glyfada,Athens, Greece, September 25–27, 1995, Lecture Notes inComputer Science, No. 985, Springer, New York (1995).

43. D. Srivastava, S. Dar, H. V. Jagadish, and A. Levy, “Answer-ing Queries with Aggregation Using Views,” Proceedings ofthe 22nd International Conference on Very Large Data Bases(VLDB ’96), Morgan Kaufmann Publishers, San Mateo, CA(1996), pp. 318–329.

44. M. Zaharioudakis, R. Cochrane, G. Lapis, H. Pirahesh, andM. Urata, “Answering Complex SQL Queries Using Auto-matic Summary Tables,” Proceedings of the 2000 ACM SIG-MOD International Conference on Management of Data, ACM,New York (2000), pp. 105–116.

45. Cognos Incorporated, Business Intelligence Products, http://www.cognos.com/products/index.html.

46. Brio Software, Inc., Business Intelligence solution for dataquery and analysis, http://www.brio.com/products/overview.html.

47. IBM Corporation, WebSphere Studio Application Developer,http://www-3.ibm.com/software/ad/studioappdev/.

48. IBM Corporation, VisualAge C��, http://www-3.ibm.com/software/ad/vacpp/.

49. Microsoft Corporation, Visual C�� .NET, http://msdn.microsoft.com/visualc/.

50. WebGain, Inc., Visual Cafe, http://www.webgain.com/products/visual_cafe/.

51. IBM Corporation, DB2 Universal Database, DB2 Data Prop-agator, http://www-3.ibm.com/software/data/dpropr/.

52. C. Mohan, “Caching Technologies for Web Applications,”Proceedings of the 27th International Conference on Very LargeData Bases, September 11–14, 2001, Roma, Italy, MorganKaufmann Publishers, San Mateo, CA (2001).

53. Q. Luo, S. Krishnamurthy, C. Mohan, H. Pirahesh, H. Woo,B. Lindsay, and J. Naughton, “Middle-Tier Database Cach-ing for e-Business,” Proceedings of the ACM SIGMOD Con-ference on Management of Data, Madison, WI, June 2002,ACM, New York (2002).

Accepted for publication July 22, 2002.

Laura M. Haas IBM Software Group, Silicon Valley Laboratory,555 Bailey Avenue, San Jose, California 95141 (electronic mail:[email protected]). Dr. Haas is a Distinguished Engineer andsenior manager in the IBM Software Group, where she is respon-sible for the DB2 Query Compiler development, including keytechnologies for DiscoveryLinkTM and Xperanto. Dr. Haas, whojoined the IBM Almaden Research Center in 1981, has made sig-nificant contributions to database research and has led severalresearch projects, including R*, Starburst, and Garlic. She hasreceived an IBM Outstanding Technical Achievement Award forher work on R* and DiscoveryLink, an IBM Outstanding Con-tribution Award for Starburst, a YWCA Tribute to Women inIndustry (TWIN) Award, and an ACM SIGMOD OutstandingContribution Award.

Eileen T. Lin IBM Software Group, Silicon Valley Laboratory, 555Bailey Avenue, San Jose, California 95141 (electronic mail:[email protected]). Dr. Lin is a Senior Technical Staff Memberand lead architect for DB2 database federation. She joined IBMin 1990 after receiving her Ph.D. degree from the Georgia In-stitute of Technology in Atlanta. She was the lead architect forDataJoiner query processing and led the team that merged Data-Joiner and Garlic technology into DB2 for UNIX� and Win-dows�. She is currently a lead architect responsible for the de-livery of database federation technology in DB2 and Xperanto.

Mary A. Roth IBM Software Group, Silicon Valley Laboratory,555 Bailey Avenue, San Jose, California 95141 (electronic mail:[email protected]). Ms. Roth is a senior engineer and man-ager in the Database Technology Institute for e-Business at IBM’sSilicon Valley Lab. She has over 12 years of experience in da-tabase research and development. As a researcher and memberof the Garlic project at IBM’s Almaden Research Center, shecontributed key advances in heterogeneous data integration tech-niques and federated query optimization and led efforts to trans-fer Garlic support to DB2. Ms. Roth is leading a team of devel-opers to deliver a key set of components for Xperanto, IBM’sinformation integration initiative for distributed data access andintegration.

HAAS, LIN, AND ROTH IBM SYSTEMS JOURNAL, VOL 41, NO 4, 2002596