Information Integration Enterprise

  • Upload
    vthung

  • View
    222

  • Download
    0

Embed Size (px)

Citation preview

  • 8/14/2019 Information Integration Enterprise

    1/8

    LARGE ENTERPRISES SPEND a great deal o tme andmoney on normaton ntegratoncombnngnormaton rom derent sources nto a unfedormat. Frequently cted as the bggest and mostexpensve challenge that normaton-technology shopsace, normaton ntegraton s thought to consumeabout 40% o ther budget.4, 14, 16, 33, 36 Market-ntellgencefrm iDC estmates that the market or datantegraton and access sotware (whch ncludes thekey enablng technology or normaton ntegraton)

    was about $2.5 bllon n 2007 and s expected to growto $3.8 bllon n 2012, or an average annual growthrate o 8.7%.20

    Sotware purchases are only one part o the totalexpense. integraton actvtes cover any orm onormaton reuse, such as movng data rom oneapplcatons database to anothers, translatngmessages or busness-to-busness e-commerce, andprovdng access to structured data and documents va

    a Web portal.

    Doi:10.1145/1378727.1378745

    A guide to the tools and core technologies formerging information from disparate sources.

    BY PhiliP a. BeRnstein anD lauRa m. haas

    ir

    igr erpr

    review articles

    ILLUSTRATION

    BY

    LEANDER

    HERZOG

    72 communications o the acm | september 2008 | vol. 51 | no. 9

  • 8/14/2019 Information Integration Enterprise

    2/8

    september 2008 | vol. 51 | no. 9 | communications o the acm 73

  • 8/14/2019 Information Integration Enterprise

    3/8

    74 communications o the acm | september 2008 | vol. 51 | no. 9

    review articles

    Beyond classical inormation-tech-nology applications, inormation inte-gration is also a large and growing parto science, engineering, and biomedi-cal computing, as independent labsoten need to use and combine eachothers data.

    Sotware vendors oer numeroustools to reduce the eort, and hencethe cost, o integration and to improvethe quality. Moreover, because inor-mation integration is a complex andmultiaceted task, many o these toolsare highly specialized. The resultingprousion o tools can be conusing. Inthis article, we try to clear up any con-usion by:

    Exploring an example o a typicalintegration problem

    Describing types of information in-

    tegration tools used in practiceReviewing core technologies that lieat the heart o integration tools

    Identiying future trends.

    a exp

    Consider a large auto manuacturerssupport center that receives a food oemails and service-call transcriptionsevery day. From any given text, com-pany analysts can extract the type ocar, the dealership, and whether thecustomer is pleased or annoyed with

    the service. But to truly understandthe reasons or the customers senti-ment, the company also needs to knowsomething about the dealership and

    textual complaint, as will the date oproblem and problem type. The rela-tional database has tables about deal-erships and transactions that can pro-vide the rest o the inormation: size odealership, date sold, and price.

    Next, programs are needed to ex-tract structured inormation rom theemail or transcription text. These pro-grams outputs provide a schema orthe text datathe elds that can bequeried. Matching and mapping toolscan be used to relate this derived sche-ma to the target schema. Similarly,matching and mapping must be doneor the relational schema. The dealer-ship name extracted rom the text canbe connected to an entry in the deal-ership table, and the customer, automodel, and dealership to the transac-

    tions table, thus joining the two tableswith the textual data.Programs are needed to align data

    instances: because it is unlikely thatthe data ormats o the extracted textare identical to those in the relationaldatabase, some data cleansing willbe required. For example, dealershipnames may not exactly match. Thesedata-integration programs must thenbe executed, oten using a commercialintegration product.

    typ ir-igr t

    A variety o architectural approachescan be used to solve problems like the

    the transactioninormation that iskept in a relational database.

    Solving such an integration prob-lem is an iterative process. The datamust rst be understood and thenprepared or integration by means ocleansing and standardization.

    Next, specications are needed re-garding what data should be integrat-ed and how they are related. Finally,an integration program is generatedand executed by some type o integra-tion engine. The results are examined,and any anomalies must be resolved,which oten requires returning to stepone and studying the data.

    Many technologies are needed tosupport this process. We introduce aew here and then describe them ingreater depth, along with others, in

    subsequent sections.The rst step toward integrating thetext and the relational data is to un-derstand what transactions and otherinormation they contain and how torelate that inormation to each dealer-ship. The manuacturer next needs todecide how to represent the integratedinormation. A simple schemaautomodel, customer, dealership, datesold, price, size o dealership, date oproblem, problem typemight suce(See Figure 1). But how should each

    eld be represented, and where will thedata come rom? Lets assume that theauto model, customer, and dealershipinormation will be extracted rom the

    gr 1. ar xr ky r r g. t r d

    prb r r d rrv dd dd r rg .

    tRansactions

    triD cr DriD a md D sd Pr

    1234 Ju, Jh J. 32 Gaaxy 4/21/07 5000

    DealeRs

    Drp DriD sz Drp owr a Rv

    ohkh by Fd 32 300 ca h b by $5m

    ts: oc. 28, 2007

    s:

    my Gaaxyak a

    quag af y 6 h!i uchad h cuk aby ohkh

    scy,Jh J. Ju

    sr D d s

    trg s

    a md cr Drp D sd Pr sz Drp Prb D Prb typ

    Gaaxy Jh J. Ju ohkh by Fd 4/21/07 5000 300 ca h oc. 28, 2007 ak a quag

  • 8/14/2019 Information Integration Enterprise

    4/8

    review articles

    september 2008 | vol. 51 | no. 9 | communications o the acm 75

    one above. We summarize these ap-proaches in this section, along with thegeneral types o products used. For in-ormation on specic products, we re-er the interested reader to IT researchcompanies that publish comprehen-sive comparisons and to Web searchengines (using the product categorieswe dene here as keyword queries).

    Data Warehouse Loading. A data warehouse is a database that consoli-dates data rom multiple sources.7For example, it may combine sales in-ormation rom subsidiaries to give asales picture or the whole company.Because subsidiaries have overlappingsets o customers and may have incon-sistent inormation about a particularcustomer, data must be cleansed toreconcile such dierences. Moreover,

    each subsidiary may have a databaseschema (that is, data representation)that diers rom the warehouse sche-ma. So each subsidiarys data has to bereshaped into the common warehouseschema.

    Extract-Transorm-Load (ETL) toolsaddress this problem21 by simpliyingthe programming o scripts. An ETLtool typically includes a repertoire ocleansing operations (such as detec-tion o approximate duplicates) andreshaping operations (such as Struc-

    tured Query Language [SQL]-style op-erations to select, join, and sort data).The tool may also include schedulingunctions to control periodic loadingor rereshing o the data warehouse.

    Some ETL tools are customized ormaster data managementthat is, toproduce a data warehouse that holdsthe master copy o critical enterprisereerence data, such as inormationabout customers or products. Masterdata is rst integrated rom multiplesources and then itsel becomes the

    denitive source o that data or theenterprise. Master data-managementtools sometimes include domain-specic unctionality. For example,or customer or vendor inormation,they may have ormats or name andaddress standardization and cleans-ing unctions to validate and correctpostal codes.

    Virtual Data Integration. Whilewarehouses materialize the integrateddata, virtual data integration gives theillusion that data sources have been

    integrated without materializing the

    integrated view. Instead, it oers a me-diated schema against which users canpose queries. The implementation,oten called a query mediator35 or en-terprise-inormation integration (EII)system,16, 27 translates the users queryinto queries on the data sources andintegrates the result o those queriesso that it appears to have come rom asingle integrated database. EII is stillan emerging technology, currently lesspopular than data warehousing.

    Although the databases cover re-lated subject matter, they are hetero-geneous in that they may use dierentdatabase systems and structure thedata using dierent schemas. An EIIsystem might be used, or example,by a nancial rm to prepare or eachcustomer a statement o portolio po-

    sitions that combines inormationabout his or her holdings rom the lo-cal customer database with stock pric-es retrieved rom an external source.

    To cope with this heterogeneityin EII, a designer creates a mediatedschema that covers the desired sub- ject matter in the data sources andmaps the data source schemas to thenew mediated schema. Data cleans-ing and reshaping problems appear inthe EII context, too. But the solutionsare somewhat dierent in EII because

    data must be transormed as part oquery processing rather than via theperiodic batch process associated withloading a data warehouse.

    EII products vary, depending on thetypes o data sources to be integrated.For example, some products ocus onintegrating SQL databases, some onintegrating Web services, and some onintegrating bioinormatics databases.

    Message Mapping. Message-orient-ed middleware helps integrate inde-pendently developed applications by

    moving messages between them. Imessages pass through a broker, theproduct is usually called an enterprise-application integration (EAI) system.1I a broker is avoided through all appli-cations use o the same protocol (orexample, Web services), then the prod-uct is called an enterprise service bus.I the ocus is on dening and control-ling the order in which each applica-tion is invoked (as part o a multistepservice), then the product is called aworkfow system.

    In addition to the protocol-transla-

    Byd r-

    gypp,rgr rg d grwgpr ,grg,d bd

    pg, dpd b d d b r d.

  • 8/14/2019 Information Integration Enterprise

    5/8

    76 communications o the acm | september 2008 | vol. 51 | no. 9

    review articles

    tion and fow-control services providedby these products, message-translationserviceswhich constitute anotherorm o inormation integrationarealso needed.

    A typical message-translation sce-nario in e-commerce enables a smallvendor (say, Pico) to oer its productsthrough a large retail Web site (Goli-ath). When a customer buys one o Pi-cos products rom Goliath, it sends anorder message to Pico, which then hasto translate that message into the or-mat required by its order-processingsystem. A message-mapping tool canhelp Pico meet this challenge. Such atool oers a graphical interace to de-ne translation unctions, which arethen compiled into a program to per-orm the message translation. Similar

    mapping tools are used to help relatethe schemas o the source databasesto the target schema or ETL and EIIand to generate the programs neededor data translation.

    Object-to-Relational Mappers. Ap-plication programs today are typicallywritten in an object-oriented language,but the data they access is usuallystored in a relational database. Whilemapping applications to databasesrequires integration o the relationaland application schemas, dierences

    in schema constructs can make themapping rather complicated. For ex-ample, there are many ways to mapclasses that are related by inheritance

    into relational tables. To simpliy theproblem, an object-to-relational map-per oers a high-level language in which to dene mappings.23 The re-sulting mappings are then compiledinto programs that translate queriesand updates over the object-orientedinterace into queries and updates onthe relational database.

    Document Management. Much o theinormation in an enterprise is con-tained in documents, such as text les,spreadsheets, and slide shows thatcontain interrelated inormation rel-evant to critical business unctionsproduct designs, marketing plans,pricing, and development schedules,or example. To promote collabora-tion and avoid duplicated work in alarge organization, this inormation

    needs to be integrated and published.Integration may simply involve mak-ing the documents available on a sin-gle Web page (such as a portal) or ina content-management system, pos-sibly augmented with per-documentannotations (on author and status, orexample). Or integration may meancombining inormation rom thesedocuments into a new document, suchas a nancial analysis.

    Whether or not the documents arecollected in one store, they can be in-

    dexed to enable keyword search acrossthe enterprise. In some applications,it is useul to extract structured inor-mation rom documents, such as cus-

    tomer name and address rom emailmessages received by the customer-support team. The ability to extractstructured inormation o this kindmay also allow businesses to integrateunstructured documents with preex-isting structured data. In the example

    above, the auto manuacturer wantedto link transactional inormationabout purchases with emails aboutthese purchases in order to enable bet-ter analysis o problem reports.

    Portal Management. One way to in-tegrate related inormation is simplyto present it all, side-by-side, on thesame screen. A portal is an entire Website built with this type o integrationin mind. For example, the home pageo a nancial services Web site typi-cally presents market prices, business

    news, and analyses o recent trends.The person viewing it does the actualintegration o the inormation.

    Portal design requires a mixtureo content management (to deal withdocuments and databases) and user-interaction technology (to present theinormation in useul and attractive ways). Sometimes these technologiesare packaged together into a productor portal design.11 But oten they areselected piecemeal, based on the re-quired unctionality o the portal and

    the taste and experience o the devel-opers who assemble it.

    cr tg

    Extensible Markup Language (XML). Inany o the scenarios noted here, anintegrated view o data rom multiplesources must be created. Oten anyone o the sources will be incomplete with respect to that view, with eachsource missing some inormation thatthe others provide. In our example, theemails are unlikely to provide detailed

    inormation about the dealerships, while the relational data might nothave the problem reports. In XML, asemi-structured ormat, each data ele-ment is tagged so that only elements whose values are known need to beincluded. This ability to handle varia-tions in inormation content is drivingEII systems to experiment with XML.22

    This fexibility makes XML an in-teresting ormat or integrating in-ormation across systems with dier-ing representations o data. In some

    integration scenarios, it may not begr 2. sr ppg .

  • 8/14/2019 Information Integration Enterprise

    6/8

    review articles

    september 2008 | vol. 51 | no. 9 | communications o the acm 77

    irgr

    vbr fdpwrd yby grgv b by v prb .

    necessary to dene a common sche-madata rom both sources can bemerged into a single sel-describingXML documentthough in scenariossuch as warehousing applications thetransormation and using o the origi-nal data into a well-dened ormat isrequired. Still, its fexibility and theubiquity o ree parsers make XMLattractive in scenarios with looser re-quirements, and it is increasingly be-ing used or transerring data betweensystems and sometimes as a ormat orstoring data as well.3

    Schema Standards. It is easier to in-tegrate data rom dierent sources ithey use the same schema. This con-sistency avoids the need to reormatthe data beore integrating it, and italso ensures that data rom all o the

    sources have mutual meaning.Even i sources do not conorm toa common schema, each source maybe able to relate its data to a com-mon standard, either industry-wide orenterprise-specic. Thus two sourcescan be related by composing the twomappings that relate each o them tothe standard. This approach only en-ables integration o inormation thatappears in the standard, and becausea standard is oten a least common de-nominator, some inormation is lost

    in the composition.There are many industry-wide sche-ma standards.18, 28, 29 Some are orientedtoward generic kinds o data, such asgeographic inormation or sotware-engineering inormation. Others per-tain to particular application domainssuch as computer-aided design, newsstories, and medical billing.

    When the schema standard is ab-stract and ocuses on creating a tax-onomy o terms, it is usually called anontology. Ontologies are oten used as

    controlled vocabulariesor example,in the biomedical domainratherthan as data ormats.12, 13

    Data Cleansing. When the same orrelated inormation is described inmultiple places (possibly within a sin-gle source), oten some o the occur-rences are inconsistent or just plain wrongthat is, dirty. They may bedirty because the data, such as inven-tory and purchase-order inormationabout the same equipment, were inde-pendently obtained. Or they may sim-

    ply have errors such as misspellings,

    be missing recent changes, or be in aorm that is inappropriate or a newuse that will be made o it.

    A typical initial step in inormationintegration is to inspect each o thedata sources perhaps with the aid odata-proling toolsor the purposeo identiying problematic data. Thena data-cleansing tool may be used totransorm the data into a commonstandardized representation. A typi-cal data-cleansing step, or example,might correct misspellings o streetnames or put all addresses in a com-mon ormat.10 Oten, data-prolingand -cleansing tools are packaged to-gether as part o an ETL tool set.

    One important type o data cleans-ing is entity resolution, or deduplica-tion, which identies and merges in-

    ormation rom multiple sources thatreer to the same entity. Mailing listsare a well-known application; we haveall received duplicate mail solicitations with dierent spellings o our namesor addresses. On the other hand,sometimes seeming duplicates areperectly valid, because there really aretwo dierent persons with very similarnames (John T. Jutt and his son John J.Jutt) living at the same address.

    Many data-cleansing tools exist,based on dierent approaches or ap-

    plied at dierent levels or scales. Forindividual elds, a common techniqueis edit-distance; two values are dupli-cates i changing a small number ocharacters transorms one value intothe other. For records, the values oall elds have to be considered; moresophisticated systems look at groupso records and accumulate evidenceover time as new data appears in thesystem.

    Schema Mapping. A undamental op-eration or all inormation-integration

    systems is identiying how a sourcedatabase schema relates to the targetintegrated schema. Schema-mappingtools, which tackle this challenge,typically display three vertical panes(see Figure 2). The let and right panesshow the two schemas to be mapped;the center pane is where the designerdenes the mapping, usually by draw-ing lines between the appropriateparts o the schemas and annotatingthe lines with the required transor-mations. Some tools oer design as-

    sistance in generating these transor-

  • 8/14/2019 Information Integration Enterprise

    7/8

    78 communications o the acm | september 2008 | vol. 51 | no. 9

    review articles

    recognized as a concept, that act canbe recorded by surrounding it withXML tags that identiy the concept, byadding an entry in an index, or by copy-ing the values into a relational table.The result is better-structured inor-mation that can more easily be com-bined with other inormation, thusaiding integration.

    Dynamic Web Technologies. When aportal is used to integrate data, it usu-ally needs to be dynamically generatedrom les and databases that resideon backend servers. The evolution oWeb technologies has made such dataaccess easier. Particularly helpul hasbeen the advent o Web services andReally Simple Syndication (RSS) eeds,along with many sites oering theirdata in XML.6 Development technol-

    ogy has been evolving too, with rapidimprovement o languages, runtimelibraries, and graphical developmentrameworks or dynamic generation oWeb pages.

    One popular way to integrate dy-namic content is a mashup, whichis a Web page that combines inorma-tion and Web services. For example,because a service or displaying mapsmay oer two unctionsone to dis-play a map and another to add a glyphthat marks a labeled position on the

    mapit could be used to create amashup that displays a list o storesand their locations on the map. Toreduce the programming eort o cre-ating mashups, rameworks are nowemerging that provide a layer o inor-mation integration analogous to EIIsystems, but which are tailored to thenew Web 2.0 environment.2

    r trd

    Today, every step o the inormation-integration process requires a good

    deal o manual intervention, whichconstitutes the main cost. Becauseintegration steps are oten complex,some human involvement seems un-avoidable. Yet more automation issurely possibleor example, to ex-plain the behavior o mappings, iden-tiy anomalous input data, and tracethe source o query results.8 Research-ers and product developers continueto explore ways to reduce human e-ort not only by improving the coretechnologies mentioned in this article

    and the integration tools that embody

    mations, which are oten complex.26, 30

    Once the mapping is dened, mosttools can generate a program to trans-orm data conorming to the sourceschema into data conorming to thetarget schema.15 For an ETL engine,the tool might generate a script in theengines scripting language. For an EIIsystem, it might generate a query inthe query language, such as SQL. Foran EAI system, it might transorm XMLdocuments rom a source-message or-mat to that o the target. For an object-to-relational mapping system, it mightgenerate a view that transorms rowsinto objects.23

    Schema Matching. Large schemashave several thousand elements, pre-senting a major problem or a schema-mapping tool. To map an element o

    Schema 1 into a plausible match inSchema 2, the designer may have toscroll through dozens o screens. Toavoid this tedious process, the tool mayoer a schema-matching algorithm,31 which uses heuristic or machine-learning techniques to nd plausiblematches based on whatever inorma-tion it has availableor example,name similarity, data-type similarity,structure similarity, an externally sup-plied thesaurus, or a library o previ-ously matched schemas. The human

    user must then validate the match.Schema-matching algorithms dowell at matching individual elements with somewhat similar names, suchas Salary_o_Employee in Schema 1and EmpSal in Schema 2, or whenmatching predened synonyms, suchas Salary and Wages. Some techniquesleverage data values. For example, thealgorithm might suggest a match be-tween the element Salary o the sourcedatabase and Stpnd o the target ithey both have values o the same type

    within a certain numerical range.But matching algorithms are ine-

    ective when there are no hints to ex-ploit. They cannot map an elementcalled PW (that is, persons wages)to EmpSal when no data values areavailable; nor can they readily mapcombinations o elements, such as To-tal_Price in Schema 1 and Quantity Unit_Cost in Schema 2. That is, thesealgorithms are helpul or avoidingtedious activities but not or solvingsubtle matching problems.

    Keyword Search. Keyword search is

    second nature to us all as a way to ndinormation. A search engine acceptsa users keywords as input and returnsa rank-ordered list o documents thatis generated using a pre-built indexand other inormation, such as an-chor text and click-throughs.5 A lessamiliar view o search is as a orm ointegrationor example, when a Websearch on a keyword yields an inte-grated list o pages rom multiple Websites. In more sophisticated scenarios,the documents to be searched residein multiple repositories such as digi-tal libraries or content stores, where itis not possible to build a single index.In such cases, ederated search can beused to explore each store individuallyand merge the results.24

    While keyword search does inte-

    grate inormation, it does so loosely.The results are oten imprecise, in-complete, or even irrelevant. By con-trast, integration o structured data viaan ETL tool or a query mediator cancreate new types o records by corre-lating and merging inormation romdierent sources. The integration re-quest has a precise semantics and theanswer normally includes all possiblerelevant matches rom these data sets,assuming that the source data andentity resolution are correct (both o

    these are big assumptions). Both pre-cise and loose integration techniqueshave merit or dierent scenarios. Key-word search may even be used againststructured data to get a quick eel or what is available and set the stage ormore precise integration.

    Information Extraction. Inormationextraction25 is the broad term or a seto techniques that produce structuredinormation rom ree-orm text. Con-cepts o interest are extracted romdocument collections by employing a

    set o annotators, which may either becustom code or specially constructedextraction rules that are interpretedand executed by an inormation-ex-traction system. In some scenarios, when sucient labeled training datais available, machine-learning tech-niques may also be employed.

    Important tasks include named-entity recognition (to identiy people,places, and companies, or example)and relationship extraction (such ascustomers phone number or custom-

    ers address). When a text ragment is

  • 8/14/2019 Information Integration Enterprise

    8/8

    review articles

    september 2008 | vol. 51 | no. 9 | communications o the acm 79

    them but also by trying to simpliy theprocess through better integration othe tools themselves.

    Inormation integration is current-ly a brittle process; changing the struc-ture o just one data source can orcean integration redesign. This problemo schema evolution32 has receivedmuch attention rom researchers; butsurprisingly ew commercial tools thatmight reduce the cost o integrationare available to address the problem. Another cause o brittleness, and an-other topic o research,9 arises romthe complex rules or handling theinconsistencies and incompletenesso dierent sources. One possible ap-proach, or example, is to oer a toolthat suggests minimal changes tosource data, thereby eliminating many

    o the unanticipated inconsistencies.Most past work has ocused on theproblems o inormation-technologyshops, where the goal o integration isusually known at the outset o a proj-ect. But some recent work addressesproblems in other domains, notablyscience, engineering, and personal-inormation management. In thesedomains, inormation integration isoten an exploratory activity in whicha user integrates some inormation,evaluates the result, and consequent-

    ly identies additional inormationto integrate. In this scenario, calleddataspaces,17 nding the right datasources is important, as is automatedtracking o how the integrated datawas derived, called its provenance.34Semantic technologies such as ontolo-gies and logic-based reasoning en-gines may also help with the integra-tion task.19

    Inormation integration is a vibranteld powered not only by engineer-ing innovation but also by evolution

    o the problem itsel. Initially, inor-mation integration was stimulated bythe needs o enterprises; or the lastdecade, it has also been driven by thedesire to integrate the vast collectiono data available on the Web. Recenttrendsthe continual improvemento Web-based search, the prolierationo hosted applications, cloud storage, Web-based integration services, andopen interaces to Web applications(such as social networks), among oth-erspresent even more challenges to

    the eld. Inormation integration will

    keep large numbers o sotware engi-neers and computer-science research-ers busy or a long time to come.

    akwdg

    We are grateul to Denise Draper, AlonHalevy, Mauricio Hernndez, DavidMaier, Sergey Melnik, Sriram Ragha- van, and the anonymous reerees ormany suggested improvements.

    References1. Alonso, G., Casati, F., Kuno, H.A., and Machiraju, V. Web

    SevicesConcepts, Achitectues and Applications.Springer, 2004.

    2. Altinel, M., Brown, P., Cline, S., Kartha, R., Louie,E., Markl, V., Mau, L., Ng, Y-H, Simmen, D.E., andSingh, A. DAMIAA data mashup abric or intranetapplications. VLDB Conerence (2007), 13701373.

    3. Babcock, C. XML plays big integration role.InfomationWeek(May 24, 2004); www.inormationweek.com/story/showArticle.

    jhtml?articleID=20900153.4. Bernstein, P.A. and Melnik, S. Model management 2.0:

    Manipulating richer mappings. In Poceedings of theACM SIGMOD Confeence, 2007, 112.

    5. Brin, S. and Page, L. The anatomy o a large-scalehypertextual Web search engine. Compute Netwoks30, 17 (1998), 107117.

    6. Carey, M.J. Data delivery in a service-oriented world:The BEA AquaLogic data services platorm. InPoceedings of the ACM SIGMOD Confeence (2006),695705.

    7. Chaudhuri, S. and Dayal, U. An overview o datawarehousing and OLAP technology. ACM SIGMODrecod 26, 1 (1997), 6574.

    8. Chiticariu, L. and Tan, W.C. Debugging schemamappings with routes. VLDB Conerence (2006),7990

    9. Chomicki, J. Consistent query answering: Fiveeasy pieces. In Poceedings of the IntenationalConfeence on Database Theoy(2007), 117.

    10. Dasu, T., and Johnson, T. Exploatoy Data Mining and

    Data Cleaning. John Wiley, 2003.11. Firestone, J.M. Entepise Infomation Potals andKnowledge Management. Butterworth-Heinemann(Elsevier Science, KMCI Press), 2003.

    12. Foundational Model o Anatomy, StructuralInormatics Group, University o Washington; http://sig.biostr.washington.edu/projects/m/

    13. Gene Ontology; http://www.geneontology.org/.14. Haas, L.M. Beauty and the beast: The theory and

    practice o inormation integration. IntenationalConfeence on Database Theoy(2007), 2843.

    15. Haas, L.M., Hernndez, M.A., Ho, H., Popa, L., andRoth, M. Clio grows up: From research prototype toindustrial tool. In Poceedings of the ACM SIGMODConfeence (2005), 805810.

    16. Halevy, A.Y., Ashish, N., Bitton, D., Carey, M.J., Draper,D., Pollock, J., Rosenthal, A., and Sikka, V. Enterpriseinormation integration: Successes, challenges, andcontroversies. In Poceedings of the ACM SIGMODConfeence (2005), 778787.

    17. Halevy, A.Y., Franklin, M.J., and Maier, D. Principles odataspace systems. ACM Symposium on Pinciples ofDatabase Systems (2006), 19.

    18. Health Level Seven; http://www.hl7.org/.19. Hepp, M., De Leenheer, P., de Moor, A., and Sure,

    Y. (Eds.). Ontology management: Semantic web,semantic web sevices, and business applications. Vol.7 o series Semantic Web And Beyond. Springer, 2008.

    20. IDC. Woldwide Data Integation and AccessSoftwae 20082012 Foecast. Doc No. 211636 (Apr.2008).

    21. Kimball, R. and Caserta, J. The Data Waehouse ETLToolkit. Wiley and Sons, 2004.

    22. Ludascher, B., Papakonstantinou, Y., and Velikhov, P.Navigation-driven evaluation o virtual mediated views.Extending Database Technology(2000), 150165.

    23. Melnik, S., Adya, A., and Bernstein, P.A. Compilingmappings to bridge applications and databases. InPoceedings of the ACM SIGMOD Confeence (2007),461472.

    24. Meng, W., Yu, C., and Liu, K. Building efcient and

    eective metasearch engines. ACM ComputingSuveys 34, 1 (2002), 4889.

    25. McCallum, A. Inormation extraction: Distillingstructured data rom unstructured text. ACM Queue 3,9 (Nov. 2005).

    26. Miller, R.J., Haas, L.M., and Hernndez, M.A. Schemamapping as query discovery. VLDB Conerence (2000),7788.

    27. Morgenthal , J.P. Entepise Infomation Integation: APagmatic Appoach. Lulu.com, 2005.

    28. OASIS standards; www.oasis-open.org/specs/.29. OMG Specifcations; www.omg.org/technology/

    documents/modeling_spec_catalog.htm.30. Popa, L., Velegrakis, Y., Miller, R.J., Hernndez, M.A.,

    and Fagin, R. Translating Web data. VLDB Conerence(2002), 598609.

    31. Rahm, E. and Bernstein, P.A. A survey o approachesto automatic schema matching. VLDB Jounal 10, 4(2001), 334350.

    32. Roddick, J.F. and de Vries, D. Reduce, reuse, recycle:Practical approaches to schema integration, evolution,and versioning. Advances in Conceptual ModelingTheory and Practice, Lectue Notes in ComputeScience, 4231. Springer, 2006.

    33. Smith, M. Toward enterprise inormation integration.Sotwaremag.com (Mar. 2007); www.sotwaremag.com/L.cm?Doc=1022-3/2007.

    34. Tan, W-C. Provenance in databases: past, current, anduture. IEEE Data Eng. Bulletin 30, 4 (2007), 312.

    35. Wiederhold, G. Mediators in the architecture o utureinormation systems. IEEE Compute25, 3 (1992),

    3849.36. Workshop on Inormation Integration, October 2006;http://db.cis.upenn.edu/iiworkshop/postworkshop/index.htm.

    Pilip A. Bernstein ([email protected]) is a principalresearcher in the database group o Microsot Research inRedmond, WA.

    Laura M. haas ([email protected]) is an IBMdistinguished engineer and director o computer science atthe IBM Almaden Research Center in San Jose, CA.

    2008 ACM 0001-0782/08/0900 $5.00