Information integration - University of California, Berkeleypeople.ischool.berkeley.edu/...is_info_integration.pdf · structured data source. He suggests that cheap, approximate information

The Information Manifold approachto data integrationAlon Y. Levy, University of Washington

A data-integration system provides auniform interface to a multitude of datasources. Consider a data-integration systemproviding information about movies fromdata sources on the World Wide Web. Thereare numerous sources on the Web concern-ing movies, such as the Internet MovieDatabase (which provides comprehensive

listings of movies, their casts, directors,genres, and so forth), MovieLink (listingplaying times of movies in US cities), andseveral sites that provide textual reviewsfor selected movies. Suppose we want tofind which Woody Allen movies are play-ing tonight in Seattle and see their respec-tive reviews. None of these data sources inisolation can answer this query. However,by combining data from multiple sources,we can answer queries like this one, andeven more complex ones.

To answer our query, we would firstquery the Internet Movie Database to ob-tain the list of movies directed by WoodyAllen, and then feed the result into theMovieLink database to check which onesare playing in Seattle. Finally, we wouldfind reviews for the relevant movies usingany of the movie review sites.

Most importantly, a data-integration sys-tem lets users focus on specifying what theywant, rather than thinking about how to ob-tain the answers. As a result, it frees themfrom the tedious tasks of finding the relevantdata sources, interacting with each source inisolation using a particular interface, andcombining data from multiple sources.

Traditional database systemsTo understand the challenges involved in

building data-integration systems, I willbriefly compare the problems that arise inthis context with those encountered in tra-ditional database systems. In this discus-sion, I focus mainly on comparisons withrelational database systems, but the differ-ences also hold for systems based on othermodels, such as object-oriented and object-relational ones. Figure 1 illustrates the dif-ferent stages in processing a query in adata-integration system.

Data modeling. Traditional database sys-tems and data-integration systems differmainly in the process they use to organizedata into an application. In a traditional sys-tem, the application designer examines theapplication’s requirements, designs a data-base schema (such as a set of relation namesand the attributes of each relation), and thenimplements the application, part of whichinvolves actually populating the database(inserting tuples into the tables).

In contrast, a data-integration applicationbegins from a set of pre-existing datasources. These sources might be databasesystems, but more often are unconventionaldata sources, such as structured files, legacy

12 IEEE INTELLIGENT SYSTEMS

Information integration

T R E N D S & C O N T R O V E R S I E ST R E N D S & C O N T R O V E R S I E S

By Marti A. HearstUniversity of California, Berkeley

[email protected] the Web’s current disorganized and anarchic state, many AI researchers believe that it

will become the world’s largest knowledge base. In this installment of Trends and Controversies,we examine a line of research whose final goal is to make disparate data sources work together tobetter serve users’ information needs. This work is known as information integration. In the fol-lowing essays, our authors talk about its application to datasets made available over the Web.

Alon Levy leads off by discussing the relationship between information-integration and tradi-tional database systems. He then enumerates important issues in the field and demonstrates howthe Information Manifold project has addressed some of these, including a language for describ-ing the contents of diverse sources and optimizing queries across sources.

Craig Knoblock and Steve Minton describe the Ariadne system. Two of its distinguishing fea-tures are its use of wrapper algorithms to extract structured information from semistructured datasources and its use of planning algorithms to determine how to integrate information efficiently andeffectively across sources. This system also features a mechanism that determines when to prefetchdata depending on how often the target sources are updated and how fast the databases are.

William Cohen describes an interesting variation on the theme, focusing on “informal” infor-mation integration. The idea is that, as in related fields that deal with uncertain and incompleteinformation, an information-integration system should be allowed to take chances and make mis-takes. His Whirl system uses information-retrieval algorithms to find approximate matches be-tween different databases, and as a consequence knits together data from quite diverse sources.

A controversy emerges in the midst of this trend, centering around the issue of whether informa-tion extraction from HTML-based Web pages is a long-standing problem. Proponents of XML(Extensible Markup Language, see www.w3.org/TR/REC-xml.html) argue that in the future infor-mation of any importance will be exchanged between programs using a well-defined protocol,rather than being displayed solely for purposes of reading using ad hoc formats in HTML. In hisessay, Levy argues that the problem of extracting information from HTML markup will, as a conse-quence of such protocols, become less important. He notes, however, that the problem of integratingdata that differs semantically will still remain. Knoblock and Minton counter that the need forHTML wrappers will remain strong, arguing that there will always be exceptions and legacy pages.

Cohen takes a different stance, suggesting that many information providers want to help in-form people, but might not see a direct benefit from the investment required to form a highlystructured data source. He suggests that cheap, approximate information integration, such asenabled by his system, can render these simpler sites more powerful, providing a larger benefitthan any individual site developer alone could attain, and getting around the chicken-and-eggproblem of who pays to make useful information available free.

On a different note, Haym Hirsh of Rutgers has signed on to help edit Trends & Controversies.To continue providing sharp, cogent debates on topics that span a wide range of intelligent sys-tems research and applications development, he and I will be alternating installments. For his firstouting next issue, Haym has lined up Barbara Hayes-Roth, Janet Murray, and Andrew Stern, whowill address interactive fiction.

—Marti Hearst

.

SEPTEMBER/OCTOBER 1998 13

systems,or Web sites. Here the applicationbuilder must design a mediated schemaonwhich users will pose queries. The medi-ated schema is a set of virtual relations,inthat they are not actually stored anywhere.The mediated schema is designed manuallyfor a particular data-integration application.For example, in the movie domain,the me-diated schema might contain the relationsMovieInfo(id, title, genre, coun-

try, year, director) describing thedifferent properties of a movie, the relationMovieActor(id, name) representing amovie’s cast,and MovieReview(id, re-view) representing reviews of movies.

Along with the mediated schema,theapplication designer needs to supply de-scriptions of the data sources. The descrip-tions specify the relationship between therelations in the mediated schema and thosein the local schemas at the sources. (Eventhough not all the sources are databases,wemodel them as having schemas at the con-ceptual level.) An information-source de-scription specifies

• the source’s contents (for example, con-tains movies),

• the attributes found in the source(genre, cast),

• constraints on the source’s contents(contains only American movies),

• the source’s completeness and reliabil-ity, and finally,

• its query-processing capabilities (canperform selections or can answer arbi-trary SQL queries).

Because the data sources are preexisting,data in the sources might be overlappingand even contradictory. Furthermore, wemight face the following problems:

• Semantic mismatches between sources.Because each data source has been de-signed by a different organization fordifferent purposes,the data is modeledin different ways. For example, onesource might store a relational databasethat stores all of a particular movie’sattributes in one table, while anothersource might spread the attributesacross several relations. Furthermore,the names of the attributes and tableswill dif fer from one source to another,as will the choice of what should be atable and what should be an attribute.

• Different naming conventions. Sources

use different names to refer to the sameobject. For example, the same personmight be called as “John Smith”in onesource and “J.M. Smith” in another.

Query optimization and execution.Atraditional relational-database system ac-cepts a declarativequery in SQL. The sys-tem first parses the query before passing itto the query optimizer. The optimizer pro-duces an efficient query-execution planforthe query, which is an imperative programthat specifies exactly how to evaluate thequery. In particular, the plan specifies theorder for performing the query’s operations(join, selection,and projection),the meth-od for implementing each operation (suchas sort-merge join or hash join),and thescheduling of the different operators(where parallelism is possible). Typically,the optimizer selects a query-executionplan by searching a space of possible plansand comparing their estimated costs. Toevaluate a query-execution plan’s cost,the

optimizer relies on extensive statisticsabout the underlying data,such as the sizesof relations,sizes of domains,and selectiv-ity of predicates. Finally, the query-execu-tion plan passes to the query-executionengine, which evaluates the query.

The traditional database and the data-integration contexts differ primarily in thatthe optimizer has little information about thedata,because the data resides in remote au-tonomous sources rather than locally. Fur-thermore, because the data sources are notnecessarily database systems,the sourcesappear to have different processing capabili -ties. For example, one data source might bea Web interface to a legacy information sys-tem,while another might be a program thatscans data stored in a structured file (such asbibliography entries). Hence, the query opti-mizer must consider the possibility of ex-ploiting a data source’s query-processingcapabilities. Query optimizers in distributeddatabase systems also consider where partsof the query are executed, but in that context

Alon Y. Levy is a faculty member at the University of Washington’s Com-puter Science and Engineering Department. His research interests are Web-site management,data integration, query optimization, management ofsemistructured data,description logics and their relationship to databasequery languages,abstractions and approximations of computational theo-ries,and relevance reasoning. He received his PhD in computer sciencefrom Stanford University and his undergraduate degree at Hebrew Univer-sity. Contact him at the Dept. of Computer Science and Engineering, SiegHall, Room 310,Univ. of Washington,Seattle, WA, 98195; [email protected]; http://www.cs.washington.edu/homes/alon/.

Craig Knoblock is a project leader and senior research scientist at the In-formation Sciences Institute, a research assistant professor in the ComputerScience Department,and a key investigator in the Integrated Media SystemsCenter at the University of Southern California. His research interests in-clude information gathering and integration, automated planning, machinelearning, knowledge discovery, and knowledge representation. He receivedhis BS from Syracuse University and his MS and PhD from Carnegie Mel-lon University, all in computer science. Contact him at USC/ISI,4676 Ad-miralty Way, Marina del Rey, CA 90292; [email protected]; http://www.isi.edu/~knoblock.

Steve Minton is a senior computer scientist at the Information SciencesInstitute and a research associate professor in the Computer Science Depart-ment at the University of Southern California. His research interests are inmachine learning, planning, scheduling, constraint-based reasoning, andprogram synthesis. He received his BA in psychology from Yale Universityand his PhD in computer science from Carnegie Mellon University. Hefounded the Journal of Artificial Intelligence Research and served as its firstexecutive editor. He was recently elected to be a fellow of the AAAI. Con-tact him at USC/ISI,4676 Admiralty Way, Marina del Rey, CA 90292;[email protected]; http://www.isi.edu/sims/minton/homepage.html.

William Cohen is a principal research staff member in the department ofMachine Learning and Information Retrieval Research at AT&T Labs-Re-search. In addition to information integration, his research interests includemachine learning, text categorization, learning from large datasets,compu-tational learning theory, and inductive logic programming. He received abachelor’s degree from Duke and a PhD from Rutgers,both in computerscience. Contact him at AT&T Labs-Research, 180 Park Avenue, FlorhamPark NJ 07932-0971; [email protected]; http://www.research.att.com/~wcohen/.

.

the different processors have identical capa-bilities. Finally, because data must be trans-ferred over a network, the query optimizerand the execution engine must be able toadapt to data-transfer delays.

Query reformulation. A data-integrationsystem user poses queries in terms of themediated schema,rather than directly in theschema where the data resides. As a conse-quence, a data-integration system must firstreformulate a user query into a query thatrefers directly to the schemas in the sources.Such a reformulation step does not exist intraditional database systems. To performthe reformulation step, the data-integrationsystem uses the source descriptions.

Wr appers.Unlike a traditional query-exe-cution engine that communicates with thestorage manager to fetch the data,a data-integration system’s query-execution planmust obtain data from remote sources. To doso,the execution engine communicates witha set of wrappers. A wrapper is a programthat is specific to every data source and thattranslates the source’s data to a form that thesystem’s query processor can further pro-cess. For example, the wrapper might extracta set of tuples from an HTML file and per-form translations in the data’s format.

Semistructured data. The term semistruc-tured data has been used with various mean-ings to refer to characteristics of data presentin a data-integration system. To understandthe importance of semistructured data,wedistinguish between a lack of structure at thephysical level versus one at the logical level.With lack of structure at the physical level,structured data (for example, tuples) areembedded in a file containing additionalmarkup information such as HTML files.Extracting the actual values from the HTMLfile can be very complex task,and is one thatthe source’s wrapper performs.

Most work on semistructured data con-cerns lack of structure at the logical level. Inthis context, semistructured data refers tocases in which the data does not necessarilyfit into a rigidly predefined schema,as re-quired in traditional database systems. Thismight arise because the data is very irregularand hence can be described only by a rela-tively large schema. In other cases,the sche-ma might be rapidly evolving, or not evendeclared at all—it might be implicit in thedata. The database community has devel-

oped several methods to model and querysemistructured data and is currently consid-ering the issues of query optimization andstorage for such data.1,2Building data-inte-gration systems based on a semistructureddata model has two main advantages:

• In many cases,the data in the sources isindeed semistructured at the logical level.

• The models developed for semistruc-tured data can cleanly integrate datacoming from multiple data models,such as relational,object-oriented, andWeb-data models.

Classes of data integration applications.The two main classes of data-integrationapplications are integration of data sourceson the Web and within a single company orenterprise. In the latter case, the sources arenot as autonomous as they are on the Web,but the requirements from a data-integra-tion system might be more stringent.

The Information Manifold ProjectIn this project,we wanted to develop a

system that would flexibly integrate many

data sources with closely related contentand that would answer queries efficientlyby accessing only the sources relevant tothe query. The remainder of this essay willdescribe its main contributions.3–6

The AI and DB approach. We based ourapproach in designing the InformationManifold on the observation that the data-integration problem lies at the intersectionof database systems and artif icial intelli-gence. Hence, we searched for solutionsthat combine and extend techniques fromboth fields. For example, we developed arepresentation language and a language fordescribing data sources that was simplefrom the knowledge-representation per-spective, but that had the necessary addedflexibility concerning previous techniquesdeveloped in the database community.

Source description language. The Infor-mation Manifold is most importantly a flexi-ble mechanism for describing data sources.This mechanism lets users describe complexconstraints on a data source’s contents,thereby letting them distinguish betweensources with closely related data. Also, thismechanism makes it easy to add or deletedata sources from the system without chang-ing the descriptions of other sources. Infor-mally, the contents of a data source are de-scribed by a query over the mediatedschema. For example, we might describe adata source as containing American moviesthat are all comedies and were producedafter 1965 (source 1 in Figure 2). As anotherexample, we can describe sources in whoseschema significantly differs from the one inthe mediated schema. For instance, we candescribe a source in which a movie’s year,genre, actor, and review attributes al-ready appear in one table (source 2 in Figure2). This source is modeled as containing theresult of a join over a set of relations in themediated schema. For some queries,extract-ing data from this source might be cheaperthan from others if the join computed in thesource is indeed required for the query.

The Information Manifold employed anexpressive language, Carin, for formulatingqueries and for representing backgroundknowledge about the relations in the medi-ated schema. Cairn7 combined the expres-sive power of the datalog database-querylanguage (needed to model relationalsources) and Description Logics,which areknowledge-representation languages de-


Query in the

source schema

Query in theunion of exported source schemas

Global data model

Local data model

Query reformulation

Query in mediated schema

Distributed query-execution plan

Query the exported source

schema

Query optimization

Wrapper Wrapper

Query execution engine

Figure 1. Prototypical architecture of a data-integrationsystem.

.


signed especially to modelcomplex hierarchies thatfrequently arise in data-integration applications.

Query-answering algorithms. We devel-oped algorithms for answering queriesusing the information sources. Recall thatuser queries are posed in terms of the medi-ated schema. Hence, the main challenge indesigning the query-answering algorithmsis to reformulate the query such that itrefers to the relations in the data sources.Our algorithms were the first to guaranteethat only the relevant set of data sources areaccessed when answering a query, even inthe presence of sources described by com-plex constraints.

It is interesting to note the differencebetween our approach to query answeringand that employed in the SIMS and Ariadneprojects described in Craig Knoblock’s andSteve Minton’s companion essay. In theirapproach,even though they used a knowl-edge-representation system for specifyingthe source descriptions,they used a general-purpose planner to reformulate a user queryinto a query on the data sources. In contrast,our approach uses the reasoning mechan-isms associated with the underlying knowl-edge-representation system to perform thereformulation. Aside from the natural ad-vantages obtained by treating the represen-tation and the query reformulation withinthe same framework, our approach can pro-vide better formal guarantees on the resultsand can benefit immediately from exten-sions to the underlying knowledge-repre-sentation system.

Handling completeness information. Ingeneral, sources on the Web are not neces-sarily complete for the domain they arecovering. For example, a computer sciencebibliography source is unlikely to containall the references in the field. However, insome cases,we can assert local complete-ness statements about sources.6 For exam-ple, the DB&LP Database (http://www.informatik.uni-trier.de/ley/db/) contains thecomplete set of papers published in someof the major database conferences. Knowl-edge of a Web source’s completeness canhelp a data-integration system in severalways. Most importantly, because a negativeanswer from a complete source is meaning-ful, the data-integration system can pruneaccess to other sources. In the Information

Manifold, we developed a method for rep-resenting local-source completeness and analgorithm for exploiting such informationin query answering.5

Using probabilistic inf ormation. The In-formation Manifold pioneered the use ofprobabilistic reasoning for data integration(representing another example of the com-bined AI and DB approach to the data-integration problem).6 When numerous datasources are relevant to a given query (suchas bibliographic databases available for atopic search),a data-integration systemneeds to order the access to the data sources.Such an ordering is dependent on the over-lap between the sources and the query andon the coverage of the sources. We devel-oped a probabilistic formalism for specify-ing and deducing such information and al-gorithms for ordering the access to datasources given such information.6

Exploiting source capabilities. Sourcesoften have different query-processing ca-pabilities. For example, one source mightbe a full-fledged relational database, whileanother might be a Web site with a veryspecific form interface that supports only alimited set of queries and that requirescertain inputs be provided to it. The Infor-mation Manifold developed several novelalgorithms for adapting to differing sourcecapabilities. When possible, to reduce theamount of processing done locally, thesystem would fully exploit the query-pro-cessing capabilities of its data sources.4,9

In addition, we developed a mechanism fordescribing source capabilities that is a nat-ural extension of our method for describ-ing source contents.9

What we did as a communityIn the past few years,each of the groups

working on data integration has made sig-nificant individual progress (see this maga-zine’s Web page for a list of projects;http://computer.org/intelligent). However,we have also made progress as a commu-nity. In particular, we have developed acommon set of terms and dimensions alongwhich we can now compare our work morerigorously.12 For example, we can now

compare the relative expres-sive power of our source-description languages. Wehave a set of properties alongwhich we can compare our

query-answering algorithms (such as,dothey guarantee accessing only relevantsources or a minimal number of sources?).We can also compare features of our systems (do they assume sources are com-plete, can they handle local completeness,and can they compare directly betweensources?).

We need to take this progress into accountas we address the challenges that lie ahead.Our common terminology will enable us(and should force us) to compare systemsmore rigorously, either theoretically or ex-perimentally. To proceed, we must also de-velop a set of data-integration benchmarks,along which we can experimentally com-pare our data-integration systems.

The immediate futureThe data-integration problem is by no

means solved. We have made significantprogress in the problems relating to model-ing data sources and developing methods forcombining data from them via a single, inte-grated view. Many problems remain in thatarea,the most significant being the problemof name matching across sources. This prob-lem is finally starting to be addressed in aprincipled manner in the Whirl system.13

In the near future, I believe that the bulkof the work in the field should shift intoother, less attended problems,some ofwhich I describe here.

Inf ormation presentation. Users arerarely interested in data that is simply andconcisely presented. More commonly, theresult of users queries are best seen as entrypoints into entire webs of data. This obser-vation begs the question of how to buildsystems that enable us to flexibly design aweb of information. In fact,this is exactlythe problem we face in designing a richlystructured Web site. The key to designingsuch systems is a declarative representationof a web of information’s structure. Basedon such a representation, we can easilyspecify how to restructure the informationintegrated from multiple sources into astructure that users can navigate. Recently,we have developed the Strudel system,14

which is the first to apply these principlesin creating Web sites.

Figure 2. Data-source descriptions.

Source 1 Source 2select title, year, director select title, genre, reviewfrom MOVIEINFO from MOVIEINFO M, MOVIEREVIEW R

where genre = COMEDY where m.id = r.idyear ≥ 1965country = USA

.

Optimization and execution. Moderndatabase systems succeed largely based onthe careful design of their query-optimiza-tion and query-execution engines. Recallthat the query optimizer is the module incharge of transforming a declarative query(given,for example, in SQL) into a query-execution plan,thereby making decisionson the order of joins and the specific meth-ods for implementing each operation in theplan. The query-execution engine actuallyevaluates the plan. Given that data integra-tion is a more general form of the problemaddressed in database systems,they willsucceed only if we carefully consider thedesign of these components in data-integra-tion systems.

Two factors complicate the problems ofquery optimization and execution in thecontext of data integration: lack of exten-sive statistics on the data we are accessing(unlike with relational databases) and un-predictable arrival rates of data from thesources at runtime. Here, too,a combina-tion of techniques from AI and databasesystems is likely to provide interesting so-lutions. In particular, in this context, theneed for interleaving query optimizationand query execution is much more signifi-cant. The idea of interleaving of planningand execution has been considered in theAI planning literature in recent years.15 Incontrast,current database systems performcomplete query optimization before begin-ning the execution. The issues of queryoptimization and execution are the focus ofthe Tukwila project underway at the Uni-versity of Washington.

Obtaining source descriptions. Currentsystems are very good at using descriptionsof the source for answering queries. How-ever, source descriptions must still be givenmanually. Specifically, the problem is toobtain the semantic mapping between thecontent of the source and the relations inthe mediated schema. If data-integrationsystems are really going to scale up to largenumbers,we must develop automatic meth-ods for obtaining source descriptions,pos-sibly by employing techniques from ma-chine learning.

A nonproblem. Many data-integration ef-forts have focused on the problem of ex-tracting data from HTML pages—extractingtuples from documents in which data issemistructured at the physical level. This

problem will become significantly less im-portant,given the emergence of standardssuch XML and languages that will f acilitatequerying XML documents.15Web sites thatserve significant amounts of data are usuallydeveloped using some tool for serving data-base contents. Using such tools will make iteasier to serve the data in XML form, ratherthan directly in HTML. Hence, Web siteswill be able to export data in XML with noadded burden to the information providers.Of course, some Web sites that do not wanttheir data to be used for integration purposesmight still only serve HTML pages,but try-ing to integrate data from such sources isprobably a futile effort at best.

However, while the availability of datain XML f ormat will reduce the emphasison wrappers converting human-readabledata to machine-readable data,the chal-lenges of semantic integration I’ve men-tioned and the need to manage data that isstructured at the logical level remains. Fur-thermore, the machine-learning algorithmsdeveloped for extracting data from HTMLpages might prove useful for the problemof obtaining semantic mappings.

Farther down the roadOnce we can build stand-alone, robust,

data-integration systems,we will face thechallenge of embedding such systems inmore general environments.I illustrate thischallenge with two examples. The firstconcerns extending the interaction with adata-integration system beyond simplequery answering. In particular, we shouldbe able to use the data to automate some ofthe tasks we routinely perform with thedata. For example, the system should beable to use the data for everyday informa-tion-management tasks,such as managingour personal information (our schedules,for example) and document workflow inorganizations,or for alerting different userson important events.

The second method concerns the expec-tations we have from the data-integrationsystem. It is unlikely that we will be able toanswer all user queries fully automatically,because there will always remain sourcesfor which we will have only models orsources whose structure (such as natural-language text) does not enable us to reli-ably extract the data. Hence, we must de-velop an environment in which the systemcooperates with the user to obtain the an-swer to the query. Where possible, the sys-

tem will answer the query completely; inother cases,the system will guide the userto the desirable answers.

References1. P. Buneman,“Semistructured Data,” Proc.

ACM Sigact-Sigmod-Sigart Symp. Principlesof Database Systems (PODS),ACM Press,New York, 1997,pp. 117–121.

2. S. Abiteboul,“Querying Semi-StructuredData,” Proc. Int’l Conf. on Database Theory(ICDT), 1997.

3. A.Y. Levy, A. Rajaraman,and J.J. Ordille,“Query Answering Algorithms for InformationAgents,” Proc. 11th Nat’ l Conf. AI, AAAIPress,Menlo Park, Calif., 1996,pp. 40–47.

4. A.Y. Levy, A. Rajaraman,and J.J. Ordille,“Querying Heterogeneous InformationSources Using Source Descriptions,” Proc.22nd Int’l Conf. Very Large Databases (VLDB-96), Morgan Kaufmann,San Francisco,1996,pp. 251–262.

5. A.Y. Levy, “Obtaining Complete Answersfrom Incomplete Databases,” Proc. 22nd Int’lConf. Very Large Databases (VLDB-96), Mor-gan Kaufmann,1996.

6. D. Florescu,D. Koller, and A.Y. Levy, “UsingProbabilistic Information in Data Integration,”Proc. 23nd Int’l Conf. Very Large Databases,Morgan Kaufmann,1997,pp. 216–225.

7. A.Y. Levy and M.-C. Rousset,“CARIN: ARepresentation Language Integrating Rulesand Description Logics,” Proc. Int’l Descrip-tion Logics, 1995.

8. O. Etzioni and D. Weld, “A Softbot-BasedInterface to the Internet,” Comm. ACM, Vol.37,No. 7,1994,pp. 72–76.

9. A.Y. Levy, A. Rajaraman,and J.D. Ullman,“Answering Queries Using Limited ExternalProcessors,” Proc. ACM Sigact-Sigmod-SigartSymp. Principles of Database Systems(PODS), ACM Press,1996.

10. H. Garcia-Molina et al.,“The TSIMMIS Ap-proach to Mediation: Data Models and Lan-guages (Extended Abstract),” Next GenerationInformation Technologies and Systems(NGITS-95), 1995.

11. O. Etzioni and D. Weld, “A Softbot-BasedInterface to the Internet,” Comm. ACM, Vol.37,No. 7,1994,pp. 72–76.

12. D. Florescu,A. Levy, and A. Mendelzon,“Database Techniques for the World-WideWeb:A Survey,” Proc. ACM Sigmod-98, ACMPress,1998.

13. W.W. Cohen,“Integration of HeterogeneousDatabases without Common Domains UsingQueries Based on Textual Similarity,” Proc.ACM Sigmod-98, ACM Press,1998, pp.201–212.

14. M. Fernandez et al.,“Catching the Boat withStrudel:Experiences with a Web-site Manage-ment System,” Proc. ACM Sigmod-98, ACMPress,1998.

15. J. Ambros-Ingerson and S. Steel,“IntegratingPlanning, Execution,and Monitoring,” Proc.13th Nat’ l Conf. AI, AAAI Press,Menlo Park,Calif., 1998,pp. 83–88.


.


The Ariadne approach to Web-based information integrationCraig A. Knoblock and Steven Minton,University of Southern California

The rise of hyperlinked networks hasmade a wealth of data readily available.However, the Web’s browsing paradigmdoes not strongly support retrieving andintegrating data from multiple sites. Today,the only way to integrate the huge amountof available data is to build specializedapplications,which are time-consuming,costly to build, and difficult to maintain.Mediator technology offers a solution tothis dilemma. Information mediators,1–4

such as the SIMS system,5 provide an inter-mediate layer between information sourcesand users. Queries to a mediator are in auniform language, independent of suchfactors as the distribution of informationover sources,the source query languages,and the location of sources. The mediatordetermines which data sources to use, howto obtain the desired information, how andwhere to temporarily store and manipulatedata,and how to efficiently retrieve infor-mation from the sources.

One of the most important ideas under-lying information mediation in many sys-tems,including SIMS, is that for each ap-plication there is a unifying domain modelthat provides a single ontology for the ap-plication. The domain model ties togetherthe individual source models, which eachdescribe the contents of a single informa-tion source. Given a query in terms of thedomain model,the system dynamicallyselects an appropriate set of sources andthen generates a plan to efficiently producethe requested data.

Information mediators were originallydeveloped for integrating information indatabases. Applying the mediator frame-work to the Web environment solves thedifficult problem of gaining access to real-world data sources. The Web provides theunderlying communication layer thatmakes it easy to set up a mediator system,because it is typically much easier to getaccess to Web data sources than to the un-derlying databases systems. In addition, theWeb environment means that users whowant to build their own mediator applica-tion need no expertise in installing, main-taining, and accessing databases.

We have developed a Web-based versionof the SIMS mediator architecture, called

Ariadne.6 In Greekmythology, Ariadnewas the daughter ofMinos and Pasiphaewho gave Theseus thethread with which tofind his way out of theMinotaur’s labyrinth.The Ariadne project’sgoal is to make it simple for users to createtheir own specialized Web-based media-tors. We are developing the technology forrapidly constructing mediators to extract,query, and integrate data from Websources. The system includes tools for con-structing wrappers that make it possible toquery Web sources as if they were data-bases and the mediator technology requiredto dynamically and efficiently answerqueries using these sources.

A simple example illustrates how Ariadnecan be used to provide access to Web-basedsources (also see the “Ariadne”sidebar).Numerous sites provide reviews on restau-rants,such as Zagats,Fodors,and Cuisine-Net,but none are comprehensive, andchecking each site can be time consuming.In addition, information from other Websources can be useful in selecting a restau-rant. For example, the LA County HealthDepartment publishes the health rating of allrestaurants in the county, and many sourcesprovide maps showing the location of res-taurants. Using Ariadne, we can integratethese sources relatively easily to create anapplication where people could search thesesources to create a map showing the restau-rants that meet their requirements.

With such an application, a user couldpose requests that would generate a maplisting all the seafood restaurants in SantaMonica that have an “A” health rating andwhose typical meal costs less than $30. Theresulting map would let the user click onthe individual restaurants to see the restau-rant critic reviews. (In practice, we do notsupport natural language, so queries areeither expressed in a structured query lan-guage or are entered through a Web-basedgraphical user interface.) The integrationprocess that Ariadne facilitates can be com-plex. For example, to actually place a res-taurant on a map requires the restaurant’slatitude and longitude, which is not usuallylisted in a review site, but can be deter-mined by running an online geocoder, suchas Etak,which takes a street address andreturns the coordinates.

Figure 3 outlines our general framework.We assume that a user building an applica-tion has identified a set of semistructuredWeb sources he or she wants to integrate.These might be both publicly availablesources as well as a user’s personal sour-ces. For each source, the developer usesAriadne to generate a wrapper for extract-ing information from that source. Thesource is then linked into a global,unifieddomain model. Once the mediator is con-structed, users can query the mediator as ifthe sources were all in a single database.Ariadne will efficiently retrieve therequested information, hiding the planningand retrieval process details from the user.

Research challenges in Web-basedintegration

Web sources differ from databases inmany significant ways,so we could notsimply apply the existing SIMS system tointegrate Web-based sources. Here we’lldescribe the problems that arise in the Webenvironment and how we addressed theseproblems in Ariadne.

Converting semistructured data intostructured data. Web sources are not data-bases,but to integrate sources we must beable to query the sources as if they were.This is done using a wrapper, which is apiece of software that interprets a request(expressed in SQL or some other structuredlanguage) against a Web source and returnsa structured reply (such as a set of tuples).Wrappers let the mediator both locate theWeb pages that contain the desired informa-tion and extract the specific data off a page.The huge number of evolving Web sourcesmakes manual construction of wrappersexpensive, so we need the tools for rapidlybuilding and maintaining wrappers.

For this,we have developed the Stalkerinductive-learning system,7 which learns aset of extraction rules for pulling informa-tion off a page. The user trains the systemby marking up example pages to show thesystem what information it should extract

.

from each page. Stalker can learn rulesfrom a relatively small number of examplesby exploiting the fact that there are typi-cally “landmarks” on a page that help usersvisually locate information.

Consider our restaurant mediator exam-ple. To extract data from the Zagats restau-rant review site, a user would need to buildtwo wrappers. The first lets the system ex-tract the information from an index page,which lists all of the restaurants and con-tains the URLs to the restaurant reviewpages. The second wrapper extracts thedetailed data about the restaurant,includ-ing the address,phone number, review,rating, and price. With these wrappers,themediator can answer queries to Zagats,such as “f ind the price and review ofSpago” or “give me the list of all restau-rants that are reviewed in Zagats.”

In his companion essay on the Informa-tion Manifold, Alon Levy claims that theproblem of wrapping semistructuredsources will soon be irrelevant becauseXML will eliminate the need for wrapperconstruction tools. We believe that he isbeing overly optimistic about the degreethat XML will solve the wrapping problem.XML clearly is coming; it will significantlysimplify the problem and might even elimi-nate the need for building wrappers formany Web sources. However, the problemof querying semistructured data will notdisappear, for several reasons:

• There will always be applications wherethe providers of the data do not want toactively share their data with anyonewho can access their Web page.

• Just as there are legacy Cobol pro-grams,there will be legacy Web appli-cations for many years to come.

• Within individual domains,XML willgreatly simplify the access to sources;

however, across domains people areunlikely to agree on the granularity thatinformation should be modeled. Forexample, for many applications,themailing address is the right level ofgranularity to model address,but if youwant to geocode an address,it needs tobe divided into street address,city,state, and zip code.

Planning to integrate data in the Webenvir onment.Another problem that arisesin the web environment is that generatingefficient plans for processing data is diffi-cult. For one, the number of sources to beintegrated could be much larger than in thedatabase environment. Also,Web sourcesdo not provide the same processing capa-bilities found in a typical database system,such as the ability to perform joins. Finally,unlike relational databases,there might berestrictions on how a source can be ac-cessed, such as a geocoder that takes thestreet address returns the geographic coor-dinates,but cannot take the geographiccoordinates and return the street address.

Ariadne breaks down query processinginto a preprocessing phase and a query-planning phase. In the first phase, the sys-tem determines the possible ways of com-bining the available sources to answer aquery. Because sources might be overlap-ping—an attribute may be available fromseveral sources—or replicated, the systemmust determine an appropriate combina-tion of sources that can answer the query.The Ariadne source-selection algorithm8

preprocesses the domain model so that thesystem can efficiently and dynamicallyselect sources based on the classes andattributes mentioned in the query.

In the second phase, Ariadne generates aplan using a method called Planning-by-Rewriting.9,10This approach takes an ini-

tial, suboptimal plan and attempts to im-prove it by applying rewriting rules. Withquery planning, producing an initial,sub-optimal plan is straightforward—the diffi-cult part is finding an efficient plan. Therewriting process iteratively improves theinitial query plan using a local searchprocess that can change both the sourcesused to answer a query and the order of theoperations on the data.

In our restaurant selection example, toanswer queries that cover all restaurants,the system would need to integrate datafrom multiple sources (wrappers) for eachrestaurant review site and filter the result-ing restaurant data based on the search pa-rameters. The mediator would then geo-code the addresses to place the data on amap. The plans for performing these opera-tions might involve many steps,with manypossible orderings and opportunities toexploit parallelism,in minimizing the over-all time to obtain the data. Our planningapproach provides a tractable approach toproducing large, high-quality information-integration plans.

Providing fast access to slow Websources.In exploiting and integrating Web-based information sources,accessing andextracting data from distributed Web sour-ces is also much slower than retrievinginformation from local databases. Becausethe amount of data might be huge and theremote sources are frequently being up-dated, simply warehousing all of the data isnot usually a practical option. Instead, weare working on an approach to selectivelymaterialize (store locally) critical pieces ofdata that let the mediator efficiently per-form the integration task. The materializeddata might be portions of the data from anindividual source or the result of integrat-ing data from multiple sources.

To decide what information to store lo-cally, we take several factors into account.First,we consider the queries that havebeen run against a mediator application.This lets the system focus on the portionsof the data that will have the greatest im-pact on the most queries. Next, we considerboth the frequency of updates to the sour-ces and the application’s requirements forgetting the most recent information. Forexample, in the restaurant application, eventhough reviews might change daily, provid-ing information that is current within aweek is probably satisfactory. But, in a


Queryplanning

Application userQueries Answers

Source modelingand wrapperconstruction

Applicationdeveloper

Constructing a mediator Using a mediator

Webpages

Models andwrappers

Figure 3. Architecture for information integration on the Web.

.


finance application, providing the lateststock price would likely be critical. Finally,we consider the sources’organization andstructure. For example, the system can onlyget the latitude and longitude from thegeocoder by providing the street address. Ifthe application lets a user request the res-taurants located within a region of a map, itcould be very expensive to figure out whichrestaurants are in that region because thesystem would need to geocode each restau-rant to determine whether it falls within theregion. Materializing the restaurant ad-dresses and their corresponding geocodesavoids a costly lookup.

Once the system decides to materialize aset of information, the materialized databecomes another information source forthe mediator. This meshes well with ourmediator framework because the plannerdynamically selects the sources and theplans that can most efficiently produce therequested data. In the restaurant example, ifthe system decides to materialize addressand geocode, it can use the locally storeddata to determine which restaurants couldpossibly fall within a region for a map-based query.

Resolving naming inconsistencies acrosssources.Within a single site, entities—suchas people, places,countries,or compan-ies—are usually named consistently. How-ever, across sites,the same entities might bereferred to with different names. For exam-ple, one restaurant review site might refer toa restaurant as Art’s Deli and another sitemight call it Art’s Delicatessen. Or, one sitemight use California Pizza Kitchen andanother site could use the abbreviationCPK. To make sense of data that spans mul-tiple sites,our system must be able to rec-ognize and resolve these differences.

In our approach, we select a primarysource for an entity’s name and then pro-vide a mapping from that source to each ofthe other sources that use a different nam-ing scheme. The Ariadne architecture letsus represent the mapping itself as simplyanother wrapped information source. Spe-cifically, we can create a mapping table,which specifies for each entry in one datasource what the equivalent entity is calledin another data source. Alternatively, if themapping is computable, Ariadne can repre-sent the mapping by a mapping function,which is a program that converts one forminto another form.

We are developing a semi-automatedmethod for building mapping tables andfunctions by analyzing the underlying datain advance. The basic idea is to use informa-tion-retrieval techniques,such as those de-scribed in William Cohen’s companionessay, to provide an initial mapping,11 andthen use additional data in the sources toresolve any remaining ambiguities via statis-tical learning methods.12 For example, res-taurants are best matched up by consideringname, street address,and phone number, butnot by using a field such as city because arestaurant in Hollywood could be listed aseither being in Hollywood or Los Angelesand different sites list them differently.

The future of Web-basedintegration

As more and more data becomes avail-able, users will become increasingly lesssatisfied using existing search engines thatreturn massive quantities of mostly irrele-vant information. Instead, the Web willmove toward more specialized content-

based applications that do more than simplyreturn documents. Information-integrationsystems such as Ariadne will help usersrapidly construct and extend their ownWeb-based applications out of the hugequantity of data available online.

While information integration has madetremendous progress over the last fewyears,13 many hard problems still must besolved. In particular, two mostly overlookedproblems deserve more attention:

• Coming up with the models or sourcedescriptions of the information sources,a time-consuming and difficult problemthat is largely performed by hand today.

• Automatically locating and integratingnew sources of data,which would beenabled by solutions to the first prob-lem. (This problem has been addressedin limited domains,such as Internetshopping,14 but the problem is stilllargely unexplored.)

For more information on the Ariadne

AriadneThis Restaurant Location

application of Ariadneshown in the first imageintegrates data from a vari-ety of sources,includingrestaurant review sites,health ratings,geocoders,and maps.

In response to a query forall highly rated restau-rants in Santa Monicawith an ‘A’ health rating,the mediator finds therestaurants that satisfythe query by extractingthe data directly from therelevant Web sites.

The mediator alsoproduces a map of therestaurants (secondimage) by converting thestreet addresses intolatitute and longitudecoordinates using anonline geocoder.

Each point on the mapin the second image is click-able. Selecting the point forChinois on Main returns thedetailed restaurant reviewdirectly from the appropriaterestaurant review site (thirdimage).

.

project and example applications that werebuilt using Ariadne, see the Ariadne home-page at http://www.isi.edu/ariadne.

References1. G. Wiederhold, “Mediators in the Architecture

of Future Information Systems,” Computer,Vol. 25,No. 3,Mar. 1992,pp. 38–49.

2. H. Garcia-Molina et al.,“The Tsimmis Ap-proach to Mediation: Data Models and Lan-guages,” J. Intelligent Information Systems,1997.

3. A.Y. Levy, A. Rajaraman,and J.J. Ordille,“Querying Heterogeneous InformationSources Using Source Descriptions,” Proc.22nd Very Large Databases Conf., MorganKaufmann,San Francisco,1996,pp. 251–262.

4. M.R. Genesereth,A.M. Keller, and O.M.Duschka,“Inf omaster:An Information Integra-tion System,” Proc. ACM Sigmod Int’l Conf.Management of Data, ACM Press,New York,1997,pp. 539–542.

5. Y. Arens,C.A. Knoblock, and W.-M. Shen,“Query Reformulation for Dynamic Informa-tion Integration,” J. Intelligent InformationSystems, Special Issue on Intelligent Informa-tion Integration,Vol. 6,Nos. 1 and 3,1996,pp.99–130.

6. C.A. Knoblock et al.,“Modeling Web Sourcesfor Information Integration,” Proc. 11th Nat’ lConf. Artificial Intelligence, AAAI Press,Menlo Park, Calif., 1998,pp. 211–218.

7. I. Muslea,S. Minton,and C.A. Knoblock,“Stalker:Learning Extraciton Rules for Semi-structured Web-Based Information Sources,”Proc. 1998 Workshop AI and InformationIntegration, AAAI Press,1998,pp. 74–81.

8. J.L. Ambite et al.,Compiling Source Descrip-tions for Efficient and Flexible InformationIntegration, tech. report, Information SciencesInstitute, Univ. of Southern California,Marinadel Rey, Calif., 1998.

9. J.L. Ambite and C.A. Knoblock, “Planning byRewriting: Efficiently Generating High-Qual-ity Plans,” Proc. 14th Nat’ l Conf. ArtificialIntelligence, AAAI Press,1997,pp. 706–713.

10. J.L. Ambite and C.A. Knoblock, “Flexible andScalable Query Planning in Distributed andHeterogeneous Environments,” Proc. FourthInt’ l Conf. Artificial Intelligence PlanningSystems, AAAI Press,1998,pp. 3–10.

11. W.W. Cohen,“Integration of HeterogeneousDatabases without Common Domains UsingQueries Based on Textual Similarity,” Proc.ACM Sigmod-98, ACM Press,1998,pp.201–212.

12. T. Huang and S. Russell,“Object Identificationin a Bayesian Context,” Proc. 15th Int’l J.Conf. AI, Morgan Kaufmann,1997,pp.1276–1283.

13. Proc. 1998 Workshop on AI and InformationIntegration, AAAI Press,1998.

14. R.B. Doorenbos,O. Etzioni,and D.S. Weld, “AScalable Comparison-Shopping Agent for theWorld-Wide Web,” Proc. First Int’l Conf.Autonomous Agents, AAAI Press,1997,pp.39–48.

The Whirl approach to informationintegrationWilliam W. Cohen,AT&T Labs-Research

Search engines such as AltaVista andportal sites such as Yahoo! help us finduseful online information sources. Whatwe need now are systems to help use thisinformation effectively. Ideally, we wouldlike programs that answer a user’s ques-tions based on information obtained frommany different online sources. We call sucha program an information-integration sys-tem,because to answer questions it mustintegrate the information from the varioussources into a single, coherent whole.

For example, consider consumer infor-mation about computer games. Many Websites contain information of this sort. Asthis essay will show, in addition to the ob-vious benefit of reducing the number ofsites a user must visit,integrating this in-formation has several important andnonobvious advantages.

One advantage is that often,more ques-tions can be answered using the integratedinformation than using any single source.Consider two sources containing slightlydifferent information: one source catego-rizes games into children’s games and adultgames,and another categorizes games intoarcade games,puzzle games,and adventuregames. In this case, the sources must beintegrated to find, say, a list of children’sadventure games. Conversely, integrationcan help exploit overlap among sources;for instance, one might be interested infinding games that three or more sourceshave rated highly, or in reading severalindependent reviews of a particular game.

A second and more important advantageof integration is that making it possible tocombine information sources also makes itpossible to decompose information so as torepresent it in a clean,modular way. Forexample, suppose we wished to create aWeb site providing some new sort of infor-mation about computer games—say, infor-mation about which games work well onolder, slower machines. The simplest wayof representing this information is exten-sionally, as a list of games having this prop-erty. By itself, however, such a list is notvery valuable to end users,who are proba-bly interested in games that not only workon their PC,but also satisfy other proper-ties,such as being inexpensive or well-designed. To make the list more useful,we

are tempted to add additional structure—forinstance, we might organize the games inthe list into categories and provide, for eachgame, links to online resources,such aspricing information and reviews.

From the standpoint of computer sci-ence, augmenting the list of games in thisway is clearly a bad idea,because it leadsto a structure that lacks modularity. Theoriginal structure was a static, easily main-tained list of computer games. In the aug-mented hypertext, this information is inter-mixed with orthogonal information aboutgame categories,possibly ephemeral infor-mation concerning the organization of ex-ternal Web sites,and possibly incorrectassumptions about the readers’ goals. Theresulting structure is hard to maintain andhard to modify in certain natural ways,such as by changing the set of categoriesused to organize the list of games.

To summarize, the simple, modular en-coding of this information will be difficultfor users to exploit, and the easy-to-useencoding will be difficult to create, modify,and maintain. By contrast,it is trivial toencode this information in a relationaldatabase in a manner that is both modularand useful:we simply create a relation list-ing all old-PC-friendly games,and stan-dard query languages let users find, say,reviews of inexpensive old-PC-friendlyarcade games. (This example assumes thatinformation about game prices and reviewsis also available in the database.) Relationaldatabases thus provide a more modularencoding of the information.

Unfortunately, conventional databasesassume information is stored locally, in aconsistent format—not externally, in di-verse formats,as is the case with informa-tion on the Web. Hence they do not solvethe problem of organizing information onthe Web. To use modular, maintainablerepresentations for information, while stillexploiting the power of the Web—its dis-tributed nature, large size, and broadscope—we need practical ways of integrat-ing information from diverse sources.

Why integrating information is hardUnfortunately, integrating information

from multiple sources is very hard. Onedifficulty is programming a computer tounderstand the various information sourceswell enough to answer questions aboutthem. Surprisingly, this is often difficulteven when information is presented in sim-


.


ple, easy-to-parse regular structures such aslists and tables.

As an example, Figure 4 shows a tabularrepresentation of the information in twohypothetical Web sites. Consider theknowledge an integration system wouldneed to answer the following questionusing these information sources:

Who publishes “Escape from Dimension Q”and where is their home page?

(In this essay, we assume that questions aregiven to the information-integration systemin a formal language; for readability, how-ever, we’ll paraphrase questions in Englishwhenever possible.)

To answer this question,the system musthave knowledge of several kinds:

• It must know where to find these tableson the Web, and how they are formatted(access knowledge).

• It must know that each tuple ⟨x,y⟩ in thetable Website-1should be interpreted asthe statement “the company y publishesthe game x,” and that each tuple ⟨t,u⟩in the table Website-2should be inter-preted as the statement “the home pagefor the company t is found at the URLu” (semantic knowledge).

• Finally, it must know that the string“Headbone”in Website-2refers to thesame company as the string “HeadboneInteractive” in Website-1(object-identityknowledge).

Even given all this knowledge, manyinteresting technical problems remain;however, the technical difficulties involvedin using these types of knowledge pale be-side the practical difficulties of acquiringthe knowledge. Currently, all this knowl-edge must be manually provided to theintegration system and updated wheneverthe original information sources change.Performing information integration is thusextremely knowledge-intensive and henceexpensive in terms of human time.

Of course, many of these problems can be“assumed away:” integrating informationsources is not nearly so difficult if they usecommon object identifiers,adopt a commondata format, and use a known ontology. Un-fortunately, few existing informationsources satisfy these assumptions. The vastmajority of existing online sources are de-signed to communicate only with humanreaders,not with other programs. We believe

that this will continue to hold true, simplybecause presenting information to a humanaudience is less demanding for the informa-tion provider—information intended for ahuman audience need not conform to someexternally set formal standard; it only has tobe comprehensible to a reader.

The Whirl approach to informationintegration

We have written a system for informationintegration called Whirl. The approach tointegration embodied in Whirl is based ontwo premises:

• It is unreasonable to assume that all theknowledge needed for information inte-gration will be present,and in any caseimpractical to encode this informationexplicitly. Consequently, inferencesmade by an integration system are in-herently incomplete and uncertain. Asin machine learning, speech recogni-tion, and information retrieval, the inte-gration system will have to take somechances and make some mistakes. Anintegration system thus must have waysof reasoning with uncertain informa-tion, and communicating to the user itsconfidence in an answer.

• Information integration should exploitthe existing human-oriented interface toinformation sources as much as poss-ible. It should, whenever possible, un-derstand information using general tech-niques,analogous to the ones peopleuse, rather than relying on externallyprovided, problem-specific knowledge.For instance, people have no difficultyrecognizing the structures in Table1 astwo-column tables; thus a good informa-tion-integration system should also be

able to recognize such structures. (Al-though general table-recognition meth-ods exist,2 to our knowledge, no existingWeb-based integration system usesthem.) Similarly, most people wouldjudge it likely that the “Headbone”and“Headbone Interactive” denote the samecompany (or closely related ones),butwould consider it unlikely that “DisneyInteractive” and “Micr osoft” do; an inte-gration system should be able to make asimilar judgement,even without knowl-edge of the domain.

Of the many mechanisms required bysuch an integration system,we have chosento concentrate (initially) on general meth-ods for integrating information withoutobject identity knowledge. In most integra-tion tasks,far more object-identity knowl-edge is needed than any other type ofknowledge; while semantic knowledge andaccess knowledge might be needed foreach source, a system potentially needsobject-identity knowledge about each pairof constants in the integrated database.

Our approach for dealing with uncertainobject identities relies on the observationthat information sources tend to use textu-ally similar names for the same real-worldobjects. This is particularly true whensources are presenting information to peo-ple of similar background in similar con-texts. To exploit this,the Whirl query lan-guage allows users to formulate SQL-likequeries about the similarity of names. Con-sider again the tables of Figure 4. Assum-ing that the table Website-1 is encoded as a relation with schema game(name,pubName), and Website-2 is encoded as a relation with schema publisher (name,homepage), the question

Figure 4. Two typical information sources: (a) Web site 1 and (b) Web site 2.

GAME TITLE PUBLISHER

Aladdin Activity Center Disney InteractiveArthur’s Computer Adventure Living Books/BroderbundEscape from Dimension Q Headbone InteractiveHow the Leopard Got His Spots Microsoft Kids: :

(a)

GAME PUBLISHER HOME PAGE

Disney http://www.disneyinteractive.comHeadbone http://www.headbone.comHumongous http://www.humongous.comBroderbund http://www.broderbund.comMicrosoft http://www.microsoft.com: :

(b)

.

Who publishes “Escape from Dimension Q”and where is their home page?

might be encoded as the Whirl query

SELECT publisher.name,

publisher.homepage

FROM game, publisher

WHERE (game.pubName ~ game.

name AND game.name ~

“Escape from Dimension Q”)

Here ~ is a similarity operator, and thus thequery asks Whirl to find a tuple ⟨u,v⟩ frompublisher such that for some tuple ⟨x,y⟩from game, y is textually similar to u, and xis textually similar to the string “Escapefrom Dimension Q.” Such a pair is a plausi-ble answer to the query, although not nec-essarily a correct one.

This query language is central to ourapproach,so we will describe it in somedetail. The query language has a “soft”semantics; the answer to such a query is notthe set of all tuples that satisfy the query,but a list of tuples,each of which is consid-ered a plausible answer to the query andeach of which is associated with a numericscore indicating its perceived plausibility.

The universe of possible answers is de-termined by the FROM part of the query; inthe example above, possible answers come

from the Cartesian product of the game andpublisher relations. Whirl scores an-swers according to how well they satisfythe conditions in the WHERE part of thequery: each similarity condition gets ascore between zero and one, each Booleancondition receives a score of either zero orone, and Whirl combines these primitivescores as if they were independent proba-bilities to obtain a score for the entireWHERE clause. For the query above, thescore of a tuple ⟨u,v,x,y⟩ is the product ofthe similarity of y to u and the similarity ofx to “Escape from Dimension Q.”

In typical use, Whirl returns only the Khighest-scoring answers,where K is a para-meter set by the user. From the user’s per-spective, interacting with the system is thusmuch like interacting with a search engine:the user requests the first K answers,ex-amines them,and then requests more ifnecessary.

Semantically, then,the Whirl query lan-guage is quite simple—but as in many en-terprises,“the devil is in the details.” Tomake this idea work well, we needed toadopt ideas from several research commu-nities. Our system computes the similarityof two names using cosine distancein thevector space model, a metric widely usedin statistical-information retrieval.1 Rough-

ly speaking, two names are similar accord-ing to this metric if they share terms, wherea term is a word stem,and names are con-sidered more similar if they share moreterms,or if the shared terms are rare. As anexample, “Disney Interactive” and “Dis-ney” would be more similar than “DisneyInteractive” and “Headbone Interactive,”because “Interactive” is a more commonterm than “Disney.” These similarity met-rics are not well understood formally, butare well supported experimentally.

Whirl also builds on ideas from artif icialintelligence. To find the best K answers to aquery, we use a variant of A* search,3,4

coupled with inverted-index techniquesdeveloped in the information-retrievalcommunity.5 In combination, these tech-niques allow Whirl to find the best K an-swers to a query fairly quickly, even whenthe universe of possible answers isextremely large.

What Whirl has accomplished Using a search-engine-like interface (in

which possible answers come in a ranked list)lets us evaluate Whirl in the same way thatinformation retrieval researchers evaluatesearch engines. In particular, given informa-tion about which of Whirl’s proposed an-swers are correct,we can evaluate Whirlusing metrics such as recall and precision. Weevaluated Whirl on a number of benchmarkproblems from several different domains,using the measure of noninterpolated aver-age precision. (Roughly speaking, this aver-ages the best level of precision obtained ateach distinct recall level. The highest possiblevalue for this measure is 100%.) We discov-ered that the off-the-shelf similarity metricwe adopted is surprisingly accurate. On 14 of18 benchmark problems,average precision is90% or higher; on seven of the 18 problems,average precision is 99% or higher.

Intriguingly, good performance canoften come even when the names from oneor both sources are embedded in extrane-ous text. Table 1 presents the first few an-swers for the query

SELECT demo.name,game.name

FROM demo,game

WHERE demo.name ~ game.name

for a problem in which the names in demoare embedded in arbitrary paragraph-longpassages of free text. As the table’s last col-umn shows,most top-ranked pairings are


Table 1. Output of a Whirl query pairing paragraphs of free text and names of computer games. The score is the similarity of the last two columns, normalized to a range of 0–100, and the checkmark

indicates if the pairing is correct.

SCORE DEMO.NAME GAME.NAME

80.26 Ubi Software has a demo of Amazing Learning Amazing Learning Games with Rayman. Games with Rayman

78.25 Interplay has a demo of Mario Teaches Typing. (PC) Mario Teaches Typing75.91 Warner Active has a small interactive demo for Where’s Waldo? Exploring

Where’s Waldo at the Circus and Where’s Waldo? GeographyExploring Geography. (Mac and Win)

74.94 MacPlay has demos of Marios Game Gallery Mario Teaches Typingand Mario Teaches Typing. (Mac)

71.56 Interplay has a demo of Mario Teaches Typing. (PC) Mario Teaches Typing 268.54 MacPlay has demos of Marios Game Gallery Mario Teaches Typing 2

and Mario Teaches Typing. (Mac)68.45 Psygnosis has an interactive demo for Lemmings Paintball

Lemmings Paintball. (Win95)65.70 ICONOS has a demo of What’s The Secret? What’s the Secret?

Volume 1. (Mac and Win)64.33 Fox Interactive has a fully working demo version Simpsons Cartoon Studio

of the Simpsons Cartoon Studio. (Win and Mac)62.90 Gryphon Software has demos of Gryphon Gryphon Bricks

Bricks, Colorforms Computer Fun Set—Power Rangers and Sailor Moon, and a FREE Gryphon Bricks Screen Saver. (Mac and Win)

60.30 Vividus Software has a free 30 day demo of Web WorkshopWeb Workshop (Web-authoring package for kids!). (Win 95 and Mac)

59.96 Conexus has two shockwave demos—Bubbleoids Super Radio Addition(from Super Radio Addition with Mike and Spike) with Mike & Spikeand Hopper (from Phonics Adventure with Sing Along Sam).

√

√√

√

√√

√

√

√

√

√

√

.


correct,and the complete ranking of answersWhirl proposed has a respectable averageprecision of 67%. Whirl’s robustness to ex-traneous noise words means that we canafford to use approximate methods of ex-tracting of data from information sources.

We have also built several nontrivialintegrated-information systems usingWhirl. The domain of the first is children’scomputer games. This application inte-grates information from 16 Web sites.Using the HTML form interface shown inFigure 5,users can construct questions likethe following:

Help me find reviews of games that are in thecategory “art,” are recommended by two ormore sites,and are designed for children sixyears old.

The application knows how to find reviews,demos,and vendors of games,and alsounderstands several properties of games,such as which games are popular and whopublishes which games.

We have built a similar system that inte-grates information about North Americanbirds. Collectively, the integrated databasescontain about 100,000 tuples,about 10,000of which point to external Web pages. Bothsystems are available on the Web (at http://whirl.research.att.com/cdroms/and http://whirl.research.att.com/birds/). The response timefor complex queries is typically less than10 seconds. (These time measurements areon a lightly loaded Sun Ultra 2170 with167-MHz processors and 512 Mbytes ofmain memory. The current server is notmultithreaded, so response times varygreatly with load.)

In building these applications,we delib-erately sidestepped many of the problemsthat have historically been research issuesin information integration. Whirl data is notsemistructured, 6 but instead is stored insimple relations. The problem of queryplanning7 is simplified by collecting alldata with spiders offline. We map individ-ual information sources into a globalschema using manually constructed views,rather than using more powerful meth-ods.8,9Access knowledge is represented inhand-coded extraction programs,10 ratherthan learned by example as proposed byNicholas Kushmeric and others.11–13Wemade these decisions to highlight the ad-vantages of our approach, relative to earlierintegration methods:by adopting an uncer-tain approach to integration, significant

applications can be built, even without theaid of other advanced techniques.

The two implemented integration appli-cations illustrate several other importantpoints. First,in the games application,Whirl extracts information about the agerange for which games are appropriate froma commercial site; this information can thenbe used to access a collection of reviewstaken from several consumer-oriented sites.The age-range information has,in somesense, been made portable; it has been dis-associated from the site that provided it andused for a goal different from its intendedpurpose (of improving access to a largeonline catalog). Attaining this sort of modu-larity and portability of information wasone of Whirl’s primary goals.

Second, integration need not requirecomplex query interfaces to be useful. Aswell as providing a query-based interface,the bird application allows data about birdsto be browsed, either geographically or byscientific order and family. This browsinginterface extends the capability of the origi-nal sources,which are seldom organizedalong both of these dimensions. The birdapplication also includes a quick-searchfeature, in which the user types in the nameof a bird and gets a list of URLs inresponse. As an example, in response to aquick search for “great blue heron,” Whirl’sanswer includes a picture indexed only asardea herodias, the scientific name of the

great blue heron. This sort of intelligentbehavior is enabled by translating theuser’s quick search into a structured querythat exploits an information source givingthe scientific nomenclature for birds.

The future of integrationWhirl’s current implementation could be

extended in many ways. Challenging tech-nical issues include scaling up to largerdata sets (the current implementation ismemory-based),finding more flexible andmore automatic extraction methods,learn-ing to improve scores based on feedback ofvarious kinds,and collecting data effic-iently at query time. We will conclude,however, with some more general remarks.

Coming Next IssueInter active Fiction

Haym Hir sh,editor

with essays by Barbara Hayes-Roth, Janet Murray, and

Andr ew Stern

Figure 5. The interface to an information-integration system based on Whirl.

.

One possible goal for computer scienceresearch is the construction of an informa-tion system with size and scope compara-ble to the Web, but with abilities compara-ble to a knowledge base. In particular, wewould like a system that can reason aboutand understand information that, like infor-mation found on the Web, is constructedand maintained in a decentralized fashion.This is a very hard problem and perhaps avery distant goal; however, Whirl repre-sents an important step toward that goal.

Previous systems that access informationfrom multiple sources fall into two mainclasses. Search engines provide weak andrelatively unstructured access to a largenumber of sites. Previous Web-based infor-mation integration systems provide betteraccess to a small number of highly struc-tured sites. Whirl’s emphasis on inexact,uncertain integration provides an interme-diate step between these two extremes.

The intermediate step of cheap,approxi-mate information integration is a criticalone. It is unreasonable to expect an infor-mation provider whose primary audience ispeople to spend much time and energy inmaking his or her information programmat-ically available unless there is a clear andimmediate benefit. Unfortunately, whileintegration does provide a benefit, this ben-efit does not materialize until a number of

sources are integrated. In economic terms,the value of having a well-structured, eas-ily-integrated information source is largelyexternal,leading to a classic chicken-and-egg problem. The availability of cheap,approximate integration methods couldhelp to overcome this problem.

Let us close with an analogy. Informationintegration can be viewed as the problem ofgetting information sources to talk to eachother. Our approach can be viewed as get-ting information sources to talk to eachother in an informal way. We hope that thiskind of informal communication will retainmuch of the utility of formal communica-tion,but be far easier to attain—just as in-formal essays like this one can,withouttedious technical detail,communicate theessence of a new technical result.

References1. G. Salton,ed., Automatic Text Processing,

Addison Wesley, Reading, Mass.,1989.2. D. Rus and D. Subramanian,“Customizing

Capture and Access,” ACM Trans. InformationSys.,Vol. 15,No. 1,1997,pp. 67–101.

3. N. Nilsson,Principles of Artificial Intel-ligence, Morgan Kaufmann,San Francisco,1987.

4. J. Pearl. Heuristics: Intelligent Search Strat-egies for Computer Problem Solving, Addison-Wesley, 1984.

5. W.W. Cohen,“Integration of HeterogeneousDatabases without Common Domains Using

Queries Based on Textual Similarity,” Proc.ACM Sigmod-98, ACM Press,New York, 1998,pp. 201-212.

6. D. Suciu,ed., Proc. Workshop on Managementof Semistructured Data; http://www.research.att.com/~suciu/workshop-papers. html.

7. J.L. Ambite and C.A. Knoblock, “Planning byRewriting: Efficiently Generating High-qualityPlans,” Proc. 14th Nat’ l Conf. AI, AAAI Press,Menlo Park, Calif., 1997,pp. 706–713.

8. A.Y. Levy, A. Rajaraman,and J.J. Ordille,“Querying Heterogeneous InformationSources Using Source Descriptions,” Proc.22nd Int’l Conf. Very Large Databases (VLDB-96), Morgan Kaufmann,1996,pp. 251–262.

9. O.M. Duschka and M.R. Genesereth,“Answer-ing Recursive Queries Using Views,” Proc.16th ACM Sigact-Sigmod-Sigart Symp. Princi-ples of Database Systems (PODS-97), ACMPress,1997,pp. 109–116.

10. W.W. Cohen,“A Web-Based Information Sys-tem that Reasons with Structured Collectionsof Text,” Proc. Second Int’l Conf. AutonomousAgents, ACM Press,New York, 1998,pp.400–407.

11. N. Kushmerick, D.S. Weld, and R. Doorenbos,“Wrapper Induction for Information Extrac-tion,” Proc. 15th Int’l J. Conf. AI, AAAI Press,1997,pp. 729–735.

12. C.A. Knoblock et al.,“Modeling Web Sourcesfor Information Integration,” Proc. 15th Nat’ lConf. AI (AAAI-98),AAAI Press,1998,pp.211–218.

13. C.-N. Hsu,“Initial Results on Wrapping Semi-structured Web Pages with Finite-State Trans-ducers and Contextual Rules,” Papers from the1998 Workshop on AI and Information Integra-tion, AAAI Press,1998,pp. 66–73.


November/December IssueVision-Based Driving Assistance in Tomorrow’s Vehicles

Thanks to the increasing computational power of current computer systems and to the reduced costs of image-acquisition devices, in the last few years vision-based sensing has gained a strategic importance, not only in mili-tary applications, but also in the automotive field. Articles in the November/December issue of IEEE IntelligentSystems will address the problems of autonomous vehicle driving, describe common techniques, and present thestate of the art. Scheduled topics include

• Machine-vision systems for intelligent transportation systems• Stereo image processing for autonomous off-road navigation• Autonomous driving approaches downtown• Course-boundary extraction algorithms• Real-time driving assistance using visual routines• Sensor and information fusion for vehicle guidance• Using neural networks to recognize road signs• Visibility analysis in fog situations.

IEEE Intelligent Systems (formerly IEEE Expert) covers the full range of intelligent system developments for the AIpractitioner, researcher, educator, and user.

IEEE Intelligent SystemsIEEE

...AND MORE!

AI IN HEALTH CAREINTELLIGENT AGENTSSELF-ADAPTIVE SOFTWAREDATA-MINING TOOLSINTELLIGENT VEHICLES

AUTONOMOUS SPACE SYSTEMS

& the i r app l i cat ions

IEEE

.

Documents

Information integration - University of California, Berkeleypeople.ischool.berkeley.edu/...is_info_integration.pdf · structured data source. He suggests that cheap, approximate information