24
Journal of Intelligent Information Systems, 17, 47–70, 2001 c 2001 Kluwer Academic Publishers. Manufactured in The Netherlands. Logical Approach to Capability-Based Rewriting in a Mediator for WebSources JOHN GRANT Towson University, Towson, MD 21252, USA VLADIMIR ZADOROZHNY University of Maryland, College Park, MD 20742, USA Received August 29, 2000; Revised June 6, 2001 Abstract. A logic-based specification for Capability-Based Rewriting (CBR) in web query processing is pre- sented. Within this approach it is shown how Semantic Query Optimization (SQO) can be used to improve the performance of a web query optimizer. Our approach allows for the characterization of different CBR tools and their properties in a uniform and generic manner. It also reveals important optimization opportunities (based on SQO) which are commonly ignored in existing CBR tools. Keywords: capability-based rewriting, semantic query optimization, web query processing 1. Introduction The rapid growth of the Internet and Intranets and the emergence of WWW interchange formats for data, e.g., XML (Layman et al.), has increased the opportunity for the exchange of data from WebSources that are accessible over a wide area network via scripts or forms- based APIs. Wrapper/mediator architectures that have been developed for heterogeneous sources have to be tailored to this new environment. A WebWrapper provides access to the WebSource and models each limited query capability of a WebSource as a wrapper call. A wrapper call is characterized by capability and cost. An example of limited query capability is the requirement of having bindings for particular (input) query attributes to provide certain output attributes. 1 Consider the ACM Digital Library WebSource (ACM Digital Library). To search for articles, this WebSource supports a form-based interface that requires the user to enter either terms or authors as input search parameters. If neither is provided, the search results in an “Invalid Search Parameters” message. So the capability restrictions of the ACM Digital Library do not allow us to submit queries like “Display all the articles currently stored in the Digital Library.” But it is quite common in relational databases to request all the tuples of a given relation. Therefore, the mediator system should include an extended query processor component to deal with the limited capabilities of WebSources. This research has been partially supported by the Defense Advanced Research Project Agency under grant 01-5-28838 and the National Science Foundation under grant IRI963010.

Logical Approach to Capability-Based Rewriting in a Mediator for WebSources

Embed Size (px)

Citation preview

Journal of Intelligent Information Systems, 17, 47–70, 2001c© 2001 Kluwer Academic Publishers. Manufactured in The Netherlands.

Logical Approach to Capability-Based Rewritingin a Mediator for WebSources∗JOHN GRANTTowson University, Towson, MD 21252, USA

VLADIMIR ZADOROZHNYUniversity of Maryland, College Park, MD 20742, USA

Received August 29, 2000; Revised June 6, 2001

Abstract. A logic-based specification for Capability-Based Rewriting (CBR) in web query processing is pre-sented. Within this approach it is shown how Semantic Query Optimization (SQO) can be used to improve theperformance of a web query optimizer. Our approach allows for the characterization of different CBR tools andtheir properties in a uniform and generic manner. It also reveals important optimization opportunities (based onSQO) which are commonly ignored in existing CBR tools.

Keywords: capability-based rewriting, semantic query optimization, web query processing

1. Introduction

The rapid growth of the Internet and Intranets and the emergence of WWW interchangeformats for data, e.g., XML (Layman et al.), has increased the opportunity for the exchangeof data from WebSources that are accessible over a wide area network via scripts or forms-based APIs. Wrapper/mediator architectures that have been developed for heterogeneoussources have to be tailored to this new environment. A WebWrapper provides access tothe WebSource and models each limited query capability of a WebSource as a wrappercall. A wrapper call is characterized by capability and cost. An example of limited querycapability is the requirement of having bindings for particular (input) query attributes toprovide certain output attributes.1 Consider the ACM Digital Library WebSource (ACMDigital Library). To search for articles, this WebSource supports a form-based interface thatrequires the user to enter either terms or authors as input search parameters. If neither isprovided, the search results in an “Invalid Search Parameters” message. So the capabilityrestrictions of the ACM Digital Library do not allow us to submit queries like “Displayall the articles currently stored in the Digital Library.” But it is quite common in relationaldatabases to request all the tuples of a given relation. Therefore, the mediator system shouldinclude an extended query processor component to deal with the limited capabilities ofWebSources.

∗This research has been partially supported by the Defense Advanced Research Project Agency under grant01-5-28838 and the National Science Foundation under grant IRI963010.

48 GRANT AND ZADOROZHNY

Capability and cost are of great importance in a WebSource environment, where there maybe a choice of WebSources with different wrapper calls. Researchers have previously studiedquery rewriting using limited capability of sources (Florescu et al., 1999; Levy et al., 1996;Papakonstantinou and Vassalos, 1999; Tomasic et al., 1996; Vassalos and Papakonstantinou,1998), and estimating the costs for accessing heterogeneous sources (Adali et al., 1996; Duet al., 1992; Haas et al., 1999; Naacke et al., 1998). There has also been some research onconsidering both capability and costs in query optimization (Haas et al., 1999), where itwas shown that both factors can impact the choice of a good plan.

In this paper we address two main issues:

• how to define a generic and system-independent specification for Capability-BasedRewriting (CBR) in web query processing. So far Capability-Based Rewriting has beendefined in the context of specific systems and not in a uniform way. We introduce a logicalframework to specify CBR and use it to improve the search ability of CBR tools.

• how to use the above logical specification to apply Semantic Query Optimization (SQO)to improve the performance of the capability search in a Web Query Optimizer.

We suggest an approach to the specification of CBR which allows us to characterize dif-ferent CBR tools and their properties in a uniform and generic manner. This approach alsoreveals important optimization opportunities, based on SQO, which are commonly ignoredin existing CBR tools.

The paper is organized as follows. Section 2 contains the background material onCapability-Based Rewriting and Semantic Query Optimization. In Section 3 we brieflydiscuss previous research related to our work. Section 4 contains the predicate specificationfor CBR. In Section 5 we develop a hierarchy of CBR definitions starting from a core setof features. Then we consider a set of extensions introducing features ignored in the corespecification. Examples of those features are multiple WebSources, attribute domain restric-tions, and ways to map WebSource attributes to mediator relation attributes. In Section 6we consider the soundness and completeness of an existing CBR tool, developed at theUniversity of Maryland, with respect to our definitions. Section 7 considers how this logi-cal framework allows us to apply SQO to improve the search ability of CBR tools and theperformance of the web query optimizer. The paper is summarized in Section 8 where someopen research issues are also mentioned.

2. Background

We use the concepts of Capability-Based Rewriting (CBR) and Semantic Query Opti-mization (SQO) throughout the paper. In this section we introduce the basic concepts ofMediator/Wrapper technology and give some background on logic-based semantic queryoptimization.

2.1. Capability-Based Rewriting (CBR)

Proper handling of the capability restrictions of WebSources is one of the central problems inthe efficient evaluation of queries in a mediator for WebSources (web queries). We consider

LOGICAL APPROACH 49

only sources with limited query capabilities requiring bindings for particular input attributesand providing projected attributes as output. We characterize such a limited query capabilityas an input-output relationship (IOR). The IORs establish specific access patterns to Webdata. Syntactically we represent them in the following form:

iori : {InputAttr1, . . . , InputAttrk} → {OutputAttr1, . . . , OutputAttrk},

where iori is the identifier of the IOR, InputAttr1, . . . , InputAttrk are the names of the inputattributes, and OutputAttr1, . . . , OutputAttrk are the names of the output attributes.

Any real WebSource may have more than one such access pattern, but we use the ter-minology WebSource for each such relationship, or access pattern, to emphasize that it isan aspect of an actual WebSource. The context will always determine the meaning of theterm WebSource. We also do not deal with the identification of IORs with their actual Web-sources. Mediator relations are typically calculated from such WebSources in order to reflecttheir capabilities. We also assume that mediators are able to process the data returned bythe WebSources, perform join operations locally, and pass bindings from one join operandto another using a dependent join operator (Zadorozhny et al., 2000) (illustrated using thefollowing example).

Example 1. Consider the Bureau of Labor Statistics WebSource (http://stats.bls.gov). Themediator schema can be represented in the relational model as given below:

OEW(OESCode,StateName,MeanWage,MedianWage,Employment,MeanAnnualWage)OES(OESCode,OccupationTitle)

Here OEW and OES stand for Occupational Employment Wages and Occupational Employ-ment Statistics correspondingly. OESCode is a unique identifier for each OES occupation.The actual queries that can be processed at this WebSource are limited. We represent thequery capabilities as four IORs (WebSources in our terminology) as follows:

OEW(OESCode,MeanWage,StateName,MedianWage,Employment)ior1: { } → {OESCode,StateName,MeanWage, MedianWage,Employment}ior2: {OESCode} → {StateName,MeanWage, MedianWage,Employment}ior3: {OESCode, StateName} → {MeanWage, MedianWage,Employment}

OES(OESCode,OccupationTitle)ior4: { } → {OESCode, OccupationTitle}

We note that ior1 subsumes ior2 and ior2 subsumes ior3, i.e., any data obtained by theuse of ior2 would also be obtained by ior1. The web query optimizer decides which of theabove capabilities should be used under various conditions. For example, if the OESCodeis known, sometimes it is better to use ior2 because it will not yield extraneous data andwill be faster to process. Details of such optimization issues can be found in Zadorozhnyet al. (2000).

50 GRANT AND ZADOROZHNY

Figure 1. Dependent and regular join over relations implemented by WebSources.

Consider the following join query written in SQL:

Select OEW.MeanWage, OEW.MedianWageFrom OES, OEWWhere OEW.OESCode = OES.OESCode and

OEW.StateName = “NC”

In order to find a good execution plan for this query, the orderings of its scan operationsimplemented by particular IORs have to be identified.

For example, the above query can be rewritten as a query with two scan operations, S1

and S2, on OES and OEW respectively (figure 1).The ordering S1 → S2 is imposed by the ior4 and ior3 and a dependent join must be

used to relate these two scan operations (figure 1(a)). The dependent join assumes thatthe values of the join attributes of the left operand (outer relation) are used to restrict thetuples of the right operand (inner relation). After this, the restricted inner relation is joinedwith the outer relation. The dependent join operator is required in our example because theattribute OESCode is an output from S1 and must bind the corresponding input attribute,also OESCode, of OES in S2. We say, that for above query scan, S2 can be implementedby the access pattern ior3 in the context of S1. The concept of context will be considered inSections 4.1 and 5.5.

However, if ior4 and ior1 are considered, an ordering between both scan operations isnot required and a regular join, which does not use the outer relation to restrict fetching theinner relation, can be used to associate these scans (figure 1(b)).

The problem of determining if there exists an executable ordering for a plan is NP-complete when several IORs are consider for every scan (Levy et al., 1996). Algorithmspresented in Levy et al. (1996) and Morris (1988) find a single ordering of the subgoalswhen only one binding pattern is considered per scan. In Vidal and Raschid (1998), eachscan can be associated with several IORs. However, only the attributes occurring in thequery and the most general IORs are considered to perform this task. A most generalIOR imposes the minimal set of input attributes and the maximal set of output attributes.Any IOR can be implemented using its corresponding most general IOR applying post-processing to the data returned by the sources. Following this idea only the IORs ior1

and ior4 are considered at the mediator interface, and so only the second ordering isidentified.

LOGICAL APPROACH 51

2.2. Semantic query optimization (SQO)

We provide here some background on semantic query optimization. We sketch the basicideas and refer the reader to Chakravarty et al. (1990) for the details. The fundamentalidea is to use a theorem proving technique, called partial subsumption, to attach residuesto predicates in a process called semantic compilation. The residues are fragments of in-tegrity constraints involving the predicates. Semantic compilation takes place before queryprocessing. Then, during query processing the residues may be used to transform the queryto another query which gives the same answers but whose execution is faster. In some caseswe may find that the residue indicates that the original query has no answers, allowing thequery processor to avoid any further work.

We illustrate semantic query optimization by some examples. First, we introduce somebasic notation. If not stated explicitly we assume that a Prolog-like syntax is used. Apredicate looks as follows: p(t1, t2, . . . , tn), where p is a predicate name of arity n, each tiis a term, and (t1, t2, . . . , tn) is a tuple. A term is a constant or a variable, though complexterms constructed using function symbols could also be brought into consideration. Apredicate without variables is called ground or instantiated. A name starting with a capitalletter signifies a variable. A rule is a statement of the form

p : −q1, q2, . . . , qn,

where p and qi are predicates, p is the rule head, and the conjunction q1, q2, . . . , qn is therule body. We assume that each variable that appears in the head of a rule also appears inthe body. A rule may be used to define the predicate p, so that p holds whenever q1, . . . , qn

all hold. Also, a rule may be considered to be a query asking for all ground tuples of p.Additionally, a rule may be considered to be an integrity constraint, that is, a statement thatmust always be true. Often, an integrity constraint has an empty head and is interpreted asmeaning that the body (the conjunction of some predicates) is impossible. Each predicateqi is called a goal. A rule with an empty body and without variables is called a fact. Wealso use the built-in predicates, like “<” in the standard way.

Suppose the database has the predicate employee(Id,Name,Age,Salary). Suppose alsothat there is an integrity constraint on the database of the form

Age > 21:- employee(Id,Name,Age,Salary),

stating that all employees must have age greater than 21. In this case we obtain the residue

{Age > 21:-},for the predicate employee. This states that (for every tuple of employee) the age value isgreater than 21. Now consider the query

q(Salary):- employee(Id,Name,Age,Salary), Age < 20

asking for the salaries of employees whose age is less than 20. Semantic query optimizationuses the residue to transform the query to

q(Salary):- employee(Id,Name,Age,Salary), Age < 20,Age > 21

52 GRANT AND ZADOROZHNY

which is contradictory, as it requires that Age be both less than 20 and greater than 21, andso cannot have any solutions. Hence it is not necessary to search the database to answer thisquery.

The second example uses the same relation but now the integrity constraint is

Salary = 30000 :- employee(Id,Name,Age,Salary), Age < 25,

stating that all employees less that 25 years of age must have a salary of 30000. In this casethe residue for the predicate employee is

{Salary = 30000 :- Age < 25},

which states that whenever an (employee’s) age is less than 25, the (employee’s) salary is30000.

Now consider the query

q(Salary) :- employee(Id,Name,Age,Salary), Age = 24

asking for the salaries of employees whose age is 24. Using the residue we obtain

{Salary = 30000 :-}

indicating that the salary must be 30000 (assuming there is such an employee).The examples considered in this subsection illustrate traditional semantic query opti-

mization for the relations of a database. In this paper we will be at a higher level so that ourrelations will be objects such as queries and WebSources instead of database relations likeemployee. Semantic query optimization will also be used as a higher level concept checkingto see, for example, if a web source can be used to answer a query.

3. Related work

In several research projects, Haas et al. (1997), Vidal and Raschid (1998) and Florescuet al. (1999) have considered the task of capability-based rewriting for sources with limitedcapabilities. The two main approaches to CBR differ on where to maintain and how to usethe source capability information. In Haas et al. (1997) this information is encapsulated inwrappers for WebSources. Wrappers participate in query planning, and check if a givenquery can be answered by a given WebSource. The mediator interacts with wrappers todecompose a query into a set of subqueries in such a way that each subquery can be evaluatedin a relevant WebSource with appropriate capability. A disadvantage of this approach is thatthe mediator may consider decompositions (rewritings) that do not produce a safe plan, thatis, a plan that gives all the answers.

In another approach, considered in Vidal and Raschid (1998), the mediator stores informa-tion about the capability of sources. This way the mediator can use the set of capability-basedrestrictions to enforce the query optimizer to consider only safe plans. Using CBR to drive

LOGICAL APPROACH 53

the query optimizer is another option in plan generation. In Zadorozhny et al. (2000), anefficient approach to extend a randomized query optimizer is suggested, so that it does notconsider unsafe rewritings.

Both approaches apply different heuristics to avoid the exhaustive enumeration of sourcecapabilities in relevant source identification and plan generation (which, in general, is NP-complete (Vidal, 1999)). Typically such heuristics are based on some general assumptions(i.e. utility function (Florescu et al., 1999) and do not consider the semantics of the Web-Sources. But semantic information may introduce a valuable optimization opportunity, ashas been proven by works in semantic query optimization (SQO) (Chakravarty et al., 1990).SQO has been discussed in connection with relational databases, deductive databases, andobject databases (Pang et al., 1991; Lakshmanan and Missaoui, 1992; Levy and Sagiv,1995). Experiments showing the usefulness of SQO in DB2 are described in Cheng et al.(1999).

In this paper we develop a framework to apply SQO for queries over sources with restrictedcapabilities. This does not depend on a specific approach to CBR and so it provides a uniformapproach to increase the efficiency of CBR-based query processing.

4. Notation and definitions

In this section we introduce the predicates used in the logical specification of capability-based rewriting (CBR). We also give an example that we use for illustration in the rest ofthe paper.

4.1. Basic definitions

We start by introducing the predicates we use to characterize basic concepts of a mediator forWebSources. We assume that there is a set of pre-defined mediator relations, whose actualinstantiation is provided by one or more WebSources. WebSource capabilities typicallyrequire bindings for certain mediator relation attributes, a binding pattern. Web data access(web query) by a particular binding pattern commonly does not provide output values forall the mediator relation attributes. The output corresponding to a particular binding patternshould be specified explicitly in the WebSource description. The binding pattern togetherwith the corresponding output specify the input-output relationship (IOR) that we call aWebSource.

We will use the following basic predicates (we place the elements of a set in brackets):

• relation(Relname, Attributes). This predicate specifies a mediator relation, Relname, witha set of attribute names, Attributes.

• webSource(WSName, InputAttributes, OutputAttributes). This predicate specifies an ac-cess pattern for a WebSource WSName with binding InputAttributes and output OutputAttributes.

• query(Qid, BoundAttrs, OutAttrs). This predicate specifies a query Qid with bindingBoundAttrs and output OutAttrs. We will later explain how Qids are specified usingrelation names; hence relation names are really included in this predicate.

54 GRANT AND ZADOROZHNY

• refines(InpAttr1, OutAttr1, InpAttr2, OutAttr2). This predicate is true if InpAttr2 is asubset of InpAttr1, and OutAttr1 is a subset of OutAttr2. The idea is that the WebSource(IOR) <InpAttr2, OutAttr2> refines the WebSource (IOR) <InpAttr1, OutAttr1> sinceit needs no more bindings and yields at least as much output.

• implemented by(Qid, WS, Context). This predicate is true if the query Qid is implementedby WebSource WS in the context of the queries Context. The context refers to a set ofqueries whose output is available as bound attributes for processing the query Qid asa result of a previous subquery execution. In other words, Context represent the set ofbindings propagated from the left dependent join operand to the right dependent joinoperand by Sideways Information Passing (SIP) (Seshadri et al., 1996). Until we get tojoin queries in Section 5.5, the context will be the empty set and we will ignore it bywriting implemented by(Qid, WS) instead of implemented by(Qid, WS, []).

• query output(Qids, Attrs). This predicate is true, if the set of attributes Attrs is the unionof the set of output attributes of Qids queries.

We will also use some utility predicates, such as:

• is subset(Set1, Set2). This predicate is true if the set Set1 is a subset of the set Set2.• member(Element, Set). This predicate is true if the object Element is a member of the set

Set.• union(Set1,Set2,Set3). This predicate is true if the set Set3 is the union of the sets Set1

and Set2.

In Section 5 we will show how to write a definition of the implemented by predicate asan integrity constraint using the core specification first and then various extensions. Thegeneral format of this integrity constraint will be

implemented by(Qid,WS,Context):- query(Qid,BAttrs,OutAttrs),p1, . . . , pk

where the p1, . . . , pk stands for additional predicates that include WS and Context and othervariables. The integrity constraint states that if a query is given and various other thingshold, the query (in a particular context) is implemented by a particular WebSource. Considernow a query written in a general form as

query(Qid,BAttrs,OutAttrs).

As in the examples of Section 2.2 the residue will be

implemented by(Qid,WS,Context) :- p1, . . . , pk

If we can instantiate the body of the residue, p1, . . . , pk , then we find that the instantiationof the WebSource WS in the context of Context will implement the query. If we cannotinstantiate the body of the residue, the query cannot be implemented by our websources, sono web search is needed.

LOGICAL APPROACH 55

4.2. Aircraft example

We will use an example involving aircraft throughout the paper to illustrate our technique.The example is based on the Landings Aviation Search Engines and FAA Aviation SafetyData WebSources. Here we specify the basic predicates for this example. The attributenames are self-describing: for instance regNo is registration number and repNo is reportnumber.

We have four WebSources:

webSource(ws1, [regNo], [model,owner,engModel, engManufacturer]).webSource(ws2, [model, owner], [regNo, engModel, engManufacturer]).webSource(ws3, [regNo], [repNo]).webSource(ws4, [repNo], [date, time, eventType]).

We specify two relations associated with these WebSources:

relation(aircraft, [regNo,model,owner,engModel,engManufacturer]).relation(airincident, [regNo,repNo,date,time,eventType]).

5. Specification of CBR for WebSources

In this section we define the implemented by predicate in several different ways dependingon various semantic constructs. Initially we assume that a query involves a single table scan.We then use the definition of implemented by as an integrity constraint to attach a residueto a query. The evaluation of this residue will determine if a query is implemented by aWebSource.

In the first subsection we deal with the core specification which is the simplest case.Then we show how to extend the core specification by allowing what we call GeneralizedWebSources obtained by combining several related WebSources through the transitivity ofattributes. The second extension adds domain constraints. The third extension involves gen-eralized attribute mappings rather than using simply names of attributes. (These extensionscan be combined, but we do not give the details here.) Finally, we show how to handlequeries involving joins of relations. As stated in Section 4.1 the context is empty and weignore it until we consider join queries in Section 5.5.

5.1. Core specification

This is the basic version where a single WebSource is used, attributes are strictly used byname only, and attribute domains are ignored. Each query (on a single relation) has a uniquequery identifier. The definition of implemented by in this case is:

implemented by(Qid,Ws):-query(Qid, BAttrs, OutAttrs),webSource(Ws, Input, Output),refines(BAttrs, OutAttrs, Input, Output).

56 GRANT AND ZADOROZHNY

This definition says that the query with identifier Qid is implemented by the WebSourceWs if the input of the WebSource—the attributes that must be bound—is a subset of thebound attributes of the query, and the output of the WebSource contains the requested outputattributes of the query.

Consider the following query (written in SQL first) requesting all the owner names andengine models of the aircraft relation:

Select owner, engModelFrom aircraft;

We write it in our notation, using queryid q1, as

query(q1,[ ], [owner,engModel]).

The residue for this query is

implemented by(q1,WS):- webSource(WS,Input,Output),refines([], [owner,engModel], Input, Output)

whose evaluation fails. Therefore this query is not implementable by any webSource.Consider the next query, again written in SQL first, which requests the owner name and

engine model of the aircraft with registration number 25116:

Select owner, engModelFrom aircraftWhere regNo = 25116;

which we write in our notation, using queryid q2 as

query(q2,[regNo],[owner,engModel]).

The residue for this query is

implemented by(q2,WS):-webSource(WS,Input, Output), refines([regNo],[owner,engModel],Input, Output).

In this case the substitutions WS=ws1, Input=regNo, Output=[model,owner,engModel,engManufacturer] yield implemented by(q2,ws1). Hence this second query is implementedby WebSource ws1.

5.2. Extension 1: Core specification + generalized WebSources

Suppose that we would like to know the dates and times of air incidents involving the aircraftwith registration number 25116. No single WebSource can answer this query. However, thecombination of WebSources ws3 and ws4 suffice, because ws3 provides all report numbers

LOGICAL APPROACH 57

for the aircraft and using these report numbers, ws4 gives the dates and times. Our conceptof a generalized WebSource is used for this type of situation. The definition is recursive.

generalizedWebSource(WS,Input,Output):-webSource(WS,Input,Output).

generalizedWebSource(WS,Input,Output):-webSource(WS1,Input,Output1),generalizedWebSource(WS2,Output2,Output),is subset(Output2,Output1),WS = idfunc(WS1,WS2).

We use idfunc here to generate a WebSource identifier by a combination of WS1 and WS2,where WS2 may have been obtained previously in a similar way.

We define implemented by by substituting generalizedWebSource for webSource:

implemented by(Qid,Ws):-query(Qid, BAttrs, OutAttrs),generalizedWebSource(Ws, Input, Output),refines(BAttrs, OutAttrs, Input, Output).

Now we apply this concept in the query considered earlier informally and written in SQLas

Select date, timeFrom airincidentWhere regNo = 25116;which we write in our notation, using queryid q3, as

query(q3,[regno],[date,time]).

As in the previous section it gives the residue

implemented by(q3,WS):-webSource(WS, Input, Output), refines([regNo], [date,time], Input, Output)

whose evaluation fails.However, consider the generalizedWebSource obtained by combining ws3 and ws4. In the

second definition of generalizedWebSource, let WS1 = ws3, Input = [regNo], Output1 =[repNo], WS2 = ws4, Output2 = [repNo], Output = [date, time, eventType]. Then, usingthe first definition of generalizedWebSource and assuming that ws3.4 = idfunc(ws3, ws4)holds, we obtain generalizedWebSource(ws3.4,[regNo],[date,time,eventType]). Finally, us-ing generalizedWebSource, the residue for the query is

implemented by(q3,WS):-generalizedWebSource(WS,Input, Output), refines([regNo],[date,time], Input, Output).

58 GRANT AND ZADOROZHNY

By the substitution WS = ws3.4, Input = [regNo], Output = [date, time, eventType] we findthat q3 is implemented by the generalizedWebSource ws3.4. In this case the generalized-WebSource was made up of two WebSources; in general, the recursive definition allows fora combination of any number of WebSources.

5.3. Extension 2: Core + domain constraints

In this subsection we deal with the case where each attribute in a relation or a WebSourcehas a corresponding domain. We still assume that the attribute names match in the queries(based on the relations) and the WebSources. Consider an input attribute and suppose that thedomain of the attribute of the query relation is not a subset of the domain of the correspondingattribute of a WebSource. Then the specific binding may not be an element in the WebSource;hence the WebSource may not be able to answer the query. Next, consider the case wherethe domain of an output attribute of the query relation is not a subset of the domain of thecorresponding attribute of a WebSource. In this case the WebSource may give only partialanswers. For this paper we will consider both of these cases unacceptable. If the domain ofthe WebSource for an output attribute is a proper superset of the domain of the query relation,unexpected values outside of the domain may appear in the answer. We assume that suchvalues can be filtered out before sent to the user and do not consider separately this possibleproblem.

We introduce three new predicates where we assume that each domain refers to a set ofvalues or is represented by a domain name whose elements are known.

• relationDomains(Relname, Attribute, Domain). This predicate specifies the domain foreach attribute of each relation.

• webSourceDomains(WSName, Attribute, Domain). This predicate specifies the domainof each attribute of each WebSource.

• domainRefines(InpAttr1, OutAttr1, InpAttr2, OutAttr2). This predicate is true if (1)InpAttr2 is a subset of InpAttr1 and for every attribute A in InpAttr2, the domain ofA in InpAttr1 is a subset of the domain of A in InpAttr2 and (2) OutAttr1 is a subset ofOutAttr2 and for every attribute B in OutAttr1, the domain of B in OutAttr1 is a subset ofthe domain of B in OutAttr2. This is the extension of the refines predicate with the addedproviso that the domain of each attribute of the first IOR is a subset of the correspondingattribute of the second IOR.

Now the only thing that needs to be changed in the definition of implemented by is tosubstitute domainRefines for refines:

implemented by(Qid,Ws):-query(Qid, BAttrs, OutAttrs),webSource(Ws, Input, Output),domainrefines(BAttrs, OutAttrs, Input, Output).

LOGICAL APPROACH 59

As an example, consider the case where there is a mediator relation aircraft as

relation(aircraft, [regNo, model, owner, engModel, engManufacturer])

and relationDomains(aircraft, regNo, 10001..99999), that is, the registration numbers rangefrom 10001 to 99999. Suppose we also have

webSourceDomains(ws1, regNo, 10001..20000)

and the query is

query(q2,[regNo],[owner,engModel]).

In this case, implemented by(q2, ws1) fails because

domainRefines([regNo], [owner,engModel], [regNo],[model, owner, engModel, engManufacturer])

fails, since 10001..99999 is not a subset of 10001..20000. In particular, it ws1 cannot answerthe SQL query

Select owner, engModelFrom aircraftWhere regNo = 75123;

because 75123 is not in the domain of regNo in ws1.Assume now that the domain of the mediator relation aircraft is changed so that we have

relationDomains(aircraft, regNo, 10001..20000)

The registration number is no longer a problem. Suppose that the following also hold:

relationDomains(aircraft, engModel,A∗..C∗)webSourceDomains(ws1, engModel,A∗)

for the same query. In this example implemented by(q1,ws1) also fails, because the domainof engModel in ws1 does not contain the domain of engModel in aircraft. This WebSourcecannot contain any engine model starting with the letters B and C .

5.4. Extension 3: Core specification + generalized attribute mappings

Up to now we based the refines predicate on subset relations between WebSource attributesets and mediator relation attribute sets. In this section we consider how it is possibleto extend our specification to the more general case of user-defined mappings between

60 GRANT AND ZADOROZHNY

WebSource and mediator attribute sets. In this case the refines predicate should check thefollowing mapping exists predicate:

mapping exists(Set1,Set2):-is valid mapping(MapId, Set1,Set2),

where is valid mapping is used to define specific mappings and MapId specifies a mappingidentifier. The mapping identifier is a constant denoting a specific kind of mapping. Forexample, the subset-based mapping used in the core specification can be represented in thefollowing way (note that subset is a constant mapping identifier):

is valid mapping(subset,Set1,Set2):-is subset(Set1,Set2).

Now we define a generalized version of the refines predicate as

mappingRefines(InpAttr1,OutAttr1,InpAttr2,OutAttr2):-is valid mapping(Mapid1,InpAttr2,InpAttr1),

is valid mapping(Mapid2,OutAttr1,OutAttr2).

The substitution of subset for both Mapid1 and Mapid2 gives the refines predicate ofthe Core Specification. Now we consider an example of a user-defined mapping for themediator relation account and WebSource account ws

relation(account,[Name,Balance]).webSource(account ws,[PersonName],[PersonName,Saving,Checking])

where Saving stands for the amount in a Savings account and Checking stands for theamount in a Checking account. Suppose that the query is

query(q4, [Name], [Balance]),

and we have the following mappings:

is valid mapping(names, [PersonName], [Name]) andis valid mapping(accounts, [Balance], [Saving, Checking]):-

Balance = Saving + Checking

then we obtain

mappingRefines([Name], [Balance], [PersonName], [Saving,Checking])

and the implemented by relation is satisfied by account ws for q4.

LOGICAL APPROACH 61

5.5. Join queries

To specify a join query over a set of WebSources we elaborate on the structure of the queryidentifier in the predicate query. So far we considered it as an atomic unique identifier.

We define the query identifier recursively as follows:

• a pair (RelName, UniqueNumber) is a query identifier of a scan query over relationRelName. UniqueNumber is an integer number which uniquely identifies a query on arelation.

• if Qid1 and Qid2 are query identifiers, then <Qid1, Qid2> is a query identifier (the orderis important).

Now, we extend the implemented by predicate by adding the context. We start with a scanquery and follow with a join query. We use [Qid1] to indicate the set of output attributes ofQid1. We apply the join with the core specification, but we can add any of the extensionspresented in the previous three subsections. Consider first the definition of implemented byfor a scan query with nonempty context. We express the idea that the output attributes ofthe previous (leftmost) queries in the Context are added to the query as bound attributes indetermining if a WebSource implements a query. In the case of a join of two queries, theoutput attributes of the first query are added to the context of the first query to obtain thecontext for the second query.

implemented by((RelName, UniqueNumber), Ws, Context) :-query((RelName,UniqueNumber), BAttrs, OutAttrs),webSource(Ws, Input, Output),query output(Context, ContextOutput),union(BAttrs, ContextOutput, ResBAttrs),refines(ResBAttrs, OutAttrs, Input, Output).

implemented by(<Qid1,Qid2>,[Ws1,Ws2],LeftContext) :-implemented by(Qid1, Ws1, LeftContext),union(LeftContext, [Qid1], RightContext),implemented by(Qid2, Ws2, RightContext).

In the above definition we assume a right-linear tree order for join execution (i.e., Qid1always corresponds to a scan). Next we give an abstract and a concrete example.

Example 2. Consider the following simple scan queries:

query((r1,1), B1, O1).query((r2,1), B2, O2).query((r3,1), B3, O3).

A join of the relations r1, r2 and r3 can be represented using the following query identifier:

<(r1,1), < (r2,1), (r3,1)>>

62 GRANT AND ZADOROZHNY

To check if this query is implementable we should evaluate the following predicate:

implemented by(<(r1,1), <(r2,1), (r3,1)>>, WS for r1r2r3, [])

which will result in checking the following decomposition:

p1: implemented by( (r1,1), WS for r1, [])implemented by(<(r2,1), (r3,1)>, WS for r2r3, [(r1,1)])

p2: implemented by( (r2,1), WS for r2, [(r1,1)])p3: implemented by( (r3,1), WS for r3, [(r1,1),(r2,1)])

Predicates p1, p2, and p3 are evaluated for the simple scan queries (r1,1), (r2,1) and (r3,1)in a way described in the previous sections. If the evaluation is successful, the variableWS for r1r2r3 will be instantiated by a sequence of WebSources that is possible to use forthe join implementation.

Example 3. Consider the following join query (written in SQL first) asking for all reportnumbers of aircraft model B727 owned by TWA involved in air incidents:

Select repNoFrom aircraft, airincidentWhere aircraft.regNo = airincident.regNo

and model = “B727”and owner = “TWA”;

written in our notation as two queries:

query((aircraft,1), [model, owner], [regNo])query((airincident,1), [ ], [repNo]),

Trying to check if this query is implementable, we obtain (applying union)

implemented by(<(aircraft,1), (airincident,1)>, [WS1, WS2], [ ]):-implemented by((aircraft,1), WS1, [ ]),implemented by((airincident,1), WS2, [(aircraft,1)]).

The solution for WS1 is ws2, so that in the second implemented by [(aircraft,1)] becomes[regNo]. Thus the second query becomes implementable using regNo as the input by as-signing ws3 to WS2.

6. Characterization of the CBR tool in MedWebSrc

We illustrate how the generic logical specification introduced in the previous sections canbe used to represent and analyse specific CBR tools. In particular, we consider one of theexisting CBR tools, the CBR tool from MedWebSrc—a Web mediator system developedat the University of Maryland (Zadorozhny et al., 2000; Vidal, 1999). We define a search

LOGICAL APPROACH 63

algorithm for relevant WebSources used in MedWebSrc in terms of our framework anddiscuss its soundness and completeness. We start with a brief survey of MedWebSrc.

MedWebSrc supports SQL-like declarative queries to Internet accessible WebSources. AWebSource is accessible via the http protocol; a forms-based interface provides a limitedquery capability, and returns answers in XML or HTML. The mediator is an extension of thePredator ORDBMS (Ramakrishnan et al., 1997); the mediator uses the relational data model.Some of the mediator relations can be remote with their actual content located on Web. Themediator has the task of decomposing a mediator query into subqueries, identifying relevantWebSources that can answer a subquery, and providing query optimization and evaluationfunctionalities. WebSource wrappers reflect the limited query capability of WebSources andhandle mismatch between the mediator and the WebSource. For each remote relation inthe mediator query, the CBR tool identifies a relevant WebSource and its query processingcapabilities.

6.1. Search algorithm for relevant WebSources in MedWebSrc

The algorithm is based on the concepts which are defined below. In the definitions wewill use the following two sets: QUER—a set of scan query identifiers and WS—a set ofWebSource identifiers.

Definition 1 (implies relation). Let ws1, ws2 ∈ WS such that webSource(ws1, Input1,Output1) and webSource(ws2, Input2, Output2) hold. Then implies(ws1,ws2) holds iffis subset(Input1, Input2), is subset(Output2,Output1), and is subset(Input2 ∪ Output2,Input1 ∪ Output1). ws1 directly implies ws2 iff implies(ws1, ws2) holds, and there is nowsj ∈ WS, wsj �= ws1 and wsj �= ws2, such that implies(ws1, wsj) and implies(wsj, ws2) hold.

For example, in Section 2.1 Example 1 implies(ior1, ior2) holds because ior1 is more generalthan ior2. For each WebSource ws′ ∈ WS we define a bucket in the following way:

Definition 2 (bucket). The bucket of WebSource ws′ ∈ WS is the pair (SS = {ws1, . . . ,

wsn}, ws′) such that ws1, . . . ,wsn, ws′ ∈ WS and SS is the largest subset of WS with theproperty that for all wsi ∈ SS, wsi directly implies ws′.

Thus, ws′ represents the most specific description with respect to the WebSources SS in itsbucket. This will help to decide if the WebSources in SS can be used to implement a query.

Definition 3 (applicable bucket). A bucket B = (SS, ws) is applicable for query q ∈ QUERin the context Context if implemented by(q, ws, Context) holds.

We also specify an implies relation on buckets to optimize the query implementationsearch:

Definition 4 (implication on buckets). For two buckets B1 = (SS1, ws1) and B2 = (SS2,

ws2), implies(B1, B2) holds iff implies(ws1, ws2) holds.

64 GRANT AND ZADOROZHNY

It follows from Definitions 1 and 2 that the implies relation defines a partial order on theset of buckets. In picturing the set of buckets as a po-set, whenever implies(B1, B2) holds,B2 is placed above (at a higher level than) B1. We define the po-set AlternatePartitionby placing each bucket into its lowest possible level. This defines the layers of the po-set,where each layer of AlternatePartition contains all the buckets from the same level. Wecall each layer of AlternatePartition a superbucket SB = {B1, B2, . . . , Bn}. In particular,for any Bi, B j ∈ {B1, B2, . . . , Bn}, ¬implies(Bi, B j) holds. A superbucket SBi precedesa superbucket SBj in the AlternatePartition if for all B ∈ SBi there exists a bucket B ′ ∈ SBjsuch that implies(B ′, B) holds. That is, if SBi precedes SBj then SBi is above SBj.

Definition 5 (relevant superbucket). A superbucket SB is relevant for query q ∈ QUER, ifthere is a bucket B ∈ SB, such that B is applicable for q, and there is no superbucket SB1with a bucket B1 ∈ SB1, where B1 is applicable for q such that SB precedes SB1.

This means that a superbucket relevant for query q is the most specific superbucket with abucket applicable for q . Hence there can be at most one relevant superbucket for a query.An applicable bucket B for q ∈ QUER is called relevant for q iff B ∈ SB, where SB is therelevant superbucket for q .

Now we are ready to specify the algorithm RelSrc for identifying relevant sources for aquery q ∈ QUER:

Algorithm RelSrc

Input: AlternatePartition and a query q.Output: a set of WebSources that implement q.

1. Find the relevant superbucket SB for q in AlternatePartition.2. If step 1 succeeds, then find all relevant buckets Bi = (SSi, wsi) for q in SB and return

the union of all SSi as the result.3. If step 1 fails, then return ∅ as the result.

6.2. Soundness and completeness of the algorithm RelSrc

Here we consider the soundness and completeness of the algorithm RelSrc with respect toour CBR specification. The following lemma is used to prove soundness.

Lemma 1. For any bucket B = (SS, ws′), if implemented by(q, ws′,Context) holds, thenfor every ws ∈ SS, implemented by(q, ws, Context) also holds.

Proof: Suppose that we are given query(q, BAttrs, OutAttrs), and websource(ws′, Inp′,Out′). By the definition of implemented by(q, ws′, Context) we obtain, using the definitionof refines,

Inp′ ⊆ BAttrs (1)

LOGICAL APPROACH 65

and

OutAttrs ⊆ Out′ (2)

Now suppose that ws ∈ SS where webSource(ws, Inp, Out) holds. By the definition ofbucket

Inp ⊆ Inp′ (3)

and

Out′ ⊆ Out (4)

From (1) and (3) we obtain

Inp ⊆ BAttrs (5)

and from (2) and (4) we obtain

OutAttrs ⊆ Out (6)

Hence, implemented by(q, w, Context) holds. ✷

Theorem 1 (Soundness of Algorithm RelSrc). For every WebSource ws returned by thealgorithm RelSrc for query q ∈ QUER, implemented by(q, ws, Context) holds.

Proof: If ws is returned by the algorithm RelSrc for query q, then w ∈ SSi for a relevantbucket Bi = (SSi, wsi) for q . So Bi is applicable for q, hence implemented by(q,wsi,Context)holds. Then, by the Lemma, implemented by(q, ws, Context) also holds. ✷

The algorithm RelSrc is not complete. Indeed, it is not always true that for any WebSourcews and query q ∈ QUER such that implemented by(q, ws, Context) holds, ws ∈ WS, whereWS is returned by the algorithm RelSrc for query q. The following is a counterexample tocompleteness. Consider WebSources ws1, ws2 and ws3 such that:

webSource(ws1, [a, b], [c, f, g, d, h]).webSource(ws2, [a, d, f ], [c, b, g, h]).webSource(ws3, [a, b, c], [ f, g, d, h]).

The buckets are: B1 = (SS1, ws1), B2 = (SS2, ws2), and B3 = (SS3, ws3). For the aboveset of WebSources SS1 = {ws1}, SS2 = {ws2}, and SS3 = {ws1, ws3}. Note that implies(B1, B3) holds. The buckets are placed into two superbuckets SB1 = {B3}, and SB2 ={B1, B2}, where SB1 precedes SB2. Indeed, for any bucket bi in SB1 (bi = B3) there is abucket b j in SB2 (namely, b j = B1) such that implies(b j , bi ) holds.

66 GRANT AND ZADOROZHNY

Figure 2. AlternatePartition used by RelSrc (a) and another approach can find SS2 (b).

Consider now the query q where query(q, [a,b,c,d,f], [h]) holds. The first superbucketapplicable for q is SB1. Thus algorithm RelSrc will return SS3 as the set of possible im-plementations for q . Meanwhile, a WebSource in SS2, namely ws2, could also implementq.

To understand the reason for the incompleness of algorithm RelSrc consider figure 2(a).The AlternatePartition considered by the algorithm RelSrc is built starting from the bottomlayer, that’s why both B1 and B2 are put in the same superbucket SB2. As SB1 is applicablefor q and precedes SB2, B2 will not be considered by RelSrc. Meanwhile, if we built thesuperbuckets in a different way, namely, as specified in figure 2(b), the answer would becomplete. Indeed, in this case neither superbucket would precede the other and the relevantsuperbucket SB1 would include all the implementations for q. However, because bucketsare placed in the lowest possible position in AlternatePartition, the completeness of RelSrcdoes not hold.

Another approach, considered in Vidal (1999), uses an idea to complete AlternatePartitionwith extra buckets so that RelSrc can find all possible WebSource implementations. It isbased on a complex procedure and we do not consider it in this paper.

7. Semantic query optimization in a mediator for WebSources

In what we considered so far we focused mainly on how to apply our generic definitionsto characterize capability-based rewriting (CBR) in mediator systems. Now we are goingto elaborate on some aspects of the practical applicability of our approach. We considerhow it can be used to improve the efficiency of CBR. Semantic query optimization allowsus to consider significant optimization opportunities in query processing with restrictedcapabilities. The general idea is as follows. First we specify some (meta)statements aboutproperties of WebSources in a particular environment. Then we turn these statements intointegrity constraints which can be used to improve the efficiency of the web query execution:

Possible examples of such statements are as follows:

• There is no WebSource providing a binding for a particular attribute.• There is no WebSource providing the output of a particular domain.• A (sub)set of WebSources does not provide the output of a particular domain.• There is no WebSource implementing a given query.

Now we specify each of above statements using notation introduced in the first part ofthe paper:

LOGICAL APPROACH 67

Example 4. There is no WebSource supporting a binding for a particular attribute of a givenrelation. Let us assume that the relation name is freshdbbooks, and the attribute name isauthor. Then our statement can be specified in the following way:

← query((freshdbbooks, ), BAttrs, ),member(author,BAttrs), implemented by((freshdbbooks, ), Ws, ).

Example 5. There is no WebSource providing the output of a particular domain. Assumingthe domain name is aDom, the specification of this statement is as follows:

← query((Rel, ), , OutAttrs),member(Attr1,OutAttrs), relationDomains(Rel, Attr1, aDom),implemented by((Rel, ),Ws, ).

Example 6. A (sub)set of WebSources does not provide the output of a particular domain.In the case the WebSources not providing the required output are [ws1,ws2,ws3], we specifythis statement as follows:

← query((Rel, ), , OutAttrs),member(Attr1,OutAttrs), relationDomains(Rel, Attr1, aDom),member(Ws, [ws1,ws2,ws3]), implemented by((Rel, ),Ws, ).

Example 7. There is no WebSource implementing a given relation. Below is a specificationof this statement for relation Rel:

← query((Rel, ), , ), implemented by((Rel, ), Ws, ).

Thus, semantic query optimization may prevent query execution for unsafe (unimple-mentable) queries, and save a lot of useless capability search efforts. The benefit of usingthe above statements as integrity constraints can also be quite large in terms of reducing thenumber of query execution plans which should be explored by a query optimizer for safequeries. To illustrate it, below we consider the number of query execution plans identifiedby the query optimizer when different binding patterns are used in the query. Following(Florescu et al., 1999), for simplicity, we study only linear (chain) queries with binaryrelations of the following form:

q: −R1(x0, x1), R2(x1, x2), . . . , Rn(xn−1, xn), C.

The predicate Ri (x j , xk) denotes a database relation with attribute variables x j , xk . C isa set of comparison predicates of the form x j = c j , where x j is an attribute variable thatappears in the chain query, and c j is a constant. Each database relation is implemented bysome WebSources. The database relations are associated with a set of WebSource accesspatterns, or binding patterns. Following Florescu et al. (1999) we depict binding patterns bysuperscripts on the attribute names of a relation. An attribute can be superscribed either with

68 GRANT AND ZADOROZHNY

Table 1. Number of safe query plans for different WebSource access patterns.

Total plans (col 2) LL plans (col 3)Bindings (col 1) (Florescu et al., 1999) (Florescu et al., 1999)

Ri (u f , v f ), i = 1, . . . , n(

2(n−1)n−1

)(n − 1)! n!

Ri (ub, v f ), Ri (u f , v f )

i = 1, . . . , n(

2(n−1)n−1

)nn nn+1

b (bound), or f (free). For example, the binding pattern Ri (xb, y f ) specifies that values ofx must be bound to obtain tuples of Ri .

In Table 1 we consider some of binding patterns presented in Florescu et al. (1999). Thevalues in column 2 represent all safe total query execution plans (bushy, with cartesianproducts). Column 3 represents safe left-linear plans commonly considered by a dynamicprogramming optimization algorithm. The values in columns 2 and 3 are taken from Florescuet al. (1999).

In general, using the integrity constraint: there is no WebSource accepting a particularbinding for a given attribute, we can reduce the number of safe plans considered by theoptimizer from

(2(n − 1)

n − 1

)nn

to (2(n − 1)

n − 1

)(n − 1)!.

If we consider only left-linear plans, like in a dynamic programming algorithm, theninstead of considering nn+1 plans, the optimizer will focus only on n! plans.

8. Summary

We defined a generic logic-based specification for Capability-Based Rewriting in web queryprocessing. This specification is based on a core specification that was extended in severalways. We showed how our definitions can be used to characterise existing CBR tools. As anexample we considered a CBR tool MedWebSrc, a Web mediator system developed at theUniversity of Maryland. We showed how to apply Semantic Query Optimization techniquesin this framework to improve the search ability of CBR tools and the performance of theweb query optimizer. In particular, for the case of linear chain queries we showed how thismethod can reduce the number of possible safe execution plans.

There are open research issues which will be important to consider in future work. Theseinclude introducing a generalized taxonomy of meta-constraints about the Web environment

LOGICAL APPROACH 69

and defining which of them might be useful, typical, or realistic. For example, is it possibleto contribute to cost-based optimization using cost-based constraints like “there is no Web-Source with a cost less than A”? Other kinds of properties may characterize the availabilityand quality of WebSources, as well as different “semantics of content”-based issues forWebSources. In this paper we considered only the commonly used conjunctive queries.An interesting research issue is the exploration of more general class of queries, such asdisjunctive queries, queries with negation and with aggregate functions.

Acknowledgments

We wish to thank Louiqa Raschid, Maria Esther Vidal, and the referees for many helpfulcomments.

Note

1. In this paper we consider only such input/output restrictions as limitations for query capabilities.

References

ACM Digital Library. http://www.acm.org/dl/Search.html.Adali, S., Candan, K.S., Papakonstantinou, Y., and Subrahmanian, V.S. (1996). Query Caching and Optimization

in Distributed Mediator Systems. In Proc. of the ACM SIGMOD Conference.Chakravarthy, U.S., Grant, J., and Minker, J. (1990). Logic-Based Approach to Semantic Query Optimization,

ACM TODS, 15(2), 162–207.Cheng, Q., Gryz, J., Koo, F., Leung, C., Liu, L., Qian, X., and Schiefer, B. (1999). Implementation of Two Semantic

Query Optimization Techniques in DB2 Universal Database. In Proc. of VLDB (pp. 687–698).Du, W., Krishnamurthy, R., and Shan, M.C. (1992). Query Optimization in a Heterogeneous DBMS. In Proc. of

the Very Large Data Bases Conference (VLDB).FAA Aviation Safety Data. http://nasdac.faa.gov/internet/.Florescu, D., Levy, A.Y., Manolescu, I., and Suciu, D. (1999). Query Optimization in the Presence of Limited

Access Patterns. In Proc. of the ACM SIGMOD Conference.Haas, L., Kossmann, D., Wimmers, E., and Yang, J. (1997). Optimizing Queries Across Diverse Data Sources. In

Proc. of VLDB.Lakshmanan, L.V.S. and Missaoui, R. (1992). On Semantic Query Optimization in Deductive Databases. In Proc.

of IEEE ICDE (pp. 368–375).Landings Aviation Search Engines. http://www.landings.com/ landings/pages/search.html.Layman A. et al. The XML-data home page. http://www.microsoft.com/standards/xml/xmldata-f.htm.Levy, A.Y., Rajaraman, A., and Ordille, J.J. (1996). Querying Heterogeneous Information Sources Using Source

Descriptions. In Proc. of VLDB.Levy, A.Y. and Sagiv, Y. (1995). Semantic Query Optimization in Datalog Programs. In Proc. of PODS (pp.

163–173).Haas L., Tork Roth, M., and Ozcan, F. (1999). Cost Models Do Matter: Providing Cost Information for Diverse

Data Sources in a Federated System. In Proc. of VLDB.Morris, K. (1988). An Algorithm for Ordering Subgoals in NAIL! In Proc. of PODS.Naacke, H., Gardarin, G., and Tomasic, A. (1998). Leveraging Mediator Cost Models with Heterogeneous Data

Sources. In Proc. of ICDE.Bureau of Labor Statistics. http://stats.bls.gov.

70 GRANT AND ZADOROZHNY

Pang, H.H., Lu, H.J., and Ooi, B.C. (1991). An Efficient Semantic Query Optimization Algorithm. In Proc. ofIEEE ICDE (pp. 326–335).

Papakonstantinou, Y. and Vassalos, V. (1999). Query Rewriting for Semistructured Data. In Proc. of the ACMSIGMOD Conference.

Ramakrishnan, R., Seshadri, P., and Livny, M. (1997). The Case for Enhanced Abstract Data Types. In Proc. ofVLDB.

Seshadri, P., et al. (1996). Cost-Based Optimization for Magic: Algebra and Implementation. In Proc. of the ACMSIGMOD Conference.

Tomasic, A., Raschid, L., and Valduriez, P. (1996). Scaling Heterogeneous Databases and the Design of Disco. InProceedings of the Intl. Conf. on Distributed Computing Systems.

Vassalos, V. and Papakonstantinou, Y. (1998). Using Knowledge of Redundancy for Query Optimization inMediators. In Proc. of the AAAI Symposium on AI and Data Integration.

Vidal, M.E. (1999). A Mediator for Scaling up to Multiple WebSources. Ph.D. Thesis, University Simon Bolivar.Vidal, M.E. and Raschid, L. (1998). Websrcmed: A mediator for Scaling up to Multiple Web Accessible Sources

(Websources). ftp://www.umiacs.umd.edu/pub/mvidal/websrcmed.ps. (under review).Zadorozhny, V., Vidal, M.E., Raschid, L., and Urhan, T. (2001). Efficient Evaluation of Queries in a Mediator for

Websources. Under review.