27
SECTIONS 21.4 – 21.5 Sanuja Dabade & Eilbroun Benjamin CS 257 – Dr. TY Lin INFORMATION INTEGRATION

SECTIONS 21.4 – 21.5 Sanuja Dabade & Eilbroun Benjamin CS 257 – Dr. TY Lin INFORMATION INTEGRATION

  • View
    215

  • Download
    1

Embed Size (px)

Citation preview

SECTIONS 21.4 – 21.5Sanuja Dabade & Eilbroun Benjamin CS 257 – Dr. TY Lin

INFORMATION INTEGRATION

Presentation Outline

21.4 Capability Based Optimization 21.4.1The Problem of Limited Source

Capabilities 21.4.2 A notation for Describing

Source Capabilities 21.4.3 Capability-Based Query-Plan

Selection 21.4.4 Adding Cost-Based

Optimization 21.5 Optimizing Mediator Queries

21.5.1 Simplified Adornment Notation 21.5.2 Obtaining Answers for

Subgoals 21.5.3 The Chain Algorithm 21.5.4 Incorporating Union Views at

the Mediator

21.4 Capability Based Optimization Introduction

Typical DBMS estimates the cost of each query plan and picks what it believes to be the best

Mediator – has knowledge of how long its sources will take to answer

Optimization of mediator queries cannot rely on cost measure alone to select a query plan

Optimization by mediator follows capability based optimization

21.4.1 The Problem of Limited Source Capabilities Many sources have only Web Based

interfaces Web sources usually allow querying

through a query form E.g. Amazon.com interface allows us to

query about books in many different ways. But we cannot ask questions that are too

general E.g. Select * from books;

21.4.1 The Problem of Limited Source Capabilities (con’t) Reasons why a source may limit the ways

in which queries can be asked Earliest database did not use relational

DBMS that supports SQL queries Indexes on large database may make certain

queries feasible, while others are too expensive to execute

Security reasons E.g. Medical database may answer queries about

averages, but won’t disclose details of a particular patient's information

21.4.2 A Notation for Describing Source Capabilities For relational data, the legal forms of

queries are described by adornments Adornments – Sequences of codes that

represent the requirements for the attributes of the relation, in their standard order f(free) – attribute can be specified or not b(bound) – must specify a value for an

attribute but any value is allowed u(unspecified) – not permitted to specify a

value for a attribute

21.4.2 A notation for Describing Source Capabilities….(cont’d)

c[S](choice from set S) means that a value must be specified and value must be from finite set S.

o[S](optional from set S) means either do not specify a value or we specify a value from finite set S

A prime (f’) specifies that an attribute is not a part of the output of the query

A capabilities specification is a set of adornments

A query must match one of the adornments in its capabilities specification

21.4.2 A notation for Describing Source Capabilities….(cont’d)

E.g. Dealer 1 is a source of data in the form:Cars (serialNo, model, color, autoTrans, navi)The adornment for this query form is b’uuuu

21.4.3 Capability-Based Query-Plan Selection Given a query at the mediator, a capability

based query optimizer first considers what queries it can ask at the sources to help answer the query

The process is repeated until: Enough queries are asked at the sources to resolve

all the conditions of the mediator query and therefore query is answered. Such a plan is called feasible.

We can construct no more valid forms of source queries, yet still cannot answer the mediator query. It has been an impossible query.

21.4.3 Capability-Based Query-Plan Selection (cont’d) The simplest form of mediator query

where we need to apply the above strategy is join relations

E.g we have sources for dealer 2 Autos(serial, model, color) Options(serial, option)

Suppose that ubf is the sole adornment for Auto and Options have two adornments, bu and uc[autoTrans, navi]

Query is – find the serial numbers and colors of Gobi models with a navigation system

21.4.4 Adding Cost-Based Optimization

Mediator’s Query optimizer is not done when the capabilities of the sources are examined

Having found feasible plans, it must choose among them

Making an intelligent, cost based query optimization requires that the mediator knows a great deal about the costs of queries involved

Sources are independent of the mediator, so it is difficult to estimate the cost

21.5 Optimizing Mediator Queries Chain algorithm – a greed algorithm that

finds a way to answer the query by sending a sequence of requests to its sources. Will always find a solution assuming at least

one solution exists. The solution may not be optimal.

21.5.1 Simplified Adornment Notation A query at the mediator is limited to b

(bound) and f (free) adornments. We use the following convention for

describing adornments: nameadornments(attributes) where:

name is the name of the relation the number of adornments = the number of

attributes

21.5.2 Obtaining Answers for Subgoals

Rules for subgoals and sources: Suppose we have the following subgoal:

Rx1x2…xn(a1, a2, …, an),

and source adornments for R are: y1y2…yn. If yi is b or c[S], then xi = b. If xi = f, then yi is not output restricted.

The adornment on the subgoal matches the adornment at the source: If yi is f, u, or o[S] and xi is either b or f.

21.5.3 The Chain Algorithm

Maintains 2 types of information: An adornment for each subgoal. A relation X that is the join of the relations for

all the subgoals that have been resolved. Initially, the adornment for a subgoal is b

iff the mediator query provides a constant binding for the corresponding argument of that subgoal.

Initially, X is a relation over no attributes, containing just an empty tuple.

21.5.3 The Chain Algorithm (con’t) First, initialize adornments of subgoals

and X. Then, repeatedly select a subgoal that

can be resolved. Let Rα(a1, a2, …, an) be the subgoal:

1. Wherever α has a b, we shall find the argument in R is a constant, or a variable in the schema of R. Project X onto its variables that appear in R.

21.5.3 The Chain Algorithm (con’t)2. For each tuple t in the project of X, issue a

query to the source as follows (β is a source adornment).

2. If a component of β is b, then the corresponding component of α is b, and we can use the corresponding component of t for source query.

3. If a component of β is c[S], and the corresponding component of t is in S, then the corresponding component of α is b, and we can use the corresponding component of t for the source query.

4. If a component of β is f, and the corresponding component of α is b, provide a constant value for source query.

21.5.3 The Chain Algorithm (con’t)

If a component of β is u, then provide no binding for this component in the source query.

If a component of β is o[S], and the corresponding component of α is f, then treat it as if it was a f.

If a component of β is o[S], and the corresponding component of α is b, then treat it as if it was c[S].

3. Every variable among a1, a2, …, an is now bound. For each remaining unresolved subgoal, change its adornment so any position holding one of these variables is b.

21.5.3 The Chain Algorithm (con’t)4. Replace X with X πs(R), where S is all

of the variables among: a1, a2, …, an.

5. Project out of X all components that correspond to variables that do not appear in the head or in any unresolved subgoal.

If every subgoal is resolved, then X is the answer.

If every subgoal is not resolved, then the algorithm fails.

α

21.5.3 The Chain Algorithm Example Mediator query:

Q: Answer(c) ← Rbf(1,a) AND Sff(a,b) AND Tff(b,c) Example:

Relation R S TData

Adornment bf c’[2,3,5]f bu

w x

1 2

1 3

1 4

x y

2 4

3 5

y z

4 6

5 7

5 8

21.5.3 The Chain Algorithm Example (con’t) Initially, the adornments on the subgoals are

the same as Q, and X contains an empty tuple. S and T cannot be resolved because they each

have ff adornments, but the sources have either a b or c.

R(1,a) can be resolved because its adornments are matched by the source’s adornments.

Send R(w,x) with w=1 to get the tables on the previous page.

21.5.3 The Chain Algorithm Example (con’t) Project the subgoal’s relation onto its

second component, since only the second component of R(1,a) is a variable.

This is joined with X, resulting in X equaling this relation.

Change adornment on S from ff to bf.

a

2

3

4

21.5.3 The Chain Algorithm Example (con’t) Now we resolve Sbf(a,b):

Project X onto a, resulting in X. Now, search S for tuples with attribute a

equivalent to attribute a in X.

Join this relation with X, and remove a because it doesn’t appear in the head nor any unresolved subgoal:

a b

2 4

3 5

b

4

5

21.5.3 The Chain Algorithm Example (con’t) Now we resolve Tbf(b,c):

Join this relation with X and project onto the c attribute to get the relation for the head.

Solution is {(6), (7), (8)}.

b c

4 6

5 7

5 8

21.5.4 Incorporating Union Views at the Mediator This implementation of the Chain

Algorithm does not consider that several sources can contribute tuples to a relation.

If specific sources have tuples to contribute that other sources may not have, it adds complexity.

To resolve this, we can consult all sources, or make best efforts to return all the answers.

21.5.4 Incorporating Union Views at the Mediator (con’t) Consulting All Sources

We can only resolve a subgoal when each source for its relation has an adornment matched by the current adornment of the subgoal.

Less practical because it makes queries harder to answer and impossible if any source is down.

Best Efforts We need only 1 source with a matching

adornment to resolve a subgoal. Need to modify chain algorithm to revisit each

subgoal when that subgoal has new bound requirements.

Questions