26
Recursive query plans for Data Integration Oliver Michael By Rajesh Kanisetti

Recursive query plans for Data Integration Oliver Michael By Rajesh Kanisetti

Embed Size (px)

Citation preview

Page 1: Recursive query plans for Data Integration Oliver Michael By Rajesh Kanisetti

Recursive query plans for Data Integration

Oliver Michael

By Rajesh Kanisetti

Page 2: Recursive query plans for Data Integration Oliver Michael By Rajesh Kanisetti

Introduction• Information integration systems

– Interaction with a uniform interface.– A set of virtual relation names to formulate queries.– The actual data in external sources.– A mapping between the virtual and the source relations.

• e.g.) db1(P,A) :- paper(P), author(P,A), ai(A)– paper, author and ai are virtual relations.

– db1 is a source relation.

– Query rewriting• Translating the virtual relations to a query that mentions

only the source relations.

Page 3: Recursive query plans for Data Integration Oliver Michael By Rajesh Kanisetti

Query Rewriting Problem• An equivalent rewriting of the query

– Available sources may not contain all the information needed to answer a query.

– Maximally-contained rewritings• e.g.) Ask for all papers by Computer Science researcher. q(P,Y,A) :- db(P,Y,A)

• Limitations on the binding patterns – A name server of an institution will provide the address

for a given name.

• Functional dependencies– The year of a conference functionally determines its

location.

Page 4: Recursive query plans for Data Integration Oliver Michael By Rajesh Kanisetti

Datalog

• Main expressive advantage: recursive queries.

• More convenient for analysis: papers look better.

• Without recursion but with negation it is equivalent in power to relational algebra

• Has affected real practice: (e.g., recursion in SQL3, magic sets transformations).

Page 5: Recursive query plans for Data Integration Oliver Michael By Rajesh Kanisetti

Datalog Concepts

• Atoms• Datalog rules, datalog programs• EDB predicates, IDB predicates • Conjunctive queries• Recursion• Built-in predicates• Negated atoms, stratified programs.• Semantics: least fixpoint.

Page 6: Recursive query plans for Data Integration Oliver Michael By Rajesh Kanisetti

Predicates and Atoms

- Relations are represented by predicates- Tuples are represented by atoms.

Purchase( “joe”, “bob”, “Nike Town”, “Nike Air”, 2/2/98)

- arithmetic, built-in, atoms:

X < 100, X+Y+5 > Z/2

- negated atoms:

NOT Product(“Brooklyn Bridge”, $100, “Microsoft”)

Page 7: Recursive query plans for Data Integration Oliver Michael By Rajesh Kanisetti

Datalog Rules and Queries

A pure datalog rule has the following form:

head :- atom1, atom2, …., atom,… where all the atoms are non-negated and relational.

BritishProduct(X) :- Product(X,Y,P) & Company(P, “UK”, SP)

A datalog program is a set of datalog rules.A program with a single rule is a conjunctive query.

We distinguish EDB predicates and IDB predicates• EDB’s are stored in the database, appear only in the bodies• IDB’s are intensionally defined, appear in both bodies and heads.

Page 8: Recursive query plans for Data Integration Oliver Michael By Rajesh Kanisetti

The Meaning of Datalog Rules

Repeat the following until you cannot derive any new facts:Consider every assignment from the variables in the bodyto the constants in the database.

If each of the atoms in the body is made true by the assignment,

then

add the tuple for the head into the relation of the head.

Start with the facts in the EDB and iteratively derive facts for IDBs.

Page 9: Recursive query plans for Data Integration Oliver Michael By Rajesh Kanisetti

Transitive Closure

Suppose we are representing a graph by a relation Edge(X,Y):

Edge(a,b), Edge (a,c), Edge(b,d), Edge(c,d), Edge(d,e)

a

b

c

d e

I want to express the query:

Find all nodes reachable from a.

Page 10: Recursive query plans for Data Integration Oliver Michael By Rajesh Kanisetti

Recursion in DatalogPath( X, Y ) :- Edge( X, Y )Path( X, Y ) :- Path( X, Z ), Path( Z, Y ).

Semantics: evaluate the rules until a fixedpoint:

Iteration #0: Edge: {(a,b), (a,c), (b,d), (c,d), (d,e)} Path: {}Iteration #1: Path: {(a,b), (a,c), (b,d), (c,d), (d,e)}

Iteration #2: Path gets the new tuples: (a,d), (b,e), (c,e)Iteration #3: Path gets the new tuple: (a,e)Iteration #4: Nothing changes -> We stop.Note: number of iterations depends on the data. Cannot be anticipated by only looking at the query!

Page 11: Recursive query plans for Data Integration Oliver Michael By Rajesh Kanisetti

Built in PredicatesRules may include atoms with built-in predicates:

ExpensiveProduct(X) :- Product(X,Y,P) & P > $100

But: we need to restrict the use of built-in atoms in rules.

P(X) :- R(X) & X<Y

What does this mean?

We could use active domain semantics, but that’s problematic.

Hence, we require that every variable that appears in a built-inatom also appears in a relational atom.

Page 12: Recursive query plans for Data Integration Oliver Michael By Rajesh Kanisetti

Negated SubgoalsRules may include negated subgoals, but in restricted forms:

P(X,Y) :- Between(X,Y,Z) & NOT Direct(X,Z)

Bad:

P(X, Y) :- R(X) & NOT S(Y)

Bad but ok:

P(X) :- R(X) & NOT S(X,Y)

We’ll rewrite as: S’(X) :- S(X,Y) P(X) :- R(X) & NOT S’(X)

Page 13: Recursive query plans for Data Integration Oliver Michael By Rajesh Kanisetti

Relations and Queries• A function-free Horn rule

p(X) :- p1(X1) & p2(X2) & .. & pn(Xn),

– p and p1, … pn are relation names, and X, X1,…,Xn are tuples of variables and constants.

– Any variables appearing in X appears also in X1,... , Xn.

– p(X): the head of the rule, p1(X1),…,pn(Xn): the body of the rule.

– The base relations only in the bodies, not in the heads of the rules.– A dependency graph

• Nodes are the relations appearing in the rules.

• An arc is from the node of relation pi to the node of predicate p if pi appears in the body of a rule p.

– Recursive rule: a cycle in the dependency graph– A query is a set of function-free Horn rules.– A conjunctive query is a single non-recursive Horn rule.

Page 14: Recursive query plans for Data Integration Oliver Michael By Rajesh Kanisetti

Query Containment

• Given two queries q1 and q2, q1 is contained in q2 if for every database D, q1(D) q2(D), where q(D) is the result of evaluating query q on D.

Page 15: Recursive query plans for Data Integration Oliver Michael By Rajesh Kanisetti

Functional Dependencies

• A relation p satisfies the functional dependency A1, …, An -> B if for every two tuples t and u in p with t.Ai= u.Ai for i= 1, …, n, also t.B= u.B.

• Relative containment– Query q1 is contained in query q2 relative to , denoted

q1 q2, if for each database D satisfying the functional dependencies in , q1(D) q2(D).

– : a set of functional dependencies

Page 16: Recursive query plans for Data Integration Oliver Michael By Rajesh Kanisetti

Modeling Information Sources and Query Plans

• Domain model: a set of virtual relations• Source relations

– The contents of the external information sources

• Source descriptions– A set of conjunctive queries– Bodies contain only virtual relations and their heads are

source relations.

• Query plan– Given a query q from user, the agent formulates a query

plan from the source relations.– A set of Horn rules only including the source relations.

Page 17: Recursive query plans for Data Integration Oliver Michael By Rajesh Kanisetti

Example

• Consider a domain model where parent, male and female are virtual relations. v1

and v2 are the source relations. v1(X,Y) :- parent(X,Y), male(X)

v2(X,Y) :- parent(X,Y), female(X)

• Query Plan: all grandparents of ann from the available sources.

answer(X) :- parent(X,Z), parent(Z,ann)

parent(X,Y) :- v1(X,Y)

parent(X,Y) :- v2(X,Y)

Page 18: Recursive query plans for Data Integration Oliver Michael By Rajesh Kanisetti

Functional Dependencies• Suppose the virtual relations:

conference(Paper, Conference),

year(Paper, Year),

location(Conference, Year, Location)

• Functional dependencies conference: Paper -> Conference

year: Paper -> Year

location: Conference, Year -> Location

• Information sources

v1(P,C,Y) :- conference(P,C), year(P,Y)

v2(P,L) :- conference(P,C), year(P,Y), location(C,Y,L)

• Query: q(L):- location(ijcai, 1991, L)

• Answer: answer(L) :- v1(P, ijcai, 1991), v2(P, L)

Page 19: Recursive query plans for Data Integration Oliver Michael By Rajesh Kanisetti

• Definition (inverse rule): Let v be a source description Then for j=1, …, n, is an inverse rule of v.– Modifying to obtain as follows:

• if X is a constant or is a variable in ,then X is unchanged in .

• Otherwise, X is one of the variables Xi appearing in the body of v but not in , and X is replaced by in .

– Purpose is to recover tuples of the virtual relations from the source relations.

).(),...,(:)( 11 nn XpXpXv )(:)( XvXp jj

jX jX

Inverse Rule

jXjX

X)(, Xf iv jX

Page 20: Recursive query plans for Data Integration Oliver Michael By Rajesh Kanisetti

• Sources relations: v1(P,C,Y) :- conference(P,C), year(P,Y)

v2(P,L) :- conference(P,C), year(P,Y), location(C,Y,L)

• The inverse rules:

• Information sources: v1(“Fuzzy” , “IJCAI”, 1991), v2(“Fuzzy”, “Sydney”)

• Derived facts: conference <fuzzy, ijcai> (with r1) <fuzzy, f1(fuzzy, sydney)> (r3)

year <fuzzy, 1991> (r2) <fuzzy, f2(fuzzy, sydney)> (r4)

location <f1(fuzzy, sydney), f2(fuzzy, sydney), sydney> (r5)

r1: conference(P,C) :- v1(P,C,Y)r2: year(P,Y) :- v1(P,C,Y)r3: conference(P, f1(P,L)) :- v2(P,L)r4: year(P, f2(P,L)) :- v2(P,L)r5: location(f1(P,L), f2(P,L) ,L) :- v2(P,L)

Page 21: Recursive query plans for Data Integration Oliver Michael By Rajesh Kanisetti

• Definition (chase rules): Let be a functional dependency satisfied by a virtual relation p. Let be the attributes of p that are not in . The chase rule corresponding to

, denoted chase( ), is the following rule:

– Functional dependencies

conference: Paper -> Conference

year: Paper -> Year

location: Conference, Year -> Location

– In our example, the chase rules are: e(C,C’) :- conference(P,C), conference(P’,C’),e(P,P’) e(Y,Y’) :- year(P,Y), year(P’,Y’), e(P,P’) e(L,L’) :- location(C,Y,L), location(C’,Y’,L’), e(C,C’), e(Y,Y’)

– Derived facts:

e < f1(fuzzy, sydney), ijcai> <f2(fuzzy, sydney), 1991>

BA

CBA,

BA BA

).',(),',','(),,,(:)',( AAeCBApCBApBBe

Page 22: Recursive query plans for Data Integration Oliver Michael By Rajesh Kanisetti

Query Rewriting

• Define q’ by modifying q iteratively as follows:– If c is a constant in one of the subgoals of q, replace it by a new

variable Z, and add the subgoal e(Z,c).

– If X is a variable in the head of q, replace X in the body of q by a new variable X’, and add subgoal e(X,X’).

– If Y that is not in the head of q appears in two subgoals of q, replace one of its occurrences by Y’, and add the subgoal e(Y,Y’).

• In our example: q(L):- location(ijcai, 1991, L) q’(L) ;- location(C,Y,L’), e(C,ijcai), e(Y,1991), e(L,L’)

• Evaluating query q’ on the reconstructed virtual relations and the derived equivalence relation e yields: IJCAI ’91 was help in Sydney.

Page 23: Recursive query plans for Data Integration Oliver Michael By Rajesh Kanisetti

Limitations on Binding Patterns• To model source capabilities, attach to

each source relation an adornment.– An adornment of a source relation v is a string of b’s

and f’s of length n, where n is the arity of v.– vbf : the first argument is bounded on v.

• Definition(executable Horn rule) – Let V be a set of relations with binding adornment, and

let r be the following Horn rule whose body relations are in V:

– The rule r is executable if the following holds for i=1,…,n: let j be a bounded argument position of vi and let a be the the j’th element in . Then, either a is a constant, or a appears in .

).(),...,(:)( 11 nn XvXvXq

iX

11 ... iXX

Page 24: Recursive query plans for Data Integration Oliver Michael By Rajesh Kanisetti

• Example – Sources: :- ijcaiPapers(X) :- cites(X,Y) :- awardPaper(X)– Query: q(X) :- awardPaper(X)– Executable conjunctive query plan:

qn(Zn) :- v1(Z0), v2(Z0, Z1), …, v2(Zn-1, Zn), v3(Zn).

• By allowing recursive plans, produce a maximally-contained plan. papers(X) :- papers(X) :- papers(Y), q(X) :- papers(X),

)(1

Xvf

),(2

YXvbf

)(3

Xvb

)(1

Xvf

),(2

YXvbf

)(3

Xvb

Page 25: Recursive query plans for Data Integration Oliver Michael By Rajesh Kanisetti

Domain Rules• Definition (domain rules) Let be a

source relation of arity n. Suppose the adornment v says that the arguments in positions 1,…,l need to be bound, and the arguments l+1, …,n, the following rule is a domain rule:

dom(Xi) :- dom(X1), …, dom(Xl), v(X1, …,Xn).

– All domain rules are executable and relation dom has adornment f.

– Every query plan P can be transformed to an executable query plan by inserting dom(X) before subgoals g in P with a bounded variable X.

Vv

Page 26: Recursive query plans for Data Integration Oliver Michael By Rajesh Kanisetti

Summary

• Given a query• Construct “inverse rules”• Construct “chase rules”• Construct “domain rules”• Above rules comprise the “query plan”