View
213
Download
0
Tags:
Embed Size (px)
Citation preview
NTUAApril 17, 2003
1
XML Query Reformulation
Val Tannen
University of Pennsylvania
Joint work with Alin Deutsch, UC San Diego
and in part with Lucian Popa, IBM Almaden
NTUAApril 17, 2003
2
Data Exchange Between Businesses Using XML
XML
XMLXML
proprietary data
proprietary data
published data
proprietary data
published data
published data
published data
hospital
insurance company pharmaceutical company
NTUAApril 17, 2003
3
XML?
<drug> <name>aspirin</name> <price>$4</price> <notes> <side-effects>upset stomach</side-effects> <maker>Bayer</maker> </notes></drug>
drug
name price notes
side-effects maker“aspirin” “$4”
“upset stomach”
“Bayer”
opening tag
matching closing tag
text
NTUAApril 17, 2003
4
A Simple Publishing Scenario
usage drug name
2/day aspirin John
3/day cortisone Jane
name diagnosis
John migraine
Jane allergy
prescription patient
<study> <case> <diag>migraine</diag> <drug>aspirin</drug> <usage>2/day</usage> </case> <case> <diag>allergy</diag> <drug>cortisone</drug> <usage>3/day</usage> </case></study>
published data
proprietary data
patient name is hidden
client
client query(XQuery)
correspondenceexpressed by
publishing query(view)
reformulation(SQL)
virtual data
View = query which, if executed, would produce the virtual data
XML query language standard (draft)
How to express the view?
How to “compose” the client query with the view,
obtaining the reformulation?
NTUAApril 17, 2003
5
completeness
soundness
The General Problem of Query Reformulation
schema P schema S
schema correspondence
client
query Q(P) ? reformulated query X(S)
Given query Q(P), find query(ies) X(S) returning same answer,
whenever such X(S) exists
NTUAApril 17, 2003
6
Applications of Query Reformulation
• data publishing
• data integration
• schema evolution
• data security illustrated next
we just saw it:public schema / storage schema
global schema / local schema
old schema / new schemaP
P
P
S
S
S
NTUAApril 17, 2003
7
An Application: Data Security
public schema P
proprietary schema Sschema
correspondence
client
query E(S)(exposes secret data correlation)
Only possible if Completeness Property holds!
intrusive query I(P)
Want to be sure that there is no I(P) returning same answer as E(S)
(patient,ailment)
(patient, physician)+
(physician, ailment)
NTUAApril 17, 2003
8
More Complicated Data Publishing:Mixed And Redundant Storage (MARS)
initial configuration
view of proprietary data
may hide information
published XML(virtual)
proprietary XML data
proprietary relational data
storage schema
public schema
schema correspondence
after tuning
redundant data
materialized views, indexes
cached queries
partial relational storage of XML
NTUAApril 17, 2003
9
An Example With Tuning
relational DBXML
XML XML
iden
tity
view
simple publishing view
drug,price,notes
rel DBrelational view
drug,price drug,usage,name name,diagnosis
XML
cac
hed
quer
y
diagnosis,drug
drug,usage,diagnosis
NTUAApril 17, 2003
10
Redundancy Enables Multiple Reformulations
Relational DBXML
XML XML
iden
tity
view
simple publishing view
drug,price,notes
Rel DBrelational view
drug,price drug,usage,name name,diagnosis
XML
cac
hed
quer
y
diagnosis,drug
drug,usage,diagnosis
client query: “find how much each treatment costs”
R2R1R3
Some reformulations are potentially cheaper to execute than others. Want to find an “optimal” one!
NTUAApril 17, 2003
11
XQuery XQuery
Schema Correspondence Expressible in XQuery
relational DBXML
XML XML
rel DBXML
XML
encode
XML
encode
XQuery XQuery
The DB administrator must be able to specify the correspondence.
Can use XQuery, fixing any of the common encodings of relational tables in XML.
NTUAApril 17, 2003
12
XQuery?
for $d in document/drug, $m in $d//maker
return <producedBy>$m/text()</producedBy>
drug
name price notes
side-effects maker“aspirin” “$4”
“upset stomach”
“Bayer”
Result should contain
<producedBy>Bayer</producedBy>
binding part
tagging template
// (descendant)is the transitive closure of / (child)
NTUAApril 17, 2003
13
Approach: XQuery Reformulation Reduced to Relational Reformulation
reformulated queries (multiple solutions)
client XQuery
Mappings ()
as XQueries
schemacorrespondence
GReXbuilt-in relational constraintscapture XML data model
XML integrityconstraints
= compilation
GReX: Generic Relational encoding of XML
relational queries
C&B
reformulated queries
relational constraints
NTUAApril 17, 2003
14
XQuery Semantics
XML data model is a tagged tree
<drug> <name>aspirin</name> <price>$4</price> <notes> <side-effects>upset stomach</side-effects> <maker>Bayer</maker> </notes></drug>
drug
name price notes
side-effects maker“aspirin” “$4”
“upset stomach”
“Bayer”
XQueries compute in two stages:
navigation in XML tree, binds variables to
nodes, text, tags, etc.
output of new XML, by filling in variable bindings into a
tagging template
for $d in document/drug, $m in $d//maker
return <producedBy>$m/text()</producedBy>
Variable binding stage
tagging stage
“$d” “$m”
NTUAApril 17, 2003
15
Compiling the Binding Part of XQueries to Relational Queries
Relational query over
child(x,y) , tag(x,t) ,desc(x,y) , Root (r), etc.
XBind query =
binding part of XQuery
(returns a relation:
tuples of variable bindings)
compiles to
P($d,$m) :- Root(r) , child(r,$d) , tag($d,“drug”) ,
desc($d,x) , child(x,$m) , tag($m,“maker”)
But not all models of this schema correspond to the intended model; need GReX !
Example:
for $d in document(“drugs.xml”)/drug, $m in $d//maker return “$d” “$m”
a relational “conjunctive” query
NTUAApril 17, 2003
16
Sample Constraints from GReX
• Relationship between child and descendant navigation:
xy [ child(x,y) desc(x,y) ] desc contains child
x [ el(x) desc(x,x) ] desc is reflexive
xyz [ desc(x,y) desc(y,z) desc(x,z) ] desc is transitive
• Tagged tree structure of XML:
rx [ root(r) desc(x,r) x = r ] root has no ancestors
xyz [ child(x,z) child(y,z) x = y ] at most one parent
These do not capture transitive closure completely, nor is it possible to do it in first-order logic; STILL...
NTUAApril 17, 2003
17
More Constraints from GReX
(some Tag) x [ el(x) t tag(x,t) ] every element has a tag
(oneTag) xt1t2 [ tag(x,t1) tag(x,t2) t1 = t2 ] one tag per element
(noLoop) xy [ desc(x,y) desc(y,x) x = y ] no non-trivial cycles
(noShare) xyuv [ child(x,u) child(x,v) unique path between
desc(u,y) desc(v,y) u = v ] elements
(inLine) xy [ desc(x,u) desc(y,u) ancestors of an element
x = y desc(x,y) desc(y,x) ] are collinear
NTUAApril 17, 2003
18
Which Reformulations Do We Find This Way?
reformulated queries (multiple solutions)
client XQuery
Mappings ()
as XQueriesschema
correspondencerelational queries
C&B
reformulated queries
relational constraints
GReXbuilt-in constraintscapture XML data model
XML integrityconstraints
= compilation
all of them?
NTUAApril 17, 2003
19
Restrictions on XQuery
Main restriction: no aggregates (to be investigated)
Leaving out aggregates, most common queries can be processed.
Minor restrictions:
no user-defined functions (of course!)
limited use of negation (or else the problem becomes undecidable)
limited use of document order (to be investigated)
no navigation to parent or wildcard child (of unspecified tag) (unintuitive, but we can show that this needs another algorithm,
unless NP= 2)p
NTUAApril 17, 2003
20
The Reduction is Sound and Complete
For the restricted XQuery fragment,
Given:
- XBind query B compiled to a relational query c(B)
- schema correspondence C given by XQueries compiled to set of constraints c(C)
Relative Completeness Theorem:
R is a minimal reformulation of B under C
iff c(R) is a minimal reformulation of c(B) under c(C) and GReX
All of them are found by C&B.
R can be computed from c(R)
NTUAApril 17, 2003
21
A Glimpse at the Chase:Transforming Queries Using Constraints
AQ:
A query: ‘ find data satisfying condition “A” ‘
A constraint: ‘ whenever the data satisfies condition “A”, it also satisfies “B” ‘
A B
A chase step:
AQ: A BQ1:
The chase: repeatedly applying chase steps until no new conditions can be addedIn general, Q and Q1 are not equivalent,
but in all DBs satisfying the constraint, they are!Theory of the chase: 20 years old, deep and rich, due to Beeri, Maier, Mendelson, Sagiv, Vardi, Yannakakis and others!
NTUAApril 17, 2003
22
How Do We Use the Chase?Capturing Relational Views With Constraints
Let the schema correspondence be the view:
‘ retrieve the data satisfying conditions “A” and “B” ‘
V: A B
Capture the definition with constraints (first-order logic statements)
VA B V A B
all data satisfying “A” and “B” “appears in result of V”
all data “appearing in V”satisfies “A” and “B”
stands for condition:
“data appears in result of V”V
NTUAApril 17, 2003
23
Chase & Backchase
First chase:
It turns out that SQ is equivalent to Q
Presence of constraint A B allows reformulation
SQ: V
Next inspect all subqueries (“syntactic pieces”) of the chase result Q2:
AQ: AQ1: BA B
AQ2: B VVA B
The equivalence is checked again using the chase (backwards)
SQ: V AQ2: B VV A B
NTUAApril 17, 2003
24
General C&B Algorithm (joint work with Lucian Popa, IBM Almaden)
(public) schema P , (proprietary) schema S
Let C be a set of constraints. (eg., on P and/or P & S )
Q(P)
U(P + S )
chas
ew
ith
C
S U B Q U E R I E S
backchase
solutions X(S) = subqueries of U,posed against S, equivalent to Q
Universal plan
Completeness Theorem [Deutsch&T.]: Any scan-minimal reformulation of Q under C is a subquery of U
Assume some terminatingchasing sequence
NTUAApril 17, 2003
25
Two Sets of Experiments
• Synthetic queries reformulation time as function of query “complexity”
XML analog of relational “star” queries, increasing number of joins
can very complex queries still be reformulated in a practical amount of time ?
• “Realistic” queries from the XML Benchmark Project [http://monetdb.cwi.nl/xml]
The Queries: 20 queries designed to exercise interesting features of XQuery
The Schema correspondence: views in both directions compiles to about 200 constraints!
Much more than in typical relational schemas!
NTUAApril 17, 2003
26
Experiments with Synthetic Queries
Number of joins (number of corners in the star)
NTUAApril 17, 2003
27
Experiments with Benchmark Queries
Reformulation times must be understood in conjunction with execution times(eg., tens of seconds for Q10)
NTUAApril 17, 2003
28
Summary of Contributions
MARS, a system for XQuery reformulation,
- with mixed and redundant storage, under integrity constraints.
- complex schema correspondence (views in both directions)
Showed practical relevance of C&B method (feasible and worthwhile)
A completeness result for a significant fragment of XQuery and a large
class of schema correspondences. The method remains sound for the full language.
A reduction between minimal reformulation and query equivalence, and
we gave matching lower bounds showing our chase-based decision procedure is
asymptotically optimal for the fragment considered.
NTUAApril 17, 2003
30
Why XML?
The relational data model is still the dominant concept in databases.
All data can be coded into tables. (For that matter into (goedel)numbers too!)
Artificial coding makes life harder for query programmers.Result: less productivity, more bugs.
XML is much more flexible. It is also “self-describing”, i.e., noneed apriori for types/schemas (but this is sometimes a bad idea).
It came from the document community (tagged text) and was cheered by industry gurus. So we have to live with it.(Although one can image better data models…)
NTUAApril 17, 2003
31
1. Cost-independent: prune subqueries that - do not correspond to legal XML queries - contain redundant descendant navigation steps
Making It Work
Chase: each chase step is similar to evaluation of a recursive Datalog rule on a symbolic database built from the query
we borrowed classical query processing techniques
typical size reduction
2^100 300
Backchase: size of search space is O(2^u), u = size of universal plan We found criteria for pruning this space.
2. A cost-based pruning strategy parameterized by costing modelPerform contiguous navigation steps starting from the rootx child-of y, y child-of z, x descendant-of z
• compiling constraints to join tree• joins implemented as hash-joins• pushing selections into joins
bottom-up exploration of subqueries: first all performing 1 navigation step, next all performing 2 navigation steps, etc.
- finds optimal reformulation for any monotonic cost model
- cost models for XML are still under research
- heuristic cost model: cost is number of table scans/XML navigation steps performed
- amenable to experimenting with other cost models
NTUAApril 17, 2003
32
Benefit of Reformulation For Execution Time
Benefit increases with increasing complexity of queryand increasing database size
original query execution - time to reformulate - execution of reformulation
-100
0
100
200
300
400
500
600
3 4 5 6 7
number of major joins per query
save
d
tim
e (s
) 60
80
90
100
150
200
no. of elements
in document
NTUAApril 17, 2003
33
For redundancy: materialized the XBind query for each query
(particular case of Acess Support Relation)
reformulation times (with redundancy and optimization)
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
Q1
Q2
Q3
Q4
Q5
Q6
Q7
Q8
Q9
Q10
Q11
Q12
Q13
Q14
Q15
Q16
Q17
Q18
Q19
Q20
queries
tim
e (s
)
time to first reformulation delta to best reformulation delta to finish search
More Results for Benchmark Queries
Time to find first reformulation is essentially the same as in the absence of redundancy.
Additional time spent only for finding optimal one.
Time to first reformulation
Delta to best reformulation
Delta to finish search
NTUAApril 17, 2003
34
Related Work:Data Integration As Particular Case of MARS Applications
P
S
QX=Q o CR
(global schema)
(local schema)
Global As View (GAV)
reformulation bycomposition-with-views
TSIMMIS, SilkRoute, XPeranto
CR
P
S
QQ=X o CR
Local As View (LAV)
rewriting-with-views
Information Manifold, STORED, Agora
CR
P
S
CR
QCR X = Q
MARS
combined effect ofrewriting+composition
[with Fernandez and Suciu in SIGMOD’99]
NTUAApril 17, 2003
35
Future Work Directions
• Short-Term:
- tuning of C&B implementation for further speedup
- XML-specific strategies for pruning the backchase stage
- in particular, finding a good cost model to perform cost-based pruning
• Medium-Term:
- Applying C&B to Data Security
- Applications to Adaptive Distributed Query Optimization
• Long Term:
- a unified framework for integrating data from various, heterogenous sources going
beyond classical databases (XML/relational/LDAP + web forms + web services)
NTUAApril 17, 2003
36
Application 3: Schema Evolution (e.g. Caching)
old schema O
new schema Nschema
correspondence
client
old query Q (O)
reformulated query X (N)
Find X(N) returning same answer as Q(O)
Goal: support existing client applications even after changing the schema
could be O extended with cached results
NTUAApril 17, 2003
37
catalog
drug drug
namenameprice price
“aspirin” “cortisone” “$50”
A Source of Redundancy: Relational Storage of XML
“$4”
notesnotes
Drugs name price
aspirin $4
cortisone $50
redundant storage
public datarelational view(lossy)
highly unstructured
NTUAApril 17, 2003
38
Containment Under Integrity Constraints
Decision procedure for containment is based on chasing with constraints from GReX.
Natural extension to XML integrity constraints.
Some results:
• Containment of well-behaved XPath/XBind queries under bounded simple XML integrity constraints (SXICs) is decidable (used in relative completeness theorem).
• Even modest use of unboundedness makes the problem undecidable.
• Corollary: containment under bounded SXICs and DTDs is undecidable.
• Containment under DTDs only is an open problem, but we have a PSPACE lower bound.
See proposal for details.
NTUAApril 17, 2003
41
The Architecture of Our Solution
reformulated queries (multiple solutions)
client XQuery
Mappings ()
as XQueries
rel/XML
encodings
schemacorrespondence
relational queries
C&B
reformulated queries
relational constraints
GReXbuilt-in XML data model constraints
XML integrityconstraints
= compilation
GReX: Generic Relational encoding of XML, used internally to partially capture the intended model
XBind queries
tagging templatedefined next
not shown here
NTUAApril 17, 2003
42
Problem:
• XML/MARS XQuery Reformulation
• schema correspondence given by views in both directions
• multiple solutions
Tool: Algorithm for reformulation
of relational queries under relational constraints
Chase & Backchase (C&B)
introduced in [VLDB’99 with L. Popa and V. Tannen]
evaluated in [SIGMOD’00 with L. Popa, A. Sahuguet and V. Tannen]
NTUAApril 17, 2003
43
Capturing Relational Views With Constraints
(bV) x z [ V(x,z) y A(x,y) B(y,z) ]result of query defining the view is included in V
V is included in result of query defining view
Let the schema correspondence be a view defined as the relational conjunctive query
V(x,z) :- A(x,y), B(y,z)
Capture the definition with constraints,
(cV) x y z [ A(x,y) B(y,z) V(x,z) ]
NTUAApril 17, 2003
44
Partially capturing the XML model
Partially, because some features cannot fully be captured with constraints:
• descendant is the transitive closure of child, but this is not FO-definable
• neither is the “treeness” property
our solution:
add a set of constraints GREX to approximate intended models
it turns out that capturing descendant helps in capturing treeness
then, we define a significant XQuery fragment (we call it well-behaved)
that cannot distinguish between intended and approximate models
NTUAApril 17, 2003
45
Constraints in GReX (2): the tagged tree structure of XML
(topRoot) rx [ root(r) desc(x,r) x = r ] root has no ancestors
(oneTag) xt1t2 [ tag(x,t1) tag(x,t2) t1 = t2 ] one tag per element
(noLoop) xy [ desc(x,y) desc(y,x) x = y ] no non-trivial cycles
(oneParent) xyz [ child(x,z) child(y,z) x = y ] at most one parent
(noShare) xyuv [ child(x,u) child(x,v) unique path between
desc(u,y) desc(v,y) u = v ] elements
(inLine) xy [ desc(x,u) desc(y,u) ancestors of an element
x = y desc(x,y) desc(y,x) ] are collinear
NTUAApril 17, 2003
46
XQuery Restrictions
What it allows:
composition of navigation steps,
navigation axes: self, (named)child, descendant, ancestor, idrefs
qualifiers: path, string path, “and”, “or”, path equality/inequality
where clause: disjunction, path equality/inequality,
existential quantification
What it rules out:
user-defined functions,
range, before predicates,
aggregates, arbitrary negation, universal quantification,
concatenation (,)
navigation to parent (..) or to child of unspecified name (*)
NTUAApril 17, 2003
47
C&B Completeness
Let C be a set of constraints (relates public schema P and proprietary schema S)
• C-minimal query:
removing any of its relational atoms produces non-equivalent query under D
• Q1 is a subquery of Q2:
Q1 is isomorphic to a “piece” of Q2
Completeness Theorem: Any C-minimal reformulation of Q is a subquery of U
Q(P)
U(P + S)
chas
e
S U B Q U E R I E S
backchase
solutions X(S) = subqueries of U,posed against S, equivalent to Q
Universal plan
NTUAApril 17, 2003
48
A Completeness Result for Our Solution
Given:
- well-behaved XBind query B
compiled to a relational query c(B)
- schema correspondence M given by well-behaved XQueries (in both directions),
compiled to set of relational constraints c(M)
- bounded XML integrity constraints XIC,
compiled to set of relational constraints c(XIC)
Relative Completeness Theorem: for any R
R is a (M+XIC)-minimal reformulation of B
iff
c(R) is a (GReX c(M) c(XIC))-minimal reformulation of c(B)
a class of XML integrity constraints, see [KRDB’01]
All of them are found by C&B. Corollary: completeness of reformulation algorithm for XBind queriesR can be computed from c(R)
NTUAApril 17, 2003
49
Capturing XML Semantics
reformulated queries (multiple solutions)
client XQuery
Mappings ()
as XQueriesschema
correspondencerelational queries
C&B
reformulated queries
relational constraints
GReXbuilt-in constraintscapture XML data model
XML integrityconstraints
= compilation
NTUAApril 17, 2003
50
Summary of Constraints Used in C&B Phase
• Built-in constraints in GReX
• Relational views compile to inclusion constraints
• XQuery views
– their XBind queries compile to inclusion constraints as for relational views
– their return clause compiles to several decorrelated queries, each captured with constraints
– the XML template in the return clause compiles to several Skolem and copy functions, each compiled to constraints
• Integrity constraints
– XML constraints compile to relational constraints
– relational schema constraints
NTUAApril 17, 2003
51
Are the Restrictions Justified?
Our completeness result holds for well-behaved XQueries, under bounded
XML integrity constraints.
What about reformulating
• XQueries with parent and wildcard child navigation?
• Under other XML integrity constraints?
• Even under full-fledged DTDs?
For such extensions, we make a deeper study of equivalence, which is an even simpler problem in reformulation.
The equivalence checker is invoked as black-box algorithm during C&B.
NTUAApril 17, 2003
52
path concatenation, attribute values
navigation axes: self, (named)child, descendant
qualifiers: path, string path, “and”
XBind (includes XPath) Fragments Equivalence
PTIME
+ join on attribute variables
NP-complete
+ any or all (!) of the following: . disjunction
. ancestor navigation
. path equality
. wildcard child () navigation
+ parent, preceding(following)-sibling
2-completep
In 2
p
well-
beh
aved
sim
ple
NTUAApril 17, 2003
53
Theorem
B1 , B2 XBind/XPath queries from our “well-behaved” fragment
c(B1) , c(B2) their relational compilation
B1 is equivalent to B2 iff
c(B1) is equivalent to c(B2) under GReX
decidable in 2p using chase
Containment for the “well-behaved” fragment of XBind/XPath
This result about containment is used in the relative completeness theorem
NTUAApril 17, 2003
54
Extensions of the “NP” fragment: 2p fragments
any or all (!) of the following make equivalence 2p-complete:
• disjunction
unsurprising: conjunctive queries+union already 2p-complete [SY’80]
• ancestor navigation
translate ancestor away introducing union: /a/b/ancestor /[a/b] /a[b]
• path equality qualifier
can simulate ancestor: //.[.//.==/p]/s /p/ancestor/s
• wildcard child navigation
union introduced by interaction //: //a /a ///a
Not well-behaved, but we have a different decision procedure
NTUAApril 17, 2003
55
Experimental Setup: Started From the XML Benchmark
Used the official XML Benchmark Project [http://monetdb.cwi.nl/xml]
The application domain: an online auctioning application.
The published schema: a DTD given by the XML Benchmark Project
Data is partially nicely structured.
The Queries: 20 queries designed to exercise interesting features of XQuery
NTUAApril 17, 2003
56
What We Added to the XML Benchmark Setup
Much more than in typical relational schemas!Had to change original implementation [SIGMOD’00] to scale.
The mixed storage schema:
relationally: person, item, open auction, closed auction, etc.
unstructured part: annotations on items
The redundancy:
materialized the XBind query for each query
(particular case of Acess Support Relation)
The mappings:
in both directions: relations XML, XML XML
It all compiles to about 200 constraints !
NTUAApril 17, 2003
57
Related Work
Publishing systems
Schema mapping proprietary relational published XML: SilkRoute, Xperanto
reformulation by composition-with-views.
Schema mapping published XML proprietary relational : STORED, Agora
reformulation by rewriting-with-views
Information Integration
TSIMMIS (composition-w-views), Information Manifold (rewriting-w-views)
Containment
Miklau and Suciu, smaller fragment of XPath(they too find that * is “naughty”
[FLS, CGLV] - conjunctive regular path queries
Amer-Ahia and Srivastava - minimization of tree pattern queries
Containment under integrity constraints XML keys [BDFHT]; description logics [CGL];
NTUAApril 17, 2003
58
Query Reformulation in Data Publishing
public schema P (virtual data)
proprietary storage schema S(materialized data)
publishing query (may hide some proprietary data)
client query Q(P)(not directly executable)
partner/client
? reformulated queryX(S)
Find X(S) returning same answer as Q(P)
schema = interface against which queries are formulated
NTUAApril 17, 2003
59
Compiling the Binding Part of XQueries to Relational Queries
Relational query over
child(x,y),tag(x,t),desc(x,y),Root(r), etc.
XBind query = binding stage
of XQuery
(returns a relation:
tuples of variable bindings)
But, over arbitrary DBs with this schema, the relational translation of
Root desc desc is not equivalent to that of Root desc
Navigation in XQueries Relational join of tables child, tag,etc.
must communicate to the C&B that desc table is transitive
NTUAApril 17, 2003
60
The Challenge for “Reformulation on MARS”
To find the reformulations efficiently, we need to
• reason with schema correspondence
• efficiently construct the search space for reformulations
- must contain all reformulations (for completeness)
• explore search space
- exhaustively (for security applications)
- maybe trading optimality of reformulation for search speed
(for optimization purposes)
NTUAApril 17, 2003
61
Contributions
• A novel algorithm for reformulation of relational queries under relational constraints
– Chase & Backchase
Uses this semantics and exploits C&B
[VLDB’99 with Popa and Tannen][SIGMOD’00 with Popa, Sahuguet and Tannen]
• MARS: a system for XQuery reformulation over Mixed And Redundant Storage
–constructs and represents search space efficiently
–cost-based exploration strategy parameterized by traditional costing module
–finds first reformulation fast
• Experimental evaluation: time to first reformulation, simple cost
• A declarative semantics for most of XQuery
• A reformulation algorithm for XQuery
–practical (feasible and worthwhile)
–complete for “most” of XQuery
–optimal (we show lower bounds for various XQuery fragments: KRDB’01, DBPL’01)
NTUAApril 17, 2003
62
Compiling Client XQueries
reformulated queries (multiple solutions)
client XQuery
Mappings ()
as XQueriesschema
correspondencerelational queries
C&B
reformulated queries
relational constraints
GReXbuilt-in constraintscapture XML data model
XML integrityconstraints
= compilation
NTUAApril 17, 2003
63
Capturing the Schema Correspondence
reformulated queries (multiple solutions)
client XQuery
Mappings ()
as XQueriesschema
correspondencerelational queries
C&B
reformulated queries
relational constraints
GReXbuilt-in constraintscapture XML data model
XML integrityconstraints
= compilation
NTUAApril 17, 2003
64
Major Obstacles in Compiling Schema Mappings to Constraints
Schema correspondence given by XQueries. As opposed to relational queries,
• XQueries have nested, correlated subqueries in return clause
• XQueries create new elements
• XQueries return deep, recursive copies of input XML trees
(solution not shown)
NTUAApril 17, 2003
65
Compiling Nested Subqueries: Decorrelation
the query
for $p in doc(“foo.xml”)//person
return <res>$p/phone/text()</res>
compile XBind parts to two decorrelated relational queries (shown here in Datalog syntax):
Bouter(p) Root(r), desc(r,x), child(x,p), tag(p,”person”)
Binner(p,t) Bouter(p), child(p,n), tag(n,”phone”), text(n,t)
capture each with two inclusion constraints, as done in original C&B method
is short for the nested query
for $p in doc(“foo.xml”)//person
return <res>for $t in $p/phone/text()
return $t
</res>
NTUAApril 17, 2003
66
Capturing Creation of New Elements
for $p in doc(“foo.xml”)//person
return <res>$p/phone/text()</res>
For each binding of $p, a distinct <res>-element is constructed.
set of bindings for $p, Bouter <res>-elements in resultF
injective function
Capture F by the relation G representing its graph, and the constraints:
pr1r2 [ G(p,r1) G(p,r2) r1=r2 ] ( r = F(p) )
p1p2r [ G(p1,r) G(p2,r) p1=p2 ] ( F is injective )
p r [ G(p,r) Bouter(p) ] (F’s domain is included in Bouter)
p [ Bouter(p) r G(p,r) ] (Bouter is included in F’s domain)
F is the Skolem function that validates this constraint
NTUAApril 17, 2003
67
Stratified-Witness Constraints(with L.P.)
Full dependencies: no existential quantifier. The chase always
terminates.
Beyond this? Given set C of dependencies --> define chase flow graph:
Nodes correspond to relation components: an R or arity 3 produces 3 nodes.
Edges are drawn between i’th of R and j’th of S iff R appears on the left
side and S appears on the right side of the implication of some dependency.
The edge is labeled if the corresponding variable in S is existentially
quantified. C is stratified-witness if there is no cycle with an -labeled edge
Proposition
The chase with stratified-witness constraints always terminates.
NTUAApril 17, 2003
68
(Relational) Conjunctive Queries
Q(x,z) R(x,y,z) , R(y,x,u) , S(z,u)
selectselect r1.A , s.A
from R r1 , R r2 , S s
where r1.A=r2.B and r1.B=r2.A and
r1.C=s.A and r2.C=s.B
notation: r stands for r1 , … , rn
queries: selectselect O(r) from R r where C(r)
NTUAApril 17, 2003
69
(Relational) Dependencies a.k.a Integrity Constraints
(rR) [ B(r) (sS) C(r,s) ]
B and C are conjunctions of equalities, as in where clause
example:
(r1R)(r2R) [r1.E= r2.E
(sR) s.D= r1.D s.E= r1.E s.F= r2.F ]
NTUAApril 17, 2003
70
Query Containment and Dependencies
Q1 selectselect O1(r1) from R1 r1 where C1(r1)
Q2 selectselect O2(r2) from R2 r2 where C2(r2)
define cont(Q1,Q2) as
(r1R1) [ C1(r1)
(r2R2) C2(r2) O1(r1)=O2(r2) ]
we have, in each instance
Q1 Q2 iff cont(Q1,Q2)
NTUAApril 17, 2003
71
And Viceversa
d (rR) [ B(r) (sS) C(r,s) ]
front(d) = selectselect r
from R r where B(r)
back(d) = selectselect r
from R r , S s where B(r) C(r,s)
we have, in each instance
d iff front(d) back(d)
NTUAApril 17, 2003
72
Chase Step
d (rR) [ B(r) (sS) C(r,s) ]
select O(r) select O(r)
from R r from R r , S s
where B(r) where B(r) C(r,s)
basic fact: Q Q’ Q =d Q’
the chase step is applicable if Q’ is not trivially
equivalent to Q
(for example, we cannot chase Q’ with d ! )
d
d
NTUAApril 17, 2003
73
Using the Chase
basic fact: if chase step of Q with d is not applicable
then Inst(Q) d
( canonical instance Inst(Q) built from query Q )
Basic Theorem
D set of dependencies
Q1 . . . chaseD(Q1) terminating chase sequence
(no more applicable steps) Then:
Q1 D Q2 iff chaseD(Q1) Q2
NTUAApril 17, 2003
74
Reformulation with Views
a view is just a query:
V select O(r) from R r where C(r)
Reformulation of query Q(R) with view V :
finding X(R,V) such that Q(R) =V X(R,V)
NTUAApril 17, 2003
75
One View =Two Dependencies
V select O(r) from R r where C(r)
the “chase-in” dependency:
cV (rR) [ C(r) (xV) x=O(r) ]
the “backchase” dependency:
bV (xV) (rR) C(r) x=O(r) ]
It turns out that
if rewritings of Q with V exist then such a
rewriting can be obtained by chasing Q with cV
NTUAApril 17, 2003
76
The Chase and Backchase (C&B) Algorithm(joint work with Lucian Popa, IBM Almaden)
The chase with cV always terminates.
The search space for rewritings of Q with V consists
of the subqueries of chasecV(Q).
( S is a subquery:
injective homomorphism from S to chasecV(Q) )
Keep only subqueries such that S V chasecV(Q)
This can be checked by (back!)chasing with cV, bV
(also terminating)
NTUAApril 17, 2003
77
Preliminary Completeness Result for C&B(with L.P.)
Theorem Any scan-minimal reformulation of Q with V
is a subquery of chasecV(Q).
scan-minimal: no scan (from item) can be removed
without compromising equivalence with Q.
Fewer scans means faster execution under most cost models.
NTUAApril 17, 2003
78
Additional Integrity Constraints
In general the storage schema contains integrity constraints
that restrict its class of instances (models). This may extend
the set of reformulation solutions!
Let C be a set of dependencies
Reformulating query Q(R) with view V under C :
finding X(R,V) such that Q(R) =V,D X(R,V).
That’s the same as reformulating Q under C + cV + bV
Can we still use the chase?