Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
1
Query Processing
Be Adaptive
Traditional Query Processing
Purchase Person
Buyer=name
City=‘seattle’ phone>’5430000’
buyer
(Simple Nested Loops)
SELECT S.snameFROM Purchase P, Person QWHERE P.buyer=Q.name AND Q.city=‘seattle’ AND Q.phone > ‘5430000’
σ
(Table scan) (Index scan)
1. Query optimization
2. Query execution
Meta-Data for Optimization
• Sizes of tables in the database
• Selectivity information:– How many rows per value in a column
– How many rows per range
– Cardinality of certain views
• Indexes on the data:– B-trees, hash indices, sorting order
Query Optimization Steps
• Algebraic transformations– Can be very complex
• Select ordering of joins– Many possible orders. Can’t explore them
all
• Select algorithm for each node in thequery plan– Many possible implementations of
relational operators
Algebraic TransformationPredicate Pushdown
Reserves Sailors
sid=sid
bid=100 rating > 5
sname
Reserves Sailors
sid=sid
bid=100
sname
rating > 5(Scan;write to temp T1)
(Scan;write totemp T2)
The earlier we process selections, less tuples we need to manipulatehigher up in the tree.Disadvantages?
Determining Join Ordering
• R1 R2 …. Rn
• Join tree:
• A join tree represents a plan.
R3 R1 R2 R4
2
Left Deep Trees
• Good for pipelining
R3 R1
R5
R2
R4
Hash-Join• Partition both relations
using hash fn h: R tuplesin partition i will onlymatch S tuples in partitioni.
Read in a partitionof R, hash it usingh2 (<> h!). Scanmatching partitionof S, search formatches.
Partitionsof R & S
Input bufferfor Si
Hash table for partitionRi (k < B-1 pages)
B main memory buffersDisk
Output buffer
Disk
Join Result
hashfnh2
h2
B main memory buffers DiskDisk
Original Relation OUTPUT
2INPUT
1
hashfunction
h B-1
Partitions
1
2
B-1
. . .
Cost of Hash-Join
• In partitioning phase, read+write bothrelations; 2(M+N).
• In matching phase, read both relations;M+N I/Os.
Sort-Merge Join
• Sort R and S on the join column, then scanthem to do a ``merge’’ on the join column.– Advance scan of R until current R-tuple >= current
S tuple, then advance scan of S until current S-tuple >= current R tuple; do this until current Rtuple = current S tuple.
– At this point, all R tuples with same value and all Stuples with same value match; output <r, s> for allpairs of such tuples.
– Then resume scanning R and S.
Cost of Sort-Merge Join
• R is scanned once; each S group is scannedonce per matching R tuple.
• Cost: M log M + N log N + (M+N)– The cost of scanning, M+N, could be M*N
(unlikely!)• Sort-Merge Join vs. Hash Join:
– Given a minimum amount of memory both have acost of 3(M+N) I/Os. Hash Join superior on thiscount if relation sizes differ greatly. Also, HashJoin shown to be highly parallelizable.
But in Data Integration…
• Few statistics, if any– autonomous sources
• Unanticipated delays and failures– network-bound sources
– A query plan can look good but not work well
• Desiderata:– optimize time to first tuples
– Reduce load on sources
3
Adaptive Query Processing
• Reconsider query plan duringexecution!– But when? How?
• Operators better suited to dynamicenvironments– Double pipelined join.
Double Pipelined Join
Hash Join Partially pipelined: no output
until inner read Asymmetric (inner vs. outer)
— optimization requiressource behavior knowledge
Double Pipelined HashJoin
Outputs data immediately
Symmetric — requires lesssource knowledge tooptimize
Re-optimization (1)
R3 R1
R5
R2
R4
Re-optimize atmaterialization points
Re-optimization (2)
R3 R1
R5
R2
R4
Insert conditionals intoplan:
“if C then join with R5otherwise, R2” …
Other Radical Options
• Eddies: a separate plan for every tuple– Depends on speed and selectivity
– But decisions are very local
• Parallel execution:– Try out 2(+) plans and see which goes
better
– Terminate the other one.
Multiple Challenges
• Don’t increase the cost of planning
• Minimize wasted work
• Tradeoff local vs. global decisions
• Don’t get stuck because of one baddecision
• Partition plan vs. data?
4
Adaptive Data PartitioningR ⋈ S ⋈ T∪
R S T
R0 S0T0
R0 S0
S0R0
T 0 S1T1
R1 S1T1
R1
S 1 T1Exclude R0S0
Exclude R0S0T0,R1S1T1
∪
R0S0
R1
T1
S 1
T0
R0 S 0
…
Query Processing Summary
• We don’t have the usual statistics– Hence, need to be adaptive
• Adaptivity needs to trade off severalfactors
• From a theoretical viewpoint, this is stillan open problem.– And what if results are assumed to be
approximate?
XML
The Data Integration Appetizer
XML and Data Integration
• XML: whetted the integration appetites– We have the syntax
– Now just solve the silly semantics problems
– Don’t bother: we’ll all standardize on DTDs.
• XML had a significant role on the dataintegration industry and research.
XML<bibliography>
<book> <title> Foundations… </title>
<author> Abiteboul </author>
<author> Hull </author>
<author> Vianu </author>
<publisher> Addison Wesley </publisher>
<year> 1995 </year>
</book>
…
</bibliography>
XML describes the content vs. presentation
XML Terminology• tags: book, title, author, …
• start tag: <book>, end tag: </book>
• elements: <book>…<book>,<author>…</author>
• elements are nested
• empty element: <red></red> abbrv. <red/>
• an XML document: single root element
well formed XML document: if it has matching tags
5
More XML: Attributes
<book price = “55” currency = “USD”>
<title> Foundations of Databases</title>
<author> Abiteboul </author>
…
<year> 1995 </year>
</book>
attributes are alternative ways to represent data
More XML: Oids and References
<person id=“o555”> <name> Jane </name> </person>
<person id=“o456”> <name> Mary </name>
<children idref=“o123 o555”/>
</person>
<person id=“o123” mother=“o456”><name>John</name>
</person>
oids and references in XML are just syntax
XML Semantics: a Tree !
<data>
<person id=“o555” >
<name> Mary </name>
<address>
<street> Maple </street>
<no> 345 </no>
<city> Seattle </city>
</address>
</person>
<person>
<name> John </name>
<address> Thailand </address>
<phone> 23456 </phone>
</person>
</data>
data
Mary
person
person
name address
name address
street no city
Maple 345 Seattle
JohnThai
phone
23456
id
o555
Elementnode
Textnode
Attributenode
Order matters !!!
XML Data
• XML is self-describing• Schema elements become part of the
data– Relational schema: persons(name,phone)– In XML <persons>, <name>, <phone> are part of
the data, and are repeated many times
• Consequence: XML is much moreflexible
• XML = semistructured data
Relational Data as XML
<person><row> <name>John</name> <phone> 3634</phone></row> <row> <name>Sue</name> <phone> 6343</phone> <row> <name>Dick</name> <phone>
6363</phone></row></person>
n a m e p h o n e
J o h n 3 6 3 4
S u e 6 3 4 3
D i c k 6 3 6 3
row row row
name name namephone phone phone
“John” 3634 “Sue” “Dick”6343 6363
person XML: person
XML is Semi-structured Data
• Missing attributes:
• Could represent ina table with nulls
<person> <name> John</name> <phone>1234</phone> </person>
<person> <name>Joe</name></person> no phone !
-Joe
1234John
phonename
6
XML is Semi-structured Data
• Repeated attributes
• Impossible in tables:
<person> <name> Mary</name> <phone>2345</phone> <phone>3456</phone></person>
two phones !
34562345Mary
phonename
???
XML is Semi-structured Data
• Attributes with different types in different objects
• Nested collections (no 1NF)
• Heterogeneous collections:– <db> contains both <book>s and <publisher>s
<person> <name> <first> John </first> <last> Smith </last> </name> <phone>1234</phone></person>
structured name !
Document Type DefinitionsDTD
• part of the original XML specification
• an XML document may have a DTD
• XML document:well-formed = if tags are correctly closed
Valid = if it has a DTD and conforms to it
• validation is useful in data exchange
Very Simple DTD
<!DOCTYPE company [ <!ELEMENT company ((person|product)*)> <!ELEMENT person (ssn, name, office, phone?)> <!ELEMENT ssn (#PCDATA)> <!ELEMENT name (#PCDATA)> <!ELEMENT office (#PCDATA)> <!ELEMENT phone (#PCDATA)> <!ELEMENT product (pid, name, description?)> <!ELEMENT pid (#PCDATA)> <!ELEMENT description (#PCDATA)>]>
Very Simple DTD
<company> <person> <ssn> 123456789 </ssn> <name> John </name> <office> B432 </office> <phone> 1234 </phone> </person> <person> <ssn> 987654321 </ssn> <name> Jim </name> <office> B123 </office> </person> <product> ... </product> ...</company>
Example of valid XML document:
Querying XML Data
• XPath = simple navigation through thetree
• XQuery = the SQL of XML
7
Sample Data for Queries<bib>
<book> <publisher> Addison-Wesley </publisher> <author> Serge Abiteboul </author> <author> <first-name> Rick </first-name> <last-name> Hull </last-name> </author> <author> Victor Vianu </author> <title> Foundations of Databases </title> <year> 1995 </year></book><book price=“55”> <publisher> Freeman </publisher> <author> Jeffrey D. Ullman </author> <title> Principles of Database and Knowledge Base Systems </title> <year> 1998 </year></book>
</bib>
Data Model for XPath
bib
book book
publisher author . . . .
Addison-Wesley Serge Abiteboul
The root
The root element
XPath: Simple Expressions
Result: <year> 1995 </year>
<year> 1998 </year>
Result: empty (there were no papers)
/bib/book/year
/bib/paper/year
XPath: Restricted KleeneClosure
Result:<author> Serge Abiteboul </author> <author> <first-name> Rick </first-name> <last-name> Hull </last-name> </author> <author> Victor Vianu </author> <author> Jeffrey D. Ullman </author>
Result: <first-name> Rick </first-name>
//author
/bib//first-name
Xpath: Text Nodes
Result: Serge Abiteboul
Jeffrey D. Ullman
Rick Hull doesn’t appear because he has firstname, lastname
Functions in XPath:– text() = matches the text value– node() = matches any node (= * or @* or text())– name() = returns the name of the current tag
/bib/book/author/text()
Xpath: Wildcard
Result: <first-name> Rick </first-name>
<last-name> Hull </last-name>
* Matches any element
//author/*
8
Xpath: Attribute Nodes
Result: “55”
@price means that price is has to be anattribute
/bib/book/@price
XPath: Predicates
Result: <author> <first-name> Rick </first-name>
<last-name> Hull </last-name>
</author>
/bib/book/author[firstname]
XPath: More Predicates
Result: <lastname> … </lastname>
<lastname> … </lastname>
/bib/book/author[firstname][address[//zip][city]]/lastname
Xpath: More Predicates
/bib/book[@price < “60”]
/bib/book[author/@age < “25”]
/bib/book[author/text()]
Xpath: Summarybib matches a bib element
* matches any element
/ matches the root element
/bib matches a bib element under root
bib/paper matches a paper in bib
bib//paper matches a paper in bib, at any depth
//paper matches a paper at any depth
paper|book matches a paper or a book
@price matches a price attribute
bib/book/@price matches price attribute in book, in bib
bib/book/[@price<“55”]/author/lastname matches…
XML: Query Language History
• XPath was there– Nobody thought to join XML data
• DB community (circa 95-96)– Semi-structured data: Lorel, UnQL, StruQL– XML-QL: applied StruQL to XML– XML paper-mill starts…
• XQuery working group formed (1998)– But not in place when some of us needed
it.
9
XQuery
• Based on Quilt, which is based on XML-QL
• Uses XPath to express more complexqueries
FLWR (“Flower”) Expressions
FOR ...
LET...
WHERE...
RETURN...
XQuery
Find all book titles published after 1995:
FOR $x IN document("bib.xml")/bib/book
WHERE $x/year > 1995
RETURN { $x/title }
Result: <title> abc </title> <title> def </title> <title> ghi </title>
XQuery
Find book titles by the coauthors of“Database Theory”:
FOR $x IN bib/book[title/text() = “Database Theory”]/author $y IN bib/book[author/text() = $x/text()]/title
RETURN <answer> { $y/text() } </answer>
Result: <answer> abc </ answer > < answer > def </ answer > < answer > ghi </ answer >
The answer willcontain duplicates !
XQuery
Same as before, but eliminate duplicates:
FOR $x IN bib/book[title/text() = “Database Theory”]/author $y IN distinct(bib/book[author/text() = $x/text()]/title)
RETURN <answer> { $y/text() } </answer>
Result: <answer> abc </ answer > < answer > def </ answer > < answer > ghi </ answer >
distinct = a function that eliminates duplicates
XQuery: NestingFor each author of a book by Morgan
Kaufmann, list all books she published:
FOR $a IN distinct(document("bib.xml") /bib/book[publisher=“Morgan Kaufmann”]/author)RETURN <result> { $a, FOR $t IN /bib/book[author=$a]/title RETURN $t } </result>
10
XQuery
<result> <author>Jones</author> <title> abc </title> <title> def </title> </result> <result> <author> Smith </author> <title> ghi </title> </result>
Result:XQuery
• FOR $x in expr -- binds $x to eachvalue in the list expr
• LET $x = expr -- binds $x to the entirelist expr– Useful for common subexpressions and for
aggregations
XQuery
count = a (aggregate) function that returns the number of elms
<big_publishers>
FOR $p IN distinct(document("bib.xml")//publisher)
LET $b := document("bib.xml")/book[publisher = $p]
WHERE count($b) > 100
RETURN { $p }
</big_publishers>
XQuery
Find books whose price is larger thanaverage:
LET $a=avg(document ( "b i b . xml ") /bib/book/price)
FOR $b in document ( "b i b . xml ") /bib/book
WHERE $b/price > $a
RETURN { $b }
Let’s try to write this in SQL…
XQuery
Summary:
• FOR-LET-WHERE-RETURN = FLWR
FOR/LET Clauses
WHERE Clause
RETURN Clause
List of tuples
List of tuples
Instance of Xquery data model
XML & Data Integration
• Need to extend every aspect of thesystem to handle XML
• Great if you’re looking for papers to write!
– Schema mapping languages
– AQUV, containment
– Query processing
– Storage
– Update language
11
An XML Mapping Language
projects
project
name
position
member
name
Source schema (UPenn)
Target schema(UW):students
student
name projectadvisor
<students> {for $p in /projects/project, $pMember in $p/member, $studName in $pMember/name/text(), $pos in $pMember/position/text()where $pos = ’student’return <student> <name> { $studName } </name> {for $sp in /projects/project, $pName in $sp/name/text(), $mName in $sp/member/name/text() where $mName = $studName return <project> { $pName } </project> } </student> }</students>
What is Query Containment?
Q ⊆ Q’
-- for every inputD, Q(D) ⊆ Q’(D)
Querycontainment
Q(D) ⊆ Q’(D)
-- SubsetInstance
containment
A set of tuplesOutput Q(D)
Sets of tuplesInput D
RelationalQuery
What is Query Containment?
Q Q’
--for every input D,Q(D) Q’(D)
Q ⊆ Q’
-- for every inputD, Q(D) ⊆ Q’(D)
Querycontainment
Q(D) Q’(D)
-- Treehomomorphism
Q(D) ⊆ Q’(D)
-- SubsetInstance
containment
An XML instance treeA set of tuplesOutput Q(D)
An XML instance treeSets of tuplesInput D
XML QueryRelational
Query
Example – An XML Instance
D:
<project>
<member>Alice</member>
</project>
<project>
<member>Bob</member>
</project>
project project
member
^ ^
member
Alice Bob
^: leaf nodes
Example – An XML Query
Q1:FOR $x IN /project RETURN<group>{
FOR $y IN $x/memberRETURN<name>{
FOR $z IN $y[text()=“Alice”]RETURN <Alice/>
FOR $z IN $y[text()=“Bob”]RETURN <Bob/>
}</name>}</group>
project project
member
^ ^
member
Alice Bob
D:
Q1(D):
^ ^
group group
name name
Alice Bob
Example – Another XML Query
group
name name
^ ^
Alice Bob
D:
Q2(D):
Q2:FOR $x IN /project RETURN<group>{
FOR $y IN /project/memberRETURN<name>{
FOR $z IN $y[text()=“Alice”]RETURN <Alice/>
FOR $z IN $y[text()=“Bob”]RETURN <Bob/>
}</name>}</group>
project project
member
^ ^
member
Alice Bob
12
Q2(D):Q1(D):
X
group group
name name
Alice Bob
group
name name
^ ^
Alice Bob
^ ^
Q2(D):Q1(D):group group
name name
Alice Bob
group
name name
^ ^
Alice Bob
^ ^
Q1(D) Q2(D)
Q2(D) Q1(D)
Example -- Tree Homomorphism andXML Instance Containment
Complexity for XPaths
[Florescu et al., 1998]
[Deutsch and Tannen,
2001]
[Miklao and Suciu,2002]
[Milo and Suciu, 1999]
[Amer-Yahia et al.,2001]
Citation
NPCWith //, [], Equi-join on tags
PSPACE-hardRegular Expression
∏2P
-complete
With //, *, [], Equi-join ontags, disjunction
CO-NPCWith //, *, []
PTIMEWith //, []
PTIMEWith *, [] (branch)
PTIMEWith //, *
ComplexityRestriction
Complexity for Nested XPaths
CO-NPC
∏2P
-complete
PSPACE-hard? (not
sure)
∏2P
-complete
CO-NPC
Nested-Query
Complexity (withfixed depth)
NPCWith //, [], Equi-join on
tags
PSPACE-hardRegular Expression
∏2P
-complete
With //, *, [], Equi-joinon tags, disjunction
CO-NPCWith //, *, []
PTIMEWith //, []
PTIMEWith *, [] (branch)
PTIMEWith //, *
XPATH-ComplexityRestriction
Containment Algorithm(1):Query Head Embedding
<student>
<name> {$student}
<advisor> {$name}
<result>
position
<people>
dbpeople
faculty
name
$pos=‘professor’
student
advisee$advisee = $student
Q1:
<result>
<name> {$student}
<advisor> {$name}
<people>
db
people
faculty
name<faculty> {$name}
<student>student
$advisee = $student
facultyname advisee
:Q2Query head embedding:
Query Head
Containment Algorithm(2):Query Body Embedding
<student>
<name> {$student}
<advisor> {$name}
<result>
position
<people>
dbpeople
faculty
name
$pos=‘professor’
student
<result>
<name> {$student}
<advisor> {$name}
<people>
db
people
advisee$advisee = $student
<faculty>
<student>student
$advisee = $student
facultyname advisee
Q1: :Q2Query bodyembedding:
XML Summary
• Great for data integration:– People could envision sharing data
• Great for research:– XML most popular topic at conferences for
a few years
• Great for industry:– Many services built on XML
• Let’s declare victory and move on..
13
Model Management[Bernstein et al.]
• Generic infrastructure for managingschemas and mappings:– Manipulate models and mappings as bulk
objects– Operators to create & compose mappings,
merge & diff models– Short operator scripts can solve schema
integration, schema evolution, reverseengineering, etc.
• First challenge: semantics of operators.