Meta-Data for Optimization Query Optimization Stepspeople.cs.aau.dk/~csj/DataInt/queryProcessing.pdfFROM Purchase P, Person Q WHERE P.buyer=Q.name AND Q.city=‘seattle’ AND Q.phone>‘543

1

Query Processing

Be Adaptive

Traditional Query Processing

Purchase Person

Buyer=name

City=‘seattle’ phone>’5430000’

buyer

(Simple Nested Loops)

SELECT S.snameFROM Purchase P, Person QWHERE P.buyer=Q.name AND Q.city=‘seattle’ AND Q.phone > ‘5430000’

σ

(Table scan) (Index scan)

1. Query optimization

2. Query execution

Meta-Data for Optimization

• Sizes of tables in the database

• Selectivity information:– How many rows per value in a column

– How many rows per range

– Cardinality of certain views

• Indexes on the data:– B-trees, hash indices, sorting order

Query Optimization Steps

• Algebraic transformations– Can be very complex

• Select ordering of joins– Many possible orders. Can’t explore them

all

• Select algorithm for each node in thequery plan– Many possible implementations of

relational operators

Algebraic TransformationPredicate Pushdown

Reserves Sailors

sid=sid

bid=100 rating > 5

sname

Reserves Sailors

sid=sid

bid=100

sname

rating > 5(Scan;write to temp T1)

(Scan;write totemp T2)

The earlier we process selections, less tuples we need to manipulatehigher up in the tree.Disadvantages?

Determining Join Ordering

• R1 R2 …. Rn

• Join tree:

• A join tree represents a plan.

R3 R1 R2 R4

2

Left Deep Trees

• Good for pipelining

R3 R1

R5

R2

R4

Hash-Join• Partition both relations

using hash fn h: R tuplesin partition i will onlymatch S tuples in partitioni.

Read in a partitionof R, hash it usingh2 (<> h!). Scanmatching partitionof S, search formatches.

Partitionsof R & S

Input bufferfor Si

Hash table for partitionRi (k < B-1 pages)

B main memory buffersDisk

Output buffer

Disk

Join Result

hashfnh2

h2

B main memory buffers DiskDisk

Original Relation OUTPUT

2INPUT

1

hashfunction

h B-1

Partitions

1

2

B-1

. . .

Cost of Hash-Join

• In partitioning phase, read+write bothrelations; 2(M+N).

• In matching phase, read both relations;M+N I/Os.

Sort-Merge Join

• Sort R and S on the join column, then scanthem to do a ``merge’’ on the join column.– Advance scan of R until current R-tuple >= current

S tuple, then advance scan of S until current S-tuple >= current R tuple; do this until current Rtuple = current S tuple.

– At this point, all R tuples with same value and all Stuples with same value match; output <r, s> for allpairs of such tuples.

– Then resume scanning R and S.

Cost of Sort-Merge Join

• R is scanned once; each S group is scannedonce per matching R tuple.

• Cost: M log M + N log N + (M+N)– The cost of scanning, M+N, could be M*N

(unlikely!)• Sort-Merge Join vs. Hash Join:

– Given a minimum amount of memory both have acost of 3(M+N) I/Os. Hash Join superior on thiscount if relation sizes differ greatly. Also, HashJoin shown to be highly parallelizable.

But in Data Integration…

• Few statistics, if any– autonomous sources

• Unanticipated delays and failures– network-bound sources

– A query plan can look good but not work well

• Desiderata:– optimize time to first tuples

– Reduce load on sources

3

Adaptive Query Processing

• Reconsider query plan duringexecution!– But when? How?

• Operators better suited to dynamicenvironments– Double pipelined join.

Double Pipelined Join

Hash Join Partially pipelined: no output

until inner read Asymmetric (inner vs. outer)

— optimization requiressource behavior knowledge

Double Pipelined HashJoin

Outputs data immediately

Symmetric — requires lesssource knowledge tooptimize

Re-optimization (1)

R3 R1

R5

R2

R4

Re-optimize atmaterialization points

Re-optimization (2)

R3 R1

R5

R2

R4

Insert conditionals intoplan:

“if C then join with R5otherwise, R2” …

Other Radical Options

• Eddies: a separate plan for every tuple– Depends on speed and selectivity

– But decisions are very local

• Parallel execution:– Try out 2(+) plans and see which goes

better

– Terminate the other one.

Multiple Challenges

• Don’t increase the cost of planning

• Minimize wasted work

• Tradeoff local vs. global decisions

• Don’t get stuck because of one baddecision

• Partition plan vs. data?

4

Adaptive Data PartitioningR ⋈ S ⋈ T∪

R S T

R0 S0T0

R0 S0

S0R0

T 0 S1T1

R1 S1T1

R1

S 1 T1Exclude R0S0

Exclude R0S0T0,R1S1T1

∪

R0S0

R1

T1

S 1

T0

R0 S 0

…

Query Processing Summary

• We don’t have the usual statistics– Hence, need to be adaptive

• Adaptivity needs to trade off severalfactors

• From a theoretical viewpoint, this is stillan open problem.– And what if results are assumed to be

approximate?

XML

The Data Integration Appetizer

XML and Data Integration

• XML: whetted the integration appetites– We have the syntax

– Now just solve the silly semantics problems

– Don’t bother: we’ll all standardize on DTDs.

• XML had a significant role on the dataintegration industry and research.

XML<bibliography>

<book> <title> Foundations… </title>

<author> Abiteboul </author>

<author> Hull </author>

<author> Vianu </author>

<publisher> Addison Wesley </publisher>

<year> 1995 </year>

</book>

…

</bibliography>

XML describes the content vs. presentation

XML Terminology• tags: book, title, author, …

• start tag: <book>, end tag: </book>

• elements: <book>…<book>,<author>…</author>

• elements are nested

• empty element: <red></red> abbrv. <red/>

• an XML document: single root element

well formed XML document: if it has matching tags

5

More XML: Attributes

<book price = “55” currency = “USD”>

<title> Foundations of Databases</title>

<author> Abiteboul </author>

…

<year> 1995 </year>

</book>

attributes are alternative ways to represent data

More XML: Oids and References

<person id=“o555”> <name> Jane </name> </person>

<person id=“o456”> <name> Mary </name>

<children idref=“o123 o555”/>

</person>

<person id=“o123” mother=“o456”><name>John</name>

</person>

oids and references in XML are just syntax

XML Semantics: a Tree !

<data>

<person id=“o555” >

<name> Mary </name>

<address>

<street> Maple </street>

<no> 345 </no>

<city> Seattle </city>

</address>

</person>

<person>

<name> John </name>

<address> Thailand </address>

<phone> 23456 </phone>

</person>

</data>

data

Mary

person

person

name address

name address

street no city

Maple 345 Seattle

JohnThai

phone

23456

id

o555

Elementnode

Textnode

Attributenode

Order matters !!!

XML Data

• XML is self-describing• Schema elements become part of the

data– Relational schema: persons(name,phone)– In XML <persons>, <name>, <phone> are part of

the data, and are repeated many times

• Consequence: XML is much moreflexible

• XML = semistructured data

Relational Data as XML

<person><row> <name>John</name> <phone> 3634</phone></row> <row> <name>Sue</name> <phone> 6343</phone> <row> <name>Dick</name> <phone>

6363</phone></row></person>

n a m e p h o n e

J o h n 3 6 3 4

S u e 6 3 4 3

D i c k 6 3 6 3

row row row

name name namephone phone phone

“John” 3634 “Sue” “Dick”6343 6363

person XML: person

XML is Semi-structured Data

• Missing attributes:

• Could represent ina table with nulls

<person> <name> John</name> <phone>1234</phone> </person>

<person> <name>Joe</name></person> no phone !

-Joe

1234John

phonename

6


• Repeated attributes

• Impossible in tables:

<person> <name> Mary</name> <phone>2345</phone> <phone>3456</phone></person>

two phones !

34562345Mary

phonename

???


• Attributes with different types in different objects

• Nested collections (no 1NF)

• Heterogeneous collections:– <db> contains both <book>s and <publisher>s

<person> <name> <first> John </first> <last> Smith </last> </name> <phone>1234</phone></person>

structured name !

Document Type DefinitionsDTD

• part of the original XML specification

• an XML document may have a DTD

• XML document:well-formed = if tags are correctly closed

Valid = if it has a DTD and conforms to it

• validation is useful in data exchange

Very Simple DTD

<!DOCTYPE company [ <!ELEMENT company ((person|product)*)> <!ELEMENT person (ssn, name, office, phone?)> <!ELEMENT ssn (#PCDATA)> <!ELEMENT name (#PCDATA)> <!ELEMENT office (#PCDATA)> <!ELEMENT phone (#PCDATA)> <!ELEMENT product (pid, name, description?)> <!ELEMENT pid (#PCDATA)> <!ELEMENT description (#PCDATA)>]>

Very Simple DTD

<company> <person> <ssn> 123456789 </ssn> <name> John </name> <office> B432 </office> <phone> 1234 </phone> </person> <person> <ssn> 987654321 </ssn> <name> Jim </name> <office> B123 </office> </person> <product> ... </product> ...</company>

Example of valid XML document:

Querying XML Data

• XPath = simple navigation through thetree

• XQuery = the SQL of XML

7

Sample Data for Queries<bib>

<book> <publisher> Addison-Wesley </publisher> <author> Serge Abiteboul </author> <author> <first-name> Rick </first-name> <last-name> Hull </last-name> </author> <author> Victor Vianu </author> <title> Foundations of Databases </title> <year> 1995 </year></book><book price=“55”> <publisher> Freeman </publisher> <author> Jeffrey D. Ullman </author> <title> Principles of Database and Knowledge Base Systems </title> <year> 1998 </year></book>

</bib>

Data Model for XPath

bib

book book

publisher author . . . .

Addison-Wesley Serge Abiteboul

The root

The root element

XPath: Simple Expressions

Result: <year> 1995 </year>

<year> 1998 </year>

Result: empty (there were no papers)

/bib/book/year

/bib/paper/year

XPath: Restricted KleeneClosure

Result:<author> Serge Abiteboul </author> <author> <first-name> Rick </first-name> <last-name> Hull </last-name> </author> <author> Victor Vianu </author> <author> Jeffrey D. Ullman </author>

Result: <first-name> Rick </first-name>

//author

/bib//first-name

Xpath: Text Nodes

Result: Serge Abiteboul

Jeffrey D. Ullman

Rick Hull doesn’t appear because he has firstname, lastname

Functions in XPath:– text() = matches the text value– node() = matches any node (= * or @* or text())– name() = returns the name of the current tag

/bib/book/author/text()

Xpath: Wildcard

Result: <first-name> Rick </first-name>

<last-name> Hull </last-name>

* Matches any element

//author/*

8

Xpath: Attribute Nodes

Result: “55”

@price means that price is has to be anattribute

/bib/book/@price

XPath: Predicates

Result: <author> <first-name> Rick </first-name>

<last-name> Hull </last-name>

</author>

/bib/book/author[firstname]

XPath: More Predicates

Result: <lastname> … </lastname>

<lastname> … </lastname>

/bib/book/author[firstname][address[//zip][city]]/lastname

Xpath: More Predicates

/bib/book[@price < “60”]

/bib/book[author/@age < “25”]

/bib/book[author/text()]

Xpath: Summarybib matches a bib element

* matches any element

/ matches the root element

/bib matches a bib element under root

bib/paper matches a paper in bib

bib//paper matches a paper in bib, at any depth

//paper matches a paper at any depth

paper|book matches a paper or a book

@price matches a price attribute

bib/book/@price matches price attribute in book, in bib

bib/book/[@price<“55”]/author/lastname matches…

XML: Query Language History

• XPath was there– Nobody thought to join XML data

• DB community (circa 95-96)– Semi-structured data: Lorel, UnQL, StruQL– XML-QL: applied StruQL to XML– XML paper-mill starts…

• XQuery working group formed (1998)– But not in place when some of us needed

it.

9

XQuery

• Based on Quilt, which is based on XML-QL

• Uses XPath to express more complexqueries

FLWR (“Flower”) Expressions

FOR ...

LET...

WHERE...

RETURN...

XQuery

Find all book titles published after 1995:

FOR $x IN document("bib.xml")/bib/book

WHERE $x/year > 1995

RETURN { $x/title }

Result: <title> abc </title> <title> def </title> <title> ghi </title>

XQuery

Find book titles by the coauthors of“Database Theory”:

FOR $x IN bib/book[title/text() = “Database Theory”]/author $y IN bib/book[author/text() = $x/text()]/title

RETURN <answer> { $y/text() } </answer>

Result: <answer> abc </ answer > < answer > def </ answer > < answer > ghi </ answer >

The answer willcontain duplicates !

XQuery

Same as before, but eliminate duplicates:

FOR $x IN bib/book[title/text() = “Database Theory”]/author $y IN distinct(bib/book[author/text() = $x/text()]/title)

RETURN <answer> { $y/text() } </answer>

Result: <answer> abc </ answer > < answer > def </ answer > < answer > ghi </ answer >

distinct = a function that eliminates duplicates

XQuery: NestingFor each author of a book by Morgan

Kaufmann, list all books she published:

FOR $a IN distinct(document("bib.xml") /bib/book[publisher=“Morgan Kaufmann”]/author)RETURN <result> { $a, FOR $t IN /bib/book[author=$a]/title RETURN $t } </result>

10

XQuery

<result> <author>Jones</author> <title> abc </title> <title> def </title> </result> <result> <author> Smith </author> <title> ghi </title> </result>

Result:XQuery

• FOR $x in expr -- binds $x to eachvalue in the list expr

• LET $x = expr -- binds $x to the entirelist expr– Useful for common subexpressions and for

aggregations

XQuery

count = a (aggregate) function that returns the number of elms

<big_publishers>

FOR $p IN distinct(document("bib.xml")//publisher)

LET $b := document("bib.xml")/book[publisher = $p]

WHERE count($b) > 100

RETURN { $p }

</big_publishers>

XQuery

Find books whose price is larger thanaverage:

LET $a=avg(document ( "b i b . xml ") /bib/book/price)

FOR $b in document ( "b i b . xml ") /bib/book

WHERE $b/price > $a

RETURN { $b }

Let’s try to write this in SQL…

XQuery

Summary:

• FOR-LET-WHERE-RETURN = FLWR

FOR/LET Clauses

WHERE Clause

RETURN Clause

List of tuples

List of tuples

Instance of Xquery data model

XML & Data Integration

• Need to extend every aspect of thesystem to handle XML

• Great if you’re looking for papers to write!

– Schema mapping languages

– AQUV, containment

– Query processing

– Storage

– Update language

11

An XML Mapping Language

projects

project

name

position

member

name

Source schema (UPenn)

Target schema(UW):students

student

name projectadvisor

<students> {for $p in /projects/project, $pMember in $p/member, $studName in $pMember/name/text(), $pos in $pMember/position/text()where $pos = ’student’return <student> <name> { $studName } </name> {for $sp in /projects/project, $pName in $sp/name/text(), $mName in $sp/member/name/text() where $mName = $studName return <project> { $pName } </project> } </student> }</students>

What is Query Containment?

Q ⊆ Q’

-- for every inputD, Q(D) ⊆ Q’(D)

Querycontainment

Q(D) ⊆ Q’(D)

-- SubsetInstance

containment

A set of tuplesOutput Q(D)

Sets of tuplesInput D

RelationalQuery

What is Query Containment?

Q Q’

--for every input D,Q(D) Q’(D)

Q ⊆ Q’

-- for every inputD, Q(D) ⊆ Q’(D)

Querycontainment

Q(D) Q’(D)

-- Treehomomorphism

Q(D) ⊆ Q’(D)

-- SubsetInstance

containment

An XML instance treeA set of tuplesOutput Q(D)

An XML instance treeSets of tuplesInput D

XML QueryRelational

Query

Example – An XML Instance

D:

<project>

<member>Alice</member>

</project>

<project>

<member>Bob</member>

</project>

project project

member

^ ^

member

Alice Bob

^: leaf nodes

Example – An XML Query

Q1:FOR $x IN /project RETURN<group>{

FOR $y IN $x/memberRETURN<name>{

FOR $z IN $y[text()=“Alice”]RETURN <Alice/>

FOR $z IN $y[text()=“Bob”]RETURN <Bob/>

}</name>}</group>

project project

member

^ ^

member

Alice Bob

D:

Q1(D):

^ ^

group group

name name

Alice Bob

Example – Another XML Query

group

name name

^ ^

Alice Bob

D:

Q2(D):

Q2:FOR $x IN /project RETURN<group>{

FOR $y IN /project/memberRETURN<name>{

FOR $z IN $y[text()=“Alice”]RETURN <Alice/>

FOR $z IN $y[text()=“Bob”]RETURN <Bob/>

}</name>}</group>

project project

member

^ ^

member

Alice Bob

12

Q2(D):Q1(D):

X

group group

name name

Alice Bob

group

name name

^ ^

Alice Bob

^ ^

Q2(D):Q1(D):group group

name name

Alice Bob

group

name name

^ ^

Alice Bob

^ ^

Q1(D) Q2(D)

Q2(D) Q1(D)

Example -- Tree Homomorphism andXML Instance Containment

Complexity for XPaths

[Florescu et al., 1998]

[Deutsch and Tannen,

2001]

[Miklao and Suciu,2002]

[Milo and Suciu, 1999]

[Amer-Yahia et al.,2001]

Citation

NPCWith //, [], Equi-join on tags

PSPACE-hardRegular Expression

∏2P

-complete

With //, *, [], Equi-join ontags, disjunction

CO-NPCWith //, *, []

PTIMEWith //, []

PTIMEWith *, [] (branch)

PTIMEWith //, *

ComplexityRestriction

Complexity for Nested XPaths

CO-NPC

∏2P

-complete

PSPACE-hard? (not

sure)

∏2P

-complete

CO-NPC

Nested-Query

Complexity (withfixed depth)

NPCWith //, [], Equi-join on

tags

PSPACE-hardRegular Expression

∏2P

-complete

With //, *, [], Equi-joinon tags, disjunction

CO-NPCWith //, *, []

PTIMEWith //, []

PTIMEWith *, [] (branch)

PTIMEWith //, *

XPATH-ComplexityRestriction

Containment Algorithm(1):Query Head Embedding

<student>

<name> {$student}

<advisor> {$name}

<result>

position

<people>

dbpeople

faculty

name

$pos=‘professor’

student

advisee$advisee = $student

Q1:

<result>

<name> {$student}

<advisor> {$name}

<people>

db

people

faculty

name<faculty> {$name}

<student>student

$advisee = $student

facultyname advisee

:Q2Query head embedding:

Query Head

Containment Algorithm(2):Query Body Embedding

<student>

<name> {$student}

<advisor> {$name}

<result>

position

<people>

dbpeople

faculty

name

$pos=‘professor’

student

<result>

<name> {$student}

<advisor> {$name}

<people>

db

people

advisee$advisee = $student

<faculty>

<student>student

$advisee = $student

facultyname advisee

Q1: :Q2Query bodyembedding:

XML Summary

• Great for data integration:– People could envision sharing data

• Great for research:– XML most popular topic at conferences for

a few years

• Great for industry:– Many services built on XML

• Let’s declare victory and move on..

13

Model Management[Bernstein et al.]

• Generic infrastructure for managingschemas and mappings:– Manipulate models and mappings as bulk

objects– Operators to create & compose mappings,

merge & diff models– Short operator scripts can solve schema

integration, schema evolution, reverseengineering, etc.

• First challenge: semantics of operators.

Documents

Meta-Data for Optimization Query Optimization Stepspeople.cs.aau.dk/~csj/DataInt/queryProcessing.pdfFROM Purchase P, Person Q WHERE P.buyer=Q.name AND Q.city=‘seattle’ AND Q.phone>‘543