Buffering in Query Evaluation over XML
Streams
Ziv Bar-YossefTechnion
Marcus FontouraVanja Josifovski
IBM Almaden Research Center
2
XML Document1: <paper>2: <section id = 1>3: <title>4: Intro5: </title>6: <content>7: bla bla bla8: </content>9: </section>10: <section id = 2>11: <title>12: Results13: </title>14: <content>15: yada yada yada16: </content>17: </section>
18: <section id = 3>19: <title>20: Conclusions21: </title>22: <content>23: etc etc etc24: </section>25: <title>26: On the Complexity of Database Queries27: </title>28: <author>29: Papadimitriou30: </author>31: <author>32: Yannakakis33: </author>34: </paper>
3
content
XML Document Tree
paper
title
section
id title
root
section
idtitle
On the Complexity of Database Queries
Intro
2
author
author
content
Papadimitriou
Yannakakis
Results yada yada yada
section
idtitle
1
etc etc etc
3content
Conclusions
bla bla bla
4
XPath Queries
Results yada yada yada
content
paper
title
section
id title
root
sectionid
titleOn the Complexity of
Database Queries
Intro
2
author
author
content
Papadimitriou
Yannakakis
section
idtitle
1
etc etc etc
3content
Conclusions
bla bla bla
/paper[author=“Papadimitriou”]/section[@id = “2” or title = “Intro”]/content
5
XPath Queries
Results yada yada yada
content
paper
title
section
id title
root
sectionid
titleOn the Complexity of
Database Queries
Intro
2
author
author
content
Papadimitriou
Yannakakis
section
idtitle
1
etc etc etc
3content
Conclusions
bla bla bla
/paper[title != section/title]/author
6
XPath Query = path pattern + predicates
XPath 2.0 Forward axis only
Eval(Q,D): nodes in D that match Q
Two modes of XPath evaluation: Full fledged evaluation: given Q,D, output Eval(Q,D) Filtering: given Q,D, determine whether Eval(Q,D) is
nonempty.
7
XML Streams XML stream: sequence of SAX events
startDocument(), endDocument(), startElement(name), endElement(name), text(str)
Why XML streams? For transferring XML between systems For efficient access to large XML documents
Critical resources Memory Processing time
8
Streaming XML Algorithms XFilter and YFilter [Altinel and Franklin 00] [Diao et al 02] X-scan [Ives, Levy, and Weld 00] XMLTK [Avila-Campillo et al 02] XTrie [Chan et al 02] SPEX [Olteanu, Kiesling, and Bry 03] Lazy DFAs [Green et al 03] The XPush Machine [Gupta and Suciu 03] XSQ [Peng and Chawathe 03] FluX [Koch el al 04] TurboXPath [Josifovski, Fontoura, and Barta 05] …
All of them use lots of memory on certain queries & documents
9
Memory Bottleneck I: Storage of Large Transition Tables Framework of most algorithms:
Q NFA Simulate NFA by DFA
Caveat: exponential blowup However: exponential blowup is not necessary
[Bar-Yossef, Fontoura, Josifovski 04] Algorithm for filtering XML streams whose space is
linear in the query size
10
Memory Bottleneck II:Buffering of Document Fragments Scenario 1: buffering nodes, which may or may not be part
of the output.
Results yada yada yada
content
paper
title
section
id title
root
sectionid title
On the Complexity of Database Queries
Intro
2
author
author
content
Papadimitriou
Yannakakis
sectionid
title1
etc etc etc
3content
Conclusions
bla bla bla
/paper[author=“Papadimitriou”]/section[@id = “2” or title = “Intro”]/content
11
Memory Bottleneck II:Buffering of Document Fragments Scenario 2: buffering nodes needed for evaluating pending
predicates.
Results yada yada yada
content
paper
title
section
id title
root
sectionid title
On the Complexity of Database Queries
Intro
2
author
author
content
Papadimitriou
Yannakakis
sectionid
title1
etc etc etc
3content
Conclusions
bla bla bla
/paper[title != section/title]/author
12
Memory Bottleneck II:Buffering of Document Fragments Scenario 3: buffering multiple candidate matches that
are nested within each other.
a
root
ca
ba
c
b
//a[b and c]
Relevant only when document is “recursive” Space required: (doc-recursion-depth) [Bar-Yossef, Fontoura, Josifovski
04]
13
Our Results Quantitative space lower bounds for:
Full-fledged evaluation of queries with predicates (Scenario 1)
Filtering/full-fledged evaluation of queries with “multi-variate” predicates (Scenario 2)
Matching upper bound Eager evaluation of predicates
In all other scenarios: no buffering required Filtering of queries with “univariate” predicates over
non-recursive documents is possible without buffering [Bar-Yossef, Fontoura, Josifovski 04]
14
Related Work Space complexity of XPath evaluation over non-
streaming XML documents [Gottlob, Koch, Pichler 03], [Segoufin 03]
Space complexity of XPath evaluation over streams of indexed XML data [Choi, Mahoui, Wood 03]
Space complexity of select-project-join queries over relational data streams [Arasu et al 02]
15
Document Concurrency Q: query D = 1,…,n: document
Each i is an SAX event t = (1,…,t) Definition: x D is alive at step t if x t and
s.t. x Eval(Q, t) x Eval(Q, t)
t-concurrency(D,Q): number of nodes that are alive at step t
concurrency(D,Q): maxt t-concurrency(D,Q)
16
Concurrency: Example
1: <paper>2: <section id = 1>3: <title>4: Intro5: </title>6: <content>7: bla bla bla8: </content>9: </section>10: <section id = 2>11: <title>12: Results13: </title>14: <content>15: yada yada yada16: </content>17: </section>
18: <section id = 3>19: <title>20: Conclusions21: </title>22: <content>23: etc etc etc24: </content>25: </section>26: <title>27: On the Complexity of Database Queries28: </title>29: <author>30: Papadimitriou31: </author>32: <author>33: Yannakakis34: </author>35: </paper>
alive
alive
dead
/paper[author=“Papadimitriou”]/section[@id = “2” or title = “Intro”]/content
17
Lower Bound Notions A “normal” lower bound:
For every algorithm A, there exist Q and D s.t. A uses on Q and D (concurrency(D,Q)) bits of space. Q and D may be “pathological” Doesn’t say much about real-world queries/documents
An “ideal” lower bound:For every A, every Q, and every D, A uses on Q and D (concurrency(D,Q)) bits of space. Too good to be true
A can have D and Q “hard-coded”, and then know the result a priori Space of A on D and Q = minimum description length of Q and D
18
Our Lower Bound Theorem: For every A, every Q, and every D,
there exists an almost isomorphic document D’, s.t. A uses on Q and D’, (concurrency(D,Q)) bits of space. D’ is the same as D, except for a few extra
empty nodes with auxiliary names. Theorem holds only if:
Q is “star-free” D is non-recursive
19
Why isn’t this Obvious? Reason 1: we want the theorem to work for
every Q and D, not only ones with high MDL. Reason 2:
Obvious: If x is alive at step t A has to remember x Because: A may or may not need to output x
Not obvious: If x and y are alive at step t A has to remember both If x and y are not “independent”, maybe it’s enough to
remember just x (or just y)
20
Proof of Lower Bound C = t-concurrency(D,Q) x1,…,xC = nodes that are alive at step t Recall: for every xi there exist i and i s.t.
xi Eval(Q, ti) x Eval(Q, ti)
Lemma: there exist a single and a single s.t. for all i, xi Eval(Q, t) xi Eval(Q, t)
21
Proof of Lower Bound (cont.) For every S { 1,…,C } define document DS: DS is the same as D, except
For every i S, we “mark” xi Marking: an extra empty child with an auxiliary
name Note: DS is almost-isomorphic to D
A = any algorithm Note: From output of A on DS, one can
“reconstruct” the set S.
22
Proof of Lower Bound (cont.) Consider state of A at step t when running on
DS
If suffix = , none of the xi’s should be output A could not have output any xi by step t
If suffix = , no information in suffix about S but S can be reconstructed from output state of A at step t must have all information
about S Conclusion: space ≥ (C)
Actual proof: by one-way communication complexity
23
Conclusions Our contributions:
Quantitative space lower bounds Full-fledged evaluation of queries with predicates Filtering/full-fledged evaluation of queries with “multi-
variate” predicates Matching upper bound
Open problems: Quantitative lower bounds for XQuery evaluation
over streams Address larger fragments of XPath