Upload
donga
View
218
Download
0
Embed Size (px)
Citation preview
XML Stream Processing
Dan Suciuwww.cs.washington.edu/homes/suciu
Joint work with faculty, visitors and students at UW
Introduction
• This is a research project at UW• Partially supported by MS• Two parts:
– A free toolkit of command lines: xsort, xagg, ...www.cs.washington.edu/homes/suciu/XMLTK
– Research on XML stream processing – this talk
The Problem
• Given:– Large number of Xpath expressions– Incoming stream of XML documents
• Decide for each document which expressions it matches
/datasets/dataset/datasets/dataset [history/text()=“recent”]/title/datasets/dataset //tableHead//*/datasets/dataset //tableHead//*/text()="Galaxy"/datasets/dataset /history/datasets/dataset /tableHead/datasets/dataset /tableHead /field
/datasets/dataset/datasets/dataset [history/text()=“recent”]/title/datasets/dataset //tableHead//*/datasets/dataset //tableHead//*/text()="Galaxy"/datasets/dataset /history/datasets/dataset /tableHead/datasets/dataset /tableHead /field
<datasets><dataset>
...</datasets>
<datasets><dataset>
...</datasets>
XPath expressionsXML Data Stream Decisions
The Application(s)
• Selective Dissemination of Information [Berkeley]• XML content routing [MIT]• SOAP Message routing in Application Servers
• Typical scale:– 10,000 to 1,000,000 Xpath expressions– XML stream: 1KB/s ? 1MB/s ?
The Approaches
• Basic techniques– NFA plus optimizations: Xfilter/Yfilter, XTrie– DFA: we are doing this here
• Beyond the obvious– SIX– views
Background on NFA and DFA
//a/b/a/a/b
NFA
b
a
b
a
a
*
5
0
1
2
4
3
$X
b
a
b
a
a
0 [other]
$X
01
02
013
014
025
[other]
[other]
b[other]
[other] a
[other]
a
DFA
Background on NFA and DFA
//a/*/*/*/b
a[other]0
01
012 02
0123 023 013 03
01234 0234 0134 034 . . . .
. . . .
a
a
a
a
a
[other]
[other] [other]
[other] [other]
b
02345
b b b
0345 0245 045
. . . .
. . . . . . . . .
$X $X $X $X$X
a
*
*
*
b
*0
5
1
2
4
3
NFA DFA (without back edges)
Background on NFA and DFA
• Issue: need to linearize Xpath expressions
/catalog/product[@category="tools"][sales/@price > 200]/quantity/catalog/product[@category="tools"][sales/@price > 200]/quantity
/catalog/product/$Y$Y/@category ="tools"$Y/sales/@price$Y/quantity
/catalog/product/$Y$Y/@category ="tools"$Y/sales/@price$Y/quantity
Extra processingOK in trivial cases.Complex cases requiremore work (future)
1 Xpath expression with filters
4 linear Xpathexpressions For now: assume
all Xpath expressionsare linear
Basic NFA Evaluation/datasets/dataset/datasets/dataset [history/text()=“recent”]/title/datasets/dataset //tableHead//*/datasets/dataset //tableHead//*/text()="Galaxy"/datasets/dataset /history/datasets/dataset /tableHead/datasets/dataset /tableHead /field/datasets/dataset/datasets/dataset [history/text()=“recent”]/title/datasets/dataset //tableHead//*/datasets/dataset //tableHead//*/text()="Galaxy"/datasets/dataset /history/datasets/dataset /tableHead/datasets/dataset /tableHead /field. . .. . .. . ./datasets/dataset/datasets/dataset [history/text()=“recent”]/title/datasets/dataset //tableHead//*/datasets/dataset //tableHead//*/text()="Galaxy"/datasets/dataset /history/datasets/dataset /tableHead/datasets/dataset /tableHead /field/datasets/dataset/history/text()
/datasets/dataset/datasets/dataset [history/text()=“recent”]/title/datasets/dataset //tableHead//*/datasets/dataset //tableHead//*/text()="Galaxy"/datasets/dataset /history/datasets/dataset /tableHead/datasets/dataset /tableHead /field/datasets/dataset/datasets/dataset [history/text()=“recent”]/title/datasets/dataset //tableHead//*/datasets/dataset //tableHead//*/text()="Galaxy"/datasets/dataset /history/datasets/dataset /tableHead/datasets/dataset /tableHead /field. . .. . .. . ./datasets/dataset/datasets/dataset [history/text()=“recent”]/title/datasets/dataset //tableHead//*/datasets/dataset //tableHead//*/text()="Galaxy"/datasets/dataset /history/datasets/dataset /tableHead/datasets/dataset /tableHead /field/datasets/dataset/history/text()
<datasets><dataset>
...</datasets>
NFAs
. . . . . .
XPath
STACK
1,55,99,...
2,3,543,43,254
3,66,102,4534,...
Current state
SAXevents
Basic DFA Evaluation/datasets/dataset/datasets/dataset [history/text()=“recent”]/title/datasets/dataset //tableHead//*/datasets/dataset //tableHead//*/text()="Galaxy"/datasets/dataset /history/datasets/dataset /tableHead/datasets/dataset /tableHead /field/datasets/dataset/datasets/dataset [history/text()=“recent”]/title/datasets/dataset //tableHead//*/datasets/dataset //tableHead//*/text()="Galaxy"/datasets/dataset /history/datasets/dataset /tableHead/datasets/dataset /tableHead /field. . .. . .. . ./datasets/dataset/datasets/dataset [history/text()=“recent”]/title/datasets/dataset //tableHead//*/datasets/dataset //tableHead//*/text()="Galaxy"/datasets/dataset /history/datasets/dataset /tableHead/datasets/dataset /tableHead /field/datasets/dataset/history/text()
/datasets/dataset/datasets/dataset [history/text()=“recent”]/title/datasets/dataset //tableHead//*/datasets/dataset //tableHead//*/text()="Galaxy"/datasets/dataset /history/datasets/dataset /tableHead/datasets/dataset /tableHead /field/datasets/dataset/datasets/dataset [history/text()=“recent”]/title/datasets/dataset //tableHead//*/datasets/dataset //tableHead//*/text()="Galaxy"/datasets/dataset /history/datasets/dataset /tableHead/datasets/dataset /tableHead /field. . .. . .. . ./datasets/dataset/datasets/dataset [history/text()=“recent”]/title/datasets/dataset //tableHead//*/datasets/dataset //tableHead//*/text()="Galaxy"/datasets/dataset /history/datasets/dataset /tableHead/datasets/dataset /tableHead /field/datasets/dataset/history/text()
<datasets><dataset>
...</datasets>
XPath
STACK
1
552
399
Current state
SAXevents
DFAs
Comparison: Throughput in MB/sThroughput for 1k, 10k, 100k, 1000k XPEs
[ prob(*)=10%, prob(//)=10% ]
0.0001
0.001
0.01
0.1
1
10
100
5MB 10MB 15MB 20MB 25MB
Total input size
parserlazyDFA(1k)lazyDFA(10k)lazyDFA(100k)lazyDFA(1000T)xfilter(1k)xfilter(10k)xfilter(100k)xfilter(1000T)
Number of States in DFA
Compute the DFA for 1,000,000 Xpathexpressions ???!!?
• 1 linear Xpath small DFA• 1,000,000 linear Xpaths HUGE DFA
Number of States in DFA
//section//footnote//figure//footnote//table//footnote. . . .. . . .//abstract//footnote
//section//footnote//figure//footnote//table//footnote. . . .. . . .//abstract//footnote
n Xpath expressions 2n states
Solution: lazy DFA !
Number of States in the lazy DFA
DFA is HUGETheoremDFA is small
Document-style recursive DTD
TheoremDFA is small
TheoremDFA is small
Non-recursive or data-style recursive DTDs
Synthetic XML dataReal XML data
1
10
100
1000
10000
100000
simple prov ebBPSS protein nasa treebank
Number of DFA States - SYNTHETIC Data
1k XPEs
10k XPEs
100k XPEs
1
10
100
1000
10000
100000
protein nasa treebank
Number of DFA States - REAL Data
1k XPEs
10k XPEs
100k XPEs
Beyond the Obvious I:Stream IndeX (SIX)
Main observation:• Parsing is major bottleneck• Skip portions of the XML document
avoid parsing and processing
Stream IndeX (SIX)
<bib><book> <publisher> Addison-Wesley </publisher>
<author> Serge Abiteboul </author><author> <first-name> Rick </first-name>
<last-name> Hull </last-name></author><author> Victor Vianu </author><title> Foundations of Databases </title><year> 1995 </year>
</book><book price=“55”>
<publisher> Freeman </publisher><author> Jeffrey D. Ullman </author><title> Principles of Database and
Knowledge Base Systems </title><year> 1998 </year>
</book></bib>
<bib><book> <publisher> Addison-Wesley </publisher>
<author> Serge Abiteboul </author><author> <first-name> Rick </first-name>
<last-name> Hull </last-name></author><author> Victor Vianu </author><title> Foundations of Databases </title><year> 1995 </year>
</book><book price=“55”>
<publisher> Freeman </publisher><author> Jeffrey D. Ullman </author><title> Principles of Database and
Knowledge Base Systems </title><year> 1998 </year>
</book></bib> . . .
. . .978author
879426author
publisher
book
bib
42312
4090233
14901240
endOffsetbeginOffset
SIXXML
Stream IndeX (SIX)
• API for SIX:– skip(k), where k >= 0– skips to the end of the k’th surrounding element– Uses beginOffset to sync with the XML doc– Uses endOffset to skip
Stream IndeX (SIX)
<datasets><dataset>
...</datasets>
<datasets><dataset>
...</datasets>
<datasets><dataset>
...</datasets>
XML XML XML
SIX
18872
6630
2050
9895
11090
18872
6630
2050
6630
2050
SIX SIX
The SIX stream is about 6% of the data streamAnd can be made MUCH smaller
Throughput improvements from SIX (stable)
0
5
10
15
20
25
30
35
55 60 65 70 75 80 85 90 95 100 105
XML stream (MB)
MB/
s
Theta=3% (SIX)Theta=3%Theta=8% (SIX)Theta=8%Theta=14% (SIX)Theta=14%
Beyond the obvious II:View Selections
• On-going work: View selections header
<datasets><dataset>
...</datasets>
<datasets><dataset>
...</datasets>
<datasets><dataset>
...</datasets>
XML XML XML
header
72
30
0
header header
72
30
0
72
30
0
100x speedupOn a hit
Conclusions
• Two ideas:– Computing the DFA is possible !– Use extra info to further speedup: SIX, Headers
• Issues:– Extend DFAs to filters: process events– How to represent SIX or Headers in XML
• Msdn.microsoft.com/webservices• [email protected] contact