14
Streaming XPath / XQuery Evaluation and Course Wrap-Up Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems December 2, 2008

Streaming XPath / XQuery Evaluation and Course Wrap-Up Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems December

Embed Size (px)

Citation preview

Page 1: Streaming XPath / XQuery Evaluation and Course Wrap-Up Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems December

Streaming XPath / XQuery Evaluationand Course Wrap-Up

Zachary G. IvesUniversity of Pennsylvania

CIS 650 – Implementing Data Management Systems

December 2, 2008

Page 2: Streaming XPath / XQuery Evaluation and Course Wrap-Up Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems December

Administrivia

Recall that the final project is due – with a write-up and a 10-minute demo presentation – on Tuesday 12/16, 9-11AM

Also: course evaluations (at end)

2

Page 3: Streaming XPath / XQuery Evaluation and Course Wrap-Up Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems December

XML: Its Roles

Perhaps used as a superset of HTML for documents, but…

Most successful as a transport format for sending data between systems SOAP, WSDL, etc. Data interchange formats like ebXML, MAGE-ML, …

So why would we want to store it in a database to query it, when we could query over XML as it streams across the network? (Note: not infinite streams, as in DSMSs, and it’s

hierarchical)

3

Page 4: Streaming XPath / XQuery Evaluation and Course Wrap-Up Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems December

Streaming XPaths and XQueries

Suppose I give an XPath expression (which is a subset of a regular expression) Can I match it against the parse tree of the data?

An XQuery takes multiple XPaths in the FOR clause, and iterates over the elements of each Xpath (binding the variable to each) FOR $i in doc(“abc”)/xyz, $j in $i/def

We can think of an XQuery as doing tree matching, which returns tuples ($i, $j) for each tree matching $i and $j

4

Page 5: Streaming XPath / XQuery Evaluation and Course Wrap-Up Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems December

Where This Leads

An XQuery can be broken into two operations: A parsing / tree matching stage (FOR and also LET)

* Finds matches to the variables * Returns a tuple of trees

A (mostly) pipelined SPJ / union / group by / order by engine – (WHERE, ORDER BY, nesting in RETURN) * Like a regular relational engine extended with XML tree

datatype!

The first engine to put these things together: Tukwila (Ives+ 2000, 2002)

IBM DB2 was built upon a nearly identical model – TurboXPath (Josifowski 2004)

5

Page 6: Streaming XPath / XQuery Evaluation and Course Wrap-Up Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems December

The Key: SAX (Simple API for XML)

If we are to match XPaths in streaming fashion, we need a stream of data items

The original parser model: DOM (Document Object Model) Builds an entire object hierarchy in memory, which

is traversable Not incremental! (Until later versions)

SAX: a series of event notifications open-tag, close-tag, character data Idea: build a state machine (or similar

mechanism) to match on the events!

6

Page 7: Streaming XPath / XQuery Evaluation and Course Wrap-Up Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems December

Different Options

Many different “streaming XPath:” matching algorithms were developed with some differences What to match with (DFA, NFA, lazy DFA, PDA,

proprietary format) Complexity of the path language (regular path

expressions, XPath), axes (downwards, upwards, sideways), internal references (IDREFs, foreign keys), recursive patterns

Which operations can be pushed into the operator (selection predicates, joins, position indices)

We’ll consider TurboXPath, highlighted in red above(Tukwila’s x-scan is highlighted in green)

7

Page 8: Streaming XPath / XQuery Evaluation and Course Wrap-Up Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems December

From XPath Patterns to Tuplesand A Normal Query Plan

8

for $c in doc("d1")//customerfor $p in doc("d2")//profiles[cid/text() = $c/cid/text()]for $o in $c/order[date = ‘12/12/01’]return <result>

{$c/name} {$p/status} {$o/amount} </result>

($c/cid/text(), $c/name, $o/amount) ($p/status, $p/cid/text())

⋈Pipelined join

TurboXPath over “d1” TurboXPath over “d2”

($c/name, $p/status, $o/amount)

XML tagger (add “result”)

Page 9: Streaming XPath / XQuery Evaluation and Course Wrap-Up Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems December

XPath Processing in TurboXPath

9

Page 10: Streaming XPath / XQuery Evaluation and Course Wrap-Up Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems December

Performance Issues

Predicate pushdown Similar to “sargable predicates” – reduces the internal

state that must be run through a cross-product to produce tuples

“Smart” memory management Want to deallocate space from partial pattern matches as

early as possible

Parser efficiency We found that Xerces-C (validating C++ parser used by

TurboXPath) was 10x slower than expat (non-validating C parser)

10

Page 11: Streaming XPath / XQuery Evaluation and Course Wrap-Up Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems December

11

Wrapping up…

This semester has been a whirlwind tour of many different aspects of the “data ecosystem” Storage Concurrency control Query processing Data distribution and streams Heterogeneity, mappings, and reformulation (and the

limitations thereof) Many styles of data integration XML processing

I hope I’ve been able to convey some of what makes this field both relevant and, I think, cool…

Page 12: Streaming XPath / XQuery Evaluation and Course Wrap-Up Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems December

Where There Is Room for More Work (Among Many Topics)

Storage: rows versus columns Concurrency control Query processing

Is there a theory of adaptivity, and an optimal scheme? Data distribution, networks, and streams

How do we distribute to 10,000 nodes? What is the relationship between network communication and query processing?

Data integration, better support for collaboration How can we make it less human-intensive?

“Lightweight databases” Probabilistic databases Visualization and interfaces Databases meets machine learning and info retrieval

12

Page 13: Streaming XPath / XQuery Evaluation and Course Wrap-Up Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems December

A Sampler of Some of the SystemsWork by (Some) Major DB Groups

Washington: Mystiq – probabilistic databases; distrib. streams

Stanford: Trio – probabilities and “lineage” meets databases Cornell: databases meets games; probabilistic databases Wisconsin: Cimple; database support for monitoring clusters MIT: Sensor query processing; signal processing; column

stores Berkeley: Data management for sensors and networks Maryland: Querying data models; learning and probabilities

meets databases Penn: Orchestra; data and workflow provenance; keyword

querying with learned ranks over databases; lightweight data integration; networking meets databases; sensor integration

13

Page 14: Streaming XPath / XQuery Evaluation and Course Wrap-Up Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems December

14

Thanks!!!

I had a great time this semester – I hope you learned a lot and found it to be enjoyable I’m looking forward to seeing your projects!