82
Efficient Evaluation Efficient Evaluation of Regular Path of Regular Path Expressions on Expressions on Streaming XML Data Streaming XML Data By - Zachary G. Ives, Alon Y. Levy and Daniel S. Weld

Efficient Evaluation of Regular Path Expressions on Streaming XML Data

  • Upload
    amanda

  • View
    33

  • Download
    1

Embed Size (px)

DESCRIPTION

Efficient Evaluation of Regular Path Expressions on Streaming XML Data. By - Zachary G. Ives, Alon Y. Levy and Daniel S. Weld. Table of Contents. A bit about XML (yes, again) Our goal, problem and solution Our XML data model How to ask questions ?. Table of Contents. - PowerPoint PPT Presentation

Citation preview

Page 1: Efficient Evaluation of Regular Path Expressions on Streaming XML Data

Efficient Evaluation of Regular Efficient Evaluation of Regular Path Expressions on Path Expressions on Streaming XML DataStreaming XML Data

By - Zachary G. Ives, Alon Y. Levy and Daniel S. Weld

Page 2: Efficient Evaluation of Regular Path Expressions on Streaming XML Data

2

Table of ContentsTable of Contents

A bit about XML (yes, again)Our goal, problem and solutionOur XML data modelHow to ask questions ?

Page 3: Efficient Evaluation of Regular Path Expressions on Streaming XML Data

3

Table of ContentsTable of Contents

X-scan operation and structureDigging deep into x-scanHow good is it ? – Performance EvaluationConclusion

Page 4: Efficient Evaluation of Regular Path Expressions on Streaming XML Data

4

A Bit About XML (yes, again)A Bit About XML (yes, again)

XML – the eXtensible Markup LanguageBecome a standardUseful for the dissemination and exchange

of information

Page 5: Efficient Evaluation of Regular Path Expressions on Streaming XML Data

5

A Bit About XML (yes, again)A Bit About XML (yes, again)

Advantages– Simple– Self-describing nature– Flexible– Represents both structured and semi-structured

data

Page 6: Efficient Evaluation of Regular Path Expressions on Streaming XML Data

6

XML StructureXML Structure

Consists of :– Elements – pairs of matching open and close

tags.– Elements may enclose additional elements or

data values.– Attributes – included in element tags.– Attributes are single-valued and describe the

element.

Page 7: Efficient Evaluation of Regular Path Expressions on Streaming XML Data

7

XML Structure (Cont.)XML Structure (Cont.)

– ID is special attribute which uniquely identify the element.

– IDREF form links the other elements in the document.

– Combining ID and IDREF forms a graph structure rather than just a tree structure.

Page 8: Efficient Evaluation of Regular Path Expressions on Streaming XML Data

8

XML ExampleXML Example

We will use this example throughout the rest of the lecture

<db> <lab ID=“baselab” manager=“smith1”> <name>Seattle Bio Lab</name> <location> <city>Seattle</city> <country>USA</country> </location> </lab> <lab ID=“lab2”> <name>PMBL</name> <city>Philadelphia</city> <country>USA</country> </lab>

<paper ID=“Smith991231” source=“baselab” biologist=“smith1”>

<title>Autocatalysis of Spect…</title> … </paper> <biologist ID=“smith1”> <lastname>Smith</lastname> … </biologist></db>

Page 9: Efficient Evaluation of Regular Path Expressions on Streaming XML Data

9

Our GoalOur Goal

Our goal is to perform queries and search operations on the XML document.

Several query languages have been proposed.

Represents the XML document as a graph.

Page 10: Efficient Evaluation of Regular Path Expressions on Streaming XML Data

10

Our Goal (Cont.)Our Goal (Cont.)

Represents the query as a regular path expression that should be matched against XML source.

These regular path expressions describe traversals along edges in the XML graph.

The variables in the query are mapped to XML elements along these paths.

Page 11: Efficient Evaluation of Regular Path Expressions on Streaming XML Data

11

Our ProblemOur Problem

Most XML query processors– Loading the data into a local repository– Building indexes on the repository– Processing the query

The repository is either – Relational database– An object oriented database– A repository of semi structured data

Page 12: Efficient Evaluation of Regular Path Expressions on Streaming XML Data

12

Our Problem (Cont.)Our Problem (Cont.)

The local storing and indexing is expensive.Especially when the query is made over

streams of incoming XML.The streams can come from many sources,

some fast and some slow.Sometimes we want some partial answer

but as soon as possible.

Page 13: Efficient Evaluation of Regular Path Expressions on Streaming XML Data

13

Our SolutionOur Solution

The query can be performed while the data streams in.

The XML-Scan (x-scan) operator does exactly that.

Used at the lowest level of the query plan and supplies data to other operators.

Page 14: Efficient Evaluation of Regular Path Expressions on Streaming XML Data

14

The X-Scan OperatorThe X-Scan Operator

Input :– An XML data stream.– Set of regular path expressions.

Output :– Stream of binding for the variables occurring in

the expressions.The bindings are produced incrementally,

as the XML data is streaming in.

Page 15: Efficient Evaluation of Regular Path Expressions on Streaming XML Data

15

The X-Scan Operator (Cont.)The X-Scan Operator (Cont.)

The entire graph can be constructed in a single pass.

X-Scan simultaneously.– Parse the XML data.– Indexing nodes by their IDs.– Resolving IDREFs.– Return the nodes that match the path

expressions of the query.

Page 16: Efficient Evaluation of Regular Path Expressions on Streaming XML Data

16

The X-Scan Operator (Cont.)The X-Scan Operator (Cont.)

Some issues in the X-Scan operation are– Deal with possibly cyclic data– Preserve order of elements– Remove duplicate bindings that are generated

due to multiple paths to the same elements

Page 17: Efficient Evaluation of Regular Path Expressions on Streaming XML Data

17

Data Model for XMLData Model for XML

Naturally, the XML data model is a graph. Each XML tag is an edge labeled with the tag

name. It is directed to a node which label is the tag’s ID.

(if it has no ID it gets a number). A given element node will have labeled edges

directed to it’s attribute values, sub-elements, and any other elements referenced via IDREF.

Page 18: Efficient Evaluation of Regular Path Expressions on Streaming XML Data

18

Data Model for XML (Cont.)Data Model for XML (Cont.)

Example is always the best way

Page 19: Efficient Evaluation of Regular Path Expressions on Streaming XML Data

19

How to ask questions ?How to ask questions ?A variety of query languages have been

proposed.The key feature in all of these languages is

the use of regular path expressions over the data.

Most of them also give the answer to the query as XML document.

X-Scan uses XML-QL.

Page 20: Efficient Evaluation of Regular Path Expressions on Streaming XML Data

20

The XML-QL SyntaxThe XML-QL Syntax

The syntax of XML-QL is

patterni template is matched against the XML data graph from sourcei and the resulted tuples are formatted as described in result.

WHERE pattern1 IN source1,

pattern2 IN source2,…

CONSTRUCT result

Page 21: Efficient Evaluation of Regular Path Expressions on Streaming XML Data

21

The XML-QL Syntax (Cont.)The XML-QL Syntax (Cont.)

An XML-QL pattern is a set of nested tags with embedded variable names (prefixed by $) that specify bindings of graph nodes to variables.

The CONSTRUCT clause specifies a tree-structured set of edges and nodes to add to the output graph for each tuple of variable bindings.

Page 22: Efficient Evaluation of Regular Path Expressions on Streaming XML Data

22

The XML-QL Syntax (Cont.)The XML-QL Syntax (Cont.)Again, example is the best wayLets look at

WHERE<db>

<lab><name>$n</><_*><city>$c</></>

</> ELEMENT_AS $l</>IN “fig1.xml”

CONSTRUCT<result>

<center><name>$n</><location>$c</>

</></>

Page 23: Efficient Evaluation of Regular Path Expressions on Streaming XML Data

23

The XML-QL Syntax (Cont.)The XML-QL Syntax (Cont.)

As we can see, the result will be

<result><center>

<name>Seattle Bio Lab</name><location>Seattle</location>

</center><center>

<name>PMBL</name><location>Philadelphia</location>

</center></result>

Page 24: Efficient Evaluation of Regular Path Expressions on Streaming XML Data

24

The XML-QL Syntax (Cont.)The XML-QL Syntax (Cont.)

If the variable is bound to a node with sub-elements, all the sub-graph will be inserted to the resulted graph.

We will use dot-notation to describe the X-Scan operation.

The previous example will rewritten as. El = root.”db”.”lab”

En = El.”name”

Ec = El._*.”city”

Page 25: Efficient Evaluation of Regular Path Expressions on Streaming XML Data

25

The X-Scan PlaceThe X-Scan Place

The goal of the X-Scan operator is therefore to produce a set of bindings for each pattern in the WHERE clause.

Page 26: Efficient Evaluation of Regular Path Expressions on Streaming XML Data

26

So, What X-Scan do ?So, What X-Scan do ?

Given the XML Stream and a set of regular path expressions, outputs a stream of tuples assigning binding values to each variable in the set of regular path expression.

The central mechanism is a set of state machines that traverse the XML graph, trying to satisfy the path expressions.

Page 27: Efficient Evaluation of Regular Path Expressions on Streaming XML Data

27

What is it made of ?What is it made of ?

The data components of X-Scan are

Page 28: Efficient Evaluation of Regular Path Expressions on Streaming XML Data

28

Where the data flows?Where the data flows?

As the data streams into the system, several structures are created– The data get parsed and stored locally– A structural index of the XML graph is created– An ID index records the IDs of all elements and

their location in the structural index– A list of references to not-yet-seen element IDs

is maintained

Page 29: Efficient Evaluation of Regular Path Expressions on Streaming XML Data

29

Where the data flows?Where the data flows?

In parallel to the creation of those structures, a set of finite state machines perform a DFS over the partial structural index.

When a machine reaches an accepting state, a new value is added to the binding-value table of that machine.

Those values are later combine to form the complete image.

Page 30: Efficient Evaluation of Regular Path Expressions on Streaming XML Data

30

Example problemsExample problems

It sounds easy, but yet there some problems to meet, for example– The handling of cycles– How to prune duplicate bindings as they are

created ? Remember X-Scan is online operator

Page 31: Efficient Evaluation of Regular Path Expressions on Streaming XML Data

31

The State MachinesThe State Machines

As described earlier, we create one regular expression for every variable in the query – in the dot-notation.

So, we build a finite-state machine for each expression.

State transition is correspond to edge traversals in the XML data graph

Page 32: Efficient Evaluation of Regular Path Expressions on Streaming XML Data

32

The State Machines (Cont.)The State Machines (Cont.)

The end of the path expression yield an accepting state, which outputs instances of the corresponding variables.

When one variable is dependent upon other variable, the other variable machine accepting state is pointing to the state machine of the first one.

Page 33: Efficient Evaluation of Regular Path Expressions on Streaming XML Data

33

The State Machines (Cont.)The State Machines (Cont.)

And back to our example

Page 34: Efficient Evaluation of Regular Path Expressions on Streaming XML Data

34

Indexing the XML GraphIndexing the XML GraphThe structural index should allow x-scan to

quickly traverse the XML data graph.Each node in the index contains

– The ID of the element and its offset in the document

– Pointers to all the sub-elements, attributes and IDREFs of the element.

Essentially it looks like the graph except for the leafs.

Page 35: Efficient Evaluation of Regular Path Expressions on Streaming XML Data

35

The Algorithm – Step by StepThe Algorithm – Step by Step

X-Scan proceeds by building the structural index and running a set of active state machines in parallel.

The core algorithm is in fact the way those state machines run, lets focus on that by running our example.

Page 36: Efficient Evaluation of Regular Path Expressions on Streaming XML Data

36

The Algorithm – Step by StepThe Algorithm – Step by Step

Initially, only the top level machine is active.

When a machine M reaches an accepting state, it produces a binding b for its variable, writes it and the parent value to its table and activates all of its dependent state machines.

Page 37: Efficient Evaluation of Regular Path Expressions on Streaming XML Data

37

The Algorithm – Step by StepThe Algorithm – Step by Step

Those machines remain active while x-scan is scanning b or any element accessible by a path from b.

The final output of x-scan is the equi-join of all the appropriate tables.

Page 38: Efficient Evaluation of Regular Path Expressions on Streaming XML Data

38

The Algorithm – By ExampleThe Algorithm – By Example

Ml is initialized on state 1 as the only active machine.

Page 39: Efficient Evaluation of Regular Path Expressions on Streaming XML Data

39

The Algorithm – By ExampleThe Algorithm – By Example

The root got a “db” edge, so the machine is pushed to its stack and moving to state 2 with value node #1

Page 40: Efficient Evaluation of Regular Path Expressions on Streaming XML Data

40

The Algorithm – By ExampleThe Algorithm – By Example

Next, following the first outgoing edge, pushing the old state value, and setting Ml to state 3 with value baselab

Page 41: Efficient Evaluation of Regular Path Expressions on Streaming XML Data

41

The Algorithm – By ExampleThe Algorithm – By ExampleSince it now in accepting state

– the baselab value is written to the Ml table– Ml is suspended– Mn and Mc are activated

Page 42: Efficient Evaluation of Regular Path Expressions on Streaming XML Data

42

The Algorithm – By ExampleThe Algorithm – By ExampleThe next edge takes Mn from state 4 to 5

And Mc run on the loop back to state 6

Both machines have #2 as binding value

Page 43: Efficient Evaluation of Regular Path Expressions on Streaming XML Data

43

The Algorithm – By ExampleThe Algorithm – By Example Since Mn is now in an accept state x-scan writes

<#2,baselab> into Mn’s table. Since no edges remain for exploration, x-scan pops

the stack and backs up the state machines, resetting Mn to state 4 and Mc to state 6

Page 44: Efficient Evaluation of Regular Path Expressions on Streaming XML Data

44

The Algorithm – By ExampleThe Algorithm – By Example The next edge is labeled location so

– Mn stay in state 4– Mc also stay in state 6 but advanced to node #3

Then Mc is advanced to state 7 on the city edge to node #4

Page 45: Efficient Evaluation of Regular Path Expressions on Streaming XML Data

45

The Algorithm – By ExampleThe Algorithm – By ExampleAt this point x-scan writes <#4,baselab>

into Mc’s table.It can also produce the first tuple of

bindings <l/baselab,n/#2,c/#4>

Page 46: Efficient Evaluation of Regular Path Expressions on Streaming XML Data

46

The Algorithm – By ExampleThe Algorithm – By Example X-Scan keeps running Mc but no more cities are

found It pops back up to baselab Running Mc along the IDREF to smith1 gives no

more cities

Page 47: Efficient Evaluation of Regular Path Expressions on Streaming XML Data

47

The Algorithm – By ExampleThe Algorithm – By Example Now, Mn and Mc are deactivated and the control

return to Ml

X-scan pops up to node #1 to state 2 The other lab edge yield another tuple

<l/lab2,n/#6,c/#7>

Page 48: Efficient Evaluation of Regular Path Expressions on Streaming XML Data

48

Where should we go ?Where should we go ?

On occasion x-scan will encounter an IDREF to a node that has not yet been parsed.

Unknown node simply will not be in the ID index.

Page 49: Efficient Evaluation of Regular Path Expressions on Streaming XML Data

49

Where should we go ?Where should we go ?

When X-Scan hits such unseen reference– It pauses all the relevant state machines– Adds an entry to the list of unresolved IDREFs

<desired ID value, referrer’s address>

– Continue to parse and build the structural index

Page 50: Efficient Evaluation of Regular Path Expressions on Streaming XML Data

50

Where should we go ?Where should we go ?

Once the target element is parsed x-scan– fills its address into each referring IDREF in

the structural index– Removes the entry from the list of unresolved

IDREFs– Awakens the state machines and proceeds

Page 51: Efficient Evaluation of Regular Path Expressions on Streaming XML Data

51

Are We Going in Circles ?Are We Going in Circles ?

Sometimes the input XML graph contains a cycles.

X-Scan must not get trap in an infinite loop.

Page 52: Efficient Evaluation of Regular Path Expressions on Streaming XML Data

52

Are We Going in Circles ?Are We Going in Circles ?

Considering the following XML graph

#1

#2

#4#3

a

a

a ab

Page 53: Efficient Evaluation of Regular Path Expressions on Streaming XML Data

53

Are We Going in Circles ?Are We Going in Circles ?

If we refuse to move in circles, we will miss the answer to the query

Ex=root._*.”b”.”a”

But if we allow moving in circles we going to get in trouble with this one

Ey=root._*.”z”

Page 54: Efficient Evaluation of Regular Path Expressions on Streaming XML Data

54

Are We Going in Circles ?Are We Going in Circles ?

What can we do ?Now we going to use the stackThe stack contains pairs of the form

(binding, state)Describing which bindings have been

associated with states of the machine along the current path.

Page 55: Efficient Evaluation of Regular Path Expressions on Streaming XML Data

55

Are We Going in Circles ?Are We Going in Circles ?Since x-scan uses deterministic finite state

machines, returning to a previous state with the same binding will not add any new possible actions.

So, when a machine enters a state, it should checks to see that this state has not been bound to the same binding along the current path.

Page 56: Efficient Evaluation of Regular Path Expressions on Streaming XML Data

56

Are We Going in Circles ?Are We Going in Circles ?

Is it working for our example ?Look at those state machines.

Page 57: Efficient Evaluation of Regular Path Expressions on Streaming XML Data

57

Some EnhancementsSome Enhancements

It is important to prevent the operator from spending time evaluating paths that are not useful.

There are two enhancements.– Selection Push-Down.– Duplicate elimination.

Page 58: Efficient Evaluation of Regular Path Expressions on Streaming XML Data

58

Selection Push-DownSelection Push-Down

The query optimizer creates and push selection operators down into the x-scan operation.

Works only on attributes – since they are single valued.

So X-Scan evaluates all node attribute edges before sub-elements edges.

Page 59: Efficient Evaluation of Regular Path Expressions on Streaming XML Data

59

Selection Push-DownSelection Push-Down

Here also, the best way to explain is by example.

WHERE<db>

<lab manager=“smith1”><name>$n</><_*><city>$c</></>

</> ELEMENT_AS $l</>IN “fig1.xml”

CONSTRUCT<result><center>

<name>$n</><location>$c</>

</></>

Page 60: Efficient Evaluation of Regular Path Expressions on Streaming XML Data

60

Selection Push-DownSelection Push-Down

The query plan generator must create an additional temporary variable temp1 and a regular path expression

Etemp1=El.@”manager”

It also adds a selection predicate

Etemp1=“smith1”

Page 61: Efficient Evaluation of Regular Path Expressions on Streaming XML Data

61

Selection Push-DownSelection Push-Down

Now, for the second lab, since it got only ID attribute, as X-Scan iterates through all lab2 attributes it finds no manager attribute.

So it can “short-circuit” on this sub-graph.Discarding the value of l and ignoring its

children.

Page 62: Efficient Evaluation of Regular Path Expressions on Streaming XML Data

62

Duplicate EliminationDuplicate Elimination

Sometimes we can visit an element multiple times through different paths.

This can produce duplicate binding tuples.The naive way is to do some post-

processing stage.It can be done smartly so there is no need to

save the entire history.

Page 63: Efficient Evaluation of Regular Path Expressions on Streaming XML Data

63

There is a big one on the wayThere is a big one on the way

Main memory may not be large enough to handle all of the index structures.

The way to handle is by– Paging the XML source document– Paging the structural index

Conventional buffer manager using LRU or some other similar policy is sufficient.

Page 64: Efficient Evaluation of Regular Path Expressions on Streaming XML Data

64

There is a big one on the wayThere is a big one on the way

But what about the ID lookup index, list of unresolved IDREFs and the state machine stack?

They use either a B+-Tree or a multilevel hashtable

The size of stack is bounded by the product of number of variables and the longest non-repeating path.

Inactive state machine stack can be naturally paged

Page 65: Efficient Evaluation of Regular Path Expressions on Streaming XML Data

65

How good is it ?How good is it ?

They used the IBM XML4C parser version 3.0.1 with the SAX parser API to implement the X-Scan.

The SAX API provides callbacks to the code as elements are read, and so allowing X-Scan to evaluate streaming XML data

Page 66: Efficient Evaluation of Regular Path Expressions on Streaming XML Data

66

How good is it ?How good is it ?

They compare the X-Scan against– Stanford’s Lore semi-structured/XML database

system.– A commercial OO-based XML repository

The experiments were performed with locally stored XML files– X-Scan lose some of its advantages

Page 67: Efficient Evaluation of Regular Path Expressions on Streaming XML Data

67

How good is it ?How good is it ?

Also this X-Scan implementation didn’t include selection predicates.

All the queries were performed on a single processor 450MHz Pentium II with 256 MB of memory.

X-Scan and the OO-based system run on Windows-NT and Lore run on Linux.

Page 68: Efficient Evaluation of Regular Path Expressions on Streaming XML Data

68

How good is it ?How good is it ?The queries they had performed included

the following documents.

Mondial and VLDB contains many references whereas the rest are mostly tree structures.

Page 69: Efficient Evaluation of Regular Path Expressions on Streaming XML Data

69

How good is it ?How good is it ?

Page 70: Efficient Evaluation of Regular Path Expressions on Streaming XML Data

70

How good is it ?How good is it ?

Conclusions.– Neither Lore nor the commercial system scale

up well to queries across multi-megabyte data files.

– They failed particularly on files that contain graph structure.

– X-Scan scale better in all cases.

Page 71: Efficient Evaluation of Regular Path Expressions on Streaming XML Data

71

How good is it ?How good is it ?

Another experiments on synthetic XML data files were conducted.– Those XMLs are random generated.

They was to check the scalability of X-Scan.They averaged three different runs across

each of the three different random graphs of the same generation parameters.

Page 72: Efficient Evaluation of Regular Path Expressions on Streaming XML Data

72

How good is it ?How good is it ?

The first experiments were conducted on a tree-structured data.

Therefore they didn’t have to build the structural index, ID and IDREFs tables.

This was to check how good the state machines work.

Page 73: Efficient Evaluation of Regular Path Expressions on Streaming XML Data

73

How good is it ?How good is it ?The results were

Page 74: Efficient Evaluation of Regular Path Expressions on Streaming XML Data

74

How good is it ?How good is it ?

Next, they wanted to check for the cost of the graph indexing and resolving references.– Without traversing the IDREFs.

They took the same graphs from the previous tests and change back the DTD so it will considered as graph.

The results were.

Page 75: Efficient Evaluation of Regular Path Expressions on Streaming XML Data

75

How good is it ?How good is it ?

Page 76: Efficient Evaluation of Regular Path Expressions on Streaming XML Data

76

How good is it ?How good is it ?

Page 77: Efficient Evaluation of Regular Path Expressions on Streaming XML Data

77

How good is it ?How good is it ?

Next, they wanted to check the effectiveness of the structural index when called to evaluate such reference edges.

The results were.

Page 78: Efficient Evaluation of Regular Path Expressions on Streaming XML Data

78

How good is it ?How good is it ?

Page 79: Efficient Evaluation of Regular Path Expressions on Streaming XML Data

79

How good is it How good is it

Page 80: Efficient Evaluation of Regular Path Expressions on Streaming XML Data

80

ConclusionConclusion

X-Scan differs in three key ways from previous works

1. The structural index allow more efficient traversing without splitting the data to table or objects that should be re combined later

2. X-Scan state machines are based on the query rather then on the data source – not reusable, but we always reread the data

3. X-Scan is pipelined and produces bindings as data is being streamed into the system

Page 81: Efficient Evaluation of Regular Path Expressions on Streaming XML Data

81

Conclusion (Cont.)Conclusion (Cont.)

Another points regarding to X-Scan are– It handles cycles well– It preserves the document order and structure– Eliminate duplicate tuples– The state machines are independent and so

can run in parallel– X-Scan is very efficient, typically imposing

8% overhead on top of the time required to parse the XML document

Page 82: Efficient Evaluation of Regular Path Expressions on Streaming XML Data

82

THE ENDTHE END