CS 561 Presentation: Indexing and Querying XML Data for Regular Path Expressions

1

CS 561 Presentation:

Indexing and Querying XML Data for Regular Path Expressions

A Paper by Quanzhong Li and Bongki Moon

Presented by Ming Li

2

Our Objective

• Developing a system that will enable us to perform XML data queries efficiently.

3

XML Queries Languages

• Used for retrieving data from XML files.

• Use a regular path expression syntax.

• e.g. XPath, XQuery.

4

Queries Today - Inefficient

• Usually XML tree traversals – Inefficient.– Top-Down Approach– Bottom-Up Approach– An example:

the query:

/chapter/_*/figure

(finding all figures in all chapters.)

5

Our Objective - Refined

• Developing a system that will enable us to perform XML data queries efficiently

• Developing such a system consists of:– Developing a way to efficiently store XML data.– Developing efficient algorithms for processing

regular path expressions (e.g. XQuery expressions).

6

Storing XML Documents - XISS

• XISS - XML Indexing and Storage System.

• Provides us with ways to:– efficiently find all elements or attributes with the

same name string grouped by document which they belong to.

– quickly determine the ancestor-descendant relationship between elements and/or attributes in the hierarchy of XML data hierarchy.

7

Determining Ancestor-Descendent Relationship

• According to Dietz’s: for two given nodes x and y of a tree T, x is an ancestor of y iff x occurs before y in the preorder traversal and after y in the postorder traversal.

• Example:

8

Determining Ancestor-Descendent Relationship – cont.

• Advantage: the ancestor-descendent relationship can be determined in constant time.

• Disadvantage: a lack of flexibility.– e.g. inserting a new node requires recomputation

of many tree nodes.

9

• A new numbering scheme:– Each node is associated with a <order, size> pair:

• For a tree node y and its parent x:

[order(y), order(y) + size(y)] (order(x), order(x) + size(x)]

• For two sibling nodes x and y, if x is the predecessor of y in preorder traversal holds:

order(x) + size(x) < order(y).


exclusive

10


• Fact: for two given nodes x and y of a tree T, x is an ancestor of y iff:

order(x) < order(y) order(x) + size(x)

11


• Properties:– the ancestor-descendent relationship can be

determined in constant time.– flexibility – node insertion usually doesn’t require

recomputation of tree nodes.– an element can be uniquely identified in a

document by its order value.

12

XISS System Overview

13

Name Index and Value Table

• Objective: minimizing the storage and computation overhead by eliminating replicated strings and string comparisons.

• Name Index - mapping distinct name strings into unique name identifiers (nid).

• Value Table - mapping distinct value strings (i.e. attribute value and text value) into unique value identifiers (vid).

• Both implemented as a B+-tree.

14

The Element Index

• Objective: quickly finding all elements with the same name string.

• Structure:

15

The Attribute Index

• Objective: quickly finding all elements with the same name string.

• Structure:– Same structure as the Element Index except that the

record in attribute index has a value identifier vid which is a key used to obtain the attribute from the value table.

16

The Structure Index

• Objectives:– Finding the parent element and child elements (or

attributes) for a given element.– Finding the parent element for a given attribute.

• Structure:

17

The Structure Index – cont.

• Structure:– B+-tree using document identifier (did) as a key.– Leaf nodes: linear arrays with records for all

elements and attributes from an XML document.– Each record: {nid, <order,size>, Parent order, Child

order, Sibling order, Attribute order}.– Records are ordered by order value.

18

Querying Method

• Decomposing path expressions into simple path expressions.

• Applying algorithms on simple path expressions and their intermediate results.

19

Decomposition of Path Expressions

• The main idea: – A complex path expression is decomposed into

several simple path expressions.– Each simple path expression produces an

intermediate result that can be used in the subsequent stage of processing.

– The results of the simple path expressions are than combined or joined together to obtain the final result of the given query.

20

Basic Subexpressions - Example

Decomposition of

(E1/E2)*/ E3 / ((E4[@a=V]) | (E5/_*/E6)):

(1 )Single Element/Attribute

(2 )Element-Attribute

(3 )Element-Element

(4 )Kleene Closure

(5 )Union/

_/*/

* |

] [/

/

(4)

(2)

(3)

(5)

(3)

(3)

(3)

(1) (1) (1)(1) (1) (1)(1)

21

Example: EA-Join: Element and Attribute Join

22

EA-Join: Element and Attribute Join

Input:

{E1,…,Em}: Ei is a set of elements having a common document identifier (did);

{A1,…,An}: Aj is a set of elements having a common document identifier (did);

Output:

A set of (e,a) pairs such that the element e is the parent of the attribute a.

23

EA-Join: Element and Attribute Join

The Algorithm:

// Sort-merge {Ei} and {Aj} by did.

(1) foreach Ei and Aj with the same did do:

// Sort-merge Ei and Aj by

// PARENT-CHILD relationship

(2) foreach e Ei and a Aj do

(3) if (e is a parent of a) then output (e,a)

end

end

24

EA-Join – Example

• Consider the XML document:

<Ele Att=“A1”>

<Ele Att=“A2”> </Ele>

</Ele>

• And the query: /Ele[@Att=“A1”]

Ele <1,3>

Ele <3,1>

Att <4,0>

Att <2,0>

25

<Ele Att=“A1”>

<Ele Att=“A2”> </Ele>

</Ele>

• Sort-merging “Ele”s and “Att”s by parent-child relation ship will give us the list:<1,3>, <2,0>, <3,1>, <4,0>

• Finding the elements “Ele”s with a child attribute “Att” with a value “A1” from the accepted list is easy using the information in the Element Record.

EA-Join – Querying /Ele[@Att=“A1”]

Ele <1,3>

Ele <3,1>

Att <4,0>

Att <2,0>

26

EA-Join – Comments

• Only a two-stage sort-merge operation without additional cost of sorting:– First merge: by did.– Second merge: by examining parent-child relationship.

• This merge is based on the order values of the element and attribute as defined by the numbering scheme.

• Attributes should be placed before their sibling elements in the order of the numbering scheme.– guarantees that elements and attributes with the same did

can be merged in a single scan.

27

Conclusions

• XISS can efficiently process regular path expression queries.

• Performance improvement over the conventional methods by up to an order of magnitude.

• Future work:optimal page size or the break-even point between the two criteria.

28

Thank you so much!