Efficient Processing of XPath Queries Using Indexes

1

Efficient Processing of XPath Queries Using Indexes

Yan Chen1, Sanjay Madria1, Kalpdrum Passi2, Sourav Bhowmick3

1 Department of Computer Science, University of Missouri-Rolla, Rolla, MO 65409, USA

[email protected] Dept. of Math. & Computer Science, Laurentian University,

Sudbury ON P3E 2C6 Canada

[email protected] School of Computer Engineering, Nanyang Technological

University, Singapore

[email protected]

mailto:[email protected]



2

Querying Semistructured Data

• Query languages to query semistructured data– XQuery, XML-QL, XML-GL, Lorel, and Quilt

• Semistructured data is represented as a graph • Queries on such data are expressed in the form of regular

path expressions • XPath is a language that describes the syntax for addressing

path expressions over XML data • Indexes on XML data - improves the performance of the

query on large XML files • Indexing techniques used in relational and object-oriented

databases do not suffice for semistructured data due to the nature of the data

3

Indexing Semistructured Data

• Dataguides – record information on the existing paths in a database– do not provide any information of parent-child

relationships between nodes in the database – as a result they cannot be used for navigation from any

arbitrary node.

• T-indexes – specialized path indexes, which only summarize a

limited class of paths. – 1-index and 2-index are special cases of T-indexes

4


• LORE – Uses four different types of index structures - value,

text, link, and path indexes– Value index and text index are used to search objects

that have specific values– link index and path index provide fast access to parents

of an object and all objects reachable via a given labeled path

– Lore uses OEM (Object Exchange Model) to store data and OQL (Object Query Language) as its query language

5


• ToXin – has two different types of index structure: the value

index and the path index. – The path index has two parts: index tree and instance

functions, and these functions can be used to trace the parent-child relationship.

– Their path index contains only parent and children information but in our model, we store the complete path from root to each node.

– ToXin uses index for single level while we use multiple index for different levels

6

A Sample XML File

<BOOKSTORE name = “Benny-bookstore”>

<BOOK title = “Brave the new world”>

<ISBN>1-1-1</ISBN>

<AUTHOR> David </AUTHOR>

</BOOK>

<BOOK title = “Glory days”>

<ISBN>1-1-2</ISBN>

<AUTHOR> Chris </AUTHOR>

</BOOK>

<BOOK title = “I love the game”>

<ISBN>1-1-3</ISBN>

<AUTHOR> Chris</AUTHOR>

</BOOK>

<BOOK title = “What lies beneath”>

<ISBN>1-1-4</ISBN>

<AUTHOR> Michael</AUTHOR>

</BOOK>

<BOOK title = “Matrix II”>

<ISBN>1-1-5</ISBN>

<AUTHOR> Jason </AUTHOR>

</BOOK>

<BOOK title = “The Root”>

<ISBN>1-1-6</ISBN>

<AUTHOR> Tomas </AUTHOR>

</BOOK>

</BOOKSTORE>

7

XML as DOM Tree

&1 [BOOKSTORE:

Benny-bookstore]

[BOOK: Brave the New

World]

[BOOK:

I love the game]

[BOOK: Matrix]

[ISBN:1-1-1] [AUTHOR: David]

&2 &3 &4 &16

&7 &8 &9 &10

[ISBN:1-1-2] [AUTHOR: Chris]

&11 &12


Chris]

&13

&14 &15

[ISBN:1-1-4] [AUTHOR: Michael]

[BOOK:

What lies beneath]

&17 &18

[AUTHOR: David]

Jason]

[ISBN:1-1-5]

[BOOK:

Matrix II]

&19

&20 &21

[AUTHOR: Tomas] [ISBN:1-1-6]

[BOOK:

The Root]

8

Indexing XML Data - Motivation• Retrieve all the books with author’s name as “Chris”

from the Benny-bookstore – We need to find all the nodes in the DOM tree with child nodes

of BOOKSTORE as BOOK. – Then for each BOOK, we need to test the author’s name. – After about 100,000 comparisons we get a couple of books

with author “Chris” as the output – By using index on AUTHOR, we do not need to test author of

each BOOK node. – With the index of the key as “Chris”, we can find all author

nodes faster – The nodes obtained can be checked if they satisfy the query

condition. – This is a “bottom-up” query plan. – Such a plan is useful in the case when we have a relatively

“small” result set at the bottom, which can be pre-selected

9

Indexing XML Data - Motivation• Find all the books with the name beginning with

“glory” and the author as “Chris” – The query plan could be to get all the books with the

name “glory” disregarding their authors. – If there are small number of books satisfying the

constraint, (e.g., four “glory” books), it might be useful to introduce another type of index, which is built on the values of some nodes.

– Here, we need index upon strings. – On the basis of the nodes obtained in the first step, we can

further test another condition on the query. – Hence, we can build a set of nodes as the “entry set”,

which will depend on the specific query and on the type of XML data

10

Types of Indexes• Name-index (Nindex)

– A name index locates nodes with the tag names – The Nindex for the incoming tag <BOOK> over the XML fragment in

figure 2 will then be {&2, &3, &4, &13, &16, &19}

• Value-index (Vindex) – A value-index locates nodes with given value – The Value-index for the word “Chris” is {&10, &12}, for the word “the” is

{&2, &4}

• Path-index (Pindex) – A path-index, locates nodes with the path from root node – Path index is the information we attach to each node to record its ancestors’

paths – In Dom tree the path information of &11 is {&1, &4}; node &7 is {&1, &2}

• Descent Number (DN) – Descent Number is the information we attach to every node to record the

number of its descents. – In the DOM tree, the DN of node &11 is 0; the DN of node &3 is 2

11

Example for XPath Queries<bib>

<book> <publisher> Addison-Wesley </publisher> <author> Serge Abiteboul </author> <author> <first-name> Rick </first-name> <last-name> Hull </last-name> </author> <author> Victor Vianu </author> <title> Foundations of Databases </title> <year> 1995 </year></book><book price=“55”> <publisher> Freeman </publisher> <author> Jeffrey D. Ullman </author> <title> Principles of Database and Knowledge Base Systems </title> <year> 1998 </year></book>

</bib>

<bib><book> <publisher> Addison-Wesley </publisher> <author> Serge Abiteboul </author> <author> <first-name> Rick </first-name> <last-name> Hull </last-name> </author> <author> Victor Vianu </author> <title> Foundations of Databases </title> <year> 1995 </year></book><book price=“55”> <publisher> Freeman </publisher> <author> Jeffrey D. Ullman </author> <title> Principles of Database and Knowledge Base Systems </title> <year> 1998 </year></book>

</bib>

12

Data Model for XPath

bib

book book

publisher author . . . .

Addison-Wesley Serge Abiteboul

The root

The root element

Much like the Xquery data model

Processing instruction

Comment

13

XPath: Simple Expressions

/bib/book/year

Result: <year> 1995 </year>

<year> 1998 </year>

/bib/paper/year

Result: empty (there were no papers)

14

Entry-point Technique

• We find an entry-point node among a set of middle level nodes in the XPath expression.

• Then we split the XPath expression at the entry-point and test for the path condition for the first part and eliminate nodes from DOM tree that do not satisfy the path condition.

• Then we test the remaining part of the XPath expression recursively eliminating nodes that do not satisfy the path condition.

• The algorithm can be implemented either using top-down approach or bottom-up approach

15

Entry-point Technique – An Example

Select BOOKSTORE/BOOKwhere BOOK.name = “Glory days” and /AUTHOR.title =

“Chris” and BOOKSTORE.name = “Benny-bookstore”

• The above query is transformed to the following XPath expression

/BOOKSTORE [name = “Benny-bookstore”]/child:: BOOK[title = “Glory Days”] /Child :: AUTHOR/child :: FIRSTNAME[name =

“Chris”]

• Use Nindex to get all BOOK nodes or AUTHOR nodes

16

Entry-point Technique – An Example• Get all books named “Glory Days” and then test the

condition on each one of them if the author is “Chris”

/BOOKSTORE [name = “Benny-bookstore”]/child:: BOOK[title = “Glory Days”]

• Then, we test each author child node, which is the latter part of X-path expression

/Child :: AUTHOR/child :: FIRSTNAME[name = “Chris”]

• In second strategy, first get all authors named “Chris”, and then test the parent nodes if book name is “Glory Days”

17

Entry-point Root-first Algorithm

INPUT: XPath expression root/X1/X2/…/Xi/…/Xm

STEP 1: FOR each Xi

BEGIN

IF Xi is indexed THENBEGIN

get every node xi of type Xi

get the DN ni of each xi

Sumi = ni

ENDEND

STEP 2: Get entry point Xn with minimum Sum, add all xn to a node set S;Consider the tree obtained after deleting all branches that do not have the node xn in its path.

split the XPath into root/X1/X2/…/Xn-1 and /Xn+1/…/Xm by the entry point Xn;

STEP 3: FOR each node xn in S BEGIN

IF the path starting from root to node xn is not included in the path

root/X1/X2/…/Xn-1/Xn THEN delete the sub tree that does not

satisfy the path condition

END

STEP 4: FOR each node xn in S, consider all sub

trees starting with xn

BEGIN

IF Xn+1/…/Xm is same as /Xm

THEN return nodes Xm

ELSE INPUT = Xn/Xn+1/…/Xm GO TO STEP 1

END

18

Example – Entry-point Root-first Algorithm

A

B (17)

E (8)

G H G H H F H H F H F G G

B (14)

C C D D

I E (4) E (6)

F G

X-Path: A/B/C/E//H

19


• Step 1: calculate descent numbers (DN) of the nodes that have indexes

• DN of node B = 31• DN of node E = 18• Entry-point = node E (minimum DN)

20


• Step 2: Delete the branches that do not have E A

B

E

G G H G H H F H F G G

B

C C

E

F

D

E

XPath – A/B/C/E and E//H

21


• Step 3: test A/B/C/E on each E node and discard the right most sub tree with node E

• Step 4: evaluate E//H on each E and finally we get the three H nodes

• Cost – O(N) where N is the number of nodes

E

G G H G H H F

E

F

22

Rest-tree Conception

• Performance deterioration in Entry-point algorithm– Find books written by “David” where the title of the book

contains the word “book” – The XML file might have hundreds of books having the

word “book” in the title and – further there might be a large number of books by author

“David”, but only one of them has the word “book” in its title

– The Entry-point algorithm first eliminates all the nodes that do not have the word “book” in its title.

– Then it eliminates the nodes that do not have “David” as the author

– Due to relatively large number of instances at the two levels, large number of eliminations is required

23


• The tree formed by the nodes that meet certain condition at its level, along with its descendant and ancestor nodes

• In the example, the Rest-tree of the node that satisfies the condition that the <BOOK> node has the word “glory” in its title, is as shown

&1

[BOOKSTORE: Benny-bookstore]

[BOOK: Glory days] &3

&9 &10


24


• First employ Entry-point algorithm to find all nodes that meet the condition statements at each level

• The final result will then be the intersection of the Rest-trees of these nodes

• In practice, we do not need to find the Rest-tree of every node satisfying the condition.

• Small set of nodes are left after applying the Entry-point algorithm

• So we need to find the Rest-trees of a relatively small set of nodes within a small sub tree

• To get the intersection of rest-trees, note that the nodes that satisfy the query condition and that have the minimum number of descendants is available from the Entry-point algorithm

25


• The minimum level is the anchor level of the rest-tree algorithm.

• We just need to intersect the Rest-trees at this minimum level.

• For example, after the first step of Entry-point algorithm, we know there are 2000 nodes at Level A that meet say condition A, 1000 nodes at Level B that meet condition B, 200 nodes at Level C, 3000 at Level D, 400 at Level E.

• The minimum level is C and the order of the levels is CEBAD

26

Rest-tree Conception• Ancestor node information is available as path-

index• Filter some nodes at Level C by checking the

grandparent node information of the 400 nodes at Level E

• Similarly, we can filter some other nodes at Level C by checking the parent node information of the nodes at Level B.

• The intersection at Level C will be complete by checking ancestor information at Level D nodes.

• The final step is to get all the nodes that satisfy the query requirement

27

Rest-tree AlgorithmINPUT: X-path expression root/X1/X2/…/Xi/…/Xm

STEP 1: FOR each Xi

BEGIN

IF Xi is indexed THEN BEGIN

get every node xi of type Xi

get the DN number ni of each xi

Sumi = ni; END END

STEP 2: get entry point Xj with minimum Sum, add all xj to a node set Sj;

get comparison point Xk with second minimum Sum, add all xk to a node set Sk;

STEP 3: IF level j > k

FOR each node xk in Sk

IF its ancestor is not in Sj THEN

delete xk from Sk

ELSE

FOR each node xj in Sj

IF its ancestor is not in Sk THEN

delete xj from Sj

STEP 4: FOR each node xj in Sj

BEGIN IF the path starting from root to node

xj is not included in the path

root/X1/X2/…/Xj THEN delete the sub tree that does not satisfy the path condition

END

STEP 5: FOR each node xj in Sj, consider all sub

trees starting with xj BEGIN

IF Xj+1/…/Xm is same as /Xm

THEN return nodes Xm

ELSE INPUT = Xj/ Xj+1/…/Xm GO TO STEP 1;

END

28

Rest-tree Algorithm - Example

A

B (17)

E (8)

G G H G H H F H H F H F G G H H F

B (14)

C (9) C (5) D D

I E (4) E (6)

F

I

C (6)

M

D

B(1)

XPath - A/B/C/E//H

Step 1: Calculate DNs

DOM Tree

29

Rest-tree Algorithm - ExampleStep 2: Minimum DN

DN of node B = 32

DN of node C = 20

DN of node E = 18 A

B (17)

E (8)

G G H G H H F H H F H F G G H H F

B (14)

C (9) C (5) D D

I E (4) E (6)

F

I

C (6)

M

D

B(1)

30


Step 3: Delete “E” nodes whose ancestor does not have “C”

A

B (17)

E (8)

G G H G H H F H H F H H F

B (14)

C (9) C (5) D

I E (4)

F

I

C (6)

M

D

B(1)

31


A

B (17)

E (8)

G G H G H H F

B (14)

C (9) C (5)

E (4)

F

Step 4: Delete the subtree that does not satisfy the path A/B/C/E Step 5: Get all the nodes from E//H

32

Test Cases and Comparisons

• Size of DOM Tree– Entry-point algorithm performs much better than the

traditional algorithm, taking less than one third of the processing time of the traditional algorithm

DOM Tree Size

0

100

200

300

400

500

600

700

1 2 3 4 5 6 7 8

Number of Nodes (10,000)

Tim

e (

Mill

i-S

ec

)

Increasing Number of Nodes for XPath: //A20//C30//A80

33

Test Cases and Comparisons

• Result Nodes Set – The processing time for the Entry-point algorithm has

increased slightly with increasing number of result nodes. – Partially, the reason is due to the recursive function call

in the Entry-point Algorithm code

Result Nodes Set

0

50

100

150

200

250

300

1 2 3

Number of Result Nodes (10)

Tim

e (M

illi-

Sec

)

Increasing Number of Result Nodes

34

Test Cases and Comparisons• Tree Height

– The variation tendency of processing time of the three methods is the same with the height of the tree

Tree Height Increasing

0

20

40

60

80

100

120

140

0 10 16 23

Tree Height

Tim

e (

Mill

i-S

ec

)


35

Test Cases and Comparisons• Without Index on result nodes

– The traditional method turns out to be a disaster, falling into no index method category.

– However, the Entry-point Algorithm is still in good shape

Without index on result nodes

0

100

200

300

1 2 3

Nodes number (10,000)

Tim

e (M

illi-

Sec

)


36

Conclusions

• Proposed three types of indexes on XML data to execute efficiently XPath queries.

• We proposed two algorithms to process XPath queries using these indexes to optimize the queries.

• We have also simulated both bottom-up and top-down approaches

• Processing XPath query using the Entry-point indexing technique performs much better than traditional algorithms with or without indexes

Documents

Efficient Processing of XPath Queries Using Indexes