34
1 Efficient Processing of XML Twig Patterns with Parent Child Edges: A Look-ahead Approach Jiaheng Lu, Ting Chen, Tok W ang Ling National University of Singa pore Nov. 11. 2004 CIKM 2004 Washington D.C. U.S. A.

1 Efficient Processing of XML Twig Patterns with Parent Child Edges: A Look-ahead Approach Jiaheng Lu, Ting Chen, Tok Wang Ling National University of

Embed Size (px)

Citation preview

Page 1: 1 Efficient Processing of XML Twig Patterns with Parent Child Edges: A Look-ahead Approach Jiaheng Lu, Ting Chen, Tok Wang Ling National University of

1

Efficient Processing of XML Twig Patterns with Parent Child Edges: A Look-ahead Approach

Jiaheng Lu, Ting Chen, Tok Wang Ling

National University of Singapore

Nov. 11. 2004

CIKM 2004 Washington D.C. U.S.A.

Page 2: 1 Efficient Processing of XML Twig Patterns with Parent Child Edges: A Look-ahead Approach Jiaheng Lu, Ting Chen, Tok Wang Ling National University of

2

Outline

☞☞ XML Twig Pattern Matching Problem definition State of the Art: TwigStack Sub-optimality of TwigStack

Our algorithm: TwigStackList Performance Conclusion

Page 3: 1 Efficient Processing of XML Twig Patterns with Parent Child Edges: A Look-ahead Approach Jiaheng Lu, Ting Chen, Tok Wang Ling National University of

3

XML Twig Pattern Matching An XML document is commonly modeled as a rooted,

ordered and labeled tree.

book

preface chapter chapter

section

section

figure

paragraph

section

figure

paragraph figure

paragraph

………….

title

title

“XML”“Data”

“Intro”

Page 4: 1 Efficient Processing of XML Twig Patterns with Parent Child Edges: A Look-ahead Approach Jiaheng Lu, Ting Chen, Tok Wang Ling National University of

4

Regional Coding Node Label1: (startPos: endPos, LevelNum) E.g.

book (0: 32, 1)

preface (1:3, 2) chapter (4:29, 2) chapter(30:31, 2)

“Intro” (2:2, 3) section (5:28, 3)

section(9:17, 4)

figure (14:15, 6)

paragraph(13:16, 5)

section(18:23, 4)

figure (20:21, 6)

paragraph(19:22, 5)figure (25:26, 5)

paragraph(24:27, 4)title: (6:8, 4)

title: (10:12, 5)

1. M.P. Consens and T.Milo. Optimizing queries on files. In In Proceedings of ACM SIGMOD, 1994.

“Data” (7:7, 3)

“XML” (11:11, 3)

Page 5: 1 Efficient Processing of XML Twig Patterns with Parent Child Edges: A Look-ahead Approach Jiaheng Lu, Ting Chen, Tok Wang Ling National University of

5

What is a Twig Pattern? A twig pattern is a small tree whose nodes are tags, attributes or

text values and edges are either Parent-Child (P-C) edges or

Ancestor-Descendant (A-D) edges. E.g. Selects Figure elements which are descendants of Paragraph

elements which in turn are children of Section elements having child element Title

Twig pattern :

Section

Title Paragraph

Figure

Page 6: 1 Efficient Processing of XML Twig Patterns with Parent Child Edges: A Look-ahead Approach Jiaheng Lu, Ting Chen, Tok Wang Ling National University of

6

XML Twig Pattern Matching

Problem Statement Given a query twig pattern Q, and an XML database

D, we need to compute ALL the answers to Q in D. E.g. Consider Q1 and Doc 1:

Doc1:

s1

s2

f1

p1

t1

t2

Section

title figure

Query solutions: (s1, t1, f1) (s2, t2, f1) (s1, t2, f1)

Q1:

Page 7: 1 Efficient Processing of XML Twig Patterns with Parent Child Edges: A Look-ahead Approach Jiaheng Lu, Ting Chen, Tok Wang Ling National University of

7

Previous work: TwigStack TwigStack2: a holistic approach

Two-phase algorithm: Phase 1 TwigJoin: intermediate root-leaf paths are outputted Phase 2 Merge: merge the intermediate path list to get the result

2. N. Bruno, D. Srivastava, and N. Koudas. Holistic twig joins: optimal xml pattern matching. In In Proceedings of ACM SIGMOD, 2002.

Page 8: 1 Efficient Processing of XML Twig Patterns with Parent Child Edges: A Look-ahead Approach Jiaheng Lu, Ting Chen, Tok Wang Ling National University of

8

Previous work: TwigStack

A node q in a twig pattern Q is associated with a stack Sq

Insertion and deletion in a stack Sq

Insertion: An element eq from stream Tq is pushed into its stack Sq if and only if

eq has a descendant eqi in each Tqi , where qi is a child of q

Each node eqi recursively has the first property

Deletion: An element eq is popped out from its stack if all matches involving it have been output.

Page 9: 1 Efficient Processing of XML Twig Patterns with Parent Child Edges: A Look-ahead Approach Jiaheng Lu, Ting Chen, Tok Wang Ling National University of

9

Sub-optimality of TwigStack

TwigStack is I/O optimal for only ancestor-descendant edge query

Unfortunately, TwigStack is sub-optimal for queries with any parent-child edge.

TwigStack may output a large size of intermediate results that are not merge-joinable to any final solution for queries with parent-child relationships.

Page 10: 1 Efficient Processing of XML Twig Patterns with Parent Child Edges: A Look-ahead Approach Jiaheng Lu, Ting Chen, Tok Wang Ling National University of

10

Sub-optimality of TwigStack: an example

Twig Patterns1

p1

f1

t2

t1

Section

title paragraph

figure

A simple XML tree

Since s1 has descendants t1,p1 and in turn p1 has descendant f1, TwigStack output an intermediate path solution <s1,t1>.

But it is useless, for there is no solution for this example at all.

Page 11: 1 Efficient Processing of XML Twig Patterns with Parent Child Edges: A Look-ahead Approach Jiaheng Lu, Ting Chen, Tok Wang Ling National University of

11

Main problem and our experiment

TwigStack might output some intermediate results that are useless to query answers .

To have a better understanding , we perform TwigStack on real dataset.

Data set : TreeBank[from U. of Washington XML datasets] Queries:

Q1:VP [/DT] //PRP_DOLLAR_ Q2: S//NP[//PP/TO][/VP/_NONE_]/JJ Q3: S [/JJ] /NP

All queries contain parent-child relationships.

Page 12: 1 Efficient Processing of XML Twig Patterns with Parent Child Edges: A Look-ahead Approach Jiaheng Lu, Ting Chen, Tok Wang Ling National University of

12

Our experimental results

Intermediate paths by TwigStack

Merge-joinable paths

Percentage of useless intermediate paths

Q1 10,663 5 99.9%

Q2 24,493 49 99.5%

Q3 70,967 10 99.9%

Most intermediate paths do not contribute to final answers due to parent-child edges!

It is a big challenge to improve TwigStack to answer queries with parent-child edges.

Page 13: 1 Efficient Processing of XML Twig Patterns with Parent Child Edges: A Look-ahead Approach Jiaheng Lu, Ting Chen, Tok Wang Ling National University of

13

Intuition for improvement

Twig Patterns1

p1

f1

t2

t1

Section

title paragraph

figure

A simple XML tree

Our intuitive observation: why not read more paragraph elements and cache them in the main memory?

For example, after we scan the p1, we do not stop and continue to read the next paragraph element. Then we find that there is only one paragraph element and f1 is not the child of paragraph. So we should not output any intermediate solution.

Page 14: 1 Efficient Processing of XML Twig Patterns with Parent Child Edges: A Look-ahead Approach Jiaheng Lu, Ting Chen, Tok Wang Ling National University of

14

Outline

XML Twig Pattern Matching Problem definition State of the Art: TwigStack Sub-optimality of TwigStack

☞☞ Our algorithm TwigStackList Experimental results Conclusion

Page 15: 1 Efficient Processing of XML Twig Patterns with Parent Child Edges: A Look-ahead Approach Jiaheng Lu, Ting Chen, Tok Wang Ling National University of

15

Our main idea

Main idea: we read more elements in the input streams and cache some of them in the main memory so that we can make a more accurate decision about whether an element can contribute to final answer.

But we cannot cache too many elements in the main memory. For each node q in twig query, the number of elements with tag q cached in the main memory should not be greater than the longest path in the XML dataset.

Page 16: 1 Efficient Processing of XML Twig Patterns with Parent Child Edges: A Look-ahead Approach Jiaheng Lu, Ting Chen, Tok Wang Ling National University of

16

Our caching method What elements should be cached into the main memory?

Only those that might contribute to final answers

s1

p1

p3p2

t1

A simple XML tree

f1

We only need to cache p1,p3 into main memory, why not p2? Because if p2 contributed to final answers, then there would be an element before f1 to become the child

of p2. But now we see that f1 is the first element. So p2 is guaranteed not to contribute to final answers.

Twig Pattern

Section

title paragraph

figure

Page 17: 1 Efficient Processing of XML Twig Patterns with Parent Child Edges: A Look-ahead Approach Jiaheng Lu, Ting Chen, Tok Wang Ling National University of

17

Our criteria for pushing an element to stack

The criteria for an element to be pushed into stack is very important for controlling intermediate results. Why?

Because, once an element is pushed into stack, then this element is ready to output. So less elements are pushed into stack, less intermediate results are output.

Our criteria: Given an element eq from stream Tq, before eq is pushed into stack Sq , we ensure that

(i) element eq has a descendant eq’ for each child q’ of q, and (ii) if (q, q’) is a parent-child relationship, eq’ has parent with tag q i

n the path from eq to eqmax , where eqmax is the descendant of eq with the maximal start value, qmax being a child of q.

(iii) each of q’ recursively satisfy the first two conditions.

Page 18: 1 Efficient Processing of XML Twig Patterns with Parent Child Edges: A Look-ahead Approach Jiaheng Lu, Ting Chen, Tok Wang Ling National University of

18

Examples

s1

p1

p3p2

t1

A simple XML tree

f1 Element p3 can be pushed into stack , but p1, p2 cannot. Because p3 has a child f1. Although p1 has a descendant f1, but f1 is not the child of p1.

Twig PatternSection

title paragraph

figure

Page 19: 1 Efficient Processing of XML Twig Patterns with Parent Child Edges: A Look-ahead Approach Jiaheng Lu, Ting Chen, Tok Wang Ling National University of

19

Our algorithm: TwigStackList We propose a novel holistic twig algorithm TwigSt

acklist to evaluate a twig query. Unique features of TwigStackList:

It considers the parent-child edge in the query There is a list for each query node to cache elements th

at likely participate in final solutions. It identifies a broader class of optimal queries. TwigSta

ckList can guarantee the I/O optimality for queries with only ancestor-descendant edges connecting branching nodes and their children.

Page 20: 1 Efficient Processing of XML Twig Patterns with Parent Child Edges: A Look-ahead Approach Jiaheng Lu, Ting Chen, Tok Wang Ling National University of

20

TwigStackList : an exampleTwig Pattern

Section

title paragraph

figure

An XML tree

Stack List

s1

p1

p3

f1

t1

t2

s2

p2t3

f2

Root

p2

s2

t3

f2

p3p3 p1

Scan s1, t1, p1 ,f1.

Page 21: 1 Efficient Processing of XML Twig Patterns with Parent Child Edges: A Look-ahead Approach Jiaheng Lu, Ting Chen, Tok Wang Ling National University of

21

TwigStackList : an exampleTwig Pattern

Section

title paragraph

figure

An XML tree

Stack List

s1

p1

p3

f1

t1

t2

s2

p2t3

f2

Root

p2

s2

t3

f2

p3p3 p1

Since p1 is not the parent of f1 (but ancestor) , we continue to scan p2 and put p1 to list.

Page 22: 1 Efficient Processing of XML Twig Patterns with Parent Child Edges: A Look-ahead Approach Jiaheng Lu, Ting Chen, Tok Wang Ling National University of

22

TwigStackList : an exampleTwig Pattern

Section

title paragraph

figure

An XML tree

Stack List

s1

p1

p3

f1

t1

t2

s2

p2t3

f2

Root

p2

s2

t3

f2

p3p3 p1

Put p2,p3 to list and the cursor points to p3, for it is the parent of f2.

Page 23: 1 Efficient Processing of XML Twig Patterns with Parent Child Edges: A Look-ahead Approach Jiaheng Lu, Ting Chen, Tok Wang Ling National University of

23

TwigStackList : an exampleTwig Pattern

Section

title paragraph

figure

An XML tree

Stack List

s1

p1

p3

f1

t1

t2

s2

p2t3

f2

Root

p2

s2

t3

Output intermediate solutions: <s2,t3>

f2

,<s2,p3,f2> Final: <s2,t3,p3,f2>Merge

p3p3 p1

Page 24: 1 Efficient Processing of XML Twig Patterns with Parent Child Edges: A Look-ahead Approach Jiaheng Lu, Ting Chen, Tok Wang Ling National University of

24

TwigStackList v.s. TwigStack

TwigStackList shows I/O optimal for the above query. In contrast, TwigStack shows sub-optimal, for it output the “uesless” path solution < s1,t1>

Twig Pattern

s1

p1

Section

titleparagraph

figure

p3

f1

t1

An XML tree

t2

s2

p2t3

f2

Root

Page 25: 1 Efficient Processing of XML Twig Patterns with Parent Child Edges: A Look-ahead Approach Jiaheng Lu, Ting Chen, Tok Wang Ling National University of

25

Sub-optimality of TwigStackList Although TwigStackList broadens the class of optimal query compared to TwigSt

ack, TwigStackList is still show sub-optimality for queries with parent-child edge connecting branching nodes.

Twig Pattern

s1

s2

p1

t1

Section

title paragraph

A simple XML tree

Observe that there is no matching solution for this dataset. But TwigStackList caches s1 and s2 in the list and push s1 to stack. So (s1,t1) will be output as a useless solution.

Page 26: 1 Efficient Processing of XML Twig Patterns with Parent Child Edges: A Look-ahead Approach Jiaheng Lu, Ting Chen, Tok Wang Ling National University of

26

Sub-optimality of TwigStackList Although TwigStackList broadens the class of optimal query compared to TwigSt

ack, TwigStackList is still show sub-optimality for queries with parent-child edge connecting branching nodes.

Twig Pattern

s1

s2

p1

t1

Section

title paragraph

A simple XML tree

Observe that there is no matching solution for this dataset. But TwigStackList caches s1 and s2 in the list and push s1 to stack. So (s1,t1) will be output as a useless solution.

p2

Here the behavior of TwigStackList is still reasonable since we do not know whether s1 has a child p2 following p1 before we advance p1.

Page 27: 1 Efficient Processing of XML Twig Patterns with Parent Child Edges: A Look-ahead Approach Jiaheng Lu, Ting Chen, Tok Wang Ling National University of

27

Outline

XML Twig Pattern Matching Problem definition State of the Art: TwigStack Sub-optimality of TwigStack

Our algorithm TwigStackList ☞☞ Experimental results Conclusion

Page 28: 1 Efficient Processing of XML Twig Patterns with Parent Child Edges: A Look-ahead Approach Jiaheng Lu, Ting Chen, Tok Wang Ling National University of

28

Experimental Setting

Experimental Setting Pentium 4 CPU, RAM 768MB, disk 2GB TreeBank

Download from University of Washington XML dataset Maximal depth 36, 2.4 million nodes

Random Seven tags : a, b, c, d, e, f, g. ; uniform distributed Fan-out of elements varied 2-100, depth varied 10-100

Page 29: 1 Efficient Processing of XML Twig Patterns with Parent Child Edges: A Look-ahead Approach Jiaheng Lu, Ting Chen, Tok Wang Ling National University of

29

Performance against TreeBank

Queries with XPath expression:

Number of intermediate path solutions for TwigStackList V.s. TwigStack

TwigStack TwigStackList Reduction percentage Useful Path

Q1 35 35 0% 35

Q2 2957 143 95% 92

Q3 25892 4612 82% 4612

Q4 10663 11 99.9% 5

Q5 702391 22565 96.8% 22565

Q6 70988 30 99.9% 10

Q1 S[//MD]//ADJ Q4 VP[/DT]//PRP_DOLLAR_

Q2 S/VP/PP[/NP/VBN]/IN Q5 S[//VP/IN]//NP

Q3 S/VP//PP[//NP/VBN]//IN Q6 S[/JJ]/NP

Page 30: 1 Efficient Processing of XML Twig Patterns with Parent Child Edges: A Look-ahead Approach Jiaheng Lu, Ting Chen, Tok Wang Ling National University of

30

Performance analysis We have three observations: (1) when queries contain only ancestor-descendant ed

ges, two algorithms have similar performance. See Q1. (2)When edges connecting branching nodes contain o

nly ancestor-descendant relationships, TwigStack is optimal, but TwigStack show the sub-optimal. See Q3.Q5

(3) When edges connecting branching nodes contain parent-child relationships, both TwigStack and TwigStackList are sub-optimal. But TwigStack typically output far few “useless” (<5%) intermediate solution than TwigStack. See Q2,Q4,Q6.

Page 31: 1 Efficient Processing of XML Twig Patterns with Parent Child Edges: A Look-ahead Approach Jiaheng Lu, Ting Chen, Tok Wang Ling National University of

31

Performance against random dataset

(a ) Q 1 (b ) Q 2 (c) Q 3

(d ) Q 4 (e) Q 5

a

b c

d e f g

a

aa

a

bb

bb cc

d

e

f

g

d

e

f

g

c d

e f g

c d

e f g

TwigStack TwigStackList Reduction Useful Path

Q1 9048 4354 52% 2077

Q2 1098 467 57% 100

Q3 25901 14476 44% 14476

Q4 32875 16775 49% 16775

Q5 3896 1320 66% 566

From the following table, we see that for all queries, TwigStackList again is more efficient than TwigStack in terms of the size of intermediate results.

Page 32: 1 Efficient Processing of XML Twig Patterns with Parent Child Edges: A Look-ahead Approach Jiaheng Lu, Ting Chen, Tok Wang Ling National University of

32

Outline

XML Twig Pattern Matching Problem definition State of the Art: TwigStack Sub-optimality of TwigStack

Our algorithm TwigStackList Experimental results ☞☞ Conclusion

Page 33: 1 Efficient Processing of XML Twig Patterns with Parent Child Edges: A Look-ahead Approach Jiaheng Lu, Ting Chen, Tok Wang Ling National University of

33

Conclusion Previous algorithm TwigStack show the sub-optimality f

or queries with parent-child edges. We propose a new algorithm TwigStackList to address t

his problem. TwigStackList broadens the class of query with I/O opti

mality. Experiments show that TwigStackList typically output m

uch fewer useless intermediate result as far as the query contains parent-child edges.

We recommend to use TwigStackList as a new holistic join algorithm to evaluate a query with parent-child edges.

Page 34: 1 Efficient Processing of XML Twig Patterns with Parent Child Edges: A Look-ahead Approach Jiaheng Lu, Ting Chen, Tok Wang Ling National University of

34

Thank You! Q & A