Estimating the Selectivity of XML Path Expressions for Internet Scale Applications Ashraf Aboulnaga...

Preview:

Citation preview

Estimating the Selectivity of XML Path Expressions for Internet Scale Applications

Ashraf AboulnagaAlaa R. AlameldeenJeffrey F. Naughton

Computer Sciences DepartmentUniversity of Wisconsin - Madison

Motivation XML enables Internet scale applications that

query data from many sources Niagara, Xyleme, …

Queries over XML data use path expressions Optimizing these queries requires

estimating the selectivity of the path expressions

Focus of this talk: Building statistics for XML data and using them for estimating the selectivity of simple path expressions

What is XML?<readings> <play> <title>Pygmalion</title> <author>Bernard Shaw</author> </play> <novel> <title>David Copperfield</title> <author>Charles Dickens</author> </novel></readings>

Querying XML

FOR $n_auth IN document("*")//novel/author $p_auth IN document("*")//play/authorWHERE $n_auth/text() = $p_auth/text()RETURN $n_auth

Optimizing this query requires estimating the selectivity of the path expressions

This requires information about the structure of the XML data

Goal of this Work Build database statistics that capture

the structure of XML data Ensure that the statistics fit in a small

amount of memory For efficient query optimization Important for Internet scale applications

Use the statistics to estimate the selectivity of simple XML path expressions//t1/t2/…/tn

Outline of Presentation Introduction Path Trees Markov Tables Performance Evaluation Conclusions

Path Trees<A> <B> </B> <B> <D> </D> </B> <C> <D> </D> <E> </E> <E> </E> <E> </E> </C></A>

A 1

C 1B 2

D 1D 1 E 3

Summarizing Path Trees Path trees contain all the information

needed for selectivity estimation Problem: May not fit in available memory

Small available memory Internet scale

Remove low frequency nodes Removed nodes replaced with *-nodes

Tag name: * meaning "any tag" Frequency: Average frequency of replaced

nodes Sibling-*, Level-*, Global-*, No-*

Sibling-* Summarization

A 1

C 9B 13

G 10F 15 H 6

K 12

E 5D 7

K 11J 4I 2

Sibling-* Summarization

A 1

C 9B 13

G 10F 15 H 6

K 12

E 5D 7

K 11J 4I 2

Sibling-* Summarization

A 1

C 9B 13

G 10F 15 H 6

K 12

E 5D 7

K 11J 4I 2

Sibling-* Summarization

A 1

C 9B 13

G 10F 15 H 6

K 12

E 5D 7

K 11J 4I 2

Sibling-* Summarization

A 1

C 9B 13

G 10F 15 H 6

K 12

E 5D 7

K 11* f=6n=2

*-nodes represent deleted sibling nodes Memory saved by coalescing nodes

Sibling-* Summarization

A 1

C 9B 13

G 10F 15 H 6

K 12

E 5D 7

K 11* f=6n=2

Sibling-* Summarization

A 1

C 9B 13

G 10F 15 H 6

K 12

E 5D 7

K 11* f=6n=2

Sibling-* Summarization

A 1

C 9B 13

G 10F 15 H 6

K 12

E 5D 7

K 11* f=6n=2

Sibling-* Summarization

A 1

C 9B 13

G 10F 15 H 6

K 12K 11* f=6n=2

* f=12n=2

Sibling-* Summarization

A 1

C 9B 13

G 10F 15 H 6

K 12

*

K 11* f=6n=2

f=12n=2

Sibling-* Summarization

A 1

C 9B 13

G 10F 15 H 6

K 12

*

K 11* f=6n=2

f=12n=2

Sibling-* Summarization

A 1

C 9B 13

F 15

K 12

*

K 11* f=6n=2

f=12n=2 * f=16

n=2

Sibling-* Summarization

A 1

C 9B 13

*F 15*

K* f=6n=2

f=12n=2

f=16n=2

f=23n=2

Sibling-* Summarization

A 1

C 9B 13

*F 15*

K* f=23n=2

6 8

3

Original Path Tree

A 1

C 9B 13

G 10F 15 H 6

K 12

E 5D 7

K 11J 4I 2

Sibling-* Summarization

A 1

C 9B 13

*F 15*

K* f=23n=2

6 8

3

Try to retain as much information as possible about the deleted nodes

Level-* Summarization

A 1

C 9B 13

G 10F 15 H 6

K 12

E 5D 7

K 11J 4I 2

Level-* Summarization

A 1

C 9B 13

G 10F 15 H 6

K 12

E 5D 7

K 11J 4I 2

Level-* Summarization

A 1

C 9B 13

G 10F 15

K 12K 11

* 6

* 3

Less information about deleted nodes than sibling-* Deletes fewer nodes than sibling-*

Global-* Summarization

A 1

C 9B 13

G 10F 15 H 6

K 12

E 5D 7

K 11J 4I 2

Global-* Summarization

A 1

C 9B 13

G 10F 15 H 6

K 12

E 5D 7

K 11J 4I 2

Global-* Summarization

C 9B 13

G 10F 15 H 6

K 12

D 7

K 11

*3

No-* Summarization

A 1

C 9B 13

G 10F 15 H 6

K 12

E 5D 7

K 11J 4I 2

No-* Summarization

A 1

C 9B 13

G 10F 15 H 6

K 12

E 5D 7

K 11J 4I 2

No-* Summarization

C 9B 13

G 10F 15 H 6

K 12

E 5D 7

K 11

Memory savings similar to global-* Conservative assumption about deleted nodes

Outline Introduction Path Trees Markov Tables Performance Evaluation Conclusions

Markov Tables A table of all distinct paths of length up

to m and their frequencies For paths of length greater than m,

combine paths from the Markov table Example:

Uses "short memory" or "Markov" property

f(B/C/D)

f(B/C)f(A/B/C/D) =

f(A/B/C)

Markov Tables

Path Freq Path Freq

A 1 AC 6

B 11 AD 4

C 15 BC 9

D 19 BD 7

AB 11 CD 8

A 1

D 4C 6B 11

D 7C 9

D 8

Summarizing Markov Tables Exact selectivities for paths of length up to

m Approximate selectivities for paths longer

than m Problem: May not fit in available memory Remove low frequency paths Discard removed paths of length > 2 Replace removed paths of length 1 or 2

with *-paths Suffix-*, Global-*, No-*

Suffix-* Summarization

Path Freq Path Freq

A 1 AC 6

B 11 AD 4

C 15 BC 9

D 19 BD 7

AB 11 CD 8

Suffix-* Summarization

Path Freq Path Freq

A 1 AC 6

B 11 AD 4

C 15 BC 9

D 19 BD 7

AB 11 CD 8

* 0 ** 0

Suffix-* Summarization

Path Freq Path Freq

A 1 AC 6

B 11 AD 4

C 15 BC 9

D 19 BD 7

AB 11 CD 8

* 0 ** 0

Suffix-* Summarization

Path Freq Path Freq

AC 6

B 11 AD 4

C 15 BC 9

D 19 BD 7

AB 11 CD 8

* f=1,n=1 ** 0

Suffix-* Summarization

Path Freq Path Freq

AC 6

B 11 AD 4

C 15 BC 9

D 19 BD 7

AB 11 CD 8

* f=1,n=1 ** 0

SD= { }

Set of deleted paths of length 2

Suffix-* Summarization

Path Freq Path Freq

AC 6

B 11

C 15 BC 9

D 19 BD 7

AB 11 CD 8

* f=1,n=1 ** 0

SD= { (AD,4) }

Suffix-* Summarization

Path Freq Path Freq

AC 6

B 11

C 15 BC 9

D 19 BD 7

AB 11 CD 8

* f=1,n=1 ** 0

SD= { (AD,4) }

Suffix-* Summarization

Path Freq Path Freq

AC 6

B 11

C 15 BC 9

D 19 BD 7

AB 11 CD 8

* f=1,n=1 ** 0

SD= { (AD,4) }

Suffix-* Summarization

Path Freq Path Freq

A*f=10,n=2

B 11

C 15 BC 9

D 19 BD 7

AB 11 CD 8

* f=1,n=1 ** 0SD= { }

Suffix-* Summarization

Path Freq Path Freq

A*f=10,n=2

B 11

C 15 BC 9

D 19 BD 7

AB 11 CD 8

* f=1,n=1 ** 0SD= { }

Suffix-* Summarization

Path Freq Path Freq

A*f=10,n=2

B 11

C 15 BC 9

D 19

AB 11 CD 8

* f=1,n=1 ** 0SD= { (BD,7) }

Suffix-* Summarization

Path Freq Path Freq

A*f=10,n=2

B 11

C 15 BC 9

D 19

AB 11 CD 8

* f=1,n=1 ** 0SD= { (BD,7) }

Suffix-* Summarization

Path Freq Path Freq

A*f=10,n=2

B 11

C 15 BC 9

D 19

AB 11

* f=1,n=1 ** 0SD= { (BD,7), (CD,8) }

Suffix-* Summarization

Path Freq Path Freq

A*f=10,n=2

B 11

C 15 BC 9

D 19

AB 11

* f=1,n=1 ** 0SD= { (BD,7), (CD,8) }

Suffix-* Summarization

Path Freq Path Freq

A*f=10,n=2

B 11

C 15 BC 9

D 19

AB 11

* f=1,n=1 ** 0SD= { (BD,7), (CD,8) }

Suffix-* Summarization

Path Freq Path Freq

A*f=10,n=2

B 11

C 15 B*f=16,n=2

D 19

AB 11

* f=1,n=1 ** 0SD= { (CD,8) }

Suffix-* Summarization

Path Freq Path Freq

A*f=10,n=2

B 11

C 15 B*f=16,n=2

D 19

AB 11

* f=1,n=1 ** 0SD= { (CD,8) }

Suffix-* Summarization

Path Freq Path Freq

B 11

C 15 B*f=16,n=2

D 19

AB 11

* f=1,n=1 **f=10,n=2SD= { (CD,8) }

Suffix-* Summarization

Path Freq Path Freq

B 11

C 15 B* 8

D 19

AB 11

* 1 ** 6

SD= { }

Global-*, No-* Summarization Global-*

Two *-paths, * and ** Deletes fewer paths than suffix-* to

summarize the Markov table No-*

No *-paths Conservatively assumes that paths not in

the Markov table do not exist in the data

Outline Introduction Path Trees Markov Tables Performance Evaluation Conclusions

Data Sets for Experiments Synthetic data set

100,000 XML elements Path tree: 3197 nodes, 6 levels, 38 KB Element frequencies: Zipfian (z=1)

DBLP data set 1,399,765 XML elements Path tree: 5883 nodes, 6 levels, 69 KB

Query Workloads 1,000 paths of length between 1 and 4 Random paths

All query paths exist in the data Random tags

Most query paths of length 2 or more do not exist in the data

Available memory between 5 and 50 KB

Best Summarization Methods Path trees

Query paths in data: Global-* Query paths not in data: No-*

Markov tables m = 2 is best Query paths in data: Suffix-* Query paths not in data: No-*

Path Trees vs. Markov Tables When to use path trees and when to use

Markov tables? Also compared against Pruned Suffix

Trees (PSTs) [Chen et al, ICDE 2001] Can handle branching path expressions Can handle conditions on element values

Synthetic Data – Random Paths

0

4

8

12

16

0 10 20 30 40 50

Available Memory (KB)

Abso

lute

Err

or

Tree Global-*Markov Suffix-*PST

Synthetic Data – Random Tags

0

2

4

6

8

0 10 20 30 40 50

Available Memory (KB)

Abso

lute

Err

or

Tree No-*Markov No-*PST

DBLP Data – Random Paths

0

20

40

60

80

100

0 10 20 30 40 50

Available Memory (KB)

Abso

lute

Err

or

Tree Global-*Markov Suffix-*PST

DBLP Data – Random Tags

0

1

2

3

4

0 10 20 30 40 50

Available Memory (KB)

Abso

lute

Err

or

Tree No-*Markov No-*PST

When are Markov Tables Better? DBLP

Repeated sub-structures effectively captured by Markov tables

<sigmod> <inproceedings> <author>…</author> … </inproceedings> …</sigmod>

<vldb> <inproceedings> <author>…</author> … </inproceedings> …</vldb>

Conclusions Novel statistics for estimating the selectivity of

XML path expressions Scale to "all the XML data on the Internet" More accurate than best previously known

alternative Repeated sub-structures: Markov tables

No repeated sub-structures: Path trees Query paths exist in the data: Global-*, Suffix-*

Query paths do not exist in the data: No-* To appear in VLDB 2001

Recommended