Estimating the Selectivity of XML Path Expressions for Internet Scale Applications Ashraf Aboulnaga...

Estimating the Selectivity of XML Path Expressions for Internet Scale Applications

Ashraf AboulnagaAlaa R. AlameldeenJeffrey F. Naughton

Computer Sciences DepartmentUniversity of Wisconsin - Madison

Motivation XML enables Internet scale applications that

query data from many sources Niagara, Xyleme, …

Queries over XML data use path expressions Optimizing these queries requires

estimating the selectivity of the path expressions

Focus of this talk: Building statistics for XML data and using them for estimating the selectivity of simple path expressions

What is XML?<readings> <play> <title>Pygmalion</title> <author>Bernard Shaw</author> </play> <novel> <title>David Copperfield</title> <author>Charles Dickens</author> </novel></readings>

Querying XML

FOR $n_auth IN document("*")//novel/author $p_auth IN document("*")//play/authorWHERE $n_auth/text() = $p_auth/text()RETURN $n_auth

Optimizing this query requires estimating the selectivity of the path expressions

This requires information about the structure of the XML data

Goal of this Work Build database statistics that capture

the structure of XML data Ensure that the statistics fit in a small

amount of memory For efficient query optimization Important for Internet scale applications

Use the statistics to estimate the selectivity of simple XML path expressions//t1/t2/…/tn

Outline of Presentation Introduction Path Trees Markov Tables Performance Evaluation Conclusions

Path Trees<A> <B> </B> <B> <D> </D> </B> <C> <D> </D> <E> </E> <E> </E> <E> </E> </C></A>

C 1B 2

D 1D 1 E 3

Summarizing Path Trees Path trees contain all the information

needed for selectivity estimation Problem: May not fit in available memory

Small available memory Internet scale

Remove low frequency nodes Removed nodes replaced with *-nodes

Tag name: * meaning "any tag" Frequency: Average frequency of replaced

nodes Sibling-*, Level-*, Global-*, No-*

Sibling-* Summarization

C 9B 13

G 10F 15 H 6

E 5D 7

K 11J 4I 2

C 9B 13

G 10F 15 H 6

E 5D 7

K 11J 4I 2

C 9B 13

G 10F 15 H 6

E 5D 7

K 11J 4I 2

C 9B 13

G 10F 15 H 6

E 5D 7

K 11J 4I 2

C 9B 13

G 10F 15 H 6

E 5D 7

K 11* f=6n=2

*-nodes represent deleted sibling nodes Memory saved by coalescing nodes

C 9B 13

G 10F 15 H 6

E 5D 7

K 11* f=6n=2

C 9B 13

G 10F 15 H 6

E 5D 7

K 11* f=6n=2

C 9B 13

G 10F 15 H 6

E 5D 7

K 11* f=6n=2

C 9B 13

G 10F 15 H 6

K 12K 11* f=6n=2

* f=12n=2

C 9B 13

G 10F 15 H 6

K 11* f=6n=2

f=12n=2

C 9B 13

G 10F 15 H 6

K 11* f=6n=2

f=12n=2

C 9B 13

K 11* f=6n=2

f=12n=2 * f=16

C 9B 13

*F 15*

K* f=6n=2

f=12n=2

f=16n=2

f=23n=2

C 9B 13

*F 15*

K* f=23n=2

Original Path Tree

C 9B 13

G 10F 15 H 6

E 5D 7

K 11J 4I 2

C 9B 13

*F 15*

K* f=23n=2

Try to retain as much information as possible about the deleted nodes

Level-* Summarization

C 9B 13

G 10F 15 H 6

E 5D 7

K 11J 4I 2

C 9B 13

G 10F 15 H 6

E 5D 7

K 11J 4I 2

C 9B 13

G 10F 15

K 12K 11

Less information about deleted nodes than sibling-* Deletes fewer nodes than sibling-*

Global-* Summarization

C 9B 13

G 10F 15 H 6

E 5D 7

K 11J 4I 2

C 9B 13

G 10F 15 H 6

E 5D 7

K 11J 4I 2

C 9B 13

G 10F 15 H 6

No-* Summarization

C 9B 13

G 10F 15 H 6

E 5D 7

K 11J 4I 2

No-* Summarization

C 9B 13

G 10F 15 H 6

E 5D 7

K 11J 4I 2

No-* Summarization

C 9B 13

G 10F 15 H 6

E 5D 7

Memory savings similar to global-* Conservative assumption about deleted nodes

Outline Introduction Path Trees Markov Tables Performance Evaluation Conclusions

Markov Tables A table of all distinct paths of length up

to m and their frequencies For paths of length greater than m,

combine paths from the Markov table Example:

Uses "short memory" or "Markov" property

f(B/C/D)

f(B/C)f(A/B/C/D) =

f(A/B/C)

Markov Tables

Path Freq Path Freq

A 1 AC 6

B 11 AD 4

C 15 BC 9

D 19 BD 7

AB 11 CD 8

D 4C 6B 11

D 7C 9

Summarizing Markov Tables Exact selectivities for paths of length up to

m Approximate selectivities for paths longer

than m Problem: May not fit in available memory Remove low frequency paths Discard removed paths of length > 2 Replace removed paths of length 1 or 2

with *-paths Suffix-*, Global-*, No-*

Suffix-* Summarization

Path Freq Path Freq

A 1 AC 6

B 11 AD 4

C 15 BC 9

D 19 BD 7

AB 11 CD 8

Path Freq Path Freq

A 1 AC 6

B 11 AD 4

C 15 BC 9

D 19 BD 7

AB 11 CD 8

* 0 ** 0

Path Freq Path Freq

A 1 AC 6

B 11 AD 4

C 15 BC 9

D 19 BD 7

AB 11 CD 8

* 0 ** 0

Path Freq Path Freq

B 11 AD 4

C 15 BC 9

D 19 BD 7

AB 11 CD 8

* f=1,n=1 ** 0

Path Freq Path Freq

B 11 AD 4

C 15 BC 9

D 19 BD 7

AB 11 CD 8

* f=1,n=1 ** 0

SD= { }

Set of deleted paths of length 2

Path Freq Path Freq

C 15 BC 9

D 19 BD 7

AB 11 CD 8

* f=1,n=1 ** 0

SD= { (AD,4) }

Path Freq Path Freq

C 15 BC 9

D 19 BD 7

AB 11 CD 8

* f=1,n=1 ** 0

SD= { (AD,4) }

Path Freq Path Freq

C 15 BC 9

D 19 BD 7

AB 11 CD 8

* f=1,n=1 ** 0

SD= { (AD,4) }

Path Freq Path Freq

A*f=10,n=2

C 15 BC 9

D 19 BD 7

AB 11 CD 8

* f=1,n=1 ** 0SD= { }

Path Freq Path Freq

A*f=10,n=2

C 15 BC 9

D 19 BD 7

AB 11 CD 8

* f=1,n=1 ** 0SD= { }

Path Freq Path Freq

A*f=10,n=2

C 15 BC 9

AB 11 CD 8

* f=1,n=1 ** 0SD= { (BD,7) }

Path Freq Path Freq

A*f=10,n=2

C 15 BC 9

AB 11 CD 8

* f=1,n=1 ** 0SD= { (BD,7) }

Path Freq Path Freq

A*f=10,n=2

C 15 BC 9

* f=1,n=1 ** 0SD= { (BD,7), (CD,8) }

Path Freq Path Freq

A*f=10,n=2

C 15 BC 9

* f=1,n=1 ** 0SD= { (BD,7), (CD,8) }

Path Freq Path Freq

A*f=10,n=2

C 15 BC 9

* f=1,n=1 ** 0SD= { (BD,7), (CD,8) }

Path Freq Path Freq

A*f=10,n=2

C 15 B*f=16,n=2

* f=1,n=1 ** 0SD= { (CD,8) }

Path Freq Path Freq

A*f=10,n=2

C 15 B*f=16,n=2

* f=1,n=1 ** 0SD= { (CD,8) }

Path Freq Path Freq

C 15 B*f=16,n=2

* f=1,n=1 **f=10,n=2SD= { (CD,8) }

Path Freq Path Freq

C 15 B* 8

* 1 ** 6

SD= { }

Global-*, No-* Summarization Global-*

Two *-paths, * and ** Deletes fewer paths than suffix-* to

summarize the Markov table No-*

No *-paths Conservatively assumes that paths not in

the Markov table do not exist in the data

Outline Introduction Path Trees Markov Tables Performance Evaluation Conclusions

Data Sets for Experiments Synthetic data set

100,000 XML elements Path tree: 3197 nodes, 6 levels, 38 KB Element frequencies: Zipfian (z=1)

DBLP data set 1,399,765 XML elements Path tree: 5883 nodes, 6 levels, 69 KB

Query Workloads 1,000 paths of length between 1 and 4 Random paths

All query paths exist in the data Random tags

Most query paths of length 2 or more do not exist in the data

Available memory between 5 and 50 KB

Best Summarization Methods Path trees

Query paths in data: Global-* Query paths not in data: No-*

Markov tables m = 2 is best Query paths in data: Suffix-* Query paths not in data: No-*

Path Trees vs. Markov Tables When to use path trees and when to use

Markov tables? Also compared against Pruned Suffix

Trees (PSTs) [Chen et al, ICDE 2001] Can handle branching path expressions Can handle conditions on element values

Synthetic Data – Random Paths

0 10 20 30 40 50

Available Memory (KB)

Tree Global-*Markov Suffix-*PST

Synthetic Data – Random Tags

0 10 20 30 40 50

Tree No-*Markov No-*PST

DBLP Data – Random Paths

0 10 20 30 40 50

Tree Global-*Markov Suffix-*PST

DBLP Data – Random Tags

0 10 20 30 40 50

Tree No-*Markov No-*PST

When are Markov Tables Better? DBLP

Repeated sub-structures effectively captured by Markov tables

Conclusions Novel statistics for estimating the selectivity of

XML path expressions Scale to "all the XML data on the Internet" More accurate than best previously known

alternative Repeated sub-structures: Markov tables

No repeated sub-structures: Path trees Query paths exist in the data: Global-*, Suffix-*

Query paths do not exist in the data: No-* To appear in VLDB 2001

Estimating the Selectivity of XML Path Expressions for Internet Scale Applications Ashraf Aboulnaga...

Documents

China’s Economic Transformation Naughton

UFeed: Refining Web Data Integration ... - Ashraf Aboulnaga · Ahmed El-Roby University of Waterloo aelroby@uwaterloo.ca Ashraf Aboulnaga Qatar Computing Research Institute, HBKU

Shanna Marrinan, Andres Roman-Urrestarazu, Declan Naughton

Fall 2013 KEOUGH-NAUGHTON INSTITUTE FOR IRISH STUDIES · 2014. 11. 21. · Nuala Ní Dhomhnaill joins us again this semester as Dis-tinguished Professor of Irish Poetry and Naughton

Evaluating Window Joins over Unbounded Streams Jaewoo Kang Jeffrey F. Naughton Stratis D. Viglas {jaewoo, naughton, Univ. of Wisconsin-Madison

Kingdon Trap Hande Akbas Kealan Naughton Alexander Fuchs-Fuchs

Charles mc naughton 29-10

MICHAEL NAUGHTON New Website and Forum TOWN OF … · MICHAEL NAUGHTON TOWN OF HUNTINGTON HIGHWAY DEPARTMENT mjn@tohhighway.com 20 years, what does it mean? LIGIS was formed in 1987

diptix by Alexandra Naughton

U.S. Economic System Helen Naughton Department of Economics University of Montana

MAYONNAISE SANDWICHES (erasing the poetry of Billy Corgan) by Alexandra Naughton

Adaptive Cache Compression for High-Performance Processors Alaa R. Alameldeen and David A.Wood Computer Sciences Department, University of Wisconsin- Madison

Naughton Plant - PacifiCorp

Internet Marketing Services - PJ Naughton

Digital holography and three dimensional imaging naughton

2 Biomedical industries1 1 Microbial Biosurfactants: Current trends and applications in Agricultural and 2 Biomedical industries 3 P.J. Naughton, R. Marchant, V. Naughton, I.M. Banat

Casement Schedule - Keough-Naughton Institute for Irish Studies

EuroCRIS 11/11/2014 Interoperability in Research Information Linda Naughton

Dr Linda Naughton, Head of Research, Jisc

Leveraging Database Technologies in Condor Jeff Naughton March 14, 2005