ViST: a dynamic index method for querying XML data by tree structures Authors: Haixun Wang, Sanghyun...

Preview:

Citation preview

ViST: a dynamic index method for querying XML data by tree structures

Authors: Haixun Wang, Sanghyun Park, Wei Fan, Philip Yu

Presenter: Elena Zheleva, November 2003

Overview

Modeling XML Queries Structure-encoded sequences Indexing ViST Experimental Results

Modeling XML Queries

DTD of purchase records:(!ELEMENT purchases (purchase*))(!ELEMENT purchase (seller, buyer))(!ATTRIST seller ID ID location CDATA name CDATA)(!ELEMENT seller (item*))(!ATTRIST buyer ID ID location CDATA name CDATA)(!ELEMENT item (item*))(!ATTRIST item name CDATA manufacturer CDATA)

Modeling XML Queries

Focus in XML query language design: ability to express complex structural or graphical queries

Modeling XML Queries

Querying XML data = finding sub structures of the data graph that match the sequence

Structure-encoded sequences: a sequential representation of both XML data and XML queries

Structure-Encoded Sequences

Structure-Encoded Sequences

Maps the data and the queries Matches the subsequence Purpose: to avoid as many join

operations as possible Def. Sequence of (symbol, prefix) pairs

Mapping Data

Represent XML document/tree in preorder

Represent in structure-encoded seq

Mapping Queries

Benefit of sequence matching: query gets processed as whole

Path Expression

Structure-Encoded Sequences

Query

Data

Querying XML

through Structure-Encoded Sequence Matching

Indexing

Role of Indexing

To provide an algorithm to perform this sequence matching

Desired features for algorithm:– Efficient support for subsequence matching– Use well-supported DB indexing

techniques such as B+ trees– Allow dynamic index insertion

What is indexing useful for

Auxiliary access structures– Used to speed up the retrieval of records– In response to certain search conditions

Provide efficient support for arbitrary structured queries– Using wild-cards // and *

Indexing

State-of the-art approaches– Indexes on paths– Indexes on nodes– Indexes on both (structures) – ViST

ViST

Algorithms

Naïve Algorithm based on Suffix Trees RIST: Relationships Indexed Suffix Tree ViST: Virtual Suffix Tree

Algorithm Using Suffix Trees

Suffix Tree: a compact index to all distinct, contiguous substrings of a string

D-Ancestorship – in XML doc tree Through structure-encoded sequence S-Ancestorship – in suffix tree

Example Using Suffix Trees

Algorithm Using Suffix Trees

Searches – first by S-Ancestorship: searching under

suffix tree– then by D-Ancestorship: matching nodes

and prefixes Disadvantages:

– Costly – traverse large portion of subtree– Most commercial DBMSs do not support

RIST: Indexing by Ancestor-Descendant Relationships Jumps directly to the nodes Y to which

X is both a D-Ancestor and S-Ancestor Index Construction: uses B+ trees

RIST: Indexing by Ancestor-Descendant Relationships Subsequence Matching Determine D-Ancestorship by prefixes Determine S-Ancestorship by label

<nx,sizex> x – suffix tree node (root of S-tree) nx – prefix traversal order sizex – number of descendants

ViST: the Virtual Suffix Tree

Same sequence algorithm as RIST BUT supports dynamic insertions Uses dynamic method to assign labels Once assigned, the labels are fixed and are

not affected by subsequent data insertion or deletion

Labeling the suffix tree w/o building it Relies on statistical information about the

XML data

ViST: the Virtual Suffix Tree

Index structure contains the sequence:

Sequence to be inserted:

Dynamic scope of x = <nx, sizex,kx>

ViST: the Virtual Suffix Tree

Experimental Results

Datasets used– DBLP: CS bibliography DB

• 289,627 records/publications• Each publication – tree of max depth 6• Avg length of structure-encoded seq = 31

– XMARK • 1 record• Complicated tree structure

– Synthetic

Experimental Results

Comparison Methods – Index Fabric Algorithm – XML paths– XISS – uses nodes as basic query unit– ViST – appx. 1/10 of time to perform

queries due to (multiple) join operations

Experimental Results - remove

Index Structure and Size (1/3 less from suffix tree)– DocId B+ Tree – N elements – Combined D-ancestor and S-ancestor B+

tree - N x L elements Index Construction

Conclusion

XML Queries = Subsequence Matching Advantages of ViST – algorithm for

subsequence matching– Avoids expensive join operations– Index on both content and structure of XML

documents– B+ trees – supported by disk-based data– Dynamic data insertion and deletion

Recommended