Upload
blake-bishop
View
216
Download
2
Embed Size (px)
Citation preview
1
Flexible Querying of XML Documents
Krishnaprasad Thirunarayan and Trivikram Immaneni
Department of Computer Science and EngineeringWright State UniversityDayton, OH-45435, USA
2
Talk Outline
Goal (What?)
Background and Motivation (Why?)
Query Language and Examples (What?)
Implementation Details (How?)
Evaluation and Applications (Why?)
Conclusions
4
Develop a keyword-based XML Query Language and its Semantics that is flexible and sufficiently expressive easy to use (for query formulation)
Implement, reusing mature software components, for efficient indexing and search
6
XML vs Text Documents
DATA: Exploit metadata/markup and aggregation structure implicit in XML documents For expressiveness and precision
QUERY: Obtain progressively improved extractions using convenient keyword-based queries in contrast with accurate extractions using complex XML-based queries
7
Relationship to Other Work
Extends XSEarch (Cohen et al) Expressive power: Incorporates
attributes and their values Equivalence (E.g., RDF) :
<T A="s"/> vs <T> <A> s </A> </T>
8
Invariance under Refinement <T> <A> word_1 and word_2 </A> </T>vs <T> <A> <B> word_1 </B> and <C> word_2 </C> </A> </T>
9
Coherence
Interconnectedness : Infer related pieces of information using aggregation implicit in XML Cohen et al : Name equivalence
XSEarch
Li et al: Structural equivalence Scheme-free XML
Guo et al : Completeness XRANK
10
Information Retrieval Explore robust relevance ranking
strategy to deal with high recall Variation on TFIDF Naïve implementation computationally
prohibitive Extension beyond “type-delimited-
document” unclear
12
Query Syntax
Entity-Attribute-KeywordSearch Terms e:a:k e:a:, :a:k, e::k e::, a::, ::k
Signed/optional Search Terms + e:a:k vs e:a:k
13
Single Search Term Satisfaction
e:a:k
The search term e:a:k is satisfied by a tree containing a subtree with the top element e that is associated with the attribute a with value containing k, or a subelement a with descendant text node containing k.
14
Example (Mondial)
<country id="f0_149" name="Austria" capital="f0_1467" population="8023244"
datacode="AU" total_area="83850" population_growth="0.41"
infant_mortality="6.2" ... government="federal republic" ...> ...
</country>
:name:Vienna is satisfied by
<province id="f0_17447" name="Vienna" ...>
<city id="f0_1467" country="f0_149" province="f0_17447" ...>
<name>Vienna</name> <population year="94">1583000</population> </city>
</province> ...
name::Vienna is satisfied by a part of it
<name>Vienna</name>.
15
Example (Heterogeneity)
<author><name>Adam Dingle</name></author>
<author name="A. Dingle" ></author> <article id="3"> @inproceedings{IMN97, author="Adam Dingle and Ed
MacNair and Thao Nguyen", … </article>
author:name:Dingle misses the last one.
16
Query Answer Candidate
Query Answer Candidate for the query Q(t_1,t_2,...,t_m), is a Most preferred satisfying collection of
trees (P_1,P_2,...,P_m) Precise : smallest enclosing
Adequate : optional search terms satisfied as much as possible
17
Query Answer
Query Answer for the query Q(t_1,t_2,...,t_m), is a Query Answer Candidate
(P_1,P_2,...,P_m) in which Trees P_i’s are Interconnected
Specifies trees related to the same “real-world” entity
18
Interconnectedness (Cohen et al)
Two subtrees T_a and T_b are said to be interconnected if the path from their roots to the lowest common ancestor does not contain two distinct nodes with the same element, or the only distinct nodes with the same element are these roots.
19
Interconnectedness (Li et al)
Two subtrees T_a and T_b are said to be interconnected, if the path from T_a's root to their lowest common ancestor in the tree does not contain another node that is the lowest common ancestor of T_a and a distinct subtree T_b‘ , where T_b' has the same root element label as T_b.
22
Tools Used
Apache Lucene 2.0 APIs in Java A high-performance, text search
engine library with smart indexing strategies.
Further tuned for memory-centric operation in contrast with disk-centric defaults
SAXParser APIs
23
Mapping to Lucene
XML documents to Lucene documents for indexingXML keyword-based queries to Lucene queries for searchingENCODING XML fragment of an XML document is
referred to internally using the filename and the XPath (of the XML fragment's root from the XML document root)
25
ExperimentsDATASETs: Sigmod, Mondial, and DBLP.
PLATFORMS: For Sigmod and Mondial datasets: HP xw9300
Workstation with 2 GHz AMD Opteron dual-core processor (270), 4 GB of main memory, and 250 GB 7200 rpm hard drive, running 32-bit Windows XP.
(java -Xms750M -Xmx1500M). For DBLP dataset: SUN Ultra-40 Workstation with 2.4
GHz dual AMD Opteron dual-core processor (280), 8GB of main memory, and 250GB 7500 rpm hard drive, running 64-bit Solaris 10.
(java -Xms1000M -Xmx3600M).
27
Dataset Indexing via Lucene
DATASET INDEXING TIME
INDEX SIZE
Sigmod 32 sec 6 MB
Mondial 180 sec 16 MB
DBLP 36 hrs 4 GB
28
Query Answer: Computation Time vs Display Time
DATASET SIMPLE QUERY
COMPLEXQUERY
Sigmod 35 ms / 1 sec
400 ms / 3 min
Mondial 25 ms / 350 ms
1 sec / 2 min
DBLP 335 ms / 1 sec
---
30
Pubs.xml
<publications>- <book> <title>Modern Information Retrieval</title> <author>Ricardo Baeza-Yates</author> <author>Berthier Ribeiro-Neto</author> - <chapter> <title>Digital Libraries</title> <author>Edward A. Fox</author> <author>Ohm Sornil</author> </chapter> </book>- <article> <title>The Anatomy of a Large-Scale Hypertextual Web Search Engine</title> <author>Sergey Brin</author> <author>Lawrence Page</author> </article>- <article> <title>An Algorithm for Suffix Stripping</title> <author>M.F.Porter</author> </article>- <article> <title>Indexing by Latent Semantic Analysis</title> </article> </publications>
31
Characteristics of Pubs.xml
Total number of authors = 7Total number of titles = 5
Title Distribution = 1 book (with 1 chapter) + [3 articles] Author Distribution =
2 ( 2 ) + [ 2 + 1 + 0 ]
32
Queries to Pubs.xml (Answer counts)
Arbitrary mix and match of authors and titles = 7 * 5 = 35
author::, title:: (8 hits)+author::, +title:: (7 hits)+author::, +title::, +author:: (4 hits)
35
Developed declarative semantics for keyword-based XML Query language with an effective query answering algorithmDeveloped a notion of interconnectedness that provides coherent answersImplemented using Lucene 2.0 APIs Indexing: Time and Space Intensive
But Query Answering: Quick