View
25
Download
0
Category
Preview:
DESCRIPTION
Flexible and Efficient XML Search with Complex Full-Text Predicates. Sihem Amer-Yahia - AT&T Labs Research → Yahoo! Research Emiran Curtmola - University of California San Diego Alin Deutsch - University of California San Diego. Introduction. - PowerPoint PPT Presentation
Citation preview
Flexible and Efficient XML Search with Complex Full-Text Predicates
Sihem Amer-Yahia - AT&T Labs Research → Yahoo! Research
Emiran Curtmola - University of California San Diego
Alin Deutsch - University of California San Diego
SIGMOD, June 2006 2
Introduction
Need for complex full-text predicates beyond simple keyword search
Library of Congress (LoC) Biomedical data ACM, IEEE publications INEX data collection Wikipedia XML data set
SIGMOD, June 2006 3
XML real fragment from LoChttp://thomas.loc.gov/home/gpoxmlc109/h2739_ih.xml
Congress on education and workforce, comments to appropriate services.
109th Mr Column and co-sponsorsMrs Miller and Mrs Jones.Others include Jefferson
on May 2, 2004Joe Jefferson
introduced the following bill.The bill was reintroduced laterand was referred to the committee
on education and workforcesponsored by Joe Jefferson
House of RepresentativesCurrent chamber on workforceand services. Committees on education are headed by Jefferson
Jeffersonand services …
HR2739
committee-name
action-desc
bill
congress-info
nbr sponsors
action
legis-session
legis
legis-body
legis-desc
SIGMOD, June 2006 4
Query with complex FT predicates
Document fragments (nodes) that
contain the keywords
“Jefferson” and “education”
and satisfy the predicates within a window of 10 words, with “Jefferson” ordered before “education”
SIGMOD, June 2006 5
Example: LoC document
Congress on education and workforce, comments to appropriate services.
109th Mr Column and co-sponsorsMrs Miller and Mrs Jones.Others include Jefferson
on May 2, 2004Joe Jefferson
introduced the following bill.The bill was reintroduced laterand was referred to the committee
on education and workforcesponsored by Joe Jefferson
House of RepresentativesCurrent chamber on workforceand services. Committees on education are headed by Jefferson
Jeffersonand services …
HR2739
committee-name
action-desc
bill
congress-info
nbr sponsors
action
legis-session
legis
legis-body
legis-desc
SIGMOD, June 2006 6
Example: LoC document
Congress on education and workforce, comments to appropriate services.
109th Mr Column and co-sponsorsMrs Miller and Mrs Jones.Others include Jefferson
on May 2, 2004Joe Jefferson
introduced the following bill.The bill was reintroduced laterand was referred to the committee
on education and workforcesponsored by Joe Jefferson
House of RepresentativesCurrent chamber on workforceand services. Committees on education are headed by Jefferson
Jeffersonand services …
HR2739
committee-name
action-desc
bill
congress-info
nbr sponsors
action
legis-session
legis
legis-body
legis-desc
Return document fragments
Naive solution: test the query at each node
→ redundant
Need for efficient evaluation of full-text predicates
use structural relationship between nodes avoid redundant computation
SIGMOD, June 2006 7
Existing languages Many XML full-text search languages
expressive power, semantics, scores [BAS-06]
XQFT-classW3C’s XQuery Full-Text (XQFT), NEXI, XIRQL, JuruXML, XSearch, XRank, XKSearch, Schema Free XQuery
Efficient query evaluation limited to Conjunctive keyword search (no predicates) Full-text predicates in isolation
Need for a universal optimization framework Guarantee the universality of the solution
SIGMOD, June 2006 8
Contributions
Formal semantics for XQFT-class Unified framework Capture family of tf*idf scoring methods
Structure-aware algorithms to efficiently evaluate XQFT-class languages XFT full-text algebra Enable new optimizations inspired by
relational rewritings
SIGMOD, June 2006 9
Talk Outline
Motivation & Contributions Formalization of XML full-text search Efficient evaluation Experiments Conclusion
SIGMOD, June 2006 10
Formalization: design goals
Capture existing full-text languages Language semantics in terms of
keyword patterns pattern matches predicates evaluated through matches
Manipulate tuples enable relational query evaluation and
rewritings
SIGMOD, June 2006 11
Formalization: patterns Pattern = tuple of simultaneously matching keywords
Query expression:
“Jefferson” and “education” within a window of 10 words, with “Jefferson” ordered before “education”
Pattern
(“Jefferson”, “education”)
SIGMOD, June 2006 12
Formalization: patterns
Formalization specifies patterns ← conjunction of keywords set of patterns ← disjunction of keywords exclusion patterns ← negation of keywords
No matches in the document
SIGMOD, June 2006 13
Formalization: matches
Congress on education and workforce, comments to appropriate services.
109th Mr Column and co-sponsorsMrs Miller and Mrs Jones.Others include Jefferson
on May 2, 2004Joe Jefferson
introduced the following bill.The bill was reintroduced laterand was referred to the committee
on education and workforcesponsored by Joe Jefferson
House of RepresentativesCurrent chamber on workforceand services. Committees on education are headed by Jefferson
Jeffersonand services …
HR2739
committee-name
action-desc
bill
congress-info
nbr sponsors
action
legis-session
legis
legis-body
legis-desc
“Jefferson”, “education”
(22, 3)
SIGMOD, June 2006 14
Formalization: matches
Congress on education and workforce, comments to appropriate services.
109th Mr Column and co-sponsorsMrs Miller and Mrs Jones.Others include Jefferson
on May 2, 2004Joe Jefferson
introduced the following bill.The bill was reintroduced laterand was referred to the committee
on education and workforcesponsored by Joe Jefferson
House of RepresentativesCurrent chamber on workforceand services. Committees on education are headed by Jefferson
Jeffersonand services …
HR2739
committee-name
action-desc
bill
congress-info
nbr sponsors
action
legis-session
legis
legis-body
legis-desc
“Jefferson”, “education”
(22, 3)
(22, 45)
SIGMOD, June 2006 15
Formalization: matches
Congress on education and workforce, comments to appropriate services.
109th Mr Column and co-sponsorsMrs Miller and Mrs Jones.Others include Jefferson
on May 2, 2004Joe Jefferson
introduced the following bill.The bill was reintroduced laterand was referred to the committee
on education and workforcesponsored by Joe Jefferson
House of RepresentativesCurrent chamber on workforceand services. Committees on education are headed by Jefferson
Jeffersonand services …
HR2739
committee-name
action-desc
bill
congress-info
nbr sponsors
action
legis-session
legis
legis-body
legis-desc
“Jefferson”, “education”
(22, 3)
(22, 45)
(22, 67)
SIGMOD, June 2006 16
Formalization: matches
Congress on education and workforce, comments to appropriate services.
109th Mr Column and co-sponsorsMrs Miller and Mrs Jones.Others include Jefferson
on May 2, 2004Joe Jefferson
introduced the following bill.The bill was reintroduced laterand was referred to the committee
on education and workforcesponsored by Joe Jefferson
House of RepresentativesCurrent chamber on workforceand services. Committees on education are headed by Jefferson
Jeffersonand services …
HR2739
committee-name
action-desc
bill
congress-info
nbr sponsors
action
legis-session
legis
legis-body
legis-desc
“Jefferson”, “education”
(22, 3)
(22, 45)
(22, 67)
(51, 3)
…
SIGMOD, June 2006 17
Formalization: matching tables
Matching table represents Nested relation Each node in the document Each pattern in the query Set of matches
SIGMOD, June 2006 18
Congress on education and workforce, comments to appropriate services.
109th Mr Column and co-sponsorsMrs Miller and Mrs Jones.Others include Jefferson
on May 2, 2004Joe Jefferson
introduced the following bill.The bill was reintroduced laterand was referred to the committee
on education and workforcesponsored by Joe Jefferson
House of RepresentativesCurrent chamber on workforceand services. Committees on education are headed by Jefferson
Jeffersonand services …
HR2739
committee-name
action-desc
bill
congress-info
nbr sponsors
action
legis-session
legis
legis-body
legis-desc
Formalization: matching tables
Node Pattern Matches
action “Jefferson”, “education” (28, 45)
(51, 45)
… … …
SIGMOD, June 2006 19
XFT Algebra
Similar to relational algebra Manipulate matching tables Leverage relational query evaluation + optimization
techniques
XFT operators construct matching table Rk for each keyword k
get(k) manipulate matching tables
R1 or R2
R1 and R2
R1 minus R2
σtimes(R), σordered(R), σwindow(R), σdistance(R)
SIGMOD, June 2006 20
XFT Algebra Query: Nodes that contain the keywords
“Jefferson” and “education” within a window of 10 words, with “Jefferson” ordered before “education”
)"("educationget
")","(" educationJeffersonordered
×
")","("10 educationJeffersonwindow
)"("Jeffersonget
Benefit: equivalent
query rewritings
SIGMOD, June 2006 21
Talk Outline
Motivation & Contributions Formalization of XML full-text search Efficient evaluation Experiments Conclusion
SIGMOD, June 2006 22
Query evaluation: AllNodes
Straightforward implementation of the XFT algebra
Each node is considered separately Each tuple is self-contained
Relational-style evaluation Joins → equi-joins Predicates → selections on set of matches
5
SIGMOD, June 2006 23
Example: LoC document
Congress on education and workforce, comments to appropriate services.
109th Mr Column and co-sponsorsMrs Miller and Mrs Jones.Others include Jefferson
on May 2, 2004Joe Jefferson
introduced the following bill.The bill was reintroduced laterand was referred to the committee
on education and workforcesponsored by Joe Jefferson
House of RepresentativesCurrent chamber on workforceand services. Committees on education are headed by Jefferson
Jeffersonand services …
HR2739
committee-name
action-desc
bill
congress-info
nbr sponsors
action
legis-session
legis
legis-body
legis-desc
1.1
1.2
1.3
1
1.1.11.1.2 1.1.3
1.2.2
1.2.2.2
1.3.1 1.3.2
1.3.1.2
SIGMOD, June 2006 24
Node Pattern Matches
1 “Jefferson” 22, 28, 51, 54, 72
1.1 “Jefferson” 22
1.1.3 “Jefferson” 22
1.2 “Jefferson” 28, 51
1.2.2 “Jefferson” 51
1.2.2.2 “Jefferson” 51
1.3 “Jefferson” 54, 72
1.3.1 “Jefferson” 54
1.3.1.2 “Jefferson” 54
1.3.2 “Jefferson” 72
Node Pattern Matches
1 “education” 3, 45, 67
1.1 “education” 3
1.1.1 “education” 3
1.2 “education” 45
1.2.2 “education” 45
1.2.2.2 “education” 45
1.3 “education” 67
1.3.2 “education” 67
×
")","("10 educationJeffersonwindow
")","(" educationJeffersonordered
)"("Jeffersonget
)"("educationget
SIGMOD, June 2006 25
Node Pattern Matches
1 “Jefferson” 22, 28, 51, 54, 72
1.1 “Jefferson” 22
1.1.3 “Jefferson” 22
1.2 “Jefferson” 28, 51
1.2.2 “Jefferson” 51
1.2.2.2 “Jefferson” 51
1.3 “Jefferson” 54, 72
1.3.1 “Jefferson” 54
1.3.1.2 “Jefferson” 54
1.3.2 “Jefferson” 72
Node Pattern Matches
1 “education” 3, 45, 67
1.1 “education” 3
1.1.1 “education” 3
1.2 “education” 45
1.2.2 “education” 45
1.2.2.2 “education” 45
1.3 “education” 67
1.3.2 “education” 67
×
")","("10 educationJeffersonwindow
")","(" educationJeffersonordered
Node Pattern Matches
1 “Jefferson”, “education” (22,45), (72,67)…
1.1 “Jefferson”, “education” (22, 3)
1.2 “Jefferson”, “education” (28, 45), (51, 45)
1.2.2 “Jefferson”, “education” (51, 45)
1.2.2.2 “Jefferson”, “education” (51, 45)
1.3 “Jefferson”, “education” (54, 67), (72, 67)
1.3.2 “Jefferson”, “education” (72, 67)
SIGMOD, June 2006 26
Node Pattern Matches
1 “Jefferson” 22, 28, 51, 54, 72
1.1 “Jefferson” 22
1.1.3 “Jefferson” 22
1.2 “Jefferson” 28, 51
1.2.2 “Jefferson” 51
1.2.2.2 “Jefferson” 51
1.3 “Jefferson” 54, 72
1.3.1 “Jefferson” 54
1.3.1.2 “Jefferson” 54
1.3.2 “Jefferson” 72
Node Pattern Matches
1 “education” 3, 45, 67
1.1 “education” 3
1.1.1 “education” 3
1.2 “education” 45
1.2.2 “education” 45
1.2.2.2 “education” 45
1.3 “education” 67
1.3.2 “education” 67
×
")","("10 educationJeffersonwindow
")","(" educationJeffersonordered
Node Pattern Matches
1 “Jefferson”, “education” (22,45), (72,67)…
1.1 “Jefferson”, “education” (22, 3)
1.2 “Jefferson”, “education” (28, 45), (51, 45)
1.2.2 “Jefferson”, “education” (51, 45)
1.2.2.2 “Jefferson”, “education” (51, 45)
1.3 “Jefferson”, “education” (54, 67), (72, 67)
1.3.2 “Jefferson”, “education” (72, 67) Predicate operates one tuple at a time
SIGMOD, June 2006 27
Example: LoC document
Congress on education and workforce, comments to appropriate services.
109th Mr Column and co-sponsorsMrs Miller and Mrs Jones.Others include Jefferson
on May 2, 2004Joe Jefferson
introduced the following bill.The bill was reintroduced laterand was referred to the committee
on education and workforcesponsored by Joe Jefferson
House of RepresentativesCurrent chamber on workforceand services. Committees on education are headed by Jefferson
Jeffersonand services …
HR2739
committee-name
action-desc
bill
congress-info
nbr sponsors
action
legis-session
legis
legis-body
legis-desc
1.1
1.2
1.3
1
1.1.11.1.2 1.1.3
1.2.2
1.2.2.2
1.3.1 1.3.2
1.3.1.2
SIGMOD, June 2006 28
Query evaluation: SCU
AllNodes = straightforward algorithm
Reduce size of intermediate results structural relationships between nodes avoid redundant match representation
SCU = Smallest Containing Unit
5
SIGMOD, June 2006 29
Node Pattern Matches
1.1.3 “Jefferson” 22
1.2.2.2 “Jefferson” 51
1.2 “Jefferson” 28
1.3.1.2 “Jefferson” 54
1.3.2 “Jefferson” 72
Node Pattern Matches
1 “Jefferson” 22, 28, 51, 54, 72
1.1 “Jefferson” 22
1.1.3 “Jefferson” 22
1.2 “Jefferson” 28, 51
1.2.2 “Jefferson” 51
1.2.2.2 “Jefferson” 51
1.3 “Jefferson” 54, 72
1.3.1 “Jefferson” 54
1.3.1.2 “Jefferson” 54
1.3.2 “Jefferson” 72
Matching tables → SCU tables
→
captures same information
)"("Jeffersonget
)"("Jeffersonget
SIGMOD, June 2006 30
Node Pattern Matches
1.1.3 “Jefferson” 22
1.2.2.2 “Jefferson” 51
1.2 “Jefferson” 28
1.3.1.2 “Jefferson” 54
1.3.2 “Jefferson” 72
Node Pattern Matches
1.1.1 “education” 3
1.2.2.2 “education” 45
1.3.2 “education” 67
×
")","("10 educationJeffersonwindow
")","(" educationJeffersonordered
)"("Jeffersonget )"("educationget
SIGMOD, June 2006 31
Node Pattern Matches
1.1.3 “Jefferson” 22
1.2.2.2 “Jefferson” 51
1.2 “Jefferson” 28
1.3.1.2 “Jefferson” 54
1.3.2 “Jefferson” 72
Node Pattern Matches
1.1.1 “education” 3
1.2.2.2 “education” 45
1.3.2 “education” 67
Node Pattern Matches
1.2.2.2 “Jefferson”, “education” (51, 45)
1.3.2 “Jefferson”, “education” (72, 67)
×
")","("10 educationJeffersonwindow
")","(" educationJeffersonordered
Equi-join does not work• Need to compute LCA
SIGMOD, June 2006 32
Node Pattern Matches
1.1.3 “Jefferson” 22
1.2.2.2 “Jefferson” 51
1.2 “Jefferson” 28
1.3.1.2 “Jefferson” 54
1.3.2 “Jefferson” 72
Node Pattern Matches
1.1.1 “education” 3
1.2.2.2 “education” 45
1.3.2 “education” 67
Node Pattern Matches
1.1 “Jefferson”, “education” (22, 3)
1.2.2.2 “Jefferson”, “education” (51, 45)
1.2 “Jefferson”, “education” (28, 45)
1.3.2 “Jefferson”, “education” (72, 67)
1.3 “Jefferson”, “education” (54, 67)
1 “Jefferson”, “education” (22, 45)… ×
")","("10 educationJeffersonwindow
")","(" educationJeffersonordered
1.1 is the LCA of1.1.3 and 1.1.1
SIGMOD, June 2006 33
Node Pattern Matches
1.1.3 “Jefferson” 22
1.2.2.2 “Jefferson” 51
1.2 “Jefferson” 28
1.3.1.2 “Jefferson” 54
1.3.2 “Jefferson” 72
Node Pattern Matches
1.1.1 “education” 3
1.2.2.2 “education” 45
1.3.2 “education” 67
×
")","("10 educationJeffersonwindow
")","(" educationJeffersonordered
Node Pattern Matches
1.2 “Jefferson”, “education” (28, 45)
1.3 “Jefferson”, “education” (54, 67)
1 “Jefferson”, “education” (22, 45)…
Node Pattern Matches
EMPTY !!!
Node Pattern Matches
1.1 “Jefferson”, “education” (22, 3)
1.2.2.2 “Jefferson”, “education” (51, 45)
1.2 “Jefferson”, “education” (28, 45)
1.3.2 “Jefferson”, “education” (72, 67)
1.3 “Jefferson”, “education” (54, 67)
1 “Jefferson”, “education” (22, 45)…
SIGMOD, June 2006 34
Node Pattern Matches
1.1.3 “Jefferson” 22
1.2.2.2 “Jefferson” 51
1.2 “Jefferson” 28
1.3.1.2 “Jefferson” 54
1.3.2 “Jefferson” 72
Node Pattern Matches
1.1.1 “education” 3
1.2.2.2 “education” 45
1.3.2 “education” 67
Node Pattern Matches
1.1 “Jefferson”, “education” (22, 3)
1.2.2.2 “Jefferson”, “education” (51, 45)
1.2 “Jefferson”, “education” (28, 45)
1.3.2 “Jefferson”, “education” (72, 67)
1.3 “Jefferson”, “education” (54, 67)
1 “Jefferson”, “education” (22, 45)… ×
")","("10 educationJeffersonwindow
")","(" educationJeffersonordered
SIGMOD, June 2006 35
Node Pattern Matches
1.1.3 “Jefferson” 22
1.2.2.2 “Jefferson” 51
1.2 “Jefferson” 28
1.3.1.2 “Jefferson” 54
1.3.2 “Jefferson” 72
Node Pattern Matches
1.1.1 “education” 3
1.2.2.2 “education” 45
1.3.2 “education” 67
Node Pattern Matches
1.1 “Jefferson”, “education” (22, 3)
1.2.2.2 “Jefferson”, “education” (51, 45)
1.2 “Jefferson”, “education” (28, 45)
1.3.2 “Jefferson”, “education” (72, 67)
1.3 “Jefferson”, “education” (54, 67)
1 “Jefferson”, “education” (22, 45)… ×
")","("10 educationJeffersonwindow
")","(" educationJeffersonordered
Node Pattern Matches
1.3 “Jefferson”, “education” (54, 67)
1 “Jefferson”, “education”(22, 45)…
SIGMOD, June 2006 36
Node Pattern Matches
1.1.3 “Jefferson” 22
1.2.2.2 “Jefferson” 51
1.2 “Jefferson” 28
1.3.1.2 “Jefferson” 54
1.3.2 “Jefferson” 72
Node Pattern Matches
1.1.1 “education” 3
1.2.2.2 “education” 45
1.3.2 “education” 67
Node Pattern Matches
1.1 “Jefferson”, “education” (22, 3)
1.2.2.2 “Jefferson”, “education” (51, 45)
1.2 “Jefferson”, “education” (28, 45)
1.3.2 “Jefferson”, “education” (72, 67)
1.3 “Jefferson”, “education” (54, 67)
1 “Jefferson”, “education” (22, 45)… ×
")","("10 educationJeffersonwindow
")","(" educationJeffersonordered
Node Pattern Matches
1.3 “Jefferson”, “education” (54, 67)
(72, 67)
1 “Jefferson”, “education”(22, 45)…
•Postorder•Stack supports single scan
SIGMOD, June 2006 37
SCU summary
Equivalent to AllNodes Structure-awareness reduces size of
intermediate results Increase computation cost
Compute LCAs of nodes Match propagation
Stack-based techniques
5
SIGMOD, June 2006 38
Related work on LCA for XML LCA for conjunctive keyword search
XRank [GSBS-03] Schema-free XQuery [LYJ-04] XKSearch [XP-05]
Shortcomings No postprocessing, not compositional
Input in document order Output postorder traversal
Support for complex predicates is not straightforward
SIGMOD, June 2006 39
Talk Outline
Motivation & Contributions Formalization of XML full-text search Efficient evaluation Experiments Conclusion
SIGMOD, June 2006 40
Experimental goals
AllNodes vs. SCU AllNodes: redundant representation SCU: smaller sizes, more computation
SCU Overhead Stack Match propagation
Benefit of Rewritings Relational-style rewritings
SIGMOD, June 2006 41
Experimental setup
Centrino 1.8GHz with 1GB of RAM
XMark generated datasets Size ranges from 50 MB – 300 MB
SIGMOD, June 2006 42
Experiments: AllNodes vs. SCU
Varying document size (q1 - query without predicates)
q1 = get(“See”) and get(“internationally”) and get(“description”) and get(“charges”) and
get(“ship”)
SIGMOD, June 2006 43
Queries q4 = σwindow>1(“See”, “internationally”, “description”, “charges”, “ship”) (q1)
q5 = σwindow>90000000(“See”, “internationally”, “description”, “charges”, “ship”) (q1)
Recall that q1 = get(“See”) and get(“internationally”) and
get(“description”) and get(“charges”) and get(“ship”)
Experiments: SCU Overhead
SIGMOD, June 2006 44
Experiments: SCU Overhead q4 always true → no match propagation, just the stack overhead q5 always false → propagate all matches
Varying query predicates (not pushed)
SIGMOD, June 2006 45
Queries q2 = σorderedE(“See”, “internationally”, “description”, “charges”, “ship”) (q1)
q3 = push selections in q2
Recall that q1 = get(“See”) and get(“internationally”) and
get(“description”) and get(“charges”) and get(“ship”)
Experiments: Benefit of Rewritings
SIGMOD, June 2006 46
Experiments: Benefit of Rewritings
Varying document size (query with predicates)
40% improvement for relational-like query rewritings
SIGMOD, June 2006 47
Conclusion
A unified logical framework for XML full-text search languages
Algebra admits Efficient algorithms for operator evaluation
Rewritings of queries into more efficient forms Facilitate XML joint optimizations of queries on
both structure and text search Future work
Score-aware logical framework
SIGMOD, June 2006 48
Thank you! 5
Recommended