Upload
stefan
View
35
Download
1
Tags:
Embed Size (px)
DESCRIPTION
An Efficient and Versatile Query Engine for TopX Search. Martin Theobald Ralf Schenkel Gerhard Weikum Max-Planck Institute for Informatics Saarbrücken Germany. VLDB ‘05. An XML-IR Scenario…. article. article. title. “ Current Approaches to XML Data Manage- ment .”. - PowerPoint PPT Presentation
Citation preview
An Efficient and Versatile An Efficient and Versatile Query Engine for Query Engine for
TopX SearchTopX SearchMartin TheobaldMartin TheobaldRalf SchenkelRalf Schenkel
Gerhard WeikumGerhard Weikum
Max-Planck Institute for InformaticsMax-Planck Institute for InformaticsSaarbrückenSaarbrücken
GermanyGermany
VLDB ‘05
VLDB ‘05 An Efficient and Versatile Query Engine for TopX Search 2
//article[//sec[about(.//, “XML retrieval”)] //par[about(.//, “native XML database”)] ]//bib[about(.//item, “W3C”)]
sec
article
sec
par
bib
par
title “Current Approaches to XML Data Manage-ment.”
item
“Data management systems control data acquisition, storage, and retrieval. Systems evolved from flat files … ”
“XML queries with an expres- sive power similar to that of Datalog …”
par
title“XML-QL: A Query Language for XML.”
“Native XML database systems can store schemaless data ... ” inproc
“Proc. Query Languages Workshop, W3C,1998.”
title“Native XML databases.”
An XML-IR Scenario…
sec
article
sec
par “Sophisticated technologies developed by smart people.”
par
title “TheXML Files”
par
title “TheOntology Game”
title“TheDirty LittleSecret”
“What does XML add for retrieval? It adds formal ways …”
bib
“w3c.org/xml” “There, I've said it - the "O" word. If anyone is thinking along ontology lines, I would like to break some old news …”
title
item
url“XML”
VLDB ‘05 An Efficient and Versatile Query Engine for TopX Search 3
TopX: Efficient XML-IR
Extend top-k query processing algorithms for sorted lists [Buckley ’85; Güntzer, Balke & Kießling ’00; Fagin ‘01] to XML data
Non-schematic, heterogeneous data sources
Combined inverted index for content & structure
Avoid full index scans, postpone expensive random accesses to large disk-resident data structures
Exploit cheap disk space for redundant indexing
Goal: Efficiently retrieve the best results of a similarity query
Goal: Efficiently retrieve the best results of a similarity query
VLDB ‘05 An Efficient and Versatile Query Engine for TopX Search 4
XML-IR: History and Related Work
IR on structured data (SGML):1995
2000
2005
IR on XML:
Commercial software: MarkLogic, Verity?, IBM?, Oracle?, ...
XML query languages:
XQuery (W3C)XPath 2.0 (W3C)
NEXI (INEX Benchmark)
XPath &XQuery
Full-Text(W3C)
XPath 1.0 (W3C)
XML-QL (AT&T Labs)
Web query languages:
Lorel (Stanford U)Araneus (U Roma)W3QS (Technion Haifa)
TeXQuery (AT&T Labs)
WebSQL (U Toronto)
XIRQL (U Dortmund / Essen)XXL & TopX (U Saarland / MPII)ApproXQL (U Berlin / U Munich)ELIXIR (U Dublin)JuruXML (IBM Haifa )XSearch (Hebrew U)Timber (U Michigan)XRank & Quark (Cornell U)FleXPath (AT&T Labs)XKeyword (UCSD)
OED etc. (U Waterloo)HySpirit (U Dortmund)HyperStorM (GMD Darmstadt)WHIRL (CMU)
VLDB ‘05 An Efficient and Versatile Query Engine for TopX Search 5
Outline
Data & Scoring model
Database schema & indexing
Top-k query processing for XML
Scheduling & probabilistic candidate pruning
Experiments & Conclusions
VLDB ‘05 An Efficient and Versatile Query Engine for TopX Search 7
Data Model
Simplified XML modeldisregarding IDRef & XLink/XPointer
Redundant full-contents Per-element term frequencies ftf(ti,e) for full contentsPre/postorder labels for each tag-term pair
<article> <title>XML-IR</title> <abs> IR techniques for XML</abs> <sec> <title> Clustering on XML </title> <par>Evaluation</par> </sec></article>
“xml ir”
articlearticle
titletitle absabs secsec
“xml ir ir technique xmlclustering xml evaluation“
“ir techniquexml“
“clustering xml evaluation“
“clustering xml”
“evaluation“
titletitle parpar
1 6
2 5 3 4 3 3
5 2 6 1
ftf(“xml”, article1 ) = 3ftf(“xml”, article1 ) = 3
VLDB ‘05 An Efficient and Versatile Query Engine for TopX Search 8
Full-Content Scoring Model
Full-content scores cast into an Okapi-BM25 probabilistic model with
element-specific model parameterization
Basic scoring idea within IR-style family of TF*IDF ranking functions tag N avglength k1 b
article 16,850 2,903 10.5 0.75sec 96,709 413 10.5 0.75par 1,024,907 32 10.5 0.75fig 109,230 13 10.5 0.75
elementstatistics
Additional static score mass c for relaxable structural conditions
VLDB ‘05 An Efficient and Versatile Query Engine for TopX Search 9
Outline
Data & Scoring model
Database schema & indexing
Top-k query processing for XML
Scheduling & probabilistic candidate pruning
Experiments & Conclusions
VLDB ‘05 An Efficient and Versatile Query Engine for TopX Search 10
Inverted Block-Index for Content & Structure
eid docid score pre post max-score
46 2 0.9 2 15 0.9
9 2 0.5 10 8 0.9
171 5 0.85 1 20 0.85
84 3 0.1 1 12 0.1
sec[clustering]
title[xml] par[evaluation]
sec[clustering] title[xml] par[evaluation]
Inverted index over tag-term pairs (full-contents)Benefits from increased selectivity of combined tag-term pairsAccelerates child-or-descendant axis, e.g., sec//”clustering”
eid docid score pre post max-
score
216 17 0.9 2 15 0.9
72 3 0.8 10 8 0.8
51 2 0.5 4 12 0.5
671 31 0.4 12 23 0.4
eid docid score pre post max-
score
3 1 1.0 1 21 1.0
28 2 0.8 8 14 0.8
182 5 0.75 3 7 0.75
96 4 0.75 6 4 0.75
Sequential block-scans Re-order elements in descending order of (maxscore, docid, score) per listFetch all tag-term pairs per doc in one sequential block-accessdocid limits range of in-memory structural joins
Stored as inverted files or database tables (B+-tree indexes)
VLDB ‘05 An Efficient and Versatile Query Engine for TopX Search 11
Navigational Index
eid docid pre post
46 2 2 15
9 2 10 8
171 5 1 20
84 3 1 12
sec
title[xml] par[evaluation]
sec title par
Additional navigational indexNon-redundant element directorySupports element paths and branching path queriesRandom accesses using (docid, tag) as key
Schema-oblivious indexing & querying
eid docid pre post
216 17 2 15
72 3 10 8
51 2 4 12
671 31 12 23
eid docid pre post
3 1 1 21
28 2 8 14
182 5 3 7
96 4 6 4
VLDB ‘05 An Efficient and Versatile Query Engine for TopX Search 12
Outline
Data & Scoring model
Database schema & indexing
Top-k query processing for XML
Scheduling & probabilistic candidate pruning
Experiments & Conclusions
VLDB ‘05 An Efficient and Versatile Query Engine for TopX Search 13
TopX Query Processing
Adapt Threshold Algorithm (TA) paradigm Focus on inexpensive sequential/sorted accessesPostpone expensive random accesses
Candidate d = connected sub-pattern with element ids and scoresIncrementally evaluate path constraints using pre/postorder labelsIn-memory structural joins (nested loops, staircase, or holistic twig joins)
Upper/lower score guarantees per candidateRemember set of evaluated dimensions E(d)
worstscore(d) = ∑iE(d) score(ti,e)bestscore(d) = worstscore(d) + ∑iE(d) highi
Early threshold terminationCandidate queuingStop, if
ExtensionsBatching of sorted accesses & efficient queue managementCost model for random access schedulingProbabilistic candidate pruning for approximate top-k results
[Theobald, Schenkel & Weikum, VLDB ’04]
[Fagin et al., PODS ’01 Güntzer et al., VLDB ’00 Buckley&Lewit, SigIR ‘85]
[Fagin et al., PODS ’01 Güntzer et al., VLDB ’00 Buckley&Lewit, SigIR ‘85]
VLDB ‘05 An Efficient and Versatile Query Engine for TopX Search 14
1.0
worst=0.9best=2.9
46 worst=0.5best=2.5
9
TopX Query Processing By Example
eid docid score pre post
46 2 0.9 2 15
9 2 0.5 10 8
171 5 0.85 1 20
84 3 0.1 1 12
eid docid score pre post
216 17 0.9 2 15
72 3 0.8 10 8
51 2 0.5 4 12
671 31 0.4 12 23
eid docid score pre post
3 1 1.0 1 21
28 2 0.8 8 14
182 5 0.75 3 7
96 4 0.75 6 4
worst=1.0best=2.8
3
worst=0.9best=2.8
216
171 worst=0.85best=2.75
72
worst=0.8best=2.65
worst=0.9best=2.8
46
2851
worst=0.5best=2.4
9doc2 doc17 doc1worst=0.9
best=2.75
216
doc5worst=1.0best=2.75
3
doc3
worst=0.9best=2.7
46
2851
worst=0.5best=2.3
9 worst=0.85best=2.65
171score=1.7best=2.5
46
28
score=0.5best=1.3
9 worst=0.9best=2.55
216
worst=1.0best=2.65
3
worst=0.85best=2.45
171
worst=0.8best=2.45
72
worst=0.8best=1.6
72
worst=0.1best=0.9
84
worst=0.9best=1.8
216
worst=1.0best=1.9
3
worst=2.2best=2.2
46
2851
worst=0.5best=0.5
9 worst=1.0best=1.6
3
worst=0.85best=2.15
171 worst=1.6best=2.1
171
182
worst=0.9best=1.0
216
worst=0.0best=2.9
Pseudo-
Element
worst=0.0best=2.8worst=0.0best=2.75worst=0.0best=2.65worst=0.0best=2.45worst=0.0best=1.7worst=0.0best=1.4worst=0.0best=1.35
sec[clustering] title[xml]
Top-2 results
worst=0.946 worst=0.59 worst=0.9
216
worst=1.746
28
worst=2.246
2851
worst=1.0
3
worst=1.6171
182
par[evaluation]1.0 1.0 1.00.9
0.850.1
0.90.80.5
0.80.75
min-2=0.0min-2=0.5min-2=0.9min-2=1.6
sec[clustering]
title[xml] par[evaluation]
Candidate queue
VLDB ‘05 An Efficient and Versatile Query Engine for TopX Search 15
Incremental Path Validations
Complex query DAGsTransitive closure of descendant constraints
Aggregate additional static score mass c for a structural condition i, if all edges rooted at i are satisfiable
Incrementally test structural constraints Quickly decrease best scores for early pruningSchedule random accesses in ascending order of structural selectivities
//article[//sec//par//“xml java”] //bib//item//title//“security”
article sec par=“xml”
par=“java”
bib title=“security”
item
child-or-descendant
articlearticle secsec par=“xml”par=“xml”
par=“java”par=
“java”bibbib title=
“security”title=
“security”itemitem
articlearticle
bibbib
itemitem
title=securitytitle=
security
secsec
par=xmlpar=xml
par=javapar=java
Query:
“Promising candidate”
0.8
0.7
[0.0, highi] [0.0, highi] [0.0, highi]
c=[1.0] [1.0] [1.0] [1.0]
bibbib
itemitem
worst(d)= 1.5best(d) = 6.5worst(d)= 1.5best(d) = 5.5worst(d)= 1.5best(d) = 4.5
0.0
0.0
RARA
min-k=4.8
RARA
VLDB ‘05 An Efficient and Versatile Query Engine for TopX Search 16
Outline
Data & Scoring model
Database schema & indexing
Top-k query processing for XML
Scheduling & probabilistic candidate pruning
Experiments & Conclusions
VLDB ‘05 An Efficient and Versatile Query Engine for TopX Search 17
MinProbe-SchedulingStructural conditions as “soft filters”
(Expensive Predicates & Minimal Probes [Chang & Hwang, SIGMOD ‘02])
Schedule random accesses only for the most promising candidates
Schedule batch of RAs on d, if
worstscore(d) + od c > min-k
Random Access Scheduling - Minimal Probes
evaluated content & structure-related score
unevaluated structural score mass (constant!)
articlearticle secsec par=“xml”par=“xml”
par=“java”par=
“java”bibbib title=
“security”title=
“security”itemitem
c=[1.0] [1.0] [1.0] [1.0]
VLDB ‘05 An Efficient and Versatile Query Engine for TopX Search 18
BenProbe-SchedulingAnalytic cost model
Basic ideaCompare expected random access costs to an optimal schedule
Access costs on d are wasted, if d does not make it into the final top-k (considering both content & structure)
Compare different Expected Wasted Costs (EWC)
EWC-RAs(d) of looking up d in the structure
EWC-RAc(d) of looking up d in the content
EWC-SA(d) of not seeing d in the next batch of b sorted accesses
Schedule batch of RAs on d, if
EWC-RAs|c(d) [RA] < EWC-SA [SA]
Cost-based Scheduling
EWC-SA =EWC-SA =
VLDB ‘05 An Efficient and Versatile Query Engine for TopX Search 19
Split the query into a set of characteristic patterns, e.g., twigs, descendants & tag-term pairs
Consider structural selectivities
P[d satisfies all structural conditions Y] =
P[d satisfies a subset Y’ of structural conditions Y] =
Consider binary correlations between structural patterns and/or tag-term
pairs (estimated from data sampling, query logs, etc.)
Structural Selectivity Estimator
//sec[//figure=“java”] [//par=“xml”] [//bib=“vldb”]
//sec[//figure]//par
//sec[//figure]//bib
//sec[//par]//bib
//sec//figure
//sec//par
//sec//bib
//bib=“vldb”
//par=“xml”
//figure=“java”
p1 = 0.682
p2 = 0.001
p3 = 0.002
p4 = 0.688
p5 = 0.968
p6 = 0.002
p7= 0.023
p8 = 0.067
p9 = 0.011
figure=“java”figure=“java”
secsec
par=“xml”par=“xml”
bib=“vldb”bib=
“vldb”bib=
“vldb”bib=
“vldb”
secsec
EWC-
RAs(d)
EWC-
RAs(d)
VLDB ‘05 An Efficient and Versatile Query Engine for TopX Search 20
Full-content Score Predictor
For each inverted list Li (i.e., all tag-term pairs) Approximate local score distribution Si by an equi-width histogram
Periodically test all d in the candidate queueConsider aggregated score predictor
eid docid score pre post max-
score
216 17 0.9 2 15 0.9
72 3 0.8 10 8 0.8
51 2 0.5 4 12 0.5
eid docid score pre post max-
score
3 1 1.0 1 21 1.0
28 2 0.8 8 14 0.8
182 5 0.75 3 7 0.75
title[xml]
par[evaluation]
Convolution (S1,S2)
2 0δ(d)
0
S1
1 high1
S2
high21 0
EWC-
RAc(d)
EWC-
RAc(d)
Probabilistic candidate pruning: Drop d from the candidate queue, if
P[d gets in the final top-k] < ε
Probabilistic candidate pruning: Drop d from the candidate queue, if
P[d gets in the final top-k] < ε
P[d gets in the final top-k] =
VLDB ‘05 An Efficient and Versatile Query Engine for TopX Search 21
Outline
Data & Scoring model
Database schema & indexing
Top-k query processing for XML
Scheduling & probabilistic candidate pruning
Experiments & Conclusions
VLDB ‘05 An Efficient and Versatile Query Engine for TopX Search 22
Data Collections & Competitors
INEX ‘04 benchmark setting12,223 docs; 12M elemt’s; 119M index entries; 534MB46 queries with official relevance judgments
e.g., //article[.//bib=“QBIC” and .//par=“image retrieval”]IMDB (Internet Movie Database)
386,529 docs; 34M elemt’s; 130M index entries; 1,117 MB20 queries, e.g., //movie[.//casting[.//actor=“John Wayne”] and .//role=“Sheriff”]//[.//year=“1959” and .//genre=“Western”]
CompetitorsDBMS-style Join&Sort
Using index full scans on the TopX schema
StructIndex [Kaushik et al, Sigmod ’04]
Top-k with separate inverted indexes for content & structureDataGuide-like structural indexFull evaluations no uncertainty about final document scoresNo candidate queuing, eager random accesses
StructIndex+Extent chaining technique for DataGuide-based extent identifiers
(skip scans)
VLDB ‘05 An Efficient and Versatile Query Engine for TopX Search 23
INEX Results
02,000,0004,000,0006,000,0008,000,000
10,000,00012,000,000
k
# S
A +
# R
A
Join&Sort
StructIndex+
StructIndex
BenProbe
MinProbe
0.0784,424723,1690.010TopX – BenProbe
0.17
0.09
0.373,25,068761,970n/a10StructIndex
0.26109,122,318n/a10Join&Sort
1.000.341.875,074,38477,482n/a10StructIndex+
0.0364,807635,5070.010TopX – MinProbe
1.000.030.351,902,427882,9290.01,000TopX – BenProbe
relP
rec
# SA
CPU se
c
P@k
MAP@
k
epsil
on# RAk
relP
rec
VLDB ‘05 An Efficient and Versatile Query Engine for TopX Search 24
IMDB Results
n/a
0.16291,655346,697n/a10StructIndex
37.7014,510077n/a10Join&Sort
1.000.17301,64722,445n/a10StructIndex+
0.0872,196317,3800.010TopX – MinProbe
0.0650,016241,4710.010TopX – BenProbe
# SA
CPU se
c
P@k
MAP@
k
epsil
on# RAk re
lPre
c
0
500,000
1,000,000
1,500,000
2,000,000
2,500,000
3,000,000
k
#SA
+ #
RA StructIndex
StructIndex+MinProbeBenProbe
VLDB ‘05 An Efficient and Versatile Query Engine for TopX Search 25
INEX with Probabilistic Pruning
0.0
0.2
0.4
0.6
0.8
1.0
ε
relPrecP@10MAP
0
200,000
400,000
600,000
800,000
#SA
+ #
RA
TopX -MinProbe
0.07
0.08
0.08
0.08
0.09
0.770.340.0556,952392,3950.2510
1.000.340.0364,807635,5070.0010TopX - MinProbe
0.650.310.0248,963231,1090.5010
0.510.330.0142,174102,1180.7510
0.380.300.0135,32736,9361.0010
# SA
CPU se
c
P@k
MAP@
k
epsil
on# RAk re
lPre
c
VLDB ‘05 An Efficient and Versatile Query Engine for TopX Search 26
Conclusions & Ongoing Work
Efficient and versatile TopX query processorExtensible framework for text, semi-structured & structured dataProbabilistic cost model for random access schedulingVery good precision/runtime ratio for probabilistic candidate pruning
ScalabilityOptimized for runtime, exploits cheap disk space
(factor 4-5 for INEX)Experiments on TREC Terabyte text collection (see paper)
Support for typical IR extensionsPhrase matching, mandatory terms “+”, negation “-”Query weights (e.g., relevance feedback, ontological similarities)
Dynamic and self-tuning query expansions [SigIR ’05]Incrementally merges inverted lists on demandDynamically opens scans on additional expansion termsVague Content & Structure (VCAS) queries
VLDB ‘05 An Efficient and Versatile Query Engine for TopX Search 27
Thank you!
Demo available!Demo
available!