An Efficient and Versatile Query Engine for TopX Search

An Efficient and Versatile An Efficient and Versatile Query Engine for Query Engine for

TopX SearchTopX SearchMartin TheobaldMartin TheobaldRalf SchenkelRalf Schenkel

Gerhard WeikumGerhard Weikum

Max-Planck Institute for InformaticsMax-Planck Institute for InformaticsSaarbrückenSaarbrücken

GermanyGermany

VLDB ‘05

VLDB ‘05 An Efficient and Versatile Query Engine for TopX Search 2

//article[//sec[about(.//, “XML retrieval”)] //par[about(.//, “native XML database”)] ]//bib[about(.//item, “W3C”)]

sec

article

sec

par

bib

par

title “Current Approaches to XML Data Manage-ment.”

item

“Data management systems control data acquisition, storage, and retrieval. Systems evolved from flat files … ”

“XML queries with an expres- sive power similar to that of Datalog …”

par

title“XML-QL: A Query Language for XML.”

“Native XML database systems can store schemaless data ... ” inproc

“Proc. Query Languages Workshop, W3C,1998.”

title“Native XML databases.”

An XML-IR Scenario…

sec

article

sec

par “Sophisticated technologies developed by smart people.”

par

title “TheXML Files”

par

title “TheOntology Game”

title“TheDirty LittleSecret”

“What does XML add for retrieval? It adds formal ways …”

bib

“w3c.org/xml” “There, I've said it - the "O" word. If anyone is thinking along ontology lines, I would like to break some old news …”

title

item

url“XML”


TopX: Efficient XML-IR

Extend top-k query processing algorithms for sorted lists [Buckley ’85; Güntzer, Balke & Kießling ’00; Fagin ‘01] to XML data

Non-schematic, heterogeneous data sources

Combined inverted index for content & structure

Avoid full index scans, postpone expensive random accesses to large disk-resident data structures

Exploit cheap disk space for redundant indexing

Goal: Efficiently retrieve the best results of a similarity query

Goal: Efficiently retrieve the best results of a similarity query


XML-IR: History and Related Work

IR on structured data (SGML):1995

2000

2005

IR on XML:

Commercial software: MarkLogic, Verity?, IBM?, Oracle?, ...

XML query languages:

XQuery (W3C)XPath 2.0 (W3C)

NEXI (INEX Benchmark)

XPath &XQuery

Full-Text(W3C)

XPath 1.0 (W3C)

XML-QL (AT&T Labs)

Web query languages:

Lorel (Stanford U)Araneus (U Roma)W3QS (Technion Haifa)

TeXQuery (AT&T Labs)

WebSQL (U Toronto)

XIRQL (U Dortmund / Essen)XXL & TopX (U Saarland / MPII)ApproXQL (U Berlin / U Munich)ELIXIR (U Dublin)JuruXML (IBM Haifa )XSearch (Hebrew U)Timber (U Michigan)XRank & Quark (Cornell U)FleXPath (AT&T Labs)XKeyword (UCSD)

OED etc. (U Waterloo)HySpirit (U Dortmund)HyperStorM (GMD Darmstadt)WHIRL (CMU)


Outline

Data & Scoring model

Database schema & indexing

Top-k query processing for XML

Scheduling & probabilistic candidate pruning

Experiments & Conclusions


Data Model

Simplified XML modeldisregarding IDRef & XLink/XPointer

Redundant full-contents Per-element term frequencies ftf(ti,e) for full contentsPre/postorder labels for each tag-term pair

<article> <title>XML-IR</title> <abs> IR techniques for XML</abs> <sec> <title> Clustering on XML </title> <par>Evaluation</par> </sec></article>

“xml ir”

articlearticle

titletitle absabs secsec

“xml ir ir technique xmlclustering xml evaluation“

“ir techniquexml“

“clustering xml evaluation“

“clustering xml”

“evaluation“

titletitle parpar

1 6

2 5 3 4 3 3

5 2 6 1

ftf(“xml”, article1 ) = 3ftf(“xml”, article1 ) = 3


Full-Content Scoring Model

Full-content scores cast into an Okapi-BM25 probabilistic model with

element-specific model parameterization

Basic scoring idea within IR-style family of TF*IDF ranking functions tag N avglength k1 b

article 16,850 2,903 10.5 0.75sec 96,709 413 10.5 0.75par 1,024,907 32 10.5 0.75fig 109,230 13 10.5 0.75

elementstatistics

Additional static score mass c for relaxable structural conditions


Outline







Inverted Block-Index for Content & Structure

eid docid score pre post max-score

46 2 0.9 2 15 0.9

9 2 0.5 10 8 0.9

171 5 0.85 1 20 0.85

84 3 0.1 1 12 0.1

sec[clustering]

title[xml] par[evaluation]

sec[clustering] title[xml] par[evaluation]

Inverted index over tag-term pairs (full-contents)Benefits from increased selectivity of combined tag-term pairsAccelerates child-or-descendant axis, e.g., sec//”clustering”

eid docid score pre post max-

score

216 17 0.9 2 15 0.9

72 3 0.8 10 8 0.8

51 2 0.5 4 12 0.5

671 31 0.4 12 23 0.4


score

3 1 1.0 1 21 1.0

28 2 0.8 8 14 0.8

182 5 0.75 3 7 0.75

96 4 0.75 6 4 0.75

Sequential block-scans Re-order elements in descending order of (maxscore, docid, score) per listFetch all tag-term pairs per doc in one sequential block-accessdocid limits range of in-memory structural joins

Stored as inverted files or database tables (B+-tree indexes)


Navigational Index

eid docid pre post

46 2 2 15

9 2 10 8

171 5 1 20

84 3 1 12

sec


sec title par

Additional navigational indexNon-redundant element directorySupports element paths and branching path queriesRandom accesses using (docid, tag) as key

Schema-oblivious indexing & querying

eid docid pre post

216 17 2 15

72 3 10 8

51 2 4 12

671 31 12 23

eid docid pre post

3 1 1 21

28 2 8 14

182 5 3 7

96 4 6 4


Outline







TopX Query Processing

Adapt Threshold Algorithm (TA) paradigm Focus on inexpensive sequential/sorted accessesPostpone expensive random accesses

Candidate d = connected sub-pattern with element ids and scoresIncrementally evaluate path constraints using pre/postorder labelsIn-memory structural joins (nested loops, staircase, or holistic twig joins)

Upper/lower score guarantees per candidateRemember set of evaluated dimensions E(d)

worstscore(d) = ∑iE(d) score(ti,e)bestscore(d) = worstscore(d) + ∑iE(d) highi

Early threshold terminationCandidate queuingStop, if

ExtensionsBatching of sorted accesses & efficient queue managementCost model for random access schedulingProbabilistic candidate pruning for approximate top-k results

[Theobald, Schenkel & Weikum, VLDB ’04]

[Fagin et al., PODS ’01 Güntzer et al., VLDB ’00 Buckley&Lewit, SigIR ‘85]

[Fagin et al., PODS ’01 Güntzer et al., VLDB ’00 Buckley&Lewit, SigIR ‘85]


1.0

worst=0.9best=2.9

46 worst=0.5best=2.5

9

TopX Query Processing By Example

eid docid score pre post

46 2 0.9 2 15

9 2 0.5 10 8

171 5 0.85 1 20

84 3 0.1 1 12


216 17 0.9 2 15

72 3 0.8 10 8

51 2 0.5 4 12

671 31 0.4 12 23


3 1 1.0 1 21

28 2 0.8 8 14

182 5 0.75 3 7

96 4 0.75 6 4

worst=1.0best=2.8

3

worst=0.9best=2.8

216

171 worst=0.85best=2.75

72

worst=0.8best=2.65

worst=0.9best=2.8

46

2851

worst=0.5best=2.4

9doc2 doc17 doc1worst=0.9

best=2.75

216

doc5worst=1.0best=2.75

3

doc3

worst=0.9best=2.7

46

2851

worst=0.5best=2.3


171score=1.7best=2.5

46

28

score=0.5best=1.3


216

worst=1.0best=2.65

3

worst=0.85best=2.45

171

worst=0.8best=2.45

72

worst=0.8best=1.6

72

worst=0.1best=0.9

84

worst=0.9best=1.8

216

worst=1.0best=1.9

3

worst=2.2best=2.2

46

2851

worst=0.5best=0.5

9 worst=1.0best=1.6

3

worst=0.85best=2.15


171

182

worst=0.9best=1.0

216

worst=0.0best=2.9

Pseudo-

Element

worst=0.0best=2.8worst=0.0best=2.75worst=0.0best=2.65worst=0.0best=2.45worst=0.0best=1.7worst=0.0best=1.4worst=0.0best=1.35

sec[clustering] title[xml]

Top-2 results

worst=0.946 worst=0.59 worst=0.9

216

worst=1.746

28

worst=2.246

2851

worst=1.0

3

worst=1.6171

182

par[evaluation]1.0 1.0 1.00.9

0.850.1

0.90.80.5

0.80.75

min-2=0.0min-2=0.5min-2=0.9min-2=1.6

sec[clustering]


Candidate queue


Incremental Path Validations

Complex query DAGsTransitive closure of descendant constraints

Aggregate additional static score mass c for a structural condition i, if all edges rooted at i are satisfiable

Incrementally test structural constraints Quickly decrease best scores for early pruningSchedule random accesses in ascending order of structural selectivities

//article[//sec//par//“xml java”] //bib//item//title//“security”

article sec par=“xml”

par=“java”

bib title=“security”

item

child-or-descendant

articlearticle secsec par=“xml”par=“xml”

par=“java”par=

“java”bibbib title=

“security”title=

“security”itemitem

articlearticle

bibbib

itemitem

title=securitytitle=

security

secsec

par=xmlpar=xml

par=javapar=java

Query:

“Promising candidate”

0.8

0.7

[0.0, highi] [0.0, highi] [0.0, highi]

c=[1.0] [1.0] [1.0] [1.0]

bibbib

itemitem

worst(d)= 1.5best(d) = 6.5worst(d)= 1.5best(d) = 5.5worst(d)= 1.5best(d) = 4.5

0.0

0.0

RARA

min-k=4.8

RARA


Outline







MinProbe-SchedulingStructural conditions as “soft filters”

(Expensive Predicates & Minimal Probes [Chang & Hwang, SIGMOD ‘02])

Schedule random accesses only for the most promising candidates

Schedule batch of RAs on d, if

worstscore(d) + od c > min-k

Random Access Scheduling - Minimal Probes

evaluated content & structure-related score

unevaluated structural score mass (constant!)

articlearticle secsec par=“xml”par=“xml”

par=“java”par=

“java”bibbib title=

“security”title=

“security”itemitem

c=[1.0] [1.0] [1.0] [1.0]


BenProbe-SchedulingAnalytic cost model

Basic ideaCompare expected random access costs to an optimal schedule

Access costs on d are wasted, if d does not make it into the final top-k (considering both content & structure)

Compare different Expected Wasted Costs (EWC)

EWC-RAs(d) of looking up d in the structure

EWC-RAc(d) of looking up d in the content

EWC-SA(d) of not seeing d in the next batch of b sorted accesses

Schedule batch of RAs on d, if

EWC-RAs|c(d) [RA] < EWC-SA [SA]

Cost-based Scheduling

EWC-SA =EWC-SA =


Split the query into a set of characteristic patterns, e.g., twigs, descendants & tag-term pairs

Consider structural selectivities

P[d satisfies all structural conditions Y] =

P[d satisfies a subset Y’ of structural conditions Y] =

Consider binary correlations between structural patterns and/or tag-term

pairs (estimated from data sampling, query logs, etc.)

Structural Selectivity Estimator

//sec[//figure=“java”] [//par=“xml”] [//bib=“vldb”]

//sec[//figure]//par

//sec[//figure]//bib

//sec[//par]//bib

//sec//figure

//sec//par

//sec//bib

//bib=“vldb”

//par=“xml”

//figure=“java”

p1 = 0.682

p2 = 0.001

p3 = 0.002

p4 = 0.688

p5 = 0.968

p6 = 0.002

p7= 0.023

p8 = 0.067

p9 = 0.011

figure=“java”figure=“java”

secsec

par=“xml”par=“xml”

bib=“vldb”bib=

“vldb”bib=

“vldb”bib=

“vldb”

secsec

EWC-

RAs(d)

EWC-

RAs(d)


Full-content Score Predictor

For each inverted list Li (i.e., all tag-term pairs) Approximate local score distribution Si by an equi-width histogram

Periodically test all d in the candidate queueConsider aggregated score predictor


score

216 17 0.9 2 15 0.9

72 3 0.8 10 8 0.8

51 2 0.5 4 12 0.5


score

3 1 1.0 1 21 1.0

28 2 0.8 8 14 0.8

182 5 0.75 3 7 0.75

title[xml]

par[evaluation]

Convolution (S1,S2)

2 0δ(d)

0

S1

1 high1

S2

high21 0

EWC-

RAc(d)

EWC-

RAc(d)

Probabilistic candidate pruning: Drop d from the candidate queue, if

P[d gets in the final top-k] < ε

Probabilistic candidate pruning: Drop d from the candidate queue, if

P[d gets in the final top-k] < ε

P[d gets in the final top-k] =


Outline







Data Collections & Competitors

INEX ‘04 benchmark setting12,223 docs; 12M elemt’s; 119M index entries; 534MB46 queries with official relevance judgments

e.g., //article[.//bib=“QBIC” and .//par=“image retrieval”]IMDB (Internet Movie Database)

386,529 docs; 34M elemt’s; 130M index entries; 1,117 MB20 queries, e.g., //movie[.//casting[.//actor=“John Wayne”] and .//role=“Sheriff”]//[.//year=“1959” and .//genre=“Western”]

CompetitorsDBMS-style Join&Sort

Using index full scans on the TopX schema

StructIndex [Kaushik et al, Sigmod ’04]

Top-k with separate inverted indexes for content & structureDataGuide-like structural indexFull evaluations no uncertainty about final document scoresNo candidate queuing, eager random accesses

StructIndex+Extent chaining technique for DataGuide-based extent identifiers

(skip scans)


INEX Results

02,000,0004,000,0006,000,0008,000,000

10,000,00012,000,000

k

# S

A +

# R

A

Join&Sort

StructIndex+

StructIndex

BenProbe

MinProbe

0.0784,424723,1690.010TopX – BenProbe

0.17

0.09

0.373,25,068761,970n/a10StructIndex

0.26109,122,318n/a10Join&Sort

1.000.341.875,074,38477,482n/a10StructIndex+

0.0364,807635,5070.010TopX – MinProbe

1.000.030.351,902,427882,9290.01,000TopX – BenProbe

relP

rec

# SA

CPU se

c

P@k

MAP@

k

epsil

on# RAk

relP

rec


IMDB Results

n/a

0.16291,655346,697n/a10StructIndex

37.7014,510077n/a10Join&Sort

1.000.17301,64722,445n/a10StructIndex+

0.0872,196317,3800.010TopX – MinProbe

0.0650,016241,4710.010TopX – BenProbe

# SA

CPU se

c

P@k

MAP@

k

epsil

on# RAk re

lPre

c

0

500,000

1,000,000

1,500,000

2,000,000

2,500,000

3,000,000

k

#SA

+ #

RA StructIndex

StructIndex+MinProbeBenProbe


INEX with Probabilistic Pruning

0.0

0.2

0.4

0.6

0.8

1.0

ε

relPrecP@10MAP

0

200,000

400,000

600,000

800,000

#SA

+ #

RA

TopX -MinProbe

0.07

0.08

0.08

0.08

0.09

0.770.340.0556,952392,3950.2510

1.000.340.0364,807635,5070.0010TopX - MinProbe

0.650.310.0248,963231,1090.5010

0.510.330.0142,174102,1180.7510

0.380.300.0135,32736,9361.0010

# SA

CPU se

c

P@k

MAP@

k

epsil

on# RAk re

lPre

c


Conclusions & Ongoing Work

Efficient and versatile TopX query processorExtensible framework for text, semi-structured & structured dataProbabilistic cost model for random access schedulingVery good precision/runtime ratio for probabilistic candidate pruning

ScalabilityOptimized for runtime, exploits cheap disk space

(factor 4-5 for INEX)Experiments on TREC Terabyte text collection (see paper)

Support for typical IR extensionsPhrase matching, mandatory terms “+”, negation “-”Query weights (e.g., relevance feedback, ontological similarities)

Dynamic and self-tuning query expansions [SigIR ’05]Incrementally merges inverted lists on demandDynamically opens scans on additional expansion termsVague Content & Structure (VCAS) queries


Thank you!

Demo available!Demo

available!

Documents

An Efficient and Versatile Query Engine for TopX Search