XML Retrieval and Evaluation: Where are we?

XML Retrieval and Evaluation: XML Retrieval and Evaluation: Where are we?Where are we?

Mounia LalmasMounia Lalmas

Queen Mary University of LondonQueen Mary University of London

OutlineOutline

XML retrieval paradigm Focussed retrieval

Overlapping elements

XML retrieval evaluation INEX

Topics

Relevance

Assessments

Metrics

AcknowledgementsAcknowledgements

Norbert Fuhr, Shlomo Geva, Norbert Gövert, Gabriella Kazai, Saadia Norbert Fuhr, Shlomo Geva, Norbert Gövert, Gabriella Kazai, Saadia

Malik, Benjamin Piwowarski, Thomas Rölleke, Börkur Sigurbjörnsson, Malik, Benjamin Piwowarski, Thomas Rölleke, Börkur Sigurbjörnsson,

Zoltan Szlavik, Vu Huyen Trang, Andrew Trotman, Arjen de VriesZoltan Szlavik, Vu Huyen Trang, Andrew Trotman, Arjen de Vries

FERMI model (Chiaramella, Mulhem & Fourel, 1996)FERMI model (Chiaramella, Mulhem & Fourel, 1996)

DELOS EU Network of Excellence (FP5 & FP6)DELOS EU Network of Excellence (FP5 & FP6)

ARC exchange programme (DAAD & British Council)ARC exchange programme (DAAD & British Council)

EPSRC (UK funding body)EPSRC (UK funding body)

XML: eXtensible Mark-up LanguageXML: eXtensible Mark-up Language

Meta-language adopted as document format language by W3C

Describe content and logical structure

Use of XPath notation to refer to the XML structureUse of XPath notation to refer to the XML structure

<book> <title> Structured Document Retrieval </title> <author> <fnm> Smith </fnm> <snm> John </snm> </author> <chapter> <title> Introduction into XML retrieval </title> <paragraph> …. </paragraph> … </chapter> …</book> chapter/title: title is a direct sub-component of chapter

//title: any titlechapter//title: title is a direct or indirect sub-component of chapterchapter/paragraph[2]: any direct second paragraph of any chapterchapter/*: all direct sub-components of a chapter

XML Retrieval vs Information RetrievalXML Retrieval vs Information Retrieval

Traditional information retrieval is

about finding relevant documents,

e.g. entire book.

XML retrieval allows users to

retrieve document components of

varying granularity (XML elements)

that are more focussed, e.g. a

subsection of a book instead of an

entire book.

Book

Chapters

Sections

Subsections

World Wide Web

This is only only another to look one le to show the need an la a out structure of and more a document and so ass to it doe not necessary text a structured document have retrieval on the web is an it important topic of today’s research it issues to make se last sentence..

SEARCHING = QUERYING + BROWSINGSEARCHING = QUERYING + BROWSING

Querying XML documentsQuerying XML documents

With respect to content only (CO)– Standard queries but retrieving XML elements– “London tube strikes”

With respect to content and structure (CAS) Constraints on the types of XML elements to be retrieved E.g. “Sections of an article in the Times about congestion charges” E.g. Articles that contain sections about congestion charges in London,

and that contain a picture of Ken Livingstone”

Focussed RetrievalFocussed Retrieval

Retrieve the bestbest XML elements according to

content and structure criteria:

Shakespeare study: best entry points, which are components from

which many relevant components can be reached through browsing

(Kazai, Lalmas & Reid, ECIR 2003)

INEX: most specific component that satisfies the query, while being

exhaustive to the query

should should not return “related” elements, i.e. on the

same path, i.e. overlapping elementsoverlapping elements

INEX: INEX: INitiative for the Evaluation of INitiative for the Evaluation of XML RetrievalXML Retrieval

Started in 2002 and runs each year April - December, ending

with a workshop at Schloss Dagstuhl Germany

Funded by DELOS, EU network of excellence in digital libraries

Documents (~500MB): 12,107 articles in XML format from IEEE Computer

Society; 8 millions elements!

INEX 2002

60 topics; inex_eval metric

INEX 2003

66 topics; enhanced subset of XPath; inex_eval and inex_eval_ng metrics

INEX 2004

75 topics; subset of 2003 XPath subset NEXI (Sigurbjörnsson & Trotman, INEX Sigurbjörnsson & Trotman, INEX

20032003) ; inex_eval, inex_eval_ng, XCG, t2i and ERR metrics

INEX Topics: Content-onlyINEX Topics: Content-only

<title>Open standards for digital video in distance learning</title>

<description>Open technologies behind media streaming in distance learning

projects</description>

<narrative> I am looking for articles/components discussing methodologies of digital video

production and distribution that respect free access to media content through internet or

via CD-ROMs or DVDs in connection to the learning process. Discussions of open versus

proprietary standards of storing and sending digital video will be appreciated. </narrative>

<keywords>media streaming,video streaming,audio streaming, digital video,distance

learning,open standards,free access</keywords>

INEX Topics: Content-and-structureINEX Topics: Content-and-structure

<title>//article[about(.,'formal methods verify correctness aviation systems')]//sec//*

[about(.,'case study application model checking theorem proving')]</title>

<description>Find documents discussing formal methods to verify correctness of aviation

systems. From those articles extract parts discussing a case study of using model

checking or theorem proving for the verification. </description>

<narrative>To be considered relevant a document must be about using formal methods to

verify correctness of aviation systems, such as flight traffic control systems, airplane-

or helicopter- parts. From those documents a section-part must be returned (I do not

want the whole section, I want something smaller). That part should be about a case

study of applying a model checker or a theorem proverb to the verification.

</narrative>

<keywords>SPIN, SMV, PVS, SPARK, CWB</keywords>

Content-and-structure topics: RestrictionsContent-and-structure topics: Restrictions

Returning “attribute” type elements (e.g. author, date) not allowed

“return authors of articles containing sections on XML retrieval

approaches”

Aboutness criterion must be specified- at least - in the target elements

“return all paragraphs contained in sections that discuss XML retrieval

approaches”

Branches not allowed

“return sections about XML retrieval that are contained in articles that

contain paragraphs about INEX experiments”

…

Are we imposing too many restrictions?

Ad hoc retrieval: TasksAd hoc retrieval: Tasks

Content-only (CO): aim is to decrease user effort by pointing the user to

the most relevant elements (2002, 2003, 2004)

Strict content-and-structure (SCAS): retrieve relevant elements that

exactly match the structure specified in the query (2002, 2003)

Vague content-and-structure (VCAS):

retrieve relevant elements that may not be the same as the target

elements, but are structurally similar (2003)

retrieve relevant elements even if do not exactly meet the structural

conditions; treat structure specification as hints as to where to look

(2004)

CO

CA

S

Relevance in XML retrievalRelevance in XML retrieval

A document is relevant if it “has significant and

demonstrable bearing on the matter at hand”.

Common assumptions in laboratory experimentation: Objectivity

Topicality

Binary nature

IndependenceXML retrieval evaluation

XML retrieval

article

ss1 ss2

s1 s2 s3

XML evaluation

Relevance in XML retrieval: INEX 2002Relevance in XML retrieval: INEX 2002

Relevance = (0,N) (1,S) (1,L) (1,E) (2,S) (2,L) (2,E) (3,S) (3,L) (3,E)

topical relevance = how much the element discusses the query: 0, 1, 2, 3

component coverage = how focused the element is on the query: N, S, L, E

If an element is relevant so must be its parent element, ...

Topicality not enoughBinary nature not enoughIndependence is wrong

(Kazai, Lalmas, Fuhr & Gövert, JASIST 2004)

XML retrieval evaluation

XML retrieval

article

ss1 ss2

s1 s2 s3

XML evaluation

Relevance in XML retrieval: INEX 2003-4Relevance in XML retrieval: INEX 2003-4

Relevance = (0,0) (1,1) (1,2) (1,3) (2,1) (2,2) (2,3) (3,1) (3,2) (3,3)

exhaustivity = how much the element discusses the query: 0, 1, 2, 3

specificity = how focused the element is on the query: 0, 1, 2, 3

If an element is relevant so must be its parent element, ...

Topicality not enoughBinary nature not enoughIndependence is wrong

(based on Chiaramella, Mulhem & Fourel, FERMI fetch and browse model 1996)

XML retrieval evaluation

XML retrieval

article

ss1 ss2

s1 s2 s3

XML evaluation

Relevance assessment taskRelevance assessment task Topics are assessed by the INEX participants

Pooling technique

Completeness

– Rules forcing assessors to assess related elements

– E.g. element assessed relevant its parent element and children elements

must also be assessed

– …

Consistency

– Rules to enforce consistent assessments

– E.g. Parent of a relevant element must also be relevant - although to a different

extent

– E.g. Exhaustivity increases going up; specificity decreases going down

– …

(Piwowarski & Lalmas, CIKM 2004)

InterfaceInterface

Currentassessments

Navigation

Groups

Assessing a topics takes a week!

Average 2 topics per participant

Duplicate assessments (12 topics) in INEX 2004(Piwowarski, INEX 2003)

Assessments: some resultsAssessments: some results

Elements assessed in INEX 2003

26% assessments on elements in the pool (66 % in INEX 2002).

68% highly specific elements not in the pool

INEX 2002

23 inconsistent assessments per topic for one rule

Agreements in INEX 2004Agreements in INEX 2004

12.19% non-zero agreements - 22% at article level12.19% non-zero agreements - 22% at article level

3.42% exact agreements - 7% at article level3.42% exact agreements - 7% at article level

higher agreements for CAS topicshigher agreements for CAS topics

(Piwowarski & Lalmas, CORIA 2004;Kazai, Masood & Lalmas, ECIR 2004)

Retrieve the bestbest XML elements according to content and structure criteria:

Most exhaustive and the most specific = (3,3)

Near misses = (3,3) (2,3) (1,3)

Near misses = (3,3) (3,2) (3,1)

Near misses = (3,3) (2,3) (1,3) (3,2) (3,1) (1,2) …

Focussed retrieval = no overlapping elements

should not retrieve an element and its child element, …

Measuring effectivenessMeasuring effectiveness

Need to consider:

Two four-graded dimensions of relevance (exh, spec)

Overlapping elements in retrieval results

Overlapping elements in recall-base

Metrics (Kazai, INEX 2003)

inex_eval (also known as inex2002) official INEX 2004 metric

inex_eval_ng (also known as inex2003) (Gövert, Fuhr, Kazai & Lalmas, INEX 2003)

ERR (expected ratio of relevant units) (Piwowarski & Gallinari, INEX 2003)

XCG (XML cumulative gain) (Kazai, Lalmas & de Vries, SIGIR 2004)

t2i (tolerance to irrelevance) (de Vries, Kazai & Lalmas, RIAO 2004)

Two four-graded dimensions of relevanceTwo four-graded dimensions of relevance Near misses

How to distinguish between (2,3) and (3,3), … , when evaluating retrieval

results?

Several “user models”

– Impatient: only reward retrieval of highly exhaustive and specific elements

(3,3)

– Happy as long as specific: only reward retrieval of highly specific elements

(3,3), (2,3) (1,3)

– …

– Very patient: reward - to a different extent - the retrieval of any relevant

elements; i.e. everything apart (0,0)

Use a quantisation function for each “user model”

Examples of quantisation functionsExamples of quantisation functions

Impatient

Very patient€

fstrict exh,spec( ) =1 if exh = 3 and spec = 3

0 otherwise

⎧ ⎨ ⎩

€

fgeneralised exh,spec( ) =

1.00 if exh,spec( ) = (3,3)

0.75 if exh,spec( )∈ 2,3( ), 3,2( ), 3,1( ){ }

0.50 if exh,spec( )∈ 1,3( ), 2,2( ), 2,1( ){ }

0.25 if exh,spec( )∈ 1,1( ), 1,2( ){ }

0.00 if exh,spec( ) = 0,0( )

⎧

⎨

⎪ ⎪ ⎪

⎩

⎪ ⎪ ⎪

• Based on precall (Raghavan, Bollmann & Jung, TOIS 1989) itself

based on expected search length (Cooper, JASIS 1968)

• Use several quantisation functions

• Overall performance as simple average across quantisation

functions (INEX 2004)

• Report an overlap indicator (INEX 2004)

• Does not consider overlap in retrieval results

• Does not consider overlap in recall-base

inex_evalinex_eval

Overlap in retrieval results

Not very focussed retrieval!

Overlap in retrieval resultsOverlap in retrieval results

Rank Systems (runs)Rank Systems (runs) Avg Prec Avg Prec % Overlap% Overlap

1.1. IBM Haifa Research Lab(CO-0.5-LAREFIENMENT) IBM Haifa Research Lab(CO-0.5-LAREFIENMENT) 0.14370.1437 80.8980.89

2.2. IBM Haifa Research Lab(CO-0.5) IBM Haifa Research Lab(CO-0.5) 0.13400.1340 81.4681.46

3.3. University of Waterloo(Waterloo-Baseline) University of Waterloo(Waterloo-Baseline) 0.12670.1267 76.3276.32

4.4. University of Amsterdam(UAms-CO-T-FBack) University of Amsterdam(UAms-CO-T-FBack) 0.11740.1174 81.8581.85

5.5. University of Waterloo(Waterloo-Expanded) University of Waterloo(Waterloo-Expanded) 0.11730.1173 75.6275.62

6.6. Queensland University of Technology(CO_PS_Stop50K) Queensland University of Technology(CO_PS_Stop50K) 0.10730.1073 75.8975.89

7.7. Queensland University of Technology(CO_PS_099_049) Queensland University of Technology(CO_PS_099_049) 0.10720.1072 76.8176.81

8.8. IBM Haifa Research Lab(CO-0.5-Clustering) IBM Haifa Research Lab(CO-0.5-Clustering) 0.10430.1043 81.1081.10

9.9. University of Amsterdam(UAms-CO-T) University of Amsterdam(UAms-CO-T) 0.10300.1030 71.9671.96

10.10. LIP6(simple) LIP6(simple) 0.09210.0921 64.2964.29

Not very focussed retrieval!

Official INEX 2004 Effectiveness Results for CO topics

100% recall only if all relevant elements returned meaning returning

overlapping elements

Contradicts aim of focussed retrieval!

Overlap in recall-base

QuantisationsQuantisations

Linear correlation Linear correlation

coefficientcoefficient

Spearman rank coefficientSpearman rank coefficient

strict - generalisedstrict - generalised 0.890.89 0.920.92

strict-specstrict-spec 0.840.84 0.890.89

strict-e3s321strict-e3s321 0.970.97 0.980.98

strict-e321s3strict-e321s3 0.80.8 0.870.87

generalised-specgeneralised-spec 0.990.99 0.990.99

generalised-e3s321generalised-e3s321 0.870.87 0.910.91

generalised-e321s3generalised-e321s3 0.970.97 0.980.98

Do we need all these quantisation functions? Do we needsuch complex relevance definition and assessments? …

Exhaustivity and specificity based on the notion of an ideal

concept space (Wong & Yao, TOIS 1995) upon which

precision and recall are defined.

Considers overlap in retrieval results

Does not consider overlap in recall-base

inex_eval_nginex_eval_ng

QuantisationsQuantisations

metrics 2003 runs2003 runs 2002 runs2002 runs

inex_evalinex_eval 0.920.92 0.940.94

inex_eval_nginex_eval_ng 0.870.87 0.950.95

strict vs generalised (Pearson’s correlation coefficient)

Do we need all these quantisation functions? Do we needsuch complex relevance definition and assessments? …

inex_eval vs inex_eval_nginex_eval vs inex_eval_ng

quantisation 2003 runs2003 runs 2002 runs2002 runs

strictstrict 0.790.79 0.950.95

generalisedgeneralised 0.690.69 0.930.93

Pearson’s correlation coefficient

Which metric to use?

As for inex_eval, 100% recall only if all relevant elements returned

meaning returning overlapping elements

Contradicts aim of focussed retrieval!

Overlap in recall-baseOverlap in recall-base

XCG: XML cumulated gainXCG: XML cumulated gain

• Based on cumulated gain measure for IR (Kekäläinen and Järvelin,

TOIS 2002)

• Accumulates gain obtained by retrieving elements at fixed ranks; not

based on precision and recall

• Requires the construction of an ideal recall-base and associated

ideal run, with which retrieval runs are compared

• Consider overlap in both retrieval results and recall-base

XCG: Top 10 INEX 2004 runsXCG: Top 10 INEX 2004 runsMAnCGMAnCG ovrlpovrlp

1. U. of Tampere (UTampere CO average) 1. U. of Tampere (UTampere CO average) [40][40] 0.37250.3725 00

2. U. of Tampere (UTampere CO fuzzy) 2. U. of Tampere (UTampere CO fuzzy) [42][42] 0.36990.3699 00

3. U. of Amsterdam (UAms-CO-T-FBack-NoOverl) 3. U. of Amsterdam (UAms-CO-T-FBack-NoOverl) [41][41] 0.35210.3521 00

4. Oslo U. College (4-par-co) 4. Oslo U. College (4-par-co) [24][24] 0.34180.3418 0.070.07

5. U. of Tampere (UTampere CO overlap) 5. U. of Tampere (UTampere CO overlap) [25][25] 0.33280.3328 39.5839.58

6. LIP6 (bn-m1-eqt-porder-eul-o.df.t-parameters-00700) 6. LIP6 (bn-m1-eqt-porder-eul-o.df.t-parameters-00700) [27][27] 0.32470.3247 74.3174.31

7. U. California, Berkeley (Berkeley CO FUS T CMBZ FDBK) 7. U. California, Berkeley (Berkeley CO FUS T CMBZ FDBK) [23][23] 0.31820.3182 50.2250.22

8. U. of Waterloo (Waterloo-Filtered) 8. U. of Waterloo (Waterloo-Filtered) [43][43] 0.31810.3181 13.3513.35

9. LIP6 (bn-m2-eqt-porder-o.df.t-parameters-00195) 9. LIP6 (bn-m2-eqt-porder-o.df.t-parameters-00195) [22][22] 0.30980.3098 64.264.2

10. LTI, CMU (Lemur CO KStem Mix02 Shrink01)10. LTI, CMU (Lemur CO KStem Mix02 Shrink01) [4] [4] 0.29530.2953 73.0273.02

[?] rank by inex_eval[?] rank by inex_evalWhich metric to use?

Conclusion and future workConclusion and future work

XML retrieval is not just about the effective retrieval of XML documents, but also about what and how to evaluate

Is the overlap problem a real problem? Currently: analysis of the evaluation results (QMUL, U Duisburg-Essen, LIP6/ U Chile and CMU)

INEX 2004– Interactive track– Heterogeneous collection track– Relevance feedback track– Natural language track

INEX 2005– Sort out how to measure performance!!!!– Understanding what users want and how they behave?– New collection with more realistic information needs - topics– …– Possible tracks: document mining, question-answering and multimedia

Dank jullie wel!Dank jullie wel!