XML Retrieval: from modelling to evaluation
Mounia Lalmas
Queen Mary University of London
qmir.dcs.qmul.ac.uk
Outline
Structured document retrieval
XML
Content-oriented XML retrieval
Evaluation
Outline
Structured document retrieval
XML
Content-oriented XML retrieval
Evaluation
Structured Document Retrieval
Traditional IR is about finding relevant documents to a user’s information need, e.g. entire book.
SDR allows users to retrieve document components that are more focussed to their information needs, e.g a chapter of a book instead of an entire book.
The structure of documents is exploited to identify which document components to retrieve.
Structured Documents
Linear order of words, sentences, paragraphs …
Hierarchy or logical structure of a book’s chapters, sections …
Links (hyperlink), cross-references, citations …
Temporal and spatial relationships in multimedia documents
Book
Chapters
Sections
Paragraphs
World Wide Web
This is only only another to look one le to show the need an la a out structure of and more a document and so ass to it doe not necessary text a structured document have retrieval on the web is an it important topic of today’s research it issues to make se last sentence..
Structured Documents
Explicit structure formalised through document representation standards
(Mark-up Languages)
Layout LaTeX (publishing), HTML (Web publishing)
Structure SGML, XML (Web publishing, engineering),
MPEG-7 (broadcasting)
Content/Semantic RDF, DAML + OIL, OWL (semantic web)
World Wide Web
This is only only another to look one le to show the need an la a out structure of and more a document and so ass to it doe not necessary text a structured document have retrieval on the web is an it important topic of today’s research it issues to make se last sentence..
<b><font size=+2>SDR</font></b><img src="qmir.jpg" border=0>
<section> <subsection> <paragraph>… </paragraph> <paragraph>… </paragraph> </subsection></section>
<Book rdf:about=“book”> <rdf:author=“..”/> <rdf:title=“…”/></Book>
Outline
Structured document retrieval
XML
Content-oriented XML retrieval
Evaluation
XML: eXtensible Mark-up Language• Meta-language (user-defined tags) being adopted as the
document format language by W3C
• Used to describe content and structure (and not layout)
• Grammar described in DTD ( used for validation)
<lecture> <title> Structured Document Retrieval </title> <author> <fnm> Smith </fnm> <snm> John </snm> </author> <chapter> <title> Introduction into XML retrieval </title> <paragraph> …. </paragraph> … </chapter> …</lecture> <!ELEMENT lecture (title,
author+,chapter+)><!ELEMENT author (fnm*,snm)><!ELEMENT fnm #PCDATA>…
XML: eXtensible Mark-up Language
Use of XPath notation to refer to the XML structure
chapter/title: title is a direct sub-component of chapter//title: any titlechapter//title: title is a direct or indirect sub-component of chapterchapter/paragraph[2]: any direct second paragraph of any chapterchapter/*: all direct sub-components of a chapter
<lecture> <title> Structured Document Retrieval </title> <author> <fnm> Smith </fnm> <snm> John </snm> </author> <chapter> <title> Introduction into SDR </title> <paragraph> …. </paragraph> … </chapter> …</lecture>
Querying XML documents Content-only (CO) queries
'open standards for digital video in distance learning'
Content-and-structure (CAS) queries
//article [about(., 'formal methods verify correctness aviation systems')] /body//section [about(.,'case study application model checking theorem proving')]
Structure-only (SA) queries
/article//*section/paragraph[2]
Passage retrieval
Fixed-length (e.g. 300-word windows, overlapping) Discourse (e.g. sentence, paragraph) according to logical
structure but fixed Semantic (e.g. TextTiling)
Retrieval: e.g. rank document based on highest ranking passage or sum of
ranking scores for all passages deal principally with CO queries
p1 p2 p3 p4 p5 p6
doc
Database approaches to XML retrieval
Relational OO Native
Flexibility, expressiveness, complexity
Efficiency
Data-oriented retrieval– containment and not aboutness– no relevance-based ranking
Aims/challenges tend to focus on efficiency performance
XQuery
OutlineStructured document retrieval
XML
Content-oriented XML retrieval A definition Challenges Approaches
Evaluation
Content-oriented XML retrieval
Return document components of varying granularity (e.g. a book, a chapter, a section, a paragraph, a table, a figure, etc), relevant to the user’s information need both with regards to content and structure.
Content-oriented XML retrieval
Retrieve the best components according to content and structure criteria:
INEX: most specific component that satisfies the query, while being exhaustive to the query
Shakespeare study: best entry points, which are components from which many relevant components can be reached through browsing
???
Article ?XML,?retrieval
?authoring
0.9 XML 0.5 XML 0.2 XML
0.4 retrieval 0.7 authoring
Challenge 1: term weights
Title Section 1 Section 2
No fixed retrieval unit + nested document components: how to obtain document and collection statistics (e.g. tf idf) which aggregation formalism to use?
Article ?XML,?retrieval
?authoring
0.9 XML 0.5 XML 0.2 XML
0.4 retrieval 0.7 authoring
Challenge 2: augmentation weights
Title Section 1 Section 2
Nested document components: which components contribute best to content of Article? how to estimate augmentation weights (e.g. size, number of children)? how to aggregate term and augmentation weights?
0.40.5
0.2
Article ?XML,?retrieval
?authoring
0.9 XML 0.5 XML 0.2 XML
0.4 retrieval 0.7 authoring
Challenge 3: component weights
Title Section 1 Section 2
Different types of document components: which component is a good retrieval unit? how to estimate component weights (frequency, user studies)? how to aggregate term, augmentation and component weights?
0.40.5
0.2
0.6 0.40.4
0.2
Approaches …vector space model
probabilistic model
bayesian network
language model
extending DB model
boolean model
natural language processing
cognitive model
ontology
parameter estimation
tuning
smoothing
fusion
phrase
term statistics
collection statistics
component statistics
proximity search
logistic regression
belief modelrelevance feedback
Vector space model
article index
abstract index
section index
sub-section index
paragraph index
RSV normalised RSV
RSV normalised RSV
RSV normalised RSV
RSV normalised RSV
RSV normalised RSV
merge
tf and idf as for fixed and non-nested retrieval units
(IBM Haifa, INEX 2003)
Language model
element language modelcollection language modelsmoothing parameter
element score
element sizeelement scorearticle score
query expansion with blind feedbackignore elements with 20 terms
high value of leads to increase in size of retrieved elements
results with = 0.9, 0.5 and 0.2 similar
rank element
(University of Amsterdam, INEX 2003)
Outline
Structured document retrieval
XML
Content-oriented XML retrieval
Evaluation
Evaluation of XML retrieval: INEX Evaluating the effectiveness of content-oriented XML retrieval
approaches
Collaborative effort participants contribute to the development of the collection queries relevance assessments
Similar methodology as for TREC, but adapted to XML retrieval
40+ participants worldwide
Workshop in Schloss Dagstuhl in December (20+ institutions)
INEX Test Collection Documents (~500MB), which consist of 12,107 articles in XML format
from the IEEE Computer Society; 8 millions elements
INEX 2002 30 CO and 30 CAS queries CO and CAS ad hoc retrieval tasks
inex_eval metric
INEX 200336 CO and 30 CAS queries CO, SCAS and VCAS ad hoc retrieval tasks
CAS queries are defined according to enhanced subset of XPath
inex_eval and inex_eval_ng metrics
INEX 2004 is just starting
Relevance in INEX
Exhaustivityhow exhaustively a document component discusses the query: 0, 1, 2, 3
Specificityhow focused the component is on the query: 0, 1, 2, 3
Relevance (3,3), (2,3), (1,1), (0,0), …
Use of an online assessment tool to ensure exhaustive and consistent assessments (assessing a query takes a week!)
section
article all sections relevant article very relevantall sections relevant article better than sectionsone section relevant article less relevantone section relevant section better than article…
Metrics
Recall / precision - based
quantisation functions to obtain one relevance value
expected search length
penalise overlap consider size
Othersexpected ratio of relevantcumulated gain-based metricstolerance to irrelevance
section
article
Lessons learntGood definition of relevance
Expressing CAS queries was not easy
Relevance assessment process must be “improved”
Further development on metrics needed
User studies required
INEX 2004
http://inex.is.informatik.uni-duisburg.de:2004/
TracksRelevance feedbackInteractiveHeterogeneous collectionNatural language query
Merci