View
213
Download
0
Category
Preview:
Citation preview
2007.04.12 - SLIDE 1IS 240 – Spring 2007
Prof. Ray Larson University of California, Berkeley
School of InformationTuesday and Thursday 10:30 am - 12:00 pm
Spring 2007http://courses.ischool.berkeley.edu/i240/s07
Principles of Information Retrieval
Lecture 21: XML Retrieval
2007.04.12 - SLIDE 2IS 240 – Spring 2007
Mini-TREC
• Proposed Schedule– February 15 – Database and previous
Queries– February 27 – report on system acquisition
and setup– March 8, New Queries for testing…– April 19, Results due (Next Thursday)– April 24 or 26, Results and system rankings– May 8 Group reports and discussion
2007.04.12 - SLIDE 3IS 240 – Spring 2007
Announcement
• No Class on Tuesday (April 17th)
2007.04.12 - SLIDE 4IS 240 – Spring 2007
Today
• Review– Geographic Information Retrieval– GIR Algorithms and evaluation based on a
presentation to the 2004 European Conference on Digital Libraries, held in Bath, U.K.
• XML and Structured Element Retrieval– INEX– Approaches to XML retrieval
Credit for some of the slides in this lecture goes to Marti Hearst
2007.04.12 - SLIDE 5IS 240 – Spring 2007
Today
• Review– Geographic Information Retrieval– GIR Algorithms and evaluation based on a
presentation to the 2004 European Conference on Digital Libraries, held in Bath, U.K.
• Web Crawling and Search Issues– Web Crawling– Web Search Engines and Algorithms
Credit for some of the slides in this lecture goes to Marti Hearst
2007.04.12 - SLIDE 6IS 240 – Spring 2007
Introduction
• What is Geographic Information Retrieval?– GIR is concerned with providing access to
georeferenced information sources. It includes all of the areas of traditional IR research with the addition of spatially and geographically oriented indexing and retrieval.
– It combines aspects of DBMS research, User Interface Research, GIS research, and Information Retrieval research.
2007.04.12 - SLIDE 7IS 240 – Spring 2007
Example: Results display from CheshireGeo:
http://calsip.regis.berkeley.edu/pattyf/mapserver/cheshire2/cheshire_init.html
2007.04.12 - SLIDE 8IS 240 – Spring 2007
1) Minimum Bounding Circle (3) 2) MBR: Minimum aligned Bounding rectangle (4)
3) Minimum Bounding Ellipse (5)
6) Convex hull (varies)5) 4-corner convex polygon (8)4) Rotated minimum bounding rectangle (5)
Presented in order of increasing quality. Number in parentheses denotes number of parameters needed to store representation
After Brinkhoff et al, 1993b
Other convex, conservative Approximations
2007.04.12 - SLIDE 9IS 240 – Spring 2007
Our Research Questions
• Spatial Ranking– How effectively can the spatial similarity between a
query region and a document region be evaluated and ranked based on the overlap of the geometric approximations for these regions?
• Geometric Approximations & Spatial Ranking:– How do different geometric approximations affect the
rankings?• MBRs: the most popular approximation • Convex hulls: the highest quality convex approximation
2007.04.12 - SLIDE 10IS 240 – Spring 2007
Spatial Ranking: Methods for computing spatial similarity
2007.04.12 - SLIDE 11IS 240 – Spring 2007
Probabilistic Models: Logistic Regression attributes
• X1 = area of overlap(query region, candidate GIO) / area
of query region
• X2 = area of overlap(query region, candidate GIO) / area
of candidate GIO
• X3 = 1 – abs(fraction of overlap region that is onshore
fraction of candidate GIO that is onshore)
• Where:
Range for all variables is 0 (not similar) to 1 (same)
2007.04.12 - SLIDE 12IS 240 – Spring 2007
CA Named Places in the Test Collection – complex polygons
Counties Cities
National Parks
National Forests
Water QCB Regions
Bioregions
2007.04.12 - SLIDE 13IS 240 – Spring 2007
CA Counties – Geometric Approximations
MBRs
Ave. False Area of Approximation:MBRs: 94.61% Convex Hulls: 26.73%
Convex Hulls
2007.04.12 - SLIDE 14IS 240 – Spring 2007
CA User Defined Areas (UDAs) in the Test Collection
2007.04.12 - SLIDE 15IS 240 – Spring 2007
Test Collection Query Regions: CA Counties
42 of 58 counties referenced in the test collection metadata
• 10 counties randomly selected as query regions to train LR model
• 32 counties used as query regions to test model
2007.04.12 - SLIDE 16IS 240 – Spring 2007
LR model
• X1 = area of overlap(query region, candidate GIO) / area of
query region
• X2 = area of overlap(query region, candidate GIO) / area of
candidate GIO
• Where: Range for all variables is 0 (not similar) to 1 (same)
2007.04.12 - SLIDE 17IS 240 – Spring 2007
Some of our Results
Mean Average Query Precision: the average precision values after each new relevant document is observed in a ranked list.
For metadata indexed by CA named place regions:
For all metadata in the test collection:
These results suggest:•Convex Hulls perform better than MBRs
•Expected result given that the CH is a higher quality approximation
•A probabilistic ranking based on MBRs can perform as well if not better than a non-probabiliistic ranking method based on Convex Hulls
•Interesting•Since any approximation other than the MBR requires great expense, this suggests that the exploration of new ranking methods based on the MBR are a good way to go.
2007.04.12 - SLIDE 18IS 240 – Spring 2007
Some of our Results
Mean Average Query Precision: the average precision values after each new relevant document is observed in a ranked list.
For metadata indexed by CA named place regions:
For all metadata in the test collection:
BUT:
The inclusion of UDA indexed metadata reduces precision.
This is because coarse approximations of onshore or coastal geographic regions will necessarily include much irrelevant offshore area, and vice versa
2007.04.12 - SLIDE 19IS 240 – Spring 2007
Shorefactor Model
• X1 = area of overlap(query region, candidate GIO) / area of query region
• X2 = area of overlap(query region, candidate GIO) / area of candidate GIO
• X3 = 1 – abs(fraction of query region approximation that is onshore – fraction of candidate GIO approximation that is onshore)
– Where: Range for all variables is 0 (not similar) to 1 (same)
2007.04.12 - SLIDE 20IS 240 – Spring 2007
Some of our Results, with Shorefactor
These results suggest:
• Addition of Shorefactor variable improves the model (LR 2), especially for MBRs
• Improvement not so dramatic for convex hull approximations – b/c the problem that shorefactor addresses is not that significant when areas are represented by convex hulls.
For all metadata in the test collection:Mean Average Query Precision:
the average precision values after each new relevant document is observed in a ranked list.
2007.04.12 - SLIDE 21IS 240 – Spring 2007
Pre
cisi
on
Recall
Results for All Data - MBRs
0.7
0.75
0.8
0.85
0.9
0.95
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Hill
Walker
Beard
LR 1
LR 2
2007.04.12 - SLIDE 22IS 240 – Spring 2007
Results for All Data - Convex Hull
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Hill
Walker
Beard
LR 1
LR 2
Pre
cisi
on
Recall
2007.04.12 - SLIDE 23IS 240 – Spring 2007
XML Retrieval
• The following slides are adapted from presentations at INEX 2003-2005 and at the INEX Element Retrieval Workshop in Glasgow 2005, with some new additions for general context, etc.
2007.04.12 - SLIDE 24IS 240 – Spring 2007
INEX Organization
Organized By:– University of Duisburg-Essen, Germany
• Norbert Fuhr, Saadia Malik, and others
– Queen Mary University of London, UK• Mounia Lalmas, Gabriella Kazai, and others
• Supported By:– DELOS Network of Excellence in Digital
Libraries (EU)– IEEE Computer Society– University of Duisburg-Essen
2007.04.12 - SLIDE 25IS 240 – Spring 2007
XML Retrieval Issues
• Using Structure?
• Specification of Queries
• How to evaluate?
2007.04.12 - SLIDE 26IS 240 – Spring 2007
Cheshire SGML/XML Support
• Underlying native format for all data is SGML or XML
• The DTD defines the database contents
• Full SGML/XML parsing
• SGML/XML Format Configuration Files define the database location and indexes
• Various format conversions and utilities available for Z39.50 support (MARC, GRS-1
2007.04.12 - SLIDE 27IS 240 – Spring 2007
SGML/XML Support
• Configuration files for the Server are SGML/XML:– They include elements describing all of the
data files and indexes for the database.– They also include instructions on how data is
to be extracted for indexing and how Z39.50 attributes map to the indexes for a given database.
2007.04.12 - SLIDE 28IS 240 – Spring 2007
Indexing
• Any SGML/XML tagged field or attribute can be indexed:– B-Tree and Hash access via Berkeley DB (Sleepycat)– Stemming, keyword, exact keys and “special keys”– Mapping from any Z39.50 Attribute combination to a
specific index– Underlying postings information includes term
frequency for probabilistic searching
• Component extraction with separate component indexes
2007.04.12 - SLIDE 29IS 240 – Spring 2007
XML Element Extraction
• A new search “ElementSetName” is XML_ELEMENT_
• Any Xpath, element name, or regular expression can be included following the final underscore when submitting a present request
• The matching elements are extracted from the records matching the search and delivered in a simple format..
2007.04.12 - SLIDE 30IS 240 – Spring 2007
XML Extraction
% zselect sherlock372 {Connection with SHERLOCK (sherlock.berkeley.edu) database 'bibfile' at port 2100 is open as connection #372}% zfind topic mathematics{OK {Status 1} {Hits 26} {Received 0} {Set Default} {RecordSyntax UNKNOWN}}% zset recsyntax XML% zset elementset XML_ELEMENT_Fld245% zdisplay{OK {Status 0} {Received 10} {Position 1} {Set Default} {NextPosition 11} {RecordSyntax XML 1.2.840.10003.5.109.10}} {<RESULT_DATA DOCID="1"><ITEM XPATH="/USMARC[1]/VarFlds[1]/VarDFlds[1]/Titles[1]/Fld245[1]"><Fld245 AddEnty="No" NFChars="0"><a>Singularitâes áa Cargáese</a></Fld245></ITEM><RESULT_DATA> … etc…
2007.04.12 - SLIDE 31IS 240 – Spring 2007
P(R | Q,D) e logO(R |Q,C )
1 e logO(R |Q,C ) b0 biX i
i1
6
Probability of relevance is based onProbability of relevance is based onLogistic regression from a sample set of documentsLogistic regression from a sample set of documentsto determine values of the coefficients. to determine values of the coefficients. At retrieval the probability estimate is obtained by:At retrieval the probability estimate is obtained by:
For the 6 For the 6 XX attribute measures shown on the next slide attribute measures shown on the next slide
TREC3 Logistic Regression
2007.04.12 - SLIDE 32IS 240 – Spring 2007
TREC3 Logistic Regression
logO(R | C,Q) bo b1 1
Qc
logqtf j
j1
Qc
b2 Q b3
1
Qc
log tf j
j1
Qc
b4 cl
b5 1
Qc
logN nt j
nt jj1
Qc
b6 logQc
Average Absolute Query Frequency
Query Length
Average Absolute Component Frequency
Document Length
Average Inverse Component Frequency
Number of Terms in both query and Component
2007.04.12 - SLIDE 33IS 240 – Spring 2007
Okapi BM25
• Where:• Q is a query containing terms T• K is k1((1-b) + b.dl/avdl)• k1, b and k3 are parameters , usually 1.2, 0.75 and 7-1000• tf is the frequency of the term in a specific document• qtf is the frequency of the term in a topic from which Q was
derived• dl and avdl are the document length and the average
document length measured in some convenient unit• w(1) is the Robertson-Sparck Jones weight.
QT qtfk
qtfk
tfK
tfkw
3
31)1( )1()1(
5.05.05.0
5.0
log)1(
rRnNrnrR
r
w
2007.04.12 - SLIDE 34IS 240 – Spring 2007
Combining Boolean and Probabilistic Search Elements
• Two original approaches:– Boolean Approach
– Non-probabilistic “Fusion Search” Set merger approach is a weighted merger of document scores from separate Boolean and Probabilistic queries
P(R | Q,D) P(R | Qbool ,D)P(R | Qprob ,D)
P(R | Qbool ,D) 1: if Boolean eval successful for D
0 : Otherwise
2007.04.12 - SLIDE 35IS 240 – Spring 2007
Subquery
INEX ‘04 Fusion Search
• Merge multiple ranked and Boolean index searches within each query and multiple component search resultsets– Major components merged are Articles, Body,
Sections, subsections, paragraphs
Subquery
Subquery
Subquery
Comp.QueryResultsComp.
QueryResults
Fusion/Merge
FinalRanked
List
2007.04.12 - SLIDE 36IS 240 – Spring 2007
Merging and Ranking Operators
• Extends the capabilities of merging to include merger operations in queries like Boolean operators
• Fuzzy Logic Operators (not used for INEX)– !FUZZY_AND– !FUZZY_OR– !FUZZY_NOT
• Containment operators: Restrict components to or with a particular parent – !RESTRICT_FROM– !RESTRICT_TO
• Merge Operators– !MERGE_SUM– !MERGE_MEAN– !MERGE_NORM– !MERGE_CMBZ
2007.04.12 - SLIDE 37IS 240 – Spring 2007
New LR Coefficients
Index b0 b1 b2 b3 b4 b5 b6
Base -3.700 1.269 -0.310 0.679 -0.021 0.223 4.010
topic -7.758 5.670 -3.427 1.787 -0.030 1.952 5.880
topicshort -6.364 2.739 -1.443 1.228 -0.020 1.280 3.837
abstract -5.892 2.318 -1.364 0.860 -0.013 1.052 3.600
alltitles -5.243 2.319 -1.361 1.415 -0.037 1.180 3.696
sec words -6.392 2.125 -1.648 1.106 -0.075 1.174 3.632
para words
-8.632 1.258 -1.654 1.485 -0.084 1.143 4.004
Estimates using INEX ‘03 relevance assessments forEstimates using INEX ‘03 relevance assessments forb1 = Average Absolute Query Frequencyb1 = Average Absolute Query Frequencyb2 = Query Lengthb2 = Query Lengthb3 = Average Absolute Component Frequencyb3 = Average Absolute Component Frequencyb4 = Document Lengthb4 = Document Lengthb5 = Average Inverse Component Frequencyb5 = Average Inverse Component Frequencyb6 = Number of Terms in common between queryb6 = Number of Terms in common between query and Component and Component
2007.04.12 - SLIDE 38IS 240 – Spring 2007
INEX CO Runs
• Three official, one later run - all Title-only– Fusion - Combines Okapi and LR using the
MERGE_CMBZ operator– NewParms (LR)- Using only LR with the new
parameters– Feedback - An attempt at blind relevance
feedback
– PostFusion - Fusion of the new LR coefficients and Okapi
2007.04.12 - SLIDE 39IS 240 – Spring 2007
Query Generation - CO
• # 162 TITLE = Text and Index Compression Algorithms
• QUERY: topicshort @+ {Text and Index Compression Algorithms}) !MERGE_CMBZ (alltitles @+ {Text and Index Compression Algorithms}) !MERGE_CMBZ (topicshort @ {Text and Index Compression Algorithms}) !MERGE_CMBZ (alltitles @ {Text and Index Compression Algorithms})
• @+ is Okapi, @ is LR• !MERGE_CMBZ is a normalized score
summation and enhancement
2007.04.12 - SLIDE 40IS 240 – Spring 2007
INEX CO Runs
Generalized Strict
Avg PrecFUSION = 0.0642NEWPARMS = 0.0582FDBK = 0.0415POSTFUS = 0.0690
Avg PrecFUSION = 0.0923NEWPARMS = 0.0853FDBK = 0.0390POSTFUS = 0.0952
2007.04.12 - SLIDE 41IS 240 – Spring 2007
INEX VCAS Runs
• Two official runs– FUSVCAS - Element fusion using LR and
various operators for path restriction– NEWVCAS - Using the new LR coefficients
for each appropriate index and various operators for path restriction
2007.04.12 - SLIDE 42IS 240 – Spring 2007
Query Generation - VCAS
• #66 TITLE = //article[about(., intelligent transport systems)]//sec[about(., on-board route planning navigation system for automobiles)]
• Submitted query = ((topic @ {intelligent transport systems})) !RESTRICT_FROM ((sec_words @ {on-board route planning navigation system for automobiles}))
• Target elements: sec|ss1|ss2|ss3
2007.04.12 - SLIDE 43IS 240 – Spring 2007
VCAS Results
Generalized Strict
Avg PrecFUSVCAS = 0.0321NEWVCAS = 0.0270
Avg PrecFUSVCAS = 0.0601NEWVCAS = 0.0569
2007.04.12 - SLIDE 44IS 240 – Spring 2007
Heterogeneous Track
• Approach using the Cheshire’s Virtual Database options– Primarily a version of distributed IR– Each collection indexed separately– Search via Z39.50 distributed queries– Z39.50 Attribute mapping used to map query
indexes to appropriate elements in a given collection
– Only LR used and collection results merged using probability of relevance for each collection result
2007.04.12 - SLIDE 45IS 240 – Spring 2007
INEX 2005 Approach
• Used only Logistic regression methods
• “TREC3” with Pivot
• “TREC2” with Pivot
• “TREC2” with Blind Feedback
• Used post-processing for specific tasks
2007.04.12 - SLIDE 46IS 240 – Spring 2007
P(R | Q,D) e logO(R |Q,C )
1 e logO(R |Q,C ) b0 biX i
i1
m
Probability of relevance is based on Logistic Probability of relevance is based on Logistic regression from a sample set of documents to regression from a sample set of documents to determine values of the coefficients. determine values of the coefficients.
At retrieval the probability estimate is obtained by:At retrieval the probability estimate is obtained by:
For some set of For some set of m m statistical measures, Xstatistical measures, Xii, derived from , derived from
the collection and querythe collection and query
Logistic Regression
2007.04.12 - SLIDE 47IS 240 – Spring 2007
TREC2 Algorithm
logO(R | C,Q) co c1
1
Qc 1
qtf i
ql 35i1
Qc
c2
1
Qc 1log
tf i
cl 80i1
Qc
c3
1
Qc 1log
ctf i
N ti1
Qc
c4 Qc
Query
Document
Collection
MatchingTerms
TermFreq for:
2007.04.12 - SLIDE 48IS 240 – Spring 2007
Blind Feedback
• Term selection from top-ranked documents is based on the classic Robertson/Sparck Jones probabilistic model:
Document Relevance
Documentindexing
+ -
+ Rt Nt -Rt Nt
- R-Rt N-Nt-R+R N-Nt
R N-R N
For each term t
2007.04.12 - SLIDE 49IS 240 – Spring 2007
Blind Feedback
• Top x new terms taken from top y documents– For each term in the top y assumed relevant set…
– Terms are ranked by termwt and the top x selected for inclusion in the query
termwt log
Rt
R Rt
N t Rt
N N t R Rt
2007.04.12 - SLIDE 50IS 240 – Spring 2007
Pivot method
• Based on the pivot weighting used by IBM Haifa in INEX 2004 (Mass & Mandelbrod)
• Used 0.50 as pivot for all cases
• For TREC3 and TREC2 runs all component results weighted by article-level results for the matching article
P(R | Q,Cnew) (X P(R | Q,Ccomp )) ((1 X)P(R | Q,Carticle))
Where X is the " pivot value" with X 0 and X 1
2007.04.12 - SLIDE 51IS 240 – Spring 2007
Subquery
Adhoc Component Fusion Search
• Merge multiple ranked component types– Major components merged are Article
Body, Sections, paragraphs, figures
Subquery
Subquery
Subquery
Comp.QueryResultsComp.
QueryResults
Fusion/Merge
RawRanked
List
2007.04.12 - SLIDE 52IS 240 – Spring 2007
P(R | Q,D) b0 biX i
i1
n
Probability of relevance is based onProbability of relevance is based onLogistic regression from a sample set of documentsLogistic regression from a sample set of documentsto determine values of the coefficients. to determine values of the coefficients. At retrieval the probability estimate is obtained by:At retrieval the probability estimate is obtained by:
TREC3 Logistic Regression
2007.04.12 - SLIDE 53IS 240 – Spring 2007
TREC3 Logistic Regression attributes
MX
n
nNIDF
IDFM
X
DLX
DAFM
X
QLX
QAFM
X
j
j
j
j
j
t
t
M
t
M
t
M
t
log
log1
log1
log1
6
15
4
13
2
11
Average Absolute Query FrequencyAverage Absolute Query Frequency
Query LengthQuery Length
Average Absolute Component FrequencyAverage Absolute Component Frequency
Document LengthDocument Length
Average Inverse Component FrequencyAverage Inverse Component Frequency
Inverse Component FrequencyInverse Component Frequency
Number of Terms in common between Number of Terms in common between query and Component -- logged query and Component -- logged
2007.04.12 - SLIDE 54IS 240 – Spring 2007
TREC3 LR Coefficients
Index b0 b1 b2 b3 b4 b5 b6
Base -3.700 1.269 -0.310 0.679 -0.021 0.223 4.010
topic -7.758 5.670 -3.427 1.787 -0.030 1.952 5.880
topicshort -6.364 2.739 -1.443 1.228 -0.020 1.280 3.837
abstract -5.892 2.318 -1.364 0.860 -0.013 1.052 3.600
alltitles -5.243 2.319 -1.361 1.415 -0.037 1.180 3.696
sec words -6.392 2.125 -1.648 1.106 -0.075 1.174 3.632
para words
-8.632 1.258 -1.654 1.485 -0.084 1.143 4.004
Estimates using INEX ‘03 relevance assessments forEstimates using INEX ‘03 relevance assessments forb1 = Average Absolute Query Frequencyb1 = Average Absolute Query Frequencyb2 = Query Lengthb2 = Query Lengthb3 = Average Absolute Component Frequencyb3 = Average Absolute Component Frequencyb4 = Document Lengthb4 = Document Lengthb5 = Average Inverse Component Frequencyb5 = Average Inverse Component Frequencyb6 = Number of Terms in common between queryb6 = Number of Terms in common between query and Component and Component
2007.04.12 - SLIDE 55IS 240 – Spring 2007
CO.Focused• Generalized & Strict
2007.04.12 - SLIDE 56IS 240 – Spring 2007
COS.Focused
• Generalized & Strict
2007.04.12 - SLIDE 57IS 240 – Spring 2007
CO.Thorough
• Generalized & Strict
2007.04.12 - SLIDE 58IS 240 – Spring 2007
COS.Thorough
• Generalized & Strict
2007.04.12 - SLIDE 59IS 240 – Spring 2007
CAS
• Generalize & Strict
2007.04.12 - SLIDE 60IS 240 – Spring 2007
Het. Element Retr. Overview
• The Problem
• Issues with Element Retrieval and Heterogeneous Retrieval
• Possible Approaches– XPointer– Generic Metadata systems
• E.g., Dublin Core
– Other Metadata Systems
2007.04.12 - SLIDE 61IS 240 – Spring 2007
The Problem
• The Adhoc track in INEX has dealt with a single DTD for one type of data (computer science journal articles)
• In “real-world” environments, XML retrieval must deal with different DTDs, different genres of data and widely varying topical content
2007.04.12 - SLIDE 62IS 240 – Spring 2007
The Heterogeneous Track
• Research Questions (2004):– For content-oriented queries, what methods are possible for– What methods can be used to map structural criteria onto
other DTDs?– Should mappings focus on element names only, or also
deal with element content or semantics?– What are appropriate evaluation criteria for heterogeneous
collections?
– determining which elements contain reasonable answers? Are pure
– statistical methods appropriate, or are ontology-based approaches also helpful?
2007.04.12 - SLIDE 63IS 240 – Spring 2007
INEX 2004 Het Collection TagsCollection Author tag Title tag Abstract tag
INEX (IEEE) fm/au fm/tig/atl fm/abs
Berkeley Fld100 Fld700
Fld245 Fld500 (rarely)
compuscience author title abstract
bibdbpub author altauthor
title abstract
dblp author editor
title booktitle
None
hcibib author title abstract
qmulcspub AUTHOR EDITOR
TITLE ABSTRACT
2007.04.12 - SLIDE 64IS 240 – Spring 2007
Issues with Element Retrieval for Heterogeneous Retrieval
• Conceptual Issues (user view)– To actually specify structural elements for retrieval
requires that the user know the structure of the items to be retrieved
– As the number of DTDs or schemas increase this task becomes more complex for both specification and for understanding
– For “real world” XML retrieval, specifying structure effectively requires omniscience on the part of the user
– The collection itself must be specified in some way (can the user know all of the collections?)
– Users of INEX can’t do correct specifications for even one DTD…
2007.04.12 - SLIDE 65IS 240 – Spring 2007
Issues with Element Retrieval for Heterogeneous Retrieval
• Practical Issues (programmers view)– Most of the same problems as the user view– As seen in an earlier papers today the system must
provide an interface that the user can understand, but maps to the complexities of the DTD(s)
– But, once again, as the number of DTDs or schemas increase this task becomes increasingly complex for the specification of the mappings
– For “real world” XML retrieval, specifying structure effectively requires omniscience on the part of the programmer to provide exhaustive mappings of the document elements to be retrieved
• As Roelof noted earlier today, this rapidly can become a system that has too many options for a user to understand or use
2007.04.12 - SLIDE 66IS 240 – Spring 2007
Postulate of Impotence
• In summation we might suggest another ``Postulate of Impotence'' like those suggested by Swanson– You can either have heterogeneous retrieval,
or precise element specifications in queries, but you cannot have both simultaneously
2007.04.12 - SLIDE 67IS 240 – Spring 2007
Possible Approaches
• Generalized structure– Parent/child as in Xpath/Xpointer– What about flat structures? (like most
collections in the Het track)
• Abstract query elements– Use semantic representations in queries
rather than structural representations• E.g. “Title” instead of //fm/tig/atl
– What semantic representations can/should be used?
2007.04.12 - SLIDE 68IS 240 – Spring 2007
XPointer
• Can specify collection-level identification– Basically a URN attached to an Xpath
• Can also specify various string-matching constraints on Xpath
• Might be useful in INEX Het Track for specifying relevance judgements
• But, it doesn’t address (or worsens) the larger problem of dealing with large numbers of heterogeneous structures
2007.04.12 - SLIDE 69IS 240 – Spring 2007
Abstract Data Elements
• The idea is to remove the requirement of precise and explicit specification of structural elements and replace them with abstract and implied specifications
• Used in other heterogeneous retrieval systems– Z39.50/SRW (attributesets and elementsets)– Dublin Core (limited set of elements for
search or retrieval)
2007.04.12 - SLIDE 70IS 240 – Spring 2007
Dublin Core
• Simple metadata for describing internet resources
• For “Document-Like Objects”
• 15 Elements (in base DC)
2007.04.12 - SLIDE 71IS 240 – Spring 2007
Dublin Core Elements
• Title
• Creator
• Subject
• Description
• Publisher
• Other Contributors
• Date
• Resource Type
• Format
• Resource Identifier
• Source
• Language
• Relation
• Coverage
• Rights Management
2007.04.12 - SLIDE 72IS 240 – Spring 2007
Issues in Dublin Core
• Lack of guidance on what to put into each element
• How to structure or organize at the element level?
• How to ensure consistency across descriptions for the same persons, places, things, etc.
Recommended