PROXIMITY WITHIN PARAGRAPH: A MEASURE TO ENHANCE DOCUMENT
RETRIEVAL PERFORMANCESrisupa Palakvangsa-Na-Ayudhya John A. Keane
School of Informatics, The University of Manchester, UK
QUERY = Informatics and Manchester and Course
How to calculate a score?
Informatics Manchester
Course
Block 1
= 3
Informatics
Block 2
= 1/3
Informatics Course
Block 3
= 2/3
44.0333
3/23/13Score
List Returned
1.Doc R: 0.85
2.Doc M: 0.71
3.Doc L: 0.69
4.Doc A: 0.56
1) Assign weights to each queried word
2) Split a document into blocks
3) Calculate a similarity score
4) Order documents by their scores
This work was supported by a PhD studentship from the Royal Thai Government
Experimental Setup Compared with Minimum Distance Between Queried Pair (MQP) Employed three test sets based on WT10G data set and 50 queries (Topics 501-550) Used two criteria: the average interpolated precision-recall (AvgPrec) and the average Discounted Cumulative Gain (AvgDCG).
Results
Test Set 1
0
2
4
6
8
10
12
Rank
Av
gD
CG
.
PWP
MQP
Test Set 2
0
2
4
6
8
10
12
Rank
Av
gD
CG
.
PWP
MQP
Test Set 3
0123456789
Rank
Av
gD
CG
.
PWP
MQP
Test Set 1
0
10
20
30
40
50
60
70
0 10 20 30 40 50 60 70 80 90 100
Recall
Av
gP
rec
PWP
MQP
Test Set 2
0
10
20
30
40
50
60
0 10 20 30 40 50 60 70 80 90 100
Recall
Av
gP
rec
PWP
MQP
Test Set 3
0
10
20
30
40
50
60
70
0 10 20 30 40 50 60 70 80 90 100
Recall
Av
gP
rec
PWP
MQP
Discussion
AdvantagesPWP MQP
tends to obtain higher AvgDCG values considers all logical blocks more fine-grained
Disadvantages
PWP MQP
considers only the number of
non-queried words between unique
queried words pairs difficult to rank documents as the same
score is given to many documents more coarse-grained
multiple-topics issue likely to obtain lower DCG values than
MQP, due to the use of multiple logical
blocks