Download ppt - PROXIMITY WITHIN PARAGRAPH: A MEASURE TO ENHANCE DOCUMENT RETRIEVAL PERFORMANCE

PROXIMITY WITHIN PARAGRAPH: A MEASURE TO ENHANCE DOCUMENT

RETRIEVAL PERFORMANCESrisupa Palakvangsa-Na-Ayudhya John A. Keane

School of Informatics, The University of Manchester, UK

QUERY = Informatics and Manchester and Course

How to calculate a score?

Informatics Manchester

Course

Block 1

= 3

Informatics

Block 2

= 1/3

Informatics Course

Block 3

= 2/3

44.0333

3/23/13Score

List Returned

1.Doc R: 0.85

2.Doc M: 0.71

3.Doc L: 0.69

4.Doc A: 0.56

1) Assign weights to each queried word

2) Split a document into blocks

3) Calculate a similarity score

4) Order documents by their scores

This work was supported by a PhD studentship from the Royal Thai Government

Experimental Setup Compared with Minimum Distance Between Queried Pair (MQP) Employed three test sets based on WT10G data set and 50 queries (Topics 501-550) Used two criteria: the average interpolated precision-recall (AvgPrec) and the average Discounted Cumulative Gain (AvgDCG).

Results

Test Set 1

0

2

4

6

8

10

12

Rank

Av

gD

CG

.

PWP

MQP

Test Set 2

0

2

4

6

8

10

12

Rank

Av

gD

CG

.

PWP

MQP

Test Set 3

0123456789

Rank

Av

gD

CG

.

PWP

MQP

Test Set 1

0

10

20

30

40

50

60

70

0 10 20 30 40 50 60 70 80 90 100

Recall

Av

gP

rec

PWP

MQP

Test Set 2

0

10

20

30

40

50

60

0 10 20 30 40 50 60 70 80 90 100

Recall

Av

gP

rec

PWP

MQP

Test Set 3

0

10

20

30

40

50

60

70

0 10 20 30 40 50 60 70 80 90 100

Recall

Av

gP

rec

PWP

MQP

Discussion

AdvantagesPWP MQP

tends to obtain higher AvgDCG values considers all logical blocks more fine-grained

Disadvantages

PWP MQP

considers only the number of

non-queried words between unique

queried words pairs difficult to rank documents as the same

score is given to many documents more coarse-grained

multiple-topics issue likely to obtain lower DCG values than

MQP, due to the use of multiple logical

blocks