36
Querylog-based Assessment of Retrievability Bias in Delpher Myriam C. Traub, Thaer Samar, Jacco van Ossenbruggen, Jiyin He, Arjen de Vries, Lynda Hardman 1

Querylog-based Assessment of Retrievability Bias in Delpher

Embed Size (px)

Citation preview

Page 1: Querylog-based Assessment of Retrievability Bias in Delpher

Querylog-based Assessment of Retrievability Bias in DelpherMyriam C. Traub, Thaer Samar, Jacco van Ossenbruggen, Jiyin He, Arjen de Vries, Lynda Hardman

1

Page 2: Querylog-based Assessment of Retrievability Bias in Delpher

Motivation

• Users want to be able

• to access all (relevant) documents in Delpher

• to get a fair overview of Delpher’s content

• However,

• data collections are implicitly biased,

• users are biased,

• and technology induces even more bias(es)

2

Page 3: Querylog-based Assessment of Retrievability Bias in Delpher

Motivation

• Users want to be able

• to access all (relevant) documents in Delpher

• to get a fair overview of Delpher’s content

• However,

• data collections are implicitly biased,

• users are biased,

• and technology induces even more bias(es)

… which I can deal with if the bias is

made explicit.#toolcrit

2

Page 4: Querylog-based Assessment of Retrievability Bias in Delpher

Motivation

• Users want to be able

• to access all (relevant) documents in Delpher

• to get a fair overview of Delpher’s content

• However,

• data collections are implicitly biased,

• users are biased,

• and technology induces even more bias(es)

… which I can deal with if the bias is

made explicit.#toolcrit

2

Note:

Bias is not necessarily a

bad thing!

Page 5: Querylog-based Assessment of Retrievability Bias in Delpher

Research Questions

RQ1: Is the access to the digitized newspaper collection in Delpher influenced by a retrievability bias?

RQ2: Can we correlate the features of a document (such as document length, time of publishing, type of document, etc.) with its retrievability scores?

RQ3: To what extent are retrievability experiments using simulated queries representative for the search behavior of real users?

3

Page 6: Querylog-based Assessment of Retrievability Bias in Delpher

Retrievability

• Introduced by Azzopardi et al. [1] in 2008 in a study based on born-digital documents and simulated queries

• Measures the accessibility of all documents in a collection for a given set of queries

• Retrievability score r(d) measures how often a document d is retrieved by a given set of queries

• Gini coefficient and Lorenz curves can visualize and quantify inequality in the distribution of r(d) scores

4

[1]  L. Azzopardi and V. Vinay. Retrievability: An evaluation measure for higher order information access tasks. In Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM ’08, pages 561–570, New York, NY, USA, 2008. ACM.

Page 7: Querylog-based Assessment of Retrievability Bias in Delpher

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Lorenz curve

% of population

% o

f inc

ome

0, 0, 0, 0, 11, 1, 1, 1, 10, 0, 1, 1, 2

Lorenz Curve & Gini Coefficient

• Introduced by economists to visualize inequality in wealth distribution

• Ranges between 0 and 1:

• perfect tyranny (G=0.8)

• perfect communist (G=0)

• in-between (G=0.5)

• There is no good or bad G.

5

Page 8: Querylog-based Assessment of Retrievability Bias in Delpher

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Lorenz curve

% of population

% o

f inc

ome

0, 0, 0, 0, 11, 1, 1, 1, 10, 0, 1, 1, 2

Lorenz Curve & Gini Coefficient

• Introduced by economists to visualize inequality in wealth distribution

• Ranges between 0 and 1:

• perfect tyranny (G=0.8)

• perfect communist (G=0)

• in-between (G=0.5)

• There is no good or bad G.

5

% of documents

% o

f ac

cum

ulat

ed r(

d)

Page 9: Querylog-based Assessment of Retrievability Bias in Delpher

Document Collection:KB Newspaper Archive

June 1618 - December 1995

Total Size 102,718,528

Vocabulary Size 353,086,358

Articles 67% 69,237,655

Advertisements 29% 29,591,599

Notifications* 2% 1,918,375

Captions 2% 1,970,899

* Familiebericht 6

Page 10: Querylog-based Assessment of Retrievability Bias in Delpher

Simulated Queries

• Followed similar strategy as previous studies

• Top 2 million single terms from the preprocessed corpus + top 2 million bigram terms

• No filtering for OCR errors

7

Page 11: Querylog-based Assessment of Retrievability Bias in Delpher

Real Queries

• User logs collected between March and July 2015 on Delpher

• Extracted queries and view data related to newspaper archive

• Total of 957,239 unique queries

8

Page 12: Querylog-based Assessment of Retrievability Bias in Delpher

Experiment / Parameters

• Real queries, simulated queries

• Standard Information Retrieval models: TFIDF, LM1000, BM25

• Pre-processing: Stemming, stopword removal, operator removal

• Cutoff values: c=10, c=100, c=1000

9

Page 13: Querylog-based Assessment of Retrievability Bias in Delpher

Results of Quantifying

Retrievability Bias

10

Page 14: Querylog-based Assessment of Retrievability Bias in Delpher

Inequality

Real queries, c=1000 GBM25 = 0.76

Simulated queries, c=100 GBM25 = 0.52

11

Page 15: Querylog-based Assessment of Retrievability Bias in Delpher

Inequality c=10

Real queries GBM25 = 0.97

Simulated queries GBM25 = 0.85

12

Page 16: Querylog-based Assessment of Retrievability Bias in Delpher

Inequality c=10

Real queries GBM25 = 0.97

Simulated queries GBM25 = 0.85

12

A large fraction of documents

scores r(d)=0

Page 17: Querylog-based Assessment of Retrievability Bias in Delpher

• The Lorenz curves and Gini values

• are strongly influenced by 0 values,

• can indicate the degree of bias, but they tell us nothing about the type of bias.

13

Limitations

Page 18: Querylog-based Assessment of Retrievability Bias in Delpher

• The Lorenz curves and Gini values

• are strongly influenced by 0 values,

• can indicate the degree of bias, but they tell us nothing about the type of bias.

13

Limitations

Does it arise from the users’

interest / search behavior?

Or a technological bias towards a particular document

feature?

Page 19: Querylog-based Assessment of Retrievability Bias in Delpher

Frequencies of r(d) values

14

• Real queries (top):

• maximum r(d)=4319

• tend to retrieve a few documents more often

• Simulated queries (bottom):

• maximum r(d)=807

• tend to retrieve a larger number of documents

1510

50100

5001000

500010000

50000100000

5000001000000

50000001000000030000000

0 500 1000 1500 2000 2500 3000 3500 4000r(d)

counts

1510

50100

5001000

500010000

50000100000

5000001000000

500000010000000

0 50 100 150 200 250 300 350 400 450 500 550 600 650 700 750 800r(d)

counts

Page 20: Querylog-based Assessment of Retrievability Bias in Delpher

R(d) Values Meaningful?

• Created 4 subsets of documents according to their r(d) score and selected a set of target documents from each subset

• Generated queries from selected documents, tailored to retrieve these specific documents

• Performed the search tasks and measured ranks of the target documents

• Showed that documents with lower r(d) score are actually harder to find

15

Hardly retr. Few times retr. Often retr. Very often retr.

Page 21: Querylog-based Assessment of Retrievability Bias in Delpher

Document Features

16

Page 22: Querylog-based Assessment of Retrievability Bias in Delpher

●●

● ●

●●

●●

● ●●

●●

● ●

● ●

●●

● ●

● ●●

● ●

●●

● ● ● ● ●

● ●

0

1

2

3

1618

− 1

862

1862

− 1

891

1891

− 1

904

1904

− 1

913

1913

− 1

920

1920

− 1

926

1926

− 1

929

1929

− 1

932

1932

− 1

935

1935

− 1

939

1939

− 1

941

1941

− 1

948

1948

− 1

956

1956

− 1

963

1963

− 1

969

1969

− 1

974

1974

− 1

979

1979

− 1

984

1984

− 1

989

1989

− 2

011

Mea

n r(d

) per

bin

Time of Publishing

• Documents ordered by publishing date

• Split into 20 bins of equal size

• Mean r(d) per bin

17

Page 23: Querylog-based Assessment of Retrievability Bias in Delpher

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●

●●●●

●●●●

●●●●●

●●●●

●●●●

●●●●●●●●

●●●●

●●

●●●●

●●

●●●

●●

●●●

●●●●●●

●●

●●●●●

●●

●●●

●●●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●●●

●●

●●

●●

●●●●

●●●●●

●●●

●●●

●●

●●

●●

●●●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●●●●

●●

●●

●●

●●●

●●

●●●

●●

●●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●●

●●●

●●

●●●

●●

●●

●●●

●●

●●●

●●

●●●

●●

●●

●●●●

●●●

●●

●●

●●●

●●

●●●

●●

●●●

●●

●●●

●●

●●

●●●

●●●

●●●

●●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●●●

●●

●●

●●

●●●

●●●

●●

●●●●

●●

●●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●●

●●●

●●●●

●●●

●●●●

●●●

●●●●

●●●

●●●●

●●●

●●●

●●

●●●●

●●●

●●

●●

●●

●●●

●●

●●

●●●●

●●●

●●●●

●●●

●●

●●●●

●●●

●●●●

●●●

●●●

●●

●●●●

●●●

●●●

●●

●●●

●●

●●●●

●●●

●●●

●●●

●●●

●●

●●

●●●●

●●

●●●●

●●●

●●●

●●

●●

●●

●●●●

●●●

●●●

●●●●

●●●●

●●

●●●●

●●●

●●●●

●●●●

●●●

●●

●●●●

●●●●

●●

●●●●

●●●

●●●

●●●

●●

●●

●●●●

●●●●

●●●●

●●

●●

●●●●●

●●●●

●●●

●●●●

●●●

●●

●●●●

●●●

●●●●

●●●

●●●●

●●

●●●●

●●●

●●●●

●●●●●

●●●●

●●●

●●

●●●●●

●●

●●●●●

●●●●

●●●●

●●●●

●●●●●

●●●●

●●

●●●

●●●●

●●

●●●●●

●●

●●●●●

●●●●

●●●●●

●●●●●

●●

●●

●●

●●

●●

●●●●●

●●●●●

●●●

●●

●●●

●●

●●

●●●

●●

●●●●

●●●●

●●

●●

●●

●●

●●●●

●●

●●

●●●●

●●●●●

●●

●●

●●

●●●

●●●●●●

●●

●●

●●●●●

●●

●●●

●●●●

●●●

●●●●

●●●●●●

●●●

●●

●●●●

●●

●●●●●

●●●●

●●

●●

●●●

●●

●●

●●

●●●●●

●●●●

●●

●●

●●

●●

●●●●

●●

●●●

●●●●●

●●●

●●

●●

●●●●●

●●●●●

●●

●●●

●●●●●

●●

●●

●●

●●

●●●

●●●●●●

●●●

●●●

●●

●●

●●

●●●●

●●

●●

●●●

●●

●●●

●●●●●●

●●

●●

●●

●●●

●●

●●

●●

●●●●●

●●

●●●

●●●●

●●

●●●●

●●●●

●●●●●●

●●●●●

●●

●●●

●●

●●●●●

●●

●●●●

●●

●●●●

●●

●●●●

●●

●●

●●

●●

●●●●●

●●●●●

●●●

●●●●●

●●●

●●

●●

●●

●●●

●●

●●

●●●●

●●

●●●

●●●

●●

●●●●●

●●●●●

●●●

●●●

●●

●●

●●

●●

●●●

●●●●●

●●

●●●

●●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●●●

●●●

●●

●●

●●●●●●●

●●

●●●

●●

●●●●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●●●●

●●●●●

●●●

●●●

●●●●

●●

●●

●●●

●●

●●

●●

●●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●●●●

●●●●●●

●●

●●

●●

●●●

●●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●●●●●

●●

●●

●●

●●●

●●●●

●●●

●●

●●

●●

●●

●●

●●●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●●

●●

●●●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●●●

●●

●●●

●●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●●●●

●●

●●

●●

●●

●●●

●●●

●●

●●●

●●

●●●

●●

●●

●●●●

●●

●●●

●●

●●

●●●

●●

●●

●●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●●

●●

●●

●●

●●

●●

●●●●

●●

●●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●●

●●

●●

●●

●●

●●●

●●

●●●

●●●●●

●●

●●

●●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●●●

●●

●●

●●●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●●

●●

●●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●●

●●●●

●●

●●●

●●

●●●●

●●

●●●

●●

●●●●●

●●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●●

●●

●●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●●●

●●

●●●●

●●

●●●

●●●

●●●

●●●

●●

●●●

●●●

●●

●●

●●●●●●

●●

●●●

●●●

0.5

1.0

1.5

2.0

0 1000 2000 3000 4000 5000Bins based on page confidence (PC)

Mea

n r(d

) per

bin

OCR Confidence Scores

• Documents ordered by page confidence (PC)

• Split into bins according to PC value

• Mean r(d) per bin

18

Page 24: Querylog-based Assessment of Retrievability Bias in Delpher

Document Length

• Varies from 33 to 381,563 words (mean = 362)

• Documents ordered by length and split into bins of 20,000

• LM1000 (left): upward trend, longer documents more retrievable

• BM25 and TFIDF (right): seem to be better at retrieving documents of medium length

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●

●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●

●●●●●●●●●●●●●●●●●●●●●

●●●●●

●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●

●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●

●●●●●●●●●●●

●●●

●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●

●●●●

●●●

●●●●

●●●●●●●●●●●●

●●

●●●●

●●●

●●●●

●●●●●●●

●●●●●●●●●●

●●●●●●●

●●●●●●●●●●

●●●

●●●●

●●●●

●●●

●●●

●●●

●●●

●●●●

●●●●●●●●●●

●●●

●●●●●●●

●●●

●●●

●●●●●●●●●●

●●●

●●●●●●●●●●

●●

●●●

●●●

●●

●●

●●●

●●

●●●

●●●

●●●

●●●

●●●

●●●●●●

●●

●●●

●●●●●●●●●

●●●

●●●

●●●●

●●●●●

●●●

●●●

●●●

●●●

●●

●●●●

●●●

●●●

●●●

●●●

●●

●●●●

●●

●●●

●●

●●●

●●●●●

●●●

●●●

●●●

●●

●●●●

●●●

●●●●●

●●●●

●●

●●●

●●●

●●

●●●

●●

●●

●●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●●●●●●●●●●●

●●●

●●●●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●●●

●●

●●

●●●●●

●●●●●●

●●●●●●●●●●●●●●●●●

●●●●●

●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●

●●●●●

●●●●

●●●

●●

●●

●●●

●●

●●

●●●

●●●●

●●●

●●

●●

●●

●●●

●●

●●

●●●●●

●●●●●

●●●●●●

●●●●●●●●

●●●●●●●●●●●●●

●●●●●●●●●●●●

●●

●●●

●●

●●

●●

●●●

●●●

●●●●

●●●

●●

●●

●●●●●

●●

●●●●

●●●●●●

●●●●●●●●

●●●

●●

●●

●●

●●●●●●●●●

●●●●●●

●●

●●●●●●●

●●●●●●●●

●●●●●●●●●●●●

●●●●●●●●●●●●●●

●●

●●

●●●

●●●

●●

●●

●●

●●●●●

●●●●●●●●●●●●●●

●●●●●

●●●

●●●●●●●●

●●●●●●●●

●●●

●●●●●●

●●

●●●●●●●●

●●

●●

●●

0

2

4

6

0 1000 2000 3000 4000 5000Bins based on document length

Mea

n r(d

) per

bin

●●●●●

●●●●●●●●●●●●

●●●●●●

●●●

●●

●●●●●●●●●●

●●●

●●

●●

●●●●●●

●●●●

●●●●

●●

●●●

●●

●●●●●●

●●●

●●

●●●●●

●●

●●●

●●●●●●●

●●●●●

●●

●●●

●●●●

●●

●●

●●●●

●●●●●●●●●●

●●●●

●●●

●●

●●

●●

●●●●

●●●

●●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●●

●●

●●●

●●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●●

●●

●●●●

●●●

●●

●●●●

●●

●●

●●●

●●

●●●●

●●

●●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●●

●●

●●

●●●●

●●●

●●

●●●

●●

●●

●●●●

●●●

●●

●●●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●●●

●●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●●●

●●●

●●

●●

●●●●

●●

●●

●●●●

●●●

●●●

●●

●●

●●

●●●

●●●

●●●

●●

●●●●●●●

●●

●●

●●

●●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●●●

●●●

●●●

●●

●●

●●●

●●●

●●

●●

●●●

●●

●●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●●●

●●

●●●

●●

●●●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●●●●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●●

●●●

●●●●●

●●

●●●

●●●●

●●●●●●

●●

●●●

●●●●●●●●●●●●

●●●

●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●

●●●●●●●●●

●●●●●●●●●●●●●●●●

●●

●●

●●

●●●●●●●

●●●●

●●●

●●●●

●●●

●●

●●

●●●

●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●

●●●●●

●●

●●●●●●●●

●●●●●●●●●●●●

●●●●●●●●●●●●●

●●●●●●●●●●●●●

●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●

●●●●●●●●●●

●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●

0.0

0.5

1.0

1.5

0 1000 2000 3000 4000 5000Bins based on document length

Mea

n r(d

) per

bin

19

Page 25: Querylog-based Assessment of Retrievability Bias in Delpher

Newspaper Titles

• Number of articles range from one to 16,348,557 (mean 82,638, median 127)

• Subset of the 10 most prevalent newspaper titles

• Mean r(d)

• Top 3 titles are regional ones

20

Newspaper Title Mean r(d)Leeuwarder courant: hoofdblad van Friesland 0.15

Nieuwsblad van het Noorden 0.14

Limburgsch dagblad 0.12Het vrije volk: democratisch-socialistisch dagblad 0.10De Tijd: godsdienstig-staatkundig dagblad 0.08Het Vaderland: staat- en letterkundig nieuwsblad 0.07

Leeuwarder courant 0.07

Algemeen Handelsblad 0.06

De Telegraaf 0.06Rotterdamsch nieuwsblad 0.05

Page 26: Querylog-based Assessment of Retrievability Bias in Delpher

Document Types

• Hardly any differences for simulated queries

• In real queries, the official notifications stand out with a much higher score

Real Simulated

Article 0.90 3.89

Advertisement 0.51 3.32

Notification* 4.80 3.22

Caption 0.84 3.06

Mean r(d) for BM25, c=100

21* Familiebericht

Page 27: Querylog-based Assessment of Retrievability Bias in Delpher

Representativeness of our Study

22

Page 28: Querylog-based Assessment of Retrievability Bias in Delpher

Top retrieved article for real queries

23

Page 29: Querylog-based Assessment of Retrievability Bias in Delpher

Top retrieved article(s) for simulated queries

24

Page 30: Querylog-based Assessment of Retrievability Bias in Delpher

Differences between query sets

• Real queries:

• Mean length: 2.32 terms

• Unique terms: 253,637

• 56 references to persons or locations in top 100 terms

• Simulated queries:

• Mean length: 1.5 terms

• Unique terms: 2,028,617

• 5 references to persons or locations in top 100 terms

25

Page 31: Querylog-based Assessment of Retrievability Bias in Delpher

15

10

50100

5001000

500010000

50000100000

5000001000000

5 10 15 20 25 30 35 40 50 60 65 70 90 110 170 700Number of Views

Cou

nts

Document views

• 2.7M out of 102M documents were viewed by users

• Shape of the frequency distribution plot is very similar to the r(d) frequency plots

• Most documents only viewed once, very few are viewed more often26

Page 32: Querylog-based Assessment of Retrievability Bias in Delpher

Overlap with views

• How many documents were viewed by Delpher users, but not retrieved in our study?

• Many non-retrieved documents

• were found using facets, operators

• scored a rank just below the cutoff

• Use a smoother cost function based on the ranking

• Better representation of the real search engine, taking faceted search / operators into account

0

0.75

1.5

2.25

3

c=10 c=100 c=1000

RetrievedNon-Retrieved

27

Page 33: Querylog-based Assessment of Retrievability Bias in Delpher

Document Types - revisited

R(d) Real

R(d) Simulated Viewed

Article 0.90 3.89 2.61%

Advertisement 0.51 3.32 2.07%

Official Notification 4.80 3.22 40.10%

Caption 0.84 3.06 4.01%

28

Page 34: Querylog-based Assessment of Retrievability Bias in Delpher

Parameter Sets for Preprocessing

[1]  L. Azzopardi and V. Vinay. Retrievability: An evaluation measure for higher order information access tasks. In Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM ’08, pages 561–570, New York, NY, USA, 2008. ACM.

29

Parameters Stemming Stopwords Operators

PS1 (as used by[1]) yes removed removed

PS2 no kept removed

PS3 (only LM1000) yes removed kept

Page 35: Querylog-based Assessment of Retrievability Bias in Delpher

0

500000

1000000

1500000

2000000

BM25 TFIDF LM1000 BM25 TFIDF LM1000

PS1 PS2 PS3

c=10 c=100

Overlap Retrieved Documents and Viewed

30

Page 36: Querylog-based Assessment of Retrievability Bias in Delpher

Conclusions

• Real and simulated queries differ in regard to

• composition of query sets

• number of (unique) terms used

• use of named entities

• Apart from document length and page confidence, we did not find strong evidence for technical bias

• Using real queries is important for realistic results

• Simulation strategies for queries need to be improved

31