42
Querylog-based Assessment of Retrievability Bias in a Large Newspaper Corpus Myriam C. Traub , Thaer Samar, Jacco van Ossenbruggen, Jiyin He, Arjen de Vries, Lynda Hardman

Querylog-based Assessment of Retrievability Bias in a Large Newspaper Corpus

Embed Size (px)

Citation preview

Page 1: Querylog-based Assessment of Retrievability Bias in a  Large Newspaper Corpus

Querylog-based Assessment of Retrievability Bias in a

Large Newspaper CorpusMyriam C. Traub, Thaer Samar, Jacco van Ossenbruggen,

Jiyin He, Arjen de Vries, Lynda Hardman

Page 2: Querylog-based Assessment of Retrievability Bias in a  Large Newspaper Corpus

Motivation

• Users want to be able

• to get a fair overview of the archive’s content

• to access all (relevant) documents in the archive

2

Page 3: Querylog-based Assessment of Retrievability Bias in a  Large Newspaper Corpus

Motivation

• Users want to be able

• to get a fair overview of the archive’s content

• to access all (relevant) documents in the archive

• However,

• data collections are implicitly and explicitly biased,

• users are biased,

• and technology induces even more bias(es)

2

Page 4: Querylog-based Assessment of Retrievability Bias in a  Large Newspaper Corpus

Motivation

• Users want to be able

• to get a fair overview of the archive’s content

• to access all (relevant) documents in the archive

• However,

• data collections are implicitly and explicitly biased,

• users are biased,

• and technology induces even more bias(es)

… which I can deal with if the bias is made

explicit.2

Page 5: Querylog-based Assessment of Retrievability Bias in a  Large Newspaper Corpus

• Bias in search results

• Potential sources are:

Retrievability Bias

3

Page 6: Querylog-based Assessment of Retrievability Bias in a  Large Newspaper Corpus

• Bias in search results

• Potential sources are:

• User interest

• Search skills of users

• Users’ willingness to explore results

Retrievability Bias

3

Page 7: Querylog-based Assessment of Retrievability Bias in a  Large Newspaper Corpus

• Bias in search results

• Potential sources are:

• User interest

• Search skills of users

• Users’ willingness to explore results

• Collection bias (indexed documents)

Retrievability Bias

3

Page 8: Querylog-based Assessment of Retrievability Bias in a  Large Newspaper Corpus

• Bias in search results

• Potential sources are:

• User interest

• Search skills of users

• Users’ willingness to explore results

• Collection bias (indexed documents)

• OCR errors

Retrievability Bias

3

Page 9: Querylog-based Assessment of Retrievability Bias in a  Large Newspaper Corpus

• Bias in search results

• Potential sources are:

• User interest

• Search skills of users

• Users’ willingness to explore results

• Collection bias (indexed documents)

• OCR errors

• Side-effects of ranking algorithm

Retrievability Bias

3

Page 10: Querylog-based Assessment of Retrievability Bias in a  Large Newspaper Corpus

• Bias in search results

• Potential sources are:

• User interest

• Search skills of users

• Users’ willingness to explore results

• Collection bias (indexed documents)

• OCR errors

• Side-effects of ranking algorithm

• Side-effects of result presentation

Retrievability Bias

3

Page 11: Querylog-based Assessment of Retrievability Bias in a  Large Newspaper Corpus

• Bias in search results

• Potential sources are:

• User interest

• Search skills of users

• Users’ willingness to explore results

• Collection bias (indexed documents)

• OCR errors

• Side-effects of ranking algorithm

• Side-effects of result presentation

Retrievability Bias

3

Page 12: Querylog-based Assessment of Retrievability Bias in a  Large Newspaper Corpus

Research Questions

RQ1: Detecting and quantifying retrievability bias

RQ2: Influence of document features on retrievability bias

RQ3: Representativeness of simulated queries and experimental setup

4

Page 13: Querylog-based Assessment of Retrievability Bias in a  Large Newspaper Corpus

Retrievability

• Introduced by Azzopardi et al. [1] in 2008 in a study based on born-digital documents and simulated queries

• Retrievability score counts how often a document is retrieved as one of the top K documents by a given set of queries

• Gini coefficient and Lorenz curves can visualize and quantify inequality in the distribution of the scores

5

[1]  L. Azzopardi and V. Vinay. Retrievability: An evaluation measure for higher order information access tasks. In Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM ’08, pages 561–570, New York, NY, USA, 2008. ACM.

Page 14: Querylog-based Assessment of Retrievability Bias in a  Large Newspaper Corpus

Lorenz Curve & Gini Coefficient

• Introduced by economists to express and visualize inequality in wealth distribution

• Gini coefficient (G):

6

Page 15: Querylog-based Assessment of Retrievability Bias in a  Large Newspaper Corpus

Lorenz curve for n=5

Lorenz Curve & Gini Coefficient

• Introduced by economists to express and visualize inequality in wealth distribution

• Gini coefficient (G):

• perfect communist (G=0)

6

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Lorenz curve

% of population

% o

f inc

ome

1, 1, 1, 1, 1

Page 16: Querylog-based Assessment of Retrievability Bias in a  Large Newspaper Corpus

Lorenz curve for n=5

Lorenz Curve & Gini Coefficient

• Introduced by economists to express and visualize inequality in wealth distribution

• Gini coefficient (G):

• perfect communist (G=0)

• in-between (G=0.5)

6

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Lorenz curve

% of population

% o

f inc

ome

1, 1, 1, 1, 10, 0, 1, 1, 2

Page 17: Querylog-based Assessment of Retrievability Bias in a  Large Newspaper Corpus

Lorenz curve for n=5

Lorenz Curve & Gini Coefficient

• Introduced by economists to express and visualize inequality in wealth distribution

• Gini coefficient (G):

• perfect communist (G=0)

• in-between (G=0.5)

• perfect tyranny (G=0.8)

6

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Lorenz curve

% of population

% o

f inc

ome

1, 1, 1, 1, 10, 0, 1, 1, 20, 0, 0, 0, 1

Page 18: Querylog-based Assessment of Retrievability Bias in a  Large Newspaper Corpus

Lorenz curve for n=5

Lorenz Curve & Gini Coefficient

• Introduced by economists to express and visualize inequality in wealth distribution

• Gini coefficient (G):

• perfect communist (G=0)

• in-between (G=0.5)

• perfect tyranny (G=0.8)

6

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Lorenz curve

% of population

% o

f inc

ome

1, 1, 1, 1, 10, 0, 1, 1, 20, 0, 0, 0, 1

% of documents

% o

f ac

cum

ulat

ed r(

d)

Page 19: Querylog-based Assessment of Retrievability Bias in a  Large Newspaper Corpus

Lorenz curve for n=5

Lorenz Curve & Gini Coefficient

• Introduced by economists to express and visualize inequality in wealth distribution

• Gini coefficient (G):

• perfect communist (G=0)

• in-between (G=0.5)

• perfect tyranny (G=0.8)

• There is no good or bad G.

6

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Lorenz curve

% of population

% o

f inc

ome

1, 1, 1, 1, 10, 0, 1, 1, 20, 0, 0, 0, 1

% of documents

% o

f ac

cum

ulat

ed r(

d)

Page 20: Querylog-based Assessment of Retrievability Bias in a  Large Newspaper Corpus

Experimental setup / Parameters

• Digitized collection of Dutch historic newspapers

• View data extracted from user logs

• Real queries, simulated queries

• Standard Information Retrieval models: TFIDF, LM1000, BM25 (using Lemur framework)

• Pre-processing (corpus & queries): Stemming, stopword removal, operator removal

• Cutoff values: c=10, c=100, c=1000

7

[1]  L. Azzopardi and V. Vinay. Retrievability: An evaluation measure for higher order information access tasks. In Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM ’08, pages 561–570, New York, NY, USA, 2008. ACM.

Page 21: Querylog-based Assessment of Retrievability Bias in a  Large Newspaper Corpus

Document Collection:Dutch Newspaper Archive

June 1618 - December 1995

Articles 67% 69,237,655

Advertisements 29% 29,591,599

Notifications* 2% 1,918,375

Captions 2% 1,970,899

Total Size 102,718,528

Vocabulary Size 353,086,358

* Familiebericht 8

Page 22: Querylog-based Assessment of Retrievability Bias in a  Large Newspaper Corpus

Simulated Queries

• Followed similar strategy as previous studies

• Top 2 million single terms from the preprocessed corpus + top 2 million bigram terms

• No filtering for OCR errors

9

Page 23: Querylog-based Assessment of Retrievability Bias in a  Large Newspaper Corpus

Real Queries

• User logs collected between March and July 2015 on Delpher, the online web service of the National Library of the Netherlands

• Extracted queries and viewed items related to newspaper archive

• Total of 957,239 unique queries

10

Page 24: Querylog-based Assessment of Retrievability Bias in a  Large Newspaper Corpus

RQ1: Detecting and Quantifying

Retrievability Bias

11

Page 25: Querylog-based Assessment of Retrievability Bias in a  Large Newspaper Corpus

Inequality c=10

Real queries GBM25 = 0.97

Simulated queries GBM25 = 0.85

12

Page 26: Querylog-based Assessment of Retrievability Bias in a  Large Newspaper Corpus

Inequality c=10

Real queries GBM25 = 0.97

Simulated queries GBM25 = 0.85

12

A very large fraction of

documents is never

retrieved.

Page 27: Querylog-based Assessment of Retrievability Bias in a  Large Newspaper Corpus

Inequality

Real queries, c=1000 GBM25 = 0.76

Simulated queries, c=100 GBM25 = 0.5213

Page 28: Querylog-based Assessment of Retrievability Bias in a  Large Newspaper Corpus

• The Lorenz curves and Gini values

• are strongly influenced by non-retrieved documents,

• can indicate the degree of bias, but they tell us nothing about the type of bias.

14

Limitations

Page 29: Querylog-based Assessment of Retrievability Bias in a  Large Newspaper Corpus

• The Lorenz curves and Gini values

• are strongly influenced by non-retrieved documents,

• can indicate the degree of bias, but they tell us nothing about the type of bias.

14

Limitations

Does the inequality arise

from the users’ interest / search behavior?

Or from a technological bias towards a particular document feature?

Page 30: Querylog-based Assessment of Retrievability Bias in a  Large Newspaper Corpus

Retrievability scores Meaningful?

• Created 4 subsets of documents according to their score and selected a set of target documents from each subset

• Generated queries from selected documents, tailored to retrieve these specific documents

• Performed search tasks and measured ranks of target documents

• Showed that documents with lower score are actually harder to find

15

Rarely Sometimes Often Very often

Page 31: Querylog-based Assessment of Retrievability Bias in a  Large Newspaper Corpus

RQ2: Influence of

Document Features

16

Page 32: Querylog-based Assessment of Retrievability Bias in a  Large Newspaper Corpus

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●

●●●●

●●●●

●●●●●

●●●●

●●●●

●●●●●●●●

●●●●

●●

●●●●

●●

●●●

●●

●●●

●●●●●●

●●

●●●●●

●●

●●●

●●●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●●●

●●

●●

●●

●●●●

●●●●●

●●●

●●●

●●

●●

●●

●●●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●●●●

●●

●●

●●

●●●

●●

●●●

●●

●●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●●

●●●

●●

●●●

●●

●●

●●●

●●

●●●

●●

●●●

●●

●●

●●●●

●●●

●●

●●

●●●

●●

●●●

●●

●●●

●●

●●●

●●

●●

●●●

●●●

●●●

●●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●●●

●●

●●

●●

●●●

●●●

●●

●●●●

●●

●●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●●

●●●

●●●●

●●●

●●●●

●●●

●●●●

●●●

●●●●

●●●

●●●

●●

●●●●

●●●

●●

●●

●●

●●●

●●

●●

●●●●

●●●

●●●●

●●●

●●

●●●●

●●●

●●●●

●●●

●●●

●●

●●●●

●●●

●●●

●●

●●●

●●

●●●●

●●●

●●●

●●●

●●●

●●

●●

●●●●

●●

●●●●

●●●

●●●

●●

●●

●●

●●●●

●●●

●●●

●●●●

●●●●

●●

●●●●

●●●

●●●●

●●●●

●●●

●●

●●●●

●●●●

●●

●●●●

●●●

●●●

●●●

●●

●●

●●●●

●●●●

●●●●

●●

●●

●●●●●

●●●●

●●●

●●●●

●●●

●●

●●●●

●●●

●●●●

●●●

●●●●

●●

●●●●

●●●

●●●●

●●●●●

●●●●

●●●

●●

●●●●●

●●

●●●●●

●●●●

●●●●

●●●●

●●●●●

●●●●

●●

●●●

●●●●

●●

●●●●●

●●

●●●●●

●●●●

●●●●●

●●●●●

●●

●●

●●

●●

●●

●●●●●

●●●●●

●●●

●●

●●●

●●

●●

●●●

●●

●●●●

●●●●

●●

●●

●●

●●

●●●●

●●

●●

●●●●

●●●●●

●●

●●

●●

●●●

●●●●●●

●●

●●

●●●●●

●●

●●●

●●●●

●●●

●●●●

●●●●●●

●●●

●●

●●●●

●●

●●●●●

●●●●

●●

●●

●●●

●●

●●

●●

●●●●●

●●●●

●●

●●

●●

●●

●●●●

●●

●●●

●●●●●

●●●

●●

●●

●●●●●

●●●●●

●●

●●●

●●●●●

●●

●●

●●

●●

●●●

●●●●●●

●●●

●●●

●●

●●

●●

●●●●

●●

●●

●●●

●●

●●●

●●●●●●

●●

●●

●●

●●●

●●

●●

●●

●●●●●

●●

●●●

●●●●

●●

●●●●

●●●●

●●●●●●

●●●●●

●●

●●●

●●

●●●●●

●●

●●●●

●●

●●●●

●●

●●●●

●●

●●

●●

●●

●●●●●

●●●●●

●●●

●●●●●

●●●

●●

●●

●●

●●●

●●

●●

●●●●

●●

●●●

●●●

●●

●●●●●

●●●●●

●●●

●●●

●●

●●

●●

●●

●●●

●●●●●

●●

●●●

●●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●●●

●●●

●●

●●

●●●●●●●

●●

●●●

●●

●●●●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●●●●

●●●●●

●●●

●●●

●●●●

●●

●●

●●●

●●

●●

●●

●●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●●●●

●●●●●●

●●

●●

●●

●●●

●●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●●●●●

●●

●●

●●

●●●

●●●●

●●●

●●

●●

●●

●●

●●

●●●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●●

●●

●●●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●●●

●●

●●●

●●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●●●●

●●

●●

●●

●●

●●●

●●●

●●

●●●

●●

●●●

●●

●●

●●●●

●●

●●●

●●

●●

●●●

●●

●●

●●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●●

●●

●●

●●

●●

●●

●●●●

●●

●●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●●

●●

●●

●●

●●

●●●

●●

●●●

●●●●●

●●

●●

●●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●●●

●●

●●

●●●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●●

●●

●●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●●

●●●●

●●

●●●

●●

●●●●

●●

●●●

●●

●●●●●

●●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●●

●●

●●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●●●

●●

●●●●

●●

●●●

●●●

●●●

●●●

●●

●●●

●●●

●●

●●

●●●●●●

●●

●●●

●●●

0.5

1.0

1.5

2.0

0 1000 2000 3000 4000 5000Bins based on page confidence (PC)

Mea

n r(d

) per

bin

OCR Confidence Scores

• Generated by OCR engine during digitization

• Documents ordered by page confidence (PC) and split into bins

• Mean score per bin

17

Page 33: Querylog-based Assessment of Retrievability Bias in a  Large Newspaper Corpus

Document Length

• Documents ordered by length and split into bins of 20,000

• LM1000 (left): upward trend, longer documents more retrievable

• BM25 and TFIDF (right): seem to be better at retrieving documents of medium length

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●

●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●

●●●●●●●●●●●●●●●●●●●●●

●●●●●

●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●

●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●

●●●●●●●●●●●

●●●

●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●

●●●●

●●●

●●●●

●●●●●●●●●●●●

●●

●●●●

●●●

●●●●

●●●●●●●

●●●●●●●●●●

●●●●●●●

●●●●●●●●●●

●●●

●●●●

●●●●

●●●

●●●

●●●

●●●

●●●●

●●●●●●●●●●

●●●

●●●●●●●

●●●

●●●

●●●●●●●●●●

●●●

●●●●●●●●●●

●●

●●●

●●●

●●

●●

●●●

●●

●●●

●●●

●●●

●●●

●●●

●●●●●●

●●

●●●

●●●●●●●●●

●●●

●●●

●●●●

●●●●●

●●●

●●●

●●●

●●●

●●

●●●●

●●●

●●●

●●●

●●●

●●

●●●●

●●

●●●

●●

●●●

●●●●●

●●●

●●●

●●●

●●

●●●●

●●●

●●●●●

●●●●

●●

●●●

●●●

●●

●●●

●●

●●

●●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●●●●●●●●●●●

●●●

●●●●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●●●

●●

●●

●●●●●

●●●●●●

●●●●●●●●●●●●●●●●●

●●●●●

●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●

●●●●●

●●●●

●●●

●●

●●

●●●

●●

●●

●●●

●●●●

●●●

●●

●●

●●

●●●

●●

●●

●●●●●

●●●●●

●●●●●●

●●●●●●●●

●●●●●●●●●●●●●

●●●●●●●●●●●●

●●

●●●

●●

●●

●●

●●●

●●●

●●●●

●●●

●●

●●

●●●●●

●●

●●●●

●●●●●●

●●●●●●●●

●●●

●●

●●

●●

●●●●●●●●●

●●●●●●

●●

●●●●●●●

●●●●●●●●

●●●●●●●●●●●●

●●●●●●●●●●●●●●

●●

●●

●●●

●●●

●●

●●

●●

●●●●●

●●●●●●●●●●●●●●

●●●●●

●●●

●●●●●●●●

●●●●●●●●

●●●

●●●●●●

●●

●●●●●●●●

●●

●●

●●

0

2

4

6

0 1000 2000 3000 4000 5000Bins based on document length

Mea

n r(d

) per

bin

●●●●●

●●●●●●●●●●●●

●●●●●●

●●●

●●

●●●●●●●●●●

●●●

●●

●●

●●●●●●

●●●●

●●●●

●●

●●●

●●

●●●●●●

●●●

●●

●●●●●

●●

●●●

●●●●●●●

●●●●●

●●

●●●

●●●●

●●

●●

●●●●

●●●●●●●●●●

●●●●

●●●

●●

●●

●●

●●●●

●●●

●●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●●

●●

●●●

●●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●●

●●

●●●●

●●●

●●

●●●●

●●

●●

●●●

●●

●●●●

●●

●●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●●

●●

●●

●●●●

●●●

●●

●●●

●●

●●

●●●●

●●●

●●

●●●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●●●

●●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●●●

●●●

●●

●●

●●●●

●●

●●

●●●●

●●●

●●●

●●

●●

●●

●●●

●●●

●●●

●●

●●●●●●●

●●

●●

●●

●●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●●●

●●●

●●●

●●

●●

●●●

●●●

●●

●●

●●●

●●

●●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●●●

●●

●●●

●●

●●●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●●●●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●●

●●●

●●●●●

●●

●●●

●●●●

●●●●●●

●●

●●●

●●●●●●●●●●●●

●●●

●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●

●●●●●●●●●

●●●●●●●●●●●●●●●●

●●

●●

●●

●●●●●●●

●●●●

●●●

●●●●

●●●

●●

●●

●●●

●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●

●●●●●

●●

●●●●●●●●

●●●●●●●●●●●●

●●●●●●●●●●●●●

●●●●●●●●●●●●●

●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●

●●●●●●●●●●

●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●

0.0

0.5

1.0

1.5

0 1000 2000 3000 4000 5000Bins based on document length

Mea

n r(d

) per

bin

18

Page 34: Querylog-based Assessment of Retrievability Bias in a  Large Newspaper Corpus

RQ3: Representativeness of Simulated Queries and

Experimental Setup

19

Page 35: Querylog-based Assessment of Retrievability Bias in a  Large Newspaper Corpus

Top retrieved article for real queries

20

Page 36: Querylog-based Assessment of Retrievability Bias in a  Large Newspaper Corpus

Top retrieved article(s) for simulated queries

21

Page 37: Querylog-based Assessment of Retrievability Bias in a  Large Newspaper Corpus

Differences between query sets

• Real queries:

• Mean length: 2.32 terms

• Unique terms: 253,637

• 56 references to persons or locations in top 100 terms

• Simulated queries:

• Mean length: 1.5 terms

• Unique terms: 2,028,617

• 5 references to persons or locations in top 100 terms

22

Page 38: Querylog-based Assessment of Retrievability Bias in a  Large Newspaper Corpus

15

10

50100

5001000

500010000

50000100000

5000001000000

5 10 15 20 25 30 35 40 50 60 65 70 90 110 170 700Number of Views

Cou

nts

Actual views

• Only 2.7M out of 102M documents were viewed by users (G = 0.98)

• most documents have not been viewed at all

• many documents only viewed once

• very few are viewed multiple times23

Page 39: Querylog-based Assessment of Retrievability Bias in a  Large Newspaper Corpus

Overlap with views

• How many documents were viewed by the users, but not retrieved in our study?

• Many non-retrieved documents

• were found using facets or operators

• scored a rank just below the cutoff

• Better representation of the real search engine, taking faceted search and operators into account

0

0.75

1.5

2.25

3

c=10 c=100 c=1000

RetrievedNon-Retrieved

24

Page 40: Querylog-based Assessment of Retrievability Bias in a  Large Newspaper Corpus

Document Types Viewed

Simulated Real Viewed

Article 3.89 0.90 2.61%

Advertisement 3.32 0.51 2.07%

Notification 3.22 4.80 40.10%

Caption 3.06 0.84 4.01%

25

Page 41: Querylog-based Assessment of Retrievability Bias in a  Large Newspaper Corpus

Conclusions• Real and simulated queries differ in

regard to

• composition of query sets

• number of (unique) terms used

• use of named entities

• Apart from document length and page confidence, we did not find strong evidence for technical bias

• Using real queries is important for realistic results

• Simulation strategies for queries need to be improved

• Retrievability studies should take faceted search and operators into account

26

Page 42: Querylog-based Assessment of Retrievability Bias in a  Large Newspaper Corpus

We would like to thank the

for making the newspaper corpus and the (sensitive) user data available to us for research.

travel grant

Supported by

Querylog-based Assessment of Retrievability Bias in a Large

Newspaper Corpus