48
Jane Reid, AMSc IRIC, QMU L, 16/10/01 1 Evaluation of IR systems Jane Reid [email protected]

Jane Reid, AMSc IRIC, QMUL, 16/10/01 1 Evaluation of IR systems Jane Reid [email protected]

Embed Size (px)

Citation preview

Page 1: Jane Reid, AMSc IRIC, QMUL, 16/10/01 1 Evaluation of IR systems Jane Reid jane@dcs.qmul.ac.uk

Jane Reid, AMSc IRIC, QMUL, 16/10/01

1

Evaluation of IR systems

Jane Reid

[email protected]

Page 2: Jane Reid, AMSc IRIC, QMUL, 16/10/01 1 Evaluation of IR systems Jane Reid jane@dcs.qmul.ac.uk

Jane Reid, AMSc IRIC, QMUL, 16/10/01

2

Lecture plan

• Background

• System-centred evaluation

• User-centred evaluation

Page 3: Jane Reid, AMSc IRIC, QMUL, 16/10/01 1 Evaluation of IR systems Jane Reid jane@dcs.qmul.ac.uk

Jane Reid, AMSc IRIC, QMUL, 16/10/01

3

The changing face of evaluation

• Originally...– Batch IR systems– Small, textual collections– Queries formulated by searchers

• Today...– Interactive IR systems– Large collections of different or mixed media– Queries formulated by end-users

Page 4: Jane Reid, AMSc IRIC, QMUL, 16/10/01 1 Evaluation of IR systems Jane Reid jane@dcs.qmul.ac.uk

Jane Reid, AMSc IRIC, QMUL, 16/10/01

4

Elements of evaluation

• When we evaluate, we need to establish:– Methodology– Criterion– Measure– Tool– Method of data analysis

Page 5: Jane Reid, AMSc IRIC, QMUL, 16/10/01 1 Evaluation of IR systems Jane Reid jane@dcs.qmul.ac.uk

Jane Reid, AMSc IRIC, QMUL, 16/10/01

5

System-centred evaluation

• (Comparative) evaluation of technical performance of IR system(s)

• Methodology = non-interactive experiment

• Criterion = relevance

• Measure = effectiveness

• Tool = test collection

• Method of data analysis = recall / precision

Page 6: Jane Reid, AMSc IRIC, QMUL, 16/10/01 1 Evaluation of IR systems Jane Reid jane@dcs.qmul.ac.uk

Jane Reid, AMSc IRIC, QMUL, 16/10/01

6

Relevance

• Relevant = “having significant and demonstrable bearing on the matter at hand”

• Underlying assumptions:– Objectivity– Topicality– Binary nature– Independence

Page 7: Jane Reid, AMSc IRIC, QMUL, 16/10/01 1 Evaluation of IR systems Jane Reid jane@dcs.qmul.ac.uk

Jane Reid, AMSc IRIC, QMUL, 16/10/01

7

Effectiveness

• Effectiveness = the ability of the IR system to retrieve relevant documents and suppress non-relevant documents

Page 8: Jane Reid, AMSc IRIC, QMUL, 16/10/01 1 Evaluation of IR systems Jane Reid jane@dcs.qmul.ac.uk

Jane Reid, AMSc IRIC, QMUL, 16/10/01

8

Test collection

• Components:– Document collection– Queries / requests– Relevance judgements

Page 9: Jane Reid, AMSc IRIC, QMUL, 16/10/01 1 Evaluation of IR systems Jane Reid jane@dcs.qmul.ac.uk

Jane Reid, AMSc IRIC, QMUL, 16/10/01

9

Test collection creation

• Manual method:– Every document judged against every query by

one of several judges

• Pooling method:– Queries run against several IR systems first– Results pooled, and top proportion chosen for

judging– Only top documents are judged

Page 10: Jane Reid, AMSc IRIC, QMUL, 16/10/01 1 Evaluation of IR systems Jane Reid jane@dcs.qmul.ac.uk

Jane Reid, AMSc IRIC, QMUL, 16/10/01

10

Recall / precision [1]

Document collection

Retrieved RelevantRetrieved and relevant

Page 11: Jane Reid, AMSc IRIC, QMUL, 16/10/01 1 Evaluation of IR systems Jane Reid jane@dcs.qmul.ac.uk

Jane Reid, AMSc IRIC, QMUL, 16/10/01

11

Recall / precision [2]

• Recall = proportion of relevant documents that is retrieved, i.e.

number of relevant documents retrieved /

total number of relevant documents

• Precision = proportion of retrieved documents that is relevant, i.e.

number of relevant documents retrieved /

number of documents retrieved

Page 12: Jane Reid, AMSc IRIC, QMUL, 16/10/01 1 Evaluation of IR systems Jane Reid jane@dcs.qmul.ac.uk

Jane Reid, AMSc IRIC, QMUL, 16/10/01

12

How to use a test collection

• For each system / system version– For each query in the test collection

• Run query against system to obtain ranking

• Use ranking and relevance judgements to calculate recall/precision (r/p) pairs at each recall point

• Interpolate to standard recall points if necessary

– Average r/p values across all queries in table / graph form

• Produce r/p graph for all systems

Page 13: Jane Reid, AMSc IRIC, QMUL, 16/10/01 1 Evaluation of IR systems Jane Reid jane@dcs.qmul.ac.uk

Jane Reid, AMSc IRIC, QMUL, 16/10/01

13

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.2 0.4 0.6 0.8 1

recall

InterpolationObserved valueInterpolated value

Page 14: Jane Reid, AMSc IRIC, QMUL, 16/10/01 1 Evaluation of IR systems Jane Reid jane@dcs.qmul.ac.uk

Jane Reid, AMSc IRIC, QMUL, 16/10/01

14

Averaging [1]

Precision

Recall Query 1 Query 2 Average

0.1 0.8 0.6 0.7

0.2 0.8 0.5 0.65

0.3 0.6 0.4 0.5

0.4 0.6 0.3 0.45

0.5 0.4 0.25 0.325

0.6 0.4 0.2 0.3

0.7 0.3 0.15 0.225

0.8 0.3 0.1 0.2

0.9 0.2 0.05 0.115

1.0 0.2 0.05 0.115

Page 15: Jane Reid, AMSc IRIC, QMUL, 16/10/01 1 Evaluation of IR systems Jane Reid jane@dcs.qmul.ac.uk

Jane Reid, AMSc IRIC, QMUL, 16/10/01

15

Averaging [2]

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

recall

query1

query 2

average

Page 16: Jane Reid, AMSc IRIC, QMUL, 16/10/01 1 Evaluation of IR systems Jane Reid jane@dcs.qmul.ac.uk

Jane Reid, AMSc IRIC, QMUL, 16/10/01

16

Comparison of systems

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

recall

precision

system 1

system 2

Page 17: Jane Reid, AMSc IRIC, QMUL, 16/10/01 1 Evaluation of IR systems Jane Reid jane@dcs.qmul.ac.uk

Jane Reid, AMSc IRIC, QMUL, 16/10/01

17

Examples of test collections [1]

• TREC (Text REtrieval Conference)– Started in 1990, run by National Institute of

Standards and Technology (NIST)– Components

• Huge document collection (several GB), taken from Wall Street Journal, Financial Times, etc

• New documents, topics (i.e. requests, including description and narrative fields) and relevance judgements (performed by retired civil servants) each year

Page 18: Jane Reid, AMSc IRIC, QMUL, 16/10/01 1 Evaluation of IR systems Jane Reid jane@dcs.qmul.ac.uk

Jane Reid, AMSc IRIC, QMUL, 16/10/01

18

Examples of test collections [2]

– Participants• Industrial, commercial and academic

• Must submit results of retrieval tasks to TREC conference each November

– “Tracks”• Ad-hoc + routing (filtering)

• Also: interactive, cross-lingual, Web, spoken document, short queries, …

Page 19: Jane Reid, AMSc IRIC, QMUL, 16/10/01 1 Evaluation of IR systems Jane Reid jane@dcs.qmul.ac.uk

Jane Reid, AMSc IRIC, QMUL, 16/10/01

19

Examples of test collections [3]

• CIS– 1239 documents about cystic fibrosis from

NLM’s MEDLINE collection– Fields: author, title, source, major and minor

subjects, abstracts, references and citations– 100 queries, developed by relevance judges

Page 20: Jane Reid, AMSc IRIC, QMUL, 16/10/01 1 Evaluation of IR systems Jane Reid jane@dcs.qmul.ac.uk

Jane Reid, AMSc IRIC, QMUL, 16/10/01

20

Examples of test collections [4]

– Unusual features:• 4 judges per document per query (3 experts, 1

medical bibliographer)

• 3 levels of relevance (0-2)

• Combined relevances on scale of 0-8

Page 21: Jane Reid, AMSc IRIC, QMUL, 16/10/01 1 Evaluation of IR systems Jane Reid jane@dcs.qmul.ac.uk

Jane Reid, AMSc IRIC, QMUL, 16/10/01

21

Examples of test collections [5]

• CACM– 3024 articles on computer science from CACM,

1958 - 1979– Fields: author, date, word stems for titles and

abstracts, categories, direct referencing, bibliography coupling, number of co-citations for each pair of articles

– 52 queries, each with 2 Boolean formulations

Page 22: Jane Reid, AMSc IRIC, QMUL, 16/10/01 1 Evaluation of IR systems Jane Reid jane@dcs.qmul.ac.uk

Jane Reid, AMSc IRIC, QMUL, 16/10/01

22

Examples of test collections [6]

– Unusual features:• Citation links to other documents, so often used for

hypertext-type experiments

Page 23: Jane Reid, AMSc IRIC, QMUL, 16/10/01 1 Evaluation of IR systems Jane Reid jane@dcs.qmul.ac.uk

Jane Reid, AMSc IRIC, QMUL, 16/10/01

23

User-centred evaluation

• Evaluation of interface / interaction

• Methodology = interactive experiment, ethnographic study, ...

• Many different criteria, measures, tools and methods of data analysis– No standard user-centred methodology– Elements often borrowed from other areas, e.g.

HCI, experimental psychology

Page 24: Jane Reid, AMSc IRIC, QMUL, 16/10/01 1 Evaluation of IR systems Jane Reid jane@dcs.qmul.ac.uk

Jane Reid, AMSc IRIC, QMUL, 16/10/01

24

User-centred issues: layers model

traditional test

collection evaluation

different document types

interaction

strategy

tasks

learning

Page 25: Jane Reid, AMSc IRIC, QMUL, 16/10/01 1 Evaluation of IR systems Jane Reid jane@dcs.qmul.ac.uk

Jane Reid, AMSc IRIC, QMUL, 16/10/01

25

Test collection

• Advantages– Cheap and easy for evaluator– Cross-system comparison possible

• Limitations– Static requests / queries– Objective, topical relevance judgements made

by domain experts– Does not evaluate interaction

Page 26: Jane Reid, AMSc IRIC, QMUL, 16/10/01 1 Evaluation of IR systems Jane Reid jane@dcs.qmul.ac.uk

Jane Reid, AMSc IRIC, QMUL, 16/10/01

26

Different document types

• Multi-media documents– Images

• Topical relevance

• Non-topical relevance

– Speech• Recognition

• Retrieval

• Structured collections

Page 27: Jane Reid, AMSc IRIC, QMUL, 16/10/01 1 Evaluation of IR systems Jane Reid jane@dcs.qmul.ac.uk

Jane Reid, AMSc IRIC, QMUL, 16/10/01

27

Interaction [1]

• Data characteristics– Size of documents– Size of collection

• System characteristics– Retrieval effectiveness– Functionality– Interface features

Page 28: Jane Reid, AMSc IRIC, QMUL, 16/10/01 1 Evaluation of IR systems Jane Reid jane@dcs.qmul.ac.uk

Jane Reid, AMSc IRIC, QMUL, 16/10/01

28

Interaction [2]

• User– Domain expertise– System expertise– Task– Subjects vs real users

• Contextual– Social and environmental factors

Page 29: Jane Reid, AMSc IRIC, QMUL, 16/10/01 1 Evaluation of IR systems Jane Reid jane@dcs.qmul.ac.uk

Jane Reid, AMSc IRIC, QMUL, 16/10/01

29

Strategy

• System characteristics– Type of access (query-based, browsing, mixed)– Functional visibility

• Search characteristics– Topic focus– Tactics and search strategy

• User characteristics– Mental/cognitive models

Page 30: Jane Reid, AMSc IRIC, QMUL, 16/10/01 1 Evaluation of IR systems Jane Reid jane@dcs.qmul.ac.uk

Jane Reid, AMSc IRIC, QMUL, 16/10/01

30

Tasks

• Real

• Simulated– Past real– Fictitious

Page 31: Jane Reid, AMSc IRIC, QMUL, 16/10/01 1 Evaluation of IR systems Jane Reid jane@dcs.qmul.ac.uk

Jane Reid, AMSc IRIC, QMUL, 16/10/01

31

Learning

• System– Dynamic weighting of terms/documents– Case-based retrieval– User modelling

• User– Evolving information needs– Learning about domain/collection/system– Sociological view

Page 32: Jane Reid, AMSc IRIC, QMUL, 16/10/01 1 Evaluation of IR systems Jane Reid jane@dcs.qmul.ac.uk

Jane Reid, AMSc IRIC, QMUL, 16/10/01

32

Measures [1]

• From IR– Evaluation of results

• Aspectual recall/precision

• Pertinence

• Utility

Page 33: Jane Reid, AMSc IRIC, QMUL, 16/10/01 1 Evaluation of IR systems Jane Reid jane@dcs.qmul.ac.uk

Jane Reid, AMSc IRIC, QMUL, 16/10/01

33

Measures [2]

• From information science/HCI– Evaluation of results

• Task performance

– Evaluation of process• Quantitative: time, number of errors

• Qualitative: usability

– Evaluation of overall quality of experience• User satisfaction

Page 34: Jane Reid, AMSc IRIC, QMUL, 16/10/01 1 Evaluation of IR systems Jane Reid jane@dcs.qmul.ac.uk

Jane Reid, AMSc IRIC, QMUL, 16/10/01

34

Tools [1]

• From information science/HCI– Before the session

• Cognitive walkthroughs

• Interviews/questionnaires

– During the session• Observation

• Think aloud protocols

Page 35: Jane Reid, AMSc IRIC, QMUL, 16/10/01 1 Evaluation of IR systems Jane Reid jane@dcs.qmul.ac.uk

Jane Reid, AMSc IRIC, QMUL, 16/10/01

35

Tools [2]

– After the session• Interviews/questionnaires

• Focus groups

Page 36: Jane Reid, AMSc IRIC, QMUL, 16/10/01 1 Evaluation of IR systems Jane Reid jane@dcs.qmul.ac.uk

Jane Reid, AMSc IRIC, QMUL, 16/10/01

36

Large-scale experiments

• Interactive TREC

• OKAPI

Page 37: Jane Reid, AMSc IRIC, QMUL, 16/10/01 1 Evaluation of IR systems Jane Reid jane@dcs.qmul.ac.uk

Jane Reid, AMSc IRIC, QMUL, 16/10/01

37

User-centred evaluation [1]

• What is to be evaluated?– e.g. IR system using new underlying model

• Why do we want to evaluate?– e.g. functionality, usability

• How will we evaluate?– e.g. effectiveness, efficiency, satisfaction

Page 38: Jane Reid, AMSc IRIC, QMUL, 16/10/01 1 Evaluation of IR systems Jane Reid jane@dcs.qmul.ac.uk

Jane Reid, AMSc IRIC, QMUL, 16/10/01

38

User-centred evaluation [2]

• Example evaluation measures:

Functionality Usability

Effectiveness recall/precision quality of solution

Efficiency retrieval time task completion time

Satisfaction preference confidence

Page 39: Jane Reid, AMSc IRIC, QMUL, 16/10/01 1 Evaluation of IR systems Jane Reid jane@dcs.qmul.ac.uk

Jane Reid, AMSc IRIC, QMUL, 16/10/01

39

Experimental design process

• Formulate research hypothesis

• Formulate experimental hypotheses

• Design experiment(s)

• Conduct pilot test and experiment(s)

• Analyse data

• Evaluate experimental hypotheses

Page 40: Jane Reid, AMSc IRIC, QMUL, 16/10/01 1 Evaluation of IR systems Jane Reid jane@dcs.qmul.ac.uk

Jane Reid, AMSc IRIC, QMUL, 16/10/01

40

Simple experimental design [1]

• Controlled experiment in laboratory setting

• One group of participants

• Each participant performs one or more tasks– Pre-defined tasks vs “real” tasks

Page 41: Jane Reid, AMSc IRIC, QMUL, 16/10/01 1 Evaluation of IR systems Jane Reid jane@dcs.qmul.ac.uk

Jane Reid, AMSc IRIC, QMUL, 16/10/01

41

Simple experimental design [2]

• Example data gathered at task stages:– Stage 1: Formulate information need– Stage 2: Gather information

• Task completion time

• Information-seeking behaviour– Use of observation, recording, think-aloud protocols

Page 42: Jane Reid, AMSc IRIC, QMUL, 16/10/01 1 Evaluation of IR systems Jane Reid jane@dcs.qmul.ac.uk

Jane Reid, AMSc IRIC, QMUL, 16/10/01

42

Simple experimental design [3]

• Example data (continued):– Stage 3: Use information

• Confidence– Use of questionnaires, interviews using Likert scales /

semantic differentials

– Stage 4: Assess information• Quality of solution

– Independent assessment of task output

Page 43: Jane Reid, AMSc IRIC, QMUL, 16/10/01 1 Evaluation of IR systems Jane Reid jane@dcs.qmul.ac.uk

Jane Reid, AMSc IRIC, QMUL, 16/10/01

43

Simple experimental design [4]

• Analysis:– Mostly qualitative, with summary statistics– Common-sense interpretation of results– Use of pre-defined benchmarks

Page 44: Jane Reid, AMSc IRIC, QMUL, 16/10/01 1 Evaluation of IR systems Jane Reid jane@dcs.qmul.ac.uk

Jane Reid, AMSc IRIC, QMUL, 16/10/01

44

Complex experimental design [1]

• Other controlled experiments:– Within-subject, e.g. longitudinal study– Between-subject

• Comparative study looking at effect of:– System type, e.g. variations in algorithm used

– Task type

– User characteristics, e.g. domain knowledge, general computer literacy, system knowledge

• Comparison with control group

Page 45: Jane Reid, AMSc IRIC, QMUL, 16/10/01 1 Evaluation of IR systems Jane Reid jane@dcs.qmul.ac.uk

Jane Reid, AMSc IRIC, QMUL, 16/10/01

45

Complex experimental design [2]

• Other controlled experiments (continued):– Mixed within-subject / between-subject

• Examine effect of interaction of variables

• Analysis:– Quantitative:

• Summary statistics

• Significance testing

– Qualitative

Page 46: Jane Reid, AMSc IRIC, QMUL, 16/10/01 1 Evaluation of IR systems Jane Reid jane@dcs.qmul.ac.uk

Jane Reid, AMSc IRIC, QMUL, 16/10/01

46

Complex experimental design [3]

• Operational / ethno-methodological experiments– Evaluation in a “semi-real” or “real” setting of

the “acceptability” of the system

• Analysis– Mostly qualitative

Page 47: Jane Reid, AMSc IRIC, QMUL, 16/10/01 1 Evaluation of IR systems Jane Reid jane@dcs.qmul.ac.uk

Jane Reid, AMSc IRIC, QMUL, 16/10/01

47

Complex experimental design [4]

• Case studies– Detailed evaluation using a single or small

number of participant(s)– Possible to examine cognitive and affective

issues

• Analysis– Mostly qualitative

Page 48: Jane Reid, AMSc IRIC, QMUL, 16/10/01 1 Evaluation of IR systems Jane Reid jane@dcs.qmul.ac.uk

Jane Reid, AMSc IRIC, QMUL, 16/10/01

48

Summary

• System-centred evaluation– Uses test collection methodology, with recall

and precision– Good for evaluating technical performance

• User-centred evaluation– No standard methodology– Good for evaluating interface / interaction

• Usually necessary to use a combination