Evaluation

EvaluationLBSC 796/INFM 718RSession 5, October 7, 2007Douglas W. Oard

AgendaQuestions

Evaluation fundamentals

System-centered strategies

User-centered strategies

IR as an Empirical DisciplineFormulate a research question: the hypothesisDesign an experiment to answer the questionPerform the experimentCompare with a baseline controlDoes the experiment answer the question?Are the results significant? Or is it just luck?Report the results!

Evaluation CriteriaEffectivenessSystem-only, human+system

EfficiencyRetrieval time, indexing time, index size UsabilityLearnability, novice use, expert use

IR Effectiveness EvaluationUser-centered strategyGiven several users, and at least 2 retrieval systemsHave each user try the same task on both systemsMeasure which system works the best

System-centered strategyGiven documents, queries, and relevance judgmentsTry several variations on the retrieval systemMeasure which ranks more good docs near the top

Good Measures of EffectivenessCapture some aspect of what the user wants

Have predictive value for other situationsDifferent queries, different document collection

Easily replicated by other researchers

Easily comparedOptimally, expressed as a single number

Comparing Alternative ApproachesAchieve a meaningful improvementAn application-specific judgment call

Achieve reliable improvement in unseen casesCan be verified using statistical tests

Evolution of EvaluationEvaluation by inspection of examplesEvaluation by demonstrationEvaluation by improvised demonstrationEvaluation on data using a figure of meritEvaluation on test dataEvaluation on common test dataEvaluation on common, unseen test data

Which is the Best Rank Order?A.B.C.D.E.F.

Which is the Best Rank Order?RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRabcdefgh

IR Test Collection DesignRepresentative document collectionSize, sources, genre, topics, Random sample of representative queriesBuilt somehow from formalized topic statementsKnown binary relevanceFor each topic-document pair (topic, not query!)Assessed by humans, used only for evaluationMeasure of effectivenessUsed to compare alternate systems

What is relevance?measuredegreedimensionestimateappraisalrelationcorrespondenceutilityconnectionsatisfactionfitbearingmatchingdocumentarticletextual formreferenceinformation providedfactqueryrequestinformation usedpoint of viewinformation need statementpersonjudgeuserrequesterInformation specialistTefko Saracevic. (1975) Relevance: A Review of and a Framework for Thinking on the Notion in Information Science. Journal of the American Society for Information Science, 26(6), 321-343;Relevance is theof aexisting between aand a as determined byDoes this help?

Defining RelevanceRelevance relates a topic and a documentDuplicates are equally relevant by definitionConstant over time and across users

Pertinence relates a task and a documentAccounts for quality, complexity, language,

Utility relates a user and a documentAccounts for prior knowledge

Another ViewRelevantRetrievedRelevant +RetrievedNot Relevant + Not RetrievedSpace of all documents

Set-Based Effectiveness MeasuresPrecisionHow much of what was found is relevant?Often of interest, particularly for interactive searchingRecallHow much of what is relevant was found?Particularly important for law, patents, and medicineFalloutHow much of what was irrelevant was rejected?Useful when different size collections are compared

Effectiveness MeasuresRelevant RetrievedFalse AlarmIrrelevant RejectedMissRelevantNot relevantRetrievedNot RetrievedDocActionUser-OrientedSystem-Oriented

Single-Figure Set-Based MeasuresBalanced F-measureHarmonic mean of recall and precision

Weakness: What if no relevant documents exist?

Cost functionReward relevant retrieved, Penalize non-relevant For example, 3R+ - 2N+Weakness: Hard to normalize, so hard to average

Single-Valued Ranked MeasuresExpected search lengthAverage rank of the first relevant documentMean precision at a fixed number of documentsPrecision at 10 docs is often used for Web searchMean precision at a fixed recall levelAdjusts for the total number of relevant docsMean breakeven pointValue at which precision = recall

Automatic Evaluation ModelIR Black BoxQueryRanked ListDocumentsEvaluationModuleMeasure of EffectivenessRelevance Judgments

Ad Hoc TopicsIn TREC, a statement of information need is called a topicTitle: Health and Computer Terminals

Description: Is it hazardous to the health of individuals to work with computer terminals on a daily basis? Narrative: Relevant documents would contain any information that expands on any physical disorder/problems that may be associated with the daily working with computer terminals. Such things as carpel tunnel, cataracts, and fatigue have been said to be associated, but how widespread are these or other problems and what is being done to alleviate any health problems.

Questions About the Black BoxExample questions:Does morphological analysis improve retrieval performance?Does expanding the query with synonyms improve retrieval performance?Corresponding experiments:Build a stemmed index and compare against unstemmed baselineExpand queries with synonyms and compare against baseline unexpanded queries

Measuring Precision and Recall1/11/21/31/42/53/63/74/84/94/105/115/125/135/145/156/166/176/186/194/201/141/141/141/142/143/143/144/144/144/145/145/145/145/145/146/146/146/146/146/14Assume there are a total of 14 relevant documentsPrecisionRecallPrecisionRecallHits 1-10Hits 11-20

Graphing Precision and RecallPlot each (recall, precision) point on a graphVisually represent the precision/recall tradeoff

Uninterpolated Average PrecisionAverage of precision at each retrieved relevant documentRelevant documents not retrieved contribute zero to score1/11/21/31/42/53/63/74/84/94/105/115/125/135/145/156/166/176/186/194/20PrecisionPrecisionHits 1-10Hits 11-20Assume total of 14 relevant documents: 8 relevant documents not retrieved contribute eight zerosMAP = .2307

Uninterpolated MAP

Visualizing Mean Average Precision0.00.20.40.60.81.0Average PrecisionTopic

What MAP HidesAdapted from a presentation by Ellen Voorhees at the University of Maryland, March 29, 1999

Why Mean Average Precision?It is easy to trade between recall and precisionAdding related query terms improves recallBut naive query expansion techniques kill precisionLimiting matches by part-of-speech helps precisionBut it almost always hurts recall

Comparisons should give some weight to bothAverage precision is a principled way to do thisMore central than other available measures

Systems Ranked by Different MeasuresRanked by measure averaged over 50 topicsAdapted from a presentation by Ellen Voorhees at the University of Maryland, March 29, 1999

P(10)

P(30)

R-Prec

Ave Prec

Recall at .5 Prec

Recall (1000)

Total Rel

Rank of 1st Rel

INQ502

INQ502

ok7ax

ok7ax

att98atdc

ok7ax

ok7ax

tno7tw4

ok7ax

ok7ax

INQ502

att98atdc

ok7ax

tno7exp1

tno7exp1

bbn1

att98atdc

INQ501

ok7am

att98atde

mds98td

att98atdc

att98atdc

INQ502

att98atde

att98atdc

att98atdc

ok7am

ok7am

att98atde

bbn1

nectchall

INQ501

nectchall

att98atde

INQ502

INQ502

Cor7A3rrf

att98atde

tnocbm25

nectchall

att98atde

INQ501

mds98td

att98atde

ok7am

INQ502

MerAbtnd

nectchdes

ok7am

bbn1

bbn1

INQ501

bbn1

INQ501

att98atdc

ok7am

nectchdes

mds98td

tno7exp1

ok7as

pirc8Aa2

ok7am

acsys7al

mds98td

INQ503

nectchdes

INQ501

bbn1

INQ502

Cor7A3rrf

mds98td

INQ503

bbn1

nectchall

pirc8Aa2

nectchall

pirc8Ad

pirc8Aa2

ibms98a

Cor7A3rrf

tno7exp1

ok7as

Cor7A3rrf

tno7exp1

INQ501

nectchdes

Cor7A3rrf

tno7tw4

mds98td

tno7exp1

acsys7al

Cor7A3rrf

nectchdes

mds98td

ok7ax

MerAbtnd

pirc8Aa2

acsys7al

ok7as

acsys7al

nectchall

acsys7al

att98atde

acsys7al

Cor7A3rrf

pirc8Aa2

nectchdes

Cor7A2rrd

acsys7al

nectchall

Brkly25

iowacuhk1

ok7as

Cor7A3rrf

nectchall

INQ503

mds98td

pirc8Ad

nectchdes

Correlations Between RankingsKendalls t computed between pairs of rankings Adapted from a presentation by Ellen Voorhees at the University of Maryland, March 29, 1999

P(30)

R Prec

Ave Prec

Recall at .5 P

Recall (1000)

Total Rels

Rank 1st Rel

P(10)

.8851

.8151

.7899

.7855

.7817

.7718

.6378

P(30)

.8676

.8446

.8238

.7959

.7915

.6213

R Prec

.9245

.8654

.8342

.8320

.5896

Ave Prec

.8840

.8473

.8495

.5612

R at .5 P

.7707

.7762

.5349

Recall(1000)

.9212

.5891

Total Rels

.5880

Relevant Document DensityAdapted from a presentation by Ellen Voorhees at the University of Maryland, March 29, 1999

Alternative Ways to Get JudgmentsExhaustive assessment is usually impracticalTopics * documents = a large number!Pooled assessment leverages cooperative evaluationRequires a diverse set of IR systemsSearch-guided assessment is sometimes viableIterate between topic research/search/assessmentAugment with review, adjudication, reassessmentKnown-item judgments have the lowest costTailor queries to retrieve a single known documentUseful as a first cut to see if a new technique is viable

Obtaining Relevance JudgmentsExhaustive assessment can be too expensiveTREC has 50 queries for >1 million docs each year

Random sampling wont workIf relevant docs are rare, none may be found!

IR systems can help focus the sampleEach system finds some relevant documentsDifferent systems find different relevant documentsTogether, enough systems will find most of them

Pooled Assessment MethodologySystems submit top 1000 documents per topic

Top 100 documents for each are judgedSingle pool, without duplicates, arbitrary orderJudged by the person that wrote the query

Treat unevaluated documents as not relevant

Compute MAP down to 1000 documentsTreat precision for complete misses as 0.0

Does pooling work?Judgments cant possibly be exhaustive!

This is only one persons opinion about relevance

What about hits 101 to 1000?

We cant possibly use judgments to evaluate a system that didnt participate in the evaluation!It doesnt matter: relative rankings remain the same!It doesnt matter: relative rankings remain the same!It doesnt matter: relative rankings remain the same!Actually, we can!Chris Buckley and Ellen M. Voorhees. (2004) Retrieval Evaluation with Incomplete Information. SIGIR 2004. Ellen Voorhees. (1998) Variations in Relevance Judgments and the Measurement of Retrieval Effectiveness. SIGIR 1998.Justin Zobel. (1998) How Reliable Are the Results of Large-Scale Information Retrieval Experiments? SIGIR 1998.

Lessons From TRECIncomplete judgments are usefulIf sample is unbiased with respect to systems tested

Different relevance judgments change absolute scoreBut rarely change comparative advantages when averaged

Evaluation technology is predictiveResults transfer to operational settingsAdapted from a presentation by Ellen Voorhees at the University of Maryland, March 29, 1999

Effects of Incomplete JudgmentsAdditional relevant documents are:roughly uniform across systemshighly skewed across topics

Systems that dont contribute to pool get comparable resultsAdapted from a presentation by Ellen Voorhees at the University of Maryland, March 29, 1999

Inter-Judge AgreementAdapted from a presentation by Ellen Voorhees at the University of Maryland, March 29, 1999

Assessor Group

Overlap

Primary & A

.421

Primary & B

.494

A & B

.426

All 3

.301

Effect of Different JudgmentsAdapted from a presentation by Ellen Voorhees at the University of Maryland, March 29, 1999

Net Effect of Different JudgesMean Kendall t between system rankings produced from different qrel sets: .938Similar results held forDifferent query setsDifferent evaluation measuresDifferent assessor typesSingle opinion vs. group opinion judgmentsAdapted from a presentation by Ellen Voorhees at the University of Maryland, March 29, 1999

Statistical Significance TestsHow sure can you be that an observed difference doesnt simply result from the particular queries you chose?System A0.200.210.220.190.170.200.21System B0.400.410.420.390.370.400.41Experiment 1Query1234567Average0.200.40System A0.020.390.160.580.040.090.12System B0.760.070.370.210.020.910.46Experiment 2Query1234567Average0.200.40

Statistical Significance TestingTry some out at: http://www.fon.hum.uva.nl/Service/Statistics.html

How Much is Enough?Measuring improvementAchieve a meaningful improvementGuideline: 0.05 is noticeable, 0.1 makes a differenceAchieve reliable improvement on typical queriesWilcoxon signed rank test for paired samples

Know when to stop!Inter-assessor agreement limits max precisionUsing one judge to assess the other yields about 0.8

Recap: Automatic EvaluationEvaluation measures focus on relevanceUsers also want utility and understandability

Goal is to compare systemsValues may vary, but relative differences are stable

Mean values obscure important phenomenaAugment with failure analysis/significance tests

Automatic EvaluationRanked List

User StudiesDocumentQueryFormulation

User StudiesGoal is to account for interface issuesBy studying the interface componentBy studying the complete system

Formative evaluationProvide a basis for system development

Summative evaluationDesigned to assess performance

Questions That Involve UsersExample questions:Does keyword highlighting help users evaluate document relevance?Is letting users weight search terms a good idea?Corresponding experiments:Build two different interfaces, one with keyword highlighting, one without; run a user studyBuild two different interfaces, one with term weighting functionality, and one without; run a user study

Blair and Maron (1985)A classic study of retrieval effectivenessEarlier studies used unrealistically small collectionsStudied an archive of documents for a lawsuit40,000 documents, ~350,000 pages of text40 different queriesUsed IBMs STAIRS full-text systemApproach:Lawyers wanted at least 75% of all relevant documentsPrecision and recall evaluated only after the lawyers were satisfied with the resultsDavid C. Blair and M. E. Maron. (1984) An Evaluation of Retrieval Effectiveness for a Full-Text Document-Retrieval System. Communications of the ACM, 28(3), 289--299.

Blair and Marons ResultsMean precision: 79%Mean recall: 20% (!!)Why recall was low?Users cant anticipate terms used in relevant documents

Differing technical terminologySlang, misspellingsOther findings:Searches by both lawyers had similar performanceLawyers recall was not much different from paralegalsaccident might be referred to as event, incident, situation, problem,

Quantitative User StudiesSelect independent variable(s)e.g., what info to display in selection interfaceSelect dependent variable(s)e.g., time to find a known relevant documentRun subjects in different ordersAverage out learning and fatigue effectsCompute statistical significanceNull hypothesis: independent variable has no effectRejected if p

Variation in Automatic MeasuresSystemWhat we seek to measureTopicSample topic space, compute expected valueTopic+SystemPair by topic and compute statistical significanceCollectionRepeat the experiment using several collections

Additional Effects in User StudiesLearningVary topic presentation order

FatigueVary system presentation order

Topic+User (Expertise)Ask about prior knowledge of each topic

Presentation Order

Batch vs. User EvaluationsDo batch (black box) and user evaluations give the same results? If not, why?Two different tasks:Instance recall (6 topics)

Question answering (8 topics)Andrew Turpin and William Hersh. (2001) Why Batch and User Evaluations Do No Give the Same Results. Proceedings of SIGIR 2001.What countries import Cuban sugar?What tropical storms, hurricanes, and typhoons have caused property damage or loss of life?Which painting did Edvard Munch complete first, Vampire or Puberty?Is Denmark larger or smaller in population than Norway?

ResultsCompared of two systems:a baseline systeman improved system that was provably better in batch evaluationsResults:

Example User Study ResultsF=0.8English queries, German documents4 searchers, 20 minutes per topic

Chart2

0.33660589060.589012926

0.16553595660.0909090909

0.40522875820.5301239005

0.44105800210.7878250591

0.33710715190.4994677441

Automatic

User-Assisted

Topic

small collection

umd022MANU0.20.5555555556

umd023MANU00

umd024AUTO0.31250.3125

umd021AUTO0.750.5555555556

umd031MANU0.50.5

umd034MANU0.31250.3125

umd033AUTO00

umd032AUTO0.27027027030.306122449

umd043AUTO00

umd042AUTO0.20408163270.1298701299

umd044MANU00

umd041MANU00.2777777778

umd054AUTO00

umd052AUTO0.17241379310.0943396226

umd051MANU0.50.3571428571

umd053MANU00

umd063MANU00

umd061MANU00

umd062AUTO0.27027027030.243902439

umd064AUTO0.31250.5

umd074MANU0.35714285710.4166666667

umd072MANU00

umd071AUTO0.468750.3571428571

umd073AUTO00

umd083AUTO00

umd081AUTO0.41666666670.2884615385

umd082MANU0.20.303030303

umd084MANU0.50.5

umd091AUTO0.35714285710.4166666667

umd094AUTO0.31250.3125

umd093MANU00

umd092MANU0.15151515150.243902439

avg0.27367722910.2909848691

umd024AUTO0.31250.3125umd022MANU0.20.5555555556

umd021AUTO0.750.5555555556umd023MANU00

umd033AUTO00umd031MANU0.50.5

umd032AUTO0.27027027030.306122449umd034MANU0.31250.3125

umd043AUTO00umd044MANU00

umd042AUTO0.20408163270.1298701299umd041MANU00.2777777778

umd054AUTO00umd051MANU0.50.3571428571

umd052AUTO0.17241379310.0943396226umd053MANU00

umd062AUTO0.27027027030.243902439umd063MANU00

umd064AUTO0.31250.5umd061MANU00

umd071AUTO0.468750.3571428571umd074MANU0.35714285710.4166666667


umd083AUTO00umd082MANU0.20.303030303

umd081AUTO0.41666666670.2884615385umd084MANU0.50.5

umd091AUTO0.35714285710.4166666667umd093MANU00

umd094AUTO0.31250.3125umd092MANU0.15151515150.243902439

avg0.32059129080.29308843820.22676316740.2888812999

umd021AUTO0.750.5555555556umd031MANU0.50.5

umd071AUTO0.468750.3571428571umd041MANU00.2777777778

umd081AUTO0.41666666670.2884615385umd051MANU0.50.3571428571

umd091AUTO0.35714285710.4166666667umd061MANU00

avg0.4981398810.40445665450.250.2837301587

umd032AUTO0.27027027030.306122449umd022MANU0.20.5555555556

umd042AUTO0.20408163270.1298701299umd072MANU00

umd052AUTO0.17241379310.0943396226umd082MANU0.20.303030303

umd062AUTO0.27027027030.243902439umd092MANU0.15151515150.243902439

avg0.22925899160.19355866010.13787878790.2756220744

umd024AUTO0.31250.3125umd034MANU0.31250.3125


umd064AUTO0.31250.5umd074MANU0.35714285710.4166666667

umd094AUTO0.31250.3125umd084MANU0.50.5

avg0.2343750.281250.29241071430.3072916667

TopicAUTO-STRICTMANU-STRICTAUTO-LOOSEMANU-LOOSE

10.4981398810.250.40445665450.2837301587

20.22925899160.13787878790.19355866010.2756220744

40.13787878790.13787878790.27562207440.2756220744

Average0.32059129080.22676316740.29308843820.2888812999

small collection

AUTO-STRICT

MANU-STRICT

AUTO-LOOSE

MANU-LOOSE

Topics

F-measure

F-measure in small collection

Chart1

Sheet1

large collection

strictloose

umd101AUTO0.10869565220.2272727273

umd104AUTO0.06578947370.0595238095

umd103MANU0.46391752580.4609929078

umd102MANU00

umd112MANU0.18181818180.2380952381

umd113MANU0.59633027520.5319148936

umd114AUTO0.81632653060.8163265306

umd111AUTO0.5645161290.6060606061

umd121MANU0.74324324320.5660377358

umd124MANU0.77777777780.825

umd123AUTO0.22222222220.5294117647

umd122AUTO0.18181818180.3271028037

umd132AUTO0.14925373130.202020202

umd133AUTO0.58823529410.5732484076

umd134MANU0.79787234040.8088235294

umd131MANU0.43478260870.6451612903

avg0.4182874480.4635620279

umd101AUTO0.10869565220.2272727273umd103MANU0.46391752580.4609929078

umd104AUTO0.06578947370.0595238095umd102MANU00

umd114AUTO0.81632653060.8163265306umd112MANU0.18181818180.2380952381

umd111AUTO0.5645161290.6060606061umd113MANU0.59633027520.5319148936

umd123AUTO0.22222222220.5294117647umd121MANU0.74324324320.5660377358

umd122AUTO0.18181818180.3271028037umd124MANU0.77777777780.825

umd132AUTO0.14925373130.202020202umd134MANU0.79787234040.8088235294

umd133AUTO0.58823529410.5732484076umd131MANU0.43478260870.6451612903

avg0.33710715190.4176208564avg0.49946774410.5095031994

umd101AUTO0.10869565220.2272727273umd121MANU0.74324324320.5660377358

umd111AUTO0.5645161290.6060606061umd131MANU0.43478260870.6451612903

avg0.33660589060.41666666670.5890129260.6055995131

umd122AUTO0.18181818180.3271028037umd102MANU00

umd132AUTO0.14925373130.202020202umd112MANU0.18181818180.2380952381

avg0.16553595660.2645615029avg0.09090909090.119047619

umd123AUTO0.22222222220.5294117647umd103MANU0.46391752580.4609929078

umd133AUTO0.58823529410.5732484076umd113MANU0.59633027520.5319148936

avg0.40522875820.55133008620.53012390050.4964539007

umd104AUTO0.06578947370.0595238095umd124MANU0.77777777780.825

umd114AUTO0.81632653060.8163265306umd134MANU0.79787234040.8088235294

avg0.44105800210.43792517010.78782505910.8169117647

strictloose

ALL0.4182874480.4635620279

AUTO0.33710715190.4176208564

MANU0.49946774410.5095031994

AUTO-T10.33660589060.4166666667

MANU-t10.5890129260.6055995131

AUTO-t20.16553595660.2645615029

MANU-t20.09090909090.119047619

AUTO-t30.40522875820.5513300862

MANU-t30.53012390050.4964539007

AUTO-t40.44105800210.4379251701

MANU-t40.78782505910.8169117647

TopicAUTO-STRICTMANU-STRICTAUTO-LOOSEMANU-LOOSE

10.33660589060.5890129260.41666666670.6055995131

20.16553595660.09090909090.26456150290.119047619

30.40522875820.53012390050.55133008620.4964539007

40.44105800210.78782505910.43792517010.8169117647

Average0.33710715190.49946774410.41762085640.5095031994

TopicAutomaticUser-Assisted

10.33660589060.589012926

20.16553595660.0909090909

30.40522875820.5301239005

40.44105800210.7878250591

Average0.33710715190.4994677441

large collection

ALL

AUTO

MANU

AUTO-T1

MANU-t1

AUTO-t2

MANU-t2

AUTO-t3

MANU-t3

AUTO-t4

MANU-t4

relevant judgment conditions

F-measure

F-Measure in large collection

f-2002.results

0.33660589060.5890129260.41666666670.6055995131

0.16553595660.09090909090.26456150290.119047619

0.40522875820.53012390050.55133008620.4964539007

0.44105800210.78782505910.43792517010.8169117647

0.33710715190.49946774410.41762085640.5095031994

AUTO-STRICT

MANU-STRICT

AUTO-LOOSE

MANU-LOOSE

Topics

F-measures

F-Measure in official results

Automatic

User-Assisted

Topics

F-Measure

umd022MANU0.20.5555555556

umd023MANU00

umd024AUTO0.31250.3125

umd021AUTO0.750.5555555556

umd031MANU0.50.5

umd034MANU0.31250.3125

umd033AUTO00

umd032AUTO0.27027027030.306122449

umd043AUTO00

umd042AUTO0.20408163270.1298701299

umd044MANU00

umd041MANU00.2777777778

umd054AUTO00

umd052AUTO0.17241379310.0943396226

umd051MANU0.50.3571428571

umd053MANU00

umd063MANU00

umd061MANU00

umd062AUTO0.27027027030.243902439

umd064AUTO0.31250.5

umd074MANU0.35714285710.4166666667

umd072MANU00

umd071AUTO0.468750.3571428571

umd073AUTO00

umd083AUTO00

umd081AUTO0.41666666670.2884615385

umd082MANU0.20.303030303

umd084MANU0.50.5

umd091AUTO0.35714285710.4166666667

umd094AUTO0.31250.3125

umd093MANU00

umd092MANU0.15151515150.243902439

umd101AUTO0.10869565220.2272727273

umd104AUTO0.06578947370.0595238095

umd103MANU0.46391752580.4609929078

umd102MANU00

umd112MANU0.18181818180.2380952381

umd113MANU0.59633027520.5319148936

umd114AUTO0.81632653060.8163265306

umd111AUTO0.5645161290.6060606061

umd121MANU0.74324324320.5660377358

umd124MANU0.77777777780.825

umd123AUTO0.22222222220.5294117647

umd122AUTO0.18181818180.3271028037

umd132AUTO0.14925373130.202020202

umd133AUTO0.58823529410.5732484076

umd134MANU0.79787234040.8088235294

umd131MANU0.43478260870.6451612903

Qualitative User StudiesObserve user behaviorInstrumented software, eye trackers, etc.Face and keyboard camerasThink-aloud protocolsInterviews and focus groupsOrganize the dataFor example, group it into overlapping categoriesLook for patterns and themesDevelop a grounded theory

Example: Mentions of relevance criteria by searchersThesaurus-based search, recorded interviews

Relevance CriteriaNumber of MentionsAll(N=703)Think-AloudRelevance Judgment(N=300)QueryForm.(N=248)Topicality 535 (76%)219234Richness 39 (5.5%) 14 0Emotion 24 (3.4%) 7 0Audio/Visual Expression 16 (2.3%) 5 0Comprehensibility 14 (2%) 1 10Duration 11 (1.6%) 9 0Novelty 10 (1.4%) 4 2

TopicalityTotal mentions by 8 searchersThesaurus-based search, recorded interviews

Chart1

26

32

44

73

116

120

124

Mentions

Sheet1

Mentions

Object26

Time Frame32

Organization/Group44

Subject73

Event/Experience116

Place120

Person124

Sheet1

0

0

0

0

0

0

0

Mentions

Sheet2

Sheet3

QuestionnairesDemographic dataFor example, computer experienceBasis for interpreting results

Subjective self-assessmentWhich did they think was more effective?Often at variance with objective results!

PreferenceWhich interface did they prefer? Why?

SummaryQualitative user studies suggest what to build

Design decomposes task into components

Automated evaluation helps to refine components

Quantitative user studies show how well it works

One Minute PaperIf I demonstrated a new retrieval technique that achieved a statistically significant improvement in average precision on the TREC collection, what would be the most serious limitation to consider when interpreting that result?

IR isnt very different from other fields we think of as experimental, e.g., a chemist in the lab mixing bubbling test tubesRelevance judgments did differ for some topics.

Quantify differences using overlap: overlap scores generally higher than mostprevious studies.