If you can't read please download the document
Upload
aimon
View
28
Download
0
Embed Size (px)
DESCRIPTION
Evaluation. LBSC 796/INFM 718R Session 5, October 7, 2007 Douglas W. Oard. Agenda. Questions Evaluation fundamentals System-centered strategies User-centered strategies. IR as an Empirical Discipline. Formulate a research question: the hypothesis - PowerPoint PPT Presentation
Citation preview
EvaluationLBSC 796/INFM 718RSession 5, October 7, 2007Douglas W. Oard
AgendaQuestions
Evaluation fundamentals
System-centered strategies
User-centered strategies
IR as an Empirical DisciplineFormulate a research question: the hypothesisDesign an experiment to answer the questionPerform the experimentCompare with a baseline controlDoes the experiment answer the question?Are the results significant? Or is it just luck?Report the results!
Evaluation CriteriaEffectivenessSystem-only, human+system
EfficiencyRetrieval time, indexing time, index size UsabilityLearnability, novice use, expert use
IR Effectiveness EvaluationUser-centered strategyGiven several users, and at least 2 retrieval systemsHave each user try the same task on both systemsMeasure which system works the best
System-centered strategyGiven documents, queries, and relevance judgmentsTry several variations on the retrieval systemMeasure which ranks more good docs near the top
Good Measures of EffectivenessCapture some aspect of what the user wants
Have predictive value for other situationsDifferent queries, different document collection
Easily replicated by other researchers
Easily comparedOptimally, expressed as a single number
Comparing Alternative ApproachesAchieve a meaningful improvementAn application-specific judgment call
Achieve reliable improvement in unseen casesCan be verified using statistical tests
Evolution of EvaluationEvaluation by inspection of examplesEvaluation by demonstrationEvaluation by improvised demonstrationEvaluation on data using a figure of meritEvaluation on test dataEvaluation on common test dataEvaluation on common, unseen test data
Which is the Best Rank Order?A.B.C.D.E.F.
Which is the Best Rank Order?RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRabcdefgh
IR Test Collection DesignRepresentative document collectionSize, sources, genre, topics, Random sample of representative queriesBuilt somehow from formalized topic statementsKnown binary relevanceFor each topic-document pair (topic, not query!)Assessed by humans, used only for evaluationMeasure of effectivenessUsed to compare alternate systems
What is relevance?measuredegreedimensionestimateappraisalrelationcorrespondenceutilityconnectionsatisfactionfitbearingmatchingdocumentarticletextual formreferenceinformation providedfactqueryrequestinformation usedpoint of viewinformation need statementpersonjudgeuserrequesterInformation specialistTefko Saracevic. (1975) Relevance: A Review of and a Framework for Thinking on the Notion in Information Science. Journal of the American Society for Information Science, 26(6), 321-343;Relevance is theof aexisting between aand a as determined byDoes this help?
Defining RelevanceRelevance relates a topic and a documentDuplicates are equally relevant by definitionConstant over time and across users
Pertinence relates a task and a documentAccounts for quality, complexity, language,
Utility relates a user and a documentAccounts for prior knowledge
Another ViewRelevantRetrievedRelevant +RetrievedNot Relevant + Not RetrievedSpace of all documents
Set-Based Effectiveness MeasuresPrecisionHow much of what was found is relevant?Often of interest, particularly for interactive searchingRecallHow much of what is relevant was found?Particularly important for law, patents, and medicineFalloutHow much of what was irrelevant was rejected?Useful when different size collections are compared
Effectiveness MeasuresRelevant RetrievedFalse AlarmIrrelevant RejectedMissRelevantNot relevantRetrievedNot RetrievedDocActionUser-OrientedSystem-Oriented
Single-Figure Set-Based MeasuresBalanced F-measureHarmonic mean of recall and precision
Weakness: What if no relevant documents exist?
Cost functionReward relevant retrieved, Penalize non-relevant For example, 3R+ - 2N+Weakness: Hard to normalize, so hard to average
Single-Valued Ranked MeasuresExpected search lengthAverage rank of the first relevant documentMean precision at a fixed number of documentsPrecision at 10 docs is often used for Web searchMean precision at a fixed recall levelAdjusts for the total number of relevant docsMean breakeven pointValue at which precision = recall
Automatic Evaluation ModelIR Black BoxQueryRanked ListDocumentsEvaluationModuleMeasure of EffectivenessRelevance Judgments
Ad Hoc TopicsIn TREC, a statement of information need is called a topicTitle: Health and Computer Terminals
Description: Is it hazardous to the health of individuals to work with computer terminals on a daily basis? Narrative: Relevant documents would contain any information that expands on any physical disorder/problems that may be associated with the daily working with computer terminals. Such things as carpel tunnel, cataracts, and fatigue have been said to be associated, but how widespread are these or other problems and what is being done to alleviate any health problems.
Questions About the Black BoxExample questions:Does morphological analysis improve retrieval performance?Does expanding the query with synonyms improve retrieval performance?Corresponding experiments:Build a stemmed index and compare against unstemmed baselineExpand queries with synonyms and compare against baseline unexpanded queries
Measuring Precision and Recall1/11/21/31/42/53/63/74/84/94/105/115/125/135/145/156/166/176/186/194/201/141/141/141/142/143/143/144/144/144/145/145/145/145/145/146/146/146/146/146/14Assume there are a total of 14 relevant documentsPrecisionRecallPrecisionRecallHits 1-10Hits 11-20
Graphing Precision and RecallPlot each (recall, precision) point on a graphVisually represent the precision/recall tradeoff
Uninterpolated Average PrecisionAverage of precision at each retrieved relevant documentRelevant documents not retrieved contribute zero to score1/11/21/31/42/53/63/74/84/94/105/115/125/135/145/156/166/176/186/194/20PrecisionPrecisionHits 1-10Hits 11-20Assume total of 14 relevant documents: 8 relevant documents not retrieved contribute eight zerosMAP = .2307
Uninterpolated MAP
Visualizing Mean Average Precision0.00.20.40.60.81.0Average PrecisionTopic
What MAP HidesAdapted from a presentation by Ellen Voorhees at the University of Maryland, March 29, 1999
Why Mean Average Precision?It is easy to trade between recall and precisionAdding related query terms improves recallBut naive query expansion techniques kill precisionLimiting matches by part-of-speech helps precisionBut it almost always hurts recall
Comparisons should give some weight to bothAverage precision is a principled way to do thisMore central than other available measures
Systems Ranked by Different MeasuresRanked by measure averaged over 50 topicsAdapted from a presentation by Ellen Voorhees at the University of Maryland, March 29, 1999
P(10)
P(30)
R-Prec
Ave Prec
Recall at .5 Prec
Recall (1000)
Total Rel
Rank of 1st Rel
INQ502
INQ502
ok7ax
ok7ax
att98atdc
ok7ax
ok7ax
tno7tw4
ok7ax
ok7ax
INQ502
att98atdc
ok7ax
tno7exp1
tno7exp1
bbn1
att98atdc
INQ501
ok7am
att98atde
mds98td
att98atdc
att98atdc
INQ502
att98atde
att98atdc
att98atdc
ok7am
ok7am
att98atde
bbn1
nectchall
INQ501
nectchall
att98atde
INQ502
INQ502
Cor7A3rrf
att98atde
tnocbm25
nectchall
att98atde
INQ501
mds98td
att98atde
ok7am
INQ502
MerAbtnd
nectchdes
ok7am
bbn1
bbn1
INQ501
bbn1
INQ501
att98atdc
ok7am
nectchdes
mds98td
tno7exp1
ok7as
pirc8Aa2
ok7am
acsys7al
mds98td
INQ503
nectchdes
INQ501
bbn1
INQ502
Cor7A3rrf
mds98td
INQ503
bbn1
nectchall
pirc8Aa2
nectchall
pirc8Ad
pirc8Aa2
ibms98a
Cor7A3rrf
tno7exp1
ok7as
Cor7A3rrf
tno7exp1
INQ501
nectchdes
Cor7A3rrf
tno7tw4
mds98td
tno7exp1
acsys7al
Cor7A3rrf
nectchdes
mds98td
ok7ax
MerAbtnd
pirc8Aa2
acsys7al
ok7as
acsys7al
nectchall
acsys7al
att98atde
acsys7al
Cor7A3rrf
pirc8Aa2
nectchdes
Cor7A2rrd
acsys7al
nectchall
Brkly25
iowacuhk1
ok7as
Cor7A3rrf
nectchall
INQ503
mds98td
pirc8Ad
nectchdes
Correlations Between RankingsKendalls t computed between pairs of rankings Adapted from a presentation by Ellen Voorhees at the University of Maryland, March 29, 1999
P(30)
R Prec
Ave Prec
Recall at .5 P
Recall (1000)
Total Rels
Rank 1st Rel
P(10)
.8851
.8151
.7899
.7855
.7817
.7718
.6378
P(30)
.8676
.8446
.8238
.7959
.7915
.6213
R Prec
.9245
.8654
.8342
.8320
.5896
Ave Prec
.8840
.8473
.8495
.5612
R at .5 P
.7707
.7762
.5349
Recall(1000)
.9212
.5891
Total Rels
.5880
Relevant Document DensityAdapted from a presentation by Ellen Voorhees at the University of Maryland, March 29, 1999
Alternative Ways to Get JudgmentsExhaustive assessment is usually impracticalTopics * documents = a large number!Pooled assessment leverages cooperative evaluationRequires a diverse set of IR systemsSearch-guided assessment is sometimes viableIterate between topic research/search/assessmentAugment with review, adjudication, reassessmentKnown-item judgments have the lowest costTailor queries to retrieve a single known documentUseful as a first cut to see if a new technique is viable
Obtaining Relevance JudgmentsExhaustive assessment can be too expensiveTREC has 50 queries for >1 million docs each year
Random sampling wont workIf relevant docs are rare, none may be found!
IR systems can help focus the sampleEach system finds some relevant documentsDifferent systems find different relevant documentsTogether, enough systems will find most of them
Pooled Assessment MethodologySystems submit top 1000 documents per topic
Top 100 documents for each are judgedSingle pool, without duplicates, arbitrary orderJudged by the person that wrote the query
Treat unevaluated documents as not relevant
Compute MAP down to 1000 documentsTreat precision for complete misses as 0.0
Does pooling work?Judgments cant possibly be exhaustive!
This is only one persons opinion about relevance
What about hits 101 to 1000?
We cant possibly use judgments to evaluate a system that didnt participate in the evaluation!It doesnt matter: relative rankings remain the same!It doesnt matter: relative rankings remain the same!It doesnt matter: relative rankings remain the same!Actually, we can!Chris Buckley and Ellen M. Voorhees. (2004) Retrieval Evaluation with Incomplete Information. SIGIR 2004. Ellen Voorhees. (1998) Variations in Relevance Judgments and the Measurement of Retrieval Effectiveness. SIGIR 1998.Justin Zobel. (1998) How Reliable Are the Results of Large-Scale Information Retrieval Experiments? SIGIR 1998.
Lessons From TRECIncomplete judgments are usefulIf sample is unbiased with respect to systems tested
Different relevance judgments change absolute scoreBut rarely change comparative advantages when averaged
Evaluation technology is predictiveResults transfer to operational settingsAdapted from a presentation by Ellen Voorhees at the University of Maryland, March 29, 1999
Effects of Incomplete JudgmentsAdditional relevant documents are:roughly uniform across systemshighly skewed across topics
Systems that dont contribute to pool get comparable resultsAdapted from a presentation by Ellen Voorhees at the University of Maryland, March 29, 1999
Inter-Judge AgreementAdapted from a presentation by Ellen Voorhees at the University of Maryland, March 29, 1999
Assessor Group
Overlap
Primary & A
.421
Primary & B
.494
A & B
.426
All 3
.301
Effect of Different JudgmentsAdapted from a presentation by Ellen Voorhees at the University of Maryland, March 29, 1999
Net Effect of Different JudgesMean Kendall t between system rankings produced from different qrel sets: .938Similar results held forDifferent query setsDifferent evaluation measuresDifferent assessor typesSingle opinion vs. group opinion judgmentsAdapted from a presentation by Ellen Voorhees at the University of Maryland, March 29, 1999
Statistical Significance TestsHow sure can you be that an observed difference doesnt simply result from the particular queries you chose?System A0.200.210.220.190.170.200.21System B0.400.410.420.390.370.400.41Experiment 1Query1234567Average0.200.40System A0.020.390.160.580.040.090.12System B0.760.070.370.210.020.910.46Experiment 2Query1234567Average0.200.40
Statistical Significance TestingTry some out at: http://www.fon.hum.uva.nl/Service/Statistics.html
How Much is Enough?Measuring improvementAchieve a meaningful improvementGuideline: 0.05 is noticeable, 0.1 makes a differenceAchieve reliable improvement on typical queriesWilcoxon signed rank test for paired samples
Know when to stop!Inter-assessor agreement limits max precisionUsing one judge to assess the other yields about 0.8
Recap: Automatic EvaluationEvaluation measures focus on relevanceUsers also want utility and understandability
Goal is to compare systemsValues may vary, but relative differences are stable
Mean values obscure important phenomenaAugment with failure analysis/significance tests
Automatic EvaluationRanked List
User StudiesDocumentQueryFormulation
User StudiesGoal is to account for interface issuesBy studying the interface componentBy studying the complete system
Formative evaluationProvide a basis for system development
Summative evaluationDesigned to assess performance
Questions That Involve UsersExample questions:Does keyword highlighting help users evaluate document relevance?Is letting users weight search terms a good idea?Corresponding experiments:Build two different interfaces, one with keyword highlighting, one without; run a user studyBuild two different interfaces, one with term weighting functionality, and one without; run a user study
Blair and Maron (1985)A classic study of retrieval effectivenessEarlier studies used unrealistically small collectionsStudied an archive of documents for a lawsuit40,000 documents, ~350,000 pages of text40 different queriesUsed IBMs STAIRS full-text systemApproach:Lawyers wanted at least 75% of all relevant documentsPrecision and recall evaluated only after the lawyers were satisfied with the resultsDavid C. Blair and M. E. Maron. (1984) An Evaluation of Retrieval Effectiveness for a Full-Text Document-Retrieval System. Communications of the ACM, 28(3), 289--299.
Blair and Marons ResultsMean precision: 79%Mean recall: 20% (!!)Why recall was low?Users cant anticipate terms used in relevant documents
Differing technical terminologySlang, misspellingsOther findings:Searches by both lawyers had similar performanceLawyers recall was not much different from paralegalsaccident might be referred to as event, incident, situation, problem,
Variation in Automatic MeasuresSystemWhat we seek to measureTopicSample topic space, compute expected valueTopic+SystemPair by topic and compute statistical significanceCollectionRepeat the experiment using several collections
Additional Effects in User StudiesLearningVary topic presentation order
FatigueVary system presentation order
Topic+User (Expertise)Ask about prior knowledge of each topic
Presentation Order
Batch vs. User EvaluationsDo batch (black box) and user evaluations give the same results? If not, why?Two different tasks:Instance recall (6 topics)
Question answering (8 topics)Andrew Turpin and William Hersh. (2001) Why Batch and User Evaluations Do No Give the Same Results. Proceedings of SIGIR 2001.What countries import Cuban sugar?What tropical storms, hurricanes, and typhoons have caused property damage or loss of life?Which painting did Edvard Munch complete first, Vampire or Puberty?Is Denmark larger or smaller in population than Norway?
ResultsCompared of two systems:a baseline systeman improved system that was provably better in batch evaluationsResults:
Example User Study ResultsF=0.8English queries, German documents4 searchers, 20 minutes per topic
Chart2
0.33660589060.589012926
0.16553595660.0909090909
0.40522875820.5301239005
0.44105800210.7878250591
0.33710715190.4994677441
Automatic
User-Assisted
Topic
small collection
umd022MANU0.20.5555555556
umd023MANU00
umd024AUTO0.31250.3125
umd021AUTO0.750.5555555556
umd031MANU0.50.5
umd034MANU0.31250.3125
umd033AUTO00
umd032AUTO0.27027027030.306122449
umd043AUTO00
umd042AUTO0.20408163270.1298701299
umd044MANU00
umd041MANU00.2777777778
umd054AUTO00
umd052AUTO0.17241379310.0943396226
umd051MANU0.50.3571428571
umd053MANU00
umd063MANU00
umd061MANU00
umd062AUTO0.27027027030.243902439
umd064AUTO0.31250.5
umd074MANU0.35714285710.4166666667
umd072MANU00
umd071AUTO0.468750.3571428571
umd073AUTO00
umd083AUTO00
umd081AUTO0.41666666670.2884615385
umd082MANU0.20.303030303
umd084MANU0.50.5
umd091AUTO0.35714285710.4166666667
umd094AUTO0.31250.3125
umd093MANU00
umd092MANU0.15151515150.243902439
avg0.27367722910.2909848691
umd024AUTO0.31250.3125umd022MANU0.20.5555555556
umd021AUTO0.750.5555555556umd023MANU00
umd033AUTO00umd031MANU0.50.5
umd032AUTO0.27027027030.306122449umd034MANU0.31250.3125
umd043AUTO00umd044MANU00
umd042AUTO0.20408163270.1298701299umd041MANU00.2777777778
umd054AUTO00umd051MANU0.50.3571428571
umd052AUTO0.17241379310.0943396226umd053MANU00
umd062AUTO0.27027027030.243902439umd063MANU00
umd064AUTO0.31250.5umd061MANU00
umd071AUTO0.468750.3571428571umd074MANU0.35714285710.4166666667
umd073AUTO00umd072MANU00
umd083AUTO00umd082MANU0.20.303030303
umd081AUTO0.41666666670.2884615385umd084MANU0.50.5
umd091AUTO0.35714285710.4166666667umd093MANU00
umd094AUTO0.31250.3125umd092MANU0.15151515150.243902439
avg0.32059129080.29308843820.22676316740.2888812999
umd021AUTO0.750.5555555556umd031MANU0.50.5
umd071AUTO0.468750.3571428571umd041MANU00.2777777778
umd081AUTO0.41666666670.2884615385umd051MANU0.50.3571428571
umd091AUTO0.35714285710.4166666667umd061MANU00
avg0.4981398810.40445665450.250.2837301587
umd032AUTO0.27027027030.306122449umd022MANU0.20.5555555556
umd042AUTO0.20408163270.1298701299umd072MANU00
umd052AUTO0.17241379310.0943396226umd082MANU0.20.303030303
umd062AUTO0.27027027030.243902439umd092MANU0.15151515150.243902439
avg0.22925899160.19355866010.13787878790.2756220744
umd024AUTO0.31250.3125umd034MANU0.31250.3125
umd054AUTO00umd044MANU00
umd064AUTO0.31250.5umd074MANU0.35714285710.4166666667
umd094AUTO0.31250.3125umd084MANU0.50.5
avg0.2343750.281250.29241071430.3072916667
TopicAUTO-STRICTMANU-STRICTAUTO-LOOSEMANU-LOOSE
10.4981398810.250.40445665450.2837301587
20.22925899160.13787878790.19355866010.2756220744
40.13787878790.13787878790.27562207440.2756220744
Average0.32059129080.22676316740.29308843820.2888812999
small collection
AUTO-STRICT
MANU-STRICT
AUTO-LOOSE
MANU-LOOSE
Topics
F-measure
F-measure in small collection
Chart1
Sheet1
large collection
strictloose
umd101AUTO0.10869565220.2272727273
umd104AUTO0.06578947370.0595238095
umd103MANU0.46391752580.4609929078
umd102MANU00
umd112MANU0.18181818180.2380952381
umd113MANU0.59633027520.5319148936
umd114AUTO0.81632653060.8163265306
umd111AUTO0.5645161290.6060606061
umd121MANU0.74324324320.5660377358
umd124MANU0.77777777780.825
umd123AUTO0.22222222220.5294117647
umd122AUTO0.18181818180.3271028037
umd132AUTO0.14925373130.202020202
umd133AUTO0.58823529410.5732484076
umd134MANU0.79787234040.8088235294
umd131MANU0.43478260870.6451612903
avg0.4182874480.4635620279
umd101AUTO0.10869565220.2272727273umd103MANU0.46391752580.4609929078
umd104AUTO0.06578947370.0595238095umd102MANU00
umd114AUTO0.81632653060.8163265306umd112MANU0.18181818180.2380952381
umd111AUTO0.5645161290.6060606061umd113MANU0.59633027520.5319148936
umd123AUTO0.22222222220.5294117647umd121MANU0.74324324320.5660377358
umd122AUTO0.18181818180.3271028037umd124MANU0.77777777780.825
umd132AUTO0.14925373130.202020202umd134MANU0.79787234040.8088235294
umd133AUTO0.58823529410.5732484076umd131MANU0.43478260870.6451612903
avg0.33710715190.4176208564avg0.49946774410.5095031994
umd101AUTO0.10869565220.2272727273umd121MANU0.74324324320.5660377358
umd111AUTO0.5645161290.6060606061umd131MANU0.43478260870.6451612903
avg0.33660589060.41666666670.5890129260.6055995131
umd122AUTO0.18181818180.3271028037umd102MANU00
umd132AUTO0.14925373130.202020202umd112MANU0.18181818180.2380952381
avg0.16553595660.2645615029avg0.09090909090.119047619
umd123AUTO0.22222222220.5294117647umd103MANU0.46391752580.4609929078
umd133AUTO0.58823529410.5732484076umd113MANU0.59633027520.5319148936
avg0.40522875820.55133008620.53012390050.4964539007
umd104AUTO0.06578947370.0595238095umd124MANU0.77777777780.825
umd114AUTO0.81632653060.8163265306umd134MANU0.79787234040.8088235294
avg0.44105800210.43792517010.78782505910.8169117647
strictloose
ALL0.4182874480.4635620279
AUTO0.33710715190.4176208564
MANU0.49946774410.5095031994
AUTO-T10.33660589060.4166666667
MANU-t10.5890129260.6055995131
AUTO-t20.16553595660.2645615029
MANU-t20.09090909090.119047619
AUTO-t30.40522875820.5513300862
MANU-t30.53012390050.4964539007
AUTO-t40.44105800210.4379251701
MANU-t40.78782505910.8169117647
TopicAUTO-STRICTMANU-STRICTAUTO-LOOSEMANU-LOOSE
10.33660589060.5890129260.41666666670.6055995131
20.16553595660.09090909090.26456150290.119047619
30.40522875820.53012390050.55133008620.4964539007
40.44105800210.78782505910.43792517010.8169117647
Average0.33710715190.49946774410.41762085640.5095031994
TopicAutomaticUser-Assisted
10.33660589060.589012926
20.16553595660.0909090909
30.40522875820.5301239005
40.44105800210.7878250591
Average0.33710715190.4994677441
large collection
ALL
AUTO
MANU
AUTO-T1
MANU-t1
AUTO-t2
MANU-t2
AUTO-t3
MANU-t3
AUTO-t4
MANU-t4
relevant judgment conditions
F-measure
F-Measure in large collection
f-2002.results
0.33660589060.5890129260.41666666670.6055995131
0.16553595660.09090909090.26456150290.119047619
0.40522875820.53012390050.55133008620.4964539007
0.44105800210.78782505910.43792517010.8169117647
0.33710715190.49946774410.41762085640.5095031994
AUTO-STRICT
MANU-STRICT
AUTO-LOOSE
MANU-LOOSE
Topics
F-measures
F-Measure in official results
Automatic
User-Assisted
Topics
F-Measure
umd022MANU0.20.5555555556
umd023MANU00
umd024AUTO0.31250.3125
umd021AUTO0.750.5555555556
umd031MANU0.50.5
umd034MANU0.31250.3125
umd033AUTO00
umd032AUTO0.27027027030.306122449
umd043AUTO00
umd042AUTO0.20408163270.1298701299
umd044MANU00
umd041MANU00.2777777778
umd054AUTO00
umd052AUTO0.17241379310.0943396226
umd051MANU0.50.3571428571
umd053MANU00
umd063MANU00
umd061MANU00
umd062AUTO0.27027027030.243902439
umd064AUTO0.31250.5
umd074MANU0.35714285710.4166666667
umd072MANU00
umd071AUTO0.468750.3571428571
umd073AUTO00
umd083AUTO00
umd081AUTO0.41666666670.2884615385
umd082MANU0.20.303030303
umd084MANU0.50.5
umd091AUTO0.35714285710.4166666667
umd094AUTO0.31250.3125
umd093MANU00
umd092MANU0.15151515150.243902439
umd101AUTO0.10869565220.2272727273
umd104AUTO0.06578947370.0595238095
umd103MANU0.46391752580.4609929078
umd102MANU00
umd112MANU0.18181818180.2380952381
umd113MANU0.59633027520.5319148936
umd114AUTO0.81632653060.8163265306
umd111AUTO0.5645161290.6060606061
umd121MANU0.74324324320.5660377358
umd124MANU0.77777777780.825
umd123AUTO0.22222222220.5294117647
umd122AUTO0.18181818180.3271028037
umd132AUTO0.14925373130.202020202
umd133AUTO0.58823529410.5732484076
umd134MANU0.79787234040.8088235294
umd131MANU0.43478260870.6451612903
Qualitative User StudiesObserve user behaviorInstrumented software, eye trackers, etc.Face and keyboard camerasThink-aloud protocolsInterviews and focus groupsOrganize the dataFor example, group it into overlapping categoriesLook for patterns and themesDevelop a grounded theory
Example: Mentions of relevance criteria by searchersThesaurus-based search, recorded interviews
Relevance CriteriaNumber of MentionsAll(N=703)Think-AloudRelevance Judgment(N=300)QueryForm.(N=248)Topicality 535 (76%)219234Richness 39 (5.5%) 14 0Emotion 24 (3.4%) 7 0Audio/Visual Expression 16 (2.3%) 5 0Comprehensibility 14 (2%) 1 10Duration 11 (1.6%) 9 0Novelty 10 (1.4%) 4 2
TopicalityTotal mentions by 8 searchersThesaurus-based search, recorded interviews
Chart1
26
32
44
73
116
120
124
Mentions
Sheet1
Mentions
Object26
Time Frame32
Organization/Group44
Subject73
Event/Experience116
Place120
Person124
Sheet1
0
0
0
0
0
0
0
Mentions
Sheet2
Sheet3
QuestionnairesDemographic dataFor example, computer experienceBasis for interpreting results
Subjective self-assessmentWhich did they think was more effective?Often at variance with objective results!
PreferenceWhich interface did they prefer? Why?
SummaryQualitative user studies suggest what to build
Design decomposes task into components
Automated evaluation helps to refine components
Quantitative user studies show how well it works
One Minute PaperIf I demonstrated a new retrieval technique that achieved a statistically significant improvement in average precision on the TREC collection, what would be the most serious limitation to consider when interpreting that result?
IR isnt very different from other fields we think of as experimental, e.g., a chemist in the lab mixing bubbling test tubesRelevance judgments did differ for some topics.
Quantify differences using overlap: overlap scores generally higher than mostprevious studies.