Evaluation

  • Upload
    aimon

  • View
    28

  • Download
    0

Embed Size (px)

DESCRIPTION

Evaluation. LBSC 796/INFM 718R Session 5, October 7, 2007 Douglas W. Oard. Agenda. Questions Evaluation fundamentals System-centered strategies User-centered strategies. IR as an Empirical Discipline. Formulate a research question: the hypothesis - PowerPoint PPT Presentation

Citation preview

  • EvaluationLBSC 796/INFM 718RSession 5, October 7, 2007Douglas W. Oard

  • AgendaQuestions

    Evaluation fundamentals

    System-centered strategies

    User-centered strategies

  • IR as an Empirical DisciplineFormulate a research question: the hypothesisDesign an experiment to answer the questionPerform the experimentCompare with a baseline controlDoes the experiment answer the question?Are the results significant? Or is it just luck?Report the results!

  • Evaluation CriteriaEffectivenessSystem-only, human+system

    EfficiencyRetrieval time, indexing time, index size UsabilityLearnability, novice use, expert use

  • IR Effectiveness EvaluationUser-centered strategyGiven several users, and at least 2 retrieval systemsHave each user try the same task on both systemsMeasure which system works the best

    System-centered strategyGiven documents, queries, and relevance judgmentsTry several variations on the retrieval systemMeasure which ranks more good docs near the top

  • Good Measures of EffectivenessCapture some aspect of what the user wants

    Have predictive value for other situationsDifferent queries, different document collection

    Easily replicated by other researchers

    Easily comparedOptimally, expressed as a single number

  • Comparing Alternative ApproachesAchieve a meaningful improvementAn application-specific judgment call

    Achieve reliable improvement in unseen casesCan be verified using statistical tests

  • Evolution of EvaluationEvaluation by inspection of examplesEvaluation by demonstrationEvaluation by improvised demonstrationEvaluation on data using a figure of meritEvaluation on test dataEvaluation on common test dataEvaluation on common, unseen test data

  • Which is the Best Rank Order?A.B.C.D.E.F.

  • Which is the Best Rank Order?RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRabcdefgh

  • IR Test Collection DesignRepresentative document collectionSize, sources, genre, topics, Random sample of representative queriesBuilt somehow from formalized topic statementsKnown binary relevanceFor each topic-document pair (topic, not query!)Assessed by humans, used only for evaluationMeasure of effectivenessUsed to compare alternate systems

  • What is relevance?measuredegreedimensionestimateappraisalrelationcorrespondenceutilityconnectionsatisfactionfitbearingmatchingdocumentarticletextual formreferenceinformation providedfactqueryrequestinformation usedpoint of viewinformation need statementpersonjudgeuserrequesterInformation specialistTefko Saracevic. (1975) Relevance: A Review of and a Framework for Thinking on the Notion in Information Science. Journal of the American Society for Information Science, 26(6), 321-343;Relevance is theof aexisting between aand a as determined byDoes this help?

  • Defining RelevanceRelevance relates a topic and a documentDuplicates are equally relevant by definitionConstant over time and across users

    Pertinence relates a task and a documentAccounts for quality, complexity, language,

    Utility relates a user and a documentAccounts for prior knowledge

  • Another ViewRelevantRetrievedRelevant +RetrievedNot Relevant + Not RetrievedSpace of all documents

  • Set-Based Effectiveness MeasuresPrecisionHow much of what was found is relevant?Often of interest, particularly for interactive searchingRecallHow much of what is relevant was found?Particularly important for law, patents, and medicineFalloutHow much of what was irrelevant was rejected?Useful when different size collections are compared

  • Effectiveness MeasuresRelevant RetrievedFalse AlarmIrrelevant RejectedMissRelevantNot relevantRetrievedNot RetrievedDocActionUser-OrientedSystem-Oriented

  • Single-Figure Set-Based MeasuresBalanced F-measureHarmonic mean of recall and precision

    Weakness: What if no relevant documents exist?

    Cost functionReward relevant retrieved, Penalize non-relevant For example, 3R+ - 2N+Weakness: Hard to normalize, so hard to average

  • Single-Valued Ranked MeasuresExpected search lengthAverage rank of the first relevant documentMean precision at a fixed number of documentsPrecision at 10 docs is often used for Web searchMean precision at a fixed recall levelAdjusts for the total number of relevant docsMean breakeven pointValue at which precision = recall

  • Automatic Evaluation ModelIR Black BoxQueryRanked ListDocumentsEvaluationModuleMeasure of EffectivenessRelevance Judgments

  • Ad Hoc TopicsIn TREC, a statement of information need is called a topicTitle: Health and Computer Terminals

    Description: Is it hazardous to the health of individuals to work with computer terminals on a daily basis? Narrative: Relevant documents would contain any information that expands on any physical disorder/problems that may be associated with the daily working with computer terminals. Such things as carpel tunnel, cataracts, and fatigue have been said to be associated, but how widespread are these or other problems and what is being done to alleviate any health problems.

  • Questions About the Black BoxExample questions:Does morphological analysis improve retrieval performance?Does expanding the query with synonyms improve retrieval performance?Corresponding experiments:Build a stemmed index and compare against unstemmed baselineExpand queries with synonyms and compare against baseline unexpanded queries

  • Measuring Precision and Recall1/11/21/31/42/53/63/74/84/94/105/115/125/135/145/156/166/176/186/194/201/141/141/141/142/143/143/144/144/144/145/145/145/145/145/146/146/146/146/146/14Assume there are a total of 14 relevant documentsPrecisionRecallPrecisionRecallHits 1-10Hits 11-20

  • Graphing Precision and RecallPlot each (recall, precision) point on a graphVisually represent the precision/recall tradeoff

  • Uninterpolated Average PrecisionAverage of precision at each retrieved relevant documentRelevant documents not retrieved contribute zero to score1/11/21/31/42/53/63/74/84/94/105/115/125/135/145/156/166/176/186/194/20PrecisionPrecisionHits 1-10Hits 11-20Assume total of 14 relevant documents: 8 relevant documents not retrieved contribute eight zerosMAP = .2307

  • Uninterpolated MAP

  • Visualizing Mean Average Precision0.00.20.40.60.81.0Average PrecisionTopic

  • What MAP HidesAdapted from a presentation by Ellen Voorhees at the University of Maryland, March 29, 1999

  • Why Mean Average Precision?It is easy to trade between recall and precisionAdding related query terms improves recallBut naive query expansion techniques kill precisionLimiting matches by part-of-speech helps precisionBut it almost always hurts recall

    Comparisons should give some weight to bothAverage precision is a principled way to do thisMore central than other available measures

  • Systems Ranked by Different MeasuresRanked by measure averaged over 50 topicsAdapted from a presentation by Ellen Voorhees at the University of Maryland, March 29, 1999

    P(10)

    P(30)

    R-Prec

    Ave Prec

    Recall at .5 Prec

    Recall (1000)

    Total Rel

    Rank of 1st Rel

    INQ502

    INQ502

    ok7ax

    ok7ax

    att98atdc

    ok7ax

    ok7ax

    tno7tw4

    ok7ax

    ok7ax

    INQ502

    att98atdc

    ok7ax

    tno7exp1

    tno7exp1

    bbn1

    att98atdc

    INQ501

    ok7am

    att98atde

    mds98td

    att98atdc

    att98atdc

    INQ502

    att98atde

    att98atdc

    att98atdc

    ok7am

    ok7am

    att98atde

    bbn1

    nectchall

    INQ501

    nectchall

    att98atde

    INQ502

    INQ502

    Cor7A3rrf

    att98atde

    tnocbm25

    nectchall

    att98atde

    INQ501

    mds98td

    att98atde

    ok7am

    INQ502

    MerAbtnd

    nectchdes

    ok7am

    bbn1

    bbn1

    INQ501

    bbn1

    INQ501

    att98atdc

    ok7am

    nectchdes

    mds98td

    tno7exp1

    ok7as

    pirc8Aa2

    ok7am

    acsys7al

    mds98td

    INQ503

    nectchdes

    INQ501

    bbn1

    INQ502

    Cor7A3rrf

    mds98td

    INQ503

    bbn1

    nectchall

    pirc8Aa2

    nectchall

    pirc8Ad

    pirc8Aa2

    ibms98a

    Cor7A3rrf

    tno7exp1

    ok7as

    Cor7A3rrf

    tno7exp1

    INQ501

    nectchdes

    Cor7A3rrf

    tno7tw4

    mds98td

    tno7exp1

    acsys7al

    Cor7A3rrf

    nectchdes

    mds98td

    ok7ax

    MerAbtnd

    pirc8Aa2

    acsys7al

    ok7as

    acsys7al

    nectchall

    acsys7al

    att98atde

    acsys7al

    Cor7A3rrf

    pirc8Aa2

    nectchdes

    Cor7A2rrd

    acsys7al

    nectchall

    Brkly25

    iowacuhk1

    ok7as

    Cor7A3rrf

    nectchall

    INQ503

    mds98td

    pirc8Ad

    nectchdes

  • Correlations Between RankingsKendalls t computed between pairs of rankings Adapted from a presentation by Ellen Voorhees at the University of Maryland, March 29, 1999

    P(30)

    R Prec

    Ave Prec

    Recall at .5 P

    Recall (1000)

    Total Rels

    Rank 1st Rel

    P(10)

    .8851

    .8151

    .7899

    .7855

    .7817

    .7718

    .6378

    P(30)

    .8676

    .8446

    .8238

    .7959

    .7915

    .6213

    R Prec

    .9245

    .8654

    .8342

    .8320

    .5896

    Ave Prec

    .8840

    .8473

    .8495

    .5612

    R at .5 P

    .7707

    .7762

    .5349

    Recall(1000)

    .9212

    .5891

    Total Rels

    .5880

  • Relevant Document DensityAdapted from a presentation by Ellen Voorhees at the University of Maryland, March 29, 1999

  • Alternative Ways to Get JudgmentsExhaustive assessment is usually impracticalTopics * documents = a large number!Pooled assessment leverages cooperative evaluationRequires a diverse set of IR systemsSearch-guided assessment is sometimes viableIterate between topic research/search/assessmentAugment with review, adjudication, reassessmentKnown-item judgments have the lowest costTailor queries to retrieve a single known documentUseful as a first cut to see if a new technique is viable

  • Obtaining Relevance JudgmentsExhaustive assessment can be too expensiveTREC has 50 queries for >1 million docs each year

    Random sampling wont workIf relevant docs are rare, none may be found!

    IR systems can help focus the sampleEach system finds some relevant documentsDifferent systems find different relevant documentsTogether, enough systems will find most of them

  • Pooled Assessment MethodologySystems submit top 1000 documents per topic

    Top 100 documents for each are judgedSingle pool, without duplicates, arbitrary orderJudged by the person that wrote the query

    Treat unevaluated documents as not relevant

    Compute MAP down to 1000 documentsTreat precision for complete misses as 0.0

  • Does pooling work?Judgments cant possibly be exhaustive!

    This is only one persons opinion about relevance

    What about hits 101 to 1000?

    We cant possibly use judgments to evaluate a system that didnt participate in the evaluation!It doesnt matter: relative rankings remain the same!It doesnt matter: relative rankings remain the same!It doesnt matter: relative rankings remain the same!Actually, we can!Chris Buckley and Ellen M. Voorhees. (2004) Retrieval Evaluation with Incomplete Information. SIGIR 2004. Ellen Voorhees. (1998) Variations in Relevance Judgments and the Measurement of Retrieval Effectiveness. SIGIR 1998.Justin Zobel. (1998) How Reliable Are the Results of Large-Scale Information Retrieval Experiments? SIGIR 1998.

  • Lessons From TRECIncomplete judgments are usefulIf sample is unbiased with respect to systems tested

    Different relevance judgments change absolute scoreBut rarely change comparative advantages when averaged

    Evaluation technology is predictiveResults transfer to operational settingsAdapted from a presentation by Ellen Voorhees at the University of Maryland, March 29, 1999

  • Effects of Incomplete JudgmentsAdditional relevant documents are:roughly uniform across systemshighly skewed across topics

    Systems that dont contribute to pool get comparable resultsAdapted from a presentation by Ellen Voorhees at the University of Maryland, March 29, 1999

  • Inter-Judge AgreementAdapted from a presentation by Ellen Voorhees at the University of Maryland, March 29, 1999

    Assessor Group

    Overlap

    Primary & A

    .421

    Primary & B

    .494

    A & B

    .426

    All 3

    .301

  • Effect of Different JudgmentsAdapted from a presentation by Ellen Voorhees at the University of Maryland, March 29, 1999

  • Net Effect of Different JudgesMean Kendall t between system rankings produced from different qrel sets: .938Similar results held forDifferent query setsDifferent evaluation measuresDifferent assessor typesSingle opinion vs. group opinion judgmentsAdapted from a presentation by Ellen Voorhees at the University of Maryland, March 29, 1999

  • Statistical Significance TestsHow sure can you be that an observed difference doesnt simply result from the particular queries you chose?System A0.200.210.220.190.170.200.21System B0.400.410.420.390.370.400.41Experiment 1Query1234567Average0.200.40System A0.020.390.160.580.040.090.12System B0.760.070.370.210.020.910.46Experiment 2Query1234567Average0.200.40

  • Statistical Significance TestingTry some out at: http://www.fon.hum.uva.nl/Service/Statistics.html

  • How Much is Enough?Measuring improvementAchieve a meaningful improvementGuideline: 0.05 is noticeable, 0.1 makes a differenceAchieve reliable improvement on typical queriesWilcoxon signed rank test for paired samples

    Know when to stop!Inter-assessor agreement limits max precisionUsing one judge to assess the other yields about 0.8

  • Recap: Automatic EvaluationEvaluation measures focus on relevanceUsers also want utility and understandability

    Goal is to compare systemsValues may vary, but relative differences are stable

    Mean values obscure important phenomenaAugment with failure analysis/significance tests

  • Automatic EvaluationRanked List

  • User StudiesDocumentQueryFormulation

  • User StudiesGoal is to account for interface issuesBy studying the interface componentBy studying the complete system

    Formative evaluationProvide a basis for system development

    Summative evaluationDesigned to assess performance

  • Questions That Involve UsersExample questions:Does keyword highlighting help users evaluate document relevance?Is letting users weight search terms a good idea?Corresponding experiments:Build two different interfaces, one with keyword highlighting, one without; run a user studyBuild two different interfaces, one with term weighting functionality, and one without; run a user study

  • Blair and Maron (1985)A classic study of retrieval effectivenessEarlier studies used unrealistically small collectionsStudied an archive of documents for a lawsuit40,000 documents, ~350,000 pages of text40 different queriesUsed IBMs STAIRS full-text systemApproach:Lawyers wanted at least 75% of all relevant documentsPrecision and recall evaluated only after the lawyers were satisfied with the resultsDavid C. Blair and M. E. Maron. (1984) An Evaluation of Retrieval Effectiveness for a Full-Text Document-Retrieval System. Communications of the ACM, 28(3), 289--299.

  • Blair and Marons ResultsMean precision: 79%Mean recall: 20% (!!)Why recall was low?Users cant anticipate terms used in relevant documents

    Differing technical terminologySlang, misspellingsOther findings:Searches by both lawyers had similar performanceLawyers recall was not much different from paralegalsaccident might be referred to as event, incident, situation, problem,

  • Quantitative User StudiesSelect independent variable(s)e.g., what info to display in selection interfaceSelect dependent variable(s)e.g., time to find a known relevant documentRun subjects in different ordersAverage out learning and fatigue effectsCompute statistical significanceNull hypothesis: independent variable has no effectRejected if p
  • Variation in Automatic MeasuresSystemWhat we seek to measureTopicSample topic space, compute expected valueTopic+SystemPair by topic and compute statistical significanceCollectionRepeat the experiment using several collections

  • Additional Effects in User StudiesLearningVary topic presentation order

    FatigueVary system presentation order

    Topic+User (Expertise)Ask about prior knowledge of each topic

  • Presentation Order

  • Batch vs. User EvaluationsDo batch (black box) and user evaluations give the same results? If not, why?Two different tasks:Instance recall (6 topics)

    Question answering (8 topics)Andrew Turpin and William Hersh. (2001) Why Batch and User Evaluations Do No Give the Same Results. Proceedings of SIGIR 2001.What countries import Cuban sugar?What tropical storms, hurricanes, and typhoons have caused property damage or loss of life?Which painting did Edvard Munch complete first, Vampire or Puberty?Is Denmark larger or smaller in population than Norway?

  • ResultsCompared of two systems:a baseline systeman improved system that was provably better in batch evaluationsResults:

  • Example User Study ResultsF=0.8English queries, German documents4 searchers, 20 minutes per topic

    Chart2

    0.33660589060.589012926

    0.16553595660.0909090909

    0.40522875820.5301239005

    0.44105800210.7878250591

    0.33710715190.4994677441

    Automatic

    User-Assisted

    Topic

    small collection

    umd022MANU0.20.5555555556

    umd023MANU00

    umd024AUTO0.31250.3125

    umd021AUTO0.750.5555555556

    umd031MANU0.50.5

    umd034MANU0.31250.3125

    umd033AUTO00

    umd032AUTO0.27027027030.306122449

    umd043AUTO00

    umd042AUTO0.20408163270.1298701299

    umd044MANU00

    umd041MANU00.2777777778

    umd054AUTO00

    umd052AUTO0.17241379310.0943396226

    umd051MANU0.50.3571428571

    umd053MANU00

    umd063MANU00

    umd061MANU00

    umd062AUTO0.27027027030.243902439

    umd064AUTO0.31250.5

    umd074MANU0.35714285710.4166666667

    umd072MANU00

    umd071AUTO0.468750.3571428571

    umd073AUTO00

    umd083AUTO00

    umd081AUTO0.41666666670.2884615385

    umd082MANU0.20.303030303

    umd084MANU0.50.5

    umd091AUTO0.35714285710.4166666667

    umd094AUTO0.31250.3125

    umd093MANU00

    umd092MANU0.15151515150.243902439

    avg0.27367722910.2909848691

    umd024AUTO0.31250.3125umd022MANU0.20.5555555556

    umd021AUTO0.750.5555555556umd023MANU00

    umd033AUTO00umd031MANU0.50.5

    umd032AUTO0.27027027030.306122449umd034MANU0.31250.3125

    umd043AUTO00umd044MANU00

    umd042AUTO0.20408163270.1298701299umd041MANU00.2777777778

    umd054AUTO00umd051MANU0.50.3571428571

    umd052AUTO0.17241379310.0943396226umd053MANU00

    umd062AUTO0.27027027030.243902439umd063MANU00

    umd064AUTO0.31250.5umd061MANU00

    umd071AUTO0.468750.3571428571umd074MANU0.35714285710.4166666667

    umd073AUTO00umd072MANU00

    umd083AUTO00umd082MANU0.20.303030303

    umd081AUTO0.41666666670.2884615385umd084MANU0.50.5

    umd091AUTO0.35714285710.4166666667umd093MANU00

    umd094AUTO0.31250.3125umd092MANU0.15151515150.243902439

    avg0.32059129080.29308843820.22676316740.2888812999

    umd021AUTO0.750.5555555556umd031MANU0.50.5

    umd071AUTO0.468750.3571428571umd041MANU00.2777777778

    umd081AUTO0.41666666670.2884615385umd051MANU0.50.3571428571

    umd091AUTO0.35714285710.4166666667umd061MANU00

    avg0.4981398810.40445665450.250.2837301587

    umd032AUTO0.27027027030.306122449umd022MANU0.20.5555555556

    umd042AUTO0.20408163270.1298701299umd072MANU00

    umd052AUTO0.17241379310.0943396226umd082MANU0.20.303030303

    umd062AUTO0.27027027030.243902439umd092MANU0.15151515150.243902439

    avg0.22925899160.19355866010.13787878790.2756220744

    umd024AUTO0.31250.3125umd034MANU0.31250.3125

    umd054AUTO00umd044MANU00

    umd064AUTO0.31250.5umd074MANU0.35714285710.4166666667

    umd094AUTO0.31250.3125umd084MANU0.50.5

    avg0.2343750.281250.29241071430.3072916667

    TopicAUTO-STRICTMANU-STRICTAUTO-LOOSEMANU-LOOSE

    10.4981398810.250.40445665450.2837301587

    20.22925899160.13787878790.19355866010.2756220744

    40.13787878790.13787878790.27562207440.2756220744

    Average0.32059129080.22676316740.29308843820.2888812999

    small collection

    AUTO-STRICT

    MANU-STRICT

    AUTO-LOOSE

    MANU-LOOSE

    Topics

    F-measure

    F-measure in small collection

    Chart1

    Sheet1

    large collection

    strictloose

    umd101AUTO0.10869565220.2272727273

    umd104AUTO0.06578947370.0595238095

    umd103MANU0.46391752580.4609929078

    umd102MANU00

    umd112MANU0.18181818180.2380952381

    umd113MANU0.59633027520.5319148936

    umd114AUTO0.81632653060.8163265306

    umd111AUTO0.5645161290.6060606061

    umd121MANU0.74324324320.5660377358

    umd124MANU0.77777777780.825

    umd123AUTO0.22222222220.5294117647

    umd122AUTO0.18181818180.3271028037

    umd132AUTO0.14925373130.202020202

    umd133AUTO0.58823529410.5732484076

    umd134MANU0.79787234040.8088235294

    umd131MANU0.43478260870.6451612903

    avg0.4182874480.4635620279

    umd101AUTO0.10869565220.2272727273umd103MANU0.46391752580.4609929078

    umd104AUTO0.06578947370.0595238095umd102MANU00

    umd114AUTO0.81632653060.8163265306umd112MANU0.18181818180.2380952381

    umd111AUTO0.5645161290.6060606061umd113MANU0.59633027520.5319148936

    umd123AUTO0.22222222220.5294117647umd121MANU0.74324324320.5660377358

    umd122AUTO0.18181818180.3271028037umd124MANU0.77777777780.825

    umd132AUTO0.14925373130.202020202umd134MANU0.79787234040.8088235294

    umd133AUTO0.58823529410.5732484076umd131MANU0.43478260870.6451612903

    avg0.33710715190.4176208564avg0.49946774410.5095031994

    umd101AUTO0.10869565220.2272727273umd121MANU0.74324324320.5660377358

    umd111AUTO0.5645161290.6060606061umd131MANU0.43478260870.6451612903

    avg0.33660589060.41666666670.5890129260.6055995131

    umd122AUTO0.18181818180.3271028037umd102MANU00

    umd132AUTO0.14925373130.202020202umd112MANU0.18181818180.2380952381

    avg0.16553595660.2645615029avg0.09090909090.119047619

    umd123AUTO0.22222222220.5294117647umd103MANU0.46391752580.4609929078

    umd133AUTO0.58823529410.5732484076umd113MANU0.59633027520.5319148936

    avg0.40522875820.55133008620.53012390050.4964539007

    umd104AUTO0.06578947370.0595238095umd124MANU0.77777777780.825

    umd114AUTO0.81632653060.8163265306umd134MANU0.79787234040.8088235294

    avg0.44105800210.43792517010.78782505910.8169117647

    strictloose

    ALL0.4182874480.4635620279

    AUTO0.33710715190.4176208564

    MANU0.49946774410.5095031994

    AUTO-T10.33660589060.4166666667

    MANU-t10.5890129260.6055995131

    AUTO-t20.16553595660.2645615029

    MANU-t20.09090909090.119047619

    AUTO-t30.40522875820.5513300862

    MANU-t30.53012390050.4964539007

    AUTO-t40.44105800210.4379251701

    MANU-t40.78782505910.8169117647

    TopicAUTO-STRICTMANU-STRICTAUTO-LOOSEMANU-LOOSE

    10.33660589060.5890129260.41666666670.6055995131

    20.16553595660.09090909090.26456150290.119047619

    30.40522875820.53012390050.55133008620.4964539007

    40.44105800210.78782505910.43792517010.8169117647

    Average0.33710715190.49946774410.41762085640.5095031994

    TopicAutomaticUser-Assisted

    10.33660589060.589012926

    20.16553595660.0909090909

    30.40522875820.5301239005

    40.44105800210.7878250591

    Average0.33710715190.4994677441

    large collection

    ALL

    AUTO

    MANU

    AUTO-T1

    MANU-t1

    AUTO-t2

    MANU-t2

    AUTO-t3

    MANU-t3

    AUTO-t4

    MANU-t4

    relevant judgment conditions

    F-measure

    F-Measure in large collection

    f-2002.results

    0.33660589060.5890129260.41666666670.6055995131

    0.16553595660.09090909090.26456150290.119047619

    0.40522875820.53012390050.55133008620.4964539007

    0.44105800210.78782505910.43792517010.8169117647

    0.33710715190.49946774410.41762085640.5095031994

    AUTO-STRICT

    MANU-STRICT

    AUTO-LOOSE

    MANU-LOOSE

    Topics

    F-measures

    F-Measure in official results

    Automatic

    User-Assisted

    Topics

    F-Measure

    umd022MANU0.20.5555555556

    umd023MANU00

    umd024AUTO0.31250.3125

    umd021AUTO0.750.5555555556

    umd031MANU0.50.5

    umd034MANU0.31250.3125

    umd033AUTO00

    umd032AUTO0.27027027030.306122449

    umd043AUTO00

    umd042AUTO0.20408163270.1298701299

    umd044MANU00

    umd041MANU00.2777777778

    umd054AUTO00

    umd052AUTO0.17241379310.0943396226

    umd051MANU0.50.3571428571

    umd053MANU00

    umd063MANU00

    umd061MANU00

    umd062AUTO0.27027027030.243902439

    umd064AUTO0.31250.5

    umd074MANU0.35714285710.4166666667

    umd072MANU00

    umd071AUTO0.468750.3571428571

    umd073AUTO00

    umd083AUTO00

    umd081AUTO0.41666666670.2884615385

    umd082MANU0.20.303030303

    umd084MANU0.50.5

    umd091AUTO0.35714285710.4166666667

    umd094AUTO0.31250.3125

    umd093MANU00

    umd092MANU0.15151515150.243902439

    umd101AUTO0.10869565220.2272727273

    umd104AUTO0.06578947370.0595238095

    umd103MANU0.46391752580.4609929078

    umd102MANU00

    umd112MANU0.18181818180.2380952381

    umd113MANU0.59633027520.5319148936

    umd114AUTO0.81632653060.8163265306

    umd111AUTO0.5645161290.6060606061

    umd121MANU0.74324324320.5660377358

    umd124MANU0.77777777780.825

    umd123AUTO0.22222222220.5294117647

    umd122AUTO0.18181818180.3271028037

    umd132AUTO0.14925373130.202020202

    umd133AUTO0.58823529410.5732484076

    umd134MANU0.79787234040.8088235294

    umd131MANU0.43478260870.6451612903

  • Qualitative User StudiesObserve user behaviorInstrumented software, eye trackers, etc.Face and keyboard camerasThink-aloud protocolsInterviews and focus groupsOrganize the dataFor example, group it into overlapping categoriesLook for patterns and themesDevelop a grounded theory

  • Example: Mentions of relevance criteria by searchersThesaurus-based search, recorded interviews

    Relevance CriteriaNumber of MentionsAll(N=703)Think-AloudRelevance Judgment(N=300)QueryForm.(N=248)Topicality 535 (76%)219234Richness 39 (5.5%) 14 0Emotion 24 (3.4%) 7 0Audio/Visual Expression 16 (2.3%) 5 0Comprehensibility 14 (2%) 1 10Duration 11 (1.6%) 9 0Novelty 10 (1.4%) 4 2

  • TopicalityTotal mentions by 8 searchersThesaurus-based search, recorded interviews

    Chart1

    26

    32

    44

    73

    116

    120

    124

    Mentions

    Sheet1

    Mentions

    Object26

    Time Frame32

    Organization/Group44

    Subject73

    Event/Experience116

    Place120

    Person124

    Sheet1

    0

    0

    0

    0

    0

    0

    0

    Mentions

    Sheet2

    Sheet3

  • QuestionnairesDemographic dataFor example, computer experienceBasis for interpreting results

    Subjective self-assessmentWhich did they think was more effective?Often at variance with objective results!

    PreferenceWhich interface did they prefer? Why?

  • SummaryQualitative user studies suggest what to build

    Design decomposes task into components

    Automated evaluation helps to refine components

    Quantitative user studies show how well it works

  • One Minute PaperIf I demonstrated a new retrieval technique that achieved a statistically significant improvement in average precision on the TREC collection, what would be the most serious limitation to consider when interpreting that result?

    IR isnt very different from other fields we think of as experimental, e.g., a chemist in the lab mixing bubbling test tubesRelevance judgments did differ for some topics.

    Quantify differences using overlap: overlap scores generally higher than mostprevious studies.