Evaluation of Multimedia Retrieval System

Embed Size (px)

Citation preview

  • 8/10/2019 Evaluation of Multimedia Retrieval System

    1/48

    Multimedia Indexing & Retrieval

    Evaluation of Multimedia Retrieval Systems

    12/6/14Evaluation of Multimedia Retrieval Systems

    1

  • 8/10/2019 Evaluation of Multimedia Retrieval System

    2/48

    Outline

    ! Introduction

    !

    Test Collections and Evaluation Workshops

    ! Evaluation Measures

    ! Significance Tests

    !

    A Case Study: TRECVID

    12/6/14Evaluation of Multimedia Retrieval Systems

    2

  • 8/10/2019 Evaluation of Multimedia Retrieval System

    3/48

    Outline

    ! Introduction

    !

    Test Collections and Evaluation Workshops

    ! Evaluation Measures

    ! Significance Tests

    !

    A Case Study: TRECVID

    12/6/14Evaluation of Multimedia Retrieval Systems

    3

  • 8/10/2019 Evaluation of Multimedia Retrieval System

    4/48

    !

    We are going to discuss the tools and methodology forcomparing the effectivenessof two or more multimediaretrieval systems in a meaningful way

    ! Multimedia retrieval system evaluation can be evaluatedusing two strategies:

    1. Withoutconsulting the potential users of the system e.g.query processing time, query throughput, etc.

    2.

    Withconsulting the potential users of the system e.g. theeffectiveness of retrieved results

    Introduction

    12/6/14Evaluation of Multimedia Retrieval Systems

    4

  • 8/10/2019 Evaluation of Multimedia Retrieval System

    5/48

    Test Collections

    !

    Doing an evaluation involving real people is not only a costlyjob, itis also difficult to controland hard to replicate

    ! Methods have been developed to design so-called test collections

    !

    Often, these test collections are created by consulting potentialusers

    ! Test collection can be used to evaluate multimedia retrievalsystems without the need to consult the users during furtherevaluations

    ! A new retrieval method or system can be evaluated by comparingit to some well-established methods in a controlled experiment

    12/6/14Evaluation of Multimedia Retrieval Systems

    5

  • 8/10/2019 Evaluation of Multimedia Retrieval System

    6/48

    Evaluation Methods in Controlled

    Experiment (Hull, 1993)

    !

    A test collection consisting of:

    1. Multimedia data

    2. A task and data needed for that task, for instance image

    search using example images as queries3. Ground truth data (correct answers)

    ! One or more suitable evaluation measures that assign valuesto the effectiveness of the search

    !

    A statistical methodology that determines whether theobserved differences in performance between the methodsinvestigated are statistically significant

    12/6/14Evaluation of Multimedia Retrieval Systems

    6

  • 8/10/2019 Evaluation of Multimedia Retrieval System

    7/48

    Glass box vs Black Box

    ! Two approaches to test the effectiveness of multimediasearch system:

    1.

    Glass boxevaluation, i.e., the systematic assessment ofevery component of a system

    example: word error rate on speech recognition, or shotdetector subsystem of video retrieval system

    2. Black boxevaluation, i.e., testing the system as a whole

    12/6/14Evaluation of Multimedia Retrieval Systems

    7

  • 8/10/2019 Evaluation of Multimedia Retrieval System

    8/48

    Outline

    ! Introduction

    !

    Test Collections and Evaluation Workshops

    ! Evaluation Measures

    ! Significance Tests

    !

    A Case Study: TRECVID

    12/6/14Evaluation of Multimedia Retrieval Systems

    8

  • 8/10/2019 Evaluation of Multimedia Retrieval System

    9/48

    ! The scientific evaluation of multimedia search systems is a costlyand laboriousjob

    !

    Big companies, for instance Web search companies such asYahooand Google, might perform such evaluations themselves

    ! For smaller companies and research institutions a good option is

    often to take a test collection that is prepared by others

    !

    A test collections are often created by the collaborative effortof many research groups in so-called evaluation workshops ofconferences i.e. Text Retrieval Conferences (TREC)

    Test Collections and Evaluation Workshops

    12/6/14Evaluation of Multimedia Retrieval Systems

    9

  • 8/10/2019 Evaluation of Multimedia Retrieval System

    10/48

    TREC-style Evaluation Cycle

    12/6/14Evaluation of Multimedia Retrieval Systems

    10

  • 8/10/2019 Evaluation of Multimedia Retrieval System

    11/48

    Ground Truth Data Preparation

    ! Forblack-boxevaluations, people are involved in the process ofdeciding whether a multimedia object is relevant for a certain

    query

    ! For glass-box evaluations, for instance an evaluation of a shot

    boundary detection algorithm, the process of deciding whether

    a sequence of frames contains a shot boundary involvespeople as well

    12/6/14Evaluation of Multimedia Retrieval Systems

    11

  • 8/10/2019 Evaluation of Multimedia Retrieval System

    12/48

    Corel Set!

    The Corel test collection is a set of stock photographs,which is divided into subsets of images (e.g., Arabianhorses, sun- sets, or English pub signs, etc.)

    ! The collection is used to evaluate the effectiveness ofcontent-based image retrieval systems in a large numberof publications and has become a de-facto standardinthe field

    12/6/14Evaluation of Multimedia Retrieval Systems

    12

  • 8/10/2019 Evaluation of Multimedia Retrieval System

    13/48

    Corel Set (cont.)

    ! Usually, Corel is used as follows:1. From the test collection, take out one image and use it as a

    query-by-example;2. Relevant images for the query are those that come from the

    same theme; and

    3.

    Compute precision and recall for a number of queries

    !

    Problems:! Single Corel set does not exist

    ! Different publications use different CDs and different selectionsof themes

    !

    Too simple, since only come from one professionalphotographer (not represents realistic settings)

    12/6/14Evaluation of Multimedia Retrieval Systems

    13

  • 8/10/2019 Evaluation of Multimedia Retrieval System

    14/48

    Mirex Workshops!

    The Music Information Retrieval Evaluation eXchange(MIREX) test collections consist of data from record labelsthat allow to publish tracks from their artists releases

    ! MIREX uses a glass boxapproach to the evaluation ofmusic information retrieval by identifying various subtaskse.g. identification, Drum detection, Genre classification,Melody extraction, On- set detection, Tempo extraction,

    Key finding, Symbolic genre classification, Symbolicmelodic similarity

    12/6/14Evaluation of Multimedia Retrieval Systems

    14

  • 8/10/2019 Evaluation of Multimedia Retrieval System

    15/48

    TRECVID: The TREC Video Retrieval Workshops

    ! The work- shop emerged as a special task of the Text RetrievalConference (TREC) in 2001

    !

    Continued as an independent workshop collocated with TREClater

    ! The workshop has focused mostly on news videos, from US,

    Chinese and Arabic sources e.g. CNN Headline News, and ABCWorld News Tonight

    12/6/14Evaluation of Multimedia Retrieval Systems

    15

  • 8/10/2019 Evaluation of Multimedia Retrieval System

    16/48

    TRECVID tasks (example)

    ! TRECVID provides several white-boxevaluation tasks such

    as:! Shot boundary detection

    ! Story segmentation

    ! High-level feature extraction

    ! Etc.

    ! The workshop series also provides a black-boxevaluationframework: a Search task that may include the completeinteractive session of a user with the system

    12/6/14Evaluation of Multimedia Retrieval Systems

    16

  • 8/10/2019 Evaluation of Multimedia Retrieval System

    17/48

    CLEF Workshops

    ! CLEF (Cross-Language Evaluation Forum), a workshop seriesmainly focusing on multilingual retrieval, i.e., retrieving data

    from a collection of documents in many languages, and cross-

    language retrieval

    ! CLEF started as a cross-language retrieval task in TREC in 1997,

    became independent from TREC in 2000

    12/6/14Evaluation of Multimedia Retrieval Systems

    17

  • 8/10/2019 Evaluation of Multimedia Retrieval System

    18/48

    ImageCLEF

    ! Goal: to investigate the effectiveness of combining text andimage for retrieval

    !

    ImageCLEF also provides an evaluation task for a medicalimagecollection consisting of medical photographs, X-rays, CT-

    scans, MRIs, etc. provided by the Geneva University Hospital in

    the Casimage project

    12/6/14Evaluation of Multimedia Retrieval Systems

    18

  • 8/10/2019 Evaluation of Multimedia Retrieval System

    19/48

    INEX Workshops! The Initiative for the Evaluation of XML Retrieval (INEX) aims at

    evaluating the retrieval performance of XML retrieval systems

    !

    It focuses on using the structure of the document to extract,relate and combinethe relevance scores of different

    multimedia fragments

    12/6/14Evaluation of Multimedia Retrieval Systems

    19

  • 8/10/2019 Evaluation of Multimedia Retrieval System

    20/48

    INEX Workshop task (example)

    ! INEX 2006 had a modest multimedia task that involved

    data from the Wikipedia

    ! A search query might for instance search for images bycombining information from its captiontext, its links, andfrom the articlethat contains the image

    12/6/14Evaluation of Multimedia Retrieval Systems

    20

  • 8/10/2019 Evaluation of Multimedia Retrieval System

    21/48

    Outline

    ! Introduction

    !

    Test Collections and Evaluation Workshops

    ! Evaluation Measures

    ! Significance Tests

    !

    A Case Study: TRECVID

    12/6/14Evaluation of Multimedia Retrieval Systems

    21

  • 8/10/2019 Evaluation of Multimedia Retrieval System

    22/48

    Evaluation Measures! The effectiveness of a system or component is often measured by the

    combination of precisionand recall

    ! Precision is defined by the fraction of the retrieved or detectedobjects that is actually relevant. Recall is defined by the fraction of

    the relevant objects that is actually retrieved

    12/6/14Evaluation of Multimedia Retrieval Systems

    22

    recall =Number of relevant documents retrieved

    Total number of relevant documents

    retrieveddocumentsofnumberTotal

    retrieveddocumentsrelevantofNumberprecision=

    Relevant

    documents

    Retrieved

    documents

    Entire

    document

    collection

  • 8/10/2019 Evaluation of Multimedia Retrieval System

    23/48

    Determining Recall is Difficult

    ! Total number of relevant items is sometimes not available:

    !Sampleacross the database and perform relevancejudgment on these items.

    ! Apply different retrieval algorithms to the same database forthe same query. The aggregateof relevant items is taken as

    the total relevant set.

    12/6/14Evaluation of Multimedia Retrieval Systems

    23

  • 8/10/2019 Evaluation of Multimedia Retrieval System

    24/48

    Trade-off between Recall andPrecision

    12/6/14Evaluation of Multimedia Retrieval Systems

    24

    10

    1

    Recall

    Precision

    The idealReturns relevant documents but

    misses many useful ones too

    Returns most relevant

    documents but includes

    lots of junk

  • 8/10/2019 Evaluation of Multimedia Retrieval System

    25/48

    Computing Recall/Precision Points

    ! For a given query, produce the ranked list of retrievals

    !

    Adjusting a thresholdon this ranked list produces differentsets of retrieved documents, and therefore differentrecall/precision measures

    ! Markeach document in the ranked list that is relevant

    according to the gold standard

    ! Compute a recall/precision pairfor each position in theranked list that contains a relevant document

    12/6/14Evaluation of Multimedia Retrieval Systems

    25

  • 8/10/2019 Evaluation of Multimedia Retrieval System

    26/48

    Computing Recall/Precision Points (example)

    12/6/14Evaluation of Multimedia Retrieval Systems

    26

    R=3/6=0.5; P=3/4=0.75

    n doc # relevant

    1 588 x

    2 589 x

    3 576

    4 590 x

    5 986

    6 592 x

    7 984

    8 988

    9 578

    10 985

    11 103

    12 591

    13 772 x

    14 990

    Let total # of relevant docs = 6

    Check each new recall point:

    R=1/6=0.167; P=1/1=1

    R=2/6=0.333; P=2/2=1

    R=5/6=0.833; p=5/13=0.38

    R=4/6=0.667; P=4/6=0.667

    Missing one

    relevant document.

    Never reach

    100% recall

  • 8/10/2019 Evaluation of Multimedia Retrieval System

    27/48

    Precision at Fixed Recall Levels

    ! A number of fixed recall levelsare chosen, for instance 10

    levels: {0.1, 0.2,!!!, 1.0}

    ! The levels correspond to usersthat are satisfied if they find

    respectively 10%, 20%,!!!, 100%of the relevant documents

    !

    For each of these levels thecorresponding precision isdetermined by averaging theprecision on that level over the

    tests that are performed

    12/6/14Evaluation of Multimedia Retrieval Systems

    27

  • 8/10/2019 Evaluation of Multimedia Retrieval System

    28/48

    Precision at Fixed Points in the Ranked List

    ! Recall isnot necessarily a goodmeasure of user equivalence

    !

    For instance if one query has 20 relevant documents while another has 200

    ! A recall of 50% would be a reasonablegoal in the first case, butunmanageablefor most users in the second case

    ! Alternative, choose a number of fixed points in the ranked list, for instancenine points at: 5, 10, 15, 20, 30, 100, 200, 500 and 1000 documents retrieved

    ! These points correspond with users that are willing to read 5, 10, 15, etc.documents of a search

    ! For each of these points in the ranked list, the precision is determined byaveragingthe precision on that level over the queries

    12/6/14Evaluation of Multimedia Retrieval Systems

    28

  • 8/10/2019 Evaluation of Multimedia Retrieval System

    29/48

    Mean Average Precision

    ! Average Precision: Average of the precision values at thepoints at which each relevant document is retrieved.

    ! Example: (1 + 1 + 0.75 + 0.667 + 0.38 + 0)/6 = 0.633

    ! Mean Average Precision: Averageof the averageprecision value for a set of queries.

    12/6/14Evaluation of Multimedia Retrieval Systems

    29

  • 8/10/2019 Evaluation of Multimedia Retrieval System

    30/48

    Combining Precision/Recall: F-Measure

    !

    One measure of performance that takes into accountboth recalland precision

    ! Harmonic mean of recall and precision:

    ! Compared to arithmetic mean, both need to be high forharmonic mean to be high.

    12/6/14Evaluation of Multimedia Retrieval Systems

    30

    PRRP

    PRF

    11

    22

    +

    =

    +

    =

  • 8/10/2019 Evaluation of Multimedia Retrieval System

    31/48

    Outline

    ! Introduction

    !

    Test Collections and Evaluation Workshops

    ! Evaluation Measures

    ! Significance Tests

    !

    A Case Study: TRECVID

    12/6/14Evaluation of Multimedia Retrieval Systems

    31

  • 8/10/2019 Evaluation of Multimedia Retrieval System

    32/48

    Significance Test

    ! Simply citing percentage improvements of one method overanother is helpful

    ! It does not tell if the improvementswere in fact due todifferences of the two methods

    ! A differences between two methods might simply be due torandom variationin the performance

    !

    To make significance testing of the differences applicable, areasonable amount of queries is needed

    ! A technique called cross-validationis used to further preventbiased evaluation results

    12/6/14Evaluation of Multimedia Retrieval Systems

    32

  • 8/10/2019 Evaluation of Multimedia Retrieval System

    33/48

    Cross Validation (example)

    12/6/14Evaluation of Multimedia Retrieval Systems

    33

  • 8/10/2019 Evaluation of Multimedia Retrieval System

    34/48

    Significance Tests Assumptions

    ! Significance tests are designed to disprovethe nullhypothesis H0

    ! The null hypothesis will be that there is no differencebetween method A and method B.

    ! Rejecting H0 implies accepting the alternativehypothesis H1

    !

    The alternative hypothesis for the retrieval experiments willbe

    1.

    Method A consistently outperforms method B, or

    2.

    Method B consistently outperforms method A

    12/6/14Evaluation of Multimedia Retrieval Systems

    34

  • 8/10/2019 Evaluation of Multimedia Retrieval System

    35/48

    Common Significance Tests

    1. The paired t-test: assumes that errors are normallydistributed

    2.

    The paired Wilcoxon signed ranks test: is a non-parametric test that assumes that errors come from acontinuous distribution that is symmetric around 0

    3.

    The paired sign test: is a non-parametric test that onlyuses the sign of the differences between method A andB for each query. The test statistic is the number of timesthat the least frequent sign occurs

    12/6/14Evaluation of Multimedia Retrieval Systems

    35

  • 8/10/2019 Evaluation of Multimedia Retrieval System

    36/48

    Significance Test (example)

    !

    We are going to compare the effectiveness of twocontent-based image retrieval systems A and B

    ! Given an example query image, return a ranked lists ofimages from a standard benchmark test collection

    ! For each system, we run 50 queries from the benchmark

    !

    Based on benchmarks relevance judgments, the meanaverage precision is 0.208 for system A and 0.241 forsystem B

    ! Are these different significant?

    12/6/14Evaluation of Multimedia Retrieval Systems

    36

  • 8/10/2019 Evaluation of Multimedia Retrieval System

    37/48

    Outline

    ! Introduction

    !

    Test Collections and Evaluation Workshops

    ! Evaluation Measures

    ! Significance Tests

    !

    A Case Study: TRECVID

    12/6/14Evaluation of Multimedia Retrieval Systems

    37

  • 8/10/2019 Evaluation of Multimedia Retrieval System

    38/48

    A Case Study: TRECVID

    ! In 2004, there were four main tasks in TRECVID:

    Glass box tasks:1. Shot boundary detection

    2. Story segmentation

    3. High-level feature extraction

    Black box task: Search

    12/6/14Evaluation of Multimedia Retrieval Systems

    38

  • 8/10/2019 Evaluation of Multimedia Retrieval System

    39/48

    Shot Boundary Detection

    !

    A shot marks the complete sequenceof frames startingfrom the moment that the camera was switched on, and

    stoppingwhen it was switched off again, or asubsequence of that selected by the movie editor

    !

    Shot boundaries: the positions in the video stream thatcontain the transitions from one shot to the next shot

    !

    There are two kind of boundaries:1. Hardcuts: a relatively large difference between the two frames.

    Easy to detect

    2. Softcuts: there is a sequence of frames that belongs to both thefirst shot and the second shot. Harder to detect

    12/6/14Evaluation of Multimedia Retrieval Systems

    39

  • 8/10/2019 Evaluation of Multimedia Retrieval System

    40/48

    TRECVID 2004 Shot Boundary Detection Task

    ! Goal: to identify the shot boundaries with their location andtype (hard or soft) in the given video clips

    ! The performance of a system is measured by precisionandrecallof detected shot boundaries

    ! The detection criteria require only a single frame overlapbetween the submitted transitions and the referencetransition

    ! The TRECVID 2004 test data:! 618,409 frames

    ! 4806 shot transitions(58% hard cuts, 42% soft cuts, mostly dissolves)

    ! The best performing systems:95%precision and recall on hard cuts, 80%precision and recall on soft cuts

    12/6/14Evaluation of Multimedia Retrieval Systems

    40

    / /f S

  • 8/10/2019 Evaluation of Multimedia Retrieval System

    41/48

    Hard cuts vs Soft cuts

    12/6/14Evaluation of Multimedia Retrieval Systems

    41

    Hard cuts Soft cuts

    with dissolve effect

    12/6/14E l ti f M lti di R t i l S t

  • 8/10/2019 Evaluation of Multimedia Retrieval System

    42/48

    Story Segmentation

    !

    A digital video retrieval system with shots as the basicretrieval unit might not be desirablea news shows and

    news magazines

    ! The segmentation of a news show into its constitutingnews items has been studied under the names ofstoryboundarydetection and story segmentation

    !

    Story segmentation of multimedia data can potentiallyexploit visual, audioand textual cuespresent in the data.

    ! Story boundaries often but not always occur at shotboundaries

    12/6/14Evaluation of Multimedia Retrieval Systems

    42

    12/6/14E l ti f M lti di R t i l S t

  • 8/10/2019 Evaluation of Multimedia Retrieval System

    43/48

    TRECVID 2004 Story Segmentation Task

    ! Goal: given the story boundary test collection, identify thestory boundaries with their location (time) in the given video

    clip(s)

    ! The definition of the story segmentation task was based onmanual story boundary annotationsmade by Linguistic DataConsortium (LDC) for the Topic Detection and Tracking (TDT)project

    ! Experiments setup:

    ! Video + Audio (no ASR/CC)

    ! Video + Audio + ASR/CC

    ! ASR/CC (no Video + Audio)

    12/6/14Evaluation of Multimedia Retrieval Systems

    43

    12/6/14Evaluation of Multimedia Retrieval Systems

  • 8/10/2019 Evaluation of Multimedia Retrieval System

    44/48

    Story Segmentation (Zhai, 2005)

    ! The video consists of two news stories and one commercial. The blue cycleand the red cycle are the news stories, and the green cycle represents amiscellaneous story

    12/6/14Evaluation of Multimedia Retrieval Systems

    44

    12/6/14Evaluation of Multimedia Retrieval Systems

  • 8/10/2019 Evaluation of Multimedia Retrieval System

    45/48

    Story Boundary Evaluation

    ! A story boundary was expressed as atime offset withrespect to the start of the video file in seconds, accurateto nearest hundredth of a second

    !

    Each reference boundary was expanded with a fuzzinessfactorof five seconds in each direction

    !

    If a computed boundary did not fall in the evaluationintervalof a reference boundary, it was considered afalse alarm

    12/6/14Evaluation of Multimedia Retrieval Systems

    45

    12/6/14Evaluation of Multimedia Retrieval Systems

  • 8/10/2019 Evaluation of Multimedia Retrieval System

    46/48

    High-level Feature Extraction

    ! One way to bridge the semantic gap between videocontent and textual representations is to annotatevideofootage with high-level features

    !

    The assumption is that features describing genericconceptse.g.:

    ! Genres: commerial, politics, sports, weather, financial

    !

    Objects: face, fire, explosion, car, boat, aircraft, map, building! Scene: water, scape, mountain, sky, outdoor, indoor, vegetation

    ! Action: people walking, people in crowd

    12/6/14Evaluation of Multimedia Retrieval Systems

    46

    12/6/14Evaluation of Multimedia Retrieval Systems

  • 8/10/2019 Evaluation of Multimedia Retrieval System

    47/48

    Video Search Task

    !

    Goal: to evaluate the system as a whole

    !

    Given the search test collection and a multimediastatement of a users information need (called topicinTRECVID)

    ! Return a ranked list of at most 1000 shots from the testcollection that best satisfy the need

    !

    A topic consists of a textual descriptionof the usersneed, for instance: Find shots of one or more buildingswith flood waters around it/them. and possibly one ormore example frames or images

    12/6/14Evaluation of Multimedia Retrieval Systems

    47

    12/6/14Evaluation of Multimedia Retrieval Systems

  • 8/10/2019 Evaluation of Multimedia Retrieval System

    48/48

    Manual vs Interactive Search Task

    12/6/14Evaluation of Multimedia Retrieval Systems