Evaluation of Multimedia Retrieval System

8/10/2019 Evaluation of Multimedia Retrieval System

1/48

Multimedia Indexing & Retrieval

Evaluation of Multimedia Retrieval Systems

12/6/14Evaluation of Multimedia Retrieval Systems

1


2/48

Outline

! Introduction

!

Test Collections and Evaluation Workshops

! Evaluation Measures

! Significance Tests

!

A Case Study: TRECVID


2


3/48

Outline

! Introduction

!




!



3


4/48

!

We are going to discuss the tools and methodology forcomparing the effectivenessof two or more multimediaretrieval systems in a meaningful way

! Multimedia retrieval system evaluation can be evaluatedusing two strategies:

1. Withoutconsulting the potential users of the system e.g.query processing time, query throughput, etc.

2.

Withconsulting the potential users of the system e.g. theeffectiveness of retrieved results

Introduction


4


5/48

Test Collections

!

Doing an evaluation involving real people is not only a costlyjob, itis also difficult to controland hard to replicate

! Methods have been developed to design so-called test collections

!

Often, these test collections are created by consulting potentialusers

! Test collection can be used to evaluate multimedia retrievalsystems without the need to consult the users during furtherevaluations

! A new retrieval method or system can be evaluated by comparingit to some well-established methods in a controlled experiment


5


6/48

Evaluation Methods in Controlled

Experiment (Hull, 1993)

!

A test collection consisting of:

1. Multimedia data

2. A task and data needed for that task, for instance image

search using example images as queries3. Ground truth data (correct answers)

! One or more suitable evaluation measures that assign valuesto the effectiveness of the search

!

A statistical methodology that determines whether theobserved differences in performance between the methodsinvestigated are statistically significant


6


7/48

Glass box vs Black Box

! Two approaches to test the effectiveness of multimediasearch system:

1.

Glass boxevaluation, i.e., the systematic assessment ofevery component of a system

example: word error rate on speech recognition, or shotdetector subsystem of video retrieval system

2. Black boxevaluation, i.e., testing the system as a whole


7


8/48

Outline

! Introduction

!




!



8


9/48

! The scientific evaluation of multimedia search systems is a costlyand laboriousjob

!

Big companies, for instance Web search companies such asYahooand Google, might perform such evaluations themselves

! For smaller companies and research institutions a good option is

often to take a test collection that is prepared by others

!

A test collections are often created by the collaborative effortof many research groups in so-called evaluation workshops ofconferences i.e. Text Retrieval Conferences (TREC)



9


10/48

TREC-style Evaluation Cycle


10


11/48

Ground Truth Data Preparation

! Forblack-boxevaluations, people are involved in the process ofdeciding whether a multimedia object is relevant for a certain

query

! For glass-box evaluations, for instance an evaluation of a shot

boundary detection algorithm, the process of deciding whether

a sequence of frames contains a shot boundary involvespeople as well


11


12/48

Corel Set!

The Corel test collection is a set of stock photographs,which is divided into subsets of images (e.g., Arabianhorses, sun- sets, or English pub signs, etc.)

! The collection is used to evaluate the effectiveness ofcontent-based image retrieval systems in a large numberof publications and has become a de-facto standardinthe field


12


13/48

Corel Set (cont.)

! Usually, Corel is used as follows:1. From the test collection, take out one image and use it as a

query-by-example;2. Relevant images for the query are those that come from the

same theme; and

3.

Compute precision and recall for a number of queries

!

Problems:! Single Corel set does not exist

! Different publications use different CDs and different selectionsof themes

!

Too simple, since only come from one professionalphotographer (not represents realistic settings)


13


14/48

Mirex Workshops!

The Music Information Retrieval Evaluation eXchange(MIREX) test collections consist of data from record labelsthat allow to publish tracks from their artists releases

! MIREX uses a glass boxapproach to the evaluation ofmusic information retrieval by identifying various subtaskse.g. identification, Drum detection, Genre classification,Melody extraction, On- set detection, Tempo extraction,

Key finding, Symbolic genre classification, Symbolicmelodic similarity


14


15/48

TRECVID: The TREC Video Retrieval Workshops

! The workshop emerged as a special task of the Text RetrievalConference (TREC) in 2001

!

Continued as an independent workshop collocated with TREClater

! The workshop has focused mostly on news videos, from US,

Chinese and Arabic sources e.g. CNN Headline News, and ABCWorld News Tonight


15


16/48

TRECVID tasks (example)

! TRECVID provides several white-boxevaluation tasks such

as:! Shot boundary detection

! Story segmentation

! High-level feature extraction

! Etc.

! The workshop series also provides a black-boxevaluationframework: a Search task that may include the completeinteractive session of a user with the system


16


17/48

CLEF Workshops

! CLEF (Cross-Language Evaluation Forum), a workshop seriesmainly focusing on multilingual retrieval, i.e., retrieving data

from a collection of documents in many languages, and cross-

language retrieval

! CLEF started as a cross-language retrieval task in TREC in 1997,

became independent from TREC in 2000


17


18/48

ImageCLEF

! Goal: to investigate the effectiveness of combining text andimage for retrieval

!

ImageCLEF also provides an evaluation task for a medicalimagecollection consisting of medical photographs, X-rays, CT-

scans, MRIs, etc. provided by the Geneva University Hospital in

the Casimage project


18


19/48

INEX Workshops! The Initiative for the Evaluation of XML Retrieval (INEX) aims at

evaluating the retrieval performance of XML retrieval systems

!

It focuses on using the structure of the document to extract,relate and combinethe relevance scores of different

multimedia fragments


19


20/48

INEX Workshop task (example)

! INEX 2006 had a modest multimedia task that involved

data from the Wikipedia

! A search query might for instance search for images bycombining information from its captiontext, its links, andfrom the articlethat contains the image


20


21/48

Outline

! Introduction

!




!



21


22/48

Evaluation Measures! The effectiveness of a system or component is often measured by the

combination of precisionand recall

! Precision is defined by the fraction of the retrieved or detectedobjects that is actually relevant. Recall is defined by the fraction of

the relevant objects that is actually retrieved


22

recall =Number of relevant documents retrieved

Total number of relevant documents

retrieveddocumentsofnumberTotal

retrieveddocumentsrelevantofNumberprecision=

Relevant

documents

Retrieved

documents

Entire

document

collection


23/48

Determining Recall is Difficult

! Total number of relevant items is sometimes not available:

!Sampleacross the database and perform relevancejudgment on these items.

! Apply different retrieval algorithms to the same database forthe same query. The aggregateof relevant items is taken as

the total relevant set.


23


24/48

Trade-off between Recall andPrecision


24

10

1

Recall

Precision

The idealReturns relevant documents but

misses many useful ones too

Returns most relevant

documents but includes

lots of junk


25/48

Computing Recall/Precision Points

! For a given query, produce the ranked list of retrievals

!

Adjusting a thresholdon this ranked list produces differentsets of retrieved documents, and therefore differentrecall/precision measures

! Markeach document in the ranked list that is relevant

according to the gold standard

! Compute a recall/precision pairfor each position in theranked list that contains a relevant document


25


26/48

Computing Recall/Precision Points (example)


26

R=3/6=0.5; P=3/4=0.75

n doc # relevant

1 588 x

2 589 x

3 576

4 590 x

5 986

6 592 x

7 984

8 988

9 578

10 985

11 103

12 591

13 772 x

14 990

Let total # of relevant docs = 6

Check each new recall point:

R=1/6=0.167; P=1/1=1

R=2/6=0.333; P=2/2=1

R=5/6=0.833; p=5/13=0.38

R=4/6=0.667; P=4/6=0.667

Missing one

relevant document.

Never reach

100% recall


27/48

Precision at Fixed Recall Levels

! A number of fixed recall levelsare chosen, for instance 10

levels: {0.1, 0.2,!!!, 1.0}

! The levels correspond to usersthat are satisfied if they find

respectively 10%, 20%,!!!, 100%of the relevant documents

!

For each of these levels thecorresponding precision isdetermined by averaging theprecision on that level over the

tests that are performed


27


28/48

Precision at Fixed Points in the Ranked List

! Recall isnot necessarily a goodmeasure of user equivalence

!

For instance if one query has 20 relevant documents while another has 200

! A recall of 50% would be a reasonablegoal in the first case, butunmanageablefor most users in the second case

! Alternative, choose a number of fixed points in the ranked list, for instancenine points at: 5, 10, 15, 20, 30, 100, 200, 500 and 1000 documents retrieved

! These points correspond with users that are willing to read 5, 10, 15, etc.documents of a search

! For each of these points in the ranked list, the precision is determined byaveragingthe precision on that level over the queries


28


29/48

Mean Average Precision

! Average Precision: Average of the precision values at thepoints at which each relevant document is retrieved.

! Example: (1 + 1 + 0.75 + 0.667 + 0.38 + 0)/6 = 0.633

! Mean Average Precision: Averageof the averageprecision value for a set of queries.


29


30/48

Combining Precision/Recall: F-Measure

!

One measure of performance that takes into accountboth recalland precision

! Harmonic mean of recall and precision:

! Compared to arithmetic mean, both need to be high forharmonic mean to be high.


30

PRRP

PRF

11

22

+

=

+

=


31/48

Outline

! Introduction

!




!



31


32/48

Significance Test

! Simply citing percentage improvements of one method overanother is helpful

! It does not tell if the improvementswere in fact due todifferences of the two methods

! A differences between two methods might simply be due torandom variationin the performance

!

To make significance testing of the differences applicable, areasonable amount of queries is needed

! A technique called cross-validationis used to further preventbiased evaluation results


32


33/48

Cross Validation (example)


33


34/48

Significance Tests Assumptions

! Significance tests are designed to disprovethe nullhypothesis H0

! The null hypothesis will be that there is no differencebetween method A and method B.

! Rejecting H0 implies accepting the alternativehypothesis H1

!

The alternative hypothesis for the retrieval experiments willbe

1.

Method A consistently outperforms method B, or

2.

Method B consistently outperforms method A


34


35/48

Common Significance Tests

1. The paired t-test: assumes that errors are normallydistributed

2.

The paired Wilcoxon signed ranks test: is a non-parametric test that assumes that errors come from acontinuous distribution that is symmetric around 0

3.

The paired sign test: is a non-parametric test that onlyuses the sign of the differences between method A andB for each query. The test statistic is the number of timesthat the least frequent sign occurs


35


36/48

Significance Test (example)

!

We are going to compare the effectiveness of twocontent-based image retrieval systems A and B

! Given an example query image, return a ranked lists ofimages from a standard benchmark test collection

! For each system, we run 50 queries from the benchmark

!

Based on benchmarks relevance judgments, the meanaverage precision is 0.208 for system A and 0.241 forsystem B

! Are these different significant?


36


37/48

Outline

! Introduction

!




!



37


38/48


! In 2004, there were four main tasks in TRECVID:

Glass box tasks:1. Shot boundary detection

2. Story segmentation

3. High-level feature extraction

Black box task: Search


38


39/48

Shot Boundary Detection

!

A shot marks the complete sequenceof frames startingfrom the moment that the camera was switched on, and

stoppingwhen it was switched off again, or asubsequence of that selected by the movie editor

!

Shot boundaries: the positions in the video stream thatcontain the transitions from one shot to the next shot

!

There are two kind of boundaries:1. Hardcuts: a relatively large difference between the two frames.

Easy to detect

2. Softcuts: there is a sequence of frames that belongs to both thefirst shot and the second shot. Harder to detect


39


40/48

TRECVID 2004 Shot Boundary Detection Task

! Goal: to identify the shot boundaries with their location andtype (hard or soft) in the given video clips

! The performance of a system is measured by precisionandrecallof detected shot boundaries

! The detection criteria require only a single frame overlapbetween the submitted transitions and the referencetransition

! The TRECVID 2004 test data:! 618,409 frames

! 4806 shot transitions(58% hard cuts, 42% soft cuts, mostly dissolves)

! The best performing systems:95%precision and recall on hard cuts, 80%precision and recall on soft cuts


40

/ /f S


41/48

Hard cuts vs Soft cuts


41

Hard cuts Soft cuts

with dissolve effect

12/6/14E l ti f M lti di R t i l S t


42/48

Story Segmentation

!

A digital video retrieval system with shots as the basicretrieval unit might not be desirablea news shows and

news magazines

! The segmentation of a news show into its constitutingnews items has been studied under the names ofstoryboundarydetection and story segmentation

!

Story segmentation of multimedia data can potentiallyexploit visual, audioand textual cuespresent in the data.

! Story boundaries often but not always occur at shotboundaries


42

12/6/14E l ti f M lti di R t i l S t


43/48

TRECVID 2004 Story Segmentation Task

! Goal: given the story boundary test collection, identify thestory boundaries with their location (time) in the given video

clip(s)

! The definition of the story segmentation task was based onmanual story boundary annotationsmade by Linguistic DataConsortium (LDC) for the Topic Detection and Tracking (TDT)project

! Experiments setup:

! Video + Audio (no ASR/CC)

! Video + Audio + ASR/CC

! ASR/CC (no Video + Audio)


43



44/48

Story Segmentation (Zhai, 2005)

! The video consists of two news stories and one commercial. The blue cycleand the red cycle are the news stories, and the green cycle represents amiscellaneous story


44



45/48

Story Boundary Evaluation

! A story boundary was expressed as atime offset withrespect to the start of the video file in seconds, accurateto nearest hundredth of a second

!

Each reference boundary was expanded with a fuzzinessfactorof five seconds in each direction

!

If a computed boundary did not fall in the evaluationintervalof a reference boundary, it was considered afalse alarm


45



46/48

High-level Feature Extraction

! One way to bridge the semantic gap between videocontent and textual representations is to annotatevideofootage with high-level features

!

The assumption is that features describing genericconceptse.g.:

! Genres: commerial, politics, sports, weather, financial

!

Objects: face, fire, explosion, car, boat, aircraft, map, building! Scene: water, scape, mountain, sky, outdoor, indoor, vegetation

! Action: people walking, people in crowd


46



47/48

Video Search Task

!

Goal: to evaluate the system as a whole

!

Given the search test collection and a multimediastatement of a users information need (called topicinTRECVID)

! Return a ranked list of at most 1000 shots from the testcollection that best satisfy the need

!

A topic consists of a textual descriptionof the usersneed, for instance: Find shots of one or more buildingswith flood waters around it/them. and possibly one ormore example frames or images


47



48/48

Manual vs Interactive Search Task


Documents

Evaluation of Multimedia Retrieval System