53
Andrea Vedaldi, University of Oxford Feature evaluation Old and new benchmarks, and new software HARVEST Programme

Modern features-part-4-evaluation

  • Upload
    zukun

  • View
    826

  • Download
    1

Embed Size (px)

Citation preview

Page 1: Modern features-part-4-evaluation

Andrea Vedaldi, University of Oxford

Feature evaluationOld and new benchmarks, and new software

HARVEST Programme

Page 2: Modern features-part-4-evaluation

Benchmarking: why and how

• Dozens of feature detectors and descriptors have been proposed

• Benchmarks- compare methods empirically- select the best method for a task

• Public benchmarks- reproducible research- simplify your life!

• Ingredients of a benchmark:

37

Theory Data Software

Page 3: Modern features-part-4-evaluation

38

Indirect evaluationRepeatability and matching scoreData: affine covariant testbed

Direct evaluationImage retrievalData: oxford 5k

SoftwareVLBenchmarks

Page 4: Modern features-part-4-evaluation

39

Indirect evaluationRepeatability and matching scoreData: affine covariant testbed

Direct evaluationImage retrievalData: oxford 5k

SoftwareVLBenchmarks

Page 5: Modern features-part-4-evaluation

Indirect feature evaluation 40

• Intuition Test how well features persist and can be matched across image transformations.

• Data- must be representative of the transformations

(viewpoint, illumination, noise, etc.)

• Performance measures- repeatability

persistence of features- matching score

matchability of features

K. Mikolajczyk, T. Tuytelaars, C. Schmid, A. Zisserman, J. Matas, F. Schaffalitzky, T. Kadir, and L. Van Gool. A comparison of affine region detectors. IJCV, 1(65):43–72, 2005.

Page 6: Modern features-part-4-evaluation

frame

Page 7: Modern features-part-4-evaluation
Page 8: Modern features-part-4-evaluation
Page 9: Modern features-part-4-evaluation
Page 10: Modern features-part-4-evaluation
Page 11: Modern features-part-4-evaluation
Page 12: Modern features-part-4-evaluation
Page 13: Modern features-part-4-evaluation
Page 14: Modern features-part-4-evaluation

Viewpoint, scale, rotation

Affine Testbed 49

Page 15: Modern features-part-4-evaluation

Lighting, compression, blur

Affine Testbed 50

Page 16: Modern features-part-4-evaluation

Intuition

Detector repeatability 51

• two pairs of features correspond

• another pair of features does not

• repeatability = 2/3

Page 17: Modern features-part-4-evaluation

Formal definition

Region overlap 52

H

homography

Ra

Rb

HRb

area intersection

area union

overlap(a, b) =|Ra \HRb||Ra [HRb|

=

Page 18: Modern features-part-4-evaluation

Intuition

Region overlap 53A Comparison of Affine Region Detectors

Figure 12. Overlap error !O . Examples of ellipses projected on the corresponding ellipse with the ground truth transformation. (bottom)Overlap error for above displayed ellipses. Note that the overlap error comes from different size, orientation and position of the ellipses.

sometimes specific to detectors and scene types(discussed below), and sometimes general—the trans-formation is outside the range for which the detector isdesigned, e.g. discretization errors, noise, non-linearillumination changes, projective deformations etc.Also the limited ‘range’ of the regions shape (size,skewness, . . . ) can partially explain this effect. Forinstance, in case of a zoomed out test image, only thelarge regions in the reference image will survive thetransformation, as the small regions will have becometoo small for accurate detection. The same holds forother types of transformations: very elongated regionsin the reference image may become undetectable ifthe inferred affine transformation stretches them evenfurther but, on the other hand, allow for very largeviewpoint changes if the inferred affine transformationmakes them rounder.

The left side of each figure typically represents smalltransformations. The repeatability score obtained inthis range indicates how well a given detector per-forms for the given scene type and to what extent thedetector is affected by a small transformation of thisscene. The invariance of the detector under the studiedtransformation, on the other hand, is reflected in theslope of the curves, i.e., how much does a given curvedegrade with increasing transformations.

The absolute number of correspondences typicallydrops faster than the relative number. This can be un-derstood by the fact that in most cases larger transfor-mations result in lower quality images and/or smallercommonly visible parts between the reference imageand the other image, and hence a smaller number ofregions are detected.

Viewpoint Change. The effect of changing view-point for the structured graffiti scene from Fig. 9(a)are displayed in Fig. 13. Figure 13(a) shows the re-peatability score and Fig. 13(b) the absolute number

of correspondences. The results for images contain-ing repeated texture motifs (Fig. 9(b)) are displayedin Fig. 14. The best results are obtained with theMSER detector for both scene types. This is due tothe high detection accuracy especially on the homo-geneous regions with distinctive boundaries. The re-peatability score for a viewpoint change of 20 degreesvaries between 40% and 78% and decreases for largeviewpoint angles to 10% ! 46%. The largest numberof corresponding regions is given by Hessian-Affine(1300) detector followed by Harris-Affine (900) detec-tor for the structured scene, and given by Harris-Affine(1200), MSER (1200) and EBR (1300) detectors forthe textured scene. These numbers decrease to lessthan 200/400 for the structured/textured scene for largeviewpoint angle.

Scale Change. Figure 15 shows the results for thestructured scene from Fig. 9(c), while Fig. 16 shows theresults for the textured scene from Fig. 9(d). The mainimage transformation is a scale change and in-planerotation. The Hessian-Affine detector performs best,followed by MSER and Harris-Affine detectors. Thisconfirms the high performance of the automatic scaleselection applied in both Hessian-Affine and Harris-Affine detectors. These plots clearly show the sensi-tivity of the detectors to the scene type. For the tex-tured scene, the edge-based region detector gives verylow repeatability scores (below 20%), whereas for thestructured scene, its results are similar to the other de-tectors, with score going from 60% down to 28%. Theunstable repeatability score of the salient region detec-tor for the textured scene is due to the small number ofdetected regions in this type of images.

Blur. Figures 17 and 18 show the results for thestructured scene from Fig. 9(e) and the textured onefrom Fig. 9(f), both undergoing increasing amounts of

• Examples of ellipses overlapping by different amounts

• Usually, features are tested at 40% overlap error (= 60% overlap)

1 - overlap

Page 19: Modern features-part-4-evaluation

54

Page 20: Modern features-part-4-evaluation

55

Page 21: Modern features-part-4-evaluation

56

Page 22: Modern features-part-4-evaluation

57

Page 23: Modern features-part-4-evaluation

Intuition

Normalised region overlap 58

larger scale = better overlap

Rescale so thatreference region hasand area of 302 pixels

Page 24: Modern features-part-4-evaluation

Formal definition

Normalised region overlap 59

H

homography1. Detect

Ra Rb

2. Warp Ra HRb

3. Normalise sHRbsRa s = |Ra|/302

4. Intersection over union overlap(a, b) =

|sRa \ sHRb||sRa [ sHRb|

Page 25: Modern features-part-4-evaluation

Formal definition

Detector repeatability 60

H

homography

1. Find featuresin common area

{Rb : b 2 B}{Ra : a 2 A}

2. Threshold theoverlap score s

ab

=

(overlap(a, b), overlap(a, b) � 1� ✏

o

� inf, otherwise.

3. Find geometricmatches

M⇤= max

M bipartite

X

(a,b)2M

sab greedy approximation

repeatability(A,B) =|M⇤|

min{|A|, |B|}repeatability(A,B) =|M⇤|

min{|A| |B|}

Page 26: Modern features-part-4-evaluation

Intuition

Descriptor matching score 61

• In addition to being stable, features must be visually distinctive

• Descriptor matching score- similar to repeatability- but matches are constructed by comparing descriptors

Page 27: Modern features-part-4-evaluation

Formal definition

Descriptor matching score 62

H

homography

1. Find featuresin common area

{Rb : b 2 B}{Ra : a 2 A}

2. Descriptordistances {da : a 2 A} {db : b 2 B} dab = kda � dbk2

4. Geometricmatches (as before)

M⇤= max

M bipartite

X

(a,b)2M

sab

3. Descriptormatches

M⇤d = min

M bipartite

X

(a,b)2M

dab

match-score(A,B) =

|M⇤ \M⇤d|

min{|A|, |B|}

Page 28: Modern features-part-4-evaluation

Example of a repeatability graph 63

30 40 50 60 200

10

20

30

40

50

60

70

80

90

100

Viewpoint angle

Rep

eata

bilit

y (g

raf)

Repeatability (graf)

VL DoGVL HessianVL HessianLaplaceVL HarrisLaplaceVL DoG (double)VL Hessian (double)VL HessianLaplace (double)VL HarrisLaplace (double)VLFeat SIFTCMP HessianVGG hesVGG har

Page 29: Modern features-part-4-evaluation

64

Indirect evaluationRepeatability and matching scoreData: affine covariant testbed

Direct evaluationImage retrievalData: oxford 5k

SoftwareVLBenchmarks

Page 30: Modern features-part-4-evaluation

Indirect evaluation

• Indirect evaluation- a “synthetic” performance measure in a “synthetic” setting

• The good- independent of a specific application / implementation- allow to evaluate single components, e.g.▪ repeatability of detector▪ matching score of descriptor

• The bad- difficult to design well- unclear correlation to the performance in applications

65

Page 31: Modern features-part-4-evaluation

Direct evaluation

• Direct evaluation- performance of a real system using a feature

• The good- tied to the “real” performance of the feature

• The bad- tied to one application- worse, tied to one implementation- difficult to evaluate single aspects of a feature

• In the follow up we will focus on object instance retrieval

66

object instance retrievalobject category recognitionobject detectiontext recognitionsemantic segmentation...

Page 32: Modern features-part-4-evaluation

Used to evaluate features

Image retrieval 67

...

Page 33: Modern features-part-4-evaluation

Represent images as bags of features

Image retrieval pipeline 68

input image detector

{f1, . . . , fn}

Harris (Laplace)Hessian (Laplace)

DoGMSER

Harris AffineHessian Affine

....

descriptor

{d1, . . . ,dn}

SIFTLIOP

BRIEFJets...

Page 34: Modern features-part-4-evaluation

increasing descriptor distance

Step 1: find neighbours of each query descriptor

Image retrieval pipeline 69

...

queryimage

H. Jégou, M. Douze, and C. Schmid. Exploiting descriptor distances for precise image search. Technical Report 7656, INRIA, 2011.

Page 35: Modern features-part-4-evaluation

Step 2: each query descriptor casts a vote for each DB image

Image retrieval pipeline 70

...

query descriptord

d1 d2 dk

� � � � � � � � � � � � � � � � � � � � � � � � � � � �

vote strength max{dk � di, 0}

...rank k

dist

ance

Page 36: Modern features-part-4-evaluation

Step 3: sort DB images by decreasing total votes

Image retrieval pipeline 71

query image

decreasing total votes

...

✔ ✗✗

Average Precision (AP)35%

recall

prec

isio

n

1

1

✔✗

Page 37: Modern features-part-4-evaluation

Step 4: Overall performance score

Image retrieval pipeline 72

✔ ✗✗

35%

query retrieval results AP

✔ ✗✗

100%

✔ ✗ ✔

75%

53%Mean Average Precision (mAP) 53%Mean Average Precision (mAP)

...... ... ... ...

Page 38: Modern features-part-4-evaluation

A retrieval benchmark dataset

Oxford 5K data 73

• ~ 5K images of Oxford- For each of 58 queries▪ about XX matching images▪ about XX confounders images

• Larger datasets are possible, but slow for extensive evaluation

• Relative ranking of features seems to be representative

Query Retrieved Images

✔✗✔

...

Page 39: Modern features-part-4-evaluation

74

Indirect evaluationRepeatability and matching scoreData: affine covariant testbed

Direct evaluationImage retrievalData: oxford 5k

SoftwareVLBenchmarks

Page 40: Modern features-part-4-evaluation

A new easy-to-use benchmarking suite

VLBenchmarks

• A novel MATLAB framework for feature evaluation- Repeatability and matching scores▪ VGG affine testbed

- Image retrieval▪ Oxford 5K

• Goodies- Simple to use MATLAB code- Automatically download datasets & run evaluations- Backward compatible with published results

75

http://www.vlfeat.org/benchmarks/index.html

Page 41: Modern features-part-4-evaluation

Obtaining and installing the code

VLBenchmarks

• Installation- Download the latest version- Unpack the archive- Launch MATLAB and type

>>8install

• Requirements- MATLAB R2008a (7.6)- A C compiler (e.g. Visual Studio, GCC, or Xcode)- Do not forget to setup MATLAB to use your C compiler

mex8;setup

76

Page 42: Modern features-part-4-evaluation

Example usage 77

� � � � � � � � � � � � � t.a� � � � � � � � � � � � � � � � � � t.a� � � � � � � � � � � � � � � � t.a

� � � � � � � � � o� � � � � � � � � se� a

� � � � � � o� � � � � � � � � � � � � �� s*� � � � � � "*d*� � � *e� a

� � � � � � � � � o� � � � � �� � � � "� � � � � � � � se�a

� � � � � � � � � � "� o� ttt� � � � � � � � t� � � � � � � � � � � �� � � � � s� �� � � �� � d�t tt

� � � � � t�� � �� �� � �� � � � �� sled� ttt� � � � � t�� � �� �� � � � s;ed� ttt� � � � � t�� � �� �� � � � slee� a

choose a detector(Harris Affine, VGG version)

choose a dataset(graffiti sequence)

choose test(detector repeatability)

run the evaluation(repeatability = 0.66)

Page 43: Modern features-part-4-evaluation

Testing on a sequence of images 78

� � � � � � � � � � � .ls� � � � � � � � � � � � � � � �. ls� � �� � � �� � � �� �� �. ls

� � � � � � � � � r� � � � � � � � � *;� s� � � � � � r� � � � � � � � � � � � � �� *o� � � � � � aoto� � � o;� s� � � � � � � � � r� � � � � �� � � � a� � � � � � � � *;�s

� � � � r� ceu� � � � � � � � � � a*� ;� r� ...

� � � � � � � � .� � � � � � � � � � d�� � � �� *� �� � � �� � t�. ..� � � �� .�� � �� �� � �� � � � �� *� ;t� ...� � � �� .�� � �� �� � � � *c;t� ...� � � �� .�� � �� �� � � � *� ;;� s

� � �

� � � � s� � � � � *� � � � � � � � � � at� o� � � � "� � � � ot� F;� sd� � � � *o� � � � � � � � � � o;� sa� � � � *o� � � � � � � � � � ao;� s� � � � � � � � s

Use parfor on a cluster!

Page 44: Modern features-part-4-evaluation

79

1 1.5 2 2.5 3 3.5 4 4.5 50

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

image number

repe

atab

ility

Page 45: Modern features-part-4-evaluation

Comparing two features 80

import"datasets.*;"import"localFeatures.*;"import"benchmarks.*;detectors{1}"="VlFeatSift()";

detectors{2}"="VlFeatCovdet('EstimateAffineShape',"true)";

dataset"="VggAffineDataset('category','graf')";

benchmark"="RepeatabilityBenchmark()";

for"d"="1:2for"j"="1:5repeatability(j,d)"="...

""""""benchmark."testFeatureExtractor(detectors{d},"...

"""""""""""""""""""""""""""dataset.getTransformation(j),"...

"""""""""""""""""""""""""""dataset.getImagePath(1),"...

"""""""""""""""""""""""""""dataset.getImagePath(j))";

endendclf";"plot(repeatability,"'linewidth',"4)";

xlabel('image"number')";

ylabel('repeatability')";

grid"on";

Page 46: Modern features-part-4-evaluation

81

1 1.5 2 2.5 3 3.5 4 4.5 50

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

image number

repe

atab

ility

Affine Adaptation

Page 47: Modern features-part-4-evaluation

• Compare the following features- SIFT, MSER, and features on a grid- on the Graffiti sequence- for repeatability and number of correspondence

82

30 40 50 60 200

10

20

30

40

50

60

70

Rep

eata

bilit

y

Viewpoint angle

Repeatability

SIFTMSERFeatures on a grid

30 40 50 60 200

100

200

300

400

500

600

700

800

Num

ber o

f cor

resp

onde

nces

Viewpoint angle

Number of correspondences

SIFTMSERFeatures on a grid

Example

Page 48: Modern features-part-4-evaluation

Backward compatible

• Previously published results can be easily reproduced

- if interested, try the script8reproduceIjcv05.m

83

International Journal of Computer Visionc! 2006 Springer Science + Business Media, Inc. Manufactured in The Netherlands.

DOI: 10.1007/s11263-005-3848-x

A Comparison of Affine Region Detectors

K. MIKOLAJCZYKUniversity of Oxford, OX1 3PJ, Oxford, United Kingdom

[email protected]

T. TUYTELAARSUniversity of Leuven, Kasteelpark Arenberg 10, 3001 Leuven, Belgium

[email protected]

C. SCHMIDINRIA, GRAVIR-CNRS, 655, av. de l’Europe, 38330, Montbonnot, France

[email protected]

A. ZISSERMANUniversity of Oxford, OX1 3PJ, Oxford, United Kingdom

[email protected]

J. MATASCzech Technical University, Karlovo Namesti 13, 121 35, Prague, Czech Republic

[email protected]

F. SCHAFFALITZKY AND T. KADIRUniversity of Oxford, OX1 3PJ, Oxford, United Kingdom

[email protected]

[email protected]

L. VAN GOOLUniversity of Leuven, Kasteelpark Arenberg 10, 3001 Leuven, Belgium

[email protected]

Received August 20, 2004; Revised May 3, 2005; Accepted May 11, 2005

First online version published in January, 2006

Abstract. The paper gives a snapshot of the state of the art in affine covariant region detectors, and comparestheir performance on a set of test images under varying imaging conditions. Six types of detectors are included:detectors based on affine normalization around Harris (Mikolajczyk and Schmid, 2002; Schaffalitzky andZisserman, 2002) and Hessian points (Mikolajczyk and Schmid, 2002), a detector of ‘maximally stable extremalregions’, proposed by Matas et al. (2002); an edge-based region detector (Tuytelaars and Van Gool, 1999)and a detector based on intensity extrema (Tuytelaars and Van Gool, 2000), and a detector of ‘salient regions’,

Page 49: Modern features-part-4-evaluation

Other useful tricks

• Compare different parameter settings

• Visualising matches

84

detectors{1}"="VggAffine('Detector','haraff',"'Threshold"',"500)";detectors{2}"="VggAffine('Detector','haraff',"'Threshold"',"1000)";

[~"~"matches"reprojFrames]"="benchmark.testFeatureExtractor("...")...benchmarks.helpers.plotFrameMatches(matches,"reprojFrames)

SIFT Matches with 4 image (VggAffineDataset−graf dataset).

Matched ref. image framesUnmatched ref. image framesMatched test image framesUnmatched test image frames

Matches using mean−variance−median descriptor with 4 image (VggAffineDataset−graf dataset).

Matched ref. image framesUnmatched ref. image framesMatched test image framesUnmatched test image frames

Page 50: Modern features-part-4-evaluation

Other benchmarks

• Detector matching score

• Image retrieval- Example: Oxford 5K lite- mAP evaluation

85

benchmark"="RepeatabilityBenchmark('mode','MatchingScore')";

dataset"="VggRetrievalDataset('Category','oxbuild',""""""""""""""""""""""""""""""'BadImagesNum',100);benchmark"="RetrievalBenchmark()";mAP"="benchmark.testFeatureExtractor(detectors{d},"dataset);

Page 51: Modern features-part-4-evaluation

Summary 86

• Benchmarks- Indirect: repeatability and matching score- Direct: image retrieval

• VLBenchmarks- a simple to use MATLAB framework- convenient

• The future- Existing measures have many shortcomings- Hopefully better benchmarks will be available soon- And they will be added to VLBenchmarks for your convenience

http://www.vlfeat.org/benchmarks/index.html

Page 52: Modern features-part-4-evaluation

Credits 87

HARVEST Programme

Karel Lenc

Andrew Zisserman

Cordelia SchmidKrystian Mikolajczyk Jiri MatasTinne Tuytelaars

Varun Gulshan

Page 53: Modern features-part-4-evaluation

88

VLFeathttp://www.vlfeat.org/

VLBenchmarkshttp://www.vlfeat.org/benchmarks/

Thank you for coming!