Learning Non-Metric Visual Similarity for Image Retrievalnoagarciad.com/docs/IMAVIS2019_preprint.pdf · 2020. 7. 2. · Learning Non-Metric Visual Similarity for Image Retrieval Noa

Learning Non-Metric Visual Similarity for Image Retrieval

Noa Garciaa,∗, George Vogiatzisa

aAston University, Birmingham B4 7ET, United Kingdom

Abstract

Measuring visual similarity between two or more instances within a data distribution is afundamental task in image retrieval. Theoretically, non-metric distances are able to generatea more complex and accurate similarity model than metric distances, provided that the non-linear data distribution is precisely captured by the system. In this work, we explore neuralnetworks models for learning a non-metric similarity function for instance search. We arguethat non-metric similarity functions based on neural networks can build a better model ofhuman visual perception than standard metric distances. As our proposed similarity functionis differentiable, we explore a real end-to-end trainable approach for image retrieval, i.e. welearn the weights from the input image pixels to the final similarity score. Experimentalevaluation shows that non-metric similarity networks are able to learn visual similaritiesbetween images and improve performance on top of state-of-the-art image representations,boosting results in standard image retrieval datasets with respect standard metric distances.

Keywords: Image Retrieval, Visual Similarity, Non-Metric Learning

1. Introduction

For humans, deciding whether two im-ages are visually similar or not is, to someextent, a natural task. However, in com-puter vision, this is a challenging prob-lem and algorithms do not always succeedin matching pictures that contain similar-looking elements. This is mainly becauseof the well-known semantic gap problem,which refers to the difference or gap betweenlow-level image pixels and high-level seman-tic concepts. Estimating visual similarity

∗Corresponding authorEmail addresses: [email protected]

(Noa Garcia), [email protected] (GeorgeVogiatzis)

is a fundamental task that seeks to breakthis semantic gap by accurately evaluatinghow alike two or more pictures are. Visualsimilarity is crucial for many computer vi-sion areas including image retrieval, imageclassification and object recognition, amongothers.

Given a query image, content-based im-age retrieval systems rank pictures in adataset according to how similar they arewith respect to the input. This can be bro-ken into two fundamental tasks: 1) comput-ing meaningful image representations thatcapture the most salient visual informationfrom pixels and 2) measuring accurate vi-sual similarity between these image repre-sentations to rank images according to a

Preprint submitted to Image and Vision Computing April 1, 2019

Si,j

Visual Similarity

Ii

Ij

CNN

Feature Extraction

1xK

CNN1xK

CosineSimilarity

Neural Network

Si,j

Visual SimilarityIi

Ij

CNN

Feature Extraction

1xK

1xKx2

CNN1xK

(a) (b)

Figure 1: Retrieval system based on metric distances versus our proposed model. (a) Standardsystems use a metric distance function to estimate the visual similarity between a pair of feature vectorsobtained from a feature extraction model. (b) Our proposed model estimates the similarity score, Si,j , froma pair of visual vectors by using a non-metric similarity network.

similarity score.

In the last years, several methods to rep-resent visual information from raw pixelsin images have been proposed, first by de-signing handcrafted features such as SIFT[1], then by compacting these local featuresinto a single global image descriptor usingdifferent techniques such as Fisher Vectors[2] and more recently by extracting deepimage representations from neural networks[3]. However, once two images are describedby feature vectors, visual similarity is com-monly measured by computing a standardmetric between them. Although regular dis-tance metrics, such as Euclidean distance orcosine similarity, are fast and easy to imple-ment, they do not take into account the pos-sible interdependency within the dataset,which means that even if a strong nonlineardata dependency is occurring in the visualcollection, they might not be able to captureit. This suggests that learning a similar-ity estimation directly from visual data canimprove the performance on image retrievaltasks, provided that the likely nonlinear-ity dependencies within the dataset are pre-cisely learned by the similarity function.

In this work, we propose a model to learna non-metric visual similarity function on

top of image representations for pushing ac-curacy in image retrieval tasks. This ideais shown in Figure 1. As in standard imageretrieval systems, we extract K-dimensionalvisual vectors from images by using a con-volutional neural network (CNN) architec-ture. Then, a visual similarity neural net-work is used to estimate the similarity scorebetween a pair of images. Note that in stan-dard systems this score is usually computedwith a metric distance.

We design a supervised regression learn-ing protocol so that different similarity de-grees between images are precisely cap-tured. Then, we directly apply the outputof the model as a similarity estimation torank images accordingly. In this way, thesimilarity network can be seen as a replace-ment of the standard metric distance com-putation, being able to mathematically fitvisual human perception better than stan-dard metrics and to improve results on topof them. The proposed similarity networkis end-to-end differentiable, which allows usto build an architecture for real end-to-endtraining: from the input image pixels to thefinal similarity score. Experimental evalu-ation shows that performance on standardimage retrieval datasets is boosted when the

2

similarity function is directly learnt fromthe visual data instead of using a rigid met-ric distance.

Many techniques have been envisagedto boost image retrieval performance inthe past, such as query expansion and re-ranking [4], network fine-tunning [3, 5] orfeature fusion [6, 7]. However, these tech-niques are not competitors of our methodbut optional add-ons, as we argue thatthe methodology proposed in this work canbe applied along with all of them in thesame way as they are being applied on sys-tems based on metric distances. Moreover,training a similarity network as we pro-pose is computationally simpler than fine-tuning the whole feature representation net-work (i.e. network fine-tunning), extractingmultiple features using different networks(i.e. feature fusion) or computing multiplequeries per image (i.e. query expansion andre-ranking).

The main contributions of this work aresummarised as follows:

1. We present a neural network architec-ture to model visual similarities, whichintroduces a new and simple methodto boost performance in image retrievalby only training the last stage of thesystem.

2. We propose a novel regression loss func-tion specifically designed for improv-ing similarity scores on top of standardmetrics in image retrieval tasks.

3. We design a real end-to-end system forcontent-based image retrieval that canbe trained from the input image pixelsto the final similarity score.

4. We empirically show the efficacy ofour method in standard image retrievaldatasets. Via our ablation study, weshow that the proposed system cansuccessfully compute visual similarities

on top of different standard retrievalfeatures, outperforming cosine similar-ity and metric learning in most of thedatasets.

This paper is structured as follows: a re-vision of relevant work can be found in Sec-tion 2; our method is detailed in Section 3;experimental evaluation is described in Sec-tion 4; and conclusions are stated in 5.

2. Related Work

In this section relevant work in image re-trieval and similarity learning is carefully re-viewed.

2.1. Content-Based Image Retrieval

Content-based image retrieval searchesfor images by considering their visual con-tent. Given a query image, pictures in acollection are ranked according to their vi-sual similarity with respect to the query.Early methods represent the visual contentof images by a set of hand-crafted features,such as SIFT [1]. As a single image maycontain hundreds of these features, aggre-gation techniques like bag-of-words (BOW)[8], Fisher Vectors [2] or VLAD [9] en-code local descriptors into a compact vector,thereby improving computational efficiencyand scalability. More recently, because ofthe latest advancements on deep learning,features obtained from convolutional neuralnetworks (CNN) have rapidly become thenew state-of-the-art in image retrieval.

2.2. Deep Learning for Image Retrieval

Deep image retrieval extracts activationsfrom CNNs as image representations. Atfirst, some methods [3, 10, 11, 12] pro-posed to use representations from one ofthe last fully connected layers of networks

3

pre-trained on the classification ImageNetdataset [13]. When deeper networks suchas GoogLeNet [14] and VGG [15] appeared,some authors [16, 17, 10, 18] showed thatmid-layer representations obtained from theconvolutional layers performed better inthe retrieval task. Since then, there havebeen several attempts to aggregate thesehigh-dimensional convolutional representa-tions into a compact vector. For example,[19, 17] compacted deep features by usingVLAD, [20] encoded the neural codes intoan histogram of words, [16, 21] applied sum-pooling to obtain a compact representationand [22, 23] aggregated deep features bymax-pooling them into a new vector. A dif-ferent approach is to train the network todirectly learn compact binary codes end-to-end [24, 25]. Some authors have shown thatfine-tunning the networks with similar datato the target task increases the performancesignificantly [3, 26, 27, 28, 5]. Finally, recentwork has shown that adding attention mod-els to select meaningful features can be alsobeneficial for image retrieval [29, 30].

All of these methods are focused on find-ing high quality features to represent vi-sual content efficiently and visual similar-ity is computed by simply applying a stan-dard metric distance. General metrics, suchas Euclidean distance or cosine similarity,however, might be failing to consider theinner data structure of these visual repre-sentations. Learning a similarity functiondirectly from data may help to capture thehuman perception of visual similarity in abetter way.

2.3. Similarity Learning

Some of the most popular similaritylearning work, such as OASIS [31] and MLR[32], are based on linear metric learning byoptimizing the weights of a linear trans-

formation matrix. For example, Yang etal. [33] proposed a framework for rank-ing elements in retrieval tasks by solvinga linear optimization problem. Althoughlinear methods are easier to optimize andless prone to overfitting, nonlinear algo-rithms are expected to achieve higher ac-curacy modeling the possible nonlinearitiesof data.

Nonlinear similarity learning based ondeep learning has been recently applied tomany different visual contexts. In low-levelimage matching, CNNs have been trained tomatch pairs of patches for stereo matching[34, 35] and optical flow [36, 37]. In high-level image matching, deep learning tech-niques have been proposed to learn low-dimensional embedding spaces in face veri-fication [38], retrieval [39, 40], classification[41, 42, 43] and product search [44], eitherby using siamese [38] or triplet [40] architec-tures. More recently, deep similarity learn-ing has also been applied to fabric imageretrieval [45] by using triplets of samplesto ensure that similar features are mappedcloser than non-similar features.

In general, these methods rely on learn-ing a mapping from image pixels to a lowdimensional target space to compute the fi-nal similarity decision by using a standardmetric. They are designed to find the bestprojection in which a linear distance canbe successfully applied. Instead of project-ing the visual data into some linear space,that may or may not exist, our approachseeks to learn the non-metric visual simi-larity score itself. Similarly, [46] and [47]used a CNN to decide whether or not twoinput images are a match, applied to pedes-trian re-identification and patch matching,respectively. In these methods, the net-works are trained as a binary classifica-tion problem (i.e. same or different pedes-

4

trian/patch), whereas in an image retrievalranking problem, a regression score is re-quired. Inspired by the results of [11], whichshowed that combining deep features withsimilarity learning techniques can be verybeneficial for the performance of image re-trieval systems, we propose to train a deeplearning algorithm to learn non-metric sim-ilarities for image retrieval and improve re-sults in top of high quality image represen-tation methods.

Neural networks have been previouslyproposed to model relationships betweenobjects in different domains, such as clas-sification in few-shot learning [48, 49] orvisual question answering [50]. The maindifference between these methods and thiswork lies in the optimization problem to besolved: whereas [48, 49, 50] aim to learn if acertain relation between a pair of images isoccurring by optimizing a classification lossfunction (e.g. whether a query image is sim-ilar to a samples from a known class or not),we introduce a novel loss function specif-ically designed to solve ranking problems,which improves similarity scores on top ofa standard metric by returning a regressionvalue (i.e. how similar the two images are).

3. Methodology

In this section we present our proposedmethod to learn a non-metric visual simi-larity function from the visual data distri-bution for image retrieval.

3.1. Visual Similarity

Visual similarity measures how alike twoimages are. Formally, given a pair of imagesIi and Ij in a collection of images ξ, we de-fine si,j as their similarity score. The highersi,j is, the more similar Ii and Ij are. To

compute si,j, images are represented by K-dimensional image representations, whichare obtained by mapping image pixels intothe feature space RK , as xk = f(Ik, wf )with Ik ∈ ξ, where f(·) is a non-linear imagerepresentation function and wf its parame-ters. We propose to learn a visual similar-ity function, g(·), that maps a pair of imagerepresentations xi and xj into a visual scoreas:

si,j = g(f(Ii, wf ), f(Ij, wf ), wg)

s.t. si,j > si,k → Ii, Ij more

similar than Ii, Ik (1)

with Ii, Ij, Ik ∈ ξ and wg being the trainableparameters of the similarity function.

Visual similarity functions are commonlybased on metric distance functions such asg(xi, xj) =

xi·xj

‖xi‖‖xj‖ or g(xi, xj) = ‖xi − xj‖,i.e. cosine similarity and Euclidean dis-tance, respectively. Metric distance func-tions, d(·), perform mathematical compar-isons between pairs of objects in a collectionΠ, by satisfying the following axioms:

1. d(a, b) ≥ 0 (non-negativity)

2. d(a, b) = 0↔ a = b (identity)

3. d(a, b) = d(b, a) (symmetry)

4. d(a, b) ≤ d(a, c) + d(c, b) (triangle in-equality)

with ∀a, b, c ∈ Π.However, metric axioms are not always

the best method to represent visual humanperception [51, 52, 53]. For example, non-negative and identity axioms are not re-quired in visual perception as long as rel-ative similarity distances are maintained.Symmetry axiom is not always true, as hu-man similarity may be influenced by the or-der of the objects being compared. Finally,triangle inequality does not correspond to

5

distAC distCB

distAB

Figure 2: Triangle inequality in image retrieval. Triangle inequality does not fit human visual percep-tion, where distAB is expected to be bigger than distAC plus distCB.

visual human perception either. It can beeasily seen when considering the images ofa person, a horse and a centaur: althougha centaur might be visually similar to botha person and a horse, the person and thehorse are not similar to each other. A vi-sual example applied to image retrieval canbe seen in Figure 2.

For a better mathematical representationof the visual human perception, we proposeto learn a non-metric visual similarity func-tion without requiring to satisfy the rigidmetric axioms.

3.2. Similarity Network

To fit visual human perception betterthan metric distance functions, we proposeto learn a similarity function from the visualdata using neural networks. This similaritynetwork is composed of a set of fully con-nected layers, each one of them, except bythe last one, followed by a ReLU [54] non-linearity.

The input of the network is a concate-nated pair of image representations vectors,xi and xj, which can be obtained using anystandard technique, such as [3, 23] or anyother. The output is a similarity score, si,j.In that way, the similarity network learns

the similarity function, g(·), from the im-age representation vectors and replaces themetric distance function used in standardsystems.

At this point, we would like to note thatthe proposed similarity network is concep-tually different to the siamese architecturein [38], as shown in Figure 3. Siamese net-works use pairs of images to learn the fea-ture extraction function, f(·), which mapsimage pixels images into vector represen-tations. Then, similarity is still computedwith a metric distance function, such as co-sine similarity or Euclidean distance. Incontrast, our approach learns the functiong(·) on top of the image representations, re-placing the standard metric distance com-putation.

3.3. Similarity Training

We design the training of the similar-ity network as a supervised regression task.However, as providing similarity labels forevery possible pair of training images is in-feasible, we propose a training procedure inwhich the visual similarity is learned pro-gressively using standard image classifica-tion annotations. The model is trainedto discriminate whether two images, Ii, Ij,

6

Figure 3: Similarity versus Siamese networks.Siamese networks (left) learn to map pixels intovector representations, whereas Similarity networks(right) learn a similarity function on top of the vec-tor representations.

are similar or dissimilar. Then, a similar-ity score, si,j, is assigned accordingly byimproving a standard similarity function,sim(·). To optimize the weights, wg, ofthe similarity function g(·) from Equation1, the following regression loss function iscomputed between each training pair of im-age representations, xi, xj:

L(Ii, Ij) = |si,j − ì,j(sim(xi, xj) + ∆)

− (1− ì,j)(sim(xi, xj)−∆)|(2)

where ∆ is a margin parameter and ì,j isdefined as:

ì,j =

{1 if Ii and Ij are similar

0 otherwise(3)

In other words, the similarity networklearns to increase the similarity score whentwo matching images are given and to de-crease it when a pair of images is not amatch. Similarity between pairs might bedecided using different techniques, such asimage classes, score based on local fea-tures or manual labeling, among others [55].Without loss of generality, we consider twoimages as similar when they belong to the

same annotated class and as dissimilar whenthey belong to different classes.

Choosing appropriate examples when us-ing pairs or triplets of samples in the train-ing process is crucial for a successful train-ing [26, 27, 56]. This is because if the net-work is only trained by using easy pairs (e.g.a car and a dog), it will not be able to dis-criminate between difficult pairs (e.g. a carand a van). We design our training pro-tocol by emphasizing the training of diffi-cult examples. First, we randomly selectan even number of similar and dissimilarpairs of training samples and train the sim-ilarity network until convergence. We thenchoose a new random set of images and com-pute the similarity score between all possi-ble pairs by using the converged network.Pairs in which the network output is worsethan the metric distance function measureare selected as difficult pairs for retraining,where a worse score means a score that islower in the case of a match and higher inthe case of a non-match. Finally, the diffi-cult pairs are added to the training processand the network is trained until convergenceone more time. Examples of difficult imagepairs are shown in Figure 4.

4. Experimental Evaluation

In this section, we present the experimen-tal evaluation we perform to validate theproposed non-metric similarity network.

4.1. Image Retrieval Datasets

Our approach is evaluated on three stan-dard image retrieval datasets: Oxford5k[57], Paris6k [58] and Land5k, a vali-dation subset of Landmarks [3] dataset.Oxford5k consists on 5,062 images of 11different Oxford landmarks and 55 queryimages. Paris6k contains 6,412 images of

7

Dissimilar images Similar images

Figure 4: Example of difficult pairs. Dissimilar images with lower score than the metric distance (left)and similar images with higher score than the metric distance (right).

11 different Paris landmarks and 55 queries.Land5k consists of 4,915 images from 529classes with a random selection of 45 im-ages to be used as queries. For experi-ments on larger datasets, we also use thestandard large-scale versions Oxford105kand Paris106k, by including 100,000 dis-tractor images [57]. In both Oxford5kand Paris6k collections query images arecropped according to the region of inter-est and evaluation is performed by comput-ing the mean Average Precision (mAP). ForLand5k results are also reported as mAP,considering an image to be relevant to aquery when they both belong to the sameclass.

For training, we use the cleaned versionof the Landmarks [3] dataset from [26].Due to broken URLs, we could only down-load 33,119 images for training and 4,915for validation. To ensure visual similarityis learnt from relevant data, we create twomore training sets, named Landmarks-extra500 and Landmarks-extra, byrandomly adding about 500 and 2000 im-ages from Oxford5k and Paris6k classesto Landmarks, respectively. Query im-ages are not added in any case and theyremain unseen by the system.

4.2. Experimental Details

Image Representation. Unless other-wise stated, we use RMAC [23] as imagerepresentation method. VGG16 network isused off-the-shelf without any retraining orfine-tunning. Images are re-scaled up to1024 pixels, keeping their original aspect ra-tio. RMAC features are very sensitive tothe PCA matrices used for normalization.For consistency, we use the PCA whiten-ing matrices trained on Paris5k on all thedatasets, instead of using different matri-ces in each testing collection. This leads toslightly worse results than the ones providedin the original paper.Similarity Training. We use cosine

similarity, sim(xi, xj) =xi·xj

‖xi‖‖xj‖ , as the sim-

ilarity function in Equation 2. For a fasterconvergence, we warm-up the weights of thesimilarity network by training it with ran-dom generated pairs of vectors and ∆ = 0.In this way, the network first learns to im-itate the cosine similarity. Visual similar-ity is then trained using almost a millionof image pairs. We experiment with severalvalues of the margin parameter ∆, rangingfrom 0.2 to 0.8. The network is optimizedusing backpropagation and stochastic gra-dient descent with a learning rate of 0.001,a batch size of 100, a weight decay of 0.0005and a momentum of 0.9.Computational cost. Standard metric

8

Config Architecture Params MSE ρA FC-1024, FC-1024, FC-1 2.1M 0.00035 0.909B FC-4096, FC-4096, FC-1 21M 0.00019 0.965C FC-8192, FC-8192, FC-1 76M 0.00012 0.974D FC-4096, FC-4096, FC-4096, FC-1 38M 0.00019 0.964

Table 1: Architecture Discussion. Fully connected layers are denoted as (FC-{filters}).

functions are relatively fast and computa-tionally cheap. Our visual similarity net-work involves the use of millions of param-eters that inevitable increase the computa-tional cost. However, it is still feasible tocompute the similarity score in a reason-able amount of time. In our experiments,training time is about 5 hours in a GeForceGTX 1080 GPU without weight warm-upand testing time for a pair of images is 1.25ms on average (0.35 ms when using cosinesimilarity in CPU).

4.3. Architecture Discussion

We first experiment with four differentarchitectures (Table 1) and compare theperformance of each configuration duringthe network warm-up (∆ = 0), by using22.5 million and 7.5 million pairs of ran-domly generated vectors for training andvalidation, respectively. During the train-ing warm-up, the network is intended toimitate the cosine similarity. We evaluateeach architecture by computing the meansquared error, MSE, and the correlationcoefficient, ρ, between the network outputand the cosine similarity. Configuration C,which is the network with larger numberof parameters, achieves the best MSE andρ results. Considering a trade-off betweenperformance and number of parameters ofeach architecture, we keep configuration Bas our default architecture for the rest of theexperiments.

4.4. Similarity Evaluation

We then study the benefits of using a non-metric similarity network for image retrievalby comparing it to several similarity meth-ods. RMAC [23] is used as image represen-tation method in all the experiments. Sim-ilarity functions under evaluation are:

• Cosine: the similarity between a pairof vectors is computed with the cosinesimilarity: cosine(xi, xj) =

xi·xj

‖xi‖‖xj‖ .

No training is required.

• OASIS: well-established OASIS algo-rithm [31] is used to learn a linear func-tion to map a pair of vectors into a sim-ilarity score. The training of the matrixtransformation is performed in a super-vised way by providing the class of eachimage.

• Linear: we learn an affine transfor-mation matrix to map a pair of vec-tors into a similarity score by optimiz-ing Equation 2 in a supervised training.Classes of images are provided duringtraining. The margin ∆ is set to 0.2.

• SimNet, SimNet*: the similarityfunction is learnt with our proposedsimilarity network by optimizing Equa-tion 2 with (SimNet*) or without (Sim-Net) difficult pairs refinement. Classesof images are provided during train-ing and different margin ∆ are tested,ranging from 0.2 to 0.8.

9

Landmarks Landmarks-extra500 Landmarks-extraOx5k Pa6k La5k Ox5k Pa6k La5k Ox5k Pa6k La5k

Cosine 0.665 0.638 0.564 0.665 0.638 0.564 0.665 0.638 0.564OASIS 0.514 0.385 0.578 0.570 0.651 0.589 0.619 0.853 0.579

Linear (0.2) 0.598 0.660 0.508 0.611 0.632 0.514 0.602 0.581 0.502SimNet (0.2) 0.658 0.460 0.669 0.717 0.654 0.671 0.718 0.757 0.668

SimNet* (0.2) 0.655 0.503 0.697 0.719 0.677 0.693 0.786 0.860 0.662SimNet* (0.4) 0.637 0.504 0.737 0.703 0.701 0.745 0.794 0.878 0.706SimNet* (0.6) 0.613 0.514 0.776 0.703 0.716 0.776 0.789 0.885 0.735SimNet* (0.8) 0.600 0.511 0.783 0.685 0.710 0.803 0.808 0.891 0.758

Table 2: Similarity Evaluation. mAP when using different similarity functions. ∆ value is set in brackets.

VGG16 RES50 MAC RMAC [26]

0.48

0.17

6

0.53

9

0.63

8 0.83

8

0.51

6

0.24

3

0.68

3

0.75

7

0.87

2

0.55

4

0.47

1

0.75

0.89

1

0.88

2

CosineSimNetSimNet*

Paris6k

Figure 5: Image Representation Discussion.mAP for different visual similarity techniques ontop of different feature extraction methods.

Results are summarize in Table 2.Trained similarity networks (SimNet, Sim-Net*) outperform trained linear methods(OASIS, Linear) in all but one testingdatasets. As all Linear, SimNet and Sim-Net* are trained using the same super-vised learning protocol and images, theresults suggest that the improvement ob-tained in our method is not because ofthe supervision but because of the non-metric nature of the model, which is sup-posed to fit the human visual perceptionmore accurately. Moreover, when using

Landmarks-extra as training dataset,results are boosted with respect to thestandard metric, achieving improvementsranging from 20% (Oxford5k) to 40%(Pairs6k). When using Landmarks-extra500 dataset, our similarity networksalso improve mAP with respect to the co-sine similarity in the three testing datasets.This indicates that visual similarity can belearnt even when using a reduced subset ofthe target image domain. However, visualsimilarity does not transfer well across do-mains when no images of the target domainare used during training, which is a well-known problem in metric learning systems[59]. In that case, cosine similarity is thebest option over all the methods.

4.5. Image Representation Discussion

Next, we study the generalisation of oursimilarity networks when used on top of dif-ferent feature extraction methods: the out-put of a VGG16 network [15], the output ofa ResNet50 network [60], MAC [23], RMAC[23] and the model from [27]. We com-pare the results of our networks, SimNetand SimNet*, against cosine similarity. Re-sults are provided in Figure 5. Our similar-ity networks outperform cosine similarity inall the experiments, improving retrieval re-

10

0 50 100 150 200 2500.63

0.65

0.68

0.7

0.73

num. samples

mA

P

SimNetCosine

Oxford5k

0 250 500 7500.45

0.55

0.65

0.75

0.8

num. samples

SimNetCosine

Paris6k

Figure 6: Domain Adaptation mAP when using different number of target samples in the training set.

sults when used on top of any standard fea-ture extraction method. Moreover, perfor-mance is boosted when SimNet* is applied,specially in features with poor retrieval per-formance, such as ResNets.

4.6. Domain Adaptation

We further investigate the influence ofthe training dataset on the similarity scorewhen it is transfered between different do-mains or collections of images. As alreadynoted in Table 2, visual similarity does nottransfer well across domains, and a subset ofsamples from the target dataset is requiredduring training to learn a meaningful simi-larity function. This is mainly because sim-ilarity estimation is a problem-dependenttask, as the similarity between a pair ofelements depends on the data collection.Thus, in Figure 6, we explore the effect onperformance when we use different subsetsof samples from the target collection in ad-dition to the Landmarks dataset. To addrelevant samples progressively to the train-ing set, we assign a class label to each imagein the Oxford5k and Paris6k collections.These datasets do not provide class labelsper se, so to overcome this issue, we use thefile name of each image as its class label.

There is a clear correlation between thesimilarity network performance and thenumber of samples from the target datasetused during training. Indeed, in agreementwith previous work in metric learning [59],we observe that not considering samplesfrom the target dataset to train a similar-ity function might be harmful. The simi-larity network, however, outperforms stan-dard metric results even when a small num-ber of samples from the target collection isused during training: only 100 images fromOxford5k and 250 images from Paris6k arerequired to outperform cosine similarity inOxford5k and Paris6k datasets, respec-tively. This suggests that the similarity net-work is able to generalise from a small sub-set of target samples, instead of memorisingdistances in the training collection.

4.7. End-to-End Training

So far, we have isolated the similaritycomputation part to verify that the im-provement in the testing datasets comparedto when using other similarity methods is,in fact, due to the visual similarity network.However, with all the modules of the re-trieval system being differentiable, an end-to-end training model is also possible. End-

11

VGG16

MaxPool

L2 Norm

CompactVectorFeature

Map

1x512x2

VGG16

MaxPool

L2 Norm

CompactVectorFeature

Map

Score

SimNet

Visual Similarity

Feature Extraction

Figure 7: End-to-End model. MAC as feature extraction and SimNet as visual similarity.

Features Sim. Ox5k Pa6k La5k

MAC Cosine 0.481 0.539 0.494

MAC SimNet 0.509 0.683 0.589

MAC SimNet 0.555 0.710 0.685

Table 3: End-to-End Training. mAP when fine-tunning different parts of the pipeline. In gray , themodules that are fine-tunned in every experiment.

to-end methods have been shown to achieveoutstanding results in many different prob-lems, including the aggregation of video fea-tures [61], stereo matching [47], person re-identification [46] or self-driving cars [62].In image retrieval, however, only end-to-endmethods for learning the feature representa-tions are proposed [5, 27], leaving the finalsimilarity score to be computed with a co-sine similarity. In this section we explore areal end-to-end training architecture for im-age retrieval, which is presented in Figure 7.

For the feature extraction part, we adoptMAC [23], although any differentiable im-age representation method can be used. Toobtain MAC vectors, images are fed intoa VGG16 network [15]. The output ofthe last convolutional layer is max-pooledand l2-normalized. For the visual similar-ity part, we use SimNet with ∆ = 0.2.As the whole architecture is end-to-end

differentiable, the weights are fine-tunnedthrough backpropagation. We first train thesimilarity network by freezing the VGG16weights. Then, we unfreeze all the lay-ers and fine-tune the model one last time.As all the layers have been already pre-trained, the final end-to-end fine-tunning isperformed in only about 200,000 pairs of im-ages from Landarmarks-extra datasetfor just 5,000 iterations.

Results are provided in Table 3. Thereis a significant improvement when using thesimilarity network instead of the cosine sim-ilarity, as already seen in the previous sec-tion. When the architecture is trained end-to-end results are improved even more, sincefine-tuning the entire architecture allows abetter fit to a particular dataset.

4.8. State of the Art Comparison

Finally, we compare our method againstseveral state-of-the-art techniques. As stan-dard practice, works are split into twogroups: off-the-shelf and fine-tunned. Off-the-shelf are techniques that extract imagerepresentations by using pre-trained CNNs,whereas fine-tunned methods retrain thenetwork parameters with a relevant datasetto compute more accurate visual represen-tation. For a fair comparison, we only con-

12

Method Dim Similarity Ox5k Ox105k Pa6k Pa106k

Babenko et al. [3] 512 L2 0.435 0.392 - -Razavian et al. [10] 4096 Averaged L2 0.322 - 0.495 -Wan et al. [11] 4096 OASIS 0.466 - 0.867 -Babenko et al. [16] 256 Cosine 0.657 0.642 - -Yue et al. [17] 128 L2 0.593 - 0.59 -Kalantidis et al. [21] 512 L2 0.708 0.653 0.797 0.722Mohedano et al. [20] 25k Cosine 0.739 0.593 0.82 0.648Salvador et al. [28] 512 Cosine 0.588 - 0.656 -Tolias et al. [23] 512 Cosine 0.669 0.616 0.83 0.757Jimenez et al. [29] 512 Cosine 0.712 0.672 0.805 0.733Ours (∆ = 0.8) 512 SimNet* 0.808 0.772 0.891 0.818

Table 4: State of the Art Comparison (Off-the-shelf). Dim corresponds to the dimensionality of thefeature representation and Similarity is the similarity function.

Method Dim Similarity Ox5k Ox105k Pa6k Pa106k

Babenko et al. [3] 512 L2 0.557 0.522 - -Gordo et al. [26] 512 Cosine 0.831 0.786 0.871 0.797Wan et al. [11] 4096 OASIS 0.783 - 0.947 -Radenovic et al. [27] 512 Cosine 0.77 0.692 0.838 0.764Salvador et al. [28] 512 Cosine 0.71 - 0.798 -Gordo et al. [5] 2048 Cosine 0.861 0.828 0.945 0.906Ours (∆ = 0.8) 512 SimNet* 0.882 0.821 0.882 0.829

Table 5: State of the Art Comparison (Fine-tunned). Dim corresponds to the dimensionality of thefeature representation and Similarity is the similarity function.

sider methods that represent each imagewith a single visual vector, without queryexpansion or image re-ranking.

Off-the-shelf results are shown in Table 4and fine-tunned results are presented in Ta-ble 5. When using off-the-shelf RMAC fea-tures, our SimNet* approach outperformsprevious methods in every dataset. To com-pare against fine-tunned methods, we com-pute RMAC vectors using the fine-tunnedversion of VGG16 proposed in [27]. Accu-racy is boosted when our similarity networkis used instead of the analogous cosine sim-ilarity method [27]. SimNet* achieves thebest mAP precision in Oxford5k datasetand comes second in Oxford105k and

Paris106k after [5], which uses the morecomplex and higher-dimensional ResNetnetwork [60] for image representation.

5. Conclusions

We have presented a method for learningvisual similarity directly from visual data.Instead of using a metric distance func-tion, we propose to train a neural networkmodel to learn a similarity score between apair of visual representations. Our methodis able to capture visual similarity betterthan other techniques, mostly because ofits non-metric nature. As all the layers inthe similarity network are differentiable, we

13

also propose an end-to-end trainable archi-tecture for image retrieval. Experimentson standard collections show that resultsare considerably improved when a similaritynetwork is used. Finally, our work can pushperformance in image retrieval systems ontop of high-quality image features, while itcan still be applied with query expansion orimage re-ranking methods.

[1] D. G. Lowe, Distinctive image features fromscale-invariant keypoints, International jour-nal of computer vision 60 (2) (2004) 91–110.

[2] F. Perronnin, Y. Liu, J. Sanchez, H. Poirier,Large-scale image retrieval with compressedfisher vectors, in: IEEE Conference on Com-puter Vision and Pattern Recognition, IEEE,2010, pp. 3384–3391.

[3] A. Babenko, A. Slesarev, A. Chigorin, V. Lem-pitsky, Neural codes for image retrieval, in:European Conference on Computer Vision,Springer, 2014, pp. 584–599.

[4] R. Arandjelovic, A. Zisserman, Three thingseveryone should know to improve object re-trieval, in: Conference on Computer Visionand Pattern Recognition, IEEE, 2012, pp.2911–2918.

[5] A. Gordo, J. Almazan, J. Revaud, D. Larlus,End-to-end learning of deep visual representa-tions for image retrieval, International Journalof Computer Vision 124 (2) (2017) 237–254.

[6] X. Dong, Y. Yan, M. Tan, Y. Yang, I. W.Tsang, Late fusion via subspace search withconsistency preservation, IEEE Transactionson Image Processing 28 (1) (2018) 518–528.

[7] X. Dong, L. Zheng, F. Ma, Y. Yang, D. Meng,Few-example object detection with modelcommunication, IEEE Transactions on Pat-tern Analysis & Machine Intelligence (1)(2018) 1–1.

[8] J. Sivic, A. Zisserman, Video google: Atext retrieval approach to object matching invideos, in: International Conference on Com-puter Vision, IEEE, 2003, p. 1470.

[9] H. Jegou, F. Perronnin, M. Douze, J. Sanchez,P. Perez, C. Schmid, Aggregating local imagedescriptors into compact codes, IEEE Trans-actions on Pattern Analysis and Machine In-telligence 34 (9) (2012) 1704–1716.

[10] A. Sharif Razavian, H. Azizpour, J. Sullivan,

S. Carlsson, Cnn features off-the-shelf: an as-tounding baseline for recognition, in: IEEEConference on Computer Vision and PatternRecognition Workshops, 2014, pp. 806–813.

[11] J. Wan, D. Wang, S. C. H. Hoi, P. Wu, J. Zhu,Y. Zhang, J. Li, Deep learning for content-based image retrieval: A comprehensive study,in: ACM International Conference on Multi-media, ACM, 2014, pp. 157–166.

[12] Y. Liu, Y. Guo, S. Wu, M. S. Lew, Deepin-dex for accurate and efficient image retrieval,in: ACM on International Conference on Mul-timedia Retrieval, ACM, 2015, pp. 43–50.

[13] O. Russakovsky, J. Deng, H. Su, J. Krause,S. Satheesh, S. Ma, Z. Huang, A. Karpa-thy, A. Khosla, M. Bernstein, et al., Imagenetlarge scale visual recognition challenge, Inter-national Journal of Computer Vision 115 (3)(2015) 211–252.

[14] C. Szegedy, W. Liu, Y. Jia, P. Sermanet,S. Reed, D. Anguelov, D. Erhan, V. Van-houcke, A. Rabinovich, Going deeper withconvolutions, in: IEEE Conference on Com-puter Vision and Pattern Recognition, 2015,pp. 1–9.

[15] K. Simonyan, A. Zisserman, Very deep convo-lutional networks for large-scale image recogni-tion, in: International Conference on LearningRepresentations, 2015.

[16] A. Babenko, V. Lempitsky, Aggregating localdeep features for image retrieval, in: IEEEinternational conference on computer vision,2015, pp. 1269–1277.

[17] J. Yue-Hei Ng, F. Yang, L. S. Davis, Exploit-ing local features from deep networks for im-age retrieval, in: IEEE Conference on Com-puter Vision and Pattern Recognition Work-shops, 2015, pp. 53–61.

[18] L. Xie, R. Hong, B. Zhang, Q. Tian, Imageclassification and retrieval are one, in: ACMon International Conference on MultimediaRetrieval, ACM, 2015, pp. 3–10.

[19] Y. Gong, L. Wang, R. Guo, S. Lazebnik,Multi-scale orderless pooling of deep convolu-tional activation features, in: European Con-ference on Computer Vision, Springer, 2014,pp. 392–407.

[20] E. Mohedano, K. McGuinness, N. E.O’Connor, A. Salvador, F. Marques, X. Giro-iNieto, Bags of local convolutional features forscalable instance search, in: ACM on Inter-

14

national Conference on Multimedia Retrieval,ACM, 2016, pp. 327–331.

[21] Y. Kalantidis, C. Mellina, S. Osindero, Cross-dimensional weighting for aggregated deepconvolutional features, in: European Confer-ence on Computer Vision, Springer, 2016, pp.685–701.

[22] A. S. Razavian, J. Sullivan, S. Carlsson,A. Maki, Visual instance retrieval with deepconvolutional networks, ITE Transactions onMedia Technology and Applications 4 (3)(2016) 251–258.

[23] G. Tolias, R. Sicre, H. Jegou, Particular ob-ject retrieval with integral max-pooling of cnnactivations, in: International Conference onLearning Representations, 2016.

[24] V. Erin Liong, J. Lu, G. Wang, P. Moulin,J. Zhou, Deep hashing for compact binarycodes learning, in: IEEE Conference on Com-puter Vision and Pattern Recognition, 2015,pp. 2475–2483.

[25] K. Lin, H.-F. Yang, J.-H. Hsiao, C.-S. Chen,Deep learning of binary hash codes for fast im-age retrieval, in: IEEE Conference on Com-puter Vision and Pattern Recognition Work-shops, 2015, pp. 27–35.

[26] A. Gordo, J. Almazan, J. Revaud, D. Lar-lus, Deep image retrieval: Learning global rep-resentations for image search, in: EuropeanConference on Computer Vision, Springer,2016, pp. 241–257.

[27] F. Radenovic, G. Tolias, O. Chum, Cnn imageretrieval learns from bow: Unsupervised fine-tuning with hard examples, in: European Con-ference on Computer Vision, Springer, 2016,pp. 3–20.

[28] A. Salvador, X. Giro-i Nieto, F. Marques,S. Satoh, Faster r-cnn features for instancesearch, in: IEEE Conference on ComputerVision and Pattern Recognition Workshops,2016, pp. 9–16.

[29] A. Jimenez, J. M. Alvarez, X. Giro Nieto,Class-weighted convolutional features for vi-sual instance search, in: British Machine Vi-sion Conference, 2017, pp. 1–12.

[30] H. Noh, A. Araujo, J. Sim, T. Weyand,B. Han, Largescale image retrieval with at-tentive deep local features, in: IEEE Interna-tional Conference on Computer Vision, 2017,pp. 3456–3465.

[31] G. Chechik, V. Sharma, U. Shalit, S. Bengio,

Large scale online learning of image similaritythrough ranking, Journal of Machine LearningResearch 11 (2010) 1109–1135.

[32] B. McFee, G. R. Lanckriet, Metric learning torank, in: International Conference on MachineLearning, 2010, pp. 775–782.

[33] Y. Yang, F. Nie, D. Xu, J. Luo, Y. Zhuang,Y. Pan, A multimedia retrieval frameworkbased on semi-supervised ranking and rele-vance feedback, IEEE Transactions on Pat-tern Analysis and Machine Intelligence 34 (4)(2012) 723–742.

[34] S. Zagoruyko, N. Komodakis, Learning tocompare image patches via convolutional neu-ral networks, in: IEEE Conference on Com-puter Vision and Pattern Recognition, 2015,pp. 4353–4361.

[35] W. Luo, A. G. Schwing, R. Urtasun, Efficientdeep learning for stereo matching, in: IEEEConference on Computer Vision and PatternRecognition, 2016, pp. 5695–5703.

[36] A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser,C. Hazirbas, V. Golkov, P. Van Der Smagt,D. Cremers, T. Brox, Flownet: Learning opti-cal flow with convolutional networks, in: IEEEInternational Conference on Computer Vision,2015, pp. 2758–2766.

[37] J. Thewlis, S. Zheng, P. Torr, A. Vedaldi,Fully-trainable deep matching, in: Proceed-ings of the British Machine Vision Conference,2016.

[38] S. Chopra, R. Hadsell, Y. LeCun, Learning asimilarity metric discriminatively, with appli-cation to face verification, in: Computer Vi-sion and Pattern Recognition, Vol. 1, IEEE,2005, pp. 539–546.

[39] P. Wu, S. C. Hoi, H. Xia, P. Zhao, D. Wang,C. Miao, Online multimodal deep similaritylearning with application to image retrieval,in: ACM international conference on Multi-media, ACM, 2013, pp. 153–162.

[40] J. Wang, Y. Song, T. Leung, C. Rosenberg,J. Wang, J. Philbin, B. Chen, Y. Wu, Learningfine-grained image similarity with deep rank-ing, in: IEEE Conference on Computer Visionand Pattern Recognition, 2014, pp. 1386–1393.

[41] E. Hoffer, N. Ailon, Deep metric learning us-ing triplet network, in: International Work-shop on Similarity-Based Pattern Recognition,Springer, 2015, pp. 84–92.

[42] Q. Qian, R. Jin, S. Zhu, Y. Lin, Fine-grained

15

visual categorization via multi-stage metriclearning, in: IEEE Conference on ComputerVision and Pattern Recognition, 2015, pp.3716–3724.

[43] H. Oh Song, Y. Xiang, S. Jegelka, S. Savarese,Deep metric learning via lifted structured fea-ture embedding, in: IEEE Conference onComputer Vision and Pattern Recognition,2016, pp. 4004–4012.

[44] S. Bell, K. Bala, Learning visual similarity forproduct design with convolutional neural net-works, ACM Transactions on Graphics 34 (4)(2015) 98.

[45] D. Deng, R. Wang, H. Wu, H. He, Q. Li,X. Luo, Learning deep similarity models withfocus ranking for fabric image retrieval, Imageand Vision Computing 70 (2018) 11–20.

[46] W. Li, R. Zhao, T. Xiao, X. Wang, Deep-reid: Deep filter pairing neural network forperson re-identification, in: IEEE Conferenceon Computer Vision and Pattern Recognition,2014, pp. 152–159.

[47] X. Han, T. Leung, Y. Jia, R. Sukthankar,A. C. Berg, Matchnet: Unifying feature andmetric learning for patch-based matching, in:IEEE Conference on Computer Vision andPattern Recognition, 2015, pp. 3279–3286.

[48] O. Vinyals, C. Blundell, T. Lillicrap, D. Wier-stra, et al., Matching networks for one shotlearning, in: Advances in Neural InformationProcessing Systems, 2016, pp. 3630–3638.

[49] A. Santoro, D. Raposo, D. G. Barrett, M. Ma-linowski, R. Pascanu, P. Battaglia, T. Lilli-crap, A simple neural network module for re-lational reasoning, in: Advances in neural in-formation processing systems, 2017, pp. 4967–4976.

[50] F. Sung, Y. Yang, L. Zhang, T. Xiang, P. H.Torr, T. M. Hospedales, Learning to com-pare: Relation network for few-shot learning,in: Computer Vision and Pattern Recognition,2018.

[51] Y. Gavet, M. Fernandes, J. Debayle, J.-C.Pinoli, Dissimilarity criteria and their compar-ison for quantitative evaluation of image seg-mentation: application to human retina ves-sels, Machine vision and applications 25 (8)(2014) 1953–1966.

[52] X. Tan, S. Chen, J. Li, Z.-H. Zhou, Learningnon-metric partial similarity based on max-imal margin criterion, in: IEEE Conference

on Computer Vision and Pattern Recognition,Vol. 1, 2006, pp. 138–145.

[53] A. Tversky, I. Gati, Similarity, separability,and the triangle inequality., Psychological re-view 89 (2) (1982) 123.

[54] A. Krizhevsky, I. Sutskever, G. E. Hinton, Im-agenet classification with deep convolutionalneural networks, in: Advances in neural in-formation processing systems, 2012, pp. 1097–1105.

[55] P. Bhagat, P. Choudhary, Image annotation:Then and now, Image and Vision Computing80 (2018) 1–23.

[56] Y. Movshovitz-Attias, A. Toshev, T. K. Le-ung, S. Ioffe, S. Singh, No fuss distance metriclearning using proxies, in: IEEE InternationalConference on Computer Vision, IEEE, 2017,pp. 360–368.

[57] J. Philbin, O. Chum, M. Isard, J. Sivic, A. Zis-serman, Object retrieval with large vocabular-ies and fast spatial matching, in: ComputerVision and Pattern Recognition, IEEE, 2007,pp. 1–8.

[58] J. Philbin, O. Chum, M. Isard, J. Sivic, A. Zis-serman, Lost in quantization: Improving par-ticular object retrieval in large scale imagedatabases, in: Computer Vision and PatternRecognition, IEEE, 2008, pp. 1–8.

[59] B. Kulis, et al., Metric learning: A survey,Foundations and Trends R© in Machine Learn-ing 5 (4) (2013) 287–364.

[60] K. He, X. Zhang, S. Ren, J. Sun, Deep residuallearning for image recognition, in: IEEE con-ference on computer vision and pattern recog-nition, 2016, pp. 770–778.

[61] Y. Xu, Y. Han, R. Hong, Q. Tian, Sequen-tial video vlad: training the aggregation lo-cally and temporally, IEEE Transactions onImage Processing 27 (10) (2018) 4933–4944.

[62] M. Bojarski, D. Del Testa, D. Dworakowski,B. Firner, B. Flepp, P. Goyal, L. D. Jackel,M. Monfort, U. Muller, J. Zhang, et al., Endto end learning for self-driving cars, arXivpreprint arXiv:1604.07316.

16

Documents

Learning Non-Metric Visual Similarity for Image Retrievalnoagarciad.com/docs/IMAVIS2019_preprint.pdf · 2020. 7. 2. · Learning Non-Metric Visual Similarity for Image Retrieval Noa