Object detection and recognition: from saliency prediction

Final degree project

Object detection and recognition: fromsaliency prediction to one-shot trained

detectors

Author:Adria Recasens

Supervisor:Antonio Torralba

September 3, 2014

You see, but you do not observe.The distinction is clear.SHERLOCK HOLMES

Contents

1 Abstract 4

2 Acknowledgements 5

3 Introduction 63.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63.2 Notation and definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3.2.1 Object detection definition . . . . . . . . . . . . . . . . . . . . . . . . . . 83.2.2 Saliency definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

4 One Shot Training 104.1 Related work: general object detections techniques . . . . . . . . . . . . . . . . . 10

4.1.1 Classification algorithm: Support Vector Machines . . . . . . . . . . . . . 104.1.2 Feature functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

4.2 One-Shot Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154.2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154.2.2 Single view example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164.2.3 Object centric database . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184.2.4 One shot training with CNN . . . . . . . . . . . . . . . . . . . . . . . . . 184.2.5 Conclusions and future work . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.3 DetectMe: The open object detector . . . . . . . . . . . . . . . . . . . . . . . . . 264.3.1 Introduction to DetectMe . . . . . . . . . . . . . . . . . . . . . . . . . . . 264.3.2 DetectMe: the application . . . . . . . . . . . . . . . . . . . . . . . . . . . 264.3.3 DetectMe competition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284.3.4 DetectMe as a framework . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.3.5 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

5 Saliency 315.1 Introduction to saliency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325.3 Deep features for saliency prediction . . . . . . . . . . . . . . . . . . . . . . . . . 345.4 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

5.4.1 Low-level features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365.4.2 Medium-level features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375.4.3 Hight level features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375.4.4 Global model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

5.5 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

6 Conclusions 40

2

List of Figures

4.1 (a) Linearly separable space (b) Non-linearly separable space (c) Non-separable spaces . 114.2 Composition of HOG feature representation. . . . . . . . . . . . . . . . . . . . . . . . 124.3 Graphical explanation of the pyramid construction process. . . . . . . . . . . . . 134.4 Network structure of the CNN presented in [10]. . . . . . . . . . . . . . . . . . . . . . 134.5 Results of mug detector’s competition. . . . . . . . . . . . . . . . . . . . . . . . . . . 154.6 Best 10 examples to train a car’s detector. . . . . . . . . . . . . . . . . . . . . . . . . 174.7 Top scoring detections for the best car detector. . . . . . . . . . . . . . . . . . . . . . 174.8 Average precision of the 50 categories in the ImageNet dataset, both the pre trained

network and the newly trained. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204.9 General statistics for the 50 categories evaluation. Ratio is computer as Pre Trained AP

Newly trained AP. 20

4.10 Histogram of the ratio between the performance of the detectors using the pre trained

network and the newly trained network. . . . . . . . . . . . . . . . . . . . . . . . . . 214.11 Results of the experiments for the CNN. The AP of the detector with best performance

trained with one sample is presented joint with the AP of the full training set. (a) Pre

trained CNN (b) Newly trained CNN . . . . . . . . . . . . . . . . . . . . . . . . . . 224.12 Histogram of the ratio of performance between the detector trained with all the data

and the best detector of each category. . . . . . . . . . . . . . . . . . . . . . . . . . 224.13 Best 10 training samples with their respectives AP. Trained with the newly trained network. 234.14 Best training examples for different categories (a) Computer keyboard (b) Popsicle (c)

Refrigerator (d) Remote Controllers . . . . . . . . . . . . . . . . . . . . . . . . . . . 244.15 Three main steps of DetectMe. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264.16 Retrain screen of the DetectMe application . . . . . . . . . . . . . . . . . . . . . . . 274.17 Submission screen of the DetectMe Competition . . . . . . . . . . . . . . . . . . . . . 284.18 Real time streaming of DetectMe detections . . . . . . . . . . . . . . . . . . . . . . . 29

5.1 Boolean Map Pipeline. Credits of the figure to [27]. . . . . . . . . . . . . . . . . . . . 325.2 Pipeline for CNN feature modelling. Credits for the figure to [23] . . . . . . . . . . . . 335.3 Saliency prediction for the different features set. (a) Original image (b) Ground truth

(c) Global Model (d) Low-level features . . . . . . . . . . . . . . . . . . . . . . . . . 365.4 Spatial models for the medium-level features. (a) Person (b) Car . . . . . . . . . . . . 375.5 Medium-level saliency prediction. (a) Person (b) Car . . . . . . . . . . . . . . . . . . 375.6 High level saliency estimation. (a) Original picture (b) Saliency heat map . . . . . . . . 38

3

Abstract

Computer vision capabilities have started to become available in smart devices this last years.The rapid growth of the smartphone world along with the big advance of the computer visionfield in the last years make possible nowadays to bring computer vision to everyday’s mobiledevices. DetectMe is one of the firsts systems to bring object detectors to everyone’s mobile de-vice. This paradigm shift generates new challenges and questions: this project wants to answersome of this questions as well as give some future lines of work to overcome this challenges.

On one hand, the aim of this project is to answer a short question: can we train gooddetectors with only one example? Section 3 will analyze this issue as well as point out asidequestions that appear when the main question is trying to being answered. A positive answerto this question as well as some hints on what makes an object a good example would improvethe user experience for those who are using computer vision systems in mobile devices.

On the other hand, we will also attack a classical problem in computer vision: where peoplelook when they are looking at a picture? The recent development of the Convolutional NeuralNetworks and its outstanding capabilities to explain visual information help to improve the per-formance on the saliency models. In section 4, a new saliency model is presented and discussed.Results show that our saliency model outperforms the state-of-the-art saliency models on theMIT 1001 dataset. Some future research lines are also drawn to improve the model as well asgenerate more saliency data to work with.

To sum up, this project doesn’t want to be a closed project. It wants to answer somequestions while pointing out potential future lines of research to find a more complete answer.

4

Acknowledgements

First and foremost, I would like to thank Antonio Torralba for all his support on this projectand for the opportunity of being part of his laboratory. His deep understanding of the computervision world, his particular way of attacking problems and his passionate way of understandingresearch have been an invaluable guide for me.

Furthermore I also wanted to thank all my labmates who contributed with bright ideas andsupport to this project. Special thanks to the ones directly involved with the projects presentedin this work. Thanks to Agata Lapedriza for her support, her new ideas and her warm welcometo Boston and the MIT. Thanks to Aditya Khosla for his advice and his brilliant contributionson the saliency project. And thanks to Zoya Bylinskii for sharing her deep understanding ofthe saliency problem and introducing me to the eye-tracking world.

Moreover, I would also want to thank the Massachusetts Institute of Technology, the Centrede Formacio Interdisciplinaria Superior and Banco Santander for their economical support onthis project. This wouldn’t have been possible without their help.

And last but not least, als qui ja sabeu, tant els d’aquı com els d’alla, moltes gracies pertot.

5

Introduction

3.1 Motivation

The computer vision’s world has dramatically improved during the last decade. The availabilityof huge amounts of computational power with a small cost, along with the advance of the fieldopen a large space for research and innovation.

The advance of the mobile technologies such as smartphones with an integrated camera isstarting to being used by the computer vision world as a platform to push the limits of the fieldand bring computer vision to the daily use. LabelMe [19] and DetectMe [13] (which will bedetailed in 4.3) are two examples of mobile phone applications that bring computer vision to theend user. This paradigm shift introduces for the first time the possibility of having non-expertsusers exploring the capabilities of computer vision systems.

In one hand, this new applications help as a tool to explain computer vision to the enduser: it plays a pedagogical role. In the other hand, from the researcher point of view, thisnew tools open a wide spectrum of opportunities on using the information produced by theend-user for research purposes. For the first time, we are close to have a full network of userswith technological capabilities to run vision system in their devices. This new point of viewcould help researchers to better understand and approach the classical challenges of the fieldfrom a different point of view.

One of the main motivations of this thesis is to understand and contribute to the challengesthat will be faced for this new users. Given that the computational power of its devices isnot comparable with the regular computers used nowadays for computer vision development, anew branch of low computational power computer vision is being developed. In this work, weconcentrate in the object detection problem, where the user is given the choice of take someimages to train a detector of a given object. Opposed to the usual situation, where the detectoris trained with a large amount of pictures, here the chosen pictures could be determinant todefine the detector’s performance.

One of the main goals of this project is to understand which is the best strategy to traindetector with few examples, in particular we are going to analyse the case where the detectoris trained with only one sample. As we will see in chapter 4, the performance of this one-shottrained detectors can even overpass the performance of a detector trained with a large amountof images. We will also present in chapter 4 alternative ways of training detector with only onesample: extending the notion of one-shot trained detector to multiple points of view will lead toa better understanding of the shape of the object from the detector and, therefore, with betterperformance.

Furthermore, this advance of the field also allows us to improve the solutions of classicalproblems in the field. In the second part of this work we are going to present a classical challengein computer vision: to predict where people is going to look when they look at a given picture.This notion, commonly known as saliency, has been used in a wide variety of field from roboticsto marketing.

In chapter 5, we will present a new algorithm to predict saliency using a multi-layered algo-rithm based on the outstanding performance of the Convolutional Neural Networks explaining

6

CHAPTER 3. INTRODUCTION 7

visual information. Moreover, we are going to point out new lines of research that could im-prove the performance of the saliency models. Fighting the lack of big saliency datasets is themain challenge to overcome at this point to be able to keep improving the performance of thesaliency models.

Finally, in the conclusion’s section we will wrap-up the content of the thesis and extractsome conclusions and future lines of work. It is important to make clear at this point that thisthesis does not want to be a closed work, it does not want to present a full project with closedconclusions and no future work to do. It wants to be the seed for future research, it will presenta significative body of research while pointing out the directions that this research could takein the next steps.


3.2 Notation and definitions

After having presented the scope of this work and before entering to its content, we believe itwould be useful to get the reader familiar with some technical definitions that will appear inthe whole document.

3.2.1 Object detection definition

First of all, let us start with the main definition of this work:

Definition 1. We define an object detector of the class C as a system that given an imageI, returns a set of boxes within the image I where an object belonging to the class C is present.For instance, we define a car detector as a system that given a picture returns all the positionswithin the picture where there is a car.

As it can be deduced from the definition, the notion of object detector is theoretical andindependent on which method is used to accomplish the goal of the system. Independently onwhich methods is used, object detectors are trained with a set of images where the ground truthis previously annotated. This lead to the following definition:

Definition 2. We define the training set of an object detector D as the set of images con-taining positives instances of the class and the set of images used as a negative instances (thisis, a set of images that do not contain any instance of the desired class) used to train the objectdetector.

Once the detector is defined and trained, we will need to evaluate its performance. First ofall we should introduce the evaluation space:

Definition 3. We define the test set as the set of images where the ground truth is previouslyknown and is used to evaluate the performance of a given detector.

And, finally, we can define the metric commonly used to evaluate object detectors:

Definition 4. We define the precision of a detector as the quotient between the number ofcorrect detections in an image divided by the total number of detections in the image. This is:

precision =|Number of correct detections||Number of total detections|

(3.1)

Definition 5. We define the recall of a detector as the quotient between the number of correctdetections in an image divided by the total number of possible correct detections in the image(the number of object of the detector’s class).

precision =|Number of correct detections|

|Number of possible correct detections|(3.2)

Definition 6. Given a detector D and a test set TS, let us compute a ranking of the detectionsof the detector D over the test set TS. By computing the precision and the recall at each positionof the ranking, we can build the precision-recall curve p(r).

Definition 7. We define the average precision as:

AP =

∫ 1

0p(r)dr (3.3)

The average precision metric is the most common in object detection. It allows us to rankdetectors given its performance, as we will be doing in the chapter 4. However, it is not theonly metric used in computer vision. As one can easily imagine, lots of different metrics areused to mesure performance in different areas.


3.2.2 Saliency definitions

In the second part of this thesis, the saliency problem will be treated and a new algorithm forsaliency prediction will be presented. First of all, it is necessary to formally define the saliencynotion:

Definition 8. The saliency of an object within an image is the quality by which it standsout relative to the other elements in the image. This can technically defined to the amount ofattention that this object has from an image’s viewer compared with the amount of attention theother parts of the image have.

To quantify the saliency value, a classical experiment is performed: the picture is shown tosome subjects which eyes are tracked with an eye tracker. Using this information, we can inferwhere the attention of the subject is centred by using its fixations when the image is show.Since the subject’s fixation are centred in only one pixel, some post-processing is needed tovisually present and compute the information.

Definition 9. We define the saliency value of a pixel as the normalised attention that thispixel has in the overall picture. Using this value we can build a saliency heat map where eachpixel has an associated value with its particular saliency value.

Definition 10. We define a saliency prediction as a predicted heat map of a given image.

Given a saliency prediction, we aim to determine the performance of the predictor. In orderto evaluate numerically the performance, we will use an equivalent concept of the precision-recallcurve for binary classifiers.

Definition 11. Given a binary classifier, we can define the receiver operating character-

istic (ROC) as the the relation between the true positive rate (TPR = True PositivesTotal Positives

) and

the false positive rate (FPR = False PositivesTotal Positives

) when the threshold of the classifier is varied.

Using the precedent definition, we can then define:

Definition 12. Given a binary classifier, the Area Under the Curve (AUC) is defined as:

AUC =

∫ 1

0FPR(T )dT (3.4)

AUC is used to compare the performance of different saliency models, since it is represen-tative of whether the model is predicting accurately the saliency of a picture or not. It will beused in chapter 5 to benchmark the different saliency models and determine the contributionof each of the feature set used to build our new model.

One Shot Training

4.1 Related work: general object detections techniques

The ability of detecting objects in their surroundings is developed in humans and animals atearly age. A one year old baby is able to detect and recognise most of the object that surroundshim without much effort. However, how humans process the surrounding and detect the objectsis still a mystery. Lot of work have been done in object detection and the results kept alwaysimproving but, at the same time, far from the human performance. Last year, Facebook archivedalmost human-performance with the DeepFace project [21], pushing the barrier closer to thelimit in face detection.

The object detection process can be dived in different parts. The basic pipeline in objectdetection consist in a set of feature function, which are mapping from the images to a n-dimensional space, Rn. Using this features, we train a classification algorithm, giving to itpositive and negative examples of the desired object. Finally, we evaluate our detector using atest set and the metric presented in the subsection 3.2.1.

Classical object detection works have centered its efforts in improve one of two both buildingblocks of the detection process: the classification algorithm or the set of feature functions. Inthe following sections we are going to present the state-of-the-art classification algorithm, theSupport Vector Machine.

Furthermore, we are also going to go through the two main set of features functions usednowadays in the state-of-the-art systems. First, we will present the Histogram Oriented Gra-dients, introduced by Dalal and Triggs [4]. HOG features have been present in most of thestate-of-the-art systems in the recent years, and have started to be substituted by featuresgenerated with Convolutional Neuronal Networks (CNN). CNN’s have opened a new world ofpossibilities in the computer vision world, given its ability to understand and map properly thevisual world. The possibility of properly train the feature function to better describe the desiredproperty of the visual environment makes it the perfect tool to apply to many of the classicalcomputer vision problems where the feature set has been historically manually created.

4.1.1 Classification algorithm: Support Vector Machines

Many different classification algorithms have been used in object detections. Support VectorMachines, first proposed by Vapnik and reformuled by Cortes and Vapnik [3], is one of the mostused algorithm in this field for its perfect combination between performance and simplicity.SVM tries to solve a classical problem: dividing two classes in Rn with a separator hyperplane.

As it can be seen in figure 4.1.a, in some situations this linear hyperplane can be easily found.When the classes are linearly separable, this is, when it exists an hyperplane that completelyseparates both classes, the solution is intuitive and easily found. But, in some scenarios, suchas the one presented in figure 4.1.b or in 4.1.c, the data does not allow to be separated witha linear hyperplane. Then, the power of the SVM is fully used to find the best solution thatdivides the data as most as possible.

10

CHAPTER 4. ONE SHOT TRAINING 11

The formal formulation of the problem that the linear Support Vector Machine tries to solveis presented as follows.

Figure 4.1: (a) Linearly separable space (b) Non-linearly separable space (c) Non-separable spaces

Given some pairs of data S = {(xi, yi) ∈ Rn × {−1, 1}}ni=1, we want w and b such as:

yi(wxi − b) ≥ 1. (4.1)

The formulation of the optimisation problem proposed to find w and b is:

arg min1

2||w||2 s.t. yi(wxi − b) ≥ 1. (4.2)

As it has been showed in figure 4.1.c, the problem 4.2 may not have solution. To soften theconditions and allow the algorithm to find better approximated solution, a soft version of thealgorithm was presented by Cortes and Vapnik. The formulation of the optimisation problemstates as follows:

arg min1

2||w||2 + C

n∑i=1

ξi s.t. yi(wxi − b) ≥ 1− ξi. (4.3)

Although the soft version of the Support Vector Machine allows to a wide family of solution,some distributions are really hard to divide using the linear version. As an extension of theSVM presented before, a non-linear version of the algorithm was introduced. By changing thedot product of the data space, new degrees of freedom are introduced to the solver. This leadsto a slower but usually a more accurate solution. Intuitively, the linear hyperplane that resultsof the linear SVM is now extended to a non-linear (n − 1)-dimensional manifold restricted bythe properties of the used dot-product.

However, since this project is introduced in the scope of uses for mobile devices, we willalways use a linear SVM when computing a detector: the soft version will let us have goodsolution for non-divisible classes while keeping the speed of the linear operations.

4.1.2 Feature functions

Features functions play a key role in the computer vision world. We expect them to map thevisual input to the n-dimensional space, keeping some meaning in the mapping distribution.Many proposals have been historically done to attack different problems in the field. In thissection we are going to present two of the most used set of feature functions: HOG and CNN.HOG has been present in the state-of-the-art algorithm until 2012, when Krizhevsky et al.[10] used deep convolutional neural networks showing a large increase of performance over theDeformable Parts Model (DPM) [5], the state-of-the-art method at that time.


Histogram Oriented Gradients

Histogram Oriented Gradients were first introduced in 2005 by Dalal and Triggs [4] in thescope of person detection. HOG maps the image to the real-space by dividing the input imagein concatenated cells of the same size. In each cells it computes the gradients of the pixels inthe cells, and from this data, an histogram of gradients. Furthermore, the histogram in this cellis normalised within a given area of the image, for the sake of deleting the effect of illumination.

Figure 4.2: Composition of HOG feature representation.

Some parameters need to be chosen when extracting the HOG representation of an image.The main parameter when computing the HOG transformation is the cell size used to build theHOG cells. The cell size is defined using the aspect ratio of overall image. It is important tonote that the cell size matter when talking about object detection. The ability of the features toexplain an object can be dramatically reduced with a wrong choice of cell size. Typical valuesof cell sizes are around 6× 8 pixels or 8× 6 pixels.

The process to compute the HOG transformation starts by dividing the image in cells.Then, in each cell the pixel gradients are computed. It is important to note that from thethree color channels of the classical RGB representation, only the strongest gradient is takeninto account. When the gradient is computed, it is time to compute the histogram. Someproposals have been done in order to improve the original HOG method proposed in [4]. Inthe foundational paper, two types of histograms are computed: a histogram considering theoriented angle each 20o and the unoriented version using the same division. The leads to 18values from the unoriented gradients and 9 values from the oriented one. But, [4] only takes oneof them, while Felzenszwalb et al [5] concatenates them as it is show in figure 5.2. Finally, someextra bins are used to reflect the energy load of the cells. [5] proposed also some modificationto the final basis used to express the features. They proposed a PCA decomposition on thefeatures to have a more compact representation and simplify the obtention process.

Finally, a resizing strategy is used to detect objects present in different sizes. Some resizedversion of the image are used to allow the detector to match the HOG structure learned previ-ously with object of different sizes. It is common to resize the image to an smaller version, butin some scenarios it could worth generate a bigger image.


Figure 4.3: Graphical explanation of the pyramid construction process.

Convolutional Neural Networks

Deep Convolutional Neural Networks were first introduced by Fukushima [6]. In the scope ofobject recognition, lots of improved version has been developed during years, some of themin number recognition [12]. With the advance of the technology and the availability of largeamount of computational power, a huge boost in performance has been accomplished in this lastyears. Recently, Krizhevsky et all [10] set the best performance in the ILSVRC 2012 challengewith which is nowadays considered the state-of-the-art algorithm in object detection.

Figure 4.4: Network structure of the CNN presented in [10].

A Convolution Neural Network consist of a number of layers interconnected. Some differenttypes of layers are present in the overall Neural Network: convolutional layers, fully connectedlayers, max-pooling layers and response-normalisation layer.

• Convolutional layer: The convolutional layer divides the input in small patches andconvolves it with a given filter. Usually a set of filters is applied to each patch, producing amultidimensional output. At the end of the layer, a non-linearity is applied. Traditionallyf(x) = (1 + e−x)−1 was used, but [10] proposed f(x) = max(0, x) which present someadvantages when learning the model.


• Fully connected layer: The fully connected layer applies a non-linear function to theinput and connects this output with all the neurone of the next fully connected layer. Thenon-linearity used is the same as the convolutional layer and is what conceptually definesthe neuron.

• Max-pooling layer: The max-pooling layer downsamples the input to build a lower-dimensional representation of it. It divides the input in patches and takes the max valuesof each of this pages. Therefore, the dimension of the output is equal to the number ofpatches used for the max-pooling process. In the classical definition of the max-poolinglayers [12], the patches do not overlay. However, [10] found that overlapping patches couldlead to less overfitting.

• Response-normalisation layer: A response-normalisation layer takes into account theactivity of a given neuron as well as the adjacent ones and normalise the output value.The details of the normalisation are better detailed in the section 3.3 of [10].

Given this basic pieces for building Convolutional Neural Networks, the explanation of thearchitecture used in [10] is straighfoward. This network is composed by 8 layers: 5 convolutionallayers and 3 fully connected layers, as it is show in the figure 4.4. The first two convolutionallayers are followed of a response-normalisation layer. After this two response-normalisationlayers as well as the fifth layer we find a max-pooling layer. Finally, the last three layers haveall the neurone connected.

Finally, it is important to note the two main strategies for training CNN that are begin usednowadays. The decision on which strategy use is taken considering the amount of training dataand the changes in the network structure. The two main strategies are:

• Training from scratch: The network weights are initialised without previous values.The learning rate is high given that the only data that the network has seen is the trainingdata. It is used when a large amount of data is available and the problem requieressignificant changes in the network behaviour.

• Fine-tuning: The network weights are initialised to weights of a previous network, usu-ally trained with larges amount of data. The learning rate is reduced and the weights areonly slightly modified to adapt the network for the new task.


4.2 One-Shot Training

4.2.1 Motivation

As it has been introduced in chapter 3, computer vision has started to enter in mobile devicesplatform in the last years. This paradigm change generate a new space for opportunity and, atthe same time, new problems to solve. DetectMe [13] was one of the first applications to trainand evaluate object detectors completely in the mobile device. It allows the final user to trainits own detectors, recording the training set for the detector. More details about the internalDetectMe application will be given in the section 4.3.

Figure 4.5: Results of mug detector’s competition.

In the scope of the DetectMe platform development, some question raised. The most im-portant one: which are the best images to train a detector? The first intuition is to evaluatethe quality of the detector based on the quantity of data used in the trained process. But,following the direction pointed by Lapedriza et al. [11], we found that a good detector can betrained with a small amount of high quality data. In the figure 4.5, we can see the results of aninternal competition to train the best mug detector. The rules were simple: all the detectorsshould be trained using DetectMe and should not use more than 30 training examples. As itcan be easily seen in the table, the results are surprising: the best detector has been trainedusing only one image.

It is important to note that the worst detector was trained with 11 training images. Thisresult motivate us to give a practical answer to the question, how should we train an objectdetector? It is known that in mobile devices we cannot handle larges amount of data,thereforeour approach follows just the opposite direction:we want to find The Example, one imagethat captures as better as possible all the object structure. Some practical issues challengethis conception, lots of objects have more than viewpoint and one image does not allow usto capture all the different views of the object. Furthermore, intraclass variation also affectsour approach, given that with one example is impossible to capture the intraclass difference.Finally, the detectors trained with one examples are not robust to small changes in the trainingexample: a small variation of the viewpoint or the bounding box can become a large change inperformance.

To overcome all this issues, in this section we are going to study the detectors trained with


only one exemple and pose some potential research lines to follow for further development ofthe topic. First of all, we are going to discuss the viewpoint issue, and how a multiple viewpointobject cannot be trained with one single image. Then, we are going to present an alternative: asingle instance detector where multiple SVM are trained, each of them with only one example.Furthermore, we are going to analyse the generalisation capacity of the Convolutional NeuralNetworks features when training linear SVM with only one examples.

4.2.2 Single view example

Many work has been done is psychology on determining which is the canonical representation onthe object. In the most of the experiments of this kind, some different examples of an object arepresented to the subjects, who evaluate the canonicity of the example [16, 17]. Some differentexplanation have been presented on why a certain view of an object is more canonical than theother one.

The most important ones are:

• Frequency hypothesis: According to the frequency hypothesis, the canonical view ofan object is the most common view from which the object is objected.

• Maximal information hypothesis: According to the maximal information hypothesis,the canonical view of an object is the on that offers more information of the 3D structureof the object.

Although some examples have been presented to reinforce both theories, at the same time,also some counterexamples have been found to reject it. Therefore, we do not have a definitivecharacterisation of the canonical views.

In this section we try to attack the problem from the computer vision point of view. Fol-lowing somehow the maximal information hypothesis, we define the canonical view as the onethat with a simple example is able to train a better object detector of the given class. The mainquestions that want to be answered in this section are:

1. Is the quality of the object detector directly related with the view from which the trainingsample was taken?

2. If so, have the preferred views for different classes some common attributes?

3. And finally, are the best detectors trained with only one image good enough? How far isits performance from the detectors trained with all the information?

To answer to this questions, we propose a simple experiment: we trained one detector pereach different image in the ImageNet dataset for 50 categories and we compare it with theoverall performance of the full training set per each category. We performed this experimentwith a detector using HOG features and a linear SVM classifier and, additionally, using CNNfeatures and a linear SVM classifier on top. It is important to note at this point that all thedetectors of the same category are trained with the same set of negatives examples for the sakeof comparison and hard negative mining 1 is used in the training process.

1The hard negative mining technique iteratively chooses the negatives instances to train the detector byselecting the negatives that were better scored as a positives in the pasts iterations.


Car results with HOG features: towards an object centric database

In figure 4.6 the results for the car category are presented. In the figure we can see the best10 examples and its respective AP performance in the training set. The overall performancetraining with all the images in the training set scores 0.142 which makes the best scoring exampleto reach a 78% of the global performance. Therefore, in the car category, question three has apositive answer: yes, detectors trained with one example can be competitive with fully traineddetectors. It is important to note that for the sake of comparison the full detector is trainedwithout any additional help such as latent positive corrections.

Figure 4.6: Best 10 examples to train a car’s detector.

Regarding the first question, we can sense some common attributes in the samples presented.All the images are taken from a similar point of view, independently of the direction of the car:some of the examples are looking to the right and some to the left, but all of the images arecaptures from a off-axis position. Following the related work in psychology, we could attributethis to the Maximal information hypothesis: the off-axis position allows the viewer to capturethe maximum amount of information of the car. However, figure 4.7 present the top car detectedby the top scoring car. As it can be easily seen, all the detections are from the preferredviewpoint, which points out another potential reason for the preferred viewpoint: it couldbe the most common viewpoint in the test set and, therefore, it is reflected in the canonicalviewpoint.

Figure 4.7: Top scoring detections for the best car detector.

This raises another concern: is the test set biasing the experiment? Is the frequency hypoth-esis explaining the results but with the frequency in the test set? To answer this questions we


need more detailed data, we need statistics on the viewpoints in the test set and, at the sametime, to delete the viewpoint bias, we need samples of each objects from different viewpoints.

4.2.3 Object centric database

Some previous work have been done in collecting data with the associated viewpoint. [26] usedAmazon Mechanical Track to annotate the viewpoint of some of the images present in thePascal dataset. Furthermore, some other work focus on having objects presented from multipleviewpoints.

[15] presents a dataset where the instances are taken from different points of view, butcaptured in a controlled environment and [26] published a small dataset with annotation of scale,angle and height. None of this datasets are object centric and, at the same time, capturated in areal-life conditions. We want to propose a data set with real-life images, like Pascal dataset, butcontaining each object present in the dataset from multiple points of view. The main propertiesof the database are:

• Object oriented: The atomic unit of the database is the physical object. Each instanceis directly related with the object and contains all the images of this object taken fromdifferent points of view.

• Annotated information: The images will contain the bounding box for the object and,additionally, the pose information (angle, height).

• Classes: The database will contain various categories that are still to define.

Mixture training

The natural evolution from the single picture training is the mixture training. In our scenario,the mixture training extends the single object training by defining the detector as a mixtureof detectors, each of one trained with one single image. This extension allows the detector tocapture multiple viewpoints of the object, restricting always the viewpoint representation toonly one sample.

The database presented in subsection 4.2.3 would allow us to reproduce the experimentpresented in subsection 4.2.2 for the mixture detectors. As the ground truth information wouldcontain the pose data, the effect of the test set bias could be corrected and the other issuescould be explored with more detail.

For instance, we would have more meaningful results on the detector’s performance becauseof the lack of bias in the test set: the comparison between the detector trained with one singleinstance and the detector trained with multiple instance would be more fair. Finally, the mainquestion would slightly change, we would not want to find the best single image to train adetector, we would want to find the best object to train it, the one that has more canonicalviews from all its view points. This will also lead to another interesting issue, is the mostcanonical object present in the most canonical single image?

4.2.4 One shot training with CNN

Although CNN are not nowadays present in mobile devices, its success in explaining the visualinput join with the advance of the mobile technology makes us very optimistic on the earlyimplementation of CNN in the mobile device world. The huge boost of performance makeit the perfect tool to handle vision-hard problems. In the following years the challenge willconsist on speed up the feature extraction process. In this section we would like to perform thesame experiments presented in section 4.2.2 with one change: the HOG features will be nowsubstituted for CNN features.


The classical CNN network is the one presented in [10], trained with 1000 categories ofImagenet. However, this network has been trained with all this 1000 categories, therefore theconcept of training the network with only one samples doesn’t apply because, even thoughthe SVM detector can be trained with only one examples, the full network has seen numeroussamples of this class. In the matter of facts, this is one of the main criticism made to CNN:do they have such good capacity to explain the visual world or have they been trained inthat much data that are able to remember and explain all this classes afterward? To answerthis question and have a fair comparison with the HOG scenario, we retrained from scratcha Convolutional Neural Network with all the categories used on Imagenet but deleting all thecategories intersecting with the 50 categories used in the experiments.

New CNN trained with less categories

CNNs have been recently positioned in the center of the computer vision world. Its outstandingperformance compared with the previous feature functions used have centered all the eyes in thecommunity, searching for an explanation for this boost. Some criticism has also raised duringthis years: the large amount of data where the CNNs are trained may be one of the reasonwhy the model is showing such a good performance. With this experiment we would like todetermine whether the CNN explaining capacity is mainly related with the training data or themodel itself is able to learn meaningful information and explain the visual input properly.

For the experiments, we trained a new CNN deleting from the original data set all thecategories that are present in our experiments. Since the categories are not aligned, we neededto delete more than 50 categories 2, resulting the final training set on about 700.000 trainingexamples from the 1.2 millions of examples of the original network. All the other parameterswere not modified in order to have fair comparison. Furthermore, the features extracted fromthe first fully connected layers were used in the classification process.

In figure 4.8 we presented the evaluation of the detectors trained with all the train set onthe 50 categories we are working on. In blue we present the categories that perform betterwith the pre trained features while in red we present the categories that perform better withthe new network. Although many categories have better performance with the old network, asignificant number of them overtake the score with the newly trained categories.

This results help to demonstrate that, even though there is certain effect of the data overlapbetween the categories for the CNN training and the categories for the experiment, large partof the performance for the CNN is coming from its internal structure. Figure 4.9 present theoverall statistics for the 50 categories. It is clear that the overall performance is better in thepre trained network, but the numbers are not fully conclusive. The difference between mAP isonly about 2 points of AP, which cannot be considered as a conclusive sign.

Figure 4.10 shows the histogram of the relation between the performance with the pre trained

network and the newly trained one is present. The ratio is computed as Pre Trained APNewly trained AP

,

therefore all the rations greater that the unit are the ones corresponding to categories thatperform better with the pre trained network than the newly trained one. The asymmetrypresent in the histogram can be understood as the categories performing better when trainedwith the new network have less improvement over the pre trained network.

To sum up, we cannot drive conclusive results from this experiment but we can point outthat not all the power of the CNN comes from the fact they have been trained with a largeamount of data. As we have seen, the structure is enough robust to properly represent categoriesthat it has not seen before, overpassing in some cases the pre trained network.

2As an example, from the 1000 categories of the original ImageNet configuration, more than hundred of themare dogs, which were all deleted from the training set for the CNN


Class AP Full IN AP Partial IN Ratio

Spatula 7.90 3.79 2.08

Refrigerator 20.62 11.45 1.8

Pitcher 33.11 18.62 1.77

Vacuum 12.29 7.92 1.55

Rugby Ball 27.1 18.12 1.50

Hair Dryer 25.14 16.94 1.48

Computer Keyboard 42.29 29.84 1.42

Basketball 50.18 35.5 1.41

Laptop 28.56 20.55 1.38

Beaker 20.77 15.17 1.37

Flute 13.75 10 1.37

Golfcart 48.67 36.52 1.33

Backpack 9.59 7.23 1.32

Microphone 5.93 4.57 1.29

Pencil Sharpener 23.39 18.72 1.24

Lemon 34.78 28.7 1.21

Rubber Eraser 7.27 6.02 1.20

Traffic Light 20.57 17.38 1.18

Bagel 36.64 31.15 1.17

Purse 16.60 14.30 1.16

Bird 33.0 28.44 1.16

Bow Tie 30.21 26.05 1.15

Punching Bag 28.40 24.86 1.14

Dumbbell 18.30 16.46 1.11

Class AP Full IN AP Partial IN Ratio

Popsicle 22.60 20.42 1.10

Banana 9.76 8.92 1.09

Washer 38.48 35.16 1.09

Pizza 40.4 37.56 1.07

Volleyball 61.06 57.00 1.07

Corckscrew 22.44 21.53 1.04

Dog 39.85 38.52 1.03

Dishwasher 18.32 17.81 1.03

Hotdog 30 29.31 1.02

Car 31.7 31.2 1.02

Snail 28.94 28.96 1

Soccer Ball 55.06 55.03 1

Nail 11.96 12.30 0.97

Sunglasses 26.67 27.48 0.97

Frying pan 22.18 22.74 0.97

Orange 28.72 30.27 0.95

Neck Brace 40.49 44, 06 0.92

Stove 11.0 12.05 0.91

Head Cabbage 13.87 16, 51 0.84

Pretzel 18.10 21.36 0.84

Tennis Ball 39.33 47.26 0.83

Remote Control 32.16 41.66 0.77

Artichoke 17.95 24, 68 0.73

iPod 40.79 56.06 0.72

Bell Peper 20.12 28.58 0.70

Figure 4.8: Average precision of the 50 categories in the ImageNet dataset, both the pre trained networkand the newly trained.

mAP Full IN mAP Partial IN Ratio σ Categories Ratio > 1 Ratio < 1

26.84 24.79 1.08 0.28 49 34 13

Figure 4.9: General statistics for the 50 categories evaluation. Ratio is computer as Pre Trained APNewly trained AP

.

Results with CNN features and comparison

Once the behaviour of the two CNNs has been compared, we are ready to explore the resultsof our experiment. Reproducing the same experiments as the one performed with the HOGfeatures and the linear SVM classifier, we trained one detector per each training image for alist of given categories for the ImageNet database. We reproduced this experiment with boththe pre trained and the newly trained network and we present the results in the figure 4.11.

The first conclusion to drive is that the performance of the CNN based detector is betterthan the HOG based detector. For instance, the car class performance for the HOG baseddetector is 14.2 training with all the data, meanwhile the CNN is able to archive a score of 31.2.This huge improvement can be explained with the high representational power of the CNNfeatures. This conclusion has been presented many times in the literature, where the power ofthe CNN as a features for object detection has been confirmed many times [8, 10].

Performance of one-shot trained detectors

Table 4.11 shows the performance of the best detector compared with the performance of thedetector trained with the full dataset. Following with the quantitative data paradigm wherethe quality of a detector can be estimated by the amount of data with which it is trained, onewould expect the full trained detectors to over perform with a large margin the one-exampledetector. However, the results obtained are not following this line: the best example showsa higher performance than the fully-trained detector in many categories. This behaviour iscommon in both the CNN pre trained network and the newly trained network which confirms


Figure 4.10: Histogram of the ratio between the performance of the detectors using the pre trainednetwork and the newly trained network.

that it is not the result of a particular network, it has to be with its architectural properties.Although the results of the fully-trained detector can be improved by applying numerous

techniques as latent positives or mixture modelling while this techniques can not be appliedto a single trained detector, the results of table 4.11 show that data quality should be takeninto account when training the detector. We cannot say that we are not making use of alarge amount of data: to find the best example we have tested hundreds of detectors per eachcategory. However, we show that when we find the best example, the performance is close oreven better than the detector trained with all the data. Furthermore, we could derive somecommon characteristics of this TOP examples: this would allow us to use the best examplewithout having to test all of them.

It is important to note that both CNNs in the experiment have been trained with a largeamount of data. In the newly trained scenario, the CNN has not seen any of the instance ofthe 50 categories, but it has seen more than 700.000 examples, which provide lot of informationabout objectness and visual mapping. This can be one potential reason to explain the result ofthis experiment: while we are training only with one example, the network has previously seenlots of objects from where it was able to extract good descriptors of the objects.

Finally, figure 4.12 presents the histograms of the ratio between the detector trained withthe full dataset and the best detector on each category. It is another visual proof that theamount of categories where the best detector over performs the fully trained detector is large,which confirm the previous conclusions. But, furthermore, it is also important to note thatthere is not a significant difference between the behaviour of the pre trained features and thenewly trained features. One could infer that the key point is to train the CNN with a large


Category AP Full Training AP Best Detector Ratio

Basketball 34.15% 43.56% 127.55%

Car 31.2% 21.72% 86%

Rugby Ball 25.13% 21.72% 86.43%

Computer Keyboard 35.82% 34.66% 96.76%

Artichoke 19.86% 25.39% 127.84%

Pizza 35.54% 49.85% 140.26%

Beaker 16.65% 16.64% 99.94%

Head Cabbage 11.65% 20.40% 175.11%

Neck Brace 40.12% 30.92% 77.07%

Dumbbell 16.01% 12.96% 80.95%

popsicle 14.94% 16.42% 109.91%

Orange 28.70% 24.80% 86.41%

Backpack 8.01% 6.33% 79.03%

Snail 27.45% 30.12% 109.73%

Remote Control 26.19% 34.91% 133.30%

Refrigerator 13.74% 26.72% 194.47%

Sunglasses 30.04% 28.29% 94.17%

iPod 43.36% 34.37% 79.27%

Volleyball 27.81% 37.50% 134.84%

Frying pan 20.80% 22.04% 105.96%

Bow Tie 30.01% 32.59% 108.60%

Punching Bag 28.30% 22.72% 80.28%

Pretzel 13.16% 23.53% 178.80%

Pitcher 31.13% 31.78% 102.09%

Bell Peper 21.78% 29.74% 136.55%

Nail 9.90% 1.44% 14.55%

Lemon 35.92% 24.49% 68.18%

Dishwasher 19.47% 12.86% 66.05%

Tennis Ball 42.57% 41.40% 97.25%

Banana 8.59% 13.82% 160.88%

Hair Dryer 24.43% 2.66% 10.89%

Bagel 30.63% 30.14% 98.40%

Corckscrew 21.54% 29.41% 136.54%

Microphone 5.36% 4.87% 90.86%

Traffic Light 18.93% 21.66% 114.42%

Spatula 5.33% 4.28% 80.30%

Soccer Ball 58.85% 49.23% 83.65%

Golfcart 48.24% 47.09% 97.62%

Rubber Eraser 2.96% 13.95% 471.28%

Flute 11.82% 9.31% 78.76%

Hotdog 32.14% 29.35% 91.32%

Pencil Sharpener 19.64% 0.65% 3.31%

Vacuum 13.08% 9.80% 74.92%

Laptop 31.19% 30.12% 96.57%

Washer 30.72% 39.27% 127.83%

Purse 14.90% 14.09% 94.56%

Category AP Full Training AP Best Detector Ratio

Basketball 28.99% 32.87% 113.38%

Rugby Ball 13.64% 23.35% 171.19%

Computer Keyboard 28.06% 31.32% 111.62%

Artichoke 19.80% 20.39% 102.98%

Pizza 28.06% 39.25% 139.88%

Beaker 15.47% 10.83% 70.01%

Head Cabbage 12.46% 16.28% 130.66%

Neck Brace 39.69% 28.09% 70.77%

Dumbbell 14.78% 9.30% 62.92%

popsicle 22.48% 20.11% 89.46%

Orange 29.63% 27.45% 92.64%

Backpack 5.97% 4.99% 83.58%

Snail 29.55% 20.93% 70.83%

Remote Control 36.04% 36.10% 100.17%

Refrigerator 7.97% 28.12% 352.82%

Sunglasses 26.84% 23.85% 88.86%

iPod 54.74% 42.04% 76.80%

Volleyball 31.49% 34.68% 110.13%

Frying pan 22.38% 20.04% 89.54%

Bow Tie 28.93% 26.71% 92.33%

Punching Bag 24.40% 20.04% 82.13%

Pretzel 17.53% 24.49% 139.70%

Pitcher 22.47% 16.24% 72.27%

Bell Peper 27.26% 28.65% 105.10%

Nail 11.58% 0% 0.00%

Lemon 31.74% 21.59% 68.02%

Dishwasher 18.75% 9.02% 48.11%

Tennis Ball 46.03% 39.39% 85.57%

Banana 9.19% 8.99% 97.82%

Hair Dryer 19.39% 0.62% 3.20%

Bagel 28.80% 33.57% 116.56%

Corckscrew 18.44% 23.06% 125.05%

Microphone 4.38% 4.82% 110.05%

Traffic Light 18.32% 20.16% 110.04%

Spatula 1.54% 4.05% 262.99%

Soccer Ball 57.62% 34.99% 60.73%

Golfcart 32.77% 42.46% 129.57%

Rubber Eraser 8.11% 15.01% 185.08%

Flute 8.80% 7.17% 81.48%

Hotdog 24.00% 24.43% 101.79%

Pencil Sharpener 17.87% 0.64% 3.58%

Vacuum 7.38% 8.04% 108.94%

Laptop 25.04% 26.11% 104.27%

Washer 35.92% 36.64% 102.00%

Purse 14.97% 11.13% 74.35%

Figure 4.11: Results of the experiments for the CNN. The AP of the detector with best performancetrained with one sample is presented joint with the AP of the full training set. (a) Pre trained CNN (b)Newly trained CNN

Figure 4.12: Histogram of the ratio of performance between the detector trained with all the data andthe best detector of each category.

amount of objects and, from them, it infers a architectural structure fairly independently onthe kind of objects you have showed to it.


Viewpoint bias with CNN features

In section 4.2.2 we analysed the most preferred viewpoint of the network for the car example,pointing out the need of a new dataset that removes the test set bias of the experiment. Twoquestions want to be answered in this section. First of all, following with the car example, doesthe feature map affects to this preferred viewpoint? This is, is the effect analysed in section4.2.2 only particular from the HOG feature-based detectors or are we going to experience thesame with other type of features. In figure 4.13 we present the ten best examples for training acar detector with CNN features. As it can be easily sensed, there is the same preference for theoff-axis viewpoint that we observed in 4.6. Given the similarity in the viewpoint preference, wecan confirm that the election of the features set does not have effect in the viewpoint selectionof the examples.

Figure 4.13: Best 10 training samples with their respectives AP. Trained with the newly trained network.

Furthermore, we also wanted to confirm that the existence of a preferred viewpoint is com-mon in a significant number of categories. Since the CNN results have been more consistent interms of performance, we will use it to compare the best examples per each category and tryto generalise the existence of a common viewpoint for different categories. In figure 4.14 wepresent the best examples for different categories: for each category they share some commonviewpoint which. For instance, both computer keyboard and refrigerators are usually viewedfrom a unique viewpoint which is the same as the one presented in figure 4.14.a and 4.14.c.Furthermore, although in every day’s life we watch at remote controllers from multiples pointsof view, the most common view is the frontal one. This is also reflected in the results, where allthe remote controllers with high AP are presented from the face. Finally, popsicles are foundin lots of different positions in real life. However, all the best example share a common positionof the popsicle, the vertical pose.

To sum up, with the results presented in figure 4.14 we show that best examples in eachcategory share some properties, in particular, a common pose or point of view where the imagehas been taken. This confirms the hypothesis presented in section 4.2.2 and therefore confirmsthe need for a new database where to experiment with the one-shot trained detectors and deletethe viewpoint bias.


Figure 4.14: Best training examples for different categories (a) Computer keyboard (b) Popsicle (c)Refrigerator (d) Remote Controllers

4.2.5 Conclusions and future work

Three questions were raised at the start of this section, and we think we gave a partial answerand, more important, pointed out some interesting lines of research to fully answer them. Tosum up, the questions proposed were:

1. Is the quality of the object detector directly related with the view from which the trainingsample was taken?

2. If so, have the preferred views for different classes some common attributes?

3. And finally, are the best detectors trained with only one image good enough? How far isits performance from the detectors trained with all the information?

The first question have a parcial answer: it depends on the category but generally yes, thereis some trend towards one particular point of view. The reasons why we observe this effect arenot fully explained in our experiment but one hypothesis was raised: the bias in the test setcan mainly explain the preference for a certain viewpoints. Therefore, removing this bias, arewe going to still have some preference for a viewpoint? Some future work has been proposed toanalyse this issue: a object-centric database with annotated information of the viewpoint fromwhich the picture was taken and, at the same time, multiple images of the same object fromdifferent points of view. Furthermore, this database would allow to train mixture detectors,training each different part of the mixture only with one example.

The second question has not a conclusive answer at this point. From all the data gathered,we can confirm that in objects with a predominant viewpoint, this is the preferred one to trainthe detector. Some examples of this kind of objects are presented in figure 4.14, where it isclear that refrigerators, remote controls and computer keyboards are better trained with imageswhere their natural view is presented. However, some objects, like cars, have more than onenatural point of view. In the car example, for instance, we usually look at cars from a large


variety of points of view: there is the front view, the lateral view and some off-axis views. Inthis scenarios, a deeper analysis needs to be done with an unbiased procedure that takes intoaccount the test-set bias.

Finally, the third question has a conclusive answer: yes, detectors trained with one samplecan be as good as the ones trained with a large amount of data. A large number of categoriespresent a better performance for the best single detector than the one trained with the fulldataset, which is even more than what was expected when the question was posed. Eventhough, the finding of this perfect example was performed using a brute force search procedure:all the training images were used one by one to train detectors and the best detector was chosen.Thus, the problem is how to capture The Examples, how to characterise the properties of thepicture that makes the best detector training only with one sample. As we pointed out, theviewpoint plays a key role on determining whether an example is good or bad, and it can be agood descriptor for the search of the best view point to train the detector.


4.3 DetectMe: The open object detector

4.3.1 Introduction to DetectMe

Some projects have been developed in the recent years to bring computer vision to mobiledevices. DetectMe [13], an object detector for platforms, is one of the first platforms to bringthe full pipeline of object detection to a mobile device: DetectMe allows the user to both trainand execute the detectors. By moving the full process to mobile devices we allow a standarduser with almost no previous knowledge about the field. For the first time, we allow non-expertuser to play around with advanced computer vision algorithm, and furthermore, we put themin contact. DetectMe wants also to become a social warehouse of detectors, a archive wherepeople stores its detectors while trying and playing around with the other’s users detectors.Finally, we allow people to retrain other detectors, in the spirit of keep improving the detectorperformance by a social-iterative process.

4.3.2 DetectMe: the application

In this section we will go through the most important characteristics of DetectMe. The De-tectMe project was started by Dolores Blanco in 2012, continued and mainly developed byJosep Marc Mingot in 2013 and is being finished by Adria Recasens in 2014. First of all, weare going to detail the training process. Then, we will detail the execution process and how toexecute detectors from other users. Finally, we will cover the retraining process, where userscan improve other user’s detectors.

Training a detector

The training process in the DetectMe application has been build to be easy to understand by anon-expert user. In figure 4.15.a the annotation process is presented. While taking the pictureof the desired object, we are able to adjust the bounding box in place, in a easy process. In thetraining phase we can add as many pictures as needed. Is important to make the point thatthe rest of the image without bounding box will be used as a negative examples, which meansthat the user should avoid to put other objects of the same category in the rest of the picture.Once all the positive instances are added to the detector, the training process will start.

Figure 4.15: Three main steps of DetectMe.


DetectMe uses a HOG + SVM algorithm, where the size of the HOG cell is determinedbased on the size of the training bounding boxes and its aspect ratio. To handle the negativesamples, it uses hard negative mining, only considering the negatives that have been consideredas a potential positive by the algorithm in the previous iteration. The experiments presentedin section 4.2.2 were conducted with the same set of parameters, which makes their resultsaplicable for the DetectMe users.

Executing a detector

After having the detector trained, it is time to use it! The execution process is very streighfowardin the DetectMe application, the only parameter to change is the threshold of the detector. Thethreshold is the value used to decide whether bounding box with an associated score is considereda positive detection or not. This value can be changed with a sliding bar in the lower part ofthe screen. Figure 4.15.b shows the execution screen, where a bounding box is defining thedetected object in the screen.

Furthermore, DetectMe also allows sharing detectors between different users as well as usingfeatured detectors, detector specially trained to show high performance. This detectors are listedin the Community detectors and Featured detectors sections of the DetectMe application, as itis shown in the figure 4.15.c. The decision of sharing a detector is taken just before startingthe training process: if the user wants to share his detector with other users he should set it asa public detector. If otherwise he wants to keep it closed, he has to set it as a private detector.

Retraining a detector

Finally, DetectMe also allows to train detectors, this is, add and remove training instances toa existent detector. For the sake of privacy and performance, DetectMe does not download allthe training samples of the detector that is going to be retrained to the mobile device, it onlydownloads the support vector captured and the end of the original detector’s training process.Only in the scenario of retraining an own detector, since all the data is already in the device,the original training images are used.

Figure 4.16: Retrain screen of the DetectMe application

In the retraining process, the old support vectors are introduced in each iterations, whilethe newly generated support vectors are changing each iterations. By doing that, we want to


keep some information of the original detector during the full training process, while if we ledthe support vectors change, the information could be lost in the process.

4.3.3 DetectMe competition

The deep impact of the smart phone technology in the last decade has made DetectMe accesibleto a large number of user. On other side, many examples of computer games related to researchhave started to emerge during the last years. For example, the players of FoldIt folded in 2011the crystal structure of the Mason-Pfizer monkey-virus in only 10 days, solving an open problemin protein folding for the last 15 years [9]. Following this gamification spirit, we believe thatDetectMe user have a large potential and intuition on how to train better object detectors.While experts in the field are always biased with his experience and research, new DetectMeusers don’t have any previous background on what could make good an object detector.

With the goal of making advantage of all this distributed knowledge, we propose a simplegame, an object detector competition, to find the best object detector. The goal of the game is,given a list of different categories, find the best object detector for each category trained withless than 30 images. The results will help to complement the research developed in section 4.2,giving more information on what makes a good detector. If in the previous section we used asystematic approach to the problem, based on a large data set, in this scenario we approachthe problem by a different point of view: we will use the people ability to find the best trainingset per each category.

The game structure is simple: players will train detectors of one of the given categories.Each detector will be uploaded to the server, where it will be evaluated against a test set fromthe LabelMe database. The resulting AP will be added to a real-time leaderboard that will beaccessible in the competition website. A beta version of this competition was launched betweensome researchers the past January, resulting from them the leaderboard presented in figure 4.5.The result of this competition held between less than 10 person was enough to motivate part ofthe research presented in this thesis and we aim that the competition will also motivate otherresearch lines.

Figure 4.17: Submission screen of the DetectMe Competition

Although the competition presented before can lead to a really good detector, lot of effortand information is lost in the process: given the nature of the competition, lots of well-traineddetectors are not used. To use all the contribution made by the different players, we would


like to propose a competitive-cooperative game: this is, a game where the players competeagainst each other but, at the same time, contribute in a common goal. In the object detectionscenario, and following the research lines proposed by [1], this could be accomplished by havingevery player working on improve the same object detector (in a retraining process similar tothe section 4.3.2) and scoring them for the relative improvement given the original detectorthat was retrained. With these rules, the players would still be competing against them whilecontributing them all to the a common goal, the consecution of the best object detector of agiven class.

4.3.4 DetectMe as a framework

In parallel to its contribution to the field by bringing computer vision to everyone, as it isintended to do in section 4.3.3, DetectMe also wants to help researchers and developers toadd computer vision capabilities in their systems in a easy and simple way. Although computervision capabilities can be extremely helpful when developing complex systems and mobile phoneapplications, the challenges presented by the implementation of this systems usually overcomethe profit given to the developer. This is why we want to provide developers with a easy-to-useframework to add computer vision capabilities to its systems in a easy way.

Figure 4.18: Real time streaming of DetectMe detections

To do so, we developed a API that allows developers to retrieve detectors, upload detectorsand, foremost, run detectors in real time from a mobile device while reading all the data comingfrom the detections. As it can be seen in figure 4.18, a real time streaming interface is availablein the DetectMe website. Furthermore, this information can be retrieved directly with a simpleprotocol connecting directly the user to the detector’s device and providing all the informationrelated to the detection. The protocol structure makes it scalable to use in multiple systemswithout overloading the central server.

Our vision is to have DetectMe used in multiple systems involving vision capabilities atsome extend. The small size of a smart device makes it easy to implement in numerous roboticssystems, where the framework presented can interface between the device and the central unit ofthe robotic system. Furthermore, iOS developers could also use our framework to develop theirown application. We aim that the simplification of the process of including vision capabilities to


iOS devices would lead to better applications using all the potential that an actual smartphoneoffers.

4.3.5 Future Work

Some future work have been pointed out in the previous section in the DetectMe application.Regarding the application itself, a potential line of work would be to expand the set of featurefunctions available to Convolutional Neural Neutworks. CNNs are expected to be used in thestate-of-the-art methods of the following years and its implementation present some interestingchallenges in terms of performance and speed.

Furthermore, some other applications are also being considered for implementation in futureversions of DetectMe. The difference between a particular detector, able to detect a particularobject between some other objects of its class, and a general detector, the one that detectsobjects of a given class, is not well represented in the actual version of DetectMe, where all thecontrol of the user is centered in the election of the training set. Some changes can be done tothe feature functions to better handle particular objects; a line of work that can be present infuture versions of DetectMe.

Following with the line of research presented in previous sections, some feedback system canalso be used to help the user when training a detector. Automatising the detection of goodexamples without the need of training a full new detector could be a first step to implement a”good example detector”, as a helper for a potential DetectMe user.

Finally, as it has been detailed in the section 4.3.3, some future work can be done in thegamification of the training process. The potential results of a competition between DetectMeusers to train an object detector or the creation of a cooperative-competitive game are the twomain proposals in this research line.

Saliency

5.1 Introduction to saliency

Where do people look when they are looking at a scene? This is the key question in saliency,the science of understanding how people attention behaves when they are facing a scene. Manyfields use saliency as a tool for improve its methods: robotic research tries to imitate humanbehaviour in robots, therefore, the information where a real human would look given a scene isvaluable and useful to better reproduce human behaviour. Furthermore, numerous applicationhave raised in the computer graphic world where saliency is used: from some video compressiontechniques where the center of attention is less compressed than the background [7,25] to non-photorealistic rendering.

Some different approaches have been taken to understand where people look. The use ofthe eye-tracker for collecting people’s fixation and, afterward, build a saliency map, is the mostreliable method: humans have a strong consistency on where to look when looking at a scene,which are good news for the saliency modelling problem. However, eye-trackers are expensiveand nowadays only found in research institutions where they are used to conduct experimentsand collect data. They cannot be easily used to estimate a saliency map of a large amount ofimages and, furthermore, it is a time consuming process.

Saliency models have been the alternative to the eye-tracker collecting system. They aimto model the fixation heat-maps from a given image: they allow saliency maps to be build inalmost real-time and for a arbitrary image. This facility of use makes it the perfect tool for allthe different field that want to apply saliency in their working without having to spend thatmunch effort in the process of building the heat maps.

In the following section some saliency related work will be presented. Numerous saliencymodels have been proposed, we will go through the state of the art models as well as theones related to our actual work. In section 5.3 we will present our new saliency model, wherea mixture of top-down and bottom-up features are used to capture the different details of theimages. Furthermore, CNN features are used to compute the low-level features, combining thema fashion that leads to a new state-of-the-art method. The results of this model are presentedin 5.4, where it is shown that with this model we beat all the state-of-the-art models in theMIT 1003 database. Finally, some future work is pointed out in the section 5.5.

31

CHAPTER 5. SALIENCY 32

5.2 Related work

Many proposals have been done for saliency estimation. In this section we are going to presentthree of the actual state-of-the-art models, for the sake of comparison and putting some basisto our work presented in section 5.3.

Judd et al. [22] proposed a full pipeline of low level, mid-level and hight level featuresto capture both the low level details that are relevant for saliency joint with the top levelinformation such as the person and faces present in the picture. The three type of picturespresent in the model are:

• Low-level features: The low-level feature try to capture local properties of the imagerelevant for the global saliency heat-map. [22] use the local energy of the steerable pyramidfeatures [20]. Furthermore, some other features are added to the low-level model: featuresfrom Torralba [14] and Rusenholtz [18] models, intensity, orientation and color contrast.

• Medium-level features: The medium level features consist of a horizon detector, trainedwith the goal of detecting the horizon and under the hypothesis that human naturallylook for salient objects at the horizon line.

• Hight-level features: The hight-level features aim to capture some high level relationson the overall picture. To do so, [22] uses a face detector [24] and a person detector [5].

• Center prior: The center prior is used to capture the natural bias of the humans to lookat the center of an image. The center prior feature for a pixel is related with its distanceto the center of the image.

All this features are modelled together with a linear SVM, to finally provide a value for thesaliency prediction. Furthermore, Judd et al [22] published a database of 1003 images whereto train and evaluate the algorithms. This created a benchmark that has been widely used tocompare different saliency prediction algorithms.

Zhang and Sclaroff [27] presented a binary map-based algorithm for saliency detection.By using randomly sampled binary maps from the input image, the algorithm estimated oneattention map per each different saliency map to, afterwords integrate all of them in a finalsaliency map. Then, the output map can be adapted depending on the task that is willing toperform.

Figure 5.1: Boolean Map Pipeline. Credits of the figure to [27].

The construction process of the binary map is as follow, a feature function that maps eachpixel to the interval [0, 255] is randomly selected. Furthermore, a threshold θ is also randomlyselected. Then, the image is thresholded, assigning to an active pixel the ones above thethreshold and to a zero valued pixel the ones below the threshold. Then, from this binary mapan attention map is computed: using the gestalt principle for figure-ground segmentation whereit is stated that surrounded regions are more likely to be salient. For this reason, surroundedregions in the given binary maps are active regions in the attention map while the other regions


are not active. Finally, all the maps are gathered together weighting it by the probability ofthis binary map given the input image.

Finally, Vig et all [23] used Convolutional Neural Network to model the visual saliency of theimages. Their model is selected using a multi-parameter optimisation among a multidimensionalfamily of models built using the four main elements of the Convolutional Neural Networks,presented in section 4.1.2. The search of parameter is performed using [2] and the resulting setof features is combined with an SVM linear model.

Figure 5.2: Pipeline for CNN feature modelling. Credits for the figure to [23]


5.3 Deep features for saliency prediction

Following the model scheme proposed by [22], in our saliency model we are going to use differenttypes of features, covering the different visual levels where saliency can be modelled. Like [22],at the top of the features we will use a Support Virtual Machine as a regressor to combine thevalue of the different features to estimate the saliency values.

The features used as an input for the SVM are:

• Low-level features: As a evolution of [23], we use the features from the max-pool layerat the top of the fith layer of the CNN proposed by [10]. The representational power ofthe CNN features can explain different visual phenomenas, and saliency is one of them.Furthermore, some other experiments have been done to fine-tune a particular networkfor saliency estimation, gathering all the outputs to just one final output with values inthe interval [0, 1].

• Medium-level features: The medium level features consist of the saliency predictionfor the 20 classes of the Pascal 20 dataset. For each class, we train an intraclass modelof the salient parts of the objects. In test time, the object detector proposed by [8] isexecuted and the model is applied to all the detected objects. Thus, the intraclass modelwill be zero to all the background and the intraclass model applied to the objects presentin the picture. This model is trained by extracting an spatial component (computed asthe mean of all the training examples of a particular category), which is combined witha CNN saliency model for the particular category. For the first time, a saliency modelstakes into account the intraclass saliency structure, and furthermore, does so for a largeamount of different objects. This can be seen as the evolution of the Tilke’s model, wherea person and a head detector are used. In our model, the face detector is not needed sincethe person intra-class model will learn that faces are the most salient part of the humanshape.

• High-level features: The high level features try to capture the high-level relationsbetween the different objects in the image. By doing it, we will be able to capture therelation of the different objects of the images and score them properly. To do so, werun the object detector for the 20 Pascal classes [8] and afterwards we score the pixelsdepending on the score given to the bounding boxes containing them. To sum up, eacharea of the picture will have a score proportional to the score of the detections containingit. This high levels features can also be understood as an evolution of the [22] model,where only a person detector is used. In our case, 20 different detectors are used.

• Center-prior: Finally, to capture the natural bias of humans to look at the centre ofthe picture, we added a centre prior. This is, a gaussian function that has the maximumin the centre of the image and keep decreasing towards the sides. It is important to notethat the centre prior used as the only saliency estimator is able to reach high levels ofscore, because of the predominancy of humans towards the centre when looking at thescene.

Each of this features will generate a separate saliency model that will be used then togenerate a global saliency model. The partial modelling will be done with the same strategy asthe global modelling: using a linear regressor of the form f(x) = wTx+ b. Therefore, for eachof the feature set, we will have fi(x) = wTφ(x) + bi.


Furthermore, the global model will be done by gathering all these models as:

f(x) = w1f1(x) + w2f2(x) + w3f3(x) + w4f4(x) + b (5.1)

By doing this, we allow each regressor to select the relevant features in a given layer and,after all, we combine them all to produce the final saliency prediction. This allows us to selectthe relevant features in each layer and finally combine them all as a whole model. The use of aunique regressor with all the features was considered and tested, but the results were poor dueto the fact the some important features were hidden by other features of different layers.


5.4 Experimental results

The model presented in the 5.3 overcomes the actual state-of-the-art models for saliency esti-mation. With this mixture of different heat-maps coming from different feature set, we covera wide range of details, creating a model able to detect saliency on different scenarios. Weevaluate the results in the MIT 1003 datasets, using 903 images for training purposes and 100images for testing purposes. Furthermore, ground truth for the test set has been annotated, inorder to test correctly the intraclass model. Thus, we are going to present the results of thedifferent models in separate steps and finally present the results for the joint model afterwards.

Figure 5.3: Saliency prediction for the different features set. (a) Original image (b) Ground truth (c)Global Model (d) Low-level features

Global Low-level features Center prior Judd Cox

AUC 0.876 0.785 0.81 0.841 0.854

Table 5.1: Detailed results for the saliency prediction in the MIT 1003 dataset.

5.4.1 Low-level features

The low-level features were extracted from the max-pool layer at the top of the fifth layer ofthe CNN, and then gathered with a linear SVM regressor. The results of this estimation canbe found in figure 5.3.d, where it is clear that the low-level features are not able to capturethe global picture of the saliency estimation. However, in short distances, they are able todistinguish salient parts of the image.


Figure 5.4: Spatial models for the medium-level features. (a) Person (b) Car

5.4.2 Medium-level features

In figure 5.4 we can observe the result for intraclass saliency modelling for cars and persons.As it can be seen in the person modelling, the head of the person takes a lot of importancewithin the person saliency map. This responds to a natural bias toward the head of the personwhen looking at it, which was covered by [22] with the face detector. Figure 5.5 presentsthe estimation for the saliency heat map of both a person example and the example. Thespatial model has some effect but at the same time it is combined with the category saliencymodel, which adds more variance to the final results. Is it finally important to note that themedium-level features can only be understood in the scope of a layered model: their power is tocomplement both low-level and high-level features to better describe the saliency distribution.Like in figure 5.5.a, the fact of having a good model over the object does not imply a goodgeneral performance of the detector. Other objects in the pictures, like the bird close to thekid, can completely change the saliency map.

Figure 5.5: Medium-level saliency prediction. (a) Person (b) Car

5.4.3 Hight level features

The high level features aim to capture the presence of different objects in the picture. Actually,as it can be seen in figure 5.6, the high level saliency map consist of the detection map of thedifferent objects present in the picture. In the picture presented, the persons in the image are


detected and weighted depending on the importance given by the detector. The high level levelfeatures do not contain the medium level model that models the intra-class saliency. This iswhy the difference parts of the object are scored with the same values, which will be afterwardsmodified with the medium-level models for intraclass saliency estimation.

Figure 5.6: High level saliency estimation. (a) Original picture (b) Saliency heat map

5.4.4 Global model

In figure 5.3.c the results for the global model are presented. One can guess at a glance that thelow-level model has a significant contribution to the global model, pointing out the salient partsof picture when analysed in a short distance. However, the other levels of the layered model addsome global interpretation of the picture, such as the car notion in the example presented. Bydoing that, the final models takes into account a general notion of the different objects presentin the picture and it reflects it in the saliency prediction.

To sum up, the results show that our model beats the state-of-the-art models by an im-portant margin. The fact of having such variation when modelling the saliency, both withtop-down and bottom-up models, makes the model complex enough to capture the differentdetails important to have a good saliency predictor.


5.5 Future Work

Although our models archive state-of-the-art performance, some additional changes can be doneto better improve it. First of all, in the low level features, some additional training and changescan be proposed to improve the CNN representational power. The main problem of this proposalis the amount of data needed to fine-tune the models, which at this point is restricted to theMIT 1003 dataset.

The data issue motivates the proposal of a new database and, with it, the construction ofa new method to collect saliency data. The proposed method consist of using the webcam ofa typical laptop to capture the saliency information, by reading the position of the pupils ofthe user. By reducing the restrictions in the capture process, we can provide the database withmore annotated images, that will lead with better results of the data-driven algorithms. Finally,we propose to annotate an existent database in order to be able to use additional informationof the images such us annotated objects or object segmentation.

Furthermore, the high level feature can be also improved by providing more informationabout the relation between classes. For instance, providing a hierarchy of saliency score betweenthe different objects present in the image can help to understand the relationship of the image’sobjects.

Conclusions

In this work, we have presented two main lines of research, with the proper contributions oneach of the sides. Although both of the research lines are conceptually far, both of them sharecommon conclusions. First and foremost, the representational power of the CNN features canmake the difference in the new computer vision systems, being applicable in multiple researchareas, as it is shown in this work. Furthermore, the data collection is a key point for furtherimprovement of the state-of-the-art: the necessity of new data bases with more relevant in-formation is present in different areas of the computer vision world. Finally, the collection ofthis new data is also moving from typical collection strategies to more innovative techniques.DetectMe is the best example this new way of collecting data, by moving the ability of trainingdetectors to the mobile devices, we get many more data collectors. On the other side, by havingthe saliency collection in every days computers with just a webcam, we will also have lots ofsubjects to help in the saliency data collection.

In chapter 4, following the trend present nowadays on brining computer vision to mobiledevices, we opened a question: can we train a good detector with only one picture? To answerthe question, we performed a large test; using 50 categories from the ImageNet database wetrained one detector per each training sample and then we evaluated it in the testing set. Bydoing it, we have got the results on this 50 categories, which are explanatory of what could bedone with a good election of training example. Results shows that, by using CNN features, weare able to have detectors trained with one example with a significant performance over thedetectors trained with all the information. In one side, this shows the high representationalpower of the CNN features while it points out another research line, moving CNN to mobiledevices.

Furthermore, some other issues related on training detectors with only one example havebeen analysed. The test set viewpoint bias affected the results of the experiments by biasingthe top scored training images. In other to remove this bias, a mixture strategy was proposed,the use of mixture detectors where each of the multiple detectors are trained with only oneinstance. To perform the experiments with this set properly, we would need to have a newdatabase, this is why we propose the creation of a object-centric database where each objecthas annotation from all the different points of views. By doing that, many experiment can beperformed with a new question: which is the best physical example to train a mixture detector?

Finally, other strategies to find good quality detectors trained with only one instance hasbeen proposed. The gamification of the training process is proposed in the section 4.3, whereDetectMe plays a key role to expand the game to a large amount of players. Multiple gamesalternatives are also discussed, the typical competitive approach is tested with a beta versionof the game but some problems are detected: by being completely competitive, at the end ofthe day lot of information is not taken into account as a result of the experiment: the effortsfrom the players that are not able to score good are just forgotten. To overcome this problem,another game is proposed: a cooperative-competitive game where everybody works together tobuild the best possible detector while playing against them to be the one that add more valueto the detector. The details of this second game are still in the air, but an interesting researchline is opened.

40

CHAPTER 6. CONCLUSIONS 41

In chapter 5, the state-of-the-art on saliency estimation is presented to motivate the newmethod proposed in this work. Our new model uses the power provided by the CNN featuresto explain the low-level attention while a multilayered saliency model is proposed to capturethe different levels of attention when looking at an image. The model outperforms the state-of-the-art models by a significant boost, evaluating it in the largest saliency dataset available atthis time, the MIT 1003.

Some future work is also proposed to improve the saliency results. The small amount ofaccessible saliency information creates a practical barrier when training the model, given thelarge amount of information needed by the Convolutional Neural Networks to train. For thisreason, a new idea raises: the annotation of a current large dataset with a new salient annotationtechnique. In order to reduce the cost and time of the annotation process, we propose a saliencyexperiment conducted with a regular computer webcam, where the subjects can execute, athome, a saliency experiment that is nowadays performed in laboratories with expensive eye-trackers. This would allow to a large increase on the saliency information available for researchand, therefore, in a better performance of the algorithms.

To sum up, this work proposes and evaluate some new techniques but, foremost, points outnew potential research lines that in the next years can become innovative ways to push theboundaries of the field.

Bibliography

[1] Jacob Abernethy and Rafael M. Frongillo. A Collaborative Mechanism for CrowdsourcingPrediction Problems. In Proceedings of Advances in Neural Information Processing Systems24, pages 1–9, 2011.

[2] J Bergstra, D Yamins, and DD Cox. Making a science of model search. arXiv preprintarXiv:1209.5111, pages 1–11, 2012.

[3] C.Cortes and V.Vapnik. Support Vector Networks. Machine Learning, 20:273˜–˜297, 1995.

[4] Navneet Dalal and Bill Triggs. Histograms of oriented gradients for human detection. InProceedings - 2005 IEEE Computer Society Conference on Computer Vision and PatternRecognition, CVPR 2005, volume I, pages 886–893, 2005.

[5] Pedro Felzenszwalb, David McAllester, and Deva Ramanan. A discriminatively trained,multiscale, deformable part model. In 26th IEEE Conference on Computer Vision andPattern Recognition, CVPR, 2008.

[6] K Fukushima. Neocognitron: a self organizing neural network model for a mechanismof pattern recognition unaffected by shift in position. Biological cybernetics, 36:193–202,1980.

[7] Wilson S. Geisler and Jeffrey S. Perry. A real-time foveated multiresolution system forlow-bandwidth video communication. SPIE Proceedings, 3299:294–305, 1998.

[8] Ross Girshick, Jeff Donahue, Trevor Darrell, U C Berkeley, and Jitendra Malik. Richfeature hierarchies for accurate object detection and semantic segmentation. CVPR’14,pages 2–9, 2014.

[9] Firas Khatib, Frank DiMaio, Seth Cooper, Maciej Kazmierczyk, Miroslaw Gilski, Szy-mon Krzywda, Helena Zabranska, Iva Pichova, James Thompson, Zoran Popovic, MariuszJaskolski, and David Baker. Crystal structure of a monomeric retroviral protease solvedby protein folding game players, 2012.

[10] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. ImageNet Classification withDeep Convolutional Neural Networks. Advances In Neural Information Processing Systems,pages 1–9, 2012.

[11] Agata Lapedriza, Zoya Bylinskii, Hamed Pirsiavash, and Antonio Torralba. Are all trainingexamples equally valuable? arXiv preprint arXiv: . . . , 2013.

[12] O Matan, J C Burges, Y LeCun, and J S Denker. Multi-digit recognition using a spacedisplacement neural network. In Proc. NIPS, pages 488–495, 1992.

[13] Josep Marc Mingot, Adria Recasens, Dolores Blanco, and Antonio Torralba. DetectMe:the open object detector, 2014.

42

BIBLIOGRAPHY 43

[14] Aude Oliva and Antonio Torralba. Modeling the shape of the scene: A holistic repre-sentation of the spatial envelope. International Journal of Computer Vision, 42:145–175,2001.

[15] Mustafa Ozuysal, Vincent Lepetit, and Pascal Fua. Pose estimation for category specificmultiview object localization. In 2009 IEEE Computer Society Conference on ComputerVision and Pattern Recognition Workshops, CVPR Workshops 2009, pages 778–785, 2009.

[16] Stephen E Palmer, Jonathan S Gardner, and Thomas D Wickens. Aesthetic issues inspatial composition: effects of position and direction on framing single objects. Spatialvision, 21:421–449, 2008.

[17] Eleanor Rosch and Barbara B. Lloyd. Cognition and Categorization, volume pp. 1978.

[18] Ruth Rosenholtz. A simple saliency model predicts a number of motion popout phenomena.Vision Research, 39:3157–3163, 1999.

[19] Bryan C. Russell, Antonio Torralba, Kevin P. Murphy, and William T. Freeman. LabelMe:A database and web-based tool for image annotation. International Journal of ComputerVision, 77:157–173, 2008.

[20] E.P. Simoncelli and W.T. Freeman. The steerable pyramid: a flexible architecture for multi-scale derivative computation. Proceedings., International Conference on Image Processing,3, 1995.

[21] Yaniv Taigman, Marc Aurelio Ranzato, Tel Aviv, and Menlo Park. DeepFace : Closingthe Gap to Human-Level Performance in Face Verification. In CVPR, 2014.

[22] Juddk Tilke, Krista Ehinger, Fr??do Durand, and Antonio Torralba. Learning to predictwhere humans look. In Proceedings of the IEEE International Conference on ComputerVision, pages 2106–2113, 2009.

[23] Eleonora Vig, Michael Dorr, and David Cox. Large-Scale Optimization of HierarchicalFeatures for Saliency Prediction in Natural Images.

[24] Paul Viola and Michael Jones. Robust real-time object detection. International Journalof Computer Vision, 57:137–154, 2001.

[25] Zhou Wang, Ligang Lu, and Alan Conrad Bovik. Foveation scalable video coding withautomatic fixation selection. IEEE Transactions on Image Processing, 12:243–254, 2003.

[26] Yu Xiang, Roozbeh Mottaghi, and Silvio Savarese. Beyond pascal: A benchmark for 3dobject detection in the wild. In IEEE Winter Conference on Applications of ComputerVision (WACV), 2014.

[27] Jianming Zhang and Stan Sclaroff. Saliency Detection: A Boolean Map Approach. In 2013IEEE International Conference on Computer Vision, pages 153–160, 2013.

Documents

Object detection and recognition: from saliency prediction