arXiv:2005.11549v2 [cs.CV] 27 Jun 2020 · Mahdieh Abbasi and Denis Laurendeau Department of Electrical and Computer Engineering Universite Laval, Qu´ ebec, Canada´ [email protected]

Self-supervised Robust Object Detectors from Partially Labelled Datasets

Mahdieh Abbasi and Denis LaurendeauDepartment of Electrical and Computer Engineering

Universite Laval, Quebec, [email protected]

[email protected]

Christian GagneMila, Canada CIFAR AI Chair, Quebec, Canada

[email protected]

Abstract—In the object detection task, merging variousdatasets from similar contexts but with different sets of Objectsof Interest (OoI) is an inexpensive way (in terms of labor cost)for crafting a large-scale dataset covering a wide range ofobjects. Moreover, merging datasets allows us to train oneintegrated object detector, instead of training several ones,which in turn resulting in the reduction of computational andtime costs. However, merging the datasets from similar contextscauses the samples with partial labeling as each constituentdataset is originally annotated for its own set of OoI and ig-nores to annotate those objects that are become interested aftermerging the datasets. With the goal of training one integratedrobust object detector with high generalization performance,we propose a training framework to overcome the missing-label challenge of the merged datasets. More specifically, wepropose a computationally efficient self-supervised frameworkto create on-the-fly pseudo-labels for the Unlabeled PositiveInstances (UPIs) in the merged dataset in order to train theobject detector jointly on both ground truths and pseudo labels.We evaluate our proposed framework for training YOLO ona simulated merged dataset with missing rate ≈ 48% usingVOC2012 and VOC2007. We empirically show that general-ization performance of YOLO trained on both ground truthsand the pseudo-labels that are created by our method is 4%

(on average) higher than the ones trained only with the groundtruth labels of the merged dataset.

1. Introduction

Modern CNN-based object detectors such as faster R-CNN [1] and YOLO [2] achieve remarkable performancewhen their training is done on the fully labeled large-scale datasets, which include both instance-level annotations(i.e. bounding boxes around each object of interest) andimage-level labels (i.e. category of the object enclosed ina bounding box). On the one hand, collecting a dataset withfull annotations, especially bounding boxes, can be a tediousand costly process. On the other hand, the object detectorssuch as R-CNN and YOLO show that their performanceis dependent on accessing to such fully labeled datasets. Inother words, they suffer from a drop in generalization perfor-mance when trained on partially labeled datasets (i.e., con-

Yolo

Candidate ROIs

Prox

y N

etw

ork

A partially labeled input with UPIs

Pseudo-label Generation

Pre-processing Step

Figure 1. Schematic explanation of our proposal for generation of pseudo-labels in a merged dataset. For a given input I with some UPIs (UnlabeledPositive Instance), the bounding boxes (RoIs) estimated by YOLO attraining epoch e (i.e. fe(I)) are extracted for a pre-processing step, i.e. toprepare them for the proxy network. Using the proxy network’s estimationsfor the given RoIs, we create pseudo-labels for UPIs allowing YOLO tobe trained jointly with the pseudo-labels and the ground truths of the giveninput.

taining instances with missing labels) [3], [4], [3]. Datasetswith missing label instances can occur in several situations,including unintentional errors occurring in the annotationprocess, partial-labeling policy (we explain it later), and themerged datasets. By merged datasets, we aim at combiningseveral datasets from similar (or the same) contexts butwith disjoint (or partially disjoint) sets of Objects-of-Interest(OoIs), e.g. [5], in order to construct a larger dataset includ-ing a wider range of objects, of possibly more variationsin their capture and nature (e.g. objects of different poses,illuminations, styles, and physical properties). For instance,Kitti [6] and German Traffic Signs [7] are datasets with twodisjoint sets of OoIs that could be merged to cover a widerspectrum of the objects appearing on roads.

Such merged datasets can facilitate the training of anintegrated object detector, which in turn can potentially leadto a significant reduction of time and computational cost.training and inferring from a unified object detector on amerged dataset is more effective in terms of memory andcomputational resources, compared to training several objectdetectors, each for one of the constituting datasets. This is

arX

iv:2

005.

1154

9v2

[cs

.CV

] 2

7 Ju

n 20

20

specially appealing for the embedded devices with limitedcomputational resources (e.g. self-driving cars) as they needto make the inference decisions in real-time manner. Inaddition, training a unified model circumvents the need tocombine decisions made by the various models, which canbe tricky and lead to sub-optimal solutions. Finally, mergingdatasets and training a unified object detector on it canpave the path toward the development of an universal objectdetector (e.g. [8]). Despite the great potential of merging-dataset for the reduction of the computational cost andannotation burden, it unfortunately results in missing-labelinstances as some OoIs in one dataset might are not labeledin other datasets.

Many modern object detectors that are trained on a par-tially labeled dataset, e.g. a merged dataset, induce inferiorgeneralization performance than those trained with the fullylabeled ones [3], [9]. Regardless the type of object detectors,the small number of labeled instances in a partially-labeleddataset is one reason for such performance degradation.The anther reason is rooted from false negative trainingsignals arising from the Unlabeled Positive Instances (UPIs).Inspired by [3], we later elaborate in Sec. 3 how theseUPIs can mislead training of an object detector, particularlyYOLO.

To augment the training size of such partially labeleddatasets, Weakly Supervised Learning (WSL) methods [4],[9], [10], [11] have been proposed to generate pseudo-labelsfor some UPIs by leveraging the image-level labels, whichare only available in the datasets that are annotated by”partial annotation policy”. To reduce the annotation cost,this policy aims to annotate only one instance of each objectif it is presented in a given image and the rest ROIs withthe same object category are left unlabeled. Although thispolicy creates a dataset with some missing instance-levellabels (i.e. bounding-box annotations), it assures that allthe images have their true image-level labels (i.e. objectcategory). Unfortunately, such WSL methods can not besimply employed for the merged datasets in order to mitigatethe missing-label instances since in Such datasets, bothimage-level and instance-level annotations are missed.

To mitigate the performance degradation in faster R-CNN trained on partially labeled datasets (e.g. OpenIm-agev3 [12] as it is labeled by ”partial annotation policy”),Wu et al. [3] propose to ignore the false training signalsarising from UPIs (i.e. false negative). To this end, theydiscard the gradients created by the RoIs that have small orno overlap with any ground truths. Although this simpleapproach can remove the false negative training signalsby UPIs, correcting them, instead of ignoring them, canfurther improve generalization performance, particularly forthe merged dataset. In other words, to benefit from the differ-ences in the appearance of objects in the merged dataset aswell as to obtain a well-generalized unified object detector, itpreferably should be trained on all of the positive instances,both the labeled and the unlabeled (UPIs) ones. In [5],the authors proposed to generate a set of pseudo-labels forUPIs in the merged dataset by using several different objectdetectors, where each is trained separately on an individual

dataset in the merged one. Finally, another unified objectdetector is trained on the offline set of pseudo-labels andthe ground-truth. However, generating such offline set ofpseudo-labels by this approach is computationally expensive(in term of time, memory, and GPU) as both training andlabel-inference of various distinct object detectors leads toa computational burden.

In this paper, we aim at enhancing the generalizationperformance of an object detector, when it is trained ona merged dataset, through augmenting it with the on-the-fly (online) generated pseudo-labels for some UPIs. Forthat purpose, we propose a computationally inexpensive andgeneral training framework for training a single detector(e.g. YOLO) while simultaneously creating pseudo-labelsfor some UPIs. Fig. 1 illustrates the pipeline of our proposedmethod. We deploy a pre-trained proxy CNN for flaggingwhich YOLO’s predicted bounding-boxes contain UPIs,then generate the ”object” and ”class” pseudo-label for them(Alg 1). In other words, if the proxy network classifies themas one of the pre-defined object classes (OoIs), their pseudo-labels are created to being included in the training phaseof the object detector, otherwise, they are discarded fromcontributing in the training. Inspired by [13], [14], we use aCNN with an explicit rejection option as the proxy network,in order to either classify a given RoI into one of the pre-defined classes or reject it as a not-of-interest object.

2. Background

YOLO divides a given image I into g × g grids,then for each grid Gij , it estimates A different bounding-boxes, where each of them is a 5 + K-dimensional vec-tor, encompassing the estimated coordinate information ofthe box (i.e. raGij

=[xaGij

, yaGij, wa

Gij, haGij

]), the objec-

tiveness probability (i.e. p(O|raGij)), and a K-dimensional

vector as the probabilities over K object categorizes (i.e.p(c|raGij

) ∈ [0, 1]K) with a ∈ {1, . . . A}. Therefore, theoutput of YOLO will be a tensor of size [g, g, A, 5 +K](Fig. 2). Moreover, for each grid Gij , a set of pre-definedbounding-boxes (called anchors) with different aspect ratiosand scales is considered. YOLO learns to estimates thebounding-boxes w.r.t these pre-defined anchors.

YO

LO

f(.)

Input Image Output

Figure 2. The output of YOLO is a tensor of size [g, g, A, 5 +K].

3. The Impact of Missing-label Instances onPerformance

As stated earlier, the missing-label instances (calledUPIs) can cause false negative signals in the training. In thefollowing, we demonstrate how the missing-label instancescan negatively impact on the performance (i.e. mean averageof precision ) of object detectors, particularly YOLO andfaster R-CNN.

YOLO computes the ”object loss” for all anchors of allthe grids, whether they contains any ground-truths or not.In other words, if an anchor has no ground-truth (a largeIoU overlap, e.g. IoU(anchor, ground-truth) > θ), its true”object” label is zero, and is one, otherwise. This can causea false negative signals by UPIs. More specifically, duringtraining of YOLO, the detector may be able to localizecorrectly a UPI, thus the objectiveness probability of itscorresponding anchor is p(O) ∼ 1, but since it has noassociated ground-truth label, it is given a true ”object” labelzero, i.e. t = 0. This forces the network to learn it as anegative or not-interesting object even thought the networkcan correctly localize and recognize such a UPI (UnlabeledPositive Instance). Ultimately such a false negative signalfrom a UPI can confuse the network since the LPIs ofthe same object category forces the network to learn itas an object of interest while the UPIs from the sameobject category encourage the network to learn it as an not-interesting (negative) object. Therefore, such false negativesignals can cause a drop in the performance of YOLO. Notethat the ”class” and ”coordinate” losses are ignored for theanchors that have small IoU overlap with a ground truth,i.e. IoU≤ θ. Thus, while UPIs can not contribute in thetraining through their ”class” and ”coordinate” losses. Theloss functions of YOLO are defined in Appendix A.

Similarly, in the faster R-CNN, a UPI can penalize thenetwork incorrectly if its anchor has an IoU (with a groundtruth) smaller than a given threshold, e.g. θ1 = 0.3. Moreprecisely, the true objectiveness label of an anchor involvinga UPI (i.e. t) will be set to zero when its IoU overlap issmall (< θ1). Then, although the RPN maybe can localizecorrectly the UPI as a positive instance (i.e. P (O) ∼ 1), the”object” loss is incorrectly penalizing the PRN by forcingit to learn the UPI as a a negative instance. Thus, suchfalse negative signals from UPIs can intervene with the truepositive signals from LPI, leading to a drop in performanceof the faster R-CNN. However, interestingly, according tothe faster R-CNN described in [1], if an anchor involvinga UPI has no high IoU overlap (e.g. > theta2 = 0.7) norsmall IoU (e.g. < θ1 = 0.3), then such an anchor and itsprobable corresponding UPI, will be ignored to contributein the training. Comparing to YOLO, this simple conditionin the faster R-CNN may reduce the probability of the falsenegative signals by UPIs (those that have θ1 ≤IoU≤ θ2).

Consequently, to mitigate the performance degradationof these object detectors, we require to reduce the falsenegative signals that are created by their ”object” loss. Toalleviate this challenge, one possible way is to discardedthese false negative signals likewise [3]. However, one can

improve further the performance if these false negative sig-nals can be corrected through generating pseudo-labels forthem. This can not only reduce the number of false negativesignals but also can increase the number of labeled instance,which they together can finally enhance the performance ofthe object detector.

4. Proposed Method

We introduce our framework to handel missing-labelinstances when the underlying object detector is YOLO,however it can also be adapted for the faster R-CNN. Duringtraining of YOLO, it is likely that some existing UPIs arelocalized correctly. However due to the lack of a groundtruth label for them, they may adversely contribute in thetraining of YOLO, inducing a drop in performance. Wepropose to generate psuedo-label for them. The estimatedRoIs by YOLO at training epoch e are evaluated to checkwhether they actually contain a positive unlabeled objector not. To achieve this, our framework incorporates a pre-trained proxy network [13], denoted by h(·), into the trainingprocess of YOLO. Indeed, the proxy network maps thecurrent estimated RoIs r ∈ Re of a given image I (denotedby Ir ), into a K+1-dim vector of probabilities over K+1classes, i.e. h(Ir) ∈ [0, 1]K+1, where {1, . . . ,K} denotesthe class of K positive objects (OoI) and K + 1-th (extra)class is for any uninterested (negative) objects. Note thatto enable h for processing these RoIs with different aspectratios, we exploit a Spatial Pyramid Pooling (SSP) layer [15]after the proxy’s last convolution layer.

To achieve this proxy trained, we can leverage from thereadily accessible datasets that contain the samples from not-interested-objects (we call them Out-of-Distribution –OoD–samples) along with the labeled samples containing OoIs(a.k.a. in-distribution samples). Recently, some promisingresults of OoD training have been reported for developing arobust object recognition classifiers [13], [14], [16], seman-tic segmentation models [17], as well as for overcomingcatastrophic forgetting [18].

Using the coordinate information r provided by YOLO,an estimated RoI r is extracted from an image I . To avoidre-labeling the RoIs containing a ground truth, only thosethat have a small or no overlap with any of the groundtruth annotations (line 4–6 of the algorithm 1) are processed.Before feeding these extracted RoIs to the proxy network,they should be pre-processed by the following procedure.

4.1. Pre-processing Step

To allow h(·) processes the RoIs in a mini-batch style,we perform this pre-processing step. Indeed, training of hwith the mini-batch SGD on the input samples with differentaspect ratio sizes is challenging since Python libraries suchas Pytorch do not allow the input samples with various sizesto be stacked in one batch. To address this issue, we canthink of padding the inputs with the largest aspect ratio sizein the batch, but this can destroy the information of thesmallest inputs (since these images can be dominated by an

Algorithm 1 Pseudo-label Generation AlgorithmInput: fe(·) object detector at training epoch e; h(·) pre-trained proxy network; I given input image with its associated

ground-truth bounding-boxes R∗ (i.e. their coordinate information) ; θ1, θ2 and, β as hyper-parameters.Output: Se, pseudo-labels of I at time e

1: Se = ∅

2:[[re1, p(O|re1), p(c|re1)], . . . , [reAg2 , p(O|reAg2), p(c|reAg2)]

]= fe(I)

3: Re = {re1, . . . , reAg2}

4: Pe = { p(c|re1), . . . p(c|reAg2)}

5: B = {∅}6: for r ∈ Re

7: if IoU( r, R∗) ≤ θ1 . To skip generation of pseudo-labels for the estimated ROIswith a large IoU overlap with a ground-truth from R∗.

8: B ←− B ∪ {r}9: B ←− pre-processing step (B)

10: for r ∈ B11: {Ir1 , . . . , Irm} = patch-drop(Ir) . Create m copies of RoI extracted by r, i.e. Ir.

12: h(Ir) = 1m+1

(h(Ir) +

∑mi=1 h(Iri)

)13: if arg max h(Ir) 6= K + 1 & max{1,...,K} h(Ir) ≥ θ214: p(c|r) = β · p(c|r) + (1− β) · h(Ir) . p(c|r) ∈ Peis the class estimation for the given r.

15: p(O|r) = max{1,...,K} h(Ir)

16: Se ←− Se ∪ [r, p(c|r), p(O|r)]

extremely large pad of zeros). To tackle this, in each trainingepoch of h, we load the samples with similar (close) aspectratios in one batch and pad them with zeros, if needed, toachieve a batch of samples with the equal aspect ratio size.To implement this, all of the training samples are clusteredby their widths and heights using k-means method. Then,the centers of these clusters serve as the pre-defined aspectratios to load the batches accordingly. Therefore, in the pre-processing step, at the test time of h, all the input instancesto h (i.e. RoIs) should be padded with zeros, if needed, inorder to keep their size equal to their nearest centers (line7 of Algorithm 1).

4.2. Pseudo-label Generation

Inspired by [19], we make use of patch-drop at the testtime of h in order to estimate the true class of a givenRoI more accurately. In the patch-drop, the given RoI isdivided into s× s patches, then randomly drop one of themto create a new version of the RoI. In our experiments, weapply patch-drop with s = 3 for m = 2 times on a givenRoI to create m versions of Ir, i.e. {Ir1 , . . . Irm} (line 11in Alg 1). We then feed them as well as the original RoIIrj to the proxy network for estimating the probability overK + 1 classes as follows:

h(Ir) =1

m+ 1

(h(Ir) +

m∑i=1

h(Iri)

). (1)

This trick leads to more calibrated confidence prediction,especially for some hard-to-classify RoIs, as the proxy net-work h predicts each version of Iri differently (to differentclasses). This indeed allow us to reduce the number offalse positive instances, thus the creation of more accuratepseudo-labels. Indeed, using a threshold on the predictiveconfidence h(.) (i.e. θ2 in the algorithm), the RoIs with lowconfidence prediction are dropped to continue the pseudo-label generation procedure. If the proxy network confidentlyclassifies the given RoI into one of K classes, its pseudoclass probability p(cls|r) is computed as follows:

p(c|r) = β · p(c|r) + (1− β) · h(Ir), (2)

where p(c|r), h(Ir) ∈ [0, 1]K are respectively the esti-mated class probabilities by YOLO at training epoch eand the proxy network h for the given RoI Ir. Finally,we set the probability of object for the given RoI r asp(O|r) = maxK

k=1 h(Ir).

To compute the loss for the pseudo-class label, we useKL-divergent between the ”class” pseudo-label p(c|r) andits estimation p(c|r) by YOLO. Similarly, the ”object” lossfor the pseudo ”object” label p(O|r) is computed by a binarycross-entropy. Finally, these two new losses for the pseudo-labels, i.e. KL (p(c|r)||p(c|r)) and BCE(p(O|r), p(O|r)),are added to the conventional loss functions of YOLO,which are defined in Appendix A.

Figure 3. Violet bounding boxes are our pseudo-labels generated during training of Yolo while the green bounding boxes are the ground-truth labels indataset D′

S (i.e. the merged dataset from VOC2007 and VOC2012 with disjoint sets of classes.)

5. Experiments

To simulate a merged dataset, we create two datasetswith two disjoint sets of classes from VOC2007 withSA={cat, cow, dog, horse, train, sheep} and VOC2012 withSB={car, motorcycle, bicycle, aeroplane, bus, person}. Onedataset, called DSA

, gathers the samples from VOC2007 thatare containing one of the objects of interest in SA (droppingthe annotations from other set of classes SB , if there areany in DSA

). Similarly, another dataset DSBis made of the

images from VOC2012 containing one of objects in SB .Then, these two datasets are merged to produce a mergeddataset D′S = DSA

∪DSBwith total classes of S = SA∪SB .

In addition, a fully labeled dataset DS from the union ofVOC2007 and VOC2012 are formed, where all the instancesbelonging to S are fully annotated. The missing label rateof D′S (the merged dataset) with respect to DS is 48%.

Object [email protected] Ours Upper-bound

Cat 74.79 77.2 82.04Cow 48.27 55.6 69.70Dog 52.71 62.0 78.70

Horse 18.68 23.7 82.51Train 58.36 57.7 79.18Sheep 57.77 65.1 72.45

Car 77.67 78.3 83.87Motorbike 68.23 72.4 79.82

Bicycle 69.98 72.1 79.00Aeroplane 59.96 62.6 71.29

Bus 65.26 71.2 78.83Person 71.32 72.0 78.30

Avg 60.25 64.2 77.97TABLE 1. PERFORMANCE (I.E. MAP) OF DIFFERENT YOLOS ON THE

TEST SET OF VOC2007 WITH FULLY LABELED INSTANCES FROMCLASSES S = SA ∪ SB . BASELINE IS THE TRAINED YOLO ON THE

MERGED DATASET (VOC2007+VOC2012) WITH MISSING-LABELINSTANCES (D′

S ), OURS IS YOLO TRAINED ON THE AUGMENTEDDATASET D′

S WITH OUR GENERATED PSEUDO-LABELS, AND THEUPPER-BOUND IS THE YOLO TRAINED ON VOC2007+VOC2012 WITH

FULLY ANNOTATED INSTANCES (DS ).

As the proxy network, we adopt Resnet20 [20] byplacing a SPP (Spatial Pyramid Pooling) layer after its lastconvolution layer to enable it to process the inputs withvarious aspect-ratio sizes. To train this network, we utilize

MSCOCO [21] training set by extracting all the groundtruth bounding boxes belonging to one of the classes inS = SA ∪ SB , and all other ground truth bounding boxesnot belonging to S are used as OOD samples (labeled asclass K + 1). The hyper-parameters of our algorithm areset to β = 0 (in Eq. 2), θ1 = 0.5 (to remove RoIs havinga large overlap with ground truth, line 4–6 of Algorithm),and θ2 = 0.8 (the threshold on the prediction confidence ofthe proxy network for the given RoIs).

In Fig. 3, we demonstrate the pseudo labels generatedby our proposed method for some UPIs in D′S . In Table 1,we compare [email protected] of three Yolos, where they arerespectively trained on D′S (baseline), on augmented D′Sby our pseudo-labels (Ours), and finally on fully labeleddataset DS . As it can be seen, training a YOLO on D′S(with a 48% rate of missing labels) leads to a ≈ 17%drop in [email protected], compared to the same YOLO when ittrained on the fully-labeled dataset (DS). Ours enhancesmAP of YOLO trained on the merged dataset D′S by 4%(on average) by augmenting D′S by pseudo-labels for someof UPIs, thus their false negative signals are eliminated.

6. Conclusion

With the goal of training an integrated object detectorwith the ability of detecting a wide range of OoIs, onecan merge several datasets from similar context but withdifferent sets of OoIs. While merging multiple datasets totrain an integrated object detector has some promising po-tentials, from reducing the computational and labeling coststo enjoying from a wider spectrum of variations (suitablefor domain-shift), many missing label instances (UnlabeledPositive Instances) in the merged dataset cause a perfor-mance degradation. To address this issue, we propose ageneral training framework for simultaneously training anobject detector (e.g. YOLO) on the merged dataset whilegenerating some on-the-fly pseudo-labels for UPIs. Usinga pre-trained proxy neural network, we generate a pseudolabel for each estimated RoI if the proxy network confidentlyclassifies it as one of its pre-defined interested classes.Otherwise, we exclude it from contributing in training ofthe object detector. By a simulated merged dataset using

VOC2007 and VOC2012, we empirically show that YOLOtrained by our framework achieves a higher generalizationperformance, compared to the YOLO trained on the originalmerged dataset (with the missing-labels). This achievementis the result of augmenting the merged dataset with ourgenerated pseudo-labels for UPIs.

References

[1] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” in Advances inneural information processing systems, 2015, pp. 91–99.

[2] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only lookonce: Unified, real-time object detection,” in Proceedings of the IEEEconference on computer vision and pattern recognition, 2016, pp.779–788.

[3] Z. Wu, N. Bodla, B. Singh, M. Najibi, R. Chellappa, and L. S.Davis, “Soft sampling for robust object detection,” arXiv preprintarXiv:1806.06986, 2018.

[4] Y. Zhang, Y. Bai, M. Ding, Y. Li, and B. Ghanem, “Weakly-supervised object detection via mining pseudo ground truth bounding-boxes,” Pattern Recognition, vol. 84, pp. 68–81, 2018.

[5] A. Rame, E. Garreau, H. Ben-Younes, and C. Ollion, “Omnia fasterr-cnn: Detection in the wild through dataset merging and soft distil-lation,” arXiv preprint arXiv:1812.02611, 2018.

[6] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics:The kitti dataset,” The International Journal of Robotics Research,vol. 32, no. 11, pp. 1231–1237, 2013.

[7] S. Houben, J. Stallkamp, J. Salmen, M. Schlipsing, and C. Igel,“Detection of traffic signs in real-world images: The German TrafficSign Detection Benchmark,” in International Joint Conference onNeural Networks, no. 1288, 2013.

[8] X. Wang, Z. Cai, D. Gao, and N. Vasconcelos, “Towards universalobject detection by domain attention,” in Proceedings of the IEEEConference on Computer Vision and Pattern Recognition, 2019, pp.7289–7298.

[9] M. Xu, Y. Bai, B. Ghanem, B. Liu, Y. Gao, N. Guo, X. Ye, F. Wan,H. You, D. Fan et al., “Missing labels in object detection,” inThe IEEE Conference on Computer Vision and Pattern Recognition(CVPR) Workshops, 2019.

[10] H. Bilen and A. Vedaldi, “Weakly supervised deep detection net-works,” in Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition, 2016, pp. 2846–2854.

[11] A. Diba, V. Sharma, A. Pazandeh, H. Pirsiavash, and L. Van Gool,“Weakly supervised cascaded convolutional networks,” in Proceed-ings of the IEEE conference on computer vision and pattern recog-nition, 2017, pp. 914–922.

[12] N. A. V. F. S. A.-E.-H. A. K. H. R. J. U. S. P. A. V. S. B. V. G.A. G. C. S. G. C. D. C. Z. F. D. N. Ivan Krasin, Tom Duerig andK. Murphy, “Openimages: A public dataset for large-scale multi-labeland multi-class image classification.”

[13] M. Abbasi, C. Shui, A. Rajabi, C. Gagne, and R. Bobba, “Towardmetrics for differentiating out-of-distribution sets,” in W-NeurIPS,Safety and Robustness in Decision Making, 2019.

[14] D. Hendrycks, M. Mazeika, and T. G. Dietterich, “Deep anomalydetection with outlier exposure,” Internation Conference on Repre-sentation Learning (ICLR), 2019.

[15] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in deepconvolutional networks for visual recognition,” IEEE transactions onpattern analysis and machine intelligence, vol. 37, no. 9, pp. 1904–1916, 2015.

[16] A. Meinke and M. Hein, “Towards neural networks that provablyknow when they don’t know,” in International Conference on Learn-ing Representations (ICLR), 2020.

[17] P. Bevandic, I. Kreso, M. Orsic, and S. Segvic, “Discriminative out-of-distribution detection for semantic segmentation,” arXiv preprintarXiv:1808.07703, 2018.

[18] K. Lee, K. Lee, J. Shin, and H. Lee, “Overcoming catastrophicforgetting with unlabeled data in the wild,” in ICCV, 2019.

[19] K. K. Singh and Y. J. Lee, “Hide-and-seek: Forcing a network tobe meticulous for weakly-supervised object and action localization,”in 2017 IEEE International Conference on Computer Vision (ICCV).IEEE, 2017, pp. 3544–3553.

[20] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learningfor image recognition,” in Proceedings of the IEEE conference oncomputer vision and pattern recognition, 2016, pp. 770–778.

[21] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan,P. Dollar, and C. L. Zitnick, “Microsoft coco: Common objects incontext,” in European conference on computer vision. Springer,2014, pp. 740–755.

[22] J. Redmon and A. Farhadi, “Yolov3: An incremental improvement,”arXiv preprint arXiv:1804.02767, 2018.

Appendix

Loss function of YOLO

Each training sample is (Ii, {t∗1i , . . . t∗ji }), where Ii is

an input image and t∗ji is the j-th ground truth bounding-box associated with i-th image (i ∈ {1, . . . N}). Eachground truth bounding-box is t∗ji = [r∗ji , k

∗ji ] with r∗ji =

[x∗ji , y∗ji , w

∗ji , h

∗ji ] and k∗ji ∈ {1, . . . ,K} is the object

category. The coordinate information of the center of j-thground-truth and its corresponding height and width w.r.tthe image are x∗ji , y

∗ji , w

∗ji , h

∗ji ∈ [0, 1], respectively. From

now on, we drop the indices from the ground-truth and theirestimations for the simplicity reasons.

Contrary to the coordinate information of the ground-truth, i.e. r∗ = [x∗, y∗, w∗, h∗], that of estimated bounding-box by YOLO, i.e. r =

[x, y, w, h

], are relative to their

corresponding gird (grid-orientation). To have the ground-truths and the estimations in the same coordinate-system, thepredicted bounding-box is transferred to image’s coordinatesystem as follows:

bx = xGij + x (3)by = yGij

+ y (4)

baw = waGij

exp(w) (5)

bah

= haGijexp(h), (6)

where xGij , yGij are the coordinate of the top-left cornerof grid Gij w.r.t the image, and wa

Gij, haGij

are the widthand height of a-th anchor of the given grid.

For each grid Gij with i, j ∈ {1, . . . , g} (Fig. 2), wecompute the ”class” and ”coordinate” losses only if the IoUbetween a ground-truth, e.g. r∗, and at least one of thegrid’s anchors Aa

Gijis larger than a pre-defined threshold

τ , otherwise its ”class” and ”coordinate” losses are zero

(ignored). Note if a grid has several anchors that havelarge IoU (> τ ) with a ground-truth, then the anchorwith the largest IoU is solely contribute for computingthese losses.

For a give Gij , let a′ = argmaxa

(IoU(r∗,Aa

Gij))

,the ”class” loss, i.e. multi-class cross-entropy, computesthe difference between the estimated class probabilities,i.e. p(c|ra′

Gij) and the true class k∗, which encoded by

p∗(c); that is a one-hot K-dimensional vector with its k∗-thelement equals to one (p∗(c = k∗) = 1) 1:

Lcls(p∗(c),p(c|ra′

Gij)) =

{log p(c = k∗|ra′

Gij) if maxa

(IoU(r∗,Aa

Gij))> τ

0 Otherwise.

(7)

Lcoor

([x∗, y∗, w∗, h∗] ,

[bx, by, b

a′

w , ba′

h

]|Gij

)=

(x∗ − bx)2 + (y∗ − by)2+

(w∗ − ba′

w )2 + (h∗ − ba′

h)2

if maxa

(IoU(r∗,Aa

Gij))> τ

0 Otherwise.

(8)In addition, for each grid Gij , we compute the ”object”loss, i.e. binary cross-entropy, measures the loss on theestimated objectiveness probability for its anchor that hasthe maximum IoU overlap with a ground-truth and itsIoU(r∗,Aa

Gij) > τ . To give true ”object” label to this

anchor, the value of taGij= 1. For the remaining anchors

of the grid taGij= 0. If none of the anchors of Gij have

a large IoU overlap (¿τ ), then the true ”object” label all ofthem is zero, i.e. taGij

= 0 ∀a ∈ {1, . . . , A}.

Lobj(taGij

, p(O|raGij)) = taGij

log p(O|raGij) + (1− taGij

) log(1− p(O|raGij))

(9)It should be emphasized that the ”object” loss is

computed for all the grids (i.e. all the anchors of allthe grids), whether they contain a ground-truth or not,while the ”coordinate” and ”class” losses for the gridsthat have no ground-truth are not computed (since theselosses are always zero for such grids, by the definition).Finally, all the above loss functions are weighted summedto define the total loss of YOLO. The weights are set sothat the contributions of the losses are balanced.

L =λcls∑Gij

Lcls

(p∗(c),p(c|ra

′

Gij))

+

λcoor∑Gij

Lcoor

([x, y, w, h] ,

[bx, by, b

a′

w , ba′

h

]|Gij

)+

λobj∑Gij

A∑a=1

Lobj

(taGij

, p(O|raGij)))

(10)

1. Instead cross-entropy for the class predictions, the authors [22] usedbinary cross-entropy loss for each of K classes.

Documents

arXiv:2005.11549v2 [cs.CV] 27 Jun 2020 · Mahdieh Abbasi and Denis Laurendeau Department of Electrical and Computer Engineering Universite Laval, Qu´ ebec, Canada´ [email protected]