18
Research Article An Evaluation of Deep Learning Methods for Small Object Detection Nhat-Duy Nguyen, Tien Do , Thanh Duc Ngo, and Duy-Dinh Le University of Information Technology, Vietnam National University, Ho Chi Minh City, Vietnam Correspondence should be addressed to Tien Do; [email protected] Received 20 January 2020; Accepted 11 March 2020; Published 27 April 2020 Academic Editor: Cesare F. Valenti Copyright © 2020 Nhat-Duy Nguyen et al. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Small object detection is an interesting topic in computer vision. With the rapid development in deep learning, it has drawn attention of several researchers with innovations in approaches to join a race. ese innovations proposed comprise region proposals, divided grid cell, multiscale feature maps, and new loss function. As a result, performance of object detection has recently had significant improvements. However, most of the state-of-the-art detectors, both in one-stage and two-stage ap- proaches, have struggled with detecting small objects. In this study, we evaluate current state-of-the-art models based on deep learning in both approaches such as Fast RCNN, Faster RCNN, RetinaNet, and YOLOv3. We provide a profound assessment of the advantages and limitations of models. Specifically, we run models with different backbones on different datasets with multiscale objects to find out what types of objects are suitable for each model along with backbones. Extensive empirical evaluation was conducted on 2 standard datasets, namely, a small object dataset and a filtered dataset from PASCAL VOC 2007. Finally, comparative results and analyses are then presented. 1.Introduction Object detection is known as a task that locates all positions of objects of interest in an input by bounding boxes and labeling them into categories that they belong to. To do this task, several ideas have been proposed from traditional approaches to deep learning-based approaches. e ap- proaches of object detection are mainly separated into two types, namely, approaches based on region proposal algo- rithms known as two-stage approaches [1–3] and ap- proaches based on regression or classification recognized as real-time and unified networks or one-stage approaches [4–7]. Applications based on real-time object detection now draw much attention of people because of its demand for meeting the modern life and helping people to have a better life. For example, self-driving cars are an authentic one to simultaneously help people transport on streets safely, re- ducing car accidents by distracted drivers. e other one includes that in manufacturing industries, the need of detectingassemblypartsthataredefectiveortheuncertainty of an angle of view, size of detected object, and deformable shapethatsignificantlychangesduringassemblyprocess[8]. It illustrates that real-time object detection, applied to the most popular vision-based applications in real world, is really indispensable. However, such applications require early object detection in order to be used subsequently as inputs for other tasks [9, 10]. Due to early detection, rep- resentationofobjectsisusuallysmalloreventiny.Generally, given an image of interest, the purpose of small object detection is to immediately detect what common objects belong to the image, especially in small sizes, implying that objects of interest are objects which either own a physically big appearance but just occupy a small patch on an image (train, car, bicycle, etc.) [11, 12] or are really with a small appearance (mouse, plate, jar, bottle, etc.) [13], as shown in Figure 1. Small object detection, therefore, is a challenging task in computer vision because apart from the small representa- tions of objects, the diversity of input images also make the taskmoredifficult.Forinstance,animagecanbeindifferent Hindawi Journal of Electrical and Computer Engineering Volume 2020, Article ID 3189691, 18 pages https://doi.org/10.1155/2020/3189691

AnEvaluationofDeepLearningMethodsforSmall ObjectDetection

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: AnEvaluationofDeepLearningMethodsforSmall ObjectDetection

Research ArticleAn Evaluation of Deep Learning Methods for SmallObject Detection

Nhat-Duy Nguyen Tien Do Thanh Duc Ngo and Duy-Dinh Le

University of Information Technology Vietnam National University Ho Chi Minh City Vietnam

Correspondence should be addressed to Tien Do tiendvuiteduvn

Received 20 January 2020 Accepted 11 March 2020 Published 27 April 2020

Academic Editor Cesare F Valenti

Copyright copy 2020 Nhat-Duy Nguyen et al (is is an open access article distributed under the Creative Commons AttributionLicense which permits unrestricted use distribution and reproduction in any medium provided the original work isproperly cited

Small object detection is an interesting topic in computer vision With the rapid development in deep learning it has drawnattention of several researchers with innovations in approaches to join a race (ese innovations proposed comprise regionproposals divided grid cell multiscale feature maps and new loss function As a result performance of object detection hasrecently had significant improvements However most of the state-of-the-art detectors both in one-stage and two-stage ap-proaches have struggled with detecting small objects In this study we evaluate current state-of-the-art models based on deeplearning in both approaches such as Fast RCNN Faster RCNN RetinaNet and YOLOv3 We provide a profound assessment ofthe advantages and limitations of models Specifically we run models with different backbones on different datasets withmultiscale objects to find out what types of objects are suitable for each model along with backbones Extensive empiricalevaluation was conducted on 2 standard datasets namely a small object dataset and a filtered dataset from PASCAL VOC 2007Finally comparative results and analyses are then presented

1 Introduction

Object detection is known as a task that locates all positionsof objects of interest in an input by bounding boxes andlabeling them into categories that they belong to To do thistask several ideas have been proposed from traditionalapproaches to deep learning-based approaches (e ap-proaches of object detection are mainly separated into twotypes namely approaches based on region proposal algo-rithms known as two-stage approaches [1ndash3] and ap-proaches based on regression or classification recognized asreal-time and unified networks or one-stage approaches[4ndash7] Applications based on real-time object detection nowdraw much attention of people because of its demand formeeting the modern life and helping people to have a betterlife For example self-driving cars are an authentic one tosimultaneously help people transport on streets safely re-ducing car accidents by distracted drivers (e other oneincludes that in manufacturing industries the need ofdetecting assembly parts that are defective or the uncertainty

of an angle of view size of detected object and deformableshape that significantly changes during assembly process [8]It illustrates that real-time object detection applied to themost popular vision-based applications in real world isreally indispensable However such applications requireearly object detection in order to be used subsequently asinputs for other tasks [9 10] Due to early detection rep-resentation of objects is usually small or even tiny Generallygiven an image of interest the purpose of small objectdetection is to immediately detect what common objectsbelong to the image especially in small sizes implying thatobjects of interest are objects which either own a physicallybig appearance but just occupy a small patch on an image(train car bicycle etc) [11 12] or are really with a smallappearance (mouse plate jar bottle etc) [13] as shown inFigure 1

Small object detection therefore is a challenging task incomputer vision because apart from the small representa-tions of objects the diversity of input images also make thetask more difficult For instance an image can be in different

HindawiJournal of Electrical and Computer EngineeringVolume 2020 Article ID 3189691 18 pageshttpsdoiorg10115520203189691

resolutions if the resolution is low it can hinder the detectorfrom detecting small objects In this case the visual infor-mation to highlight the locations of small objects will besignificantly limited In addition small objects can be de-formable or are overlapped by other objects A wide varietyof detection methods have been proposed in the last yearsfrom the development of deep learning Various ideas havebeen presented and attached evaluations have been made todeal with challenges of object detection but those proposeddetectors currently spend their ability on the detection ofnormal sizes not just small objects However an evaluationof small object detection approaches is indispensable andimportant in the study of object detection Lately objectdetection has significantly attracted attention from state-of-the-art approaches and these have made their efforts totackle object detection and yield good performance onchallenging and multiclass datasets such as PASCAL VOCand COCO (ese cutting-edge methods are firstly trainedon ImageNet and transferred to detection for example in

[2] the authors use a proposed network which applies aspatial pyramid pooling layer to extract features and com-pute these over an entire image regardless of image sizesinstead of employing part-based models [14] R-CNN [1] is apioneer of breakthrough object detection and has severalinnovations from previous approaches an image is resizedto a fixed size to feed into the network and then applies anexternal algorithm to generate object proposals Improvedfrom [1] Fast R-CNN [3] applies regions of interest (RoIs) toextract a fixed-length feature from the feature maps for eachproposal Faster R-CNN [15] uses its own network togenerate object proposals instead of applying an externalalgorithm

So far almost detectionmodels are all well-performed onchallenging datasets such as COCO and PASCAL VOC(ese datasets commonly contain objects taking medium orbig parts on an image that contains a few small objects whichcause an imbalance data between objects in different sizesresulting in a bias of models to objects greater in numbers In

(a) (b)

Figure 1 Illustration of (a) objects such as a bus plains or cars that have big appearance but occupy small parts on an image taken from [11]and (b) objects that really own small appearance such as mouses or plates taken from [13]

2 Journal of Electrical and Computer Engineering

addition the number of classes of current small objectdatasets is less than common datasets Besides most of thestate-of-the-art detectors both in one-stage and two-stageapproaches have struggled with detecting small objects As aresult we have presented an in-depth evaluation of existingdeep learning models in detecting small objects in our priorwork [16] We evaluate three state-of-the-art models in-cluding YouOnly Look Once (YOLO) Single ShotMultiBoxDetector (SSD) and Faster R-CNN with related trade-offfactors ie accuracy execution time and resource con-straints In this time we make not only an extension bycontinually evaluating state-of-the-art and up-to-date de-tection models but also summarize pros and cons as well asthe design of models rather than just introducing their ideaInstead of focusing on real-time models we evaluate state-of-the-art models both in the one-stage approach which isable to run in real time such as YOLOv3 RetinaNet and thetwo-stage approach which do not meet real-time detectionbut high accuracy such as Fast RCNN and Faster RCNNWeadd these models to our evaluation due to some reasons andwe firstly take claims from the original work of these modelsParticularly we pick up YOLOv3 because this detector is anovel and state-of-the-art model which combines currentadvanced techniques such as residual blocks skip connec-tions and multiscale detection Similarly RetinaNet is adetector that proposes an updated calculation for lossfunction to penalize the imbalance of classes in a datasetAlthough Faster RCNN is the only one model that isevaluated in our previous work we want to evaluate thismodel with different backbones to consider how wellbackbones work when they are combined with FasterRCNN Furthermore Faster RCNN is an improvement ofFast RCNN and we still add Fast RCNN to our evaluationbecause this model works with an external algorithm togenerate region proposals on an input image instead of on afeature map alike Faster RCNN Besides we evaluate thesemodels with different backbones such as ResNet 50 ResNet101 ResNet 152 ResNeXT 101 and FPN on small objects toconsider how well these backbones are when combiningthem with models We still make our evaluation on 2datasets namely small object dataset [13] and our filtereddataset from PASCAL VOC 2007 [11] with criteria such asaccuracy speed of processing and resource consumption aswell However we want to provide analyses to the design andthe way models work and explore how well models canafford with multiscale objects (is helps readers have apreference of each model and from there they can choose asuitable model to meet their needs (erefore the followingsare our contributions

(i) Wemade an extension for evaluating deep models intwo main approaches of detection namely the one-stage approach and two-stage approach such asYOLOv3 RetinaNet Fast RCNN and Faster RCNNalong with popular backbones such as FPN ResNetor ResNeXT

(ii) We provided not only disadvantages and advantagesof the models relating to accuracy resource con-sumption and speed of processing in context of

small objects as well as changes of these factors whenan object size is scaled up or down but also acomparison between one-stage and two-stagemethods

2 Challenges

Overall there are several problems relating to challenges thatneed to be solved with object detection Object detectionitself draws much attention from researchers but after aperiod of time challenges just tackle a part particularlyCOCO challenges provide a standard in regard to small andmedium detection and accuracy in most of detectors is stilllow with this standard (erefore in terms of small objectdetection it is harder to researchers because apart fromnormal challenges alike object detection it owns particularchallenges for small objects Besides the definition of smallobjects is not obviously clear (e following presentationmake it more obvious

21 Small Appearances Recently small object detection hasbeen considered as an attractive problem itself because thereare many sorts of its own challenges that are very intriguingto researchers First of all the possibilities of the appearanceof small objects are much more than other objects because ofthe small size that leads to a fact that detectors get confusedto spot these objects among plenty of other objects which arelocated around or even are the same size or appearance It isarduous when differentiating small objects from the clutterof background Furthermore the pixels available to repre-sent the information of small objects are also much fewerthan normal objects It means that there are less informativerepresentatives for detectors to perform its task Besides keyfeatures to obtain small objects from an image are vulnerableand even lost progressively when going thorough manykinds of different layers of deep network such as convolu-tional or pooling layers For example in VGG16 if the objectof interest occupies a 32times 32 size it will be presented at most1 pixel after 5 times of going through the pooling block As aresult the exhausted searching such as sliding window [14]or the drastic increase in the number of bounding boxes likeselective search [17] is unfeasible to achieve good outputsSome samples of small objects are shown in Figure 1

22 Small Object Definitions (e definition problem ofsmall object detection is to clarify how small scales or sizes ofobjects are or how many pixels they occupy on an image(is is arduous and different if we consider objects onimages of high resolution and low resolution For examplean object is assigned as a small object as occupying a part of400times 400 resolution on 2048times 2048 but being very big on500times 500 one (erefore it causes a difficulty to researcherswhen a dataset consists of images with various ranges ofresolution Up till now there are some definitions of smallobjects and these definitions are not clearly defined Itdepends upon datasets that are used for evaluation andcharacteristics of objects of interest (erefore to performthe task of detecting small objects researchers define

Journal of Electrical and Computer Engineering 3

different definitions for different datasets instead of onlyusing the size of bounding boxes containing objects toconsider if the objects are small or not For example Zhuet al [18] mentioned that small objects are objects whosesizes are filling 20 of an image when releasing their datasetabout traffic signs If the traffic sign has its square size it is asmall object when the width of the bounding box is less than20 of an image and the height of the bounding box is lessthan the height of an image In [19] Torralba et al supposedsmall objects are less than or equal to 32times 32 pixels In smallobject dataset [13] objects are small when they have meanrelative overlap (the overlap area between bounding box areaand the image is) from 008 to 058 respectively 16times16to 42times 42 pixel in a VGA image In this work we reuse theabove definitions especially the definitions from [13 18] asthe main references because they are reliable resources andare widely accepted by other researchers

23 Datasets and Approaches (ere are limited works toconcentrate on sorts of small objects and it results in thelimitation of experience and knowledge to deeply go for acomprehensive research (e previous approaches justspecify to focus on big objects and ignore the existence ofsmall objects In fact we do not comprehend how muchexisting detection approaches are well-performed whendealing with small objects Hence in this work we conductto assess the performance of existing state-of-the-art de-tectors to draw a general picture of their abilities for smallobject detection

In terms of small object detection there are just a fewworks regarding the problem of detecting small objects Sofar most of these works are just designed to detect somesingle categories such as traffic signs [18] or vehicles [20ndash22]or pedestrians [23] that do not contain common or mul-ticlass datasets in real world (is results in a lack of eval-uation for the approaches to show its ability detectingdifferent kinds of objects and variation of their shapes aswell Fortunately Chen et al [13] present their small objectdataset by combining the Microsoft COCO [12] and SUNdatasets [24] that consist of common objects such asldquomouserdquo ldquotelephonerdquo ldquoswitchrdquo ldquooutletrdquo ldquoclockrdquo ldquotissueboxrdquo ldquofaucetrdquo ldquoplaterdquo and ldquojarrdquo Chen also augments theR-CNN algorithm with some modifications to improveperformance of detecting small objects Following this ideawe conduct a small survey on existing datasets and theauthors find that PASCAL VOC is in common with COCOand SUN datasets which consist of small objects of variouscategories So we depend on existing and common defini-tions of small objects to filter objects that meet these defi-nitions and form a dataset including 4 subsets correspondingto 4 different definitions of small objects so as to objectivelyconsider how different scales of objects affecting perfor-mance of detection are In addition there is recently a smallobject dataset in a challenge called Vision Meets Drones AChallenge (httpaiskyeyecom) and this dataset is con-sidered the challenging dataset because it consists of severalsmall objects even tiny objects in images in different con-texts and conditions in wild but the views in images are

snapshot from drones which fly above and take picturesfrom the high resolution cameras attached to it Unfortu-nately this dataset does not have annotations for testing soit is hard to take it for evaluation

(erefore in this work we choose small object dataset[13] and our filtered dataset to make our evaluation becausethese datasets contain common objects and the number ofimages are large so the evaluations are objective

3 Deep Models for Object Detection

Recently in widespread developments of deep learning it isknown that convolutional neural network (CNN) ap-proaches have showed lots of improvements and achievedgood results in various tasks (erefore it is commonlyapplied to well-known works Most of the works haveshowed significant improvements in detecting objects fillingmedium or big parts on an image

RCNN [1] is one of the pioneers (e following methodsare an improvement form of R-CNN such as [2 3 15]Especially Faster R-CNN [15] is considered as a state-of-the-art approach Although this sequence of advanced worksuses a lot of different and breakthrough ideas from slidingwindow to object proposals and mostly achieves the bestresults as state-of-the-art methods on challenging datasetssuch as COCO PASCAL VOC and ILSVRC however theirrepresentations take much time to run on an image com-pletely and may lead to reduction in the running perfor-mance of the detector As a result the detectors facedifficulty in using them for detecting objects in real timedespite achieving high accuracy (is means they just focuson accuracy and ignore effects of speed of processing Inaddition detecting objects having small sizes in real world isas important as objects having big or medium sizes evenmore necessary than we imagined Especially in industriesof automotive smart cars army projects and smarttransportation data must be promptly and precisely pro-cessed to make sure that safety is first But in these casesgenerally the data recorded usually are far from our positionand the information is a small thing

In terms of real-time detection the one-stage methodsinstead of using object proposal to get RoI before moving toclassifier like two-stage approaches such as Faster R-CNNuse local information to predict objects such as YOLO andSSD Both methods process images in real time and detectobjects correctly and still have a high point of mAP Nev-ertheless these papers just mention that the models candetect small objects and have good results but they do notshow evidences to prove how much or what extent of smallobjects that are solved In this work we evaluate thesemodels from both approaches to find out their performanceand to what extend they are good at as detecting smallobjects (e following are general ideas of above-mentionedapproaches

31 R-CNN R-CNN [1] is a novel and simple approach as apioneer advanced providing more than 30 mean averageprecision (mAP) than the previous works on PASCAL VOC

4 Journal of Electrical and Computer Engineering

(e overview of R-CNN architecture consists of four mainphases which are known as the new advances of this methodFirstly the R-CNN network resizes an image to 227times 227and takes it as an input(en selective search algorithm [17]is applied to the image and generates 2000 candidates ofproposed bounding boxes as the warped regions used for theinput of CNN feature network (rough the regions thenetwork extracts a 4096-dimensional feature vector fromeach region and then computes the features for each regionFinally using the class-specific linear SVM classifier behindthe last layer is to classify regions to consider if there are anyobjects and what the objects are

(emajor key to the success of the R-CNN is the featuresmatter In R-CNN the low-level image features (eg HOG)are replaced with the CNN features which are arguablymore discriminative representations However the evalua-tion of an image is extremely costly and wasteful becauseR-CNN must apply the convolutional network 2000 timesBesides resizing the input to the low 227times 227 is a problemaffecting small objects which are easy to deform or even loseinformation as changing the resolution far from its originalsizes (e region proposals overlapped thus leading tocomputation of familiar features many times and with everyregion proposal it must be stored to disk before performingthe extraction of features In addition lots of boundingboxes overlapped will result in a drop of mAP if small objectsare close to big objects because there is a bias to choose thebounding boxes which contain big objects and ignorance ofbounding boxes for small objects

32 Spatial PyramidPooling (SPP) (e primary ideas of SPP[2] are motivated from limitations of CNN architecture suchas the original CNN receiving the size of input imagesmust bea fixed size (224times 224 of AlexNet) so the actual use of the rawpicture often needs cropping (a fixed-size patch that truncatesthe original image) or warping (RoI of an image inputmust bea fixed size of the patch) (e fully connected layer needs afixed-length input and convolutional layer that can beadapted to the arbitrary input size thus it needs a bridge as amediate layer between the convolutional layer and the fullyconnected layer and that is the SPP layer Particularly SPP-net firstly finds 2000 candidates of region proposals like theR-CNN method and then extracts the feature maps from theentire image SPP maps each window of the features corre-sponding to region proposals as a fixed-length representationregardless of the input size Finally 2 fully connected layersare used to classify by SVM

In short SPP-net versus R-CNN detection task is better100times faster than R-CNN but training time is very slowbecause of multistage training steps (fine-tuning of lastlayers SVM and regressions) and really taking a lot of diskspace to save vectors of features

33 Fast R-CNN Fast R-CNN [3] is an advanced methodthat presents various innovations to improve the time oftraining and testing phase and efficiently classifying objectproposals while still increasing the accuracy rate by usingdeep convolutional networks (e architecture of Fast

R-CNN is trained end-to-end with a multitask loss Specif-ically the convolutional network takes an image at any size asan input and several RoIs Instead of applying RoI on an inputand wrapping them to feed into the network at the first steplike RCNN Fast RCNN applies these RoIs on a feature mapafter the several convolutional layers of the base networkEach RoI is extracted a fixed-size feature vector by a poolinglayer andmapped to a feature vector by fully connected layers(e network has two output vectors per RoI softmaxprobabilities and per-class bounding-box regression offsets

(e most important feature of RoI is sharing compu-tation and memory in the forward and backward passesfrom the same image (e huge contribution of Fast R-CNNis that it proposes a new training method that fixes thedrawbacks of R-CNN and SPP-net while increasing theirrunning time and accuracy rate (e advantage is the meanaverage precision of detection is higher than R-CNN andSPP-net Training phase is a single stage using a multitaskloss and can update the entire network layers (e capacityof disk storage is not required for feature caching

34 Faster R-CNN Faster R-CNN [15] is an innovatedapproach improved from Fast R-CNN Unlike two previousapproaches of its own instead of generating bounding boxesby external algorithms [17] like [1 3] Faster R-CNN runs itsown method called the region proposal network (RPN)which is trained end-to-end in order to give the generationof highly qualified region proposals After gaining deepfeatures from early convolutional layers RPN is taken intothe account and windows slide over the feature map toextract features for each region proposal RPN is consideredas a fully convolutional network which simultaneouslypredicts bounding boxes of objects and objectness scores ateach position (e input of RPN is an image of any size andoutputs a set of bounding boxes as rectangular objectproposals along with an objectness score for each proposalSpecifically the RPN takes the image feature map of the fifthconvolutional layer (conv5) as an input and applies a 3times 3sliding window on the feature map (en the intermediatelayer will feed into two different branches one for objectscore (determines whether the region is thing or stuff) andthe other for regression (determines how should thebounding box change to become more similar to the groundtruth) (e RPN improves accuracy and running time as wellas avoids to generate excess of proposal boxes because theRPN reduces the cost by sharing computation on convolu-tional features RPN and Fast R-CNN aremerged into a singlenetwork by sharing their convolutional features (is com-bination helps Faster R-CNN to have leading performance onaccuracy but leads to its architecture as a two-stage networkwhich reduces the speed of processing of this method

35 You Only Look Once Inherited from the advantages ofthe previous models which have been introduced earlier YouOnly Look Once (YOLO) [4] is considered as a state-of-the-art object detection in real time with various categories at thattime YOLO currently has three versions [4ndash6] which areimproved substantially through each version progressively

Journal of Electrical and Computer Engineering 5

(e detail analyses of the YOLO approaches as a premise toapply it into practical applications are as follows

YOLOv1 [4] is widely known that YOLO an unified orone-stage network is a completely novel approach based onan idea that aims to tackle object detection in real timeproposed by Redmon et al implying that instead of per-forming the task of object detection like the previoustechniques based on complex tasks such as [1 4] which useexhausted sliding window and then feed outputs of this toclassifiers performing at equally spaced locations over thewhole image or region proposals to generate bounding boxeswhich possibly contain objects and then feed them toconvolutional neural networks YOLO considers objectdetection to be a regression problem simultaneously givingthe prediction for various coordinates of bounding boxesand class probabilities for these boxes (e key idea toperform the detection of YOLO is that YOLO separatesimages into grid views which push the running time as wellas accuracy in localizing objects of YOLO(e goal of YOLOis to deal with two problems namely what objects arepresented and where they are in an image (e summari-zation of YOLO operation proceeds with three principalsteps simply and straightforwardly Firstly YOLO takes aninput image resized to a fixed size then works a singleconvolutional network as a unified network on the imageand ultimately puts a threshold on the resulting detectionsby the confidence score of the model YOLO runs at 45 fpson GPU and the smaller Fast YOLO reaches 150 fps (isprocessing can run steaming video in real time Although thedesign of YOLO architecture affords end-to-end trainingand real-time detection it still keeps high average precision

(e network divides the input image into a Stimes S gridwhere Stimes S is equal to the width and height of the tensorwhich presents the final prediction In case the center of anobject is in a grid cell the gird cell takes responsibility fordetecting that object Moreover each gird cell is simulta-neously responsible for predicting bounding boxes andconfidence scores which present how confident the model ofbounding box contains an object as well as how accurate itindicates the bounding box is predicted

(e drawback of YOLO is that it lags behind the state-of-the-art detection systems about accuracy but is better thanthose about running time Itmakes less than half the number ofbackground errors compared to Fast R-CNN YOLO is highlygeneralizable so it can quickly identify objects in an image butit usually struggles to precisely localize some objects especiallysmall ones (erefore the author introduced YOLOv2 toimprove performance and fix drawbacks of YOLO as well

YOLOv2 [5] has a number of various improvements fromYOLOv1 Similarly to the origin YOLOv2 runs on differentfixed sizes of an input image but it introduced several newtraining methods for object detection and classification suchas batch normalization multiscale training with the higherresolutions of input images predicting final detection onhigher spatial output and using good default bounding boxesinstead of fully connected layers

However this offers a trade-off between speed and ac-curacy (e details of the mAP improvements in PASCALVOC 2007 are shown in Figure 2

(ese novel improvements allow YOLOv2 to train onmulticlass datasets like COCO or ImageNet In addition itwas attempted to train the detector to detect over 9000different object classes YOLOv2 uses a network architecturecustomized from the original network YOLOv2 mainlyconcentrates on a way of improving recall and localizationwhile still receiving high accuracy of classification incomparison with state-of-the-art detectors and the originYOLO significantly makes more localization errors but is farless likely to predict false detections on places where nothingexists Although YOLOv2 has accuracy improvementsYOLOv2 does not work well on small objects because theinput downsampling results in the low dimension of thefeature map which is used for the final prediction To solvethese problems recently the author introduces YOLOv3with significant improvements on object detection espe-cially on small object detection Generally a variety of latestnetworks tend to be toward deeper and yield good per-formance on their tasks with deep features learned fromnumerous layers

YOLOv3 [6] is one of these approaches instead of usingDarknet-19 like two old versions [4 5] YOLOv3 develops adeeper network with 53 layers called Darknet-53 and com-bines the network with state-of-the-art techniques such asresidual blocks skip connections and upsampling (e re-sidual blocks and skip connections are very popular in ResNetand relative approaches and the upsampling recently alsoimproves the recall precision and IOU metrics for objectdetection [25] For the task of detection 53 more layers arestacked onto it giving a 106-layer fully convolutional un-derlying architecture for YOLOv3 (is is the reason behindthe slowness of YOLOv3 compared to YOLOv2

Second YOLOv3 enables the detector to predict objectsat three different outputs with three different scales ratherthan just one prediction at the last layer of the networksimilar to its competitor SSD [26] which has improved a lotof performance on a low resolution image (is is useful topick up diverse outcomes in order to improve performanceof detection (e final output is created by applying a 1times 1kernel on a feature map Particularly the detection is doneby applying 1times 1 detection kernels on feature maps of threedifferent sizes at three different places in the network partlysimilar to feature pyramid networks (FPNs) [27]

(ird YOLOv3 still keeps using K-means to generateanchor boxes but instead of fully applying 5 anchor boxes atthe last detection YOLOv3 generates 9 anchor boxes andseparates them into 3 locations Each location applies 3anchor boxes hence there are more bounding boxes perimage For example if we have an image of 416times 416YOLOv2 predicts 13times 13times 5 845 boxes in YOLOv3 thenumber of boxes is 10647 implying that YOLOv3 predicts10 times the number of boxes compared to YOLOv2

Fourth YOLOv3 also changes the way to calculate thecost function If the anchor overlaps a ground truth morethan other bounding boxes the corresponding objectnessscore should be 1 For other anchor boxes with overlapgreater than a predefined threshold 05 they incur no costEach ground truth is only associated with one boundary boxIf a bounding box is not assigned it incurs no classification

6 Journal of Electrical and Computer Engineering

and localization lost just confidence loss on objectness (eloss function in previous YOLO looks like

λcoord 1113944

S2

i01113944

B

j01objij xi minus 1113954xi( 1113857

2+ yi minus 1113954yi( 1113857

21113960 1113961

+ λcoord 1113944

S2

i01113944

B

j01objij

wi

radicminus

1113954wi

1113969

1113874 11138752

+

hi

1113969

minus

1113954hi

1113969

1113874 11138752

1113890 1113891

+ 1113944S2

i01113944

B

j01objij Ci minus 1113954Ci1113872 1113873

2

+ λnoobj 1113944S2

i01113944

B

j01noobjij Ci minus 1113954Ci1113872 1113873

2

+ 1113944S2

i01obji 1113944

cisinclassespi(c) minus 1113954pi(c)( 1113857

2

(1)Currently instead of using mean square error in calcu-

lating the classification loss at the last three terms YOLOv3uses binary cross-entropy loss for each label In other wordsYOLOv3 makes its prediction of an objectness score and classprediction for each bounding box using logistic regression

(ere is no more softmax function for class prediction(e reason is that the most currently used classifiers assumethat predicted labels are independent and mutually exclusiveimplying that if an object belongs to one class then it cannotbelong to the other and this is solely true if output predictionis really mutual nevertheless in case dataset has multilabelclasses and there are labels which are not nonexclusive such aspedestrian and person At the time the sum of possibilityscores may be greater than 1 if the classifier is softmax soYOLOv3 alternates the classifier for class prediction from thesoftmax function to independent logistic classifiers to cal-culate the likeliness of the input belonging to a specific label

36 Single Shot MultiBox Detector Single Shot MultiBoxDetector (SSD) [26] is a single shot detector using a singleand one-stage deep neural network designed for objectdetection in real time By comparison the state-of-the-artmethod in two-stage processing Faster RCNN uses itsproposed network to generate object proposals and utilizesthose to classify objects in order to be toward real-timedetection instead of using an external method but the wholeprocess runs at 7 FPS SSD enhances the speed of running

(a)

(b)

(c)

(d)

Figure 2 (e visualization of detectors with the strongest backbones on subsets of PASCAL VOC_MRA_058 VOC_MRA_10VOC_MRA_20 and VOC_WH_20 in order respectively (a) YOLO Darknet-53 (b) Faster RCNN ResNeXT-101-64times 4d-FPN (c)RetinaNet ResNeXT-101-64times 4d-FPN (d) Fast RCNN ResNeXT-101-64times 4d-FPN

Journal of Electrical and Computer Engineering 7

time faster than the previous detectors by eliminating theneed of the proposal network(erefore it causes a few dropin mAP and SSD compensates this by applying some im-provements including multiscale features and default boxes(ese improvements allow SSD to gain the same of FasterRCNN using lower resolution images which then furtherspeeds up the processing of SSD For 300times 300 input imageas the best version SSD gets 772 mAP at 46 FPS betterthan Faster R-CNN 732 and a little smaller than the bestversion of YOLOv2 554times 554 input image and 786 mAPat 40 FPS on VOC 2007 on Nvidia Titan X

Similarly SSD consists of 2 parts namely extraction offeature maps and use of convolution filters to detect objectsSSD uses VGG16 as a base network to extract feature maps(en it combines 6 convolutional layers to make predictionEach prediction contains a bounding box and N+ 1 scoresfor each class where N is the number of classes and one forextraclass for no object Instead of using a region proposalnetwork to generate boxes and feed to a classifier forcomputing the object location and class scores SSD simplyuses small convolution filters After the VGG16 base net-work extracts features from feature maps SSD applies 3times 3convolution filters for each cell to predict objects Each filtergives an output including N+ 1 scores for each class and 4attributes for one boundary box

SSD has a difference from previous approaches at thesame time and it makes prediction on multiscale featuremaps for detection independently rather than just one lastlayer (e CNN network spatially reduces the dimension ofthe image gradually leading to the decrease in the resolutionof the feature maps As mentioned SSD uses a lower inputimage to detect objects hence early layers are used to detectsmall objects and lower resolution layers to detect largerscale objects progressively Besides SSD applies differentscales of default boxes to different layers and for intuitivevisualization in Figure 3 Particularly the only blue defaultbox on 8times 8 feature map fits to the ground truth of the catand the only red one on 4times 4 feature map matches to theground truth of the dog

Although SSD has significant improvements in objectdetection as integrating with these above parts SSD is notgood at detecting small objects which can be improved byadding deconvolution layers with skip connections to in-troduce additional large-scale context [28] Generally SSDoutperforms Faster RCNN which is a state-of-the-art ap-proach about accuracy on PASCAL VOC and COCO whilerunning at real-time detection

37CNNDrawbacks Most of the CNNmodels are currentlydesigned by the hierarchy of various layers such as con-volutional and pooling layers that are arranged in a certainorder not only on small networks but also on multilayernetworks to state-of-the-art networks Along with theselayers fully connected layers are added behind and known asFC layers (e block consisting of FC layers and previouslayers is designated as feature extractors and it outputs keyfeatures of objects of interest as an input for classifierscoming behind However deeply going through many kinds

of layers is a way that is not good for small object detectionbecause in the task of small object detection objects ofinterest are objects owning small sizes and appearanceBesides small objects unlike normal or big objects which areless affected by resizing the image or passing lots of differentlayers are very vulnerable to the changes in image sizesWhen an image passes a convolutional layer the size of theimage will be decreased by receptive fields that slide over theimage to extract useful features (is does not affect smallobjects if there are just a few layers but in a CNN networkwe have many layers like this and it is very hard for smallobjects Still if small objects just go through convolutionallayers it will not be anything to mention Small objectswhich just have a few informative presence have to passpooling layers which help in avoiding overfitting and re-ducing computational costs by decreasing a number ofparameters To do this these layers use fixed sliding windowsthat care about a fixed target that is identified before such asmaximum or average calculations of valuables For thesereasons GAN is an approach that may alter the CNN ap-proach because of its advantages We can take advantages ofa way that the approach generates data to overcome thelimitations of data of small objects for the training phaseAlthough images still have to pass layers such as convolu-tional and pooling layers in this context the network justhas less layers compared to others Bai el al [29] haveproposed to apply MTGAN to detect small objects by takingcrop inputs from a processing step made by baseline de-tectors such as Faster RCNN [15] or Mask RCNN [9]

Because of mentioned reasons and following the survey[30] Liu et al have presented numerous works of survey andevaluation but there are no works that do with small objectsin them(erefore in this work we assess popular and state-of-the-art models to find out pros and cons of these modelsParticularly we evaluate 4 deep models such as YOLOv3RetinaNet Fast RCNN and Faster RCNN with several basenetworks for small object detection with different scales ofobjects In these models YOLOv3 and RetinaNet belong tothe one-stage approach Fast RCNN and Faster RCNN are inthe two-stage approach We choose these models becauseYOLOv3 is the model with combination of state-of-the-arttechniques and RetinaNet is the model with a new lossfunction which penalizes the imbalance of classes in adataset Besides we choose RetinaNet to make comparisonsbetweenmodels in the same approach Similarly Fast RCNNand Faster RCNN are the same and both models are in thesame approach and have nearly the similar pipeline in objectdetection (ere is a difference is that Fast RCNN utilizes anexternal proposal to generate object proposals based oninput images However Faster RCNN proposes its ownnetwork to generate object proposals on feature maps andthis makes Faster RCNN train end-to-end easily and workbetter

4 Experimental Evaluation

In this section we present the information of our ex-perimental setting and datasets which we use forevaluation

8 Journal of Electrical and Computer Engineering

41 Experimental Setting We continually train and evaluatevarious object detectors on the two datasets such as PASCALVOC [11] and a newly generated dataset [16] (e evaluatedapproaches in this time consist of Faster RCNN [15]YOLOv3 [6] and RetinaNet [7] with different backbonesExcept for YOLOv3 the others are trained and evaluated bythe Detectron python code

Currently the original datasets which commonly areused in object detection are PASCAL VOC [11] and COCO[12] Both datasets are constructed by almost large objects orother kinds of objects whose size fill a big part in the image(ese two datasets are not suitable for small object detectionIn addition there is another dataset which is large-scale andincludes a lot of classes for small object detection collectedby drones and named VisDrone dataset [31] However itdoes not publish the labels for test set to evaluate and theviews of images are topdown which is not our case As aresult in order to evaluate the detection performance of themodels we use a dataset which was published in [13] (isdataset is called small object dataset which is the combi-nation between COCO [12] and SUN [24] dataset (ere are10 classes in small object dataset including mouse tele-phone switch outlet clock toilet paper (t paper) tissue box(t box) faucet plate and jar (e whole dataset consists of4925 images in total and there are 3296 images for trainingand 1629 images for testing (e mouse class owns thelargest number of instances in images 2137 instances in 1739images and the tissue box class has the smallest number ofinstances 103 instances in 100 images Apart from the smallobject dataset we also filter subsets from PASCAL VOC2007 following standard definitions On PASCAL VOCthere are 20 classes but with small object detection there arefewer classes on strict definitions of small objects Table 1lists the details of the number of small objects and imagescontaining them for subsets of the dataset

We trained all models on small object dataset with thesame parameters Particularly in the training phase wetrained the models with 70k iterations with the parametersincluding momentum decay gamma learning rate batchsize step size and training days in Table 2 At the firstmoment we attempted to start off the models with a higherlearning rate 10minus 2 but the models diverged leading to theloss value being NaN or Inf after 100 first iterations(en wetried at a lower learning rate 10minus3 at 100 first iterations andrise to 10minus2 to consider if the models can converge as startingoff at a lower learning rate However it remained unchangedanything We also saw that the models converged quicklyduring 10k first iterations with 10minus3 and then progressively

slow down after 20k (erefore we decided to start off thetraining with a learning rate at 10minus3 and decrease to 10minus 4 and10minus5 at 25k and 35k iterations respectively (is settingshows that the loss value was stable from 40k but we set thetraining up to 70k to consider how the loss value changesand saw that it did not change a lot after 40k iterations Wetried to evaluate the models from 30k to 70k and generallythe performance of the models was not stable after 40k it-erations For this reason we picked up the weight forevaluation at 30k and 40k iterations At 30k iterationsYOLO achieves the best results and others get the best one at40k iterations In case of subsets of PASCAL VOC 2007 wecombine train and valid set from PASCAL VOC 2007 and2012 to form a training set PASCAL VOC 2012 works as adata augmentation set for PASCAL VOC 2007 We use thiscombined training set to train all models and test them onsubsets All models train the same parameter First due tothe limitation of memory we rescale all the size of images tothe same size with the shortest side 600 and the lengthiestside 1000 as in [15]

In YOLOv3 we run the K-means clustering algorithm inorder to initialize 9 suitable default bounding boxes fortraining and testing phases of our selected datasets and wechanged the anchors value (e following are 9 anchors forsmall object dataset after running the K-means algorithm[103459 144216] [262937 190947] [214024 363180][479317 291237] [404932 637489] [836447 513203][722167 1199181] [1727416 1170773] and [12465972528465]

In Faster R-CNN to fairly compare with the prior workand deploy on different backbones we also reuse directly theanchor scales and aspect ratios following the paper [13] suchas anchor scales 16times16 40times 40 and 100times100 pixels andaspect ratio 05 1 and 2 instead of having to cluster a set ofdefault bounding boxes similar to YOLOv3 Similarly inRetinaNet we keep the default setting for training such asgamma loss 20 alpha loss 025 anchor scale 4 andscalers per octave 3 because of following authors and thisconfiguration is the optimized valuables

42 Our Newly Generated Dataset In this time to have anobjective comparison we also use our newly generateddataset and the information of this dataset is shown inTable 1 We use it to consider the effects of object sizesamong factors including models time of processing accu-racy and resource consumption (e dataset consists of 4subsets filtered from PASCAL VOC 2007 such as

Batch normHi-res classifierConvolutionalAnchor boxes

New neworkDimension priors

Location predictionPassthrough

MultiscaleHi-res detector

YOLO YOLOv2

VOC2007 mAP

634 658 695 692 696 744 754 768 786

Figure 3 mAP of YOLOv2 at each added part [5]

Journal of Electrical and Computer Engineering 9

VOC_WH_20 VOC_MRA_058 VOC_MRA_10 andVOC_MRA_20 and detail information is provided asfollows

(i) VOC2007_WH_02 contains objects whose widthand height are less than 20 of an imagersquos width andheight (is one has fewer than PASCAL VOC 2007two classes such as dining table and sofa because ofthe constraint of the definition

(ii) VOC_MRA_058 VOC_MRA_10 andVOC_MRA_20 compose objects occupying themaximum mean relative area of the original imageunder 058 10 and 20 respectively Two ofthem have the same number of PASCAL VOC 2007classes except for VOC_MRA_058 and the one hasfewer four classes such as dining table dog sofa andtrain

5 Results and Analyses

In this section we show results that we achieved through theexperimental phase All models mentioned in this sectionexcept for models cited from other papers are trained on thesame environment and 1 GPU Ubuntu 16044 LTS Intel(R) Xeon (R) Gold 6152 CPU 210GHz GPU Tesla P100In addition to the comparative accuracy other comparisonsare also provided to make our objective and clear assessmentresults

51 Accuracy

511 Small Object Dataset Following the detection resultsin Table 3 methods which belong to two-stage approachesoutperform ones in one-stage approaches about 8ndash10Specifically Faster RCNN with ResNeXT-101-64times 4d-FPNbackbone achieved the top mAP in two-stage approachesand the top of the table as well 412 In comparison withthe top in one-stage approaches YOLOv3 608times 608 withDarknet-53 obtained 331 Following [32] methods basedon region proposal such as Faster RCNN are better than

methods based on regression or classification such as YOLOand SSD Actually this is also right once again as in contextof small object dataset

For methods in each approach Firstly two-stage ap-proaches Faster RCNN which is an improvement of FastRCNN is only greater than Fast RCNN about 1ndash2 but onlyfor ResNeXT backbones and equal to Fast RCNN for the rest(e difference here is not too much and it means that theperformance of external region proposal like selective searchcombined with ROI pooling is as good as internal regionproposal like RPN with ROI aligned in this case Besidescompared to R-CNN we perceive that there is a boost 8ndash10when RoI pooling or RoI aligned is added because R-CNNwhich uses region proposals from selective search then feedsthem into the network and directly computes features from fc(fully connected) layers only receives 235 with Alexnetand 248 with VGG16 combined with proposals from RPNHowever Fast RCNN and Faster RCNN with two kinds ofRoIs are much better Fast RCNN receives accuracy in a rangeof 317 to 396 based on different backbones SimilarlyFaster RCNN gets 301 to 412 Secondly in one-stageapproaches YOLO outperforms SSD and RetinaNet How-ever YOLO gets the highest outcome 331 and SSD andRetinaNet get 1132 and 30 respectively YOLO and SSDare considered as state-of-the art methods in speed andsacrificing accuracy However there is a large difference inaccuracy between YOLO and SSD the difference here is thatSSD adds multiple convolutional layers behind the backboneand each layer has their own ability instead of using 2 fullyconnected layers like YOLO Although RetinaNet is assignedinto a method in one-stage approaches it cannot run in realtime RetinaNet is one which is proposed to deal with theimbalance between foreground and background by the focalloss (erefore RetinaNet obtains a higher accuracy incomparison with others except for YOLOv3 (Darknet-53)

When it comes to the backbones we realized that Dar-knet-53 is the best in one-stage and real-time methods andeven far higher than ResNet-50 although it similarly has thesame layers with ResNet-50 In contrast ResNeXT combinedwith FPN is themost powerful one in both one-stage and two-

Table 1 (e information of the subsets

Subsets Classes Images InstancesVOC_MRA_058 16 329 529VOC_MRA_10 20 2231 5893VOC_MRA_20 20 2970 7867VOC_WH_20 18 1070 2313

Table 2 (e parameters of models

Method Momentum Decay Gamma Learning_rate Batch_size Training_days StepsizeYOLOv2 [16] 09 00005 0001 8 5 25000YOLOv3 09 00005 0001 32 3ndash4 25000SSD300 [16] 09 00005 01 0000004 12 9 40000 80000SSD512 [16] 09 00005 01 0000004 12 12 100000 120000RetinaNet 09 00005 01 0001 64 4-gt12 h 25000 35000Fast RCNN 09 00005 01 0001 64 4-gt12 h 25000 35000Faster RCNN 09 00005 01 0001 64 4-gt12 h 25000 35000

10 Journal of Electrical and Computer Engineering

stage methods if we only consider accuracy Overall there isan increase about 1ndash3 for changing the simple backbone tothe complex one in each type For example when switchingfrom original ResNet to ResNet-FPN the accuracy is boostedfrom 2 to 3(is is clear that leveraging the advantages frommultiscale features of FPN is a common way to improvedetection and tackle the scale imbalance of input images andbounding boxes of different objects Similarly we switchResNeXT-101-32times 8d-FPN to ResNeXT-101-64times 4d-FPNand the accuracy changes from 405 to 412 for FasterRCNN and from 387 to 396 for Fast RCNN Howeverwhen considering between ResNet-50-FPN and ResNet-101-FPN the growth only happens in Fast RCNN from 333 to355(ere is a little bit decrease 01 for Faster RCNN(isreduction also happens with RetinaNet while the simplerbackbone ResNeXT-101-32times 8d-FPN gets 30 and theResNeXT-101-64times 4d-FPN just gets 251 It means that thevery deeper backbones do not guarantee the increase inaccuracy and the reason is that an advantage of a deepernetwork needsmore parameters to learn It means the authorsmust have a large number of data to feed into the network totrain and update parameters itself but in this case the data ofsmall object dataset are not abundant too much to fit the verydeep network and hence increasing the chances of overfittingBesides features which are originally from the early layer ofResNet are not well-generalized because when they arecombined with FPN the accuracy has an improvement about2ndash3 When YOLO switches from Darknet-19 to Darknet-53 it really boosts the accuracy (e highest accuracy belongsto the Darknet-19 with the resolution of 1024times1024 whichjust gets 2402 However YOLO 608times 608 with Darknet-53gets 331 (e explanation for this reason is that YOLOv3with Darknet-53 has several improvements from Darknet-19YOLOv3 has 3 location of scales to predict objects especiallyone specialized in small objects instead of only one likeDarknet-19 and it is also integrated cutting-edge advantagessuch as residual blocks and shortcut connections (e re-duction in accuracy happens again with YOLO whenswitching from ResNet-101 to ResNet-152 about 1ndash2 Inthese methods YOLO and SSD are the only ones which allowmultiple input sizes(e higher the resolution of input imagesare the higher accuracy the method receives (e reason isthat a higher resolution image allows more pixels to describethe visual information for small objects However if theresolution is far from the original size of images it results in adecrease in accuracy For example YOLO 1024times1024 withDarknet-19 gets a lower accuracy than the resolution of800times 800 In addition we have tried to increase in resolutionof Darknet-53 from 608 to 1024 and themAP decreases whenthe resolution is over 608times 608 (erefore the effect of imagesize is clear for models like SSD and YOLO Generally allcomparative results of mAP on this dataset have the domi-nation of classes very great in numbers and this is caused bythe imbalance data between the number of images and in-stances in these images For example according to the sta-tistics in [13] mouse is a major class significantly contributingto mAP in Table 3 with the highest number of instances andimages as well However tissue has least contribution with thelowest AP originally affected by the number of data

Furthermore the imbalance data lead models tending todetect frequent objects implying that models will misun-derstand objects having a nearly similar appearance with thedomination class as the objects of interest rather than lessfrequent objects As a result false positives will increase bythese problems Figure 4 illustrates the detection withstrongest backbones Following this visualization the domina-tion of the classes such asmouse or faucet results inmisdetectionwith areas which have a same appearance to them (is mis-understanding has a tendency to weaker backbones in thecomparison and one-stage method like YOLO which primarilyheads to speed has more misdetection than two-stage methodsA reason that causes these problems are the difference in thewayof training deep networks [33] One-stage methods such asYOLO use a soft sampling method that uses a whole dataset toupdate parameters rather than only choosing samples fromtraining data However two-stage methods such as RCNNfamily tend to employ hard sampling methods that randomlysample a certain number of positive and negative boundingboxes to train its network

512 Subsets of PASCAL With 4 subsets of 4 different scalesof objects in images we want to find out howmuch the scalesimpact on the models (e whole results are shown in Ta-ble 4 We separate the results into 2 groups as the one-stageand two-stage approaches and Figure 5 is a visualization forthe strongest backbones in each method on subsets

In case of different scales like our subsets there is a differencebetween one-stage approaches and two-stage approaches In thiscase methods from the one-stage approach have a better per-formance than two-stage ones inmost of scales(is is really theopposite of small object dataset Specifically two-stage methodsare totally better than one-stage ones in case of real-time inputsand just better a bit than nonreal-time models in VOC_WH20about 10ndash20 and the same result with smaller objects inVOC_MRA_0058 and VOC_MRA_010 However in biggerobjects in VOC_MRA_020 methods in one-stage approacheshave significant outcomes rather than two-stage ones In ad-dition there is just Faster RCNN that has good performance inmost cases to compare to methods in one-stage ones FastRCNN is only good at big objects in VOC_MRA_020 and failsto have good detection in smaller objects

In the one-stage approach in methods which allowmultiple inputs like YOLO and SSD there are 2 kindsnamely ones that can run in real time and the others thatcannot if the resolution is over 640 or 512 for YOLO andSSD respectively For real-time ones YOLO outperformsSSD for all scales of objects Specifically YOLOv2 withDarknet-19 is better than SSD 26 with objects inVOC_MRA_0058 and VOC_MRA_010 and 4ndash15 forlarger objects in VOC_MRA_020 and VOC_WH_20YOLOv3 with Darknet-53 gets higher results about 3ndash5 incomparison with YOLOv2 hence YOLOv3 also gets higherresults compared to SSD However if we consider nonreal-time input images SSD is greater than YOLO with objects inVOC_MRA_010 However RetinaNet which is the one thatcannot run in real time in the one-stage approach performsthe same results compared to ones in nonreal time in YOLO

Journal of Electrical and Computer Engineering 11

and better than SSD RetinaNet is more stable than SSD andYOLO when the scales are changed (e bigger the objectsare the more the stability is For example the change is somuch about 33 when the scale increases from objects inVOC_MRA_0058 to ones in VOC_MRA_010 andVOC_MRA_020 However this change is not much about10 with bigger objects in comparison with YOLO 15ndash25In case of YOLO this remarkable increase in accuracy whenobjects are larger is obviously good for a model (e changein SSD resembles the change in RetinaNet

Concerning resolutions in YOLO and SSD we see thatwhen image resolution is increased they push the accuracyto improve in general For YOLOv2 with Darknet-19 andYOLOv3 with Darknet-53 and SSD they all have an increasein accuracy when the resolution is large except for YOLOv2

with objects belonging to VOC_MRA_010 andVOC_MRA_020 when the image is over 800 In additionYOLOv2 has a fluctuation with those objects inVOC_WH20 As mentioned in our previous work YOLO isbetter than SSD in those objects less than 10 of the imageshowever in this case YOLOv3 is good at all scales of objects(is is because YOLOv3 has 3 detection locations comingwith more ratios of default boxes and it leads to a significantoutcome when combining results from 3 locations

When we switch to the two-stage approaches FasterRCNN has a significant improvement in most scales ratherthan Fast RCNN except for objects in VOC_MRA_020 whichhave the same accuracy (is shows that if objects are com-pletely separated into different scales the RoI pooling does notwork well with smaller objects and ones in VOC_WH20 In

Table 3 Comparative results on small object dataset

Method Backbone Clock Faucet Jar Mouse Outlet Plate Switch Tel t box t paper mAPYOLO 416 [16]

Darknet-19

228 308 4 52 204 131 13 61 0 353 1939YOLO 448 [16] 23 369 9 525 184 136 175 42 0 343 2013YOLO 480 [16] 342 373 91 533 214 136 158 91 91 342 2371YOLO 512 [16] 231 366 61 598 246 142 157 91 45 324 2261YOLO 554 [16] 234 372 91 601 272 134 199 91 45 345 2384YOLO 640 [16] 202 362 32 598 278 117 181 82 45 356 2253YOLO 800 [16] 276 36 23 602 328 131 233 91 91 267 2402YOLO 1024 [16] 217 293 14 583 264 118 175 91 91 157 2003YOLO 320

Darknet-532622 3838 455 5646 3642 1334 248 1065 455 4296 2583

YOLO 416 2847 4715 1083 6049 4315 1587 3073 1515 262 483 3028YOLO 608 2998 4789 1076 6588 4802 1809 3122 1462 1799 4656 331YOLO 320

ResNet-501957 2573 067 4517 1437 938 1384 909 909 237 1706

YOLO 416 2378 3665 04 5423 1837 1375 1978 984 942 3568 2219YOLO 608 2692 4065 177 6186 2918 1504 2024 1009 1329 3601 255YOLO 320

ResNet-1012052 279 057 4468 1698 1305 1366 966 909 2436 1805

YOLO 416 2572 356 303 5573 224 1561 1726 932 303 3871 2264YOLO 608 2879 4459 942 6218 3334 1553 2388 1324 1583 3917 286YOLO 320

ResNet-1522164 2756 303 4806 1739 1112 1451 909 455 3188 1888

YOLO 416 257 3654 089 5381 206 1413 2021 1149 029 3306 2167YOLO 608 2601 4454 455 61 3176 1302 2267 1235 993 3999 2658SSD300 [16] ResNet-101 55 91 0 255 61 45 0 45 91 182 825SSD300 [16] VGG16 91 171 0 261 91 91 0 45 0 167 916SSD512 [16] VGG16 91 171 0 43 91 91 91 91 0 76 1132RetinaNet ResNet-50-FPN 307 493 2 655 213 161 85 129 1 257 233RetinaNet ResNet-101-FPN 306 487 71 647 20 159 118 107 29 387 251RetinaNet ResNeXT-101-32times 8d-FPN 355 55 121 665 239 184 98 162 94 537 30RetinaNet ResNeXT-101-64times 4d-FPN 314 502 89 663 208 153 94 14 22 324 251R-CNN [13] RPN prop +VGG16 319 313 42 568 311 93 142 164 234 294 248R-CNN [13] Alexnet 7times 300 pro 324 272 51 569 28 98 136 124 179 356 239R-CNN [13] VGG16 7times 300 pro 373 303 72 606 415 158 215 137 22 333 284R-CNN [13] ContextNet (Alexnet 7times) 327 268 46 564 263 99 129 122 187 34 235Fast RCNN ResNet-50-C4 324 463 65 658 383 201 253 166 141 52 317Fast RCNN ResNet-50-FPN 374 473 73 689 467 21 321 171 93 459 333Fast RCNN ResNet-101-FPN 393 503 106 683 471 204 333 186 154 514 355Fast RCNN ResNeXT-101-32times 8d-FPN 475 548 103 718 54 214 344 217 177 535 387Fast RCNN ResNeXT-101-64times 4d-FPN 454 557 109 725 533 24 369 229 16 581 396Faster R-CNN [16] VGG16 2376 3765 803 54 1616 1188 1512 91 625 3729 2192Faster RCNN ResNet-50-C4 322 446 66 659 352 175 257 196 137 40 301Faster RCNN ResNet-50-FPN 357 499 73 684 489 188 296 147 114 533 338Faster RCNN ResNet-101-FPN 398 492 49 682 47 185 297 14 129 522 337Faster RCNN ResNeXT-101-32times 8d-FPN 498 566 114 721 563 232 37 208 188 587 405Faster RCNN ResNeXT-101-64times 4d-FPN 496 586 122 725 545 232 369 208 201 631 412(e values in bold represent the best in one-stage methods and the ones in italics represent the highest in two-stage methods

12 Journal of Electrical and Computer Engineering

addition if we compare with one-stage methods it is signif-icantly lower than them However RoI align along with RPN iswell performed when scales are changedWhen it comes to thebackbones there is a few decrease in accuracy when changingfrom ResNet-50-FPN to ResNet-101-FPN or from ResNeXT-101-32times 8d-FPN to ResNeXT-101-64times 4d-FPN with objectsfrom all scales for both Faster RCNN and Fast RCNN (eVGG16 backbone has an impressive outcome rather thanstrong backbones such as ResNet or ResNeXT Although theaccuracy is less than two strong backbones VGG16 is stillbetter with objects in VOC_WH20 and has a few change inaccuracy when changing objects with big sizes

52 Time Processing and Resource Consumption Tables 5and 6 show us the performance comparison of the eval-uated models with base networks that belong to the modelsGenerally we see that when RAM consumption in testingand training increases more layers are added (is meansthat if the network is more deeper the need of processingalso increases because this leads to the increase in pa-rameters and time to process data as well YOLO is themodel consuming the least memory in both two-phasetraining and testing Particularly YOLO is only from 4G to5G for training and from 16G to 18G for testing withDarknet-53 YOLO is the only one which is able to run inreal time YOLO just needs about 03ms to 04ms toprocess an image in comparison to more than 01 s and 02 s

with Faster RCNN and RetinaNet (is allows us to pick upthese models on devices which own the modest memoryWhile RetinaNet is assigned to the one-stage approach it isnot good enough to meet real-time detection(e inferencetime in Fast RCNN is lower a little bit than Faster RCNNand RetinaNet In contrast the RAM consumption intraining and testing of RetinaNet is lower than Fast RCNNand Faster RCNN Of all architectures the ResNet-50-C4 isthe one requiring the highest memory and time to processdata because the output size of ResNet-50-C4 is bigger a bitthan others [9] If we consider ResNet or ResNeXT com-bined with FPN Faster RCNN is over 100Mb compared toFast RCNN and 300Mb with RetinaNet In additionaccording to Table 2 the number of training days of FasterRCNN and RetinaNet need less time for training only a fewhours to 1 day rather than YOLO 3ndash4 days (is dem-onstrates that if we pay our attention to performance anddo not have much time for training we choose FasterRCNN or RetinaNet instead of YOLO one In contrast ifwe only focus on processing speed and still achieve goodperformance one-stage methods are always the good oneIn the same context of backbones RetinaNet uses a lowerresource than Fast RCNN and Faster RCNN about 100Mband 300Mb for Fast RCNN and Faster RCNN respectivelyin testing time However the training time of RetinaNetuses much memory more than Fast RCNN about 28 G andFaster RCNN about 23 G for ResNeXT-101-32times 8d-FPNand ResNeXT-101-64 times 4d-FPN If we consider this on

(a) (b)

loc ∆ (cx cy w h) conf (c1 c2 hellip cp)

(c)

Figure 4 (e location of the default boxes in different scales (a) image with GT boxes (b) 8times 8 feature map (c) 4times 4 feature map

Journal of Electrical and Computer Engineering 13

Table 4 (e comparative results on subsets of PASCAL VOC 2007

Approach Method VOC_MRA_0058 VOC_MRA_010 VOC_MRA_020 VOC_WH20

One stage

YOLOv2 416 [16] 302 3138 4289 1852YOLOv2 448 [16] 447 329 6015 2196YOLOv2 480 [16] 426 3348 6078 2667YOLOv2 512 [16] 542 3574 6112 2463YOLOv2 544 [16] 697 3656 63 2662YOLOv2 640 [16] 77 3797 6129 2341YOLOv2 800 [16] 1024 373 6191 269YOLOv2 1024 [16] 1069 2993 5514 2897

YOLOv3 320 718 3458 6036 204YOLOv3 416 102 3897 6253 2412YOLOv3 608 117 4265 6856 2886SSD 300 [16] 171 3276 4626 1691SSD 512 [16] 29 4346 5711 1987

RetinaNet-ResNet-50-FPN 884 415 502 2814RetinaNet-ResNet-101-FPN 895 425 519 2746

RetinaNet-ResNeXT-101-32times 8d-FPN 1029 454 545 3008RetinaNet-ResNeXT-101-64times 4d-FPN 1071 455 551 3132

Two stage

Fast RCNN-ResNet-50-C4 023 132 499 393Fast RCNN-ResNet-50-FPN 063 135 556 345Fast RCNN-ResNet-101-FPN 039 159 576 312

Fast RCNN-ResNeXT-101-32times 8d-FPN 051 144 579 333Fast RCNN-ResNeXT-101-64times 4d-FPN 029 142 573 376

Faster RCNN-ResNet-50-C4 698 399 487 2604Faster RCNN-ResNet-50-FPN 1074 456 563 2979Faster RCNN-ResNet-101-FPN 1063 469 576 3057

Faster RCNN-ResNeXT-101-32times 8d-FPN 1164 473 576 3212Faster RCNN-ResNeXT-101-64times 4d-FPN 1054 471 569 3164

Faster RCNN-VGG16 [16] 573 3558 4414 4111(is table illustrates how well models adapt to different scales of objects (e values in bold represent the best in one-stage methods and the ones in italicsrepresent the highest in two-stage methods

(a)

(b)

(c)

Figure 5 Continued

14 Journal of Electrical and Computer Engineering

small object dataset it does not work too much becauseRetinaNet is lower than Faster RCNN about 10 in per-formance Otherwise on different scales of subsets Reti-naNet works well when comparing to Faster RCNN and

the difference is just 2ndash4 percentages Although ResNetbackbones combined with the others yield an improvementin accuracy they do not work for YOLO on small objectdatasets YOLO with Darknet-53 utilizes more resource

Table 5 (e comparison of consumption on small object dataset

Model Backbone Inference time (s) Test RAM (MiB) Train RAM (MiB)YOLOv3 Darknet-53 00331 1825 4759YOLOv3 ResNet-50 0027 1285 3479YOLOv3 ResNet-101 00356 1829 5383YOLOv3 ResNet-152 00454 2443 7531RetinaNet ResNet-50-FPN 0102 2075 4435RetinaNet ResNet-101-FPN 0127 2723 5577RetinaNet ResNeXT-101-32times 8d-FPN 0229 3767 7863RetinaNet ResNeXT-101-64times 4d-FPN 0292 3719 7813Fast RCNN ResNet-50-C4 03 6449 5877Fast RCNN ResNet-50-FPN 0089 2277 4455Fast RCNN ResNet-101-FPN 0113 2947 5627Fast RCNN ResNeXT-101-32times 8d-FPN 0212 3987 4961Fast RCNN ResNeXT-101-64times 4d-FPN 0269 3885 4799Faster RCNN ResNet-50-C4 0412 6609 6129Faster RCNN ResNet-50-FPN 0101 2387 5381Faster RCNN ResNet-101-FPN 0124 3001 6487Faster RCNN ResNeXT-101-32times 8d-FPN 0256 4027 5333Faster RCNN ResNeXT-101-64times 4d-FPN 0286 4003 5246

Table 6 (e comparison of consumption on subsets filtered from PASCAL VOC

Model Backbone Inference time (s) Test RAM (MiB) Train RAM (MiB)YOLOv3 Darknet-53 0027 1645 4079RetinaNet ResNet-50-FPN 01 1935 4133RetinaNet ResNet-101-FPN 0116 2585 5435RetinaNet ResNeXT-101-32times 8d-FPN 0222 3641 7723RetinaNet ResNeXT-101-64times 4d-FPN 0284 3561 7599Fast RCNN ResNet-50-C4 0495 6371 5677Fast RCNN ResNet-50-FPN 0092 2131 4387Fast RCNN ResNet-101-FPN 0114 2819 5463Fast RCNN ResNeXT-101-32times 8d-FPN 0213 3873 4637Fast RCNN ResNeXT-101-64times 4d-FPN 0265 3735 4575Faster RCNN ResNet-50-C4 026 6141 5991Faster RCNN ResNet-50-FPN 01 2245 5207Faster RCNN ResNet-101-FPN 013 2855 6335Faster RCNN ResNeXT-101-32times 8d-FPN 0225 3943 5087Faster RCNN ResNeXT-101-64times 4d-FPN 0276 3885 4909

(d)

Figure 5 Highlight of bounding boxes from comparative backbones on small object dataset We here select YOLO with Darknet-53 andResNet-50 for objective comparison because there have obviously the same layers in their networks along with the significant techniquessuch as skip connections and residual blocks (e bounding boxes show that ResNet-50 has the sensitivity to areas which resembles theobjects of interest than Darknet-53 Similarly ResNet-50-FPN and ResNet-50-C4 are chosen to consider (e detection shows thatcombining ResNet-50 with FPN outputs a better performance rather than the original one Particularly misdetection happens in moredensity than ResNet-50-FPN such as in columns 4 and 5 Zoom in to see more detail

Journal of Electrical and Computer Engineering 15

than ResNet ones but it has the best accuracy amongmodels (erefore we only test YOLO with Darknet-53 insubsets of PASCAL

53 Analyses of the Trade-Off among Detectors Networkdesigns and approaches alike the one-stage approach proveits performance as applying them to detect general objectsboth small scales and other kinds of scales Although they arefast and accurate there is still a drawback always existing inthese models that is the trade-off between accuracy andspeed of processing For example YOLOv3 proposes theidea that performs detection at three different scales and thisresult is obviously impressive and yields good performanceHowever to gain this advantage YOLOv3 has to sacrifice thetime to process Instead of all inputs of the model normallyprocessing one time for detection like YOLOv2 this ideamust work 3 times (is trade-off is also partly affected byresolution as we change it during training or testing ourmodels In our previous work we have mentioned that wehave to choose a right resolution to ensure our models towork properly In case of the two-stage approaches the ideathat proposes region proposals to improve the localization ofobjects to serve for detection is good as well (is is usefulbut we have to take it into the account that we shouldgenerate proposals on feature maps or directly on inputimages because this affects a lot on the way which modelsintend to run and identify representations of objects Ifobjects are normal or have a big or medium appearance it isgood for models to work but if objects are in multiscalesthis is a problem to consider and research deeply in order tobalance the performance as well as improve it (erefore topartly fix this problem the one-stage approach allows us tochoose a fixed size of an input for training and testing butthe support still depends on characteristics of datasets whichwe evaluate or the image size After all all models we chooseto evaluate are affected by the scales of objects when wechange the scale and accuracy of models change a lot exceptfor Faster RCNN the only one model that seems to be stablewith the scale especially when combining with the VGG16architecture Although the accuracy of VGG16 is not betterthan the other architectures the difference here is that it doesnot change too much in accuracy (is is only right for bigobjects having the overlap of the bounding box and theimage greater than 10 if not this is not assured

Figure 1 shows that the possibility of small objects ismore than other objects (e black length of the camera issomehow similar to the black mouse placed on a mouse pad(is possibility of small object presence causes more diffi-culties to detectors and leads to wrong detection Anywherein an image can be small objects it results in a fact thatdetectors have much wrong detection with familiar ap-pearance which they have seen If we consider the visuali-zation of the detection in Figure 4 the wrong detection ispartly similar to the appearance of the other objects in thedataset (is problem is caused by the data imbalance be-tween classes and instances in each class which originally isknown as the foreground-foreground class imbalance In

other words the common problems which not only happenwith small objects but also for whole datasets are theintraclass similarity and interclass variation

6 Conclusion

Small object detection is a challenging and interestingproblem in the task of object detection and has drawn at-tention from researchers thanks to the development of deeplearning which is motivation to improve performance oftasks in computer vision Although deep models belongingto detection originally tend to solve problems relating togeneral object detection they still work at a particular levelto the success of small object detection As evaluation workson small object detection for deep models our goal is tohighlight remarkable achievements of popular and state-of-the-art deep models in order to provide a variety of views asapplying deep models in small object detection Particularlywe evaluate state-of-the-art real-time detectors based ondeep learning from two approaches such as YOLOv3RetinaNet Fast RCNN and Faster RCNN on two datasetsnamely small object dataset and subsets filtered fromPASCAL VOC about effects of different factors objectivelyincluding accuracy execution time and resource usage

In spite of the successful achievements in recent yearsthe performance of detection has improved significantlyand there is still a huge gap in accuracy between normalobjects and small objects In the criteria of the COCOdataset the difference from the small scale to medium andbig scale is too much Most models are good at detection ofnormal objects and problems are going to happen whenapplying them to detect small objects As a result to reducethe gap of small object detection the first thing to do is investdatasets which have massive amount of data to train modelsand have a wide range of categories to compete with thehuman visual system alike [12 34]

So far detection models are divided into two mainapproaches namely one-stage approach and two-stageapproach Models in the one-stage approach is known asdetectors which have better and more efficient detection incomparison to another approach (e efficiency here has thepotential power to run in real time and is able to apply themto practical applications However the trade-off betweenaccuracy and speed is a difficult challenge which needs to betaken into the account in order to balance the gap Howevermodels in the two-stage approach have their reputation ofregion-based detectors which have high accuracy but are toolow in speed to apply them to real world (is drawbackcomes from the computation of networks

(rough our evaluation there is a fact that architectureswhich are utilized as base networks to extract deep featureshave significant effects on frameworks (e deeper the ar-chitecture is the higher the accuracy of detection is Once anetwork has an increase in the depth this means it has morelayers than normal ones and it will have massive parametersto train Hence this needs a lot of data to fine tune theseparameters reasonably If there is an increase in computa-tion resource consumption will also increase As a result it

16 Journal of Electrical and Computer Engineering

will be difficult as we want to take them to apply in practicalapplications Besides the contextual exploit in models isdefinitely limited this results cause ignoring much usefuland informative data in training especially in context ofsmall objects Because small objects are able to appearanywhere in an input image if the image is well-exploitedwith the context the performance of small object detectionwill be improved better

For all above reasons and according to our evalua-tion if we tend to have good performance and ignore thespeed of processing two-stage methods like FasterRCNN are well-performed and demonstrate its networkdesign with the different datasets on many contexts ofobjects including multiscale objects (erefore FasterRCNN is considered as a giant baseline in order to baseon or develop from it If our target has a balance ofaccuracy and speed YOLO is a good one in case we donot care the training time because the sacrifice betweenthe speed and accuracy is worth applying it into practicalapplications Otherwise Faster RCNN or RetinaNet isstill a substitution to work on When it comes tobackbones we have to concern about the data to choose areasonable backbone to combine with the methodsBecause the amount of data will significant impact on themodel if data are not abundant the shallow network willfit it well Besides there is recently a novel approachpromising in training deep models with less data that isweakly supervised learning such as zero-shot one-shotor few-shot learning (erefore these approaches will beconsidered in our future works and following our recentsearching to have better performance on object detec-tion we have to consider several factors to improve themAP such as multiscale training superresolution forscaling up the visual information to small objects [35] orpreprocessing data to avoid the imbalance data becausewe have a wide range of imbalance problems relating todata [33]

Data Availability

(e data used to support the findings of this study areavailable from the corresponding author upon request

Conflicts of Interest

(e authors declare that they have no conflicts of interest

Acknowledgments

(is research was funded by the Vietnam National Uni-versity HoChiMinh City (VNU-HCM) under grant noB2017-26-01

References

[1] R Girshick J Donahue T Darrell and J Malik ldquoRich featurehierarchies for accurate object detection and semantic seg-mentationrdquo in Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition pp 580ndash587 ColumbusOH USA June 2014

[2] K He X Zhang S Ren and J Sun ldquoSpatial pyramid poolingin deep convolutional networks for visual recognitionrdquo inProceedings of the European Conference on Computer Visionpp 346ndash361 Springer Zurich Switzerland September 2014

[3] R Girshick ldquoFast R-CNNrdquo in Proceedings of the IEEE In-ternational Conference on Computer Vision pp 1440ndash1448Santiago Chile December 2015

[4] J Redmon S Divvala R Girshick and A Farhadi ldquoYou onlylook once unified real-time object detectionrdquo in Proceedingsof the IEEE Conference on Computer Vision And PatternRecognition pp 779ndash788 Las Vegas NV USA June 2016

[5] J Redmon and A Farhadi ldquoYOLO9000 better fasterstrongerrdquo 2016 httpsarxivorgabs161208242

[6] J Redmon and A Farhadi ldquoYOLOv3 an incremental im-provementrdquo 2018 httpsarxivorgabs180402767

[7] T-Y Lin P Goyal R Girshick K He and P Dollar ldquoFocalloss for dense object detectionrdquo in Proceedings of the IEEEtransactions on pattern analysis and machine intelligenceVenice Italy October 2018

[8] K Zidek A Hosovsky J Pitelrsquo and S Bednar ldquoRecognition ofassembly parts by convolutional neural networksrdquo in Ad-vances in Manufacturing Engineering and Materialspp 281ndash289 Springer Cham Switzerland 2019

[9] K He G Gkioxari P Dollar and R Girshick ldquoMaskR-CNNrdquo in Proceedings of the IEEE International Conferenceon computer vision (ICCV) IEEE Venice Italy pp 2980ndash2988 October 2017

[10] L-C Chen A Hermans G Papandreou et al ldquoInstancesegmentation by refining object detection with semantic anddirection featuresrdquo 2017 httpsarxivorgabs171204837

[11] M Everingham L Van Gool C K I Williams J Winn andA Zisserman ldquo(e pascal visual object classes (VOC) chal-lengerdquo International Journal of Computer Vision vol 88no 2 pp 303ndash338 2010

[12] T-Y Lin M Maire S Belongie et al ldquoMicrosoft COCOcommon objects in contextrdquo in Proceedings of the EuropeanConference on Computer Vision pp 740ndash755 SpringerZurich Switzerland September 2014

[13] C Chen M-Y Liu O Tuzel and J Xiao ldquoR-CNN for smallobject detectionrdquo in Proceedings of the Asian Conference onComputer Vision pp 214ndash230 Springer Taipei TaiwanNovember 2016

[14] P F Felzenszwalb R B Girshick D McAllester andD Ramanan ldquoObject detection with discriminatively trainedpart-based modelsrdquo IEEE Transactions on Pattern Analysisand Machine Intelligence vol 32 no 9 pp 1627ndash1645 2010

[15] S Ren K He R Girshick and J Sun ldquoFaster R-CNN towardsreal-time object detection with region proposal networksrdquo inProceedings of the 28th International Conference on NeuralInformation Processing Systems Ser NIPSrsquo15 pp 91ndash99 MITPress Cambridge MA USA 2015 httpdlacmorgcitationcfmid=29692392969250

[16] P Pham D Nguyen T Do T D Ngo and D-D LeldquoEvaluation of deep models for real-time small object de-tectionrdquo in Proceedings of the International Conference onNeural Information Processing pp 516ndash526 SpringerGuangzhou China November 2017

[17] J R R Uijlings K E A Van De Sande T Gevers andA W M Smeulders ldquoSelective search for object recognitionrdquoInternational Journal of Computer Vision vol 104 no 2pp 154ndash171 2013

[18] Z Zhu D Liang S Zhang X Huang B Li and S HuldquoTraffic-sign detection and classification in the wildrdquo inProceedings of the IEEE Conference on Computer Vision And

Journal of Electrical and Computer Engineering 17

Pattern Recognition pp 2110ndash2118 Vegas NV USA June2016

[19] A Torralba R Fergus and W T Freeman ldquo80 million tinyimages a large data set for nonparametric object and scenerecognitionrdquo IEEE Transactions on Pattern Analysis andMachine Intelligence vol 30 no 11 pp 1958ndash1970 2008

[20] A Kembhavi D Harwood and L S Davis ldquoVehicle detectionusing partial least squaresrdquo IEEE Transactions on PatternAnalysis and Machine Intelligence vol 33 no 6 pp 1250ndash1265 2011

[21] V I Morariu E Ahmed V Santhanam D Harwood andL S Davis ldquoComposite discriminant factor analysisrdquo inProceedings of the IEEE Winter Conference on Applications ofComputer Vision pp 564ndash571 IEEE Steamboat Springs COUSA March 2014

[22] A Andreas P Lenz and R Urtasun ldquoAre we ready forautonomous driving the kitti vision benchmark suiterdquo inProceedings of the IEEE Conference on Computer Vision AndPattern Recognition Providence RI USA June 2012

[23] A Alahi K Goel V Ramanathan A Robicquet L Fei-Feiand S Savarese ldquoSocial LSTM human trajectory prediction incrowded spacesrdquo in Proceedings of the IEEE Conference onComputer Vision And Pattern Recognition pp 961ndash971 LasVegas NV USA June 2016

[24] J Xiao K A Ehinger J Hays A Torralba and A Oliva ldquoSundatabase exploring a large collection of scene categoriesrdquoInternational Journal of Computer Vision vol 119 no 1pp 3ndash22 2016

[25] E Dong Y Zhu Y Ji and S Du ldquoAn improved convolutionneural network for object detection using YOLOv2rdquoin Pro-ceedings of the 2018 IEEE International Conference onMechatronics and Automation (ICMA) pp 1184ndash1188 IEEEChangchun China August 2018

[26] W Liu D Anguelov D Erhan et al ldquoSingle shot multiboxdetectorrdquo in Proceedings of the European Conference onComputer Vision Springer Amsterdam (e Netherlandspp 21ndash37 October 2016

[27] T-Y Lin P Dollar R B Girshick K He B Hariharan andS J Belongie ldquoFeature pyramid networks for object detec-tionrdquo in Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition (CVPR) vol 1 no 2 p 4Honolulu HI USA July 2017

[28] C-Y Fu W Liu A Ranga A Tyagi and A C Berg ldquoDSSDdeconvolutional single shot detectorrdquo 2017 httpsarxivorgabs170106659

[29] Y Bai Y Zhang M Ding and B Ghanem ldquoSOD-MTGANsmall object detection via multi-task generative adversarialnetworkrdquo in Proceedings of the European Conference onComputer Vision pp 210ndash226 Springer Munich GermanySeptember 2018

[30] L Liu W Ouyang X Wang et al ldquoDeep learning for genericobject detection a surveyrdquo 2018 httpsarxivorgabs180902165

[31] P Zhu L Wen X Bian L Haibin and Q Hu ldquoVision meetsdrones a challengerdquo 2018 httpsarxivorgabs180407437

[32] Z-Q Zhao P Zheng S-t Xu and X Wu ldquoObject detectionwith deep learning a reviewrdquo 2018 httpsarxivorgabs180705511

[33] K Oksuz B C Cam S Kalkan and E Akbas ldquoImbalanceproblems in object detection a reviewrdquo 2019 httpsarxivorgabs190900169

[34] O Russakovsky J Deng H Su et al ldquoImagenet large scalevisual recognition challengerdquo International Journal of Com-puter Vision vol 115 no 3 pp 211ndash252 2015

[35] Y Bai Y Zhang M Ding and B Ghanem ldquoSOD-MTGANsmall object detection via multi-task generative adversarialnetworkrdquo in Proceedings of the European Conference onComputer Vision (ECCV) pp 206ndash221 Munich GermanySeptember 2018

18 Journal of Electrical and Computer Engineering

Page 2: AnEvaluationofDeepLearningMethodsforSmall ObjectDetection

resolutions if the resolution is low it can hinder the detectorfrom detecting small objects In this case the visual infor-mation to highlight the locations of small objects will besignificantly limited In addition small objects can be de-formable or are overlapped by other objects A wide varietyof detection methods have been proposed in the last yearsfrom the development of deep learning Various ideas havebeen presented and attached evaluations have been made todeal with challenges of object detection but those proposeddetectors currently spend their ability on the detection ofnormal sizes not just small objects However an evaluationof small object detection approaches is indispensable andimportant in the study of object detection Lately objectdetection has significantly attracted attention from state-of-the-art approaches and these have made their efforts totackle object detection and yield good performance onchallenging and multiclass datasets such as PASCAL VOCand COCO (ese cutting-edge methods are firstly trainedon ImageNet and transferred to detection for example in

[2] the authors use a proposed network which applies aspatial pyramid pooling layer to extract features and com-pute these over an entire image regardless of image sizesinstead of employing part-based models [14] R-CNN [1] is apioneer of breakthrough object detection and has severalinnovations from previous approaches an image is resizedto a fixed size to feed into the network and then applies anexternal algorithm to generate object proposals Improvedfrom [1] Fast R-CNN [3] applies regions of interest (RoIs) toextract a fixed-length feature from the feature maps for eachproposal Faster R-CNN [15] uses its own network togenerate object proposals instead of applying an externalalgorithm

So far almost detectionmodels are all well-performed onchallenging datasets such as COCO and PASCAL VOC(ese datasets commonly contain objects taking medium orbig parts on an image that contains a few small objects whichcause an imbalance data between objects in different sizesresulting in a bias of models to objects greater in numbers In

(a) (b)

Figure 1 Illustration of (a) objects such as a bus plains or cars that have big appearance but occupy small parts on an image taken from [11]and (b) objects that really own small appearance such as mouses or plates taken from [13]

2 Journal of Electrical and Computer Engineering

addition the number of classes of current small objectdatasets is less than common datasets Besides most of thestate-of-the-art detectors both in one-stage and two-stageapproaches have struggled with detecting small objects As aresult we have presented an in-depth evaluation of existingdeep learning models in detecting small objects in our priorwork [16] We evaluate three state-of-the-art models in-cluding YouOnly Look Once (YOLO) Single ShotMultiBoxDetector (SSD) and Faster R-CNN with related trade-offfactors ie accuracy execution time and resource con-straints In this time we make not only an extension bycontinually evaluating state-of-the-art and up-to-date de-tection models but also summarize pros and cons as well asthe design of models rather than just introducing their ideaInstead of focusing on real-time models we evaluate state-of-the-art models both in the one-stage approach which isable to run in real time such as YOLOv3 RetinaNet and thetwo-stage approach which do not meet real-time detectionbut high accuracy such as Fast RCNN and Faster RCNNWeadd these models to our evaluation due to some reasons andwe firstly take claims from the original work of these modelsParticularly we pick up YOLOv3 because this detector is anovel and state-of-the-art model which combines currentadvanced techniques such as residual blocks skip connec-tions and multiscale detection Similarly RetinaNet is adetector that proposes an updated calculation for lossfunction to penalize the imbalance of classes in a datasetAlthough Faster RCNN is the only one model that isevaluated in our previous work we want to evaluate thismodel with different backbones to consider how wellbackbones work when they are combined with FasterRCNN Furthermore Faster RCNN is an improvement ofFast RCNN and we still add Fast RCNN to our evaluationbecause this model works with an external algorithm togenerate region proposals on an input image instead of on afeature map alike Faster RCNN Besides we evaluate thesemodels with different backbones such as ResNet 50 ResNet101 ResNet 152 ResNeXT 101 and FPN on small objects toconsider how well these backbones are when combiningthem with models We still make our evaluation on 2datasets namely small object dataset [13] and our filtereddataset from PASCAL VOC 2007 [11] with criteria such asaccuracy speed of processing and resource consumption aswell However we want to provide analyses to the design andthe way models work and explore how well models canafford with multiscale objects (is helps readers have apreference of each model and from there they can choose asuitable model to meet their needs (erefore the followingsare our contributions

(i) Wemade an extension for evaluating deep models intwo main approaches of detection namely the one-stage approach and two-stage approach such asYOLOv3 RetinaNet Fast RCNN and Faster RCNNalong with popular backbones such as FPN ResNetor ResNeXT

(ii) We provided not only disadvantages and advantagesof the models relating to accuracy resource con-sumption and speed of processing in context of

small objects as well as changes of these factors whenan object size is scaled up or down but also acomparison between one-stage and two-stagemethods

2 Challenges

Overall there are several problems relating to challenges thatneed to be solved with object detection Object detectionitself draws much attention from researchers but after aperiod of time challenges just tackle a part particularlyCOCO challenges provide a standard in regard to small andmedium detection and accuracy in most of detectors is stilllow with this standard (erefore in terms of small objectdetection it is harder to researchers because apart fromnormal challenges alike object detection it owns particularchallenges for small objects Besides the definition of smallobjects is not obviously clear (e following presentationmake it more obvious

21 Small Appearances Recently small object detection hasbeen considered as an attractive problem itself because thereare many sorts of its own challenges that are very intriguingto researchers First of all the possibilities of the appearanceof small objects are much more than other objects because ofthe small size that leads to a fact that detectors get confusedto spot these objects among plenty of other objects which arelocated around or even are the same size or appearance It isarduous when differentiating small objects from the clutterof background Furthermore the pixels available to repre-sent the information of small objects are also much fewerthan normal objects It means that there are less informativerepresentatives for detectors to perform its task Besides keyfeatures to obtain small objects from an image are vulnerableand even lost progressively when going thorough manykinds of different layers of deep network such as convolu-tional or pooling layers For example in VGG16 if the objectof interest occupies a 32times 32 size it will be presented at most1 pixel after 5 times of going through the pooling block As aresult the exhausted searching such as sliding window [14]or the drastic increase in the number of bounding boxes likeselective search [17] is unfeasible to achieve good outputsSome samples of small objects are shown in Figure 1

22 Small Object Definitions (e definition problem ofsmall object detection is to clarify how small scales or sizes ofobjects are or how many pixels they occupy on an image(is is arduous and different if we consider objects onimages of high resolution and low resolution For examplean object is assigned as a small object as occupying a part of400times 400 resolution on 2048times 2048 but being very big on500times 500 one (erefore it causes a difficulty to researcherswhen a dataset consists of images with various ranges ofresolution Up till now there are some definitions of smallobjects and these definitions are not clearly defined Itdepends upon datasets that are used for evaluation andcharacteristics of objects of interest (erefore to performthe task of detecting small objects researchers define

Journal of Electrical and Computer Engineering 3

different definitions for different datasets instead of onlyusing the size of bounding boxes containing objects toconsider if the objects are small or not For example Zhuet al [18] mentioned that small objects are objects whosesizes are filling 20 of an image when releasing their datasetabout traffic signs If the traffic sign has its square size it is asmall object when the width of the bounding box is less than20 of an image and the height of the bounding box is lessthan the height of an image In [19] Torralba et al supposedsmall objects are less than or equal to 32times 32 pixels In smallobject dataset [13] objects are small when they have meanrelative overlap (the overlap area between bounding box areaand the image is) from 008 to 058 respectively 16times16to 42times 42 pixel in a VGA image In this work we reuse theabove definitions especially the definitions from [13 18] asthe main references because they are reliable resources andare widely accepted by other researchers

23 Datasets and Approaches (ere are limited works toconcentrate on sorts of small objects and it results in thelimitation of experience and knowledge to deeply go for acomprehensive research (e previous approaches justspecify to focus on big objects and ignore the existence ofsmall objects In fact we do not comprehend how muchexisting detection approaches are well-performed whendealing with small objects Hence in this work we conductto assess the performance of existing state-of-the-art de-tectors to draw a general picture of their abilities for smallobject detection

In terms of small object detection there are just a fewworks regarding the problem of detecting small objects Sofar most of these works are just designed to detect somesingle categories such as traffic signs [18] or vehicles [20ndash22]or pedestrians [23] that do not contain common or mul-ticlass datasets in real world (is results in a lack of eval-uation for the approaches to show its ability detectingdifferent kinds of objects and variation of their shapes aswell Fortunately Chen et al [13] present their small objectdataset by combining the Microsoft COCO [12] and SUNdatasets [24] that consist of common objects such asldquomouserdquo ldquotelephonerdquo ldquoswitchrdquo ldquooutletrdquo ldquoclockrdquo ldquotissueboxrdquo ldquofaucetrdquo ldquoplaterdquo and ldquojarrdquo Chen also augments theR-CNN algorithm with some modifications to improveperformance of detecting small objects Following this ideawe conduct a small survey on existing datasets and theauthors find that PASCAL VOC is in common with COCOand SUN datasets which consist of small objects of variouscategories So we depend on existing and common defini-tions of small objects to filter objects that meet these defi-nitions and form a dataset including 4 subsets correspondingto 4 different definitions of small objects so as to objectivelyconsider how different scales of objects affecting perfor-mance of detection are In addition there is recently a smallobject dataset in a challenge called Vision Meets Drones AChallenge (httpaiskyeyecom) and this dataset is con-sidered the challenging dataset because it consists of severalsmall objects even tiny objects in images in different con-texts and conditions in wild but the views in images are

snapshot from drones which fly above and take picturesfrom the high resolution cameras attached to it Unfortu-nately this dataset does not have annotations for testing soit is hard to take it for evaluation

(erefore in this work we choose small object dataset[13] and our filtered dataset to make our evaluation becausethese datasets contain common objects and the number ofimages are large so the evaluations are objective

3 Deep Models for Object Detection

Recently in widespread developments of deep learning it isknown that convolutional neural network (CNN) ap-proaches have showed lots of improvements and achievedgood results in various tasks (erefore it is commonlyapplied to well-known works Most of the works haveshowed significant improvements in detecting objects fillingmedium or big parts on an image

RCNN [1] is one of the pioneers (e following methodsare an improvement form of R-CNN such as [2 3 15]Especially Faster R-CNN [15] is considered as a state-of-the-art approach Although this sequence of advanced worksuses a lot of different and breakthrough ideas from slidingwindow to object proposals and mostly achieves the bestresults as state-of-the-art methods on challenging datasetssuch as COCO PASCAL VOC and ILSVRC however theirrepresentations take much time to run on an image com-pletely and may lead to reduction in the running perfor-mance of the detector As a result the detectors facedifficulty in using them for detecting objects in real timedespite achieving high accuracy (is means they just focuson accuracy and ignore effects of speed of processing Inaddition detecting objects having small sizes in real world isas important as objects having big or medium sizes evenmore necessary than we imagined Especially in industriesof automotive smart cars army projects and smarttransportation data must be promptly and precisely pro-cessed to make sure that safety is first But in these casesgenerally the data recorded usually are far from our positionand the information is a small thing

In terms of real-time detection the one-stage methodsinstead of using object proposal to get RoI before moving toclassifier like two-stage approaches such as Faster R-CNNuse local information to predict objects such as YOLO andSSD Both methods process images in real time and detectobjects correctly and still have a high point of mAP Nev-ertheless these papers just mention that the models candetect small objects and have good results but they do notshow evidences to prove how much or what extent of smallobjects that are solved In this work we evaluate thesemodels from both approaches to find out their performanceand to what extend they are good at as detecting smallobjects (e following are general ideas of above-mentionedapproaches

31 R-CNN R-CNN [1] is a novel and simple approach as apioneer advanced providing more than 30 mean averageprecision (mAP) than the previous works on PASCAL VOC

4 Journal of Electrical and Computer Engineering

(e overview of R-CNN architecture consists of four mainphases which are known as the new advances of this methodFirstly the R-CNN network resizes an image to 227times 227and takes it as an input(en selective search algorithm [17]is applied to the image and generates 2000 candidates ofproposed bounding boxes as the warped regions used for theinput of CNN feature network (rough the regions thenetwork extracts a 4096-dimensional feature vector fromeach region and then computes the features for each regionFinally using the class-specific linear SVM classifier behindthe last layer is to classify regions to consider if there are anyobjects and what the objects are

(emajor key to the success of the R-CNN is the featuresmatter In R-CNN the low-level image features (eg HOG)are replaced with the CNN features which are arguablymore discriminative representations However the evalua-tion of an image is extremely costly and wasteful becauseR-CNN must apply the convolutional network 2000 timesBesides resizing the input to the low 227times 227 is a problemaffecting small objects which are easy to deform or even loseinformation as changing the resolution far from its originalsizes (e region proposals overlapped thus leading tocomputation of familiar features many times and with everyregion proposal it must be stored to disk before performingthe extraction of features In addition lots of boundingboxes overlapped will result in a drop of mAP if small objectsare close to big objects because there is a bias to choose thebounding boxes which contain big objects and ignorance ofbounding boxes for small objects

32 Spatial PyramidPooling (SPP) (e primary ideas of SPP[2] are motivated from limitations of CNN architecture suchas the original CNN receiving the size of input imagesmust bea fixed size (224times 224 of AlexNet) so the actual use of the rawpicture often needs cropping (a fixed-size patch that truncatesthe original image) or warping (RoI of an image inputmust bea fixed size of the patch) (e fully connected layer needs afixed-length input and convolutional layer that can beadapted to the arbitrary input size thus it needs a bridge as amediate layer between the convolutional layer and the fullyconnected layer and that is the SPP layer Particularly SPP-net firstly finds 2000 candidates of region proposals like theR-CNN method and then extracts the feature maps from theentire image SPP maps each window of the features corre-sponding to region proposals as a fixed-length representationregardless of the input size Finally 2 fully connected layersare used to classify by SVM

In short SPP-net versus R-CNN detection task is better100times faster than R-CNN but training time is very slowbecause of multistage training steps (fine-tuning of lastlayers SVM and regressions) and really taking a lot of diskspace to save vectors of features

33 Fast R-CNN Fast R-CNN [3] is an advanced methodthat presents various innovations to improve the time oftraining and testing phase and efficiently classifying objectproposals while still increasing the accuracy rate by usingdeep convolutional networks (e architecture of Fast

R-CNN is trained end-to-end with a multitask loss Specif-ically the convolutional network takes an image at any size asan input and several RoIs Instead of applying RoI on an inputand wrapping them to feed into the network at the first steplike RCNN Fast RCNN applies these RoIs on a feature mapafter the several convolutional layers of the base networkEach RoI is extracted a fixed-size feature vector by a poolinglayer andmapped to a feature vector by fully connected layers(e network has two output vectors per RoI softmaxprobabilities and per-class bounding-box regression offsets

(e most important feature of RoI is sharing compu-tation and memory in the forward and backward passesfrom the same image (e huge contribution of Fast R-CNNis that it proposes a new training method that fixes thedrawbacks of R-CNN and SPP-net while increasing theirrunning time and accuracy rate (e advantage is the meanaverage precision of detection is higher than R-CNN andSPP-net Training phase is a single stage using a multitaskloss and can update the entire network layers (e capacityof disk storage is not required for feature caching

34 Faster R-CNN Faster R-CNN [15] is an innovatedapproach improved from Fast R-CNN Unlike two previousapproaches of its own instead of generating bounding boxesby external algorithms [17] like [1 3] Faster R-CNN runs itsown method called the region proposal network (RPN)which is trained end-to-end in order to give the generationof highly qualified region proposals After gaining deepfeatures from early convolutional layers RPN is taken intothe account and windows slide over the feature map toextract features for each region proposal RPN is consideredas a fully convolutional network which simultaneouslypredicts bounding boxes of objects and objectness scores ateach position (e input of RPN is an image of any size andoutputs a set of bounding boxes as rectangular objectproposals along with an objectness score for each proposalSpecifically the RPN takes the image feature map of the fifthconvolutional layer (conv5) as an input and applies a 3times 3sliding window on the feature map (en the intermediatelayer will feed into two different branches one for objectscore (determines whether the region is thing or stuff) andthe other for regression (determines how should thebounding box change to become more similar to the groundtruth) (e RPN improves accuracy and running time as wellas avoids to generate excess of proposal boxes because theRPN reduces the cost by sharing computation on convolu-tional features RPN and Fast R-CNN aremerged into a singlenetwork by sharing their convolutional features (is com-bination helps Faster R-CNN to have leading performance onaccuracy but leads to its architecture as a two-stage networkwhich reduces the speed of processing of this method

35 You Only Look Once Inherited from the advantages ofthe previous models which have been introduced earlier YouOnly Look Once (YOLO) [4] is considered as a state-of-the-art object detection in real time with various categories at thattime YOLO currently has three versions [4ndash6] which areimproved substantially through each version progressively

Journal of Electrical and Computer Engineering 5

(e detail analyses of the YOLO approaches as a premise toapply it into practical applications are as follows

YOLOv1 [4] is widely known that YOLO an unified orone-stage network is a completely novel approach based onan idea that aims to tackle object detection in real timeproposed by Redmon et al implying that instead of per-forming the task of object detection like the previoustechniques based on complex tasks such as [1 4] which useexhausted sliding window and then feed outputs of this toclassifiers performing at equally spaced locations over thewhole image or region proposals to generate bounding boxeswhich possibly contain objects and then feed them toconvolutional neural networks YOLO considers objectdetection to be a regression problem simultaneously givingthe prediction for various coordinates of bounding boxesand class probabilities for these boxes (e key idea toperform the detection of YOLO is that YOLO separatesimages into grid views which push the running time as wellas accuracy in localizing objects of YOLO(e goal of YOLOis to deal with two problems namely what objects arepresented and where they are in an image (e summari-zation of YOLO operation proceeds with three principalsteps simply and straightforwardly Firstly YOLO takes aninput image resized to a fixed size then works a singleconvolutional network as a unified network on the imageand ultimately puts a threshold on the resulting detectionsby the confidence score of the model YOLO runs at 45 fpson GPU and the smaller Fast YOLO reaches 150 fps (isprocessing can run steaming video in real time Although thedesign of YOLO architecture affords end-to-end trainingand real-time detection it still keeps high average precision

(e network divides the input image into a Stimes S gridwhere Stimes S is equal to the width and height of the tensorwhich presents the final prediction In case the center of anobject is in a grid cell the gird cell takes responsibility fordetecting that object Moreover each gird cell is simulta-neously responsible for predicting bounding boxes andconfidence scores which present how confident the model ofbounding box contains an object as well as how accurate itindicates the bounding box is predicted

(e drawback of YOLO is that it lags behind the state-of-the-art detection systems about accuracy but is better thanthose about running time Itmakes less than half the number ofbackground errors compared to Fast R-CNN YOLO is highlygeneralizable so it can quickly identify objects in an image butit usually struggles to precisely localize some objects especiallysmall ones (erefore the author introduced YOLOv2 toimprove performance and fix drawbacks of YOLO as well

YOLOv2 [5] has a number of various improvements fromYOLOv1 Similarly to the origin YOLOv2 runs on differentfixed sizes of an input image but it introduced several newtraining methods for object detection and classification suchas batch normalization multiscale training with the higherresolutions of input images predicting final detection onhigher spatial output and using good default bounding boxesinstead of fully connected layers

However this offers a trade-off between speed and ac-curacy (e details of the mAP improvements in PASCALVOC 2007 are shown in Figure 2

(ese novel improvements allow YOLOv2 to train onmulticlass datasets like COCO or ImageNet In addition itwas attempted to train the detector to detect over 9000different object classes YOLOv2 uses a network architecturecustomized from the original network YOLOv2 mainlyconcentrates on a way of improving recall and localizationwhile still receiving high accuracy of classification incomparison with state-of-the-art detectors and the originYOLO significantly makes more localization errors but is farless likely to predict false detections on places where nothingexists Although YOLOv2 has accuracy improvementsYOLOv2 does not work well on small objects because theinput downsampling results in the low dimension of thefeature map which is used for the final prediction To solvethese problems recently the author introduces YOLOv3with significant improvements on object detection espe-cially on small object detection Generally a variety of latestnetworks tend to be toward deeper and yield good per-formance on their tasks with deep features learned fromnumerous layers

YOLOv3 [6] is one of these approaches instead of usingDarknet-19 like two old versions [4 5] YOLOv3 develops adeeper network with 53 layers called Darknet-53 and com-bines the network with state-of-the-art techniques such asresidual blocks skip connections and upsampling (e re-sidual blocks and skip connections are very popular in ResNetand relative approaches and the upsampling recently alsoimproves the recall precision and IOU metrics for objectdetection [25] For the task of detection 53 more layers arestacked onto it giving a 106-layer fully convolutional un-derlying architecture for YOLOv3 (is is the reason behindthe slowness of YOLOv3 compared to YOLOv2

Second YOLOv3 enables the detector to predict objectsat three different outputs with three different scales ratherthan just one prediction at the last layer of the networksimilar to its competitor SSD [26] which has improved a lotof performance on a low resolution image (is is useful topick up diverse outcomes in order to improve performanceof detection (e final output is created by applying a 1times 1kernel on a feature map Particularly the detection is doneby applying 1times 1 detection kernels on feature maps of threedifferent sizes at three different places in the network partlysimilar to feature pyramid networks (FPNs) [27]

(ird YOLOv3 still keeps using K-means to generateanchor boxes but instead of fully applying 5 anchor boxes atthe last detection YOLOv3 generates 9 anchor boxes andseparates them into 3 locations Each location applies 3anchor boxes hence there are more bounding boxes perimage For example if we have an image of 416times 416YOLOv2 predicts 13times 13times 5 845 boxes in YOLOv3 thenumber of boxes is 10647 implying that YOLOv3 predicts10 times the number of boxes compared to YOLOv2

Fourth YOLOv3 also changes the way to calculate thecost function If the anchor overlaps a ground truth morethan other bounding boxes the corresponding objectnessscore should be 1 For other anchor boxes with overlapgreater than a predefined threshold 05 they incur no costEach ground truth is only associated with one boundary boxIf a bounding box is not assigned it incurs no classification

6 Journal of Electrical and Computer Engineering

and localization lost just confidence loss on objectness (eloss function in previous YOLO looks like

λcoord 1113944

S2

i01113944

B

j01objij xi minus 1113954xi( 1113857

2+ yi minus 1113954yi( 1113857

21113960 1113961

+ λcoord 1113944

S2

i01113944

B

j01objij

wi

radicminus

1113954wi

1113969

1113874 11138752

+

hi

1113969

minus

1113954hi

1113969

1113874 11138752

1113890 1113891

+ 1113944S2

i01113944

B

j01objij Ci minus 1113954Ci1113872 1113873

2

+ λnoobj 1113944S2

i01113944

B

j01noobjij Ci minus 1113954Ci1113872 1113873

2

+ 1113944S2

i01obji 1113944

cisinclassespi(c) minus 1113954pi(c)( 1113857

2

(1)Currently instead of using mean square error in calcu-

lating the classification loss at the last three terms YOLOv3uses binary cross-entropy loss for each label In other wordsYOLOv3 makes its prediction of an objectness score and classprediction for each bounding box using logistic regression

(ere is no more softmax function for class prediction(e reason is that the most currently used classifiers assumethat predicted labels are independent and mutually exclusiveimplying that if an object belongs to one class then it cannotbelong to the other and this is solely true if output predictionis really mutual nevertheless in case dataset has multilabelclasses and there are labels which are not nonexclusive such aspedestrian and person At the time the sum of possibilityscores may be greater than 1 if the classifier is softmax soYOLOv3 alternates the classifier for class prediction from thesoftmax function to independent logistic classifiers to cal-culate the likeliness of the input belonging to a specific label

36 Single Shot MultiBox Detector Single Shot MultiBoxDetector (SSD) [26] is a single shot detector using a singleand one-stage deep neural network designed for objectdetection in real time By comparison the state-of-the-artmethod in two-stage processing Faster RCNN uses itsproposed network to generate object proposals and utilizesthose to classify objects in order to be toward real-timedetection instead of using an external method but the wholeprocess runs at 7 FPS SSD enhances the speed of running

(a)

(b)

(c)

(d)

Figure 2 (e visualization of detectors with the strongest backbones on subsets of PASCAL VOC_MRA_058 VOC_MRA_10VOC_MRA_20 and VOC_WH_20 in order respectively (a) YOLO Darknet-53 (b) Faster RCNN ResNeXT-101-64times 4d-FPN (c)RetinaNet ResNeXT-101-64times 4d-FPN (d) Fast RCNN ResNeXT-101-64times 4d-FPN

Journal of Electrical and Computer Engineering 7

time faster than the previous detectors by eliminating theneed of the proposal network(erefore it causes a few dropin mAP and SSD compensates this by applying some im-provements including multiscale features and default boxes(ese improvements allow SSD to gain the same of FasterRCNN using lower resolution images which then furtherspeeds up the processing of SSD For 300times 300 input imageas the best version SSD gets 772 mAP at 46 FPS betterthan Faster R-CNN 732 and a little smaller than the bestversion of YOLOv2 554times 554 input image and 786 mAPat 40 FPS on VOC 2007 on Nvidia Titan X

Similarly SSD consists of 2 parts namely extraction offeature maps and use of convolution filters to detect objectsSSD uses VGG16 as a base network to extract feature maps(en it combines 6 convolutional layers to make predictionEach prediction contains a bounding box and N+ 1 scoresfor each class where N is the number of classes and one forextraclass for no object Instead of using a region proposalnetwork to generate boxes and feed to a classifier forcomputing the object location and class scores SSD simplyuses small convolution filters After the VGG16 base net-work extracts features from feature maps SSD applies 3times 3convolution filters for each cell to predict objects Each filtergives an output including N+ 1 scores for each class and 4attributes for one boundary box

SSD has a difference from previous approaches at thesame time and it makes prediction on multiscale featuremaps for detection independently rather than just one lastlayer (e CNN network spatially reduces the dimension ofthe image gradually leading to the decrease in the resolutionof the feature maps As mentioned SSD uses a lower inputimage to detect objects hence early layers are used to detectsmall objects and lower resolution layers to detect largerscale objects progressively Besides SSD applies differentscales of default boxes to different layers and for intuitivevisualization in Figure 3 Particularly the only blue defaultbox on 8times 8 feature map fits to the ground truth of the catand the only red one on 4times 4 feature map matches to theground truth of the dog

Although SSD has significant improvements in objectdetection as integrating with these above parts SSD is notgood at detecting small objects which can be improved byadding deconvolution layers with skip connections to in-troduce additional large-scale context [28] Generally SSDoutperforms Faster RCNN which is a state-of-the-art ap-proach about accuracy on PASCAL VOC and COCO whilerunning at real-time detection

37CNNDrawbacks Most of the CNNmodels are currentlydesigned by the hierarchy of various layers such as con-volutional and pooling layers that are arranged in a certainorder not only on small networks but also on multilayernetworks to state-of-the-art networks Along with theselayers fully connected layers are added behind and known asFC layers (e block consisting of FC layers and previouslayers is designated as feature extractors and it outputs keyfeatures of objects of interest as an input for classifierscoming behind However deeply going through many kinds

of layers is a way that is not good for small object detectionbecause in the task of small object detection objects ofinterest are objects owning small sizes and appearanceBesides small objects unlike normal or big objects which areless affected by resizing the image or passing lots of differentlayers are very vulnerable to the changes in image sizesWhen an image passes a convolutional layer the size of theimage will be decreased by receptive fields that slide over theimage to extract useful features (is does not affect smallobjects if there are just a few layers but in a CNN networkwe have many layers like this and it is very hard for smallobjects Still if small objects just go through convolutionallayers it will not be anything to mention Small objectswhich just have a few informative presence have to passpooling layers which help in avoiding overfitting and re-ducing computational costs by decreasing a number ofparameters To do this these layers use fixed sliding windowsthat care about a fixed target that is identified before such asmaximum or average calculations of valuables For thesereasons GAN is an approach that may alter the CNN ap-proach because of its advantages We can take advantages ofa way that the approach generates data to overcome thelimitations of data of small objects for the training phaseAlthough images still have to pass layers such as convolu-tional and pooling layers in this context the network justhas less layers compared to others Bai el al [29] haveproposed to apply MTGAN to detect small objects by takingcrop inputs from a processing step made by baseline de-tectors such as Faster RCNN [15] or Mask RCNN [9]

Because of mentioned reasons and following the survey[30] Liu et al have presented numerous works of survey andevaluation but there are no works that do with small objectsin them(erefore in this work we assess popular and state-of-the-art models to find out pros and cons of these modelsParticularly we evaluate 4 deep models such as YOLOv3RetinaNet Fast RCNN and Faster RCNN with several basenetworks for small object detection with different scales ofobjects In these models YOLOv3 and RetinaNet belong tothe one-stage approach Fast RCNN and Faster RCNN are inthe two-stage approach We choose these models becauseYOLOv3 is the model with combination of state-of-the-arttechniques and RetinaNet is the model with a new lossfunction which penalizes the imbalance of classes in adataset Besides we choose RetinaNet to make comparisonsbetweenmodels in the same approach Similarly Fast RCNNand Faster RCNN are the same and both models are in thesame approach and have nearly the similar pipeline in objectdetection (ere is a difference is that Fast RCNN utilizes anexternal proposal to generate object proposals based oninput images However Faster RCNN proposes its ownnetwork to generate object proposals on feature maps andthis makes Faster RCNN train end-to-end easily and workbetter

4 Experimental Evaluation

In this section we present the information of our ex-perimental setting and datasets which we use forevaluation

8 Journal of Electrical and Computer Engineering

41 Experimental Setting We continually train and evaluatevarious object detectors on the two datasets such as PASCALVOC [11] and a newly generated dataset [16] (e evaluatedapproaches in this time consist of Faster RCNN [15]YOLOv3 [6] and RetinaNet [7] with different backbonesExcept for YOLOv3 the others are trained and evaluated bythe Detectron python code

Currently the original datasets which commonly areused in object detection are PASCAL VOC [11] and COCO[12] Both datasets are constructed by almost large objects orother kinds of objects whose size fill a big part in the image(ese two datasets are not suitable for small object detectionIn addition there is another dataset which is large-scale andincludes a lot of classes for small object detection collectedby drones and named VisDrone dataset [31] However itdoes not publish the labels for test set to evaluate and theviews of images are topdown which is not our case As aresult in order to evaluate the detection performance of themodels we use a dataset which was published in [13] (isdataset is called small object dataset which is the combi-nation between COCO [12] and SUN [24] dataset (ere are10 classes in small object dataset including mouse tele-phone switch outlet clock toilet paper (t paper) tissue box(t box) faucet plate and jar (e whole dataset consists of4925 images in total and there are 3296 images for trainingand 1629 images for testing (e mouse class owns thelargest number of instances in images 2137 instances in 1739images and the tissue box class has the smallest number ofinstances 103 instances in 100 images Apart from the smallobject dataset we also filter subsets from PASCAL VOC2007 following standard definitions On PASCAL VOCthere are 20 classes but with small object detection there arefewer classes on strict definitions of small objects Table 1lists the details of the number of small objects and imagescontaining them for subsets of the dataset

We trained all models on small object dataset with thesame parameters Particularly in the training phase wetrained the models with 70k iterations with the parametersincluding momentum decay gamma learning rate batchsize step size and training days in Table 2 At the firstmoment we attempted to start off the models with a higherlearning rate 10minus 2 but the models diverged leading to theloss value being NaN or Inf after 100 first iterations(en wetried at a lower learning rate 10minus3 at 100 first iterations andrise to 10minus2 to consider if the models can converge as startingoff at a lower learning rate However it remained unchangedanything We also saw that the models converged quicklyduring 10k first iterations with 10minus3 and then progressively

slow down after 20k (erefore we decided to start off thetraining with a learning rate at 10minus3 and decrease to 10minus 4 and10minus5 at 25k and 35k iterations respectively (is settingshows that the loss value was stable from 40k but we set thetraining up to 70k to consider how the loss value changesand saw that it did not change a lot after 40k iterations Wetried to evaluate the models from 30k to 70k and generallythe performance of the models was not stable after 40k it-erations For this reason we picked up the weight forevaluation at 30k and 40k iterations At 30k iterationsYOLO achieves the best results and others get the best one at40k iterations In case of subsets of PASCAL VOC 2007 wecombine train and valid set from PASCAL VOC 2007 and2012 to form a training set PASCAL VOC 2012 works as adata augmentation set for PASCAL VOC 2007 We use thiscombined training set to train all models and test them onsubsets All models train the same parameter First due tothe limitation of memory we rescale all the size of images tothe same size with the shortest side 600 and the lengthiestside 1000 as in [15]

In YOLOv3 we run the K-means clustering algorithm inorder to initialize 9 suitable default bounding boxes fortraining and testing phases of our selected datasets and wechanged the anchors value (e following are 9 anchors forsmall object dataset after running the K-means algorithm[103459 144216] [262937 190947] [214024 363180][479317 291237] [404932 637489] [836447 513203][722167 1199181] [1727416 1170773] and [12465972528465]

In Faster R-CNN to fairly compare with the prior workand deploy on different backbones we also reuse directly theanchor scales and aspect ratios following the paper [13] suchas anchor scales 16times16 40times 40 and 100times100 pixels andaspect ratio 05 1 and 2 instead of having to cluster a set ofdefault bounding boxes similar to YOLOv3 Similarly inRetinaNet we keep the default setting for training such asgamma loss 20 alpha loss 025 anchor scale 4 andscalers per octave 3 because of following authors and thisconfiguration is the optimized valuables

42 Our Newly Generated Dataset In this time to have anobjective comparison we also use our newly generateddataset and the information of this dataset is shown inTable 1 We use it to consider the effects of object sizesamong factors including models time of processing accu-racy and resource consumption (e dataset consists of 4subsets filtered from PASCAL VOC 2007 such as

Batch normHi-res classifierConvolutionalAnchor boxes

New neworkDimension priors

Location predictionPassthrough

MultiscaleHi-res detector

YOLO YOLOv2

VOC2007 mAP

634 658 695 692 696 744 754 768 786

Figure 3 mAP of YOLOv2 at each added part [5]

Journal of Electrical and Computer Engineering 9

VOC_WH_20 VOC_MRA_058 VOC_MRA_10 andVOC_MRA_20 and detail information is provided asfollows

(i) VOC2007_WH_02 contains objects whose widthand height are less than 20 of an imagersquos width andheight (is one has fewer than PASCAL VOC 2007two classes such as dining table and sofa because ofthe constraint of the definition

(ii) VOC_MRA_058 VOC_MRA_10 andVOC_MRA_20 compose objects occupying themaximum mean relative area of the original imageunder 058 10 and 20 respectively Two ofthem have the same number of PASCAL VOC 2007classes except for VOC_MRA_058 and the one hasfewer four classes such as dining table dog sofa andtrain

5 Results and Analyses

In this section we show results that we achieved through theexperimental phase All models mentioned in this sectionexcept for models cited from other papers are trained on thesame environment and 1 GPU Ubuntu 16044 LTS Intel(R) Xeon (R) Gold 6152 CPU 210GHz GPU Tesla P100In addition to the comparative accuracy other comparisonsare also provided to make our objective and clear assessmentresults

51 Accuracy

511 Small Object Dataset Following the detection resultsin Table 3 methods which belong to two-stage approachesoutperform ones in one-stage approaches about 8ndash10Specifically Faster RCNN with ResNeXT-101-64times 4d-FPNbackbone achieved the top mAP in two-stage approachesand the top of the table as well 412 In comparison withthe top in one-stage approaches YOLOv3 608times 608 withDarknet-53 obtained 331 Following [32] methods basedon region proposal such as Faster RCNN are better than

methods based on regression or classification such as YOLOand SSD Actually this is also right once again as in contextof small object dataset

For methods in each approach Firstly two-stage ap-proaches Faster RCNN which is an improvement of FastRCNN is only greater than Fast RCNN about 1ndash2 but onlyfor ResNeXT backbones and equal to Fast RCNN for the rest(e difference here is not too much and it means that theperformance of external region proposal like selective searchcombined with ROI pooling is as good as internal regionproposal like RPN with ROI aligned in this case Besidescompared to R-CNN we perceive that there is a boost 8ndash10when RoI pooling or RoI aligned is added because R-CNNwhich uses region proposals from selective search then feedsthem into the network and directly computes features from fc(fully connected) layers only receives 235 with Alexnetand 248 with VGG16 combined with proposals from RPNHowever Fast RCNN and Faster RCNN with two kinds ofRoIs are much better Fast RCNN receives accuracy in a rangeof 317 to 396 based on different backbones SimilarlyFaster RCNN gets 301 to 412 Secondly in one-stageapproaches YOLO outperforms SSD and RetinaNet How-ever YOLO gets the highest outcome 331 and SSD andRetinaNet get 1132 and 30 respectively YOLO and SSDare considered as state-of-the art methods in speed andsacrificing accuracy However there is a large difference inaccuracy between YOLO and SSD the difference here is thatSSD adds multiple convolutional layers behind the backboneand each layer has their own ability instead of using 2 fullyconnected layers like YOLO Although RetinaNet is assignedinto a method in one-stage approaches it cannot run in realtime RetinaNet is one which is proposed to deal with theimbalance between foreground and background by the focalloss (erefore RetinaNet obtains a higher accuracy incomparison with others except for YOLOv3 (Darknet-53)

When it comes to the backbones we realized that Dar-knet-53 is the best in one-stage and real-time methods andeven far higher than ResNet-50 although it similarly has thesame layers with ResNet-50 In contrast ResNeXT combinedwith FPN is themost powerful one in both one-stage and two-

Table 1 (e information of the subsets

Subsets Classes Images InstancesVOC_MRA_058 16 329 529VOC_MRA_10 20 2231 5893VOC_MRA_20 20 2970 7867VOC_WH_20 18 1070 2313

Table 2 (e parameters of models

Method Momentum Decay Gamma Learning_rate Batch_size Training_days StepsizeYOLOv2 [16] 09 00005 0001 8 5 25000YOLOv3 09 00005 0001 32 3ndash4 25000SSD300 [16] 09 00005 01 0000004 12 9 40000 80000SSD512 [16] 09 00005 01 0000004 12 12 100000 120000RetinaNet 09 00005 01 0001 64 4-gt12 h 25000 35000Fast RCNN 09 00005 01 0001 64 4-gt12 h 25000 35000Faster RCNN 09 00005 01 0001 64 4-gt12 h 25000 35000

10 Journal of Electrical and Computer Engineering

stage methods if we only consider accuracy Overall there isan increase about 1ndash3 for changing the simple backbone tothe complex one in each type For example when switchingfrom original ResNet to ResNet-FPN the accuracy is boostedfrom 2 to 3(is is clear that leveraging the advantages frommultiscale features of FPN is a common way to improvedetection and tackle the scale imbalance of input images andbounding boxes of different objects Similarly we switchResNeXT-101-32times 8d-FPN to ResNeXT-101-64times 4d-FPNand the accuracy changes from 405 to 412 for FasterRCNN and from 387 to 396 for Fast RCNN Howeverwhen considering between ResNet-50-FPN and ResNet-101-FPN the growth only happens in Fast RCNN from 333 to355(ere is a little bit decrease 01 for Faster RCNN(isreduction also happens with RetinaNet while the simplerbackbone ResNeXT-101-32times 8d-FPN gets 30 and theResNeXT-101-64times 4d-FPN just gets 251 It means that thevery deeper backbones do not guarantee the increase inaccuracy and the reason is that an advantage of a deepernetwork needsmore parameters to learn It means the authorsmust have a large number of data to feed into the network totrain and update parameters itself but in this case the data ofsmall object dataset are not abundant too much to fit the verydeep network and hence increasing the chances of overfittingBesides features which are originally from the early layer ofResNet are not well-generalized because when they arecombined with FPN the accuracy has an improvement about2ndash3 When YOLO switches from Darknet-19 to Darknet-53 it really boosts the accuracy (e highest accuracy belongsto the Darknet-19 with the resolution of 1024times1024 whichjust gets 2402 However YOLO 608times 608 with Darknet-53gets 331 (e explanation for this reason is that YOLOv3with Darknet-53 has several improvements from Darknet-19YOLOv3 has 3 location of scales to predict objects especiallyone specialized in small objects instead of only one likeDarknet-19 and it is also integrated cutting-edge advantagessuch as residual blocks and shortcut connections (e re-duction in accuracy happens again with YOLO whenswitching from ResNet-101 to ResNet-152 about 1ndash2 Inthese methods YOLO and SSD are the only ones which allowmultiple input sizes(e higher the resolution of input imagesare the higher accuracy the method receives (e reason isthat a higher resolution image allows more pixels to describethe visual information for small objects However if theresolution is far from the original size of images it results in adecrease in accuracy For example YOLO 1024times1024 withDarknet-19 gets a lower accuracy than the resolution of800times 800 In addition we have tried to increase in resolutionof Darknet-53 from 608 to 1024 and themAP decreases whenthe resolution is over 608times 608 (erefore the effect of imagesize is clear for models like SSD and YOLO Generally allcomparative results of mAP on this dataset have the domi-nation of classes very great in numbers and this is caused bythe imbalance data between the number of images and in-stances in these images For example according to the sta-tistics in [13] mouse is a major class significantly contributingto mAP in Table 3 with the highest number of instances andimages as well However tissue has least contribution with thelowest AP originally affected by the number of data

Furthermore the imbalance data lead models tending todetect frequent objects implying that models will misun-derstand objects having a nearly similar appearance with thedomination class as the objects of interest rather than lessfrequent objects As a result false positives will increase bythese problems Figure 4 illustrates the detection withstrongest backbones Following this visualization the domina-tion of the classes such asmouse or faucet results inmisdetectionwith areas which have a same appearance to them (is mis-understanding has a tendency to weaker backbones in thecomparison and one-stage method like YOLO which primarilyheads to speed has more misdetection than two-stage methodsA reason that causes these problems are the difference in thewayof training deep networks [33] One-stage methods such asYOLO use a soft sampling method that uses a whole dataset toupdate parameters rather than only choosing samples fromtraining data However two-stage methods such as RCNNfamily tend to employ hard sampling methods that randomlysample a certain number of positive and negative boundingboxes to train its network

512 Subsets of PASCAL With 4 subsets of 4 different scalesof objects in images we want to find out howmuch the scalesimpact on the models (e whole results are shown in Ta-ble 4 We separate the results into 2 groups as the one-stageand two-stage approaches and Figure 5 is a visualization forthe strongest backbones in each method on subsets

In case of different scales like our subsets there is a differencebetween one-stage approaches and two-stage approaches In thiscase methods from the one-stage approach have a better per-formance than two-stage ones inmost of scales(is is really theopposite of small object dataset Specifically two-stage methodsare totally better than one-stage ones in case of real-time inputsand just better a bit than nonreal-time models in VOC_WH20about 10ndash20 and the same result with smaller objects inVOC_MRA_0058 and VOC_MRA_010 However in biggerobjects in VOC_MRA_020 methods in one-stage approacheshave significant outcomes rather than two-stage ones In ad-dition there is just Faster RCNN that has good performance inmost cases to compare to methods in one-stage ones FastRCNN is only good at big objects in VOC_MRA_020 and failsto have good detection in smaller objects

In the one-stage approach in methods which allowmultiple inputs like YOLO and SSD there are 2 kindsnamely ones that can run in real time and the others thatcannot if the resolution is over 640 or 512 for YOLO andSSD respectively For real-time ones YOLO outperformsSSD for all scales of objects Specifically YOLOv2 withDarknet-19 is better than SSD 26 with objects inVOC_MRA_0058 and VOC_MRA_010 and 4ndash15 forlarger objects in VOC_MRA_020 and VOC_WH_20YOLOv3 with Darknet-53 gets higher results about 3ndash5 incomparison with YOLOv2 hence YOLOv3 also gets higherresults compared to SSD However if we consider nonreal-time input images SSD is greater than YOLO with objects inVOC_MRA_010 However RetinaNet which is the one thatcannot run in real time in the one-stage approach performsthe same results compared to ones in nonreal time in YOLO

Journal of Electrical and Computer Engineering 11

and better than SSD RetinaNet is more stable than SSD andYOLO when the scales are changed (e bigger the objectsare the more the stability is For example the change is somuch about 33 when the scale increases from objects inVOC_MRA_0058 to ones in VOC_MRA_010 andVOC_MRA_020 However this change is not much about10 with bigger objects in comparison with YOLO 15ndash25In case of YOLO this remarkable increase in accuracy whenobjects are larger is obviously good for a model (e changein SSD resembles the change in RetinaNet

Concerning resolutions in YOLO and SSD we see thatwhen image resolution is increased they push the accuracyto improve in general For YOLOv2 with Darknet-19 andYOLOv3 with Darknet-53 and SSD they all have an increasein accuracy when the resolution is large except for YOLOv2

with objects belonging to VOC_MRA_010 andVOC_MRA_020 when the image is over 800 In additionYOLOv2 has a fluctuation with those objects inVOC_WH20 As mentioned in our previous work YOLO isbetter than SSD in those objects less than 10 of the imageshowever in this case YOLOv3 is good at all scales of objects(is is because YOLOv3 has 3 detection locations comingwith more ratios of default boxes and it leads to a significantoutcome when combining results from 3 locations

When we switch to the two-stage approaches FasterRCNN has a significant improvement in most scales ratherthan Fast RCNN except for objects in VOC_MRA_020 whichhave the same accuracy (is shows that if objects are com-pletely separated into different scales the RoI pooling does notwork well with smaller objects and ones in VOC_WH20 In

Table 3 Comparative results on small object dataset

Method Backbone Clock Faucet Jar Mouse Outlet Plate Switch Tel t box t paper mAPYOLO 416 [16]

Darknet-19

228 308 4 52 204 131 13 61 0 353 1939YOLO 448 [16] 23 369 9 525 184 136 175 42 0 343 2013YOLO 480 [16] 342 373 91 533 214 136 158 91 91 342 2371YOLO 512 [16] 231 366 61 598 246 142 157 91 45 324 2261YOLO 554 [16] 234 372 91 601 272 134 199 91 45 345 2384YOLO 640 [16] 202 362 32 598 278 117 181 82 45 356 2253YOLO 800 [16] 276 36 23 602 328 131 233 91 91 267 2402YOLO 1024 [16] 217 293 14 583 264 118 175 91 91 157 2003YOLO 320

Darknet-532622 3838 455 5646 3642 1334 248 1065 455 4296 2583

YOLO 416 2847 4715 1083 6049 4315 1587 3073 1515 262 483 3028YOLO 608 2998 4789 1076 6588 4802 1809 3122 1462 1799 4656 331YOLO 320

ResNet-501957 2573 067 4517 1437 938 1384 909 909 237 1706

YOLO 416 2378 3665 04 5423 1837 1375 1978 984 942 3568 2219YOLO 608 2692 4065 177 6186 2918 1504 2024 1009 1329 3601 255YOLO 320

ResNet-1012052 279 057 4468 1698 1305 1366 966 909 2436 1805

YOLO 416 2572 356 303 5573 224 1561 1726 932 303 3871 2264YOLO 608 2879 4459 942 6218 3334 1553 2388 1324 1583 3917 286YOLO 320

ResNet-1522164 2756 303 4806 1739 1112 1451 909 455 3188 1888

YOLO 416 257 3654 089 5381 206 1413 2021 1149 029 3306 2167YOLO 608 2601 4454 455 61 3176 1302 2267 1235 993 3999 2658SSD300 [16] ResNet-101 55 91 0 255 61 45 0 45 91 182 825SSD300 [16] VGG16 91 171 0 261 91 91 0 45 0 167 916SSD512 [16] VGG16 91 171 0 43 91 91 91 91 0 76 1132RetinaNet ResNet-50-FPN 307 493 2 655 213 161 85 129 1 257 233RetinaNet ResNet-101-FPN 306 487 71 647 20 159 118 107 29 387 251RetinaNet ResNeXT-101-32times 8d-FPN 355 55 121 665 239 184 98 162 94 537 30RetinaNet ResNeXT-101-64times 4d-FPN 314 502 89 663 208 153 94 14 22 324 251R-CNN [13] RPN prop +VGG16 319 313 42 568 311 93 142 164 234 294 248R-CNN [13] Alexnet 7times 300 pro 324 272 51 569 28 98 136 124 179 356 239R-CNN [13] VGG16 7times 300 pro 373 303 72 606 415 158 215 137 22 333 284R-CNN [13] ContextNet (Alexnet 7times) 327 268 46 564 263 99 129 122 187 34 235Fast RCNN ResNet-50-C4 324 463 65 658 383 201 253 166 141 52 317Fast RCNN ResNet-50-FPN 374 473 73 689 467 21 321 171 93 459 333Fast RCNN ResNet-101-FPN 393 503 106 683 471 204 333 186 154 514 355Fast RCNN ResNeXT-101-32times 8d-FPN 475 548 103 718 54 214 344 217 177 535 387Fast RCNN ResNeXT-101-64times 4d-FPN 454 557 109 725 533 24 369 229 16 581 396Faster R-CNN [16] VGG16 2376 3765 803 54 1616 1188 1512 91 625 3729 2192Faster RCNN ResNet-50-C4 322 446 66 659 352 175 257 196 137 40 301Faster RCNN ResNet-50-FPN 357 499 73 684 489 188 296 147 114 533 338Faster RCNN ResNet-101-FPN 398 492 49 682 47 185 297 14 129 522 337Faster RCNN ResNeXT-101-32times 8d-FPN 498 566 114 721 563 232 37 208 188 587 405Faster RCNN ResNeXT-101-64times 4d-FPN 496 586 122 725 545 232 369 208 201 631 412(e values in bold represent the best in one-stage methods and the ones in italics represent the highest in two-stage methods

12 Journal of Electrical and Computer Engineering

addition if we compare with one-stage methods it is signif-icantly lower than them However RoI align along with RPN iswell performed when scales are changedWhen it comes to thebackbones there is a few decrease in accuracy when changingfrom ResNet-50-FPN to ResNet-101-FPN or from ResNeXT-101-32times 8d-FPN to ResNeXT-101-64times 4d-FPN with objectsfrom all scales for both Faster RCNN and Fast RCNN (eVGG16 backbone has an impressive outcome rather thanstrong backbones such as ResNet or ResNeXT Although theaccuracy is less than two strong backbones VGG16 is stillbetter with objects in VOC_WH20 and has a few change inaccuracy when changing objects with big sizes

52 Time Processing and Resource Consumption Tables 5and 6 show us the performance comparison of the eval-uated models with base networks that belong to the modelsGenerally we see that when RAM consumption in testingand training increases more layers are added (is meansthat if the network is more deeper the need of processingalso increases because this leads to the increase in pa-rameters and time to process data as well YOLO is themodel consuming the least memory in both two-phasetraining and testing Particularly YOLO is only from 4G to5G for training and from 16G to 18G for testing withDarknet-53 YOLO is the only one which is able to run inreal time YOLO just needs about 03ms to 04ms toprocess an image in comparison to more than 01 s and 02 s

with Faster RCNN and RetinaNet (is allows us to pick upthese models on devices which own the modest memoryWhile RetinaNet is assigned to the one-stage approach it isnot good enough to meet real-time detection(e inferencetime in Fast RCNN is lower a little bit than Faster RCNNand RetinaNet In contrast the RAM consumption intraining and testing of RetinaNet is lower than Fast RCNNand Faster RCNN Of all architectures the ResNet-50-C4 isthe one requiring the highest memory and time to processdata because the output size of ResNet-50-C4 is bigger a bitthan others [9] If we consider ResNet or ResNeXT com-bined with FPN Faster RCNN is over 100Mb compared toFast RCNN and 300Mb with RetinaNet In additionaccording to Table 2 the number of training days of FasterRCNN and RetinaNet need less time for training only a fewhours to 1 day rather than YOLO 3ndash4 days (is dem-onstrates that if we pay our attention to performance anddo not have much time for training we choose FasterRCNN or RetinaNet instead of YOLO one In contrast ifwe only focus on processing speed and still achieve goodperformance one-stage methods are always the good oneIn the same context of backbones RetinaNet uses a lowerresource than Fast RCNN and Faster RCNN about 100Mband 300Mb for Fast RCNN and Faster RCNN respectivelyin testing time However the training time of RetinaNetuses much memory more than Fast RCNN about 28 G andFaster RCNN about 23 G for ResNeXT-101-32times 8d-FPNand ResNeXT-101-64 times 4d-FPN If we consider this on

(a) (b)

loc ∆ (cx cy w h) conf (c1 c2 hellip cp)

(c)

Figure 4 (e location of the default boxes in different scales (a) image with GT boxes (b) 8times 8 feature map (c) 4times 4 feature map

Journal of Electrical and Computer Engineering 13

Table 4 (e comparative results on subsets of PASCAL VOC 2007

Approach Method VOC_MRA_0058 VOC_MRA_010 VOC_MRA_020 VOC_WH20

One stage

YOLOv2 416 [16] 302 3138 4289 1852YOLOv2 448 [16] 447 329 6015 2196YOLOv2 480 [16] 426 3348 6078 2667YOLOv2 512 [16] 542 3574 6112 2463YOLOv2 544 [16] 697 3656 63 2662YOLOv2 640 [16] 77 3797 6129 2341YOLOv2 800 [16] 1024 373 6191 269YOLOv2 1024 [16] 1069 2993 5514 2897

YOLOv3 320 718 3458 6036 204YOLOv3 416 102 3897 6253 2412YOLOv3 608 117 4265 6856 2886SSD 300 [16] 171 3276 4626 1691SSD 512 [16] 29 4346 5711 1987

RetinaNet-ResNet-50-FPN 884 415 502 2814RetinaNet-ResNet-101-FPN 895 425 519 2746

RetinaNet-ResNeXT-101-32times 8d-FPN 1029 454 545 3008RetinaNet-ResNeXT-101-64times 4d-FPN 1071 455 551 3132

Two stage

Fast RCNN-ResNet-50-C4 023 132 499 393Fast RCNN-ResNet-50-FPN 063 135 556 345Fast RCNN-ResNet-101-FPN 039 159 576 312

Fast RCNN-ResNeXT-101-32times 8d-FPN 051 144 579 333Fast RCNN-ResNeXT-101-64times 4d-FPN 029 142 573 376

Faster RCNN-ResNet-50-C4 698 399 487 2604Faster RCNN-ResNet-50-FPN 1074 456 563 2979Faster RCNN-ResNet-101-FPN 1063 469 576 3057

Faster RCNN-ResNeXT-101-32times 8d-FPN 1164 473 576 3212Faster RCNN-ResNeXT-101-64times 4d-FPN 1054 471 569 3164

Faster RCNN-VGG16 [16] 573 3558 4414 4111(is table illustrates how well models adapt to different scales of objects (e values in bold represent the best in one-stage methods and the ones in italicsrepresent the highest in two-stage methods

(a)

(b)

(c)

Figure 5 Continued

14 Journal of Electrical and Computer Engineering

small object dataset it does not work too much becauseRetinaNet is lower than Faster RCNN about 10 in per-formance Otherwise on different scales of subsets Reti-naNet works well when comparing to Faster RCNN and

the difference is just 2ndash4 percentages Although ResNetbackbones combined with the others yield an improvementin accuracy they do not work for YOLO on small objectdatasets YOLO with Darknet-53 utilizes more resource

Table 5 (e comparison of consumption on small object dataset

Model Backbone Inference time (s) Test RAM (MiB) Train RAM (MiB)YOLOv3 Darknet-53 00331 1825 4759YOLOv3 ResNet-50 0027 1285 3479YOLOv3 ResNet-101 00356 1829 5383YOLOv3 ResNet-152 00454 2443 7531RetinaNet ResNet-50-FPN 0102 2075 4435RetinaNet ResNet-101-FPN 0127 2723 5577RetinaNet ResNeXT-101-32times 8d-FPN 0229 3767 7863RetinaNet ResNeXT-101-64times 4d-FPN 0292 3719 7813Fast RCNN ResNet-50-C4 03 6449 5877Fast RCNN ResNet-50-FPN 0089 2277 4455Fast RCNN ResNet-101-FPN 0113 2947 5627Fast RCNN ResNeXT-101-32times 8d-FPN 0212 3987 4961Fast RCNN ResNeXT-101-64times 4d-FPN 0269 3885 4799Faster RCNN ResNet-50-C4 0412 6609 6129Faster RCNN ResNet-50-FPN 0101 2387 5381Faster RCNN ResNet-101-FPN 0124 3001 6487Faster RCNN ResNeXT-101-32times 8d-FPN 0256 4027 5333Faster RCNN ResNeXT-101-64times 4d-FPN 0286 4003 5246

Table 6 (e comparison of consumption on subsets filtered from PASCAL VOC

Model Backbone Inference time (s) Test RAM (MiB) Train RAM (MiB)YOLOv3 Darknet-53 0027 1645 4079RetinaNet ResNet-50-FPN 01 1935 4133RetinaNet ResNet-101-FPN 0116 2585 5435RetinaNet ResNeXT-101-32times 8d-FPN 0222 3641 7723RetinaNet ResNeXT-101-64times 4d-FPN 0284 3561 7599Fast RCNN ResNet-50-C4 0495 6371 5677Fast RCNN ResNet-50-FPN 0092 2131 4387Fast RCNN ResNet-101-FPN 0114 2819 5463Fast RCNN ResNeXT-101-32times 8d-FPN 0213 3873 4637Fast RCNN ResNeXT-101-64times 4d-FPN 0265 3735 4575Faster RCNN ResNet-50-C4 026 6141 5991Faster RCNN ResNet-50-FPN 01 2245 5207Faster RCNN ResNet-101-FPN 013 2855 6335Faster RCNN ResNeXT-101-32times 8d-FPN 0225 3943 5087Faster RCNN ResNeXT-101-64times 4d-FPN 0276 3885 4909

(d)

Figure 5 Highlight of bounding boxes from comparative backbones on small object dataset We here select YOLO with Darknet-53 andResNet-50 for objective comparison because there have obviously the same layers in their networks along with the significant techniquessuch as skip connections and residual blocks (e bounding boxes show that ResNet-50 has the sensitivity to areas which resembles theobjects of interest than Darknet-53 Similarly ResNet-50-FPN and ResNet-50-C4 are chosen to consider (e detection shows thatcombining ResNet-50 with FPN outputs a better performance rather than the original one Particularly misdetection happens in moredensity than ResNet-50-FPN such as in columns 4 and 5 Zoom in to see more detail

Journal of Electrical and Computer Engineering 15

than ResNet ones but it has the best accuracy amongmodels (erefore we only test YOLO with Darknet-53 insubsets of PASCAL

53 Analyses of the Trade-Off among Detectors Networkdesigns and approaches alike the one-stage approach proveits performance as applying them to detect general objectsboth small scales and other kinds of scales Although they arefast and accurate there is still a drawback always existing inthese models that is the trade-off between accuracy andspeed of processing For example YOLOv3 proposes theidea that performs detection at three different scales and thisresult is obviously impressive and yields good performanceHowever to gain this advantage YOLOv3 has to sacrifice thetime to process Instead of all inputs of the model normallyprocessing one time for detection like YOLOv2 this ideamust work 3 times (is trade-off is also partly affected byresolution as we change it during training or testing ourmodels In our previous work we have mentioned that wehave to choose a right resolution to ensure our models towork properly In case of the two-stage approaches the ideathat proposes region proposals to improve the localization ofobjects to serve for detection is good as well (is is usefulbut we have to take it into the account that we shouldgenerate proposals on feature maps or directly on inputimages because this affects a lot on the way which modelsintend to run and identify representations of objects Ifobjects are normal or have a big or medium appearance it isgood for models to work but if objects are in multiscalesthis is a problem to consider and research deeply in order tobalance the performance as well as improve it (erefore topartly fix this problem the one-stage approach allows us tochoose a fixed size of an input for training and testing butthe support still depends on characteristics of datasets whichwe evaluate or the image size After all all models we chooseto evaluate are affected by the scales of objects when wechange the scale and accuracy of models change a lot exceptfor Faster RCNN the only one model that seems to be stablewith the scale especially when combining with the VGG16architecture Although the accuracy of VGG16 is not betterthan the other architectures the difference here is that it doesnot change too much in accuracy (is is only right for bigobjects having the overlap of the bounding box and theimage greater than 10 if not this is not assured

Figure 1 shows that the possibility of small objects ismore than other objects (e black length of the camera issomehow similar to the black mouse placed on a mouse pad(is possibility of small object presence causes more diffi-culties to detectors and leads to wrong detection Anywherein an image can be small objects it results in a fact thatdetectors have much wrong detection with familiar ap-pearance which they have seen If we consider the visuali-zation of the detection in Figure 4 the wrong detection ispartly similar to the appearance of the other objects in thedataset (is problem is caused by the data imbalance be-tween classes and instances in each class which originally isknown as the foreground-foreground class imbalance In

other words the common problems which not only happenwith small objects but also for whole datasets are theintraclass similarity and interclass variation

6 Conclusion

Small object detection is a challenging and interestingproblem in the task of object detection and has drawn at-tention from researchers thanks to the development of deeplearning which is motivation to improve performance oftasks in computer vision Although deep models belongingto detection originally tend to solve problems relating togeneral object detection they still work at a particular levelto the success of small object detection As evaluation workson small object detection for deep models our goal is tohighlight remarkable achievements of popular and state-of-the-art deep models in order to provide a variety of views asapplying deep models in small object detection Particularlywe evaluate state-of-the-art real-time detectors based ondeep learning from two approaches such as YOLOv3RetinaNet Fast RCNN and Faster RCNN on two datasetsnamely small object dataset and subsets filtered fromPASCAL VOC about effects of different factors objectivelyincluding accuracy execution time and resource usage

In spite of the successful achievements in recent yearsthe performance of detection has improved significantlyand there is still a huge gap in accuracy between normalobjects and small objects In the criteria of the COCOdataset the difference from the small scale to medium andbig scale is too much Most models are good at detection ofnormal objects and problems are going to happen whenapplying them to detect small objects As a result to reducethe gap of small object detection the first thing to do is investdatasets which have massive amount of data to train modelsand have a wide range of categories to compete with thehuman visual system alike [12 34]

So far detection models are divided into two mainapproaches namely one-stage approach and two-stageapproach Models in the one-stage approach is known asdetectors which have better and more efficient detection incomparison to another approach (e efficiency here has thepotential power to run in real time and is able to apply themto practical applications However the trade-off betweenaccuracy and speed is a difficult challenge which needs to betaken into the account in order to balance the gap Howevermodels in the two-stage approach have their reputation ofregion-based detectors which have high accuracy but are toolow in speed to apply them to real world (is drawbackcomes from the computation of networks

(rough our evaluation there is a fact that architectureswhich are utilized as base networks to extract deep featureshave significant effects on frameworks (e deeper the ar-chitecture is the higher the accuracy of detection is Once anetwork has an increase in the depth this means it has morelayers than normal ones and it will have massive parametersto train Hence this needs a lot of data to fine tune theseparameters reasonably If there is an increase in computa-tion resource consumption will also increase As a result it

16 Journal of Electrical and Computer Engineering

will be difficult as we want to take them to apply in practicalapplications Besides the contextual exploit in models isdefinitely limited this results cause ignoring much usefuland informative data in training especially in context ofsmall objects Because small objects are able to appearanywhere in an input image if the image is well-exploitedwith the context the performance of small object detectionwill be improved better

For all above reasons and according to our evalua-tion if we tend to have good performance and ignore thespeed of processing two-stage methods like FasterRCNN are well-performed and demonstrate its networkdesign with the different datasets on many contexts ofobjects including multiscale objects (erefore FasterRCNN is considered as a giant baseline in order to baseon or develop from it If our target has a balance ofaccuracy and speed YOLO is a good one in case we donot care the training time because the sacrifice betweenthe speed and accuracy is worth applying it into practicalapplications Otherwise Faster RCNN or RetinaNet isstill a substitution to work on When it comes tobackbones we have to concern about the data to choose areasonable backbone to combine with the methodsBecause the amount of data will significant impact on themodel if data are not abundant the shallow network willfit it well Besides there is recently a novel approachpromising in training deep models with less data that isweakly supervised learning such as zero-shot one-shotor few-shot learning (erefore these approaches will beconsidered in our future works and following our recentsearching to have better performance on object detec-tion we have to consider several factors to improve themAP such as multiscale training superresolution forscaling up the visual information to small objects [35] orpreprocessing data to avoid the imbalance data becausewe have a wide range of imbalance problems relating todata [33]

Data Availability

(e data used to support the findings of this study areavailable from the corresponding author upon request

Conflicts of Interest

(e authors declare that they have no conflicts of interest

Acknowledgments

(is research was funded by the Vietnam National Uni-versity HoChiMinh City (VNU-HCM) under grant noB2017-26-01

References

[1] R Girshick J Donahue T Darrell and J Malik ldquoRich featurehierarchies for accurate object detection and semantic seg-mentationrdquo in Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition pp 580ndash587 ColumbusOH USA June 2014

[2] K He X Zhang S Ren and J Sun ldquoSpatial pyramid poolingin deep convolutional networks for visual recognitionrdquo inProceedings of the European Conference on Computer Visionpp 346ndash361 Springer Zurich Switzerland September 2014

[3] R Girshick ldquoFast R-CNNrdquo in Proceedings of the IEEE In-ternational Conference on Computer Vision pp 1440ndash1448Santiago Chile December 2015

[4] J Redmon S Divvala R Girshick and A Farhadi ldquoYou onlylook once unified real-time object detectionrdquo in Proceedingsof the IEEE Conference on Computer Vision And PatternRecognition pp 779ndash788 Las Vegas NV USA June 2016

[5] J Redmon and A Farhadi ldquoYOLO9000 better fasterstrongerrdquo 2016 httpsarxivorgabs161208242

[6] J Redmon and A Farhadi ldquoYOLOv3 an incremental im-provementrdquo 2018 httpsarxivorgabs180402767

[7] T-Y Lin P Goyal R Girshick K He and P Dollar ldquoFocalloss for dense object detectionrdquo in Proceedings of the IEEEtransactions on pattern analysis and machine intelligenceVenice Italy October 2018

[8] K Zidek A Hosovsky J Pitelrsquo and S Bednar ldquoRecognition ofassembly parts by convolutional neural networksrdquo in Ad-vances in Manufacturing Engineering and Materialspp 281ndash289 Springer Cham Switzerland 2019

[9] K He G Gkioxari P Dollar and R Girshick ldquoMaskR-CNNrdquo in Proceedings of the IEEE International Conferenceon computer vision (ICCV) IEEE Venice Italy pp 2980ndash2988 October 2017

[10] L-C Chen A Hermans G Papandreou et al ldquoInstancesegmentation by refining object detection with semantic anddirection featuresrdquo 2017 httpsarxivorgabs171204837

[11] M Everingham L Van Gool C K I Williams J Winn andA Zisserman ldquo(e pascal visual object classes (VOC) chal-lengerdquo International Journal of Computer Vision vol 88no 2 pp 303ndash338 2010

[12] T-Y Lin M Maire S Belongie et al ldquoMicrosoft COCOcommon objects in contextrdquo in Proceedings of the EuropeanConference on Computer Vision pp 740ndash755 SpringerZurich Switzerland September 2014

[13] C Chen M-Y Liu O Tuzel and J Xiao ldquoR-CNN for smallobject detectionrdquo in Proceedings of the Asian Conference onComputer Vision pp 214ndash230 Springer Taipei TaiwanNovember 2016

[14] P F Felzenszwalb R B Girshick D McAllester andD Ramanan ldquoObject detection with discriminatively trainedpart-based modelsrdquo IEEE Transactions on Pattern Analysisand Machine Intelligence vol 32 no 9 pp 1627ndash1645 2010

[15] S Ren K He R Girshick and J Sun ldquoFaster R-CNN towardsreal-time object detection with region proposal networksrdquo inProceedings of the 28th International Conference on NeuralInformation Processing Systems Ser NIPSrsquo15 pp 91ndash99 MITPress Cambridge MA USA 2015 httpdlacmorgcitationcfmid=29692392969250

[16] P Pham D Nguyen T Do T D Ngo and D-D LeldquoEvaluation of deep models for real-time small object de-tectionrdquo in Proceedings of the International Conference onNeural Information Processing pp 516ndash526 SpringerGuangzhou China November 2017

[17] J R R Uijlings K E A Van De Sande T Gevers andA W M Smeulders ldquoSelective search for object recognitionrdquoInternational Journal of Computer Vision vol 104 no 2pp 154ndash171 2013

[18] Z Zhu D Liang S Zhang X Huang B Li and S HuldquoTraffic-sign detection and classification in the wildrdquo inProceedings of the IEEE Conference on Computer Vision And

Journal of Electrical and Computer Engineering 17

Pattern Recognition pp 2110ndash2118 Vegas NV USA June2016

[19] A Torralba R Fergus and W T Freeman ldquo80 million tinyimages a large data set for nonparametric object and scenerecognitionrdquo IEEE Transactions on Pattern Analysis andMachine Intelligence vol 30 no 11 pp 1958ndash1970 2008

[20] A Kembhavi D Harwood and L S Davis ldquoVehicle detectionusing partial least squaresrdquo IEEE Transactions on PatternAnalysis and Machine Intelligence vol 33 no 6 pp 1250ndash1265 2011

[21] V I Morariu E Ahmed V Santhanam D Harwood andL S Davis ldquoComposite discriminant factor analysisrdquo inProceedings of the IEEE Winter Conference on Applications ofComputer Vision pp 564ndash571 IEEE Steamboat Springs COUSA March 2014

[22] A Andreas P Lenz and R Urtasun ldquoAre we ready forautonomous driving the kitti vision benchmark suiterdquo inProceedings of the IEEE Conference on Computer Vision AndPattern Recognition Providence RI USA June 2012

[23] A Alahi K Goel V Ramanathan A Robicquet L Fei-Feiand S Savarese ldquoSocial LSTM human trajectory prediction incrowded spacesrdquo in Proceedings of the IEEE Conference onComputer Vision And Pattern Recognition pp 961ndash971 LasVegas NV USA June 2016

[24] J Xiao K A Ehinger J Hays A Torralba and A Oliva ldquoSundatabase exploring a large collection of scene categoriesrdquoInternational Journal of Computer Vision vol 119 no 1pp 3ndash22 2016

[25] E Dong Y Zhu Y Ji and S Du ldquoAn improved convolutionneural network for object detection using YOLOv2rdquoin Pro-ceedings of the 2018 IEEE International Conference onMechatronics and Automation (ICMA) pp 1184ndash1188 IEEEChangchun China August 2018

[26] W Liu D Anguelov D Erhan et al ldquoSingle shot multiboxdetectorrdquo in Proceedings of the European Conference onComputer Vision Springer Amsterdam (e Netherlandspp 21ndash37 October 2016

[27] T-Y Lin P Dollar R B Girshick K He B Hariharan andS J Belongie ldquoFeature pyramid networks for object detec-tionrdquo in Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition (CVPR) vol 1 no 2 p 4Honolulu HI USA July 2017

[28] C-Y Fu W Liu A Ranga A Tyagi and A C Berg ldquoDSSDdeconvolutional single shot detectorrdquo 2017 httpsarxivorgabs170106659

[29] Y Bai Y Zhang M Ding and B Ghanem ldquoSOD-MTGANsmall object detection via multi-task generative adversarialnetworkrdquo in Proceedings of the European Conference onComputer Vision pp 210ndash226 Springer Munich GermanySeptember 2018

[30] L Liu W Ouyang X Wang et al ldquoDeep learning for genericobject detection a surveyrdquo 2018 httpsarxivorgabs180902165

[31] P Zhu L Wen X Bian L Haibin and Q Hu ldquoVision meetsdrones a challengerdquo 2018 httpsarxivorgabs180407437

[32] Z-Q Zhao P Zheng S-t Xu and X Wu ldquoObject detectionwith deep learning a reviewrdquo 2018 httpsarxivorgabs180705511

[33] K Oksuz B C Cam S Kalkan and E Akbas ldquoImbalanceproblems in object detection a reviewrdquo 2019 httpsarxivorgabs190900169

[34] O Russakovsky J Deng H Su et al ldquoImagenet large scalevisual recognition challengerdquo International Journal of Com-puter Vision vol 115 no 3 pp 211ndash252 2015

[35] Y Bai Y Zhang M Ding and B Ghanem ldquoSOD-MTGANsmall object detection via multi-task generative adversarialnetworkrdquo in Proceedings of the European Conference onComputer Vision (ECCV) pp 206ndash221 Munich GermanySeptember 2018

18 Journal of Electrical and Computer Engineering

Page 3: AnEvaluationofDeepLearningMethodsforSmall ObjectDetection

addition the number of classes of current small objectdatasets is less than common datasets Besides most of thestate-of-the-art detectors both in one-stage and two-stageapproaches have struggled with detecting small objects As aresult we have presented an in-depth evaluation of existingdeep learning models in detecting small objects in our priorwork [16] We evaluate three state-of-the-art models in-cluding YouOnly Look Once (YOLO) Single ShotMultiBoxDetector (SSD) and Faster R-CNN with related trade-offfactors ie accuracy execution time and resource con-straints In this time we make not only an extension bycontinually evaluating state-of-the-art and up-to-date de-tection models but also summarize pros and cons as well asthe design of models rather than just introducing their ideaInstead of focusing on real-time models we evaluate state-of-the-art models both in the one-stage approach which isable to run in real time such as YOLOv3 RetinaNet and thetwo-stage approach which do not meet real-time detectionbut high accuracy such as Fast RCNN and Faster RCNNWeadd these models to our evaluation due to some reasons andwe firstly take claims from the original work of these modelsParticularly we pick up YOLOv3 because this detector is anovel and state-of-the-art model which combines currentadvanced techniques such as residual blocks skip connec-tions and multiscale detection Similarly RetinaNet is adetector that proposes an updated calculation for lossfunction to penalize the imbalance of classes in a datasetAlthough Faster RCNN is the only one model that isevaluated in our previous work we want to evaluate thismodel with different backbones to consider how wellbackbones work when they are combined with FasterRCNN Furthermore Faster RCNN is an improvement ofFast RCNN and we still add Fast RCNN to our evaluationbecause this model works with an external algorithm togenerate region proposals on an input image instead of on afeature map alike Faster RCNN Besides we evaluate thesemodels with different backbones such as ResNet 50 ResNet101 ResNet 152 ResNeXT 101 and FPN on small objects toconsider how well these backbones are when combiningthem with models We still make our evaluation on 2datasets namely small object dataset [13] and our filtereddataset from PASCAL VOC 2007 [11] with criteria such asaccuracy speed of processing and resource consumption aswell However we want to provide analyses to the design andthe way models work and explore how well models canafford with multiscale objects (is helps readers have apreference of each model and from there they can choose asuitable model to meet their needs (erefore the followingsare our contributions

(i) Wemade an extension for evaluating deep models intwo main approaches of detection namely the one-stage approach and two-stage approach such asYOLOv3 RetinaNet Fast RCNN and Faster RCNNalong with popular backbones such as FPN ResNetor ResNeXT

(ii) We provided not only disadvantages and advantagesof the models relating to accuracy resource con-sumption and speed of processing in context of

small objects as well as changes of these factors whenan object size is scaled up or down but also acomparison between one-stage and two-stagemethods

2 Challenges

Overall there are several problems relating to challenges thatneed to be solved with object detection Object detectionitself draws much attention from researchers but after aperiod of time challenges just tackle a part particularlyCOCO challenges provide a standard in regard to small andmedium detection and accuracy in most of detectors is stilllow with this standard (erefore in terms of small objectdetection it is harder to researchers because apart fromnormal challenges alike object detection it owns particularchallenges for small objects Besides the definition of smallobjects is not obviously clear (e following presentationmake it more obvious

21 Small Appearances Recently small object detection hasbeen considered as an attractive problem itself because thereare many sorts of its own challenges that are very intriguingto researchers First of all the possibilities of the appearanceof small objects are much more than other objects because ofthe small size that leads to a fact that detectors get confusedto spot these objects among plenty of other objects which arelocated around or even are the same size or appearance It isarduous when differentiating small objects from the clutterof background Furthermore the pixels available to repre-sent the information of small objects are also much fewerthan normal objects It means that there are less informativerepresentatives for detectors to perform its task Besides keyfeatures to obtain small objects from an image are vulnerableand even lost progressively when going thorough manykinds of different layers of deep network such as convolu-tional or pooling layers For example in VGG16 if the objectof interest occupies a 32times 32 size it will be presented at most1 pixel after 5 times of going through the pooling block As aresult the exhausted searching such as sliding window [14]or the drastic increase in the number of bounding boxes likeselective search [17] is unfeasible to achieve good outputsSome samples of small objects are shown in Figure 1

22 Small Object Definitions (e definition problem ofsmall object detection is to clarify how small scales or sizes ofobjects are or how many pixels they occupy on an image(is is arduous and different if we consider objects onimages of high resolution and low resolution For examplean object is assigned as a small object as occupying a part of400times 400 resolution on 2048times 2048 but being very big on500times 500 one (erefore it causes a difficulty to researcherswhen a dataset consists of images with various ranges ofresolution Up till now there are some definitions of smallobjects and these definitions are not clearly defined Itdepends upon datasets that are used for evaluation andcharacteristics of objects of interest (erefore to performthe task of detecting small objects researchers define

Journal of Electrical and Computer Engineering 3

different definitions for different datasets instead of onlyusing the size of bounding boxes containing objects toconsider if the objects are small or not For example Zhuet al [18] mentioned that small objects are objects whosesizes are filling 20 of an image when releasing their datasetabout traffic signs If the traffic sign has its square size it is asmall object when the width of the bounding box is less than20 of an image and the height of the bounding box is lessthan the height of an image In [19] Torralba et al supposedsmall objects are less than or equal to 32times 32 pixels In smallobject dataset [13] objects are small when they have meanrelative overlap (the overlap area between bounding box areaand the image is) from 008 to 058 respectively 16times16to 42times 42 pixel in a VGA image In this work we reuse theabove definitions especially the definitions from [13 18] asthe main references because they are reliable resources andare widely accepted by other researchers

23 Datasets and Approaches (ere are limited works toconcentrate on sorts of small objects and it results in thelimitation of experience and knowledge to deeply go for acomprehensive research (e previous approaches justspecify to focus on big objects and ignore the existence ofsmall objects In fact we do not comprehend how muchexisting detection approaches are well-performed whendealing with small objects Hence in this work we conductto assess the performance of existing state-of-the-art de-tectors to draw a general picture of their abilities for smallobject detection

In terms of small object detection there are just a fewworks regarding the problem of detecting small objects Sofar most of these works are just designed to detect somesingle categories such as traffic signs [18] or vehicles [20ndash22]or pedestrians [23] that do not contain common or mul-ticlass datasets in real world (is results in a lack of eval-uation for the approaches to show its ability detectingdifferent kinds of objects and variation of their shapes aswell Fortunately Chen et al [13] present their small objectdataset by combining the Microsoft COCO [12] and SUNdatasets [24] that consist of common objects such asldquomouserdquo ldquotelephonerdquo ldquoswitchrdquo ldquooutletrdquo ldquoclockrdquo ldquotissueboxrdquo ldquofaucetrdquo ldquoplaterdquo and ldquojarrdquo Chen also augments theR-CNN algorithm with some modifications to improveperformance of detecting small objects Following this ideawe conduct a small survey on existing datasets and theauthors find that PASCAL VOC is in common with COCOand SUN datasets which consist of small objects of variouscategories So we depend on existing and common defini-tions of small objects to filter objects that meet these defi-nitions and form a dataset including 4 subsets correspondingto 4 different definitions of small objects so as to objectivelyconsider how different scales of objects affecting perfor-mance of detection are In addition there is recently a smallobject dataset in a challenge called Vision Meets Drones AChallenge (httpaiskyeyecom) and this dataset is con-sidered the challenging dataset because it consists of severalsmall objects even tiny objects in images in different con-texts and conditions in wild but the views in images are

snapshot from drones which fly above and take picturesfrom the high resolution cameras attached to it Unfortu-nately this dataset does not have annotations for testing soit is hard to take it for evaluation

(erefore in this work we choose small object dataset[13] and our filtered dataset to make our evaluation becausethese datasets contain common objects and the number ofimages are large so the evaluations are objective

3 Deep Models for Object Detection

Recently in widespread developments of deep learning it isknown that convolutional neural network (CNN) ap-proaches have showed lots of improvements and achievedgood results in various tasks (erefore it is commonlyapplied to well-known works Most of the works haveshowed significant improvements in detecting objects fillingmedium or big parts on an image

RCNN [1] is one of the pioneers (e following methodsare an improvement form of R-CNN such as [2 3 15]Especially Faster R-CNN [15] is considered as a state-of-the-art approach Although this sequence of advanced worksuses a lot of different and breakthrough ideas from slidingwindow to object proposals and mostly achieves the bestresults as state-of-the-art methods on challenging datasetssuch as COCO PASCAL VOC and ILSVRC however theirrepresentations take much time to run on an image com-pletely and may lead to reduction in the running perfor-mance of the detector As a result the detectors facedifficulty in using them for detecting objects in real timedespite achieving high accuracy (is means they just focuson accuracy and ignore effects of speed of processing Inaddition detecting objects having small sizes in real world isas important as objects having big or medium sizes evenmore necessary than we imagined Especially in industriesof automotive smart cars army projects and smarttransportation data must be promptly and precisely pro-cessed to make sure that safety is first But in these casesgenerally the data recorded usually are far from our positionand the information is a small thing

In terms of real-time detection the one-stage methodsinstead of using object proposal to get RoI before moving toclassifier like two-stage approaches such as Faster R-CNNuse local information to predict objects such as YOLO andSSD Both methods process images in real time and detectobjects correctly and still have a high point of mAP Nev-ertheless these papers just mention that the models candetect small objects and have good results but they do notshow evidences to prove how much or what extent of smallobjects that are solved In this work we evaluate thesemodels from both approaches to find out their performanceand to what extend they are good at as detecting smallobjects (e following are general ideas of above-mentionedapproaches

31 R-CNN R-CNN [1] is a novel and simple approach as apioneer advanced providing more than 30 mean averageprecision (mAP) than the previous works on PASCAL VOC

4 Journal of Electrical and Computer Engineering

(e overview of R-CNN architecture consists of four mainphases which are known as the new advances of this methodFirstly the R-CNN network resizes an image to 227times 227and takes it as an input(en selective search algorithm [17]is applied to the image and generates 2000 candidates ofproposed bounding boxes as the warped regions used for theinput of CNN feature network (rough the regions thenetwork extracts a 4096-dimensional feature vector fromeach region and then computes the features for each regionFinally using the class-specific linear SVM classifier behindthe last layer is to classify regions to consider if there are anyobjects and what the objects are

(emajor key to the success of the R-CNN is the featuresmatter In R-CNN the low-level image features (eg HOG)are replaced with the CNN features which are arguablymore discriminative representations However the evalua-tion of an image is extremely costly and wasteful becauseR-CNN must apply the convolutional network 2000 timesBesides resizing the input to the low 227times 227 is a problemaffecting small objects which are easy to deform or even loseinformation as changing the resolution far from its originalsizes (e region proposals overlapped thus leading tocomputation of familiar features many times and with everyregion proposal it must be stored to disk before performingthe extraction of features In addition lots of boundingboxes overlapped will result in a drop of mAP if small objectsare close to big objects because there is a bias to choose thebounding boxes which contain big objects and ignorance ofbounding boxes for small objects

32 Spatial PyramidPooling (SPP) (e primary ideas of SPP[2] are motivated from limitations of CNN architecture suchas the original CNN receiving the size of input imagesmust bea fixed size (224times 224 of AlexNet) so the actual use of the rawpicture often needs cropping (a fixed-size patch that truncatesthe original image) or warping (RoI of an image inputmust bea fixed size of the patch) (e fully connected layer needs afixed-length input and convolutional layer that can beadapted to the arbitrary input size thus it needs a bridge as amediate layer between the convolutional layer and the fullyconnected layer and that is the SPP layer Particularly SPP-net firstly finds 2000 candidates of region proposals like theR-CNN method and then extracts the feature maps from theentire image SPP maps each window of the features corre-sponding to region proposals as a fixed-length representationregardless of the input size Finally 2 fully connected layersare used to classify by SVM

In short SPP-net versus R-CNN detection task is better100times faster than R-CNN but training time is very slowbecause of multistage training steps (fine-tuning of lastlayers SVM and regressions) and really taking a lot of diskspace to save vectors of features

33 Fast R-CNN Fast R-CNN [3] is an advanced methodthat presents various innovations to improve the time oftraining and testing phase and efficiently classifying objectproposals while still increasing the accuracy rate by usingdeep convolutional networks (e architecture of Fast

R-CNN is trained end-to-end with a multitask loss Specif-ically the convolutional network takes an image at any size asan input and several RoIs Instead of applying RoI on an inputand wrapping them to feed into the network at the first steplike RCNN Fast RCNN applies these RoIs on a feature mapafter the several convolutional layers of the base networkEach RoI is extracted a fixed-size feature vector by a poolinglayer andmapped to a feature vector by fully connected layers(e network has two output vectors per RoI softmaxprobabilities and per-class bounding-box regression offsets

(e most important feature of RoI is sharing compu-tation and memory in the forward and backward passesfrom the same image (e huge contribution of Fast R-CNNis that it proposes a new training method that fixes thedrawbacks of R-CNN and SPP-net while increasing theirrunning time and accuracy rate (e advantage is the meanaverage precision of detection is higher than R-CNN andSPP-net Training phase is a single stage using a multitaskloss and can update the entire network layers (e capacityof disk storage is not required for feature caching

34 Faster R-CNN Faster R-CNN [15] is an innovatedapproach improved from Fast R-CNN Unlike two previousapproaches of its own instead of generating bounding boxesby external algorithms [17] like [1 3] Faster R-CNN runs itsown method called the region proposal network (RPN)which is trained end-to-end in order to give the generationof highly qualified region proposals After gaining deepfeatures from early convolutional layers RPN is taken intothe account and windows slide over the feature map toextract features for each region proposal RPN is consideredas a fully convolutional network which simultaneouslypredicts bounding boxes of objects and objectness scores ateach position (e input of RPN is an image of any size andoutputs a set of bounding boxes as rectangular objectproposals along with an objectness score for each proposalSpecifically the RPN takes the image feature map of the fifthconvolutional layer (conv5) as an input and applies a 3times 3sliding window on the feature map (en the intermediatelayer will feed into two different branches one for objectscore (determines whether the region is thing or stuff) andthe other for regression (determines how should thebounding box change to become more similar to the groundtruth) (e RPN improves accuracy and running time as wellas avoids to generate excess of proposal boxes because theRPN reduces the cost by sharing computation on convolu-tional features RPN and Fast R-CNN aremerged into a singlenetwork by sharing their convolutional features (is com-bination helps Faster R-CNN to have leading performance onaccuracy but leads to its architecture as a two-stage networkwhich reduces the speed of processing of this method

35 You Only Look Once Inherited from the advantages ofthe previous models which have been introduced earlier YouOnly Look Once (YOLO) [4] is considered as a state-of-the-art object detection in real time with various categories at thattime YOLO currently has three versions [4ndash6] which areimproved substantially through each version progressively

Journal of Electrical and Computer Engineering 5

(e detail analyses of the YOLO approaches as a premise toapply it into practical applications are as follows

YOLOv1 [4] is widely known that YOLO an unified orone-stage network is a completely novel approach based onan idea that aims to tackle object detection in real timeproposed by Redmon et al implying that instead of per-forming the task of object detection like the previoustechniques based on complex tasks such as [1 4] which useexhausted sliding window and then feed outputs of this toclassifiers performing at equally spaced locations over thewhole image or region proposals to generate bounding boxeswhich possibly contain objects and then feed them toconvolutional neural networks YOLO considers objectdetection to be a regression problem simultaneously givingthe prediction for various coordinates of bounding boxesand class probabilities for these boxes (e key idea toperform the detection of YOLO is that YOLO separatesimages into grid views which push the running time as wellas accuracy in localizing objects of YOLO(e goal of YOLOis to deal with two problems namely what objects arepresented and where they are in an image (e summari-zation of YOLO operation proceeds with three principalsteps simply and straightforwardly Firstly YOLO takes aninput image resized to a fixed size then works a singleconvolutional network as a unified network on the imageand ultimately puts a threshold on the resulting detectionsby the confidence score of the model YOLO runs at 45 fpson GPU and the smaller Fast YOLO reaches 150 fps (isprocessing can run steaming video in real time Although thedesign of YOLO architecture affords end-to-end trainingand real-time detection it still keeps high average precision

(e network divides the input image into a Stimes S gridwhere Stimes S is equal to the width and height of the tensorwhich presents the final prediction In case the center of anobject is in a grid cell the gird cell takes responsibility fordetecting that object Moreover each gird cell is simulta-neously responsible for predicting bounding boxes andconfidence scores which present how confident the model ofbounding box contains an object as well as how accurate itindicates the bounding box is predicted

(e drawback of YOLO is that it lags behind the state-of-the-art detection systems about accuracy but is better thanthose about running time Itmakes less than half the number ofbackground errors compared to Fast R-CNN YOLO is highlygeneralizable so it can quickly identify objects in an image butit usually struggles to precisely localize some objects especiallysmall ones (erefore the author introduced YOLOv2 toimprove performance and fix drawbacks of YOLO as well

YOLOv2 [5] has a number of various improvements fromYOLOv1 Similarly to the origin YOLOv2 runs on differentfixed sizes of an input image but it introduced several newtraining methods for object detection and classification suchas batch normalization multiscale training with the higherresolutions of input images predicting final detection onhigher spatial output and using good default bounding boxesinstead of fully connected layers

However this offers a trade-off between speed and ac-curacy (e details of the mAP improvements in PASCALVOC 2007 are shown in Figure 2

(ese novel improvements allow YOLOv2 to train onmulticlass datasets like COCO or ImageNet In addition itwas attempted to train the detector to detect over 9000different object classes YOLOv2 uses a network architecturecustomized from the original network YOLOv2 mainlyconcentrates on a way of improving recall and localizationwhile still receiving high accuracy of classification incomparison with state-of-the-art detectors and the originYOLO significantly makes more localization errors but is farless likely to predict false detections on places where nothingexists Although YOLOv2 has accuracy improvementsYOLOv2 does not work well on small objects because theinput downsampling results in the low dimension of thefeature map which is used for the final prediction To solvethese problems recently the author introduces YOLOv3with significant improvements on object detection espe-cially on small object detection Generally a variety of latestnetworks tend to be toward deeper and yield good per-formance on their tasks with deep features learned fromnumerous layers

YOLOv3 [6] is one of these approaches instead of usingDarknet-19 like two old versions [4 5] YOLOv3 develops adeeper network with 53 layers called Darknet-53 and com-bines the network with state-of-the-art techniques such asresidual blocks skip connections and upsampling (e re-sidual blocks and skip connections are very popular in ResNetand relative approaches and the upsampling recently alsoimproves the recall precision and IOU metrics for objectdetection [25] For the task of detection 53 more layers arestacked onto it giving a 106-layer fully convolutional un-derlying architecture for YOLOv3 (is is the reason behindthe slowness of YOLOv3 compared to YOLOv2

Second YOLOv3 enables the detector to predict objectsat three different outputs with three different scales ratherthan just one prediction at the last layer of the networksimilar to its competitor SSD [26] which has improved a lotof performance on a low resolution image (is is useful topick up diverse outcomes in order to improve performanceof detection (e final output is created by applying a 1times 1kernel on a feature map Particularly the detection is doneby applying 1times 1 detection kernels on feature maps of threedifferent sizes at three different places in the network partlysimilar to feature pyramid networks (FPNs) [27]

(ird YOLOv3 still keeps using K-means to generateanchor boxes but instead of fully applying 5 anchor boxes atthe last detection YOLOv3 generates 9 anchor boxes andseparates them into 3 locations Each location applies 3anchor boxes hence there are more bounding boxes perimage For example if we have an image of 416times 416YOLOv2 predicts 13times 13times 5 845 boxes in YOLOv3 thenumber of boxes is 10647 implying that YOLOv3 predicts10 times the number of boxes compared to YOLOv2

Fourth YOLOv3 also changes the way to calculate thecost function If the anchor overlaps a ground truth morethan other bounding boxes the corresponding objectnessscore should be 1 For other anchor boxes with overlapgreater than a predefined threshold 05 they incur no costEach ground truth is only associated with one boundary boxIf a bounding box is not assigned it incurs no classification

6 Journal of Electrical and Computer Engineering

and localization lost just confidence loss on objectness (eloss function in previous YOLO looks like

λcoord 1113944

S2

i01113944

B

j01objij xi minus 1113954xi( 1113857

2+ yi minus 1113954yi( 1113857

21113960 1113961

+ λcoord 1113944

S2

i01113944

B

j01objij

wi

radicminus

1113954wi

1113969

1113874 11138752

+

hi

1113969

minus

1113954hi

1113969

1113874 11138752

1113890 1113891

+ 1113944S2

i01113944

B

j01objij Ci minus 1113954Ci1113872 1113873

2

+ λnoobj 1113944S2

i01113944

B

j01noobjij Ci minus 1113954Ci1113872 1113873

2

+ 1113944S2

i01obji 1113944

cisinclassespi(c) minus 1113954pi(c)( 1113857

2

(1)Currently instead of using mean square error in calcu-

lating the classification loss at the last three terms YOLOv3uses binary cross-entropy loss for each label In other wordsYOLOv3 makes its prediction of an objectness score and classprediction for each bounding box using logistic regression

(ere is no more softmax function for class prediction(e reason is that the most currently used classifiers assumethat predicted labels are independent and mutually exclusiveimplying that if an object belongs to one class then it cannotbelong to the other and this is solely true if output predictionis really mutual nevertheless in case dataset has multilabelclasses and there are labels which are not nonexclusive such aspedestrian and person At the time the sum of possibilityscores may be greater than 1 if the classifier is softmax soYOLOv3 alternates the classifier for class prediction from thesoftmax function to independent logistic classifiers to cal-culate the likeliness of the input belonging to a specific label

36 Single Shot MultiBox Detector Single Shot MultiBoxDetector (SSD) [26] is a single shot detector using a singleand one-stage deep neural network designed for objectdetection in real time By comparison the state-of-the-artmethod in two-stage processing Faster RCNN uses itsproposed network to generate object proposals and utilizesthose to classify objects in order to be toward real-timedetection instead of using an external method but the wholeprocess runs at 7 FPS SSD enhances the speed of running

(a)

(b)

(c)

(d)

Figure 2 (e visualization of detectors with the strongest backbones on subsets of PASCAL VOC_MRA_058 VOC_MRA_10VOC_MRA_20 and VOC_WH_20 in order respectively (a) YOLO Darknet-53 (b) Faster RCNN ResNeXT-101-64times 4d-FPN (c)RetinaNet ResNeXT-101-64times 4d-FPN (d) Fast RCNN ResNeXT-101-64times 4d-FPN

Journal of Electrical and Computer Engineering 7

time faster than the previous detectors by eliminating theneed of the proposal network(erefore it causes a few dropin mAP and SSD compensates this by applying some im-provements including multiscale features and default boxes(ese improvements allow SSD to gain the same of FasterRCNN using lower resolution images which then furtherspeeds up the processing of SSD For 300times 300 input imageas the best version SSD gets 772 mAP at 46 FPS betterthan Faster R-CNN 732 and a little smaller than the bestversion of YOLOv2 554times 554 input image and 786 mAPat 40 FPS on VOC 2007 on Nvidia Titan X

Similarly SSD consists of 2 parts namely extraction offeature maps and use of convolution filters to detect objectsSSD uses VGG16 as a base network to extract feature maps(en it combines 6 convolutional layers to make predictionEach prediction contains a bounding box and N+ 1 scoresfor each class where N is the number of classes and one forextraclass for no object Instead of using a region proposalnetwork to generate boxes and feed to a classifier forcomputing the object location and class scores SSD simplyuses small convolution filters After the VGG16 base net-work extracts features from feature maps SSD applies 3times 3convolution filters for each cell to predict objects Each filtergives an output including N+ 1 scores for each class and 4attributes for one boundary box

SSD has a difference from previous approaches at thesame time and it makes prediction on multiscale featuremaps for detection independently rather than just one lastlayer (e CNN network spatially reduces the dimension ofthe image gradually leading to the decrease in the resolutionof the feature maps As mentioned SSD uses a lower inputimage to detect objects hence early layers are used to detectsmall objects and lower resolution layers to detect largerscale objects progressively Besides SSD applies differentscales of default boxes to different layers and for intuitivevisualization in Figure 3 Particularly the only blue defaultbox on 8times 8 feature map fits to the ground truth of the catand the only red one on 4times 4 feature map matches to theground truth of the dog

Although SSD has significant improvements in objectdetection as integrating with these above parts SSD is notgood at detecting small objects which can be improved byadding deconvolution layers with skip connections to in-troduce additional large-scale context [28] Generally SSDoutperforms Faster RCNN which is a state-of-the-art ap-proach about accuracy on PASCAL VOC and COCO whilerunning at real-time detection

37CNNDrawbacks Most of the CNNmodels are currentlydesigned by the hierarchy of various layers such as con-volutional and pooling layers that are arranged in a certainorder not only on small networks but also on multilayernetworks to state-of-the-art networks Along with theselayers fully connected layers are added behind and known asFC layers (e block consisting of FC layers and previouslayers is designated as feature extractors and it outputs keyfeatures of objects of interest as an input for classifierscoming behind However deeply going through many kinds

of layers is a way that is not good for small object detectionbecause in the task of small object detection objects ofinterest are objects owning small sizes and appearanceBesides small objects unlike normal or big objects which areless affected by resizing the image or passing lots of differentlayers are very vulnerable to the changes in image sizesWhen an image passes a convolutional layer the size of theimage will be decreased by receptive fields that slide over theimage to extract useful features (is does not affect smallobjects if there are just a few layers but in a CNN networkwe have many layers like this and it is very hard for smallobjects Still if small objects just go through convolutionallayers it will not be anything to mention Small objectswhich just have a few informative presence have to passpooling layers which help in avoiding overfitting and re-ducing computational costs by decreasing a number ofparameters To do this these layers use fixed sliding windowsthat care about a fixed target that is identified before such asmaximum or average calculations of valuables For thesereasons GAN is an approach that may alter the CNN ap-proach because of its advantages We can take advantages ofa way that the approach generates data to overcome thelimitations of data of small objects for the training phaseAlthough images still have to pass layers such as convolu-tional and pooling layers in this context the network justhas less layers compared to others Bai el al [29] haveproposed to apply MTGAN to detect small objects by takingcrop inputs from a processing step made by baseline de-tectors such as Faster RCNN [15] or Mask RCNN [9]

Because of mentioned reasons and following the survey[30] Liu et al have presented numerous works of survey andevaluation but there are no works that do with small objectsin them(erefore in this work we assess popular and state-of-the-art models to find out pros and cons of these modelsParticularly we evaluate 4 deep models such as YOLOv3RetinaNet Fast RCNN and Faster RCNN with several basenetworks for small object detection with different scales ofobjects In these models YOLOv3 and RetinaNet belong tothe one-stage approach Fast RCNN and Faster RCNN are inthe two-stage approach We choose these models becauseYOLOv3 is the model with combination of state-of-the-arttechniques and RetinaNet is the model with a new lossfunction which penalizes the imbalance of classes in adataset Besides we choose RetinaNet to make comparisonsbetweenmodels in the same approach Similarly Fast RCNNand Faster RCNN are the same and both models are in thesame approach and have nearly the similar pipeline in objectdetection (ere is a difference is that Fast RCNN utilizes anexternal proposal to generate object proposals based oninput images However Faster RCNN proposes its ownnetwork to generate object proposals on feature maps andthis makes Faster RCNN train end-to-end easily and workbetter

4 Experimental Evaluation

In this section we present the information of our ex-perimental setting and datasets which we use forevaluation

8 Journal of Electrical and Computer Engineering

41 Experimental Setting We continually train and evaluatevarious object detectors on the two datasets such as PASCALVOC [11] and a newly generated dataset [16] (e evaluatedapproaches in this time consist of Faster RCNN [15]YOLOv3 [6] and RetinaNet [7] with different backbonesExcept for YOLOv3 the others are trained and evaluated bythe Detectron python code

Currently the original datasets which commonly areused in object detection are PASCAL VOC [11] and COCO[12] Both datasets are constructed by almost large objects orother kinds of objects whose size fill a big part in the image(ese two datasets are not suitable for small object detectionIn addition there is another dataset which is large-scale andincludes a lot of classes for small object detection collectedby drones and named VisDrone dataset [31] However itdoes not publish the labels for test set to evaluate and theviews of images are topdown which is not our case As aresult in order to evaluate the detection performance of themodels we use a dataset which was published in [13] (isdataset is called small object dataset which is the combi-nation between COCO [12] and SUN [24] dataset (ere are10 classes in small object dataset including mouse tele-phone switch outlet clock toilet paper (t paper) tissue box(t box) faucet plate and jar (e whole dataset consists of4925 images in total and there are 3296 images for trainingand 1629 images for testing (e mouse class owns thelargest number of instances in images 2137 instances in 1739images and the tissue box class has the smallest number ofinstances 103 instances in 100 images Apart from the smallobject dataset we also filter subsets from PASCAL VOC2007 following standard definitions On PASCAL VOCthere are 20 classes but with small object detection there arefewer classes on strict definitions of small objects Table 1lists the details of the number of small objects and imagescontaining them for subsets of the dataset

We trained all models on small object dataset with thesame parameters Particularly in the training phase wetrained the models with 70k iterations with the parametersincluding momentum decay gamma learning rate batchsize step size and training days in Table 2 At the firstmoment we attempted to start off the models with a higherlearning rate 10minus 2 but the models diverged leading to theloss value being NaN or Inf after 100 first iterations(en wetried at a lower learning rate 10minus3 at 100 first iterations andrise to 10minus2 to consider if the models can converge as startingoff at a lower learning rate However it remained unchangedanything We also saw that the models converged quicklyduring 10k first iterations with 10minus3 and then progressively

slow down after 20k (erefore we decided to start off thetraining with a learning rate at 10minus3 and decrease to 10minus 4 and10minus5 at 25k and 35k iterations respectively (is settingshows that the loss value was stable from 40k but we set thetraining up to 70k to consider how the loss value changesand saw that it did not change a lot after 40k iterations Wetried to evaluate the models from 30k to 70k and generallythe performance of the models was not stable after 40k it-erations For this reason we picked up the weight forevaluation at 30k and 40k iterations At 30k iterationsYOLO achieves the best results and others get the best one at40k iterations In case of subsets of PASCAL VOC 2007 wecombine train and valid set from PASCAL VOC 2007 and2012 to form a training set PASCAL VOC 2012 works as adata augmentation set for PASCAL VOC 2007 We use thiscombined training set to train all models and test them onsubsets All models train the same parameter First due tothe limitation of memory we rescale all the size of images tothe same size with the shortest side 600 and the lengthiestside 1000 as in [15]

In YOLOv3 we run the K-means clustering algorithm inorder to initialize 9 suitable default bounding boxes fortraining and testing phases of our selected datasets and wechanged the anchors value (e following are 9 anchors forsmall object dataset after running the K-means algorithm[103459 144216] [262937 190947] [214024 363180][479317 291237] [404932 637489] [836447 513203][722167 1199181] [1727416 1170773] and [12465972528465]

In Faster R-CNN to fairly compare with the prior workand deploy on different backbones we also reuse directly theanchor scales and aspect ratios following the paper [13] suchas anchor scales 16times16 40times 40 and 100times100 pixels andaspect ratio 05 1 and 2 instead of having to cluster a set ofdefault bounding boxes similar to YOLOv3 Similarly inRetinaNet we keep the default setting for training such asgamma loss 20 alpha loss 025 anchor scale 4 andscalers per octave 3 because of following authors and thisconfiguration is the optimized valuables

42 Our Newly Generated Dataset In this time to have anobjective comparison we also use our newly generateddataset and the information of this dataset is shown inTable 1 We use it to consider the effects of object sizesamong factors including models time of processing accu-racy and resource consumption (e dataset consists of 4subsets filtered from PASCAL VOC 2007 such as

Batch normHi-res classifierConvolutionalAnchor boxes

New neworkDimension priors

Location predictionPassthrough

MultiscaleHi-res detector

YOLO YOLOv2

VOC2007 mAP

634 658 695 692 696 744 754 768 786

Figure 3 mAP of YOLOv2 at each added part [5]

Journal of Electrical and Computer Engineering 9

VOC_WH_20 VOC_MRA_058 VOC_MRA_10 andVOC_MRA_20 and detail information is provided asfollows

(i) VOC2007_WH_02 contains objects whose widthand height are less than 20 of an imagersquos width andheight (is one has fewer than PASCAL VOC 2007two classes such as dining table and sofa because ofthe constraint of the definition

(ii) VOC_MRA_058 VOC_MRA_10 andVOC_MRA_20 compose objects occupying themaximum mean relative area of the original imageunder 058 10 and 20 respectively Two ofthem have the same number of PASCAL VOC 2007classes except for VOC_MRA_058 and the one hasfewer four classes such as dining table dog sofa andtrain

5 Results and Analyses

In this section we show results that we achieved through theexperimental phase All models mentioned in this sectionexcept for models cited from other papers are trained on thesame environment and 1 GPU Ubuntu 16044 LTS Intel(R) Xeon (R) Gold 6152 CPU 210GHz GPU Tesla P100In addition to the comparative accuracy other comparisonsare also provided to make our objective and clear assessmentresults

51 Accuracy

511 Small Object Dataset Following the detection resultsin Table 3 methods which belong to two-stage approachesoutperform ones in one-stage approaches about 8ndash10Specifically Faster RCNN with ResNeXT-101-64times 4d-FPNbackbone achieved the top mAP in two-stage approachesand the top of the table as well 412 In comparison withthe top in one-stage approaches YOLOv3 608times 608 withDarknet-53 obtained 331 Following [32] methods basedon region proposal such as Faster RCNN are better than

methods based on regression or classification such as YOLOand SSD Actually this is also right once again as in contextof small object dataset

For methods in each approach Firstly two-stage ap-proaches Faster RCNN which is an improvement of FastRCNN is only greater than Fast RCNN about 1ndash2 but onlyfor ResNeXT backbones and equal to Fast RCNN for the rest(e difference here is not too much and it means that theperformance of external region proposal like selective searchcombined with ROI pooling is as good as internal regionproposal like RPN with ROI aligned in this case Besidescompared to R-CNN we perceive that there is a boost 8ndash10when RoI pooling or RoI aligned is added because R-CNNwhich uses region proposals from selective search then feedsthem into the network and directly computes features from fc(fully connected) layers only receives 235 with Alexnetand 248 with VGG16 combined with proposals from RPNHowever Fast RCNN and Faster RCNN with two kinds ofRoIs are much better Fast RCNN receives accuracy in a rangeof 317 to 396 based on different backbones SimilarlyFaster RCNN gets 301 to 412 Secondly in one-stageapproaches YOLO outperforms SSD and RetinaNet How-ever YOLO gets the highest outcome 331 and SSD andRetinaNet get 1132 and 30 respectively YOLO and SSDare considered as state-of-the art methods in speed andsacrificing accuracy However there is a large difference inaccuracy between YOLO and SSD the difference here is thatSSD adds multiple convolutional layers behind the backboneand each layer has their own ability instead of using 2 fullyconnected layers like YOLO Although RetinaNet is assignedinto a method in one-stage approaches it cannot run in realtime RetinaNet is one which is proposed to deal with theimbalance between foreground and background by the focalloss (erefore RetinaNet obtains a higher accuracy incomparison with others except for YOLOv3 (Darknet-53)

When it comes to the backbones we realized that Dar-knet-53 is the best in one-stage and real-time methods andeven far higher than ResNet-50 although it similarly has thesame layers with ResNet-50 In contrast ResNeXT combinedwith FPN is themost powerful one in both one-stage and two-

Table 1 (e information of the subsets

Subsets Classes Images InstancesVOC_MRA_058 16 329 529VOC_MRA_10 20 2231 5893VOC_MRA_20 20 2970 7867VOC_WH_20 18 1070 2313

Table 2 (e parameters of models

Method Momentum Decay Gamma Learning_rate Batch_size Training_days StepsizeYOLOv2 [16] 09 00005 0001 8 5 25000YOLOv3 09 00005 0001 32 3ndash4 25000SSD300 [16] 09 00005 01 0000004 12 9 40000 80000SSD512 [16] 09 00005 01 0000004 12 12 100000 120000RetinaNet 09 00005 01 0001 64 4-gt12 h 25000 35000Fast RCNN 09 00005 01 0001 64 4-gt12 h 25000 35000Faster RCNN 09 00005 01 0001 64 4-gt12 h 25000 35000

10 Journal of Electrical and Computer Engineering

stage methods if we only consider accuracy Overall there isan increase about 1ndash3 for changing the simple backbone tothe complex one in each type For example when switchingfrom original ResNet to ResNet-FPN the accuracy is boostedfrom 2 to 3(is is clear that leveraging the advantages frommultiscale features of FPN is a common way to improvedetection and tackle the scale imbalance of input images andbounding boxes of different objects Similarly we switchResNeXT-101-32times 8d-FPN to ResNeXT-101-64times 4d-FPNand the accuracy changes from 405 to 412 for FasterRCNN and from 387 to 396 for Fast RCNN Howeverwhen considering between ResNet-50-FPN and ResNet-101-FPN the growth only happens in Fast RCNN from 333 to355(ere is a little bit decrease 01 for Faster RCNN(isreduction also happens with RetinaNet while the simplerbackbone ResNeXT-101-32times 8d-FPN gets 30 and theResNeXT-101-64times 4d-FPN just gets 251 It means that thevery deeper backbones do not guarantee the increase inaccuracy and the reason is that an advantage of a deepernetwork needsmore parameters to learn It means the authorsmust have a large number of data to feed into the network totrain and update parameters itself but in this case the data ofsmall object dataset are not abundant too much to fit the verydeep network and hence increasing the chances of overfittingBesides features which are originally from the early layer ofResNet are not well-generalized because when they arecombined with FPN the accuracy has an improvement about2ndash3 When YOLO switches from Darknet-19 to Darknet-53 it really boosts the accuracy (e highest accuracy belongsto the Darknet-19 with the resolution of 1024times1024 whichjust gets 2402 However YOLO 608times 608 with Darknet-53gets 331 (e explanation for this reason is that YOLOv3with Darknet-53 has several improvements from Darknet-19YOLOv3 has 3 location of scales to predict objects especiallyone specialized in small objects instead of only one likeDarknet-19 and it is also integrated cutting-edge advantagessuch as residual blocks and shortcut connections (e re-duction in accuracy happens again with YOLO whenswitching from ResNet-101 to ResNet-152 about 1ndash2 Inthese methods YOLO and SSD are the only ones which allowmultiple input sizes(e higher the resolution of input imagesare the higher accuracy the method receives (e reason isthat a higher resolution image allows more pixels to describethe visual information for small objects However if theresolution is far from the original size of images it results in adecrease in accuracy For example YOLO 1024times1024 withDarknet-19 gets a lower accuracy than the resolution of800times 800 In addition we have tried to increase in resolutionof Darknet-53 from 608 to 1024 and themAP decreases whenthe resolution is over 608times 608 (erefore the effect of imagesize is clear for models like SSD and YOLO Generally allcomparative results of mAP on this dataset have the domi-nation of classes very great in numbers and this is caused bythe imbalance data between the number of images and in-stances in these images For example according to the sta-tistics in [13] mouse is a major class significantly contributingto mAP in Table 3 with the highest number of instances andimages as well However tissue has least contribution with thelowest AP originally affected by the number of data

Furthermore the imbalance data lead models tending todetect frequent objects implying that models will misun-derstand objects having a nearly similar appearance with thedomination class as the objects of interest rather than lessfrequent objects As a result false positives will increase bythese problems Figure 4 illustrates the detection withstrongest backbones Following this visualization the domina-tion of the classes such asmouse or faucet results inmisdetectionwith areas which have a same appearance to them (is mis-understanding has a tendency to weaker backbones in thecomparison and one-stage method like YOLO which primarilyheads to speed has more misdetection than two-stage methodsA reason that causes these problems are the difference in thewayof training deep networks [33] One-stage methods such asYOLO use a soft sampling method that uses a whole dataset toupdate parameters rather than only choosing samples fromtraining data However two-stage methods such as RCNNfamily tend to employ hard sampling methods that randomlysample a certain number of positive and negative boundingboxes to train its network

512 Subsets of PASCAL With 4 subsets of 4 different scalesof objects in images we want to find out howmuch the scalesimpact on the models (e whole results are shown in Ta-ble 4 We separate the results into 2 groups as the one-stageand two-stage approaches and Figure 5 is a visualization forthe strongest backbones in each method on subsets

In case of different scales like our subsets there is a differencebetween one-stage approaches and two-stage approaches In thiscase methods from the one-stage approach have a better per-formance than two-stage ones inmost of scales(is is really theopposite of small object dataset Specifically two-stage methodsare totally better than one-stage ones in case of real-time inputsand just better a bit than nonreal-time models in VOC_WH20about 10ndash20 and the same result with smaller objects inVOC_MRA_0058 and VOC_MRA_010 However in biggerobjects in VOC_MRA_020 methods in one-stage approacheshave significant outcomes rather than two-stage ones In ad-dition there is just Faster RCNN that has good performance inmost cases to compare to methods in one-stage ones FastRCNN is only good at big objects in VOC_MRA_020 and failsto have good detection in smaller objects

In the one-stage approach in methods which allowmultiple inputs like YOLO and SSD there are 2 kindsnamely ones that can run in real time and the others thatcannot if the resolution is over 640 or 512 for YOLO andSSD respectively For real-time ones YOLO outperformsSSD for all scales of objects Specifically YOLOv2 withDarknet-19 is better than SSD 26 with objects inVOC_MRA_0058 and VOC_MRA_010 and 4ndash15 forlarger objects in VOC_MRA_020 and VOC_WH_20YOLOv3 with Darknet-53 gets higher results about 3ndash5 incomparison with YOLOv2 hence YOLOv3 also gets higherresults compared to SSD However if we consider nonreal-time input images SSD is greater than YOLO with objects inVOC_MRA_010 However RetinaNet which is the one thatcannot run in real time in the one-stage approach performsthe same results compared to ones in nonreal time in YOLO

Journal of Electrical and Computer Engineering 11

and better than SSD RetinaNet is more stable than SSD andYOLO when the scales are changed (e bigger the objectsare the more the stability is For example the change is somuch about 33 when the scale increases from objects inVOC_MRA_0058 to ones in VOC_MRA_010 andVOC_MRA_020 However this change is not much about10 with bigger objects in comparison with YOLO 15ndash25In case of YOLO this remarkable increase in accuracy whenobjects are larger is obviously good for a model (e changein SSD resembles the change in RetinaNet

Concerning resolutions in YOLO and SSD we see thatwhen image resolution is increased they push the accuracyto improve in general For YOLOv2 with Darknet-19 andYOLOv3 with Darknet-53 and SSD they all have an increasein accuracy when the resolution is large except for YOLOv2

with objects belonging to VOC_MRA_010 andVOC_MRA_020 when the image is over 800 In additionYOLOv2 has a fluctuation with those objects inVOC_WH20 As mentioned in our previous work YOLO isbetter than SSD in those objects less than 10 of the imageshowever in this case YOLOv3 is good at all scales of objects(is is because YOLOv3 has 3 detection locations comingwith more ratios of default boxes and it leads to a significantoutcome when combining results from 3 locations

When we switch to the two-stage approaches FasterRCNN has a significant improvement in most scales ratherthan Fast RCNN except for objects in VOC_MRA_020 whichhave the same accuracy (is shows that if objects are com-pletely separated into different scales the RoI pooling does notwork well with smaller objects and ones in VOC_WH20 In

Table 3 Comparative results on small object dataset

Method Backbone Clock Faucet Jar Mouse Outlet Plate Switch Tel t box t paper mAPYOLO 416 [16]

Darknet-19

228 308 4 52 204 131 13 61 0 353 1939YOLO 448 [16] 23 369 9 525 184 136 175 42 0 343 2013YOLO 480 [16] 342 373 91 533 214 136 158 91 91 342 2371YOLO 512 [16] 231 366 61 598 246 142 157 91 45 324 2261YOLO 554 [16] 234 372 91 601 272 134 199 91 45 345 2384YOLO 640 [16] 202 362 32 598 278 117 181 82 45 356 2253YOLO 800 [16] 276 36 23 602 328 131 233 91 91 267 2402YOLO 1024 [16] 217 293 14 583 264 118 175 91 91 157 2003YOLO 320

Darknet-532622 3838 455 5646 3642 1334 248 1065 455 4296 2583

YOLO 416 2847 4715 1083 6049 4315 1587 3073 1515 262 483 3028YOLO 608 2998 4789 1076 6588 4802 1809 3122 1462 1799 4656 331YOLO 320

ResNet-501957 2573 067 4517 1437 938 1384 909 909 237 1706

YOLO 416 2378 3665 04 5423 1837 1375 1978 984 942 3568 2219YOLO 608 2692 4065 177 6186 2918 1504 2024 1009 1329 3601 255YOLO 320

ResNet-1012052 279 057 4468 1698 1305 1366 966 909 2436 1805

YOLO 416 2572 356 303 5573 224 1561 1726 932 303 3871 2264YOLO 608 2879 4459 942 6218 3334 1553 2388 1324 1583 3917 286YOLO 320

ResNet-1522164 2756 303 4806 1739 1112 1451 909 455 3188 1888

YOLO 416 257 3654 089 5381 206 1413 2021 1149 029 3306 2167YOLO 608 2601 4454 455 61 3176 1302 2267 1235 993 3999 2658SSD300 [16] ResNet-101 55 91 0 255 61 45 0 45 91 182 825SSD300 [16] VGG16 91 171 0 261 91 91 0 45 0 167 916SSD512 [16] VGG16 91 171 0 43 91 91 91 91 0 76 1132RetinaNet ResNet-50-FPN 307 493 2 655 213 161 85 129 1 257 233RetinaNet ResNet-101-FPN 306 487 71 647 20 159 118 107 29 387 251RetinaNet ResNeXT-101-32times 8d-FPN 355 55 121 665 239 184 98 162 94 537 30RetinaNet ResNeXT-101-64times 4d-FPN 314 502 89 663 208 153 94 14 22 324 251R-CNN [13] RPN prop +VGG16 319 313 42 568 311 93 142 164 234 294 248R-CNN [13] Alexnet 7times 300 pro 324 272 51 569 28 98 136 124 179 356 239R-CNN [13] VGG16 7times 300 pro 373 303 72 606 415 158 215 137 22 333 284R-CNN [13] ContextNet (Alexnet 7times) 327 268 46 564 263 99 129 122 187 34 235Fast RCNN ResNet-50-C4 324 463 65 658 383 201 253 166 141 52 317Fast RCNN ResNet-50-FPN 374 473 73 689 467 21 321 171 93 459 333Fast RCNN ResNet-101-FPN 393 503 106 683 471 204 333 186 154 514 355Fast RCNN ResNeXT-101-32times 8d-FPN 475 548 103 718 54 214 344 217 177 535 387Fast RCNN ResNeXT-101-64times 4d-FPN 454 557 109 725 533 24 369 229 16 581 396Faster R-CNN [16] VGG16 2376 3765 803 54 1616 1188 1512 91 625 3729 2192Faster RCNN ResNet-50-C4 322 446 66 659 352 175 257 196 137 40 301Faster RCNN ResNet-50-FPN 357 499 73 684 489 188 296 147 114 533 338Faster RCNN ResNet-101-FPN 398 492 49 682 47 185 297 14 129 522 337Faster RCNN ResNeXT-101-32times 8d-FPN 498 566 114 721 563 232 37 208 188 587 405Faster RCNN ResNeXT-101-64times 4d-FPN 496 586 122 725 545 232 369 208 201 631 412(e values in bold represent the best in one-stage methods and the ones in italics represent the highest in two-stage methods

12 Journal of Electrical and Computer Engineering

addition if we compare with one-stage methods it is signif-icantly lower than them However RoI align along with RPN iswell performed when scales are changedWhen it comes to thebackbones there is a few decrease in accuracy when changingfrom ResNet-50-FPN to ResNet-101-FPN or from ResNeXT-101-32times 8d-FPN to ResNeXT-101-64times 4d-FPN with objectsfrom all scales for both Faster RCNN and Fast RCNN (eVGG16 backbone has an impressive outcome rather thanstrong backbones such as ResNet or ResNeXT Although theaccuracy is less than two strong backbones VGG16 is stillbetter with objects in VOC_WH20 and has a few change inaccuracy when changing objects with big sizes

52 Time Processing and Resource Consumption Tables 5and 6 show us the performance comparison of the eval-uated models with base networks that belong to the modelsGenerally we see that when RAM consumption in testingand training increases more layers are added (is meansthat if the network is more deeper the need of processingalso increases because this leads to the increase in pa-rameters and time to process data as well YOLO is themodel consuming the least memory in both two-phasetraining and testing Particularly YOLO is only from 4G to5G for training and from 16G to 18G for testing withDarknet-53 YOLO is the only one which is able to run inreal time YOLO just needs about 03ms to 04ms toprocess an image in comparison to more than 01 s and 02 s

with Faster RCNN and RetinaNet (is allows us to pick upthese models on devices which own the modest memoryWhile RetinaNet is assigned to the one-stage approach it isnot good enough to meet real-time detection(e inferencetime in Fast RCNN is lower a little bit than Faster RCNNand RetinaNet In contrast the RAM consumption intraining and testing of RetinaNet is lower than Fast RCNNand Faster RCNN Of all architectures the ResNet-50-C4 isthe one requiring the highest memory and time to processdata because the output size of ResNet-50-C4 is bigger a bitthan others [9] If we consider ResNet or ResNeXT com-bined with FPN Faster RCNN is over 100Mb compared toFast RCNN and 300Mb with RetinaNet In additionaccording to Table 2 the number of training days of FasterRCNN and RetinaNet need less time for training only a fewhours to 1 day rather than YOLO 3ndash4 days (is dem-onstrates that if we pay our attention to performance anddo not have much time for training we choose FasterRCNN or RetinaNet instead of YOLO one In contrast ifwe only focus on processing speed and still achieve goodperformance one-stage methods are always the good oneIn the same context of backbones RetinaNet uses a lowerresource than Fast RCNN and Faster RCNN about 100Mband 300Mb for Fast RCNN and Faster RCNN respectivelyin testing time However the training time of RetinaNetuses much memory more than Fast RCNN about 28 G andFaster RCNN about 23 G for ResNeXT-101-32times 8d-FPNand ResNeXT-101-64 times 4d-FPN If we consider this on

(a) (b)

loc ∆ (cx cy w h) conf (c1 c2 hellip cp)

(c)

Figure 4 (e location of the default boxes in different scales (a) image with GT boxes (b) 8times 8 feature map (c) 4times 4 feature map

Journal of Electrical and Computer Engineering 13

Table 4 (e comparative results on subsets of PASCAL VOC 2007

Approach Method VOC_MRA_0058 VOC_MRA_010 VOC_MRA_020 VOC_WH20

One stage

YOLOv2 416 [16] 302 3138 4289 1852YOLOv2 448 [16] 447 329 6015 2196YOLOv2 480 [16] 426 3348 6078 2667YOLOv2 512 [16] 542 3574 6112 2463YOLOv2 544 [16] 697 3656 63 2662YOLOv2 640 [16] 77 3797 6129 2341YOLOv2 800 [16] 1024 373 6191 269YOLOv2 1024 [16] 1069 2993 5514 2897

YOLOv3 320 718 3458 6036 204YOLOv3 416 102 3897 6253 2412YOLOv3 608 117 4265 6856 2886SSD 300 [16] 171 3276 4626 1691SSD 512 [16] 29 4346 5711 1987

RetinaNet-ResNet-50-FPN 884 415 502 2814RetinaNet-ResNet-101-FPN 895 425 519 2746

RetinaNet-ResNeXT-101-32times 8d-FPN 1029 454 545 3008RetinaNet-ResNeXT-101-64times 4d-FPN 1071 455 551 3132

Two stage

Fast RCNN-ResNet-50-C4 023 132 499 393Fast RCNN-ResNet-50-FPN 063 135 556 345Fast RCNN-ResNet-101-FPN 039 159 576 312

Fast RCNN-ResNeXT-101-32times 8d-FPN 051 144 579 333Fast RCNN-ResNeXT-101-64times 4d-FPN 029 142 573 376

Faster RCNN-ResNet-50-C4 698 399 487 2604Faster RCNN-ResNet-50-FPN 1074 456 563 2979Faster RCNN-ResNet-101-FPN 1063 469 576 3057

Faster RCNN-ResNeXT-101-32times 8d-FPN 1164 473 576 3212Faster RCNN-ResNeXT-101-64times 4d-FPN 1054 471 569 3164

Faster RCNN-VGG16 [16] 573 3558 4414 4111(is table illustrates how well models adapt to different scales of objects (e values in bold represent the best in one-stage methods and the ones in italicsrepresent the highest in two-stage methods

(a)

(b)

(c)

Figure 5 Continued

14 Journal of Electrical and Computer Engineering

small object dataset it does not work too much becauseRetinaNet is lower than Faster RCNN about 10 in per-formance Otherwise on different scales of subsets Reti-naNet works well when comparing to Faster RCNN and

the difference is just 2ndash4 percentages Although ResNetbackbones combined with the others yield an improvementin accuracy they do not work for YOLO on small objectdatasets YOLO with Darknet-53 utilizes more resource

Table 5 (e comparison of consumption on small object dataset

Model Backbone Inference time (s) Test RAM (MiB) Train RAM (MiB)YOLOv3 Darknet-53 00331 1825 4759YOLOv3 ResNet-50 0027 1285 3479YOLOv3 ResNet-101 00356 1829 5383YOLOv3 ResNet-152 00454 2443 7531RetinaNet ResNet-50-FPN 0102 2075 4435RetinaNet ResNet-101-FPN 0127 2723 5577RetinaNet ResNeXT-101-32times 8d-FPN 0229 3767 7863RetinaNet ResNeXT-101-64times 4d-FPN 0292 3719 7813Fast RCNN ResNet-50-C4 03 6449 5877Fast RCNN ResNet-50-FPN 0089 2277 4455Fast RCNN ResNet-101-FPN 0113 2947 5627Fast RCNN ResNeXT-101-32times 8d-FPN 0212 3987 4961Fast RCNN ResNeXT-101-64times 4d-FPN 0269 3885 4799Faster RCNN ResNet-50-C4 0412 6609 6129Faster RCNN ResNet-50-FPN 0101 2387 5381Faster RCNN ResNet-101-FPN 0124 3001 6487Faster RCNN ResNeXT-101-32times 8d-FPN 0256 4027 5333Faster RCNN ResNeXT-101-64times 4d-FPN 0286 4003 5246

Table 6 (e comparison of consumption on subsets filtered from PASCAL VOC

Model Backbone Inference time (s) Test RAM (MiB) Train RAM (MiB)YOLOv3 Darknet-53 0027 1645 4079RetinaNet ResNet-50-FPN 01 1935 4133RetinaNet ResNet-101-FPN 0116 2585 5435RetinaNet ResNeXT-101-32times 8d-FPN 0222 3641 7723RetinaNet ResNeXT-101-64times 4d-FPN 0284 3561 7599Fast RCNN ResNet-50-C4 0495 6371 5677Fast RCNN ResNet-50-FPN 0092 2131 4387Fast RCNN ResNet-101-FPN 0114 2819 5463Fast RCNN ResNeXT-101-32times 8d-FPN 0213 3873 4637Fast RCNN ResNeXT-101-64times 4d-FPN 0265 3735 4575Faster RCNN ResNet-50-C4 026 6141 5991Faster RCNN ResNet-50-FPN 01 2245 5207Faster RCNN ResNet-101-FPN 013 2855 6335Faster RCNN ResNeXT-101-32times 8d-FPN 0225 3943 5087Faster RCNN ResNeXT-101-64times 4d-FPN 0276 3885 4909

(d)

Figure 5 Highlight of bounding boxes from comparative backbones on small object dataset We here select YOLO with Darknet-53 andResNet-50 for objective comparison because there have obviously the same layers in their networks along with the significant techniquessuch as skip connections and residual blocks (e bounding boxes show that ResNet-50 has the sensitivity to areas which resembles theobjects of interest than Darknet-53 Similarly ResNet-50-FPN and ResNet-50-C4 are chosen to consider (e detection shows thatcombining ResNet-50 with FPN outputs a better performance rather than the original one Particularly misdetection happens in moredensity than ResNet-50-FPN such as in columns 4 and 5 Zoom in to see more detail

Journal of Electrical and Computer Engineering 15

than ResNet ones but it has the best accuracy amongmodels (erefore we only test YOLO with Darknet-53 insubsets of PASCAL

53 Analyses of the Trade-Off among Detectors Networkdesigns and approaches alike the one-stage approach proveits performance as applying them to detect general objectsboth small scales and other kinds of scales Although they arefast and accurate there is still a drawback always existing inthese models that is the trade-off between accuracy andspeed of processing For example YOLOv3 proposes theidea that performs detection at three different scales and thisresult is obviously impressive and yields good performanceHowever to gain this advantage YOLOv3 has to sacrifice thetime to process Instead of all inputs of the model normallyprocessing one time for detection like YOLOv2 this ideamust work 3 times (is trade-off is also partly affected byresolution as we change it during training or testing ourmodels In our previous work we have mentioned that wehave to choose a right resolution to ensure our models towork properly In case of the two-stage approaches the ideathat proposes region proposals to improve the localization ofobjects to serve for detection is good as well (is is usefulbut we have to take it into the account that we shouldgenerate proposals on feature maps or directly on inputimages because this affects a lot on the way which modelsintend to run and identify representations of objects Ifobjects are normal or have a big or medium appearance it isgood for models to work but if objects are in multiscalesthis is a problem to consider and research deeply in order tobalance the performance as well as improve it (erefore topartly fix this problem the one-stage approach allows us tochoose a fixed size of an input for training and testing butthe support still depends on characteristics of datasets whichwe evaluate or the image size After all all models we chooseto evaluate are affected by the scales of objects when wechange the scale and accuracy of models change a lot exceptfor Faster RCNN the only one model that seems to be stablewith the scale especially when combining with the VGG16architecture Although the accuracy of VGG16 is not betterthan the other architectures the difference here is that it doesnot change too much in accuracy (is is only right for bigobjects having the overlap of the bounding box and theimage greater than 10 if not this is not assured

Figure 1 shows that the possibility of small objects ismore than other objects (e black length of the camera issomehow similar to the black mouse placed on a mouse pad(is possibility of small object presence causes more diffi-culties to detectors and leads to wrong detection Anywherein an image can be small objects it results in a fact thatdetectors have much wrong detection with familiar ap-pearance which they have seen If we consider the visuali-zation of the detection in Figure 4 the wrong detection ispartly similar to the appearance of the other objects in thedataset (is problem is caused by the data imbalance be-tween classes and instances in each class which originally isknown as the foreground-foreground class imbalance In

other words the common problems which not only happenwith small objects but also for whole datasets are theintraclass similarity and interclass variation

6 Conclusion

Small object detection is a challenging and interestingproblem in the task of object detection and has drawn at-tention from researchers thanks to the development of deeplearning which is motivation to improve performance oftasks in computer vision Although deep models belongingto detection originally tend to solve problems relating togeneral object detection they still work at a particular levelto the success of small object detection As evaluation workson small object detection for deep models our goal is tohighlight remarkable achievements of popular and state-of-the-art deep models in order to provide a variety of views asapplying deep models in small object detection Particularlywe evaluate state-of-the-art real-time detectors based ondeep learning from two approaches such as YOLOv3RetinaNet Fast RCNN and Faster RCNN on two datasetsnamely small object dataset and subsets filtered fromPASCAL VOC about effects of different factors objectivelyincluding accuracy execution time and resource usage

In spite of the successful achievements in recent yearsthe performance of detection has improved significantlyand there is still a huge gap in accuracy between normalobjects and small objects In the criteria of the COCOdataset the difference from the small scale to medium andbig scale is too much Most models are good at detection ofnormal objects and problems are going to happen whenapplying them to detect small objects As a result to reducethe gap of small object detection the first thing to do is investdatasets which have massive amount of data to train modelsand have a wide range of categories to compete with thehuman visual system alike [12 34]

So far detection models are divided into two mainapproaches namely one-stage approach and two-stageapproach Models in the one-stage approach is known asdetectors which have better and more efficient detection incomparison to another approach (e efficiency here has thepotential power to run in real time and is able to apply themto practical applications However the trade-off betweenaccuracy and speed is a difficult challenge which needs to betaken into the account in order to balance the gap Howevermodels in the two-stage approach have their reputation ofregion-based detectors which have high accuracy but are toolow in speed to apply them to real world (is drawbackcomes from the computation of networks

(rough our evaluation there is a fact that architectureswhich are utilized as base networks to extract deep featureshave significant effects on frameworks (e deeper the ar-chitecture is the higher the accuracy of detection is Once anetwork has an increase in the depth this means it has morelayers than normal ones and it will have massive parametersto train Hence this needs a lot of data to fine tune theseparameters reasonably If there is an increase in computa-tion resource consumption will also increase As a result it

16 Journal of Electrical and Computer Engineering

will be difficult as we want to take them to apply in practicalapplications Besides the contextual exploit in models isdefinitely limited this results cause ignoring much usefuland informative data in training especially in context ofsmall objects Because small objects are able to appearanywhere in an input image if the image is well-exploitedwith the context the performance of small object detectionwill be improved better

For all above reasons and according to our evalua-tion if we tend to have good performance and ignore thespeed of processing two-stage methods like FasterRCNN are well-performed and demonstrate its networkdesign with the different datasets on many contexts ofobjects including multiscale objects (erefore FasterRCNN is considered as a giant baseline in order to baseon or develop from it If our target has a balance ofaccuracy and speed YOLO is a good one in case we donot care the training time because the sacrifice betweenthe speed and accuracy is worth applying it into practicalapplications Otherwise Faster RCNN or RetinaNet isstill a substitution to work on When it comes tobackbones we have to concern about the data to choose areasonable backbone to combine with the methodsBecause the amount of data will significant impact on themodel if data are not abundant the shallow network willfit it well Besides there is recently a novel approachpromising in training deep models with less data that isweakly supervised learning such as zero-shot one-shotor few-shot learning (erefore these approaches will beconsidered in our future works and following our recentsearching to have better performance on object detec-tion we have to consider several factors to improve themAP such as multiscale training superresolution forscaling up the visual information to small objects [35] orpreprocessing data to avoid the imbalance data becausewe have a wide range of imbalance problems relating todata [33]

Data Availability

(e data used to support the findings of this study areavailable from the corresponding author upon request

Conflicts of Interest

(e authors declare that they have no conflicts of interest

Acknowledgments

(is research was funded by the Vietnam National Uni-versity HoChiMinh City (VNU-HCM) under grant noB2017-26-01

References

[1] R Girshick J Donahue T Darrell and J Malik ldquoRich featurehierarchies for accurate object detection and semantic seg-mentationrdquo in Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition pp 580ndash587 ColumbusOH USA June 2014

[2] K He X Zhang S Ren and J Sun ldquoSpatial pyramid poolingin deep convolutional networks for visual recognitionrdquo inProceedings of the European Conference on Computer Visionpp 346ndash361 Springer Zurich Switzerland September 2014

[3] R Girshick ldquoFast R-CNNrdquo in Proceedings of the IEEE In-ternational Conference on Computer Vision pp 1440ndash1448Santiago Chile December 2015

[4] J Redmon S Divvala R Girshick and A Farhadi ldquoYou onlylook once unified real-time object detectionrdquo in Proceedingsof the IEEE Conference on Computer Vision And PatternRecognition pp 779ndash788 Las Vegas NV USA June 2016

[5] J Redmon and A Farhadi ldquoYOLO9000 better fasterstrongerrdquo 2016 httpsarxivorgabs161208242

[6] J Redmon and A Farhadi ldquoYOLOv3 an incremental im-provementrdquo 2018 httpsarxivorgabs180402767

[7] T-Y Lin P Goyal R Girshick K He and P Dollar ldquoFocalloss for dense object detectionrdquo in Proceedings of the IEEEtransactions on pattern analysis and machine intelligenceVenice Italy October 2018

[8] K Zidek A Hosovsky J Pitelrsquo and S Bednar ldquoRecognition ofassembly parts by convolutional neural networksrdquo in Ad-vances in Manufacturing Engineering and Materialspp 281ndash289 Springer Cham Switzerland 2019

[9] K He G Gkioxari P Dollar and R Girshick ldquoMaskR-CNNrdquo in Proceedings of the IEEE International Conferenceon computer vision (ICCV) IEEE Venice Italy pp 2980ndash2988 October 2017

[10] L-C Chen A Hermans G Papandreou et al ldquoInstancesegmentation by refining object detection with semantic anddirection featuresrdquo 2017 httpsarxivorgabs171204837

[11] M Everingham L Van Gool C K I Williams J Winn andA Zisserman ldquo(e pascal visual object classes (VOC) chal-lengerdquo International Journal of Computer Vision vol 88no 2 pp 303ndash338 2010

[12] T-Y Lin M Maire S Belongie et al ldquoMicrosoft COCOcommon objects in contextrdquo in Proceedings of the EuropeanConference on Computer Vision pp 740ndash755 SpringerZurich Switzerland September 2014

[13] C Chen M-Y Liu O Tuzel and J Xiao ldquoR-CNN for smallobject detectionrdquo in Proceedings of the Asian Conference onComputer Vision pp 214ndash230 Springer Taipei TaiwanNovember 2016

[14] P F Felzenszwalb R B Girshick D McAllester andD Ramanan ldquoObject detection with discriminatively trainedpart-based modelsrdquo IEEE Transactions on Pattern Analysisand Machine Intelligence vol 32 no 9 pp 1627ndash1645 2010

[15] S Ren K He R Girshick and J Sun ldquoFaster R-CNN towardsreal-time object detection with region proposal networksrdquo inProceedings of the 28th International Conference on NeuralInformation Processing Systems Ser NIPSrsquo15 pp 91ndash99 MITPress Cambridge MA USA 2015 httpdlacmorgcitationcfmid=29692392969250

[16] P Pham D Nguyen T Do T D Ngo and D-D LeldquoEvaluation of deep models for real-time small object de-tectionrdquo in Proceedings of the International Conference onNeural Information Processing pp 516ndash526 SpringerGuangzhou China November 2017

[17] J R R Uijlings K E A Van De Sande T Gevers andA W M Smeulders ldquoSelective search for object recognitionrdquoInternational Journal of Computer Vision vol 104 no 2pp 154ndash171 2013

[18] Z Zhu D Liang S Zhang X Huang B Li and S HuldquoTraffic-sign detection and classification in the wildrdquo inProceedings of the IEEE Conference on Computer Vision And

Journal of Electrical and Computer Engineering 17

Pattern Recognition pp 2110ndash2118 Vegas NV USA June2016

[19] A Torralba R Fergus and W T Freeman ldquo80 million tinyimages a large data set for nonparametric object and scenerecognitionrdquo IEEE Transactions on Pattern Analysis andMachine Intelligence vol 30 no 11 pp 1958ndash1970 2008

[20] A Kembhavi D Harwood and L S Davis ldquoVehicle detectionusing partial least squaresrdquo IEEE Transactions on PatternAnalysis and Machine Intelligence vol 33 no 6 pp 1250ndash1265 2011

[21] V I Morariu E Ahmed V Santhanam D Harwood andL S Davis ldquoComposite discriminant factor analysisrdquo inProceedings of the IEEE Winter Conference on Applications ofComputer Vision pp 564ndash571 IEEE Steamboat Springs COUSA March 2014

[22] A Andreas P Lenz and R Urtasun ldquoAre we ready forautonomous driving the kitti vision benchmark suiterdquo inProceedings of the IEEE Conference on Computer Vision AndPattern Recognition Providence RI USA June 2012

[23] A Alahi K Goel V Ramanathan A Robicquet L Fei-Feiand S Savarese ldquoSocial LSTM human trajectory prediction incrowded spacesrdquo in Proceedings of the IEEE Conference onComputer Vision And Pattern Recognition pp 961ndash971 LasVegas NV USA June 2016

[24] J Xiao K A Ehinger J Hays A Torralba and A Oliva ldquoSundatabase exploring a large collection of scene categoriesrdquoInternational Journal of Computer Vision vol 119 no 1pp 3ndash22 2016

[25] E Dong Y Zhu Y Ji and S Du ldquoAn improved convolutionneural network for object detection using YOLOv2rdquoin Pro-ceedings of the 2018 IEEE International Conference onMechatronics and Automation (ICMA) pp 1184ndash1188 IEEEChangchun China August 2018

[26] W Liu D Anguelov D Erhan et al ldquoSingle shot multiboxdetectorrdquo in Proceedings of the European Conference onComputer Vision Springer Amsterdam (e Netherlandspp 21ndash37 October 2016

[27] T-Y Lin P Dollar R B Girshick K He B Hariharan andS J Belongie ldquoFeature pyramid networks for object detec-tionrdquo in Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition (CVPR) vol 1 no 2 p 4Honolulu HI USA July 2017

[28] C-Y Fu W Liu A Ranga A Tyagi and A C Berg ldquoDSSDdeconvolutional single shot detectorrdquo 2017 httpsarxivorgabs170106659

[29] Y Bai Y Zhang M Ding and B Ghanem ldquoSOD-MTGANsmall object detection via multi-task generative adversarialnetworkrdquo in Proceedings of the European Conference onComputer Vision pp 210ndash226 Springer Munich GermanySeptember 2018

[30] L Liu W Ouyang X Wang et al ldquoDeep learning for genericobject detection a surveyrdquo 2018 httpsarxivorgabs180902165

[31] P Zhu L Wen X Bian L Haibin and Q Hu ldquoVision meetsdrones a challengerdquo 2018 httpsarxivorgabs180407437

[32] Z-Q Zhao P Zheng S-t Xu and X Wu ldquoObject detectionwith deep learning a reviewrdquo 2018 httpsarxivorgabs180705511

[33] K Oksuz B C Cam S Kalkan and E Akbas ldquoImbalanceproblems in object detection a reviewrdquo 2019 httpsarxivorgabs190900169

[34] O Russakovsky J Deng H Su et al ldquoImagenet large scalevisual recognition challengerdquo International Journal of Com-puter Vision vol 115 no 3 pp 211ndash252 2015

[35] Y Bai Y Zhang M Ding and B Ghanem ldquoSOD-MTGANsmall object detection via multi-task generative adversarialnetworkrdquo in Proceedings of the European Conference onComputer Vision (ECCV) pp 206ndash221 Munich GermanySeptember 2018

18 Journal of Electrical and Computer Engineering

Page 4: AnEvaluationofDeepLearningMethodsforSmall ObjectDetection

different definitions for different datasets instead of onlyusing the size of bounding boxes containing objects toconsider if the objects are small or not For example Zhuet al [18] mentioned that small objects are objects whosesizes are filling 20 of an image when releasing their datasetabout traffic signs If the traffic sign has its square size it is asmall object when the width of the bounding box is less than20 of an image and the height of the bounding box is lessthan the height of an image In [19] Torralba et al supposedsmall objects are less than or equal to 32times 32 pixels In smallobject dataset [13] objects are small when they have meanrelative overlap (the overlap area between bounding box areaand the image is) from 008 to 058 respectively 16times16to 42times 42 pixel in a VGA image In this work we reuse theabove definitions especially the definitions from [13 18] asthe main references because they are reliable resources andare widely accepted by other researchers

23 Datasets and Approaches (ere are limited works toconcentrate on sorts of small objects and it results in thelimitation of experience and knowledge to deeply go for acomprehensive research (e previous approaches justspecify to focus on big objects and ignore the existence ofsmall objects In fact we do not comprehend how muchexisting detection approaches are well-performed whendealing with small objects Hence in this work we conductto assess the performance of existing state-of-the-art de-tectors to draw a general picture of their abilities for smallobject detection

In terms of small object detection there are just a fewworks regarding the problem of detecting small objects Sofar most of these works are just designed to detect somesingle categories such as traffic signs [18] or vehicles [20ndash22]or pedestrians [23] that do not contain common or mul-ticlass datasets in real world (is results in a lack of eval-uation for the approaches to show its ability detectingdifferent kinds of objects and variation of their shapes aswell Fortunately Chen et al [13] present their small objectdataset by combining the Microsoft COCO [12] and SUNdatasets [24] that consist of common objects such asldquomouserdquo ldquotelephonerdquo ldquoswitchrdquo ldquooutletrdquo ldquoclockrdquo ldquotissueboxrdquo ldquofaucetrdquo ldquoplaterdquo and ldquojarrdquo Chen also augments theR-CNN algorithm with some modifications to improveperformance of detecting small objects Following this ideawe conduct a small survey on existing datasets and theauthors find that PASCAL VOC is in common with COCOand SUN datasets which consist of small objects of variouscategories So we depend on existing and common defini-tions of small objects to filter objects that meet these defi-nitions and form a dataset including 4 subsets correspondingto 4 different definitions of small objects so as to objectivelyconsider how different scales of objects affecting perfor-mance of detection are In addition there is recently a smallobject dataset in a challenge called Vision Meets Drones AChallenge (httpaiskyeyecom) and this dataset is con-sidered the challenging dataset because it consists of severalsmall objects even tiny objects in images in different con-texts and conditions in wild but the views in images are

snapshot from drones which fly above and take picturesfrom the high resolution cameras attached to it Unfortu-nately this dataset does not have annotations for testing soit is hard to take it for evaluation

(erefore in this work we choose small object dataset[13] and our filtered dataset to make our evaluation becausethese datasets contain common objects and the number ofimages are large so the evaluations are objective

3 Deep Models for Object Detection

Recently in widespread developments of deep learning it isknown that convolutional neural network (CNN) ap-proaches have showed lots of improvements and achievedgood results in various tasks (erefore it is commonlyapplied to well-known works Most of the works haveshowed significant improvements in detecting objects fillingmedium or big parts on an image

RCNN [1] is one of the pioneers (e following methodsare an improvement form of R-CNN such as [2 3 15]Especially Faster R-CNN [15] is considered as a state-of-the-art approach Although this sequence of advanced worksuses a lot of different and breakthrough ideas from slidingwindow to object proposals and mostly achieves the bestresults as state-of-the-art methods on challenging datasetssuch as COCO PASCAL VOC and ILSVRC however theirrepresentations take much time to run on an image com-pletely and may lead to reduction in the running perfor-mance of the detector As a result the detectors facedifficulty in using them for detecting objects in real timedespite achieving high accuracy (is means they just focuson accuracy and ignore effects of speed of processing Inaddition detecting objects having small sizes in real world isas important as objects having big or medium sizes evenmore necessary than we imagined Especially in industriesof automotive smart cars army projects and smarttransportation data must be promptly and precisely pro-cessed to make sure that safety is first But in these casesgenerally the data recorded usually are far from our positionand the information is a small thing

In terms of real-time detection the one-stage methodsinstead of using object proposal to get RoI before moving toclassifier like two-stage approaches such as Faster R-CNNuse local information to predict objects such as YOLO andSSD Both methods process images in real time and detectobjects correctly and still have a high point of mAP Nev-ertheless these papers just mention that the models candetect small objects and have good results but they do notshow evidences to prove how much or what extent of smallobjects that are solved In this work we evaluate thesemodels from both approaches to find out their performanceand to what extend they are good at as detecting smallobjects (e following are general ideas of above-mentionedapproaches

31 R-CNN R-CNN [1] is a novel and simple approach as apioneer advanced providing more than 30 mean averageprecision (mAP) than the previous works on PASCAL VOC

4 Journal of Electrical and Computer Engineering

(e overview of R-CNN architecture consists of four mainphases which are known as the new advances of this methodFirstly the R-CNN network resizes an image to 227times 227and takes it as an input(en selective search algorithm [17]is applied to the image and generates 2000 candidates ofproposed bounding boxes as the warped regions used for theinput of CNN feature network (rough the regions thenetwork extracts a 4096-dimensional feature vector fromeach region and then computes the features for each regionFinally using the class-specific linear SVM classifier behindthe last layer is to classify regions to consider if there are anyobjects and what the objects are

(emajor key to the success of the R-CNN is the featuresmatter In R-CNN the low-level image features (eg HOG)are replaced with the CNN features which are arguablymore discriminative representations However the evalua-tion of an image is extremely costly and wasteful becauseR-CNN must apply the convolutional network 2000 timesBesides resizing the input to the low 227times 227 is a problemaffecting small objects which are easy to deform or even loseinformation as changing the resolution far from its originalsizes (e region proposals overlapped thus leading tocomputation of familiar features many times and with everyregion proposal it must be stored to disk before performingthe extraction of features In addition lots of boundingboxes overlapped will result in a drop of mAP if small objectsare close to big objects because there is a bias to choose thebounding boxes which contain big objects and ignorance ofbounding boxes for small objects

32 Spatial PyramidPooling (SPP) (e primary ideas of SPP[2] are motivated from limitations of CNN architecture suchas the original CNN receiving the size of input imagesmust bea fixed size (224times 224 of AlexNet) so the actual use of the rawpicture often needs cropping (a fixed-size patch that truncatesthe original image) or warping (RoI of an image inputmust bea fixed size of the patch) (e fully connected layer needs afixed-length input and convolutional layer that can beadapted to the arbitrary input size thus it needs a bridge as amediate layer between the convolutional layer and the fullyconnected layer and that is the SPP layer Particularly SPP-net firstly finds 2000 candidates of region proposals like theR-CNN method and then extracts the feature maps from theentire image SPP maps each window of the features corre-sponding to region proposals as a fixed-length representationregardless of the input size Finally 2 fully connected layersare used to classify by SVM

In short SPP-net versus R-CNN detection task is better100times faster than R-CNN but training time is very slowbecause of multistage training steps (fine-tuning of lastlayers SVM and regressions) and really taking a lot of diskspace to save vectors of features

33 Fast R-CNN Fast R-CNN [3] is an advanced methodthat presents various innovations to improve the time oftraining and testing phase and efficiently classifying objectproposals while still increasing the accuracy rate by usingdeep convolutional networks (e architecture of Fast

R-CNN is trained end-to-end with a multitask loss Specif-ically the convolutional network takes an image at any size asan input and several RoIs Instead of applying RoI on an inputand wrapping them to feed into the network at the first steplike RCNN Fast RCNN applies these RoIs on a feature mapafter the several convolutional layers of the base networkEach RoI is extracted a fixed-size feature vector by a poolinglayer andmapped to a feature vector by fully connected layers(e network has two output vectors per RoI softmaxprobabilities and per-class bounding-box regression offsets

(e most important feature of RoI is sharing compu-tation and memory in the forward and backward passesfrom the same image (e huge contribution of Fast R-CNNis that it proposes a new training method that fixes thedrawbacks of R-CNN and SPP-net while increasing theirrunning time and accuracy rate (e advantage is the meanaverage precision of detection is higher than R-CNN andSPP-net Training phase is a single stage using a multitaskloss and can update the entire network layers (e capacityof disk storage is not required for feature caching

34 Faster R-CNN Faster R-CNN [15] is an innovatedapproach improved from Fast R-CNN Unlike two previousapproaches of its own instead of generating bounding boxesby external algorithms [17] like [1 3] Faster R-CNN runs itsown method called the region proposal network (RPN)which is trained end-to-end in order to give the generationof highly qualified region proposals After gaining deepfeatures from early convolutional layers RPN is taken intothe account and windows slide over the feature map toextract features for each region proposal RPN is consideredas a fully convolutional network which simultaneouslypredicts bounding boxes of objects and objectness scores ateach position (e input of RPN is an image of any size andoutputs a set of bounding boxes as rectangular objectproposals along with an objectness score for each proposalSpecifically the RPN takes the image feature map of the fifthconvolutional layer (conv5) as an input and applies a 3times 3sliding window on the feature map (en the intermediatelayer will feed into two different branches one for objectscore (determines whether the region is thing or stuff) andthe other for regression (determines how should thebounding box change to become more similar to the groundtruth) (e RPN improves accuracy and running time as wellas avoids to generate excess of proposal boxes because theRPN reduces the cost by sharing computation on convolu-tional features RPN and Fast R-CNN aremerged into a singlenetwork by sharing their convolutional features (is com-bination helps Faster R-CNN to have leading performance onaccuracy but leads to its architecture as a two-stage networkwhich reduces the speed of processing of this method

35 You Only Look Once Inherited from the advantages ofthe previous models which have been introduced earlier YouOnly Look Once (YOLO) [4] is considered as a state-of-the-art object detection in real time with various categories at thattime YOLO currently has three versions [4ndash6] which areimproved substantially through each version progressively

Journal of Electrical and Computer Engineering 5

(e detail analyses of the YOLO approaches as a premise toapply it into practical applications are as follows

YOLOv1 [4] is widely known that YOLO an unified orone-stage network is a completely novel approach based onan idea that aims to tackle object detection in real timeproposed by Redmon et al implying that instead of per-forming the task of object detection like the previoustechniques based on complex tasks such as [1 4] which useexhausted sliding window and then feed outputs of this toclassifiers performing at equally spaced locations over thewhole image or region proposals to generate bounding boxeswhich possibly contain objects and then feed them toconvolutional neural networks YOLO considers objectdetection to be a regression problem simultaneously givingthe prediction for various coordinates of bounding boxesand class probabilities for these boxes (e key idea toperform the detection of YOLO is that YOLO separatesimages into grid views which push the running time as wellas accuracy in localizing objects of YOLO(e goal of YOLOis to deal with two problems namely what objects arepresented and where they are in an image (e summari-zation of YOLO operation proceeds with three principalsteps simply and straightforwardly Firstly YOLO takes aninput image resized to a fixed size then works a singleconvolutional network as a unified network on the imageand ultimately puts a threshold on the resulting detectionsby the confidence score of the model YOLO runs at 45 fpson GPU and the smaller Fast YOLO reaches 150 fps (isprocessing can run steaming video in real time Although thedesign of YOLO architecture affords end-to-end trainingand real-time detection it still keeps high average precision

(e network divides the input image into a Stimes S gridwhere Stimes S is equal to the width and height of the tensorwhich presents the final prediction In case the center of anobject is in a grid cell the gird cell takes responsibility fordetecting that object Moreover each gird cell is simulta-neously responsible for predicting bounding boxes andconfidence scores which present how confident the model ofbounding box contains an object as well as how accurate itindicates the bounding box is predicted

(e drawback of YOLO is that it lags behind the state-of-the-art detection systems about accuracy but is better thanthose about running time Itmakes less than half the number ofbackground errors compared to Fast R-CNN YOLO is highlygeneralizable so it can quickly identify objects in an image butit usually struggles to precisely localize some objects especiallysmall ones (erefore the author introduced YOLOv2 toimprove performance and fix drawbacks of YOLO as well

YOLOv2 [5] has a number of various improvements fromYOLOv1 Similarly to the origin YOLOv2 runs on differentfixed sizes of an input image but it introduced several newtraining methods for object detection and classification suchas batch normalization multiscale training with the higherresolutions of input images predicting final detection onhigher spatial output and using good default bounding boxesinstead of fully connected layers

However this offers a trade-off between speed and ac-curacy (e details of the mAP improvements in PASCALVOC 2007 are shown in Figure 2

(ese novel improvements allow YOLOv2 to train onmulticlass datasets like COCO or ImageNet In addition itwas attempted to train the detector to detect over 9000different object classes YOLOv2 uses a network architecturecustomized from the original network YOLOv2 mainlyconcentrates on a way of improving recall and localizationwhile still receiving high accuracy of classification incomparison with state-of-the-art detectors and the originYOLO significantly makes more localization errors but is farless likely to predict false detections on places where nothingexists Although YOLOv2 has accuracy improvementsYOLOv2 does not work well on small objects because theinput downsampling results in the low dimension of thefeature map which is used for the final prediction To solvethese problems recently the author introduces YOLOv3with significant improvements on object detection espe-cially on small object detection Generally a variety of latestnetworks tend to be toward deeper and yield good per-formance on their tasks with deep features learned fromnumerous layers

YOLOv3 [6] is one of these approaches instead of usingDarknet-19 like two old versions [4 5] YOLOv3 develops adeeper network with 53 layers called Darknet-53 and com-bines the network with state-of-the-art techniques such asresidual blocks skip connections and upsampling (e re-sidual blocks and skip connections are very popular in ResNetand relative approaches and the upsampling recently alsoimproves the recall precision and IOU metrics for objectdetection [25] For the task of detection 53 more layers arestacked onto it giving a 106-layer fully convolutional un-derlying architecture for YOLOv3 (is is the reason behindthe slowness of YOLOv3 compared to YOLOv2

Second YOLOv3 enables the detector to predict objectsat three different outputs with three different scales ratherthan just one prediction at the last layer of the networksimilar to its competitor SSD [26] which has improved a lotof performance on a low resolution image (is is useful topick up diverse outcomes in order to improve performanceof detection (e final output is created by applying a 1times 1kernel on a feature map Particularly the detection is doneby applying 1times 1 detection kernels on feature maps of threedifferent sizes at three different places in the network partlysimilar to feature pyramid networks (FPNs) [27]

(ird YOLOv3 still keeps using K-means to generateanchor boxes but instead of fully applying 5 anchor boxes atthe last detection YOLOv3 generates 9 anchor boxes andseparates them into 3 locations Each location applies 3anchor boxes hence there are more bounding boxes perimage For example if we have an image of 416times 416YOLOv2 predicts 13times 13times 5 845 boxes in YOLOv3 thenumber of boxes is 10647 implying that YOLOv3 predicts10 times the number of boxes compared to YOLOv2

Fourth YOLOv3 also changes the way to calculate thecost function If the anchor overlaps a ground truth morethan other bounding boxes the corresponding objectnessscore should be 1 For other anchor boxes with overlapgreater than a predefined threshold 05 they incur no costEach ground truth is only associated with one boundary boxIf a bounding box is not assigned it incurs no classification

6 Journal of Electrical and Computer Engineering

and localization lost just confidence loss on objectness (eloss function in previous YOLO looks like

λcoord 1113944

S2

i01113944

B

j01objij xi minus 1113954xi( 1113857

2+ yi minus 1113954yi( 1113857

21113960 1113961

+ λcoord 1113944

S2

i01113944

B

j01objij

wi

radicminus

1113954wi

1113969

1113874 11138752

+

hi

1113969

minus

1113954hi

1113969

1113874 11138752

1113890 1113891

+ 1113944S2

i01113944

B

j01objij Ci minus 1113954Ci1113872 1113873

2

+ λnoobj 1113944S2

i01113944

B

j01noobjij Ci minus 1113954Ci1113872 1113873

2

+ 1113944S2

i01obji 1113944

cisinclassespi(c) minus 1113954pi(c)( 1113857

2

(1)Currently instead of using mean square error in calcu-

lating the classification loss at the last three terms YOLOv3uses binary cross-entropy loss for each label In other wordsYOLOv3 makes its prediction of an objectness score and classprediction for each bounding box using logistic regression

(ere is no more softmax function for class prediction(e reason is that the most currently used classifiers assumethat predicted labels are independent and mutually exclusiveimplying that if an object belongs to one class then it cannotbelong to the other and this is solely true if output predictionis really mutual nevertheless in case dataset has multilabelclasses and there are labels which are not nonexclusive such aspedestrian and person At the time the sum of possibilityscores may be greater than 1 if the classifier is softmax soYOLOv3 alternates the classifier for class prediction from thesoftmax function to independent logistic classifiers to cal-culate the likeliness of the input belonging to a specific label

36 Single Shot MultiBox Detector Single Shot MultiBoxDetector (SSD) [26] is a single shot detector using a singleand one-stage deep neural network designed for objectdetection in real time By comparison the state-of-the-artmethod in two-stage processing Faster RCNN uses itsproposed network to generate object proposals and utilizesthose to classify objects in order to be toward real-timedetection instead of using an external method but the wholeprocess runs at 7 FPS SSD enhances the speed of running

(a)

(b)

(c)

(d)

Figure 2 (e visualization of detectors with the strongest backbones on subsets of PASCAL VOC_MRA_058 VOC_MRA_10VOC_MRA_20 and VOC_WH_20 in order respectively (a) YOLO Darknet-53 (b) Faster RCNN ResNeXT-101-64times 4d-FPN (c)RetinaNet ResNeXT-101-64times 4d-FPN (d) Fast RCNN ResNeXT-101-64times 4d-FPN

Journal of Electrical and Computer Engineering 7

time faster than the previous detectors by eliminating theneed of the proposal network(erefore it causes a few dropin mAP and SSD compensates this by applying some im-provements including multiscale features and default boxes(ese improvements allow SSD to gain the same of FasterRCNN using lower resolution images which then furtherspeeds up the processing of SSD For 300times 300 input imageas the best version SSD gets 772 mAP at 46 FPS betterthan Faster R-CNN 732 and a little smaller than the bestversion of YOLOv2 554times 554 input image and 786 mAPat 40 FPS on VOC 2007 on Nvidia Titan X

Similarly SSD consists of 2 parts namely extraction offeature maps and use of convolution filters to detect objectsSSD uses VGG16 as a base network to extract feature maps(en it combines 6 convolutional layers to make predictionEach prediction contains a bounding box and N+ 1 scoresfor each class where N is the number of classes and one forextraclass for no object Instead of using a region proposalnetwork to generate boxes and feed to a classifier forcomputing the object location and class scores SSD simplyuses small convolution filters After the VGG16 base net-work extracts features from feature maps SSD applies 3times 3convolution filters for each cell to predict objects Each filtergives an output including N+ 1 scores for each class and 4attributes for one boundary box

SSD has a difference from previous approaches at thesame time and it makes prediction on multiscale featuremaps for detection independently rather than just one lastlayer (e CNN network spatially reduces the dimension ofthe image gradually leading to the decrease in the resolutionof the feature maps As mentioned SSD uses a lower inputimage to detect objects hence early layers are used to detectsmall objects and lower resolution layers to detect largerscale objects progressively Besides SSD applies differentscales of default boxes to different layers and for intuitivevisualization in Figure 3 Particularly the only blue defaultbox on 8times 8 feature map fits to the ground truth of the catand the only red one on 4times 4 feature map matches to theground truth of the dog

Although SSD has significant improvements in objectdetection as integrating with these above parts SSD is notgood at detecting small objects which can be improved byadding deconvolution layers with skip connections to in-troduce additional large-scale context [28] Generally SSDoutperforms Faster RCNN which is a state-of-the-art ap-proach about accuracy on PASCAL VOC and COCO whilerunning at real-time detection

37CNNDrawbacks Most of the CNNmodels are currentlydesigned by the hierarchy of various layers such as con-volutional and pooling layers that are arranged in a certainorder not only on small networks but also on multilayernetworks to state-of-the-art networks Along with theselayers fully connected layers are added behind and known asFC layers (e block consisting of FC layers and previouslayers is designated as feature extractors and it outputs keyfeatures of objects of interest as an input for classifierscoming behind However deeply going through many kinds

of layers is a way that is not good for small object detectionbecause in the task of small object detection objects ofinterest are objects owning small sizes and appearanceBesides small objects unlike normal or big objects which areless affected by resizing the image or passing lots of differentlayers are very vulnerable to the changes in image sizesWhen an image passes a convolutional layer the size of theimage will be decreased by receptive fields that slide over theimage to extract useful features (is does not affect smallobjects if there are just a few layers but in a CNN networkwe have many layers like this and it is very hard for smallobjects Still if small objects just go through convolutionallayers it will not be anything to mention Small objectswhich just have a few informative presence have to passpooling layers which help in avoiding overfitting and re-ducing computational costs by decreasing a number ofparameters To do this these layers use fixed sliding windowsthat care about a fixed target that is identified before such asmaximum or average calculations of valuables For thesereasons GAN is an approach that may alter the CNN ap-proach because of its advantages We can take advantages ofa way that the approach generates data to overcome thelimitations of data of small objects for the training phaseAlthough images still have to pass layers such as convolu-tional and pooling layers in this context the network justhas less layers compared to others Bai el al [29] haveproposed to apply MTGAN to detect small objects by takingcrop inputs from a processing step made by baseline de-tectors such as Faster RCNN [15] or Mask RCNN [9]

Because of mentioned reasons and following the survey[30] Liu et al have presented numerous works of survey andevaluation but there are no works that do with small objectsin them(erefore in this work we assess popular and state-of-the-art models to find out pros and cons of these modelsParticularly we evaluate 4 deep models such as YOLOv3RetinaNet Fast RCNN and Faster RCNN with several basenetworks for small object detection with different scales ofobjects In these models YOLOv3 and RetinaNet belong tothe one-stage approach Fast RCNN and Faster RCNN are inthe two-stage approach We choose these models becauseYOLOv3 is the model with combination of state-of-the-arttechniques and RetinaNet is the model with a new lossfunction which penalizes the imbalance of classes in adataset Besides we choose RetinaNet to make comparisonsbetweenmodels in the same approach Similarly Fast RCNNand Faster RCNN are the same and both models are in thesame approach and have nearly the similar pipeline in objectdetection (ere is a difference is that Fast RCNN utilizes anexternal proposal to generate object proposals based oninput images However Faster RCNN proposes its ownnetwork to generate object proposals on feature maps andthis makes Faster RCNN train end-to-end easily and workbetter

4 Experimental Evaluation

In this section we present the information of our ex-perimental setting and datasets which we use forevaluation

8 Journal of Electrical and Computer Engineering

41 Experimental Setting We continually train and evaluatevarious object detectors on the two datasets such as PASCALVOC [11] and a newly generated dataset [16] (e evaluatedapproaches in this time consist of Faster RCNN [15]YOLOv3 [6] and RetinaNet [7] with different backbonesExcept for YOLOv3 the others are trained and evaluated bythe Detectron python code

Currently the original datasets which commonly areused in object detection are PASCAL VOC [11] and COCO[12] Both datasets are constructed by almost large objects orother kinds of objects whose size fill a big part in the image(ese two datasets are not suitable for small object detectionIn addition there is another dataset which is large-scale andincludes a lot of classes for small object detection collectedby drones and named VisDrone dataset [31] However itdoes not publish the labels for test set to evaluate and theviews of images are topdown which is not our case As aresult in order to evaluate the detection performance of themodels we use a dataset which was published in [13] (isdataset is called small object dataset which is the combi-nation between COCO [12] and SUN [24] dataset (ere are10 classes in small object dataset including mouse tele-phone switch outlet clock toilet paper (t paper) tissue box(t box) faucet plate and jar (e whole dataset consists of4925 images in total and there are 3296 images for trainingand 1629 images for testing (e mouse class owns thelargest number of instances in images 2137 instances in 1739images and the tissue box class has the smallest number ofinstances 103 instances in 100 images Apart from the smallobject dataset we also filter subsets from PASCAL VOC2007 following standard definitions On PASCAL VOCthere are 20 classes but with small object detection there arefewer classes on strict definitions of small objects Table 1lists the details of the number of small objects and imagescontaining them for subsets of the dataset

We trained all models on small object dataset with thesame parameters Particularly in the training phase wetrained the models with 70k iterations with the parametersincluding momentum decay gamma learning rate batchsize step size and training days in Table 2 At the firstmoment we attempted to start off the models with a higherlearning rate 10minus 2 but the models diverged leading to theloss value being NaN or Inf after 100 first iterations(en wetried at a lower learning rate 10minus3 at 100 first iterations andrise to 10minus2 to consider if the models can converge as startingoff at a lower learning rate However it remained unchangedanything We also saw that the models converged quicklyduring 10k first iterations with 10minus3 and then progressively

slow down after 20k (erefore we decided to start off thetraining with a learning rate at 10minus3 and decrease to 10minus 4 and10minus5 at 25k and 35k iterations respectively (is settingshows that the loss value was stable from 40k but we set thetraining up to 70k to consider how the loss value changesand saw that it did not change a lot after 40k iterations Wetried to evaluate the models from 30k to 70k and generallythe performance of the models was not stable after 40k it-erations For this reason we picked up the weight forevaluation at 30k and 40k iterations At 30k iterationsYOLO achieves the best results and others get the best one at40k iterations In case of subsets of PASCAL VOC 2007 wecombine train and valid set from PASCAL VOC 2007 and2012 to form a training set PASCAL VOC 2012 works as adata augmentation set for PASCAL VOC 2007 We use thiscombined training set to train all models and test them onsubsets All models train the same parameter First due tothe limitation of memory we rescale all the size of images tothe same size with the shortest side 600 and the lengthiestside 1000 as in [15]

In YOLOv3 we run the K-means clustering algorithm inorder to initialize 9 suitable default bounding boxes fortraining and testing phases of our selected datasets and wechanged the anchors value (e following are 9 anchors forsmall object dataset after running the K-means algorithm[103459 144216] [262937 190947] [214024 363180][479317 291237] [404932 637489] [836447 513203][722167 1199181] [1727416 1170773] and [12465972528465]

In Faster R-CNN to fairly compare with the prior workand deploy on different backbones we also reuse directly theanchor scales and aspect ratios following the paper [13] suchas anchor scales 16times16 40times 40 and 100times100 pixels andaspect ratio 05 1 and 2 instead of having to cluster a set ofdefault bounding boxes similar to YOLOv3 Similarly inRetinaNet we keep the default setting for training such asgamma loss 20 alpha loss 025 anchor scale 4 andscalers per octave 3 because of following authors and thisconfiguration is the optimized valuables

42 Our Newly Generated Dataset In this time to have anobjective comparison we also use our newly generateddataset and the information of this dataset is shown inTable 1 We use it to consider the effects of object sizesamong factors including models time of processing accu-racy and resource consumption (e dataset consists of 4subsets filtered from PASCAL VOC 2007 such as

Batch normHi-res classifierConvolutionalAnchor boxes

New neworkDimension priors

Location predictionPassthrough

MultiscaleHi-res detector

YOLO YOLOv2

VOC2007 mAP

634 658 695 692 696 744 754 768 786

Figure 3 mAP of YOLOv2 at each added part [5]

Journal of Electrical and Computer Engineering 9

VOC_WH_20 VOC_MRA_058 VOC_MRA_10 andVOC_MRA_20 and detail information is provided asfollows

(i) VOC2007_WH_02 contains objects whose widthand height are less than 20 of an imagersquos width andheight (is one has fewer than PASCAL VOC 2007two classes such as dining table and sofa because ofthe constraint of the definition

(ii) VOC_MRA_058 VOC_MRA_10 andVOC_MRA_20 compose objects occupying themaximum mean relative area of the original imageunder 058 10 and 20 respectively Two ofthem have the same number of PASCAL VOC 2007classes except for VOC_MRA_058 and the one hasfewer four classes such as dining table dog sofa andtrain

5 Results and Analyses

In this section we show results that we achieved through theexperimental phase All models mentioned in this sectionexcept for models cited from other papers are trained on thesame environment and 1 GPU Ubuntu 16044 LTS Intel(R) Xeon (R) Gold 6152 CPU 210GHz GPU Tesla P100In addition to the comparative accuracy other comparisonsare also provided to make our objective and clear assessmentresults

51 Accuracy

511 Small Object Dataset Following the detection resultsin Table 3 methods which belong to two-stage approachesoutperform ones in one-stage approaches about 8ndash10Specifically Faster RCNN with ResNeXT-101-64times 4d-FPNbackbone achieved the top mAP in two-stage approachesand the top of the table as well 412 In comparison withthe top in one-stage approaches YOLOv3 608times 608 withDarknet-53 obtained 331 Following [32] methods basedon region proposal such as Faster RCNN are better than

methods based on regression or classification such as YOLOand SSD Actually this is also right once again as in contextof small object dataset

For methods in each approach Firstly two-stage ap-proaches Faster RCNN which is an improvement of FastRCNN is only greater than Fast RCNN about 1ndash2 but onlyfor ResNeXT backbones and equal to Fast RCNN for the rest(e difference here is not too much and it means that theperformance of external region proposal like selective searchcombined with ROI pooling is as good as internal regionproposal like RPN with ROI aligned in this case Besidescompared to R-CNN we perceive that there is a boost 8ndash10when RoI pooling or RoI aligned is added because R-CNNwhich uses region proposals from selective search then feedsthem into the network and directly computes features from fc(fully connected) layers only receives 235 with Alexnetand 248 with VGG16 combined with proposals from RPNHowever Fast RCNN and Faster RCNN with two kinds ofRoIs are much better Fast RCNN receives accuracy in a rangeof 317 to 396 based on different backbones SimilarlyFaster RCNN gets 301 to 412 Secondly in one-stageapproaches YOLO outperforms SSD and RetinaNet How-ever YOLO gets the highest outcome 331 and SSD andRetinaNet get 1132 and 30 respectively YOLO and SSDare considered as state-of-the art methods in speed andsacrificing accuracy However there is a large difference inaccuracy between YOLO and SSD the difference here is thatSSD adds multiple convolutional layers behind the backboneand each layer has their own ability instead of using 2 fullyconnected layers like YOLO Although RetinaNet is assignedinto a method in one-stage approaches it cannot run in realtime RetinaNet is one which is proposed to deal with theimbalance between foreground and background by the focalloss (erefore RetinaNet obtains a higher accuracy incomparison with others except for YOLOv3 (Darknet-53)

When it comes to the backbones we realized that Dar-knet-53 is the best in one-stage and real-time methods andeven far higher than ResNet-50 although it similarly has thesame layers with ResNet-50 In contrast ResNeXT combinedwith FPN is themost powerful one in both one-stage and two-

Table 1 (e information of the subsets

Subsets Classes Images InstancesVOC_MRA_058 16 329 529VOC_MRA_10 20 2231 5893VOC_MRA_20 20 2970 7867VOC_WH_20 18 1070 2313

Table 2 (e parameters of models

Method Momentum Decay Gamma Learning_rate Batch_size Training_days StepsizeYOLOv2 [16] 09 00005 0001 8 5 25000YOLOv3 09 00005 0001 32 3ndash4 25000SSD300 [16] 09 00005 01 0000004 12 9 40000 80000SSD512 [16] 09 00005 01 0000004 12 12 100000 120000RetinaNet 09 00005 01 0001 64 4-gt12 h 25000 35000Fast RCNN 09 00005 01 0001 64 4-gt12 h 25000 35000Faster RCNN 09 00005 01 0001 64 4-gt12 h 25000 35000

10 Journal of Electrical and Computer Engineering

stage methods if we only consider accuracy Overall there isan increase about 1ndash3 for changing the simple backbone tothe complex one in each type For example when switchingfrom original ResNet to ResNet-FPN the accuracy is boostedfrom 2 to 3(is is clear that leveraging the advantages frommultiscale features of FPN is a common way to improvedetection and tackle the scale imbalance of input images andbounding boxes of different objects Similarly we switchResNeXT-101-32times 8d-FPN to ResNeXT-101-64times 4d-FPNand the accuracy changes from 405 to 412 for FasterRCNN and from 387 to 396 for Fast RCNN Howeverwhen considering between ResNet-50-FPN and ResNet-101-FPN the growth only happens in Fast RCNN from 333 to355(ere is a little bit decrease 01 for Faster RCNN(isreduction also happens with RetinaNet while the simplerbackbone ResNeXT-101-32times 8d-FPN gets 30 and theResNeXT-101-64times 4d-FPN just gets 251 It means that thevery deeper backbones do not guarantee the increase inaccuracy and the reason is that an advantage of a deepernetwork needsmore parameters to learn It means the authorsmust have a large number of data to feed into the network totrain and update parameters itself but in this case the data ofsmall object dataset are not abundant too much to fit the verydeep network and hence increasing the chances of overfittingBesides features which are originally from the early layer ofResNet are not well-generalized because when they arecombined with FPN the accuracy has an improvement about2ndash3 When YOLO switches from Darknet-19 to Darknet-53 it really boosts the accuracy (e highest accuracy belongsto the Darknet-19 with the resolution of 1024times1024 whichjust gets 2402 However YOLO 608times 608 with Darknet-53gets 331 (e explanation for this reason is that YOLOv3with Darknet-53 has several improvements from Darknet-19YOLOv3 has 3 location of scales to predict objects especiallyone specialized in small objects instead of only one likeDarknet-19 and it is also integrated cutting-edge advantagessuch as residual blocks and shortcut connections (e re-duction in accuracy happens again with YOLO whenswitching from ResNet-101 to ResNet-152 about 1ndash2 Inthese methods YOLO and SSD are the only ones which allowmultiple input sizes(e higher the resolution of input imagesare the higher accuracy the method receives (e reason isthat a higher resolution image allows more pixels to describethe visual information for small objects However if theresolution is far from the original size of images it results in adecrease in accuracy For example YOLO 1024times1024 withDarknet-19 gets a lower accuracy than the resolution of800times 800 In addition we have tried to increase in resolutionof Darknet-53 from 608 to 1024 and themAP decreases whenthe resolution is over 608times 608 (erefore the effect of imagesize is clear for models like SSD and YOLO Generally allcomparative results of mAP on this dataset have the domi-nation of classes very great in numbers and this is caused bythe imbalance data between the number of images and in-stances in these images For example according to the sta-tistics in [13] mouse is a major class significantly contributingto mAP in Table 3 with the highest number of instances andimages as well However tissue has least contribution with thelowest AP originally affected by the number of data

Furthermore the imbalance data lead models tending todetect frequent objects implying that models will misun-derstand objects having a nearly similar appearance with thedomination class as the objects of interest rather than lessfrequent objects As a result false positives will increase bythese problems Figure 4 illustrates the detection withstrongest backbones Following this visualization the domina-tion of the classes such asmouse or faucet results inmisdetectionwith areas which have a same appearance to them (is mis-understanding has a tendency to weaker backbones in thecomparison and one-stage method like YOLO which primarilyheads to speed has more misdetection than two-stage methodsA reason that causes these problems are the difference in thewayof training deep networks [33] One-stage methods such asYOLO use a soft sampling method that uses a whole dataset toupdate parameters rather than only choosing samples fromtraining data However two-stage methods such as RCNNfamily tend to employ hard sampling methods that randomlysample a certain number of positive and negative boundingboxes to train its network

512 Subsets of PASCAL With 4 subsets of 4 different scalesof objects in images we want to find out howmuch the scalesimpact on the models (e whole results are shown in Ta-ble 4 We separate the results into 2 groups as the one-stageand two-stage approaches and Figure 5 is a visualization forthe strongest backbones in each method on subsets

In case of different scales like our subsets there is a differencebetween one-stage approaches and two-stage approaches In thiscase methods from the one-stage approach have a better per-formance than two-stage ones inmost of scales(is is really theopposite of small object dataset Specifically two-stage methodsare totally better than one-stage ones in case of real-time inputsand just better a bit than nonreal-time models in VOC_WH20about 10ndash20 and the same result with smaller objects inVOC_MRA_0058 and VOC_MRA_010 However in biggerobjects in VOC_MRA_020 methods in one-stage approacheshave significant outcomes rather than two-stage ones In ad-dition there is just Faster RCNN that has good performance inmost cases to compare to methods in one-stage ones FastRCNN is only good at big objects in VOC_MRA_020 and failsto have good detection in smaller objects

In the one-stage approach in methods which allowmultiple inputs like YOLO and SSD there are 2 kindsnamely ones that can run in real time and the others thatcannot if the resolution is over 640 or 512 for YOLO andSSD respectively For real-time ones YOLO outperformsSSD for all scales of objects Specifically YOLOv2 withDarknet-19 is better than SSD 26 with objects inVOC_MRA_0058 and VOC_MRA_010 and 4ndash15 forlarger objects in VOC_MRA_020 and VOC_WH_20YOLOv3 with Darknet-53 gets higher results about 3ndash5 incomparison with YOLOv2 hence YOLOv3 also gets higherresults compared to SSD However if we consider nonreal-time input images SSD is greater than YOLO with objects inVOC_MRA_010 However RetinaNet which is the one thatcannot run in real time in the one-stage approach performsthe same results compared to ones in nonreal time in YOLO

Journal of Electrical and Computer Engineering 11

and better than SSD RetinaNet is more stable than SSD andYOLO when the scales are changed (e bigger the objectsare the more the stability is For example the change is somuch about 33 when the scale increases from objects inVOC_MRA_0058 to ones in VOC_MRA_010 andVOC_MRA_020 However this change is not much about10 with bigger objects in comparison with YOLO 15ndash25In case of YOLO this remarkable increase in accuracy whenobjects are larger is obviously good for a model (e changein SSD resembles the change in RetinaNet

Concerning resolutions in YOLO and SSD we see thatwhen image resolution is increased they push the accuracyto improve in general For YOLOv2 with Darknet-19 andYOLOv3 with Darknet-53 and SSD they all have an increasein accuracy when the resolution is large except for YOLOv2

with objects belonging to VOC_MRA_010 andVOC_MRA_020 when the image is over 800 In additionYOLOv2 has a fluctuation with those objects inVOC_WH20 As mentioned in our previous work YOLO isbetter than SSD in those objects less than 10 of the imageshowever in this case YOLOv3 is good at all scales of objects(is is because YOLOv3 has 3 detection locations comingwith more ratios of default boxes and it leads to a significantoutcome when combining results from 3 locations

When we switch to the two-stage approaches FasterRCNN has a significant improvement in most scales ratherthan Fast RCNN except for objects in VOC_MRA_020 whichhave the same accuracy (is shows that if objects are com-pletely separated into different scales the RoI pooling does notwork well with smaller objects and ones in VOC_WH20 In

Table 3 Comparative results on small object dataset

Method Backbone Clock Faucet Jar Mouse Outlet Plate Switch Tel t box t paper mAPYOLO 416 [16]

Darknet-19

228 308 4 52 204 131 13 61 0 353 1939YOLO 448 [16] 23 369 9 525 184 136 175 42 0 343 2013YOLO 480 [16] 342 373 91 533 214 136 158 91 91 342 2371YOLO 512 [16] 231 366 61 598 246 142 157 91 45 324 2261YOLO 554 [16] 234 372 91 601 272 134 199 91 45 345 2384YOLO 640 [16] 202 362 32 598 278 117 181 82 45 356 2253YOLO 800 [16] 276 36 23 602 328 131 233 91 91 267 2402YOLO 1024 [16] 217 293 14 583 264 118 175 91 91 157 2003YOLO 320

Darknet-532622 3838 455 5646 3642 1334 248 1065 455 4296 2583

YOLO 416 2847 4715 1083 6049 4315 1587 3073 1515 262 483 3028YOLO 608 2998 4789 1076 6588 4802 1809 3122 1462 1799 4656 331YOLO 320

ResNet-501957 2573 067 4517 1437 938 1384 909 909 237 1706

YOLO 416 2378 3665 04 5423 1837 1375 1978 984 942 3568 2219YOLO 608 2692 4065 177 6186 2918 1504 2024 1009 1329 3601 255YOLO 320

ResNet-1012052 279 057 4468 1698 1305 1366 966 909 2436 1805

YOLO 416 2572 356 303 5573 224 1561 1726 932 303 3871 2264YOLO 608 2879 4459 942 6218 3334 1553 2388 1324 1583 3917 286YOLO 320

ResNet-1522164 2756 303 4806 1739 1112 1451 909 455 3188 1888

YOLO 416 257 3654 089 5381 206 1413 2021 1149 029 3306 2167YOLO 608 2601 4454 455 61 3176 1302 2267 1235 993 3999 2658SSD300 [16] ResNet-101 55 91 0 255 61 45 0 45 91 182 825SSD300 [16] VGG16 91 171 0 261 91 91 0 45 0 167 916SSD512 [16] VGG16 91 171 0 43 91 91 91 91 0 76 1132RetinaNet ResNet-50-FPN 307 493 2 655 213 161 85 129 1 257 233RetinaNet ResNet-101-FPN 306 487 71 647 20 159 118 107 29 387 251RetinaNet ResNeXT-101-32times 8d-FPN 355 55 121 665 239 184 98 162 94 537 30RetinaNet ResNeXT-101-64times 4d-FPN 314 502 89 663 208 153 94 14 22 324 251R-CNN [13] RPN prop +VGG16 319 313 42 568 311 93 142 164 234 294 248R-CNN [13] Alexnet 7times 300 pro 324 272 51 569 28 98 136 124 179 356 239R-CNN [13] VGG16 7times 300 pro 373 303 72 606 415 158 215 137 22 333 284R-CNN [13] ContextNet (Alexnet 7times) 327 268 46 564 263 99 129 122 187 34 235Fast RCNN ResNet-50-C4 324 463 65 658 383 201 253 166 141 52 317Fast RCNN ResNet-50-FPN 374 473 73 689 467 21 321 171 93 459 333Fast RCNN ResNet-101-FPN 393 503 106 683 471 204 333 186 154 514 355Fast RCNN ResNeXT-101-32times 8d-FPN 475 548 103 718 54 214 344 217 177 535 387Fast RCNN ResNeXT-101-64times 4d-FPN 454 557 109 725 533 24 369 229 16 581 396Faster R-CNN [16] VGG16 2376 3765 803 54 1616 1188 1512 91 625 3729 2192Faster RCNN ResNet-50-C4 322 446 66 659 352 175 257 196 137 40 301Faster RCNN ResNet-50-FPN 357 499 73 684 489 188 296 147 114 533 338Faster RCNN ResNet-101-FPN 398 492 49 682 47 185 297 14 129 522 337Faster RCNN ResNeXT-101-32times 8d-FPN 498 566 114 721 563 232 37 208 188 587 405Faster RCNN ResNeXT-101-64times 4d-FPN 496 586 122 725 545 232 369 208 201 631 412(e values in bold represent the best in one-stage methods and the ones in italics represent the highest in two-stage methods

12 Journal of Electrical and Computer Engineering

addition if we compare with one-stage methods it is signif-icantly lower than them However RoI align along with RPN iswell performed when scales are changedWhen it comes to thebackbones there is a few decrease in accuracy when changingfrom ResNet-50-FPN to ResNet-101-FPN or from ResNeXT-101-32times 8d-FPN to ResNeXT-101-64times 4d-FPN with objectsfrom all scales for both Faster RCNN and Fast RCNN (eVGG16 backbone has an impressive outcome rather thanstrong backbones such as ResNet or ResNeXT Although theaccuracy is less than two strong backbones VGG16 is stillbetter with objects in VOC_WH20 and has a few change inaccuracy when changing objects with big sizes

52 Time Processing and Resource Consumption Tables 5and 6 show us the performance comparison of the eval-uated models with base networks that belong to the modelsGenerally we see that when RAM consumption in testingand training increases more layers are added (is meansthat if the network is more deeper the need of processingalso increases because this leads to the increase in pa-rameters and time to process data as well YOLO is themodel consuming the least memory in both two-phasetraining and testing Particularly YOLO is only from 4G to5G for training and from 16G to 18G for testing withDarknet-53 YOLO is the only one which is able to run inreal time YOLO just needs about 03ms to 04ms toprocess an image in comparison to more than 01 s and 02 s

with Faster RCNN and RetinaNet (is allows us to pick upthese models on devices which own the modest memoryWhile RetinaNet is assigned to the one-stage approach it isnot good enough to meet real-time detection(e inferencetime in Fast RCNN is lower a little bit than Faster RCNNand RetinaNet In contrast the RAM consumption intraining and testing of RetinaNet is lower than Fast RCNNand Faster RCNN Of all architectures the ResNet-50-C4 isthe one requiring the highest memory and time to processdata because the output size of ResNet-50-C4 is bigger a bitthan others [9] If we consider ResNet or ResNeXT com-bined with FPN Faster RCNN is over 100Mb compared toFast RCNN and 300Mb with RetinaNet In additionaccording to Table 2 the number of training days of FasterRCNN and RetinaNet need less time for training only a fewhours to 1 day rather than YOLO 3ndash4 days (is dem-onstrates that if we pay our attention to performance anddo not have much time for training we choose FasterRCNN or RetinaNet instead of YOLO one In contrast ifwe only focus on processing speed and still achieve goodperformance one-stage methods are always the good oneIn the same context of backbones RetinaNet uses a lowerresource than Fast RCNN and Faster RCNN about 100Mband 300Mb for Fast RCNN and Faster RCNN respectivelyin testing time However the training time of RetinaNetuses much memory more than Fast RCNN about 28 G andFaster RCNN about 23 G for ResNeXT-101-32times 8d-FPNand ResNeXT-101-64 times 4d-FPN If we consider this on

(a) (b)

loc ∆ (cx cy w h) conf (c1 c2 hellip cp)

(c)

Figure 4 (e location of the default boxes in different scales (a) image with GT boxes (b) 8times 8 feature map (c) 4times 4 feature map

Journal of Electrical and Computer Engineering 13

Table 4 (e comparative results on subsets of PASCAL VOC 2007

Approach Method VOC_MRA_0058 VOC_MRA_010 VOC_MRA_020 VOC_WH20

One stage

YOLOv2 416 [16] 302 3138 4289 1852YOLOv2 448 [16] 447 329 6015 2196YOLOv2 480 [16] 426 3348 6078 2667YOLOv2 512 [16] 542 3574 6112 2463YOLOv2 544 [16] 697 3656 63 2662YOLOv2 640 [16] 77 3797 6129 2341YOLOv2 800 [16] 1024 373 6191 269YOLOv2 1024 [16] 1069 2993 5514 2897

YOLOv3 320 718 3458 6036 204YOLOv3 416 102 3897 6253 2412YOLOv3 608 117 4265 6856 2886SSD 300 [16] 171 3276 4626 1691SSD 512 [16] 29 4346 5711 1987

RetinaNet-ResNet-50-FPN 884 415 502 2814RetinaNet-ResNet-101-FPN 895 425 519 2746

RetinaNet-ResNeXT-101-32times 8d-FPN 1029 454 545 3008RetinaNet-ResNeXT-101-64times 4d-FPN 1071 455 551 3132

Two stage

Fast RCNN-ResNet-50-C4 023 132 499 393Fast RCNN-ResNet-50-FPN 063 135 556 345Fast RCNN-ResNet-101-FPN 039 159 576 312

Fast RCNN-ResNeXT-101-32times 8d-FPN 051 144 579 333Fast RCNN-ResNeXT-101-64times 4d-FPN 029 142 573 376

Faster RCNN-ResNet-50-C4 698 399 487 2604Faster RCNN-ResNet-50-FPN 1074 456 563 2979Faster RCNN-ResNet-101-FPN 1063 469 576 3057

Faster RCNN-ResNeXT-101-32times 8d-FPN 1164 473 576 3212Faster RCNN-ResNeXT-101-64times 4d-FPN 1054 471 569 3164

Faster RCNN-VGG16 [16] 573 3558 4414 4111(is table illustrates how well models adapt to different scales of objects (e values in bold represent the best in one-stage methods and the ones in italicsrepresent the highest in two-stage methods

(a)

(b)

(c)

Figure 5 Continued

14 Journal of Electrical and Computer Engineering

small object dataset it does not work too much becauseRetinaNet is lower than Faster RCNN about 10 in per-formance Otherwise on different scales of subsets Reti-naNet works well when comparing to Faster RCNN and

the difference is just 2ndash4 percentages Although ResNetbackbones combined with the others yield an improvementin accuracy they do not work for YOLO on small objectdatasets YOLO with Darknet-53 utilizes more resource

Table 5 (e comparison of consumption on small object dataset

Model Backbone Inference time (s) Test RAM (MiB) Train RAM (MiB)YOLOv3 Darknet-53 00331 1825 4759YOLOv3 ResNet-50 0027 1285 3479YOLOv3 ResNet-101 00356 1829 5383YOLOv3 ResNet-152 00454 2443 7531RetinaNet ResNet-50-FPN 0102 2075 4435RetinaNet ResNet-101-FPN 0127 2723 5577RetinaNet ResNeXT-101-32times 8d-FPN 0229 3767 7863RetinaNet ResNeXT-101-64times 4d-FPN 0292 3719 7813Fast RCNN ResNet-50-C4 03 6449 5877Fast RCNN ResNet-50-FPN 0089 2277 4455Fast RCNN ResNet-101-FPN 0113 2947 5627Fast RCNN ResNeXT-101-32times 8d-FPN 0212 3987 4961Fast RCNN ResNeXT-101-64times 4d-FPN 0269 3885 4799Faster RCNN ResNet-50-C4 0412 6609 6129Faster RCNN ResNet-50-FPN 0101 2387 5381Faster RCNN ResNet-101-FPN 0124 3001 6487Faster RCNN ResNeXT-101-32times 8d-FPN 0256 4027 5333Faster RCNN ResNeXT-101-64times 4d-FPN 0286 4003 5246

Table 6 (e comparison of consumption on subsets filtered from PASCAL VOC

Model Backbone Inference time (s) Test RAM (MiB) Train RAM (MiB)YOLOv3 Darknet-53 0027 1645 4079RetinaNet ResNet-50-FPN 01 1935 4133RetinaNet ResNet-101-FPN 0116 2585 5435RetinaNet ResNeXT-101-32times 8d-FPN 0222 3641 7723RetinaNet ResNeXT-101-64times 4d-FPN 0284 3561 7599Fast RCNN ResNet-50-C4 0495 6371 5677Fast RCNN ResNet-50-FPN 0092 2131 4387Fast RCNN ResNet-101-FPN 0114 2819 5463Fast RCNN ResNeXT-101-32times 8d-FPN 0213 3873 4637Fast RCNN ResNeXT-101-64times 4d-FPN 0265 3735 4575Faster RCNN ResNet-50-C4 026 6141 5991Faster RCNN ResNet-50-FPN 01 2245 5207Faster RCNN ResNet-101-FPN 013 2855 6335Faster RCNN ResNeXT-101-32times 8d-FPN 0225 3943 5087Faster RCNN ResNeXT-101-64times 4d-FPN 0276 3885 4909

(d)

Figure 5 Highlight of bounding boxes from comparative backbones on small object dataset We here select YOLO with Darknet-53 andResNet-50 for objective comparison because there have obviously the same layers in their networks along with the significant techniquessuch as skip connections and residual blocks (e bounding boxes show that ResNet-50 has the sensitivity to areas which resembles theobjects of interest than Darknet-53 Similarly ResNet-50-FPN and ResNet-50-C4 are chosen to consider (e detection shows thatcombining ResNet-50 with FPN outputs a better performance rather than the original one Particularly misdetection happens in moredensity than ResNet-50-FPN such as in columns 4 and 5 Zoom in to see more detail

Journal of Electrical and Computer Engineering 15

than ResNet ones but it has the best accuracy amongmodels (erefore we only test YOLO with Darknet-53 insubsets of PASCAL

53 Analyses of the Trade-Off among Detectors Networkdesigns and approaches alike the one-stage approach proveits performance as applying them to detect general objectsboth small scales and other kinds of scales Although they arefast and accurate there is still a drawback always existing inthese models that is the trade-off between accuracy andspeed of processing For example YOLOv3 proposes theidea that performs detection at three different scales and thisresult is obviously impressive and yields good performanceHowever to gain this advantage YOLOv3 has to sacrifice thetime to process Instead of all inputs of the model normallyprocessing one time for detection like YOLOv2 this ideamust work 3 times (is trade-off is also partly affected byresolution as we change it during training or testing ourmodels In our previous work we have mentioned that wehave to choose a right resolution to ensure our models towork properly In case of the two-stage approaches the ideathat proposes region proposals to improve the localization ofobjects to serve for detection is good as well (is is usefulbut we have to take it into the account that we shouldgenerate proposals on feature maps or directly on inputimages because this affects a lot on the way which modelsintend to run and identify representations of objects Ifobjects are normal or have a big or medium appearance it isgood for models to work but if objects are in multiscalesthis is a problem to consider and research deeply in order tobalance the performance as well as improve it (erefore topartly fix this problem the one-stage approach allows us tochoose a fixed size of an input for training and testing butthe support still depends on characteristics of datasets whichwe evaluate or the image size After all all models we chooseto evaluate are affected by the scales of objects when wechange the scale and accuracy of models change a lot exceptfor Faster RCNN the only one model that seems to be stablewith the scale especially when combining with the VGG16architecture Although the accuracy of VGG16 is not betterthan the other architectures the difference here is that it doesnot change too much in accuracy (is is only right for bigobjects having the overlap of the bounding box and theimage greater than 10 if not this is not assured

Figure 1 shows that the possibility of small objects ismore than other objects (e black length of the camera issomehow similar to the black mouse placed on a mouse pad(is possibility of small object presence causes more diffi-culties to detectors and leads to wrong detection Anywherein an image can be small objects it results in a fact thatdetectors have much wrong detection with familiar ap-pearance which they have seen If we consider the visuali-zation of the detection in Figure 4 the wrong detection ispartly similar to the appearance of the other objects in thedataset (is problem is caused by the data imbalance be-tween classes and instances in each class which originally isknown as the foreground-foreground class imbalance In

other words the common problems which not only happenwith small objects but also for whole datasets are theintraclass similarity and interclass variation

6 Conclusion

Small object detection is a challenging and interestingproblem in the task of object detection and has drawn at-tention from researchers thanks to the development of deeplearning which is motivation to improve performance oftasks in computer vision Although deep models belongingto detection originally tend to solve problems relating togeneral object detection they still work at a particular levelto the success of small object detection As evaluation workson small object detection for deep models our goal is tohighlight remarkable achievements of popular and state-of-the-art deep models in order to provide a variety of views asapplying deep models in small object detection Particularlywe evaluate state-of-the-art real-time detectors based ondeep learning from two approaches such as YOLOv3RetinaNet Fast RCNN and Faster RCNN on two datasetsnamely small object dataset and subsets filtered fromPASCAL VOC about effects of different factors objectivelyincluding accuracy execution time and resource usage

In spite of the successful achievements in recent yearsthe performance of detection has improved significantlyand there is still a huge gap in accuracy between normalobjects and small objects In the criteria of the COCOdataset the difference from the small scale to medium andbig scale is too much Most models are good at detection ofnormal objects and problems are going to happen whenapplying them to detect small objects As a result to reducethe gap of small object detection the first thing to do is investdatasets which have massive amount of data to train modelsand have a wide range of categories to compete with thehuman visual system alike [12 34]

So far detection models are divided into two mainapproaches namely one-stage approach and two-stageapproach Models in the one-stage approach is known asdetectors which have better and more efficient detection incomparison to another approach (e efficiency here has thepotential power to run in real time and is able to apply themto practical applications However the trade-off betweenaccuracy and speed is a difficult challenge which needs to betaken into the account in order to balance the gap Howevermodels in the two-stage approach have their reputation ofregion-based detectors which have high accuracy but are toolow in speed to apply them to real world (is drawbackcomes from the computation of networks

(rough our evaluation there is a fact that architectureswhich are utilized as base networks to extract deep featureshave significant effects on frameworks (e deeper the ar-chitecture is the higher the accuracy of detection is Once anetwork has an increase in the depth this means it has morelayers than normal ones and it will have massive parametersto train Hence this needs a lot of data to fine tune theseparameters reasonably If there is an increase in computa-tion resource consumption will also increase As a result it

16 Journal of Electrical and Computer Engineering

will be difficult as we want to take them to apply in practicalapplications Besides the contextual exploit in models isdefinitely limited this results cause ignoring much usefuland informative data in training especially in context ofsmall objects Because small objects are able to appearanywhere in an input image if the image is well-exploitedwith the context the performance of small object detectionwill be improved better

For all above reasons and according to our evalua-tion if we tend to have good performance and ignore thespeed of processing two-stage methods like FasterRCNN are well-performed and demonstrate its networkdesign with the different datasets on many contexts ofobjects including multiscale objects (erefore FasterRCNN is considered as a giant baseline in order to baseon or develop from it If our target has a balance ofaccuracy and speed YOLO is a good one in case we donot care the training time because the sacrifice betweenthe speed and accuracy is worth applying it into practicalapplications Otherwise Faster RCNN or RetinaNet isstill a substitution to work on When it comes tobackbones we have to concern about the data to choose areasonable backbone to combine with the methodsBecause the amount of data will significant impact on themodel if data are not abundant the shallow network willfit it well Besides there is recently a novel approachpromising in training deep models with less data that isweakly supervised learning such as zero-shot one-shotor few-shot learning (erefore these approaches will beconsidered in our future works and following our recentsearching to have better performance on object detec-tion we have to consider several factors to improve themAP such as multiscale training superresolution forscaling up the visual information to small objects [35] orpreprocessing data to avoid the imbalance data becausewe have a wide range of imbalance problems relating todata [33]

Data Availability

(e data used to support the findings of this study areavailable from the corresponding author upon request

Conflicts of Interest

(e authors declare that they have no conflicts of interest

Acknowledgments

(is research was funded by the Vietnam National Uni-versity HoChiMinh City (VNU-HCM) under grant noB2017-26-01

References

[1] R Girshick J Donahue T Darrell and J Malik ldquoRich featurehierarchies for accurate object detection and semantic seg-mentationrdquo in Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition pp 580ndash587 ColumbusOH USA June 2014

[2] K He X Zhang S Ren and J Sun ldquoSpatial pyramid poolingin deep convolutional networks for visual recognitionrdquo inProceedings of the European Conference on Computer Visionpp 346ndash361 Springer Zurich Switzerland September 2014

[3] R Girshick ldquoFast R-CNNrdquo in Proceedings of the IEEE In-ternational Conference on Computer Vision pp 1440ndash1448Santiago Chile December 2015

[4] J Redmon S Divvala R Girshick and A Farhadi ldquoYou onlylook once unified real-time object detectionrdquo in Proceedingsof the IEEE Conference on Computer Vision And PatternRecognition pp 779ndash788 Las Vegas NV USA June 2016

[5] J Redmon and A Farhadi ldquoYOLO9000 better fasterstrongerrdquo 2016 httpsarxivorgabs161208242

[6] J Redmon and A Farhadi ldquoYOLOv3 an incremental im-provementrdquo 2018 httpsarxivorgabs180402767

[7] T-Y Lin P Goyal R Girshick K He and P Dollar ldquoFocalloss for dense object detectionrdquo in Proceedings of the IEEEtransactions on pattern analysis and machine intelligenceVenice Italy October 2018

[8] K Zidek A Hosovsky J Pitelrsquo and S Bednar ldquoRecognition ofassembly parts by convolutional neural networksrdquo in Ad-vances in Manufacturing Engineering and Materialspp 281ndash289 Springer Cham Switzerland 2019

[9] K He G Gkioxari P Dollar and R Girshick ldquoMaskR-CNNrdquo in Proceedings of the IEEE International Conferenceon computer vision (ICCV) IEEE Venice Italy pp 2980ndash2988 October 2017

[10] L-C Chen A Hermans G Papandreou et al ldquoInstancesegmentation by refining object detection with semantic anddirection featuresrdquo 2017 httpsarxivorgabs171204837

[11] M Everingham L Van Gool C K I Williams J Winn andA Zisserman ldquo(e pascal visual object classes (VOC) chal-lengerdquo International Journal of Computer Vision vol 88no 2 pp 303ndash338 2010

[12] T-Y Lin M Maire S Belongie et al ldquoMicrosoft COCOcommon objects in contextrdquo in Proceedings of the EuropeanConference on Computer Vision pp 740ndash755 SpringerZurich Switzerland September 2014

[13] C Chen M-Y Liu O Tuzel and J Xiao ldquoR-CNN for smallobject detectionrdquo in Proceedings of the Asian Conference onComputer Vision pp 214ndash230 Springer Taipei TaiwanNovember 2016

[14] P F Felzenszwalb R B Girshick D McAllester andD Ramanan ldquoObject detection with discriminatively trainedpart-based modelsrdquo IEEE Transactions on Pattern Analysisand Machine Intelligence vol 32 no 9 pp 1627ndash1645 2010

[15] S Ren K He R Girshick and J Sun ldquoFaster R-CNN towardsreal-time object detection with region proposal networksrdquo inProceedings of the 28th International Conference on NeuralInformation Processing Systems Ser NIPSrsquo15 pp 91ndash99 MITPress Cambridge MA USA 2015 httpdlacmorgcitationcfmid=29692392969250

[16] P Pham D Nguyen T Do T D Ngo and D-D LeldquoEvaluation of deep models for real-time small object de-tectionrdquo in Proceedings of the International Conference onNeural Information Processing pp 516ndash526 SpringerGuangzhou China November 2017

[17] J R R Uijlings K E A Van De Sande T Gevers andA W M Smeulders ldquoSelective search for object recognitionrdquoInternational Journal of Computer Vision vol 104 no 2pp 154ndash171 2013

[18] Z Zhu D Liang S Zhang X Huang B Li and S HuldquoTraffic-sign detection and classification in the wildrdquo inProceedings of the IEEE Conference on Computer Vision And

Journal of Electrical and Computer Engineering 17

Pattern Recognition pp 2110ndash2118 Vegas NV USA June2016

[19] A Torralba R Fergus and W T Freeman ldquo80 million tinyimages a large data set for nonparametric object and scenerecognitionrdquo IEEE Transactions on Pattern Analysis andMachine Intelligence vol 30 no 11 pp 1958ndash1970 2008

[20] A Kembhavi D Harwood and L S Davis ldquoVehicle detectionusing partial least squaresrdquo IEEE Transactions on PatternAnalysis and Machine Intelligence vol 33 no 6 pp 1250ndash1265 2011

[21] V I Morariu E Ahmed V Santhanam D Harwood andL S Davis ldquoComposite discriminant factor analysisrdquo inProceedings of the IEEE Winter Conference on Applications ofComputer Vision pp 564ndash571 IEEE Steamboat Springs COUSA March 2014

[22] A Andreas P Lenz and R Urtasun ldquoAre we ready forautonomous driving the kitti vision benchmark suiterdquo inProceedings of the IEEE Conference on Computer Vision AndPattern Recognition Providence RI USA June 2012

[23] A Alahi K Goel V Ramanathan A Robicquet L Fei-Feiand S Savarese ldquoSocial LSTM human trajectory prediction incrowded spacesrdquo in Proceedings of the IEEE Conference onComputer Vision And Pattern Recognition pp 961ndash971 LasVegas NV USA June 2016

[24] J Xiao K A Ehinger J Hays A Torralba and A Oliva ldquoSundatabase exploring a large collection of scene categoriesrdquoInternational Journal of Computer Vision vol 119 no 1pp 3ndash22 2016

[25] E Dong Y Zhu Y Ji and S Du ldquoAn improved convolutionneural network for object detection using YOLOv2rdquoin Pro-ceedings of the 2018 IEEE International Conference onMechatronics and Automation (ICMA) pp 1184ndash1188 IEEEChangchun China August 2018

[26] W Liu D Anguelov D Erhan et al ldquoSingle shot multiboxdetectorrdquo in Proceedings of the European Conference onComputer Vision Springer Amsterdam (e Netherlandspp 21ndash37 October 2016

[27] T-Y Lin P Dollar R B Girshick K He B Hariharan andS J Belongie ldquoFeature pyramid networks for object detec-tionrdquo in Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition (CVPR) vol 1 no 2 p 4Honolulu HI USA July 2017

[28] C-Y Fu W Liu A Ranga A Tyagi and A C Berg ldquoDSSDdeconvolutional single shot detectorrdquo 2017 httpsarxivorgabs170106659

[29] Y Bai Y Zhang M Ding and B Ghanem ldquoSOD-MTGANsmall object detection via multi-task generative adversarialnetworkrdquo in Proceedings of the European Conference onComputer Vision pp 210ndash226 Springer Munich GermanySeptember 2018

[30] L Liu W Ouyang X Wang et al ldquoDeep learning for genericobject detection a surveyrdquo 2018 httpsarxivorgabs180902165

[31] P Zhu L Wen X Bian L Haibin and Q Hu ldquoVision meetsdrones a challengerdquo 2018 httpsarxivorgabs180407437

[32] Z-Q Zhao P Zheng S-t Xu and X Wu ldquoObject detectionwith deep learning a reviewrdquo 2018 httpsarxivorgabs180705511

[33] K Oksuz B C Cam S Kalkan and E Akbas ldquoImbalanceproblems in object detection a reviewrdquo 2019 httpsarxivorgabs190900169

[34] O Russakovsky J Deng H Su et al ldquoImagenet large scalevisual recognition challengerdquo International Journal of Com-puter Vision vol 115 no 3 pp 211ndash252 2015

[35] Y Bai Y Zhang M Ding and B Ghanem ldquoSOD-MTGANsmall object detection via multi-task generative adversarialnetworkrdquo in Proceedings of the European Conference onComputer Vision (ECCV) pp 206ndash221 Munich GermanySeptember 2018

18 Journal of Electrical and Computer Engineering

Page 5: AnEvaluationofDeepLearningMethodsforSmall ObjectDetection

(e overview of R-CNN architecture consists of four mainphases which are known as the new advances of this methodFirstly the R-CNN network resizes an image to 227times 227and takes it as an input(en selective search algorithm [17]is applied to the image and generates 2000 candidates ofproposed bounding boxes as the warped regions used for theinput of CNN feature network (rough the regions thenetwork extracts a 4096-dimensional feature vector fromeach region and then computes the features for each regionFinally using the class-specific linear SVM classifier behindthe last layer is to classify regions to consider if there are anyobjects and what the objects are

(emajor key to the success of the R-CNN is the featuresmatter In R-CNN the low-level image features (eg HOG)are replaced with the CNN features which are arguablymore discriminative representations However the evalua-tion of an image is extremely costly and wasteful becauseR-CNN must apply the convolutional network 2000 timesBesides resizing the input to the low 227times 227 is a problemaffecting small objects which are easy to deform or even loseinformation as changing the resolution far from its originalsizes (e region proposals overlapped thus leading tocomputation of familiar features many times and with everyregion proposal it must be stored to disk before performingthe extraction of features In addition lots of boundingboxes overlapped will result in a drop of mAP if small objectsare close to big objects because there is a bias to choose thebounding boxes which contain big objects and ignorance ofbounding boxes for small objects

32 Spatial PyramidPooling (SPP) (e primary ideas of SPP[2] are motivated from limitations of CNN architecture suchas the original CNN receiving the size of input imagesmust bea fixed size (224times 224 of AlexNet) so the actual use of the rawpicture often needs cropping (a fixed-size patch that truncatesthe original image) or warping (RoI of an image inputmust bea fixed size of the patch) (e fully connected layer needs afixed-length input and convolutional layer that can beadapted to the arbitrary input size thus it needs a bridge as amediate layer between the convolutional layer and the fullyconnected layer and that is the SPP layer Particularly SPP-net firstly finds 2000 candidates of region proposals like theR-CNN method and then extracts the feature maps from theentire image SPP maps each window of the features corre-sponding to region proposals as a fixed-length representationregardless of the input size Finally 2 fully connected layersare used to classify by SVM

In short SPP-net versus R-CNN detection task is better100times faster than R-CNN but training time is very slowbecause of multistage training steps (fine-tuning of lastlayers SVM and regressions) and really taking a lot of diskspace to save vectors of features

33 Fast R-CNN Fast R-CNN [3] is an advanced methodthat presents various innovations to improve the time oftraining and testing phase and efficiently classifying objectproposals while still increasing the accuracy rate by usingdeep convolutional networks (e architecture of Fast

R-CNN is trained end-to-end with a multitask loss Specif-ically the convolutional network takes an image at any size asan input and several RoIs Instead of applying RoI on an inputand wrapping them to feed into the network at the first steplike RCNN Fast RCNN applies these RoIs on a feature mapafter the several convolutional layers of the base networkEach RoI is extracted a fixed-size feature vector by a poolinglayer andmapped to a feature vector by fully connected layers(e network has two output vectors per RoI softmaxprobabilities and per-class bounding-box regression offsets

(e most important feature of RoI is sharing compu-tation and memory in the forward and backward passesfrom the same image (e huge contribution of Fast R-CNNis that it proposes a new training method that fixes thedrawbacks of R-CNN and SPP-net while increasing theirrunning time and accuracy rate (e advantage is the meanaverage precision of detection is higher than R-CNN andSPP-net Training phase is a single stage using a multitaskloss and can update the entire network layers (e capacityof disk storage is not required for feature caching

34 Faster R-CNN Faster R-CNN [15] is an innovatedapproach improved from Fast R-CNN Unlike two previousapproaches of its own instead of generating bounding boxesby external algorithms [17] like [1 3] Faster R-CNN runs itsown method called the region proposal network (RPN)which is trained end-to-end in order to give the generationof highly qualified region proposals After gaining deepfeatures from early convolutional layers RPN is taken intothe account and windows slide over the feature map toextract features for each region proposal RPN is consideredas a fully convolutional network which simultaneouslypredicts bounding boxes of objects and objectness scores ateach position (e input of RPN is an image of any size andoutputs a set of bounding boxes as rectangular objectproposals along with an objectness score for each proposalSpecifically the RPN takes the image feature map of the fifthconvolutional layer (conv5) as an input and applies a 3times 3sliding window on the feature map (en the intermediatelayer will feed into two different branches one for objectscore (determines whether the region is thing or stuff) andthe other for regression (determines how should thebounding box change to become more similar to the groundtruth) (e RPN improves accuracy and running time as wellas avoids to generate excess of proposal boxes because theRPN reduces the cost by sharing computation on convolu-tional features RPN and Fast R-CNN aremerged into a singlenetwork by sharing their convolutional features (is com-bination helps Faster R-CNN to have leading performance onaccuracy but leads to its architecture as a two-stage networkwhich reduces the speed of processing of this method

35 You Only Look Once Inherited from the advantages ofthe previous models which have been introduced earlier YouOnly Look Once (YOLO) [4] is considered as a state-of-the-art object detection in real time with various categories at thattime YOLO currently has three versions [4ndash6] which areimproved substantially through each version progressively

Journal of Electrical and Computer Engineering 5

(e detail analyses of the YOLO approaches as a premise toapply it into practical applications are as follows

YOLOv1 [4] is widely known that YOLO an unified orone-stage network is a completely novel approach based onan idea that aims to tackle object detection in real timeproposed by Redmon et al implying that instead of per-forming the task of object detection like the previoustechniques based on complex tasks such as [1 4] which useexhausted sliding window and then feed outputs of this toclassifiers performing at equally spaced locations over thewhole image or region proposals to generate bounding boxeswhich possibly contain objects and then feed them toconvolutional neural networks YOLO considers objectdetection to be a regression problem simultaneously givingthe prediction for various coordinates of bounding boxesand class probabilities for these boxes (e key idea toperform the detection of YOLO is that YOLO separatesimages into grid views which push the running time as wellas accuracy in localizing objects of YOLO(e goal of YOLOis to deal with two problems namely what objects arepresented and where they are in an image (e summari-zation of YOLO operation proceeds with three principalsteps simply and straightforwardly Firstly YOLO takes aninput image resized to a fixed size then works a singleconvolutional network as a unified network on the imageand ultimately puts a threshold on the resulting detectionsby the confidence score of the model YOLO runs at 45 fpson GPU and the smaller Fast YOLO reaches 150 fps (isprocessing can run steaming video in real time Although thedesign of YOLO architecture affords end-to-end trainingand real-time detection it still keeps high average precision

(e network divides the input image into a Stimes S gridwhere Stimes S is equal to the width and height of the tensorwhich presents the final prediction In case the center of anobject is in a grid cell the gird cell takes responsibility fordetecting that object Moreover each gird cell is simulta-neously responsible for predicting bounding boxes andconfidence scores which present how confident the model ofbounding box contains an object as well as how accurate itindicates the bounding box is predicted

(e drawback of YOLO is that it lags behind the state-of-the-art detection systems about accuracy but is better thanthose about running time Itmakes less than half the number ofbackground errors compared to Fast R-CNN YOLO is highlygeneralizable so it can quickly identify objects in an image butit usually struggles to precisely localize some objects especiallysmall ones (erefore the author introduced YOLOv2 toimprove performance and fix drawbacks of YOLO as well

YOLOv2 [5] has a number of various improvements fromYOLOv1 Similarly to the origin YOLOv2 runs on differentfixed sizes of an input image but it introduced several newtraining methods for object detection and classification suchas batch normalization multiscale training with the higherresolutions of input images predicting final detection onhigher spatial output and using good default bounding boxesinstead of fully connected layers

However this offers a trade-off between speed and ac-curacy (e details of the mAP improvements in PASCALVOC 2007 are shown in Figure 2

(ese novel improvements allow YOLOv2 to train onmulticlass datasets like COCO or ImageNet In addition itwas attempted to train the detector to detect over 9000different object classes YOLOv2 uses a network architecturecustomized from the original network YOLOv2 mainlyconcentrates on a way of improving recall and localizationwhile still receiving high accuracy of classification incomparison with state-of-the-art detectors and the originYOLO significantly makes more localization errors but is farless likely to predict false detections on places where nothingexists Although YOLOv2 has accuracy improvementsYOLOv2 does not work well on small objects because theinput downsampling results in the low dimension of thefeature map which is used for the final prediction To solvethese problems recently the author introduces YOLOv3with significant improvements on object detection espe-cially on small object detection Generally a variety of latestnetworks tend to be toward deeper and yield good per-formance on their tasks with deep features learned fromnumerous layers

YOLOv3 [6] is one of these approaches instead of usingDarknet-19 like two old versions [4 5] YOLOv3 develops adeeper network with 53 layers called Darknet-53 and com-bines the network with state-of-the-art techniques such asresidual blocks skip connections and upsampling (e re-sidual blocks and skip connections are very popular in ResNetand relative approaches and the upsampling recently alsoimproves the recall precision and IOU metrics for objectdetection [25] For the task of detection 53 more layers arestacked onto it giving a 106-layer fully convolutional un-derlying architecture for YOLOv3 (is is the reason behindthe slowness of YOLOv3 compared to YOLOv2

Second YOLOv3 enables the detector to predict objectsat three different outputs with three different scales ratherthan just one prediction at the last layer of the networksimilar to its competitor SSD [26] which has improved a lotof performance on a low resolution image (is is useful topick up diverse outcomes in order to improve performanceof detection (e final output is created by applying a 1times 1kernel on a feature map Particularly the detection is doneby applying 1times 1 detection kernels on feature maps of threedifferent sizes at three different places in the network partlysimilar to feature pyramid networks (FPNs) [27]

(ird YOLOv3 still keeps using K-means to generateanchor boxes but instead of fully applying 5 anchor boxes atthe last detection YOLOv3 generates 9 anchor boxes andseparates them into 3 locations Each location applies 3anchor boxes hence there are more bounding boxes perimage For example if we have an image of 416times 416YOLOv2 predicts 13times 13times 5 845 boxes in YOLOv3 thenumber of boxes is 10647 implying that YOLOv3 predicts10 times the number of boxes compared to YOLOv2

Fourth YOLOv3 also changes the way to calculate thecost function If the anchor overlaps a ground truth morethan other bounding boxes the corresponding objectnessscore should be 1 For other anchor boxes with overlapgreater than a predefined threshold 05 they incur no costEach ground truth is only associated with one boundary boxIf a bounding box is not assigned it incurs no classification

6 Journal of Electrical and Computer Engineering

and localization lost just confidence loss on objectness (eloss function in previous YOLO looks like

λcoord 1113944

S2

i01113944

B

j01objij xi minus 1113954xi( 1113857

2+ yi minus 1113954yi( 1113857

21113960 1113961

+ λcoord 1113944

S2

i01113944

B

j01objij

wi

radicminus

1113954wi

1113969

1113874 11138752

+

hi

1113969

minus

1113954hi

1113969

1113874 11138752

1113890 1113891

+ 1113944S2

i01113944

B

j01objij Ci minus 1113954Ci1113872 1113873

2

+ λnoobj 1113944S2

i01113944

B

j01noobjij Ci minus 1113954Ci1113872 1113873

2

+ 1113944S2

i01obji 1113944

cisinclassespi(c) minus 1113954pi(c)( 1113857

2

(1)Currently instead of using mean square error in calcu-

lating the classification loss at the last three terms YOLOv3uses binary cross-entropy loss for each label In other wordsYOLOv3 makes its prediction of an objectness score and classprediction for each bounding box using logistic regression

(ere is no more softmax function for class prediction(e reason is that the most currently used classifiers assumethat predicted labels are independent and mutually exclusiveimplying that if an object belongs to one class then it cannotbelong to the other and this is solely true if output predictionis really mutual nevertheless in case dataset has multilabelclasses and there are labels which are not nonexclusive such aspedestrian and person At the time the sum of possibilityscores may be greater than 1 if the classifier is softmax soYOLOv3 alternates the classifier for class prediction from thesoftmax function to independent logistic classifiers to cal-culate the likeliness of the input belonging to a specific label

36 Single Shot MultiBox Detector Single Shot MultiBoxDetector (SSD) [26] is a single shot detector using a singleand one-stage deep neural network designed for objectdetection in real time By comparison the state-of-the-artmethod in two-stage processing Faster RCNN uses itsproposed network to generate object proposals and utilizesthose to classify objects in order to be toward real-timedetection instead of using an external method but the wholeprocess runs at 7 FPS SSD enhances the speed of running

(a)

(b)

(c)

(d)

Figure 2 (e visualization of detectors with the strongest backbones on subsets of PASCAL VOC_MRA_058 VOC_MRA_10VOC_MRA_20 and VOC_WH_20 in order respectively (a) YOLO Darknet-53 (b) Faster RCNN ResNeXT-101-64times 4d-FPN (c)RetinaNet ResNeXT-101-64times 4d-FPN (d) Fast RCNN ResNeXT-101-64times 4d-FPN

Journal of Electrical and Computer Engineering 7

time faster than the previous detectors by eliminating theneed of the proposal network(erefore it causes a few dropin mAP and SSD compensates this by applying some im-provements including multiscale features and default boxes(ese improvements allow SSD to gain the same of FasterRCNN using lower resolution images which then furtherspeeds up the processing of SSD For 300times 300 input imageas the best version SSD gets 772 mAP at 46 FPS betterthan Faster R-CNN 732 and a little smaller than the bestversion of YOLOv2 554times 554 input image and 786 mAPat 40 FPS on VOC 2007 on Nvidia Titan X

Similarly SSD consists of 2 parts namely extraction offeature maps and use of convolution filters to detect objectsSSD uses VGG16 as a base network to extract feature maps(en it combines 6 convolutional layers to make predictionEach prediction contains a bounding box and N+ 1 scoresfor each class where N is the number of classes and one forextraclass for no object Instead of using a region proposalnetwork to generate boxes and feed to a classifier forcomputing the object location and class scores SSD simplyuses small convolution filters After the VGG16 base net-work extracts features from feature maps SSD applies 3times 3convolution filters for each cell to predict objects Each filtergives an output including N+ 1 scores for each class and 4attributes for one boundary box

SSD has a difference from previous approaches at thesame time and it makes prediction on multiscale featuremaps for detection independently rather than just one lastlayer (e CNN network spatially reduces the dimension ofthe image gradually leading to the decrease in the resolutionof the feature maps As mentioned SSD uses a lower inputimage to detect objects hence early layers are used to detectsmall objects and lower resolution layers to detect largerscale objects progressively Besides SSD applies differentscales of default boxes to different layers and for intuitivevisualization in Figure 3 Particularly the only blue defaultbox on 8times 8 feature map fits to the ground truth of the catand the only red one on 4times 4 feature map matches to theground truth of the dog

Although SSD has significant improvements in objectdetection as integrating with these above parts SSD is notgood at detecting small objects which can be improved byadding deconvolution layers with skip connections to in-troduce additional large-scale context [28] Generally SSDoutperforms Faster RCNN which is a state-of-the-art ap-proach about accuracy on PASCAL VOC and COCO whilerunning at real-time detection

37CNNDrawbacks Most of the CNNmodels are currentlydesigned by the hierarchy of various layers such as con-volutional and pooling layers that are arranged in a certainorder not only on small networks but also on multilayernetworks to state-of-the-art networks Along with theselayers fully connected layers are added behind and known asFC layers (e block consisting of FC layers and previouslayers is designated as feature extractors and it outputs keyfeatures of objects of interest as an input for classifierscoming behind However deeply going through many kinds

of layers is a way that is not good for small object detectionbecause in the task of small object detection objects ofinterest are objects owning small sizes and appearanceBesides small objects unlike normal or big objects which areless affected by resizing the image or passing lots of differentlayers are very vulnerable to the changes in image sizesWhen an image passes a convolutional layer the size of theimage will be decreased by receptive fields that slide over theimage to extract useful features (is does not affect smallobjects if there are just a few layers but in a CNN networkwe have many layers like this and it is very hard for smallobjects Still if small objects just go through convolutionallayers it will not be anything to mention Small objectswhich just have a few informative presence have to passpooling layers which help in avoiding overfitting and re-ducing computational costs by decreasing a number ofparameters To do this these layers use fixed sliding windowsthat care about a fixed target that is identified before such asmaximum or average calculations of valuables For thesereasons GAN is an approach that may alter the CNN ap-proach because of its advantages We can take advantages ofa way that the approach generates data to overcome thelimitations of data of small objects for the training phaseAlthough images still have to pass layers such as convolu-tional and pooling layers in this context the network justhas less layers compared to others Bai el al [29] haveproposed to apply MTGAN to detect small objects by takingcrop inputs from a processing step made by baseline de-tectors such as Faster RCNN [15] or Mask RCNN [9]

Because of mentioned reasons and following the survey[30] Liu et al have presented numerous works of survey andevaluation but there are no works that do with small objectsin them(erefore in this work we assess popular and state-of-the-art models to find out pros and cons of these modelsParticularly we evaluate 4 deep models such as YOLOv3RetinaNet Fast RCNN and Faster RCNN with several basenetworks for small object detection with different scales ofobjects In these models YOLOv3 and RetinaNet belong tothe one-stage approach Fast RCNN and Faster RCNN are inthe two-stage approach We choose these models becauseYOLOv3 is the model with combination of state-of-the-arttechniques and RetinaNet is the model with a new lossfunction which penalizes the imbalance of classes in adataset Besides we choose RetinaNet to make comparisonsbetweenmodels in the same approach Similarly Fast RCNNand Faster RCNN are the same and both models are in thesame approach and have nearly the similar pipeline in objectdetection (ere is a difference is that Fast RCNN utilizes anexternal proposal to generate object proposals based oninput images However Faster RCNN proposes its ownnetwork to generate object proposals on feature maps andthis makes Faster RCNN train end-to-end easily and workbetter

4 Experimental Evaluation

In this section we present the information of our ex-perimental setting and datasets which we use forevaluation

8 Journal of Electrical and Computer Engineering

41 Experimental Setting We continually train and evaluatevarious object detectors on the two datasets such as PASCALVOC [11] and a newly generated dataset [16] (e evaluatedapproaches in this time consist of Faster RCNN [15]YOLOv3 [6] and RetinaNet [7] with different backbonesExcept for YOLOv3 the others are trained and evaluated bythe Detectron python code

Currently the original datasets which commonly areused in object detection are PASCAL VOC [11] and COCO[12] Both datasets are constructed by almost large objects orother kinds of objects whose size fill a big part in the image(ese two datasets are not suitable for small object detectionIn addition there is another dataset which is large-scale andincludes a lot of classes for small object detection collectedby drones and named VisDrone dataset [31] However itdoes not publish the labels for test set to evaluate and theviews of images are topdown which is not our case As aresult in order to evaluate the detection performance of themodels we use a dataset which was published in [13] (isdataset is called small object dataset which is the combi-nation between COCO [12] and SUN [24] dataset (ere are10 classes in small object dataset including mouse tele-phone switch outlet clock toilet paper (t paper) tissue box(t box) faucet plate and jar (e whole dataset consists of4925 images in total and there are 3296 images for trainingand 1629 images for testing (e mouse class owns thelargest number of instances in images 2137 instances in 1739images and the tissue box class has the smallest number ofinstances 103 instances in 100 images Apart from the smallobject dataset we also filter subsets from PASCAL VOC2007 following standard definitions On PASCAL VOCthere are 20 classes but with small object detection there arefewer classes on strict definitions of small objects Table 1lists the details of the number of small objects and imagescontaining them for subsets of the dataset

We trained all models on small object dataset with thesame parameters Particularly in the training phase wetrained the models with 70k iterations with the parametersincluding momentum decay gamma learning rate batchsize step size and training days in Table 2 At the firstmoment we attempted to start off the models with a higherlearning rate 10minus 2 but the models diverged leading to theloss value being NaN or Inf after 100 first iterations(en wetried at a lower learning rate 10minus3 at 100 first iterations andrise to 10minus2 to consider if the models can converge as startingoff at a lower learning rate However it remained unchangedanything We also saw that the models converged quicklyduring 10k first iterations with 10minus3 and then progressively

slow down after 20k (erefore we decided to start off thetraining with a learning rate at 10minus3 and decrease to 10minus 4 and10minus5 at 25k and 35k iterations respectively (is settingshows that the loss value was stable from 40k but we set thetraining up to 70k to consider how the loss value changesand saw that it did not change a lot after 40k iterations Wetried to evaluate the models from 30k to 70k and generallythe performance of the models was not stable after 40k it-erations For this reason we picked up the weight forevaluation at 30k and 40k iterations At 30k iterationsYOLO achieves the best results and others get the best one at40k iterations In case of subsets of PASCAL VOC 2007 wecombine train and valid set from PASCAL VOC 2007 and2012 to form a training set PASCAL VOC 2012 works as adata augmentation set for PASCAL VOC 2007 We use thiscombined training set to train all models and test them onsubsets All models train the same parameter First due tothe limitation of memory we rescale all the size of images tothe same size with the shortest side 600 and the lengthiestside 1000 as in [15]

In YOLOv3 we run the K-means clustering algorithm inorder to initialize 9 suitable default bounding boxes fortraining and testing phases of our selected datasets and wechanged the anchors value (e following are 9 anchors forsmall object dataset after running the K-means algorithm[103459 144216] [262937 190947] [214024 363180][479317 291237] [404932 637489] [836447 513203][722167 1199181] [1727416 1170773] and [12465972528465]

In Faster R-CNN to fairly compare with the prior workand deploy on different backbones we also reuse directly theanchor scales and aspect ratios following the paper [13] suchas anchor scales 16times16 40times 40 and 100times100 pixels andaspect ratio 05 1 and 2 instead of having to cluster a set ofdefault bounding boxes similar to YOLOv3 Similarly inRetinaNet we keep the default setting for training such asgamma loss 20 alpha loss 025 anchor scale 4 andscalers per octave 3 because of following authors and thisconfiguration is the optimized valuables

42 Our Newly Generated Dataset In this time to have anobjective comparison we also use our newly generateddataset and the information of this dataset is shown inTable 1 We use it to consider the effects of object sizesamong factors including models time of processing accu-racy and resource consumption (e dataset consists of 4subsets filtered from PASCAL VOC 2007 such as

Batch normHi-res classifierConvolutionalAnchor boxes

New neworkDimension priors

Location predictionPassthrough

MultiscaleHi-res detector

YOLO YOLOv2

VOC2007 mAP

634 658 695 692 696 744 754 768 786

Figure 3 mAP of YOLOv2 at each added part [5]

Journal of Electrical and Computer Engineering 9

VOC_WH_20 VOC_MRA_058 VOC_MRA_10 andVOC_MRA_20 and detail information is provided asfollows

(i) VOC2007_WH_02 contains objects whose widthand height are less than 20 of an imagersquos width andheight (is one has fewer than PASCAL VOC 2007two classes such as dining table and sofa because ofthe constraint of the definition

(ii) VOC_MRA_058 VOC_MRA_10 andVOC_MRA_20 compose objects occupying themaximum mean relative area of the original imageunder 058 10 and 20 respectively Two ofthem have the same number of PASCAL VOC 2007classes except for VOC_MRA_058 and the one hasfewer four classes such as dining table dog sofa andtrain

5 Results and Analyses

In this section we show results that we achieved through theexperimental phase All models mentioned in this sectionexcept for models cited from other papers are trained on thesame environment and 1 GPU Ubuntu 16044 LTS Intel(R) Xeon (R) Gold 6152 CPU 210GHz GPU Tesla P100In addition to the comparative accuracy other comparisonsare also provided to make our objective and clear assessmentresults

51 Accuracy

511 Small Object Dataset Following the detection resultsin Table 3 methods which belong to two-stage approachesoutperform ones in one-stage approaches about 8ndash10Specifically Faster RCNN with ResNeXT-101-64times 4d-FPNbackbone achieved the top mAP in two-stage approachesand the top of the table as well 412 In comparison withthe top in one-stage approaches YOLOv3 608times 608 withDarknet-53 obtained 331 Following [32] methods basedon region proposal such as Faster RCNN are better than

methods based on regression or classification such as YOLOand SSD Actually this is also right once again as in contextof small object dataset

For methods in each approach Firstly two-stage ap-proaches Faster RCNN which is an improvement of FastRCNN is only greater than Fast RCNN about 1ndash2 but onlyfor ResNeXT backbones and equal to Fast RCNN for the rest(e difference here is not too much and it means that theperformance of external region proposal like selective searchcombined with ROI pooling is as good as internal regionproposal like RPN with ROI aligned in this case Besidescompared to R-CNN we perceive that there is a boost 8ndash10when RoI pooling or RoI aligned is added because R-CNNwhich uses region proposals from selective search then feedsthem into the network and directly computes features from fc(fully connected) layers only receives 235 with Alexnetand 248 with VGG16 combined with proposals from RPNHowever Fast RCNN and Faster RCNN with two kinds ofRoIs are much better Fast RCNN receives accuracy in a rangeof 317 to 396 based on different backbones SimilarlyFaster RCNN gets 301 to 412 Secondly in one-stageapproaches YOLO outperforms SSD and RetinaNet How-ever YOLO gets the highest outcome 331 and SSD andRetinaNet get 1132 and 30 respectively YOLO and SSDare considered as state-of-the art methods in speed andsacrificing accuracy However there is a large difference inaccuracy between YOLO and SSD the difference here is thatSSD adds multiple convolutional layers behind the backboneand each layer has their own ability instead of using 2 fullyconnected layers like YOLO Although RetinaNet is assignedinto a method in one-stage approaches it cannot run in realtime RetinaNet is one which is proposed to deal with theimbalance between foreground and background by the focalloss (erefore RetinaNet obtains a higher accuracy incomparison with others except for YOLOv3 (Darknet-53)

When it comes to the backbones we realized that Dar-knet-53 is the best in one-stage and real-time methods andeven far higher than ResNet-50 although it similarly has thesame layers with ResNet-50 In contrast ResNeXT combinedwith FPN is themost powerful one in both one-stage and two-

Table 1 (e information of the subsets

Subsets Classes Images InstancesVOC_MRA_058 16 329 529VOC_MRA_10 20 2231 5893VOC_MRA_20 20 2970 7867VOC_WH_20 18 1070 2313

Table 2 (e parameters of models

Method Momentum Decay Gamma Learning_rate Batch_size Training_days StepsizeYOLOv2 [16] 09 00005 0001 8 5 25000YOLOv3 09 00005 0001 32 3ndash4 25000SSD300 [16] 09 00005 01 0000004 12 9 40000 80000SSD512 [16] 09 00005 01 0000004 12 12 100000 120000RetinaNet 09 00005 01 0001 64 4-gt12 h 25000 35000Fast RCNN 09 00005 01 0001 64 4-gt12 h 25000 35000Faster RCNN 09 00005 01 0001 64 4-gt12 h 25000 35000

10 Journal of Electrical and Computer Engineering

stage methods if we only consider accuracy Overall there isan increase about 1ndash3 for changing the simple backbone tothe complex one in each type For example when switchingfrom original ResNet to ResNet-FPN the accuracy is boostedfrom 2 to 3(is is clear that leveraging the advantages frommultiscale features of FPN is a common way to improvedetection and tackle the scale imbalance of input images andbounding boxes of different objects Similarly we switchResNeXT-101-32times 8d-FPN to ResNeXT-101-64times 4d-FPNand the accuracy changes from 405 to 412 for FasterRCNN and from 387 to 396 for Fast RCNN Howeverwhen considering between ResNet-50-FPN and ResNet-101-FPN the growth only happens in Fast RCNN from 333 to355(ere is a little bit decrease 01 for Faster RCNN(isreduction also happens with RetinaNet while the simplerbackbone ResNeXT-101-32times 8d-FPN gets 30 and theResNeXT-101-64times 4d-FPN just gets 251 It means that thevery deeper backbones do not guarantee the increase inaccuracy and the reason is that an advantage of a deepernetwork needsmore parameters to learn It means the authorsmust have a large number of data to feed into the network totrain and update parameters itself but in this case the data ofsmall object dataset are not abundant too much to fit the verydeep network and hence increasing the chances of overfittingBesides features which are originally from the early layer ofResNet are not well-generalized because when they arecombined with FPN the accuracy has an improvement about2ndash3 When YOLO switches from Darknet-19 to Darknet-53 it really boosts the accuracy (e highest accuracy belongsto the Darknet-19 with the resolution of 1024times1024 whichjust gets 2402 However YOLO 608times 608 with Darknet-53gets 331 (e explanation for this reason is that YOLOv3with Darknet-53 has several improvements from Darknet-19YOLOv3 has 3 location of scales to predict objects especiallyone specialized in small objects instead of only one likeDarknet-19 and it is also integrated cutting-edge advantagessuch as residual blocks and shortcut connections (e re-duction in accuracy happens again with YOLO whenswitching from ResNet-101 to ResNet-152 about 1ndash2 Inthese methods YOLO and SSD are the only ones which allowmultiple input sizes(e higher the resolution of input imagesare the higher accuracy the method receives (e reason isthat a higher resolution image allows more pixels to describethe visual information for small objects However if theresolution is far from the original size of images it results in adecrease in accuracy For example YOLO 1024times1024 withDarknet-19 gets a lower accuracy than the resolution of800times 800 In addition we have tried to increase in resolutionof Darknet-53 from 608 to 1024 and themAP decreases whenthe resolution is over 608times 608 (erefore the effect of imagesize is clear for models like SSD and YOLO Generally allcomparative results of mAP on this dataset have the domi-nation of classes very great in numbers and this is caused bythe imbalance data between the number of images and in-stances in these images For example according to the sta-tistics in [13] mouse is a major class significantly contributingto mAP in Table 3 with the highest number of instances andimages as well However tissue has least contribution with thelowest AP originally affected by the number of data

Furthermore the imbalance data lead models tending todetect frequent objects implying that models will misun-derstand objects having a nearly similar appearance with thedomination class as the objects of interest rather than lessfrequent objects As a result false positives will increase bythese problems Figure 4 illustrates the detection withstrongest backbones Following this visualization the domina-tion of the classes such asmouse or faucet results inmisdetectionwith areas which have a same appearance to them (is mis-understanding has a tendency to weaker backbones in thecomparison and one-stage method like YOLO which primarilyheads to speed has more misdetection than two-stage methodsA reason that causes these problems are the difference in thewayof training deep networks [33] One-stage methods such asYOLO use a soft sampling method that uses a whole dataset toupdate parameters rather than only choosing samples fromtraining data However two-stage methods such as RCNNfamily tend to employ hard sampling methods that randomlysample a certain number of positive and negative boundingboxes to train its network

512 Subsets of PASCAL With 4 subsets of 4 different scalesof objects in images we want to find out howmuch the scalesimpact on the models (e whole results are shown in Ta-ble 4 We separate the results into 2 groups as the one-stageand two-stage approaches and Figure 5 is a visualization forthe strongest backbones in each method on subsets

In case of different scales like our subsets there is a differencebetween one-stage approaches and two-stage approaches In thiscase methods from the one-stage approach have a better per-formance than two-stage ones inmost of scales(is is really theopposite of small object dataset Specifically two-stage methodsare totally better than one-stage ones in case of real-time inputsand just better a bit than nonreal-time models in VOC_WH20about 10ndash20 and the same result with smaller objects inVOC_MRA_0058 and VOC_MRA_010 However in biggerobjects in VOC_MRA_020 methods in one-stage approacheshave significant outcomes rather than two-stage ones In ad-dition there is just Faster RCNN that has good performance inmost cases to compare to methods in one-stage ones FastRCNN is only good at big objects in VOC_MRA_020 and failsto have good detection in smaller objects

In the one-stage approach in methods which allowmultiple inputs like YOLO and SSD there are 2 kindsnamely ones that can run in real time and the others thatcannot if the resolution is over 640 or 512 for YOLO andSSD respectively For real-time ones YOLO outperformsSSD for all scales of objects Specifically YOLOv2 withDarknet-19 is better than SSD 26 with objects inVOC_MRA_0058 and VOC_MRA_010 and 4ndash15 forlarger objects in VOC_MRA_020 and VOC_WH_20YOLOv3 with Darknet-53 gets higher results about 3ndash5 incomparison with YOLOv2 hence YOLOv3 also gets higherresults compared to SSD However if we consider nonreal-time input images SSD is greater than YOLO with objects inVOC_MRA_010 However RetinaNet which is the one thatcannot run in real time in the one-stage approach performsthe same results compared to ones in nonreal time in YOLO

Journal of Electrical and Computer Engineering 11

and better than SSD RetinaNet is more stable than SSD andYOLO when the scales are changed (e bigger the objectsare the more the stability is For example the change is somuch about 33 when the scale increases from objects inVOC_MRA_0058 to ones in VOC_MRA_010 andVOC_MRA_020 However this change is not much about10 with bigger objects in comparison with YOLO 15ndash25In case of YOLO this remarkable increase in accuracy whenobjects are larger is obviously good for a model (e changein SSD resembles the change in RetinaNet

Concerning resolutions in YOLO and SSD we see thatwhen image resolution is increased they push the accuracyto improve in general For YOLOv2 with Darknet-19 andYOLOv3 with Darknet-53 and SSD they all have an increasein accuracy when the resolution is large except for YOLOv2

with objects belonging to VOC_MRA_010 andVOC_MRA_020 when the image is over 800 In additionYOLOv2 has a fluctuation with those objects inVOC_WH20 As mentioned in our previous work YOLO isbetter than SSD in those objects less than 10 of the imageshowever in this case YOLOv3 is good at all scales of objects(is is because YOLOv3 has 3 detection locations comingwith more ratios of default boxes and it leads to a significantoutcome when combining results from 3 locations

When we switch to the two-stage approaches FasterRCNN has a significant improvement in most scales ratherthan Fast RCNN except for objects in VOC_MRA_020 whichhave the same accuracy (is shows that if objects are com-pletely separated into different scales the RoI pooling does notwork well with smaller objects and ones in VOC_WH20 In

Table 3 Comparative results on small object dataset

Method Backbone Clock Faucet Jar Mouse Outlet Plate Switch Tel t box t paper mAPYOLO 416 [16]

Darknet-19

228 308 4 52 204 131 13 61 0 353 1939YOLO 448 [16] 23 369 9 525 184 136 175 42 0 343 2013YOLO 480 [16] 342 373 91 533 214 136 158 91 91 342 2371YOLO 512 [16] 231 366 61 598 246 142 157 91 45 324 2261YOLO 554 [16] 234 372 91 601 272 134 199 91 45 345 2384YOLO 640 [16] 202 362 32 598 278 117 181 82 45 356 2253YOLO 800 [16] 276 36 23 602 328 131 233 91 91 267 2402YOLO 1024 [16] 217 293 14 583 264 118 175 91 91 157 2003YOLO 320

Darknet-532622 3838 455 5646 3642 1334 248 1065 455 4296 2583

YOLO 416 2847 4715 1083 6049 4315 1587 3073 1515 262 483 3028YOLO 608 2998 4789 1076 6588 4802 1809 3122 1462 1799 4656 331YOLO 320

ResNet-501957 2573 067 4517 1437 938 1384 909 909 237 1706

YOLO 416 2378 3665 04 5423 1837 1375 1978 984 942 3568 2219YOLO 608 2692 4065 177 6186 2918 1504 2024 1009 1329 3601 255YOLO 320

ResNet-1012052 279 057 4468 1698 1305 1366 966 909 2436 1805

YOLO 416 2572 356 303 5573 224 1561 1726 932 303 3871 2264YOLO 608 2879 4459 942 6218 3334 1553 2388 1324 1583 3917 286YOLO 320

ResNet-1522164 2756 303 4806 1739 1112 1451 909 455 3188 1888

YOLO 416 257 3654 089 5381 206 1413 2021 1149 029 3306 2167YOLO 608 2601 4454 455 61 3176 1302 2267 1235 993 3999 2658SSD300 [16] ResNet-101 55 91 0 255 61 45 0 45 91 182 825SSD300 [16] VGG16 91 171 0 261 91 91 0 45 0 167 916SSD512 [16] VGG16 91 171 0 43 91 91 91 91 0 76 1132RetinaNet ResNet-50-FPN 307 493 2 655 213 161 85 129 1 257 233RetinaNet ResNet-101-FPN 306 487 71 647 20 159 118 107 29 387 251RetinaNet ResNeXT-101-32times 8d-FPN 355 55 121 665 239 184 98 162 94 537 30RetinaNet ResNeXT-101-64times 4d-FPN 314 502 89 663 208 153 94 14 22 324 251R-CNN [13] RPN prop +VGG16 319 313 42 568 311 93 142 164 234 294 248R-CNN [13] Alexnet 7times 300 pro 324 272 51 569 28 98 136 124 179 356 239R-CNN [13] VGG16 7times 300 pro 373 303 72 606 415 158 215 137 22 333 284R-CNN [13] ContextNet (Alexnet 7times) 327 268 46 564 263 99 129 122 187 34 235Fast RCNN ResNet-50-C4 324 463 65 658 383 201 253 166 141 52 317Fast RCNN ResNet-50-FPN 374 473 73 689 467 21 321 171 93 459 333Fast RCNN ResNet-101-FPN 393 503 106 683 471 204 333 186 154 514 355Fast RCNN ResNeXT-101-32times 8d-FPN 475 548 103 718 54 214 344 217 177 535 387Fast RCNN ResNeXT-101-64times 4d-FPN 454 557 109 725 533 24 369 229 16 581 396Faster R-CNN [16] VGG16 2376 3765 803 54 1616 1188 1512 91 625 3729 2192Faster RCNN ResNet-50-C4 322 446 66 659 352 175 257 196 137 40 301Faster RCNN ResNet-50-FPN 357 499 73 684 489 188 296 147 114 533 338Faster RCNN ResNet-101-FPN 398 492 49 682 47 185 297 14 129 522 337Faster RCNN ResNeXT-101-32times 8d-FPN 498 566 114 721 563 232 37 208 188 587 405Faster RCNN ResNeXT-101-64times 4d-FPN 496 586 122 725 545 232 369 208 201 631 412(e values in bold represent the best in one-stage methods and the ones in italics represent the highest in two-stage methods

12 Journal of Electrical and Computer Engineering

addition if we compare with one-stage methods it is signif-icantly lower than them However RoI align along with RPN iswell performed when scales are changedWhen it comes to thebackbones there is a few decrease in accuracy when changingfrom ResNet-50-FPN to ResNet-101-FPN or from ResNeXT-101-32times 8d-FPN to ResNeXT-101-64times 4d-FPN with objectsfrom all scales for both Faster RCNN and Fast RCNN (eVGG16 backbone has an impressive outcome rather thanstrong backbones such as ResNet or ResNeXT Although theaccuracy is less than two strong backbones VGG16 is stillbetter with objects in VOC_WH20 and has a few change inaccuracy when changing objects with big sizes

52 Time Processing and Resource Consumption Tables 5and 6 show us the performance comparison of the eval-uated models with base networks that belong to the modelsGenerally we see that when RAM consumption in testingand training increases more layers are added (is meansthat if the network is more deeper the need of processingalso increases because this leads to the increase in pa-rameters and time to process data as well YOLO is themodel consuming the least memory in both two-phasetraining and testing Particularly YOLO is only from 4G to5G for training and from 16G to 18G for testing withDarknet-53 YOLO is the only one which is able to run inreal time YOLO just needs about 03ms to 04ms toprocess an image in comparison to more than 01 s and 02 s

with Faster RCNN and RetinaNet (is allows us to pick upthese models on devices which own the modest memoryWhile RetinaNet is assigned to the one-stage approach it isnot good enough to meet real-time detection(e inferencetime in Fast RCNN is lower a little bit than Faster RCNNand RetinaNet In contrast the RAM consumption intraining and testing of RetinaNet is lower than Fast RCNNand Faster RCNN Of all architectures the ResNet-50-C4 isthe one requiring the highest memory and time to processdata because the output size of ResNet-50-C4 is bigger a bitthan others [9] If we consider ResNet or ResNeXT com-bined with FPN Faster RCNN is over 100Mb compared toFast RCNN and 300Mb with RetinaNet In additionaccording to Table 2 the number of training days of FasterRCNN and RetinaNet need less time for training only a fewhours to 1 day rather than YOLO 3ndash4 days (is dem-onstrates that if we pay our attention to performance anddo not have much time for training we choose FasterRCNN or RetinaNet instead of YOLO one In contrast ifwe only focus on processing speed and still achieve goodperformance one-stage methods are always the good oneIn the same context of backbones RetinaNet uses a lowerresource than Fast RCNN and Faster RCNN about 100Mband 300Mb for Fast RCNN and Faster RCNN respectivelyin testing time However the training time of RetinaNetuses much memory more than Fast RCNN about 28 G andFaster RCNN about 23 G for ResNeXT-101-32times 8d-FPNand ResNeXT-101-64 times 4d-FPN If we consider this on

(a) (b)

loc ∆ (cx cy w h) conf (c1 c2 hellip cp)

(c)

Figure 4 (e location of the default boxes in different scales (a) image with GT boxes (b) 8times 8 feature map (c) 4times 4 feature map

Journal of Electrical and Computer Engineering 13

Table 4 (e comparative results on subsets of PASCAL VOC 2007

Approach Method VOC_MRA_0058 VOC_MRA_010 VOC_MRA_020 VOC_WH20

One stage

YOLOv2 416 [16] 302 3138 4289 1852YOLOv2 448 [16] 447 329 6015 2196YOLOv2 480 [16] 426 3348 6078 2667YOLOv2 512 [16] 542 3574 6112 2463YOLOv2 544 [16] 697 3656 63 2662YOLOv2 640 [16] 77 3797 6129 2341YOLOv2 800 [16] 1024 373 6191 269YOLOv2 1024 [16] 1069 2993 5514 2897

YOLOv3 320 718 3458 6036 204YOLOv3 416 102 3897 6253 2412YOLOv3 608 117 4265 6856 2886SSD 300 [16] 171 3276 4626 1691SSD 512 [16] 29 4346 5711 1987

RetinaNet-ResNet-50-FPN 884 415 502 2814RetinaNet-ResNet-101-FPN 895 425 519 2746

RetinaNet-ResNeXT-101-32times 8d-FPN 1029 454 545 3008RetinaNet-ResNeXT-101-64times 4d-FPN 1071 455 551 3132

Two stage

Fast RCNN-ResNet-50-C4 023 132 499 393Fast RCNN-ResNet-50-FPN 063 135 556 345Fast RCNN-ResNet-101-FPN 039 159 576 312

Fast RCNN-ResNeXT-101-32times 8d-FPN 051 144 579 333Fast RCNN-ResNeXT-101-64times 4d-FPN 029 142 573 376

Faster RCNN-ResNet-50-C4 698 399 487 2604Faster RCNN-ResNet-50-FPN 1074 456 563 2979Faster RCNN-ResNet-101-FPN 1063 469 576 3057

Faster RCNN-ResNeXT-101-32times 8d-FPN 1164 473 576 3212Faster RCNN-ResNeXT-101-64times 4d-FPN 1054 471 569 3164

Faster RCNN-VGG16 [16] 573 3558 4414 4111(is table illustrates how well models adapt to different scales of objects (e values in bold represent the best in one-stage methods and the ones in italicsrepresent the highest in two-stage methods

(a)

(b)

(c)

Figure 5 Continued

14 Journal of Electrical and Computer Engineering

small object dataset it does not work too much becauseRetinaNet is lower than Faster RCNN about 10 in per-formance Otherwise on different scales of subsets Reti-naNet works well when comparing to Faster RCNN and

the difference is just 2ndash4 percentages Although ResNetbackbones combined with the others yield an improvementin accuracy they do not work for YOLO on small objectdatasets YOLO with Darknet-53 utilizes more resource

Table 5 (e comparison of consumption on small object dataset

Model Backbone Inference time (s) Test RAM (MiB) Train RAM (MiB)YOLOv3 Darknet-53 00331 1825 4759YOLOv3 ResNet-50 0027 1285 3479YOLOv3 ResNet-101 00356 1829 5383YOLOv3 ResNet-152 00454 2443 7531RetinaNet ResNet-50-FPN 0102 2075 4435RetinaNet ResNet-101-FPN 0127 2723 5577RetinaNet ResNeXT-101-32times 8d-FPN 0229 3767 7863RetinaNet ResNeXT-101-64times 4d-FPN 0292 3719 7813Fast RCNN ResNet-50-C4 03 6449 5877Fast RCNN ResNet-50-FPN 0089 2277 4455Fast RCNN ResNet-101-FPN 0113 2947 5627Fast RCNN ResNeXT-101-32times 8d-FPN 0212 3987 4961Fast RCNN ResNeXT-101-64times 4d-FPN 0269 3885 4799Faster RCNN ResNet-50-C4 0412 6609 6129Faster RCNN ResNet-50-FPN 0101 2387 5381Faster RCNN ResNet-101-FPN 0124 3001 6487Faster RCNN ResNeXT-101-32times 8d-FPN 0256 4027 5333Faster RCNN ResNeXT-101-64times 4d-FPN 0286 4003 5246

Table 6 (e comparison of consumption on subsets filtered from PASCAL VOC

Model Backbone Inference time (s) Test RAM (MiB) Train RAM (MiB)YOLOv3 Darknet-53 0027 1645 4079RetinaNet ResNet-50-FPN 01 1935 4133RetinaNet ResNet-101-FPN 0116 2585 5435RetinaNet ResNeXT-101-32times 8d-FPN 0222 3641 7723RetinaNet ResNeXT-101-64times 4d-FPN 0284 3561 7599Fast RCNN ResNet-50-C4 0495 6371 5677Fast RCNN ResNet-50-FPN 0092 2131 4387Fast RCNN ResNet-101-FPN 0114 2819 5463Fast RCNN ResNeXT-101-32times 8d-FPN 0213 3873 4637Fast RCNN ResNeXT-101-64times 4d-FPN 0265 3735 4575Faster RCNN ResNet-50-C4 026 6141 5991Faster RCNN ResNet-50-FPN 01 2245 5207Faster RCNN ResNet-101-FPN 013 2855 6335Faster RCNN ResNeXT-101-32times 8d-FPN 0225 3943 5087Faster RCNN ResNeXT-101-64times 4d-FPN 0276 3885 4909

(d)

Figure 5 Highlight of bounding boxes from comparative backbones on small object dataset We here select YOLO with Darknet-53 andResNet-50 for objective comparison because there have obviously the same layers in their networks along with the significant techniquessuch as skip connections and residual blocks (e bounding boxes show that ResNet-50 has the sensitivity to areas which resembles theobjects of interest than Darknet-53 Similarly ResNet-50-FPN and ResNet-50-C4 are chosen to consider (e detection shows thatcombining ResNet-50 with FPN outputs a better performance rather than the original one Particularly misdetection happens in moredensity than ResNet-50-FPN such as in columns 4 and 5 Zoom in to see more detail

Journal of Electrical and Computer Engineering 15

than ResNet ones but it has the best accuracy amongmodels (erefore we only test YOLO with Darknet-53 insubsets of PASCAL

53 Analyses of the Trade-Off among Detectors Networkdesigns and approaches alike the one-stage approach proveits performance as applying them to detect general objectsboth small scales and other kinds of scales Although they arefast and accurate there is still a drawback always existing inthese models that is the trade-off between accuracy andspeed of processing For example YOLOv3 proposes theidea that performs detection at three different scales and thisresult is obviously impressive and yields good performanceHowever to gain this advantage YOLOv3 has to sacrifice thetime to process Instead of all inputs of the model normallyprocessing one time for detection like YOLOv2 this ideamust work 3 times (is trade-off is also partly affected byresolution as we change it during training or testing ourmodels In our previous work we have mentioned that wehave to choose a right resolution to ensure our models towork properly In case of the two-stage approaches the ideathat proposes region proposals to improve the localization ofobjects to serve for detection is good as well (is is usefulbut we have to take it into the account that we shouldgenerate proposals on feature maps or directly on inputimages because this affects a lot on the way which modelsintend to run and identify representations of objects Ifobjects are normal or have a big or medium appearance it isgood for models to work but if objects are in multiscalesthis is a problem to consider and research deeply in order tobalance the performance as well as improve it (erefore topartly fix this problem the one-stage approach allows us tochoose a fixed size of an input for training and testing butthe support still depends on characteristics of datasets whichwe evaluate or the image size After all all models we chooseto evaluate are affected by the scales of objects when wechange the scale and accuracy of models change a lot exceptfor Faster RCNN the only one model that seems to be stablewith the scale especially when combining with the VGG16architecture Although the accuracy of VGG16 is not betterthan the other architectures the difference here is that it doesnot change too much in accuracy (is is only right for bigobjects having the overlap of the bounding box and theimage greater than 10 if not this is not assured

Figure 1 shows that the possibility of small objects ismore than other objects (e black length of the camera issomehow similar to the black mouse placed on a mouse pad(is possibility of small object presence causes more diffi-culties to detectors and leads to wrong detection Anywherein an image can be small objects it results in a fact thatdetectors have much wrong detection with familiar ap-pearance which they have seen If we consider the visuali-zation of the detection in Figure 4 the wrong detection ispartly similar to the appearance of the other objects in thedataset (is problem is caused by the data imbalance be-tween classes and instances in each class which originally isknown as the foreground-foreground class imbalance In

other words the common problems which not only happenwith small objects but also for whole datasets are theintraclass similarity and interclass variation

6 Conclusion

Small object detection is a challenging and interestingproblem in the task of object detection and has drawn at-tention from researchers thanks to the development of deeplearning which is motivation to improve performance oftasks in computer vision Although deep models belongingto detection originally tend to solve problems relating togeneral object detection they still work at a particular levelto the success of small object detection As evaluation workson small object detection for deep models our goal is tohighlight remarkable achievements of popular and state-of-the-art deep models in order to provide a variety of views asapplying deep models in small object detection Particularlywe evaluate state-of-the-art real-time detectors based ondeep learning from two approaches such as YOLOv3RetinaNet Fast RCNN and Faster RCNN on two datasetsnamely small object dataset and subsets filtered fromPASCAL VOC about effects of different factors objectivelyincluding accuracy execution time and resource usage

In spite of the successful achievements in recent yearsthe performance of detection has improved significantlyand there is still a huge gap in accuracy between normalobjects and small objects In the criteria of the COCOdataset the difference from the small scale to medium andbig scale is too much Most models are good at detection ofnormal objects and problems are going to happen whenapplying them to detect small objects As a result to reducethe gap of small object detection the first thing to do is investdatasets which have massive amount of data to train modelsand have a wide range of categories to compete with thehuman visual system alike [12 34]

So far detection models are divided into two mainapproaches namely one-stage approach and two-stageapproach Models in the one-stage approach is known asdetectors which have better and more efficient detection incomparison to another approach (e efficiency here has thepotential power to run in real time and is able to apply themto practical applications However the trade-off betweenaccuracy and speed is a difficult challenge which needs to betaken into the account in order to balance the gap Howevermodels in the two-stage approach have their reputation ofregion-based detectors which have high accuracy but are toolow in speed to apply them to real world (is drawbackcomes from the computation of networks

(rough our evaluation there is a fact that architectureswhich are utilized as base networks to extract deep featureshave significant effects on frameworks (e deeper the ar-chitecture is the higher the accuracy of detection is Once anetwork has an increase in the depth this means it has morelayers than normal ones and it will have massive parametersto train Hence this needs a lot of data to fine tune theseparameters reasonably If there is an increase in computa-tion resource consumption will also increase As a result it

16 Journal of Electrical and Computer Engineering

will be difficult as we want to take them to apply in practicalapplications Besides the contextual exploit in models isdefinitely limited this results cause ignoring much usefuland informative data in training especially in context ofsmall objects Because small objects are able to appearanywhere in an input image if the image is well-exploitedwith the context the performance of small object detectionwill be improved better

For all above reasons and according to our evalua-tion if we tend to have good performance and ignore thespeed of processing two-stage methods like FasterRCNN are well-performed and demonstrate its networkdesign with the different datasets on many contexts ofobjects including multiscale objects (erefore FasterRCNN is considered as a giant baseline in order to baseon or develop from it If our target has a balance ofaccuracy and speed YOLO is a good one in case we donot care the training time because the sacrifice betweenthe speed and accuracy is worth applying it into practicalapplications Otherwise Faster RCNN or RetinaNet isstill a substitution to work on When it comes tobackbones we have to concern about the data to choose areasonable backbone to combine with the methodsBecause the amount of data will significant impact on themodel if data are not abundant the shallow network willfit it well Besides there is recently a novel approachpromising in training deep models with less data that isweakly supervised learning such as zero-shot one-shotor few-shot learning (erefore these approaches will beconsidered in our future works and following our recentsearching to have better performance on object detec-tion we have to consider several factors to improve themAP such as multiscale training superresolution forscaling up the visual information to small objects [35] orpreprocessing data to avoid the imbalance data becausewe have a wide range of imbalance problems relating todata [33]

Data Availability

(e data used to support the findings of this study areavailable from the corresponding author upon request

Conflicts of Interest

(e authors declare that they have no conflicts of interest

Acknowledgments

(is research was funded by the Vietnam National Uni-versity HoChiMinh City (VNU-HCM) under grant noB2017-26-01

References

[1] R Girshick J Donahue T Darrell and J Malik ldquoRich featurehierarchies for accurate object detection and semantic seg-mentationrdquo in Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition pp 580ndash587 ColumbusOH USA June 2014

[2] K He X Zhang S Ren and J Sun ldquoSpatial pyramid poolingin deep convolutional networks for visual recognitionrdquo inProceedings of the European Conference on Computer Visionpp 346ndash361 Springer Zurich Switzerland September 2014

[3] R Girshick ldquoFast R-CNNrdquo in Proceedings of the IEEE In-ternational Conference on Computer Vision pp 1440ndash1448Santiago Chile December 2015

[4] J Redmon S Divvala R Girshick and A Farhadi ldquoYou onlylook once unified real-time object detectionrdquo in Proceedingsof the IEEE Conference on Computer Vision And PatternRecognition pp 779ndash788 Las Vegas NV USA June 2016

[5] J Redmon and A Farhadi ldquoYOLO9000 better fasterstrongerrdquo 2016 httpsarxivorgabs161208242

[6] J Redmon and A Farhadi ldquoYOLOv3 an incremental im-provementrdquo 2018 httpsarxivorgabs180402767

[7] T-Y Lin P Goyal R Girshick K He and P Dollar ldquoFocalloss for dense object detectionrdquo in Proceedings of the IEEEtransactions on pattern analysis and machine intelligenceVenice Italy October 2018

[8] K Zidek A Hosovsky J Pitelrsquo and S Bednar ldquoRecognition ofassembly parts by convolutional neural networksrdquo in Ad-vances in Manufacturing Engineering and Materialspp 281ndash289 Springer Cham Switzerland 2019

[9] K He G Gkioxari P Dollar and R Girshick ldquoMaskR-CNNrdquo in Proceedings of the IEEE International Conferenceon computer vision (ICCV) IEEE Venice Italy pp 2980ndash2988 October 2017

[10] L-C Chen A Hermans G Papandreou et al ldquoInstancesegmentation by refining object detection with semantic anddirection featuresrdquo 2017 httpsarxivorgabs171204837

[11] M Everingham L Van Gool C K I Williams J Winn andA Zisserman ldquo(e pascal visual object classes (VOC) chal-lengerdquo International Journal of Computer Vision vol 88no 2 pp 303ndash338 2010

[12] T-Y Lin M Maire S Belongie et al ldquoMicrosoft COCOcommon objects in contextrdquo in Proceedings of the EuropeanConference on Computer Vision pp 740ndash755 SpringerZurich Switzerland September 2014

[13] C Chen M-Y Liu O Tuzel and J Xiao ldquoR-CNN for smallobject detectionrdquo in Proceedings of the Asian Conference onComputer Vision pp 214ndash230 Springer Taipei TaiwanNovember 2016

[14] P F Felzenszwalb R B Girshick D McAllester andD Ramanan ldquoObject detection with discriminatively trainedpart-based modelsrdquo IEEE Transactions on Pattern Analysisand Machine Intelligence vol 32 no 9 pp 1627ndash1645 2010

[15] S Ren K He R Girshick and J Sun ldquoFaster R-CNN towardsreal-time object detection with region proposal networksrdquo inProceedings of the 28th International Conference on NeuralInformation Processing Systems Ser NIPSrsquo15 pp 91ndash99 MITPress Cambridge MA USA 2015 httpdlacmorgcitationcfmid=29692392969250

[16] P Pham D Nguyen T Do T D Ngo and D-D LeldquoEvaluation of deep models for real-time small object de-tectionrdquo in Proceedings of the International Conference onNeural Information Processing pp 516ndash526 SpringerGuangzhou China November 2017

[17] J R R Uijlings K E A Van De Sande T Gevers andA W M Smeulders ldquoSelective search for object recognitionrdquoInternational Journal of Computer Vision vol 104 no 2pp 154ndash171 2013

[18] Z Zhu D Liang S Zhang X Huang B Li and S HuldquoTraffic-sign detection and classification in the wildrdquo inProceedings of the IEEE Conference on Computer Vision And

Journal of Electrical and Computer Engineering 17

Pattern Recognition pp 2110ndash2118 Vegas NV USA June2016

[19] A Torralba R Fergus and W T Freeman ldquo80 million tinyimages a large data set for nonparametric object and scenerecognitionrdquo IEEE Transactions on Pattern Analysis andMachine Intelligence vol 30 no 11 pp 1958ndash1970 2008

[20] A Kembhavi D Harwood and L S Davis ldquoVehicle detectionusing partial least squaresrdquo IEEE Transactions on PatternAnalysis and Machine Intelligence vol 33 no 6 pp 1250ndash1265 2011

[21] V I Morariu E Ahmed V Santhanam D Harwood andL S Davis ldquoComposite discriminant factor analysisrdquo inProceedings of the IEEE Winter Conference on Applications ofComputer Vision pp 564ndash571 IEEE Steamboat Springs COUSA March 2014

[22] A Andreas P Lenz and R Urtasun ldquoAre we ready forautonomous driving the kitti vision benchmark suiterdquo inProceedings of the IEEE Conference on Computer Vision AndPattern Recognition Providence RI USA June 2012

[23] A Alahi K Goel V Ramanathan A Robicquet L Fei-Feiand S Savarese ldquoSocial LSTM human trajectory prediction incrowded spacesrdquo in Proceedings of the IEEE Conference onComputer Vision And Pattern Recognition pp 961ndash971 LasVegas NV USA June 2016

[24] J Xiao K A Ehinger J Hays A Torralba and A Oliva ldquoSundatabase exploring a large collection of scene categoriesrdquoInternational Journal of Computer Vision vol 119 no 1pp 3ndash22 2016

[25] E Dong Y Zhu Y Ji and S Du ldquoAn improved convolutionneural network for object detection using YOLOv2rdquoin Pro-ceedings of the 2018 IEEE International Conference onMechatronics and Automation (ICMA) pp 1184ndash1188 IEEEChangchun China August 2018

[26] W Liu D Anguelov D Erhan et al ldquoSingle shot multiboxdetectorrdquo in Proceedings of the European Conference onComputer Vision Springer Amsterdam (e Netherlandspp 21ndash37 October 2016

[27] T-Y Lin P Dollar R B Girshick K He B Hariharan andS J Belongie ldquoFeature pyramid networks for object detec-tionrdquo in Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition (CVPR) vol 1 no 2 p 4Honolulu HI USA July 2017

[28] C-Y Fu W Liu A Ranga A Tyagi and A C Berg ldquoDSSDdeconvolutional single shot detectorrdquo 2017 httpsarxivorgabs170106659

[29] Y Bai Y Zhang M Ding and B Ghanem ldquoSOD-MTGANsmall object detection via multi-task generative adversarialnetworkrdquo in Proceedings of the European Conference onComputer Vision pp 210ndash226 Springer Munich GermanySeptember 2018

[30] L Liu W Ouyang X Wang et al ldquoDeep learning for genericobject detection a surveyrdquo 2018 httpsarxivorgabs180902165

[31] P Zhu L Wen X Bian L Haibin and Q Hu ldquoVision meetsdrones a challengerdquo 2018 httpsarxivorgabs180407437

[32] Z-Q Zhao P Zheng S-t Xu and X Wu ldquoObject detectionwith deep learning a reviewrdquo 2018 httpsarxivorgabs180705511

[33] K Oksuz B C Cam S Kalkan and E Akbas ldquoImbalanceproblems in object detection a reviewrdquo 2019 httpsarxivorgabs190900169

[34] O Russakovsky J Deng H Su et al ldquoImagenet large scalevisual recognition challengerdquo International Journal of Com-puter Vision vol 115 no 3 pp 211ndash252 2015

[35] Y Bai Y Zhang M Ding and B Ghanem ldquoSOD-MTGANsmall object detection via multi-task generative adversarialnetworkrdquo in Proceedings of the European Conference onComputer Vision (ECCV) pp 206ndash221 Munich GermanySeptember 2018

18 Journal of Electrical and Computer Engineering

Page 6: AnEvaluationofDeepLearningMethodsforSmall ObjectDetection

(e detail analyses of the YOLO approaches as a premise toapply it into practical applications are as follows

YOLOv1 [4] is widely known that YOLO an unified orone-stage network is a completely novel approach based onan idea that aims to tackle object detection in real timeproposed by Redmon et al implying that instead of per-forming the task of object detection like the previoustechniques based on complex tasks such as [1 4] which useexhausted sliding window and then feed outputs of this toclassifiers performing at equally spaced locations over thewhole image or region proposals to generate bounding boxeswhich possibly contain objects and then feed them toconvolutional neural networks YOLO considers objectdetection to be a regression problem simultaneously givingthe prediction for various coordinates of bounding boxesand class probabilities for these boxes (e key idea toperform the detection of YOLO is that YOLO separatesimages into grid views which push the running time as wellas accuracy in localizing objects of YOLO(e goal of YOLOis to deal with two problems namely what objects arepresented and where they are in an image (e summari-zation of YOLO operation proceeds with three principalsteps simply and straightforwardly Firstly YOLO takes aninput image resized to a fixed size then works a singleconvolutional network as a unified network on the imageand ultimately puts a threshold on the resulting detectionsby the confidence score of the model YOLO runs at 45 fpson GPU and the smaller Fast YOLO reaches 150 fps (isprocessing can run steaming video in real time Although thedesign of YOLO architecture affords end-to-end trainingand real-time detection it still keeps high average precision

(e network divides the input image into a Stimes S gridwhere Stimes S is equal to the width and height of the tensorwhich presents the final prediction In case the center of anobject is in a grid cell the gird cell takes responsibility fordetecting that object Moreover each gird cell is simulta-neously responsible for predicting bounding boxes andconfidence scores which present how confident the model ofbounding box contains an object as well as how accurate itindicates the bounding box is predicted

(e drawback of YOLO is that it lags behind the state-of-the-art detection systems about accuracy but is better thanthose about running time Itmakes less than half the number ofbackground errors compared to Fast R-CNN YOLO is highlygeneralizable so it can quickly identify objects in an image butit usually struggles to precisely localize some objects especiallysmall ones (erefore the author introduced YOLOv2 toimprove performance and fix drawbacks of YOLO as well

YOLOv2 [5] has a number of various improvements fromYOLOv1 Similarly to the origin YOLOv2 runs on differentfixed sizes of an input image but it introduced several newtraining methods for object detection and classification suchas batch normalization multiscale training with the higherresolutions of input images predicting final detection onhigher spatial output and using good default bounding boxesinstead of fully connected layers

However this offers a trade-off between speed and ac-curacy (e details of the mAP improvements in PASCALVOC 2007 are shown in Figure 2

(ese novel improvements allow YOLOv2 to train onmulticlass datasets like COCO or ImageNet In addition itwas attempted to train the detector to detect over 9000different object classes YOLOv2 uses a network architecturecustomized from the original network YOLOv2 mainlyconcentrates on a way of improving recall and localizationwhile still receiving high accuracy of classification incomparison with state-of-the-art detectors and the originYOLO significantly makes more localization errors but is farless likely to predict false detections on places where nothingexists Although YOLOv2 has accuracy improvementsYOLOv2 does not work well on small objects because theinput downsampling results in the low dimension of thefeature map which is used for the final prediction To solvethese problems recently the author introduces YOLOv3with significant improvements on object detection espe-cially on small object detection Generally a variety of latestnetworks tend to be toward deeper and yield good per-formance on their tasks with deep features learned fromnumerous layers

YOLOv3 [6] is one of these approaches instead of usingDarknet-19 like two old versions [4 5] YOLOv3 develops adeeper network with 53 layers called Darknet-53 and com-bines the network with state-of-the-art techniques such asresidual blocks skip connections and upsampling (e re-sidual blocks and skip connections are very popular in ResNetand relative approaches and the upsampling recently alsoimproves the recall precision and IOU metrics for objectdetection [25] For the task of detection 53 more layers arestacked onto it giving a 106-layer fully convolutional un-derlying architecture for YOLOv3 (is is the reason behindthe slowness of YOLOv3 compared to YOLOv2

Second YOLOv3 enables the detector to predict objectsat three different outputs with three different scales ratherthan just one prediction at the last layer of the networksimilar to its competitor SSD [26] which has improved a lotof performance on a low resolution image (is is useful topick up diverse outcomes in order to improve performanceof detection (e final output is created by applying a 1times 1kernel on a feature map Particularly the detection is doneby applying 1times 1 detection kernels on feature maps of threedifferent sizes at three different places in the network partlysimilar to feature pyramid networks (FPNs) [27]

(ird YOLOv3 still keeps using K-means to generateanchor boxes but instead of fully applying 5 anchor boxes atthe last detection YOLOv3 generates 9 anchor boxes andseparates them into 3 locations Each location applies 3anchor boxes hence there are more bounding boxes perimage For example if we have an image of 416times 416YOLOv2 predicts 13times 13times 5 845 boxes in YOLOv3 thenumber of boxes is 10647 implying that YOLOv3 predicts10 times the number of boxes compared to YOLOv2

Fourth YOLOv3 also changes the way to calculate thecost function If the anchor overlaps a ground truth morethan other bounding boxes the corresponding objectnessscore should be 1 For other anchor boxes with overlapgreater than a predefined threshold 05 they incur no costEach ground truth is only associated with one boundary boxIf a bounding box is not assigned it incurs no classification

6 Journal of Electrical and Computer Engineering

and localization lost just confidence loss on objectness (eloss function in previous YOLO looks like

λcoord 1113944

S2

i01113944

B

j01objij xi minus 1113954xi( 1113857

2+ yi minus 1113954yi( 1113857

21113960 1113961

+ λcoord 1113944

S2

i01113944

B

j01objij

wi

radicminus

1113954wi

1113969

1113874 11138752

+

hi

1113969

minus

1113954hi

1113969

1113874 11138752

1113890 1113891

+ 1113944S2

i01113944

B

j01objij Ci minus 1113954Ci1113872 1113873

2

+ λnoobj 1113944S2

i01113944

B

j01noobjij Ci minus 1113954Ci1113872 1113873

2

+ 1113944S2

i01obji 1113944

cisinclassespi(c) minus 1113954pi(c)( 1113857

2

(1)Currently instead of using mean square error in calcu-

lating the classification loss at the last three terms YOLOv3uses binary cross-entropy loss for each label In other wordsYOLOv3 makes its prediction of an objectness score and classprediction for each bounding box using logistic regression

(ere is no more softmax function for class prediction(e reason is that the most currently used classifiers assumethat predicted labels are independent and mutually exclusiveimplying that if an object belongs to one class then it cannotbelong to the other and this is solely true if output predictionis really mutual nevertheless in case dataset has multilabelclasses and there are labels which are not nonexclusive such aspedestrian and person At the time the sum of possibilityscores may be greater than 1 if the classifier is softmax soYOLOv3 alternates the classifier for class prediction from thesoftmax function to independent logistic classifiers to cal-culate the likeliness of the input belonging to a specific label

36 Single Shot MultiBox Detector Single Shot MultiBoxDetector (SSD) [26] is a single shot detector using a singleand one-stage deep neural network designed for objectdetection in real time By comparison the state-of-the-artmethod in two-stage processing Faster RCNN uses itsproposed network to generate object proposals and utilizesthose to classify objects in order to be toward real-timedetection instead of using an external method but the wholeprocess runs at 7 FPS SSD enhances the speed of running

(a)

(b)

(c)

(d)

Figure 2 (e visualization of detectors with the strongest backbones on subsets of PASCAL VOC_MRA_058 VOC_MRA_10VOC_MRA_20 and VOC_WH_20 in order respectively (a) YOLO Darknet-53 (b) Faster RCNN ResNeXT-101-64times 4d-FPN (c)RetinaNet ResNeXT-101-64times 4d-FPN (d) Fast RCNN ResNeXT-101-64times 4d-FPN

Journal of Electrical and Computer Engineering 7

time faster than the previous detectors by eliminating theneed of the proposal network(erefore it causes a few dropin mAP and SSD compensates this by applying some im-provements including multiscale features and default boxes(ese improvements allow SSD to gain the same of FasterRCNN using lower resolution images which then furtherspeeds up the processing of SSD For 300times 300 input imageas the best version SSD gets 772 mAP at 46 FPS betterthan Faster R-CNN 732 and a little smaller than the bestversion of YOLOv2 554times 554 input image and 786 mAPat 40 FPS on VOC 2007 on Nvidia Titan X

Similarly SSD consists of 2 parts namely extraction offeature maps and use of convolution filters to detect objectsSSD uses VGG16 as a base network to extract feature maps(en it combines 6 convolutional layers to make predictionEach prediction contains a bounding box and N+ 1 scoresfor each class where N is the number of classes and one forextraclass for no object Instead of using a region proposalnetwork to generate boxes and feed to a classifier forcomputing the object location and class scores SSD simplyuses small convolution filters After the VGG16 base net-work extracts features from feature maps SSD applies 3times 3convolution filters for each cell to predict objects Each filtergives an output including N+ 1 scores for each class and 4attributes for one boundary box

SSD has a difference from previous approaches at thesame time and it makes prediction on multiscale featuremaps for detection independently rather than just one lastlayer (e CNN network spatially reduces the dimension ofthe image gradually leading to the decrease in the resolutionof the feature maps As mentioned SSD uses a lower inputimage to detect objects hence early layers are used to detectsmall objects and lower resolution layers to detect largerscale objects progressively Besides SSD applies differentscales of default boxes to different layers and for intuitivevisualization in Figure 3 Particularly the only blue defaultbox on 8times 8 feature map fits to the ground truth of the catand the only red one on 4times 4 feature map matches to theground truth of the dog

Although SSD has significant improvements in objectdetection as integrating with these above parts SSD is notgood at detecting small objects which can be improved byadding deconvolution layers with skip connections to in-troduce additional large-scale context [28] Generally SSDoutperforms Faster RCNN which is a state-of-the-art ap-proach about accuracy on PASCAL VOC and COCO whilerunning at real-time detection

37CNNDrawbacks Most of the CNNmodels are currentlydesigned by the hierarchy of various layers such as con-volutional and pooling layers that are arranged in a certainorder not only on small networks but also on multilayernetworks to state-of-the-art networks Along with theselayers fully connected layers are added behind and known asFC layers (e block consisting of FC layers and previouslayers is designated as feature extractors and it outputs keyfeatures of objects of interest as an input for classifierscoming behind However deeply going through many kinds

of layers is a way that is not good for small object detectionbecause in the task of small object detection objects ofinterest are objects owning small sizes and appearanceBesides small objects unlike normal or big objects which areless affected by resizing the image or passing lots of differentlayers are very vulnerable to the changes in image sizesWhen an image passes a convolutional layer the size of theimage will be decreased by receptive fields that slide over theimage to extract useful features (is does not affect smallobjects if there are just a few layers but in a CNN networkwe have many layers like this and it is very hard for smallobjects Still if small objects just go through convolutionallayers it will not be anything to mention Small objectswhich just have a few informative presence have to passpooling layers which help in avoiding overfitting and re-ducing computational costs by decreasing a number ofparameters To do this these layers use fixed sliding windowsthat care about a fixed target that is identified before such asmaximum or average calculations of valuables For thesereasons GAN is an approach that may alter the CNN ap-proach because of its advantages We can take advantages ofa way that the approach generates data to overcome thelimitations of data of small objects for the training phaseAlthough images still have to pass layers such as convolu-tional and pooling layers in this context the network justhas less layers compared to others Bai el al [29] haveproposed to apply MTGAN to detect small objects by takingcrop inputs from a processing step made by baseline de-tectors such as Faster RCNN [15] or Mask RCNN [9]

Because of mentioned reasons and following the survey[30] Liu et al have presented numerous works of survey andevaluation but there are no works that do with small objectsin them(erefore in this work we assess popular and state-of-the-art models to find out pros and cons of these modelsParticularly we evaluate 4 deep models such as YOLOv3RetinaNet Fast RCNN and Faster RCNN with several basenetworks for small object detection with different scales ofobjects In these models YOLOv3 and RetinaNet belong tothe one-stage approach Fast RCNN and Faster RCNN are inthe two-stage approach We choose these models becauseYOLOv3 is the model with combination of state-of-the-arttechniques and RetinaNet is the model with a new lossfunction which penalizes the imbalance of classes in adataset Besides we choose RetinaNet to make comparisonsbetweenmodels in the same approach Similarly Fast RCNNand Faster RCNN are the same and both models are in thesame approach and have nearly the similar pipeline in objectdetection (ere is a difference is that Fast RCNN utilizes anexternal proposal to generate object proposals based oninput images However Faster RCNN proposes its ownnetwork to generate object proposals on feature maps andthis makes Faster RCNN train end-to-end easily and workbetter

4 Experimental Evaluation

In this section we present the information of our ex-perimental setting and datasets which we use forevaluation

8 Journal of Electrical and Computer Engineering

41 Experimental Setting We continually train and evaluatevarious object detectors on the two datasets such as PASCALVOC [11] and a newly generated dataset [16] (e evaluatedapproaches in this time consist of Faster RCNN [15]YOLOv3 [6] and RetinaNet [7] with different backbonesExcept for YOLOv3 the others are trained and evaluated bythe Detectron python code

Currently the original datasets which commonly areused in object detection are PASCAL VOC [11] and COCO[12] Both datasets are constructed by almost large objects orother kinds of objects whose size fill a big part in the image(ese two datasets are not suitable for small object detectionIn addition there is another dataset which is large-scale andincludes a lot of classes for small object detection collectedby drones and named VisDrone dataset [31] However itdoes not publish the labels for test set to evaluate and theviews of images are topdown which is not our case As aresult in order to evaluate the detection performance of themodels we use a dataset which was published in [13] (isdataset is called small object dataset which is the combi-nation between COCO [12] and SUN [24] dataset (ere are10 classes in small object dataset including mouse tele-phone switch outlet clock toilet paper (t paper) tissue box(t box) faucet plate and jar (e whole dataset consists of4925 images in total and there are 3296 images for trainingand 1629 images for testing (e mouse class owns thelargest number of instances in images 2137 instances in 1739images and the tissue box class has the smallest number ofinstances 103 instances in 100 images Apart from the smallobject dataset we also filter subsets from PASCAL VOC2007 following standard definitions On PASCAL VOCthere are 20 classes but with small object detection there arefewer classes on strict definitions of small objects Table 1lists the details of the number of small objects and imagescontaining them for subsets of the dataset

We trained all models on small object dataset with thesame parameters Particularly in the training phase wetrained the models with 70k iterations with the parametersincluding momentum decay gamma learning rate batchsize step size and training days in Table 2 At the firstmoment we attempted to start off the models with a higherlearning rate 10minus 2 but the models diverged leading to theloss value being NaN or Inf after 100 first iterations(en wetried at a lower learning rate 10minus3 at 100 first iterations andrise to 10minus2 to consider if the models can converge as startingoff at a lower learning rate However it remained unchangedanything We also saw that the models converged quicklyduring 10k first iterations with 10minus3 and then progressively

slow down after 20k (erefore we decided to start off thetraining with a learning rate at 10minus3 and decrease to 10minus 4 and10minus5 at 25k and 35k iterations respectively (is settingshows that the loss value was stable from 40k but we set thetraining up to 70k to consider how the loss value changesand saw that it did not change a lot after 40k iterations Wetried to evaluate the models from 30k to 70k and generallythe performance of the models was not stable after 40k it-erations For this reason we picked up the weight forevaluation at 30k and 40k iterations At 30k iterationsYOLO achieves the best results and others get the best one at40k iterations In case of subsets of PASCAL VOC 2007 wecombine train and valid set from PASCAL VOC 2007 and2012 to form a training set PASCAL VOC 2012 works as adata augmentation set for PASCAL VOC 2007 We use thiscombined training set to train all models and test them onsubsets All models train the same parameter First due tothe limitation of memory we rescale all the size of images tothe same size with the shortest side 600 and the lengthiestside 1000 as in [15]

In YOLOv3 we run the K-means clustering algorithm inorder to initialize 9 suitable default bounding boxes fortraining and testing phases of our selected datasets and wechanged the anchors value (e following are 9 anchors forsmall object dataset after running the K-means algorithm[103459 144216] [262937 190947] [214024 363180][479317 291237] [404932 637489] [836447 513203][722167 1199181] [1727416 1170773] and [12465972528465]

In Faster R-CNN to fairly compare with the prior workand deploy on different backbones we also reuse directly theanchor scales and aspect ratios following the paper [13] suchas anchor scales 16times16 40times 40 and 100times100 pixels andaspect ratio 05 1 and 2 instead of having to cluster a set ofdefault bounding boxes similar to YOLOv3 Similarly inRetinaNet we keep the default setting for training such asgamma loss 20 alpha loss 025 anchor scale 4 andscalers per octave 3 because of following authors and thisconfiguration is the optimized valuables

42 Our Newly Generated Dataset In this time to have anobjective comparison we also use our newly generateddataset and the information of this dataset is shown inTable 1 We use it to consider the effects of object sizesamong factors including models time of processing accu-racy and resource consumption (e dataset consists of 4subsets filtered from PASCAL VOC 2007 such as

Batch normHi-res classifierConvolutionalAnchor boxes

New neworkDimension priors

Location predictionPassthrough

MultiscaleHi-res detector

YOLO YOLOv2

VOC2007 mAP

634 658 695 692 696 744 754 768 786

Figure 3 mAP of YOLOv2 at each added part [5]

Journal of Electrical and Computer Engineering 9

VOC_WH_20 VOC_MRA_058 VOC_MRA_10 andVOC_MRA_20 and detail information is provided asfollows

(i) VOC2007_WH_02 contains objects whose widthand height are less than 20 of an imagersquos width andheight (is one has fewer than PASCAL VOC 2007two classes such as dining table and sofa because ofthe constraint of the definition

(ii) VOC_MRA_058 VOC_MRA_10 andVOC_MRA_20 compose objects occupying themaximum mean relative area of the original imageunder 058 10 and 20 respectively Two ofthem have the same number of PASCAL VOC 2007classes except for VOC_MRA_058 and the one hasfewer four classes such as dining table dog sofa andtrain

5 Results and Analyses

In this section we show results that we achieved through theexperimental phase All models mentioned in this sectionexcept for models cited from other papers are trained on thesame environment and 1 GPU Ubuntu 16044 LTS Intel(R) Xeon (R) Gold 6152 CPU 210GHz GPU Tesla P100In addition to the comparative accuracy other comparisonsare also provided to make our objective and clear assessmentresults

51 Accuracy

511 Small Object Dataset Following the detection resultsin Table 3 methods which belong to two-stage approachesoutperform ones in one-stage approaches about 8ndash10Specifically Faster RCNN with ResNeXT-101-64times 4d-FPNbackbone achieved the top mAP in two-stage approachesand the top of the table as well 412 In comparison withthe top in one-stage approaches YOLOv3 608times 608 withDarknet-53 obtained 331 Following [32] methods basedon region proposal such as Faster RCNN are better than

methods based on regression or classification such as YOLOand SSD Actually this is also right once again as in contextof small object dataset

For methods in each approach Firstly two-stage ap-proaches Faster RCNN which is an improvement of FastRCNN is only greater than Fast RCNN about 1ndash2 but onlyfor ResNeXT backbones and equal to Fast RCNN for the rest(e difference here is not too much and it means that theperformance of external region proposal like selective searchcombined with ROI pooling is as good as internal regionproposal like RPN with ROI aligned in this case Besidescompared to R-CNN we perceive that there is a boost 8ndash10when RoI pooling or RoI aligned is added because R-CNNwhich uses region proposals from selective search then feedsthem into the network and directly computes features from fc(fully connected) layers only receives 235 with Alexnetand 248 with VGG16 combined with proposals from RPNHowever Fast RCNN and Faster RCNN with two kinds ofRoIs are much better Fast RCNN receives accuracy in a rangeof 317 to 396 based on different backbones SimilarlyFaster RCNN gets 301 to 412 Secondly in one-stageapproaches YOLO outperforms SSD and RetinaNet How-ever YOLO gets the highest outcome 331 and SSD andRetinaNet get 1132 and 30 respectively YOLO and SSDare considered as state-of-the art methods in speed andsacrificing accuracy However there is a large difference inaccuracy between YOLO and SSD the difference here is thatSSD adds multiple convolutional layers behind the backboneand each layer has their own ability instead of using 2 fullyconnected layers like YOLO Although RetinaNet is assignedinto a method in one-stage approaches it cannot run in realtime RetinaNet is one which is proposed to deal with theimbalance between foreground and background by the focalloss (erefore RetinaNet obtains a higher accuracy incomparison with others except for YOLOv3 (Darknet-53)

When it comes to the backbones we realized that Dar-knet-53 is the best in one-stage and real-time methods andeven far higher than ResNet-50 although it similarly has thesame layers with ResNet-50 In contrast ResNeXT combinedwith FPN is themost powerful one in both one-stage and two-

Table 1 (e information of the subsets

Subsets Classes Images InstancesVOC_MRA_058 16 329 529VOC_MRA_10 20 2231 5893VOC_MRA_20 20 2970 7867VOC_WH_20 18 1070 2313

Table 2 (e parameters of models

Method Momentum Decay Gamma Learning_rate Batch_size Training_days StepsizeYOLOv2 [16] 09 00005 0001 8 5 25000YOLOv3 09 00005 0001 32 3ndash4 25000SSD300 [16] 09 00005 01 0000004 12 9 40000 80000SSD512 [16] 09 00005 01 0000004 12 12 100000 120000RetinaNet 09 00005 01 0001 64 4-gt12 h 25000 35000Fast RCNN 09 00005 01 0001 64 4-gt12 h 25000 35000Faster RCNN 09 00005 01 0001 64 4-gt12 h 25000 35000

10 Journal of Electrical and Computer Engineering

stage methods if we only consider accuracy Overall there isan increase about 1ndash3 for changing the simple backbone tothe complex one in each type For example when switchingfrom original ResNet to ResNet-FPN the accuracy is boostedfrom 2 to 3(is is clear that leveraging the advantages frommultiscale features of FPN is a common way to improvedetection and tackle the scale imbalance of input images andbounding boxes of different objects Similarly we switchResNeXT-101-32times 8d-FPN to ResNeXT-101-64times 4d-FPNand the accuracy changes from 405 to 412 for FasterRCNN and from 387 to 396 for Fast RCNN Howeverwhen considering between ResNet-50-FPN and ResNet-101-FPN the growth only happens in Fast RCNN from 333 to355(ere is a little bit decrease 01 for Faster RCNN(isreduction also happens with RetinaNet while the simplerbackbone ResNeXT-101-32times 8d-FPN gets 30 and theResNeXT-101-64times 4d-FPN just gets 251 It means that thevery deeper backbones do not guarantee the increase inaccuracy and the reason is that an advantage of a deepernetwork needsmore parameters to learn It means the authorsmust have a large number of data to feed into the network totrain and update parameters itself but in this case the data ofsmall object dataset are not abundant too much to fit the verydeep network and hence increasing the chances of overfittingBesides features which are originally from the early layer ofResNet are not well-generalized because when they arecombined with FPN the accuracy has an improvement about2ndash3 When YOLO switches from Darknet-19 to Darknet-53 it really boosts the accuracy (e highest accuracy belongsto the Darknet-19 with the resolution of 1024times1024 whichjust gets 2402 However YOLO 608times 608 with Darknet-53gets 331 (e explanation for this reason is that YOLOv3with Darknet-53 has several improvements from Darknet-19YOLOv3 has 3 location of scales to predict objects especiallyone specialized in small objects instead of only one likeDarknet-19 and it is also integrated cutting-edge advantagessuch as residual blocks and shortcut connections (e re-duction in accuracy happens again with YOLO whenswitching from ResNet-101 to ResNet-152 about 1ndash2 Inthese methods YOLO and SSD are the only ones which allowmultiple input sizes(e higher the resolution of input imagesare the higher accuracy the method receives (e reason isthat a higher resolution image allows more pixels to describethe visual information for small objects However if theresolution is far from the original size of images it results in adecrease in accuracy For example YOLO 1024times1024 withDarknet-19 gets a lower accuracy than the resolution of800times 800 In addition we have tried to increase in resolutionof Darknet-53 from 608 to 1024 and themAP decreases whenthe resolution is over 608times 608 (erefore the effect of imagesize is clear for models like SSD and YOLO Generally allcomparative results of mAP on this dataset have the domi-nation of classes very great in numbers and this is caused bythe imbalance data between the number of images and in-stances in these images For example according to the sta-tistics in [13] mouse is a major class significantly contributingto mAP in Table 3 with the highest number of instances andimages as well However tissue has least contribution with thelowest AP originally affected by the number of data

Furthermore the imbalance data lead models tending todetect frequent objects implying that models will misun-derstand objects having a nearly similar appearance with thedomination class as the objects of interest rather than lessfrequent objects As a result false positives will increase bythese problems Figure 4 illustrates the detection withstrongest backbones Following this visualization the domina-tion of the classes such asmouse or faucet results inmisdetectionwith areas which have a same appearance to them (is mis-understanding has a tendency to weaker backbones in thecomparison and one-stage method like YOLO which primarilyheads to speed has more misdetection than two-stage methodsA reason that causes these problems are the difference in thewayof training deep networks [33] One-stage methods such asYOLO use a soft sampling method that uses a whole dataset toupdate parameters rather than only choosing samples fromtraining data However two-stage methods such as RCNNfamily tend to employ hard sampling methods that randomlysample a certain number of positive and negative boundingboxes to train its network

512 Subsets of PASCAL With 4 subsets of 4 different scalesof objects in images we want to find out howmuch the scalesimpact on the models (e whole results are shown in Ta-ble 4 We separate the results into 2 groups as the one-stageand two-stage approaches and Figure 5 is a visualization forthe strongest backbones in each method on subsets

In case of different scales like our subsets there is a differencebetween one-stage approaches and two-stage approaches In thiscase methods from the one-stage approach have a better per-formance than two-stage ones inmost of scales(is is really theopposite of small object dataset Specifically two-stage methodsare totally better than one-stage ones in case of real-time inputsand just better a bit than nonreal-time models in VOC_WH20about 10ndash20 and the same result with smaller objects inVOC_MRA_0058 and VOC_MRA_010 However in biggerobjects in VOC_MRA_020 methods in one-stage approacheshave significant outcomes rather than two-stage ones In ad-dition there is just Faster RCNN that has good performance inmost cases to compare to methods in one-stage ones FastRCNN is only good at big objects in VOC_MRA_020 and failsto have good detection in smaller objects

In the one-stage approach in methods which allowmultiple inputs like YOLO and SSD there are 2 kindsnamely ones that can run in real time and the others thatcannot if the resolution is over 640 or 512 for YOLO andSSD respectively For real-time ones YOLO outperformsSSD for all scales of objects Specifically YOLOv2 withDarknet-19 is better than SSD 26 with objects inVOC_MRA_0058 and VOC_MRA_010 and 4ndash15 forlarger objects in VOC_MRA_020 and VOC_WH_20YOLOv3 with Darknet-53 gets higher results about 3ndash5 incomparison with YOLOv2 hence YOLOv3 also gets higherresults compared to SSD However if we consider nonreal-time input images SSD is greater than YOLO with objects inVOC_MRA_010 However RetinaNet which is the one thatcannot run in real time in the one-stage approach performsthe same results compared to ones in nonreal time in YOLO

Journal of Electrical and Computer Engineering 11

and better than SSD RetinaNet is more stable than SSD andYOLO when the scales are changed (e bigger the objectsare the more the stability is For example the change is somuch about 33 when the scale increases from objects inVOC_MRA_0058 to ones in VOC_MRA_010 andVOC_MRA_020 However this change is not much about10 with bigger objects in comparison with YOLO 15ndash25In case of YOLO this remarkable increase in accuracy whenobjects are larger is obviously good for a model (e changein SSD resembles the change in RetinaNet

Concerning resolutions in YOLO and SSD we see thatwhen image resolution is increased they push the accuracyto improve in general For YOLOv2 with Darknet-19 andYOLOv3 with Darknet-53 and SSD they all have an increasein accuracy when the resolution is large except for YOLOv2

with objects belonging to VOC_MRA_010 andVOC_MRA_020 when the image is over 800 In additionYOLOv2 has a fluctuation with those objects inVOC_WH20 As mentioned in our previous work YOLO isbetter than SSD in those objects less than 10 of the imageshowever in this case YOLOv3 is good at all scales of objects(is is because YOLOv3 has 3 detection locations comingwith more ratios of default boxes and it leads to a significantoutcome when combining results from 3 locations

When we switch to the two-stage approaches FasterRCNN has a significant improvement in most scales ratherthan Fast RCNN except for objects in VOC_MRA_020 whichhave the same accuracy (is shows that if objects are com-pletely separated into different scales the RoI pooling does notwork well with smaller objects and ones in VOC_WH20 In

Table 3 Comparative results on small object dataset

Method Backbone Clock Faucet Jar Mouse Outlet Plate Switch Tel t box t paper mAPYOLO 416 [16]

Darknet-19

228 308 4 52 204 131 13 61 0 353 1939YOLO 448 [16] 23 369 9 525 184 136 175 42 0 343 2013YOLO 480 [16] 342 373 91 533 214 136 158 91 91 342 2371YOLO 512 [16] 231 366 61 598 246 142 157 91 45 324 2261YOLO 554 [16] 234 372 91 601 272 134 199 91 45 345 2384YOLO 640 [16] 202 362 32 598 278 117 181 82 45 356 2253YOLO 800 [16] 276 36 23 602 328 131 233 91 91 267 2402YOLO 1024 [16] 217 293 14 583 264 118 175 91 91 157 2003YOLO 320

Darknet-532622 3838 455 5646 3642 1334 248 1065 455 4296 2583

YOLO 416 2847 4715 1083 6049 4315 1587 3073 1515 262 483 3028YOLO 608 2998 4789 1076 6588 4802 1809 3122 1462 1799 4656 331YOLO 320

ResNet-501957 2573 067 4517 1437 938 1384 909 909 237 1706

YOLO 416 2378 3665 04 5423 1837 1375 1978 984 942 3568 2219YOLO 608 2692 4065 177 6186 2918 1504 2024 1009 1329 3601 255YOLO 320

ResNet-1012052 279 057 4468 1698 1305 1366 966 909 2436 1805

YOLO 416 2572 356 303 5573 224 1561 1726 932 303 3871 2264YOLO 608 2879 4459 942 6218 3334 1553 2388 1324 1583 3917 286YOLO 320

ResNet-1522164 2756 303 4806 1739 1112 1451 909 455 3188 1888

YOLO 416 257 3654 089 5381 206 1413 2021 1149 029 3306 2167YOLO 608 2601 4454 455 61 3176 1302 2267 1235 993 3999 2658SSD300 [16] ResNet-101 55 91 0 255 61 45 0 45 91 182 825SSD300 [16] VGG16 91 171 0 261 91 91 0 45 0 167 916SSD512 [16] VGG16 91 171 0 43 91 91 91 91 0 76 1132RetinaNet ResNet-50-FPN 307 493 2 655 213 161 85 129 1 257 233RetinaNet ResNet-101-FPN 306 487 71 647 20 159 118 107 29 387 251RetinaNet ResNeXT-101-32times 8d-FPN 355 55 121 665 239 184 98 162 94 537 30RetinaNet ResNeXT-101-64times 4d-FPN 314 502 89 663 208 153 94 14 22 324 251R-CNN [13] RPN prop +VGG16 319 313 42 568 311 93 142 164 234 294 248R-CNN [13] Alexnet 7times 300 pro 324 272 51 569 28 98 136 124 179 356 239R-CNN [13] VGG16 7times 300 pro 373 303 72 606 415 158 215 137 22 333 284R-CNN [13] ContextNet (Alexnet 7times) 327 268 46 564 263 99 129 122 187 34 235Fast RCNN ResNet-50-C4 324 463 65 658 383 201 253 166 141 52 317Fast RCNN ResNet-50-FPN 374 473 73 689 467 21 321 171 93 459 333Fast RCNN ResNet-101-FPN 393 503 106 683 471 204 333 186 154 514 355Fast RCNN ResNeXT-101-32times 8d-FPN 475 548 103 718 54 214 344 217 177 535 387Fast RCNN ResNeXT-101-64times 4d-FPN 454 557 109 725 533 24 369 229 16 581 396Faster R-CNN [16] VGG16 2376 3765 803 54 1616 1188 1512 91 625 3729 2192Faster RCNN ResNet-50-C4 322 446 66 659 352 175 257 196 137 40 301Faster RCNN ResNet-50-FPN 357 499 73 684 489 188 296 147 114 533 338Faster RCNN ResNet-101-FPN 398 492 49 682 47 185 297 14 129 522 337Faster RCNN ResNeXT-101-32times 8d-FPN 498 566 114 721 563 232 37 208 188 587 405Faster RCNN ResNeXT-101-64times 4d-FPN 496 586 122 725 545 232 369 208 201 631 412(e values in bold represent the best in one-stage methods and the ones in italics represent the highest in two-stage methods

12 Journal of Electrical and Computer Engineering

addition if we compare with one-stage methods it is signif-icantly lower than them However RoI align along with RPN iswell performed when scales are changedWhen it comes to thebackbones there is a few decrease in accuracy when changingfrom ResNet-50-FPN to ResNet-101-FPN or from ResNeXT-101-32times 8d-FPN to ResNeXT-101-64times 4d-FPN with objectsfrom all scales for both Faster RCNN and Fast RCNN (eVGG16 backbone has an impressive outcome rather thanstrong backbones such as ResNet or ResNeXT Although theaccuracy is less than two strong backbones VGG16 is stillbetter with objects in VOC_WH20 and has a few change inaccuracy when changing objects with big sizes

52 Time Processing and Resource Consumption Tables 5and 6 show us the performance comparison of the eval-uated models with base networks that belong to the modelsGenerally we see that when RAM consumption in testingand training increases more layers are added (is meansthat if the network is more deeper the need of processingalso increases because this leads to the increase in pa-rameters and time to process data as well YOLO is themodel consuming the least memory in both two-phasetraining and testing Particularly YOLO is only from 4G to5G for training and from 16G to 18G for testing withDarknet-53 YOLO is the only one which is able to run inreal time YOLO just needs about 03ms to 04ms toprocess an image in comparison to more than 01 s and 02 s

with Faster RCNN and RetinaNet (is allows us to pick upthese models on devices which own the modest memoryWhile RetinaNet is assigned to the one-stage approach it isnot good enough to meet real-time detection(e inferencetime in Fast RCNN is lower a little bit than Faster RCNNand RetinaNet In contrast the RAM consumption intraining and testing of RetinaNet is lower than Fast RCNNand Faster RCNN Of all architectures the ResNet-50-C4 isthe one requiring the highest memory and time to processdata because the output size of ResNet-50-C4 is bigger a bitthan others [9] If we consider ResNet or ResNeXT com-bined with FPN Faster RCNN is over 100Mb compared toFast RCNN and 300Mb with RetinaNet In additionaccording to Table 2 the number of training days of FasterRCNN and RetinaNet need less time for training only a fewhours to 1 day rather than YOLO 3ndash4 days (is dem-onstrates that if we pay our attention to performance anddo not have much time for training we choose FasterRCNN or RetinaNet instead of YOLO one In contrast ifwe only focus on processing speed and still achieve goodperformance one-stage methods are always the good oneIn the same context of backbones RetinaNet uses a lowerresource than Fast RCNN and Faster RCNN about 100Mband 300Mb for Fast RCNN and Faster RCNN respectivelyin testing time However the training time of RetinaNetuses much memory more than Fast RCNN about 28 G andFaster RCNN about 23 G for ResNeXT-101-32times 8d-FPNand ResNeXT-101-64 times 4d-FPN If we consider this on

(a) (b)

loc ∆ (cx cy w h) conf (c1 c2 hellip cp)

(c)

Figure 4 (e location of the default boxes in different scales (a) image with GT boxes (b) 8times 8 feature map (c) 4times 4 feature map

Journal of Electrical and Computer Engineering 13

Table 4 (e comparative results on subsets of PASCAL VOC 2007

Approach Method VOC_MRA_0058 VOC_MRA_010 VOC_MRA_020 VOC_WH20

One stage

YOLOv2 416 [16] 302 3138 4289 1852YOLOv2 448 [16] 447 329 6015 2196YOLOv2 480 [16] 426 3348 6078 2667YOLOv2 512 [16] 542 3574 6112 2463YOLOv2 544 [16] 697 3656 63 2662YOLOv2 640 [16] 77 3797 6129 2341YOLOv2 800 [16] 1024 373 6191 269YOLOv2 1024 [16] 1069 2993 5514 2897

YOLOv3 320 718 3458 6036 204YOLOv3 416 102 3897 6253 2412YOLOv3 608 117 4265 6856 2886SSD 300 [16] 171 3276 4626 1691SSD 512 [16] 29 4346 5711 1987

RetinaNet-ResNet-50-FPN 884 415 502 2814RetinaNet-ResNet-101-FPN 895 425 519 2746

RetinaNet-ResNeXT-101-32times 8d-FPN 1029 454 545 3008RetinaNet-ResNeXT-101-64times 4d-FPN 1071 455 551 3132

Two stage

Fast RCNN-ResNet-50-C4 023 132 499 393Fast RCNN-ResNet-50-FPN 063 135 556 345Fast RCNN-ResNet-101-FPN 039 159 576 312

Fast RCNN-ResNeXT-101-32times 8d-FPN 051 144 579 333Fast RCNN-ResNeXT-101-64times 4d-FPN 029 142 573 376

Faster RCNN-ResNet-50-C4 698 399 487 2604Faster RCNN-ResNet-50-FPN 1074 456 563 2979Faster RCNN-ResNet-101-FPN 1063 469 576 3057

Faster RCNN-ResNeXT-101-32times 8d-FPN 1164 473 576 3212Faster RCNN-ResNeXT-101-64times 4d-FPN 1054 471 569 3164

Faster RCNN-VGG16 [16] 573 3558 4414 4111(is table illustrates how well models adapt to different scales of objects (e values in bold represent the best in one-stage methods and the ones in italicsrepresent the highest in two-stage methods

(a)

(b)

(c)

Figure 5 Continued

14 Journal of Electrical and Computer Engineering

small object dataset it does not work too much becauseRetinaNet is lower than Faster RCNN about 10 in per-formance Otherwise on different scales of subsets Reti-naNet works well when comparing to Faster RCNN and

the difference is just 2ndash4 percentages Although ResNetbackbones combined with the others yield an improvementin accuracy they do not work for YOLO on small objectdatasets YOLO with Darknet-53 utilizes more resource

Table 5 (e comparison of consumption on small object dataset

Model Backbone Inference time (s) Test RAM (MiB) Train RAM (MiB)YOLOv3 Darknet-53 00331 1825 4759YOLOv3 ResNet-50 0027 1285 3479YOLOv3 ResNet-101 00356 1829 5383YOLOv3 ResNet-152 00454 2443 7531RetinaNet ResNet-50-FPN 0102 2075 4435RetinaNet ResNet-101-FPN 0127 2723 5577RetinaNet ResNeXT-101-32times 8d-FPN 0229 3767 7863RetinaNet ResNeXT-101-64times 4d-FPN 0292 3719 7813Fast RCNN ResNet-50-C4 03 6449 5877Fast RCNN ResNet-50-FPN 0089 2277 4455Fast RCNN ResNet-101-FPN 0113 2947 5627Fast RCNN ResNeXT-101-32times 8d-FPN 0212 3987 4961Fast RCNN ResNeXT-101-64times 4d-FPN 0269 3885 4799Faster RCNN ResNet-50-C4 0412 6609 6129Faster RCNN ResNet-50-FPN 0101 2387 5381Faster RCNN ResNet-101-FPN 0124 3001 6487Faster RCNN ResNeXT-101-32times 8d-FPN 0256 4027 5333Faster RCNN ResNeXT-101-64times 4d-FPN 0286 4003 5246

Table 6 (e comparison of consumption on subsets filtered from PASCAL VOC

Model Backbone Inference time (s) Test RAM (MiB) Train RAM (MiB)YOLOv3 Darknet-53 0027 1645 4079RetinaNet ResNet-50-FPN 01 1935 4133RetinaNet ResNet-101-FPN 0116 2585 5435RetinaNet ResNeXT-101-32times 8d-FPN 0222 3641 7723RetinaNet ResNeXT-101-64times 4d-FPN 0284 3561 7599Fast RCNN ResNet-50-C4 0495 6371 5677Fast RCNN ResNet-50-FPN 0092 2131 4387Fast RCNN ResNet-101-FPN 0114 2819 5463Fast RCNN ResNeXT-101-32times 8d-FPN 0213 3873 4637Fast RCNN ResNeXT-101-64times 4d-FPN 0265 3735 4575Faster RCNN ResNet-50-C4 026 6141 5991Faster RCNN ResNet-50-FPN 01 2245 5207Faster RCNN ResNet-101-FPN 013 2855 6335Faster RCNN ResNeXT-101-32times 8d-FPN 0225 3943 5087Faster RCNN ResNeXT-101-64times 4d-FPN 0276 3885 4909

(d)

Figure 5 Highlight of bounding boxes from comparative backbones on small object dataset We here select YOLO with Darknet-53 andResNet-50 for objective comparison because there have obviously the same layers in their networks along with the significant techniquessuch as skip connections and residual blocks (e bounding boxes show that ResNet-50 has the sensitivity to areas which resembles theobjects of interest than Darknet-53 Similarly ResNet-50-FPN and ResNet-50-C4 are chosen to consider (e detection shows thatcombining ResNet-50 with FPN outputs a better performance rather than the original one Particularly misdetection happens in moredensity than ResNet-50-FPN such as in columns 4 and 5 Zoom in to see more detail

Journal of Electrical and Computer Engineering 15

than ResNet ones but it has the best accuracy amongmodels (erefore we only test YOLO with Darknet-53 insubsets of PASCAL

53 Analyses of the Trade-Off among Detectors Networkdesigns and approaches alike the one-stage approach proveits performance as applying them to detect general objectsboth small scales and other kinds of scales Although they arefast and accurate there is still a drawback always existing inthese models that is the trade-off between accuracy andspeed of processing For example YOLOv3 proposes theidea that performs detection at three different scales and thisresult is obviously impressive and yields good performanceHowever to gain this advantage YOLOv3 has to sacrifice thetime to process Instead of all inputs of the model normallyprocessing one time for detection like YOLOv2 this ideamust work 3 times (is trade-off is also partly affected byresolution as we change it during training or testing ourmodels In our previous work we have mentioned that wehave to choose a right resolution to ensure our models towork properly In case of the two-stage approaches the ideathat proposes region proposals to improve the localization ofobjects to serve for detection is good as well (is is usefulbut we have to take it into the account that we shouldgenerate proposals on feature maps or directly on inputimages because this affects a lot on the way which modelsintend to run and identify representations of objects Ifobjects are normal or have a big or medium appearance it isgood for models to work but if objects are in multiscalesthis is a problem to consider and research deeply in order tobalance the performance as well as improve it (erefore topartly fix this problem the one-stage approach allows us tochoose a fixed size of an input for training and testing butthe support still depends on characteristics of datasets whichwe evaluate or the image size After all all models we chooseto evaluate are affected by the scales of objects when wechange the scale and accuracy of models change a lot exceptfor Faster RCNN the only one model that seems to be stablewith the scale especially when combining with the VGG16architecture Although the accuracy of VGG16 is not betterthan the other architectures the difference here is that it doesnot change too much in accuracy (is is only right for bigobjects having the overlap of the bounding box and theimage greater than 10 if not this is not assured

Figure 1 shows that the possibility of small objects ismore than other objects (e black length of the camera issomehow similar to the black mouse placed on a mouse pad(is possibility of small object presence causes more diffi-culties to detectors and leads to wrong detection Anywherein an image can be small objects it results in a fact thatdetectors have much wrong detection with familiar ap-pearance which they have seen If we consider the visuali-zation of the detection in Figure 4 the wrong detection ispartly similar to the appearance of the other objects in thedataset (is problem is caused by the data imbalance be-tween classes and instances in each class which originally isknown as the foreground-foreground class imbalance In

other words the common problems which not only happenwith small objects but also for whole datasets are theintraclass similarity and interclass variation

6 Conclusion

Small object detection is a challenging and interestingproblem in the task of object detection and has drawn at-tention from researchers thanks to the development of deeplearning which is motivation to improve performance oftasks in computer vision Although deep models belongingto detection originally tend to solve problems relating togeneral object detection they still work at a particular levelto the success of small object detection As evaluation workson small object detection for deep models our goal is tohighlight remarkable achievements of popular and state-of-the-art deep models in order to provide a variety of views asapplying deep models in small object detection Particularlywe evaluate state-of-the-art real-time detectors based ondeep learning from two approaches such as YOLOv3RetinaNet Fast RCNN and Faster RCNN on two datasetsnamely small object dataset and subsets filtered fromPASCAL VOC about effects of different factors objectivelyincluding accuracy execution time and resource usage

In spite of the successful achievements in recent yearsthe performance of detection has improved significantlyand there is still a huge gap in accuracy between normalobjects and small objects In the criteria of the COCOdataset the difference from the small scale to medium andbig scale is too much Most models are good at detection ofnormal objects and problems are going to happen whenapplying them to detect small objects As a result to reducethe gap of small object detection the first thing to do is investdatasets which have massive amount of data to train modelsand have a wide range of categories to compete with thehuman visual system alike [12 34]

So far detection models are divided into two mainapproaches namely one-stage approach and two-stageapproach Models in the one-stage approach is known asdetectors which have better and more efficient detection incomparison to another approach (e efficiency here has thepotential power to run in real time and is able to apply themto practical applications However the trade-off betweenaccuracy and speed is a difficult challenge which needs to betaken into the account in order to balance the gap Howevermodels in the two-stage approach have their reputation ofregion-based detectors which have high accuracy but are toolow in speed to apply them to real world (is drawbackcomes from the computation of networks

(rough our evaluation there is a fact that architectureswhich are utilized as base networks to extract deep featureshave significant effects on frameworks (e deeper the ar-chitecture is the higher the accuracy of detection is Once anetwork has an increase in the depth this means it has morelayers than normal ones and it will have massive parametersto train Hence this needs a lot of data to fine tune theseparameters reasonably If there is an increase in computa-tion resource consumption will also increase As a result it

16 Journal of Electrical and Computer Engineering

will be difficult as we want to take them to apply in practicalapplications Besides the contextual exploit in models isdefinitely limited this results cause ignoring much usefuland informative data in training especially in context ofsmall objects Because small objects are able to appearanywhere in an input image if the image is well-exploitedwith the context the performance of small object detectionwill be improved better

For all above reasons and according to our evalua-tion if we tend to have good performance and ignore thespeed of processing two-stage methods like FasterRCNN are well-performed and demonstrate its networkdesign with the different datasets on many contexts ofobjects including multiscale objects (erefore FasterRCNN is considered as a giant baseline in order to baseon or develop from it If our target has a balance ofaccuracy and speed YOLO is a good one in case we donot care the training time because the sacrifice betweenthe speed and accuracy is worth applying it into practicalapplications Otherwise Faster RCNN or RetinaNet isstill a substitution to work on When it comes tobackbones we have to concern about the data to choose areasonable backbone to combine with the methodsBecause the amount of data will significant impact on themodel if data are not abundant the shallow network willfit it well Besides there is recently a novel approachpromising in training deep models with less data that isweakly supervised learning such as zero-shot one-shotor few-shot learning (erefore these approaches will beconsidered in our future works and following our recentsearching to have better performance on object detec-tion we have to consider several factors to improve themAP such as multiscale training superresolution forscaling up the visual information to small objects [35] orpreprocessing data to avoid the imbalance data becausewe have a wide range of imbalance problems relating todata [33]

Data Availability

(e data used to support the findings of this study areavailable from the corresponding author upon request

Conflicts of Interest

(e authors declare that they have no conflicts of interest

Acknowledgments

(is research was funded by the Vietnam National Uni-versity HoChiMinh City (VNU-HCM) under grant noB2017-26-01

References

[1] R Girshick J Donahue T Darrell and J Malik ldquoRich featurehierarchies for accurate object detection and semantic seg-mentationrdquo in Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition pp 580ndash587 ColumbusOH USA June 2014

[2] K He X Zhang S Ren and J Sun ldquoSpatial pyramid poolingin deep convolutional networks for visual recognitionrdquo inProceedings of the European Conference on Computer Visionpp 346ndash361 Springer Zurich Switzerland September 2014

[3] R Girshick ldquoFast R-CNNrdquo in Proceedings of the IEEE In-ternational Conference on Computer Vision pp 1440ndash1448Santiago Chile December 2015

[4] J Redmon S Divvala R Girshick and A Farhadi ldquoYou onlylook once unified real-time object detectionrdquo in Proceedingsof the IEEE Conference on Computer Vision And PatternRecognition pp 779ndash788 Las Vegas NV USA June 2016

[5] J Redmon and A Farhadi ldquoYOLO9000 better fasterstrongerrdquo 2016 httpsarxivorgabs161208242

[6] J Redmon and A Farhadi ldquoYOLOv3 an incremental im-provementrdquo 2018 httpsarxivorgabs180402767

[7] T-Y Lin P Goyal R Girshick K He and P Dollar ldquoFocalloss for dense object detectionrdquo in Proceedings of the IEEEtransactions on pattern analysis and machine intelligenceVenice Italy October 2018

[8] K Zidek A Hosovsky J Pitelrsquo and S Bednar ldquoRecognition ofassembly parts by convolutional neural networksrdquo in Ad-vances in Manufacturing Engineering and Materialspp 281ndash289 Springer Cham Switzerland 2019

[9] K He G Gkioxari P Dollar and R Girshick ldquoMaskR-CNNrdquo in Proceedings of the IEEE International Conferenceon computer vision (ICCV) IEEE Venice Italy pp 2980ndash2988 October 2017

[10] L-C Chen A Hermans G Papandreou et al ldquoInstancesegmentation by refining object detection with semantic anddirection featuresrdquo 2017 httpsarxivorgabs171204837

[11] M Everingham L Van Gool C K I Williams J Winn andA Zisserman ldquo(e pascal visual object classes (VOC) chal-lengerdquo International Journal of Computer Vision vol 88no 2 pp 303ndash338 2010

[12] T-Y Lin M Maire S Belongie et al ldquoMicrosoft COCOcommon objects in contextrdquo in Proceedings of the EuropeanConference on Computer Vision pp 740ndash755 SpringerZurich Switzerland September 2014

[13] C Chen M-Y Liu O Tuzel and J Xiao ldquoR-CNN for smallobject detectionrdquo in Proceedings of the Asian Conference onComputer Vision pp 214ndash230 Springer Taipei TaiwanNovember 2016

[14] P F Felzenszwalb R B Girshick D McAllester andD Ramanan ldquoObject detection with discriminatively trainedpart-based modelsrdquo IEEE Transactions on Pattern Analysisand Machine Intelligence vol 32 no 9 pp 1627ndash1645 2010

[15] S Ren K He R Girshick and J Sun ldquoFaster R-CNN towardsreal-time object detection with region proposal networksrdquo inProceedings of the 28th International Conference on NeuralInformation Processing Systems Ser NIPSrsquo15 pp 91ndash99 MITPress Cambridge MA USA 2015 httpdlacmorgcitationcfmid=29692392969250

[16] P Pham D Nguyen T Do T D Ngo and D-D LeldquoEvaluation of deep models for real-time small object de-tectionrdquo in Proceedings of the International Conference onNeural Information Processing pp 516ndash526 SpringerGuangzhou China November 2017

[17] J R R Uijlings K E A Van De Sande T Gevers andA W M Smeulders ldquoSelective search for object recognitionrdquoInternational Journal of Computer Vision vol 104 no 2pp 154ndash171 2013

[18] Z Zhu D Liang S Zhang X Huang B Li and S HuldquoTraffic-sign detection and classification in the wildrdquo inProceedings of the IEEE Conference on Computer Vision And

Journal of Electrical and Computer Engineering 17

Pattern Recognition pp 2110ndash2118 Vegas NV USA June2016

[19] A Torralba R Fergus and W T Freeman ldquo80 million tinyimages a large data set for nonparametric object and scenerecognitionrdquo IEEE Transactions on Pattern Analysis andMachine Intelligence vol 30 no 11 pp 1958ndash1970 2008

[20] A Kembhavi D Harwood and L S Davis ldquoVehicle detectionusing partial least squaresrdquo IEEE Transactions on PatternAnalysis and Machine Intelligence vol 33 no 6 pp 1250ndash1265 2011

[21] V I Morariu E Ahmed V Santhanam D Harwood andL S Davis ldquoComposite discriminant factor analysisrdquo inProceedings of the IEEE Winter Conference on Applications ofComputer Vision pp 564ndash571 IEEE Steamboat Springs COUSA March 2014

[22] A Andreas P Lenz and R Urtasun ldquoAre we ready forautonomous driving the kitti vision benchmark suiterdquo inProceedings of the IEEE Conference on Computer Vision AndPattern Recognition Providence RI USA June 2012

[23] A Alahi K Goel V Ramanathan A Robicquet L Fei-Feiand S Savarese ldquoSocial LSTM human trajectory prediction incrowded spacesrdquo in Proceedings of the IEEE Conference onComputer Vision And Pattern Recognition pp 961ndash971 LasVegas NV USA June 2016

[24] J Xiao K A Ehinger J Hays A Torralba and A Oliva ldquoSundatabase exploring a large collection of scene categoriesrdquoInternational Journal of Computer Vision vol 119 no 1pp 3ndash22 2016

[25] E Dong Y Zhu Y Ji and S Du ldquoAn improved convolutionneural network for object detection using YOLOv2rdquoin Pro-ceedings of the 2018 IEEE International Conference onMechatronics and Automation (ICMA) pp 1184ndash1188 IEEEChangchun China August 2018

[26] W Liu D Anguelov D Erhan et al ldquoSingle shot multiboxdetectorrdquo in Proceedings of the European Conference onComputer Vision Springer Amsterdam (e Netherlandspp 21ndash37 October 2016

[27] T-Y Lin P Dollar R B Girshick K He B Hariharan andS J Belongie ldquoFeature pyramid networks for object detec-tionrdquo in Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition (CVPR) vol 1 no 2 p 4Honolulu HI USA July 2017

[28] C-Y Fu W Liu A Ranga A Tyagi and A C Berg ldquoDSSDdeconvolutional single shot detectorrdquo 2017 httpsarxivorgabs170106659

[29] Y Bai Y Zhang M Ding and B Ghanem ldquoSOD-MTGANsmall object detection via multi-task generative adversarialnetworkrdquo in Proceedings of the European Conference onComputer Vision pp 210ndash226 Springer Munich GermanySeptember 2018

[30] L Liu W Ouyang X Wang et al ldquoDeep learning for genericobject detection a surveyrdquo 2018 httpsarxivorgabs180902165

[31] P Zhu L Wen X Bian L Haibin and Q Hu ldquoVision meetsdrones a challengerdquo 2018 httpsarxivorgabs180407437

[32] Z-Q Zhao P Zheng S-t Xu and X Wu ldquoObject detectionwith deep learning a reviewrdquo 2018 httpsarxivorgabs180705511

[33] K Oksuz B C Cam S Kalkan and E Akbas ldquoImbalanceproblems in object detection a reviewrdquo 2019 httpsarxivorgabs190900169

[34] O Russakovsky J Deng H Su et al ldquoImagenet large scalevisual recognition challengerdquo International Journal of Com-puter Vision vol 115 no 3 pp 211ndash252 2015

[35] Y Bai Y Zhang M Ding and B Ghanem ldquoSOD-MTGANsmall object detection via multi-task generative adversarialnetworkrdquo in Proceedings of the European Conference onComputer Vision (ECCV) pp 206ndash221 Munich GermanySeptember 2018

18 Journal of Electrical and Computer Engineering

Page 7: AnEvaluationofDeepLearningMethodsforSmall ObjectDetection

and localization lost just confidence loss on objectness (eloss function in previous YOLO looks like

λcoord 1113944

S2

i01113944

B

j01objij xi minus 1113954xi( 1113857

2+ yi minus 1113954yi( 1113857

21113960 1113961

+ λcoord 1113944

S2

i01113944

B

j01objij

wi

radicminus

1113954wi

1113969

1113874 11138752

+

hi

1113969

minus

1113954hi

1113969

1113874 11138752

1113890 1113891

+ 1113944S2

i01113944

B

j01objij Ci minus 1113954Ci1113872 1113873

2

+ λnoobj 1113944S2

i01113944

B

j01noobjij Ci minus 1113954Ci1113872 1113873

2

+ 1113944S2

i01obji 1113944

cisinclassespi(c) minus 1113954pi(c)( 1113857

2

(1)Currently instead of using mean square error in calcu-

lating the classification loss at the last three terms YOLOv3uses binary cross-entropy loss for each label In other wordsYOLOv3 makes its prediction of an objectness score and classprediction for each bounding box using logistic regression

(ere is no more softmax function for class prediction(e reason is that the most currently used classifiers assumethat predicted labels are independent and mutually exclusiveimplying that if an object belongs to one class then it cannotbelong to the other and this is solely true if output predictionis really mutual nevertheless in case dataset has multilabelclasses and there are labels which are not nonexclusive such aspedestrian and person At the time the sum of possibilityscores may be greater than 1 if the classifier is softmax soYOLOv3 alternates the classifier for class prediction from thesoftmax function to independent logistic classifiers to cal-culate the likeliness of the input belonging to a specific label

36 Single Shot MultiBox Detector Single Shot MultiBoxDetector (SSD) [26] is a single shot detector using a singleand one-stage deep neural network designed for objectdetection in real time By comparison the state-of-the-artmethod in two-stage processing Faster RCNN uses itsproposed network to generate object proposals and utilizesthose to classify objects in order to be toward real-timedetection instead of using an external method but the wholeprocess runs at 7 FPS SSD enhances the speed of running

(a)

(b)

(c)

(d)

Figure 2 (e visualization of detectors with the strongest backbones on subsets of PASCAL VOC_MRA_058 VOC_MRA_10VOC_MRA_20 and VOC_WH_20 in order respectively (a) YOLO Darknet-53 (b) Faster RCNN ResNeXT-101-64times 4d-FPN (c)RetinaNet ResNeXT-101-64times 4d-FPN (d) Fast RCNN ResNeXT-101-64times 4d-FPN

Journal of Electrical and Computer Engineering 7

time faster than the previous detectors by eliminating theneed of the proposal network(erefore it causes a few dropin mAP and SSD compensates this by applying some im-provements including multiscale features and default boxes(ese improvements allow SSD to gain the same of FasterRCNN using lower resolution images which then furtherspeeds up the processing of SSD For 300times 300 input imageas the best version SSD gets 772 mAP at 46 FPS betterthan Faster R-CNN 732 and a little smaller than the bestversion of YOLOv2 554times 554 input image and 786 mAPat 40 FPS on VOC 2007 on Nvidia Titan X

Similarly SSD consists of 2 parts namely extraction offeature maps and use of convolution filters to detect objectsSSD uses VGG16 as a base network to extract feature maps(en it combines 6 convolutional layers to make predictionEach prediction contains a bounding box and N+ 1 scoresfor each class where N is the number of classes and one forextraclass for no object Instead of using a region proposalnetwork to generate boxes and feed to a classifier forcomputing the object location and class scores SSD simplyuses small convolution filters After the VGG16 base net-work extracts features from feature maps SSD applies 3times 3convolution filters for each cell to predict objects Each filtergives an output including N+ 1 scores for each class and 4attributes for one boundary box

SSD has a difference from previous approaches at thesame time and it makes prediction on multiscale featuremaps for detection independently rather than just one lastlayer (e CNN network spatially reduces the dimension ofthe image gradually leading to the decrease in the resolutionof the feature maps As mentioned SSD uses a lower inputimage to detect objects hence early layers are used to detectsmall objects and lower resolution layers to detect largerscale objects progressively Besides SSD applies differentscales of default boxes to different layers and for intuitivevisualization in Figure 3 Particularly the only blue defaultbox on 8times 8 feature map fits to the ground truth of the catand the only red one on 4times 4 feature map matches to theground truth of the dog

Although SSD has significant improvements in objectdetection as integrating with these above parts SSD is notgood at detecting small objects which can be improved byadding deconvolution layers with skip connections to in-troduce additional large-scale context [28] Generally SSDoutperforms Faster RCNN which is a state-of-the-art ap-proach about accuracy on PASCAL VOC and COCO whilerunning at real-time detection

37CNNDrawbacks Most of the CNNmodels are currentlydesigned by the hierarchy of various layers such as con-volutional and pooling layers that are arranged in a certainorder not only on small networks but also on multilayernetworks to state-of-the-art networks Along with theselayers fully connected layers are added behind and known asFC layers (e block consisting of FC layers and previouslayers is designated as feature extractors and it outputs keyfeatures of objects of interest as an input for classifierscoming behind However deeply going through many kinds

of layers is a way that is not good for small object detectionbecause in the task of small object detection objects ofinterest are objects owning small sizes and appearanceBesides small objects unlike normal or big objects which areless affected by resizing the image or passing lots of differentlayers are very vulnerable to the changes in image sizesWhen an image passes a convolutional layer the size of theimage will be decreased by receptive fields that slide over theimage to extract useful features (is does not affect smallobjects if there are just a few layers but in a CNN networkwe have many layers like this and it is very hard for smallobjects Still if small objects just go through convolutionallayers it will not be anything to mention Small objectswhich just have a few informative presence have to passpooling layers which help in avoiding overfitting and re-ducing computational costs by decreasing a number ofparameters To do this these layers use fixed sliding windowsthat care about a fixed target that is identified before such asmaximum or average calculations of valuables For thesereasons GAN is an approach that may alter the CNN ap-proach because of its advantages We can take advantages ofa way that the approach generates data to overcome thelimitations of data of small objects for the training phaseAlthough images still have to pass layers such as convolu-tional and pooling layers in this context the network justhas less layers compared to others Bai el al [29] haveproposed to apply MTGAN to detect small objects by takingcrop inputs from a processing step made by baseline de-tectors such as Faster RCNN [15] or Mask RCNN [9]

Because of mentioned reasons and following the survey[30] Liu et al have presented numerous works of survey andevaluation but there are no works that do with small objectsin them(erefore in this work we assess popular and state-of-the-art models to find out pros and cons of these modelsParticularly we evaluate 4 deep models such as YOLOv3RetinaNet Fast RCNN and Faster RCNN with several basenetworks for small object detection with different scales ofobjects In these models YOLOv3 and RetinaNet belong tothe one-stage approach Fast RCNN and Faster RCNN are inthe two-stage approach We choose these models becauseYOLOv3 is the model with combination of state-of-the-arttechniques and RetinaNet is the model with a new lossfunction which penalizes the imbalance of classes in adataset Besides we choose RetinaNet to make comparisonsbetweenmodels in the same approach Similarly Fast RCNNand Faster RCNN are the same and both models are in thesame approach and have nearly the similar pipeline in objectdetection (ere is a difference is that Fast RCNN utilizes anexternal proposal to generate object proposals based oninput images However Faster RCNN proposes its ownnetwork to generate object proposals on feature maps andthis makes Faster RCNN train end-to-end easily and workbetter

4 Experimental Evaluation

In this section we present the information of our ex-perimental setting and datasets which we use forevaluation

8 Journal of Electrical and Computer Engineering

41 Experimental Setting We continually train and evaluatevarious object detectors on the two datasets such as PASCALVOC [11] and a newly generated dataset [16] (e evaluatedapproaches in this time consist of Faster RCNN [15]YOLOv3 [6] and RetinaNet [7] with different backbonesExcept for YOLOv3 the others are trained and evaluated bythe Detectron python code

Currently the original datasets which commonly areused in object detection are PASCAL VOC [11] and COCO[12] Both datasets are constructed by almost large objects orother kinds of objects whose size fill a big part in the image(ese two datasets are not suitable for small object detectionIn addition there is another dataset which is large-scale andincludes a lot of classes for small object detection collectedby drones and named VisDrone dataset [31] However itdoes not publish the labels for test set to evaluate and theviews of images are topdown which is not our case As aresult in order to evaluate the detection performance of themodels we use a dataset which was published in [13] (isdataset is called small object dataset which is the combi-nation between COCO [12] and SUN [24] dataset (ere are10 classes in small object dataset including mouse tele-phone switch outlet clock toilet paper (t paper) tissue box(t box) faucet plate and jar (e whole dataset consists of4925 images in total and there are 3296 images for trainingand 1629 images for testing (e mouse class owns thelargest number of instances in images 2137 instances in 1739images and the tissue box class has the smallest number ofinstances 103 instances in 100 images Apart from the smallobject dataset we also filter subsets from PASCAL VOC2007 following standard definitions On PASCAL VOCthere are 20 classes but with small object detection there arefewer classes on strict definitions of small objects Table 1lists the details of the number of small objects and imagescontaining them for subsets of the dataset

We trained all models on small object dataset with thesame parameters Particularly in the training phase wetrained the models with 70k iterations with the parametersincluding momentum decay gamma learning rate batchsize step size and training days in Table 2 At the firstmoment we attempted to start off the models with a higherlearning rate 10minus 2 but the models diverged leading to theloss value being NaN or Inf after 100 first iterations(en wetried at a lower learning rate 10minus3 at 100 first iterations andrise to 10minus2 to consider if the models can converge as startingoff at a lower learning rate However it remained unchangedanything We also saw that the models converged quicklyduring 10k first iterations with 10minus3 and then progressively

slow down after 20k (erefore we decided to start off thetraining with a learning rate at 10minus3 and decrease to 10minus 4 and10minus5 at 25k and 35k iterations respectively (is settingshows that the loss value was stable from 40k but we set thetraining up to 70k to consider how the loss value changesand saw that it did not change a lot after 40k iterations Wetried to evaluate the models from 30k to 70k and generallythe performance of the models was not stable after 40k it-erations For this reason we picked up the weight forevaluation at 30k and 40k iterations At 30k iterationsYOLO achieves the best results and others get the best one at40k iterations In case of subsets of PASCAL VOC 2007 wecombine train and valid set from PASCAL VOC 2007 and2012 to form a training set PASCAL VOC 2012 works as adata augmentation set for PASCAL VOC 2007 We use thiscombined training set to train all models and test them onsubsets All models train the same parameter First due tothe limitation of memory we rescale all the size of images tothe same size with the shortest side 600 and the lengthiestside 1000 as in [15]

In YOLOv3 we run the K-means clustering algorithm inorder to initialize 9 suitable default bounding boxes fortraining and testing phases of our selected datasets and wechanged the anchors value (e following are 9 anchors forsmall object dataset after running the K-means algorithm[103459 144216] [262937 190947] [214024 363180][479317 291237] [404932 637489] [836447 513203][722167 1199181] [1727416 1170773] and [12465972528465]

In Faster R-CNN to fairly compare with the prior workand deploy on different backbones we also reuse directly theanchor scales and aspect ratios following the paper [13] suchas anchor scales 16times16 40times 40 and 100times100 pixels andaspect ratio 05 1 and 2 instead of having to cluster a set ofdefault bounding boxes similar to YOLOv3 Similarly inRetinaNet we keep the default setting for training such asgamma loss 20 alpha loss 025 anchor scale 4 andscalers per octave 3 because of following authors and thisconfiguration is the optimized valuables

42 Our Newly Generated Dataset In this time to have anobjective comparison we also use our newly generateddataset and the information of this dataset is shown inTable 1 We use it to consider the effects of object sizesamong factors including models time of processing accu-racy and resource consumption (e dataset consists of 4subsets filtered from PASCAL VOC 2007 such as

Batch normHi-res classifierConvolutionalAnchor boxes

New neworkDimension priors

Location predictionPassthrough

MultiscaleHi-res detector

YOLO YOLOv2

VOC2007 mAP

634 658 695 692 696 744 754 768 786

Figure 3 mAP of YOLOv2 at each added part [5]

Journal of Electrical and Computer Engineering 9

VOC_WH_20 VOC_MRA_058 VOC_MRA_10 andVOC_MRA_20 and detail information is provided asfollows

(i) VOC2007_WH_02 contains objects whose widthand height are less than 20 of an imagersquos width andheight (is one has fewer than PASCAL VOC 2007two classes such as dining table and sofa because ofthe constraint of the definition

(ii) VOC_MRA_058 VOC_MRA_10 andVOC_MRA_20 compose objects occupying themaximum mean relative area of the original imageunder 058 10 and 20 respectively Two ofthem have the same number of PASCAL VOC 2007classes except for VOC_MRA_058 and the one hasfewer four classes such as dining table dog sofa andtrain

5 Results and Analyses

In this section we show results that we achieved through theexperimental phase All models mentioned in this sectionexcept for models cited from other papers are trained on thesame environment and 1 GPU Ubuntu 16044 LTS Intel(R) Xeon (R) Gold 6152 CPU 210GHz GPU Tesla P100In addition to the comparative accuracy other comparisonsare also provided to make our objective and clear assessmentresults

51 Accuracy

511 Small Object Dataset Following the detection resultsin Table 3 methods which belong to two-stage approachesoutperform ones in one-stage approaches about 8ndash10Specifically Faster RCNN with ResNeXT-101-64times 4d-FPNbackbone achieved the top mAP in two-stage approachesand the top of the table as well 412 In comparison withthe top in one-stage approaches YOLOv3 608times 608 withDarknet-53 obtained 331 Following [32] methods basedon region proposal such as Faster RCNN are better than

methods based on regression or classification such as YOLOand SSD Actually this is also right once again as in contextof small object dataset

For methods in each approach Firstly two-stage ap-proaches Faster RCNN which is an improvement of FastRCNN is only greater than Fast RCNN about 1ndash2 but onlyfor ResNeXT backbones and equal to Fast RCNN for the rest(e difference here is not too much and it means that theperformance of external region proposal like selective searchcombined with ROI pooling is as good as internal regionproposal like RPN with ROI aligned in this case Besidescompared to R-CNN we perceive that there is a boost 8ndash10when RoI pooling or RoI aligned is added because R-CNNwhich uses region proposals from selective search then feedsthem into the network and directly computes features from fc(fully connected) layers only receives 235 with Alexnetand 248 with VGG16 combined with proposals from RPNHowever Fast RCNN and Faster RCNN with two kinds ofRoIs are much better Fast RCNN receives accuracy in a rangeof 317 to 396 based on different backbones SimilarlyFaster RCNN gets 301 to 412 Secondly in one-stageapproaches YOLO outperforms SSD and RetinaNet How-ever YOLO gets the highest outcome 331 and SSD andRetinaNet get 1132 and 30 respectively YOLO and SSDare considered as state-of-the art methods in speed andsacrificing accuracy However there is a large difference inaccuracy between YOLO and SSD the difference here is thatSSD adds multiple convolutional layers behind the backboneand each layer has their own ability instead of using 2 fullyconnected layers like YOLO Although RetinaNet is assignedinto a method in one-stage approaches it cannot run in realtime RetinaNet is one which is proposed to deal with theimbalance between foreground and background by the focalloss (erefore RetinaNet obtains a higher accuracy incomparison with others except for YOLOv3 (Darknet-53)

When it comes to the backbones we realized that Dar-knet-53 is the best in one-stage and real-time methods andeven far higher than ResNet-50 although it similarly has thesame layers with ResNet-50 In contrast ResNeXT combinedwith FPN is themost powerful one in both one-stage and two-

Table 1 (e information of the subsets

Subsets Classes Images InstancesVOC_MRA_058 16 329 529VOC_MRA_10 20 2231 5893VOC_MRA_20 20 2970 7867VOC_WH_20 18 1070 2313

Table 2 (e parameters of models

Method Momentum Decay Gamma Learning_rate Batch_size Training_days StepsizeYOLOv2 [16] 09 00005 0001 8 5 25000YOLOv3 09 00005 0001 32 3ndash4 25000SSD300 [16] 09 00005 01 0000004 12 9 40000 80000SSD512 [16] 09 00005 01 0000004 12 12 100000 120000RetinaNet 09 00005 01 0001 64 4-gt12 h 25000 35000Fast RCNN 09 00005 01 0001 64 4-gt12 h 25000 35000Faster RCNN 09 00005 01 0001 64 4-gt12 h 25000 35000

10 Journal of Electrical and Computer Engineering

stage methods if we only consider accuracy Overall there isan increase about 1ndash3 for changing the simple backbone tothe complex one in each type For example when switchingfrom original ResNet to ResNet-FPN the accuracy is boostedfrom 2 to 3(is is clear that leveraging the advantages frommultiscale features of FPN is a common way to improvedetection and tackle the scale imbalance of input images andbounding boxes of different objects Similarly we switchResNeXT-101-32times 8d-FPN to ResNeXT-101-64times 4d-FPNand the accuracy changes from 405 to 412 for FasterRCNN and from 387 to 396 for Fast RCNN Howeverwhen considering between ResNet-50-FPN and ResNet-101-FPN the growth only happens in Fast RCNN from 333 to355(ere is a little bit decrease 01 for Faster RCNN(isreduction also happens with RetinaNet while the simplerbackbone ResNeXT-101-32times 8d-FPN gets 30 and theResNeXT-101-64times 4d-FPN just gets 251 It means that thevery deeper backbones do not guarantee the increase inaccuracy and the reason is that an advantage of a deepernetwork needsmore parameters to learn It means the authorsmust have a large number of data to feed into the network totrain and update parameters itself but in this case the data ofsmall object dataset are not abundant too much to fit the verydeep network and hence increasing the chances of overfittingBesides features which are originally from the early layer ofResNet are not well-generalized because when they arecombined with FPN the accuracy has an improvement about2ndash3 When YOLO switches from Darknet-19 to Darknet-53 it really boosts the accuracy (e highest accuracy belongsto the Darknet-19 with the resolution of 1024times1024 whichjust gets 2402 However YOLO 608times 608 with Darknet-53gets 331 (e explanation for this reason is that YOLOv3with Darknet-53 has several improvements from Darknet-19YOLOv3 has 3 location of scales to predict objects especiallyone specialized in small objects instead of only one likeDarknet-19 and it is also integrated cutting-edge advantagessuch as residual blocks and shortcut connections (e re-duction in accuracy happens again with YOLO whenswitching from ResNet-101 to ResNet-152 about 1ndash2 Inthese methods YOLO and SSD are the only ones which allowmultiple input sizes(e higher the resolution of input imagesare the higher accuracy the method receives (e reason isthat a higher resolution image allows more pixels to describethe visual information for small objects However if theresolution is far from the original size of images it results in adecrease in accuracy For example YOLO 1024times1024 withDarknet-19 gets a lower accuracy than the resolution of800times 800 In addition we have tried to increase in resolutionof Darknet-53 from 608 to 1024 and themAP decreases whenthe resolution is over 608times 608 (erefore the effect of imagesize is clear for models like SSD and YOLO Generally allcomparative results of mAP on this dataset have the domi-nation of classes very great in numbers and this is caused bythe imbalance data between the number of images and in-stances in these images For example according to the sta-tistics in [13] mouse is a major class significantly contributingto mAP in Table 3 with the highest number of instances andimages as well However tissue has least contribution with thelowest AP originally affected by the number of data

Furthermore the imbalance data lead models tending todetect frequent objects implying that models will misun-derstand objects having a nearly similar appearance with thedomination class as the objects of interest rather than lessfrequent objects As a result false positives will increase bythese problems Figure 4 illustrates the detection withstrongest backbones Following this visualization the domina-tion of the classes such asmouse or faucet results inmisdetectionwith areas which have a same appearance to them (is mis-understanding has a tendency to weaker backbones in thecomparison and one-stage method like YOLO which primarilyheads to speed has more misdetection than two-stage methodsA reason that causes these problems are the difference in thewayof training deep networks [33] One-stage methods such asYOLO use a soft sampling method that uses a whole dataset toupdate parameters rather than only choosing samples fromtraining data However two-stage methods such as RCNNfamily tend to employ hard sampling methods that randomlysample a certain number of positive and negative boundingboxes to train its network

512 Subsets of PASCAL With 4 subsets of 4 different scalesof objects in images we want to find out howmuch the scalesimpact on the models (e whole results are shown in Ta-ble 4 We separate the results into 2 groups as the one-stageand two-stage approaches and Figure 5 is a visualization forthe strongest backbones in each method on subsets

In case of different scales like our subsets there is a differencebetween one-stage approaches and two-stage approaches In thiscase methods from the one-stage approach have a better per-formance than two-stage ones inmost of scales(is is really theopposite of small object dataset Specifically two-stage methodsare totally better than one-stage ones in case of real-time inputsand just better a bit than nonreal-time models in VOC_WH20about 10ndash20 and the same result with smaller objects inVOC_MRA_0058 and VOC_MRA_010 However in biggerobjects in VOC_MRA_020 methods in one-stage approacheshave significant outcomes rather than two-stage ones In ad-dition there is just Faster RCNN that has good performance inmost cases to compare to methods in one-stage ones FastRCNN is only good at big objects in VOC_MRA_020 and failsto have good detection in smaller objects

In the one-stage approach in methods which allowmultiple inputs like YOLO and SSD there are 2 kindsnamely ones that can run in real time and the others thatcannot if the resolution is over 640 or 512 for YOLO andSSD respectively For real-time ones YOLO outperformsSSD for all scales of objects Specifically YOLOv2 withDarknet-19 is better than SSD 26 with objects inVOC_MRA_0058 and VOC_MRA_010 and 4ndash15 forlarger objects in VOC_MRA_020 and VOC_WH_20YOLOv3 with Darknet-53 gets higher results about 3ndash5 incomparison with YOLOv2 hence YOLOv3 also gets higherresults compared to SSD However if we consider nonreal-time input images SSD is greater than YOLO with objects inVOC_MRA_010 However RetinaNet which is the one thatcannot run in real time in the one-stage approach performsthe same results compared to ones in nonreal time in YOLO

Journal of Electrical and Computer Engineering 11

and better than SSD RetinaNet is more stable than SSD andYOLO when the scales are changed (e bigger the objectsare the more the stability is For example the change is somuch about 33 when the scale increases from objects inVOC_MRA_0058 to ones in VOC_MRA_010 andVOC_MRA_020 However this change is not much about10 with bigger objects in comparison with YOLO 15ndash25In case of YOLO this remarkable increase in accuracy whenobjects are larger is obviously good for a model (e changein SSD resembles the change in RetinaNet

Concerning resolutions in YOLO and SSD we see thatwhen image resolution is increased they push the accuracyto improve in general For YOLOv2 with Darknet-19 andYOLOv3 with Darknet-53 and SSD they all have an increasein accuracy when the resolution is large except for YOLOv2

with objects belonging to VOC_MRA_010 andVOC_MRA_020 when the image is over 800 In additionYOLOv2 has a fluctuation with those objects inVOC_WH20 As mentioned in our previous work YOLO isbetter than SSD in those objects less than 10 of the imageshowever in this case YOLOv3 is good at all scales of objects(is is because YOLOv3 has 3 detection locations comingwith more ratios of default boxes and it leads to a significantoutcome when combining results from 3 locations

When we switch to the two-stage approaches FasterRCNN has a significant improvement in most scales ratherthan Fast RCNN except for objects in VOC_MRA_020 whichhave the same accuracy (is shows that if objects are com-pletely separated into different scales the RoI pooling does notwork well with smaller objects and ones in VOC_WH20 In

Table 3 Comparative results on small object dataset

Method Backbone Clock Faucet Jar Mouse Outlet Plate Switch Tel t box t paper mAPYOLO 416 [16]

Darknet-19

228 308 4 52 204 131 13 61 0 353 1939YOLO 448 [16] 23 369 9 525 184 136 175 42 0 343 2013YOLO 480 [16] 342 373 91 533 214 136 158 91 91 342 2371YOLO 512 [16] 231 366 61 598 246 142 157 91 45 324 2261YOLO 554 [16] 234 372 91 601 272 134 199 91 45 345 2384YOLO 640 [16] 202 362 32 598 278 117 181 82 45 356 2253YOLO 800 [16] 276 36 23 602 328 131 233 91 91 267 2402YOLO 1024 [16] 217 293 14 583 264 118 175 91 91 157 2003YOLO 320

Darknet-532622 3838 455 5646 3642 1334 248 1065 455 4296 2583

YOLO 416 2847 4715 1083 6049 4315 1587 3073 1515 262 483 3028YOLO 608 2998 4789 1076 6588 4802 1809 3122 1462 1799 4656 331YOLO 320

ResNet-501957 2573 067 4517 1437 938 1384 909 909 237 1706

YOLO 416 2378 3665 04 5423 1837 1375 1978 984 942 3568 2219YOLO 608 2692 4065 177 6186 2918 1504 2024 1009 1329 3601 255YOLO 320

ResNet-1012052 279 057 4468 1698 1305 1366 966 909 2436 1805

YOLO 416 2572 356 303 5573 224 1561 1726 932 303 3871 2264YOLO 608 2879 4459 942 6218 3334 1553 2388 1324 1583 3917 286YOLO 320

ResNet-1522164 2756 303 4806 1739 1112 1451 909 455 3188 1888

YOLO 416 257 3654 089 5381 206 1413 2021 1149 029 3306 2167YOLO 608 2601 4454 455 61 3176 1302 2267 1235 993 3999 2658SSD300 [16] ResNet-101 55 91 0 255 61 45 0 45 91 182 825SSD300 [16] VGG16 91 171 0 261 91 91 0 45 0 167 916SSD512 [16] VGG16 91 171 0 43 91 91 91 91 0 76 1132RetinaNet ResNet-50-FPN 307 493 2 655 213 161 85 129 1 257 233RetinaNet ResNet-101-FPN 306 487 71 647 20 159 118 107 29 387 251RetinaNet ResNeXT-101-32times 8d-FPN 355 55 121 665 239 184 98 162 94 537 30RetinaNet ResNeXT-101-64times 4d-FPN 314 502 89 663 208 153 94 14 22 324 251R-CNN [13] RPN prop +VGG16 319 313 42 568 311 93 142 164 234 294 248R-CNN [13] Alexnet 7times 300 pro 324 272 51 569 28 98 136 124 179 356 239R-CNN [13] VGG16 7times 300 pro 373 303 72 606 415 158 215 137 22 333 284R-CNN [13] ContextNet (Alexnet 7times) 327 268 46 564 263 99 129 122 187 34 235Fast RCNN ResNet-50-C4 324 463 65 658 383 201 253 166 141 52 317Fast RCNN ResNet-50-FPN 374 473 73 689 467 21 321 171 93 459 333Fast RCNN ResNet-101-FPN 393 503 106 683 471 204 333 186 154 514 355Fast RCNN ResNeXT-101-32times 8d-FPN 475 548 103 718 54 214 344 217 177 535 387Fast RCNN ResNeXT-101-64times 4d-FPN 454 557 109 725 533 24 369 229 16 581 396Faster R-CNN [16] VGG16 2376 3765 803 54 1616 1188 1512 91 625 3729 2192Faster RCNN ResNet-50-C4 322 446 66 659 352 175 257 196 137 40 301Faster RCNN ResNet-50-FPN 357 499 73 684 489 188 296 147 114 533 338Faster RCNN ResNet-101-FPN 398 492 49 682 47 185 297 14 129 522 337Faster RCNN ResNeXT-101-32times 8d-FPN 498 566 114 721 563 232 37 208 188 587 405Faster RCNN ResNeXT-101-64times 4d-FPN 496 586 122 725 545 232 369 208 201 631 412(e values in bold represent the best in one-stage methods and the ones in italics represent the highest in two-stage methods

12 Journal of Electrical and Computer Engineering

addition if we compare with one-stage methods it is signif-icantly lower than them However RoI align along with RPN iswell performed when scales are changedWhen it comes to thebackbones there is a few decrease in accuracy when changingfrom ResNet-50-FPN to ResNet-101-FPN or from ResNeXT-101-32times 8d-FPN to ResNeXT-101-64times 4d-FPN with objectsfrom all scales for both Faster RCNN and Fast RCNN (eVGG16 backbone has an impressive outcome rather thanstrong backbones such as ResNet or ResNeXT Although theaccuracy is less than two strong backbones VGG16 is stillbetter with objects in VOC_WH20 and has a few change inaccuracy when changing objects with big sizes

52 Time Processing and Resource Consumption Tables 5and 6 show us the performance comparison of the eval-uated models with base networks that belong to the modelsGenerally we see that when RAM consumption in testingand training increases more layers are added (is meansthat if the network is more deeper the need of processingalso increases because this leads to the increase in pa-rameters and time to process data as well YOLO is themodel consuming the least memory in both two-phasetraining and testing Particularly YOLO is only from 4G to5G for training and from 16G to 18G for testing withDarknet-53 YOLO is the only one which is able to run inreal time YOLO just needs about 03ms to 04ms toprocess an image in comparison to more than 01 s and 02 s

with Faster RCNN and RetinaNet (is allows us to pick upthese models on devices which own the modest memoryWhile RetinaNet is assigned to the one-stage approach it isnot good enough to meet real-time detection(e inferencetime in Fast RCNN is lower a little bit than Faster RCNNand RetinaNet In contrast the RAM consumption intraining and testing of RetinaNet is lower than Fast RCNNand Faster RCNN Of all architectures the ResNet-50-C4 isthe one requiring the highest memory and time to processdata because the output size of ResNet-50-C4 is bigger a bitthan others [9] If we consider ResNet or ResNeXT com-bined with FPN Faster RCNN is over 100Mb compared toFast RCNN and 300Mb with RetinaNet In additionaccording to Table 2 the number of training days of FasterRCNN and RetinaNet need less time for training only a fewhours to 1 day rather than YOLO 3ndash4 days (is dem-onstrates that if we pay our attention to performance anddo not have much time for training we choose FasterRCNN or RetinaNet instead of YOLO one In contrast ifwe only focus on processing speed and still achieve goodperformance one-stage methods are always the good oneIn the same context of backbones RetinaNet uses a lowerresource than Fast RCNN and Faster RCNN about 100Mband 300Mb for Fast RCNN and Faster RCNN respectivelyin testing time However the training time of RetinaNetuses much memory more than Fast RCNN about 28 G andFaster RCNN about 23 G for ResNeXT-101-32times 8d-FPNand ResNeXT-101-64 times 4d-FPN If we consider this on

(a) (b)

loc ∆ (cx cy w h) conf (c1 c2 hellip cp)

(c)

Figure 4 (e location of the default boxes in different scales (a) image with GT boxes (b) 8times 8 feature map (c) 4times 4 feature map

Journal of Electrical and Computer Engineering 13

Table 4 (e comparative results on subsets of PASCAL VOC 2007

Approach Method VOC_MRA_0058 VOC_MRA_010 VOC_MRA_020 VOC_WH20

One stage

YOLOv2 416 [16] 302 3138 4289 1852YOLOv2 448 [16] 447 329 6015 2196YOLOv2 480 [16] 426 3348 6078 2667YOLOv2 512 [16] 542 3574 6112 2463YOLOv2 544 [16] 697 3656 63 2662YOLOv2 640 [16] 77 3797 6129 2341YOLOv2 800 [16] 1024 373 6191 269YOLOv2 1024 [16] 1069 2993 5514 2897

YOLOv3 320 718 3458 6036 204YOLOv3 416 102 3897 6253 2412YOLOv3 608 117 4265 6856 2886SSD 300 [16] 171 3276 4626 1691SSD 512 [16] 29 4346 5711 1987

RetinaNet-ResNet-50-FPN 884 415 502 2814RetinaNet-ResNet-101-FPN 895 425 519 2746

RetinaNet-ResNeXT-101-32times 8d-FPN 1029 454 545 3008RetinaNet-ResNeXT-101-64times 4d-FPN 1071 455 551 3132

Two stage

Fast RCNN-ResNet-50-C4 023 132 499 393Fast RCNN-ResNet-50-FPN 063 135 556 345Fast RCNN-ResNet-101-FPN 039 159 576 312

Fast RCNN-ResNeXT-101-32times 8d-FPN 051 144 579 333Fast RCNN-ResNeXT-101-64times 4d-FPN 029 142 573 376

Faster RCNN-ResNet-50-C4 698 399 487 2604Faster RCNN-ResNet-50-FPN 1074 456 563 2979Faster RCNN-ResNet-101-FPN 1063 469 576 3057

Faster RCNN-ResNeXT-101-32times 8d-FPN 1164 473 576 3212Faster RCNN-ResNeXT-101-64times 4d-FPN 1054 471 569 3164

Faster RCNN-VGG16 [16] 573 3558 4414 4111(is table illustrates how well models adapt to different scales of objects (e values in bold represent the best in one-stage methods and the ones in italicsrepresent the highest in two-stage methods

(a)

(b)

(c)

Figure 5 Continued

14 Journal of Electrical and Computer Engineering

small object dataset it does not work too much becauseRetinaNet is lower than Faster RCNN about 10 in per-formance Otherwise on different scales of subsets Reti-naNet works well when comparing to Faster RCNN and

the difference is just 2ndash4 percentages Although ResNetbackbones combined with the others yield an improvementin accuracy they do not work for YOLO on small objectdatasets YOLO with Darknet-53 utilizes more resource

Table 5 (e comparison of consumption on small object dataset

Model Backbone Inference time (s) Test RAM (MiB) Train RAM (MiB)YOLOv3 Darknet-53 00331 1825 4759YOLOv3 ResNet-50 0027 1285 3479YOLOv3 ResNet-101 00356 1829 5383YOLOv3 ResNet-152 00454 2443 7531RetinaNet ResNet-50-FPN 0102 2075 4435RetinaNet ResNet-101-FPN 0127 2723 5577RetinaNet ResNeXT-101-32times 8d-FPN 0229 3767 7863RetinaNet ResNeXT-101-64times 4d-FPN 0292 3719 7813Fast RCNN ResNet-50-C4 03 6449 5877Fast RCNN ResNet-50-FPN 0089 2277 4455Fast RCNN ResNet-101-FPN 0113 2947 5627Fast RCNN ResNeXT-101-32times 8d-FPN 0212 3987 4961Fast RCNN ResNeXT-101-64times 4d-FPN 0269 3885 4799Faster RCNN ResNet-50-C4 0412 6609 6129Faster RCNN ResNet-50-FPN 0101 2387 5381Faster RCNN ResNet-101-FPN 0124 3001 6487Faster RCNN ResNeXT-101-32times 8d-FPN 0256 4027 5333Faster RCNN ResNeXT-101-64times 4d-FPN 0286 4003 5246

Table 6 (e comparison of consumption on subsets filtered from PASCAL VOC

Model Backbone Inference time (s) Test RAM (MiB) Train RAM (MiB)YOLOv3 Darknet-53 0027 1645 4079RetinaNet ResNet-50-FPN 01 1935 4133RetinaNet ResNet-101-FPN 0116 2585 5435RetinaNet ResNeXT-101-32times 8d-FPN 0222 3641 7723RetinaNet ResNeXT-101-64times 4d-FPN 0284 3561 7599Fast RCNN ResNet-50-C4 0495 6371 5677Fast RCNN ResNet-50-FPN 0092 2131 4387Fast RCNN ResNet-101-FPN 0114 2819 5463Fast RCNN ResNeXT-101-32times 8d-FPN 0213 3873 4637Fast RCNN ResNeXT-101-64times 4d-FPN 0265 3735 4575Faster RCNN ResNet-50-C4 026 6141 5991Faster RCNN ResNet-50-FPN 01 2245 5207Faster RCNN ResNet-101-FPN 013 2855 6335Faster RCNN ResNeXT-101-32times 8d-FPN 0225 3943 5087Faster RCNN ResNeXT-101-64times 4d-FPN 0276 3885 4909

(d)

Figure 5 Highlight of bounding boxes from comparative backbones on small object dataset We here select YOLO with Darknet-53 andResNet-50 for objective comparison because there have obviously the same layers in their networks along with the significant techniquessuch as skip connections and residual blocks (e bounding boxes show that ResNet-50 has the sensitivity to areas which resembles theobjects of interest than Darknet-53 Similarly ResNet-50-FPN and ResNet-50-C4 are chosen to consider (e detection shows thatcombining ResNet-50 with FPN outputs a better performance rather than the original one Particularly misdetection happens in moredensity than ResNet-50-FPN such as in columns 4 and 5 Zoom in to see more detail

Journal of Electrical and Computer Engineering 15

than ResNet ones but it has the best accuracy amongmodels (erefore we only test YOLO with Darknet-53 insubsets of PASCAL

53 Analyses of the Trade-Off among Detectors Networkdesigns and approaches alike the one-stage approach proveits performance as applying them to detect general objectsboth small scales and other kinds of scales Although they arefast and accurate there is still a drawback always existing inthese models that is the trade-off between accuracy andspeed of processing For example YOLOv3 proposes theidea that performs detection at three different scales and thisresult is obviously impressive and yields good performanceHowever to gain this advantage YOLOv3 has to sacrifice thetime to process Instead of all inputs of the model normallyprocessing one time for detection like YOLOv2 this ideamust work 3 times (is trade-off is also partly affected byresolution as we change it during training or testing ourmodels In our previous work we have mentioned that wehave to choose a right resolution to ensure our models towork properly In case of the two-stage approaches the ideathat proposes region proposals to improve the localization ofobjects to serve for detection is good as well (is is usefulbut we have to take it into the account that we shouldgenerate proposals on feature maps or directly on inputimages because this affects a lot on the way which modelsintend to run and identify representations of objects Ifobjects are normal or have a big or medium appearance it isgood for models to work but if objects are in multiscalesthis is a problem to consider and research deeply in order tobalance the performance as well as improve it (erefore topartly fix this problem the one-stage approach allows us tochoose a fixed size of an input for training and testing butthe support still depends on characteristics of datasets whichwe evaluate or the image size After all all models we chooseto evaluate are affected by the scales of objects when wechange the scale and accuracy of models change a lot exceptfor Faster RCNN the only one model that seems to be stablewith the scale especially when combining with the VGG16architecture Although the accuracy of VGG16 is not betterthan the other architectures the difference here is that it doesnot change too much in accuracy (is is only right for bigobjects having the overlap of the bounding box and theimage greater than 10 if not this is not assured

Figure 1 shows that the possibility of small objects ismore than other objects (e black length of the camera issomehow similar to the black mouse placed on a mouse pad(is possibility of small object presence causes more diffi-culties to detectors and leads to wrong detection Anywherein an image can be small objects it results in a fact thatdetectors have much wrong detection with familiar ap-pearance which they have seen If we consider the visuali-zation of the detection in Figure 4 the wrong detection ispartly similar to the appearance of the other objects in thedataset (is problem is caused by the data imbalance be-tween classes and instances in each class which originally isknown as the foreground-foreground class imbalance In

other words the common problems which not only happenwith small objects but also for whole datasets are theintraclass similarity and interclass variation

6 Conclusion

Small object detection is a challenging and interestingproblem in the task of object detection and has drawn at-tention from researchers thanks to the development of deeplearning which is motivation to improve performance oftasks in computer vision Although deep models belongingto detection originally tend to solve problems relating togeneral object detection they still work at a particular levelto the success of small object detection As evaluation workson small object detection for deep models our goal is tohighlight remarkable achievements of popular and state-of-the-art deep models in order to provide a variety of views asapplying deep models in small object detection Particularlywe evaluate state-of-the-art real-time detectors based ondeep learning from two approaches such as YOLOv3RetinaNet Fast RCNN and Faster RCNN on two datasetsnamely small object dataset and subsets filtered fromPASCAL VOC about effects of different factors objectivelyincluding accuracy execution time and resource usage

In spite of the successful achievements in recent yearsthe performance of detection has improved significantlyand there is still a huge gap in accuracy between normalobjects and small objects In the criteria of the COCOdataset the difference from the small scale to medium andbig scale is too much Most models are good at detection ofnormal objects and problems are going to happen whenapplying them to detect small objects As a result to reducethe gap of small object detection the first thing to do is investdatasets which have massive amount of data to train modelsand have a wide range of categories to compete with thehuman visual system alike [12 34]

So far detection models are divided into two mainapproaches namely one-stage approach and two-stageapproach Models in the one-stage approach is known asdetectors which have better and more efficient detection incomparison to another approach (e efficiency here has thepotential power to run in real time and is able to apply themto practical applications However the trade-off betweenaccuracy and speed is a difficult challenge which needs to betaken into the account in order to balance the gap Howevermodels in the two-stage approach have their reputation ofregion-based detectors which have high accuracy but are toolow in speed to apply them to real world (is drawbackcomes from the computation of networks

(rough our evaluation there is a fact that architectureswhich are utilized as base networks to extract deep featureshave significant effects on frameworks (e deeper the ar-chitecture is the higher the accuracy of detection is Once anetwork has an increase in the depth this means it has morelayers than normal ones and it will have massive parametersto train Hence this needs a lot of data to fine tune theseparameters reasonably If there is an increase in computa-tion resource consumption will also increase As a result it

16 Journal of Electrical and Computer Engineering

will be difficult as we want to take them to apply in practicalapplications Besides the contextual exploit in models isdefinitely limited this results cause ignoring much usefuland informative data in training especially in context ofsmall objects Because small objects are able to appearanywhere in an input image if the image is well-exploitedwith the context the performance of small object detectionwill be improved better

For all above reasons and according to our evalua-tion if we tend to have good performance and ignore thespeed of processing two-stage methods like FasterRCNN are well-performed and demonstrate its networkdesign with the different datasets on many contexts ofobjects including multiscale objects (erefore FasterRCNN is considered as a giant baseline in order to baseon or develop from it If our target has a balance ofaccuracy and speed YOLO is a good one in case we donot care the training time because the sacrifice betweenthe speed and accuracy is worth applying it into practicalapplications Otherwise Faster RCNN or RetinaNet isstill a substitution to work on When it comes tobackbones we have to concern about the data to choose areasonable backbone to combine with the methodsBecause the amount of data will significant impact on themodel if data are not abundant the shallow network willfit it well Besides there is recently a novel approachpromising in training deep models with less data that isweakly supervised learning such as zero-shot one-shotor few-shot learning (erefore these approaches will beconsidered in our future works and following our recentsearching to have better performance on object detec-tion we have to consider several factors to improve themAP such as multiscale training superresolution forscaling up the visual information to small objects [35] orpreprocessing data to avoid the imbalance data becausewe have a wide range of imbalance problems relating todata [33]

Data Availability

(e data used to support the findings of this study areavailable from the corresponding author upon request

Conflicts of Interest

(e authors declare that they have no conflicts of interest

Acknowledgments

(is research was funded by the Vietnam National Uni-versity HoChiMinh City (VNU-HCM) under grant noB2017-26-01

References

[1] R Girshick J Donahue T Darrell and J Malik ldquoRich featurehierarchies for accurate object detection and semantic seg-mentationrdquo in Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition pp 580ndash587 ColumbusOH USA June 2014

[2] K He X Zhang S Ren and J Sun ldquoSpatial pyramid poolingin deep convolutional networks for visual recognitionrdquo inProceedings of the European Conference on Computer Visionpp 346ndash361 Springer Zurich Switzerland September 2014

[3] R Girshick ldquoFast R-CNNrdquo in Proceedings of the IEEE In-ternational Conference on Computer Vision pp 1440ndash1448Santiago Chile December 2015

[4] J Redmon S Divvala R Girshick and A Farhadi ldquoYou onlylook once unified real-time object detectionrdquo in Proceedingsof the IEEE Conference on Computer Vision And PatternRecognition pp 779ndash788 Las Vegas NV USA June 2016

[5] J Redmon and A Farhadi ldquoYOLO9000 better fasterstrongerrdquo 2016 httpsarxivorgabs161208242

[6] J Redmon and A Farhadi ldquoYOLOv3 an incremental im-provementrdquo 2018 httpsarxivorgabs180402767

[7] T-Y Lin P Goyal R Girshick K He and P Dollar ldquoFocalloss for dense object detectionrdquo in Proceedings of the IEEEtransactions on pattern analysis and machine intelligenceVenice Italy October 2018

[8] K Zidek A Hosovsky J Pitelrsquo and S Bednar ldquoRecognition ofassembly parts by convolutional neural networksrdquo in Ad-vances in Manufacturing Engineering and Materialspp 281ndash289 Springer Cham Switzerland 2019

[9] K He G Gkioxari P Dollar and R Girshick ldquoMaskR-CNNrdquo in Proceedings of the IEEE International Conferenceon computer vision (ICCV) IEEE Venice Italy pp 2980ndash2988 October 2017

[10] L-C Chen A Hermans G Papandreou et al ldquoInstancesegmentation by refining object detection with semantic anddirection featuresrdquo 2017 httpsarxivorgabs171204837

[11] M Everingham L Van Gool C K I Williams J Winn andA Zisserman ldquo(e pascal visual object classes (VOC) chal-lengerdquo International Journal of Computer Vision vol 88no 2 pp 303ndash338 2010

[12] T-Y Lin M Maire S Belongie et al ldquoMicrosoft COCOcommon objects in contextrdquo in Proceedings of the EuropeanConference on Computer Vision pp 740ndash755 SpringerZurich Switzerland September 2014

[13] C Chen M-Y Liu O Tuzel and J Xiao ldquoR-CNN for smallobject detectionrdquo in Proceedings of the Asian Conference onComputer Vision pp 214ndash230 Springer Taipei TaiwanNovember 2016

[14] P F Felzenszwalb R B Girshick D McAllester andD Ramanan ldquoObject detection with discriminatively trainedpart-based modelsrdquo IEEE Transactions on Pattern Analysisand Machine Intelligence vol 32 no 9 pp 1627ndash1645 2010

[15] S Ren K He R Girshick and J Sun ldquoFaster R-CNN towardsreal-time object detection with region proposal networksrdquo inProceedings of the 28th International Conference on NeuralInformation Processing Systems Ser NIPSrsquo15 pp 91ndash99 MITPress Cambridge MA USA 2015 httpdlacmorgcitationcfmid=29692392969250

[16] P Pham D Nguyen T Do T D Ngo and D-D LeldquoEvaluation of deep models for real-time small object de-tectionrdquo in Proceedings of the International Conference onNeural Information Processing pp 516ndash526 SpringerGuangzhou China November 2017

[17] J R R Uijlings K E A Van De Sande T Gevers andA W M Smeulders ldquoSelective search for object recognitionrdquoInternational Journal of Computer Vision vol 104 no 2pp 154ndash171 2013

[18] Z Zhu D Liang S Zhang X Huang B Li and S HuldquoTraffic-sign detection and classification in the wildrdquo inProceedings of the IEEE Conference on Computer Vision And

Journal of Electrical and Computer Engineering 17

Pattern Recognition pp 2110ndash2118 Vegas NV USA June2016

[19] A Torralba R Fergus and W T Freeman ldquo80 million tinyimages a large data set for nonparametric object and scenerecognitionrdquo IEEE Transactions on Pattern Analysis andMachine Intelligence vol 30 no 11 pp 1958ndash1970 2008

[20] A Kembhavi D Harwood and L S Davis ldquoVehicle detectionusing partial least squaresrdquo IEEE Transactions on PatternAnalysis and Machine Intelligence vol 33 no 6 pp 1250ndash1265 2011

[21] V I Morariu E Ahmed V Santhanam D Harwood andL S Davis ldquoComposite discriminant factor analysisrdquo inProceedings of the IEEE Winter Conference on Applications ofComputer Vision pp 564ndash571 IEEE Steamboat Springs COUSA March 2014

[22] A Andreas P Lenz and R Urtasun ldquoAre we ready forautonomous driving the kitti vision benchmark suiterdquo inProceedings of the IEEE Conference on Computer Vision AndPattern Recognition Providence RI USA June 2012

[23] A Alahi K Goel V Ramanathan A Robicquet L Fei-Feiand S Savarese ldquoSocial LSTM human trajectory prediction incrowded spacesrdquo in Proceedings of the IEEE Conference onComputer Vision And Pattern Recognition pp 961ndash971 LasVegas NV USA June 2016

[24] J Xiao K A Ehinger J Hays A Torralba and A Oliva ldquoSundatabase exploring a large collection of scene categoriesrdquoInternational Journal of Computer Vision vol 119 no 1pp 3ndash22 2016

[25] E Dong Y Zhu Y Ji and S Du ldquoAn improved convolutionneural network for object detection using YOLOv2rdquoin Pro-ceedings of the 2018 IEEE International Conference onMechatronics and Automation (ICMA) pp 1184ndash1188 IEEEChangchun China August 2018

[26] W Liu D Anguelov D Erhan et al ldquoSingle shot multiboxdetectorrdquo in Proceedings of the European Conference onComputer Vision Springer Amsterdam (e Netherlandspp 21ndash37 October 2016

[27] T-Y Lin P Dollar R B Girshick K He B Hariharan andS J Belongie ldquoFeature pyramid networks for object detec-tionrdquo in Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition (CVPR) vol 1 no 2 p 4Honolulu HI USA July 2017

[28] C-Y Fu W Liu A Ranga A Tyagi and A C Berg ldquoDSSDdeconvolutional single shot detectorrdquo 2017 httpsarxivorgabs170106659

[29] Y Bai Y Zhang M Ding and B Ghanem ldquoSOD-MTGANsmall object detection via multi-task generative adversarialnetworkrdquo in Proceedings of the European Conference onComputer Vision pp 210ndash226 Springer Munich GermanySeptember 2018

[30] L Liu W Ouyang X Wang et al ldquoDeep learning for genericobject detection a surveyrdquo 2018 httpsarxivorgabs180902165

[31] P Zhu L Wen X Bian L Haibin and Q Hu ldquoVision meetsdrones a challengerdquo 2018 httpsarxivorgabs180407437

[32] Z-Q Zhao P Zheng S-t Xu and X Wu ldquoObject detectionwith deep learning a reviewrdquo 2018 httpsarxivorgabs180705511

[33] K Oksuz B C Cam S Kalkan and E Akbas ldquoImbalanceproblems in object detection a reviewrdquo 2019 httpsarxivorgabs190900169

[34] O Russakovsky J Deng H Su et al ldquoImagenet large scalevisual recognition challengerdquo International Journal of Com-puter Vision vol 115 no 3 pp 211ndash252 2015

[35] Y Bai Y Zhang M Ding and B Ghanem ldquoSOD-MTGANsmall object detection via multi-task generative adversarialnetworkrdquo in Proceedings of the European Conference onComputer Vision (ECCV) pp 206ndash221 Munich GermanySeptember 2018

18 Journal of Electrical and Computer Engineering

Page 8: AnEvaluationofDeepLearningMethodsforSmall ObjectDetection

time faster than the previous detectors by eliminating theneed of the proposal network(erefore it causes a few dropin mAP and SSD compensates this by applying some im-provements including multiscale features and default boxes(ese improvements allow SSD to gain the same of FasterRCNN using lower resolution images which then furtherspeeds up the processing of SSD For 300times 300 input imageas the best version SSD gets 772 mAP at 46 FPS betterthan Faster R-CNN 732 and a little smaller than the bestversion of YOLOv2 554times 554 input image and 786 mAPat 40 FPS on VOC 2007 on Nvidia Titan X

Similarly SSD consists of 2 parts namely extraction offeature maps and use of convolution filters to detect objectsSSD uses VGG16 as a base network to extract feature maps(en it combines 6 convolutional layers to make predictionEach prediction contains a bounding box and N+ 1 scoresfor each class where N is the number of classes and one forextraclass for no object Instead of using a region proposalnetwork to generate boxes and feed to a classifier forcomputing the object location and class scores SSD simplyuses small convolution filters After the VGG16 base net-work extracts features from feature maps SSD applies 3times 3convolution filters for each cell to predict objects Each filtergives an output including N+ 1 scores for each class and 4attributes for one boundary box

SSD has a difference from previous approaches at thesame time and it makes prediction on multiscale featuremaps for detection independently rather than just one lastlayer (e CNN network spatially reduces the dimension ofthe image gradually leading to the decrease in the resolutionof the feature maps As mentioned SSD uses a lower inputimage to detect objects hence early layers are used to detectsmall objects and lower resolution layers to detect largerscale objects progressively Besides SSD applies differentscales of default boxes to different layers and for intuitivevisualization in Figure 3 Particularly the only blue defaultbox on 8times 8 feature map fits to the ground truth of the catand the only red one on 4times 4 feature map matches to theground truth of the dog

Although SSD has significant improvements in objectdetection as integrating with these above parts SSD is notgood at detecting small objects which can be improved byadding deconvolution layers with skip connections to in-troduce additional large-scale context [28] Generally SSDoutperforms Faster RCNN which is a state-of-the-art ap-proach about accuracy on PASCAL VOC and COCO whilerunning at real-time detection

37CNNDrawbacks Most of the CNNmodels are currentlydesigned by the hierarchy of various layers such as con-volutional and pooling layers that are arranged in a certainorder not only on small networks but also on multilayernetworks to state-of-the-art networks Along with theselayers fully connected layers are added behind and known asFC layers (e block consisting of FC layers and previouslayers is designated as feature extractors and it outputs keyfeatures of objects of interest as an input for classifierscoming behind However deeply going through many kinds

of layers is a way that is not good for small object detectionbecause in the task of small object detection objects ofinterest are objects owning small sizes and appearanceBesides small objects unlike normal or big objects which areless affected by resizing the image or passing lots of differentlayers are very vulnerable to the changes in image sizesWhen an image passes a convolutional layer the size of theimage will be decreased by receptive fields that slide over theimage to extract useful features (is does not affect smallobjects if there are just a few layers but in a CNN networkwe have many layers like this and it is very hard for smallobjects Still if small objects just go through convolutionallayers it will not be anything to mention Small objectswhich just have a few informative presence have to passpooling layers which help in avoiding overfitting and re-ducing computational costs by decreasing a number ofparameters To do this these layers use fixed sliding windowsthat care about a fixed target that is identified before such asmaximum or average calculations of valuables For thesereasons GAN is an approach that may alter the CNN ap-proach because of its advantages We can take advantages ofa way that the approach generates data to overcome thelimitations of data of small objects for the training phaseAlthough images still have to pass layers such as convolu-tional and pooling layers in this context the network justhas less layers compared to others Bai el al [29] haveproposed to apply MTGAN to detect small objects by takingcrop inputs from a processing step made by baseline de-tectors such as Faster RCNN [15] or Mask RCNN [9]

Because of mentioned reasons and following the survey[30] Liu et al have presented numerous works of survey andevaluation but there are no works that do with small objectsin them(erefore in this work we assess popular and state-of-the-art models to find out pros and cons of these modelsParticularly we evaluate 4 deep models such as YOLOv3RetinaNet Fast RCNN and Faster RCNN with several basenetworks for small object detection with different scales ofobjects In these models YOLOv3 and RetinaNet belong tothe one-stage approach Fast RCNN and Faster RCNN are inthe two-stage approach We choose these models becauseYOLOv3 is the model with combination of state-of-the-arttechniques and RetinaNet is the model with a new lossfunction which penalizes the imbalance of classes in adataset Besides we choose RetinaNet to make comparisonsbetweenmodels in the same approach Similarly Fast RCNNand Faster RCNN are the same and both models are in thesame approach and have nearly the similar pipeline in objectdetection (ere is a difference is that Fast RCNN utilizes anexternal proposal to generate object proposals based oninput images However Faster RCNN proposes its ownnetwork to generate object proposals on feature maps andthis makes Faster RCNN train end-to-end easily and workbetter

4 Experimental Evaluation

In this section we present the information of our ex-perimental setting and datasets which we use forevaluation

8 Journal of Electrical and Computer Engineering

41 Experimental Setting We continually train and evaluatevarious object detectors on the two datasets such as PASCALVOC [11] and a newly generated dataset [16] (e evaluatedapproaches in this time consist of Faster RCNN [15]YOLOv3 [6] and RetinaNet [7] with different backbonesExcept for YOLOv3 the others are trained and evaluated bythe Detectron python code

Currently the original datasets which commonly areused in object detection are PASCAL VOC [11] and COCO[12] Both datasets are constructed by almost large objects orother kinds of objects whose size fill a big part in the image(ese two datasets are not suitable for small object detectionIn addition there is another dataset which is large-scale andincludes a lot of classes for small object detection collectedby drones and named VisDrone dataset [31] However itdoes not publish the labels for test set to evaluate and theviews of images are topdown which is not our case As aresult in order to evaluate the detection performance of themodels we use a dataset which was published in [13] (isdataset is called small object dataset which is the combi-nation between COCO [12] and SUN [24] dataset (ere are10 classes in small object dataset including mouse tele-phone switch outlet clock toilet paper (t paper) tissue box(t box) faucet plate and jar (e whole dataset consists of4925 images in total and there are 3296 images for trainingand 1629 images for testing (e mouse class owns thelargest number of instances in images 2137 instances in 1739images and the tissue box class has the smallest number ofinstances 103 instances in 100 images Apart from the smallobject dataset we also filter subsets from PASCAL VOC2007 following standard definitions On PASCAL VOCthere are 20 classes but with small object detection there arefewer classes on strict definitions of small objects Table 1lists the details of the number of small objects and imagescontaining them for subsets of the dataset

We trained all models on small object dataset with thesame parameters Particularly in the training phase wetrained the models with 70k iterations with the parametersincluding momentum decay gamma learning rate batchsize step size and training days in Table 2 At the firstmoment we attempted to start off the models with a higherlearning rate 10minus 2 but the models diverged leading to theloss value being NaN or Inf after 100 first iterations(en wetried at a lower learning rate 10minus3 at 100 first iterations andrise to 10minus2 to consider if the models can converge as startingoff at a lower learning rate However it remained unchangedanything We also saw that the models converged quicklyduring 10k first iterations with 10minus3 and then progressively

slow down after 20k (erefore we decided to start off thetraining with a learning rate at 10minus3 and decrease to 10minus 4 and10minus5 at 25k and 35k iterations respectively (is settingshows that the loss value was stable from 40k but we set thetraining up to 70k to consider how the loss value changesand saw that it did not change a lot after 40k iterations Wetried to evaluate the models from 30k to 70k and generallythe performance of the models was not stable after 40k it-erations For this reason we picked up the weight forevaluation at 30k and 40k iterations At 30k iterationsYOLO achieves the best results and others get the best one at40k iterations In case of subsets of PASCAL VOC 2007 wecombine train and valid set from PASCAL VOC 2007 and2012 to form a training set PASCAL VOC 2012 works as adata augmentation set for PASCAL VOC 2007 We use thiscombined training set to train all models and test them onsubsets All models train the same parameter First due tothe limitation of memory we rescale all the size of images tothe same size with the shortest side 600 and the lengthiestside 1000 as in [15]

In YOLOv3 we run the K-means clustering algorithm inorder to initialize 9 suitable default bounding boxes fortraining and testing phases of our selected datasets and wechanged the anchors value (e following are 9 anchors forsmall object dataset after running the K-means algorithm[103459 144216] [262937 190947] [214024 363180][479317 291237] [404932 637489] [836447 513203][722167 1199181] [1727416 1170773] and [12465972528465]

In Faster R-CNN to fairly compare with the prior workand deploy on different backbones we also reuse directly theanchor scales and aspect ratios following the paper [13] suchas anchor scales 16times16 40times 40 and 100times100 pixels andaspect ratio 05 1 and 2 instead of having to cluster a set ofdefault bounding boxes similar to YOLOv3 Similarly inRetinaNet we keep the default setting for training such asgamma loss 20 alpha loss 025 anchor scale 4 andscalers per octave 3 because of following authors and thisconfiguration is the optimized valuables

42 Our Newly Generated Dataset In this time to have anobjective comparison we also use our newly generateddataset and the information of this dataset is shown inTable 1 We use it to consider the effects of object sizesamong factors including models time of processing accu-racy and resource consumption (e dataset consists of 4subsets filtered from PASCAL VOC 2007 such as

Batch normHi-res classifierConvolutionalAnchor boxes

New neworkDimension priors

Location predictionPassthrough

MultiscaleHi-res detector

YOLO YOLOv2

VOC2007 mAP

634 658 695 692 696 744 754 768 786

Figure 3 mAP of YOLOv2 at each added part [5]

Journal of Electrical and Computer Engineering 9

VOC_WH_20 VOC_MRA_058 VOC_MRA_10 andVOC_MRA_20 and detail information is provided asfollows

(i) VOC2007_WH_02 contains objects whose widthand height are less than 20 of an imagersquos width andheight (is one has fewer than PASCAL VOC 2007two classes such as dining table and sofa because ofthe constraint of the definition

(ii) VOC_MRA_058 VOC_MRA_10 andVOC_MRA_20 compose objects occupying themaximum mean relative area of the original imageunder 058 10 and 20 respectively Two ofthem have the same number of PASCAL VOC 2007classes except for VOC_MRA_058 and the one hasfewer four classes such as dining table dog sofa andtrain

5 Results and Analyses

In this section we show results that we achieved through theexperimental phase All models mentioned in this sectionexcept for models cited from other papers are trained on thesame environment and 1 GPU Ubuntu 16044 LTS Intel(R) Xeon (R) Gold 6152 CPU 210GHz GPU Tesla P100In addition to the comparative accuracy other comparisonsare also provided to make our objective and clear assessmentresults

51 Accuracy

511 Small Object Dataset Following the detection resultsin Table 3 methods which belong to two-stage approachesoutperform ones in one-stage approaches about 8ndash10Specifically Faster RCNN with ResNeXT-101-64times 4d-FPNbackbone achieved the top mAP in two-stage approachesand the top of the table as well 412 In comparison withthe top in one-stage approaches YOLOv3 608times 608 withDarknet-53 obtained 331 Following [32] methods basedon region proposal such as Faster RCNN are better than

methods based on regression or classification such as YOLOand SSD Actually this is also right once again as in contextof small object dataset

For methods in each approach Firstly two-stage ap-proaches Faster RCNN which is an improvement of FastRCNN is only greater than Fast RCNN about 1ndash2 but onlyfor ResNeXT backbones and equal to Fast RCNN for the rest(e difference here is not too much and it means that theperformance of external region proposal like selective searchcombined with ROI pooling is as good as internal regionproposal like RPN with ROI aligned in this case Besidescompared to R-CNN we perceive that there is a boost 8ndash10when RoI pooling or RoI aligned is added because R-CNNwhich uses region proposals from selective search then feedsthem into the network and directly computes features from fc(fully connected) layers only receives 235 with Alexnetand 248 with VGG16 combined with proposals from RPNHowever Fast RCNN and Faster RCNN with two kinds ofRoIs are much better Fast RCNN receives accuracy in a rangeof 317 to 396 based on different backbones SimilarlyFaster RCNN gets 301 to 412 Secondly in one-stageapproaches YOLO outperforms SSD and RetinaNet How-ever YOLO gets the highest outcome 331 and SSD andRetinaNet get 1132 and 30 respectively YOLO and SSDare considered as state-of-the art methods in speed andsacrificing accuracy However there is a large difference inaccuracy between YOLO and SSD the difference here is thatSSD adds multiple convolutional layers behind the backboneand each layer has their own ability instead of using 2 fullyconnected layers like YOLO Although RetinaNet is assignedinto a method in one-stage approaches it cannot run in realtime RetinaNet is one which is proposed to deal with theimbalance between foreground and background by the focalloss (erefore RetinaNet obtains a higher accuracy incomparison with others except for YOLOv3 (Darknet-53)

When it comes to the backbones we realized that Dar-knet-53 is the best in one-stage and real-time methods andeven far higher than ResNet-50 although it similarly has thesame layers with ResNet-50 In contrast ResNeXT combinedwith FPN is themost powerful one in both one-stage and two-

Table 1 (e information of the subsets

Subsets Classes Images InstancesVOC_MRA_058 16 329 529VOC_MRA_10 20 2231 5893VOC_MRA_20 20 2970 7867VOC_WH_20 18 1070 2313

Table 2 (e parameters of models

Method Momentum Decay Gamma Learning_rate Batch_size Training_days StepsizeYOLOv2 [16] 09 00005 0001 8 5 25000YOLOv3 09 00005 0001 32 3ndash4 25000SSD300 [16] 09 00005 01 0000004 12 9 40000 80000SSD512 [16] 09 00005 01 0000004 12 12 100000 120000RetinaNet 09 00005 01 0001 64 4-gt12 h 25000 35000Fast RCNN 09 00005 01 0001 64 4-gt12 h 25000 35000Faster RCNN 09 00005 01 0001 64 4-gt12 h 25000 35000

10 Journal of Electrical and Computer Engineering

stage methods if we only consider accuracy Overall there isan increase about 1ndash3 for changing the simple backbone tothe complex one in each type For example when switchingfrom original ResNet to ResNet-FPN the accuracy is boostedfrom 2 to 3(is is clear that leveraging the advantages frommultiscale features of FPN is a common way to improvedetection and tackle the scale imbalance of input images andbounding boxes of different objects Similarly we switchResNeXT-101-32times 8d-FPN to ResNeXT-101-64times 4d-FPNand the accuracy changes from 405 to 412 for FasterRCNN and from 387 to 396 for Fast RCNN Howeverwhen considering between ResNet-50-FPN and ResNet-101-FPN the growth only happens in Fast RCNN from 333 to355(ere is a little bit decrease 01 for Faster RCNN(isreduction also happens with RetinaNet while the simplerbackbone ResNeXT-101-32times 8d-FPN gets 30 and theResNeXT-101-64times 4d-FPN just gets 251 It means that thevery deeper backbones do not guarantee the increase inaccuracy and the reason is that an advantage of a deepernetwork needsmore parameters to learn It means the authorsmust have a large number of data to feed into the network totrain and update parameters itself but in this case the data ofsmall object dataset are not abundant too much to fit the verydeep network and hence increasing the chances of overfittingBesides features which are originally from the early layer ofResNet are not well-generalized because when they arecombined with FPN the accuracy has an improvement about2ndash3 When YOLO switches from Darknet-19 to Darknet-53 it really boosts the accuracy (e highest accuracy belongsto the Darknet-19 with the resolution of 1024times1024 whichjust gets 2402 However YOLO 608times 608 with Darknet-53gets 331 (e explanation for this reason is that YOLOv3with Darknet-53 has several improvements from Darknet-19YOLOv3 has 3 location of scales to predict objects especiallyone specialized in small objects instead of only one likeDarknet-19 and it is also integrated cutting-edge advantagessuch as residual blocks and shortcut connections (e re-duction in accuracy happens again with YOLO whenswitching from ResNet-101 to ResNet-152 about 1ndash2 Inthese methods YOLO and SSD are the only ones which allowmultiple input sizes(e higher the resolution of input imagesare the higher accuracy the method receives (e reason isthat a higher resolution image allows more pixels to describethe visual information for small objects However if theresolution is far from the original size of images it results in adecrease in accuracy For example YOLO 1024times1024 withDarknet-19 gets a lower accuracy than the resolution of800times 800 In addition we have tried to increase in resolutionof Darknet-53 from 608 to 1024 and themAP decreases whenthe resolution is over 608times 608 (erefore the effect of imagesize is clear for models like SSD and YOLO Generally allcomparative results of mAP on this dataset have the domi-nation of classes very great in numbers and this is caused bythe imbalance data between the number of images and in-stances in these images For example according to the sta-tistics in [13] mouse is a major class significantly contributingto mAP in Table 3 with the highest number of instances andimages as well However tissue has least contribution with thelowest AP originally affected by the number of data

Furthermore the imbalance data lead models tending todetect frequent objects implying that models will misun-derstand objects having a nearly similar appearance with thedomination class as the objects of interest rather than lessfrequent objects As a result false positives will increase bythese problems Figure 4 illustrates the detection withstrongest backbones Following this visualization the domina-tion of the classes such asmouse or faucet results inmisdetectionwith areas which have a same appearance to them (is mis-understanding has a tendency to weaker backbones in thecomparison and one-stage method like YOLO which primarilyheads to speed has more misdetection than two-stage methodsA reason that causes these problems are the difference in thewayof training deep networks [33] One-stage methods such asYOLO use a soft sampling method that uses a whole dataset toupdate parameters rather than only choosing samples fromtraining data However two-stage methods such as RCNNfamily tend to employ hard sampling methods that randomlysample a certain number of positive and negative boundingboxes to train its network

512 Subsets of PASCAL With 4 subsets of 4 different scalesof objects in images we want to find out howmuch the scalesimpact on the models (e whole results are shown in Ta-ble 4 We separate the results into 2 groups as the one-stageand two-stage approaches and Figure 5 is a visualization forthe strongest backbones in each method on subsets

In case of different scales like our subsets there is a differencebetween one-stage approaches and two-stage approaches In thiscase methods from the one-stage approach have a better per-formance than two-stage ones inmost of scales(is is really theopposite of small object dataset Specifically two-stage methodsare totally better than one-stage ones in case of real-time inputsand just better a bit than nonreal-time models in VOC_WH20about 10ndash20 and the same result with smaller objects inVOC_MRA_0058 and VOC_MRA_010 However in biggerobjects in VOC_MRA_020 methods in one-stage approacheshave significant outcomes rather than two-stage ones In ad-dition there is just Faster RCNN that has good performance inmost cases to compare to methods in one-stage ones FastRCNN is only good at big objects in VOC_MRA_020 and failsto have good detection in smaller objects

In the one-stage approach in methods which allowmultiple inputs like YOLO and SSD there are 2 kindsnamely ones that can run in real time and the others thatcannot if the resolution is over 640 or 512 for YOLO andSSD respectively For real-time ones YOLO outperformsSSD for all scales of objects Specifically YOLOv2 withDarknet-19 is better than SSD 26 with objects inVOC_MRA_0058 and VOC_MRA_010 and 4ndash15 forlarger objects in VOC_MRA_020 and VOC_WH_20YOLOv3 with Darknet-53 gets higher results about 3ndash5 incomparison with YOLOv2 hence YOLOv3 also gets higherresults compared to SSD However if we consider nonreal-time input images SSD is greater than YOLO with objects inVOC_MRA_010 However RetinaNet which is the one thatcannot run in real time in the one-stage approach performsthe same results compared to ones in nonreal time in YOLO

Journal of Electrical and Computer Engineering 11

and better than SSD RetinaNet is more stable than SSD andYOLO when the scales are changed (e bigger the objectsare the more the stability is For example the change is somuch about 33 when the scale increases from objects inVOC_MRA_0058 to ones in VOC_MRA_010 andVOC_MRA_020 However this change is not much about10 with bigger objects in comparison with YOLO 15ndash25In case of YOLO this remarkable increase in accuracy whenobjects are larger is obviously good for a model (e changein SSD resembles the change in RetinaNet

Concerning resolutions in YOLO and SSD we see thatwhen image resolution is increased they push the accuracyto improve in general For YOLOv2 with Darknet-19 andYOLOv3 with Darknet-53 and SSD they all have an increasein accuracy when the resolution is large except for YOLOv2

with objects belonging to VOC_MRA_010 andVOC_MRA_020 when the image is over 800 In additionYOLOv2 has a fluctuation with those objects inVOC_WH20 As mentioned in our previous work YOLO isbetter than SSD in those objects less than 10 of the imageshowever in this case YOLOv3 is good at all scales of objects(is is because YOLOv3 has 3 detection locations comingwith more ratios of default boxes and it leads to a significantoutcome when combining results from 3 locations

When we switch to the two-stage approaches FasterRCNN has a significant improvement in most scales ratherthan Fast RCNN except for objects in VOC_MRA_020 whichhave the same accuracy (is shows that if objects are com-pletely separated into different scales the RoI pooling does notwork well with smaller objects and ones in VOC_WH20 In

Table 3 Comparative results on small object dataset

Method Backbone Clock Faucet Jar Mouse Outlet Plate Switch Tel t box t paper mAPYOLO 416 [16]

Darknet-19

228 308 4 52 204 131 13 61 0 353 1939YOLO 448 [16] 23 369 9 525 184 136 175 42 0 343 2013YOLO 480 [16] 342 373 91 533 214 136 158 91 91 342 2371YOLO 512 [16] 231 366 61 598 246 142 157 91 45 324 2261YOLO 554 [16] 234 372 91 601 272 134 199 91 45 345 2384YOLO 640 [16] 202 362 32 598 278 117 181 82 45 356 2253YOLO 800 [16] 276 36 23 602 328 131 233 91 91 267 2402YOLO 1024 [16] 217 293 14 583 264 118 175 91 91 157 2003YOLO 320

Darknet-532622 3838 455 5646 3642 1334 248 1065 455 4296 2583

YOLO 416 2847 4715 1083 6049 4315 1587 3073 1515 262 483 3028YOLO 608 2998 4789 1076 6588 4802 1809 3122 1462 1799 4656 331YOLO 320

ResNet-501957 2573 067 4517 1437 938 1384 909 909 237 1706

YOLO 416 2378 3665 04 5423 1837 1375 1978 984 942 3568 2219YOLO 608 2692 4065 177 6186 2918 1504 2024 1009 1329 3601 255YOLO 320

ResNet-1012052 279 057 4468 1698 1305 1366 966 909 2436 1805

YOLO 416 2572 356 303 5573 224 1561 1726 932 303 3871 2264YOLO 608 2879 4459 942 6218 3334 1553 2388 1324 1583 3917 286YOLO 320

ResNet-1522164 2756 303 4806 1739 1112 1451 909 455 3188 1888

YOLO 416 257 3654 089 5381 206 1413 2021 1149 029 3306 2167YOLO 608 2601 4454 455 61 3176 1302 2267 1235 993 3999 2658SSD300 [16] ResNet-101 55 91 0 255 61 45 0 45 91 182 825SSD300 [16] VGG16 91 171 0 261 91 91 0 45 0 167 916SSD512 [16] VGG16 91 171 0 43 91 91 91 91 0 76 1132RetinaNet ResNet-50-FPN 307 493 2 655 213 161 85 129 1 257 233RetinaNet ResNet-101-FPN 306 487 71 647 20 159 118 107 29 387 251RetinaNet ResNeXT-101-32times 8d-FPN 355 55 121 665 239 184 98 162 94 537 30RetinaNet ResNeXT-101-64times 4d-FPN 314 502 89 663 208 153 94 14 22 324 251R-CNN [13] RPN prop +VGG16 319 313 42 568 311 93 142 164 234 294 248R-CNN [13] Alexnet 7times 300 pro 324 272 51 569 28 98 136 124 179 356 239R-CNN [13] VGG16 7times 300 pro 373 303 72 606 415 158 215 137 22 333 284R-CNN [13] ContextNet (Alexnet 7times) 327 268 46 564 263 99 129 122 187 34 235Fast RCNN ResNet-50-C4 324 463 65 658 383 201 253 166 141 52 317Fast RCNN ResNet-50-FPN 374 473 73 689 467 21 321 171 93 459 333Fast RCNN ResNet-101-FPN 393 503 106 683 471 204 333 186 154 514 355Fast RCNN ResNeXT-101-32times 8d-FPN 475 548 103 718 54 214 344 217 177 535 387Fast RCNN ResNeXT-101-64times 4d-FPN 454 557 109 725 533 24 369 229 16 581 396Faster R-CNN [16] VGG16 2376 3765 803 54 1616 1188 1512 91 625 3729 2192Faster RCNN ResNet-50-C4 322 446 66 659 352 175 257 196 137 40 301Faster RCNN ResNet-50-FPN 357 499 73 684 489 188 296 147 114 533 338Faster RCNN ResNet-101-FPN 398 492 49 682 47 185 297 14 129 522 337Faster RCNN ResNeXT-101-32times 8d-FPN 498 566 114 721 563 232 37 208 188 587 405Faster RCNN ResNeXT-101-64times 4d-FPN 496 586 122 725 545 232 369 208 201 631 412(e values in bold represent the best in one-stage methods and the ones in italics represent the highest in two-stage methods

12 Journal of Electrical and Computer Engineering

addition if we compare with one-stage methods it is signif-icantly lower than them However RoI align along with RPN iswell performed when scales are changedWhen it comes to thebackbones there is a few decrease in accuracy when changingfrom ResNet-50-FPN to ResNet-101-FPN or from ResNeXT-101-32times 8d-FPN to ResNeXT-101-64times 4d-FPN with objectsfrom all scales for both Faster RCNN and Fast RCNN (eVGG16 backbone has an impressive outcome rather thanstrong backbones such as ResNet or ResNeXT Although theaccuracy is less than two strong backbones VGG16 is stillbetter with objects in VOC_WH20 and has a few change inaccuracy when changing objects with big sizes

52 Time Processing and Resource Consumption Tables 5and 6 show us the performance comparison of the eval-uated models with base networks that belong to the modelsGenerally we see that when RAM consumption in testingand training increases more layers are added (is meansthat if the network is more deeper the need of processingalso increases because this leads to the increase in pa-rameters and time to process data as well YOLO is themodel consuming the least memory in both two-phasetraining and testing Particularly YOLO is only from 4G to5G for training and from 16G to 18G for testing withDarknet-53 YOLO is the only one which is able to run inreal time YOLO just needs about 03ms to 04ms toprocess an image in comparison to more than 01 s and 02 s

with Faster RCNN and RetinaNet (is allows us to pick upthese models on devices which own the modest memoryWhile RetinaNet is assigned to the one-stage approach it isnot good enough to meet real-time detection(e inferencetime in Fast RCNN is lower a little bit than Faster RCNNand RetinaNet In contrast the RAM consumption intraining and testing of RetinaNet is lower than Fast RCNNand Faster RCNN Of all architectures the ResNet-50-C4 isthe one requiring the highest memory and time to processdata because the output size of ResNet-50-C4 is bigger a bitthan others [9] If we consider ResNet or ResNeXT com-bined with FPN Faster RCNN is over 100Mb compared toFast RCNN and 300Mb with RetinaNet In additionaccording to Table 2 the number of training days of FasterRCNN and RetinaNet need less time for training only a fewhours to 1 day rather than YOLO 3ndash4 days (is dem-onstrates that if we pay our attention to performance anddo not have much time for training we choose FasterRCNN or RetinaNet instead of YOLO one In contrast ifwe only focus on processing speed and still achieve goodperformance one-stage methods are always the good oneIn the same context of backbones RetinaNet uses a lowerresource than Fast RCNN and Faster RCNN about 100Mband 300Mb for Fast RCNN and Faster RCNN respectivelyin testing time However the training time of RetinaNetuses much memory more than Fast RCNN about 28 G andFaster RCNN about 23 G for ResNeXT-101-32times 8d-FPNand ResNeXT-101-64 times 4d-FPN If we consider this on

(a) (b)

loc ∆ (cx cy w h) conf (c1 c2 hellip cp)

(c)

Figure 4 (e location of the default boxes in different scales (a) image with GT boxes (b) 8times 8 feature map (c) 4times 4 feature map

Journal of Electrical and Computer Engineering 13

Table 4 (e comparative results on subsets of PASCAL VOC 2007

Approach Method VOC_MRA_0058 VOC_MRA_010 VOC_MRA_020 VOC_WH20

One stage

YOLOv2 416 [16] 302 3138 4289 1852YOLOv2 448 [16] 447 329 6015 2196YOLOv2 480 [16] 426 3348 6078 2667YOLOv2 512 [16] 542 3574 6112 2463YOLOv2 544 [16] 697 3656 63 2662YOLOv2 640 [16] 77 3797 6129 2341YOLOv2 800 [16] 1024 373 6191 269YOLOv2 1024 [16] 1069 2993 5514 2897

YOLOv3 320 718 3458 6036 204YOLOv3 416 102 3897 6253 2412YOLOv3 608 117 4265 6856 2886SSD 300 [16] 171 3276 4626 1691SSD 512 [16] 29 4346 5711 1987

RetinaNet-ResNet-50-FPN 884 415 502 2814RetinaNet-ResNet-101-FPN 895 425 519 2746

RetinaNet-ResNeXT-101-32times 8d-FPN 1029 454 545 3008RetinaNet-ResNeXT-101-64times 4d-FPN 1071 455 551 3132

Two stage

Fast RCNN-ResNet-50-C4 023 132 499 393Fast RCNN-ResNet-50-FPN 063 135 556 345Fast RCNN-ResNet-101-FPN 039 159 576 312

Fast RCNN-ResNeXT-101-32times 8d-FPN 051 144 579 333Fast RCNN-ResNeXT-101-64times 4d-FPN 029 142 573 376

Faster RCNN-ResNet-50-C4 698 399 487 2604Faster RCNN-ResNet-50-FPN 1074 456 563 2979Faster RCNN-ResNet-101-FPN 1063 469 576 3057

Faster RCNN-ResNeXT-101-32times 8d-FPN 1164 473 576 3212Faster RCNN-ResNeXT-101-64times 4d-FPN 1054 471 569 3164

Faster RCNN-VGG16 [16] 573 3558 4414 4111(is table illustrates how well models adapt to different scales of objects (e values in bold represent the best in one-stage methods and the ones in italicsrepresent the highest in two-stage methods

(a)

(b)

(c)

Figure 5 Continued

14 Journal of Electrical and Computer Engineering

small object dataset it does not work too much becauseRetinaNet is lower than Faster RCNN about 10 in per-formance Otherwise on different scales of subsets Reti-naNet works well when comparing to Faster RCNN and

the difference is just 2ndash4 percentages Although ResNetbackbones combined with the others yield an improvementin accuracy they do not work for YOLO on small objectdatasets YOLO with Darknet-53 utilizes more resource

Table 5 (e comparison of consumption on small object dataset

Model Backbone Inference time (s) Test RAM (MiB) Train RAM (MiB)YOLOv3 Darknet-53 00331 1825 4759YOLOv3 ResNet-50 0027 1285 3479YOLOv3 ResNet-101 00356 1829 5383YOLOv3 ResNet-152 00454 2443 7531RetinaNet ResNet-50-FPN 0102 2075 4435RetinaNet ResNet-101-FPN 0127 2723 5577RetinaNet ResNeXT-101-32times 8d-FPN 0229 3767 7863RetinaNet ResNeXT-101-64times 4d-FPN 0292 3719 7813Fast RCNN ResNet-50-C4 03 6449 5877Fast RCNN ResNet-50-FPN 0089 2277 4455Fast RCNN ResNet-101-FPN 0113 2947 5627Fast RCNN ResNeXT-101-32times 8d-FPN 0212 3987 4961Fast RCNN ResNeXT-101-64times 4d-FPN 0269 3885 4799Faster RCNN ResNet-50-C4 0412 6609 6129Faster RCNN ResNet-50-FPN 0101 2387 5381Faster RCNN ResNet-101-FPN 0124 3001 6487Faster RCNN ResNeXT-101-32times 8d-FPN 0256 4027 5333Faster RCNN ResNeXT-101-64times 4d-FPN 0286 4003 5246

Table 6 (e comparison of consumption on subsets filtered from PASCAL VOC

Model Backbone Inference time (s) Test RAM (MiB) Train RAM (MiB)YOLOv3 Darknet-53 0027 1645 4079RetinaNet ResNet-50-FPN 01 1935 4133RetinaNet ResNet-101-FPN 0116 2585 5435RetinaNet ResNeXT-101-32times 8d-FPN 0222 3641 7723RetinaNet ResNeXT-101-64times 4d-FPN 0284 3561 7599Fast RCNN ResNet-50-C4 0495 6371 5677Fast RCNN ResNet-50-FPN 0092 2131 4387Fast RCNN ResNet-101-FPN 0114 2819 5463Fast RCNN ResNeXT-101-32times 8d-FPN 0213 3873 4637Fast RCNN ResNeXT-101-64times 4d-FPN 0265 3735 4575Faster RCNN ResNet-50-C4 026 6141 5991Faster RCNN ResNet-50-FPN 01 2245 5207Faster RCNN ResNet-101-FPN 013 2855 6335Faster RCNN ResNeXT-101-32times 8d-FPN 0225 3943 5087Faster RCNN ResNeXT-101-64times 4d-FPN 0276 3885 4909

(d)

Figure 5 Highlight of bounding boxes from comparative backbones on small object dataset We here select YOLO with Darknet-53 andResNet-50 for objective comparison because there have obviously the same layers in their networks along with the significant techniquessuch as skip connections and residual blocks (e bounding boxes show that ResNet-50 has the sensitivity to areas which resembles theobjects of interest than Darknet-53 Similarly ResNet-50-FPN and ResNet-50-C4 are chosen to consider (e detection shows thatcombining ResNet-50 with FPN outputs a better performance rather than the original one Particularly misdetection happens in moredensity than ResNet-50-FPN such as in columns 4 and 5 Zoom in to see more detail

Journal of Electrical and Computer Engineering 15

than ResNet ones but it has the best accuracy amongmodels (erefore we only test YOLO with Darknet-53 insubsets of PASCAL

53 Analyses of the Trade-Off among Detectors Networkdesigns and approaches alike the one-stage approach proveits performance as applying them to detect general objectsboth small scales and other kinds of scales Although they arefast and accurate there is still a drawback always existing inthese models that is the trade-off between accuracy andspeed of processing For example YOLOv3 proposes theidea that performs detection at three different scales and thisresult is obviously impressive and yields good performanceHowever to gain this advantage YOLOv3 has to sacrifice thetime to process Instead of all inputs of the model normallyprocessing one time for detection like YOLOv2 this ideamust work 3 times (is trade-off is also partly affected byresolution as we change it during training or testing ourmodels In our previous work we have mentioned that wehave to choose a right resolution to ensure our models towork properly In case of the two-stage approaches the ideathat proposes region proposals to improve the localization ofobjects to serve for detection is good as well (is is usefulbut we have to take it into the account that we shouldgenerate proposals on feature maps or directly on inputimages because this affects a lot on the way which modelsintend to run and identify representations of objects Ifobjects are normal or have a big or medium appearance it isgood for models to work but if objects are in multiscalesthis is a problem to consider and research deeply in order tobalance the performance as well as improve it (erefore topartly fix this problem the one-stage approach allows us tochoose a fixed size of an input for training and testing butthe support still depends on characteristics of datasets whichwe evaluate or the image size After all all models we chooseto evaluate are affected by the scales of objects when wechange the scale and accuracy of models change a lot exceptfor Faster RCNN the only one model that seems to be stablewith the scale especially when combining with the VGG16architecture Although the accuracy of VGG16 is not betterthan the other architectures the difference here is that it doesnot change too much in accuracy (is is only right for bigobjects having the overlap of the bounding box and theimage greater than 10 if not this is not assured

Figure 1 shows that the possibility of small objects ismore than other objects (e black length of the camera issomehow similar to the black mouse placed on a mouse pad(is possibility of small object presence causes more diffi-culties to detectors and leads to wrong detection Anywherein an image can be small objects it results in a fact thatdetectors have much wrong detection with familiar ap-pearance which they have seen If we consider the visuali-zation of the detection in Figure 4 the wrong detection ispartly similar to the appearance of the other objects in thedataset (is problem is caused by the data imbalance be-tween classes and instances in each class which originally isknown as the foreground-foreground class imbalance In

other words the common problems which not only happenwith small objects but also for whole datasets are theintraclass similarity and interclass variation

6 Conclusion

Small object detection is a challenging and interestingproblem in the task of object detection and has drawn at-tention from researchers thanks to the development of deeplearning which is motivation to improve performance oftasks in computer vision Although deep models belongingto detection originally tend to solve problems relating togeneral object detection they still work at a particular levelto the success of small object detection As evaluation workson small object detection for deep models our goal is tohighlight remarkable achievements of popular and state-of-the-art deep models in order to provide a variety of views asapplying deep models in small object detection Particularlywe evaluate state-of-the-art real-time detectors based ondeep learning from two approaches such as YOLOv3RetinaNet Fast RCNN and Faster RCNN on two datasetsnamely small object dataset and subsets filtered fromPASCAL VOC about effects of different factors objectivelyincluding accuracy execution time and resource usage

In spite of the successful achievements in recent yearsthe performance of detection has improved significantlyand there is still a huge gap in accuracy between normalobjects and small objects In the criteria of the COCOdataset the difference from the small scale to medium andbig scale is too much Most models are good at detection ofnormal objects and problems are going to happen whenapplying them to detect small objects As a result to reducethe gap of small object detection the first thing to do is investdatasets which have massive amount of data to train modelsand have a wide range of categories to compete with thehuman visual system alike [12 34]

So far detection models are divided into two mainapproaches namely one-stage approach and two-stageapproach Models in the one-stage approach is known asdetectors which have better and more efficient detection incomparison to another approach (e efficiency here has thepotential power to run in real time and is able to apply themto practical applications However the trade-off betweenaccuracy and speed is a difficult challenge which needs to betaken into the account in order to balance the gap Howevermodels in the two-stage approach have their reputation ofregion-based detectors which have high accuracy but are toolow in speed to apply them to real world (is drawbackcomes from the computation of networks

(rough our evaluation there is a fact that architectureswhich are utilized as base networks to extract deep featureshave significant effects on frameworks (e deeper the ar-chitecture is the higher the accuracy of detection is Once anetwork has an increase in the depth this means it has morelayers than normal ones and it will have massive parametersto train Hence this needs a lot of data to fine tune theseparameters reasonably If there is an increase in computa-tion resource consumption will also increase As a result it

16 Journal of Electrical and Computer Engineering

will be difficult as we want to take them to apply in practicalapplications Besides the contextual exploit in models isdefinitely limited this results cause ignoring much usefuland informative data in training especially in context ofsmall objects Because small objects are able to appearanywhere in an input image if the image is well-exploitedwith the context the performance of small object detectionwill be improved better

For all above reasons and according to our evalua-tion if we tend to have good performance and ignore thespeed of processing two-stage methods like FasterRCNN are well-performed and demonstrate its networkdesign with the different datasets on many contexts ofobjects including multiscale objects (erefore FasterRCNN is considered as a giant baseline in order to baseon or develop from it If our target has a balance ofaccuracy and speed YOLO is a good one in case we donot care the training time because the sacrifice betweenthe speed and accuracy is worth applying it into practicalapplications Otherwise Faster RCNN or RetinaNet isstill a substitution to work on When it comes tobackbones we have to concern about the data to choose areasonable backbone to combine with the methodsBecause the amount of data will significant impact on themodel if data are not abundant the shallow network willfit it well Besides there is recently a novel approachpromising in training deep models with less data that isweakly supervised learning such as zero-shot one-shotor few-shot learning (erefore these approaches will beconsidered in our future works and following our recentsearching to have better performance on object detec-tion we have to consider several factors to improve themAP such as multiscale training superresolution forscaling up the visual information to small objects [35] orpreprocessing data to avoid the imbalance data becausewe have a wide range of imbalance problems relating todata [33]

Data Availability

(e data used to support the findings of this study areavailable from the corresponding author upon request

Conflicts of Interest

(e authors declare that they have no conflicts of interest

Acknowledgments

(is research was funded by the Vietnam National Uni-versity HoChiMinh City (VNU-HCM) under grant noB2017-26-01

References

[1] R Girshick J Donahue T Darrell and J Malik ldquoRich featurehierarchies for accurate object detection and semantic seg-mentationrdquo in Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition pp 580ndash587 ColumbusOH USA June 2014

[2] K He X Zhang S Ren and J Sun ldquoSpatial pyramid poolingin deep convolutional networks for visual recognitionrdquo inProceedings of the European Conference on Computer Visionpp 346ndash361 Springer Zurich Switzerland September 2014

[3] R Girshick ldquoFast R-CNNrdquo in Proceedings of the IEEE In-ternational Conference on Computer Vision pp 1440ndash1448Santiago Chile December 2015

[4] J Redmon S Divvala R Girshick and A Farhadi ldquoYou onlylook once unified real-time object detectionrdquo in Proceedingsof the IEEE Conference on Computer Vision And PatternRecognition pp 779ndash788 Las Vegas NV USA June 2016

[5] J Redmon and A Farhadi ldquoYOLO9000 better fasterstrongerrdquo 2016 httpsarxivorgabs161208242

[6] J Redmon and A Farhadi ldquoYOLOv3 an incremental im-provementrdquo 2018 httpsarxivorgabs180402767

[7] T-Y Lin P Goyal R Girshick K He and P Dollar ldquoFocalloss for dense object detectionrdquo in Proceedings of the IEEEtransactions on pattern analysis and machine intelligenceVenice Italy October 2018

[8] K Zidek A Hosovsky J Pitelrsquo and S Bednar ldquoRecognition ofassembly parts by convolutional neural networksrdquo in Ad-vances in Manufacturing Engineering and Materialspp 281ndash289 Springer Cham Switzerland 2019

[9] K He G Gkioxari P Dollar and R Girshick ldquoMaskR-CNNrdquo in Proceedings of the IEEE International Conferenceon computer vision (ICCV) IEEE Venice Italy pp 2980ndash2988 October 2017

[10] L-C Chen A Hermans G Papandreou et al ldquoInstancesegmentation by refining object detection with semantic anddirection featuresrdquo 2017 httpsarxivorgabs171204837

[11] M Everingham L Van Gool C K I Williams J Winn andA Zisserman ldquo(e pascal visual object classes (VOC) chal-lengerdquo International Journal of Computer Vision vol 88no 2 pp 303ndash338 2010

[12] T-Y Lin M Maire S Belongie et al ldquoMicrosoft COCOcommon objects in contextrdquo in Proceedings of the EuropeanConference on Computer Vision pp 740ndash755 SpringerZurich Switzerland September 2014

[13] C Chen M-Y Liu O Tuzel and J Xiao ldquoR-CNN for smallobject detectionrdquo in Proceedings of the Asian Conference onComputer Vision pp 214ndash230 Springer Taipei TaiwanNovember 2016

[14] P F Felzenszwalb R B Girshick D McAllester andD Ramanan ldquoObject detection with discriminatively trainedpart-based modelsrdquo IEEE Transactions on Pattern Analysisand Machine Intelligence vol 32 no 9 pp 1627ndash1645 2010

[15] S Ren K He R Girshick and J Sun ldquoFaster R-CNN towardsreal-time object detection with region proposal networksrdquo inProceedings of the 28th International Conference on NeuralInformation Processing Systems Ser NIPSrsquo15 pp 91ndash99 MITPress Cambridge MA USA 2015 httpdlacmorgcitationcfmid=29692392969250

[16] P Pham D Nguyen T Do T D Ngo and D-D LeldquoEvaluation of deep models for real-time small object de-tectionrdquo in Proceedings of the International Conference onNeural Information Processing pp 516ndash526 SpringerGuangzhou China November 2017

[17] J R R Uijlings K E A Van De Sande T Gevers andA W M Smeulders ldquoSelective search for object recognitionrdquoInternational Journal of Computer Vision vol 104 no 2pp 154ndash171 2013

[18] Z Zhu D Liang S Zhang X Huang B Li and S HuldquoTraffic-sign detection and classification in the wildrdquo inProceedings of the IEEE Conference on Computer Vision And

Journal of Electrical and Computer Engineering 17

Pattern Recognition pp 2110ndash2118 Vegas NV USA June2016

[19] A Torralba R Fergus and W T Freeman ldquo80 million tinyimages a large data set for nonparametric object and scenerecognitionrdquo IEEE Transactions on Pattern Analysis andMachine Intelligence vol 30 no 11 pp 1958ndash1970 2008

[20] A Kembhavi D Harwood and L S Davis ldquoVehicle detectionusing partial least squaresrdquo IEEE Transactions on PatternAnalysis and Machine Intelligence vol 33 no 6 pp 1250ndash1265 2011

[21] V I Morariu E Ahmed V Santhanam D Harwood andL S Davis ldquoComposite discriminant factor analysisrdquo inProceedings of the IEEE Winter Conference on Applications ofComputer Vision pp 564ndash571 IEEE Steamboat Springs COUSA March 2014

[22] A Andreas P Lenz and R Urtasun ldquoAre we ready forautonomous driving the kitti vision benchmark suiterdquo inProceedings of the IEEE Conference on Computer Vision AndPattern Recognition Providence RI USA June 2012

[23] A Alahi K Goel V Ramanathan A Robicquet L Fei-Feiand S Savarese ldquoSocial LSTM human trajectory prediction incrowded spacesrdquo in Proceedings of the IEEE Conference onComputer Vision And Pattern Recognition pp 961ndash971 LasVegas NV USA June 2016

[24] J Xiao K A Ehinger J Hays A Torralba and A Oliva ldquoSundatabase exploring a large collection of scene categoriesrdquoInternational Journal of Computer Vision vol 119 no 1pp 3ndash22 2016

[25] E Dong Y Zhu Y Ji and S Du ldquoAn improved convolutionneural network for object detection using YOLOv2rdquoin Pro-ceedings of the 2018 IEEE International Conference onMechatronics and Automation (ICMA) pp 1184ndash1188 IEEEChangchun China August 2018

[26] W Liu D Anguelov D Erhan et al ldquoSingle shot multiboxdetectorrdquo in Proceedings of the European Conference onComputer Vision Springer Amsterdam (e Netherlandspp 21ndash37 October 2016

[27] T-Y Lin P Dollar R B Girshick K He B Hariharan andS J Belongie ldquoFeature pyramid networks for object detec-tionrdquo in Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition (CVPR) vol 1 no 2 p 4Honolulu HI USA July 2017

[28] C-Y Fu W Liu A Ranga A Tyagi and A C Berg ldquoDSSDdeconvolutional single shot detectorrdquo 2017 httpsarxivorgabs170106659

[29] Y Bai Y Zhang M Ding and B Ghanem ldquoSOD-MTGANsmall object detection via multi-task generative adversarialnetworkrdquo in Proceedings of the European Conference onComputer Vision pp 210ndash226 Springer Munich GermanySeptember 2018

[30] L Liu W Ouyang X Wang et al ldquoDeep learning for genericobject detection a surveyrdquo 2018 httpsarxivorgabs180902165

[31] P Zhu L Wen X Bian L Haibin and Q Hu ldquoVision meetsdrones a challengerdquo 2018 httpsarxivorgabs180407437

[32] Z-Q Zhao P Zheng S-t Xu and X Wu ldquoObject detectionwith deep learning a reviewrdquo 2018 httpsarxivorgabs180705511

[33] K Oksuz B C Cam S Kalkan and E Akbas ldquoImbalanceproblems in object detection a reviewrdquo 2019 httpsarxivorgabs190900169

[34] O Russakovsky J Deng H Su et al ldquoImagenet large scalevisual recognition challengerdquo International Journal of Com-puter Vision vol 115 no 3 pp 211ndash252 2015

[35] Y Bai Y Zhang M Ding and B Ghanem ldquoSOD-MTGANsmall object detection via multi-task generative adversarialnetworkrdquo in Proceedings of the European Conference onComputer Vision (ECCV) pp 206ndash221 Munich GermanySeptember 2018

18 Journal of Electrical and Computer Engineering

Page 9: AnEvaluationofDeepLearningMethodsforSmall ObjectDetection

41 Experimental Setting We continually train and evaluatevarious object detectors on the two datasets such as PASCALVOC [11] and a newly generated dataset [16] (e evaluatedapproaches in this time consist of Faster RCNN [15]YOLOv3 [6] and RetinaNet [7] with different backbonesExcept for YOLOv3 the others are trained and evaluated bythe Detectron python code

Currently the original datasets which commonly areused in object detection are PASCAL VOC [11] and COCO[12] Both datasets are constructed by almost large objects orother kinds of objects whose size fill a big part in the image(ese two datasets are not suitable for small object detectionIn addition there is another dataset which is large-scale andincludes a lot of classes for small object detection collectedby drones and named VisDrone dataset [31] However itdoes not publish the labels for test set to evaluate and theviews of images are topdown which is not our case As aresult in order to evaluate the detection performance of themodels we use a dataset which was published in [13] (isdataset is called small object dataset which is the combi-nation between COCO [12] and SUN [24] dataset (ere are10 classes in small object dataset including mouse tele-phone switch outlet clock toilet paper (t paper) tissue box(t box) faucet plate and jar (e whole dataset consists of4925 images in total and there are 3296 images for trainingand 1629 images for testing (e mouse class owns thelargest number of instances in images 2137 instances in 1739images and the tissue box class has the smallest number ofinstances 103 instances in 100 images Apart from the smallobject dataset we also filter subsets from PASCAL VOC2007 following standard definitions On PASCAL VOCthere are 20 classes but with small object detection there arefewer classes on strict definitions of small objects Table 1lists the details of the number of small objects and imagescontaining them for subsets of the dataset

We trained all models on small object dataset with thesame parameters Particularly in the training phase wetrained the models with 70k iterations with the parametersincluding momentum decay gamma learning rate batchsize step size and training days in Table 2 At the firstmoment we attempted to start off the models with a higherlearning rate 10minus 2 but the models diverged leading to theloss value being NaN or Inf after 100 first iterations(en wetried at a lower learning rate 10minus3 at 100 first iterations andrise to 10minus2 to consider if the models can converge as startingoff at a lower learning rate However it remained unchangedanything We also saw that the models converged quicklyduring 10k first iterations with 10minus3 and then progressively

slow down after 20k (erefore we decided to start off thetraining with a learning rate at 10minus3 and decrease to 10minus 4 and10minus5 at 25k and 35k iterations respectively (is settingshows that the loss value was stable from 40k but we set thetraining up to 70k to consider how the loss value changesand saw that it did not change a lot after 40k iterations Wetried to evaluate the models from 30k to 70k and generallythe performance of the models was not stable after 40k it-erations For this reason we picked up the weight forevaluation at 30k and 40k iterations At 30k iterationsYOLO achieves the best results and others get the best one at40k iterations In case of subsets of PASCAL VOC 2007 wecombine train and valid set from PASCAL VOC 2007 and2012 to form a training set PASCAL VOC 2012 works as adata augmentation set for PASCAL VOC 2007 We use thiscombined training set to train all models and test them onsubsets All models train the same parameter First due tothe limitation of memory we rescale all the size of images tothe same size with the shortest side 600 and the lengthiestside 1000 as in [15]

In YOLOv3 we run the K-means clustering algorithm inorder to initialize 9 suitable default bounding boxes fortraining and testing phases of our selected datasets and wechanged the anchors value (e following are 9 anchors forsmall object dataset after running the K-means algorithm[103459 144216] [262937 190947] [214024 363180][479317 291237] [404932 637489] [836447 513203][722167 1199181] [1727416 1170773] and [12465972528465]

In Faster R-CNN to fairly compare with the prior workand deploy on different backbones we also reuse directly theanchor scales and aspect ratios following the paper [13] suchas anchor scales 16times16 40times 40 and 100times100 pixels andaspect ratio 05 1 and 2 instead of having to cluster a set ofdefault bounding boxes similar to YOLOv3 Similarly inRetinaNet we keep the default setting for training such asgamma loss 20 alpha loss 025 anchor scale 4 andscalers per octave 3 because of following authors and thisconfiguration is the optimized valuables

42 Our Newly Generated Dataset In this time to have anobjective comparison we also use our newly generateddataset and the information of this dataset is shown inTable 1 We use it to consider the effects of object sizesamong factors including models time of processing accu-racy and resource consumption (e dataset consists of 4subsets filtered from PASCAL VOC 2007 such as

Batch normHi-res classifierConvolutionalAnchor boxes

New neworkDimension priors

Location predictionPassthrough

MultiscaleHi-res detector

YOLO YOLOv2

VOC2007 mAP

634 658 695 692 696 744 754 768 786

Figure 3 mAP of YOLOv2 at each added part [5]

Journal of Electrical and Computer Engineering 9

VOC_WH_20 VOC_MRA_058 VOC_MRA_10 andVOC_MRA_20 and detail information is provided asfollows

(i) VOC2007_WH_02 contains objects whose widthand height are less than 20 of an imagersquos width andheight (is one has fewer than PASCAL VOC 2007two classes such as dining table and sofa because ofthe constraint of the definition

(ii) VOC_MRA_058 VOC_MRA_10 andVOC_MRA_20 compose objects occupying themaximum mean relative area of the original imageunder 058 10 and 20 respectively Two ofthem have the same number of PASCAL VOC 2007classes except for VOC_MRA_058 and the one hasfewer four classes such as dining table dog sofa andtrain

5 Results and Analyses

In this section we show results that we achieved through theexperimental phase All models mentioned in this sectionexcept for models cited from other papers are trained on thesame environment and 1 GPU Ubuntu 16044 LTS Intel(R) Xeon (R) Gold 6152 CPU 210GHz GPU Tesla P100In addition to the comparative accuracy other comparisonsare also provided to make our objective and clear assessmentresults

51 Accuracy

511 Small Object Dataset Following the detection resultsin Table 3 methods which belong to two-stage approachesoutperform ones in one-stage approaches about 8ndash10Specifically Faster RCNN with ResNeXT-101-64times 4d-FPNbackbone achieved the top mAP in two-stage approachesand the top of the table as well 412 In comparison withthe top in one-stage approaches YOLOv3 608times 608 withDarknet-53 obtained 331 Following [32] methods basedon region proposal such as Faster RCNN are better than

methods based on regression or classification such as YOLOand SSD Actually this is also right once again as in contextof small object dataset

For methods in each approach Firstly two-stage ap-proaches Faster RCNN which is an improvement of FastRCNN is only greater than Fast RCNN about 1ndash2 but onlyfor ResNeXT backbones and equal to Fast RCNN for the rest(e difference here is not too much and it means that theperformance of external region proposal like selective searchcombined with ROI pooling is as good as internal regionproposal like RPN with ROI aligned in this case Besidescompared to R-CNN we perceive that there is a boost 8ndash10when RoI pooling or RoI aligned is added because R-CNNwhich uses region proposals from selective search then feedsthem into the network and directly computes features from fc(fully connected) layers only receives 235 with Alexnetand 248 with VGG16 combined with proposals from RPNHowever Fast RCNN and Faster RCNN with two kinds ofRoIs are much better Fast RCNN receives accuracy in a rangeof 317 to 396 based on different backbones SimilarlyFaster RCNN gets 301 to 412 Secondly in one-stageapproaches YOLO outperforms SSD and RetinaNet How-ever YOLO gets the highest outcome 331 and SSD andRetinaNet get 1132 and 30 respectively YOLO and SSDare considered as state-of-the art methods in speed andsacrificing accuracy However there is a large difference inaccuracy between YOLO and SSD the difference here is thatSSD adds multiple convolutional layers behind the backboneand each layer has their own ability instead of using 2 fullyconnected layers like YOLO Although RetinaNet is assignedinto a method in one-stage approaches it cannot run in realtime RetinaNet is one which is proposed to deal with theimbalance between foreground and background by the focalloss (erefore RetinaNet obtains a higher accuracy incomparison with others except for YOLOv3 (Darknet-53)

When it comes to the backbones we realized that Dar-knet-53 is the best in one-stage and real-time methods andeven far higher than ResNet-50 although it similarly has thesame layers with ResNet-50 In contrast ResNeXT combinedwith FPN is themost powerful one in both one-stage and two-

Table 1 (e information of the subsets

Subsets Classes Images InstancesVOC_MRA_058 16 329 529VOC_MRA_10 20 2231 5893VOC_MRA_20 20 2970 7867VOC_WH_20 18 1070 2313

Table 2 (e parameters of models

Method Momentum Decay Gamma Learning_rate Batch_size Training_days StepsizeYOLOv2 [16] 09 00005 0001 8 5 25000YOLOv3 09 00005 0001 32 3ndash4 25000SSD300 [16] 09 00005 01 0000004 12 9 40000 80000SSD512 [16] 09 00005 01 0000004 12 12 100000 120000RetinaNet 09 00005 01 0001 64 4-gt12 h 25000 35000Fast RCNN 09 00005 01 0001 64 4-gt12 h 25000 35000Faster RCNN 09 00005 01 0001 64 4-gt12 h 25000 35000

10 Journal of Electrical and Computer Engineering

stage methods if we only consider accuracy Overall there isan increase about 1ndash3 for changing the simple backbone tothe complex one in each type For example when switchingfrom original ResNet to ResNet-FPN the accuracy is boostedfrom 2 to 3(is is clear that leveraging the advantages frommultiscale features of FPN is a common way to improvedetection and tackle the scale imbalance of input images andbounding boxes of different objects Similarly we switchResNeXT-101-32times 8d-FPN to ResNeXT-101-64times 4d-FPNand the accuracy changes from 405 to 412 for FasterRCNN and from 387 to 396 for Fast RCNN Howeverwhen considering between ResNet-50-FPN and ResNet-101-FPN the growth only happens in Fast RCNN from 333 to355(ere is a little bit decrease 01 for Faster RCNN(isreduction also happens with RetinaNet while the simplerbackbone ResNeXT-101-32times 8d-FPN gets 30 and theResNeXT-101-64times 4d-FPN just gets 251 It means that thevery deeper backbones do not guarantee the increase inaccuracy and the reason is that an advantage of a deepernetwork needsmore parameters to learn It means the authorsmust have a large number of data to feed into the network totrain and update parameters itself but in this case the data ofsmall object dataset are not abundant too much to fit the verydeep network and hence increasing the chances of overfittingBesides features which are originally from the early layer ofResNet are not well-generalized because when they arecombined with FPN the accuracy has an improvement about2ndash3 When YOLO switches from Darknet-19 to Darknet-53 it really boosts the accuracy (e highest accuracy belongsto the Darknet-19 with the resolution of 1024times1024 whichjust gets 2402 However YOLO 608times 608 with Darknet-53gets 331 (e explanation for this reason is that YOLOv3with Darknet-53 has several improvements from Darknet-19YOLOv3 has 3 location of scales to predict objects especiallyone specialized in small objects instead of only one likeDarknet-19 and it is also integrated cutting-edge advantagessuch as residual blocks and shortcut connections (e re-duction in accuracy happens again with YOLO whenswitching from ResNet-101 to ResNet-152 about 1ndash2 Inthese methods YOLO and SSD are the only ones which allowmultiple input sizes(e higher the resolution of input imagesare the higher accuracy the method receives (e reason isthat a higher resolution image allows more pixels to describethe visual information for small objects However if theresolution is far from the original size of images it results in adecrease in accuracy For example YOLO 1024times1024 withDarknet-19 gets a lower accuracy than the resolution of800times 800 In addition we have tried to increase in resolutionof Darknet-53 from 608 to 1024 and themAP decreases whenthe resolution is over 608times 608 (erefore the effect of imagesize is clear for models like SSD and YOLO Generally allcomparative results of mAP on this dataset have the domi-nation of classes very great in numbers and this is caused bythe imbalance data between the number of images and in-stances in these images For example according to the sta-tistics in [13] mouse is a major class significantly contributingto mAP in Table 3 with the highest number of instances andimages as well However tissue has least contribution with thelowest AP originally affected by the number of data

Furthermore the imbalance data lead models tending todetect frequent objects implying that models will misun-derstand objects having a nearly similar appearance with thedomination class as the objects of interest rather than lessfrequent objects As a result false positives will increase bythese problems Figure 4 illustrates the detection withstrongest backbones Following this visualization the domina-tion of the classes such asmouse or faucet results inmisdetectionwith areas which have a same appearance to them (is mis-understanding has a tendency to weaker backbones in thecomparison and one-stage method like YOLO which primarilyheads to speed has more misdetection than two-stage methodsA reason that causes these problems are the difference in thewayof training deep networks [33] One-stage methods such asYOLO use a soft sampling method that uses a whole dataset toupdate parameters rather than only choosing samples fromtraining data However two-stage methods such as RCNNfamily tend to employ hard sampling methods that randomlysample a certain number of positive and negative boundingboxes to train its network

512 Subsets of PASCAL With 4 subsets of 4 different scalesof objects in images we want to find out howmuch the scalesimpact on the models (e whole results are shown in Ta-ble 4 We separate the results into 2 groups as the one-stageand two-stage approaches and Figure 5 is a visualization forthe strongest backbones in each method on subsets

In case of different scales like our subsets there is a differencebetween one-stage approaches and two-stage approaches In thiscase methods from the one-stage approach have a better per-formance than two-stage ones inmost of scales(is is really theopposite of small object dataset Specifically two-stage methodsare totally better than one-stage ones in case of real-time inputsand just better a bit than nonreal-time models in VOC_WH20about 10ndash20 and the same result with smaller objects inVOC_MRA_0058 and VOC_MRA_010 However in biggerobjects in VOC_MRA_020 methods in one-stage approacheshave significant outcomes rather than two-stage ones In ad-dition there is just Faster RCNN that has good performance inmost cases to compare to methods in one-stage ones FastRCNN is only good at big objects in VOC_MRA_020 and failsto have good detection in smaller objects

In the one-stage approach in methods which allowmultiple inputs like YOLO and SSD there are 2 kindsnamely ones that can run in real time and the others thatcannot if the resolution is over 640 or 512 for YOLO andSSD respectively For real-time ones YOLO outperformsSSD for all scales of objects Specifically YOLOv2 withDarknet-19 is better than SSD 26 with objects inVOC_MRA_0058 and VOC_MRA_010 and 4ndash15 forlarger objects in VOC_MRA_020 and VOC_WH_20YOLOv3 with Darknet-53 gets higher results about 3ndash5 incomparison with YOLOv2 hence YOLOv3 also gets higherresults compared to SSD However if we consider nonreal-time input images SSD is greater than YOLO with objects inVOC_MRA_010 However RetinaNet which is the one thatcannot run in real time in the one-stage approach performsthe same results compared to ones in nonreal time in YOLO

Journal of Electrical and Computer Engineering 11

and better than SSD RetinaNet is more stable than SSD andYOLO when the scales are changed (e bigger the objectsare the more the stability is For example the change is somuch about 33 when the scale increases from objects inVOC_MRA_0058 to ones in VOC_MRA_010 andVOC_MRA_020 However this change is not much about10 with bigger objects in comparison with YOLO 15ndash25In case of YOLO this remarkable increase in accuracy whenobjects are larger is obviously good for a model (e changein SSD resembles the change in RetinaNet

Concerning resolutions in YOLO and SSD we see thatwhen image resolution is increased they push the accuracyto improve in general For YOLOv2 with Darknet-19 andYOLOv3 with Darknet-53 and SSD they all have an increasein accuracy when the resolution is large except for YOLOv2

with objects belonging to VOC_MRA_010 andVOC_MRA_020 when the image is over 800 In additionYOLOv2 has a fluctuation with those objects inVOC_WH20 As mentioned in our previous work YOLO isbetter than SSD in those objects less than 10 of the imageshowever in this case YOLOv3 is good at all scales of objects(is is because YOLOv3 has 3 detection locations comingwith more ratios of default boxes and it leads to a significantoutcome when combining results from 3 locations

When we switch to the two-stage approaches FasterRCNN has a significant improvement in most scales ratherthan Fast RCNN except for objects in VOC_MRA_020 whichhave the same accuracy (is shows that if objects are com-pletely separated into different scales the RoI pooling does notwork well with smaller objects and ones in VOC_WH20 In

Table 3 Comparative results on small object dataset

Method Backbone Clock Faucet Jar Mouse Outlet Plate Switch Tel t box t paper mAPYOLO 416 [16]

Darknet-19

228 308 4 52 204 131 13 61 0 353 1939YOLO 448 [16] 23 369 9 525 184 136 175 42 0 343 2013YOLO 480 [16] 342 373 91 533 214 136 158 91 91 342 2371YOLO 512 [16] 231 366 61 598 246 142 157 91 45 324 2261YOLO 554 [16] 234 372 91 601 272 134 199 91 45 345 2384YOLO 640 [16] 202 362 32 598 278 117 181 82 45 356 2253YOLO 800 [16] 276 36 23 602 328 131 233 91 91 267 2402YOLO 1024 [16] 217 293 14 583 264 118 175 91 91 157 2003YOLO 320

Darknet-532622 3838 455 5646 3642 1334 248 1065 455 4296 2583

YOLO 416 2847 4715 1083 6049 4315 1587 3073 1515 262 483 3028YOLO 608 2998 4789 1076 6588 4802 1809 3122 1462 1799 4656 331YOLO 320

ResNet-501957 2573 067 4517 1437 938 1384 909 909 237 1706

YOLO 416 2378 3665 04 5423 1837 1375 1978 984 942 3568 2219YOLO 608 2692 4065 177 6186 2918 1504 2024 1009 1329 3601 255YOLO 320

ResNet-1012052 279 057 4468 1698 1305 1366 966 909 2436 1805

YOLO 416 2572 356 303 5573 224 1561 1726 932 303 3871 2264YOLO 608 2879 4459 942 6218 3334 1553 2388 1324 1583 3917 286YOLO 320

ResNet-1522164 2756 303 4806 1739 1112 1451 909 455 3188 1888

YOLO 416 257 3654 089 5381 206 1413 2021 1149 029 3306 2167YOLO 608 2601 4454 455 61 3176 1302 2267 1235 993 3999 2658SSD300 [16] ResNet-101 55 91 0 255 61 45 0 45 91 182 825SSD300 [16] VGG16 91 171 0 261 91 91 0 45 0 167 916SSD512 [16] VGG16 91 171 0 43 91 91 91 91 0 76 1132RetinaNet ResNet-50-FPN 307 493 2 655 213 161 85 129 1 257 233RetinaNet ResNet-101-FPN 306 487 71 647 20 159 118 107 29 387 251RetinaNet ResNeXT-101-32times 8d-FPN 355 55 121 665 239 184 98 162 94 537 30RetinaNet ResNeXT-101-64times 4d-FPN 314 502 89 663 208 153 94 14 22 324 251R-CNN [13] RPN prop +VGG16 319 313 42 568 311 93 142 164 234 294 248R-CNN [13] Alexnet 7times 300 pro 324 272 51 569 28 98 136 124 179 356 239R-CNN [13] VGG16 7times 300 pro 373 303 72 606 415 158 215 137 22 333 284R-CNN [13] ContextNet (Alexnet 7times) 327 268 46 564 263 99 129 122 187 34 235Fast RCNN ResNet-50-C4 324 463 65 658 383 201 253 166 141 52 317Fast RCNN ResNet-50-FPN 374 473 73 689 467 21 321 171 93 459 333Fast RCNN ResNet-101-FPN 393 503 106 683 471 204 333 186 154 514 355Fast RCNN ResNeXT-101-32times 8d-FPN 475 548 103 718 54 214 344 217 177 535 387Fast RCNN ResNeXT-101-64times 4d-FPN 454 557 109 725 533 24 369 229 16 581 396Faster R-CNN [16] VGG16 2376 3765 803 54 1616 1188 1512 91 625 3729 2192Faster RCNN ResNet-50-C4 322 446 66 659 352 175 257 196 137 40 301Faster RCNN ResNet-50-FPN 357 499 73 684 489 188 296 147 114 533 338Faster RCNN ResNet-101-FPN 398 492 49 682 47 185 297 14 129 522 337Faster RCNN ResNeXT-101-32times 8d-FPN 498 566 114 721 563 232 37 208 188 587 405Faster RCNN ResNeXT-101-64times 4d-FPN 496 586 122 725 545 232 369 208 201 631 412(e values in bold represent the best in one-stage methods and the ones in italics represent the highest in two-stage methods

12 Journal of Electrical and Computer Engineering

addition if we compare with one-stage methods it is signif-icantly lower than them However RoI align along with RPN iswell performed when scales are changedWhen it comes to thebackbones there is a few decrease in accuracy when changingfrom ResNet-50-FPN to ResNet-101-FPN or from ResNeXT-101-32times 8d-FPN to ResNeXT-101-64times 4d-FPN with objectsfrom all scales for both Faster RCNN and Fast RCNN (eVGG16 backbone has an impressive outcome rather thanstrong backbones such as ResNet or ResNeXT Although theaccuracy is less than two strong backbones VGG16 is stillbetter with objects in VOC_WH20 and has a few change inaccuracy when changing objects with big sizes

52 Time Processing and Resource Consumption Tables 5and 6 show us the performance comparison of the eval-uated models with base networks that belong to the modelsGenerally we see that when RAM consumption in testingand training increases more layers are added (is meansthat if the network is more deeper the need of processingalso increases because this leads to the increase in pa-rameters and time to process data as well YOLO is themodel consuming the least memory in both two-phasetraining and testing Particularly YOLO is only from 4G to5G for training and from 16G to 18G for testing withDarknet-53 YOLO is the only one which is able to run inreal time YOLO just needs about 03ms to 04ms toprocess an image in comparison to more than 01 s and 02 s

with Faster RCNN and RetinaNet (is allows us to pick upthese models on devices which own the modest memoryWhile RetinaNet is assigned to the one-stage approach it isnot good enough to meet real-time detection(e inferencetime in Fast RCNN is lower a little bit than Faster RCNNand RetinaNet In contrast the RAM consumption intraining and testing of RetinaNet is lower than Fast RCNNand Faster RCNN Of all architectures the ResNet-50-C4 isthe one requiring the highest memory and time to processdata because the output size of ResNet-50-C4 is bigger a bitthan others [9] If we consider ResNet or ResNeXT com-bined with FPN Faster RCNN is over 100Mb compared toFast RCNN and 300Mb with RetinaNet In additionaccording to Table 2 the number of training days of FasterRCNN and RetinaNet need less time for training only a fewhours to 1 day rather than YOLO 3ndash4 days (is dem-onstrates that if we pay our attention to performance anddo not have much time for training we choose FasterRCNN or RetinaNet instead of YOLO one In contrast ifwe only focus on processing speed and still achieve goodperformance one-stage methods are always the good oneIn the same context of backbones RetinaNet uses a lowerresource than Fast RCNN and Faster RCNN about 100Mband 300Mb for Fast RCNN and Faster RCNN respectivelyin testing time However the training time of RetinaNetuses much memory more than Fast RCNN about 28 G andFaster RCNN about 23 G for ResNeXT-101-32times 8d-FPNand ResNeXT-101-64 times 4d-FPN If we consider this on

(a) (b)

loc ∆ (cx cy w h) conf (c1 c2 hellip cp)

(c)

Figure 4 (e location of the default boxes in different scales (a) image with GT boxes (b) 8times 8 feature map (c) 4times 4 feature map

Journal of Electrical and Computer Engineering 13

Table 4 (e comparative results on subsets of PASCAL VOC 2007

Approach Method VOC_MRA_0058 VOC_MRA_010 VOC_MRA_020 VOC_WH20

One stage

YOLOv2 416 [16] 302 3138 4289 1852YOLOv2 448 [16] 447 329 6015 2196YOLOv2 480 [16] 426 3348 6078 2667YOLOv2 512 [16] 542 3574 6112 2463YOLOv2 544 [16] 697 3656 63 2662YOLOv2 640 [16] 77 3797 6129 2341YOLOv2 800 [16] 1024 373 6191 269YOLOv2 1024 [16] 1069 2993 5514 2897

YOLOv3 320 718 3458 6036 204YOLOv3 416 102 3897 6253 2412YOLOv3 608 117 4265 6856 2886SSD 300 [16] 171 3276 4626 1691SSD 512 [16] 29 4346 5711 1987

RetinaNet-ResNet-50-FPN 884 415 502 2814RetinaNet-ResNet-101-FPN 895 425 519 2746

RetinaNet-ResNeXT-101-32times 8d-FPN 1029 454 545 3008RetinaNet-ResNeXT-101-64times 4d-FPN 1071 455 551 3132

Two stage

Fast RCNN-ResNet-50-C4 023 132 499 393Fast RCNN-ResNet-50-FPN 063 135 556 345Fast RCNN-ResNet-101-FPN 039 159 576 312

Fast RCNN-ResNeXT-101-32times 8d-FPN 051 144 579 333Fast RCNN-ResNeXT-101-64times 4d-FPN 029 142 573 376

Faster RCNN-ResNet-50-C4 698 399 487 2604Faster RCNN-ResNet-50-FPN 1074 456 563 2979Faster RCNN-ResNet-101-FPN 1063 469 576 3057

Faster RCNN-ResNeXT-101-32times 8d-FPN 1164 473 576 3212Faster RCNN-ResNeXT-101-64times 4d-FPN 1054 471 569 3164

Faster RCNN-VGG16 [16] 573 3558 4414 4111(is table illustrates how well models adapt to different scales of objects (e values in bold represent the best in one-stage methods and the ones in italicsrepresent the highest in two-stage methods

(a)

(b)

(c)

Figure 5 Continued

14 Journal of Electrical and Computer Engineering

small object dataset it does not work too much becauseRetinaNet is lower than Faster RCNN about 10 in per-formance Otherwise on different scales of subsets Reti-naNet works well when comparing to Faster RCNN and

the difference is just 2ndash4 percentages Although ResNetbackbones combined with the others yield an improvementin accuracy they do not work for YOLO on small objectdatasets YOLO with Darknet-53 utilizes more resource

Table 5 (e comparison of consumption on small object dataset

Model Backbone Inference time (s) Test RAM (MiB) Train RAM (MiB)YOLOv3 Darknet-53 00331 1825 4759YOLOv3 ResNet-50 0027 1285 3479YOLOv3 ResNet-101 00356 1829 5383YOLOv3 ResNet-152 00454 2443 7531RetinaNet ResNet-50-FPN 0102 2075 4435RetinaNet ResNet-101-FPN 0127 2723 5577RetinaNet ResNeXT-101-32times 8d-FPN 0229 3767 7863RetinaNet ResNeXT-101-64times 4d-FPN 0292 3719 7813Fast RCNN ResNet-50-C4 03 6449 5877Fast RCNN ResNet-50-FPN 0089 2277 4455Fast RCNN ResNet-101-FPN 0113 2947 5627Fast RCNN ResNeXT-101-32times 8d-FPN 0212 3987 4961Fast RCNN ResNeXT-101-64times 4d-FPN 0269 3885 4799Faster RCNN ResNet-50-C4 0412 6609 6129Faster RCNN ResNet-50-FPN 0101 2387 5381Faster RCNN ResNet-101-FPN 0124 3001 6487Faster RCNN ResNeXT-101-32times 8d-FPN 0256 4027 5333Faster RCNN ResNeXT-101-64times 4d-FPN 0286 4003 5246

Table 6 (e comparison of consumption on subsets filtered from PASCAL VOC

Model Backbone Inference time (s) Test RAM (MiB) Train RAM (MiB)YOLOv3 Darknet-53 0027 1645 4079RetinaNet ResNet-50-FPN 01 1935 4133RetinaNet ResNet-101-FPN 0116 2585 5435RetinaNet ResNeXT-101-32times 8d-FPN 0222 3641 7723RetinaNet ResNeXT-101-64times 4d-FPN 0284 3561 7599Fast RCNN ResNet-50-C4 0495 6371 5677Fast RCNN ResNet-50-FPN 0092 2131 4387Fast RCNN ResNet-101-FPN 0114 2819 5463Fast RCNN ResNeXT-101-32times 8d-FPN 0213 3873 4637Fast RCNN ResNeXT-101-64times 4d-FPN 0265 3735 4575Faster RCNN ResNet-50-C4 026 6141 5991Faster RCNN ResNet-50-FPN 01 2245 5207Faster RCNN ResNet-101-FPN 013 2855 6335Faster RCNN ResNeXT-101-32times 8d-FPN 0225 3943 5087Faster RCNN ResNeXT-101-64times 4d-FPN 0276 3885 4909

(d)

Figure 5 Highlight of bounding boxes from comparative backbones on small object dataset We here select YOLO with Darknet-53 andResNet-50 for objective comparison because there have obviously the same layers in their networks along with the significant techniquessuch as skip connections and residual blocks (e bounding boxes show that ResNet-50 has the sensitivity to areas which resembles theobjects of interest than Darknet-53 Similarly ResNet-50-FPN and ResNet-50-C4 are chosen to consider (e detection shows thatcombining ResNet-50 with FPN outputs a better performance rather than the original one Particularly misdetection happens in moredensity than ResNet-50-FPN such as in columns 4 and 5 Zoom in to see more detail

Journal of Electrical and Computer Engineering 15

than ResNet ones but it has the best accuracy amongmodels (erefore we only test YOLO with Darknet-53 insubsets of PASCAL

53 Analyses of the Trade-Off among Detectors Networkdesigns and approaches alike the one-stage approach proveits performance as applying them to detect general objectsboth small scales and other kinds of scales Although they arefast and accurate there is still a drawback always existing inthese models that is the trade-off between accuracy andspeed of processing For example YOLOv3 proposes theidea that performs detection at three different scales and thisresult is obviously impressive and yields good performanceHowever to gain this advantage YOLOv3 has to sacrifice thetime to process Instead of all inputs of the model normallyprocessing one time for detection like YOLOv2 this ideamust work 3 times (is trade-off is also partly affected byresolution as we change it during training or testing ourmodels In our previous work we have mentioned that wehave to choose a right resolution to ensure our models towork properly In case of the two-stage approaches the ideathat proposes region proposals to improve the localization ofobjects to serve for detection is good as well (is is usefulbut we have to take it into the account that we shouldgenerate proposals on feature maps or directly on inputimages because this affects a lot on the way which modelsintend to run and identify representations of objects Ifobjects are normal or have a big or medium appearance it isgood for models to work but if objects are in multiscalesthis is a problem to consider and research deeply in order tobalance the performance as well as improve it (erefore topartly fix this problem the one-stage approach allows us tochoose a fixed size of an input for training and testing butthe support still depends on characteristics of datasets whichwe evaluate or the image size After all all models we chooseto evaluate are affected by the scales of objects when wechange the scale and accuracy of models change a lot exceptfor Faster RCNN the only one model that seems to be stablewith the scale especially when combining with the VGG16architecture Although the accuracy of VGG16 is not betterthan the other architectures the difference here is that it doesnot change too much in accuracy (is is only right for bigobjects having the overlap of the bounding box and theimage greater than 10 if not this is not assured

Figure 1 shows that the possibility of small objects ismore than other objects (e black length of the camera issomehow similar to the black mouse placed on a mouse pad(is possibility of small object presence causes more diffi-culties to detectors and leads to wrong detection Anywherein an image can be small objects it results in a fact thatdetectors have much wrong detection with familiar ap-pearance which they have seen If we consider the visuali-zation of the detection in Figure 4 the wrong detection ispartly similar to the appearance of the other objects in thedataset (is problem is caused by the data imbalance be-tween classes and instances in each class which originally isknown as the foreground-foreground class imbalance In

other words the common problems which not only happenwith small objects but also for whole datasets are theintraclass similarity and interclass variation

6 Conclusion

Small object detection is a challenging and interestingproblem in the task of object detection and has drawn at-tention from researchers thanks to the development of deeplearning which is motivation to improve performance oftasks in computer vision Although deep models belongingto detection originally tend to solve problems relating togeneral object detection they still work at a particular levelto the success of small object detection As evaluation workson small object detection for deep models our goal is tohighlight remarkable achievements of popular and state-of-the-art deep models in order to provide a variety of views asapplying deep models in small object detection Particularlywe evaluate state-of-the-art real-time detectors based ondeep learning from two approaches such as YOLOv3RetinaNet Fast RCNN and Faster RCNN on two datasetsnamely small object dataset and subsets filtered fromPASCAL VOC about effects of different factors objectivelyincluding accuracy execution time and resource usage

In spite of the successful achievements in recent yearsthe performance of detection has improved significantlyand there is still a huge gap in accuracy between normalobjects and small objects In the criteria of the COCOdataset the difference from the small scale to medium andbig scale is too much Most models are good at detection ofnormal objects and problems are going to happen whenapplying them to detect small objects As a result to reducethe gap of small object detection the first thing to do is investdatasets which have massive amount of data to train modelsand have a wide range of categories to compete with thehuman visual system alike [12 34]

So far detection models are divided into two mainapproaches namely one-stage approach and two-stageapproach Models in the one-stage approach is known asdetectors which have better and more efficient detection incomparison to another approach (e efficiency here has thepotential power to run in real time and is able to apply themto practical applications However the trade-off betweenaccuracy and speed is a difficult challenge which needs to betaken into the account in order to balance the gap Howevermodels in the two-stage approach have their reputation ofregion-based detectors which have high accuracy but are toolow in speed to apply them to real world (is drawbackcomes from the computation of networks

(rough our evaluation there is a fact that architectureswhich are utilized as base networks to extract deep featureshave significant effects on frameworks (e deeper the ar-chitecture is the higher the accuracy of detection is Once anetwork has an increase in the depth this means it has morelayers than normal ones and it will have massive parametersto train Hence this needs a lot of data to fine tune theseparameters reasonably If there is an increase in computa-tion resource consumption will also increase As a result it

16 Journal of Electrical and Computer Engineering

will be difficult as we want to take them to apply in practicalapplications Besides the contextual exploit in models isdefinitely limited this results cause ignoring much usefuland informative data in training especially in context ofsmall objects Because small objects are able to appearanywhere in an input image if the image is well-exploitedwith the context the performance of small object detectionwill be improved better

For all above reasons and according to our evalua-tion if we tend to have good performance and ignore thespeed of processing two-stage methods like FasterRCNN are well-performed and demonstrate its networkdesign with the different datasets on many contexts ofobjects including multiscale objects (erefore FasterRCNN is considered as a giant baseline in order to baseon or develop from it If our target has a balance ofaccuracy and speed YOLO is a good one in case we donot care the training time because the sacrifice betweenthe speed and accuracy is worth applying it into practicalapplications Otherwise Faster RCNN or RetinaNet isstill a substitution to work on When it comes tobackbones we have to concern about the data to choose areasonable backbone to combine with the methodsBecause the amount of data will significant impact on themodel if data are not abundant the shallow network willfit it well Besides there is recently a novel approachpromising in training deep models with less data that isweakly supervised learning such as zero-shot one-shotor few-shot learning (erefore these approaches will beconsidered in our future works and following our recentsearching to have better performance on object detec-tion we have to consider several factors to improve themAP such as multiscale training superresolution forscaling up the visual information to small objects [35] orpreprocessing data to avoid the imbalance data becausewe have a wide range of imbalance problems relating todata [33]

Data Availability

(e data used to support the findings of this study areavailable from the corresponding author upon request

Conflicts of Interest

(e authors declare that they have no conflicts of interest

Acknowledgments

(is research was funded by the Vietnam National Uni-versity HoChiMinh City (VNU-HCM) under grant noB2017-26-01

References

[1] R Girshick J Donahue T Darrell and J Malik ldquoRich featurehierarchies for accurate object detection and semantic seg-mentationrdquo in Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition pp 580ndash587 ColumbusOH USA June 2014

[2] K He X Zhang S Ren and J Sun ldquoSpatial pyramid poolingin deep convolutional networks for visual recognitionrdquo inProceedings of the European Conference on Computer Visionpp 346ndash361 Springer Zurich Switzerland September 2014

[3] R Girshick ldquoFast R-CNNrdquo in Proceedings of the IEEE In-ternational Conference on Computer Vision pp 1440ndash1448Santiago Chile December 2015

[4] J Redmon S Divvala R Girshick and A Farhadi ldquoYou onlylook once unified real-time object detectionrdquo in Proceedingsof the IEEE Conference on Computer Vision And PatternRecognition pp 779ndash788 Las Vegas NV USA June 2016

[5] J Redmon and A Farhadi ldquoYOLO9000 better fasterstrongerrdquo 2016 httpsarxivorgabs161208242

[6] J Redmon and A Farhadi ldquoYOLOv3 an incremental im-provementrdquo 2018 httpsarxivorgabs180402767

[7] T-Y Lin P Goyal R Girshick K He and P Dollar ldquoFocalloss for dense object detectionrdquo in Proceedings of the IEEEtransactions on pattern analysis and machine intelligenceVenice Italy October 2018

[8] K Zidek A Hosovsky J Pitelrsquo and S Bednar ldquoRecognition ofassembly parts by convolutional neural networksrdquo in Ad-vances in Manufacturing Engineering and Materialspp 281ndash289 Springer Cham Switzerland 2019

[9] K He G Gkioxari P Dollar and R Girshick ldquoMaskR-CNNrdquo in Proceedings of the IEEE International Conferenceon computer vision (ICCV) IEEE Venice Italy pp 2980ndash2988 October 2017

[10] L-C Chen A Hermans G Papandreou et al ldquoInstancesegmentation by refining object detection with semantic anddirection featuresrdquo 2017 httpsarxivorgabs171204837

[11] M Everingham L Van Gool C K I Williams J Winn andA Zisserman ldquo(e pascal visual object classes (VOC) chal-lengerdquo International Journal of Computer Vision vol 88no 2 pp 303ndash338 2010

[12] T-Y Lin M Maire S Belongie et al ldquoMicrosoft COCOcommon objects in contextrdquo in Proceedings of the EuropeanConference on Computer Vision pp 740ndash755 SpringerZurich Switzerland September 2014

[13] C Chen M-Y Liu O Tuzel and J Xiao ldquoR-CNN for smallobject detectionrdquo in Proceedings of the Asian Conference onComputer Vision pp 214ndash230 Springer Taipei TaiwanNovember 2016

[14] P F Felzenszwalb R B Girshick D McAllester andD Ramanan ldquoObject detection with discriminatively trainedpart-based modelsrdquo IEEE Transactions on Pattern Analysisand Machine Intelligence vol 32 no 9 pp 1627ndash1645 2010

[15] S Ren K He R Girshick and J Sun ldquoFaster R-CNN towardsreal-time object detection with region proposal networksrdquo inProceedings of the 28th International Conference on NeuralInformation Processing Systems Ser NIPSrsquo15 pp 91ndash99 MITPress Cambridge MA USA 2015 httpdlacmorgcitationcfmid=29692392969250

[16] P Pham D Nguyen T Do T D Ngo and D-D LeldquoEvaluation of deep models for real-time small object de-tectionrdquo in Proceedings of the International Conference onNeural Information Processing pp 516ndash526 SpringerGuangzhou China November 2017

[17] J R R Uijlings K E A Van De Sande T Gevers andA W M Smeulders ldquoSelective search for object recognitionrdquoInternational Journal of Computer Vision vol 104 no 2pp 154ndash171 2013

[18] Z Zhu D Liang S Zhang X Huang B Li and S HuldquoTraffic-sign detection and classification in the wildrdquo inProceedings of the IEEE Conference on Computer Vision And

Journal of Electrical and Computer Engineering 17

Pattern Recognition pp 2110ndash2118 Vegas NV USA June2016

[19] A Torralba R Fergus and W T Freeman ldquo80 million tinyimages a large data set for nonparametric object and scenerecognitionrdquo IEEE Transactions on Pattern Analysis andMachine Intelligence vol 30 no 11 pp 1958ndash1970 2008

[20] A Kembhavi D Harwood and L S Davis ldquoVehicle detectionusing partial least squaresrdquo IEEE Transactions on PatternAnalysis and Machine Intelligence vol 33 no 6 pp 1250ndash1265 2011

[21] V I Morariu E Ahmed V Santhanam D Harwood andL S Davis ldquoComposite discriminant factor analysisrdquo inProceedings of the IEEE Winter Conference on Applications ofComputer Vision pp 564ndash571 IEEE Steamboat Springs COUSA March 2014

[22] A Andreas P Lenz and R Urtasun ldquoAre we ready forautonomous driving the kitti vision benchmark suiterdquo inProceedings of the IEEE Conference on Computer Vision AndPattern Recognition Providence RI USA June 2012

[23] A Alahi K Goel V Ramanathan A Robicquet L Fei-Feiand S Savarese ldquoSocial LSTM human trajectory prediction incrowded spacesrdquo in Proceedings of the IEEE Conference onComputer Vision And Pattern Recognition pp 961ndash971 LasVegas NV USA June 2016

[24] J Xiao K A Ehinger J Hays A Torralba and A Oliva ldquoSundatabase exploring a large collection of scene categoriesrdquoInternational Journal of Computer Vision vol 119 no 1pp 3ndash22 2016

[25] E Dong Y Zhu Y Ji and S Du ldquoAn improved convolutionneural network for object detection using YOLOv2rdquoin Pro-ceedings of the 2018 IEEE International Conference onMechatronics and Automation (ICMA) pp 1184ndash1188 IEEEChangchun China August 2018

[26] W Liu D Anguelov D Erhan et al ldquoSingle shot multiboxdetectorrdquo in Proceedings of the European Conference onComputer Vision Springer Amsterdam (e Netherlandspp 21ndash37 October 2016

[27] T-Y Lin P Dollar R B Girshick K He B Hariharan andS J Belongie ldquoFeature pyramid networks for object detec-tionrdquo in Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition (CVPR) vol 1 no 2 p 4Honolulu HI USA July 2017

[28] C-Y Fu W Liu A Ranga A Tyagi and A C Berg ldquoDSSDdeconvolutional single shot detectorrdquo 2017 httpsarxivorgabs170106659

[29] Y Bai Y Zhang M Ding and B Ghanem ldquoSOD-MTGANsmall object detection via multi-task generative adversarialnetworkrdquo in Proceedings of the European Conference onComputer Vision pp 210ndash226 Springer Munich GermanySeptember 2018

[30] L Liu W Ouyang X Wang et al ldquoDeep learning for genericobject detection a surveyrdquo 2018 httpsarxivorgabs180902165

[31] P Zhu L Wen X Bian L Haibin and Q Hu ldquoVision meetsdrones a challengerdquo 2018 httpsarxivorgabs180407437

[32] Z-Q Zhao P Zheng S-t Xu and X Wu ldquoObject detectionwith deep learning a reviewrdquo 2018 httpsarxivorgabs180705511

[33] K Oksuz B C Cam S Kalkan and E Akbas ldquoImbalanceproblems in object detection a reviewrdquo 2019 httpsarxivorgabs190900169

[34] O Russakovsky J Deng H Su et al ldquoImagenet large scalevisual recognition challengerdquo International Journal of Com-puter Vision vol 115 no 3 pp 211ndash252 2015

[35] Y Bai Y Zhang M Ding and B Ghanem ldquoSOD-MTGANsmall object detection via multi-task generative adversarialnetworkrdquo in Proceedings of the European Conference onComputer Vision (ECCV) pp 206ndash221 Munich GermanySeptember 2018

18 Journal of Electrical and Computer Engineering

Page 10: AnEvaluationofDeepLearningMethodsforSmall ObjectDetection

VOC_WH_20 VOC_MRA_058 VOC_MRA_10 andVOC_MRA_20 and detail information is provided asfollows

(i) VOC2007_WH_02 contains objects whose widthand height are less than 20 of an imagersquos width andheight (is one has fewer than PASCAL VOC 2007two classes such as dining table and sofa because ofthe constraint of the definition

(ii) VOC_MRA_058 VOC_MRA_10 andVOC_MRA_20 compose objects occupying themaximum mean relative area of the original imageunder 058 10 and 20 respectively Two ofthem have the same number of PASCAL VOC 2007classes except for VOC_MRA_058 and the one hasfewer four classes such as dining table dog sofa andtrain

5 Results and Analyses

In this section we show results that we achieved through theexperimental phase All models mentioned in this sectionexcept for models cited from other papers are trained on thesame environment and 1 GPU Ubuntu 16044 LTS Intel(R) Xeon (R) Gold 6152 CPU 210GHz GPU Tesla P100In addition to the comparative accuracy other comparisonsare also provided to make our objective and clear assessmentresults

51 Accuracy

511 Small Object Dataset Following the detection resultsin Table 3 methods which belong to two-stage approachesoutperform ones in one-stage approaches about 8ndash10Specifically Faster RCNN with ResNeXT-101-64times 4d-FPNbackbone achieved the top mAP in two-stage approachesand the top of the table as well 412 In comparison withthe top in one-stage approaches YOLOv3 608times 608 withDarknet-53 obtained 331 Following [32] methods basedon region proposal such as Faster RCNN are better than

methods based on regression or classification such as YOLOand SSD Actually this is also right once again as in contextof small object dataset

For methods in each approach Firstly two-stage ap-proaches Faster RCNN which is an improvement of FastRCNN is only greater than Fast RCNN about 1ndash2 but onlyfor ResNeXT backbones and equal to Fast RCNN for the rest(e difference here is not too much and it means that theperformance of external region proposal like selective searchcombined with ROI pooling is as good as internal regionproposal like RPN with ROI aligned in this case Besidescompared to R-CNN we perceive that there is a boost 8ndash10when RoI pooling or RoI aligned is added because R-CNNwhich uses region proposals from selective search then feedsthem into the network and directly computes features from fc(fully connected) layers only receives 235 with Alexnetand 248 with VGG16 combined with proposals from RPNHowever Fast RCNN and Faster RCNN with two kinds ofRoIs are much better Fast RCNN receives accuracy in a rangeof 317 to 396 based on different backbones SimilarlyFaster RCNN gets 301 to 412 Secondly in one-stageapproaches YOLO outperforms SSD and RetinaNet How-ever YOLO gets the highest outcome 331 and SSD andRetinaNet get 1132 and 30 respectively YOLO and SSDare considered as state-of-the art methods in speed andsacrificing accuracy However there is a large difference inaccuracy between YOLO and SSD the difference here is thatSSD adds multiple convolutional layers behind the backboneand each layer has their own ability instead of using 2 fullyconnected layers like YOLO Although RetinaNet is assignedinto a method in one-stage approaches it cannot run in realtime RetinaNet is one which is proposed to deal with theimbalance between foreground and background by the focalloss (erefore RetinaNet obtains a higher accuracy incomparison with others except for YOLOv3 (Darknet-53)

When it comes to the backbones we realized that Dar-knet-53 is the best in one-stage and real-time methods andeven far higher than ResNet-50 although it similarly has thesame layers with ResNet-50 In contrast ResNeXT combinedwith FPN is themost powerful one in both one-stage and two-

Table 1 (e information of the subsets

Subsets Classes Images InstancesVOC_MRA_058 16 329 529VOC_MRA_10 20 2231 5893VOC_MRA_20 20 2970 7867VOC_WH_20 18 1070 2313

Table 2 (e parameters of models

Method Momentum Decay Gamma Learning_rate Batch_size Training_days StepsizeYOLOv2 [16] 09 00005 0001 8 5 25000YOLOv3 09 00005 0001 32 3ndash4 25000SSD300 [16] 09 00005 01 0000004 12 9 40000 80000SSD512 [16] 09 00005 01 0000004 12 12 100000 120000RetinaNet 09 00005 01 0001 64 4-gt12 h 25000 35000Fast RCNN 09 00005 01 0001 64 4-gt12 h 25000 35000Faster RCNN 09 00005 01 0001 64 4-gt12 h 25000 35000

10 Journal of Electrical and Computer Engineering

stage methods if we only consider accuracy Overall there isan increase about 1ndash3 for changing the simple backbone tothe complex one in each type For example when switchingfrom original ResNet to ResNet-FPN the accuracy is boostedfrom 2 to 3(is is clear that leveraging the advantages frommultiscale features of FPN is a common way to improvedetection and tackle the scale imbalance of input images andbounding boxes of different objects Similarly we switchResNeXT-101-32times 8d-FPN to ResNeXT-101-64times 4d-FPNand the accuracy changes from 405 to 412 for FasterRCNN and from 387 to 396 for Fast RCNN Howeverwhen considering between ResNet-50-FPN and ResNet-101-FPN the growth only happens in Fast RCNN from 333 to355(ere is a little bit decrease 01 for Faster RCNN(isreduction also happens with RetinaNet while the simplerbackbone ResNeXT-101-32times 8d-FPN gets 30 and theResNeXT-101-64times 4d-FPN just gets 251 It means that thevery deeper backbones do not guarantee the increase inaccuracy and the reason is that an advantage of a deepernetwork needsmore parameters to learn It means the authorsmust have a large number of data to feed into the network totrain and update parameters itself but in this case the data ofsmall object dataset are not abundant too much to fit the verydeep network and hence increasing the chances of overfittingBesides features which are originally from the early layer ofResNet are not well-generalized because when they arecombined with FPN the accuracy has an improvement about2ndash3 When YOLO switches from Darknet-19 to Darknet-53 it really boosts the accuracy (e highest accuracy belongsto the Darknet-19 with the resolution of 1024times1024 whichjust gets 2402 However YOLO 608times 608 with Darknet-53gets 331 (e explanation for this reason is that YOLOv3with Darknet-53 has several improvements from Darknet-19YOLOv3 has 3 location of scales to predict objects especiallyone specialized in small objects instead of only one likeDarknet-19 and it is also integrated cutting-edge advantagessuch as residual blocks and shortcut connections (e re-duction in accuracy happens again with YOLO whenswitching from ResNet-101 to ResNet-152 about 1ndash2 Inthese methods YOLO and SSD are the only ones which allowmultiple input sizes(e higher the resolution of input imagesare the higher accuracy the method receives (e reason isthat a higher resolution image allows more pixels to describethe visual information for small objects However if theresolution is far from the original size of images it results in adecrease in accuracy For example YOLO 1024times1024 withDarknet-19 gets a lower accuracy than the resolution of800times 800 In addition we have tried to increase in resolutionof Darknet-53 from 608 to 1024 and themAP decreases whenthe resolution is over 608times 608 (erefore the effect of imagesize is clear for models like SSD and YOLO Generally allcomparative results of mAP on this dataset have the domi-nation of classes very great in numbers and this is caused bythe imbalance data between the number of images and in-stances in these images For example according to the sta-tistics in [13] mouse is a major class significantly contributingto mAP in Table 3 with the highest number of instances andimages as well However tissue has least contribution with thelowest AP originally affected by the number of data

Furthermore the imbalance data lead models tending todetect frequent objects implying that models will misun-derstand objects having a nearly similar appearance with thedomination class as the objects of interest rather than lessfrequent objects As a result false positives will increase bythese problems Figure 4 illustrates the detection withstrongest backbones Following this visualization the domina-tion of the classes such asmouse or faucet results inmisdetectionwith areas which have a same appearance to them (is mis-understanding has a tendency to weaker backbones in thecomparison and one-stage method like YOLO which primarilyheads to speed has more misdetection than two-stage methodsA reason that causes these problems are the difference in thewayof training deep networks [33] One-stage methods such asYOLO use a soft sampling method that uses a whole dataset toupdate parameters rather than only choosing samples fromtraining data However two-stage methods such as RCNNfamily tend to employ hard sampling methods that randomlysample a certain number of positive and negative boundingboxes to train its network

512 Subsets of PASCAL With 4 subsets of 4 different scalesof objects in images we want to find out howmuch the scalesimpact on the models (e whole results are shown in Ta-ble 4 We separate the results into 2 groups as the one-stageand two-stage approaches and Figure 5 is a visualization forthe strongest backbones in each method on subsets

In case of different scales like our subsets there is a differencebetween one-stage approaches and two-stage approaches In thiscase methods from the one-stage approach have a better per-formance than two-stage ones inmost of scales(is is really theopposite of small object dataset Specifically two-stage methodsare totally better than one-stage ones in case of real-time inputsand just better a bit than nonreal-time models in VOC_WH20about 10ndash20 and the same result with smaller objects inVOC_MRA_0058 and VOC_MRA_010 However in biggerobjects in VOC_MRA_020 methods in one-stage approacheshave significant outcomes rather than two-stage ones In ad-dition there is just Faster RCNN that has good performance inmost cases to compare to methods in one-stage ones FastRCNN is only good at big objects in VOC_MRA_020 and failsto have good detection in smaller objects

In the one-stage approach in methods which allowmultiple inputs like YOLO and SSD there are 2 kindsnamely ones that can run in real time and the others thatcannot if the resolution is over 640 or 512 for YOLO andSSD respectively For real-time ones YOLO outperformsSSD for all scales of objects Specifically YOLOv2 withDarknet-19 is better than SSD 26 with objects inVOC_MRA_0058 and VOC_MRA_010 and 4ndash15 forlarger objects in VOC_MRA_020 and VOC_WH_20YOLOv3 with Darknet-53 gets higher results about 3ndash5 incomparison with YOLOv2 hence YOLOv3 also gets higherresults compared to SSD However if we consider nonreal-time input images SSD is greater than YOLO with objects inVOC_MRA_010 However RetinaNet which is the one thatcannot run in real time in the one-stage approach performsthe same results compared to ones in nonreal time in YOLO

Journal of Electrical and Computer Engineering 11

and better than SSD RetinaNet is more stable than SSD andYOLO when the scales are changed (e bigger the objectsare the more the stability is For example the change is somuch about 33 when the scale increases from objects inVOC_MRA_0058 to ones in VOC_MRA_010 andVOC_MRA_020 However this change is not much about10 with bigger objects in comparison with YOLO 15ndash25In case of YOLO this remarkable increase in accuracy whenobjects are larger is obviously good for a model (e changein SSD resembles the change in RetinaNet

Concerning resolutions in YOLO and SSD we see thatwhen image resolution is increased they push the accuracyto improve in general For YOLOv2 with Darknet-19 andYOLOv3 with Darknet-53 and SSD they all have an increasein accuracy when the resolution is large except for YOLOv2

with objects belonging to VOC_MRA_010 andVOC_MRA_020 when the image is over 800 In additionYOLOv2 has a fluctuation with those objects inVOC_WH20 As mentioned in our previous work YOLO isbetter than SSD in those objects less than 10 of the imageshowever in this case YOLOv3 is good at all scales of objects(is is because YOLOv3 has 3 detection locations comingwith more ratios of default boxes and it leads to a significantoutcome when combining results from 3 locations

When we switch to the two-stage approaches FasterRCNN has a significant improvement in most scales ratherthan Fast RCNN except for objects in VOC_MRA_020 whichhave the same accuracy (is shows that if objects are com-pletely separated into different scales the RoI pooling does notwork well with smaller objects and ones in VOC_WH20 In

Table 3 Comparative results on small object dataset

Method Backbone Clock Faucet Jar Mouse Outlet Plate Switch Tel t box t paper mAPYOLO 416 [16]

Darknet-19

228 308 4 52 204 131 13 61 0 353 1939YOLO 448 [16] 23 369 9 525 184 136 175 42 0 343 2013YOLO 480 [16] 342 373 91 533 214 136 158 91 91 342 2371YOLO 512 [16] 231 366 61 598 246 142 157 91 45 324 2261YOLO 554 [16] 234 372 91 601 272 134 199 91 45 345 2384YOLO 640 [16] 202 362 32 598 278 117 181 82 45 356 2253YOLO 800 [16] 276 36 23 602 328 131 233 91 91 267 2402YOLO 1024 [16] 217 293 14 583 264 118 175 91 91 157 2003YOLO 320

Darknet-532622 3838 455 5646 3642 1334 248 1065 455 4296 2583

YOLO 416 2847 4715 1083 6049 4315 1587 3073 1515 262 483 3028YOLO 608 2998 4789 1076 6588 4802 1809 3122 1462 1799 4656 331YOLO 320

ResNet-501957 2573 067 4517 1437 938 1384 909 909 237 1706

YOLO 416 2378 3665 04 5423 1837 1375 1978 984 942 3568 2219YOLO 608 2692 4065 177 6186 2918 1504 2024 1009 1329 3601 255YOLO 320

ResNet-1012052 279 057 4468 1698 1305 1366 966 909 2436 1805

YOLO 416 2572 356 303 5573 224 1561 1726 932 303 3871 2264YOLO 608 2879 4459 942 6218 3334 1553 2388 1324 1583 3917 286YOLO 320

ResNet-1522164 2756 303 4806 1739 1112 1451 909 455 3188 1888

YOLO 416 257 3654 089 5381 206 1413 2021 1149 029 3306 2167YOLO 608 2601 4454 455 61 3176 1302 2267 1235 993 3999 2658SSD300 [16] ResNet-101 55 91 0 255 61 45 0 45 91 182 825SSD300 [16] VGG16 91 171 0 261 91 91 0 45 0 167 916SSD512 [16] VGG16 91 171 0 43 91 91 91 91 0 76 1132RetinaNet ResNet-50-FPN 307 493 2 655 213 161 85 129 1 257 233RetinaNet ResNet-101-FPN 306 487 71 647 20 159 118 107 29 387 251RetinaNet ResNeXT-101-32times 8d-FPN 355 55 121 665 239 184 98 162 94 537 30RetinaNet ResNeXT-101-64times 4d-FPN 314 502 89 663 208 153 94 14 22 324 251R-CNN [13] RPN prop +VGG16 319 313 42 568 311 93 142 164 234 294 248R-CNN [13] Alexnet 7times 300 pro 324 272 51 569 28 98 136 124 179 356 239R-CNN [13] VGG16 7times 300 pro 373 303 72 606 415 158 215 137 22 333 284R-CNN [13] ContextNet (Alexnet 7times) 327 268 46 564 263 99 129 122 187 34 235Fast RCNN ResNet-50-C4 324 463 65 658 383 201 253 166 141 52 317Fast RCNN ResNet-50-FPN 374 473 73 689 467 21 321 171 93 459 333Fast RCNN ResNet-101-FPN 393 503 106 683 471 204 333 186 154 514 355Fast RCNN ResNeXT-101-32times 8d-FPN 475 548 103 718 54 214 344 217 177 535 387Fast RCNN ResNeXT-101-64times 4d-FPN 454 557 109 725 533 24 369 229 16 581 396Faster R-CNN [16] VGG16 2376 3765 803 54 1616 1188 1512 91 625 3729 2192Faster RCNN ResNet-50-C4 322 446 66 659 352 175 257 196 137 40 301Faster RCNN ResNet-50-FPN 357 499 73 684 489 188 296 147 114 533 338Faster RCNN ResNet-101-FPN 398 492 49 682 47 185 297 14 129 522 337Faster RCNN ResNeXT-101-32times 8d-FPN 498 566 114 721 563 232 37 208 188 587 405Faster RCNN ResNeXT-101-64times 4d-FPN 496 586 122 725 545 232 369 208 201 631 412(e values in bold represent the best in one-stage methods and the ones in italics represent the highest in two-stage methods

12 Journal of Electrical and Computer Engineering

addition if we compare with one-stage methods it is signif-icantly lower than them However RoI align along with RPN iswell performed when scales are changedWhen it comes to thebackbones there is a few decrease in accuracy when changingfrom ResNet-50-FPN to ResNet-101-FPN or from ResNeXT-101-32times 8d-FPN to ResNeXT-101-64times 4d-FPN with objectsfrom all scales for both Faster RCNN and Fast RCNN (eVGG16 backbone has an impressive outcome rather thanstrong backbones such as ResNet or ResNeXT Although theaccuracy is less than two strong backbones VGG16 is stillbetter with objects in VOC_WH20 and has a few change inaccuracy when changing objects with big sizes

52 Time Processing and Resource Consumption Tables 5and 6 show us the performance comparison of the eval-uated models with base networks that belong to the modelsGenerally we see that when RAM consumption in testingand training increases more layers are added (is meansthat if the network is more deeper the need of processingalso increases because this leads to the increase in pa-rameters and time to process data as well YOLO is themodel consuming the least memory in both two-phasetraining and testing Particularly YOLO is only from 4G to5G for training and from 16G to 18G for testing withDarknet-53 YOLO is the only one which is able to run inreal time YOLO just needs about 03ms to 04ms toprocess an image in comparison to more than 01 s and 02 s

with Faster RCNN and RetinaNet (is allows us to pick upthese models on devices which own the modest memoryWhile RetinaNet is assigned to the one-stage approach it isnot good enough to meet real-time detection(e inferencetime in Fast RCNN is lower a little bit than Faster RCNNand RetinaNet In contrast the RAM consumption intraining and testing of RetinaNet is lower than Fast RCNNand Faster RCNN Of all architectures the ResNet-50-C4 isthe one requiring the highest memory and time to processdata because the output size of ResNet-50-C4 is bigger a bitthan others [9] If we consider ResNet or ResNeXT com-bined with FPN Faster RCNN is over 100Mb compared toFast RCNN and 300Mb with RetinaNet In additionaccording to Table 2 the number of training days of FasterRCNN and RetinaNet need less time for training only a fewhours to 1 day rather than YOLO 3ndash4 days (is dem-onstrates that if we pay our attention to performance anddo not have much time for training we choose FasterRCNN or RetinaNet instead of YOLO one In contrast ifwe only focus on processing speed and still achieve goodperformance one-stage methods are always the good oneIn the same context of backbones RetinaNet uses a lowerresource than Fast RCNN and Faster RCNN about 100Mband 300Mb for Fast RCNN and Faster RCNN respectivelyin testing time However the training time of RetinaNetuses much memory more than Fast RCNN about 28 G andFaster RCNN about 23 G for ResNeXT-101-32times 8d-FPNand ResNeXT-101-64 times 4d-FPN If we consider this on

(a) (b)

loc ∆ (cx cy w h) conf (c1 c2 hellip cp)

(c)

Figure 4 (e location of the default boxes in different scales (a) image with GT boxes (b) 8times 8 feature map (c) 4times 4 feature map

Journal of Electrical and Computer Engineering 13

Table 4 (e comparative results on subsets of PASCAL VOC 2007

Approach Method VOC_MRA_0058 VOC_MRA_010 VOC_MRA_020 VOC_WH20

One stage

YOLOv2 416 [16] 302 3138 4289 1852YOLOv2 448 [16] 447 329 6015 2196YOLOv2 480 [16] 426 3348 6078 2667YOLOv2 512 [16] 542 3574 6112 2463YOLOv2 544 [16] 697 3656 63 2662YOLOv2 640 [16] 77 3797 6129 2341YOLOv2 800 [16] 1024 373 6191 269YOLOv2 1024 [16] 1069 2993 5514 2897

YOLOv3 320 718 3458 6036 204YOLOv3 416 102 3897 6253 2412YOLOv3 608 117 4265 6856 2886SSD 300 [16] 171 3276 4626 1691SSD 512 [16] 29 4346 5711 1987

RetinaNet-ResNet-50-FPN 884 415 502 2814RetinaNet-ResNet-101-FPN 895 425 519 2746

RetinaNet-ResNeXT-101-32times 8d-FPN 1029 454 545 3008RetinaNet-ResNeXT-101-64times 4d-FPN 1071 455 551 3132

Two stage

Fast RCNN-ResNet-50-C4 023 132 499 393Fast RCNN-ResNet-50-FPN 063 135 556 345Fast RCNN-ResNet-101-FPN 039 159 576 312

Fast RCNN-ResNeXT-101-32times 8d-FPN 051 144 579 333Fast RCNN-ResNeXT-101-64times 4d-FPN 029 142 573 376

Faster RCNN-ResNet-50-C4 698 399 487 2604Faster RCNN-ResNet-50-FPN 1074 456 563 2979Faster RCNN-ResNet-101-FPN 1063 469 576 3057

Faster RCNN-ResNeXT-101-32times 8d-FPN 1164 473 576 3212Faster RCNN-ResNeXT-101-64times 4d-FPN 1054 471 569 3164

Faster RCNN-VGG16 [16] 573 3558 4414 4111(is table illustrates how well models adapt to different scales of objects (e values in bold represent the best in one-stage methods and the ones in italicsrepresent the highest in two-stage methods

(a)

(b)

(c)

Figure 5 Continued

14 Journal of Electrical and Computer Engineering

small object dataset it does not work too much becauseRetinaNet is lower than Faster RCNN about 10 in per-formance Otherwise on different scales of subsets Reti-naNet works well when comparing to Faster RCNN and

the difference is just 2ndash4 percentages Although ResNetbackbones combined with the others yield an improvementin accuracy they do not work for YOLO on small objectdatasets YOLO with Darknet-53 utilizes more resource

Table 5 (e comparison of consumption on small object dataset

Model Backbone Inference time (s) Test RAM (MiB) Train RAM (MiB)YOLOv3 Darknet-53 00331 1825 4759YOLOv3 ResNet-50 0027 1285 3479YOLOv3 ResNet-101 00356 1829 5383YOLOv3 ResNet-152 00454 2443 7531RetinaNet ResNet-50-FPN 0102 2075 4435RetinaNet ResNet-101-FPN 0127 2723 5577RetinaNet ResNeXT-101-32times 8d-FPN 0229 3767 7863RetinaNet ResNeXT-101-64times 4d-FPN 0292 3719 7813Fast RCNN ResNet-50-C4 03 6449 5877Fast RCNN ResNet-50-FPN 0089 2277 4455Fast RCNN ResNet-101-FPN 0113 2947 5627Fast RCNN ResNeXT-101-32times 8d-FPN 0212 3987 4961Fast RCNN ResNeXT-101-64times 4d-FPN 0269 3885 4799Faster RCNN ResNet-50-C4 0412 6609 6129Faster RCNN ResNet-50-FPN 0101 2387 5381Faster RCNN ResNet-101-FPN 0124 3001 6487Faster RCNN ResNeXT-101-32times 8d-FPN 0256 4027 5333Faster RCNN ResNeXT-101-64times 4d-FPN 0286 4003 5246

Table 6 (e comparison of consumption on subsets filtered from PASCAL VOC

Model Backbone Inference time (s) Test RAM (MiB) Train RAM (MiB)YOLOv3 Darknet-53 0027 1645 4079RetinaNet ResNet-50-FPN 01 1935 4133RetinaNet ResNet-101-FPN 0116 2585 5435RetinaNet ResNeXT-101-32times 8d-FPN 0222 3641 7723RetinaNet ResNeXT-101-64times 4d-FPN 0284 3561 7599Fast RCNN ResNet-50-C4 0495 6371 5677Fast RCNN ResNet-50-FPN 0092 2131 4387Fast RCNN ResNet-101-FPN 0114 2819 5463Fast RCNN ResNeXT-101-32times 8d-FPN 0213 3873 4637Fast RCNN ResNeXT-101-64times 4d-FPN 0265 3735 4575Faster RCNN ResNet-50-C4 026 6141 5991Faster RCNN ResNet-50-FPN 01 2245 5207Faster RCNN ResNet-101-FPN 013 2855 6335Faster RCNN ResNeXT-101-32times 8d-FPN 0225 3943 5087Faster RCNN ResNeXT-101-64times 4d-FPN 0276 3885 4909

(d)

Figure 5 Highlight of bounding boxes from comparative backbones on small object dataset We here select YOLO with Darknet-53 andResNet-50 for objective comparison because there have obviously the same layers in their networks along with the significant techniquessuch as skip connections and residual blocks (e bounding boxes show that ResNet-50 has the sensitivity to areas which resembles theobjects of interest than Darknet-53 Similarly ResNet-50-FPN and ResNet-50-C4 are chosen to consider (e detection shows thatcombining ResNet-50 with FPN outputs a better performance rather than the original one Particularly misdetection happens in moredensity than ResNet-50-FPN such as in columns 4 and 5 Zoom in to see more detail

Journal of Electrical and Computer Engineering 15

than ResNet ones but it has the best accuracy amongmodels (erefore we only test YOLO with Darknet-53 insubsets of PASCAL

53 Analyses of the Trade-Off among Detectors Networkdesigns and approaches alike the one-stage approach proveits performance as applying them to detect general objectsboth small scales and other kinds of scales Although they arefast and accurate there is still a drawback always existing inthese models that is the trade-off between accuracy andspeed of processing For example YOLOv3 proposes theidea that performs detection at three different scales and thisresult is obviously impressive and yields good performanceHowever to gain this advantage YOLOv3 has to sacrifice thetime to process Instead of all inputs of the model normallyprocessing one time for detection like YOLOv2 this ideamust work 3 times (is trade-off is also partly affected byresolution as we change it during training or testing ourmodels In our previous work we have mentioned that wehave to choose a right resolution to ensure our models towork properly In case of the two-stage approaches the ideathat proposes region proposals to improve the localization ofobjects to serve for detection is good as well (is is usefulbut we have to take it into the account that we shouldgenerate proposals on feature maps or directly on inputimages because this affects a lot on the way which modelsintend to run and identify representations of objects Ifobjects are normal or have a big or medium appearance it isgood for models to work but if objects are in multiscalesthis is a problem to consider and research deeply in order tobalance the performance as well as improve it (erefore topartly fix this problem the one-stage approach allows us tochoose a fixed size of an input for training and testing butthe support still depends on characteristics of datasets whichwe evaluate or the image size After all all models we chooseto evaluate are affected by the scales of objects when wechange the scale and accuracy of models change a lot exceptfor Faster RCNN the only one model that seems to be stablewith the scale especially when combining with the VGG16architecture Although the accuracy of VGG16 is not betterthan the other architectures the difference here is that it doesnot change too much in accuracy (is is only right for bigobjects having the overlap of the bounding box and theimage greater than 10 if not this is not assured

Figure 1 shows that the possibility of small objects ismore than other objects (e black length of the camera issomehow similar to the black mouse placed on a mouse pad(is possibility of small object presence causes more diffi-culties to detectors and leads to wrong detection Anywherein an image can be small objects it results in a fact thatdetectors have much wrong detection with familiar ap-pearance which they have seen If we consider the visuali-zation of the detection in Figure 4 the wrong detection ispartly similar to the appearance of the other objects in thedataset (is problem is caused by the data imbalance be-tween classes and instances in each class which originally isknown as the foreground-foreground class imbalance In

other words the common problems which not only happenwith small objects but also for whole datasets are theintraclass similarity and interclass variation

6 Conclusion

Small object detection is a challenging and interestingproblem in the task of object detection and has drawn at-tention from researchers thanks to the development of deeplearning which is motivation to improve performance oftasks in computer vision Although deep models belongingto detection originally tend to solve problems relating togeneral object detection they still work at a particular levelto the success of small object detection As evaluation workson small object detection for deep models our goal is tohighlight remarkable achievements of popular and state-of-the-art deep models in order to provide a variety of views asapplying deep models in small object detection Particularlywe evaluate state-of-the-art real-time detectors based ondeep learning from two approaches such as YOLOv3RetinaNet Fast RCNN and Faster RCNN on two datasetsnamely small object dataset and subsets filtered fromPASCAL VOC about effects of different factors objectivelyincluding accuracy execution time and resource usage

In spite of the successful achievements in recent yearsthe performance of detection has improved significantlyand there is still a huge gap in accuracy between normalobjects and small objects In the criteria of the COCOdataset the difference from the small scale to medium andbig scale is too much Most models are good at detection ofnormal objects and problems are going to happen whenapplying them to detect small objects As a result to reducethe gap of small object detection the first thing to do is investdatasets which have massive amount of data to train modelsand have a wide range of categories to compete with thehuman visual system alike [12 34]

So far detection models are divided into two mainapproaches namely one-stage approach and two-stageapproach Models in the one-stage approach is known asdetectors which have better and more efficient detection incomparison to another approach (e efficiency here has thepotential power to run in real time and is able to apply themto practical applications However the trade-off betweenaccuracy and speed is a difficult challenge which needs to betaken into the account in order to balance the gap Howevermodels in the two-stage approach have their reputation ofregion-based detectors which have high accuracy but are toolow in speed to apply them to real world (is drawbackcomes from the computation of networks

(rough our evaluation there is a fact that architectureswhich are utilized as base networks to extract deep featureshave significant effects on frameworks (e deeper the ar-chitecture is the higher the accuracy of detection is Once anetwork has an increase in the depth this means it has morelayers than normal ones and it will have massive parametersto train Hence this needs a lot of data to fine tune theseparameters reasonably If there is an increase in computa-tion resource consumption will also increase As a result it

16 Journal of Electrical and Computer Engineering

will be difficult as we want to take them to apply in practicalapplications Besides the contextual exploit in models isdefinitely limited this results cause ignoring much usefuland informative data in training especially in context ofsmall objects Because small objects are able to appearanywhere in an input image if the image is well-exploitedwith the context the performance of small object detectionwill be improved better

For all above reasons and according to our evalua-tion if we tend to have good performance and ignore thespeed of processing two-stage methods like FasterRCNN are well-performed and demonstrate its networkdesign with the different datasets on many contexts ofobjects including multiscale objects (erefore FasterRCNN is considered as a giant baseline in order to baseon or develop from it If our target has a balance ofaccuracy and speed YOLO is a good one in case we donot care the training time because the sacrifice betweenthe speed and accuracy is worth applying it into practicalapplications Otherwise Faster RCNN or RetinaNet isstill a substitution to work on When it comes tobackbones we have to concern about the data to choose areasonable backbone to combine with the methodsBecause the amount of data will significant impact on themodel if data are not abundant the shallow network willfit it well Besides there is recently a novel approachpromising in training deep models with less data that isweakly supervised learning such as zero-shot one-shotor few-shot learning (erefore these approaches will beconsidered in our future works and following our recentsearching to have better performance on object detec-tion we have to consider several factors to improve themAP such as multiscale training superresolution forscaling up the visual information to small objects [35] orpreprocessing data to avoid the imbalance data becausewe have a wide range of imbalance problems relating todata [33]

Data Availability

(e data used to support the findings of this study areavailable from the corresponding author upon request

Conflicts of Interest

(e authors declare that they have no conflicts of interest

Acknowledgments

(is research was funded by the Vietnam National Uni-versity HoChiMinh City (VNU-HCM) under grant noB2017-26-01

References

[1] R Girshick J Donahue T Darrell and J Malik ldquoRich featurehierarchies for accurate object detection and semantic seg-mentationrdquo in Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition pp 580ndash587 ColumbusOH USA June 2014

[2] K He X Zhang S Ren and J Sun ldquoSpatial pyramid poolingin deep convolutional networks for visual recognitionrdquo inProceedings of the European Conference on Computer Visionpp 346ndash361 Springer Zurich Switzerland September 2014

[3] R Girshick ldquoFast R-CNNrdquo in Proceedings of the IEEE In-ternational Conference on Computer Vision pp 1440ndash1448Santiago Chile December 2015

[4] J Redmon S Divvala R Girshick and A Farhadi ldquoYou onlylook once unified real-time object detectionrdquo in Proceedingsof the IEEE Conference on Computer Vision And PatternRecognition pp 779ndash788 Las Vegas NV USA June 2016

[5] J Redmon and A Farhadi ldquoYOLO9000 better fasterstrongerrdquo 2016 httpsarxivorgabs161208242

[6] J Redmon and A Farhadi ldquoYOLOv3 an incremental im-provementrdquo 2018 httpsarxivorgabs180402767

[7] T-Y Lin P Goyal R Girshick K He and P Dollar ldquoFocalloss for dense object detectionrdquo in Proceedings of the IEEEtransactions on pattern analysis and machine intelligenceVenice Italy October 2018

[8] K Zidek A Hosovsky J Pitelrsquo and S Bednar ldquoRecognition ofassembly parts by convolutional neural networksrdquo in Ad-vances in Manufacturing Engineering and Materialspp 281ndash289 Springer Cham Switzerland 2019

[9] K He G Gkioxari P Dollar and R Girshick ldquoMaskR-CNNrdquo in Proceedings of the IEEE International Conferenceon computer vision (ICCV) IEEE Venice Italy pp 2980ndash2988 October 2017

[10] L-C Chen A Hermans G Papandreou et al ldquoInstancesegmentation by refining object detection with semantic anddirection featuresrdquo 2017 httpsarxivorgabs171204837

[11] M Everingham L Van Gool C K I Williams J Winn andA Zisserman ldquo(e pascal visual object classes (VOC) chal-lengerdquo International Journal of Computer Vision vol 88no 2 pp 303ndash338 2010

[12] T-Y Lin M Maire S Belongie et al ldquoMicrosoft COCOcommon objects in contextrdquo in Proceedings of the EuropeanConference on Computer Vision pp 740ndash755 SpringerZurich Switzerland September 2014

[13] C Chen M-Y Liu O Tuzel and J Xiao ldquoR-CNN for smallobject detectionrdquo in Proceedings of the Asian Conference onComputer Vision pp 214ndash230 Springer Taipei TaiwanNovember 2016

[14] P F Felzenszwalb R B Girshick D McAllester andD Ramanan ldquoObject detection with discriminatively trainedpart-based modelsrdquo IEEE Transactions on Pattern Analysisand Machine Intelligence vol 32 no 9 pp 1627ndash1645 2010

[15] S Ren K He R Girshick and J Sun ldquoFaster R-CNN towardsreal-time object detection with region proposal networksrdquo inProceedings of the 28th International Conference on NeuralInformation Processing Systems Ser NIPSrsquo15 pp 91ndash99 MITPress Cambridge MA USA 2015 httpdlacmorgcitationcfmid=29692392969250

[16] P Pham D Nguyen T Do T D Ngo and D-D LeldquoEvaluation of deep models for real-time small object de-tectionrdquo in Proceedings of the International Conference onNeural Information Processing pp 516ndash526 SpringerGuangzhou China November 2017

[17] J R R Uijlings K E A Van De Sande T Gevers andA W M Smeulders ldquoSelective search for object recognitionrdquoInternational Journal of Computer Vision vol 104 no 2pp 154ndash171 2013

[18] Z Zhu D Liang S Zhang X Huang B Li and S HuldquoTraffic-sign detection and classification in the wildrdquo inProceedings of the IEEE Conference on Computer Vision And

Journal of Electrical and Computer Engineering 17

Pattern Recognition pp 2110ndash2118 Vegas NV USA June2016

[19] A Torralba R Fergus and W T Freeman ldquo80 million tinyimages a large data set for nonparametric object and scenerecognitionrdquo IEEE Transactions on Pattern Analysis andMachine Intelligence vol 30 no 11 pp 1958ndash1970 2008

[20] A Kembhavi D Harwood and L S Davis ldquoVehicle detectionusing partial least squaresrdquo IEEE Transactions on PatternAnalysis and Machine Intelligence vol 33 no 6 pp 1250ndash1265 2011

[21] V I Morariu E Ahmed V Santhanam D Harwood andL S Davis ldquoComposite discriminant factor analysisrdquo inProceedings of the IEEE Winter Conference on Applications ofComputer Vision pp 564ndash571 IEEE Steamboat Springs COUSA March 2014

[22] A Andreas P Lenz and R Urtasun ldquoAre we ready forautonomous driving the kitti vision benchmark suiterdquo inProceedings of the IEEE Conference on Computer Vision AndPattern Recognition Providence RI USA June 2012

[23] A Alahi K Goel V Ramanathan A Robicquet L Fei-Feiand S Savarese ldquoSocial LSTM human trajectory prediction incrowded spacesrdquo in Proceedings of the IEEE Conference onComputer Vision And Pattern Recognition pp 961ndash971 LasVegas NV USA June 2016

[24] J Xiao K A Ehinger J Hays A Torralba and A Oliva ldquoSundatabase exploring a large collection of scene categoriesrdquoInternational Journal of Computer Vision vol 119 no 1pp 3ndash22 2016

[25] E Dong Y Zhu Y Ji and S Du ldquoAn improved convolutionneural network for object detection using YOLOv2rdquoin Pro-ceedings of the 2018 IEEE International Conference onMechatronics and Automation (ICMA) pp 1184ndash1188 IEEEChangchun China August 2018

[26] W Liu D Anguelov D Erhan et al ldquoSingle shot multiboxdetectorrdquo in Proceedings of the European Conference onComputer Vision Springer Amsterdam (e Netherlandspp 21ndash37 October 2016

[27] T-Y Lin P Dollar R B Girshick K He B Hariharan andS J Belongie ldquoFeature pyramid networks for object detec-tionrdquo in Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition (CVPR) vol 1 no 2 p 4Honolulu HI USA July 2017

[28] C-Y Fu W Liu A Ranga A Tyagi and A C Berg ldquoDSSDdeconvolutional single shot detectorrdquo 2017 httpsarxivorgabs170106659

[29] Y Bai Y Zhang M Ding and B Ghanem ldquoSOD-MTGANsmall object detection via multi-task generative adversarialnetworkrdquo in Proceedings of the European Conference onComputer Vision pp 210ndash226 Springer Munich GermanySeptember 2018

[30] L Liu W Ouyang X Wang et al ldquoDeep learning for genericobject detection a surveyrdquo 2018 httpsarxivorgabs180902165

[31] P Zhu L Wen X Bian L Haibin and Q Hu ldquoVision meetsdrones a challengerdquo 2018 httpsarxivorgabs180407437

[32] Z-Q Zhao P Zheng S-t Xu and X Wu ldquoObject detectionwith deep learning a reviewrdquo 2018 httpsarxivorgabs180705511

[33] K Oksuz B C Cam S Kalkan and E Akbas ldquoImbalanceproblems in object detection a reviewrdquo 2019 httpsarxivorgabs190900169

[34] O Russakovsky J Deng H Su et al ldquoImagenet large scalevisual recognition challengerdquo International Journal of Com-puter Vision vol 115 no 3 pp 211ndash252 2015

[35] Y Bai Y Zhang M Ding and B Ghanem ldquoSOD-MTGANsmall object detection via multi-task generative adversarialnetworkrdquo in Proceedings of the European Conference onComputer Vision (ECCV) pp 206ndash221 Munich GermanySeptember 2018

18 Journal of Electrical and Computer Engineering

Page 11: AnEvaluationofDeepLearningMethodsforSmall ObjectDetection

stage methods if we only consider accuracy Overall there isan increase about 1ndash3 for changing the simple backbone tothe complex one in each type For example when switchingfrom original ResNet to ResNet-FPN the accuracy is boostedfrom 2 to 3(is is clear that leveraging the advantages frommultiscale features of FPN is a common way to improvedetection and tackle the scale imbalance of input images andbounding boxes of different objects Similarly we switchResNeXT-101-32times 8d-FPN to ResNeXT-101-64times 4d-FPNand the accuracy changes from 405 to 412 for FasterRCNN and from 387 to 396 for Fast RCNN Howeverwhen considering between ResNet-50-FPN and ResNet-101-FPN the growth only happens in Fast RCNN from 333 to355(ere is a little bit decrease 01 for Faster RCNN(isreduction also happens with RetinaNet while the simplerbackbone ResNeXT-101-32times 8d-FPN gets 30 and theResNeXT-101-64times 4d-FPN just gets 251 It means that thevery deeper backbones do not guarantee the increase inaccuracy and the reason is that an advantage of a deepernetwork needsmore parameters to learn It means the authorsmust have a large number of data to feed into the network totrain and update parameters itself but in this case the data ofsmall object dataset are not abundant too much to fit the verydeep network and hence increasing the chances of overfittingBesides features which are originally from the early layer ofResNet are not well-generalized because when they arecombined with FPN the accuracy has an improvement about2ndash3 When YOLO switches from Darknet-19 to Darknet-53 it really boosts the accuracy (e highest accuracy belongsto the Darknet-19 with the resolution of 1024times1024 whichjust gets 2402 However YOLO 608times 608 with Darknet-53gets 331 (e explanation for this reason is that YOLOv3with Darknet-53 has several improvements from Darknet-19YOLOv3 has 3 location of scales to predict objects especiallyone specialized in small objects instead of only one likeDarknet-19 and it is also integrated cutting-edge advantagessuch as residual blocks and shortcut connections (e re-duction in accuracy happens again with YOLO whenswitching from ResNet-101 to ResNet-152 about 1ndash2 Inthese methods YOLO and SSD are the only ones which allowmultiple input sizes(e higher the resolution of input imagesare the higher accuracy the method receives (e reason isthat a higher resolution image allows more pixels to describethe visual information for small objects However if theresolution is far from the original size of images it results in adecrease in accuracy For example YOLO 1024times1024 withDarknet-19 gets a lower accuracy than the resolution of800times 800 In addition we have tried to increase in resolutionof Darknet-53 from 608 to 1024 and themAP decreases whenthe resolution is over 608times 608 (erefore the effect of imagesize is clear for models like SSD and YOLO Generally allcomparative results of mAP on this dataset have the domi-nation of classes very great in numbers and this is caused bythe imbalance data between the number of images and in-stances in these images For example according to the sta-tistics in [13] mouse is a major class significantly contributingto mAP in Table 3 with the highest number of instances andimages as well However tissue has least contribution with thelowest AP originally affected by the number of data

Furthermore the imbalance data lead models tending todetect frequent objects implying that models will misun-derstand objects having a nearly similar appearance with thedomination class as the objects of interest rather than lessfrequent objects As a result false positives will increase bythese problems Figure 4 illustrates the detection withstrongest backbones Following this visualization the domina-tion of the classes such asmouse or faucet results inmisdetectionwith areas which have a same appearance to them (is mis-understanding has a tendency to weaker backbones in thecomparison and one-stage method like YOLO which primarilyheads to speed has more misdetection than two-stage methodsA reason that causes these problems are the difference in thewayof training deep networks [33] One-stage methods such asYOLO use a soft sampling method that uses a whole dataset toupdate parameters rather than only choosing samples fromtraining data However two-stage methods such as RCNNfamily tend to employ hard sampling methods that randomlysample a certain number of positive and negative boundingboxes to train its network

512 Subsets of PASCAL With 4 subsets of 4 different scalesof objects in images we want to find out howmuch the scalesimpact on the models (e whole results are shown in Ta-ble 4 We separate the results into 2 groups as the one-stageand two-stage approaches and Figure 5 is a visualization forthe strongest backbones in each method on subsets

In case of different scales like our subsets there is a differencebetween one-stage approaches and two-stage approaches In thiscase methods from the one-stage approach have a better per-formance than two-stage ones inmost of scales(is is really theopposite of small object dataset Specifically two-stage methodsare totally better than one-stage ones in case of real-time inputsand just better a bit than nonreal-time models in VOC_WH20about 10ndash20 and the same result with smaller objects inVOC_MRA_0058 and VOC_MRA_010 However in biggerobjects in VOC_MRA_020 methods in one-stage approacheshave significant outcomes rather than two-stage ones In ad-dition there is just Faster RCNN that has good performance inmost cases to compare to methods in one-stage ones FastRCNN is only good at big objects in VOC_MRA_020 and failsto have good detection in smaller objects

In the one-stage approach in methods which allowmultiple inputs like YOLO and SSD there are 2 kindsnamely ones that can run in real time and the others thatcannot if the resolution is over 640 or 512 for YOLO andSSD respectively For real-time ones YOLO outperformsSSD for all scales of objects Specifically YOLOv2 withDarknet-19 is better than SSD 26 with objects inVOC_MRA_0058 and VOC_MRA_010 and 4ndash15 forlarger objects in VOC_MRA_020 and VOC_WH_20YOLOv3 with Darknet-53 gets higher results about 3ndash5 incomparison with YOLOv2 hence YOLOv3 also gets higherresults compared to SSD However if we consider nonreal-time input images SSD is greater than YOLO with objects inVOC_MRA_010 However RetinaNet which is the one thatcannot run in real time in the one-stage approach performsthe same results compared to ones in nonreal time in YOLO

Journal of Electrical and Computer Engineering 11

and better than SSD RetinaNet is more stable than SSD andYOLO when the scales are changed (e bigger the objectsare the more the stability is For example the change is somuch about 33 when the scale increases from objects inVOC_MRA_0058 to ones in VOC_MRA_010 andVOC_MRA_020 However this change is not much about10 with bigger objects in comparison with YOLO 15ndash25In case of YOLO this remarkable increase in accuracy whenobjects are larger is obviously good for a model (e changein SSD resembles the change in RetinaNet

Concerning resolutions in YOLO and SSD we see thatwhen image resolution is increased they push the accuracyto improve in general For YOLOv2 with Darknet-19 andYOLOv3 with Darknet-53 and SSD they all have an increasein accuracy when the resolution is large except for YOLOv2

with objects belonging to VOC_MRA_010 andVOC_MRA_020 when the image is over 800 In additionYOLOv2 has a fluctuation with those objects inVOC_WH20 As mentioned in our previous work YOLO isbetter than SSD in those objects less than 10 of the imageshowever in this case YOLOv3 is good at all scales of objects(is is because YOLOv3 has 3 detection locations comingwith more ratios of default boxes and it leads to a significantoutcome when combining results from 3 locations

When we switch to the two-stage approaches FasterRCNN has a significant improvement in most scales ratherthan Fast RCNN except for objects in VOC_MRA_020 whichhave the same accuracy (is shows that if objects are com-pletely separated into different scales the RoI pooling does notwork well with smaller objects and ones in VOC_WH20 In

Table 3 Comparative results on small object dataset

Method Backbone Clock Faucet Jar Mouse Outlet Plate Switch Tel t box t paper mAPYOLO 416 [16]

Darknet-19

228 308 4 52 204 131 13 61 0 353 1939YOLO 448 [16] 23 369 9 525 184 136 175 42 0 343 2013YOLO 480 [16] 342 373 91 533 214 136 158 91 91 342 2371YOLO 512 [16] 231 366 61 598 246 142 157 91 45 324 2261YOLO 554 [16] 234 372 91 601 272 134 199 91 45 345 2384YOLO 640 [16] 202 362 32 598 278 117 181 82 45 356 2253YOLO 800 [16] 276 36 23 602 328 131 233 91 91 267 2402YOLO 1024 [16] 217 293 14 583 264 118 175 91 91 157 2003YOLO 320

Darknet-532622 3838 455 5646 3642 1334 248 1065 455 4296 2583

YOLO 416 2847 4715 1083 6049 4315 1587 3073 1515 262 483 3028YOLO 608 2998 4789 1076 6588 4802 1809 3122 1462 1799 4656 331YOLO 320

ResNet-501957 2573 067 4517 1437 938 1384 909 909 237 1706

YOLO 416 2378 3665 04 5423 1837 1375 1978 984 942 3568 2219YOLO 608 2692 4065 177 6186 2918 1504 2024 1009 1329 3601 255YOLO 320

ResNet-1012052 279 057 4468 1698 1305 1366 966 909 2436 1805

YOLO 416 2572 356 303 5573 224 1561 1726 932 303 3871 2264YOLO 608 2879 4459 942 6218 3334 1553 2388 1324 1583 3917 286YOLO 320

ResNet-1522164 2756 303 4806 1739 1112 1451 909 455 3188 1888

YOLO 416 257 3654 089 5381 206 1413 2021 1149 029 3306 2167YOLO 608 2601 4454 455 61 3176 1302 2267 1235 993 3999 2658SSD300 [16] ResNet-101 55 91 0 255 61 45 0 45 91 182 825SSD300 [16] VGG16 91 171 0 261 91 91 0 45 0 167 916SSD512 [16] VGG16 91 171 0 43 91 91 91 91 0 76 1132RetinaNet ResNet-50-FPN 307 493 2 655 213 161 85 129 1 257 233RetinaNet ResNet-101-FPN 306 487 71 647 20 159 118 107 29 387 251RetinaNet ResNeXT-101-32times 8d-FPN 355 55 121 665 239 184 98 162 94 537 30RetinaNet ResNeXT-101-64times 4d-FPN 314 502 89 663 208 153 94 14 22 324 251R-CNN [13] RPN prop +VGG16 319 313 42 568 311 93 142 164 234 294 248R-CNN [13] Alexnet 7times 300 pro 324 272 51 569 28 98 136 124 179 356 239R-CNN [13] VGG16 7times 300 pro 373 303 72 606 415 158 215 137 22 333 284R-CNN [13] ContextNet (Alexnet 7times) 327 268 46 564 263 99 129 122 187 34 235Fast RCNN ResNet-50-C4 324 463 65 658 383 201 253 166 141 52 317Fast RCNN ResNet-50-FPN 374 473 73 689 467 21 321 171 93 459 333Fast RCNN ResNet-101-FPN 393 503 106 683 471 204 333 186 154 514 355Fast RCNN ResNeXT-101-32times 8d-FPN 475 548 103 718 54 214 344 217 177 535 387Fast RCNN ResNeXT-101-64times 4d-FPN 454 557 109 725 533 24 369 229 16 581 396Faster R-CNN [16] VGG16 2376 3765 803 54 1616 1188 1512 91 625 3729 2192Faster RCNN ResNet-50-C4 322 446 66 659 352 175 257 196 137 40 301Faster RCNN ResNet-50-FPN 357 499 73 684 489 188 296 147 114 533 338Faster RCNN ResNet-101-FPN 398 492 49 682 47 185 297 14 129 522 337Faster RCNN ResNeXT-101-32times 8d-FPN 498 566 114 721 563 232 37 208 188 587 405Faster RCNN ResNeXT-101-64times 4d-FPN 496 586 122 725 545 232 369 208 201 631 412(e values in bold represent the best in one-stage methods and the ones in italics represent the highest in two-stage methods

12 Journal of Electrical and Computer Engineering

addition if we compare with one-stage methods it is signif-icantly lower than them However RoI align along with RPN iswell performed when scales are changedWhen it comes to thebackbones there is a few decrease in accuracy when changingfrom ResNet-50-FPN to ResNet-101-FPN or from ResNeXT-101-32times 8d-FPN to ResNeXT-101-64times 4d-FPN with objectsfrom all scales for both Faster RCNN and Fast RCNN (eVGG16 backbone has an impressive outcome rather thanstrong backbones such as ResNet or ResNeXT Although theaccuracy is less than two strong backbones VGG16 is stillbetter with objects in VOC_WH20 and has a few change inaccuracy when changing objects with big sizes

52 Time Processing and Resource Consumption Tables 5and 6 show us the performance comparison of the eval-uated models with base networks that belong to the modelsGenerally we see that when RAM consumption in testingand training increases more layers are added (is meansthat if the network is more deeper the need of processingalso increases because this leads to the increase in pa-rameters and time to process data as well YOLO is themodel consuming the least memory in both two-phasetraining and testing Particularly YOLO is only from 4G to5G for training and from 16G to 18G for testing withDarknet-53 YOLO is the only one which is able to run inreal time YOLO just needs about 03ms to 04ms toprocess an image in comparison to more than 01 s and 02 s

with Faster RCNN and RetinaNet (is allows us to pick upthese models on devices which own the modest memoryWhile RetinaNet is assigned to the one-stage approach it isnot good enough to meet real-time detection(e inferencetime in Fast RCNN is lower a little bit than Faster RCNNand RetinaNet In contrast the RAM consumption intraining and testing of RetinaNet is lower than Fast RCNNand Faster RCNN Of all architectures the ResNet-50-C4 isthe one requiring the highest memory and time to processdata because the output size of ResNet-50-C4 is bigger a bitthan others [9] If we consider ResNet or ResNeXT com-bined with FPN Faster RCNN is over 100Mb compared toFast RCNN and 300Mb with RetinaNet In additionaccording to Table 2 the number of training days of FasterRCNN and RetinaNet need less time for training only a fewhours to 1 day rather than YOLO 3ndash4 days (is dem-onstrates that if we pay our attention to performance anddo not have much time for training we choose FasterRCNN or RetinaNet instead of YOLO one In contrast ifwe only focus on processing speed and still achieve goodperformance one-stage methods are always the good oneIn the same context of backbones RetinaNet uses a lowerresource than Fast RCNN and Faster RCNN about 100Mband 300Mb for Fast RCNN and Faster RCNN respectivelyin testing time However the training time of RetinaNetuses much memory more than Fast RCNN about 28 G andFaster RCNN about 23 G for ResNeXT-101-32times 8d-FPNand ResNeXT-101-64 times 4d-FPN If we consider this on

(a) (b)

loc ∆ (cx cy w h) conf (c1 c2 hellip cp)

(c)

Figure 4 (e location of the default boxes in different scales (a) image with GT boxes (b) 8times 8 feature map (c) 4times 4 feature map

Journal of Electrical and Computer Engineering 13

Table 4 (e comparative results on subsets of PASCAL VOC 2007

Approach Method VOC_MRA_0058 VOC_MRA_010 VOC_MRA_020 VOC_WH20

One stage

YOLOv2 416 [16] 302 3138 4289 1852YOLOv2 448 [16] 447 329 6015 2196YOLOv2 480 [16] 426 3348 6078 2667YOLOv2 512 [16] 542 3574 6112 2463YOLOv2 544 [16] 697 3656 63 2662YOLOv2 640 [16] 77 3797 6129 2341YOLOv2 800 [16] 1024 373 6191 269YOLOv2 1024 [16] 1069 2993 5514 2897

YOLOv3 320 718 3458 6036 204YOLOv3 416 102 3897 6253 2412YOLOv3 608 117 4265 6856 2886SSD 300 [16] 171 3276 4626 1691SSD 512 [16] 29 4346 5711 1987

RetinaNet-ResNet-50-FPN 884 415 502 2814RetinaNet-ResNet-101-FPN 895 425 519 2746

RetinaNet-ResNeXT-101-32times 8d-FPN 1029 454 545 3008RetinaNet-ResNeXT-101-64times 4d-FPN 1071 455 551 3132

Two stage

Fast RCNN-ResNet-50-C4 023 132 499 393Fast RCNN-ResNet-50-FPN 063 135 556 345Fast RCNN-ResNet-101-FPN 039 159 576 312

Fast RCNN-ResNeXT-101-32times 8d-FPN 051 144 579 333Fast RCNN-ResNeXT-101-64times 4d-FPN 029 142 573 376

Faster RCNN-ResNet-50-C4 698 399 487 2604Faster RCNN-ResNet-50-FPN 1074 456 563 2979Faster RCNN-ResNet-101-FPN 1063 469 576 3057

Faster RCNN-ResNeXT-101-32times 8d-FPN 1164 473 576 3212Faster RCNN-ResNeXT-101-64times 4d-FPN 1054 471 569 3164

Faster RCNN-VGG16 [16] 573 3558 4414 4111(is table illustrates how well models adapt to different scales of objects (e values in bold represent the best in one-stage methods and the ones in italicsrepresent the highest in two-stage methods

(a)

(b)

(c)

Figure 5 Continued

14 Journal of Electrical and Computer Engineering

small object dataset it does not work too much becauseRetinaNet is lower than Faster RCNN about 10 in per-formance Otherwise on different scales of subsets Reti-naNet works well when comparing to Faster RCNN and

the difference is just 2ndash4 percentages Although ResNetbackbones combined with the others yield an improvementin accuracy they do not work for YOLO on small objectdatasets YOLO with Darknet-53 utilizes more resource

Table 5 (e comparison of consumption on small object dataset

Model Backbone Inference time (s) Test RAM (MiB) Train RAM (MiB)YOLOv3 Darknet-53 00331 1825 4759YOLOv3 ResNet-50 0027 1285 3479YOLOv3 ResNet-101 00356 1829 5383YOLOv3 ResNet-152 00454 2443 7531RetinaNet ResNet-50-FPN 0102 2075 4435RetinaNet ResNet-101-FPN 0127 2723 5577RetinaNet ResNeXT-101-32times 8d-FPN 0229 3767 7863RetinaNet ResNeXT-101-64times 4d-FPN 0292 3719 7813Fast RCNN ResNet-50-C4 03 6449 5877Fast RCNN ResNet-50-FPN 0089 2277 4455Fast RCNN ResNet-101-FPN 0113 2947 5627Fast RCNN ResNeXT-101-32times 8d-FPN 0212 3987 4961Fast RCNN ResNeXT-101-64times 4d-FPN 0269 3885 4799Faster RCNN ResNet-50-C4 0412 6609 6129Faster RCNN ResNet-50-FPN 0101 2387 5381Faster RCNN ResNet-101-FPN 0124 3001 6487Faster RCNN ResNeXT-101-32times 8d-FPN 0256 4027 5333Faster RCNN ResNeXT-101-64times 4d-FPN 0286 4003 5246

Table 6 (e comparison of consumption on subsets filtered from PASCAL VOC

Model Backbone Inference time (s) Test RAM (MiB) Train RAM (MiB)YOLOv3 Darknet-53 0027 1645 4079RetinaNet ResNet-50-FPN 01 1935 4133RetinaNet ResNet-101-FPN 0116 2585 5435RetinaNet ResNeXT-101-32times 8d-FPN 0222 3641 7723RetinaNet ResNeXT-101-64times 4d-FPN 0284 3561 7599Fast RCNN ResNet-50-C4 0495 6371 5677Fast RCNN ResNet-50-FPN 0092 2131 4387Fast RCNN ResNet-101-FPN 0114 2819 5463Fast RCNN ResNeXT-101-32times 8d-FPN 0213 3873 4637Fast RCNN ResNeXT-101-64times 4d-FPN 0265 3735 4575Faster RCNN ResNet-50-C4 026 6141 5991Faster RCNN ResNet-50-FPN 01 2245 5207Faster RCNN ResNet-101-FPN 013 2855 6335Faster RCNN ResNeXT-101-32times 8d-FPN 0225 3943 5087Faster RCNN ResNeXT-101-64times 4d-FPN 0276 3885 4909

(d)

Figure 5 Highlight of bounding boxes from comparative backbones on small object dataset We here select YOLO with Darknet-53 andResNet-50 for objective comparison because there have obviously the same layers in their networks along with the significant techniquessuch as skip connections and residual blocks (e bounding boxes show that ResNet-50 has the sensitivity to areas which resembles theobjects of interest than Darknet-53 Similarly ResNet-50-FPN and ResNet-50-C4 are chosen to consider (e detection shows thatcombining ResNet-50 with FPN outputs a better performance rather than the original one Particularly misdetection happens in moredensity than ResNet-50-FPN such as in columns 4 and 5 Zoom in to see more detail

Journal of Electrical and Computer Engineering 15

than ResNet ones but it has the best accuracy amongmodels (erefore we only test YOLO with Darknet-53 insubsets of PASCAL

53 Analyses of the Trade-Off among Detectors Networkdesigns and approaches alike the one-stage approach proveits performance as applying them to detect general objectsboth small scales and other kinds of scales Although they arefast and accurate there is still a drawback always existing inthese models that is the trade-off between accuracy andspeed of processing For example YOLOv3 proposes theidea that performs detection at three different scales and thisresult is obviously impressive and yields good performanceHowever to gain this advantage YOLOv3 has to sacrifice thetime to process Instead of all inputs of the model normallyprocessing one time for detection like YOLOv2 this ideamust work 3 times (is trade-off is also partly affected byresolution as we change it during training or testing ourmodels In our previous work we have mentioned that wehave to choose a right resolution to ensure our models towork properly In case of the two-stage approaches the ideathat proposes region proposals to improve the localization ofobjects to serve for detection is good as well (is is usefulbut we have to take it into the account that we shouldgenerate proposals on feature maps or directly on inputimages because this affects a lot on the way which modelsintend to run and identify representations of objects Ifobjects are normal or have a big or medium appearance it isgood for models to work but if objects are in multiscalesthis is a problem to consider and research deeply in order tobalance the performance as well as improve it (erefore topartly fix this problem the one-stage approach allows us tochoose a fixed size of an input for training and testing butthe support still depends on characteristics of datasets whichwe evaluate or the image size After all all models we chooseto evaluate are affected by the scales of objects when wechange the scale and accuracy of models change a lot exceptfor Faster RCNN the only one model that seems to be stablewith the scale especially when combining with the VGG16architecture Although the accuracy of VGG16 is not betterthan the other architectures the difference here is that it doesnot change too much in accuracy (is is only right for bigobjects having the overlap of the bounding box and theimage greater than 10 if not this is not assured

Figure 1 shows that the possibility of small objects ismore than other objects (e black length of the camera issomehow similar to the black mouse placed on a mouse pad(is possibility of small object presence causes more diffi-culties to detectors and leads to wrong detection Anywherein an image can be small objects it results in a fact thatdetectors have much wrong detection with familiar ap-pearance which they have seen If we consider the visuali-zation of the detection in Figure 4 the wrong detection ispartly similar to the appearance of the other objects in thedataset (is problem is caused by the data imbalance be-tween classes and instances in each class which originally isknown as the foreground-foreground class imbalance In

other words the common problems which not only happenwith small objects but also for whole datasets are theintraclass similarity and interclass variation

6 Conclusion

Small object detection is a challenging and interestingproblem in the task of object detection and has drawn at-tention from researchers thanks to the development of deeplearning which is motivation to improve performance oftasks in computer vision Although deep models belongingto detection originally tend to solve problems relating togeneral object detection they still work at a particular levelto the success of small object detection As evaluation workson small object detection for deep models our goal is tohighlight remarkable achievements of popular and state-of-the-art deep models in order to provide a variety of views asapplying deep models in small object detection Particularlywe evaluate state-of-the-art real-time detectors based ondeep learning from two approaches such as YOLOv3RetinaNet Fast RCNN and Faster RCNN on two datasetsnamely small object dataset and subsets filtered fromPASCAL VOC about effects of different factors objectivelyincluding accuracy execution time and resource usage

In spite of the successful achievements in recent yearsthe performance of detection has improved significantlyand there is still a huge gap in accuracy between normalobjects and small objects In the criteria of the COCOdataset the difference from the small scale to medium andbig scale is too much Most models are good at detection ofnormal objects and problems are going to happen whenapplying them to detect small objects As a result to reducethe gap of small object detection the first thing to do is investdatasets which have massive amount of data to train modelsand have a wide range of categories to compete with thehuman visual system alike [12 34]

So far detection models are divided into two mainapproaches namely one-stage approach and two-stageapproach Models in the one-stage approach is known asdetectors which have better and more efficient detection incomparison to another approach (e efficiency here has thepotential power to run in real time and is able to apply themto practical applications However the trade-off betweenaccuracy and speed is a difficult challenge which needs to betaken into the account in order to balance the gap Howevermodels in the two-stage approach have their reputation ofregion-based detectors which have high accuracy but are toolow in speed to apply them to real world (is drawbackcomes from the computation of networks

(rough our evaluation there is a fact that architectureswhich are utilized as base networks to extract deep featureshave significant effects on frameworks (e deeper the ar-chitecture is the higher the accuracy of detection is Once anetwork has an increase in the depth this means it has morelayers than normal ones and it will have massive parametersto train Hence this needs a lot of data to fine tune theseparameters reasonably If there is an increase in computa-tion resource consumption will also increase As a result it

16 Journal of Electrical and Computer Engineering

will be difficult as we want to take them to apply in practicalapplications Besides the contextual exploit in models isdefinitely limited this results cause ignoring much usefuland informative data in training especially in context ofsmall objects Because small objects are able to appearanywhere in an input image if the image is well-exploitedwith the context the performance of small object detectionwill be improved better

For all above reasons and according to our evalua-tion if we tend to have good performance and ignore thespeed of processing two-stage methods like FasterRCNN are well-performed and demonstrate its networkdesign with the different datasets on many contexts ofobjects including multiscale objects (erefore FasterRCNN is considered as a giant baseline in order to baseon or develop from it If our target has a balance ofaccuracy and speed YOLO is a good one in case we donot care the training time because the sacrifice betweenthe speed and accuracy is worth applying it into practicalapplications Otherwise Faster RCNN or RetinaNet isstill a substitution to work on When it comes tobackbones we have to concern about the data to choose areasonable backbone to combine with the methodsBecause the amount of data will significant impact on themodel if data are not abundant the shallow network willfit it well Besides there is recently a novel approachpromising in training deep models with less data that isweakly supervised learning such as zero-shot one-shotor few-shot learning (erefore these approaches will beconsidered in our future works and following our recentsearching to have better performance on object detec-tion we have to consider several factors to improve themAP such as multiscale training superresolution forscaling up the visual information to small objects [35] orpreprocessing data to avoid the imbalance data becausewe have a wide range of imbalance problems relating todata [33]

Data Availability

(e data used to support the findings of this study areavailable from the corresponding author upon request

Conflicts of Interest

(e authors declare that they have no conflicts of interest

Acknowledgments

(is research was funded by the Vietnam National Uni-versity HoChiMinh City (VNU-HCM) under grant noB2017-26-01

References

[1] R Girshick J Donahue T Darrell and J Malik ldquoRich featurehierarchies for accurate object detection and semantic seg-mentationrdquo in Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition pp 580ndash587 ColumbusOH USA June 2014

[2] K He X Zhang S Ren and J Sun ldquoSpatial pyramid poolingin deep convolutional networks for visual recognitionrdquo inProceedings of the European Conference on Computer Visionpp 346ndash361 Springer Zurich Switzerland September 2014

[3] R Girshick ldquoFast R-CNNrdquo in Proceedings of the IEEE In-ternational Conference on Computer Vision pp 1440ndash1448Santiago Chile December 2015

[4] J Redmon S Divvala R Girshick and A Farhadi ldquoYou onlylook once unified real-time object detectionrdquo in Proceedingsof the IEEE Conference on Computer Vision And PatternRecognition pp 779ndash788 Las Vegas NV USA June 2016

[5] J Redmon and A Farhadi ldquoYOLO9000 better fasterstrongerrdquo 2016 httpsarxivorgabs161208242

[6] J Redmon and A Farhadi ldquoYOLOv3 an incremental im-provementrdquo 2018 httpsarxivorgabs180402767

[7] T-Y Lin P Goyal R Girshick K He and P Dollar ldquoFocalloss for dense object detectionrdquo in Proceedings of the IEEEtransactions on pattern analysis and machine intelligenceVenice Italy October 2018

[8] K Zidek A Hosovsky J Pitelrsquo and S Bednar ldquoRecognition ofassembly parts by convolutional neural networksrdquo in Ad-vances in Manufacturing Engineering and Materialspp 281ndash289 Springer Cham Switzerland 2019

[9] K He G Gkioxari P Dollar and R Girshick ldquoMaskR-CNNrdquo in Proceedings of the IEEE International Conferenceon computer vision (ICCV) IEEE Venice Italy pp 2980ndash2988 October 2017

[10] L-C Chen A Hermans G Papandreou et al ldquoInstancesegmentation by refining object detection with semantic anddirection featuresrdquo 2017 httpsarxivorgabs171204837

[11] M Everingham L Van Gool C K I Williams J Winn andA Zisserman ldquo(e pascal visual object classes (VOC) chal-lengerdquo International Journal of Computer Vision vol 88no 2 pp 303ndash338 2010

[12] T-Y Lin M Maire S Belongie et al ldquoMicrosoft COCOcommon objects in contextrdquo in Proceedings of the EuropeanConference on Computer Vision pp 740ndash755 SpringerZurich Switzerland September 2014

[13] C Chen M-Y Liu O Tuzel and J Xiao ldquoR-CNN for smallobject detectionrdquo in Proceedings of the Asian Conference onComputer Vision pp 214ndash230 Springer Taipei TaiwanNovember 2016

[14] P F Felzenszwalb R B Girshick D McAllester andD Ramanan ldquoObject detection with discriminatively trainedpart-based modelsrdquo IEEE Transactions on Pattern Analysisand Machine Intelligence vol 32 no 9 pp 1627ndash1645 2010

[15] S Ren K He R Girshick and J Sun ldquoFaster R-CNN towardsreal-time object detection with region proposal networksrdquo inProceedings of the 28th International Conference on NeuralInformation Processing Systems Ser NIPSrsquo15 pp 91ndash99 MITPress Cambridge MA USA 2015 httpdlacmorgcitationcfmid=29692392969250

[16] P Pham D Nguyen T Do T D Ngo and D-D LeldquoEvaluation of deep models for real-time small object de-tectionrdquo in Proceedings of the International Conference onNeural Information Processing pp 516ndash526 SpringerGuangzhou China November 2017

[17] J R R Uijlings K E A Van De Sande T Gevers andA W M Smeulders ldquoSelective search for object recognitionrdquoInternational Journal of Computer Vision vol 104 no 2pp 154ndash171 2013

[18] Z Zhu D Liang S Zhang X Huang B Li and S HuldquoTraffic-sign detection and classification in the wildrdquo inProceedings of the IEEE Conference on Computer Vision And

Journal of Electrical and Computer Engineering 17

Pattern Recognition pp 2110ndash2118 Vegas NV USA June2016

[19] A Torralba R Fergus and W T Freeman ldquo80 million tinyimages a large data set for nonparametric object and scenerecognitionrdquo IEEE Transactions on Pattern Analysis andMachine Intelligence vol 30 no 11 pp 1958ndash1970 2008

[20] A Kembhavi D Harwood and L S Davis ldquoVehicle detectionusing partial least squaresrdquo IEEE Transactions on PatternAnalysis and Machine Intelligence vol 33 no 6 pp 1250ndash1265 2011

[21] V I Morariu E Ahmed V Santhanam D Harwood andL S Davis ldquoComposite discriminant factor analysisrdquo inProceedings of the IEEE Winter Conference on Applications ofComputer Vision pp 564ndash571 IEEE Steamboat Springs COUSA March 2014

[22] A Andreas P Lenz and R Urtasun ldquoAre we ready forautonomous driving the kitti vision benchmark suiterdquo inProceedings of the IEEE Conference on Computer Vision AndPattern Recognition Providence RI USA June 2012

[23] A Alahi K Goel V Ramanathan A Robicquet L Fei-Feiand S Savarese ldquoSocial LSTM human trajectory prediction incrowded spacesrdquo in Proceedings of the IEEE Conference onComputer Vision And Pattern Recognition pp 961ndash971 LasVegas NV USA June 2016

[24] J Xiao K A Ehinger J Hays A Torralba and A Oliva ldquoSundatabase exploring a large collection of scene categoriesrdquoInternational Journal of Computer Vision vol 119 no 1pp 3ndash22 2016

[25] E Dong Y Zhu Y Ji and S Du ldquoAn improved convolutionneural network for object detection using YOLOv2rdquoin Pro-ceedings of the 2018 IEEE International Conference onMechatronics and Automation (ICMA) pp 1184ndash1188 IEEEChangchun China August 2018

[26] W Liu D Anguelov D Erhan et al ldquoSingle shot multiboxdetectorrdquo in Proceedings of the European Conference onComputer Vision Springer Amsterdam (e Netherlandspp 21ndash37 October 2016

[27] T-Y Lin P Dollar R B Girshick K He B Hariharan andS J Belongie ldquoFeature pyramid networks for object detec-tionrdquo in Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition (CVPR) vol 1 no 2 p 4Honolulu HI USA July 2017

[28] C-Y Fu W Liu A Ranga A Tyagi and A C Berg ldquoDSSDdeconvolutional single shot detectorrdquo 2017 httpsarxivorgabs170106659

[29] Y Bai Y Zhang M Ding and B Ghanem ldquoSOD-MTGANsmall object detection via multi-task generative adversarialnetworkrdquo in Proceedings of the European Conference onComputer Vision pp 210ndash226 Springer Munich GermanySeptember 2018

[30] L Liu W Ouyang X Wang et al ldquoDeep learning for genericobject detection a surveyrdquo 2018 httpsarxivorgabs180902165

[31] P Zhu L Wen X Bian L Haibin and Q Hu ldquoVision meetsdrones a challengerdquo 2018 httpsarxivorgabs180407437

[32] Z-Q Zhao P Zheng S-t Xu and X Wu ldquoObject detectionwith deep learning a reviewrdquo 2018 httpsarxivorgabs180705511

[33] K Oksuz B C Cam S Kalkan and E Akbas ldquoImbalanceproblems in object detection a reviewrdquo 2019 httpsarxivorgabs190900169

[34] O Russakovsky J Deng H Su et al ldquoImagenet large scalevisual recognition challengerdquo International Journal of Com-puter Vision vol 115 no 3 pp 211ndash252 2015

[35] Y Bai Y Zhang M Ding and B Ghanem ldquoSOD-MTGANsmall object detection via multi-task generative adversarialnetworkrdquo in Proceedings of the European Conference onComputer Vision (ECCV) pp 206ndash221 Munich GermanySeptember 2018

18 Journal of Electrical and Computer Engineering

Page 12: AnEvaluationofDeepLearningMethodsforSmall ObjectDetection

and better than SSD RetinaNet is more stable than SSD andYOLO when the scales are changed (e bigger the objectsare the more the stability is For example the change is somuch about 33 when the scale increases from objects inVOC_MRA_0058 to ones in VOC_MRA_010 andVOC_MRA_020 However this change is not much about10 with bigger objects in comparison with YOLO 15ndash25In case of YOLO this remarkable increase in accuracy whenobjects are larger is obviously good for a model (e changein SSD resembles the change in RetinaNet

Concerning resolutions in YOLO and SSD we see thatwhen image resolution is increased they push the accuracyto improve in general For YOLOv2 with Darknet-19 andYOLOv3 with Darknet-53 and SSD they all have an increasein accuracy when the resolution is large except for YOLOv2

with objects belonging to VOC_MRA_010 andVOC_MRA_020 when the image is over 800 In additionYOLOv2 has a fluctuation with those objects inVOC_WH20 As mentioned in our previous work YOLO isbetter than SSD in those objects less than 10 of the imageshowever in this case YOLOv3 is good at all scales of objects(is is because YOLOv3 has 3 detection locations comingwith more ratios of default boxes and it leads to a significantoutcome when combining results from 3 locations

When we switch to the two-stage approaches FasterRCNN has a significant improvement in most scales ratherthan Fast RCNN except for objects in VOC_MRA_020 whichhave the same accuracy (is shows that if objects are com-pletely separated into different scales the RoI pooling does notwork well with smaller objects and ones in VOC_WH20 In

Table 3 Comparative results on small object dataset

Method Backbone Clock Faucet Jar Mouse Outlet Plate Switch Tel t box t paper mAPYOLO 416 [16]

Darknet-19

228 308 4 52 204 131 13 61 0 353 1939YOLO 448 [16] 23 369 9 525 184 136 175 42 0 343 2013YOLO 480 [16] 342 373 91 533 214 136 158 91 91 342 2371YOLO 512 [16] 231 366 61 598 246 142 157 91 45 324 2261YOLO 554 [16] 234 372 91 601 272 134 199 91 45 345 2384YOLO 640 [16] 202 362 32 598 278 117 181 82 45 356 2253YOLO 800 [16] 276 36 23 602 328 131 233 91 91 267 2402YOLO 1024 [16] 217 293 14 583 264 118 175 91 91 157 2003YOLO 320

Darknet-532622 3838 455 5646 3642 1334 248 1065 455 4296 2583

YOLO 416 2847 4715 1083 6049 4315 1587 3073 1515 262 483 3028YOLO 608 2998 4789 1076 6588 4802 1809 3122 1462 1799 4656 331YOLO 320

ResNet-501957 2573 067 4517 1437 938 1384 909 909 237 1706

YOLO 416 2378 3665 04 5423 1837 1375 1978 984 942 3568 2219YOLO 608 2692 4065 177 6186 2918 1504 2024 1009 1329 3601 255YOLO 320

ResNet-1012052 279 057 4468 1698 1305 1366 966 909 2436 1805

YOLO 416 2572 356 303 5573 224 1561 1726 932 303 3871 2264YOLO 608 2879 4459 942 6218 3334 1553 2388 1324 1583 3917 286YOLO 320

ResNet-1522164 2756 303 4806 1739 1112 1451 909 455 3188 1888

YOLO 416 257 3654 089 5381 206 1413 2021 1149 029 3306 2167YOLO 608 2601 4454 455 61 3176 1302 2267 1235 993 3999 2658SSD300 [16] ResNet-101 55 91 0 255 61 45 0 45 91 182 825SSD300 [16] VGG16 91 171 0 261 91 91 0 45 0 167 916SSD512 [16] VGG16 91 171 0 43 91 91 91 91 0 76 1132RetinaNet ResNet-50-FPN 307 493 2 655 213 161 85 129 1 257 233RetinaNet ResNet-101-FPN 306 487 71 647 20 159 118 107 29 387 251RetinaNet ResNeXT-101-32times 8d-FPN 355 55 121 665 239 184 98 162 94 537 30RetinaNet ResNeXT-101-64times 4d-FPN 314 502 89 663 208 153 94 14 22 324 251R-CNN [13] RPN prop +VGG16 319 313 42 568 311 93 142 164 234 294 248R-CNN [13] Alexnet 7times 300 pro 324 272 51 569 28 98 136 124 179 356 239R-CNN [13] VGG16 7times 300 pro 373 303 72 606 415 158 215 137 22 333 284R-CNN [13] ContextNet (Alexnet 7times) 327 268 46 564 263 99 129 122 187 34 235Fast RCNN ResNet-50-C4 324 463 65 658 383 201 253 166 141 52 317Fast RCNN ResNet-50-FPN 374 473 73 689 467 21 321 171 93 459 333Fast RCNN ResNet-101-FPN 393 503 106 683 471 204 333 186 154 514 355Fast RCNN ResNeXT-101-32times 8d-FPN 475 548 103 718 54 214 344 217 177 535 387Fast RCNN ResNeXT-101-64times 4d-FPN 454 557 109 725 533 24 369 229 16 581 396Faster R-CNN [16] VGG16 2376 3765 803 54 1616 1188 1512 91 625 3729 2192Faster RCNN ResNet-50-C4 322 446 66 659 352 175 257 196 137 40 301Faster RCNN ResNet-50-FPN 357 499 73 684 489 188 296 147 114 533 338Faster RCNN ResNet-101-FPN 398 492 49 682 47 185 297 14 129 522 337Faster RCNN ResNeXT-101-32times 8d-FPN 498 566 114 721 563 232 37 208 188 587 405Faster RCNN ResNeXT-101-64times 4d-FPN 496 586 122 725 545 232 369 208 201 631 412(e values in bold represent the best in one-stage methods and the ones in italics represent the highest in two-stage methods

12 Journal of Electrical and Computer Engineering

addition if we compare with one-stage methods it is signif-icantly lower than them However RoI align along with RPN iswell performed when scales are changedWhen it comes to thebackbones there is a few decrease in accuracy when changingfrom ResNet-50-FPN to ResNet-101-FPN or from ResNeXT-101-32times 8d-FPN to ResNeXT-101-64times 4d-FPN with objectsfrom all scales for both Faster RCNN and Fast RCNN (eVGG16 backbone has an impressive outcome rather thanstrong backbones such as ResNet or ResNeXT Although theaccuracy is less than two strong backbones VGG16 is stillbetter with objects in VOC_WH20 and has a few change inaccuracy when changing objects with big sizes

52 Time Processing and Resource Consumption Tables 5and 6 show us the performance comparison of the eval-uated models with base networks that belong to the modelsGenerally we see that when RAM consumption in testingand training increases more layers are added (is meansthat if the network is more deeper the need of processingalso increases because this leads to the increase in pa-rameters and time to process data as well YOLO is themodel consuming the least memory in both two-phasetraining and testing Particularly YOLO is only from 4G to5G for training and from 16G to 18G for testing withDarknet-53 YOLO is the only one which is able to run inreal time YOLO just needs about 03ms to 04ms toprocess an image in comparison to more than 01 s and 02 s

with Faster RCNN and RetinaNet (is allows us to pick upthese models on devices which own the modest memoryWhile RetinaNet is assigned to the one-stage approach it isnot good enough to meet real-time detection(e inferencetime in Fast RCNN is lower a little bit than Faster RCNNand RetinaNet In contrast the RAM consumption intraining and testing of RetinaNet is lower than Fast RCNNand Faster RCNN Of all architectures the ResNet-50-C4 isthe one requiring the highest memory and time to processdata because the output size of ResNet-50-C4 is bigger a bitthan others [9] If we consider ResNet or ResNeXT com-bined with FPN Faster RCNN is over 100Mb compared toFast RCNN and 300Mb with RetinaNet In additionaccording to Table 2 the number of training days of FasterRCNN and RetinaNet need less time for training only a fewhours to 1 day rather than YOLO 3ndash4 days (is dem-onstrates that if we pay our attention to performance anddo not have much time for training we choose FasterRCNN or RetinaNet instead of YOLO one In contrast ifwe only focus on processing speed and still achieve goodperformance one-stage methods are always the good oneIn the same context of backbones RetinaNet uses a lowerresource than Fast RCNN and Faster RCNN about 100Mband 300Mb for Fast RCNN and Faster RCNN respectivelyin testing time However the training time of RetinaNetuses much memory more than Fast RCNN about 28 G andFaster RCNN about 23 G for ResNeXT-101-32times 8d-FPNand ResNeXT-101-64 times 4d-FPN If we consider this on

(a) (b)

loc ∆ (cx cy w h) conf (c1 c2 hellip cp)

(c)

Figure 4 (e location of the default boxes in different scales (a) image with GT boxes (b) 8times 8 feature map (c) 4times 4 feature map

Journal of Electrical and Computer Engineering 13

Table 4 (e comparative results on subsets of PASCAL VOC 2007

Approach Method VOC_MRA_0058 VOC_MRA_010 VOC_MRA_020 VOC_WH20

One stage

YOLOv2 416 [16] 302 3138 4289 1852YOLOv2 448 [16] 447 329 6015 2196YOLOv2 480 [16] 426 3348 6078 2667YOLOv2 512 [16] 542 3574 6112 2463YOLOv2 544 [16] 697 3656 63 2662YOLOv2 640 [16] 77 3797 6129 2341YOLOv2 800 [16] 1024 373 6191 269YOLOv2 1024 [16] 1069 2993 5514 2897

YOLOv3 320 718 3458 6036 204YOLOv3 416 102 3897 6253 2412YOLOv3 608 117 4265 6856 2886SSD 300 [16] 171 3276 4626 1691SSD 512 [16] 29 4346 5711 1987

RetinaNet-ResNet-50-FPN 884 415 502 2814RetinaNet-ResNet-101-FPN 895 425 519 2746

RetinaNet-ResNeXT-101-32times 8d-FPN 1029 454 545 3008RetinaNet-ResNeXT-101-64times 4d-FPN 1071 455 551 3132

Two stage

Fast RCNN-ResNet-50-C4 023 132 499 393Fast RCNN-ResNet-50-FPN 063 135 556 345Fast RCNN-ResNet-101-FPN 039 159 576 312

Fast RCNN-ResNeXT-101-32times 8d-FPN 051 144 579 333Fast RCNN-ResNeXT-101-64times 4d-FPN 029 142 573 376

Faster RCNN-ResNet-50-C4 698 399 487 2604Faster RCNN-ResNet-50-FPN 1074 456 563 2979Faster RCNN-ResNet-101-FPN 1063 469 576 3057

Faster RCNN-ResNeXT-101-32times 8d-FPN 1164 473 576 3212Faster RCNN-ResNeXT-101-64times 4d-FPN 1054 471 569 3164

Faster RCNN-VGG16 [16] 573 3558 4414 4111(is table illustrates how well models adapt to different scales of objects (e values in bold represent the best in one-stage methods and the ones in italicsrepresent the highest in two-stage methods

(a)

(b)

(c)

Figure 5 Continued

14 Journal of Electrical and Computer Engineering

small object dataset it does not work too much becauseRetinaNet is lower than Faster RCNN about 10 in per-formance Otherwise on different scales of subsets Reti-naNet works well when comparing to Faster RCNN and

the difference is just 2ndash4 percentages Although ResNetbackbones combined with the others yield an improvementin accuracy they do not work for YOLO on small objectdatasets YOLO with Darknet-53 utilizes more resource

Table 5 (e comparison of consumption on small object dataset

Model Backbone Inference time (s) Test RAM (MiB) Train RAM (MiB)YOLOv3 Darknet-53 00331 1825 4759YOLOv3 ResNet-50 0027 1285 3479YOLOv3 ResNet-101 00356 1829 5383YOLOv3 ResNet-152 00454 2443 7531RetinaNet ResNet-50-FPN 0102 2075 4435RetinaNet ResNet-101-FPN 0127 2723 5577RetinaNet ResNeXT-101-32times 8d-FPN 0229 3767 7863RetinaNet ResNeXT-101-64times 4d-FPN 0292 3719 7813Fast RCNN ResNet-50-C4 03 6449 5877Fast RCNN ResNet-50-FPN 0089 2277 4455Fast RCNN ResNet-101-FPN 0113 2947 5627Fast RCNN ResNeXT-101-32times 8d-FPN 0212 3987 4961Fast RCNN ResNeXT-101-64times 4d-FPN 0269 3885 4799Faster RCNN ResNet-50-C4 0412 6609 6129Faster RCNN ResNet-50-FPN 0101 2387 5381Faster RCNN ResNet-101-FPN 0124 3001 6487Faster RCNN ResNeXT-101-32times 8d-FPN 0256 4027 5333Faster RCNN ResNeXT-101-64times 4d-FPN 0286 4003 5246

Table 6 (e comparison of consumption on subsets filtered from PASCAL VOC

Model Backbone Inference time (s) Test RAM (MiB) Train RAM (MiB)YOLOv3 Darknet-53 0027 1645 4079RetinaNet ResNet-50-FPN 01 1935 4133RetinaNet ResNet-101-FPN 0116 2585 5435RetinaNet ResNeXT-101-32times 8d-FPN 0222 3641 7723RetinaNet ResNeXT-101-64times 4d-FPN 0284 3561 7599Fast RCNN ResNet-50-C4 0495 6371 5677Fast RCNN ResNet-50-FPN 0092 2131 4387Fast RCNN ResNet-101-FPN 0114 2819 5463Fast RCNN ResNeXT-101-32times 8d-FPN 0213 3873 4637Fast RCNN ResNeXT-101-64times 4d-FPN 0265 3735 4575Faster RCNN ResNet-50-C4 026 6141 5991Faster RCNN ResNet-50-FPN 01 2245 5207Faster RCNN ResNet-101-FPN 013 2855 6335Faster RCNN ResNeXT-101-32times 8d-FPN 0225 3943 5087Faster RCNN ResNeXT-101-64times 4d-FPN 0276 3885 4909

(d)

Figure 5 Highlight of bounding boxes from comparative backbones on small object dataset We here select YOLO with Darknet-53 andResNet-50 for objective comparison because there have obviously the same layers in their networks along with the significant techniquessuch as skip connections and residual blocks (e bounding boxes show that ResNet-50 has the sensitivity to areas which resembles theobjects of interest than Darknet-53 Similarly ResNet-50-FPN and ResNet-50-C4 are chosen to consider (e detection shows thatcombining ResNet-50 with FPN outputs a better performance rather than the original one Particularly misdetection happens in moredensity than ResNet-50-FPN such as in columns 4 and 5 Zoom in to see more detail

Journal of Electrical and Computer Engineering 15

than ResNet ones but it has the best accuracy amongmodels (erefore we only test YOLO with Darknet-53 insubsets of PASCAL

53 Analyses of the Trade-Off among Detectors Networkdesigns and approaches alike the one-stage approach proveits performance as applying them to detect general objectsboth small scales and other kinds of scales Although they arefast and accurate there is still a drawback always existing inthese models that is the trade-off between accuracy andspeed of processing For example YOLOv3 proposes theidea that performs detection at three different scales and thisresult is obviously impressive and yields good performanceHowever to gain this advantage YOLOv3 has to sacrifice thetime to process Instead of all inputs of the model normallyprocessing one time for detection like YOLOv2 this ideamust work 3 times (is trade-off is also partly affected byresolution as we change it during training or testing ourmodels In our previous work we have mentioned that wehave to choose a right resolution to ensure our models towork properly In case of the two-stage approaches the ideathat proposes region proposals to improve the localization ofobjects to serve for detection is good as well (is is usefulbut we have to take it into the account that we shouldgenerate proposals on feature maps or directly on inputimages because this affects a lot on the way which modelsintend to run and identify representations of objects Ifobjects are normal or have a big or medium appearance it isgood for models to work but if objects are in multiscalesthis is a problem to consider and research deeply in order tobalance the performance as well as improve it (erefore topartly fix this problem the one-stage approach allows us tochoose a fixed size of an input for training and testing butthe support still depends on characteristics of datasets whichwe evaluate or the image size After all all models we chooseto evaluate are affected by the scales of objects when wechange the scale and accuracy of models change a lot exceptfor Faster RCNN the only one model that seems to be stablewith the scale especially when combining with the VGG16architecture Although the accuracy of VGG16 is not betterthan the other architectures the difference here is that it doesnot change too much in accuracy (is is only right for bigobjects having the overlap of the bounding box and theimage greater than 10 if not this is not assured

Figure 1 shows that the possibility of small objects ismore than other objects (e black length of the camera issomehow similar to the black mouse placed on a mouse pad(is possibility of small object presence causes more diffi-culties to detectors and leads to wrong detection Anywherein an image can be small objects it results in a fact thatdetectors have much wrong detection with familiar ap-pearance which they have seen If we consider the visuali-zation of the detection in Figure 4 the wrong detection ispartly similar to the appearance of the other objects in thedataset (is problem is caused by the data imbalance be-tween classes and instances in each class which originally isknown as the foreground-foreground class imbalance In

other words the common problems which not only happenwith small objects but also for whole datasets are theintraclass similarity and interclass variation

6 Conclusion

Small object detection is a challenging and interestingproblem in the task of object detection and has drawn at-tention from researchers thanks to the development of deeplearning which is motivation to improve performance oftasks in computer vision Although deep models belongingto detection originally tend to solve problems relating togeneral object detection they still work at a particular levelto the success of small object detection As evaluation workson small object detection for deep models our goal is tohighlight remarkable achievements of popular and state-of-the-art deep models in order to provide a variety of views asapplying deep models in small object detection Particularlywe evaluate state-of-the-art real-time detectors based ondeep learning from two approaches such as YOLOv3RetinaNet Fast RCNN and Faster RCNN on two datasetsnamely small object dataset and subsets filtered fromPASCAL VOC about effects of different factors objectivelyincluding accuracy execution time and resource usage

In spite of the successful achievements in recent yearsthe performance of detection has improved significantlyand there is still a huge gap in accuracy between normalobjects and small objects In the criteria of the COCOdataset the difference from the small scale to medium andbig scale is too much Most models are good at detection ofnormal objects and problems are going to happen whenapplying them to detect small objects As a result to reducethe gap of small object detection the first thing to do is investdatasets which have massive amount of data to train modelsand have a wide range of categories to compete with thehuman visual system alike [12 34]

So far detection models are divided into two mainapproaches namely one-stage approach and two-stageapproach Models in the one-stage approach is known asdetectors which have better and more efficient detection incomparison to another approach (e efficiency here has thepotential power to run in real time and is able to apply themto practical applications However the trade-off betweenaccuracy and speed is a difficult challenge which needs to betaken into the account in order to balance the gap Howevermodels in the two-stage approach have their reputation ofregion-based detectors which have high accuracy but are toolow in speed to apply them to real world (is drawbackcomes from the computation of networks

(rough our evaluation there is a fact that architectureswhich are utilized as base networks to extract deep featureshave significant effects on frameworks (e deeper the ar-chitecture is the higher the accuracy of detection is Once anetwork has an increase in the depth this means it has morelayers than normal ones and it will have massive parametersto train Hence this needs a lot of data to fine tune theseparameters reasonably If there is an increase in computa-tion resource consumption will also increase As a result it

16 Journal of Electrical and Computer Engineering

will be difficult as we want to take them to apply in practicalapplications Besides the contextual exploit in models isdefinitely limited this results cause ignoring much usefuland informative data in training especially in context ofsmall objects Because small objects are able to appearanywhere in an input image if the image is well-exploitedwith the context the performance of small object detectionwill be improved better

For all above reasons and according to our evalua-tion if we tend to have good performance and ignore thespeed of processing two-stage methods like FasterRCNN are well-performed and demonstrate its networkdesign with the different datasets on many contexts ofobjects including multiscale objects (erefore FasterRCNN is considered as a giant baseline in order to baseon or develop from it If our target has a balance ofaccuracy and speed YOLO is a good one in case we donot care the training time because the sacrifice betweenthe speed and accuracy is worth applying it into practicalapplications Otherwise Faster RCNN or RetinaNet isstill a substitution to work on When it comes tobackbones we have to concern about the data to choose areasonable backbone to combine with the methodsBecause the amount of data will significant impact on themodel if data are not abundant the shallow network willfit it well Besides there is recently a novel approachpromising in training deep models with less data that isweakly supervised learning such as zero-shot one-shotor few-shot learning (erefore these approaches will beconsidered in our future works and following our recentsearching to have better performance on object detec-tion we have to consider several factors to improve themAP such as multiscale training superresolution forscaling up the visual information to small objects [35] orpreprocessing data to avoid the imbalance data becausewe have a wide range of imbalance problems relating todata [33]

Data Availability

(e data used to support the findings of this study areavailable from the corresponding author upon request

Conflicts of Interest

(e authors declare that they have no conflicts of interest

Acknowledgments

(is research was funded by the Vietnam National Uni-versity HoChiMinh City (VNU-HCM) under grant noB2017-26-01

References

[1] R Girshick J Donahue T Darrell and J Malik ldquoRich featurehierarchies for accurate object detection and semantic seg-mentationrdquo in Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition pp 580ndash587 ColumbusOH USA June 2014

[2] K He X Zhang S Ren and J Sun ldquoSpatial pyramid poolingin deep convolutional networks for visual recognitionrdquo inProceedings of the European Conference on Computer Visionpp 346ndash361 Springer Zurich Switzerland September 2014

[3] R Girshick ldquoFast R-CNNrdquo in Proceedings of the IEEE In-ternational Conference on Computer Vision pp 1440ndash1448Santiago Chile December 2015

[4] J Redmon S Divvala R Girshick and A Farhadi ldquoYou onlylook once unified real-time object detectionrdquo in Proceedingsof the IEEE Conference on Computer Vision And PatternRecognition pp 779ndash788 Las Vegas NV USA June 2016

[5] J Redmon and A Farhadi ldquoYOLO9000 better fasterstrongerrdquo 2016 httpsarxivorgabs161208242

[6] J Redmon and A Farhadi ldquoYOLOv3 an incremental im-provementrdquo 2018 httpsarxivorgabs180402767

[7] T-Y Lin P Goyal R Girshick K He and P Dollar ldquoFocalloss for dense object detectionrdquo in Proceedings of the IEEEtransactions on pattern analysis and machine intelligenceVenice Italy October 2018

[8] K Zidek A Hosovsky J Pitelrsquo and S Bednar ldquoRecognition ofassembly parts by convolutional neural networksrdquo in Ad-vances in Manufacturing Engineering and Materialspp 281ndash289 Springer Cham Switzerland 2019

[9] K He G Gkioxari P Dollar and R Girshick ldquoMaskR-CNNrdquo in Proceedings of the IEEE International Conferenceon computer vision (ICCV) IEEE Venice Italy pp 2980ndash2988 October 2017

[10] L-C Chen A Hermans G Papandreou et al ldquoInstancesegmentation by refining object detection with semantic anddirection featuresrdquo 2017 httpsarxivorgabs171204837

[11] M Everingham L Van Gool C K I Williams J Winn andA Zisserman ldquo(e pascal visual object classes (VOC) chal-lengerdquo International Journal of Computer Vision vol 88no 2 pp 303ndash338 2010

[12] T-Y Lin M Maire S Belongie et al ldquoMicrosoft COCOcommon objects in contextrdquo in Proceedings of the EuropeanConference on Computer Vision pp 740ndash755 SpringerZurich Switzerland September 2014

[13] C Chen M-Y Liu O Tuzel and J Xiao ldquoR-CNN for smallobject detectionrdquo in Proceedings of the Asian Conference onComputer Vision pp 214ndash230 Springer Taipei TaiwanNovember 2016

[14] P F Felzenszwalb R B Girshick D McAllester andD Ramanan ldquoObject detection with discriminatively trainedpart-based modelsrdquo IEEE Transactions on Pattern Analysisand Machine Intelligence vol 32 no 9 pp 1627ndash1645 2010

[15] S Ren K He R Girshick and J Sun ldquoFaster R-CNN towardsreal-time object detection with region proposal networksrdquo inProceedings of the 28th International Conference on NeuralInformation Processing Systems Ser NIPSrsquo15 pp 91ndash99 MITPress Cambridge MA USA 2015 httpdlacmorgcitationcfmid=29692392969250

[16] P Pham D Nguyen T Do T D Ngo and D-D LeldquoEvaluation of deep models for real-time small object de-tectionrdquo in Proceedings of the International Conference onNeural Information Processing pp 516ndash526 SpringerGuangzhou China November 2017

[17] J R R Uijlings K E A Van De Sande T Gevers andA W M Smeulders ldquoSelective search for object recognitionrdquoInternational Journal of Computer Vision vol 104 no 2pp 154ndash171 2013

[18] Z Zhu D Liang S Zhang X Huang B Li and S HuldquoTraffic-sign detection and classification in the wildrdquo inProceedings of the IEEE Conference on Computer Vision And

Journal of Electrical and Computer Engineering 17

Pattern Recognition pp 2110ndash2118 Vegas NV USA June2016

[19] A Torralba R Fergus and W T Freeman ldquo80 million tinyimages a large data set for nonparametric object and scenerecognitionrdquo IEEE Transactions on Pattern Analysis andMachine Intelligence vol 30 no 11 pp 1958ndash1970 2008

[20] A Kembhavi D Harwood and L S Davis ldquoVehicle detectionusing partial least squaresrdquo IEEE Transactions on PatternAnalysis and Machine Intelligence vol 33 no 6 pp 1250ndash1265 2011

[21] V I Morariu E Ahmed V Santhanam D Harwood andL S Davis ldquoComposite discriminant factor analysisrdquo inProceedings of the IEEE Winter Conference on Applications ofComputer Vision pp 564ndash571 IEEE Steamboat Springs COUSA March 2014

[22] A Andreas P Lenz and R Urtasun ldquoAre we ready forautonomous driving the kitti vision benchmark suiterdquo inProceedings of the IEEE Conference on Computer Vision AndPattern Recognition Providence RI USA June 2012

[23] A Alahi K Goel V Ramanathan A Robicquet L Fei-Feiand S Savarese ldquoSocial LSTM human trajectory prediction incrowded spacesrdquo in Proceedings of the IEEE Conference onComputer Vision And Pattern Recognition pp 961ndash971 LasVegas NV USA June 2016

[24] J Xiao K A Ehinger J Hays A Torralba and A Oliva ldquoSundatabase exploring a large collection of scene categoriesrdquoInternational Journal of Computer Vision vol 119 no 1pp 3ndash22 2016

[25] E Dong Y Zhu Y Ji and S Du ldquoAn improved convolutionneural network for object detection using YOLOv2rdquoin Pro-ceedings of the 2018 IEEE International Conference onMechatronics and Automation (ICMA) pp 1184ndash1188 IEEEChangchun China August 2018

[26] W Liu D Anguelov D Erhan et al ldquoSingle shot multiboxdetectorrdquo in Proceedings of the European Conference onComputer Vision Springer Amsterdam (e Netherlandspp 21ndash37 October 2016

[27] T-Y Lin P Dollar R B Girshick K He B Hariharan andS J Belongie ldquoFeature pyramid networks for object detec-tionrdquo in Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition (CVPR) vol 1 no 2 p 4Honolulu HI USA July 2017

[28] C-Y Fu W Liu A Ranga A Tyagi and A C Berg ldquoDSSDdeconvolutional single shot detectorrdquo 2017 httpsarxivorgabs170106659

[29] Y Bai Y Zhang M Ding and B Ghanem ldquoSOD-MTGANsmall object detection via multi-task generative adversarialnetworkrdquo in Proceedings of the European Conference onComputer Vision pp 210ndash226 Springer Munich GermanySeptember 2018

[30] L Liu W Ouyang X Wang et al ldquoDeep learning for genericobject detection a surveyrdquo 2018 httpsarxivorgabs180902165

[31] P Zhu L Wen X Bian L Haibin and Q Hu ldquoVision meetsdrones a challengerdquo 2018 httpsarxivorgabs180407437

[32] Z-Q Zhao P Zheng S-t Xu and X Wu ldquoObject detectionwith deep learning a reviewrdquo 2018 httpsarxivorgabs180705511

[33] K Oksuz B C Cam S Kalkan and E Akbas ldquoImbalanceproblems in object detection a reviewrdquo 2019 httpsarxivorgabs190900169

[34] O Russakovsky J Deng H Su et al ldquoImagenet large scalevisual recognition challengerdquo International Journal of Com-puter Vision vol 115 no 3 pp 211ndash252 2015

[35] Y Bai Y Zhang M Ding and B Ghanem ldquoSOD-MTGANsmall object detection via multi-task generative adversarialnetworkrdquo in Proceedings of the European Conference onComputer Vision (ECCV) pp 206ndash221 Munich GermanySeptember 2018

18 Journal of Electrical and Computer Engineering

Page 13: AnEvaluationofDeepLearningMethodsforSmall ObjectDetection

addition if we compare with one-stage methods it is signif-icantly lower than them However RoI align along with RPN iswell performed when scales are changedWhen it comes to thebackbones there is a few decrease in accuracy when changingfrom ResNet-50-FPN to ResNet-101-FPN or from ResNeXT-101-32times 8d-FPN to ResNeXT-101-64times 4d-FPN with objectsfrom all scales for both Faster RCNN and Fast RCNN (eVGG16 backbone has an impressive outcome rather thanstrong backbones such as ResNet or ResNeXT Although theaccuracy is less than two strong backbones VGG16 is stillbetter with objects in VOC_WH20 and has a few change inaccuracy when changing objects with big sizes

52 Time Processing and Resource Consumption Tables 5and 6 show us the performance comparison of the eval-uated models with base networks that belong to the modelsGenerally we see that when RAM consumption in testingand training increases more layers are added (is meansthat if the network is more deeper the need of processingalso increases because this leads to the increase in pa-rameters and time to process data as well YOLO is themodel consuming the least memory in both two-phasetraining and testing Particularly YOLO is only from 4G to5G for training and from 16G to 18G for testing withDarknet-53 YOLO is the only one which is able to run inreal time YOLO just needs about 03ms to 04ms toprocess an image in comparison to more than 01 s and 02 s

with Faster RCNN and RetinaNet (is allows us to pick upthese models on devices which own the modest memoryWhile RetinaNet is assigned to the one-stage approach it isnot good enough to meet real-time detection(e inferencetime in Fast RCNN is lower a little bit than Faster RCNNand RetinaNet In contrast the RAM consumption intraining and testing of RetinaNet is lower than Fast RCNNand Faster RCNN Of all architectures the ResNet-50-C4 isthe one requiring the highest memory and time to processdata because the output size of ResNet-50-C4 is bigger a bitthan others [9] If we consider ResNet or ResNeXT com-bined with FPN Faster RCNN is over 100Mb compared toFast RCNN and 300Mb with RetinaNet In additionaccording to Table 2 the number of training days of FasterRCNN and RetinaNet need less time for training only a fewhours to 1 day rather than YOLO 3ndash4 days (is dem-onstrates that if we pay our attention to performance anddo not have much time for training we choose FasterRCNN or RetinaNet instead of YOLO one In contrast ifwe only focus on processing speed and still achieve goodperformance one-stage methods are always the good oneIn the same context of backbones RetinaNet uses a lowerresource than Fast RCNN and Faster RCNN about 100Mband 300Mb for Fast RCNN and Faster RCNN respectivelyin testing time However the training time of RetinaNetuses much memory more than Fast RCNN about 28 G andFaster RCNN about 23 G for ResNeXT-101-32times 8d-FPNand ResNeXT-101-64 times 4d-FPN If we consider this on

(a) (b)

loc ∆ (cx cy w h) conf (c1 c2 hellip cp)

(c)

Figure 4 (e location of the default boxes in different scales (a) image with GT boxes (b) 8times 8 feature map (c) 4times 4 feature map

Journal of Electrical and Computer Engineering 13

Table 4 (e comparative results on subsets of PASCAL VOC 2007

Approach Method VOC_MRA_0058 VOC_MRA_010 VOC_MRA_020 VOC_WH20

One stage

YOLOv2 416 [16] 302 3138 4289 1852YOLOv2 448 [16] 447 329 6015 2196YOLOv2 480 [16] 426 3348 6078 2667YOLOv2 512 [16] 542 3574 6112 2463YOLOv2 544 [16] 697 3656 63 2662YOLOv2 640 [16] 77 3797 6129 2341YOLOv2 800 [16] 1024 373 6191 269YOLOv2 1024 [16] 1069 2993 5514 2897

YOLOv3 320 718 3458 6036 204YOLOv3 416 102 3897 6253 2412YOLOv3 608 117 4265 6856 2886SSD 300 [16] 171 3276 4626 1691SSD 512 [16] 29 4346 5711 1987

RetinaNet-ResNet-50-FPN 884 415 502 2814RetinaNet-ResNet-101-FPN 895 425 519 2746

RetinaNet-ResNeXT-101-32times 8d-FPN 1029 454 545 3008RetinaNet-ResNeXT-101-64times 4d-FPN 1071 455 551 3132

Two stage

Fast RCNN-ResNet-50-C4 023 132 499 393Fast RCNN-ResNet-50-FPN 063 135 556 345Fast RCNN-ResNet-101-FPN 039 159 576 312

Fast RCNN-ResNeXT-101-32times 8d-FPN 051 144 579 333Fast RCNN-ResNeXT-101-64times 4d-FPN 029 142 573 376

Faster RCNN-ResNet-50-C4 698 399 487 2604Faster RCNN-ResNet-50-FPN 1074 456 563 2979Faster RCNN-ResNet-101-FPN 1063 469 576 3057

Faster RCNN-ResNeXT-101-32times 8d-FPN 1164 473 576 3212Faster RCNN-ResNeXT-101-64times 4d-FPN 1054 471 569 3164

Faster RCNN-VGG16 [16] 573 3558 4414 4111(is table illustrates how well models adapt to different scales of objects (e values in bold represent the best in one-stage methods and the ones in italicsrepresent the highest in two-stage methods

(a)

(b)

(c)

Figure 5 Continued

14 Journal of Electrical and Computer Engineering

small object dataset it does not work too much becauseRetinaNet is lower than Faster RCNN about 10 in per-formance Otherwise on different scales of subsets Reti-naNet works well when comparing to Faster RCNN and

the difference is just 2ndash4 percentages Although ResNetbackbones combined with the others yield an improvementin accuracy they do not work for YOLO on small objectdatasets YOLO with Darknet-53 utilizes more resource

Table 5 (e comparison of consumption on small object dataset

Model Backbone Inference time (s) Test RAM (MiB) Train RAM (MiB)YOLOv3 Darknet-53 00331 1825 4759YOLOv3 ResNet-50 0027 1285 3479YOLOv3 ResNet-101 00356 1829 5383YOLOv3 ResNet-152 00454 2443 7531RetinaNet ResNet-50-FPN 0102 2075 4435RetinaNet ResNet-101-FPN 0127 2723 5577RetinaNet ResNeXT-101-32times 8d-FPN 0229 3767 7863RetinaNet ResNeXT-101-64times 4d-FPN 0292 3719 7813Fast RCNN ResNet-50-C4 03 6449 5877Fast RCNN ResNet-50-FPN 0089 2277 4455Fast RCNN ResNet-101-FPN 0113 2947 5627Fast RCNN ResNeXT-101-32times 8d-FPN 0212 3987 4961Fast RCNN ResNeXT-101-64times 4d-FPN 0269 3885 4799Faster RCNN ResNet-50-C4 0412 6609 6129Faster RCNN ResNet-50-FPN 0101 2387 5381Faster RCNN ResNet-101-FPN 0124 3001 6487Faster RCNN ResNeXT-101-32times 8d-FPN 0256 4027 5333Faster RCNN ResNeXT-101-64times 4d-FPN 0286 4003 5246

Table 6 (e comparison of consumption on subsets filtered from PASCAL VOC

Model Backbone Inference time (s) Test RAM (MiB) Train RAM (MiB)YOLOv3 Darknet-53 0027 1645 4079RetinaNet ResNet-50-FPN 01 1935 4133RetinaNet ResNet-101-FPN 0116 2585 5435RetinaNet ResNeXT-101-32times 8d-FPN 0222 3641 7723RetinaNet ResNeXT-101-64times 4d-FPN 0284 3561 7599Fast RCNN ResNet-50-C4 0495 6371 5677Fast RCNN ResNet-50-FPN 0092 2131 4387Fast RCNN ResNet-101-FPN 0114 2819 5463Fast RCNN ResNeXT-101-32times 8d-FPN 0213 3873 4637Fast RCNN ResNeXT-101-64times 4d-FPN 0265 3735 4575Faster RCNN ResNet-50-C4 026 6141 5991Faster RCNN ResNet-50-FPN 01 2245 5207Faster RCNN ResNet-101-FPN 013 2855 6335Faster RCNN ResNeXT-101-32times 8d-FPN 0225 3943 5087Faster RCNN ResNeXT-101-64times 4d-FPN 0276 3885 4909

(d)

Figure 5 Highlight of bounding boxes from comparative backbones on small object dataset We here select YOLO with Darknet-53 andResNet-50 for objective comparison because there have obviously the same layers in their networks along with the significant techniquessuch as skip connections and residual blocks (e bounding boxes show that ResNet-50 has the sensitivity to areas which resembles theobjects of interest than Darknet-53 Similarly ResNet-50-FPN and ResNet-50-C4 are chosen to consider (e detection shows thatcombining ResNet-50 with FPN outputs a better performance rather than the original one Particularly misdetection happens in moredensity than ResNet-50-FPN such as in columns 4 and 5 Zoom in to see more detail

Journal of Electrical and Computer Engineering 15

than ResNet ones but it has the best accuracy amongmodels (erefore we only test YOLO with Darknet-53 insubsets of PASCAL

53 Analyses of the Trade-Off among Detectors Networkdesigns and approaches alike the one-stage approach proveits performance as applying them to detect general objectsboth small scales and other kinds of scales Although they arefast and accurate there is still a drawback always existing inthese models that is the trade-off between accuracy andspeed of processing For example YOLOv3 proposes theidea that performs detection at three different scales and thisresult is obviously impressive and yields good performanceHowever to gain this advantage YOLOv3 has to sacrifice thetime to process Instead of all inputs of the model normallyprocessing one time for detection like YOLOv2 this ideamust work 3 times (is trade-off is also partly affected byresolution as we change it during training or testing ourmodels In our previous work we have mentioned that wehave to choose a right resolution to ensure our models towork properly In case of the two-stage approaches the ideathat proposes region proposals to improve the localization ofobjects to serve for detection is good as well (is is usefulbut we have to take it into the account that we shouldgenerate proposals on feature maps or directly on inputimages because this affects a lot on the way which modelsintend to run and identify representations of objects Ifobjects are normal or have a big or medium appearance it isgood for models to work but if objects are in multiscalesthis is a problem to consider and research deeply in order tobalance the performance as well as improve it (erefore topartly fix this problem the one-stage approach allows us tochoose a fixed size of an input for training and testing butthe support still depends on characteristics of datasets whichwe evaluate or the image size After all all models we chooseto evaluate are affected by the scales of objects when wechange the scale and accuracy of models change a lot exceptfor Faster RCNN the only one model that seems to be stablewith the scale especially when combining with the VGG16architecture Although the accuracy of VGG16 is not betterthan the other architectures the difference here is that it doesnot change too much in accuracy (is is only right for bigobjects having the overlap of the bounding box and theimage greater than 10 if not this is not assured

Figure 1 shows that the possibility of small objects ismore than other objects (e black length of the camera issomehow similar to the black mouse placed on a mouse pad(is possibility of small object presence causes more diffi-culties to detectors and leads to wrong detection Anywherein an image can be small objects it results in a fact thatdetectors have much wrong detection with familiar ap-pearance which they have seen If we consider the visuali-zation of the detection in Figure 4 the wrong detection ispartly similar to the appearance of the other objects in thedataset (is problem is caused by the data imbalance be-tween classes and instances in each class which originally isknown as the foreground-foreground class imbalance In

other words the common problems which not only happenwith small objects but also for whole datasets are theintraclass similarity and interclass variation

6 Conclusion

Small object detection is a challenging and interestingproblem in the task of object detection and has drawn at-tention from researchers thanks to the development of deeplearning which is motivation to improve performance oftasks in computer vision Although deep models belongingto detection originally tend to solve problems relating togeneral object detection they still work at a particular levelto the success of small object detection As evaluation workson small object detection for deep models our goal is tohighlight remarkable achievements of popular and state-of-the-art deep models in order to provide a variety of views asapplying deep models in small object detection Particularlywe evaluate state-of-the-art real-time detectors based ondeep learning from two approaches such as YOLOv3RetinaNet Fast RCNN and Faster RCNN on two datasetsnamely small object dataset and subsets filtered fromPASCAL VOC about effects of different factors objectivelyincluding accuracy execution time and resource usage

In spite of the successful achievements in recent yearsthe performance of detection has improved significantlyand there is still a huge gap in accuracy between normalobjects and small objects In the criteria of the COCOdataset the difference from the small scale to medium andbig scale is too much Most models are good at detection ofnormal objects and problems are going to happen whenapplying them to detect small objects As a result to reducethe gap of small object detection the first thing to do is investdatasets which have massive amount of data to train modelsand have a wide range of categories to compete with thehuman visual system alike [12 34]

So far detection models are divided into two mainapproaches namely one-stage approach and two-stageapproach Models in the one-stage approach is known asdetectors which have better and more efficient detection incomparison to another approach (e efficiency here has thepotential power to run in real time and is able to apply themto practical applications However the trade-off betweenaccuracy and speed is a difficult challenge which needs to betaken into the account in order to balance the gap Howevermodels in the two-stage approach have their reputation ofregion-based detectors which have high accuracy but are toolow in speed to apply them to real world (is drawbackcomes from the computation of networks

(rough our evaluation there is a fact that architectureswhich are utilized as base networks to extract deep featureshave significant effects on frameworks (e deeper the ar-chitecture is the higher the accuracy of detection is Once anetwork has an increase in the depth this means it has morelayers than normal ones and it will have massive parametersto train Hence this needs a lot of data to fine tune theseparameters reasonably If there is an increase in computa-tion resource consumption will also increase As a result it

16 Journal of Electrical and Computer Engineering

will be difficult as we want to take them to apply in practicalapplications Besides the contextual exploit in models isdefinitely limited this results cause ignoring much usefuland informative data in training especially in context ofsmall objects Because small objects are able to appearanywhere in an input image if the image is well-exploitedwith the context the performance of small object detectionwill be improved better

For all above reasons and according to our evalua-tion if we tend to have good performance and ignore thespeed of processing two-stage methods like FasterRCNN are well-performed and demonstrate its networkdesign with the different datasets on many contexts ofobjects including multiscale objects (erefore FasterRCNN is considered as a giant baseline in order to baseon or develop from it If our target has a balance ofaccuracy and speed YOLO is a good one in case we donot care the training time because the sacrifice betweenthe speed and accuracy is worth applying it into practicalapplications Otherwise Faster RCNN or RetinaNet isstill a substitution to work on When it comes tobackbones we have to concern about the data to choose areasonable backbone to combine with the methodsBecause the amount of data will significant impact on themodel if data are not abundant the shallow network willfit it well Besides there is recently a novel approachpromising in training deep models with less data that isweakly supervised learning such as zero-shot one-shotor few-shot learning (erefore these approaches will beconsidered in our future works and following our recentsearching to have better performance on object detec-tion we have to consider several factors to improve themAP such as multiscale training superresolution forscaling up the visual information to small objects [35] orpreprocessing data to avoid the imbalance data becausewe have a wide range of imbalance problems relating todata [33]

Data Availability

(e data used to support the findings of this study areavailable from the corresponding author upon request

Conflicts of Interest

(e authors declare that they have no conflicts of interest

Acknowledgments

(is research was funded by the Vietnam National Uni-versity HoChiMinh City (VNU-HCM) under grant noB2017-26-01

References

[1] R Girshick J Donahue T Darrell and J Malik ldquoRich featurehierarchies for accurate object detection and semantic seg-mentationrdquo in Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition pp 580ndash587 ColumbusOH USA June 2014

[2] K He X Zhang S Ren and J Sun ldquoSpatial pyramid poolingin deep convolutional networks for visual recognitionrdquo inProceedings of the European Conference on Computer Visionpp 346ndash361 Springer Zurich Switzerland September 2014

[3] R Girshick ldquoFast R-CNNrdquo in Proceedings of the IEEE In-ternational Conference on Computer Vision pp 1440ndash1448Santiago Chile December 2015

[4] J Redmon S Divvala R Girshick and A Farhadi ldquoYou onlylook once unified real-time object detectionrdquo in Proceedingsof the IEEE Conference on Computer Vision And PatternRecognition pp 779ndash788 Las Vegas NV USA June 2016

[5] J Redmon and A Farhadi ldquoYOLO9000 better fasterstrongerrdquo 2016 httpsarxivorgabs161208242

[6] J Redmon and A Farhadi ldquoYOLOv3 an incremental im-provementrdquo 2018 httpsarxivorgabs180402767

[7] T-Y Lin P Goyal R Girshick K He and P Dollar ldquoFocalloss for dense object detectionrdquo in Proceedings of the IEEEtransactions on pattern analysis and machine intelligenceVenice Italy October 2018

[8] K Zidek A Hosovsky J Pitelrsquo and S Bednar ldquoRecognition ofassembly parts by convolutional neural networksrdquo in Ad-vances in Manufacturing Engineering and Materialspp 281ndash289 Springer Cham Switzerland 2019

[9] K He G Gkioxari P Dollar and R Girshick ldquoMaskR-CNNrdquo in Proceedings of the IEEE International Conferenceon computer vision (ICCV) IEEE Venice Italy pp 2980ndash2988 October 2017

[10] L-C Chen A Hermans G Papandreou et al ldquoInstancesegmentation by refining object detection with semantic anddirection featuresrdquo 2017 httpsarxivorgabs171204837

[11] M Everingham L Van Gool C K I Williams J Winn andA Zisserman ldquo(e pascal visual object classes (VOC) chal-lengerdquo International Journal of Computer Vision vol 88no 2 pp 303ndash338 2010

[12] T-Y Lin M Maire S Belongie et al ldquoMicrosoft COCOcommon objects in contextrdquo in Proceedings of the EuropeanConference on Computer Vision pp 740ndash755 SpringerZurich Switzerland September 2014

[13] C Chen M-Y Liu O Tuzel and J Xiao ldquoR-CNN for smallobject detectionrdquo in Proceedings of the Asian Conference onComputer Vision pp 214ndash230 Springer Taipei TaiwanNovember 2016

[14] P F Felzenszwalb R B Girshick D McAllester andD Ramanan ldquoObject detection with discriminatively trainedpart-based modelsrdquo IEEE Transactions on Pattern Analysisand Machine Intelligence vol 32 no 9 pp 1627ndash1645 2010

[15] S Ren K He R Girshick and J Sun ldquoFaster R-CNN towardsreal-time object detection with region proposal networksrdquo inProceedings of the 28th International Conference on NeuralInformation Processing Systems Ser NIPSrsquo15 pp 91ndash99 MITPress Cambridge MA USA 2015 httpdlacmorgcitationcfmid=29692392969250

[16] P Pham D Nguyen T Do T D Ngo and D-D LeldquoEvaluation of deep models for real-time small object de-tectionrdquo in Proceedings of the International Conference onNeural Information Processing pp 516ndash526 SpringerGuangzhou China November 2017

[17] J R R Uijlings K E A Van De Sande T Gevers andA W M Smeulders ldquoSelective search for object recognitionrdquoInternational Journal of Computer Vision vol 104 no 2pp 154ndash171 2013

[18] Z Zhu D Liang S Zhang X Huang B Li and S HuldquoTraffic-sign detection and classification in the wildrdquo inProceedings of the IEEE Conference on Computer Vision And

Journal of Electrical and Computer Engineering 17

Pattern Recognition pp 2110ndash2118 Vegas NV USA June2016

[19] A Torralba R Fergus and W T Freeman ldquo80 million tinyimages a large data set for nonparametric object and scenerecognitionrdquo IEEE Transactions on Pattern Analysis andMachine Intelligence vol 30 no 11 pp 1958ndash1970 2008

[20] A Kembhavi D Harwood and L S Davis ldquoVehicle detectionusing partial least squaresrdquo IEEE Transactions on PatternAnalysis and Machine Intelligence vol 33 no 6 pp 1250ndash1265 2011

[21] V I Morariu E Ahmed V Santhanam D Harwood andL S Davis ldquoComposite discriminant factor analysisrdquo inProceedings of the IEEE Winter Conference on Applications ofComputer Vision pp 564ndash571 IEEE Steamboat Springs COUSA March 2014

[22] A Andreas P Lenz and R Urtasun ldquoAre we ready forautonomous driving the kitti vision benchmark suiterdquo inProceedings of the IEEE Conference on Computer Vision AndPattern Recognition Providence RI USA June 2012

[23] A Alahi K Goel V Ramanathan A Robicquet L Fei-Feiand S Savarese ldquoSocial LSTM human trajectory prediction incrowded spacesrdquo in Proceedings of the IEEE Conference onComputer Vision And Pattern Recognition pp 961ndash971 LasVegas NV USA June 2016

[24] J Xiao K A Ehinger J Hays A Torralba and A Oliva ldquoSundatabase exploring a large collection of scene categoriesrdquoInternational Journal of Computer Vision vol 119 no 1pp 3ndash22 2016

[25] E Dong Y Zhu Y Ji and S Du ldquoAn improved convolutionneural network for object detection using YOLOv2rdquoin Pro-ceedings of the 2018 IEEE International Conference onMechatronics and Automation (ICMA) pp 1184ndash1188 IEEEChangchun China August 2018

[26] W Liu D Anguelov D Erhan et al ldquoSingle shot multiboxdetectorrdquo in Proceedings of the European Conference onComputer Vision Springer Amsterdam (e Netherlandspp 21ndash37 October 2016

[27] T-Y Lin P Dollar R B Girshick K He B Hariharan andS J Belongie ldquoFeature pyramid networks for object detec-tionrdquo in Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition (CVPR) vol 1 no 2 p 4Honolulu HI USA July 2017

[28] C-Y Fu W Liu A Ranga A Tyagi and A C Berg ldquoDSSDdeconvolutional single shot detectorrdquo 2017 httpsarxivorgabs170106659

[29] Y Bai Y Zhang M Ding and B Ghanem ldquoSOD-MTGANsmall object detection via multi-task generative adversarialnetworkrdquo in Proceedings of the European Conference onComputer Vision pp 210ndash226 Springer Munich GermanySeptember 2018

[30] L Liu W Ouyang X Wang et al ldquoDeep learning for genericobject detection a surveyrdquo 2018 httpsarxivorgabs180902165

[31] P Zhu L Wen X Bian L Haibin and Q Hu ldquoVision meetsdrones a challengerdquo 2018 httpsarxivorgabs180407437

[32] Z-Q Zhao P Zheng S-t Xu and X Wu ldquoObject detectionwith deep learning a reviewrdquo 2018 httpsarxivorgabs180705511

[33] K Oksuz B C Cam S Kalkan and E Akbas ldquoImbalanceproblems in object detection a reviewrdquo 2019 httpsarxivorgabs190900169

[34] O Russakovsky J Deng H Su et al ldquoImagenet large scalevisual recognition challengerdquo International Journal of Com-puter Vision vol 115 no 3 pp 211ndash252 2015

[35] Y Bai Y Zhang M Ding and B Ghanem ldquoSOD-MTGANsmall object detection via multi-task generative adversarialnetworkrdquo in Proceedings of the European Conference onComputer Vision (ECCV) pp 206ndash221 Munich GermanySeptember 2018

18 Journal of Electrical and Computer Engineering

Page 14: AnEvaluationofDeepLearningMethodsforSmall ObjectDetection

Table 4 (e comparative results on subsets of PASCAL VOC 2007

Approach Method VOC_MRA_0058 VOC_MRA_010 VOC_MRA_020 VOC_WH20

One stage

YOLOv2 416 [16] 302 3138 4289 1852YOLOv2 448 [16] 447 329 6015 2196YOLOv2 480 [16] 426 3348 6078 2667YOLOv2 512 [16] 542 3574 6112 2463YOLOv2 544 [16] 697 3656 63 2662YOLOv2 640 [16] 77 3797 6129 2341YOLOv2 800 [16] 1024 373 6191 269YOLOv2 1024 [16] 1069 2993 5514 2897

YOLOv3 320 718 3458 6036 204YOLOv3 416 102 3897 6253 2412YOLOv3 608 117 4265 6856 2886SSD 300 [16] 171 3276 4626 1691SSD 512 [16] 29 4346 5711 1987

RetinaNet-ResNet-50-FPN 884 415 502 2814RetinaNet-ResNet-101-FPN 895 425 519 2746

RetinaNet-ResNeXT-101-32times 8d-FPN 1029 454 545 3008RetinaNet-ResNeXT-101-64times 4d-FPN 1071 455 551 3132

Two stage

Fast RCNN-ResNet-50-C4 023 132 499 393Fast RCNN-ResNet-50-FPN 063 135 556 345Fast RCNN-ResNet-101-FPN 039 159 576 312

Fast RCNN-ResNeXT-101-32times 8d-FPN 051 144 579 333Fast RCNN-ResNeXT-101-64times 4d-FPN 029 142 573 376

Faster RCNN-ResNet-50-C4 698 399 487 2604Faster RCNN-ResNet-50-FPN 1074 456 563 2979Faster RCNN-ResNet-101-FPN 1063 469 576 3057

Faster RCNN-ResNeXT-101-32times 8d-FPN 1164 473 576 3212Faster RCNN-ResNeXT-101-64times 4d-FPN 1054 471 569 3164

Faster RCNN-VGG16 [16] 573 3558 4414 4111(is table illustrates how well models adapt to different scales of objects (e values in bold represent the best in one-stage methods and the ones in italicsrepresent the highest in two-stage methods

(a)

(b)

(c)

Figure 5 Continued

14 Journal of Electrical and Computer Engineering

small object dataset it does not work too much becauseRetinaNet is lower than Faster RCNN about 10 in per-formance Otherwise on different scales of subsets Reti-naNet works well when comparing to Faster RCNN and

the difference is just 2ndash4 percentages Although ResNetbackbones combined with the others yield an improvementin accuracy they do not work for YOLO on small objectdatasets YOLO with Darknet-53 utilizes more resource

Table 5 (e comparison of consumption on small object dataset

Model Backbone Inference time (s) Test RAM (MiB) Train RAM (MiB)YOLOv3 Darknet-53 00331 1825 4759YOLOv3 ResNet-50 0027 1285 3479YOLOv3 ResNet-101 00356 1829 5383YOLOv3 ResNet-152 00454 2443 7531RetinaNet ResNet-50-FPN 0102 2075 4435RetinaNet ResNet-101-FPN 0127 2723 5577RetinaNet ResNeXT-101-32times 8d-FPN 0229 3767 7863RetinaNet ResNeXT-101-64times 4d-FPN 0292 3719 7813Fast RCNN ResNet-50-C4 03 6449 5877Fast RCNN ResNet-50-FPN 0089 2277 4455Fast RCNN ResNet-101-FPN 0113 2947 5627Fast RCNN ResNeXT-101-32times 8d-FPN 0212 3987 4961Fast RCNN ResNeXT-101-64times 4d-FPN 0269 3885 4799Faster RCNN ResNet-50-C4 0412 6609 6129Faster RCNN ResNet-50-FPN 0101 2387 5381Faster RCNN ResNet-101-FPN 0124 3001 6487Faster RCNN ResNeXT-101-32times 8d-FPN 0256 4027 5333Faster RCNN ResNeXT-101-64times 4d-FPN 0286 4003 5246

Table 6 (e comparison of consumption on subsets filtered from PASCAL VOC

Model Backbone Inference time (s) Test RAM (MiB) Train RAM (MiB)YOLOv3 Darknet-53 0027 1645 4079RetinaNet ResNet-50-FPN 01 1935 4133RetinaNet ResNet-101-FPN 0116 2585 5435RetinaNet ResNeXT-101-32times 8d-FPN 0222 3641 7723RetinaNet ResNeXT-101-64times 4d-FPN 0284 3561 7599Fast RCNN ResNet-50-C4 0495 6371 5677Fast RCNN ResNet-50-FPN 0092 2131 4387Fast RCNN ResNet-101-FPN 0114 2819 5463Fast RCNN ResNeXT-101-32times 8d-FPN 0213 3873 4637Fast RCNN ResNeXT-101-64times 4d-FPN 0265 3735 4575Faster RCNN ResNet-50-C4 026 6141 5991Faster RCNN ResNet-50-FPN 01 2245 5207Faster RCNN ResNet-101-FPN 013 2855 6335Faster RCNN ResNeXT-101-32times 8d-FPN 0225 3943 5087Faster RCNN ResNeXT-101-64times 4d-FPN 0276 3885 4909

(d)

Figure 5 Highlight of bounding boxes from comparative backbones on small object dataset We here select YOLO with Darknet-53 andResNet-50 for objective comparison because there have obviously the same layers in their networks along with the significant techniquessuch as skip connections and residual blocks (e bounding boxes show that ResNet-50 has the sensitivity to areas which resembles theobjects of interest than Darknet-53 Similarly ResNet-50-FPN and ResNet-50-C4 are chosen to consider (e detection shows thatcombining ResNet-50 with FPN outputs a better performance rather than the original one Particularly misdetection happens in moredensity than ResNet-50-FPN such as in columns 4 and 5 Zoom in to see more detail

Journal of Electrical and Computer Engineering 15

than ResNet ones but it has the best accuracy amongmodels (erefore we only test YOLO with Darknet-53 insubsets of PASCAL

53 Analyses of the Trade-Off among Detectors Networkdesigns and approaches alike the one-stage approach proveits performance as applying them to detect general objectsboth small scales and other kinds of scales Although they arefast and accurate there is still a drawback always existing inthese models that is the trade-off between accuracy andspeed of processing For example YOLOv3 proposes theidea that performs detection at three different scales and thisresult is obviously impressive and yields good performanceHowever to gain this advantage YOLOv3 has to sacrifice thetime to process Instead of all inputs of the model normallyprocessing one time for detection like YOLOv2 this ideamust work 3 times (is trade-off is also partly affected byresolution as we change it during training or testing ourmodels In our previous work we have mentioned that wehave to choose a right resolution to ensure our models towork properly In case of the two-stage approaches the ideathat proposes region proposals to improve the localization ofobjects to serve for detection is good as well (is is usefulbut we have to take it into the account that we shouldgenerate proposals on feature maps or directly on inputimages because this affects a lot on the way which modelsintend to run and identify representations of objects Ifobjects are normal or have a big or medium appearance it isgood for models to work but if objects are in multiscalesthis is a problem to consider and research deeply in order tobalance the performance as well as improve it (erefore topartly fix this problem the one-stage approach allows us tochoose a fixed size of an input for training and testing butthe support still depends on characteristics of datasets whichwe evaluate or the image size After all all models we chooseto evaluate are affected by the scales of objects when wechange the scale and accuracy of models change a lot exceptfor Faster RCNN the only one model that seems to be stablewith the scale especially when combining with the VGG16architecture Although the accuracy of VGG16 is not betterthan the other architectures the difference here is that it doesnot change too much in accuracy (is is only right for bigobjects having the overlap of the bounding box and theimage greater than 10 if not this is not assured

Figure 1 shows that the possibility of small objects ismore than other objects (e black length of the camera issomehow similar to the black mouse placed on a mouse pad(is possibility of small object presence causes more diffi-culties to detectors and leads to wrong detection Anywherein an image can be small objects it results in a fact thatdetectors have much wrong detection with familiar ap-pearance which they have seen If we consider the visuali-zation of the detection in Figure 4 the wrong detection ispartly similar to the appearance of the other objects in thedataset (is problem is caused by the data imbalance be-tween classes and instances in each class which originally isknown as the foreground-foreground class imbalance In

other words the common problems which not only happenwith small objects but also for whole datasets are theintraclass similarity and interclass variation

6 Conclusion

Small object detection is a challenging and interestingproblem in the task of object detection and has drawn at-tention from researchers thanks to the development of deeplearning which is motivation to improve performance oftasks in computer vision Although deep models belongingto detection originally tend to solve problems relating togeneral object detection they still work at a particular levelto the success of small object detection As evaluation workson small object detection for deep models our goal is tohighlight remarkable achievements of popular and state-of-the-art deep models in order to provide a variety of views asapplying deep models in small object detection Particularlywe evaluate state-of-the-art real-time detectors based ondeep learning from two approaches such as YOLOv3RetinaNet Fast RCNN and Faster RCNN on two datasetsnamely small object dataset and subsets filtered fromPASCAL VOC about effects of different factors objectivelyincluding accuracy execution time and resource usage

In spite of the successful achievements in recent yearsthe performance of detection has improved significantlyand there is still a huge gap in accuracy between normalobjects and small objects In the criteria of the COCOdataset the difference from the small scale to medium andbig scale is too much Most models are good at detection ofnormal objects and problems are going to happen whenapplying them to detect small objects As a result to reducethe gap of small object detection the first thing to do is investdatasets which have massive amount of data to train modelsand have a wide range of categories to compete with thehuman visual system alike [12 34]

So far detection models are divided into two mainapproaches namely one-stage approach and two-stageapproach Models in the one-stage approach is known asdetectors which have better and more efficient detection incomparison to another approach (e efficiency here has thepotential power to run in real time and is able to apply themto practical applications However the trade-off betweenaccuracy and speed is a difficult challenge which needs to betaken into the account in order to balance the gap Howevermodels in the two-stage approach have their reputation ofregion-based detectors which have high accuracy but are toolow in speed to apply them to real world (is drawbackcomes from the computation of networks

(rough our evaluation there is a fact that architectureswhich are utilized as base networks to extract deep featureshave significant effects on frameworks (e deeper the ar-chitecture is the higher the accuracy of detection is Once anetwork has an increase in the depth this means it has morelayers than normal ones and it will have massive parametersto train Hence this needs a lot of data to fine tune theseparameters reasonably If there is an increase in computa-tion resource consumption will also increase As a result it

16 Journal of Electrical and Computer Engineering

will be difficult as we want to take them to apply in practicalapplications Besides the contextual exploit in models isdefinitely limited this results cause ignoring much usefuland informative data in training especially in context ofsmall objects Because small objects are able to appearanywhere in an input image if the image is well-exploitedwith the context the performance of small object detectionwill be improved better

For all above reasons and according to our evalua-tion if we tend to have good performance and ignore thespeed of processing two-stage methods like FasterRCNN are well-performed and demonstrate its networkdesign with the different datasets on many contexts ofobjects including multiscale objects (erefore FasterRCNN is considered as a giant baseline in order to baseon or develop from it If our target has a balance ofaccuracy and speed YOLO is a good one in case we donot care the training time because the sacrifice betweenthe speed and accuracy is worth applying it into practicalapplications Otherwise Faster RCNN or RetinaNet isstill a substitution to work on When it comes tobackbones we have to concern about the data to choose areasonable backbone to combine with the methodsBecause the amount of data will significant impact on themodel if data are not abundant the shallow network willfit it well Besides there is recently a novel approachpromising in training deep models with less data that isweakly supervised learning such as zero-shot one-shotor few-shot learning (erefore these approaches will beconsidered in our future works and following our recentsearching to have better performance on object detec-tion we have to consider several factors to improve themAP such as multiscale training superresolution forscaling up the visual information to small objects [35] orpreprocessing data to avoid the imbalance data becausewe have a wide range of imbalance problems relating todata [33]

Data Availability

(e data used to support the findings of this study areavailable from the corresponding author upon request

Conflicts of Interest

(e authors declare that they have no conflicts of interest

Acknowledgments

(is research was funded by the Vietnam National Uni-versity HoChiMinh City (VNU-HCM) under grant noB2017-26-01

References

[1] R Girshick J Donahue T Darrell and J Malik ldquoRich featurehierarchies for accurate object detection and semantic seg-mentationrdquo in Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition pp 580ndash587 ColumbusOH USA June 2014

[2] K He X Zhang S Ren and J Sun ldquoSpatial pyramid poolingin deep convolutional networks for visual recognitionrdquo inProceedings of the European Conference on Computer Visionpp 346ndash361 Springer Zurich Switzerland September 2014

[3] R Girshick ldquoFast R-CNNrdquo in Proceedings of the IEEE In-ternational Conference on Computer Vision pp 1440ndash1448Santiago Chile December 2015

[4] J Redmon S Divvala R Girshick and A Farhadi ldquoYou onlylook once unified real-time object detectionrdquo in Proceedingsof the IEEE Conference on Computer Vision And PatternRecognition pp 779ndash788 Las Vegas NV USA June 2016

[5] J Redmon and A Farhadi ldquoYOLO9000 better fasterstrongerrdquo 2016 httpsarxivorgabs161208242

[6] J Redmon and A Farhadi ldquoYOLOv3 an incremental im-provementrdquo 2018 httpsarxivorgabs180402767

[7] T-Y Lin P Goyal R Girshick K He and P Dollar ldquoFocalloss for dense object detectionrdquo in Proceedings of the IEEEtransactions on pattern analysis and machine intelligenceVenice Italy October 2018

[8] K Zidek A Hosovsky J Pitelrsquo and S Bednar ldquoRecognition ofassembly parts by convolutional neural networksrdquo in Ad-vances in Manufacturing Engineering and Materialspp 281ndash289 Springer Cham Switzerland 2019

[9] K He G Gkioxari P Dollar and R Girshick ldquoMaskR-CNNrdquo in Proceedings of the IEEE International Conferenceon computer vision (ICCV) IEEE Venice Italy pp 2980ndash2988 October 2017

[10] L-C Chen A Hermans G Papandreou et al ldquoInstancesegmentation by refining object detection with semantic anddirection featuresrdquo 2017 httpsarxivorgabs171204837

[11] M Everingham L Van Gool C K I Williams J Winn andA Zisserman ldquo(e pascal visual object classes (VOC) chal-lengerdquo International Journal of Computer Vision vol 88no 2 pp 303ndash338 2010

[12] T-Y Lin M Maire S Belongie et al ldquoMicrosoft COCOcommon objects in contextrdquo in Proceedings of the EuropeanConference on Computer Vision pp 740ndash755 SpringerZurich Switzerland September 2014

[13] C Chen M-Y Liu O Tuzel and J Xiao ldquoR-CNN for smallobject detectionrdquo in Proceedings of the Asian Conference onComputer Vision pp 214ndash230 Springer Taipei TaiwanNovember 2016

[14] P F Felzenszwalb R B Girshick D McAllester andD Ramanan ldquoObject detection with discriminatively trainedpart-based modelsrdquo IEEE Transactions on Pattern Analysisand Machine Intelligence vol 32 no 9 pp 1627ndash1645 2010

[15] S Ren K He R Girshick and J Sun ldquoFaster R-CNN towardsreal-time object detection with region proposal networksrdquo inProceedings of the 28th International Conference on NeuralInformation Processing Systems Ser NIPSrsquo15 pp 91ndash99 MITPress Cambridge MA USA 2015 httpdlacmorgcitationcfmid=29692392969250

[16] P Pham D Nguyen T Do T D Ngo and D-D LeldquoEvaluation of deep models for real-time small object de-tectionrdquo in Proceedings of the International Conference onNeural Information Processing pp 516ndash526 SpringerGuangzhou China November 2017

[17] J R R Uijlings K E A Van De Sande T Gevers andA W M Smeulders ldquoSelective search for object recognitionrdquoInternational Journal of Computer Vision vol 104 no 2pp 154ndash171 2013

[18] Z Zhu D Liang S Zhang X Huang B Li and S HuldquoTraffic-sign detection and classification in the wildrdquo inProceedings of the IEEE Conference on Computer Vision And

Journal of Electrical and Computer Engineering 17

Pattern Recognition pp 2110ndash2118 Vegas NV USA June2016

[19] A Torralba R Fergus and W T Freeman ldquo80 million tinyimages a large data set for nonparametric object and scenerecognitionrdquo IEEE Transactions on Pattern Analysis andMachine Intelligence vol 30 no 11 pp 1958ndash1970 2008

[20] A Kembhavi D Harwood and L S Davis ldquoVehicle detectionusing partial least squaresrdquo IEEE Transactions on PatternAnalysis and Machine Intelligence vol 33 no 6 pp 1250ndash1265 2011

[21] V I Morariu E Ahmed V Santhanam D Harwood andL S Davis ldquoComposite discriminant factor analysisrdquo inProceedings of the IEEE Winter Conference on Applications ofComputer Vision pp 564ndash571 IEEE Steamboat Springs COUSA March 2014

[22] A Andreas P Lenz and R Urtasun ldquoAre we ready forautonomous driving the kitti vision benchmark suiterdquo inProceedings of the IEEE Conference on Computer Vision AndPattern Recognition Providence RI USA June 2012

[23] A Alahi K Goel V Ramanathan A Robicquet L Fei-Feiand S Savarese ldquoSocial LSTM human trajectory prediction incrowded spacesrdquo in Proceedings of the IEEE Conference onComputer Vision And Pattern Recognition pp 961ndash971 LasVegas NV USA June 2016

[24] J Xiao K A Ehinger J Hays A Torralba and A Oliva ldquoSundatabase exploring a large collection of scene categoriesrdquoInternational Journal of Computer Vision vol 119 no 1pp 3ndash22 2016

[25] E Dong Y Zhu Y Ji and S Du ldquoAn improved convolutionneural network for object detection using YOLOv2rdquoin Pro-ceedings of the 2018 IEEE International Conference onMechatronics and Automation (ICMA) pp 1184ndash1188 IEEEChangchun China August 2018

[26] W Liu D Anguelov D Erhan et al ldquoSingle shot multiboxdetectorrdquo in Proceedings of the European Conference onComputer Vision Springer Amsterdam (e Netherlandspp 21ndash37 October 2016

[27] T-Y Lin P Dollar R B Girshick K He B Hariharan andS J Belongie ldquoFeature pyramid networks for object detec-tionrdquo in Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition (CVPR) vol 1 no 2 p 4Honolulu HI USA July 2017

[28] C-Y Fu W Liu A Ranga A Tyagi and A C Berg ldquoDSSDdeconvolutional single shot detectorrdquo 2017 httpsarxivorgabs170106659

[29] Y Bai Y Zhang M Ding and B Ghanem ldquoSOD-MTGANsmall object detection via multi-task generative adversarialnetworkrdquo in Proceedings of the European Conference onComputer Vision pp 210ndash226 Springer Munich GermanySeptember 2018

[30] L Liu W Ouyang X Wang et al ldquoDeep learning for genericobject detection a surveyrdquo 2018 httpsarxivorgabs180902165

[31] P Zhu L Wen X Bian L Haibin and Q Hu ldquoVision meetsdrones a challengerdquo 2018 httpsarxivorgabs180407437

[32] Z-Q Zhao P Zheng S-t Xu and X Wu ldquoObject detectionwith deep learning a reviewrdquo 2018 httpsarxivorgabs180705511

[33] K Oksuz B C Cam S Kalkan and E Akbas ldquoImbalanceproblems in object detection a reviewrdquo 2019 httpsarxivorgabs190900169

[34] O Russakovsky J Deng H Su et al ldquoImagenet large scalevisual recognition challengerdquo International Journal of Com-puter Vision vol 115 no 3 pp 211ndash252 2015

[35] Y Bai Y Zhang M Ding and B Ghanem ldquoSOD-MTGANsmall object detection via multi-task generative adversarialnetworkrdquo in Proceedings of the European Conference onComputer Vision (ECCV) pp 206ndash221 Munich GermanySeptember 2018

18 Journal of Electrical and Computer Engineering

Page 15: AnEvaluationofDeepLearningMethodsforSmall ObjectDetection

small object dataset it does not work too much becauseRetinaNet is lower than Faster RCNN about 10 in per-formance Otherwise on different scales of subsets Reti-naNet works well when comparing to Faster RCNN and

the difference is just 2ndash4 percentages Although ResNetbackbones combined with the others yield an improvementin accuracy they do not work for YOLO on small objectdatasets YOLO with Darknet-53 utilizes more resource

Table 5 (e comparison of consumption on small object dataset

Model Backbone Inference time (s) Test RAM (MiB) Train RAM (MiB)YOLOv3 Darknet-53 00331 1825 4759YOLOv3 ResNet-50 0027 1285 3479YOLOv3 ResNet-101 00356 1829 5383YOLOv3 ResNet-152 00454 2443 7531RetinaNet ResNet-50-FPN 0102 2075 4435RetinaNet ResNet-101-FPN 0127 2723 5577RetinaNet ResNeXT-101-32times 8d-FPN 0229 3767 7863RetinaNet ResNeXT-101-64times 4d-FPN 0292 3719 7813Fast RCNN ResNet-50-C4 03 6449 5877Fast RCNN ResNet-50-FPN 0089 2277 4455Fast RCNN ResNet-101-FPN 0113 2947 5627Fast RCNN ResNeXT-101-32times 8d-FPN 0212 3987 4961Fast RCNN ResNeXT-101-64times 4d-FPN 0269 3885 4799Faster RCNN ResNet-50-C4 0412 6609 6129Faster RCNN ResNet-50-FPN 0101 2387 5381Faster RCNN ResNet-101-FPN 0124 3001 6487Faster RCNN ResNeXT-101-32times 8d-FPN 0256 4027 5333Faster RCNN ResNeXT-101-64times 4d-FPN 0286 4003 5246

Table 6 (e comparison of consumption on subsets filtered from PASCAL VOC

Model Backbone Inference time (s) Test RAM (MiB) Train RAM (MiB)YOLOv3 Darknet-53 0027 1645 4079RetinaNet ResNet-50-FPN 01 1935 4133RetinaNet ResNet-101-FPN 0116 2585 5435RetinaNet ResNeXT-101-32times 8d-FPN 0222 3641 7723RetinaNet ResNeXT-101-64times 4d-FPN 0284 3561 7599Fast RCNN ResNet-50-C4 0495 6371 5677Fast RCNN ResNet-50-FPN 0092 2131 4387Fast RCNN ResNet-101-FPN 0114 2819 5463Fast RCNN ResNeXT-101-32times 8d-FPN 0213 3873 4637Fast RCNN ResNeXT-101-64times 4d-FPN 0265 3735 4575Faster RCNN ResNet-50-C4 026 6141 5991Faster RCNN ResNet-50-FPN 01 2245 5207Faster RCNN ResNet-101-FPN 013 2855 6335Faster RCNN ResNeXT-101-32times 8d-FPN 0225 3943 5087Faster RCNN ResNeXT-101-64times 4d-FPN 0276 3885 4909

(d)

Figure 5 Highlight of bounding boxes from comparative backbones on small object dataset We here select YOLO with Darknet-53 andResNet-50 for objective comparison because there have obviously the same layers in their networks along with the significant techniquessuch as skip connections and residual blocks (e bounding boxes show that ResNet-50 has the sensitivity to areas which resembles theobjects of interest than Darknet-53 Similarly ResNet-50-FPN and ResNet-50-C4 are chosen to consider (e detection shows thatcombining ResNet-50 with FPN outputs a better performance rather than the original one Particularly misdetection happens in moredensity than ResNet-50-FPN such as in columns 4 and 5 Zoom in to see more detail

Journal of Electrical and Computer Engineering 15

than ResNet ones but it has the best accuracy amongmodels (erefore we only test YOLO with Darknet-53 insubsets of PASCAL

53 Analyses of the Trade-Off among Detectors Networkdesigns and approaches alike the one-stage approach proveits performance as applying them to detect general objectsboth small scales and other kinds of scales Although they arefast and accurate there is still a drawback always existing inthese models that is the trade-off between accuracy andspeed of processing For example YOLOv3 proposes theidea that performs detection at three different scales and thisresult is obviously impressive and yields good performanceHowever to gain this advantage YOLOv3 has to sacrifice thetime to process Instead of all inputs of the model normallyprocessing one time for detection like YOLOv2 this ideamust work 3 times (is trade-off is also partly affected byresolution as we change it during training or testing ourmodels In our previous work we have mentioned that wehave to choose a right resolution to ensure our models towork properly In case of the two-stage approaches the ideathat proposes region proposals to improve the localization ofobjects to serve for detection is good as well (is is usefulbut we have to take it into the account that we shouldgenerate proposals on feature maps or directly on inputimages because this affects a lot on the way which modelsintend to run and identify representations of objects Ifobjects are normal or have a big or medium appearance it isgood for models to work but if objects are in multiscalesthis is a problem to consider and research deeply in order tobalance the performance as well as improve it (erefore topartly fix this problem the one-stage approach allows us tochoose a fixed size of an input for training and testing butthe support still depends on characteristics of datasets whichwe evaluate or the image size After all all models we chooseto evaluate are affected by the scales of objects when wechange the scale and accuracy of models change a lot exceptfor Faster RCNN the only one model that seems to be stablewith the scale especially when combining with the VGG16architecture Although the accuracy of VGG16 is not betterthan the other architectures the difference here is that it doesnot change too much in accuracy (is is only right for bigobjects having the overlap of the bounding box and theimage greater than 10 if not this is not assured

Figure 1 shows that the possibility of small objects ismore than other objects (e black length of the camera issomehow similar to the black mouse placed on a mouse pad(is possibility of small object presence causes more diffi-culties to detectors and leads to wrong detection Anywherein an image can be small objects it results in a fact thatdetectors have much wrong detection with familiar ap-pearance which they have seen If we consider the visuali-zation of the detection in Figure 4 the wrong detection ispartly similar to the appearance of the other objects in thedataset (is problem is caused by the data imbalance be-tween classes and instances in each class which originally isknown as the foreground-foreground class imbalance In

other words the common problems which not only happenwith small objects but also for whole datasets are theintraclass similarity and interclass variation

6 Conclusion

Small object detection is a challenging and interestingproblem in the task of object detection and has drawn at-tention from researchers thanks to the development of deeplearning which is motivation to improve performance oftasks in computer vision Although deep models belongingto detection originally tend to solve problems relating togeneral object detection they still work at a particular levelto the success of small object detection As evaluation workson small object detection for deep models our goal is tohighlight remarkable achievements of popular and state-of-the-art deep models in order to provide a variety of views asapplying deep models in small object detection Particularlywe evaluate state-of-the-art real-time detectors based ondeep learning from two approaches such as YOLOv3RetinaNet Fast RCNN and Faster RCNN on two datasetsnamely small object dataset and subsets filtered fromPASCAL VOC about effects of different factors objectivelyincluding accuracy execution time and resource usage

In spite of the successful achievements in recent yearsthe performance of detection has improved significantlyand there is still a huge gap in accuracy between normalobjects and small objects In the criteria of the COCOdataset the difference from the small scale to medium andbig scale is too much Most models are good at detection ofnormal objects and problems are going to happen whenapplying them to detect small objects As a result to reducethe gap of small object detection the first thing to do is investdatasets which have massive amount of data to train modelsand have a wide range of categories to compete with thehuman visual system alike [12 34]

So far detection models are divided into two mainapproaches namely one-stage approach and two-stageapproach Models in the one-stage approach is known asdetectors which have better and more efficient detection incomparison to another approach (e efficiency here has thepotential power to run in real time and is able to apply themto practical applications However the trade-off betweenaccuracy and speed is a difficult challenge which needs to betaken into the account in order to balance the gap Howevermodels in the two-stage approach have their reputation ofregion-based detectors which have high accuracy but are toolow in speed to apply them to real world (is drawbackcomes from the computation of networks

(rough our evaluation there is a fact that architectureswhich are utilized as base networks to extract deep featureshave significant effects on frameworks (e deeper the ar-chitecture is the higher the accuracy of detection is Once anetwork has an increase in the depth this means it has morelayers than normal ones and it will have massive parametersto train Hence this needs a lot of data to fine tune theseparameters reasonably If there is an increase in computa-tion resource consumption will also increase As a result it

16 Journal of Electrical and Computer Engineering

will be difficult as we want to take them to apply in practicalapplications Besides the contextual exploit in models isdefinitely limited this results cause ignoring much usefuland informative data in training especially in context ofsmall objects Because small objects are able to appearanywhere in an input image if the image is well-exploitedwith the context the performance of small object detectionwill be improved better

For all above reasons and according to our evalua-tion if we tend to have good performance and ignore thespeed of processing two-stage methods like FasterRCNN are well-performed and demonstrate its networkdesign with the different datasets on many contexts ofobjects including multiscale objects (erefore FasterRCNN is considered as a giant baseline in order to baseon or develop from it If our target has a balance ofaccuracy and speed YOLO is a good one in case we donot care the training time because the sacrifice betweenthe speed and accuracy is worth applying it into practicalapplications Otherwise Faster RCNN or RetinaNet isstill a substitution to work on When it comes tobackbones we have to concern about the data to choose areasonable backbone to combine with the methodsBecause the amount of data will significant impact on themodel if data are not abundant the shallow network willfit it well Besides there is recently a novel approachpromising in training deep models with less data that isweakly supervised learning such as zero-shot one-shotor few-shot learning (erefore these approaches will beconsidered in our future works and following our recentsearching to have better performance on object detec-tion we have to consider several factors to improve themAP such as multiscale training superresolution forscaling up the visual information to small objects [35] orpreprocessing data to avoid the imbalance data becausewe have a wide range of imbalance problems relating todata [33]

Data Availability

(e data used to support the findings of this study areavailable from the corresponding author upon request

Conflicts of Interest

(e authors declare that they have no conflicts of interest

Acknowledgments

(is research was funded by the Vietnam National Uni-versity HoChiMinh City (VNU-HCM) under grant noB2017-26-01

References

[1] R Girshick J Donahue T Darrell and J Malik ldquoRich featurehierarchies for accurate object detection and semantic seg-mentationrdquo in Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition pp 580ndash587 ColumbusOH USA June 2014

[2] K He X Zhang S Ren and J Sun ldquoSpatial pyramid poolingin deep convolutional networks for visual recognitionrdquo inProceedings of the European Conference on Computer Visionpp 346ndash361 Springer Zurich Switzerland September 2014

[3] R Girshick ldquoFast R-CNNrdquo in Proceedings of the IEEE In-ternational Conference on Computer Vision pp 1440ndash1448Santiago Chile December 2015

[4] J Redmon S Divvala R Girshick and A Farhadi ldquoYou onlylook once unified real-time object detectionrdquo in Proceedingsof the IEEE Conference on Computer Vision And PatternRecognition pp 779ndash788 Las Vegas NV USA June 2016

[5] J Redmon and A Farhadi ldquoYOLO9000 better fasterstrongerrdquo 2016 httpsarxivorgabs161208242

[6] J Redmon and A Farhadi ldquoYOLOv3 an incremental im-provementrdquo 2018 httpsarxivorgabs180402767

[7] T-Y Lin P Goyal R Girshick K He and P Dollar ldquoFocalloss for dense object detectionrdquo in Proceedings of the IEEEtransactions on pattern analysis and machine intelligenceVenice Italy October 2018

[8] K Zidek A Hosovsky J Pitelrsquo and S Bednar ldquoRecognition ofassembly parts by convolutional neural networksrdquo in Ad-vances in Manufacturing Engineering and Materialspp 281ndash289 Springer Cham Switzerland 2019

[9] K He G Gkioxari P Dollar and R Girshick ldquoMaskR-CNNrdquo in Proceedings of the IEEE International Conferenceon computer vision (ICCV) IEEE Venice Italy pp 2980ndash2988 October 2017

[10] L-C Chen A Hermans G Papandreou et al ldquoInstancesegmentation by refining object detection with semantic anddirection featuresrdquo 2017 httpsarxivorgabs171204837

[11] M Everingham L Van Gool C K I Williams J Winn andA Zisserman ldquo(e pascal visual object classes (VOC) chal-lengerdquo International Journal of Computer Vision vol 88no 2 pp 303ndash338 2010

[12] T-Y Lin M Maire S Belongie et al ldquoMicrosoft COCOcommon objects in contextrdquo in Proceedings of the EuropeanConference on Computer Vision pp 740ndash755 SpringerZurich Switzerland September 2014

[13] C Chen M-Y Liu O Tuzel and J Xiao ldquoR-CNN for smallobject detectionrdquo in Proceedings of the Asian Conference onComputer Vision pp 214ndash230 Springer Taipei TaiwanNovember 2016

[14] P F Felzenszwalb R B Girshick D McAllester andD Ramanan ldquoObject detection with discriminatively trainedpart-based modelsrdquo IEEE Transactions on Pattern Analysisand Machine Intelligence vol 32 no 9 pp 1627ndash1645 2010

[15] S Ren K He R Girshick and J Sun ldquoFaster R-CNN towardsreal-time object detection with region proposal networksrdquo inProceedings of the 28th International Conference on NeuralInformation Processing Systems Ser NIPSrsquo15 pp 91ndash99 MITPress Cambridge MA USA 2015 httpdlacmorgcitationcfmid=29692392969250

[16] P Pham D Nguyen T Do T D Ngo and D-D LeldquoEvaluation of deep models for real-time small object de-tectionrdquo in Proceedings of the International Conference onNeural Information Processing pp 516ndash526 SpringerGuangzhou China November 2017

[17] J R R Uijlings K E A Van De Sande T Gevers andA W M Smeulders ldquoSelective search for object recognitionrdquoInternational Journal of Computer Vision vol 104 no 2pp 154ndash171 2013

[18] Z Zhu D Liang S Zhang X Huang B Li and S HuldquoTraffic-sign detection and classification in the wildrdquo inProceedings of the IEEE Conference on Computer Vision And

Journal of Electrical and Computer Engineering 17

Pattern Recognition pp 2110ndash2118 Vegas NV USA June2016

[19] A Torralba R Fergus and W T Freeman ldquo80 million tinyimages a large data set for nonparametric object and scenerecognitionrdquo IEEE Transactions on Pattern Analysis andMachine Intelligence vol 30 no 11 pp 1958ndash1970 2008

[20] A Kembhavi D Harwood and L S Davis ldquoVehicle detectionusing partial least squaresrdquo IEEE Transactions on PatternAnalysis and Machine Intelligence vol 33 no 6 pp 1250ndash1265 2011

[21] V I Morariu E Ahmed V Santhanam D Harwood andL S Davis ldquoComposite discriminant factor analysisrdquo inProceedings of the IEEE Winter Conference on Applications ofComputer Vision pp 564ndash571 IEEE Steamboat Springs COUSA March 2014

[22] A Andreas P Lenz and R Urtasun ldquoAre we ready forautonomous driving the kitti vision benchmark suiterdquo inProceedings of the IEEE Conference on Computer Vision AndPattern Recognition Providence RI USA June 2012

[23] A Alahi K Goel V Ramanathan A Robicquet L Fei-Feiand S Savarese ldquoSocial LSTM human trajectory prediction incrowded spacesrdquo in Proceedings of the IEEE Conference onComputer Vision And Pattern Recognition pp 961ndash971 LasVegas NV USA June 2016

[24] J Xiao K A Ehinger J Hays A Torralba and A Oliva ldquoSundatabase exploring a large collection of scene categoriesrdquoInternational Journal of Computer Vision vol 119 no 1pp 3ndash22 2016

[25] E Dong Y Zhu Y Ji and S Du ldquoAn improved convolutionneural network for object detection using YOLOv2rdquoin Pro-ceedings of the 2018 IEEE International Conference onMechatronics and Automation (ICMA) pp 1184ndash1188 IEEEChangchun China August 2018

[26] W Liu D Anguelov D Erhan et al ldquoSingle shot multiboxdetectorrdquo in Proceedings of the European Conference onComputer Vision Springer Amsterdam (e Netherlandspp 21ndash37 October 2016

[27] T-Y Lin P Dollar R B Girshick K He B Hariharan andS J Belongie ldquoFeature pyramid networks for object detec-tionrdquo in Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition (CVPR) vol 1 no 2 p 4Honolulu HI USA July 2017

[28] C-Y Fu W Liu A Ranga A Tyagi and A C Berg ldquoDSSDdeconvolutional single shot detectorrdquo 2017 httpsarxivorgabs170106659

[29] Y Bai Y Zhang M Ding and B Ghanem ldquoSOD-MTGANsmall object detection via multi-task generative adversarialnetworkrdquo in Proceedings of the European Conference onComputer Vision pp 210ndash226 Springer Munich GermanySeptember 2018

[30] L Liu W Ouyang X Wang et al ldquoDeep learning for genericobject detection a surveyrdquo 2018 httpsarxivorgabs180902165

[31] P Zhu L Wen X Bian L Haibin and Q Hu ldquoVision meetsdrones a challengerdquo 2018 httpsarxivorgabs180407437

[32] Z-Q Zhao P Zheng S-t Xu and X Wu ldquoObject detectionwith deep learning a reviewrdquo 2018 httpsarxivorgabs180705511

[33] K Oksuz B C Cam S Kalkan and E Akbas ldquoImbalanceproblems in object detection a reviewrdquo 2019 httpsarxivorgabs190900169

[34] O Russakovsky J Deng H Su et al ldquoImagenet large scalevisual recognition challengerdquo International Journal of Com-puter Vision vol 115 no 3 pp 211ndash252 2015

[35] Y Bai Y Zhang M Ding and B Ghanem ldquoSOD-MTGANsmall object detection via multi-task generative adversarialnetworkrdquo in Proceedings of the European Conference onComputer Vision (ECCV) pp 206ndash221 Munich GermanySeptember 2018

18 Journal of Electrical and Computer Engineering

Page 16: AnEvaluationofDeepLearningMethodsforSmall ObjectDetection

than ResNet ones but it has the best accuracy amongmodels (erefore we only test YOLO with Darknet-53 insubsets of PASCAL

53 Analyses of the Trade-Off among Detectors Networkdesigns and approaches alike the one-stage approach proveits performance as applying them to detect general objectsboth small scales and other kinds of scales Although they arefast and accurate there is still a drawback always existing inthese models that is the trade-off between accuracy andspeed of processing For example YOLOv3 proposes theidea that performs detection at three different scales and thisresult is obviously impressive and yields good performanceHowever to gain this advantage YOLOv3 has to sacrifice thetime to process Instead of all inputs of the model normallyprocessing one time for detection like YOLOv2 this ideamust work 3 times (is trade-off is also partly affected byresolution as we change it during training or testing ourmodels In our previous work we have mentioned that wehave to choose a right resolution to ensure our models towork properly In case of the two-stage approaches the ideathat proposes region proposals to improve the localization ofobjects to serve for detection is good as well (is is usefulbut we have to take it into the account that we shouldgenerate proposals on feature maps or directly on inputimages because this affects a lot on the way which modelsintend to run and identify representations of objects Ifobjects are normal or have a big or medium appearance it isgood for models to work but if objects are in multiscalesthis is a problem to consider and research deeply in order tobalance the performance as well as improve it (erefore topartly fix this problem the one-stage approach allows us tochoose a fixed size of an input for training and testing butthe support still depends on characteristics of datasets whichwe evaluate or the image size After all all models we chooseto evaluate are affected by the scales of objects when wechange the scale and accuracy of models change a lot exceptfor Faster RCNN the only one model that seems to be stablewith the scale especially when combining with the VGG16architecture Although the accuracy of VGG16 is not betterthan the other architectures the difference here is that it doesnot change too much in accuracy (is is only right for bigobjects having the overlap of the bounding box and theimage greater than 10 if not this is not assured

Figure 1 shows that the possibility of small objects ismore than other objects (e black length of the camera issomehow similar to the black mouse placed on a mouse pad(is possibility of small object presence causes more diffi-culties to detectors and leads to wrong detection Anywherein an image can be small objects it results in a fact thatdetectors have much wrong detection with familiar ap-pearance which they have seen If we consider the visuali-zation of the detection in Figure 4 the wrong detection ispartly similar to the appearance of the other objects in thedataset (is problem is caused by the data imbalance be-tween classes and instances in each class which originally isknown as the foreground-foreground class imbalance In

other words the common problems which not only happenwith small objects but also for whole datasets are theintraclass similarity and interclass variation

6 Conclusion

Small object detection is a challenging and interestingproblem in the task of object detection and has drawn at-tention from researchers thanks to the development of deeplearning which is motivation to improve performance oftasks in computer vision Although deep models belongingto detection originally tend to solve problems relating togeneral object detection they still work at a particular levelto the success of small object detection As evaluation workson small object detection for deep models our goal is tohighlight remarkable achievements of popular and state-of-the-art deep models in order to provide a variety of views asapplying deep models in small object detection Particularlywe evaluate state-of-the-art real-time detectors based ondeep learning from two approaches such as YOLOv3RetinaNet Fast RCNN and Faster RCNN on two datasetsnamely small object dataset and subsets filtered fromPASCAL VOC about effects of different factors objectivelyincluding accuracy execution time and resource usage

In spite of the successful achievements in recent yearsthe performance of detection has improved significantlyand there is still a huge gap in accuracy between normalobjects and small objects In the criteria of the COCOdataset the difference from the small scale to medium andbig scale is too much Most models are good at detection ofnormal objects and problems are going to happen whenapplying them to detect small objects As a result to reducethe gap of small object detection the first thing to do is investdatasets which have massive amount of data to train modelsand have a wide range of categories to compete with thehuman visual system alike [12 34]

So far detection models are divided into two mainapproaches namely one-stage approach and two-stageapproach Models in the one-stage approach is known asdetectors which have better and more efficient detection incomparison to another approach (e efficiency here has thepotential power to run in real time and is able to apply themto practical applications However the trade-off betweenaccuracy and speed is a difficult challenge which needs to betaken into the account in order to balance the gap Howevermodels in the two-stage approach have their reputation ofregion-based detectors which have high accuracy but are toolow in speed to apply them to real world (is drawbackcomes from the computation of networks

(rough our evaluation there is a fact that architectureswhich are utilized as base networks to extract deep featureshave significant effects on frameworks (e deeper the ar-chitecture is the higher the accuracy of detection is Once anetwork has an increase in the depth this means it has morelayers than normal ones and it will have massive parametersto train Hence this needs a lot of data to fine tune theseparameters reasonably If there is an increase in computa-tion resource consumption will also increase As a result it

16 Journal of Electrical and Computer Engineering

will be difficult as we want to take them to apply in practicalapplications Besides the contextual exploit in models isdefinitely limited this results cause ignoring much usefuland informative data in training especially in context ofsmall objects Because small objects are able to appearanywhere in an input image if the image is well-exploitedwith the context the performance of small object detectionwill be improved better

For all above reasons and according to our evalua-tion if we tend to have good performance and ignore thespeed of processing two-stage methods like FasterRCNN are well-performed and demonstrate its networkdesign with the different datasets on many contexts ofobjects including multiscale objects (erefore FasterRCNN is considered as a giant baseline in order to baseon or develop from it If our target has a balance ofaccuracy and speed YOLO is a good one in case we donot care the training time because the sacrifice betweenthe speed and accuracy is worth applying it into practicalapplications Otherwise Faster RCNN or RetinaNet isstill a substitution to work on When it comes tobackbones we have to concern about the data to choose areasonable backbone to combine with the methodsBecause the amount of data will significant impact on themodel if data are not abundant the shallow network willfit it well Besides there is recently a novel approachpromising in training deep models with less data that isweakly supervised learning such as zero-shot one-shotor few-shot learning (erefore these approaches will beconsidered in our future works and following our recentsearching to have better performance on object detec-tion we have to consider several factors to improve themAP such as multiscale training superresolution forscaling up the visual information to small objects [35] orpreprocessing data to avoid the imbalance data becausewe have a wide range of imbalance problems relating todata [33]

Data Availability

(e data used to support the findings of this study areavailable from the corresponding author upon request

Conflicts of Interest

(e authors declare that they have no conflicts of interest

Acknowledgments

(is research was funded by the Vietnam National Uni-versity HoChiMinh City (VNU-HCM) under grant noB2017-26-01

References

[1] R Girshick J Donahue T Darrell and J Malik ldquoRich featurehierarchies for accurate object detection and semantic seg-mentationrdquo in Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition pp 580ndash587 ColumbusOH USA June 2014

[2] K He X Zhang S Ren and J Sun ldquoSpatial pyramid poolingin deep convolutional networks for visual recognitionrdquo inProceedings of the European Conference on Computer Visionpp 346ndash361 Springer Zurich Switzerland September 2014

[3] R Girshick ldquoFast R-CNNrdquo in Proceedings of the IEEE In-ternational Conference on Computer Vision pp 1440ndash1448Santiago Chile December 2015

[4] J Redmon S Divvala R Girshick and A Farhadi ldquoYou onlylook once unified real-time object detectionrdquo in Proceedingsof the IEEE Conference on Computer Vision And PatternRecognition pp 779ndash788 Las Vegas NV USA June 2016

[5] J Redmon and A Farhadi ldquoYOLO9000 better fasterstrongerrdquo 2016 httpsarxivorgabs161208242

[6] J Redmon and A Farhadi ldquoYOLOv3 an incremental im-provementrdquo 2018 httpsarxivorgabs180402767

[7] T-Y Lin P Goyal R Girshick K He and P Dollar ldquoFocalloss for dense object detectionrdquo in Proceedings of the IEEEtransactions on pattern analysis and machine intelligenceVenice Italy October 2018

[8] K Zidek A Hosovsky J Pitelrsquo and S Bednar ldquoRecognition ofassembly parts by convolutional neural networksrdquo in Ad-vances in Manufacturing Engineering and Materialspp 281ndash289 Springer Cham Switzerland 2019

[9] K He G Gkioxari P Dollar and R Girshick ldquoMaskR-CNNrdquo in Proceedings of the IEEE International Conferenceon computer vision (ICCV) IEEE Venice Italy pp 2980ndash2988 October 2017

[10] L-C Chen A Hermans G Papandreou et al ldquoInstancesegmentation by refining object detection with semantic anddirection featuresrdquo 2017 httpsarxivorgabs171204837

[11] M Everingham L Van Gool C K I Williams J Winn andA Zisserman ldquo(e pascal visual object classes (VOC) chal-lengerdquo International Journal of Computer Vision vol 88no 2 pp 303ndash338 2010

[12] T-Y Lin M Maire S Belongie et al ldquoMicrosoft COCOcommon objects in contextrdquo in Proceedings of the EuropeanConference on Computer Vision pp 740ndash755 SpringerZurich Switzerland September 2014

[13] C Chen M-Y Liu O Tuzel and J Xiao ldquoR-CNN for smallobject detectionrdquo in Proceedings of the Asian Conference onComputer Vision pp 214ndash230 Springer Taipei TaiwanNovember 2016

[14] P F Felzenszwalb R B Girshick D McAllester andD Ramanan ldquoObject detection with discriminatively trainedpart-based modelsrdquo IEEE Transactions on Pattern Analysisand Machine Intelligence vol 32 no 9 pp 1627ndash1645 2010

[15] S Ren K He R Girshick and J Sun ldquoFaster R-CNN towardsreal-time object detection with region proposal networksrdquo inProceedings of the 28th International Conference on NeuralInformation Processing Systems Ser NIPSrsquo15 pp 91ndash99 MITPress Cambridge MA USA 2015 httpdlacmorgcitationcfmid=29692392969250

[16] P Pham D Nguyen T Do T D Ngo and D-D LeldquoEvaluation of deep models for real-time small object de-tectionrdquo in Proceedings of the International Conference onNeural Information Processing pp 516ndash526 SpringerGuangzhou China November 2017

[17] J R R Uijlings K E A Van De Sande T Gevers andA W M Smeulders ldquoSelective search for object recognitionrdquoInternational Journal of Computer Vision vol 104 no 2pp 154ndash171 2013

[18] Z Zhu D Liang S Zhang X Huang B Li and S HuldquoTraffic-sign detection and classification in the wildrdquo inProceedings of the IEEE Conference on Computer Vision And

Journal of Electrical and Computer Engineering 17

Pattern Recognition pp 2110ndash2118 Vegas NV USA June2016

[19] A Torralba R Fergus and W T Freeman ldquo80 million tinyimages a large data set for nonparametric object and scenerecognitionrdquo IEEE Transactions on Pattern Analysis andMachine Intelligence vol 30 no 11 pp 1958ndash1970 2008

[20] A Kembhavi D Harwood and L S Davis ldquoVehicle detectionusing partial least squaresrdquo IEEE Transactions on PatternAnalysis and Machine Intelligence vol 33 no 6 pp 1250ndash1265 2011

[21] V I Morariu E Ahmed V Santhanam D Harwood andL S Davis ldquoComposite discriminant factor analysisrdquo inProceedings of the IEEE Winter Conference on Applications ofComputer Vision pp 564ndash571 IEEE Steamboat Springs COUSA March 2014

[22] A Andreas P Lenz and R Urtasun ldquoAre we ready forautonomous driving the kitti vision benchmark suiterdquo inProceedings of the IEEE Conference on Computer Vision AndPattern Recognition Providence RI USA June 2012

[23] A Alahi K Goel V Ramanathan A Robicquet L Fei-Feiand S Savarese ldquoSocial LSTM human trajectory prediction incrowded spacesrdquo in Proceedings of the IEEE Conference onComputer Vision And Pattern Recognition pp 961ndash971 LasVegas NV USA June 2016

[24] J Xiao K A Ehinger J Hays A Torralba and A Oliva ldquoSundatabase exploring a large collection of scene categoriesrdquoInternational Journal of Computer Vision vol 119 no 1pp 3ndash22 2016

[25] E Dong Y Zhu Y Ji and S Du ldquoAn improved convolutionneural network for object detection using YOLOv2rdquoin Pro-ceedings of the 2018 IEEE International Conference onMechatronics and Automation (ICMA) pp 1184ndash1188 IEEEChangchun China August 2018

[26] W Liu D Anguelov D Erhan et al ldquoSingle shot multiboxdetectorrdquo in Proceedings of the European Conference onComputer Vision Springer Amsterdam (e Netherlandspp 21ndash37 October 2016

[27] T-Y Lin P Dollar R B Girshick K He B Hariharan andS J Belongie ldquoFeature pyramid networks for object detec-tionrdquo in Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition (CVPR) vol 1 no 2 p 4Honolulu HI USA July 2017

[28] C-Y Fu W Liu A Ranga A Tyagi and A C Berg ldquoDSSDdeconvolutional single shot detectorrdquo 2017 httpsarxivorgabs170106659

[29] Y Bai Y Zhang M Ding and B Ghanem ldquoSOD-MTGANsmall object detection via multi-task generative adversarialnetworkrdquo in Proceedings of the European Conference onComputer Vision pp 210ndash226 Springer Munich GermanySeptember 2018

[30] L Liu W Ouyang X Wang et al ldquoDeep learning for genericobject detection a surveyrdquo 2018 httpsarxivorgabs180902165

[31] P Zhu L Wen X Bian L Haibin and Q Hu ldquoVision meetsdrones a challengerdquo 2018 httpsarxivorgabs180407437

[32] Z-Q Zhao P Zheng S-t Xu and X Wu ldquoObject detectionwith deep learning a reviewrdquo 2018 httpsarxivorgabs180705511

[33] K Oksuz B C Cam S Kalkan and E Akbas ldquoImbalanceproblems in object detection a reviewrdquo 2019 httpsarxivorgabs190900169

[34] O Russakovsky J Deng H Su et al ldquoImagenet large scalevisual recognition challengerdquo International Journal of Com-puter Vision vol 115 no 3 pp 211ndash252 2015

[35] Y Bai Y Zhang M Ding and B Ghanem ldquoSOD-MTGANsmall object detection via multi-task generative adversarialnetworkrdquo in Proceedings of the European Conference onComputer Vision (ECCV) pp 206ndash221 Munich GermanySeptember 2018

18 Journal of Electrical and Computer Engineering

Page 17: AnEvaluationofDeepLearningMethodsforSmall ObjectDetection

will be difficult as we want to take them to apply in practicalapplications Besides the contextual exploit in models isdefinitely limited this results cause ignoring much usefuland informative data in training especially in context ofsmall objects Because small objects are able to appearanywhere in an input image if the image is well-exploitedwith the context the performance of small object detectionwill be improved better

For all above reasons and according to our evalua-tion if we tend to have good performance and ignore thespeed of processing two-stage methods like FasterRCNN are well-performed and demonstrate its networkdesign with the different datasets on many contexts ofobjects including multiscale objects (erefore FasterRCNN is considered as a giant baseline in order to baseon or develop from it If our target has a balance ofaccuracy and speed YOLO is a good one in case we donot care the training time because the sacrifice betweenthe speed and accuracy is worth applying it into practicalapplications Otherwise Faster RCNN or RetinaNet isstill a substitution to work on When it comes tobackbones we have to concern about the data to choose areasonable backbone to combine with the methodsBecause the amount of data will significant impact on themodel if data are not abundant the shallow network willfit it well Besides there is recently a novel approachpromising in training deep models with less data that isweakly supervised learning such as zero-shot one-shotor few-shot learning (erefore these approaches will beconsidered in our future works and following our recentsearching to have better performance on object detec-tion we have to consider several factors to improve themAP such as multiscale training superresolution forscaling up the visual information to small objects [35] orpreprocessing data to avoid the imbalance data becausewe have a wide range of imbalance problems relating todata [33]

Data Availability

(e data used to support the findings of this study areavailable from the corresponding author upon request

Conflicts of Interest

(e authors declare that they have no conflicts of interest

Acknowledgments

(is research was funded by the Vietnam National Uni-versity HoChiMinh City (VNU-HCM) under grant noB2017-26-01

References

[1] R Girshick J Donahue T Darrell and J Malik ldquoRich featurehierarchies for accurate object detection and semantic seg-mentationrdquo in Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition pp 580ndash587 ColumbusOH USA June 2014

[2] K He X Zhang S Ren and J Sun ldquoSpatial pyramid poolingin deep convolutional networks for visual recognitionrdquo inProceedings of the European Conference on Computer Visionpp 346ndash361 Springer Zurich Switzerland September 2014

[3] R Girshick ldquoFast R-CNNrdquo in Proceedings of the IEEE In-ternational Conference on Computer Vision pp 1440ndash1448Santiago Chile December 2015

[4] J Redmon S Divvala R Girshick and A Farhadi ldquoYou onlylook once unified real-time object detectionrdquo in Proceedingsof the IEEE Conference on Computer Vision And PatternRecognition pp 779ndash788 Las Vegas NV USA June 2016

[5] J Redmon and A Farhadi ldquoYOLO9000 better fasterstrongerrdquo 2016 httpsarxivorgabs161208242

[6] J Redmon and A Farhadi ldquoYOLOv3 an incremental im-provementrdquo 2018 httpsarxivorgabs180402767

[7] T-Y Lin P Goyal R Girshick K He and P Dollar ldquoFocalloss for dense object detectionrdquo in Proceedings of the IEEEtransactions on pattern analysis and machine intelligenceVenice Italy October 2018

[8] K Zidek A Hosovsky J Pitelrsquo and S Bednar ldquoRecognition ofassembly parts by convolutional neural networksrdquo in Ad-vances in Manufacturing Engineering and Materialspp 281ndash289 Springer Cham Switzerland 2019

[9] K He G Gkioxari P Dollar and R Girshick ldquoMaskR-CNNrdquo in Proceedings of the IEEE International Conferenceon computer vision (ICCV) IEEE Venice Italy pp 2980ndash2988 October 2017

[10] L-C Chen A Hermans G Papandreou et al ldquoInstancesegmentation by refining object detection with semantic anddirection featuresrdquo 2017 httpsarxivorgabs171204837

[11] M Everingham L Van Gool C K I Williams J Winn andA Zisserman ldquo(e pascal visual object classes (VOC) chal-lengerdquo International Journal of Computer Vision vol 88no 2 pp 303ndash338 2010

[12] T-Y Lin M Maire S Belongie et al ldquoMicrosoft COCOcommon objects in contextrdquo in Proceedings of the EuropeanConference on Computer Vision pp 740ndash755 SpringerZurich Switzerland September 2014

[13] C Chen M-Y Liu O Tuzel and J Xiao ldquoR-CNN for smallobject detectionrdquo in Proceedings of the Asian Conference onComputer Vision pp 214ndash230 Springer Taipei TaiwanNovember 2016

[14] P F Felzenszwalb R B Girshick D McAllester andD Ramanan ldquoObject detection with discriminatively trainedpart-based modelsrdquo IEEE Transactions on Pattern Analysisand Machine Intelligence vol 32 no 9 pp 1627ndash1645 2010

[15] S Ren K He R Girshick and J Sun ldquoFaster R-CNN towardsreal-time object detection with region proposal networksrdquo inProceedings of the 28th International Conference on NeuralInformation Processing Systems Ser NIPSrsquo15 pp 91ndash99 MITPress Cambridge MA USA 2015 httpdlacmorgcitationcfmid=29692392969250

[16] P Pham D Nguyen T Do T D Ngo and D-D LeldquoEvaluation of deep models for real-time small object de-tectionrdquo in Proceedings of the International Conference onNeural Information Processing pp 516ndash526 SpringerGuangzhou China November 2017

[17] J R R Uijlings K E A Van De Sande T Gevers andA W M Smeulders ldquoSelective search for object recognitionrdquoInternational Journal of Computer Vision vol 104 no 2pp 154ndash171 2013

[18] Z Zhu D Liang S Zhang X Huang B Li and S HuldquoTraffic-sign detection and classification in the wildrdquo inProceedings of the IEEE Conference on Computer Vision And

Journal of Electrical and Computer Engineering 17

Pattern Recognition pp 2110ndash2118 Vegas NV USA June2016

[19] A Torralba R Fergus and W T Freeman ldquo80 million tinyimages a large data set for nonparametric object and scenerecognitionrdquo IEEE Transactions on Pattern Analysis andMachine Intelligence vol 30 no 11 pp 1958ndash1970 2008

[20] A Kembhavi D Harwood and L S Davis ldquoVehicle detectionusing partial least squaresrdquo IEEE Transactions on PatternAnalysis and Machine Intelligence vol 33 no 6 pp 1250ndash1265 2011

[21] V I Morariu E Ahmed V Santhanam D Harwood andL S Davis ldquoComposite discriminant factor analysisrdquo inProceedings of the IEEE Winter Conference on Applications ofComputer Vision pp 564ndash571 IEEE Steamboat Springs COUSA March 2014

[22] A Andreas P Lenz and R Urtasun ldquoAre we ready forautonomous driving the kitti vision benchmark suiterdquo inProceedings of the IEEE Conference on Computer Vision AndPattern Recognition Providence RI USA June 2012

[23] A Alahi K Goel V Ramanathan A Robicquet L Fei-Feiand S Savarese ldquoSocial LSTM human trajectory prediction incrowded spacesrdquo in Proceedings of the IEEE Conference onComputer Vision And Pattern Recognition pp 961ndash971 LasVegas NV USA June 2016

[24] J Xiao K A Ehinger J Hays A Torralba and A Oliva ldquoSundatabase exploring a large collection of scene categoriesrdquoInternational Journal of Computer Vision vol 119 no 1pp 3ndash22 2016

[25] E Dong Y Zhu Y Ji and S Du ldquoAn improved convolutionneural network for object detection using YOLOv2rdquoin Pro-ceedings of the 2018 IEEE International Conference onMechatronics and Automation (ICMA) pp 1184ndash1188 IEEEChangchun China August 2018

[26] W Liu D Anguelov D Erhan et al ldquoSingle shot multiboxdetectorrdquo in Proceedings of the European Conference onComputer Vision Springer Amsterdam (e Netherlandspp 21ndash37 October 2016

[27] T-Y Lin P Dollar R B Girshick K He B Hariharan andS J Belongie ldquoFeature pyramid networks for object detec-tionrdquo in Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition (CVPR) vol 1 no 2 p 4Honolulu HI USA July 2017

[28] C-Y Fu W Liu A Ranga A Tyagi and A C Berg ldquoDSSDdeconvolutional single shot detectorrdquo 2017 httpsarxivorgabs170106659

[29] Y Bai Y Zhang M Ding and B Ghanem ldquoSOD-MTGANsmall object detection via multi-task generative adversarialnetworkrdquo in Proceedings of the European Conference onComputer Vision pp 210ndash226 Springer Munich GermanySeptember 2018

[30] L Liu W Ouyang X Wang et al ldquoDeep learning for genericobject detection a surveyrdquo 2018 httpsarxivorgabs180902165

[31] P Zhu L Wen X Bian L Haibin and Q Hu ldquoVision meetsdrones a challengerdquo 2018 httpsarxivorgabs180407437

[32] Z-Q Zhao P Zheng S-t Xu and X Wu ldquoObject detectionwith deep learning a reviewrdquo 2018 httpsarxivorgabs180705511

[33] K Oksuz B C Cam S Kalkan and E Akbas ldquoImbalanceproblems in object detection a reviewrdquo 2019 httpsarxivorgabs190900169

[34] O Russakovsky J Deng H Su et al ldquoImagenet large scalevisual recognition challengerdquo International Journal of Com-puter Vision vol 115 no 3 pp 211ndash252 2015

[35] Y Bai Y Zhang M Ding and B Ghanem ldquoSOD-MTGANsmall object detection via multi-task generative adversarialnetworkrdquo in Proceedings of the European Conference onComputer Vision (ECCV) pp 206ndash221 Munich GermanySeptember 2018

18 Journal of Electrical and Computer Engineering

Page 18: AnEvaluationofDeepLearningMethodsforSmall ObjectDetection

Pattern Recognition pp 2110ndash2118 Vegas NV USA June2016

[19] A Torralba R Fergus and W T Freeman ldquo80 million tinyimages a large data set for nonparametric object and scenerecognitionrdquo IEEE Transactions on Pattern Analysis andMachine Intelligence vol 30 no 11 pp 1958ndash1970 2008

[20] A Kembhavi D Harwood and L S Davis ldquoVehicle detectionusing partial least squaresrdquo IEEE Transactions on PatternAnalysis and Machine Intelligence vol 33 no 6 pp 1250ndash1265 2011

[21] V I Morariu E Ahmed V Santhanam D Harwood andL S Davis ldquoComposite discriminant factor analysisrdquo inProceedings of the IEEE Winter Conference on Applications ofComputer Vision pp 564ndash571 IEEE Steamboat Springs COUSA March 2014

[22] A Andreas P Lenz and R Urtasun ldquoAre we ready forautonomous driving the kitti vision benchmark suiterdquo inProceedings of the IEEE Conference on Computer Vision AndPattern Recognition Providence RI USA June 2012

[23] A Alahi K Goel V Ramanathan A Robicquet L Fei-Feiand S Savarese ldquoSocial LSTM human trajectory prediction incrowded spacesrdquo in Proceedings of the IEEE Conference onComputer Vision And Pattern Recognition pp 961ndash971 LasVegas NV USA June 2016

[24] J Xiao K A Ehinger J Hays A Torralba and A Oliva ldquoSundatabase exploring a large collection of scene categoriesrdquoInternational Journal of Computer Vision vol 119 no 1pp 3ndash22 2016

[25] E Dong Y Zhu Y Ji and S Du ldquoAn improved convolutionneural network for object detection using YOLOv2rdquoin Pro-ceedings of the 2018 IEEE International Conference onMechatronics and Automation (ICMA) pp 1184ndash1188 IEEEChangchun China August 2018

[26] W Liu D Anguelov D Erhan et al ldquoSingle shot multiboxdetectorrdquo in Proceedings of the European Conference onComputer Vision Springer Amsterdam (e Netherlandspp 21ndash37 October 2016

[27] T-Y Lin P Dollar R B Girshick K He B Hariharan andS J Belongie ldquoFeature pyramid networks for object detec-tionrdquo in Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition (CVPR) vol 1 no 2 p 4Honolulu HI USA July 2017

[28] C-Y Fu W Liu A Ranga A Tyagi and A C Berg ldquoDSSDdeconvolutional single shot detectorrdquo 2017 httpsarxivorgabs170106659

[29] Y Bai Y Zhang M Ding and B Ghanem ldquoSOD-MTGANsmall object detection via multi-task generative adversarialnetworkrdquo in Proceedings of the European Conference onComputer Vision pp 210ndash226 Springer Munich GermanySeptember 2018

[30] L Liu W Ouyang X Wang et al ldquoDeep learning for genericobject detection a surveyrdquo 2018 httpsarxivorgabs180902165

[31] P Zhu L Wen X Bian L Haibin and Q Hu ldquoVision meetsdrones a challengerdquo 2018 httpsarxivorgabs180407437

[32] Z-Q Zhao P Zheng S-t Xu and X Wu ldquoObject detectionwith deep learning a reviewrdquo 2018 httpsarxivorgabs180705511

[33] K Oksuz B C Cam S Kalkan and E Akbas ldquoImbalanceproblems in object detection a reviewrdquo 2019 httpsarxivorgabs190900169

[34] O Russakovsky J Deng H Su et al ldquoImagenet large scalevisual recognition challengerdquo International Journal of Com-puter Vision vol 115 no 3 pp 211ndash252 2015

[35] Y Bai Y Zhang M Ding and B Ghanem ldquoSOD-MTGANsmall object detection via multi-task generative adversarialnetworkrdquo in Proceedings of the European Conference onComputer Vision (ECCV) pp 206ndash221 Munich GermanySeptember 2018

18 Journal of Electrical and Computer Engineering