11
Research Article Learning Feature Fusion in Deep Learning-Based Object Detector Ehtesham Hassan , 1 Yasser Khalil, 2 and Imtiaz Ahmad 3 1 Department of Computer Science and Engineering, Kuwait College of Science and Technology, Kuwait City, Kuwait 2 University of Ottawa, Ottawa, Canada 3 Department of Computer Engineering, Kuwait University, Kuwait City, Kuwait Correspondence should be addressed to Ehtesham Hassan; [email protected] Received 24 August 2019; Revised 12 April 2020; Accepted 27 April 2020; Published 22 May 2020 Academic Editor: Kevser Dincer Copyright © 2020 Ehtesham Hassan et al. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Object detection in real images is a challenging problem in computer vision. Despite several advancements in detection and recognition techniques, robust and accurate localization of interesting objects in images from real-life scenarios remains unsolved because of the difficulties posed by intraclass and interclass variations, occlusion, lightning, and scale changes at different levels. In this work, we present an object detection framework by learning-based fusion of handcrafted features with deep features. Deep features characterize different regions of interest in a testing image with a rich set of statistical features. Our hypothesis is to reinforce these features with handcrafted features by learning the optimal fusion during network training. Our detection framework is based on the recent version of YOLO object detection architecture. Experimental evaluation on PASCAL-VOC and MS-COCO datasets achieved the detection rate increase of 11.4% and 1.9% on the mAP scale in comparison with the YOLO version-3 detector (Redmon and Farhadi 2018). An important step in the proposed learning-based feature fusion strategy is to correctly identify the layer feeding in new features. e present work shows a qualitative approach to identify the best layer for fusion and design steps for feeding in the additional feature sets in convolutional network-based detectors. 1. Introduction Object detection in natural scenes is an important problem that drives many real-life applications. In the last few de- cades, a significant amount of research effort has gone into the understanding of object detection problem with general as well as domain-specific challenges [1–5]. As a result, several novel and innovative object detection methods have been developed. Despite continuous efforts from researchers from diverse backgrounds, the present state of the art is far from satisfactory as seen in recent results on standard datasets PASCAL-VOC [6] and MS-COCO [7]. Most recent works on object detection have invariably applied the convolutional neural networks (CNNs) based detector model. ese models are learned end-to-end addressing feature extraction, parameter learning, and postprocessing in one- or at most two-stage training process. ese methods have significantly improved the state of the art in object detection. e representational capability of features extracted by a CNN depends on the complexity of the detection problem in testing time and variability cap- tured within the training dataset. In real-world applications, a target object’s appearance undergoes significant variations due to view angles, lighting, background clutter, and oc- clusions. Handcrafted features that are designed with do- main understanding have often shown to be more distinctive and reliable in many situations. An intelligent fusion of both the modalities of features is expected to achieve better de- tection performance. In this work, we propose feature en- hanced CNN based object detection framework by learning- based fusion of handcrafted features with deep features in the embedding space. We use fundamental color channels: RGB, HSV, and LBP in combination with gradient and orientation histograms to enhance deep features for over- coming the prediction errors due to appearance variations. To effectively combine handcrafted features with deep learning-based features, we also investigate the problem of identifying the optimum layer to inject the handcrafted Hindawi Journal of Engineering Volume 2020, Article ID 7286187, 11 pages https://doi.org/10.1155/2020/7286187

ResearchArticle LearningFeatureFusioninDeepLearning …downloads.hindawi.com/journals/je/2020/7286187.pdf · 2020-05-22 · qualitative approach to select the best layer for feature

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: ResearchArticle LearningFeatureFusioninDeepLearning …downloads.hindawi.com/journals/je/2020/7286187.pdf · 2020-05-22 · qualitative approach to select the best layer for feature

Research ArticleLearning Feature Fusion in Deep Learning-Based Object Detector

Ehtesham Hassan 1 Yasser Khalil2 and Imtiaz Ahmad3

1Department of Computer Science and Engineering Kuwait College of Science and Technology Kuwait City Kuwait2University of Ottawa Ottawa Canada3Department of Computer Engineering Kuwait University Kuwait City Kuwait

Correspondence should be addressed to Ehtesham Hassan ehassankcstedukw

Received 24 August 2019 Revised 12 April 2020 Accepted 27 April 2020 Published 22 May 2020

Academic Editor Kevser Dincer

Copyright copy 2020 Ehtesham Hassan et al (is is an open access article distributed under the Creative Commons AttributionLicense which permits unrestricted use distribution and reproduction in any medium provided the original work isproperly cited

Object detection in real images is a challenging problem in computer vision Despite several advancements in detection andrecognition techniques robust and accurate localization of interesting objects in images from real-life scenarios remains unsolvedbecause of the difficulties posed by intraclass and interclass variations occlusion lightning and scale changes at different levels Inthis work we present an object detection framework by learning-based fusion of handcrafted features with deep features Deepfeatures characterize different regions of interest in a testing image with a rich set of statistical features Our hypothesis is toreinforce these features with handcrafted features by learning the optimal fusion during network training Our detectionframework is based on the recent version of YOLO object detection architecture Experimental evaluation on PASCAL-VOC andMS-COCO datasets achieved the detection rate increase of 114 and 19 on the mAP scale in comparison with the YOLOversion-3 detector (Redmon and Farhadi 2018) An important step in the proposed learning-based feature fusion strategy is tocorrectly identify the layer feeding in new features (e present work shows a qualitative approach to identify the best layer forfusion and design steps for feeding in the additional feature sets in convolutional network-based detectors

1 Introduction

Object detection in natural scenes is an important problemthat drives many real-life applications In the last few de-cades a significant amount of research effort has gone intothe understanding of object detection problem with generalas well as domain-specific challenges [1ndash5] As a resultseveral novel and innovative object detection methods havebeen developed Despite continuous efforts from researchersfrom diverse backgrounds the present state of the art is farfrom satisfactory as seen in recent results on standarddatasets PASCAL-VOC [6] and MS-COCO [7]

Most recent works on object detection have invariablyapplied the convolutional neural networks (CNNs) baseddetector model (ese models are learned end-to-endaddressing feature extraction parameter learning andpostprocessing in one- or at most two-stage training process(ese methods have significantly improved the state of theart in object detection (e representational capability of

features extracted by a CNN depends on the complexity ofthe detection problem in testing time and variability cap-tured within the training dataset In real-world applicationsa target objectrsquos appearance undergoes significant variationsdue to view angles lighting background clutter and oc-clusions Handcrafted features that are designed with do-main understanding have often shown to bemore distinctiveand reliable in many situations An intelligent fusion of boththe modalities of features is expected to achieve better de-tection performance In this work we propose feature en-hanced CNN based object detection framework by learning-based fusion of handcrafted features with deep features inthe embedding space We use fundamental color channelsRGB HSV and LBP in combination with gradient andorientation histograms to enhance deep features for over-coming the prediction errors due to appearance variationsTo effectively combine handcrafted features with deeplearning-based features we also investigate the problem ofidentifying the optimum layer to inject the handcrafted

HindawiJournal of EngineeringVolume 2020 Article ID 7286187 11 pageshttpsdoiorg10115520207286187

features for feature fusion Our detection framework is basedon state-of-the-art YOLO [8] architecture where we injectthe handcrafted features at appropriate layer(s) to regularizenetwork weights during the training procedure Wehypothesise that fusing handcrafted features during networklearning guides the CNN to extract a more accurate androbust feature set by exploiting upon the complementaryinformation available in the handcrafted feature set (emajor contributions in this paper are summarized as follows

(1) We propose a novel object detection approach basedon CNN learning by fusing simple feature descrip-tors like color channels and gradient histograms [9]by learning-based fusion to train a more robust andaccurate detection model

(2) We use the latest YOLO objection detection archi-tecture as our base We describe the proposedlearning-based feature fusion strategy that uses aqualitative approach to select the best layer forfeature injection In this work the presented workpresents a novel strategy to fuse features in CNNbased object detectors

(3) We demonstrate the comparative performance of theproposed feature fusion approach on PASCAL-VOCand MS-COCO datasets which shows improvementin the original YOLO object detection rate by asignificant measure

(e paper is organized as follows Section 2 presents therelevant previous research works We briefly discuss thelatest YOLO architecture and Integral channel features inSection 4 Section 5 discussed the details and imple-mentation issues of the proposed object detection methodExperimental evaluation of the proposed object detectionframework on different datasets is presented in Section 6We conclude the paper in Section 7 presenting the per-spective of our work and future research directions

2 Related Works

Our work in this paper and the proposed approach thereinare related to the CNN based object detection using vision(ere has been a significant amount of work in this researcharea A thorough review of the literature on object detectionand CNN based classification and regression is beyond thescope of this paper (e following discussion reviews theprominent and critical works in the context of the proposedapproach for object detection

Object detection in natural scenes can be pursued usingdifferent approaches model-based learning-based and aux-iliary [10] Model-based methods depend on heuristic deter-mined models using color shape and intensity attributes (eobject characterization in such methods is also referred to ashandcrafted features where they are based on the expertsrsquoengineering features that best describe representations inimages Learning-based methods are where features fromobjects are automatically learned from classifiers and are laterused for detection Last the auxiliary approach is where thelocation of objects is used as prior information for visualizing

them (e auxiliary approach is expensive to deploy as theinfrastructure of objects needs to be replaced for them to beeffective Prior knowledge of the objectrsquos location reduces thecomputational search cost and improves accuracy Further-more it helps in eliminating a large number of false positives

(e last few decades of advancements in the state of theart in object detection solutions have seen novel design ofhandcrafted features such as histogram of gradients (HOG)[11] scale-invariant feature transform (SIFT) [12] andpyramid HOG [13] (ese methods used simple discrimi-native classifiers by scanning the image space or by featurematching Jones et al [14] used integral images for featureextraction for face detection which can be quickly computed(ese works generate fixed or variable size object descriptorsas a set of local feature vectors In [15] the authors proposedlocal-contour based features with two-staged partially su-pervised learning for object detection Among the earliestworks on learning-based object detection the authors inreference [16] proposed learning a sparse part-based objectrepresentation In [17] the authors introduced multiple-component learning-based parts based object detection byautomatic learning and combination of individual classifiers(e authors in reference [9] subsequently combined HOGwith heterogeneous color channels for pedestrian detection(ese features widely known as integral channel features(ICF) again utilize integral images for fast feature compu-tation which is an important requirement for image scanbased object detection

In [18] the authors proposed to learn structure betweendifferent object parts using SVM formulation for modeling ofobject structures atmultiple scales usingmixtures of deformablepart models An extensive review of pedestrian detection usinghandcrafted features has been presented by [1] In [19] thevisual attention based modeling is used for salient object de-tection by using a bootstrap learningmodel In [20] the authorsproposed regionletsmdashthe integration of different types of fea-tures that were computed locally (ese features were used tomodel an object class by a cascaded boosting classifier

With the recent rise in CNN based learning methodsseveral CNN based object detection models have beenproposed Girshick et al [21] proposed region-based con-volutional neural networks (RndashCNN) as object detectionmodels (e RndashCNN model trained independent compo-nents for generating region proposals as bounding boxes byusing selective search-based image scanning and for clas-sifying the region proposals to one of the object categories(e model was further improvised as Fast RndashCNN [22]nevertheless these models were slow and were difficult tooptimize Faster RndashCNN [3] alleviated the difficulties with itsprevious versions by replacing region proposal network(RPN) with selective search which is trained as a singleneural network with fast RndashCNN sharing the convolutionalfeature set (e RPN component of the Faster RndashCNNguides the unified network to look into different regions ofinterest Faster RndashCNN hitherto is considered as the mostrobust and accurate object detector these models still lackreal-time performance because the first proposals are gen-erated and subsequently proposal labeled in the knowncategories

2 Journal of Engineering

Dai et al [23] extended the shared fully convolutionalnetwork architecture originally proposed for image seg-mentation ([24]) to two-stage detection strategy havingregion proposal and region classification (e end-to-endlearning of the network constructs a set of position-sensitivescore maps to deal with the translation variance of targetobjects (e scores are generated by a bank of convolutionallayers which encodes the relative spatial position informa-tion incorporating translation variance in the learning

Some recent works on object detection also followed aregression-based approach which includes Single ShotMultibox Detector (SSD) [25] Deconvolutional Single ShotDetector (DSSD) [26] and You Only Look Once (YOLO)[8] SSD combines the Multibox approach for bounding boxregression by using a set of default boxes for predicting theshape offset and category confidence at each location in theimage (e authors used VGG-16 as the base network ar-chitecture where the network combines predictions fromdifferent feature maps with different resolutions DSSDfurther improvised SSD by shifting to the ldquoencoder-decoderrdquotype of networks to incorporate context information in thelearning by replacing the VGG-16 architecture with resid-ual-101

YOLO [27] unlike its predecessors trained single re-gressor network for direct prediction of object boundingboxes with category confidences (is approach of completeregression without any classification step yielded superiorreal-time performance in terms of the detection speed AlsoYOLO bettered detection accuracy in comparison with allother detection systems except Faster-RCNN With theincorporation of anchor boxes for guiding the bounding boxprediction in a newly designed network YOLOv2 betteredall other existing state of the art in PASCAL VOC-2017challenge (e latest YOLO (here onwards we refer to thisversion as YOLOv3) claims to improve its performance withthe incorporation of much deeper network architecturecombining predictions at multiple scales (e YOLO algo-rithms see the entire image all at once through the forwardpass of the network which gives more accurate and com-prehensive information (is helps the detector in avoidingfalse positives than the classification based detection systemswhich focus only on the region proposals Despite re-markable progress in object detection in natural images dueto deep learning-based strategies the performance ofexisting models requires improvement from the accuracyand real-time performance standpoint (e latest results onPascal VOC-challenge [6] and MS-COCO [7] datasets es-tablish that more effort is required to solve the detectionproblem

In addition several CNN based detection models havebeen designed for specific target objects [28ndash30] (esemethods have raised the benchmark in target objectiondetection nevertheless the problem is far from being solved[531ndash33] An important distinction in specific object de-tection is the availability of auxiliary information sources oruser feedback which has helped researchers to solve theproblem up to the acceptable limit

In this work we propose a feature enriched object de-tector based on the latest YOLO architecture Our choice of

YOLO architecture is based on its comparative detectionaccuracy and superior detection rate It is expected that CNNbased detection model should be able to learn the object-specific salient features required for detection neverthelessit cannot be guaranteed in single objective-based learningformulation (is direction of research exploring learning-based enrichment of deep features by including additionalmodalities has been not been explored for object detectionIn this context the recent work on foreground extraction forvideo analysis [34] a deep neural network based frameworkexploits on the multistage fusion of the combination ofresidual features from different convolutional layers On theother hand our approach is to feed in the handcraftedfeatures in the network to guide the feature learning processleading towards more accurate detection For learning-basedfeature fusion we use ICF descriptor which has shownremarkable object detection performance as discussedabove

3 Revisiting the Object Detection Problem

Object detection consists of localization of region capturingthe object and assigning their labels by processing localizedregion (e localization problem can be formulated as (i)regression problem that predicts the region of interests or(ii) binary classification problem focusing on detection offoreground region having object segments (e outcome ofany of these methods would predict object boundary as set offour coordinates (xmin xmax ymin and ymax) as shown inFigure 1

Label identification of the object region is a N-classclassification problem where N denotes the number ofclasses for example tree person bird dog bicycle or catcar tree In both the stages of object detection problemchallenges arise because of the underlying variations inlighting conditions scaling occlusion partial view andorientation Figure 2 shows some such challenging situationsfrom the COCO dataset

4 Preliminaries

Before presenting the object detection methodology usinglearning-based feature fusion we briefly discuss the integralchannel features and YOLOv3rsquos architecture We also makesome modifications in the YOLOv3 architecture as proposedin [35] which are also described in this section

41 You Only Look Once Object Detector-YOLOv3 (e classof YOLO algorithms [82736] look at the entire image whendetecting and recognizing objects and extract deep infor-mation about classes and their appearance unlike otherapproaches such as sliding window-based methods orRndashCNN based algorithms (ese algorithms treat the de-tection of objects as a single regression problem giving fasterresponse with a reduction in the design complexity of thedetector (ough the significant speed achievement thealgorithms lag in terms of accuracy especially with smallobjects

Journal of Engineering 3

(e latest algorithm in the YOLO family that isYOLOv3 has proven its performance compared to otherstate-of-the-art detectors (e YOLOv3 architecture has 107layers in total distributed as convolutional 75 route 4residual 23 upsample 2 detection 3 (e architectureuses a new model for feature extraction referred to asDarknet-53 (e new model is significantly larger thanmodels used in the earlier version nevertheless it hasproven to be more efficient than other state of the arts (emodel uses 53 convolution layers which takes an input imageof size 416times 416 Figure 3 shows the architecture of theYOLOv3 object detector (e Darknet-53 model is pre-trained on ImageNet [37] For the detection task the net-work is modified by removing its last layer and then stackingup additional layers resulting in the final network archi-tecture (e first 75 layers in the network represent 52convolutional layers of the Darknet-53 model pretrained onImageNet (e remaining 32 layers are added to qualifyYOLOv3 for object detection on different datasets withfurther training Besides YOLOv3 applies new residuallayers similar to skip connections that combine feature mapsfrom two layers using elementwise addition resulting infiner-grained information

(eYOLOv3 replaces the softmax based activation used inolder versions with independent logistic classifiers Featuresare extracted using a similar concept to feature pyramidnetworks Also binary cross-entropy loss is now used for classpredictions which is beneficial when faced with images

having overlapping labels (e K-means is used for anchorbox generation however 9 bounding boxes are now usedrather than 5 (e number of bounding boxes is distributedequally across the three detection scales In the present designa route layer is also used which outputs a layerrsquos feature maps

411 Prediction YOLOv3 processes images by dividingthem in N times N grid and if the center of the object falls in agrid cell then that cell is responsible for detecting the object(e network predicts bounding boxes at three differentscales (e first detection scale is used for detecting large-sized objects(e second detection scale is used for medium-sized objects and the last for small objects (e three de-tection layers are displayed in red in Figure 3 Each cellpredicts B bounding boxes and each prediction has 5predictions x y w h and confidence score representing themeasure of prediction having an object (e x and y pa-rameters are the center of the box for the grid cell while w

and h are the width and height of the predicted box for theentire image (e confidence score is the Intersection overUnion (IoU) between the predicted box and the groundtruth (e output prediction is N times N times B times (5 + C) tensorwhere 5 is the number of predictions per bounding box(x y w h and the confidence value) and C is the totalnumber of object categories Figure 4 shows the predictionalgorithm of YOLOv3 for N times N cells where each cell has B

bounding boxes

Input Detection algorithm Output

Objectdetector

Detection results

105 221 187 407 dog156 501 124 341 bicycle480 587 84 176 truck

TruckBicycle

Dog

Figure 1 Object detection in an example image

Figure 2 Example images from MS-COCO dataset

4 Journal of Engineering

412 Loss Function (e YOLOv3 includes a loss function(1) that instructs the network to correctly predict bounding

boxes and accurately classify the detected objects with aprovision to penalize false positives

λcoord 1113944

N2

i01113944

B

j01

objij xi minus 1113954xi( 1113857

2+ yi minus 1113954yi( 1113857

21113960 1113961 + λcoord 1113944

N2

i01113944

B

j01objij

w

radici minus

1113954wi

1113969

1113874 11138752

+h

radic

i minus

1113954hi

1113969

1113874 11138752

1113890 1113891

+ 1113944N2

i01113944

B

j01objij Ci minus 1113954Ci1113872 1113873

2+ λnoobj 1113944

N2

i01113944

B

j01noobjij Ci minus 1113954Ci1113872 1113873

2minus 1113944

N2

i01113944

B

j0δilog 1113954pi(c)( 1113857 + 1 minus δi( 1113857log 1 minus 1113954pi(c)( 1113857

(1)

(e symbols are explained in Table 1 (e symbolsunder hat represent corresponding prediction values

(e loss function in the equation has three error com-ponents localization confidence and classification as

416

416

33

416 208

208208

208

104104 52 52

52

256 256 512

52104

128 128

10411

1 331 1

26

26

1

33

416

33 1

1 33

332

6464

L0-conv0 3 times 3 times 64 L1-conv1 3 times 3 times 64-s2 L2-conv2 1 times 1 times 32 L5-conv43 times 3 times 128-s2

L12-conv93 times 3 times 256-s2 L12-conv9 3 times 3 times 256-s2

L3-conv3 3 times 3 times 64L4-res-1

times 1L6-conv5 1 times 1 times 64L7-conv6 3 times 3 times 128L8-res-3

times 2

L13-conv10 1 times 1 times 128L14-conv11 3 times 3 times 256L15-res-3

times 8

(a)

L38-conv27 1 times 1 times 256L39-conv28 3 times 3 times 512

L84-route-4 L86-route-1 61

L98-route-1 36

L84-conv59 1 times 1 times 256

L95-route-4L96-conv67 1 times 1 times 128

L85-upsample times 2

L97-upsample times 2

L40-res-3

times 8L63-conv44 1 times 1 times 512L64-conv45 3 times 3 times 1024L65-res-3

times 4L75-conv52 1 times 1 times 512L76-conv53 3 times 3 times 1024

L81-conv58 1 times 1 times ZL82-first detection layer

L93-conv66 1 times 1 times ZL94-second detection layer

L105-conv74 1 times 1 times ZL106-third detection layer

times 3

L87-conv60 1 times 1 times 256L88-conv61 3 times 3 times 512

times3

L99-conv68 1 times 1 times 128L100-conv69 3 times 3 times 256

times 3

L62-conv43 3 times 3 times 1024-s2

26

26

3

3

512

13 13

13 13

11

1311

1024 1024

13

1313

11

1024 Z

13

13

256256

26

26 26

26

512

26

26

11

768 (256 + 512)

384 (128 + 256)

11

Z

26

26

Z

52

52

52

256

52

52

52

52

128128

5226

26

11

11

(b)

Figure 3 YOLOv3 network architecture (a) Convolutional layers in the YOLOv3 (b) Detection layers in the YOLOv3

Journal of Engineering 5

observed in equation (1) Different loss components arecombined by sum-squared algorithm as it is easier foroptimization (e localization loss is responsible forminimizing the error between the ldquoresponsiblerdquobounding box and the ground truth object if an object isdetected in a grid cell

42 Integral Channel Features (ICF) In this work we fusedifferent channels of information proposed in the ICF [9]in the YOLOv3 architecture ICF descriptor is a series ofimage channels computed from the input image usinglinear and nonlinear transformations A channel refers to arepresentation of the input image Next first-order featuresare extracted by computing the sum of rectangular regions(e evaluated channels were color channels (RGB GrayHSV and LUV) gradient magnitude and gradients his-togram (e authors in [9] claimed that the most infor-mative channel among all in independent evaluation forpedestrian detection was HOG Further a combination ofLUV gradient and HOG gave the best detection rate Ourproposed work reevaluates the combination of channelsbecause of varying challenges in the present applicationscenario In our work the window size parameter for HOGcomputation depends on the 2D dimensions of the deepfeatures where they are to be fused with (discussed inSection 41) As an example if the 2D dimensions of a layerwhere features are to be fused are 13times13 and the inputimage is of size 416times 416 then the window size will be32times 32 so that it results in 13times13 HOGs For each windowwe compute the HOG using six bins Other HOG pa-rameters including cell and block size used for normali-zation purposes are also set the same as the window size(e block stride parameter is also set equal to the window

size having no overlap between two windows As shown inFigure 5 the ICF for a given image is the linear concate-nation of individual color channels after normalization thecorresponding HOG feature (e preprocessing in featurecomputation includes the resizing of the input image toalign it with the network layer that receives this input Forexperiments discussed in this paper we follow the samesteps and parameters for HOG computation as discussed in[9]

5 Learning-Based Feature Fusion in YOLOv3

Before we establish the proposed feature fusion we set outthe required steps which will be described subsequently

(1) Identify the candidate convolution layers for featureinjection

(2) Evaluate the layers for feature injection usingcomplete set of ICF channels

(3) Evaluate different combinations of ICF channelsusing the best network layer obtained in the previousstep

(4) Train and test the detector injecting the best ICFchannel combination at the best layer position

Our proposal for feature fusion starts with identifying alocation and space for handcrafted features within theYOLOv3 architecture For injecting an additional set offeatures in this architecture we first need to identify theconvolution layer to feed additional information We adopta qualitative approach to decide on the layer for featurefusion by validating feature fusion on different layers (edepth space of a convolutional layer represents the numberof feature maps that hold the deep featuremdashhereafter

YOLOv3Cell (0 1)

Cell (0 0)

x y w h objectness airphane 05 shark 045

x y w h objectness kite 03 car 001

x y w h objectness bird 012 truck 002

Cell (N N)

Figure 4 YOLOv3 prediction output for an example image cell(i j) corresponds to the image region within the ith row and jth column ofthe N times N grid

6 Journal of Engineering

referred to as filters We need to identify the specific numberof filters within the specific layer which will store the ad-ditional features First we propose to double the originalnumber of filters in the specific layer as shown in Figure 6

Depending on the number of channels in the ICF set thesame number of filters needs to be reserved for storinghandcrafted feature values (e ICF descriptor is injectedinto the network at the beginning of the training procedureIn the layer selected for ICF injection the additional filters(introduced by doubling the layer depth) are copied withICF descriptor from the rear end as shown in Figure 6 (eremaining filters are set to zero in the beginning As thetraining progresses the layer weights that is filter valuesupdate following the learning algorithm (e output gen-erated by this layer depends on the values of all filters (eproposedmethod for feature fusion diverges from the simpleapproach of the stacking of handcrafted features withoriginal filters in the selected layer (e extra filters in ad-dition to the filters used for storing ICF descriptor valueshelp regularize the layer weightsWemaintain the strategy ofdoubling the filters for feature fusion regardless of the

specific convolutional layer under validation An exampleshowing the procedure of doubling the number of filters at aspecific layer to issue space for ICF fusion is illustrated inFigure 6 In this example fusion was selected to occur at62n d layer of YOLOv3rsquos network (e top side of the figureshows a section from the original YOLOv3rsquos network and thebottom side is where the number of filters for the 62n d layeris doubled (e number of filters was changed from 1024 to2048 (e light orange is the new filter depth (2048) andconsists of the original filter depth (1024) concatenated withthe additional filters of the same depth represented in greenAdditionally doubling the number of filters in the fusionlayer poses fewer design challenges than with filters withdirect stacking of handcrafted features

51 Design Issues with Injection of ICF Descriptor To fusehandcrafted features with deep features their 2D dimen-sions have to match the width and height dimensions of thelayer Considering the example in Figure 6 at 62n d layer ifthe input image size is 416times416 then the filter dimensionswill be 13times13 Hence all selected channels for fusion shouldbe of size 13times13 to fit into the additional filters In thetesting phase of YOLOv3 the size of the input image is fixedat 416times 416 so the dimensions of the ICF descriptor at the62n d layer will always be 13times13 On the other hand in thetraining phase of YOLOv3 the size of input image changesbased on a random parameter in every 10 iterations wherethe size of the image is defined as a factor of 32 starting from320times 320 up to 608times 608 If the input size is 320times 320 or608times 608 then at the 62n d layer the filter size becomes10times10 or 19times19 respectively (e filter size compared withthe input image size is downsized by a factor of 32 Aftermaking sure that the handcrafted features are of the rightsize their values can replace the additional filters in thechosen layer

(e flowchart of the proposed object detector is shown inFigure 7 For training purposes the desired ICF channels ofinput images at different scales are calculated offline (eseimages at different scales are used to incorporate variabilityin the training process In the testing phase ICF channels arecalculated only once for each image at a fixed scale(e rightside of the figure shows that as our proposed detector isbeing trainedtested fusion takes place at the layer named``nth YOLOv3 convolutional layerrdquo (e doubled number of

Image

Preprocessing

Gray HSV LUV Gradmagnitude HOG

Normalization

Concatenation

ICF handcraftedfeatures vector

Figure 5 Flowchart for the ICF computation for an exampleimage

Table 1 Symbols used in YOLOv3 loss function

Symbol DefinitionN Number of grid cells in an imageB Number of anchor boxesCi Confidence score of jth bounding box in grid cell ipi(c) Conditional probability of class c in grid cell ixi yi wi hi Location and size of a bounding box1obji 1 if an object appears in grid cell i 0 otherwise1objij 1 if an object is present in ith grid cell and the jth ldquoresponsiblerdquo bounding box 0 otherwise1noobj

ij 1 if no object is present in ith grid cell and the jth ldquoresponsiblerdquo bounding box 0 otherwiseδi 1 if predicted label matches the ground truth label 0 otherwiseλcoord Constant (default 50)λnoobj Constant (default 05)

Journal of Engineering 7

filters and the number of ICF channels are represented as 2x

and y respectively (e figure shows how the ICF overlapswith a portion of the deep features residing in some of theadditional filters

6 Experimental Evaluation

A preliminary evaluation of the proposed framework isperformed on the PASCAL-VOC dataset (VOC2007 de-tection task) (e original YOLOv3 detector achieved themAP of 675 on the testing set with input image dimensionas 416times 416 (e YOLOv3 implementation available by [35]is used for benchmarking our experimental results Asmentioned in Section 5 we first identify the convolutionallayer for injecting the handcrafted features and the requiredspace (ie the number of filters) First we figure out the bestlayer location to fuse ICF channels Subsequently by anotherset of experiments we determine the best combination ofICF channels We separate a validation set comprising one-tenth of the training examples by random selection (evalidation set is used for testing the convolutional layers andcombination of ICF channels for optimum selection Fordetermining the best layer for fusion we use all channels ofinformation in ICF as used in the original work by Dollaret al [9](is is done by doubling the number of filters at thelayer under evaluation fusing 17 channels of ICF set andchecking the resulting mAP We evaluated the two con-volutional layers preceding a detection layer in YOLOv3architecture For each experiment the network is trained for25000 iterations (e remaining training parameters are setas the original YOLOv3 detector We performed the firstevaluation on 80th layer which is the last convolutional layerbefore the detection layer

Doubling the number of filters for 80th layer andinjecting the full set of ICF channels resulted in the mAP of698 In comparison with the benchmark performancebased on YOLOv3 performance we achieved an increase of17 on mAP scale (e next experiment on 79th layer

achieved the mAP of 715 which is more effective than theprevious experiment Table 2 lists the mAP scores for ex-periments conducted for determining the best layer positionfor feature fusion As observed the best mAP score of 715is achieved at 79th layer

After determining the most suitable layer for ICF fusionwe conduct experiments to discover the combination of ICFchannels for fusion that is most suitable for the object de-tection task in the VOC dataset Again in this set of ex-periments we fuse ICF channels by doubling the number offilters at the 79th layer In this experiment we run thetraining process for 50000 iterations Fusing all ICFchannels color gradient magnitude and HOG achieved themAP of 777 (e entire ICF descriptor summed up to 17channels considering 6 channels for HOG Removing theRGB channel from the 17 ICF channels increased the mAPto 782 showing a minuscule increase of 05 from thepreceding experiment Considering original input imagesare in RGB color values the increase is not significant Wealso explored other channel combinations the corre-sponding mAP scores are presented in Table 3

As seen the best ICF channel combination for VOC2007detection task consists of Gray HSV LUV gradient mag-nitudes and HOG achieving maximummAP score of 791We retrain the YOLOv3 detector on complete training setfusing the finalized channels of ICF descriptor (e networkachieved the mAP of 787 on the testing set which is 112more than the detection rate achieved by original YOLOv3detector

61 Evaluation on MS-COCO Dataset MS-COCO datasetconsists of 82783 training 40504 validation and 40775testing images belonging to 80 categories In the literaturethe YOLOv3 [8] reported the mAP of 553 on the COCOdataset for an input image size of 416times 416 All experimentsreported in this work were executed on a NVIDIA GeForceGTX 1080 GPU desktop which is unsubstantial in

L63-conv44 1 times 1 times 512L64-conv45 3 times 3 times 1024

L62-conv433 times 3 times 1024-s2

L62-conv43 3 times 3 times 2048-s2

L65-res-3times 4

L63-conv44 1 times 1 times 512L64-conv45 3 times 3 times 1024L65-res-3

times 4

26

26512

512

1024 102413

131

113

131

1

102413

1313

13

1024

11 1

11024

13

13

13

131

1

33

26

26

33

Original

Doubling

2048 ICF fusion

Figure 6 ICF descriptor injection in the specific convolutional layer

8 Journal of Engineering

comparison with the infrastructure used in the originalpaper [8] (erefore we adopt an alternate strategy forbenchmarking where the YOLO detector (original and withproposed fusion strategy) is retrained on the MS-COCOdataset and we compare the detection performance at 50000iterations (e original YOLOv3 evaluation on COCOtesting dataset achieved the mAP of 365 for 50000iterations

After determining the best location for ICF fusion usingthe mAP values achieved on the validation set as shown inTable 4 we performed experiments to determine the bestcombination of ICF channel fusion that best suits the COCOdataset (Table 5) As shown in Table 4 doubling the 78th

layer resulted in the highest mAP score (373) (ereforefor all subsequent experiments related to the COCO datasetwere conducted with the doubled number of filters at the78th layer All experiments were evaluated at 50000 itera-tions Fusing all ICF channels (color gradient magnitudeand HOG) resulted in the mAP of 377 Removing RGBfrom the 17 ICF channels led to the mAP of 373 Otherchannel combinations were tried out but did not give goodresults such as fusing only HOG features (6 channels)LUV+HOG (9 channels) gray + LUV+ gradient +HOG (11channels) and gray +HSV+LUV+ gradient (8 channels)(e best combination that gave the highest mAP scoreconsisted of fusion of the all ICF channels

Furthermore the evaluation on testing set with fusionof the complete set of ICF channels at the 78th layerachieved 378 of the mAP value With COCO datasetwe also experimented with the effect of normalization onthe ICF channels To measure the effect of normalizationon the performance of the network we tried out twodifferent normalization techniques All the results dis-cussed so far were achieved without any normalization

We first experimented with max normalization techniquedividing each channel by its maximum range so that allvalues fall in the range of 0 to 1 (e fusion of maxnormalized ICF channels in YOLOv3 achieved the mAPof 384 on the testing set which is 06 higher than thebest result achieved with unnormalized features and 19with the original YOLOv3 performance (is confirmswith the training principle of deep networks whichsuggests the use of input normalization as one of the tools

Table 2 VOC dataset detection rate with fusion of complete set ofICF channels

Layer number Number of filters after doubling mAP59 512 66160 1024 68462 2048 66277 1024 69878 2048 70979 1024 71580 2048 698

Table 3 VOC dataset detection rate with fusion of ICF channelcombinations at the 79th layer

ICF channel combination Number ofchannels mAP

RGB+ gray +HSV+LUV+ grad +HOG 17 777Gray +HSV+LUV+ grad +HOG 14 774HSV+LUV+ grad +HOG 13 786Gray + LUV+ grad +HOG 11 780LUV+ grad +HOG 10 791Gray +HSV+ grad +HOG 11 768Gray +HSV+LUV 7 762Gray +HSV+LUV+HOG 13 771

Offline Online

Imagedataset

ComputedICF

Read inputimage

Read inputimage

ICF computation

Store features

Trainingset

No

No

Yes

Yes

Morescales

Change inputimage scale

Imagedataset

1st YOLOv3layer

2nd YOLOv3layer

106th YOLOv3layer

107th YOLOv3layer

nth YOLOv3convolutional

layer

Convolution

Batchnormalization

Add bias

Fuse features

Activation

ComputedICF

Deep features

Deep features

ICF

ICF

2x y

y2x - y

Figure 7 Flowchart of the proposed object detector

Journal of Engineering 9

for performance improvement In the second normali-zation technique we used Z-score based normalizationwhich converts data in all channels to have mean 0 andstandard deviation 1 However the Z-score normaliza-tion did not perform as good as the previous normali-zation technique

7 Conclusion and Future Work

In this work we present an approach to fuse handcraftedfeatures in the convolutional neural network based objectdetector (e fusion of handcrafted features with learn-ing-based features has proved to be effective in manyearlier works In this work we demonstrate the meth-odological steps to fuse handcrafted features in the latestversion of the YOLO detector Our experiments with acombination of simple integral channel features fusion inYOLO have yielded substantial improvement in the de-tection rate on PASCAL-VOC and MS-COCO datasetsIn conventional machine learning early late andlearning-based fusion have been primary strategies forfeature fusion However deep learning networks aredesigned for specific input sizes which pose a challengefor algorithm designer to feed in extra information in thenetwork unless the network is redesigned (e workpresented a novel approach based on methodologicalsteps to address both the problems In future work weplan to explore the approach for sequence-based learningfor object detection and tracking

Data Availability

(is work utilizes standard datasets which are open access

Conflicts of Interest

(e authors declare that they have no conflicts of interest

References

[1] P Dollar C Wojek B Schiele and P Perona ldquoPedestriandetection an evaluation of the state of the artrdquo IEEETransactions on Pattern Analysis and Machine Intelligencevol 34 no 4 pp 743ndash761 2012

[2] D T Nguyen W Li and P O Ogunbona ldquoHuman detectionfrom images and videos a surveyrdquo Pattern Recognitionvol 51 pp 148ndash175 2016

[3] S Ren K He R Girshick and J Sun ldquoFaster r-cnn towardsreal-time object detection with region proposal networksrdquoIEEE Transactions on Pattern Analysis and Machine Intelli-gence vol 39 no 6 pp 1137ndash1149 2017

[4] S Zafeiriou C Zhang and Z Zhang ldquoA survey on facedetection in the wild past present and futurerdquo ComputerVision and Image Understanding vol 138 pp 1ndash24 2015

[5] S Zhang R Benenson M Omran J Hosang and B SchieleldquoHow far are we from solving pedestrian detectionrdquo inProceedings of the IEEE Conference on Computer Vision andPattern Recognition IEEE Las Vegas NV USA pp 1259ndash1267 June 2016

[6] Pascal visual object classes homepage (pascal-voc) httphostrobotsoxacukpascalVOC

[7] Coco-common objects in context (ms-coco) httpcocodatasetorg

[8] J Redmon and A Farhadi Yolov3 An Incremental Im-provement 2018 httpsarxivorgabs180402767

[9] P Dollar Z Tu P Perona and S Belongie Integral ChannelFeatures BMVC Press Swansea UK 2009

[10] F Arman and J K Aggarwal ldquoModel-based object recog-nition in dense-range images-a reviewrdquo ACM ComputingSurveys (CSUR) vol 25 no 1 pp 5ndash43 1993

[11] N Dalal and B Triggs ldquoHistograms of oriented gradients forhuman detectionrdquo Computer Vision and Pattern Recognitionvol 1 pp 886ndash893 2005

[12] D G Lowe ldquoDistinctive image features from scale-invariantkeypointsrdquo International Journal of Computer Vision vol 60no 2 pp 91ndash110 2004

[13] A Bosch A Zisserman and X Munoz ldquoRepresenting shapewith a spatial pyramid kernelrdquo in Proceedings of the 6th ACMInternational Conference on Image and Video Retrievalpp 401ndash408 Association for Computing MachineryAmsterdam (e Netherlands July 2007

[14] P Viola and M J Jones ldquoRobust real-time face detectionrdquoInternational Journal of Computer Vision vol 57 no 2pp 137ndash154 2004

[15] J Shotton A Blake and R Cipolla ldquoContour-based learningfor object detection in Computer Visionrdquo in Proceedings ofthe Tenth IEEE International Conference on Computer Visionvol 1 IEEE Beijing China pp 503ndash510 October 2005

[16] S Agarwal and D Roth ldquoLearning a sparse representation forobject detectionrdquo in Proceedings of the European Conferenceon Computer Vision pp 113ndash127 Copenhagen Denmar May2002

[17] P Dollar B Babenko S Belongie P Perona and Z TuldquoMultiple component learning for object detectionrdquo in Pro-ceedings of the 10th European Conference on Computer vol 10Springer Berlin Germany pp 211ndash224 June 2008

[18] P F Felzenszwalb R B Girshick D McAllester andD Ramanan ldquoObject detection with discriminatively trainedpart-based modelsrdquo IEEE Transactions on Pattern Analysisand Machine Intelligence vol 32 no 9 pp 1627ndash1645 2010

[19] N Tong H Lu X Ruan and M-H Yang ldquoSalient objectdetection via bootstrap learningrdquo in Proceedings of the IEEE

Table 4 COCO dataset detection rate with fusion of complete setof ICF channels

Layer number Number of filters after doubling mAP57 1024 35259 512 33760 1024 35673 2048 35175 1024 36677 1024 36378 2048 37379 1024 369

Table 5 MS-COCO detection rate with fusion of ICF channelcombinations at the 78th layer

ICF channel combination Number ofchannels mAP

RGB+ gray +HSV+LUV+ grad +HOG 17 377Gray +HSV+LUV+ grad +HOG 14 373HSV+LUV+ grad +HOG 13 374Gray + LUV+ grad +HOG 11 364

10 Journal of Engineering

Conference on Computer Vision and Pattern RecognitionIEEE Boston MA USA pp 1884ndash1892 June 2015

[20] X Wang M Yang S Zhu and Y Lin ldquoRegionlets for genericobject detectionrdquo in Proceedings of the IEEE InternationalConference on Computer Vision IEEE Sydney NSW Aus-tralia pp 17ndash24 December 2013

[21] R Girshick J Donahue T Darrell and J Malik ldquoRich featurehierarchies for accurate object detection and semantic seg-mentationrdquo in Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition IEEE Columbus OHUSA pp 580ndash587 June 2014

[22] R Girshick ldquoFast R-Cnnrdquo in Proceedings of IEEE Interna-tional Conference on Computer Vision (ICCV) pp 1440ndash1448Santiago Chile 2015

[23] J Dai Y Li K He and J Sun ldquoR-FCN object detection viaregion-based fully convolutional networksrdquo in Proceedings ofthe 30th International Conference on Neural InformationProcessing Systems pp 379ndash387 Barcelona Spain December2016

[24] J Long E Shelhamer and T Darrell ldquoFully convolutionalnetworks for semantic segmentationrdquo in Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition(CVPR) IEEE Boston MA USA pp 3431ndash3440 June 2015

[25] W Liu D Anguelov D Erhan et al ldquoSSD single shotmultibox detectorrdquo in European Conference on ComputerVision pp 21ndash37 Amsterdam Netherlands 2016

[26] C Fu W Liu A Ranga A Tyagi and A C Berg Decon-volutional Single Shot Detector 2017 httpsarxivorgabs170106659

[27] J Redmon S Divvala R Girshick and A Farhadi ldquoYou onlylook once unified real-time object detectionrdquo in Proceedingsof the IEEE Conference on Computer Vision and PatternRecognition IEEE Las Vegas NV USA pp 779ndash788 June2016

[28] Y Tian P Luo X Wang and X Tang ldquoDeep learning strongparts for pedestrian detectionrdquo in Proceedings of the IEEEInternational Conference on Computer Vision IEEE SantiagoChile pp 1904ndash1912 December 2015

[29] Y Tian P Luo X Wang and X Tang ldquoPedestrian detectionaided by deep learning semantic tasksrdquo in Proceedings of theIEEE Conference on Computer Vision and Pattern RecognitionIEEE Boston MA USA pp 5079ndash5087 June 2015

[30] Y Zhou L Liu L Shao and M Mellor ldquoDave a unifiedframework for fast vehicle detection and annotationrdquo inProceedings of the European Conference on Computer Visionpp 278ndash293 Amsterdam (e Netherlands October 2016

[31] B Li T Zhang and T Xia ldquoVehicle detection from 3d lidarusing fully convolutional networkrdquo 2016 httparxivorgabs160807916

[32] L Zhang L Lin X Liang and K He ldquoIs faster r-cnn doingwell for pedestrian detectionrdquo in Proceedings of the EuropeanConference on Computer Vision pp 443ndash457 Amsterdam(e Netherlands October 2016

[33] S Zhang R Benenson M Omran J Hosang and B SchieleldquoTowards reaching human performance in pedestrian de-tectionrdquo IEEE Transactions on Pattern Analysis and MachineIntelligence vol 40 no 4 pp 973ndash986 2018

[34] T Akilan Q M J Wu and W Zhang ldquoVideo foregroundextraction using multi-view receptive field and encoder-de-coder dcnn for traffic and surveillance applicationsrdquo IEEETransactions on Vehicular Technology vol 68 no 10pp 9478ndash9493 2019

[35] A B AlexeyWindows and Linux Version of Darknet Yolo V3amp V2 Neural Networks for Object Detection pp 6ndash2 2018

[36] J Redmon and A Farhadi ldquoYolo9000 better faster stron-gerrdquo in Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition (CVPR) IEEE SeattleWashington USA pp 6517ndash6525 July 2017

[37] A Krizhevsky I Sutskever and G E Hinton ldquoImagenetclassification with deep convolutional neural networksrdquoAdvances in Neural Information Processing Systemspp 1097ndash1105 2012

Journal of Engineering 11

Page 2: ResearchArticle LearningFeatureFusioninDeepLearning …downloads.hindawi.com/journals/je/2020/7286187.pdf · 2020-05-22 · qualitative approach to select the best layer for feature

features for feature fusion Our detection framework is basedon state-of-the-art YOLO [8] architecture where we injectthe handcrafted features at appropriate layer(s) to regularizenetwork weights during the training procedure Wehypothesise that fusing handcrafted features during networklearning guides the CNN to extract a more accurate androbust feature set by exploiting upon the complementaryinformation available in the handcrafted feature set (emajor contributions in this paper are summarized as follows

(1) We propose a novel object detection approach basedon CNN learning by fusing simple feature descrip-tors like color channels and gradient histograms [9]by learning-based fusion to train a more robust andaccurate detection model

(2) We use the latest YOLO objection detection archi-tecture as our base We describe the proposedlearning-based feature fusion strategy that uses aqualitative approach to select the best layer forfeature injection In this work the presented workpresents a novel strategy to fuse features in CNNbased object detectors

(3) We demonstrate the comparative performance of theproposed feature fusion approach on PASCAL-VOCand MS-COCO datasets which shows improvementin the original YOLO object detection rate by asignificant measure

(e paper is organized as follows Section 2 presents therelevant previous research works We briefly discuss thelatest YOLO architecture and Integral channel features inSection 4 Section 5 discussed the details and imple-mentation issues of the proposed object detection methodExperimental evaluation of the proposed object detectionframework on different datasets is presented in Section 6We conclude the paper in Section 7 presenting the per-spective of our work and future research directions

2 Related Works

Our work in this paper and the proposed approach thereinare related to the CNN based object detection using vision(ere has been a significant amount of work in this researcharea A thorough review of the literature on object detectionand CNN based classification and regression is beyond thescope of this paper (e following discussion reviews theprominent and critical works in the context of the proposedapproach for object detection

Object detection in natural scenes can be pursued usingdifferent approaches model-based learning-based and aux-iliary [10] Model-based methods depend on heuristic deter-mined models using color shape and intensity attributes (eobject characterization in such methods is also referred to ashandcrafted features where they are based on the expertsrsquoengineering features that best describe representations inimages Learning-based methods are where features fromobjects are automatically learned from classifiers and are laterused for detection Last the auxiliary approach is where thelocation of objects is used as prior information for visualizing

them (e auxiliary approach is expensive to deploy as theinfrastructure of objects needs to be replaced for them to beeffective Prior knowledge of the objectrsquos location reduces thecomputational search cost and improves accuracy Further-more it helps in eliminating a large number of false positives

(e last few decades of advancements in the state of theart in object detection solutions have seen novel design ofhandcrafted features such as histogram of gradients (HOG)[11] scale-invariant feature transform (SIFT) [12] andpyramid HOG [13] (ese methods used simple discrimi-native classifiers by scanning the image space or by featurematching Jones et al [14] used integral images for featureextraction for face detection which can be quickly computed(ese works generate fixed or variable size object descriptorsas a set of local feature vectors In [15] the authors proposedlocal-contour based features with two-staged partially su-pervised learning for object detection Among the earliestworks on learning-based object detection the authors inreference [16] proposed learning a sparse part-based objectrepresentation In [17] the authors introduced multiple-component learning-based parts based object detection byautomatic learning and combination of individual classifiers(e authors in reference [9] subsequently combined HOGwith heterogeneous color channels for pedestrian detection(ese features widely known as integral channel features(ICF) again utilize integral images for fast feature compu-tation which is an important requirement for image scanbased object detection

In [18] the authors proposed to learn structure betweendifferent object parts using SVM formulation for modeling ofobject structures atmultiple scales usingmixtures of deformablepart models An extensive review of pedestrian detection usinghandcrafted features has been presented by [1] In [19] thevisual attention based modeling is used for salient object de-tection by using a bootstrap learningmodel In [20] the authorsproposed regionletsmdashthe integration of different types of fea-tures that were computed locally (ese features were used tomodel an object class by a cascaded boosting classifier

With the recent rise in CNN based learning methodsseveral CNN based object detection models have beenproposed Girshick et al [21] proposed region-based con-volutional neural networks (RndashCNN) as object detectionmodels (e RndashCNN model trained independent compo-nents for generating region proposals as bounding boxes byusing selective search-based image scanning and for clas-sifying the region proposals to one of the object categories(e model was further improvised as Fast RndashCNN [22]nevertheless these models were slow and were difficult tooptimize Faster RndashCNN [3] alleviated the difficulties with itsprevious versions by replacing region proposal network(RPN) with selective search which is trained as a singleneural network with fast RndashCNN sharing the convolutionalfeature set (e RPN component of the Faster RndashCNNguides the unified network to look into different regions ofinterest Faster RndashCNN hitherto is considered as the mostrobust and accurate object detector these models still lackreal-time performance because the first proposals are gen-erated and subsequently proposal labeled in the knowncategories

2 Journal of Engineering

Dai et al [23] extended the shared fully convolutionalnetwork architecture originally proposed for image seg-mentation ([24]) to two-stage detection strategy havingregion proposal and region classification (e end-to-endlearning of the network constructs a set of position-sensitivescore maps to deal with the translation variance of targetobjects (e scores are generated by a bank of convolutionallayers which encodes the relative spatial position informa-tion incorporating translation variance in the learning

Some recent works on object detection also followed aregression-based approach which includes Single ShotMultibox Detector (SSD) [25] Deconvolutional Single ShotDetector (DSSD) [26] and You Only Look Once (YOLO)[8] SSD combines the Multibox approach for bounding boxregression by using a set of default boxes for predicting theshape offset and category confidence at each location in theimage (e authors used VGG-16 as the base network ar-chitecture where the network combines predictions fromdifferent feature maps with different resolutions DSSDfurther improvised SSD by shifting to the ldquoencoder-decoderrdquotype of networks to incorporate context information in thelearning by replacing the VGG-16 architecture with resid-ual-101

YOLO [27] unlike its predecessors trained single re-gressor network for direct prediction of object boundingboxes with category confidences (is approach of completeregression without any classification step yielded superiorreal-time performance in terms of the detection speed AlsoYOLO bettered detection accuracy in comparison with allother detection systems except Faster-RCNN With theincorporation of anchor boxes for guiding the bounding boxprediction in a newly designed network YOLOv2 betteredall other existing state of the art in PASCAL VOC-2017challenge (e latest YOLO (here onwards we refer to thisversion as YOLOv3) claims to improve its performance withthe incorporation of much deeper network architecturecombining predictions at multiple scales (e YOLO algo-rithms see the entire image all at once through the forwardpass of the network which gives more accurate and com-prehensive information (is helps the detector in avoidingfalse positives than the classification based detection systemswhich focus only on the region proposals Despite re-markable progress in object detection in natural images dueto deep learning-based strategies the performance ofexisting models requires improvement from the accuracyand real-time performance standpoint (e latest results onPascal VOC-challenge [6] and MS-COCO [7] datasets es-tablish that more effort is required to solve the detectionproblem

In addition several CNN based detection models havebeen designed for specific target objects [28ndash30] (esemethods have raised the benchmark in target objectiondetection nevertheless the problem is far from being solved[531ndash33] An important distinction in specific object de-tection is the availability of auxiliary information sources oruser feedback which has helped researchers to solve theproblem up to the acceptable limit

In this work we propose a feature enriched object de-tector based on the latest YOLO architecture Our choice of

YOLO architecture is based on its comparative detectionaccuracy and superior detection rate It is expected that CNNbased detection model should be able to learn the object-specific salient features required for detection neverthelessit cannot be guaranteed in single objective-based learningformulation (is direction of research exploring learning-based enrichment of deep features by including additionalmodalities has been not been explored for object detectionIn this context the recent work on foreground extraction forvideo analysis [34] a deep neural network based frameworkexploits on the multistage fusion of the combination ofresidual features from different convolutional layers On theother hand our approach is to feed in the handcraftedfeatures in the network to guide the feature learning processleading towards more accurate detection For learning-basedfeature fusion we use ICF descriptor which has shownremarkable object detection performance as discussedabove

3 Revisiting the Object Detection Problem

Object detection consists of localization of region capturingthe object and assigning their labels by processing localizedregion (e localization problem can be formulated as (i)regression problem that predicts the region of interests or(ii) binary classification problem focusing on detection offoreground region having object segments (e outcome ofany of these methods would predict object boundary as set offour coordinates (xmin xmax ymin and ymax) as shown inFigure 1

Label identification of the object region is a N-classclassification problem where N denotes the number ofclasses for example tree person bird dog bicycle or catcar tree In both the stages of object detection problemchallenges arise because of the underlying variations inlighting conditions scaling occlusion partial view andorientation Figure 2 shows some such challenging situationsfrom the COCO dataset

4 Preliminaries

Before presenting the object detection methodology usinglearning-based feature fusion we briefly discuss the integralchannel features and YOLOv3rsquos architecture We also makesome modifications in the YOLOv3 architecture as proposedin [35] which are also described in this section

41 You Only Look Once Object Detector-YOLOv3 (e classof YOLO algorithms [82736] look at the entire image whendetecting and recognizing objects and extract deep infor-mation about classes and their appearance unlike otherapproaches such as sliding window-based methods orRndashCNN based algorithms (ese algorithms treat the de-tection of objects as a single regression problem giving fasterresponse with a reduction in the design complexity of thedetector (ough the significant speed achievement thealgorithms lag in terms of accuracy especially with smallobjects

Journal of Engineering 3

(e latest algorithm in the YOLO family that isYOLOv3 has proven its performance compared to otherstate-of-the-art detectors (e YOLOv3 architecture has 107layers in total distributed as convolutional 75 route 4residual 23 upsample 2 detection 3 (e architectureuses a new model for feature extraction referred to asDarknet-53 (e new model is significantly larger thanmodels used in the earlier version nevertheless it hasproven to be more efficient than other state of the arts (emodel uses 53 convolution layers which takes an input imageof size 416times 416 Figure 3 shows the architecture of theYOLOv3 object detector (e Darknet-53 model is pre-trained on ImageNet [37] For the detection task the net-work is modified by removing its last layer and then stackingup additional layers resulting in the final network archi-tecture (e first 75 layers in the network represent 52convolutional layers of the Darknet-53 model pretrained onImageNet (e remaining 32 layers are added to qualifyYOLOv3 for object detection on different datasets withfurther training Besides YOLOv3 applies new residuallayers similar to skip connections that combine feature mapsfrom two layers using elementwise addition resulting infiner-grained information

(eYOLOv3 replaces the softmax based activation used inolder versions with independent logistic classifiers Featuresare extracted using a similar concept to feature pyramidnetworks Also binary cross-entropy loss is now used for classpredictions which is beneficial when faced with images

having overlapping labels (e K-means is used for anchorbox generation however 9 bounding boxes are now usedrather than 5 (e number of bounding boxes is distributedequally across the three detection scales In the present designa route layer is also used which outputs a layerrsquos feature maps

411 Prediction YOLOv3 processes images by dividingthem in N times N grid and if the center of the object falls in agrid cell then that cell is responsible for detecting the object(e network predicts bounding boxes at three differentscales (e first detection scale is used for detecting large-sized objects(e second detection scale is used for medium-sized objects and the last for small objects (e three de-tection layers are displayed in red in Figure 3 Each cellpredicts B bounding boxes and each prediction has 5predictions x y w h and confidence score representing themeasure of prediction having an object (e x and y pa-rameters are the center of the box for the grid cell while w

and h are the width and height of the predicted box for theentire image (e confidence score is the Intersection overUnion (IoU) between the predicted box and the groundtruth (e output prediction is N times N times B times (5 + C) tensorwhere 5 is the number of predictions per bounding box(x y w h and the confidence value) and C is the totalnumber of object categories Figure 4 shows the predictionalgorithm of YOLOv3 for N times N cells where each cell has B

bounding boxes

Input Detection algorithm Output

Objectdetector

Detection results

105 221 187 407 dog156 501 124 341 bicycle480 587 84 176 truck

TruckBicycle

Dog

Figure 1 Object detection in an example image

Figure 2 Example images from MS-COCO dataset

4 Journal of Engineering

412 Loss Function (e YOLOv3 includes a loss function(1) that instructs the network to correctly predict bounding

boxes and accurately classify the detected objects with aprovision to penalize false positives

λcoord 1113944

N2

i01113944

B

j01

objij xi minus 1113954xi( 1113857

2+ yi minus 1113954yi( 1113857

21113960 1113961 + λcoord 1113944

N2

i01113944

B

j01objij

w

radici minus

1113954wi

1113969

1113874 11138752

+h

radic

i minus

1113954hi

1113969

1113874 11138752

1113890 1113891

+ 1113944N2

i01113944

B

j01objij Ci minus 1113954Ci1113872 1113873

2+ λnoobj 1113944

N2

i01113944

B

j01noobjij Ci minus 1113954Ci1113872 1113873

2minus 1113944

N2

i01113944

B

j0δilog 1113954pi(c)( 1113857 + 1 minus δi( 1113857log 1 minus 1113954pi(c)( 1113857

(1)

(e symbols are explained in Table 1 (e symbolsunder hat represent corresponding prediction values

(e loss function in the equation has three error com-ponents localization confidence and classification as

416

416

33

416 208

208208

208

104104 52 52

52

256 256 512

52104

128 128

10411

1 331 1

26

26

1

33

416

33 1

1 33

332

6464

L0-conv0 3 times 3 times 64 L1-conv1 3 times 3 times 64-s2 L2-conv2 1 times 1 times 32 L5-conv43 times 3 times 128-s2

L12-conv93 times 3 times 256-s2 L12-conv9 3 times 3 times 256-s2

L3-conv3 3 times 3 times 64L4-res-1

times 1L6-conv5 1 times 1 times 64L7-conv6 3 times 3 times 128L8-res-3

times 2

L13-conv10 1 times 1 times 128L14-conv11 3 times 3 times 256L15-res-3

times 8

(a)

L38-conv27 1 times 1 times 256L39-conv28 3 times 3 times 512

L84-route-4 L86-route-1 61

L98-route-1 36

L84-conv59 1 times 1 times 256

L95-route-4L96-conv67 1 times 1 times 128

L85-upsample times 2

L97-upsample times 2

L40-res-3

times 8L63-conv44 1 times 1 times 512L64-conv45 3 times 3 times 1024L65-res-3

times 4L75-conv52 1 times 1 times 512L76-conv53 3 times 3 times 1024

L81-conv58 1 times 1 times ZL82-first detection layer

L93-conv66 1 times 1 times ZL94-second detection layer

L105-conv74 1 times 1 times ZL106-third detection layer

times 3

L87-conv60 1 times 1 times 256L88-conv61 3 times 3 times 512

times3

L99-conv68 1 times 1 times 128L100-conv69 3 times 3 times 256

times 3

L62-conv43 3 times 3 times 1024-s2

26

26

3

3

512

13 13

13 13

11

1311

1024 1024

13

1313

11

1024 Z

13

13

256256

26

26 26

26

512

26

26

11

768 (256 + 512)

384 (128 + 256)

11

Z

26

26

Z

52

52

52

256

52

52

52

52

128128

5226

26

11

11

(b)

Figure 3 YOLOv3 network architecture (a) Convolutional layers in the YOLOv3 (b) Detection layers in the YOLOv3

Journal of Engineering 5

observed in equation (1) Different loss components arecombined by sum-squared algorithm as it is easier foroptimization (e localization loss is responsible forminimizing the error between the ldquoresponsiblerdquobounding box and the ground truth object if an object isdetected in a grid cell

42 Integral Channel Features (ICF) In this work we fusedifferent channels of information proposed in the ICF [9]in the YOLOv3 architecture ICF descriptor is a series ofimage channels computed from the input image usinglinear and nonlinear transformations A channel refers to arepresentation of the input image Next first-order featuresare extracted by computing the sum of rectangular regions(e evaluated channels were color channels (RGB GrayHSV and LUV) gradient magnitude and gradients his-togram (e authors in [9] claimed that the most infor-mative channel among all in independent evaluation forpedestrian detection was HOG Further a combination ofLUV gradient and HOG gave the best detection rate Ourproposed work reevaluates the combination of channelsbecause of varying challenges in the present applicationscenario In our work the window size parameter for HOGcomputation depends on the 2D dimensions of the deepfeatures where they are to be fused with (discussed inSection 41) As an example if the 2D dimensions of a layerwhere features are to be fused are 13times13 and the inputimage is of size 416times 416 then the window size will be32times 32 so that it results in 13times13 HOGs For each windowwe compute the HOG using six bins Other HOG pa-rameters including cell and block size used for normali-zation purposes are also set the same as the window size(e block stride parameter is also set equal to the window

size having no overlap between two windows As shown inFigure 5 the ICF for a given image is the linear concate-nation of individual color channels after normalization thecorresponding HOG feature (e preprocessing in featurecomputation includes the resizing of the input image toalign it with the network layer that receives this input Forexperiments discussed in this paper we follow the samesteps and parameters for HOG computation as discussed in[9]

5 Learning-Based Feature Fusion in YOLOv3

Before we establish the proposed feature fusion we set outthe required steps which will be described subsequently

(1) Identify the candidate convolution layers for featureinjection

(2) Evaluate the layers for feature injection usingcomplete set of ICF channels

(3) Evaluate different combinations of ICF channelsusing the best network layer obtained in the previousstep

(4) Train and test the detector injecting the best ICFchannel combination at the best layer position

Our proposal for feature fusion starts with identifying alocation and space for handcrafted features within theYOLOv3 architecture For injecting an additional set offeatures in this architecture we first need to identify theconvolution layer to feed additional information We adopta qualitative approach to decide on the layer for featurefusion by validating feature fusion on different layers (edepth space of a convolutional layer represents the numberof feature maps that hold the deep featuremdashhereafter

YOLOv3Cell (0 1)

Cell (0 0)

x y w h objectness airphane 05 shark 045

x y w h objectness kite 03 car 001

x y w h objectness bird 012 truck 002

Cell (N N)

Figure 4 YOLOv3 prediction output for an example image cell(i j) corresponds to the image region within the ith row and jth column ofthe N times N grid

6 Journal of Engineering

referred to as filters We need to identify the specific numberof filters within the specific layer which will store the ad-ditional features First we propose to double the originalnumber of filters in the specific layer as shown in Figure 6

Depending on the number of channels in the ICF set thesame number of filters needs to be reserved for storinghandcrafted feature values (e ICF descriptor is injectedinto the network at the beginning of the training procedureIn the layer selected for ICF injection the additional filters(introduced by doubling the layer depth) are copied withICF descriptor from the rear end as shown in Figure 6 (eremaining filters are set to zero in the beginning As thetraining progresses the layer weights that is filter valuesupdate following the learning algorithm (e output gen-erated by this layer depends on the values of all filters (eproposedmethod for feature fusion diverges from the simpleapproach of the stacking of handcrafted features withoriginal filters in the selected layer (e extra filters in ad-dition to the filters used for storing ICF descriptor valueshelp regularize the layer weightsWemaintain the strategy ofdoubling the filters for feature fusion regardless of the

specific convolutional layer under validation An exampleshowing the procedure of doubling the number of filters at aspecific layer to issue space for ICF fusion is illustrated inFigure 6 In this example fusion was selected to occur at62n d layer of YOLOv3rsquos network (e top side of the figureshows a section from the original YOLOv3rsquos network and thebottom side is where the number of filters for the 62n d layeris doubled (e number of filters was changed from 1024 to2048 (e light orange is the new filter depth (2048) andconsists of the original filter depth (1024) concatenated withthe additional filters of the same depth represented in greenAdditionally doubling the number of filters in the fusionlayer poses fewer design challenges than with filters withdirect stacking of handcrafted features

51 Design Issues with Injection of ICF Descriptor To fusehandcrafted features with deep features their 2D dimen-sions have to match the width and height dimensions of thelayer Considering the example in Figure 6 at 62n d layer ifthe input image size is 416times416 then the filter dimensionswill be 13times13 Hence all selected channels for fusion shouldbe of size 13times13 to fit into the additional filters In thetesting phase of YOLOv3 the size of the input image is fixedat 416times 416 so the dimensions of the ICF descriptor at the62n d layer will always be 13times13 On the other hand in thetraining phase of YOLOv3 the size of input image changesbased on a random parameter in every 10 iterations wherethe size of the image is defined as a factor of 32 starting from320times 320 up to 608times 608 If the input size is 320times 320 or608times 608 then at the 62n d layer the filter size becomes10times10 or 19times19 respectively (e filter size compared withthe input image size is downsized by a factor of 32 Aftermaking sure that the handcrafted features are of the rightsize their values can replace the additional filters in thechosen layer

(e flowchart of the proposed object detector is shown inFigure 7 For training purposes the desired ICF channels ofinput images at different scales are calculated offline (eseimages at different scales are used to incorporate variabilityin the training process In the testing phase ICF channels arecalculated only once for each image at a fixed scale(e rightside of the figure shows that as our proposed detector isbeing trainedtested fusion takes place at the layer named``nth YOLOv3 convolutional layerrdquo (e doubled number of

Image

Preprocessing

Gray HSV LUV Gradmagnitude HOG

Normalization

Concatenation

ICF handcraftedfeatures vector

Figure 5 Flowchart for the ICF computation for an exampleimage

Table 1 Symbols used in YOLOv3 loss function

Symbol DefinitionN Number of grid cells in an imageB Number of anchor boxesCi Confidence score of jth bounding box in grid cell ipi(c) Conditional probability of class c in grid cell ixi yi wi hi Location and size of a bounding box1obji 1 if an object appears in grid cell i 0 otherwise1objij 1 if an object is present in ith grid cell and the jth ldquoresponsiblerdquo bounding box 0 otherwise1noobj

ij 1 if no object is present in ith grid cell and the jth ldquoresponsiblerdquo bounding box 0 otherwiseδi 1 if predicted label matches the ground truth label 0 otherwiseλcoord Constant (default 50)λnoobj Constant (default 05)

Journal of Engineering 7

filters and the number of ICF channels are represented as 2x

and y respectively (e figure shows how the ICF overlapswith a portion of the deep features residing in some of theadditional filters

6 Experimental Evaluation

A preliminary evaluation of the proposed framework isperformed on the PASCAL-VOC dataset (VOC2007 de-tection task) (e original YOLOv3 detector achieved themAP of 675 on the testing set with input image dimensionas 416times 416 (e YOLOv3 implementation available by [35]is used for benchmarking our experimental results Asmentioned in Section 5 we first identify the convolutionallayer for injecting the handcrafted features and the requiredspace (ie the number of filters) First we figure out the bestlayer location to fuse ICF channels Subsequently by anotherset of experiments we determine the best combination ofICF channels We separate a validation set comprising one-tenth of the training examples by random selection (evalidation set is used for testing the convolutional layers andcombination of ICF channels for optimum selection Fordetermining the best layer for fusion we use all channels ofinformation in ICF as used in the original work by Dollaret al [9](is is done by doubling the number of filters at thelayer under evaluation fusing 17 channels of ICF set andchecking the resulting mAP We evaluated the two con-volutional layers preceding a detection layer in YOLOv3architecture For each experiment the network is trained for25000 iterations (e remaining training parameters are setas the original YOLOv3 detector We performed the firstevaluation on 80th layer which is the last convolutional layerbefore the detection layer

Doubling the number of filters for 80th layer andinjecting the full set of ICF channels resulted in the mAP of698 In comparison with the benchmark performancebased on YOLOv3 performance we achieved an increase of17 on mAP scale (e next experiment on 79th layer

achieved the mAP of 715 which is more effective than theprevious experiment Table 2 lists the mAP scores for ex-periments conducted for determining the best layer positionfor feature fusion As observed the best mAP score of 715is achieved at 79th layer

After determining the most suitable layer for ICF fusionwe conduct experiments to discover the combination of ICFchannels for fusion that is most suitable for the object de-tection task in the VOC dataset Again in this set of ex-periments we fuse ICF channels by doubling the number offilters at the 79th layer In this experiment we run thetraining process for 50000 iterations Fusing all ICFchannels color gradient magnitude and HOG achieved themAP of 777 (e entire ICF descriptor summed up to 17channels considering 6 channels for HOG Removing theRGB channel from the 17 ICF channels increased the mAPto 782 showing a minuscule increase of 05 from thepreceding experiment Considering original input imagesare in RGB color values the increase is not significant Wealso explored other channel combinations the corre-sponding mAP scores are presented in Table 3

As seen the best ICF channel combination for VOC2007detection task consists of Gray HSV LUV gradient mag-nitudes and HOG achieving maximummAP score of 791We retrain the YOLOv3 detector on complete training setfusing the finalized channels of ICF descriptor (e networkachieved the mAP of 787 on the testing set which is 112more than the detection rate achieved by original YOLOv3detector

61 Evaluation on MS-COCO Dataset MS-COCO datasetconsists of 82783 training 40504 validation and 40775testing images belonging to 80 categories In the literaturethe YOLOv3 [8] reported the mAP of 553 on the COCOdataset for an input image size of 416times 416 All experimentsreported in this work were executed on a NVIDIA GeForceGTX 1080 GPU desktop which is unsubstantial in

L63-conv44 1 times 1 times 512L64-conv45 3 times 3 times 1024

L62-conv433 times 3 times 1024-s2

L62-conv43 3 times 3 times 2048-s2

L65-res-3times 4

L63-conv44 1 times 1 times 512L64-conv45 3 times 3 times 1024L65-res-3

times 4

26

26512

512

1024 102413

131

113

131

1

102413

1313

13

1024

11 1

11024

13

13

13

131

1

33

26

26

33

Original

Doubling

2048 ICF fusion

Figure 6 ICF descriptor injection in the specific convolutional layer

8 Journal of Engineering

comparison with the infrastructure used in the originalpaper [8] (erefore we adopt an alternate strategy forbenchmarking where the YOLO detector (original and withproposed fusion strategy) is retrained on the MS-COCOdataset and we compare the detection performance at 50000iterations (e original YOLOv3 evaluation on COCOtesting dataset achieved the mAP of 365 for 50000iterations

After determining the best location for ICF fusion usingthe mAP values achieved on the validation set as shown inTable 4 we performed experiments to determine the bestcombination of ICF channel fusion that best suits the COCOdataset (Table 5) As shown in Table 4 doubling the 78th

layer resulted in the highest mAP score (373) (ereforefor all subsequent experiments related to the COCO datasetwere conducted with the doubled number of filters at the78th layer All experiments were evaluated at 50000 itera-tions Fusing all ICF channels (color gradient magnitudeand HOG) resulted in the mAP of 377 Removing RGBfrom the 17 ICF channels led to the mAP of 373 Otherchannel combinations were tried out but did not give goodresults such as fusing only HOG features (6 channels)LUV+HOG (9 channels) gray + LUV+ gradient +HOG (11channels) and gray +HSV+LUV+ gradient (8 channels)(e best combination that gave the highest mAP scoreconsisted of fusion of the all ICF channels

Furthermore the evaluation on testing set with fusionof the complete set of ICF channels at the 78th layerachieved 378 of the mAP value With COCO datasetwe also experimented with the effect of normalization onthe ICF channels To measure the effect of normalizationon the performance of the network we tried out twodifferent normalization techniques All the results dis-cussed so far were achieved without any normalization

We first experimented with max normalization techniquedividing each channel by its maximum range so that allvalues fall in the range of 0 to 1 (e fusion of maxnormalized ICF channels in YOLOv3 achieved the mAPof 384 on the testing set which is 06 higher than thebest result achieved with unnormalized features and 19with the original YOLOv3 performance (is confirmswith the training principle of deep networks whichsuggests the use of input normalization as one of the tools

Table 2 VOC dataset detection rate with fusion of complete set ofICF channels

Layer number Number of filters after doubling mAP59 512 66160 1024 68462 2048 66277 1024 69878 2048 70979 1024 71580 2048 698

Table 3 VOC dataset detection rate with fusion of ICF channelcombinations at the 79th layer

ICF channel combination Number ofchannels mAP

RGB+ gray +HSV+LUV+ grad +HOG 17 777Gray +HSV+LUV+ grad +HOG 14 774HSV+LUV+ grad +HOG 13 786Gray + LUV+ grad +HOG 11 780LUV+ grad +HOG 10 791Gray +HSV+ grad +HOG 11 768Gray +HSV+LUV 7 762Gray +HSV+LUV+HOG 13 771

Offline Online

Imagedataset

ComputedICF

Read inputimage

Read inputimage

ICF computation

Store features

Trainingset

No

No

Yes

Yes

Morescales

Change inputimage scale

Imagedataset

1st YOLOv3layer

2nd YOLOv3layer

106th YOLOv3layer

107th YOLOv3layer

nth YOLOv3convolutional

layer

Convolution

Batchnormalization

Add bias

Fuse features

Activation

ComputedICF

Deep features

Deep features

ICF

ICF

2x y

y2x - y

Figure 7 Flowchart of the proposed object detector

Journal of Engineering 9

for performance improvement In the second normali-zation technique we used Z-score based normalizationwhich converts data in all channels to have mean 0 andstandard deviation 1 However the Z-score normaliza-tion did not perform as good as the previous normali-zation technique

7 Conclusion and Future Work

In this work we present an approach to fuse handcraftedfeatures in the convolutional neural network based objectdetector (e fusion of handcrafted features with learn-ing-based features has proved to be effective in manyearlier works In this work we demonstrate the meth-odological steps to fuse handcrafted features in the latestversion of the YOLO detector Our experiments with acombination of simple integral channel features fusion inYOLO have yielded substantial improvement in the de-tection rate on PASCAL-VOC and MS-COCO datasetsIn conventional machine learning early late andlearning-based fusion have been primary strategies forfeature fusion However deep learning networks aredesigned for specific input sizes which pose a challengefor algorithm designer to feed in extra information in thenetwork unless the network is redesigned (e workpresented a novel approach based on methodologicalsteps to address both the problems In future work weplan to explore the approach for sequence-based learningfor object detection and tracking

Data Availability

(is work utilizes standard datasets which are open access

Conflicts of Interest

(e authors declare that they have no conflicts of interest

References

[1] P Dollar C Wojek B Schiele and P Perona ldquoPedestriandetection an evaluation of the state of the artrdquo IEEETransactions on Pattern Analysis and Machine Intelligencevol 34 no 4 pp 743ndash761 2012

[2] D T Nguyen W Li and P O Ogunbona ldquoHuman detectionfrom images and videos a surveyrdquo Pattern Recognitionvol 51 pp 148ndash175 2016

[3] S Ren K He R Girshick and J Sun ldquoFaster r-cnn towardsreal-time object detection with region proposal networksrdquoIEEE Transactions on Pattern Analysis and Machine Intelli-gence vol 39 no 6 pp 1137ndash1149 2017

[4] S Zafeiriou C Zhang and Z Zhang ldquoA survey on facedetection in the wild past present and futurerdquo ComputerVision and Image Understanding vol 138 pp 1ndash24 2015

[5] S Zhang R Benenson M Omran J Hosang and B SchieleldquoHow far are we from solving pedestrian detectionrdquo inProceedings of the IEEE Conference on Computer Vision andPattern Recognition IEEE Las Vegas NV USA pp 1259ndash1267 June 2016

[6] Pascal visual object classes homepage (pascal-voc) httphostrobotsoxacukpascalVOC

[7] Coco-common objects in context (ms-coco) httpcocodatasetorg

[8] J Redmon and A Farhadi Yolov3 An Incremental Im-provement 2018 httpsarxivorgabs180402767

[9] P Dollar Z Tu P Perona and S Belongie Integral ChannelFeatures BMVC Press Swansea UK 2009

[10] F Arman and J K Aggarwal ldquoModel-based object recog-nition in dense-range images-a reviewrdquo ACM ComputingSurveys (CSUR) vol 25 no 1 pp 5ndash43 1993

[11] N Dalal and B Triggs ldquoHistograms of oriented gradients forhuman detectionrdquo Computer Vision and Pattern Recognitionvol 1 pp 886ndash893 2005

[12] D G Lowe ldquoDistinctive image features from scale-invariantkeypointsrdquo International Journal of Computer Vision vol 60no 2 pp 91ndash110 2004

[13] A Bosch A Zisserman and X Munoz ldquoRepresenting shapewith a spatial pyramid kernelrdquo in Proceedings of the 6th ACMInternational Conference on Image and Video Retrievalpp 401ndash408 Association for Computing MachineryAmsterdam (e Netherlands July 2007

[14] P Viola and M J Jones ldquoRobust real-time face detectionrdquoInternational Journal of Computer Vision vol 57 no 2pp 137ndash154 2004

[15] J Shotton A Blake and R Cipolla ldquoContour-based learningfor object detection in Computer Visionrdquo in Proceedings ofthe Tenth IEEE International Conference on Computer Visionvol 1 IEEE Beijing China pp 503ndash510 October 2005

[16] S Agarwal and D Roth ldquoLearning a sparse representation forobject detectionrdquo in Proceedings of the European Conferenceon Computer Vision pp 113ndash127 Copenhagen Denmar May2002

[17] P Dollar B Babenko S Belongie P Perona and Z TuldquoMultiple component learning for object detectionrdquo in Pro-ceedings of the 10th European Conference on Computer vol 10Springer Berlin Germany pp 211ndash224 June 2008

[18] P F Felzenszwalb R B Girshick D McAllester andD Ramanan ldquoObject detection with discriminatively trainedpart-based modelsrdquo IEEE Transactions on Pattern Analysisand Machine Intelligence vol 32 no 9 pp 1627ndash1645 2010

[19] N Tong H Lu X Ruan and M-H Yang ldquoSalient objectdetection via bootstrap learningrdquo in Proceedings of the IEEE

Table 4 COCO dataset detection rate with fusion of complete setof ICF channels

Layer number Number of filters after doubling mAP57 1024 35259 512 33760 1024 35673 2048 35175 1024 36677 1024 36378 2048 37379 1024 369

Table 5 MS-COCO detection rate with fusion of ICF channelcombinations at the 78th layer

ICF channel combination Number ofchannels mAP

RGB+ gray +HSV+LUV+ grad +HOG 17 377Gray +HSV+LUV+ grad +HOG 14 373HSV+LUV+ grad +HOG 13 374Gray + LUV+ grad +HOG 11 364

10 Journal of Engineering

Conference on Computer Vision and Pattern RecognitionIEEE Boston MA USA pp 1884ndash1892 June 2015

[20] X Wang M Yang S Zhu and Y Lin ldquoRegionlets for genericobject detectionrdquo in Proceedings of the IEEE InternationalConference on Computer Vision IEEE Sydney NSW Aus-tralia pp 17ndash24 December 2013

[21] R Girshick J Donahue T Darrell and J Malik ldquoRich featurehierarchies for accurate object detection and semantic seg-mentationrdquo in Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition IEEE Columbus OHUSA pp 580ndash587 June 2014

[22] R Girshick ldquoFast R-Cnnrdquo in Proceedings of IEEE Interna-tional Conference on Computer Vision (ICCV) pp 1440ndash1448Santiago Chile 2015

[23] J Dai Y Li K He and J Sun ldquoR-FCN object detection viaregion-based fully convolutional networksrdquo in Proceedings ofthe 30th International Conference on Neural InformationProcessing Systems pp 379ndash387 Barcelona Spain December2016

[24] J Long E Shelhamer and T Darrell ldquoFully convolutionalnetworks for semantic segmentationrdquo in Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition(CVPR) IEEE Boston MA USA pp 3431ndash3440 June 2015

[25] W Liu D Anguelov D Erhan et al ldquoSSD single shotmultibox detectorrdquo in European Conference on ComputerVision pp 21ndash37 Amsterdam Netherlands 2016

[26] C Fu W Liu A Ranga A Tyagi and A C Berg Decon-volutional Single Shot Detector 2017 httpsarxivorgabs170106659

[27] J Redmon S Divvala R Girshick and A Farhadi ldquoYou onlylook once unified real-time object detectionrdquo in Proceedingsof the IEEE Conference on Computer Vision and PatternRecognition IEEE Las Vegas NV USA pp 779ndash788 June2016

[28] Y Tian P Luo X Wang and X Tang ldquoDeep learning strongparts for pedestrian detectionrdquo in Proceedings of the IEEEInternational Conference on Computer Vision IEEE SantiagoChile pp 1904ndash1912 December 2015

[29] Y Tian P Luo X Wang and X Tang ldquoPedestrian detectionaided by deep learning semantic tasksrdquo in Proceedings of theIEEE Conference on Computer Vision and Pattern RecognitionIEEE Boston MA USA pp 5079ndash5087 June 2015

[30] Y Zhou L Liu L Shao and M Mellor ldquoDave a unifiedframework for fast vehicle detection and annotationrdquo inProceedings of the European Conference on Computer Visionpp 278ndash293 Amsterdam (e Netherlands October 2016

[31] B Li T Zhang and T Xia ldquoVehicle detection from 3d lidarusing fully convolutional networkrdquo 2016 httparxivorgabs160807916

[32] L Zhang L Lin X Liang and K He ldquoIs faster r-cnn doingwell for pedestrian detectionrdquo in Proceedings of the EuropeanConference on Computer Vision pp 443ndash457 Amsterdam(e Netherlands October 2016

[33] S Zhang R Benenson M Omran J Hosang and B SchieleldquoTowards reaching human performance in pedestrian de-tectionrdquo IEEE Transactions on Pattern Analysis and MachineIntelligence vol 40 no 4 pp 973ndash986 2018

[34] T Akilan Q M J Wu and W Zhang ldquoVideo foregroundextraction using multi-view receptive field and encoder-de-coder dcnn for traffic and surveillance applicationsrdquo IEEETransactions on Vehicular Technology vol 68 no 10pp 9478ndash9493 2019

[35] A B AlexeyWindows and Linux Version of Darknet Yolo V3amp V2 Neural Networks for Object Detection pp 6ndash2 2018

[36] J Redmon and A Farhadi ldquoYolo9000 better faster stron-gerrdquo in Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition (CVPR) IEEE SeattleWashington USA pp 6517ndash6525 July 2017

[37] A Krizhevsky I Sutskever and G E Hinton ldquoImagenetclassification with deep convolutional neural networksrdquoAdvances in Neural Information Processing Systemspp 1097ndash1105 2012

Journal of Engineering 11

Page 3: ResearchArticle LearningFeatureFusioninDeepLearning …downloads.hindawi.com/journals/je/2020/7286187.pdf · 2020-05-22 · qualitative approach to select the best layer for feature

Dai et al [23] extended the shared fully convolutionalnetwork architecture originally proposed for image seg-mentation ([24]) to two-stage detection strategy havingregion proposal and region classification (e end-to-endlearning of the network constructs a set of position-sensitivescore maps to deal with the translation variance of targetobjects (e scores are generated by a bank of convolutionallayers which encodes the relative spatial position informa-tion incorporating translation variance in the learning

Some recent works on object detection also followed aregression-based approach which includes Single ShotMultibox Detector (SSD) [25] Deconvolutional Single ShotDetector (DSSD) [26] and You Only Look Once (YOLO)[8] SSD combines the Multibox approach for bounding boxregression by using a set of default boxes for predicting theshape offset and category confidence at each location in theimage (e authors used VGG-16 as the base network ar-chitecture where the network combines predictions fromdifferent feature maps with different resolutions DSSDfurther improvised SSD by shifting to the ldquoencoder-decoderrdquotype of networks to incorporate context information in thelearning by replacing the VGG-16 architecture with resid-ual-101

YOLO [27] unlike its predecessors trained single re-gressor network for direct prediction of object boundingboxes with category confidences (is approach of completeregression without any classification step yielded superiorreal-time performance in terms of the detection speed AlsoYOLO bettered detection accuracy in comparison with allother detection systems except Faster-RCNN With theincorporation of anchor boxes for guiding the bounding boxprediction in a newly designed network YOLOv2 betteredall other existing state of the art in PASCAL VOC-2017challenge (e latest YOLO (here onwards we refer to thisversion as YOLOv3) claims to improve its performance withthe incorporation of much deeper network architecturecombining predictions at multiple scales (e YOLO algo-rithms see the entire image all at once through the forwardpass of the network which gives more accurate and com-prehensive information (is helps the detector in avoidingfalse positives than the classification based detection systemswhich focus only on the region proposals Despite re-markable progress in object detection in natural images dueto deep learning-based strategies the performance ofexisting models requires improvement from the accuracyand real-time performance standpoint (e latest results onPascal VOC-challenge [6] and MS-COCO [7] datasets es-tablish that more effort is required to solve the detectionproblem

In addition several CNN based detection models havebeen designed for specific target objects [28ndash30] (esemethods have raised the benchmark in target objectiondetection nevertheless the problem is far from being solved[531ndash33] An important distinction in specific object de-tection is the availability of auxiliary information sources oruser feedback which has helped researchers to solve theproblem up to the acceptable limit

In this work we propose a feature enriched object de-tector based on the latest YOLO architecture Our choice of

YOLO architecture is based on its comparative detectionaccuracy and superior detection rate It is expected that CNNbased detection model should be able to learn the object-specific salient features required for detection neverthelessit cannot be guaranteed in single objective-based learningformulation (is direction of research exploring learning-based enrichment of deep features by including additionalmodalities has been not been explored for object detectionIn this context the recent work on foreground extraction forvideo analysis [34] a deep neural network based frameworkexploits on the multistage fusion of the combination ofresidual features from different convolutional layers On theother hand our approach is to feed in the handcraftedfeatures in the network to guide the feature learning processleading towards more accurate detection For learning-basedfeature fusion we use ICF descriptor which has shownremarkable object detection performance as discussedabove

3 Revisiting the Object Detection Problem

Object detection consists of localization of region capturingthe object and assigning their labels by processing localizedregion (e localization problem can be formulated as (i)regression problem that predicts the region of interests or(ii) binary classification problem focusing on detection offoreground region having object segments (e outcome ofany of these methods would predict object boundary as set offour coordinates (xmin xmax ymin and ymax) as shown inFigure 1

Label identification of the object region is a N-classclassification problem where N denotes the number ofclasses for example tree person bird dog bicycle or catcar tree In both the stages of object detection problemchallenges arise because of the underlying variations inlighting conditions scaling occlusion partial view andorientation Figure 2 shows some such challenging situationsfrom the COCO dataset

4 Preliminaries

Before presenting the object detection methodology usinglearning-based feature fusion we briefly discuss the integralchannel features and YOLOv3rsquos architecture We also makesome modifications in the YOLOv3 architecture as proposedin [35] which are also described in this section

41 You Only Look Once Object Detector-YOLOv3 (e classof YOLO algorithms [82736] look at the entire image whendetecting and recognizing objects and extract deep infor-mation about classes and their appearance unlike otherapproaches such as sliding window-based methods orRndashCNN based algorithms (ese algorithms treat the de-tection of objects as a single regression problem giving fasterresponse with a reduction in the design complexity of thedetector (ough the significant speed achievement thealgorithms lag in terms of accuracy especially with smallobjects

Journal of Engineering 3

(e latest algorithm in the YOLO family that isYOLOv3 has proven its performance compared to otherstate-of-the-art detectors (e YOLOv3 architecture has 107layers in total distributed as convolutional 75 route 4residual 23 upsample 2 detection 3 (e architectureuses a new model for feature extraction referred to asDarknet-53 (e new model is significantly larger thanmodels used in the earlier version nevertheless it hasproven to be more efficient than other state of the arts (emodel uses 53 convolution layers which takes an input imageof size 416times 416 Figure 3 shows the architecture of theYOLOv3 object detector (e Darknet-53 model is pre-trained on ImageNet [37] For the detection task the net-work is modified by removing its last layer and then stackingup additional layers resulting in the final network archi-tecture (e first 75 layers in the network represent 52convolutional layers of the Darknet-53 model pretrained onImageNet (e remaining 32 layers are added to qualifyYOLOv3 for object detection on different datasets withfurther training Besides YOLOv3 applies new residuallayers similar to skip connections that combine feature mapsfrom two layers using elementwise addition resulting infiner-grained information

(eYOLOv3 replaces the softmax based activation used inolder versions with independent logistic classifiers Featuresare extracted using a similar concept to feature pyramidnetworks Also binary cross-entropy loss is now used for classpredictions which is beneficial when faced with images

having overlapping labels (e K-means is used for anchorbox generation however 9 bounding boxes are now usedrather than 5 (e number of bounding boxes is distributedequally across the three detection scales In the present designa route layer is also used which outputs a layerrsquos feature maps

411 Prediction YOLOv3 processes images by dividingthem in N times N grid and if the center of the object falls in agrid cell then that cell is responsible for detecting the object(e network predicts bounding boxes at three differentscales (e first detection scale is used for detecting large-sized objects(e second detection scale is used for medium-sized objects and the last for small objects (e three de-tection layers are displayed in red in Figure 3 Each cellpredicts B bounding boxes and each prediction has 5predictions x y w h and confidence score representing themeasure of prediction having an object (e x and y pa-rameters are the center of the box for the grid cell while w

and h are the width and height of the predicted box for theentire image (e confidence score is the Intersection overUnion (IoU) between the predicted box and the groundtruth (e output prediction is N times N times B times (5 + C) tensorwhere 5 is the number of predictions per bounding box(x y w h and the confidence value) and C is the totalnumber of object categories Figure 4 shows the predictionalgorithm of YOLOv3 for N times N cells where each cell has B

bounding boxes

Input Detection algorithm Output

Objectdetector

Detection results

105 221 187 407 dog156 501 124 341 bicycle480 587 84 176 truck

TruckBicycle

Dog

Figure 1 Object detection in an example image

Figure 2 Example images from MS-COCO dataset

4 Journal of Engineering

412 Loss Function (e YOLOv3 includes a loss function(1) that instructs the network to correctly predict bounding

boxes and accurately classify the detected objects with aprovision to penalize false positives

λcoord 1113944

N2

i01113944

B

j01

objij xi minus 1113954xi( 1113857

2+ yi minus 1113954yi( 1113857

21113960 1113961 + λcoord 1113944

N2

i01113944

B

j01objij

w

radici minus

1113954wi

1113969

1113874 11138752

+h

radic

i minus

1113954hi

1113969

1113874 11138752

1113890 1113891

+ 1113944N2

i01113944

B

j01objij Ci minus 1113954Ci1113872 1113873

2+ λnoobj 1113944

N2

i01113944

B

j01noobjij Ci minus 1113954Ci1113872 1113873

2minus 1113944

N2

i01113944

B

j0δilog 1113954pi(c)( 1113857 + 1 minus δi( 1113857log 1 minus 1113954pi(c)( 1113857

(1)

(e symbols are explained in Table 1 (e symbolsunder hat represent corresponding prediction values

(e loss function in the equation has three error com-ponents localization confidence and classification as

416

416

33

416 208

208208

208

104104 52 52

52

256 256 512

52104

128 128

10411

1 331 1

26

26

1

33

416

33 1

1 33

332

6464

L0-conv0 3 times 3 times 64 L1-conv1 3 times 3 times 64-s2 L2-conv2 1 times 1 times 32 L5-conv43 times 3 times 128-s2

L12-conv93 times 3 times 256-s2 L12-conv9 3 times 3 times 256-s2

L3-conv3 3 times 3 times 64L4-res-1

times 1L6-conv5 1 times 1 times 64L7-conv6 3 times 3 times 128L8-res-3

times 2

L13-conv10 1 times 1 times 128L14-conv11 3 times 3 times 256L15-res-3

times 8

(a)

L38-conv27 1 times 1 times 256L39-conv28 3 times 3 times 512

L84-route-4 L86-route-1 61

L98-route-1 36

L84-conv59 1 times 1 times 256

L95-route-4L96-conv67 1 times 1 times 128

L85-upsample times 2

L97-upsample times 2

L40-res-3

times 8L63-conv44 1 times 1 times 512L64-conv45 3 times 3 times 1024L65-res-3

times 4L75-conv52 1 times 1 times 512L76-conv53 3 times 3 times 1024

L81-conv58 1 times 1 times ZL82-first detection layer

L93-conv66 1 times 1 times ZL94-second detection layer

L105-conv74 1 times 1 times ZL106-third detection layer

times 3

L87-conv60 1 times 1 times 256L88-conv61 3 times 3 times 512

times3

L99-conv68 1 times 1 times 128L100-conv69 3 times 3 times 256

times 3

L62-conv43 3 times 3 times 1024-s2

26

26

3

3

512

13 13

13 13

11

1311

1024 1024

13

1313

11

1024 Z

13

13

256256

26

26 26

26

512

26

26

11

768 (256 + 512)

384 (128 + 256)

11

Z

26

26

Z

52

52

52

256

52

52

52

52

128128

5226

26

11

11

(b)

Figure 3 YOLOv3 network architecture (a) Convolutional layers in the YOLOv3 (b) Detection layers in the YOLOv3

Journal of Engineering 5

observed in equation (1) Different loss components arecombined by sum-squared algorithm as it is easier foroptimization (e localization loss is responsible forminimizing the error between the ldquoresponsiblerdquobounding box and the ground truth object if an object isdetected in a grid cell

42 Integral Channel Features (ICF) In this work we fusedifferent channels of information proposed in the ICF [9]in the YOLOv3 architecture ICF descriptor is a series ofimage channels computed from the input image usinglinear and nonlinear transformations A channel refers to arepresentation of the input image Next first-order featuresare extracted by computing the sum of rectangular regions(e evaluated channels were color channels (RGB GrayHSV and LUV) gradient magnitude and gradients his-togram (e authors in [9] claimed that the most infor-mative channel among all in independent evaluation forpedestrian detection was HOG Further a combination ofLUV gradient and HOG gave the best detection rate Ourproposed work reevaluates the combination of channelsbecause of varying challenges in the present applicationscenario In our work the window size parameter for HOGcomputation depends on the 2D dimensions of the deepfeatures where they are to be fused with (discussed inSection 41) As an example if the 2D dimensions of a layerwhere features are to be fused are 13times13 and the inputimage is of size 416times 416 then the window size will be32times 32 so that it results in 13times13 HOGs For each windowwe compute the HOG using six bins Other HOG pa-rameters including cell and block size used for normali-zation purposes are also set the same as the window size(e block stride parameter is also set equal to the window

size having no overlap between two windows As shown inFigure 5 the ICF for a given image is the linear concate-nation of individual color channels after normalization thecorresponding HOG feature (e preprocessing in featurecomputation includes the resizing of the input image toalign it with the network layer that receives this input Forexperiments discussed in this paper we follow the samesteps and parameters for HOG computation as discussed in[9]

5 Learning-Based Feature Fusion in YOLOv3

Before we establish the proposed feature fusion we set outthe required steps which will be described subsequently

(1) Identify the candidate convolution layers for featureinjection

(2) Evaluate the layers for feature injection usingcomplete set of ICF channels

(3) Evaluate different combinations of ICF channelsusing the best network layer obtained in the previousstep

(4) Train and test the detector injecting the best ICFchannel combination at the best layer position

Our proposal for feature fusion starts with identifying alocation and space for handcrafted features within theYOLOv3 architecture For injecting an additional set offeatures in this architecture we first need to identify theconvolution layer to feed additional information We adopta qualitative approach to decide on the layer for featurefusion by validating feature fusion on different layers (edepth space of a convolutional layer represents the numberof feature maps that hold the deep featuremdashhereafter

YOLOv3Cell (0 1)

Cell (0 0)

x y w h objectness airphane 05 shark 045

x y w h objectness kite 03 car 001

x y w h objectness bird 012 truck 002

Cell (N N)

Figure 4 YOLOv3 prediction output for an example image cell(i j) corresponds to the image region within the ith row and jth column ofthe N times N grid

6 Journal of Engineering

referred to as filters We need to identify the specific numberof filters within the specific layer which will store the ad-ditional features First we propose to double the originalnumber of filters in the specific layer as shown in Figure 6

Depending on the number of channels in the ICF set thesame number of filters needs to be reserved for storinghandcrafted feature values (e ICF descriptor is injectedinto the network at the beginning of the training procedureIn the layer selected for ICF injection the additional filters(introduced by doubling the layer depth) are copied withICF descriptor from the rear end as shown in Figure 6 (eremaining filters are set to zero in the beginning As thetraining progresses the layer weights that is filter valuesupdate following the learning algorithm (e output gen-erated by this layer depends on the values of all filters (eproposedmethod for feature fusion diverges from the simpleapproach of the stacking of handcrafted features withoriginal filters in the selected layer (e extra filters in ad-dition to the filters used for storing ICF descriptor valueshelp regularize the layer weightsWemaintain the strategy ofdoubling the filters for feature fusion regardless of the

specific convolutional layer under validation An exampleshowing the procedure of doubling the number of filters at aspecific layer to issue space for ICF fusion is illustrated inFigure 6 In this example fusion was selected to occur at62n d layer of YOLOv3rsquos network (e top side of the figureshows a section from the original YOLOv3rsquos network and thebottom side is where the number of filters for the 62n d layeris doubled (e number of filters was changed from 1024 to2048 (e light orange is the new filter depth (2048) andconsists of the original filter depth (1024) concatenated withthe additional filters of the same depth represented in greenAdditionally doubling the number of filters in the fusionlayer poses fewer design challenges than with filters withdirect stacking of handcrafted features

51 Design Issues with Injection of ICF Descriptor To fusehandcrafted features with deep features their 2D dimen-sions have to match the width and height dimensions of thelayer Considering the example in Figure 6 at 62n d layer ifthe input image size is 416times416 then the filter dimensionswill be 13times13 Hence all selected channels for fusion shouldbe of size 13times13 to fit into the additional filters In thetesting phase of YOLOv3 the size of the input image is fixedat 416times 416 so the dimensions of the ICF descriptor at the62n d layer will always be 13times13 On the other hand in thetraining phase of YOLOv3 the size of input image changesbased on a random parameter in every 10 iterations wherethe size of the image is defined as a factor of 32 starting from320times 320 up to 608times 608 If the input size is 320times 320 or608times 608 then at the 62n d layer the filter size becomes10times10 or 19times19 respectively (e filter size compared withthe input image size is downsized by a factor of 32 Aftermaking sure that the handcrafted features are of the rightsize their values can replace the additional filters in thechosen layer

(e flowchart of the proposed object detector is shown inFigure 7 For training purposes the desired ICF channels ofinput images at different scales are calculated offline (eseimages at different scales are used to incorporate variabilityin the training process In the testing phase ICF channels arecalculated only once for each image at a fixed scale(e rightside of the figure shows that as our proposed detector isbeing trainedtested fusion takes place at the layer named``nth YOLOv3 convolutional layerrdquo (e doubled number of

Image

Preprocessing

Gray HSV LUV Gradmagnitude HOG

Normalization

Concatenation

ICF handcraftedfeatures vector

Figure 5 Flowchart for the ICF computation for an exampleimage

Table 1 Symbols used in YOLOv3 loss function

Symbol DefinitionN Number of grid cells in an imageB Number of anchor boxesCi Confidence score of jth bounding box in grid cell ipi(c) Conditional probability of class c in grid cell ixi yi wi hi Location and size of a bounding box1obji 1 if an object appears in grid cell i 0 otherwise1objij 1 if an object is present in ith grid cell and the jth ldquoresponsiblerdquo bounding box 0 otherwise1noobj

ij 1 if no object is present in ith grid cell and the jth ldquoresponsiblerdquo bounding box 0 otherwiseδi 1 if predicted label matches the ground truth label 0 otherwiseλcoord Constant (default 50)λnoobj Constant (default 05)

Journal of Engineering 7

filters and the number of ICF channels are represented as 2x

and y respectively (e figure shows how the ICF overlapswith a portion of the deep features residing in some of theadditional filters

6 Experimental Evaluation

A preliminary evaluation of the proposed framework isperformed on the PASCAL-VOC dataset (VOC2007 de-tection task) (e original YOLOv3 detector achieved themAP of 675 on the testing set with input image dimensionas 416times 416 (e YOLOv3 implementation available by [35]is used for benchmarking our experimental results Asmentioned in Section 5 we first identify the convolutionallayer for injecting the handcrafted features and the requiredspace (ie the number of filters) First we figure out the bestlayer location to fuse ICF channels Subsequently by anotherset of experiments we determine the best combination ofICF channels We separate a validation set comprising one-tenth of the training examples by random selection (evalidation set is used for testing the convolutional layers andcombination of ICF channels for optimum selection Fordetermining the best layer for fusion we use all channels ofinformation in ICF as used in the original work by Dollaret al [9](is is done by doubling the number of filters at thelayer under evaluation fusing 17 channels of ICF set andchecking the resulting mAP We evaluated the two con-volutional layers preceding a detection layer in YOLOv3architecture For each experiment the network is trained for25000 iterations (e remaining training parameters are setas the original YOLOv3 detector We performed the firstevaluation on 80th layer which is the last convolutional layerbefore the detection layer

Doubling the number of filters for 80th layer andinjecting the full set of ICF channels resulted in the mAP of698 In comparison with the benchmark performancebased on YOLOv3 performance we achieved an increase of17 on mAP scale (e next experiment on 79th layer

achieved the mAP of 715 which is more effective than theprevious experiment Table 2 lists the mAP scores for ex-periments conducted for determining the best layer positionfor feature fusion As observed the best mAP score of 715is achieved at 79th layer

After determining the most suitable layer for ICF fusionwe conduct experiments to discover the combination of ICFchannels for fusion that is most suitable for the object de-tection task in the VOC dataset Again in this set of ex-periments we fuse ICF channels by doubling the number offilters at the 79th layer In this experiment we run thetraining process for 50000 iterations Fusing all ICFchannels color gradient magnitude and HOG achieved themAP of 777 (e entire ICF descriptor summed up to 17channels considering 6 channels for HOG Removing theRGB channel from the 17 ICF channels increased the mAPto 782 showing a minuscule increase of 05 from thepreceding experiment Considering original input imagesare in RGB color values the increase is not significant Wealso explored other channel combinations the corre-sponding mAP scores are presented in Table 3

As seen the best ICF channel combination for VOC2007detection task consists of Gray HSV LUV gradient mag-nitudes and HOG achieving maximummAP score of 791We retrain the YOLOv3 detector on complete training setfusing the finalized channels of ICF descriptor (e networkachieved the mAP of 787 on the testing set which is 112more than the detection rate achieved by original YOLOv3detector

61 Evaluation on MS-COCO Dataset MS-COCO datasetconsists of 82783 training 40504 validation and 40775testing images belonging to 80 categories In the literaturethe YOLOv3 [8] reported the mAP of 553 on the COCOdataset for an input image size of 416times 416 All experimentsreported in this work were executed on a NVIDIA GeForceGTX 1080 GPU desktop which is unsubstantial in

L63-conv44 1 times 1 times 512L64-conv45 3 times 3 times 1024

L62-conv433 times 3 times 1024-s2

L62-conv43 3 times 3 times 2048-s2

L65-res-3times 4

L63-conv44 1 times 1 times 512L64-conv45 3 times 3 times 1024L65-res-3

times 4

26

26512

512

1024 102413

131

113

131

1

102413

1313

13

1024

11 1

11024

13

13

13

131

1

33

26

26

33

Original

Doubling

2048 ICF fusion

Figure 6 ICF descriptor injection in the specific convolutional layer

8 Journal of Engineering

comparison with the infrastructure used in the originalpaper [8] (erefore we adopt an alternate strategy forbenchmarking where the YOLO detector (original and withproposed fusion strategy) is retrained on the MS-COCOdataset and we compare the detection performance at 50000iterations (e original YOLOv3 evaluation on COCOtesting dataset achieved the mAP of 365 for 50000iterations

After determining the best location for ICF fusion usingthe mAP values achieved on the validation set as shown inTable 4 we performed experiments to determine the bestcombination of ICF channel fusion that best suits the COCOdataset (Table 5) As shown in Table 4 doubling the 78th

layer resulted in the highest mAP score (373) (ereforefor all subsequent experiments related to the COCO datasetwere conducted with the doubled number of filters at the78th layer All experiments were evaluated at 50000 itera-tions Fusing all ICF channels (color gradient magnitudeand HOG) resulted in the mAP of 377 Removing RGBfrom the 17 ICF channels led to the mAP of 373 Otherchannel combinations were tried out but did not give goodresults such as fusing only HOG features (6 channels)LUV+HOG (9 channels) gray + LUV+ gradient +HOG (11channels) and gray +HSV+LUV+ gradient (8 channels)(e best combination that gave the highest mAP scoreconsisted of fusion of the all ICF channels

Furthermore the evaluation on testing set with fusionof the complete set of ICF channels at the 78th layerachieved 378 of the mAP value With COCO datasetwe also experimented with the effect of normalization onthe ICF channels To measure the effect of normalizationon the performance of the network we tried out twodifferent normalization techniques All the results dis-cussed so far were achieved without any normalization

We first experimented with max normalization techniquedividing each channel by its maximum range so that allvalues fall in the range of 0 to 1 (e fusion of maxnormalized ICF channels in YOLOv3 achieved the mAPof 384 on the testing set which is 06 higher than thebest result achieved with unnormalized features and 19with the original YOLOv3 performance (is confirmswith the training principle of deep networks whichsuggests the use of input normalization as one of the tools

Table 2 VOC dataset detection rate with fusion of complete set ofICF channels

Layer number Number of filters after doubling mAP59 512 66160 1024 68462 2048 66277 1024 69878 2048 70979 1024 71580 2048 698

Table 3 VOC dataset detection rate with fusion of ICF channelcombinations at the 79th layer

ICF channel combination Number ofchannels mAP

RGB+ gray +HSV+LUV+ grad +HOG 17 777Gray +HSV+LUV+ grad +HOG 14 774HSV+LUV+ grad +HOG 13 786Gray + LUV+ grad +HOG 11 780LUV+ grad +HOG 10 791Gray +HSV+ grad +HOG 11 768Gray +HSV+LUV 7 762Gray +HSV+LUV+HOG 13 771

Offline Online

Imagedataset

ComputedICF

Read inputimage

Read inputimage

ICF computation

Store features

Trainingset

No

No

Yes

Yes

Morescales

Change inputimage scale

Imagedataset

1st YOLOv3layer

2nd YOLOv3layer

106th YOLOv3layer

107th YOLOv3layer

nth YOLOv3convolutional

layer

Convolution

Batchnormalization

Add bias

Fuse features

Activation

ComputedICF

Deep features

Deep features

ICF

ICF

2x y

y2x - y

Figure 7 Flowchart of the proposed object detector

Journal of Engineering 9

for performance improvement In the second normali-zation technique we used Z-score based normalizationwhich converts data in all channels to have mean 0 andstandard deviation 1 However the Z-score normaliza-tion did not perform as good as the previous normali-zation technique

7 Conclusion and Future Work

In this work we present an approach to fuse handcraftedfeatures in the convolutional neural network based objectdetector (e fusion of handcrafted features with learn-ing-based features has proved to be effective in manyearlier works In this work we demonstrate the meth-odological steps to fuse handcrafted features in the latestversion of the YOLO detector Our experiments with acombination of simple integral channel features fusion inYOLO have yielded substantial improvement in the de-tection rate on PASCAL-VOC and MS-COCO datasetsIn conventional machine learning early late andlearning-based fusion have been primary strategies forfeature fusion However deep learning networks aredesigned for specific input sizes which pose a challengefor algorithm designer to feed in extra information in thenetwork unless the network is redesigned (e workpresented a novel approach based on methodologicalsteps to address both the problems In future work weplan to explore the approach for sequence-based learningfor object detection and tracking

Data Availability

(is work utilizes standard datasets which are open access

Conflicts of Interest

(e authors declare that they have no conflicts of interest

References

[1] P Dollar C Wojek B Schiele and P Perona ldquoPedestriandetection an evaluation of the state of the artrdquo IEEETransactions on Pattern Analysis and Machine Intelligencevol 34 no 4 pp 743ndash761 2012

[2] D T Nguyen W Li and P O Ogunbona ldquoHuman detectionfrom images and videos a surveyrdquo Pattern Recognitionvol 51 pp 148ndash175 2016

[3] S Ren K He R Girshick and J Sun ldquoFaster r-cnn towardsreal-time object detection with region proposal networksrdquoIEEE Transactions on Pattern Analysis and Machine Intelli-gence vol 39 no 6 pp 1137ndash1149 2017

[4] S Zafeiriou C Zhang and Z Zhang ldquoA survey on facedetection in the wild past present and futurerdquo ComputerVision and Image Understanding vol 138 pp 1ndash24 2015

[5] S Zhang R Benenson M Omran J Hosang and B SchieleldquoHow far are we from solving pedestrian detectionrdquo inProceedings of the IEEE Conference on Computer Vision andPattern Recognition IEEE Las Vegas NV USA pp 1259ndash1267 June 2016

[6] Pascal visual object classes homepage (pascal-voc) httphostrobotsoxacukpascalVOC

[7] Coco-common objects in context (ms-coco) httpcocodatasetorg

[8] J Redmon and A Farhadi Yolov3 An Incremental Im-provement 2018 httpsarxivorgabs180402767

[9] P Dollar Z Tu P Perona and S Belongie Integral ChannelFeatures BMVC Press Swansea UK 2009

[10] F Arman and J K Aggarwal ldquoModel-based object recog-nition in dense-range images-a reviewrdquo ACM ComputingSurveys (CSUR) vol 25 no 1 pp 5ndash43 1993

[11] N Dalal and B Triggs ldquoHistograms of oriented gradients forhuman detectionrdquo Computer Vision and Pattern Recognitionvol 1 pp 886ndash893 2005

[12] D G Lowe ldquoDistinctive image features from scale-invariantkeypointsrdquo International Journal of Computer Vision vol 60no 2 pp 91ndash110 2004

[13] A Bosch A Zisserman and X Munoz ldquoRepresenting shapewith a spatial pyramid kernelrdquo in Proceedings of the 6th ACMInternational Conference on Image and Video Retrievalpp 401ndash408 Association for Computing MachineryAmsterdam (e Netherlands July 2007

[14] P Viola and M J Jones ldquoRobust real-time face detectionrdquoInternational Journal of Computer Vision vol 57 no 2pp 137ndash154 2004

[15] J Shotton A Blake and R Cipolla ldquoContour-based learningfor object detection in Computer Visionrdquo in Proceedings ofthe Tenth IEEE International Conference on Computer Visionvol 1 IEEE Beijing China pp 503ndash510 October 2005

[16] S Agarwal and D Roth ldquoLearning a sparse representation forobject detectionrdquo in Proceedings of the European Conferenceon Computer Vision pp 113ndash127 Copenhagen Denmar May2002

[17] P Dollar B Babenko S Belongie P Perona and Z TuldquoMultiple component learning for object detectionrdquo in Pro-ceedings of the 10th European Conference on Computer vol 10Springer Berlin Germany pp 211ndash224 June 2008

[18] P F Felzenszwalb R B Girshick D McAllester andD Ramanan ldquoObject detection with discriminatively trainedpart-based modelsrdquo IEEE Transactions on Pattern Analysisand Machine Intelligence vol 32 no 9 pp 1627ndash1645 2010

[19] N Tong H Lu X Ruan and M-H Yang ldquoSalient objectdetection via bootstrap learningrdquo in Proceedings of the IEEE

Table 4 COCO dataset detection rate with fusion of complete setof ICF channels

Layer number Number of filters after doubling mAP57 1024 35259 512 33760 1024 35673 2048 35175 1024 36677 1024 36378 2048 37379 1024 369

Table 5 MS-COCO detection rate with fusion of ICF channelcombinations at the 78th layer

ICF channel combination Number ofchannels mAP

RGB+ gray +HSV+LUV+ grad +HOG 17 377Gray +HSV+LUV+ grad +HOG 14 373HSV+LUV+ grad +HOG 13 374Gray + LUV+ grad +HOG 11 364

10 Journal of Engineering

Conference on Computer Vision and Pattern RecognitionIEEE Boston MA USA pp 1884ndash1892 June 2015

[20] X Wang M Yang S Zhu and Y Lin ldquoRegionlets for genericobject detectionrdquo in Proceedings of the IEEE InternationalConference on Computer Vision IEEE Sydney NSW Aus-tralia pp 17ndash24 December 2013

[21] R Girshick J Donahue T Darrell and J Malik ldquoRich featurehierarchies for accurate object detection and semantic seg-mentationrdquo in Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition IEEE Columbus OHUSA pp 580ndash587 June 2014

[22] R Girshick ldquoFast R-Cnnrdquo in Proceedings of IEEE Interna-tional Conference on Computer Vision (ICCV) pp 1440ndash1448Santiago Chile 2015

[23] J Dai Y Li K He and J Sun ldquoR-FCN object detection viaregion-based fully convolutional networksrdquo in Proceedings ofthe 30th International Conference on Neural InformationProcessing Systems pp 379ndash387 Barcelona Spain December2016

[24] J Long E Shelhamer and T Darrell ldquoFully convolutionalnetworks for semantic segmentationrdquo in Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition(CVPR) IEEE Boston MA USA pp 3431ndash3440 June 2015

[25] W Liu D Anguelov D Erhan et al ldquoSSD single shotmultibox detectorrdquo in European Conference on ComputerVision pp 21ndash37 Amsterdam Netherlands 2016

[26] C Fu W Liu A Ranga A Tyagi and A C Berg Decon-volutional Single Shot Detector 2017 httpsarxivorgabs170106659

[27] J Redmon S Divvala R Girshick and A Farhadi ldquoYou onlylook once unified real-time object detectionrdquo in Proceedingsof the IEEE Conference on Computer Vision and PatternRecognition IEEE Las Vegas NV USA pp 779ndash788 June2016

[28] Y Tian P Luo X Wang and X Tang ldquoDeep learning strongparts for pedestrian detectionrdquo in Proceedings of the IEEEInternational Conference on Computer Vision IEEE SantiagoChile pp 1904ndash1912 December 2015

[29] Y Tian P Luo X Wang and X Tang ldquoPedestrian detectionaided by deep learning semantic tasksrdquo in Proceedings of theIEEE Conference on Computer Vision and Pattern RecognitionIEEE Boston MA USA pp 5079ndash5087 June 2015

[30] Y Zhou L Liu L Shao and M Mellor ldquoDave a unifiedframework for fast vehicle detection and annotationrdquo inProceedings of the European Conference on Computer Visionpp 278ndash293 Amsterdam (e Netherlands October 2016

[31] B Li T Zhang and T Xia ldquoVehicle detection from 3d lidarusing fully convolutional networkrdquo 2016 httparxivorgabs160807916

[32] L Zhang L Lin X Liang and K He ldquoIs faster r-cnn doingwell for pedestrian detectionrdquo in Proceedings of the EuropeanConference on Computer Vision pp 443ndash457 Amsterdam(e Netherlands October 2016

[33] S Zhang R Benenson M Omran J Hosang and B SchieleldquoTowards reaching human performance in pedestrian de-tectionrdquo IEEE Transactions on Pattern Analysis and MachineIntelligence vol 40 no 4 pp 973ndash986 2018

[34] T Akilan Q M J Wu and W Zhang ldquoVideo foregroundextraction using multi-view receptive field and encoder-de-coder dcnn for traffic and surveillance applicationsrdquo IEEETransactions on Vehicular Technology vol 68 no 10pp 9478ndash9493 2019

[35] A B AlexeyWindows and Linux Version of Darknet Yolo V3amp V2 Neural Networks for Object Detection pp 6ndash2 2018

[36] J Redmon and A Farhadi ldquoYolo9000 better faster stron-gerrdquo in Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition (CVPR) IEEE SeattleWashington USA pp 6517ndash6525 July 2017

[37] A Krizhevsky I Sutskever and G E Hinton ldquoImagenetclassification with deep convolutional neural networksrdquoAdvances in Neural Information Processing Systemspp 1097ndash1105 2012

Journal of Engineering 11

Page 4: ResearchArticle LearningFeatureFusioninDeepLearning …downloads.hindawi.com/journals/je/2020/7286187.pdf · 2020-05-22 · qualitative approach to select the best layer for feature

(e latest algorithm in the YOLO family that isYOLOv3 has proven its performance compared to otherstate-of-the-art detectors (e YOLOv3 architecture has 107layers in total distributed as convolutional 75 route 4residual 23 upsample 2 detection 3 (e architectureuses a new model for feature extraction referred to asDarknet-53 (e new model is significantly larger thanmodels used in the earlier version nevertheless it hasproven to be more efficient than other state of the arts (emodel uses 53 convolution layers which takes an input imageof size 416times 416 Figure 3 shows the architecture of theYOLOv3 object detector (e Darknet-53 model is pre-trained on ImageNet [37] For the detection task the net-work is modified by removing its last layer and then stackingup additional layers resulting in the final network archi-tecture (e first 75 layers in the network represent 52convolutional layers of the Darknet-53 model pretrained onImageNet (e remaining 32 layers are added to qualifyYOLOv3 for object detection on different datasets withfurther training Besides YOLOv3 applies new residuallayers similar to skip connections that combine feature mapsfrom two layers using elementwise addition resulting infiner-grained information

(eYOLOv3 replaces the softmax based activation used inolder versions with independent logistic classifiers Featuresare extracted using a similar concept to feature pyramidnetworks Also binary cross-entropy loss is now used for classpredictions which is beneficial when faced with images

having overlapping labels (e K-means is used for anchorbox generation however 9 bounding boxes are now usedrather than 5 (e number of bounding boxes is distributedequally across the three detection scales In the present designa route layer is also used which outputs a layerrsquos feature maps

411 Prediction YOLOv3 processes images by dividingthem in N times N grid and if the center of the object falls in agrid cell then that cell is responsible for detecting the object(e network predicts bounding boxes at three differentscales (e first detection scale is used for detecting large-sized objects(e second detection scale is used for medium-sized objects and the last for small objects (e three de-tection layers are displayed in red in Figure 3 Each cellpredicts B bounding boxes and each prediction has 5predictions x y w h and confidence score representing themeasure of prediction having an object (e x and y pa-rameters are the center of the box for the grid cell while w

and h are the width and height of the predicted box for theentire image (e confidence score is the Intersection overUnion (IoU) between the predicted box and the groundtruth (e output prediction is N times N times B times (5 + C) tensorwhere 5 is the number of predictions per bounding box(x y w h and the confidence value) and C is the totalnumber of object categories Figure 4 shows the predictionalgorithm of YOLOv3 for N times N cells where each cell has B

bounding boxes

Input Detection algorithm Output

Objectdetector

Detection results

105 221 187 407 dog156 501 124 341 bicycle480 587 84 176 truck

TruckBicycle

Dog

Figure 1 Object detection in an example image

Figure 2 Example images from MS-COCO dataset

4 Journal of Engineering

412 Loss Function (e YOLOv3 includes a loss function(1) that instructs the network to correctly predict bounding

boxes and accurately classify the detected objects with aprovision to penalize false positives

λcoord 1113944

N2

i01113944

B

j01

objij xi minus 1113954xi( 1113857

2+ yi minus 1113954yi( 1113857

21113960 1113961 + λcoord 1113944

N2

i01113944

B

j01objij

w

radici minus

1113954wi

1113969

1113874 11138752

+h

radic

i minus

1113954hi

1113969

1113874 11138752

1113890 1113891

+ 1113944N2

i01113944

B

j01objij Ci minus 1113954Ci1113872 1113873

2+ λnoobj 1113944

N2

i01113944

B

j01noobjij Ci minus 1113954Ci1113872 1113873

2minus 1113944

N2

i01113944

B

j0δilog 1113954pi(c)( 1113857 + 1 minus δi( 1113857log 1 minus 1113954pi(c)( 1113857

(1)

(e symbols are explained in Table 1 (e symbolsunder hat represent corresponding prediction values

(e loss function in the equation has three error com-ponents localization confidence and classification as

416

416

33

416 208

208208

208

104104 52 52

52

256 256 512

52104

128 128

10411

1 331 1

26

26

1

33

416

33 1

1 33

332

6464

L0-conv0 3 times 3 times 64 L1-conv1 3 times 3 times 64-s2 L2-conv2 1 times 1 times 32 L5-conv43 times 3 times 128-s2

L12-conv93 times 3 times 256-s2 L12-conv9 3 times 3 times 256-s2

L3-conv3 3 times 3 times 64L4-res-1

times 1L6-conv5 1 times 1 times 64L7-conv6 3 times 3 times 128L8-res-3

times 2

L13-conv10 1 times 1 times 128L14-conv11 3 times 3 times 256L15-res-3

times 8

(a)

L38-conv27 1 times 1 times 256L39-conv28 3 times 3 times 512

L84-route-4 L86-route-1 61

L98-route-1 36

L84-conv59 1 times 1 times 256

L95-route-4L96-conv67 1 times 1 times 128

L85-upsample times 2

L97-upsample times 2

L40-res-3

times 8L63-conv44 1 times 1 times 512L64-conv45 3 times 3 times 1024L65-res-3

times 4L75-conv52 1 times 1 times 512L76-conv53 3 times 3 times 1024

L81-conv58 1 times 1 times ZL82-first detection layer

L93-conv66 1 times 1 times ZL94-second detection layer

L105-conv74 1 times 1 times ZL106-third detection layer

times 3

L87-conv60 1 times 1 times 256L88-conv61 3 times 3 times 512

times3

L99-conv68 1 times 1 times 128L100-conv69 3 times 3 times 256

times 3

L62-conv43 3 times 3 times 1024-s2

26

26

3

3

512

13 13

13 13

11

1311

1024 1024

13

1313

11

1024 Z

13

13

256256

26

26 26

26

512

26

26

11

768 (256 + 512)

384 (128 + 256)

11

Z

26

26

Z

52

52

52

256

52

52

52

52

128128

5226

26

11

11

(b)

Figure 3 YOLOv3 network architecture (a) Convolutional layers in the YOLOv3 (b) Detection layers in the YOLOv3

Journal of Engineering 5

observed in equation (1) Different loss components arecombined by sum-squared algorithm as it is easier foroptimization (e localization loss is responsible forminimizing the error between the ldquoresponsiblerdquobounding box and the ground truth object if an object isdetected in a grid cell

42 Integral Channel Features (ICF) In this work we fusedifferent channels of information proposed in the ICF [9]in the YOLOv3 architecture ICF descriptor is a series ofimage channels computed from the input image usinglinear and nonlinear transformations A channel refers to arepresentation of the input image Next first-order featuresare extracted by computing the sum of rectangular regions(e evaluated channels were color channels (RGB GrayHSV and LUV) gradient magnitude and gradients his-togram (e authors in [9] claimed that the most infor-mative channel among all in independent evaluation forpedestrian detection was HOG Further a combination ofLUV gradient and HOG gave the best detection rate Ourproposed work reevaluates the combination of channelsbecause of varying challenges in the present applicationscenario In our work the window size parameter for HOGcomputation depends on the 2D dimensions of the deepfeatures where they are to be fused with (discussed inSection 41) As an example if the 2D dimensions of a layerwhere features are to be fused are 13times13 and the inputimage is of size 416times 416 then the window size will be32times 32 so that it results in 13times13 HOGs For each windowwe compute the HOG using six bins Other HOG pa-rameters including cell and block size used for normali-zation purposes are also set the same as the window size(e block stride parameter is also set equal to the window

size having no overlap between two windows As shown inFigure 5 the ICF for a given image is the linear concate-nation of individual color channels after normalization thecorresponding HOG feature (e preprocessing in featurecomputation includes the resizing of the input image toalign it with the network layer that receives this input Forexperiments discussed in this paper we follow the samesteps and parameters for HOG computation as discussed in[9]

5 Learning-Based Feature Fusion in YOLOv3

Before we establish the proposed feature fusion we set outthe required steps which will be described subsequently

(1) Identify the candidate convolution layers for featureinjection

(2) Evaluate the layers for feature injection usingcomplete set of ICF channels

(3) Evaluate different combinations of ICF channelsusing the best network layer obtained in the previousstep

(4) Train and test the detector injecting the best ICFchannel combination at the best layer position

Our proposal for feature fusion starts with identifying alocation and space for handcrafted features within theYOLOv3 architecture For injecting an additional set offeatures in this architecture we first need to identify theconvolution layer to feed additional information We adopta qualitative approach to decide on the layer for featurefusion by validating feature fusion on different layers (edepth space of a convolutional layer represents the numberof feature maps that hold the deep featuremdashhereafter

YOLOv3Cell (0 1)

Cell (0 0)

x y w h objectness airphane 05 shark 045

x y w h objectness kite 03 car 001

x y w h objectness bird 012 truck 002

Cell (N N)

Figure 4 YOLOv3 prediction output for an example image cell(i j) corresponds to the image region within the ith row and jth column ofthe N times N grid

6 Journal of Engineering

referred to as filters We need to identify the specific numberof filters within the specific layer which will store the ad-ditional features First we propose to double the originalnumber of filters in the specific layer as shown in Figure 6

Depending on the number of channels in the ICF set thesame number of filters needs to be reserved for storinghandcrafted feature values (e ICF descriptor is injectedinto the network at the beginning of the training procedureIn the layer selected for ICF injection the additional filters(introduced by doubling the layer depth) are copied withICF descriptor from the rear end as shown in Figure 6 (eremaining filters are set to zero in the beginning As thetraining progresses the layer weights that is filter valuesupdate following the learning algorithm (e output gen-erated by this layer depends on the values of all filters (eproposedmethod for feature fusion diverges from the simpleapproach of the stacking of handcrafted features withoriginal filters in the selected layer (e extra filters in ad-dition to the filters used for storing ICF descriptor valueshelp regularize the layer weightsWemaintain the strategy ofdoubling the filters for feature fusion regardless of the

specific convolutional layer under validation An exampleshowing the procedure of doubling the number of filters at aspecific layer to issue space for ICF fusion is illustrated inFigure 6 In this example fusion was selected to occur at62n d layer of YOLOv3rsquos network (e top side of the figureshows a section from the original YOLOv3rsquos network and thebottom side is where the number of filters for the 62n d layeris doubled (e number of filters was changed from 1024 to2048 (e light orange is the new filter depth (2048) andconsists of the original filter depth (1024) concatenated withthe additional filters of the same depth represented in greenAdditionally doubling the number of filters in the fusionlayer poses fewer design challenges than with filters withdirect stacking of handcrafted features

51 Design Issues with Injection of ICF Descriptor To fusehandcrafted features with deep features their 2D dimen-sions have to match the width and height dimensions of thelayer Considering the example in Figure 6 at 62n d layer ifthe input image size is 416times416 then the filter dimensionswill be 13times13 Hence all selected channels for fusion shouldbe of size 13times13 to fit into the additional filters In thetesting phase of YOLOv3 the size of the input image is fixedat 416times 416 so the dimensions of the ICF descriptor at the62n d layer will always be 13times13 On the other hand in thetraining phase of YOLOv3 the size of input image changesbased on a random parameter in every 10 iterations wherethe size of the image is defined as a factor of 32 starting from320times 320 up to 608times 608 If the input size is 320times 320 or608times 608 then at the 62n d layer the filter size becomes10times10 or 19times19 respectively (e filter size compared withthe input image size is downsized by a factor of 32 Aftermaking sure that the handcrafted features are of the rightsize their values can replace the additional filters in thechosen layer

(e flowchart of the proposed object detector is shown inFigure 7 For training purposes the desired ICF channels ofinput images at different scales are calculated offline (eseimages at different scales are used to incorporate variabilityin the training process In the testing phase ICF channels arecalculated only once for each image at a fixed scale(e rightside of the figure shows that as our proposed detector isbeing trainedtested fusion takes place at the layer named``nth YOLOv3 convolutional layerrdquo (e doubled number of

Image

Preprocessing

Gray HSV LUV Gradmagnitude HOG

Normalization

Concatenation

ICF handcraftedfeatures vector

Figure 5 Flowchart for the ICF computation for an exampleimage

Table 1 Symbols used in YOLOv3 loss function

Symbol DefinitionN Number of grid cells in an imageB Number of anchor boxesCi Confidence score of jth bounding box in grid cell ipi(c) Conditional probability of class c in grid cell ixi yi wi hi Location and size of a bounding box1obji 1 if an object appears in grid cell i 0 otherwise1objij 1 if an object is present in ith grid cell and the jth ldquoresponsiblerdquo bounding box 0 otherwise1noobj

ij 1 if no object is present in ith grid cell and the jth ldquoresponsiblerdquo bounding box 0 otherwiseδi 1 if predicted label matches the ground truth label 0 otherwiseλcoord Constant (default 50)λnoobj Constant (default 05)

Journal of Engineering 7

filters and the number of ICF channels are represented as 2x

and y respectively (e figure shows how the ICF overlapswith a portion of the deep features residing in some of theadditional filters

6 Experimental Evaluation

A preliminary evaluation of the proposed framework isperformed on the PASCAL-VOC dataset (VOC2007 de-tection task) (e original YOLOv3 detector achieved themAP of 675 on the testing set with input image dimensionas 416times 416 (e YOLOv3 implementation available by [35]is used for benchmarking our experimental results Asmentioned in Section 5 we first identify the convolutionallayer for injecting the handcrafted features and the requiredspace (ie the number of filters) First we figure out the bestlayer location to fuse ICF channels Subsequently by anotherset of experiments we determine the best combination ofICF channels We separate a validation set comprising one-tenth of the training examples by random selection (evalidation set is used for testing the convolutional layers andcombination of ICF channels for optimum selection Fordetermining the best layer for fusion we use all channels ofinformation in ICF as used in the original work by Dollaret al [9](is is done by doubling the number of filters at thelayer under evaluation fusing 17 channels of ICF set andchecking the resulting mAP We evaluated the two con-volutional layers preceding a detection layer in YOLOv3architecture For each experiment the network is trained for25000 iterations (e remaining training parameters are setas the original YOLOv3 detector We performed the firstevaluation on 80th layer which is the last convolutional layerbefore the detection layer

Doubling the number of filters for 80th layer andinjecting the full set of ICF channels resulted in the mAP of698 In comparison with the benchmark performancebased on YOLOv3 performance we achieved an increase of17 on mAP scale (e next experiment on 79th layer

achieved the mAP of 715 which is more effective than theprevious experiment Table 2 lists the mAP scores for ex-periments conducted for determining the best layer positionfor feature fusion As observed the best mAP score of 715is achieved at 79th layer

After determining the most suitable layer for ICF fusionwe conduct experiments to discover the combination of ICFchannels for fusion that is most suitable for the object de-tection task in the VOC dataset Again in this set of ex-periments we fuse ICF channels by doubling the number offilters at the 79th layer In this experiment we run thetraining process for 50000 iterations Fusing all ICFchannels color gradient magnitude and HOG achieved themAP of 777 (e entire ICF descriptor summed up to 17channels considering 6 channels for HOG Removing theRGB channel from the 17 ICF channels increased the mAPto 782 showing a minuscule increase of 05 from thepreceding experiment Considering original input imagesare in RGB color values the increase is not significant Wealso explored other channel combinations the corre-sponding mAP scores are presented in Table 3

As seen the best ICF channel combination for VOC2007detection task consists of Gray HSV LUV gradient mag-nitudes and HOG achieving maximummAP score of 791We retrain the YOLOv3 detector on complete training setfusing the finalized channels of ICF descriptor (e networkachieved the mAP of 787 on the testing set which is 112more than the detection rate achieved by original YOLOv3detector

61 Evaluation on MS-COCO Dataset MS-COCO datasetconsists of 82783 training 40504 validation and 40775testing images belonging to 80 categories In the literaturethe YOLOv3 [8] reported the mAP of 553 on the COCOdataset for an input image size of 416times 416 All experimentsreported in this work were executed on a NVIDIA GeForceGTX 1080 GPU desktop which is unsubstantial in

L63-conv44 1 times 1 times 512L64-conv45 3 times 3 times 1024

L62-conv433 times 3 times 1024-s2

L62-conv43 3 times 3 times 2048-s2

L65-res-3times 4

L63-conv44 1 times 1 times 512L64-conv45 3 times 3 times 1024L65-res-3

times 4

26

26512

512

1024 102413

131

113

131

1

102413

1313

13

1024

11 1

11024

13

13

13

131

1

33

26

26

33

Original

Doubling

2048 ICF fusion

Figure 6 ICF descriptor injection in the specific convolutional layer

8 Journal of Engineering

comparison with the infrastructure used in the originalpaper [8] (erefore we adopt an alternate strategy forbenchmarking where the YOLO detector (original and withproposed fusion strategy) is retrained on the MS-COCOdataset and we compare the detection performance at 50000iterations (e original YOLOv3 evaluation on COCOtesting dataset achieved the mAP of 365 for 50000iterations

After determining the best location for ICF fusion usingthe mAP values achieved on the validation set as shown inTable 4 we performed experiments to determine the bestcombination of ICF channel fusion that best suits the COCOdataset (Table 5) As shown in Table 4 doubling the 78th

layer resulted in the highest mAP score (373) (ereforefor all subsequent experiments related to the COCO datasetwere conducted with the doubled number of filters at the78th layer All experiments were evaluated at 50000 itera-tions Fusing all ICF channels (color gradient magnitudeand HOG) resulted in the mAP of 377 Removing RGBfrom the 17 ICF channels led to the mAP of 373 Otherchannel combinations were tried out but did not give goodresults such as fusing only HOG features (6 channels)LUV+HOG (9 channels) gray + LUV+ gradient +HOG (11channels) and gray +HSV+LUV+ gradient (8 channels)(e best combination that gave the highest mAP scoreconsisted of fusion of the all ICF channels

Furthermore the evaluation on testing set with fusionof the complete set of ICF channels at the 78th layerachieved 378 of the mAP value With COCO datasetwe also experimented with the effect of normalization onthe ICF channels To measure the effect of normalizationon the performance of the network we tried out twodifferent normalization techniques All the results dis-cussed so far were achieved without any normalization

We first experimented with max normalization techniquedividing each channel by its maximum range so that allvalues fall in the range of 0 to 1 (e fusion of maxnormalized ICF channels in YOLOv3 achieved the mAPof 384 on the testing set which is 06 higher than thebest result achieved with unnormalized features and 19with the original YOLOv3 performance (is confirmswith the training principle of deep networks whichsuggests the use of input normalization as one of the tools

Table 2 VOC dataset detection rate with fusion of complete set ofICF channels

Layer number Number of filters after doubling mAP59 512 66160 1024 68462 2048 66277 1024 69878 2048 70979 1024 71580 2048 698

Table 3 VOC dataset detection rate with fusion of ICF channelcombinations at the 79th layer

ICF channel combination Number ofchannels mAP

RGB+ gray +HSV+LUV+ grad +HOG 17 777Gray +HSV+LUV+ grad +HOG 14 774HSV+LUV+ grad +HOG 13 786Gray + LUV+ grad +HOG 11 780LUV+ grad +HOG 10 791Gray +HSV+ grad +HOG 11 768Gray +HSV+LUV 7 762Gray +HSV+LUV+HOG 13 771

Offline Online

Imagedataset

ComputedICF

Read inputimage

Read inputimage

ICF computation

Store features

Trainingset

No

No

Yes

Yes

Morescales

Change inputimage scale

Imagedataset

1st YOLOv3layer

2nd YOLOv3layer

106th YOLOv3layer

107th YOLOv3layer

nth YOLOv3convolutional

layer

Convolution

Batchnormalization

Add bias

Fuse features

Activation

ComputedICF

Deep features

Deep features

ICF

ICF

2x y

y2x - y

Figure 7 Flowchart of the proposed object detector

Journal of Engineering 9

for performance improvement In the second normali-zation technique we used Z-score based normalizationwhich converts data in all channels to have mean 0 andstandard deviation 1 However the Z-score normaliza-tion did not perform as good as the previous normali-zation technique

7 Conclusion and Future Work

In this work we present an approach to fuse handcraftedfeatures in the convolutional neural network based objectdetector (e fusion of handcrafted features with learn-ing-based features has proved to be effective in manyearlier works In this work we demonstrate the meth-odological steps to fuse handcrafted features in the latestversion of the YOLO detector Our experiments with acombination of simple integral channel features fusion inYOLO have yielded substantial improvement in the de-tection rate on PASCAL-VOC and MS-COCO datasetsIn conventional machine learning early late andlearning-based fusion have been primary strategies forfeature fusion However deep learning networks aredesigned for specific input sizes which pose a challengefor algorithm designer to feed in extra information in thenetwork unless the network is redesigned (e workpresented a novel approach based on methodologicalsteps to address both the problems In future work weplan to explore the approach for sequence-based learningfor object detection and tracking

Data Availability

(is work utilizes standard datasets which are open access

Conflicts of Interest

(e authors declare that they have no conflicts of interest

References

[1] P Dollar C Wojek B Schiele and P Perona ldquoPedestriandetection an evaluation of the state of the artrdquo IEEETransactions on Pattern Analysis and Machine Intelligencevol 34 no 4 pp 743ndash761 2012

[2] D T Nguyen W Li and P O Ogunbona ldquoHuman detectionfrom images and videos a surveyrdquo Pattern Recognitionvol 51 pp 148ndash175 2016

[3] S Ren K He R Girshick and J Sun ldquoFaster r-cnn towardsreal-time object detection with region proposal networksrdquoIEEE Transactions on Pattern Analysis and Machine Intelli-gence vol 39 no 6 pp 1137ndash1149 2017

[4] S Zafeiriou C Zhang and Z Zhang ldquoA survey on facedetection in the wild past present and futurerdquo ComputerVision and Image Understanding vol 138 pp 1ndash24 2015

[5] S Zhang R Benenson M Omran J Hosang and B SchieleldquoHow far are we from solving pedestrian detectionrdquo inProceedings of the IEEE Conference on Computer Vision andPattern Recognition IEEE Las Vegas NV USA pp 1259ndash1267 June 2016

[6] Pascal visual object classes homepage (pascal-voc) httphostrobotsoxacukpascalVOC

[7] Coco-common objects in context (ms-coco) httpcocodatasetorg

[8] J Redmon and A Farhadi Yolov3 An Incremental Im-provement 2018 httpsarxivorgabs180402767

[9] P Dollar Z Tu P Perona and S Belongie Integral ChannelFeatures BMVC Press Swansea UK 2009

[10] F Arman and J K Aggarwal ldquoModel-based object recog-nition in dense-range images-a reviewrdquo ACM ComputingSurveys (CSUR) vol 25 no 1 pp 5ndash43 1993

[11] N Dalal and B Triggs ldquoHistograms of oriented gradients forhuman detectionrdquo Computer Vision and Pattern Recognitionvol 1 pp 886ndash893 2005

[12] D G Lowe ldquoDistinctive image features from scale-invariantkeypointsrdquo International Journal of Computer Vision vol 60no 2 pp 91ndash110 2004

[13] A Bosch A Zisserman and X Munoz ldquoRepresenting shapewith a spatial pyramid kernelrdquo in Proceedings of the 6th ACMInternational Conference on Image and Video Retrievalpp 401ndash408 Association for Computing MachineryAmsterdam (e Netherlands July 2007

[14] P Viola and M J Jones ldquoRobust real-time face detectionrdquoInternational Journal of Computer Vision vol 57 no 2pp 137ndash154 2004

[15] J Shotton A Blake and R Cipolla ldquoContour-based learningfor object detection in Computer Visionrdquo in Proceedings ofthe Tenth IEEE International Conference on Computer Visionvol 1 IEEE Beijing China pp 503ndash510 October 2005

[16] S Agarwal and D Roth ldquoLearning a sparse representation forobject detectionrdquo in Proceedings of the European Conferenceon Computer Vision pp 113ndash127 Copenhagen Denmar May2002

[17] P Dollar B Babenko S Belongie P Perona and Z TuldquoMultiple component learning for object detectionrdquo in Pro-ceedings of the 10th European Conference on Computer vol 10Springer Berlin Germany pp 211ndash224 June 2008

[18] P F Felzenszwalb R B Girshick D McAllester andD Ramanan ldquoObject detection with discriminatively trainedpart-based modelsrdquo IEEE Transactions on Pattern Analysisand Machine Intelligence vol 32 no 9 pp 1627ndash1645 2010

[19] N Tong H Lu X Ruan and M-H Yang ldquoSalient objectdetection via bootstrap learningrdquo in Proceedings of the IEEE

Table 4 COCO dataset detection rate with fusion of complete setof ICF channels

Layer number Number of filters after doubling mAP57 1024 35259 512 33760 1024 35673 2048 35175 1024 36677 1024 36378 2048 37379 1024 369

Table 5 MS-COCO detection rate with fusion of ICF channelcombinations at the 78th layer

ICF channel combination Number ofchannels mAP

RGB+ gray +HSV+LUV+ grad +HOG 17 377Gray +HSV+LUV+ grad +HOG 14 373HSV+LUV+ grad +HOG 13 374Gray + LUV+ grad +HOG 11 364

10 Journal of Engineering

Conference on Computer Vision and Pattern RecognitionIEEE Boston MA USA pp 1884ndash1892 June 2015

[20] X Wang M Yang S Zhu and Y Lin ldquoRegionlets for genericobject detectionrdquo in Proceedings of the IEEE InternationalConference on Computer Vision IEEE Sydney NSW Aus-tralia pp 17ndash24 December 2013

[21] R Girshick J Donahue T Darrell and J Malik ldquoRich featurehierarchies for accurate object detection and semantic seg-mentationrdquo in Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition IEEE Columbus OHUSA pp 580ndash587 June 2014

[22] R Girshick ldquoFast R-Cnnrdquo in Proceedings of IEEE Interna-tional Conference on Computer Vision (ICCV) pp 1440ndash1448Santiago Chile 2015

[23] J Dai Y Li K He and J Sun ldquoR-FCN object detection viaregion-based fully convolutional networksrdquo in Proceedings ofthe 30th International Conference on Neural InformationProcessing Systems pp 379ndash387 Barcelona Spain December2016

[24] J Long E Shelhamer and T Darrell ldquoFully convolutionalnetworks for semantic segmentationrdquo in Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition(CVPR) IEEE Boston MA USA pp 3431ndash3440 June 2015

[25] W Liu D Anguelov D Erhan et al ldquoSSD single shotmultibox detectorrdquo in European Conference on ComputerVision pp 21ndash37 Amsterdam Netherlands 2016

[26] C Fu W Liu A Ranga A Tyagi and A C Berg Decon-volutional Single Shot Detector 2017 httpsarxivorgabs170106659

[27] J Redmon S Divvala R Girshick and A Farhadi ldquoYou onlylook once unified real-time object detectionrdquo in Proceedingsof the IEEE Conference on Computer Vision and PatternRecognition IEEE Las Vegas NV USA pp 779ndash788 June2016

[28] Y Tian P Luo X Wang and X Tang ldquoDeep learning strongparts for pedestrian detectionrdquo in Proceedings of the IEEEInternational Conference on Computer Vision IEEE SantiagoChile pp 1904ndash1912 December 2015

[29] Y Tian P Luo X Wang and X Tang ldquoPedestrian detectionaided by deep learning semantic tasksrdquo in Proceedings of theIEEE Conference on Computer Vision and Pattern RecognitionIEEE Boston MA USA pp 5079ndash5087 June 2015

[30] Y Zhou L Liu L Shao and M Mellor ldquoDave a unifiedframework for fast vehicle detection and annotationrdquo inProceedings of the European Conference on Computer Visionpp 278ndash293 Amsterdam (e Netherlands October 2016

[31] B Li T Zhang and T Xia ldquoVehicle detection from 3d lidarusing fully convolutional networkrdquo 2016 httparxivorgabs160807916

[32] L Zhang L Lin X Liang and K He ldquoIs faster r-cnn doingwell for pedestrian detectionrdquo in Proceedings of the EuropeanConference on Computer Vision pp 443ndash457 Amsterdam(e Netherlands October 2016

[33] S Zhang R Benenson M Omran J Hosang and B SchieleldquoTowards reaching human performance in pedestrian de-tectionrdquo IEEE Transactions on Pattern Analysis and MachineIntelligence vol 40 no 4 pp 973ndash986 2018

[34] T Akilan Q M J Wu and W Zhang ldquoVideo foregroundextraction using multi-view receptive field and encoder-de-coder dcnn for traffic and surveillance applicationsrdquo IEEETransactions on Vehicular Technology vol 68 no 10pp 9478ndash9493 2019

[35] A B AlexeyWindows and Linux Version of Darknet Yolo V3amp V2 Neural Networks for Object Detection pp 6ndash2 2018

[36] J Redmon and A Farhadi ldquoYolo9000 better faster stron-gerrdquo in Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition (CVPR) IEEE SeattleWashington USA pp 6517ndash6525 July 2017

[37] A Krizhevsky I Sutskever and G E Hinton ldquoImagenetclassification with deep convolutional neural networksrdquoAdvances in Neural Information Processing Systemspp 1097ndash1105 2012

Journal of Engineering 11

Page 5: ResearchArticle LearningFeatureFusioninDeepLearning …downloads.hindawi.com/journals/je/2020/7286187.pdf · 2020-05-22 · qualitative approach to select the best layer for feature

412 Loss Function (e YOLOv3 includes a loss function(1) that instructs the network to correctly predict bounding

boxes and accurately classify the detected objects with aprovision to penalize false positives

λcoord 1113944

N2

i01113944

B

j01

objij xi minus 1113954xi( 1113857

2+ yi minus 1113954yi( 1113857

21113960 1113961 + λcoord 1113944

N2

i01113944

B

j01objij

w

radici minus

1113954wi

1113969

1113874 11138752

+h

radic

i minus

1113954hi

1113969

1113874 11138752

1113890 1113891

+ 1113944N2

i01113944

B

j01objij Ci minus 1113954Ci1113872 1113873

2+ λnoobj 1113944

N2

i01113944

B

j01noobjij Ci minus 1113954Ci1113872 1113873

2minus 1113944

N2

i01113944

B

j0δilog 1113954pi(c)( 1113857 + 1 minus δi( 1113857log 1 minus 1113954pi(c)( 1113857

(1)

(e symbols are explained in Table 1 (e symbolsunder hat represent corresponding prediction values

(e loss function in the equation has three error com-ponents localization confidence and classification as

416

416

33

416 208

208208

208

104104 52 52

52

256 256 512

52104

128 128

10411

1 331 1

26

26

1

33

416

33 1

1 33

332

6464

L0-conv0 3 times 3 times 64 L1-conv1 3 times 3 times 64-s2 L2-conv2 1 times 1 times 32 L5-conv43 times 3 times 128-s2

L12-conv93 times 3 times 256-s2 L12-conv9 3 times 3 times 256-s2

L3-conv3 3 times 3 times 64L4-res-1

times 1L6-conv5 1 times 1 times 64L7-conv6 3 times 3 times 128L8-res-3

times 2

L13-conv10 1 times 1 times 128L14-conv11 3 times 3 times 256L15-res-3

times 8

(a)

L38-conv27 1 times 1 times 256L39-conv28 3 times 3 times 512

L84-route-4 L86-route-1 61

L98-route-1 36

L84-conv59 1 times 1 times 256

L95-route-4L96-conv67 1 times 1 times 128

L85-upsample times 2

L97-upsample times 2

L40-res-3

times 8L63-conv44 1 times 1 times 512L64-conv45 3 times 3 times 1024L65-res-3

times 4L75-conv52 1 times 1 times 512L76-conv53 3 times 3 times 1024

L81-conv58 1 times 1 times ZL82-first detection layer

L93-conv66 1 times 1 times ZL94-second detection layer

L105-conv74 1 times 1 times ZL106-third detection layer

times 3

L87-conv60 1 times 1 times 256L88-conv61 3 times 3 times 512

times3

L99-conv68 1 times 1 times 128L100-conv69 3 times 3 times 256

times 3

L62-conv43 3 times 3 times 1024-s2

26

26

3

3

512

13 13

13 13

11

1311

1024 1024

13

1313

11

1024 Z

13

13

256256

26

26 26

26

512

26

26

11

768 (256 + 512)

384 (128 + 256)

11

Z

26

26

Z

52

52

52

256

52

52

52

52

128128

5226

26

11

11

(b)

Figure 3 YOLOv3 network architecture (a) Convolutional layers in the YOLOv3 (b) Detection layers in the YOLOv3

Journal of Engineering 5

observed in equation (1) Different loss components arecombined by sum-squared algorithm as it is easier foroptimization (e localization loss is responsible forminimizing the error between the ldquoresponsiblerdquobounding box and the ground truth object if an object isdetected in a grid cell

42 Integral Channel Features (ICF) In this work we fusedifferent channels of information proposed in the ICF [9]in the YOLOv3 architecture ICF descriptor is a series ofimage channels computed from the input image usinglinear and nonlinear transformations A channel refers to arepresentation of the input image Next first-order featuresare extracted by computing the sum of rectangular regions(e evaluated channels were color channels (RGB GrayHSV and LUV) gradient magnitude and gradients his-togram (e authors in [9] claimed that the most infor-mative channel among all in independent evaluation forpedestrian detection was HOG Further a combination ofLUV gradient and HOG gave the best detection rate Ourproposed work reevaluates the combination of channelsbecause of varying challenges in the present applicationscenario In our work the window size parameter for HOGcomputation depends on the 2D dimensions of the deepfeatures where they are to be fused with (discussed inSection 41) As an example if the 2D dimensions of a layerwhere features are to be fused are 13times13 and the inputimage is of size 416times 416 then the window size will be32times 32 so that it results in 13times13 HOGs For each windowwe compute the HOG using six bins Other HOG pa-rameters including cell and block size used for normali-zation purposes are also set the same as the window size(e block stride parameter is also set equal to the window

size having no overlap between two windows As shown inFigure 5 the ICF for a given image is the linear concate-nation of individual color channels after normalization thecorresponding HOG feature (e preprocessing in featurecomputation includes the resizing of the input image toalign it with the network layer that receives this input Forexperiments discussed in this paper we follow the samesteps and parameters for HOG computation as discussed in[9]

5 Learning-Based Feature Fusion in YOLOv3

Before we establish the proposed feature fusion we set outthe required steps which will be described subsequently

(1) Identify the candidate convolution layers for featureinjection

(2) Evaluate the layers for feature injection usingcomplete set of ICF channels

(3) Evaluate different combinations of ICF channelsusing the best network layer obtained in the previousstep

(4) Train and test the detector injecting the best ICFchannel combination at the best layer position

Our proposal for feature fusion starts with identifying alocation and space for handcrafted features within theYOLOv3 architecture For injecting an additional set offeatures in this architecture we first need to identify theconvolution layer to feed additional information We adopta qualitative approach to decide on the layer for featurefusion by validating feature fusion on different layers (edepth space of a convolutional layer represents the numberof feature maps that hold the deep featuremdashhereafter

YOLOv3Cell (0 1)

Cell (0 0)

x y w h objectness airphane 05 shark 045

x y w h objectness kite 03 car 001

x y w h objectness bird 012 truck 002

Cell (N N)

Figure 4 YOLOv3 prediction output for an example image cell(i j) corresponds to the image region within the ith row and jth column ofthe N times N grid

6 Journal of Engineering

referred to as filters We need to identify the specific numberof filters within the specific layer which will store the ad-ditional features First we propose to double the originalnumber of filters in the specific layer as shown in Figure 6

Depending on the number of channels in the ICF set thesame number of filters needs to be reserved for storinghandcrafted feature values (e ICF descriptor is injectedinto the network at the beginning of the training procedureIn the layer selected for ICF injection the additional filters(introduced by doubling the layer depth) are copied withICF descriptor from the rear end as shown in Figure 6 (eremaining filters are set to zero in the beginning As thetraining progresses the layer weights that is filter valuesupdate following the learning algorithm (e output gen-erated by this layer depends on the values of all filters (eproposedmethod for feature fusion diverges from the simpleapproach of the stacking of handcrafted features withoriginal filters in the selected layer (e extra filters in ad-dition to the filters used for storing ICF descriptor valueshelp regularize the layer weightsWemaintain the strategy ofdoubling the filters for feature fusion regardless of the

specific convolutional layer under validation An exampleshowing the procedure of doubling the number of filters at aspecific layer to issue space for ICF fusion is illustrated inFigure 6 In this example fusion was selected to occur at62n d layer of YOLOv3rsquos network (e top side of the figureshows a section from the original YOLOv3rsquos network and thebottom side is where the number of filters for the 62n d layeris doubled (e number of filters was changed from 1024 to2048 (e light orange is the new filter depth (2048) andconsists of the original filter depth (1024) concatenated withthe additional filters of the same depth represented in greenAdditionally doubling the number of filters in the fusionlayer poses fewer design challenges than with filters withdirect stacking of handcrafted features

51 Design Issues with Injection of ICF Descriptor To fusehandcrafted features with deep features their 2D dimen-sions have to match the width and height dimensions of thelayer Considering the example in Figure 6 at 62n d layer ifthe input image size is 416times416 then the filter dimensionswill be 13times13 Hence all selected channels for fusion shouldbe of size 13times13 to fit into the additional filters In thetesting phase of YOLOv3 the size of the input image is fixedat 416times 416 so the dimensions of the ICF descriptor at the62n d layer will always be 13times13 On the other hand in thetraining phase of YOLOv3 the size of input image changesbased on a random parameter in every 10 iterations wherethe size of the image is defined as a factor of 32 starting from320times 320 up to 608times 608 If the input size is 320times 320 or608times 608 then at the 62n d layer the filter size becomes10times10 or 19times19 respectively (e filter size compared withthe input image size is downsized by a factor of 32 Aftermaking sure that the handcrafted features are of the rightsize their values can replace the additional filters in thechosen layer

(e flowchart of the proposed object detector is shown inFigure 7 For training purposes the desired ICF channels ofinput images at different scales are calculated offline (eseimages at different scales are used to incorporate variabilityin the training process In the testing phase ICF channels arecalculated only once for each image at a fixed scale(e rightside of the figure shows that as our proposed detector isbeing trainedtested fusion takes place at the layer named``nth YOLOv3 convolutional layerrdquo (e doubled number of

Image

Preprocessing

Gray HSV LUV Gradmagnitude HOG

Normalization

Concatenation

ICF handcraftedfeatures vector

Figure 5 Flowchart for the ICF computation for an exampleimage

Table 1 Symbols used in YOLOv3 loss function

Symbol DefinitionN Number of grid cells in an imageB Number of anchor boxesCi Confidence score of jth bounding box in grid cell ipi(c) Conditional probability of class c in grid cell ixi yi wi hi Location and size of a bounding box1obji 1 if an object appears in grid cell i 0 otherwise1objij 1 if an object is present in ith grid cell and the jth ldquoresponsiblerdquo bounding box 0 otherwise1noobj

ij 1 if no object is present in ith grid cell and the jth ldquoresponsiblerdquo bounding box 0 otherwiseδi 1 if predicted label matches the ground truth label 0 otherwiseλcoord Constant (default 50)λnoobj Constant (default 05)

Journal of Engineering 7

filters and the number of ICF channels are represented as 2x

and y respectively (e figure shows how the ICF overlapswith a portion of the deep features residing in some of theadditional filters

6 Experimental Evaluation

A preliminary evaluation of the proposed framework isperformed on the PASCAL-VOC dataset (VOC2007 de-tection task) (e original YOLOv3 detector achieved themAP of 675 on the testing set with input image dimensionas 416times 416 (e YOLOv3 implementation available by [35]is used for benchmarking our experimental results Asmentioned in Section 5 we first identify the convolutionallayer for injecting the handcrafted features and the requiredspace (ie the number of filters) First we figure out the bestlayer location to fuse ICF channels Subsequently by anotherset of experiments we determine the best combination ofICF channels We separate a validation set comprising one-tenth of the training examples by random selection (evalidation set is used for testing the convolutional layers andcombination of ICF channels for optimum selection Fordetermining the best layer for fusion we use all channels ofinformation in ICF as used in the original work by Dollaret al [9](is is done by doubling the number of filters at thelayer under evaluation fusing 17 channels of ICF set andchecking the resulting mAP We evaluated the two con-volutional layers preceding a detection layer in YOLOv3architecture For each experiment the network is trained for25000 iterations (e remaining training parameters are setas the original YOLOv3 detector We performed the firstevaluation on 80th layer which is the last convolutional layerbefore the detection layer

Doubling the number of filters for 80th layer andinjecting the full set of ICF channels resulted in the mAP of698 In comparison with the benchmark performancebased on YOLOv3 performance we achieved an increase of17 on mAP scale (e next experiment on 79th layer

achieved the mAP of 715 which is more effective than theprevious experiment Table 2 lists the mAP scores for ex-periments conducted for determining the best layer positionfor feature fusion As observed the best mAP score of 715is achieved at 79th layer

After determining the most suitable layer for ICF fusionwe conduct experiments to discover the combination of ICFchannels for fusion that is most suitable for the object de-tection task in the VOC dataset Again in this set of ex-periments we fuse ICF channels by doubling the number offilters at the 79th layer In this experiment we run thetraining process for 50000 iterations Fusing all ICFchannels color gradient magnitude and HOG achieved themAP of 777 (e entire ICF descriptor summed up to 17channels considering 6 channels for HOG Removing theRGB channel from the 17 ICF channels increased the mAPto 782 showing a minuscule increase of 05 from thepreceding experiment Considering original input imagesare in RGB color values the increase is not significant Wealso explored other channel combinations the corre-sponding mAP scores are presented in Table 3

As seen the best ICF channel combination for VOC2007detection task consists of Gray HSV LUV gradient mag-nitudes and HOG achieving maximummAP score of 791We retrain the YOLOv3 detector on complete training setfusing the finalized channels of ICF descriptor (e networkachieved the mAP of 787 on the testing set which is 112more than the detection rate achieved by original YOLOv3detector

61 Evaluation on MS-COCO Dataset MS-COCO datasetconsists of 82783 training 40504 validation and 40775testing images belonging to 80 categories In the literaturethe YOLOv3 [8] reported the mAP of 553 on the COCOdataset for an input image size of 416times 416 All experimentsreported in this work were executed on a NVIDIA GeForceGTX 1080 GPU desktop which is unsubstantial in

L63-conv44 1 times 1 times 512L64-conv45 3 times 3 times 1024

L62-conv433 times 3 times 1024-s2

L62-conv43 3 times 3 times 2048-s2

L65-res-3times 4

L63-conv44 1 times 1 times 512L64-conv45 3 times 3 times 1024L65-res-3

times 4

26

26512

512

1024 102413

131

113

131

1

102413

1313

13

1024

11 1

11024

13

13

13

131

1

33

26

26

33

Original

Doubling

2048 ICF fusion

Figure 6 ICF descriptor injection in the specific convolutional layer

8 Journal of Engineering

comparison with the infrastructure used in the originalpaper [8] (erefore we adopt an alternate strategy forbenchmarking where the YOLO detector (original and withproposed fusion strategy) is retrained on the MS-COCOdataset and we compare the detection performance at 50000iterations (e original YOLOv3 evaluation on COCOtesting dataset achieved the mAP of 365 for 50000iterations

After determining the best location for ICF fusion usingthe mAP values achieved on the validation set as shown inTable 4 we performed experiments to determine the bestcombination of ICF channel fusion that best suits the COCOdataset (Table 5) As shown in Table 4 doubling the 78th

layer resulted in the highest mAP score (373) (ereforefor all subsequent experiments related to the COCO datasetwere conducted with the doubled number of filters at the78th layer All experiments were evaluated at 50000 itera-tions Fusing all ICF channels (color gradient magnitudeand HOG) resulted in the mAP of 377 Removing RGBfrom the 17 ICF channels led to the mAP of 373 Otherchannel combinations were tried out but did not give goodresults such as fusing only HOG features (6 channels)LUV+HOG (9 channels) gray + LUV+ gradient +HOG (11channels) and gray +HSV+LUV+ gradient (8 channels)(e best combination that gave the highest mAP scoreconsisted of fusion of the all ICF channels

Furthermore the evaluation on testing set with fusionof the complete set of ICF channels at the 78th layerachieved 378 of the mAP value With COCO datasetwe also experimented with the effect of normalization onthe ICF channels To measure the effect of normalizationon the performance of the network we tried out twodifferent normalization techniques All the results dis-cussed so far were achieved without any normalization

We first experimented with max normalization techniquedividing each channel by its maximum range so that allvalues fall in the range of 0 to 1 (e fusion of maxnormalized ICF channels in YOLOv3 achieved the mAPof 384 on the testing set which is 06 higher than thebest result achieved with unnormalized features and 19with the original YOLOv3 performance (is confirmswith the training principle of deep networks whichsuggests the use of input normalization as one of the tools

Table 2 VOC dataset detection rate with fusion of complete set ofICF channels

Layer number Number of filters after doubling mAP59 512 66160 1024 68462 2048 66277 1024 69878 2048 70979 1024 71580 2048 698

Table 3 VOC dataset detection rate with fusion of ICF channelcombinations at the 79th layer

ICF channel combination Number ofchannels mAP

RGB+ gray +HSV+LUV+ grad +HOG 17 777Gray +HSV+LUV+ grad +HOG 14 774HSV+LUV+ grad +HOG 13 786Gray + LUV+ grad +HOG 11 780LUV+ grad +HOG 10 791Gray +HSV+ grad +HOG 11 768Gray +HSV+LUV 7 762Gray +HSV+LUV+HOG 13 771

Offline Online

Imagedataset

ComputedICF

Read inputimage

Read inputimage

ICF computation

Store features

Trainingset

No

No

Yes

Yes

Morescales

Change inputimage scale

Imagedataset

1st YOLOv3layer

2nd YOLOv3layer

106th YOLOv3layer

107th YOLOv3layer

nth YOLOv3convolutional

layer

Convolution

Batchnormalization

Add bias

Fuse features

Activation

ComputedICF

Deep features

Deep features

ICF

ICF

2x y

y2x - y

Figure 7 Flowchart of the proposed object detector

Journal of Engineering 9

for performance improvement In the second normali-zation technique we used Z-score based normalizationwhich converts data in all channels to have mean 0 andstandard deviation 1 However the Z-score normaliza-tion did not perform as good as the previous normali-zation technique

7 Conclusion and Future Work

In this work we present an approach to fuse handcraftedfeatures in the convolutional neural network based objectdetector (e fusion of handcrafted features with learn-ing-based features has proved to be effective in manyearlier works In this work we demonstrate the meth-odological steps to fuse handcrafted features in the latestversion of the YOLO detector Our experiments with acombination of simple integral channel features fusion inYOLO have yielded substantial improvement in the de-tection rate on PASCAL-VOC and MS-COCO datasetsIn conventional machine learning early late andlearning-based fusion have been primary strategies forfeature fusion However deep learning networks aredesigned for specific input sizes which pose a challengefor algorithm designer to feed in extra information in thenetwork unless the network is redesigned (e workpresented a novel approach based on methodologicalsteps to address both the problems In future work weplan to explore the approach for sequence-based learningfor object detection and tracking

Data Availability

(is work utilizes standard datasets which are open access

Conflicts of Interest

(e authors declare that they have no conflicts of interest

References

[1] P Dollar C Wojek B Schiele and P Perona ldquoPedestriandetection an evaluation of the state of the artrdquo IEEETransactions on Pattern Analysis and Machine Intelligencevol 34 no 4 pp 743ndash761 2012

[2] D T Nguyen W Li and P O Ogunbona ldquoHuman detectionfrom images and videos a surveyrdquo Pattern Recognitionvol 51 pp 148ndash175 2016

[3] S Ren K He R Girshick and J Sun ldquoFaster r-cnn towardsreal-time object detection with region proposal networksrdquoIEEE Transactions on Pattern Analysis and Machine Intelli-gence vol 39 no 6 pp 1137ndash1149 2017

[4] S Zafeiriou C Zhang and Z Zhang ldquoA survey on facedetection in the wild past present and futurerdquo ComputerVision and Image Understanding vol 138 pp 1ndash24 2015

[5] S Zhang R Benenson M Omran J Hosang and B SchieleldquoHow far are we from solving pedestrian detectionrdquo inProceedings of the IEEE Conference on Computer Vision andPattern Recognition IEEE Las Vegas NV USA pp 1259ndash1267 June 2016

[6] Pascal visual object classes homepage (pascal-voc) httphostrobotsoxacukpascalVOC

[7] Coco-common objects in context (ms-coco) httpcocodatasetorg

[8] J Redmon and A Farhadi Yolov3 An Incremental Im-provement 2018 httpsarxivorgabs180402767

[9] P Dollar Z Tu P Perona and S Belongie Integral ChannelFeatures BMVC Press Swansea UK 2009

[10] F Arman and J K Aggarwal ldquoModel-based object recog-nition in dense-range images-a reviewrdquo ACM ComputingSurveys (CSUR) vol 25 no 1 pp 5ndash43 1993

[11] N Dalal and B Triggs ldquoHistograms of oriented gradients forhuman detectionrdquo Computer Vision and Pattern Recognitionvol 1 pp 886ndash893 2005

[12] D G Lowe ldquoDistinctive image features from scale-invariantkeypointsrdquo International Journal of Computer Vision vol 60no 2 pp 91ndash110 2004

[13] A Bosch A Zisserman and X Munoz ldquoRepresenting shapewith a spatial pyramid kernelrdquo in Proceedings of the 6th ACMInternational Conference on Image and Video Retrievalpp 401ndash408 Association for Computing MachineryAmsterdam (e Netherlands July 2007

[14] P Viola and M J Jones ldquoRobust real-time face detectionrdquoInternational Journal of Computer Vision vol 57 no 2pp 137ndash154 2004

[15] J Shotton A Blake and R Cipolla ldquoContour-based learningfor object detection in Computer Visionrdquo in Proceedings ofthe Tenth IEEE International Conference on Computer Visionvol 1 IEEE Beijing China pp 503ndash510 October 2005

[16] S Agarwal and D Roth ldquoLearning a sparse representation forobject detectionrdquo in Proceedings of the European Conferenceon Computer Vision pp 113ndash127 Copenhagen Denmar May2002

[17] P Dollar B Babenko S Belongie P Perona and Z TuldquoMultiple component learning for object detectionrdquo in Pro-ceedings of the 10th European Conference on Computer vol 10Springer Berlin Germany pp 211ndash224 June 2008

[18] P F Felzenszwalb R B Girshick D McAllester andD Ramanan ldquoObject detection with discriminatively trainedpart-based modelsrdquo IEEE Transactions on Pattern Analysisand Machine Intelligence vol 32 no 9 pp 1627ndash1645 2010

[19] N Tong H Lu X Ruan and M-H Yang ldquoSalient objectdetection via bootstrap learningrdquo in Proceedings of the IEEE

Table 4 COCO dataset detection rate with fusion of complete setof ICF channels

Layer number Number of filters after doubling mAP57 1024 35259 512 33760 1024 35673 2048 35175 1024 36677 1024 36378 2048 37379 1024 369

Table 5 MS-COCO detection rate with fusion of ICF channelcombinations at the 78th layer

ICF channel combination Number ofchannels mAP

RGB+ gray +HSV+LUV+ grad +HOG 17 377Gray +HSV+LUV+ grad +HOG 14 373HSV+LUV+ grad +HOG 13 374Gray + LUV+ grad +HOG 11 364

10 Journal of Engineering

Conference on Computer Vision and Pattern RecognitionIEEE Boston MA USA pp 1884ndash1892 June 2015

[20] X Wang M Yang S Zhu and Y Lin ldquoRegionlets for genericobject detectionrdquo in Proceedings of the IEEE InternationalConference on Computer Vision IEEE Sydney NSW Aus-tralia pp 17ndash24 December 2013

[21] R Girshick J Donahue T Darrell and J Malik ldquoRich featurehierarchies for accurate object detection and semantic seg-mentationrdquo in Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition IEEE Columbus OHUSA pp 580ndash587 June 2014

[22] R Girshick ldquoFast R-Cnnrdquo in Proceedings of IEEE Interna-tional Conference on Computer Vision (ICCV) pp 1440ndash1448Santiago Chile 2015

[23] J Dai Y Li K He and J Sun ldquoR-FCN object detection viaregion-based fully convolutional networksrdquo in Proceedings ofthe 30th International Conference on Neural InformationProcessing Systems pp 379ndash387 Barcelona Spain December2016

[24] J Long E Shelhamer and T Darrell ldquoFully convolutionalnetworks for semantic segmentationrdquo in Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition(CVPR) IEEE Boston MA USA pp 3431ndash3440 June 2015

[25] W Liu D Anguelov D Erhan et al ldquoSSD single shotmultibox detectorrdquo in European Conference on ComputerVision pp 21ndash37 Amsterdam Netherlands 2016

[26] C Fu W Liu A Ranga A Tyagi and A C Berg Decon-volutional Single Shot Detector 2017 httpsarxivorgabs170106659

[27] J Redmon S Divvala R Girshick and A Farhadi ldquoYou onlylook once unified real-time object detectionrdquo in Proceedingsof the IEEE Conference on Computer Vision and PatternRecognition IEEE Las Vegas NV USA pp 779ndash788 June2016

[28] Y Tian P Luo X Wang and X Tang ldquoDeep learning strongparts for pedestrian detectionrdquo in Proceedings of the IEEEInternational Conference on Computer Vision IEEE SantiagoChile pp 1904ndash1912 December 2015

[29] Y Tian P Luo X Wang and X Tang ldquoPedestrian detectionaided by deep learning semantic tasksrdquo in Proceedings of theIEEE Conference on Computer Vision and Pattern RecognitionIEEE Boston MA USA pp 5079ndash5087 June 2015

[30] Y Zhou L Liu L Shao and M Mellor ldquoDave a unifiedframework for fast vehicle detection and annotationrdquo inProceedings of the European Conference on Computer Visionpp 278ndash293 Amsterdam (e Netherlands October 2016

[31] B Li T Zhang and T Xia ldquoVehicle detection from 3d lidarusing fully convolutional networkrdquo 2016 httparxivorgabs160807916

[32] L Zhang L Lin X Liang and K He ldquoIs faster r-cnn doingwell for pedestrian detectionrdquo in Proceedings of the EuropeanConference on Computer Vision pp 443ndash457 Amsterdam(e Netherlands October 2016

[33] S Zhang R Benenson M Omran J Hosang and B SchieleldquoTowards reaching human performance in pedestrian de-tectionrdquo IEEE Transactions on Pattern Analysis and MachineIntelligence vol 40 no 4 pp 973ndash986 2018

[34] T Akilan Q M J Wu and W Zhang ldquoVideo foregroundextraction using multi-view receptive field and encoder-de-coder dcnn for traffic and surveillance applicationsrdquo IEEETransactions on Vehicular Technology vol 68 no 10pp 9478ndash9493 2019

[35] A B AlexeyWindows and Linux Version of Darknet Yolo V3amp V2 Neural Networks for Object Detection pp 6ndash2 2018

[36] J Redmon and A Farhadi ldquoYolo9000 better faster stron-gerrdquo in Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition (CVPR) IEEE SeattleWashington USA pp 6517ndash6525 July 2017

[37] A Krizhevsky I Sutskever and G E Hinton ldquoImagenetclassification with deep convolutional neural networksrdquoAdvances in Neural Information Processing Systemspp 1097ndash1105 2012

Journal of Engineering 11

Page 6: ResearchArticle LearningFeatureFusioninDeepLearning …downloads.hindawi.com/journals/je/2020/7286187.pdf · 2020-05-22 · qualitative approach to select the best layer for feature

observed in equation (1) Different loss components arecombined by sum-squared algorithm as it is easier foroptimization (e localization loss is responsible forminimizing the error between the ldquoresponsiblerdquobounding box and the ground truth object if an object isdetected in a grid cell

42 Integral Channel Features (ICF) In this work we fusedifferent channels of information proposed in the ICF [9]in the YOLOv3 architecture ICF descriptor is a series ofimage channels computed from the input image usinglinear and nonlinear transformations A channel refers to arepresentation of the input image Next first-order featuresare extracted by computing the sum of rectangular regions(e evaluated channels were color channels (RGB GrayHSV and LUV) gradient magnitude and gradients his-togram (e authors in [9] claimed that the most infor-mative channel among all in independent evaluation forpedestrian detection was HOG Further a combination ofLUV gradient and HOG gave the best detection rate Ourproposed work reevaluates the combination of channelsbecause of varying challenges in the present applicationscenario In our work the window size parameter for HOGcomputation depends on the 2D dimensions of the deepfeatures where they are to be fused with (discussed inSection 41) As an example if the 2D dimensions of a layerwhere features are to be fused are 13times13 and the inputimage is of size 416times 416 then the window size will be32times 32 so that it results in 13times13 HOGs For each windowwe compute the HOG using six bins Other HOG pa-rameters including cell and block size used for normali-zation purposes are also set the same as the window size(e block stride parameter is also set equal to the window

size having no overlap between two windows As shown inFigure 5 the ICF for a given image is the linear concate-nation of individual color channels after normalization thecorresponding HOG feature (e preprocessing in featurecomputation includes the resizing of the input image toalign it with the network layer that receives this input Forexperiments discussed in this paper we follow the samesteps and parameters for HOG computation as discussed in[9]

5 Learning-Based Feature Fusion in YOLOv3

Before we establish the proposed feature fusion we set outthe required steps which will be described subsequently

(1) Identify the candidate convolution layers for featureinjection

(2) Evaluate the layers for feature injection usingcomplete set of ICF channels

(3) Evaluate different combinations of ICF channelsusing the best network layer obtained in the previousstep

(4) Train and test the detector injecting the best ICFchannel combination at the best layer position

Our proposal for feature fusion starts with identifying alocation and space for handcrafted features within theYOLOv3 architecture For injecting an additional set offeatures in this architecture we first need to identify theconvolution layer to feed additional information We adopta qualitative approach to decide on the layer for featurefusion by validating feature fusion on different layers (edepth space of a convolutional layer represents the numberof feature maps that hold the deep featuremdashhereafter

YOLOv3Cell (0 1)

Cell (0 0)

x y w h objectness airphane 05 shark 045

x y w h objectness kite 03 car 001

x y w h objectness bird 012 truck 002

Cell (N N)

Figure 4 YOLOv3 prediction output for an example image cell(i j) corresponds to the image region within the ith row and jth column ofthe N times N grid

6 Journal of Engineering

referred to as filters We need to identify the specific numberof filters within the specific layer which will store the ad-ditional features First we propose to double the originalnumber of filters in the specific layer as shown in Figure 6

Depending on the number of channels in the ICF set thesame number of filters needs to be reserved for storinghandcrafted feature values (e ICF descriptor is injectedinto the network at the beginning of the training procedureIn the layer selected for ICF injection the additional filters(introduced by doubling the layer depth) are copied withICF descriptor from the rear end as shown in Figure 6 (eremaining filters are set to zero in the beginning As thetraining progresses the layer weights that is filter valuesupdate following the learning algorithm (e output gen-erated by this layer depends on the values of all filters (eproposedmethod for feature fusion diverges from the simpleapproach of the stacking of handcrafted features withoriginal filters in the selected layer (e extra filters in ad-dition to the filters used for storing ICF descriptor valueshelp regularize the layer weightsWemaintain the strategy ofdoubling the filters for feature fusion regardless of the

specific convolutional layer under validation An exampleshowing the procedure of doubling the number of filters at aspecific layer to issue space for ICF fusion is illustrated inFigure 6 In this example fusion was selected to occur at62n d layer of YOLOv3rsquos network (e top side of the figureshows a section from the original YOLOv3rsquos network and thebottom side is where the number of filters for the 62n d layeris doubled (e number of filters was changed from 1024 to2048 (e light orange is the new filter depth (2048) andconsists of the original filter depth (1024) concatenated withthe additional filters of the same depth represented in greenAdditionally doubling the number of filters in the fusionlayer poses fewer design challenges than with filters withdirect stacking of handcrafted features

51 Design Issues with Injection of ICF Descriptor To fusehandcrafted features with deep features their 2D dimen-sions have to match the width and height dimensions of thelayer Considering the example in Figure 6 at 62n d layer ifthe input image size is 416times416 then the filter dimensionswill be 13times13 Hence all selected channels for fusion shouldbe of size 13times13 to fit into the additional filters In thetesting phase of YOLOv3 the size of the input image is fixedat 416times 416 so the dimensions of the ICF descriptor at the62n d layer will always be 13times13 On the other hand in thetraining phase of YOLOv3 the size of input image changesbased on a random parameter in every 10 iterations wherethe size of the image is defined as a factor of 32 starting from320times 320 up to 608times 608 If the input size is 320times 320 or608times 608 then at the 62n d layer the filter size becomes10times10 or 19times19 respectively (e filter size compared withthe input image size is downsized by a factor of 32 Aftermaking sure that the handcrafted features are of the rightsize their values can replace the additional filters in thechosen layer

(e flowchart of the proposed object detector is shown inFigure 7 For training purposes the desired ICF channels ofinput images at different scales are calculated offline (eseimages at different scales are used to incorporate variabilityin the training process In the testing phase ICF channels arecalculated only once for each image at a fixed scale(e rightside of the figure shows that as our proposed detector isbeing trainedtested fusion takes place at the layer named``nth YOLOv3 convolutional layerrdquo (e doubled number of

Image

Preprocessing

Gray HSV LUV Gradmagnitude HOG

Normalization

Concatenation

ICF handcraftedfeatures vector

Figure 5 Flowchart for the ICF computation for an exampleimage

Table 1 Symbols used in YOLOv3 loss function

Symbol DefinitionN Number of grid cells in an imageB Number of anchor boxesCi Confidence score of jth bounding box in grid cell ipi(c) Conditional probability of class c in grid cell ixi yi wi hi Location and size of a bounding box1obji 1 if an object appears in grid cell i 0 otherwise1objij 1 if an object is present in ith grid cell and the jth ldquoresponsiblerdquo bounding box 0 otherwise1noobj

ij 1 if no object is present in ith grid cell and the jth ldquoresponsiblerdquo bounding box 0 otherwiseδi 1 if predicted label matches the ground truth label 0 otherwiseλcoord Constant (default 50)λnoobj Constant (default 05)

Journal of Engineering 7

filters and the number of ICF channels are represented as 2x

and y respectively (e figure shows how the ICF overlapswith a portion of the deep features residing in some of theadditional filters

6 Experimental Evaluation

A preliminary evaluation of the proposed framework isperformed on the PASCAL-VOC dataset (VOC2007 de-tection task) (e original YOLOv3 detector achieved themAP of 675 on the testing set with input image dimensionas 416times 416 (e YOLOv3 implementation available by [35]is used for benchmarking our experimental results Asmentioned in Section 5 we first identify the convolutionallayer for injecting the handcrafted features and the requiredspace (ie the number of filters) First we figure out the bestlayer location to fuse ICF channels Subsequently by anotherset of experiments we determine the best combination ofICF channels We separate a validation set comprising one-tenth of the training examples by random selection (evalidation set is used for testing the convolutional layers andcombination of ICF channels for optimum selection Fordetermining the best layer for fusion we use all channels ofinformation in ICF as used in the original work by Dollaret al [9](is is done by doubling the number of filters at thelayer under evaluation fusing 17 channels of ICF set andchecking the resulting mAP We evaluated the two con-volutional layers preceding a detection layer in YOLOv3architecture For each experiment the network is trained for25000 iterations (e remaining training parameters are setas the original YOLOv3 detector We performed the firstevaluation on 80th layer which is the last convolutional layerbefore the detection layer

Doubling the number of filters for 80th layer andinjecting the full set of ICF channels resulted in the mAP of698 In comparison with the benchmark performancebased on YOLOv3 performance we achieved an increase of17 on mAP scale (e next experiment on 79th layer

achieved the mAP of 715 which is more effective than theprevious experiment Table 2 lists the mAP scores for ex-periments conducted for determining the best layer positionfor feature fusion As observed the best mAP score of 715is achieved at 79th layer

After determining the most suitable layer for ICF fusionwe conduct experiments to discover the combination of ICFchannels for fusion that is most suitable for the object de-tection task in the VOC dataset Again in this set of ex-periments we fuse ICF channels by doubling the number offilters at the 79th layer In this experiment we run thetraining process for 50000 iterations Fusing all ICFchannels color gradient magnitude and HOG achieved themAP of 777 (e entire ICF descriptor summed up to 17channels considering 6 channels for HOG Removing theRGB channel from the 17 ICF channels increased the mAPto 782 showing a minuscule increase of 05 from thepreceding experiment Considering original input imagesare in RGB color values the increase is not significant Wealso explored other channel combinations the corre-sponding mAP scores are presented in Table 3

As seen the best ICF channel combination for VOC2007detection task consists of Gray HSV LUV gradient mag-nitudes and HOG achieving maximummAP score of 791We retrain the YOLOv3 detector on complete training setfusing the finalized channels of ICF descriptor (e networkachieved the mAP of 787 on the testing set which is 112more than the detection rate achieved by original YOLOv3detector

61 Evaluation on MS-COCO Dataset MS-COCO datasetconsists of 82783 training 40504 validation and 40775testing images belonging to 80 categories In the literaturethe YOLOv3 [8] reported the mAP of 553 on the COCOdataset for an input image size of 416times 416 All experimentsreported in this work were executed on a NVIDIA GeForceGTX 1080 GPU desktop which is unsubstantial in

L63-conv44 1 times 1 times 512L64-conv45 3 times 3 times 1024

L62-conv433 times 3 times 1024-s2

L62-conv43 3 times 3 times 2048-s2

L65-res-3times 4

L63-conv44 1 times 1 times 512L64-conv45 3 times 3 times 1024L65-res-3

times 4

26

26512

512

1024 102413

131

113

131

1

102413

1313

13

1024

11 1

11024

13

13

13

131

1

33

26

26

33

Original

Doubling

2048 ICF fusion

Figure 6 ICF descriptor injection in the specific convolutional layer

8 Journal of Engineering

comparison with the infrastructure used in the originalpaper [8] (erefore we adopt an alternate strategy forbenchmarking where the YOLO detector (original and withproposed fusion strategy) is retrained on the MS-COCOdataset and we compare the detection performance at 50000iterations (e original YOLOv3 evaluation on COCOtesting dataset achieved the mAP of 365 for 50000iterations

After determining the best location for ICF fusion usingthe mAP values achieved on the validation set as shown inTable 4 we performed experiments to determine the bestcombination of ICF channel fusion that best suits the COCOdataset (Table 5) As shown in Table 4 doubling the 78th

layer resulted in the highest mAP score (373) (ereforefor all subsequent experiments related to the COCO datasetwere conducted with the doubled number of filters at the78th layer All experiments were evaluated at 50000 itera-tions Fusing all ICF channels (color gradient magnitudeand HOG) resulted in the mAP of 377 Removing RGBfrom the 17 ICF channels led to the mAP of 373 Otherchannel combinations were tried out but did not give goodresults such as fusing only HOG features (6 channels)LUV+HOG (9 channels) gray + LUV+ gradient +HOG (11channels) and gray +HSV+LUV+ gradient (8 channels)(e best combination that gave the highest mAP scoreconsisted of fusion of the all ICF channels

Furthermore the evaluation on testing set with fusionof the complete set of ICF channels at the 78th layerachieved 378 of the mAP value With COCO datasetwe also experimented with the effect of normalization onthe ICF channels To measure the effect of normalizationon the performance of the network we tried out twodifferent normalization techniques All the results dis-cussed so far were achieved without any normalization

We first experimented with max normalization techniquedividing each channel by its maximum range so that allvalues fall in the range of 0 to 1 (e fusion of maxnormalized ICF channels in YOLOv3 achieved the mAPof 384 on the testing set which is 06 higher than thebest result achieved with unnormalized features and 19with the original YOLOv3 performance (is confirmswith the training principle of deep networks whichsuggests the use of input normalization as one of the tools

Table 2 VOC dataset detection rate with fusion of complete set ofICF channels

Layer number Number of filters after doubling mAP59 512 66160 1024 68462 2048 66277 1024 69878 2048 70979 1024 71580 2048 698

Table 3 VOC dataset detection rate with fusion of ICF channelcombinations at the 79th layer

ICF channel combination Number ofchannels mAP

RGB+ gray +HSV+LUV+ grad +HOG 17 777Gray +HSV+LUV+ grad +HOG 14 774HSV+LUV+ grad +HOG 13 786Gray + LUV+ grad +HOG 11 780LUV+ grad +HOG 10 791Gray +HSV+ grad +HOG 11 768Gray +HSV+LUV 7 762Gray +HSV+LUV+HOG 13 771

Offline Online

Imagedataset

ComputedICF

Read inputimage

Read inputimage

ICF computation

Store features

Trainingset

No

No

Yes

Yes

Morescales

Change inputimage scale

Imagedataset

1st YOLOv3layer

2nd YOLOv3layer

106th YOLOv3layer

107th YOLOv3layer

nth YOLOv3convolutional

layer

Convolution

Batchnormalization

Add bias

Fuse features

Activation

ComputedICF

Deep features

Deep features

ICF

ICF

2x y

y2x - y

Figure 7 Flowchart of the proposed object detector

Journal of Engineering 9

for performance improvement In the second normali-zation technique we used Z-score based normalizationwhich converts data in all channels to have mean 0 andstandard deviation 1 However the Z-score normaliza-tion did not perform as good as the previous normali-zation technique

7 Conclusion and Future Work

In this work we present an approach to fuse handcraftedfeatures in the convolutional neural network based objectdetector (e fusion of handcrafted features with learn-ing-based features has proved to be effective in manyearlier works In this work we demonstrate the meth-odological steps to fuse handcrafted features in the latestversion of the YOLO detector Our experiments with acombination of simple integral channel features fusion inYOLO have yielded substantial improvement in the de-tection rate on PASCAL-VOC and MS-COCO datasetsIn conventional machine learning early late andlearning-based fusion have been primary strategies forfeature fusion However deep learning networks aredesigned for specific input sizes which pose a challengefor algorithm designer to feed in extra information in thenetwork unless the network is redesigned (e workpresented a novel approach based on methodologicalsteps to address both the problems In future work weplan to explore the approach for sequence-based learningfor object detection and tracking

Data Availability

(is work utilizes standard datasets which are open access

Conflicts of Interest

(e authors declare that they have no conflicts of interest

References

[1] P Dollar C Wojek B Schiele and P Perona ldquoPedestriandetection an evaluation of the state of the artrdquo IEEETransactions on Pattern Analysis and Machine Intelligencevol 34 no 4 pp 743ndash761 2012

[2] D T Nguyen W Li and P O Ogunbona ldquoHuman detectionfrom images and videos a surveyrdquo Pattern Recognitionvol 51 pp 148ndash175 2016

[3] S Ren K He R Girshick and J Sun ldquoFaster r-cnn towardsreal-time object detection with region proposal networksrdquoIEEE Transactions on Pattern Analysis and Machine Intelli-gence vol 39 no 6 pp 1137ndash1149 2017

[4] S Zafeiriou C Zhang and Z Zhang ldquoA survey on facedetection in the wild past present and futurerdquo ComputerVision and Image Understanding vol 138 pp 1ndash24 2015

[5] S Zhang R Benenson M Omran J Hosang and B SchieleldquoHow far are we from solving pedestrian detectionrdquo inProceedings of the IEEE Conference on Computer Vision andPattern Recognition IEEE Las Vegas NV USA pp 1259ndash1267 June 2016

[6] Pascal visual object classes homepage (pascal-voc) httphostrobotsoxacukpascalVOC

[7] Coco-common objects in context (ms-coco) httpcocodatasetorg

[8] J Redmon and A Farhadi Yolov3 An Incremental Im-provement 2018 httpsarxivorgabs180402767

[9] P Dollar Z Tu P Perona and S Belongie Integral ChannelFeatures BMVC Press Swansea UK 2009

[10] F Arman and J K Aggarwal ldquoModel-based object recog-nition in dense-range images-a reviewrdquo ACM ComputingSurveys (CSUR) vol 25 no 1 pp 5ndash43 1993

[11] N Dalal and B Triggs ldquoHistograms of oriented gradients forhuman detectionrdquo Computer Vision and Pattern Recognitionvol 1 pp 886ndash893 2005

[12] D G Lowe ldquoDistinctive image features from scale-invariantkeypointsrdquo International Journal of Computer Vision vol 60no 2 pp 91ndash110 2004

[13] A Bosch A Zisserman and X Munoz ldquoRepresenting shapewith a spatial pyramid kernelrdquo in Proceedings of the 6th ACMInternational Conference on Image and Video Retrievalpp 401ndash408 Association for Computing MachineryAmsterdam (e Netherlands July 2007

[14] P Viola and M J Jones ldquoRobust real-time face detectionrdquoInternational Journal of Computer Vision vol 57 no 2pp 137ndash154 2004

[15] J Shotton A Blake and R Cipolla ldquoContour-based learningfor object detection in Computer Visionrdquo in Proceedings ofthe Tenth IEEE International Conference on Computer Visionvol 1 IEEE Beijing China pp 503ndash510 October 2005

[16] S Agarwal and D Roth ldquoLearning a sparse representation forobject detectionrdquo in Proceedings of the European Conferenceon Computer Vision pp 113ndash127 Copenhagen Denmar May2002

[17] P Dollar B Babenko S Belongie P Perona and Z TuldquoMultiple component learning for object detectionrdquo in Pro-ceedings of the 10th European Conference on Computer vol 10Springer Berlin Germany pp 211ndash224 June 2008

[18] P F Felzenszwalb R B Girshick D McAllester andD Ramanan ldquoObject detection with discriminatively trainedpart-based modelsrdquo IEEE Transactions on Pattern Analysisand Machine Intelligence vol 32 no 9 pp 1627ndash1645 2010

[19] N Tong H Lu X Ruan and M-H Yang ldquoSalient objectdetection via bootstrap learningrdquo in Proceedings of the IEEE

Table 4 COCO dataset detection rate with fusion of complete setof ICF channels

Layer number Number of filters after doubling mAP57 1024 35259 512 33760 1024 35673 2048 35175 1024 36677 1024 36378 2048 37379 1024 369

Table 5 MS-COCO detection rate with fusion of ICF channelcombinations at the 78th layer

ICF channel combination Number ofchannels mAP

RGB+ gray +HSV+LUV+ grad +HOG 17 377Gray +HSV+LUV+ grad +HOG 14 373HSV+LUV+ grad +HOG 13 374Gray + LUV+ grad +HOG 11 364

10 Journal of Engineering

Conference on Computer Vision and Pattern RecognitionIEEE Boston MA USA pp 1884ndash1892 June 2015

[20] X Wang M Yang S Zhu and Y Lin ldquoRegionlets for genericobject detectionrdquo in Proceedings of the IEEE InternationalConference on Computer Vision IEEE Sydney NSW Aus-tralia pp 17ndash24 December 2013

[21] R Girshick J Donahue T Darrell and J Malik ldquoRich featurehierarchies for accurate object detection and semantic seg-mentationrdquo in Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition IEEE Columbus OHUSA pp 580ndash587 June 2014

[22] R Girshick ldquoFast R-Cnnrdquo in Proceedings of IEEE Interna-tional Conference on Computer Vision (ICCV) pp 1440ndash1448Santiago Chile 2015

[23] J Dai Y Li K He and J Sun ldquoR-FCN object detection viaregion-based fully convolutional networksrdquo in Proceedings ofthe 30th International Conference on Neural InformationProcessing Systems pp 379ndash387 Barcelona Spain December2016

[24] J Long E Shelhamer and T Darrell ldquoFully convolutionalnetworks for semantic segmentationrdquo in Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition(CVPR) IEEE Boston MA USA pp 3431ndash3440 June 2015

[25] W Liu D Anguelov D Erhan et al ldquoSSD single shotmultibox detectorrdquo in European Conference on ComputerVision pp 21ndash37 Amsterdam Netherlands 2016

[26] C Fu W Liu A Ranga A Tyagi and A C Berg Decon-volutional Single Shot Detector 2017 httpsarxivorgabs170106659

[27] J Redmon S Divvala R Girshick and A Farhadi ldquoYou onlylook once unified real-time object detectionrdquo in Proceedingsof the IEEE Conference on Computer Vision and PatternRecognition IEEE Las Vegas NV USA pp 779ndash788 June2016

[28] Y Tian P Luo X Wang and X Tang ldquoDeep learning strongparts for pedestrian detectionrdquo in Proceedings of the IEEEInternational Conference on Computer Vision IEEE SantiagoChile pp 1904ndash1912 December 2015

[29] Y Tian P Luo X Wang and X Tang ldquoPedestrian detectionaided by deep learning semantic tasksrdquo in Proceedings of theIEEE Conference on Computer Vision and Pattern RecognitionIEEE Boston MA USA pp 5079ndash5087 June 2015

[30] Y Zhou L Liu L Shao and M Mellor ldquoDave a unifiedframework for fast vehicle detection and annotationrdquo inProceedings of the European Conference on Computer Visionpp 278ndash293 Amsterdam (e Netherlands October 2016

[31] B Li T Zhang and T Xia ldquoVehicle detection from 3d lidarusing fully convolutional networkrdquo 2016 httparxivorgabs160807916

[32] L Zhang L Lin X Liang and K He ldquoIs faster r-cnn doingwell for pedestrian detectionrdquo in Proceedings of the EuropeanConference on Computer Vision pp 443ndash457 Amsterdam(e Netherlands October 2016

[33] S Zhang R Benenson M Omran J Hosang and B SchieleldquoTowards reaching human performance in pedestrian de-tectionrdquo IEEE Transactions on Pattern Analysis and MachineIntelligence vol 40 no 4 pp 973ndash986 2018

[34] T Akilan Q M J Wu and W Zhang ldquoVideo foregroundextraction using multi-view receptive field and encoder-de-coder dcnn for traffic and surveillance applicationsrdquo IEEETransactions on Vehicular Technology vol 68 no 10pp 9478ndash9493 2019

[35] A B AlexeyWindows and Linux Version of Darknet Yolo V3amp V2 Neural Networks for Object Detection pp 6ndash2 2018

[36] J Redmon and A Farhadi ldquoYolo9000 better faster stron-gerrdquo in Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition (CVPR) IEEE SeattleWashington USA pp 6517ndash6525 July 2017

[37] A Krizhevsky I Sutskever and G E Hinton ldquoImagenetclassification with deep convolutional neural networksrdquoAdvances in Neural Information Processing Systemspp 1097ndash1105 2012

Journal of Engineering 11

Page 7: ResearchArticle LearningFeatureFusioninDeepLearning …downloads.hindawi.com/journals/je/2020/7286187.pdf · 2020-05-22 · qualitative approach to select the best layer for feature

referred to as filters We need to identify the specific numberof filters within the specific layer which will store the ad-ditional features First we propose to double the originalnumber of filters in the specific layer as shown in Figure 6

Depending on the number of channels in the ICF set thesame number of filters needs to be reserved for storinghandcrafted feature values (e ICF descriptor is injectedinto the network at the beginning of the training procedureIn the layer selected for ICF injection the additional filters(introduced by doubling the layer depth) are copied withICF descriptor from the rear end as shown in Figure 6 (eremaining filters are set to zero in the beginning As thetraining progresses the layer weights that is filter valuesupdate following the learning algorithm (e output gen-erated by this layer depends on the values of all filters (eproposedmethod for feature fusion diverges from the simpleapproach of the stacking of handcrafted features withoriginal filters in the selected layer (e extra filters in ad-dition to the filters used for storing ICF descriptor valueshelp regularize the layer weightsWemaintain the strategy ofdoubling the filters for feature fusion regardless of the

specific convolutional layer under validation An exampleshowing the procedure of doubling the number of filters at aspecific layer to issue space for ICF fusion is illustrated inFigure 6 In this example fusion was selected to occur at62n d layer of YOLOv3rsquos network (e top side of the figureshows a section from the original YOLOv3rsquos network and thebottom side is where the number of filters for the 62n d layeris doubled (e number of filters was changed from 1024 to2048 (e light orange is the new filter depth (2048) andconsists of the original filter depth (1024) concatenated withthe additional filters of the same depth represented in greenAdditionally doubling the number of filters in the fusionlayer poses fewer design challenges than with filters withdirect stacking of handcrafted features

51 Design Issues with Injection of ICF Descriptor To fusehandcrafted features with deep features their 2D dimen-sions have to match the width and height dimensions of thelayer Considering the example in Figure 6 at 62n d layer ifthe input image size is 416times416 then the filter dimensionswill be 13times13 Hence all selected channels for fusion shouldbe of size 13times13 to fit into the additional filters In thetesting phase of YOLOv3 the size of the input image is fixedat 416times 416 so the dimensions of the ICF descriptor at the62n d layer will always be 13times13 On the other hand in thetraining phase of YOLOv3 the size of input image changesbased on a random parameter in every 10 iterations wherethe size of the image is defined as a factor of 32 starting from320times 320 up to 608times 608 If the input size is 320times 320 or608times 608 then at the 62n d layer the filter size becomes10times10 or 19times19 respectively (e filter size compared withthe input image size is downsized by a factor of 32 Aftermaking sure that the handcrafted features are of the rightsize their values can replace the additional filters in thechosen layer

(e flowchart of the proposed object detector is shown inFigure 7 For training purposes the desired ICF channels ofinput images at different scales are calculated offline (eseimages at different scales are used to incorporate variabilityin the training process In the testing phase ICF channels arecalculated only once for each image at a fixed scale(e rightside of the figure shows that as our proposed detector isbeing trainedtested fusion takes place at the layer named``nth YOLOv3 convolutional layerrdquo (e doubled number of

Image

Preprocessing

Gray HSV LUV Gradmagnitude HOG

Normalization

Concatenation

ICF handcraftedfeatures vector

Figure 5 Flowchart for the ICF computation for an exampleimage

Table 1 Symbols used in YOLOv3 loss function

Symbol DefinitionN Number of grid cells in an imageB Number of anchor boxesCi Confidence score of jth bounding box in grid cell ipi(c) Conditional probability of class c in grid cell ixi yi wi hi Location and size of a bounding box1obji 1 if an object appears in grid cell i 0 otherwise1objij 1 if an object is present in ith grid cell and the jth ldquoresponsiblerdquo bounding box 0 otherwise1noobj

ij 1 if no object is present in ith grid cell and the jth ldquoresponsiblerdquo bounding box 0 otherwiseδi 1 if predicted label matches the ground truth label 0 otherwiseλcoord Constant (default 50)λnoobj Constant (default 05)

Journal of Engineering 7

filters and the number of ICF channels are represented as 2x

and y respectively (e figure shows how the ICF overlapswith a portion of the deep features residing in some of theadditional filters

6 Experimental Evaluation

A preliminary evaluation of the proposed framework isperformed on the PASCAL-VOC dataset (VOC2007 de-tection task) (e original YOLOv3 detector achieved themAP of 675 on the testing set with input image dimensionas 416times 416 (e YOLOv3 implementation available by [35]is used for benchmarking our experimental results Asmentioned in Section 5 we first identify the convolutionallayer for injecting the handcrafted features and the requiredspace (ie the number of filters) First we figure out the bestlayer location to fuse ICF channels Subsequently by anotherset of experiments we determine the best combination ofICF channels We separate a validation set comprising one-tenth of the training examples by random selection (evalidation set is used for testing the convolutional layers andcombination of ICF channels for optimum selection Fordetermining the best layer for fusion we use all channels ofinformation in ICF as used in the original work by Dollaret al [9](is is done by doubling the number of filters at thelayer under evaluation fusing 17 channels of ICF set andchecking the resulting mAP We evaluated the two con-volutional layers preceding a detection layer in YOLOv3architecture For each experiment the network is trained for25000 iterations (e remaining training parameters are setas the original YOLOv3 detector We performed the firstevaluation on 80th layer which is the last convolutional layerbefore the detection layer

Doubling the number of filters for 80th layer andinjecting the full set of ICF channels resulted in the mAP of698 In comparison with the benchmark performancebased on YOLOv3 performance we achieved an increase of17 on mAP scale (e next experiment on 79th layer

achieved the mAP of 715 which is more effective than theprevious experiment Table 2 lists the mAP scores for ex-periments conducted for determining the best layer positionfor feature fusion As observed the best mAP score of 715is achieved at 79th layer

After determining the most suitable layer for ICF fusionwe conduct experiments to discover the combination of ICFchannels for fusion that is most suitable for the object de-tection task in the VOC dataset Again in this set of ex-periments we fuse ICF channels by doubling the number offilters at the 79th layer In this experiment we run thetraining process for 50000 iterations Fusing all ICFchannels color gradient magnitude and HOG achieved themAP of 777 (e entire ICF descriptor summed up to 17channels considering 6 channels for HOG Removing theRGB channel from the 17 ICF channels increased the mAPto 782 showing a minuscule increase of 05 from thepreceding experiment Considering original input imagesare in RGB color values the increase is not significant Wealso explored other channel combinations the corre-sponding mAP scores are presented in Table 3

As seen the best ICF channel combination for VOC2007detection task consists of Gray HSV LUV gradient mag-nitudes and HOG achieving maximummAP score of 791We retrain the YOLOv3 detector on complete training setfusing the finalized channels of ICF descriptor (e networkachieved the mAP of 787 on the testing set which is 112more than the detection rate achieved by original YOLOv3detector

61 Evaluation on MS-COCO Dataset MS-COCO datasetconsists of 82783 training 40504 validation and 40775testing images belonging to 80 categories In the literaturethe YOLOv3 [8] reported the mAP of 553 on the COCOdataset for an input image size of 416times 416 All experimentsreported in this work were executed on a NVIDIA GeForceGTX 1080 GPU desktop which is unsubstantial in

L63-conv44 1 times 1 times 512L64-conv45 3 times 3 times 1024

L62-conv433 times 3 times 1024-s2

L62-conv43 3 times 3 times 2048-s2

L65-res-3times 4

L63-conv44 1 times 1 times 512L64-conv45 3 times 3 times 1024L65-res-3

times 4

26

26512

512

1024 102413

131

113

131

1

102413

1313

13

1024

11 1

11024

13

13

13

131

1

33

26

26

33

Original

Doubling

2048 ICF fusion

Figure 6 ICF descriptor injection in the specific convolutional layer

8 Journal of Engineering

comparison with the infrastructure used in the originalpaper [8] (erefore we adopt an alternate strategy forbenchmarking where the YOLO detector (original and withproposed fusion strategy) is retrained on the MS-COCOdataset and we compare the detection performance at 50000iterations (e original YOLOv3 evaluation on COCOtesting dataset achieved the mAP of 365 for 50000iterations

After determining the best location for ICF fusion usingthe mAP values achieved on the validation set as shown inTable 4 we performed experiments to determine the bestcombination of ICF channel fusion that best suits the COCOdataset (Table 5) As shown in Table 4 doubling the 78th

layer resulted in the highest mAP score (373) (ereforefor all subsequent experiments related to the COCO datasetwere conducted with the doubled number of filters at the78th layer All experiments were evaluated at 50000 itera-tions Fusing all ICF channels (color gradient magnitudeand HOG) resulted in the mAP of 377 Removing RGBfrom the 17 ICF channels led to the mAP of 373 Otherchannel combinations were tried out but did not give goodresults such as fusing only HOG features (6 channels)LUV+HOG (9 channels) gray + LUV+ gradient +HOG (11channels) and gray +HSV+LUV+ gradient (8 channels)(e best combination that gave the highest mAP scoreconsisted of fusion of the all ICF channels

Furthermore the evaluation on testing set with fusionof the complete set of ICF channels at the 78th layerachieved 378 of the mAP value With COCO datasetwe also experimented with the effect of normalization onthe ICF channels To measure the effect of normalizationon the performance of the network we tried out twodifferent normalization techniques All the results dis-cussed so far were achieved without any normalization

We first experimented with max normalization techniquedividing each channel by its maximum range so that allvalues fall in the range of 0 to 1 (e fusion of maxnormalized ICF channels in YOLOv3 achieved the mAPof 384 on the testing set which is 06 higher than thebest result achieved with unnormalized features and 19with the original YOLOv3 performance (is confirmswith the training principle of deep networks whichsuggests the use of input normalization as one of the tools

Table 2 VOC dataset detection rate with fusion of complete set ofICF channels

Layer number Number of filters after doubling mAP59 512 66160 1024 68462 2048 66277 1024 69878 2048 70979 1024 71580 2048 698

Table 3 VOC dataset detection rate with fusion of ICF channelcombinations at the 79th layer

ICF channel combination Number ofchannels mAP

RGB+ gray +HSV+LUV+ grad +HOG 17 777Gray +HSV+LUV+ grad +HOG 14 774HSV+LUV+ grad +HOG 13 786Gray + LUV+ grad +HOG 11 780LUV+ grad +HOG 10 791Gray +HSV+ grad +HOG 11 768Gray +HSV+LUV 7 762Gray +HSV+LUV+HOG 13 771

Offline Online

Imagedataset

ComputedICF

Read inputimage

Read inputimage

ICF computation

Store features

Trainingset

No

No

Yes

Yes

Morescales

Change inputimage scale

Imagedataset

1st YOLOv3layer

2nd YOLOv3layer

106th YOLOv3layer

107th YOLOv3layer

nth YOLOv3convolutional

layer

Convolution

Batchnormalization

Add bias

Fuse features

Activation

ComputedICF

Deep features

Deep features

ICF

ICF

2x y

y2x - y

Figure 7 Flowchart of the proposed object detector

Journal of Engineering 9

for performance improvement In the second normali-zation technique we used Z-score based normalizationwhich converts data in all channels to have mean 0 andstandard deviation 1 However the Z-score normaliza-tion did not perform as good as the previous normali-zation technique

7 Conclusion and Future Work

In this work we present an approach to fuse handcraftedfeatures in the convolutional neural network based objectdetector (e fusion of handcrafted features with learn-ing-based features has proved to be effective in manyearlier works In this work we demonstrate the meth-odological steps to fuse handcrafted features in the latestversion of the YOLO detector Our experiments with acombination of simple integral channel features fusion inYOLO have yielded substantial improvement in the de-tection rate on PASCAL-VOC and MS-COCO datasetsIn conventional machine learning early late andlearning-based fusion have been primary strategies forfeature fusion However deep learning networks aredesigned for specific input sizes which pose a challengefor algorithm designer to feed in extra information in thenetwork unless the network is redesigned (e workpresented a novel approach based on methodologicalsteps to address both the problems In future work weplan to explore the approach for sequence-based learningfor object detection and tracking

Data Availability

(is work utilizes standard datasets which are open access

Conflicts of Interest

(e authors declare that they have no conflicts of interest

References

[1] P Dollar C Wojek B Schiele and P Perona ldquoPedestriandetection an evaluation of the state of the artrdquo IEEETransactions on Pattern Analysis and Machine Intelligencevol 34 no 4 pp 743ndash761 2012

[2] D T Nguyen W Li and P O Ogunbona ldquoHuman detectionfrom images and videos a surveyrdquo Pattern Recognitionvol 51 pp 148ndash175 2016

[3] S Ren K He R Girshick and J Sun ldquoFaster r-cnn towardsreal-time object detection with region proposal networksrdquoIEEE Transactions on Pattern Analysis and Machine Intelli-gence vol 39 no 6 pp 1137ndash1149 2017

[4] S Zafeiriou C Zhang and Z Zhang ldquoA survey on facedetection in the wild past present and futurerdquo ComputerVision and Image Understanding vol 138 pp 1ndash24 2015

[5] S Zhang R Benenson M Omran J Hosang and B SchieleldquoHow far are we from solving pedestrian detectionrdquo inProceedings of the IEEE Conference on Computer Vision andPattern Recognition IEEE Las Vegas NV USA pp 1259ndash1267 June 2016

[6] Pascal visual object classes homepage (pascal-voc) httphostrobotsoxacukpascalVOC

[7] Coco-common objects in context (ms-coco) httpcocodatasetorg

[8] J Redmon and A Farhadi Yolov3 An Incremental Im-provement 2018 httpsarxivorgabs180402767

[9] P Dollar Z Tu P Perona and S Belongie Integral ChannelFeatures BMVC Press Swansea UK 2009

[10] F Arman and J K Aggarwal ldquoModel-based object recog-nition in dense-range images-a reviewrdquo ACM ComputingSurveys (CSUR) vol 25 no 1 pp 5ndash43 1993

[11] N Dalal and B Triggs ldquoHistograms of oriented gradients forhuman detectionrdquo Computer Vision and Pattern Recognitionvol 1 pp 886ndash893 2005

[12] D G Lowe ldquoDistinctive image features from scale-invariantkeypointsrdquo International Journal of Computer Vision vol 60no 2 pp 91ndash110 2004

[13] A Bosch A Zisserman and X Munoz ldquoRepresenting shapewith a spatial pyramid kernelrdquo in Proceedings of the 6th ACMInternational Conference on Image and Video Retrievalpp 401ndash408 Association for Computing MachineryAmsterdam (e Netherlands July 2007

[14] P Viola and M J Jones ldquoRobust real-time face detectionrdquoInternational Journal of Computer Vision vol 57 no 2pp 137ndash154 2004

[15] J Shotton A Blake and R Cipolla ldquoContour-based learningfor object detection in Computer Visionrdquo in Proceedings ofthe Tenth IEEE International Conference on Computer Visionvol 1 IEEE Beijing China pp 503ndash510 October 2005

[16] S Agarwal and D Roth ldquoLearning a sparse representation forobject detectionrdquo in Proceedings of the European Conferenceon Computer Vision pp 113ndash127 Copenhagen Denmar May2002

[17] P Dollar B Babenko S Belongie P Perona and Z TuldquoMultiple component learning for object detectionrdquo in Pro-ceedings of the 10th European Conference on Computer vol 10Springer Berlin Germany pp 211ndash224 June 2008

[18] P F Felzenszwalb R B Girshick D McAllester andD Ramanan ldquoObject detection with discriminatively trainedpart-based modelsrdquo IEEE Transactions on Pattern Analysisand Machine Intelligence vol 32 no 9 pp 1627ndash1645 2010

[19] N Tong H Lu X Ruan and M-H Yang ldquoSalient objectdetection via bootstrap learningrdquo in Proceedings of the IEEE

Table 4 COCO dataset detection rate with fusion of complete setof ICF channels

Layer number Number of filters after doubling mAP57 1024 35259 512 33760 1024 35673 2048 35175 1024 36677 1024 36378 2048 37379 1024 369

Table 5 MS-COCO detection rate with fusion of ICF channelcombinations at the 78th layer

ICF channel combination Number ofchannels mAP

RGB+ gray +HSV+LUV+ grad +HOG 17 377Gray +HSV+LUV+ grad +HOG 14 373HSV+LUV+ grad +HOG 13 374Gray + LUV+ grad +HOG 11 364

10 Journal of Engineering

Conference on Computer Vision and Pattern RecognitionIEEE Boston MA USA pp 1884ndash1892 June 2015

[20] X Wang M Yang S Zhu and Y Lin ldquoRegionlets for genericobject detectionrdquo in Proceedings of the IEEE InternationalConference on Computer Vision IEEE Sydney NSW Aus-tralia pp 17ndash24 December 2013

[21] R Girshick J Donahue T Darrell and J Malik ldquoRich featurehierarchies for accurate object detection and semantic seg-mentationrdquo in Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition IEEE Columbus OHUSA pp 580ndash587 June 2014

[22] R Girshick ldquoFast R-Cnnrdquo in Proceedings of IEEE Interna-tional Conference on Computer Vision (ICCV) pp 1440ndash1448Santiago Chile 2015

[23] J Dai Y Li K He and J Sun ldquoR-FCN object detection viaregion-based fully convolutional networksrdquo in Proceedings ofthe 30th International Conference on Neural InformationProcessing Systems pp 379ndash387 Barcelona Spain December2016

[24] J Long E Shelhamer and T Darrell ldquoFully convolutionalnetworks for semantic segmentationrdquo in Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition(CVPR) IEEE Boston MA USA pp 3431ndash3440 June 2015

[25] W Liu D Anguelov D Erhan et al ldquoSSD single shotmultibox detectorrdquo in European Conference on ComputerVision pp 21ndash37 Amsterdam Netherlands 2016

[26] C Fu W Liu A Ranga A Tyagi and A C Berg Decon-volutional Single Shot Detector 2017 httpsarxivorgabs170106659

[27] J Redmon S Divvala R Girshick and A Farhadi ldquoYou onlylook once unified real-time object detectionrdquo in Proceedingsof the IEEE Conference on Computer Vision and PatternRecognition IEEE Las Vegas NV USA pp 779ndash788 June2016

[28] Y Tian P Luo X Wang and X Tang ldquoDeep learning strongparts for pedestrian detectionrdquo in Proceedings of the IEEEInternational Conference on Computer Vision IEEE SantiagoChile pp 1904ndash1912 December 2015

[29] Y Tian P Luo X Wang and X Tang ldquoPedestrian detectionaided by deep learning semantic tasksrdquo in Proceedings of theIEEE Conference on Computer Vision and Pattern RecognitionIEEE Boston MA USA pp 5079ndash5087 June 2015

[30] Y Zhou L Liu L Shao and M Mellor ldquoDave a unifiedframework for fast vehicle detection and annotationrdquo inProceedings of the European Conference on Computer Visionpp 278ndash293 Amsterdam (e Netherlands October 2016

[31] B Li T Zhang and T Xia ldquoVehicle detection from 3d lidarusing fully convolutional networkrdquo 2016 httparxivorgabs160807916

[32] L Zhang L Lin X Liang and K He ldquoIs faster r-cnn doingwell for pedestrian detectionrdquo in Proceedings of the EuropeanConference on Computer Vision pp 443ndash457 Amsterdam(e Netherlands October 2016

[33] S Zhang R Benenson M Omran J Hosang and B SchieleldquoTowards reaching human performance in pedestrian de-tectionrdquo IEEE Transactions on Pattern Analysis and MachineIntelligence vol 40 no 4 pp 973ndash986 2018

[34] T Akilan Q M J Wu and W Zhang ldquoVideo foregroundextraction using multi-view receptive field and encoder-de-coder dcnn for traffic and surveillance applicationsrdquo IEEETransactions on Vehicular Technology vol 68 no 10pp 9478ndash9493 2019

[35] A B AlexeyWindows and Linux Version of Darknet Yolo V3amp V2 Neural Networks for Object Detection pp 6ndash2 2018

[36] J Redmon and A Farhadi ldquoYolo9000 better faster stron-gerrdquo in Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition (CVPR) IEEE SeattleWashington USA pp 6517ndash6525 July 2017

[37] A Krizhevsky I Sutskever and G E Hinton ldquoImagenetclassification with deep convolutional neural networksrdquoAdvances in Neural Information Processing Systemspp 1097ndash1105 2012

Journal of Engineering 11

Page 8: ResearchArticle LearningFeatureFusioninDeepLearning …downloads.hindawi.com/journals/je/2020/7286187.pdf · 2020-05-22 · qualitative approach to select the best layer for feature

filters and the number of ICF channels are represented as 2x

and y respectively (e figure shows how the ICF overlapswith a portion of the deep features residing in some of theadditional filters

6 Experimental Evaluation

A preliminary evaluation of the proposed framework isperformed on the PASCAL-VOC dataset (VOC2007 de-tection task) (e original YOLOv3 detector achieved themAP of 675 on the testing set with input image dimensionas 416times 416 (e YOLOv3 implementation available by [35]is used for benchmarking our experimental results Asmentioned in Section 5 we first identify the convolutionallayer for injecting the handcrafted features and the requiredspace (ie the number of filters) First we figure out the bestlayer location to fuse ICF channels Subsequently by anotherset of experiments we determine the best combination ofICF channels We separate a validation set comprising one-tenth of the training examples by random selection (evalidation set is used for testing the convolutional layers andcombination of ICF channels for optimum selection Fordetermining the best layer for fusion we use all channels ofinformation in ICF as used in the original work by Dollaret al [9](is is done by doubling the number of filters at thelayer under evaluation fusing 17 channels of ICF set andchecking the resulting mAP We evaluated the two con-volutional layers preceding a detection layer in YOLOv3architecture For each experiment the network is trained for25000 iterations (e remaining training parameters are setas the original YOLOv3 detector We performed the firstevaluation on 80th layer which is the last convolutional layerbefore the detection layer

Doubling the number of filters for 80th layer andinjecting the full set of ICF channels resulted in the mAP of698 In comparison with the benchmark performancebased on YOLOv3 performance we achieved an increase of17 on mAP scale (e next experiment on 79th layer

achieved the mAP of 715 which is more effective than theprevious experiment Table 2 lists the mAP scores for ex-periments conducted for determining the best layer positionfor feature fusion As observed the best mAP score of 715is achieved at 79th layer

After determining the most suitable layer for ICF fusionwe conduct experiments to discover the combination of ICFchannels for fusion that is most suitable for the object de-tection task in the VOC dataset Again in this set of ex-periments we fuse ICF channels by doubling the number offilters at the 79th layer In this experiment we run thetraining process for 50000 iterations Fusing all ICFchannels color gradient magnitude and HOG achieved themAP of 777 (e entire ICF descriptor summed up to 17channels considering 6 channels for HOG Removing theRGB channel from the 17 ICF channels increased the mAPto 782 showing a minuscule increase of 05 from thepreceding experiment Considering original input imagesare in RGB color values the increase is not significant Wealso explored other channel combinations the corre-sponding mAP scores are presented in Table 3

As seen the best ICF channel combination for VOC2007detection task consists of Gray HSV LUV gradient mag-nitudes and HOG achieving maximummAP score of 791We retrain the YOLOv3 detector on complete training setfusing the finalized channels of ICF descriptor (e networkachieved the mAP of 787 on the testing set which is 112more than the detection rate achieved by original YOLOv3detector

61 Evaluation on MS-COCO Dataset MS-COCO datasetconsists of 82783 training 40504 validation and 40775testing images belonging to 80 categories In the literaturethe YOLOv3 [8] reported the mAP of 553 on the COCOdataset for an input image size of 416times 416 All experimentsreported in this work were executed on a NVIDIA GeForceGTX 1080 GPU desktop which is unsubstantial in

L63-conv44 1 times 1 times 512L64-conv45 3 times 3 times 1024

L62-conv433 times 3 times 1024-s2

L62-conv43 3 times 3 times 2048-s2

L65-res-3times 4

L63-conv44 1 times 1 times 512L64-conv45 3 times 3 times 1024L65-res-3

times 4

26

26512

512

1024 102413

131

113

131

1

102413

1313

13

1024

11 1

11024

13

13

13

131

1

33

26

26

33

Original

Doubling

2048 ICF fusion

Figure 6 ICF descriptor injection in the specific convolutional layer

8 Journal of Engineering

comparison with the infrastructure used in the originalpaper [8] (erefore we adopt an alternate strategy forbenchmarking where the YOLO detector (original and withproposed fusion strategy) is retrained on the MS-COCOdataset and we compare the detection performance at 50000iterations (e original YOLOv3 evaluation on COCOtesting dataset achieved the mAP of 365 for 50000iterations

After determining the best location for ICF fusion usingthe mAP values achieved on the validation set as shown inTable 4 we performed experiments to determine the bestcombination of ICF channel fusion that best suits the COCOdataset (Table 5) As shown in Table 4 doubling the 78th

layer resulted in the highest mAP score (373) (ereforefor all subsequent experiments related to the COCO datasetwere conducted with the doubled number of filters at the78th layer All experiments were evaluated at 50000 itera-tions Fusing all ICF channels (color gradient magnitudeand HOG) resulted in the mAP of 377 Removing RGBfrom the 17 ICF channels led to the mAP of 373 Otherchannel combinations were tried out but did not give goodresults such as fusing only HOG features (6 channels)LUV+HOG (9 channels) gray + LUV+ gradient +HOG (11channels) and gray +HSV+LUV+ gradient (8 channels)(e best combination that gave the highest mAP scoreconsisted of fusion of the all ICF channels

Furthermore the evaluation on testing set with fusionof the complete set of ICF channels at the 78th layerachieved 378 of the mAP value With COCO datasetwe also experimented with the effect of normalization onthe ICF channels To measure the effect of normalizationon the performance of the network we tried out twodifferent normalization techniques All the results dis-cussed so far were achieved without any normalization

We first experimented with max normalization techniquedividing each channel by its maximum range so that allvalues fall in the range of 0 to 1 (e fusion of maxnormalized ICF channels in YOLOv3 achieved the mAPof 384 on the testing set which is 06 higher than thebest result achieved with unnormalized features and 19with the original YOLOv3 performance (is confirmswith the training principle of deep networks whichsuggests the use of input normalization as one of the tools

Table 2 VOC dataset detection rate with fusion of complete set ofICF channels

Layer number Number of filters after doubling mAP59 512 66160 1024 68462 2048 66277 1024 69878 2048 70979 1024 71580 2048 698

Table 3 VOC dataset detection rate with fusion of ICF channelcombinations at the 79th layer

ICF channel combination Number ofchannels mAP

RGB+ gray +HSV+LUV+ grad +HOG 17 777Gray +HSV+LUV+ grad +HOG 14 774HSV+LUV+ grad +HOG 13 786Gray + LUV+ grad +HOG 11 780LUV+ grad +HOG 10 791Gray +HSV+ grad +HOG 11 768Gray +HSV+LUV 7 762Gray +HSV+LUV+HOG 13 771

Offline Online

Imagedataset

ComputedICF

Read inputimage

Read inputimage

ICF computation

Store features

Trainingset

No

No

Yes

Yes

Morescales

Change inputimage scale

Imagedataset

1st YOLOv3layer

2nd YOLOv3layer

106th YOLOv3layer

107th YOLOv3layer

nth YOLOv3convolutional

layer

Convolution

Batchnormalization

Add bias

Fuse features

Activation

ComputedICF

Deep features

Deep features

ICF

ICF

2x y

y2x - y

Figure 7 Flowchart of the proposed object detector

Journal of Engineering 9

for performance improvement In the second normali-zation technique we used Z-score based normalizationwhich converts data in all channels to have mean 0 andstandard deviation 1 However the Z-score normaliza-tion did not perform as good as the previous normali-zation technique

7 Conclusion and Future Work

In this work we present an approach to fuse handcraftedfeatures in the convolutional neural network based objectdetector (e fusion of handcrafted features with learn-ing-based features has proved to be effective in manyearlier works In this work we demonstrate the meth-odological steps to fuse handcrafted features in the latestversion of the YOLO detector Our experiments with acombination of simple integral channel features fusion inYOLO have yielded substantial improvement in the de-tection rate on PASCAL-VOC and MS-COCO datasetsIn conventional machine learning early late andlearning-based fusion have been primary strategies forfeature fusion However deep learning networks aredesigned for specific input sizes which pose a challengefor algorithm designer to feed in extra information in thenetwork unless the network is redesigned (e workpresented a novel approach based on methodologicalsteps to address both the problems In future work weplan to explore the approach for sequence-based learningfor object detection and tracking

Data Availability

(is work utilizes standard datasets which are open access

Conflicts of Interest

(e authors declare that they have no conflicts of interest

References

[1] P Dollar C Wojek B Schiele and P Perona ldquoPedestriandetection an evaluation of the state of the artrdquo IEEETransactions on Pattern Analysis and Machine Intelligencevol 34 no 4 pp 743ndash761 2012

[2] D T Nguyen W Li and P O Ogunbona ldquoHuman detectionfrom images and videos a surveyrdquo Pattern Recognitionvol 51 pp 148ndash175 2016

[3] S Ren K He R Girshick and J Sun ldquoFaster r-cnn towardsreal-time object detection with region proposal networksrdquoIEEE Transactions on Pattern Analysis and Machine Intelli-gence vol 39 no 6 pp 1137ndash1149 2017

[4] S Zafeiriou C Zhang and Z Zhang ldquoA survey on facedetection in the wild past present and futurerdquo ComputerVision and Image Understanding vol 138 pp 1ndash24 2015

[5] S Zhang R Benenson M Omran J Hosang and B SchieleldquoHow far are we from solving pedestrian detectionrdquo inProceedings of the IEEE Conference on Computer Vision andPattern Recognition IEEE Las Vegas NV USA pp 1259ndash1267 June 2016

[6] Pascal visual object classes homepage (pascal-voc) httphostrobotsoxacukpascalVOC

[7] Coco-common objects in context (ms-coco) httpcocodatasetorg

[8] J Redmon and A Farhadi Yolov3 An Incremental Im-provement 2018 httpsarxivorgabs180402767

[9] P Dollar Z Tu P Perona and S Belongie Integral ChannelFeatures BMVC Press Swansea UK 2009

[10] F Arman and J K Aggarwal ldquoModel-based object recog-nition in dense-range images-a reviewrdquo ACM ComputingSurveys (CSUR) vol 25 no 1 pp 5ndash43 1993

[11] N Dalal and B Triggs ldquoHistograms of oriented gradients forhuman detectionrdquo Computer Vision and Pattern Recognitionvol 1 pp 886ndash893 2005

[12] D G Lowe ldquoDistinctive image features from scale-invariantkeypointsrdquo International Journal of Computer Vision vol 60no 2 pp 91ndash110 2004

[13] A Bosch A Zisserman and X Munoz ldquoRepresenting shapewith a spatial pyramid kernelrdquo in Proceedings of the 6th ACMInternational Conference on Image and Video Retrievalpp 401ndash408 Association for Computing MachineryAmsterdam (e Netherlands July 2007

[14] P Viola and M J Jones ldquoRobust real-time face detectionrdquoInternational Journal of Computer Vision vol 57 no 2pp 137ndash154 2004

[15] J Shotton A Blake and R Cipolla ldquoContour-based learningfor object detection in Computer Visionrdquo in Proceedings ofthe Tenth IEEE International Conference on Computer Visionvol 1 IEEE Beijing China pp 503ndash510 October 2005

[16] S Agarwal and D Roth ldquoLearning a sparse representation forobject detectionrdquo in Proceedings of the European Conferenceon Computer Vision pp 113ndash127 Copenhagen Denmar May2002

[17] P Dollar B Babenko S Belongie P Perona and Z TuldquoMultiple component learning for object detectionrdquo in Pro-ceedings of the 10th European Conference on Computer vol 10Springer Berlin Germany pp 211ndash224 June 2008

[18] P F Felzenszwalb R B Girshick D McAllester andD Ramanan ldquoObject detection with discriminatively trainedpart-based modelsrdquo IEEE Transactions on Pattern Analysisand Machine Intelligence vol 32 no 9 pp 1627ndash1645 2010

[19] N Tong H Lu X Ruan and M-H Yang ldquoSalient objectdetection via bootstrap learningrdquo in Proceedings of the IEEE

Table 4 COCO dataset detection rate with fusion of complete setof ICF channels

Layer number Number of filters after doubling mAP57 1024 35259 512 33760 1024 35673 2048 35175 1024 36677 1024 36378 2048 37379 1024 369

Table 5 MS-COCO detection rate with fusion of ICF channelcombinations at the 78th layer

ICF channel combination Number ofchannels mAP

RGB+ gray +HSV+LUV+ grad +HOG 17 377Gray +HSV+LUV+ grad +HOG 14 373HSV+LUV+ grad +HOG 13 374Gray + LUV+ grad +HOG 11 364

10 Journal of Engineering

Conference on Computer Vision and Pattern RecognitionIEEE Boston MA USA pp 1884ndash1892 June 2015

[20] X Wang M Yang S Zhu and Y Lin ldquoRegionlets for genericobject detectionrdquo in Proceedings of the IEEE InternationalConference on Computer Vision IEEE Sydney NSW Aus-tralia pp 17ndash24 December 2013

[21] R Girshick J Donahue T Darrell and J Malik ldquoRich featurehierarchies for accurate object detection and semantic seg-mentationrdquo in Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition IEEE Columbus OHUSA pp 580ndash587 June 2014

[22] R Girshick ldquoFast R-Cnnrdquo in Proceedings of IEEE Interna-tional Conference on Computer Vision (ICCV) pp 1440ndash1448Santiago Chile 2015

[23] J Dai Y Li K He and J Sun ldquoR-FCN object detection viaregion-based fully convolutional networksrdquo in Proceedings ofthe 30th International Conference on Neural InformationProcessing Systems pp 379ndash387 Barcelona Spain December2016

[24] J Long E Shelhamer and T Darrell ldquoFully convolutionalnetworks for semantic segmentationrdquo in Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition(CVPR) IEEE Boston MA USA pp 3431ndash3440 June 2015

[25] W Liu D Anguelov D Erhan et al ldquoSSD single shotmultibox detectorrdquo in European Conference on ComputerVision pp 21ndash37 Amsterdam Netherlands 2016

[26] C Fu W Liu A Ranga A Tyagi and A C Berg Decon-volutional Single Shot Detector 2017 httpsarxivorgabs170106659

[27] J Redmon S Divvala R Girshick and A Farhadi ldquoYou onlylook once unified real-time object detectionrdquo in Proceedingsof the IEEE Conference on Computer Vision and PatternRecognition IEEE Las Vegas NV USA pp 779ndash788 June2016

[28] Y Tian P Luo X Wang and X Tang ldquoDeep learning strongparts for pedestrian detectionrdquo in Proceedings of the IEEEInternational Conference on Computer Vision IEEE SantiagoChile pp 1904ndash1912 December 2015

[29] Y Tian P Luo X Wang and X Tang ldquoPedestrian detectionaided by deep learning semantic tasksrdquo in Proceedings of theIEEE Conference on Computer Vision and Pattern RecognitionIEEE Boston MA USA pp 5079ndash5087 June 2015

[30] Y Zhou L Liu L Shao and M Mellor ldquoDave a unifiedframework for fast vehicle detection and annotationrdquo inProceedings of the European Conference on Computer Visionpp 278ndash293 Amsterdam (e Netherlands October 2016

[31] B Li T Zhang and T Xia ldquoVehicle detection from 3d lidarusing fully convolutional networkrdquo 2016 httparxivorgabs160807916

[32] L Zhang L Lin X Liang and K He ldquoIs faster r-cnn doingwell for pedestrian detectionrdquo in Proceedings of the EuropeanConference on Computer Vision pp 443ndash457 Amsterdam(e Netherlands October 2016

[33] S Zhang R Benenson M Omran J Hosang and B SchieleldquoTowards reaching human performance in pedestrian de-tectionrdquo IEEE Transactions on Pattern Analysis and MachineIntelligence vol 40 no 4 pp 973ndash986 2018

[34] T Akilan Q M J Wu and W Zhang ldquoVideo foregroundextraction using multi-view receptive field and encoder-de-coder dcnn for traffic and surveillance applicationsrdquo IEEETransactions on Vehicular Technology vol 68 no 10pp 9478ndash9493 2019

[35] A B AlexeyWindows and Linux Version of Darknet Yolo V3amp V2 Neural Networks for Object Detection pp 6ndash2 2018

[36] J Redmon and A Farhadi ldquoYolo9000 better faster stron-gerrdquo in Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition (CVPR) IEEE SeattleWashington USA pp 6517ndash6525 July 2017

[37] A Krizhevsky I Sutskever and G E Hinton ldquoImagenetclassification with deep convolutional neural networksrdquoAdvances in Neural Information Processing Systemspp 1097ndash1105 2012

Journal of Engineering 11

Page 9: ResearchArticle LearningFeatureFusioninDeepLearning …downloads.hindawi.com/journals/je/2020/7286187.pdf · 2020-05-22 · qualitative approach to select the best layer for feature

comparison with the infrastructure used in the originalpaper [8] (erefore we adopt an alternate strategy forbenchmarking where the YOLO detector (original and withproposed fusion strategy) is retrained on the MS-COCOdataset and we compare the detection performance at 50000iterations (e original YOLOv3 evaluation on COCOtesting dataset achieved the mAP of 365 for 50000iterations

After determining the best location for ICF fusion usingthe mAP values achieved on the validation set as shown inTable 4 we performed experiments to determine the bestcombination of ICF channel fusion that best suits the COCOdataset (Table 5) As shown in Table 4 doubling the 78th

layer resulted in the highest mAP score (373) (ereforefor all subsequent experiments related to the COCO datasetwere conducted with the doubled number of filters at the78th layer All experiments were evaluated at 50000 itera-tions Fusing all ICF channels (color gradient magnitudeand HOG) resulted in the mAP of 377 Removing RGBfrom the 17 ICF channels led to the mAP of 373 Otherchannel combinations were tried out but did not give goodresults such as fusing only HOG features (6 channels)LUV+HOG (9 channels) gray + LUV+ gradient +HOG (11channels) and gray +HSV+LUV+ gradient (8 channels)(e best combination that gave the highest mAP scoreconsisted of fusion of the all ICF channels

Furthermore the evaluation on testing set with fusionof the complete set of ICF channels at the 78th layerachieved 378 of the mAP value With COCO datasetwe also experimented with the effect of normalization onthe ICF channels To measure the effect of normalizationon the performance of the network we tried out twodifferent normalization techniques All the results dis-cussed so far were achieved without any normalization

We first experimented with max normalization techniquedividing each channel by its maximum range so that allvalues fall in the range of 0 to 1 (e fusion of maxnormalized ICF channels in YOLOv3 achieved the mAPof 384 on the testing set which is 06 higher than thebest result achieved with unnormalized features and 19with the original YOLOv3 performance (is confirmswith the training principle of deep networks whichsuggests the use of input normalization as one of the tools

Table 2 VOC dataset detection rate with fusion of complete set ofICF channels

Layer number Number of filters after doubling mAP59 512 66160 1024 68462 2048 66277 1024 69878 2048 70979 1024 71580 2048 698

Table 3 VOC dataset detection rate with fusion of ICF channelcombinations at the 79th layer

ICF channel combination Number ofchannels mAP

RGB+ gray +HSV+LUV+ grad +HOG 17 777Gray +HSV+LUV+ grad +HOG 14 774HSV+LUV+ grad +HOG 13 786Gray + LUV+ grad +HOG 11 780LUV+ grad +HOG 10 791Gray +HSV+ grad +HOG 11 768Gray +HSV+LUV 7 762Gray +HSV+LUV+HOG 13 771

Offline Online

Imagedataset

ComputedICF

Read inputimage

Read inputimage

ICF computation

Store features

Trainingset

No

No

Yes

Yes

Morescales

Change inputimage scale

Imagedataset

1st YOLOv3layer

2nd YOLOv3layer

106th YOLOv3layer

107th YOLOv3layer

nth YOLOv3convolutional

layer

Convolution

Batchnormalization

Add bias

Fuse features

Activation

ComputedICF

Deep features

Deep features

ICF

ICF

2x y

y2x - y

Figure 7 Flowchart of the proposed object detector

Journal of Engineering 9

for performance improvement In the second normali-zation technique we used Z-score based normalizationwhich converts data in all channels to have mean 0 andstandard deviation 1 However the Z-score normaliza-tion did not perform as good as the previous normali-zation technique

7 Conclusion and Future Work

In this work we present an approach to fuse handcraftedfeatures in the convolutional neural network based objectdetector (e fusion of handcrafted features with learn-ing-based features has proved to be effective in manyearlier works In this work we demonstrate the meth-odological steps to fuse handcrafted features in the latestversion of the YOLO detector Our experiments with acombination of simple integral channel features fusion inYOLO have yielded substantial improvement in the de-tection rate on PASCAL-VOC and MS-COCO datasetsIn conventional machine learning early late andlearning-based fusion have been primary strategies forfeature fusion However deep learning networks aredesigned for specific input sizes which pose a challengefor algorithm designer to feed in extra information in thenetwork unless the network is redesigned (e workpresented a novel approach based on methodologicalsteps to address both the problems In future work weplan to explore the approach for sequence-based learningfor object detection and tracking

Data Availability

(is work utilizes standard datasets which are open access

Conflicts of Interest

(e authors declare that they have no conflicts of interest

References

[1] P Dollar C Wojek B Schiele and P Perona ldquoPedestriandetection an evaluation of the state of the artrdquo IEEETransactions on Pattern Analysis and Machine Intelligencevol 34 no 4 pp 743ndash761 2012

[2] D T Nguyen W Li and P O Ogunbona ldquoHuman detectionfrom images and videos a surveyrdquo Pattern Recognitionvol 51 pp 148ndash175 2016

[3] S Ren K He R Girshick and J Sun ldquoFaster r-cnn towardsreal-time object detection with region proposal networksrdquoIEEE Transactions on Pattern Analysis and Machine Intelli-gence vol 39 no 6 pp 1137ndash1149 2017

[4] S Zafeiriou C Zhang and Z Zhang ldquoA survey on facedetection in the wild past present and futurerdquo ComputerVision and Image Understanding vol 138 pp 1ndash24 2015

[5] S Zhang R Benenson M Omran J Hosang and B SchieleldquoHow far are we from solving pedestrian detectionrdquo inProceedings of the IEEE Conference on Computer Vision andPattern Recognition IEEE Las Vegas NV USA pp 1259ndash1267 June 2016

[6] Pascal visual object classes homepage (pascal-voc) httphostrobotsoxacukpascalVOC

[7] Coco-common objects in context (ms-coco) httpcocodatasetorg

[8] J Redmon and A Farhadi Yolov3 An Incremental Im-provement 2018 httpsarxivorgabs180402767

[9] P Dollar Z Tu P Perona and S Belongie Integral ChannelFeatures BMVC Press Swansea UK 2009

[10] F Arman and J K Aggarwal ldquoModel-based object recog-nition in dense-range images-a reviewrdquo ACM ComputingSurveys (CSUR) vol 25 no 1 pp 5ndash43 1993

[11] N Dalal and B Triggs ldquoHistograms of oriented gradients forhuman detectionrdquo Computer Vision and Pattern Recognitionvol 1 pp 886ndash893 2005

[12] D G Lowe ldquoDistinctive image features from scale-invariantkeypointsrdquo International Journal of Computer Vision vol 60no 2 pp 91ndash110 2004

[13] A Bosch A Zisserman and X Munoz ldquoRepresenting shapewith a spatial pyramid kernelrdquo in Proceedings of the 6th ACMInternational Conference on Image and Video Retrievalpp 401ndash408 Association for Computing MachineryAmsterdam (e Netherlands July 2007

[14] P Viola and M J Jones ldquoRobust real-time face detectionrdquoInternational Journal of Computer Vision vol 57 no 2pp 137ndash154 2004

[15] J Shotton A Blake and R Cipolla ldquoContour-based learningfor object detection in Computer Visionrdquo in Proceedings ofthe Tenth IEEE International Conference on Computer Visionvol 1 IEEE Beijing China pp 503ndash510 October 2005

[16] S Agarwal and D Roth ldquoLearning a sparse representation forobject detectionrdquo in Proceedings of the European Conferenceon Computer Vision pp 113ndash127 Copenhagen Denmar May2002

[17] P Dollar B Babenko S Belongie P Perona and Z TuldquoMultiple component learning for object detectionrdquo in Pro-ceedings of the 10th European Conference on Computer vol 10Springer Berlin Germany pp 211ndash224 June 2008

[18] P F Felzenszwalb R B Girshick D McAllester andD Ramanan ldquoObject detection with discriminatively trainedpart-based modelsrdquo IEEE Transactions on Pattern Analysisand Machine Intelligence vol 32 no 9 pp 1627ndash1645 2010

[19] N Tong H Lu X Ruan and M-H Yang ldquoSalient objectdetection via bootstrap learningrdquo in Proceedings of the IEEE

Table 4 COCO dataset detection rate with fusion of complete setof ICF channels

Layer number Number of filters after doubling mAP57 1024 35259 512 33760 1024 35673 2048 35175 1024 36677 1024 36378 2048 37379 1024 369

Table 5 MS-COCO detection rate with fusion of ICF channelcombinations at the 78th layer

ICF channel combination Number ofchannels mAP

RGB+ gray +HSV+LUV+ grad +HOG 17 377Gray +HSV+LUV+ grad +HOG 14 373HSV+LUV+ grad +HOG 13 374Gray + LUV+ grad +HOG 11 364

10 Journal of Engineering

Conference on Computer Vision and Pattern RecognitionIEEE Boston MA USA pp 1884ndash1892 June 2015

[20] X Wang M Yang S Zhu and Y Lin ldquoRegionlets for genericobject detectionrdquo in Proceedings of the IEEE InternationalConference on Computer Vision IEEE Sydney NSW Aus-tralia pp 17ndash24 December 2013

[21] R Girshick J Donahue T Darrell and J Malik ldquoRich featurehierarchies for accurate object detection and semantic seg-mentationrdquo in Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition IEEE Columbus OHUSA pp 580ndash587 June 2014

[22] R Girshick ldquoFast R-Cnnrdquo in Proceedings of IEEE Interna-tional Conference on Computer Vision (ICCV) pp 1440ndash1448Santiago Chile 2015

[23] J Dai Y Li K He and J Sun ldquoR-FCN object detection viaregion-based fully convolutional networksrdquo in Proceedings ofthe 30th International Conference on Neural InformationProcessing Systems pp 379ndash387 Barcelona Spain December2016

[24] J Long E Shelhamer and T Darrell ldquoFully convolutionalnetworks for semantic segmentationrdquo in Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition(CVPR) IEEE Boston MA USA pp 3431ndash3440 June 2015

[25] W Liu D Anguelov D Erhan et al ldquoSSD single shotmultibox detectorrdquo in European Conference on ComputerVision pp 21ndash37 Amsterdam Netherlands 2016

[26] C Fu W Liu A Ranga A Tyagi and A C Berg Decon-volutional Single Shot Detector 2017 httpsarxivorgabs170106659

[27] J Redmon S Divvala R Girshick and A Farhadi ldquoYou onlylook once unified real-time object detectionrdquo in Proceedingsof the IEEE Conference on Computer Vision and PatternRecognition IEEE Las Vegas NV USA pp 779ndash788 June2016

[28] Y Tian P Luo X Wang and X Tang ldquoDeep learning strongparts for pedestrian detectionrdquo in Proceedings of the IEEEInternational Conference on Computer Vision IEEE SantiagoChile pp 1904ndash1912 December 2015

[29] Y Tian P Luo X Wang and X Tang ldquoPedestrian detectionaided by deep learning semantic tasksrdquo in Proceedings of theIEEE Conference on Computer Vision and Pattern RecognitionIEEE Boston MA USA pp 5079ndash5087 June 2015

[30] Y Zhou L Liu L Shao and M Mellor ldquoDave a unifiedframework for fast vehicle detection and annotationrdquo inProceedings of the European Conference on Computer Visionpp 278ndash293 Amsterdam (e Netherlands October 2016

[31] B Li T Zhang and T Xia ldquoVehicle detection from 3d lidarusing fully convolutional networkrdquo 2016 httparxivorgabs160807916

[32] L Zhang L Lin X Liang and K He ldquoIs faster r-cnn doingwell for pedestrian detectionrdquo in Proceedings of the EuropeanConference on Computer Vision pp 443ndash457 Amsterdam(e Netherlands October 2016

[33] S Zhang R Benenson M Omran J Hosang and B SchieleldquoTowards reaching human performance in pedestrian de-tectionrdquo IEEE Transactions on Pattern Analysis and MachineIntelligence vol 40 no 4 pp 973ndash986 2018

[34] T Akilan Q M J Wu and W Zhang ldquoVideo foregroundextraction using multi-view receptive field and encoder-de-coder dcnn for traffic and surveillance applicationsrdquo IEEETransactions on Vehicular Technology vol 68 no 10pp 9478ndash9493 2019

[35] A B AlexeyWindows and Linux Version of Darknet Yolo V3amp V2 Neural Networks for Object Detection pp 6ndash2 2018

[36] J Redmon and A Farhadi ldquoYolo9000 better faster stron-gerrdquo in Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition (CVPR) IEEE SeattleWashington USA pp 6517ndash6525 July 2017

[37] A Krizhevsky I Sutskever and G E Hinton ldquoImagenetclassification with deep convolutional neural networksrdquoAdvances in Neural Information Processing Systemspp 1097ndash1105 2012

Journal of Engineering 11

Page 10: ResearchArticle LearningFeatureFusioninDeepLearning …downloads.hindawi.com/journals/je/2020/7286187.pdf · 2020-05-22 · qualitative approach to select the best layer for feature

for performance improvement In the second normali-zation technique we used Z-score based normalizationwhich converts data in all channels to have mean 0 andstandard deviation 1 However the Z-score normaliza-tion did not perform as good as the previous normali-zation technique

7 Conclusion and Future Work

In this work we present an approach to fuse handcraftedfeatures in the convolutional neural network based objectdetector (e fusion of handcrafted features with learn-ing-based features has proved to be effective in manyearlier works In this work we demonstrate the meth-odological steps to fuse handcrafted features in the latestversion of the YOLO detector Our experiments with acombination of simple integral channel features fusion inYOLO have yielded substantial improvement in the de-tection rate on PASCAL-VOC and MS-COCO datasetsIn conventional machine learning early late andlearning-based fusion have been primary strategies forfeature fusion However deep learning networks aredesigned for specific input sizes which pose a challengefor algorithm designer to feed in extra information in thenetwork unless the network is redesigned (e workpresented a novel approach based on methodologicalsteps to address both the problems In future work weplan to explore the approach for sequence-based learningfor object detection and tracking

Data Availability

(is work utilizes standard datasets which are open access

Conflicts of Interest

(e authors declare that they have no conflicts of interest

References

[1] P Dollar C Wojek B Schiele and P Perona ldquoPedestriandetection an evaluation of the state of the artrdquo IEEETransactions on Pattern Analysis and Machine Intelligencevol 34 no 4 pp 743ndash761 2012

[2] D T Nguyen W Li and P O Ogunbona ldquoHuman detectionfrom images and videos a surveyrdquo Pattern Recognitionvol 51 pp 148ndash175 2016

[3] S Ren K He R Girshick and J Sun ldquoFaster r-cnn towardsreal-time object detection with region proposal networksrdquoIEEE Transactions on Pattern Analysis and Machine Intelli-gence vol 39 no 6 pp 1137ndash1149 2017

[4] S Zafeiriou C Zhang and Z Zhang ldquoA survey on facedetection in the wild past present and futurerdquo ComputerVision and Image Understanding vol 138 pp 1ndash24 2015

[5] S Zhang R Benenson M Omran J Hosang and B SchieleldquoHow far are we from solving pedestrian detectionrdquo inProceedings of the IEEE Conference on Computer Vision andPattern Recognition IEEE Las Vegas NV USA pp 1259ndash1267 June 2016

[6] Pascal visual object classes homepage (pascal-voc) httphostrobotsoxacukpascalVOC

[7] Coco-common objects in context (ms-coco) httpcocodatasetorg

[8] J Redmon and A Farhadi Yolov3 An Incremental Im-provement 2018 httpsarxivorgabs180402767

[9] P Dollar Z Tu P Perona and S Belongie Integral ChannelFeatures BMVC Press Swansea UK 2009

[10] F Arman and J K Aggarwal ldquoModel-based object recog-nition in dense-range images-a reviewrdquo ACM ComputingSurveys (CSUR) vol 25 no 1 pp 5ndash43 1993

[11] N Dalal and B Triggs ldquoHistograms of oriented gradients forhuman detectionrdquo Computer Vision and Pattern Recognitionvol 1 pp 886ndash893 2005

[12] D G Lowe ldquoDistinctive image features from scale-invariantkeypointsrdquo International Journal of Computer Vision vol 60no 2 pp 91ndash110 2004

[13] A Bosch A Zisserman and X Munoz ldquoRepresenting shapewith a spatial pyramid kernelrdquo in Proceedings of the 6th ACMInternational Conference on Image and Video Retrievalpp 401ndash408 Association for Computing MachineryAmsterdam (e Netherlands July 2007

[14] P Viola and M J Jones ldquoRobust real-time face detectionrdquoInternational Journal of Computer Vision vol 57 no 2pp 137ndash154 2004

[15] J Shotton A Blake and R Cipolla ldquoContour-based learningfor object detection in Computer Visionrdquo in Proceedings ofthe Tenth IEEE International Conference on Computer Visionvol 1 IEEE Beijing China pp 503ndash510 October 2005

[16] S Agarwal and D Roth ldquoLearning a sparse representation forobject detectionrdquo in Proceedings of the European Conferenceon Computer Vision pp 113ndash127 Copenhagen Denmar May2002

[17] P Dollar B Babenko S Belongie P Perona and Z TuldquoMultiple component learning for object detectionrdquo in Pro-ceedings of the 10th European Conference on Computer vol 10Springer Berlin Germany pp 211ndash224 June 2008

[18] P F Felzenszwalb R B Girshick D McAllester andD Ramanan ldquoObject detection with discriminatively trainedpart-based modelsrdquo IEEE Transactions on Pattern Analysisand Machine Intelligence vol 32 no 9 pp 1627ndash1645 2010

[19] N Tong H Lu X Ruan and M-H Yang ldquoSalient objectdetection via bootstrap learningrdquo in Proceedings of the IEEE

Table 4 COCO dataset detection rate with fusion of complete setof ICF channels

Layer number Number of filters after doubling mAP57 1024 35259 512 33760 1024 35673 2048 35175 1024 36677 1024 36378 2048 37379 1024 369

Table 5 MS-COCO detection rate with fusion of ICF channelcombinations at the 78th layer

ICF channel combination Number ofchannels mAP

RGB+ gray +HSV+LUV+ grad +HOG 17 377Gray +HSV+LUV+ grad +HOG 14 373HSV+LUV+ grad +HOG 13 374Gray + LUV+ grad +HOG 11 364

10 Journal of Engineering

Conference on Computer Vision and Pattern RecognitionIEEE Boston MA USA pp 1884ndash1892 June 2015

[20] X Wang M Yang S Zhu and Y Lin ldquoRegionlets for genericobject detectionrdquo in Proceedings of the IEEE InternationalConference on Computer Vision IEEE Sydney NSW Aus-tralia pp 17ndash24 December 2013

[21] R Girshick J Donahue T Darrell and J Malik ldquoRich featurehierarchies for accurate object detection and semantic seg-mentationrdquo in Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition IEEE Columbus OHUSA pp 580ndash587 June 2014

[22] R Girshick ldquoFast R-Cnnrdquo in Proceedings of IEEE Interna-tional Conference on Computer Vision (ICCV) pp 1440ndash1448Santiago Chile 2015

[23] J Dai Y Li K He and J Sun ldquoR-FCN object detection viaregion-based fully convolutional networksrdquo in Proceedings ofthe 30th International Conference on Neural InformationProcessing Systems pp 379ndash387 Barcelona Spain December2016

[24] J Long E Shelhamer and T Darrell ldquoFully convolutionalnetworks for semantic segmentationrdquo in Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition(CVPR) IEEE Boston MA USA pp 3431ndash3440 June 2015

[25] W Liu D Anguelov D Erhan et al ldquoSSD single shotmultibox detectorrdquo in European Conference on ComputerVision pp 21ndash37 Amsterdam Netherlands 2016

[26] C Fu W Liu A Ranga A Tyagi and A C Berg Decon-volutional Single Shot Detector 2017 httpsarxivorgabs170106659

[27] J Redmon S Divvala R Girshick and A Farhadi ldquoYou onlylook once unified real-time object detectionrdquo in Proceedingsof the IEEE Conference on Computer Vision and PatternRecognition IEEE Las Vegas NV USA pp 779ndash788 June2016

[28] Y Tian P Luo X Wang and X Tang ldquoDeep learning strongparts for pedestrian detectionrdquo in Proceedings of the IEEEInternational Conference on Computer Vision IEEE SantiagoChile pp 1904ndash1912 December 2015

[29] Y Tian P Luo X Wang and X Tang ldquoPedestrian detectionaided by deep learning semantic tasksrdquo in Proceedings of theIEEE Conference on Computer Vision and Pattern RecognitionIEEE Boston MA USA pp 5079ndash5087 June 2015

[30] Y Zhou L Liu L Shao and M Mellor ldquoDave a unifiedframework for fast vehicle detection and annotationrdquo inProceedings of the European Conference on Computer Visionpp 278ndash293 Amsterdam (e Netherlands October 2016

[31] B Li T Zhang and T Xia ldquoVehicle detection from 3d lidarusing fully convolutional networkrdquo 2016 httparxivorgabs160807916

[32] L Zhang L Lin X Liang and K He ldquoIs faster r-cnn doingwell for pedestrian detectionrdquo in Proceedings of the EuropeanConference on Computer Vision pp 443ndash457 Amsterdam(e Netherlands October 2016

[33] S Zhang R Benenson M Omran J Hosang and B SchieleldquoTowards reaching human performance in pedestrian de-tectionrdquo IEEE Transactions on Pattern Analysis and MachineIntelligence vol 40 no 4 pp 973ndash986 2018

[34] T Akilan Q M J Wu and W Zhang ldquoVideo foregroundextraction using multi-view receptive field and encoder-de-coder dcnn for traffic and surveillance applicationsrdquo IEEETransactions on Vehicular Technology vol 68 no 10pp 9478ndash9493 2019

[35] A B AlexeyWindows and Linux Version of Darknet Yolo V3amp V2 Neural Networks for Object Detection pp 6ndash2 2018

[36] J Redmon and A Farhadi ldquoYolo9000 better faster stron-gerrdquo in Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition (CVPR) IEEE SeattleWashington USA pp 6517ndash6525 July 2017

[37] A Krizhevsky I Sutskever and G E Hinton ldquoImagenetclassification with deep convolutional neural networksrdquoAdvances in Neural Information Processing Systemspp 1097ndash1105 2012

Journal of Engineering 11

Page 11: ResearchArticle LearningFeatureFusioninDeepLearning …downloads.hindawi.com/journals/je/2020/7286187.pdf · 2020-05-22 · qualitative approach to select the best layer for feature

Conference on Computer Vision and Pattern RecognitionIEEE Boston MA USA pp 1884ndash1892 June 2015

[20] X Wang M Yang S Zhu and Y Lin ldquoRegionlets for genericobject detectionrdquo in Proceedings of the IEEE InternationalConference on Computer Vision IEEE Sydney NSW Aus-tralia pp 17ndash24 December 2013

[21] R Girshick J Donahue T Darrell and J Malik ldquoRich featurehierarchies for accurate object detection and semantic seg-mentationrdquo in Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition IEEE Columbus OHUSA pp 580ndash587 June 2014

[22] R Girshick ldquoFast R-Cnnrdquo in Proceedings of IEEE Interna-tional Conference on Computer Vision (ICCV) pp 1440ndash1448Santiago Chile 2015

[23] J Dai Y Li K He and J Sun ldquoR-FCN object detection viaregion-based fully convolutional networksrdquo in Proceedings ofthe 30th International Conference on Neural InformationProcessing Systems pp 379ndash387 Barcelona Spain December2016

[24] J Long E Shelhamer and T Darrell ldquoFully convolutionalnetworks for semantic segmentationrdquo in Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition(CVPR) IEEE Boston MA USA pp 3431ndash3440 June 2015

[25] W Liu D Anguelov D Erhan et al ldquoSSD single shotmultibox detectorrdquo in European Conference on ComputerVision pp 21ndash37 Amsterdam Netherlands 2016

[26] C Fu W Liu A Ranga A Tyagi and A C Berg Decon-volutional Single Shot Detector 2017 httpsarxivorgabs170106659

[27] J Redmon S Divvala R Girshick and A Farhadi ldquoYou onlylook once unified real-time object detectionrdquo in Proceedingsof the IEEE Conference on Computer Vision and PatternRecognition IEEE Las Vegas NV USA pp 779ndash788 June2016

[28] Y Tian P Luo X Wang and X Tang ldquoDeep learning strongparts for pedestrian detectionrdquo in Proceedings of the IEEEInternational Conference on Computer Vision IEEE SantiagoChile pp 1904ndash1912 December 2015

[29] Y Tian P Luo X Wang and X Tang ldquoPedestrian detectionaided by deep learning semantic tasksrdquo in Proceedings of theIEEE Conference on Computer Vision and Pattern RecognitionIEEE Boston MA USA pp 5079ndash5087 June 2015

[30] Y Zhou L Liu L Shao and M Mellor ldquoDave a unifiedframework for fast vehicle detection and annotationrdquo inProceedings of the European Conference on Computer Visionpp 278ndash293 Amsterdam (e Netherlands October 2016

[31] B Li T Zhang and T Xia ldquoVehicle detection from 3d lidarusing fully convolutional networkrdquo 2016 httparxivorgabs160807916

[32] L Zhang L Lin X Liang and K He ldquoIs faster r-cnn doingwell for pedestrian detectionrdquo in Proceedings of the EuropeanConference on Computer Vision pp 443ndash457 Amsterdam(e Netherlands October 2016

[33] S Zhang R Benenson M Omran J Hosang and B SchieleldquoTowards reaching human performance in pedestrian de-tectionrdquo IEEE Transactions on Pattern Analysis and MachineIntelligence vol 40 no 4 pp 973ndash986 2018

[34] T Akilan Q M J Wu and W Zhang ldquoVideo foregroundextraction using multi-view receptive field and encoder-de-coder dcnn for traffic and surveillance applicationsrdquo IEEETransactions on Vehicular Technology vol 68 no 10pp 9478ndash9493 2019

[35] A B AlexeyWindows and Linux Version of Darknet Yolo V3amp V2 Neural Networks for Object Detection pp 6ndash2 2018

[36] J Redmon and A Farhadi ldquoYolo9000 better faster stron-gerrdquo in Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition (CVPR) IEEE SeattleWashington USA pp 6517ndash6525 July 2017

[37] A Krizhevsky I Sutskever and G E Hinton ldquoImagenetclassification with deep convolutional neural networksrdquoAdvances in Neural Information Processing Systemspp 1097ndash1105 2012

Journal of Engineering 11