12
1 Intelligent Surveillance as an Edge Network Service: from Harr-Cascade, SVM to a Lightweight CNN Seyed Yahya Nikouei, Yu Chen, Sejun Song, Ronghua Xu, Baek-Young Choi, Timothy R. Faughnan Abstract—Edge computing efficiently extends the realm of information technology beyond the boundary defined by cloud computing paradigm. Performing computation near the source and destination, edge computing is promising to address the challenges in many delay-sensitive applications, like real-time surveillance. Leveraging the ubiquitously connected cameras and smart mobile devices, it enables video analytics at the edge. However, traditional human-objects detection and track- ing approaches are still computationally too expensive to edge devices. Aiming at intelligent surveillance as an edge network service, this work explored the feasibility of two popular human- objects detection schemes, Harr-Cascade and SVM, at the edge. Understanding the existing constraints of the algorithms, a lightweight Convolutional Neural Network (L-CNN) is proposed using the depthwise separable convolution. The proposed L-CNN considerably reduces the number of parameters without affecting the quality of the output, thus it is ideal for an edge device usage. Being trained with Single Shot Multi-box Detector (SSD) to pinpoint each human-object location, it gives coordination of bounding box around the object. We implemented and tested L-CNN on an edge device using Raspberry PI 3. The intensive experimental comparison study has validated that the proposed L-CNN is a feasible design for real-time human-object detection as an edge service. KeywordsEdge Computing, Smart Surveillance, Lightweight Convolutional Neural Network (L-CNN), Human Detection. I. I NTRODUCTION Nowadays almost every person can be connected to the network using their pocket-sized mobile devices wherever and whenever. Moreover, the CPUs and GPUs in a modern mobile device have more computational power than the computers once used on an Apollo space shuttle. For example, a Cray XMP supercomputer used for scientific work in the 80s has less power than a recent iPhone. The advancement of cyber- physical technologies and their interconnection through elastic communication networks facilitate the concept of the Smart Cities that improve the life quality of residents. Attracted by the convenient lifestyle in bigger cities, the world’s popula- tion has been increasingly concentrated in urban areas at an unprecedented scale and speed [9]. For public safety, cities S. Y. Nikouei, Y. Chen, and R. Xu are with the Department of Electrical and Computer Engineering, Binghamton University, SUNY, Binghamton, NY 13902 USA, email: {snikoue1, ychen, rxu22}@binghamton.edu. S. Song and B. Y. Choi are with School of Computing and Engineering, University of Missouri-Kansas City, Kansas City, MO 64110 USA, email: {songsej, choiby}@umkc.edu. T. R. Faughnan is with the New York State University Police, Binghamton University, SUNY, Binghamton, NY 13902 USA, email: [email protected]. Manuscript received in March 2018. face the daunting task of accommodating the urban dynamics with smarter management strategies [50]. Efficient surveillance is essential for situational awareness (SAW), particularly for many dynamic data-driven tasks [4]. For example, North America in 2016 alone has more than 62 million of public surveillance cameras watching the movement of the civilian traffic for their safety [30]. Understanding and analyzing the large-scale complex dynamic surveillance data is of great importance to protect both human lives and urban environ- ments. However, significant challenges exist to support big dynamic data collection and analytic in real time [10]. A typical low frame rate (1.25 Hz) wide area motion imagery (WAMI) sequence alone can generate over 100M of data per second (400G per hour). The data scale for high frame rate and higher resolution videos [38] can be significantly more substantial. According to the recent study, the video data dominates the real-time traffic and creates heavy workload on the communication networks. For example, online video accounts for 74% of all online traffic in 2017 and 78% of mobile traffic will be video data by 2021 [12]. Although many video streaming protocols [49] have been proposed in support of higher efficiency, the delay and dependence on com- munication networks still hinder real-time interactions. Also, concerning video data storage, although many organizations keep on improving their storage capacity, in practice, the video footage is chronologically deleted in 60 days without reflecting any perceived value of the content due to a data storage capacity issue. The surveillance community has been aware of the growing demand for human resources to interpret the data due to the ubiquitous deployment of networked static and mobile cameras and has made many efforts in past decades [36]. For example, many automated anomaly detection algorithms have been investigated using machine learning [40] and statistical anal- ysis [19] approaches. Although these intelligent approaches are powerful, they are computationally very expensive. Hence they are implemented as a central cloud service. Researchers are also trying to help operation personnel beware events using event-driven visualization mechanism [18], re-configuring the networked cameras [37], and mapping conventional real-time images to 3D camera images [46]. However, the traditional human-in-the-loop solutions are still challenged significantly by the demand of real-time surveillance systems due to the lack of scalability. For example, video analysis mostly relies on teams of specially trained officers manually watching thou- sands of hours of different format and quality videos, looking for one specific target. Due to the manual coordination and tracking and mechanical pan-tilt-zoom (PTZ) remote control, arXiv:1805.00331v1 [cs.CV] 24 Apr 2018

1 Intelligent Surveillance as an Edge Network Service ... · been conducted. The L-CNN, SSD GoogleNet, Harr-Cascade, and HOG+SVM algorithms are implemented on a Raspberry PI Model

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: 1 Intelligent Surveillance as an Edge Network Service ... · been conducted. The L-CNN, SSD GoogleNet, Harr-Cascade, and HOG+SVM algorithms are implemented on a Raspberry PI Model

1

Intelligent Surveillance as an Edge Network Service:from Harr-Cascade, SVM to a Lightweight CNN

Seyed Yahya Nikouei, Yu Chen, Sejun Song, Ronghua Xu, Baek-Young Choi, Timothy R. Faughnan

Abstract—Edge computing efficiently extends the realm ofinformation technology beyond the boundary defined by cloudcomputing paradigm. Performing computation near the sourceand destination, edge computing is promising to address thechallenges in many delay-sensitive applications, like real-timesurveillance. Leveraging the ubiquitously connected camerasand smart mobile devices, it enables video analytics at theedge. However, traditional human-objects detection and track-ing approaches are still computationally too expensive to edgedevices. Aiming at intelligent surveillance as an edge networkservice, this work explored the feasibility of two popular human-objects detection schemes, Harr-Cascade and SVM, at the edge.Understanding the existing constraints of the algorithms, alightweight Convolutional Neural Network (L-CNN) is proposedusing the depthwise separable convolution. The proposed L-CNNconsiderably reduces the number of parameters without affectingthe quality of the output, thus it is ideal for an edge deviceusage. Being trained with Single Shot Multi-box Detector (SSD)to pinpoint each human-object location, it gives coordination ofbounding box around the object. We implemented and testedL-CNN on an edge device using Raspberry PI 3. The intensiveexperimental comparison study has validated that the proposedL-CNN is a feasible design for real-time human-object detectionas an edge service.

Keywords—Edge Computing, Smart Surveillance, LightweightConvolutional Neural Network (L-CNN), Human Detection.

I. INTRODUCTION

Nowadays almost every person can be connected to thenetwork using their pocket-sized mobile devices wherever andwhenever. Moreover, the CPUs and GPUs in a modern mobiledevice have more computational power than the computersonce used on an Apollo space shuttle. For example, a CrayXMP supercomputer used for scientific work in the 80s hasless power than a recent iPhone. The advancement of cyber-physical technologies and their interconnection through elasticcommunication networks facilitate the concept of the SmartCities that improve the life quality of residents. Attracted bythe convenient lifestyle in bigger cities, the world’s popula-tion has been increasingly concentrated in urban areas at anunprecedented scale and speed [9]. For public safety, cities

S. Y. Nikouei, Y. Chen, and R. Xu are with the Department of Electricaland Computer Engineering, Binghamton University, SUNY, Binghamton, NY13902 USA, email: {snikoue1, ychen, rxu22}@binghamton.edu.

S. Song and B. Y. Choi are with School of Computing and Engineering,University of Missouri-Kansas City, Kansas City, MO 64110 USA, email:{songsej, choiby}@umkc.edu.

T. R. Faughnan is with the New York State University Police,Binghamton University, SUNY, Binghamton, NY 13902 USA, email:[email protected].

Manuscript received in March 2018.

face the daunting task of accommodating the urban dynamicswith smarter management strategies [50]. Efficient surveillanceis essential for situational awareness (SAW), particularly formany dynamic data-driven tasks [4]. For example, NorthAmerica in 2016 alone has more than 62 million of publicsurveillance cameras watching the movement of the civiliantraffic for their safety [30]. Understanding and analyzing thelarge-scale complex dynamic surveillance data is of greatimportance to protect both human lives and urban environ-ments. However, significant challenges exist to support bigdynamic data collection and analytic in real time [10]. Atypical low frame rate (1.25 Hz) wide area motion imagery(WAMI) sequence alone can generate over 100M of data persecond (400G per hour). The data scale for high frame rateand higher resolution videos [38] can be significantly moresubstantial. According to the recent study, the video datadominates the real-time traffic and creates heavy workloadon the communication networks. For example, online videoaccounts for 74% of all online traffic in 2017 and 78% ofmobile traffic will be video data by 2021 [12]. Althoughmany video streaming protocols [49] have been proposed insupport of higher efficiency, the delay and dependence on com-munication networks still hinder real-time interactions. Also,concerning video data storage, although many organizationskeep on improving their storage capacity, in practice, the videofootage is chronologically deleted in 60 days without reflectingany perceived value of the content due to a data storagecapacity issue.

The surveillance community has been aware of the growingdemand for human resources to interpret the data due to theubiquitous deployment of networked static and mobile camerasand has made many efforts in past decades [36]. For example,many automated anomaly detection algorithms have beeninvestigated using machine learning [40] and statistical anal-ysis [19] approaches. Although these intelligent approachesare powerful, they are computationally very expensive. Hencethey are implemented as a central cloud service. Researchersare also trying to help operation personnel beware events usingevent-driven visualization mechanism [18], re-configuring thenetworked cameras [37], and mapping conventional real-timeimages to 3D camera images [46]. However, the traditionalhuman-in-the-loop solutions are still challenged significantlyby the demand of real-time surveillance systems due to thelack of scalability. For example, video analysis mostly relieson teams of specially trained officers manually watching thou-sands of hours of different format and quality videos, lookingfor one specific target. Due to the manual coordination andtracking and mechanical pan-tilt-zoom (PTZ) remote control,

arX

iv:1

805.

0033

1v1

[cs

.CV

] 2

4 A

pr 2

018

Page 2: 1 Intelligent Surveillance as an Edge Network Service ... · been conducted. The L-CNN, SSD GoogleNet, Harr-Cascade, and HOG+SVM algorithms are implemented on a Raspberry PI Model

2

it is challenging to achieve adequate real-time surveillance.Edge computing extends the realm of information technol-

ogy (IT) beyond the boundary of cloud computing paradigmby devolving various services (i.e., computation) at the edgeof the network. Edge computing services are promising toaddress the challenges in many delay-sensitive, mission-criticalapplications, like real-time surveillance, which can leveragethe ubiquitous networked cameras and smart mobile devices.Figure 1 presents an edge-fog-cloud hierarchical architecture,in which many intelligent approaches for the real-time surveil-lance system are an ideal fit. However, human-objects detectionand tracking services are still computationally expensive andare conducted in well-powered cloud centers.

In this paper, we propose to devolve more intelligence to theedge to significantly improve many smart tasks such as fastobject detection and tracking. Adopting the recent researchon machine learning, we choose the Convolutional NeuralNetwork (CNN) algorithm which incurs comparatively lesspre-processing overhead than other human image classificationalgorithms. We efficiently tailored the CNN to be furnishedin the resource-constrained edge devices according to anobservation that surveillance systems are mainly for the safetyand security of human being. According to our experimentalstudy, the L-CNN algorithm can process an average of 1.79and up to 2.06 frames per second (FPS) on the selected edgedevice, a Raspberry PI 3 board. It meets the design goalconsidering the limited computing power and the requirementfrom the application. Intuitively, tracking a car in real-timerequires much higher computational power than monitoring apedestrian in real-time due to the higher mobility. Humans tendto walk slowly in comparison to a high frame rate of a camera.For example, a human-object detection approach requires com-putation to find only new arriving human-objects in a frame,because other identified human-objects are already calculatedin the previous frames and the human-objects walking out ofthe frame can be automatically removed. Hence the real-timehuman-object detection task can be accomplished by executingthe detection algorithm only two frames per second (FPS).

On the other hand, GoogleNet, an SSD (Single Shot Multi-box Detector) [34], is one of the best object detection algo-rithms. It takes four times less FPS (0.4 FPS) and three timesmore memory size (320.4 MB) than the L-CNN algorithm(for details and discussion, see Section VI). However, theproposed L-CNN algorithm is adequate for running on smalledge devices, as it takes only 1.79 FPS that is less than theFPS requirement (2 FPS) as well as it requires just 122.5MB memory space considering the limited RAM (i.e., 1 GB)size of the typical edge devices. Typically, the algorithm’stop five shots to guess the object in an image is the criteriaused to evaluate the performance of classifiers based onCNNs. The error rate can be calculated based on the top fiveguesses and also the top guess of each algorithm. The L-CNN is designed for distinguishing human/non-human objects,however, so there are no other classes for L-CNN to guessfurther. Thus, Instead of waiting for the top five predictions,it can quickly take the first prediction. Rather than giving alabel to the whole image based on the dominant object, theSSD outputs the bounding box ordinates around the object of

Fig. 1. Edge-Fog-Cloud hierarchical architecture.

interest. The accuracy between 85% to 90% is achieved whichvaries based on the color and texture of the background, andthe number of human objects in a single frame. These resultsare based on real-life, on-campus security camera footage.

In summary, the major contributions of this work are high-lighted below:

1) Aiming at intelligent surveillance as an edge net-work service, a thorough study on two well-knownhuman-objects detection schemes, Harr-Cascade andHOG+SVM, has been conducted, which evaluates theirfeasibility of running on resource-limited edge devices;

2) A lightweight Convolutional Neural Network (L-CNN)is proposed to enable a real-time human-objects iden-tification on a network edge;

3) Instead of simulation, system-oriented research hasbeen conducted. The L-CNN, SSD GoogleNet, Harr-Cascade, and HOG+SVM algorithms are implementedon a Raspberry PI Model 3 board as the edge device;and

4) An extensive experimental validation study has beenconducted using real-world surveillance video data.Comparing with SSD GoogleNet, Harr-Cascade andHOG+SVM, the L-CNN is a promising approach fordelay-sensitive, mission-critical applications like real-time smart surveillance.

The rest of the paper is sorted as follows. Section II providesbackground of the closely related work. Section III discussestwo widely used pedestrian detection methods and evaluatestheir performance in an environment where the resource is notas abundant as in a cloud center for video processing. Then,Section IV introduces the proposed lightweight CNN archi-tecture, and the training of the L-CNN is disused in SectionV. In Section VI, the comparison study on the performancesof the L-CNN and other algorithms are presented, which areimplemented on a Raspberry PI 3 model B device as the edgecomputing device. At last, Section VII wraps up this paperwith conclusions and discussions of our on-going efforts.

II. BACKGROUND KNOWLEDGE AND RELATED WORK

A. Smart Surveillance as an Edge ServiceTraditional surveillance systems depend on human operators

to manipulate the processing of captured video [7]. However,there are many shortcomings with this approach. Not only

Page 3: 1 Intelligent Surveillance as an Edge Network Service ... · been conducted. The L-CNN, SSD GoogleNet, Harr-Cascade, and HOG+SVM algorithms are implemented on a Raspberry PI Model

3

it is unrealistic to have a human operator to maintain fullconcentration on the video for a long time, but it is alsonot scalable as the number of cameras and sensors growssignificantly.More recently proposed smart systems aimed atminimizing the role that human operators play in objectdetection, and the responsibility of abnormal behavior detec-tion is taken by various more intelligent machine learningalgorithms [10], [45]. The algorithm automatically processesthe collected video frames in a cloud to detect, track, andreport any unusual circumstances. Under the hierarchical edge-fog-cloud paradigm shown in Fig. 1, functions in a smartsurveillance system are classified into three levels:• Level 1: each object of interest is identified through low-

level feature extraction from video data;• Level 2: the behavior or intention of each object of

interest is detected/recognized, quick alarm raising; and• Level 3: anomalous or suspicious activities profile build-

ing and historical statistical analysis.Ideally, the minimum delay and communication overhead

would be achieved if all the functions are conducted on-site atthe network edge, and the decision is made instantly [8], [9].However, it is not realistic to accomplish the operations ofLevel 1 and Level 2 by the edge devices. Therefore, oncethe detection and tracking tasks are done, the results areoutsourced to the fog layer for further data contextualizationand decision making. The computationally expensive Level3 functions can be positioned on fog computing level oreven further to cloud centers considering the constraints onedge processing power. And functions like long term profilebuilding is not required to be accomplished instantly. In a smartsurveillance system, the Level 1 functions are the fundamental.More specifically human object detection is vital as missingany objects in the frame will lead to undetected behavior. Also.the false positive rate should be minimized, because wronglyidentifying an object in the frame as a human will possiblyresult in false alarms.

B. Human Object DetectionIn this paper, we focus on quick and accurate detection of

human objects. Detecting a human being in the video may bemore challenging than a moving car or other pattern followerobjects due to the occlusion caused by inconstant appearance,illumination, and background. It is vital for the algorithm togive out the exact position coordination of the object of interestfor tracking purposes.

Haar-like feature extraction is well suited for face and eyedetection [13]. Haar models are light weighted and very fast,which are appreciated as a candidate for edge implementation.However, the human body can have different clothing, whichis harder for this type of models to achieve a high detectionaccuracy [23]. Thus, in the videos pedestrians are often ignoredby this model. Grids of Histograms of Oriented Gradient(HOG) can produce reliable features for human detection [15].While the Haar features fail to detect humans when the bodyangle toward the camera changes, HOG features continue toperform well. Later these features are given to a SupportVector Machine (SVM) classifier to create a human detection

algorithm called HOG+SVM [35]. One downside to usingHOG feature extraction at the edge is that this method createsa burden on the limited resource environment. Performanceresults are discussed and compared in Section VI. Scale In-variance Feature Transformation (SIFT) is another well-knownalgorithm for human detection through extracting distinctiveinvariant features from images, which provides features thatcan be used to perform reliable matching between differentviews of an object or scene [25].

C. Machine Learning at the Edge

Powerful machine learning algorithms are recognized asthe solution to take full advantage of big data in manyareas [32], [51]. However, when the big data comes to theedge, the demand for computing and storage resources makesthem unfit. The edge environment necessitates lightweightbut robust algorithms. Applications of some simple machinelearning algorithms have been investigated in environmentswith constraints on resources, such as wireless sensor networks(WSNs) and IoT devices [2]. There are efforts to build large-scale distributed machine learning systems to leverage het-erogeneous computing devices and reduce the communicationoverhead [1], [42]. Although none of the reported algorithms isideally fit to the resource-constrained edge environment, theyhave laid a solid foundation for us. In fact, the community hasrecognized the importance of efficient models for mobile andembedded applications [3], [27], [48].

GoogleNet [43] and Microsoft ResNet [24] are well-knownand powerful. They can take a picture as an input and conductclassification for up to one thousand different objects in thepicture. The network has as many filters as needed to createa feature map that can differentiate between the possibleobjects [26], [28]. If only human objects are required to beclassified or detected, the network architecture needs fewerfilters for the same performance. However, the authors couldnot find any deep learning network specially designed fordetecting human objects in mind.

It is attractive to creating faster deep learning networksthat require less resource without losing performance. As itsname implies, SqueezeNet achieves the same performance asAlexNet but takes less memory [29]. MobileNet is created towork on resource constraint devices [27]. It is not only memoryefficient, but also it runs very fast because of a differentconvolutional architecture that creates it. It has been mathe-matically proven that this network creates less computationalburden while having fewer parameters. MobileNet producesresults comparable to GoogleNet and ResNet [24]. ResNethas a higher accuracy than AlexNet, and is one of the bestperforming networks in R-CNN.

D. CNN Training

A CNN needs to be well trained before being deployedand applied to conduct the tasks. Usually, the training processrequires a lot of computing resources and large storage spacethat allows the training images be loaded and fed to thenetwork in batches. Also the filter and other parameters should

Page 4: 1 Intelligent Surveillance as an Edge Network Service ... · been conducted. The L-CNN, SSD GoogleNet, Harr-Cascade, and HOG+SVM algorithms are implemented on a Raspberry PI Model

4

be pre-loaded into the memory. In addition, the back propa-gation operation incurs math intensive matrix and differentialcalculations. Clearly, the edge environment is not an ideal placefor training.

There are several widely used models to serve this purpose,such as TensorFlow [1], Keras [11] and Caffe [31]. TensorFlowis an open source software library for machine learning andartificial intelligence in general. It was introduced by Google’sMachine Intelligence research organization in 2015. Based on anumerical computation using data flow graphs, one prominentbenefit of this model is that it is optimized for a distributedenvironment. Many GPUs, in particular, can collaborate andincrease the training speed. Also, a light version of theTensorFlow is recently introduced for mobile devices, whichallows load CNN models without additional libraries needed.

Keras gained popularity for its user-friendly, easy-to-learnenvironment. It uses TensorFlow as a back-end engine. Keraslibraries, accessed using python, create a bridge betweenpython syntax and TensoFlow. The libraries are created tomake it easy for the user to generate and test deep modes asfast as possible. Many of the TensorFlow parameters are givendefaults values, and a high-level code is provided for the enduser. The trade-off is, allowing spontaneous coding in python,low-level flexibility of TensorFlow is sacrificed. Moreover, oneof the problems that make Keras not the best choice for edgedevices is that while OpenCV library supports deep learningmodels, it still fails to import Keras based networks. To useKeras the library itself has to be installed on the edge device,and also the results need to be loaded in the OpenCV library.

Being introduced in 2014 in C++ language, Caffe is a well-known tool for the deep learning community. It is a low-levellibrary to work with CNNs. Fast speed makes Caffe perfectfor research experiments and industry deployment. Caffe canprocess over 60M images per day with a single NVIDIA K40GPU [31]. On the other hand, caffe has two main weaknesses.Lack of documentation on its commands makes coding a hardjob. Specifically in SSD realization of caffe model, as there arelayers needed for SSD deployment but very little information isprovided on their functionality. Additionally, Caffe architectureis written as a plain text file, which is harder to manage whenmore layers are included in the architecture.

In this work, the proposed L-CNN is trained using MXNetbecause of its implementation of SSD and good documenta-tion. It will be discussed in Section V.

III. HARR-CASCADE AND SVM AT THE EDGE

Haar Feature-based Cascade Classifier algorithms are widelyadopted on small devices such as cellphone cameras for objectdetection [44]. They are very fast after training and do not needa lot of memory to store data. These algorithms are used todetect faces, eyes, car plate numbers, and even face gestures.They also can be used for the full-body human object detectionpurposes. The Support Vector Machine (SVM) [14] classifieris popular in solving many practical classification problems. Itsbest performance is achieved when the classification featuresare well distinguished. Using the raw value of the pixels toidentify a human being would create a poor result as it is not

Fig. 2. Examples of Haar-like features. (a) two rectangular features. (b) threerectangular features. (c) four rectangular features

robust against environmental noises. To tackle the shortcoming,Histogram of Oriented Gradients (HOG), a feature extractionmethod, has been recognized as the best complementary toSVM in human object detection [5].

These algorithms have been studied in different scenarios,interested readers may find more details in literature [14], [44].This study is focused on the performance on edge devices.

A. Harr-CascadeThe first step is the feature detection. Haar-like features

consist of three shapes that can be at any position and scalewithin the input image:• Two rectangular features: the sum of pixel values in half

of the filter is compared with the sum in the other halfin the selected region of the input image;

• Three rectangular features: it has three parts where twoare considered together, and the sum is subtracted fromthe third part; and

• Four rectangular features: a rectangular shape can beconsidered, and thus two parts are subtracted from theother two parts.

Figure 2 shows Haar-like feature set examples. These filtersare going to convolute over an input image and in eachposition the sum of pixel values in black rectangles will besubtracted from the sum of pixel values in white rectangles.This process performs better for objects that are closer to thecamera since the features are based on pixel value for a humanbody. Similarly, when the capturing angle changes the samefilter may produce very different results. A simpler classifiermay miss an object if the features are not totally reliable fordetection. One may also argue that with a better image set fortraining Haar Cascade will provide more accurate results.

When all possible shapes and sizes of Haar features areconvoluted, even a 24 × 24 image will produce more than160 thousand features since filters in Fig. 2 can have anycombination of sizes and rotations and positions. The learningprocess is computationally expensive and needs to take place atCPU clusters, which might not be available to many. However,once the training is finished, a feature set is ready and only theselected features are stored for future classification. Thus, thecomputational complexity of the overall algorithm is small.The ready-to-use .xml file for different object detection jobsis publicly available through GitHub webpage [22]. Also,OpenCV libraries make it easier to use these files for multi-detection purposes.

These features are applied to each image for training, theresults are calculated and the best performing features areselected. The process of selecting best features is performed

Page 5: 1 Intelligent Surveillance as an Edge Network Service ... · been conducted. The L-CNN, SSD GoogleNet, Harr-Cascade, and HOG+SVM algorithms are implemented on a Raspberry PI Model

5

by the Adaboost algorithm. It stands for Adaptive Boostingand it is constructed from classifiers that are called ”weaklearners”. This algorithm will create a weighted sum betweenresults of weak learners for combination (Eq. 1), where h(x)is considered as each weak learner for input x. During thelearning process each weak learner will receive a weight forthe sum for error calculation (Eq. 2), which is based on lastlycalculated boosted classifier.This is a supervised learning andthe goal is set to minimize error where i is every input forlearning iteration t.

FT (x) =∑t

ft(x), whereft(x) = αth(x) (1)

Ee =∑i

E[Ft−1 + αth(xi)] (2)

A tremendous number of images, both positive and negativeimages, are collected for training. Theoretically speaking,positive images contain the object of interest with differentbackgrounds including the positions and coordinates of thesample. In practice, many images are used twice by mirroringthem. Negative images do not include the object of interest.It is common that around 2000 positive and 1000 negativeimages are used. Taking a positive image, then the mirror of thesame image will be considered as another positive image. Anysmall changes in the object position will also be recognized asanother positive image sample. The result of training creates afile that contains the features to be performed on input images.In practice, the object recognition algorithm uses the featuresin the input file. According to the results, the classification isfurther made to get the coordination as the output.

As revealed by the training procedure and working flowof the algorithm, Haar-Cascade object classification will notperform well if the training set does not contain all possibleangles. Also if the object of interest is far away from the

Fig. 3. (a) HOG convolution filter. (b) HOG representation of an Image. (c)cell representation of the HOG [34].

camera, which is usual to surveillance cameras, the algorithmmay fail to detect the object. Furthermore, the simple structuremay imply the loss of robustness.

B. SVM ClassifiersFigure 3(a) depicts a filter that is placed on a pixel in

white with its four neighboring pixels in black. X and Yderivatives are calculated simply by subtracting the horizontalneighboring and vertical neighboring pixel values respectivelyfor the white pixel. In particular, the X derivatives are firedby the vertical lines, and Y derivatives are fired by horizontallines, which makes the overall system to be sensitive to edges.Changing the presentation format to amplitude and angle willresult in unsigned gradients for each given pixel. In practice,a filter can be used to convolute over the image and in eachstep calculate the gradient for a given pixel. Because of theunsigned gradients, the angular values are between 0◦ to 180◦.If nine bins of each 20◦ are considered, then the amplitudeof the gradients can be represented in nine bins. It is worthmentioning that if the angular value is not the center of thebin, then the amplitude is going to be divided into two binsthat the angle of the gradient is closest to. If the input imagehas more than one channel such as RGB, then channel with thehighest gradient amplitude is chosen, and also the respectiveangle is used for histogram representation.

The accuracy and performance of the algorithm depend onthe number of pixels picked here. While eight pixels will usesubroutines for faster calculation, other values such as six ornine will affect the performance drastically. Also, making thiswindow smaller gives more accuracy, but multiple detectionsmight be reported. Figure 3(b) shows one of such histograms.Each of the nine bins is corresponding to the histogram of 20degrees. This figure is taken from a batch of pixels, wherethere is a line passing the window, so the angular value of120◦ to 140◦ has the most abundance. The values representedin this figure are normalized.

The normalization is based on the larger cells of 32 × 32,where instead of using the second norm of the nine bins in 8×8superpixel, the 16 gradient histogram vectors are consideredfor normalization to make the algorithm faster. So a vectorof size 144 is created for further classifications. For example,the HOG algorithm output can be represented as an image inFig. 3(c) where a 64 × 64 cell is chosen to represent eachgradient cell (it is enlarged to be seen by human eyes). Thisfigure shows each of the gradients in white for parts of theinput frame in an angle and amplitude manner.

To catch all the features with different distances, usuallya pyramid of the image is leveraged. The image with moredetails is considered first, and then some pixels in each rowand column are discarded to create a lower resolution versionof the same image, and then the same HOG algorithm willgenerate another feature map. The steps iterate until it is notfeasible anymore to conduct classification on the image. TheSVM classifies objects of interest at each stage, so multipledetection reports are possible. In different scenes, fine-tuningof the variables might be needed. Figure 4 shows an imagepyramid, where the top left is the actual input and the bottom-right one has the least number of pixels, but the size of each

Page 6: 1 Intelligent Surveillance as an Edge Network Service ... · been conducted. The L-CNN, SSD GoogleNet, Harr-Cascade, and HOG+SVM algorithms are implemented on a Raspberry PI Model

6

Fig. 4. Image pyramid used in HOG feature extraction method.

of pixels is the largest, which preserves the dimensionality ofthe input image.

Figure 5 is an example, where the output detects a humanobject several times because several feature maps are providedto the SVM. Therefore, an extra step is needed to take onlyone of the bounding boxes and discard the rest. One mostlyused method is to capture the biggest bounding box as theobject. This approach may lead to an inaccurate detection.The effect is more noticeable when there are multiple humanobjects closer to each other. This method may fail to detect allobjects. Although the detection rate can be improved by fine-tuning the filter size and values, in practice, it is non-trivial toreconfigure once the cameras have already been installed.

As explained above, the HOG algorithm extracts featuresand a trained SVM classifies the humans. Unfortunately, whilethe feature extraction presents useful information, it is tooexpensive to edge devices. These computing intensive tasks arerepeatedly executed for each frame. It makes the overall algo-rithm not suitable for an edge device because human detectionwill frequently be performed in a real-time environment.

The HOG+SVM algorithm is computationally expensive andsuffers the same problem as Haar Cascaded algorithm does. Itcan recognize the features and detect human objects well ifa pedestrian is captured by the surveillance camera from theangle that is same as the angle used by the training data. But itis not able to detect objects captured from different angles. Sofar, algorithms that are capable of identifying human objectswith variant capture angles are still in need.

IV. LIGHTWEIGHT CNN

CNNs has been widely applied as a powerful tool for objectclassifications. However, it is non-trivial to fit the CNNs into

Fig. 5. False multiple detection for a single human object.

the network edge devices due to the very restrict constraints onresources. Even if the time consuming and computing intensivetraining can be outsourced to the cloud, edge devices stillcannot afford the storage space for parameters and weight val-ues of filters of these deep neural networks. For example, themost well-known ResNet is with 51, 101, or 201 layers, whichimplies a huge number of parameters. Therefore, a lightweightdesigned CNN is expected in the edge environment.

The L-CNN is proposed in order to fit the intelligence intothe edge computing devices. The first step of the L-CNN isemploying the Depthwise Separable Convolution [27], [41]to reduce the computational cost of the CNN itself, withoutmuch sacrificing the accuracy of the whole network. Also,the network is specialized for human detection to reducethe unnecessary huge filter numbers in each layer. Secondly,because the main goal here is to have the surveillance systemthat can protect humans, the light weighted architecture withthe best performance is chosen. The accurate position ofthe pedestrian is needed to make decisions based on theirmovement. Thus, state of the art SSD architecture is used tohave fast and reliable results for object detection.

A. Depthwise Separable ConvolutionBy splitting each conventional convolution layer into two

parts, it makes computational complexity more suitable foredge devices using depthwise separable convolution and point-wise separable convolution [41]. These architectures areintroduced first, as they are part of the foundation for L-CNN.

More specifically, the conventional convolution will take aninput such as F , which has a dimensionality of Df ×Df andof M channels, and maps it into G, which is N channels ofDg×Dg dimension. This is done by filter K, which is a set ofN filters, each of them is Dk×Dk and there are M channels,as calculated in Eq. 3:

Gk,l,n =∑i,j,m

Ki,j,m,n · Fk+i−1,l+j−1,m (3)

The computational complexity is

CCConventional = Dk ×Dk ×M ×N ×Df ×Df (4)

Figure 6 compares the depthwise separable convolutionand the conventional convolution. The depthwise separableconvolution consists of two parts. The first is M channels ofDk ×Dk × 1 filters that will produce M outputs, which is adepthwise convolution layer. Next is a pointwise convolutionlayer in which the filters are N channels of 1 × 1 filters.Similarly, with the input of F as before this layer will producean output as in Eq. 5:

Gk,l,n =∑i,j,m

Ki,j,m,n · Fk+i−1,l+j−1,m (5)

where K is a depthwise convolutional filter, which has aspecial dimension of Dk × Dk × M and the mth filter inKwill be applied on mth F . The computational complexity ofthe depthwise convolution is

Page 7: 1 Intelligent Surveillance as an Edge Network Service ... · been conducted. The L-CNN, SSD GoogleNet, Harr-Cascade, and HOG+SVM algorithms are implemented on a Raspberry PI Model

7

Fig. 6. Comparison between (a) The conventional convolution; and (b)Depthwise separable convolution.

CCDepth = Dk×Dk×M×Df×Df+N×M×Df×Df (6)

Based on Eq. 6 and Eq. 4, the calculation complexity isreduced by a factor calculated by Eq. 7 [27]. It makes a fasterand more efficient network that is an ideal fit for edge devices.

CCConventional

CCDepth=

1

N+

1

Dk2 (7)

Immediately after each convolutional step, there is a BatchNormalization layer and an ReLU layer for normalizationand nonlinearity introduction that can be found in manyconvolution architectures.

B. The L-CNN ArchitectureThe proposed L-CNN network architecture has 26 layers

considering depthwise and pointwise convolutions as separatelayers, which does not count the final classifier, softmax,regression layers to give a bounding box around the detectedobject. A simple fully connected neural network classifier takesthe prior probabilities of each window of objects, identifiesthe objects within the proposed window, and adds the labelfor output bounding boxes at the end of the network. Figure7 depicts the network filter specifications for each layer.

In this architecture, the pooling layer is added only one timein the final step of the convolutional layers, and downsizinghappens with the help of striding parameter in filters. Inthe first convolutional layer of the L-CNN architecture aconventional convolution takes place, but in the rest of thenetwork depthwise along with pointwise convolutions are used.

To classify thousands of objects, the classifier used at theend of a conventional deep learning network needs to handlethe given final feature map and produce different percentageoutputs for each class. Therefore, in case of a fully connectedneural network in the hidden layer 1000 to 2024 neurons areused, and in the output layer, 1000 neurons are used (eachneuron for each class). In contrast, the L-CNN is focused on ahuman object detection such that the network is used only forpedestrian detection. Again, there is no need for a complicatedclassifier at the end of the network. It further simplifies thenetwork and decreases the number of parameters to store.

Fig. 7. L-CNN network layers specification.

C. Human Object Detection Methods

The output of CNNs is an object prediction in the image.In surveillance tasks, the exact position of the object is alsoimportant. There might be more than one pedestrian in theframe and also it is not enough to only tell whether or not thereis a human object, the coordination of the object is essential fortracking purposes. Therefore, network architecture and trainingmethods used for object detection are different from what usedfor image prediction.

There are several models to find an object within an inputimage. Using the feature map, Region-CNN (R-CNN) receivesan input frame and then proposes places where there is apossibility of existing an object [21]. After calculating thefeature map from the convolutional layers, R-CNN tests theproposals, detects objects in each proposal, and classifies them.R-CNN treats the network as a couple of separate parts. Oncethe feature map is created, it is used to extract the boundingboxes and identify the class of highest probability of objectscorresponding to each bounding box. Tacking another stepwith this model, the faster R-CNN [20] reduces the numberof proposals to meet the requirements in real time detectionapplications. According to [39], the faster R-CNN performsabout 200 times better than the original R-CNN depending onthe network used. However, it may not be the best option touse at an edge device because it still may be slow.

Introduced in late 2016, the Single Shot Multi-Object De-

Page 8: 1 Intelligent Surveillance as an Edge Network Service ... · been conducted. The L-CNN, SSD GoogleNet, Harr-Cascade, and HOG+SVM algorithms are implemented on a Raspberry PI Model

8

tector (SSD) method is faster in general than R-CNN [34]. Thename comes from the fact that in one feed forward through thenetwork, results are generated and there is no need for extrasteps taken in R-CNN. It is a unified framework for detectionof an object with a single network. For training purposes, SSDarchitecture needs more libraries and commands than usualCNN and it comes with a distribution of Caffe and MXNet,and when installed it will receive the input image and outputthe coordination of each object detected in the image alongwith a label for the object. It modifies the proposal generatorto get class probability instead of the existence of an objectin the proposal. Instead of having the classical sliding windowand checking in each window for an object to report, SSDat the beginning layer of convolutional filters will create aset of default bounding boxes over different aspect ratios,then scales them with each feature map through convolutionallayers along the image itself. In the end, it will check foreach object category presence, based on the prior probabilitiesof the objects in the bounding box and finally adjusts thebounding box to better fit the detection. One of the downsidesof SSD is that smaller objects detection accuracy is low ifprior probability extraction performed in one layer. In smartsurveillance, this can lead to loss of generalization. Because thegoal is to detect every human object regardless of distance tocamera or angle towards it. However, if the output of differentfeature maps from different layers is used [39] detection ratecan be increased.

The proposed L-CNN network uses feature maps fromlayers 17, 15, 13, 11 to create more accurate detection. Theapproach gives high speed to the network and makes it suitableto be used in real-time for human detection. Furthermore,there is no need to redesign the whole classification part ofthe network. As reported, the Faster R-CNN achieved sevenFPS on a Titan X graphics card, SDD achieves 46 FPS [34]and also used for even faster and more accurate small objectdetection [6]. After the training phase, the video is loaded inthe OpenCV version 3.3 library. Therefore, the edge deviceonly needs to run OpenCV which is eventually used for objecttracking as well.

V. L-CNN TRAINING

As explained before the fully trained network is fast andimplementable on the edge devices. Instead of TensorFlowor Caffe, the proposed L-CNN model was implemented andtrained on MXNet.

MXNet is a flexible and scaleable deep learning model thatmany cloud centers provide today. Writing a deep learningarchitecture becomes an easy task because of its support ofseveral languages such as C++, Java, Python. MXNet hasa great community and documentation that will help fasterreach of the results. Moreover, SSD branch of MXNet hasa responsive community to help programmers learn how tofamiliarize themselves with SSD. The L-CNN used MXNetin python language as the structure can be generated easilyin python functions. The L-CNN architecture file which iscalled a symbol file ready for training will have less than80 lines with the addition of SSD code at the end. Through

training phase .Jason text file of the architecture along anotherfile containing network weights will be created [47] whichcan later be used instead of the symbol file for better speed.These outputs are closely related to what seen on caffe files(design.txt and .caffemodel). Also, MXNet and caffe are thetwo models for CNN training that support SSD based objectdetection to the best of authors knowledge and caffe becauseof its downsides in documentation specially in SSD branch isnot considered.

One of the most important steps is to choose a dataset that contains a human object captured by a surveillancecamera from different angles and distances to extract moregeneral features. Different image sets are used for trainingthe network. Among them, INRIAPerson [15] or COCO [33]contains many images. The main drawbacks of these imagesets are that they dont contain many different possible humanpositioning images. For example, INRIAperson includes manyimages containing people of different genders, sizes, and ages.Meanwhile, all of these images are the same size and are alsotaken from the same point of view. This image set is createdfor different purposes, but it is not helpful to our L-CNNtraining. The resulted system may only detect people fromthe side because all the images are taken in that angle or onlydetect them with a certain distance from the camera as thetraining sets are all the same. To preserve generality to detectany human from a different angle a huge set of different sizeand human angle images is needed.

While many image sets are available for classification, objectdetection image sets are not common enough. Each objectin the image should be classified, and the coordination for itshould be marked separately. Among existing data sets, PascalVisual Object Classes (VOC) [17] from 2005 to 2012 providedimages with different resolutions in twenty classes that humanwas one of them. COCO image set can also be considered asobjects have segmentation which provides boundaries for eachclass of object. Another source might be the video segmentsYouTube provides.

In this work, the images from the VOC along with Ima-geNet [16] are used to train the proposed L-CNN network.ImageNet is the biggest image set that contains more than14 million different images from more than one thousanddifferent classes of objects. In this application, only the imagesfor the class of human is used. ImageNet provides boundingbox coordination for some of the images in particular classes.Note that the ImageNet uses synset to name the classes.Although this system of naming is machine-readable, it isharder for humans to understand. A combination of sub-classesfor human images with coordination from ImageNet websiteis employed.

The images have to be the same size as the input of thenetwork. The network accepts colored (RGB) images with thesize of 224×224 pixels. Thus, for training a blob of 32 imageseach having three-dimensional data created and for validationblobs of 16 images. 75% of the total set was used for trainingand 25% for validation. Lightning Memory-Mapped Database(LMDB) files were produced, which store data in a formatas a {key, value} pair. Converting the image set to such aformat leads to a faster reading speed. Furthermore, before

Page 9: 1 Intelligent Surveillance as an Edge Network Service ... · been conducted. The L-CNN, SSD GoogleNet, Harr-Cascade, and HOG+SVM algorithms are implemented on a Raspberry PI Model

9

the training, the data is normalized by calculating the meanvalue of each RGB channel.

In total, more than 8000 images gathered from differentsources as mentioned above. Training is done on a servermachine with 28 CPU cores of Intel(R) Xenon(R) CPU atthe base frequency of 2.4 GHz with physical memory of256 GB. Training took 9.7 days. Several stop-criteria areintroduced such as maximum iteration of 400 where eachepoch is produced of 250 batches, and for every iteration, onevalidation test took place. also, Every 40 iterations a snapshotof the weights were created to save the progress. The trainingand error are calculated as Eq. 8:

MSE =1

n

n∑i=1

(yi − yi)2 (8)

where yi is the value calculated by the network, and the yi isthe actual value. This Mean Square Error represents the errorin object detection and linear regression used for fine-tuningthe bounding box used another error.

VI. EXPERIMENTAL RESULTS

A. Experimental SetupAll of the above-discussed methods are implemented on an

edge device, Raspberry PI 3 Model B with ARMv7 1.2 GHzprocessor and 1 GB of RAM.

Single Board Computers (SBC), which run a full operatingsystem and have sufficient peripherals (memory, CPU, powerregulation) to start execution without the addition of hardware,are targeted industrial platforms such as vending machines.The Raspberry PI Foundation made the SBC accessible toalmost anyone with low cost ($ 35) through delivering Rasp-berry Pi product family. Here are several advantages that allowRaspberry Pi to be an excellent choice for implementation ofIoT-based edge computing applications.

1) Low cost of hardware, ease of availability and theadvantages of a substantial community allow developersto easily setup their work on Raspberry Pi platform.

2) The Raspberry Pi is based on an ARM architecture, andrun Raspbian, a Debian-based Linux operating system.The binary compatibility among other architectures,like X86/64 is supported when migrating applications,which is helpful to address heterogeneity existed IoTdevices used for edge computing applications.

3) Exposing General Purpose Input-Output (GPIO) con-nection pins can bridge the gap to the physical worldby connecting a huge variety of electronic componentsto achieve a context-aware intelligent edge computing.

4) As a powerful prototyping platform, Raspberry Pi isconvenient for researchers to build a fully functionalprototype to test and validate an idea or hypotheses.

Above all, given merits like commodity hardware, sup-porting high-level programming languages (e.g., Python) andrunning popular variants of Unix-based operating systems, TheRaspberry Pi is an ideal platform for Edge Computing.

The CPU and memory utilization for the algorithms arecaptured by a third party application named memory profiler.

Fig. 8. Performance in FPS, CPU, Memory Utility and Average False PositiveRate (FPR%).

This software is used for python applications and can trackthe CPU and memory used by that process. It saves the dataand later plots it using python MATPLOTLIB library. FramePer Second (FPS) is the major parameter to evaluate theperformance of these algorithms. Figure 8 shows the averageFPS in 30 seconds of run time for each algorithm. Once again itis reminded that other CNN architecture needed to be retrainedusing SSD or other object detection models so they can be usedhere.

B. Results and DiscussionsFigure 8 summarizes the experimental results. The fastest

algorithm is the Haar Cascaded, the proposed L-CNN is thesecond and very close to the best. The figure also shows thatHaar Cascaded is the best in terms of resource efficiency, andagain the L-CNN is the second and very close. However, interms of average false positive rate (FPR) our L-CNN achieveda very decent performance (6.6%) that is much better than thatof Harr Cascaded (26.3%). In fact, the L-CNN’s accuracy iscomparable with SSD GoogleNet (5.3%), but the later featuresa much higher resource demanding and an extremely lowspeed (0.39 FPS). In contrast, the average speed of L-CNNis 1.79 FPS and a peak performance of 2.06 FPS is obtained.It is the highest and more than six times faster than theHOG+SVM algorithm (0.3 FPS), which is computationallyheavy. Although both HOG+SVM and SSD GoogleNet haveshown good accuracy, HOG+SVM achieved this by fine-tunning the algorithm, their slow speed and high demandingon CPU and memory can be vital for edge applications.

Typically the number of parameters of a CNN architectureis reported. The L-CNN is compared to algorithms other thanconvolutional networks here and so memory taken by objectdetection execution is reported. In Fig. 8 the memory usedby different algorithms are shown as an average value in 30seconds of run time. The GoogleNet needs more memoryin general due to their huge number of filters and theirarchitecture that needs to be loaded in memory at runtime.

The GoogleNet does not use a huge memory in contraryto other reports because this is an SSD based GoogleNet andimages that are used for training have only 21 classes instead

Page 10: 1 Intelligent Surveillance as an Edge Network Service ... · been conducted. The L-CNN, SSD GoogleNet, Harr-Cascade, and HOG+SVM algorithms are implemented on a Raspberry PI Model

10

Fig. 9. Haar Cascaded.

Fig. 10. HOG+SVM.

of one thousand classes. As shown in Fig. 8 and Table I,with fewer classes, less memory is needed. To compute theseaccuracy measures, real-life surveillance video is used. Asthere is only one usage model, accuracy reported here may behigher than general purpose usage reported in other literature.

Figures 9 to 12 show the results of Haar Cascaded,HOG+SVM, GoogleNet and L-CNN in processing a samplesurveillance video. The footage is re-sized for all algorithmsto 224 × 224 pixels. The smaller image size is, the lesscomputation resource requires. Also, the deep model archi-tecture only accepts fixed-size images. Therefore, to compareall the algorithms fairly, they all fed image with the same size.Because in practice surveillance videos are not allowed to beexposed to the public, figures included in this paper are footagefrom an online open source video.

The Haar Cascaded algorithm gives false detection bymisidentifying the stone and the tripod as a human, shownin Fig. 9. Meanwhile, the HOG+SVM algorithm does notmake the same mistakes as illustrated in Fig. 10. However, twoother issues are observed. First, the bounding box is not fixedaround the human objects. This may lead to inaccurate trackingperformances in later steps. Secondly, in the middle of theframe where objects that are very close are considered as onein some frames, although in later frames two separate boxes

Fig. 11. SSD-GoogleNet.

Fig. 12. L-CNN.

are created for each person. As discussed earlier, the accuracyof the HOG+SVM algorithm can be improved by fine-tuningdifferent parameters. But it is still not able to mitigate thechanges in the background and the scene. Figures 11 and 12verify the high accuracy achieved by the CNNs. While it isnot a surprise that GoogleNet can detect almost all humanobject with a false positive rate of 5.3%, the performance ofthe proposed L-CNN is very decent. In comparison to SSDGoogleNet, proposed L-CNN takes about third of memorywhile the performance accuracy does not drop. Also, speedof L-CNN in terms of FPS is more than 4 times of SSDGoogleNet.

Figure 13 highlights the results of the L-CNN algorithm inprocessing video frames in which human object is capturedfrom variant angles and distances. These are challengingscenarios for detection algorithms to decide whether or notthe objects are human beings. Not only the features vary whenthe angles and distances are different, but also sometimes thehuman body is only partially visible or in different gestures.For example, in the right-up subfigure, the legs of the workerstanding in the middle are overlapped, the second person hasonly head and part of the left arm captured. In the left-bottomsubfigure, both two legs of the pedestrian are not visible.Many algorithms either cannot identify it is a human body, or

Page 11: 1 Intelligent Surveillance as an Edge Network Service ... · been conducted. The L-CNN, SSD GoogleNet, Harr-Cascade, and HOG+SVM algorithms are implemented on a Raspberry PI Model

11

Fig. 13. L-CNN: A human object from variant angles and distances.

very high false positive rate is incurred, as an example shownearlier in Fig. 5. The experimental results are very encouragingthat the L-CNN is capable of handling such kind of difficultsituations within the resource-constrained edge devices.

Table I compares different CNN architectures with theproposed L-CNN algorithm, including several well-knownarchitectures such as VGG, GoogleNet, and the simplifiedlightweight MobileNet. It is reminded that SSD networksbecause of their change in architecture and needs for specialimages with contour of objects for training, are dissimilar toclassification CNNs and so SSD CNNs in this table are archi-tectures that are trained using SSD. This test is performed ona desktop machine without the graphic card and with Intel(R)Core(TM) i7 (3.40 GHz and 16 GB of RAM). The resultmatches our intuition very well that many heavy algorithmsare not good choices for an edge device as they require up to20 times more memory space.

Algorithms Memory (MB)VGG 2459.8SSD-GoogleNet 320.4SqueezeNet 145.3SSD-MobileNet 172.2SSD-L-CNN 139.5

TABLE I. MEMORY UTILITY OF CNNS.

VII. CONCLUSIONS

To make proactive urban surveillance and human behaviorrecognition and prediction as edge network services, timely,accurate human object detection at the edge is the essentialstep. While there are many algorithms for human detection,they are not suitable for edge computing environment. Inthis paper, leveraging the Depthwise Separable Convolutionalnetwork, a lightweight CNN architecture is introduced forhuman object detection at the edge. This model was trainedusing parts of ImageNet and VOC07 datasets which containsthe coordination of the objects of interest. Caffe model wasused for training with a server, and later OpenCV libraries areused for implementation on the edge device.

This paper has studied the advantages and constraints of twowidely used human object detection algorithms, namely HaarCascaded object detector and HOG+SVM human detector, in

the context of edge computing. Along with GoogLeNet, theyare implemented on a Raspberry PI as an edge device fora comparison study. The experimental results have verifiedthat the proposed L-CNN algorithm has met the design goals.The L-CNN has achieved satisfactory FPS (Maximum 2.03and Average 1.79) and high accuracy (false positive rate of6.6%), it uses two times fewer parameters than GoogleNetand occupies 2.6 times less memory than SSD GoogleNet.Besides, the experimental results also verified that the L-CNN algorithm could handle more complex situations wheresurveillance videos are captured from different angles anddistances, or the human body in the frames is not complete.

With the capability of immediate human object identifi-cation, our on-going efforts include the following tasks: (1)lightweight object tracking, (2) human behavior recognition,(3) suspicious activity prediction and early alarm, and (4)video clip marking for batch replay. Our ultimate goal is aproactive surveillance system that enables a more safe andsecure community by identifying suspicious activities andraising alert before damages are caused.

REFERENCES

[1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S.Corrado, A. Davis, J. Dean, M. Devin et al., “Tensorflow: Large-scalemachine learning on heterogeneous distributed systems,” arXiv preprintarXiv:1603.04467, 2016.

[2] J. Ahn, J. Paek, and J. Ko, “Machine learning-based image classifi-cation for wireless camera sensor networks,” in Embedded and Real-Time Computing Systems and Applications (RTCSA), 2016 IEEE 22ndInternational Conference on. IEEE, 2016, pp. 103–103.

[3] D. Anisimov and T. Khanova, “Towards lightweight convolutionalneural networks for object detection,” in Advanced Video and SignalBased Surveillance (AVSS), 2017 14th IEEE International Conferenceon. IEEE, 2017, pp. 1–8.

[4] E. Blasch, G. Seetharaman, S. Suddarth, K. Palaniappan, G. Chen,H. Ling, and A. Basharat, “Summary of methods in wide-area motionimagery (wami),” in SPIE Defense+ Security. International Societyfor Optics and Photonics, 2014, pp. 90 890C–90 890C.

[5] H. Bristow and S. Lucey, “Why do linear svms trained on hog featuresperform so well?” arXiv preprint arXiv:1406.2419, 2014.

[6] G. Cao, X. Xie, W. Yang, Q. Liao, G. Shi, and J. Wu, “Feature-fusedssd: Fast detection for small objects,” arXiv preprint arXiv:1709.05054,2017.

[7] F. F. Chamasemani and L. S. Affendey, “Systematic review andclassification on video surveillance systems,” International Journal ofInformation Technology and Computer Science (IJITCS), vol. 5, no. 7,p. 87, 2013.

[8] N. Chen, Y. Chen, E. Blasch, H. Ling, Y. You, and X. Ye, “Enablingsmart urban surveillance at the edge,” in 2017 IEEE InternationalConference on Smart Cloud (SmartCloud). IEEE, 2017, pp. 109–119.

[9] N. Chen, Y. Chen, S. Song, C.-T. Huang, and X. Ye, “Smart ur-ban surveillance using fog computing,” in Edge Computing (SEC),IEEE/ACM Symposium on. IEEE, 2016, pp. 95–96.

[10] N. Chen, Y. Chen, Y. You, H. Ling, P. Liang, and R. Zimmermann, “Dy-namic urban surveillance video stream processing using fog computing,”in Multimedia Big Data (BigMM), 2016 IEEE Second InternationalConference on. IEEE, 2016, pp. 105–112.

[11] F. Chollet et al., “Keras,” 2015.[12] Cisco, “Cisco visual networking index: Forecast and methodology,

20162021,” https://www.cisco.com/c/en/us/solutions/collateral/service-provider/visual-networking-indexvni/mobile-white-paper-c11-520862.html, 2017.

Page 12: 1 Intelligent Surveillance as an Edge Network Service ... · been conducted. The L-CNN, SSD GoogleNet, Harr-Cascade, and HOG+SVM algorithms are implemented on a Raspberry PI Model

12

[13] M. Cristani, R. Raghavendra, A. Del Bue, and V. Murino, “Humanbehavior analysis in video surveillance: A social signal processingperspective,” Neurocomputing, vol. 100, pp. 86–97, 2013.

[14] N. Cristianini and J. Shawe-Taylor, An introduction to support vectormachines and other kernel-based learning methods. Cambridgeuniversity press, 2000.

[15] N. Dalal and B. Triggs, “Histograms of oriented gradients for humandetection,” in Computer Vision and Pattern Recognition, 2005. CVPR2005. IEEE Computer Society Conference on, vol. 1. IEEE, 2005, pp.886–893.

[16] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet:A large-scale hierarchical image database,” in Computer Vision andPattern Recognition, 2009. CVPR 2009. IEEE Conference on. IEEE,2009, pp. 248–255.

[17] M. Everingham, L. Van Gool, C. Williams, J. Winn, andA. Zisserman, “The pascal visual object classes challenge 2012(voc2012) results (2012),” in URL http://www. pascal-network.org/challenges/VOC/voc2011/workshop/index. html, 2011.

[18] C.-T. Fan, Y.-K. Wang, and C.-R. Huang, “Heterogeneous informationfusion and visualization for a large-scale intelligent video surveillancesystem,” IEEE Transactions on Systems, Man, and Cybernetics: Sys-tems, vol. 47, no. 4, pp. 593–604, 2017.

[19] T. Fuse and K. Kamiya, “Statistical anomaly detection in human dy-namics monitoring using a hierarchical dirichlet process hidden markovmodel,” IEEE Transactions on Intelligent Transportation Systems, 2017.

[20] R. Girshick, “Fast r-cnn,” in Proceedings of the IEEE internationalconference on computer vision, 2015, pp. 1440–1448.

[21] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich featurehierarchies for accurate object detection and semantic segmentation,”in Proceedings of the IEEE conference on computer vision and patternrecognition, 2014, pp. 580–587.

[22] GitHub, “Haar cascade resources,”https://github.com/opencv/opencv/tree/master/data/haarcascades,2017.

[23] S. Hannuna, M. Camplani, J. Hall, M. Mirmehdi, D. Damen,T. Burghardt, A. Paiement, and L. Tao, “Ds-kcf: a real-time tracker forrgb-d data,” Journal of Real-Time Image Processing, pp. 1–20, 2016.

[24] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for imagerecognition,” in Proceedings of the IEEE conference on computer visionand pattern recognition, 2016, pp. 770–778.

[25] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista, “High-speedtracking with kernelized correlation filters,” IEEE Transactions onPattern Analysis and Machine Intelligence, vol. 37, no. 3, pp. 583–596,2015.

[26] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality ofdata with neural networks,” science, vol. 313, no. 5786, pp. 504–507,2006.

[27] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang,T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient convo-lutional neural networks for mobile vision applications,” arXiv preprintarXiv:1704.04861, 2017.

[28] W. Hu, Y. Huang, L. Wei, F. Zhang, and H. Li, “Deep convolutionalneural networks for hyperspectral image classification,” Journal ofSensors, vol. 2015, 2015.

[29] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally,and K. Keutzer, “Squeezenet: Alexnet-level accuracy with 50x fewerparameters and¡ 0.5 mb model size,” arXiv preprint arXiv:1602.07360,2016.

[30] N. Jenkins, “North american security camera installed base to reach62 million in 2016,” https://technology.ihs.com/583114/north-american-security-camera-installed-base-to-reach-62-million-in-2016, 2016.

[31] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick,S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture forfast feature embedding,” in Proceedings of the 22nd ACM internationalconference on Multimedia. ACM, 2014, pp. 675–678.

[32] K. Kolomvatsos and C. Anagnostopoulos, “Reinforcement learning forpredictive analytics in smart cities,” in Informatics, vol. 4, no. 3.Multidisciplinary Digital Publishing Institute, 2017, p. 16.

[33] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan,P. Dollar, and C. L. Zitnick, “Microsoft coco: Common objects incontext,” in European conference on computer vision. Springer, 2014,pp. 740–755.

[34] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu,and A. C. Berg, “Ssd: Single shot multibox detector,” in Europeanconference on computer vision. Springer, 2016, pp. 21–37.

[35] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,”International journal of computer vision, vol. 60, no. 2, pp. 91–110,2004.

[36] J. Ma, Y. Dai, and K. Hirota, “A survey of video-based crowd anomalydetection in dense scenes,” Journal of Advanced Computational Intel-ligence and Intelligent Informatics, vol. 21, no. 2, pp. 235–246, 2017.

[37] C. Piciarelli, L. Esterle, A. Khan, B. Rinner, and G. L. Foresti,“Dynamic reconfiguration in camera networks: a short survey,” IEEETransactions on Circuits and Systems for Video Technology, vol. 26,no. 5, pp. 965–977, 2016.

[38] R. Porter, A. M. Fraser, and D. Hush, “Wide-area motion imagery,”IEEE Signal Processing Magazine, vol. 27, no. 5, pp. 56–65, 2010.

[39] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-timeobject detection with region proposal networks,” IEEE transactions onpattern analysis and machine intelligence, vol. 39, no. 6, pp. 1137–1149, 2017.

[40] M. Ribeiro, A. E. Lazzaretti, and H. S. Lopes, “A study of deepconvolutional auto-encoders for anomaly detection in videos,” PatternRecognition Letters, 2017.

[41] L. Sifre, “Rigid-motion scattering for image classification, 2014,” Ph.D.dissertation, Ph. D. thesis, PSU.

[42] P. Sun, Y. Wen, T. N. B. Duong, and S. Yan, “Timed dataflow:Reducing communication overhead for distributed machine learningsystems,” in Parallel and Distributed Systems (ICPADS), 2016 IEEE22nd International Conference on. IEEE, 2016, pp. 1110–1117.

[43] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,”in Proceedings of the IEEE conference on computer vision and patternrecognition, 2015, pp. 1–9.

[44] P. Viola and M. Jones, “Rapid object detection using a boosted cascadeof simple features,” in Computer Vision and Pattern Recognition,2001. CVPR 2001. Proceedings of the 2001 IEEE Computer SocietyConference on, vol. 1. IEEE, 2001, pp. I–I.

[45] X. Wang, “Intelligent multi-camera video surveillance: A review,”Pattern recognition letters, vol. 34, no. 1, pp. 3–19, 2013.

[46] J. Wu, “Mobility-enhanced public safety surveillance system using 3dcameras and high speed broadband networks,” GENI NICE EveningDemos, 2015.

[47] J. Z. Zhang, “Mxnet port of ssd: Single shot multibox object detector,”in https://github.com/zhreshold/mxnet-ssd, 2018.

[48] X. Zhang, X. Zhou, M. Lin, and J. Sun, “Shufflenet: An extremely effi-cient convolutional neural network for mobile devices,” arXiv preprintarXiv:1707.01083, 2017.

[49] X. Zheng and Z. Cai, “Real-time big data delivery in wireless networks:a case study on video delivery,” IEEE Transactions on IndustrialInformatics, 2017.

[50] Y. Zheng, L. Capra, O. Wolfson, and H. Yang, “Urban computing:concepts, methodologies, and applications,” ACM Transactions on In-telligent Systems and Technology (TIST), vol. 5, no. 3, p. 38, 2014.

[51] L. Zhou, S. Pan, J. Wang, and A. V. Vasilakos, “Machine learning onbig data: Opportunities and challenges,” Neurocomputing, vol. 237, pp.350–361, 2017.