ADAPTIVE VIDEO SUBSAMPLING FOR ENERGY ......To enable energy-efﬁcient video object detection, we propose an adaptive algorithm to subsample video frames that uses a metric for objectness

ADAPTIVE VIDEO SUBSAMPLING FOR ENERGY-EFFICIENT OBJECT DETECTION

Divya Mohan2,3*, Sameeksha Katoch1,2*, Suren Jayasuriya1,2, Pavan Turaga1,2, Andreas Spanias1,2

1School of Electrical, Computer and Energy Engineering, Arizona State University2Sensor Signal and Information Processing (SenSIP) Center

3Department of Electrical Engineering and Computer Sciences, University of California, Berkeley

ABSTRACTEnergy-efficient computer vision is vitally important for em-bedded and mobile platforms where a longer battery life canallow increased deployment in the field. In image sensors, oneof the primary causes of energy expenditure is the samplingand digitization process. Smart subsampling of the image-array in a manner that is task-specific, can result in significantsavings of energy. We present an adaptive algorithm for videosubsampling, which is aimed at enabling accurate object de-tection, while saving sampling energy. The approach utilizesobjectness measures, which we show can be accurately esti-mated even from sub-sampled frames, and then uses that in-formation to determine the adaptive sampling for the subse-quent frame. We show energy savings of 18 - 67% with onlya slight degradation in object detection accuracy in experi-ments. These results motivated us to further explore energyefficient subsampling using advanced techniques such as, re-inforcement learning and kalman filtering. The experimentsusing these techniques are underway and provide ample sup-port for adaptive subsampling as a promising avenue for em-bedded computer vision in the future.

Index Terms— Objectness, Image and Video Subsam-pling, Energy-efficient computer vision

1. INTRODUCTION

A critical performance requirement for embedded computervision is energy-efficiency in order to preserve battery life formobile and autonomous platforms. In particular, the imagesensor and readout can take up a significant amount of energyin a computer vision pipeline, particularly if the sensor is cap-turing and processing video data in real-time. For instance,the Google Glass performing continuous face detection drainsits battery in 45 minutes, with image sensing taking up 50%of the energy budget [1].

The primary mechanism in which image sensors cansave energy is to limit their readout to portions of the ar-ray known as regions-of-interest (ROIs). This is a form ofspatial subsampling, and can be achieved using windowing,

*These authors contributed equally. This work is funded in part by theNSF CPS Program #1646542 and the SenSIP Center.

column/row skipping, or binning in the image sensor [2].Such a content-driven approach can result in significant en-ergy savings. However, this comes at the cost of potentialloss of visual detail for objects, that may be necessary forend-task performance. This will certainly be the case if thesubsampling approach is agnostic to the semantic informationin the frames. Thus, there is an opportunity to design smartsampling approaches, which can determine the sampling pat-tern based on scene content, to save energy while preservingcomputer vision task performance.

In this work-in-progress, we investigate the effectivenessof video subsampling for the task of video object detection.Object detection for videos is a critical application with im-plications for self-driving cars, surveillance, and autonomousrobotics. To enable energy-efficient video object detection,we propose an adaptive algorithm to subsample video framesthat uses a metric for objectness [3, 4] and intensity-basedsegmentation. This algorithm utilizes semantic informationfrom a previous key frame in the video to determine sub-sampling patterns for future frames. We show that this adap-tive algorithm achieves better trade-offs in energy savings toobject detection accuracy as compared to naive subsamplingmethods of uniform and random sampling.

2. RELATED WORK

Several algorithms for adaptive spatial subsampling havebeen proposed in the context of image and video compres-sion [5, 6, 7]. However, we are concerned with saving energywhile performing object detection on real-time video, andthus cannot rely on video compression algorithms which relyon the full video being captured first.

LiKamWa et. al. investigated the energy-savings of tradi-tional image sensors for windowing, binning, and column/rowskipping [2], and Buckler et. al. showed one can subsampleby 4× from a high resolution image in an ISP pipeline whileonly losing 1% of accuracy in image classification on the CI-FAR10 dataset [8]. Buckler et. al. also propose a similarsensor mechanism to realize image subsampling by turningoff pixel ADCs during rolling shutter readout [8]. Guo et.al. investigated subsampling within a single image for deepneural network classification [9]. They developed a technique

called AdaSkip which changes the row resolution of the im-age sensor, and showed 15× energy savings with almost noloss in accuracy.

Another technique to save energy includes compres-sive sensing, which takes coded measurements of the sceneand then reconstructs at a sub-Nyquist rate [10]. Camerassuch as the single pixel camera [11] and video compres-sive sensing [12] save energy through reduced number ofpixel measurements. However, compressive sensing re-lies on computationally heavy post-processing to recoverthe image in software. Reconstruction-free inference ap-proaches have also been proposed, which usually do betterthan reconstruct-then-infer, but still result in significant dropsin end-performance [13] and are computationally intensive.

Our work differs from all these previous works by devel-oping an adaptive algorithm trained to predict subsamplingframes for the future in the video based on previous keyframe(s). This algorithm is designed to run real-time and onincoming streaming frames.

3. METHOD

In this section, we present our adaptive subsampling algo-rithm for video object detection. We require that our algo-rithm operate at run-time – determining the future subsam-pling patterns based only on prior frames (i.e. causal sys-tem) – so that it can work on incoming video frames. Thisalgorithm is simple conceptually in nature as we wanted toreduce the amount of overhead computation it takes to allowfor adaptive sampling. However, we show that this methodachieves only a slight degradation in object detection perfor-mance while saving energy. The whole algorithm is summa-rized in Figure 2.

One constraint we placed on our method is that it hadto work on embedded platforms that have limited resources.This included the assumption that there is no GPU on the plat-form, and thus no way to retrain the object detection neuralnetwork to adapt to the subsampling pattern. Of course do-ing so could yield even better object detection accuracy forthe same energy savings, but this will be an avenue of futurework of how to retrain the network for different subsamplingschemes. The advantage of our method is that it is immedi-ately deployable to existing systems such as UAVs and roboticplatforms and requires no training or GPUs on-board.Objectness as semantic information: The first key ques-tion we consider is how to extract semantic information fromprevious frame(s). While there are several techniques thatcould be used as generic visual features including CNN fea-tures [14], we utilize an algorithm that trains a measure forobjectness for a given image [3, 4]. This makes our algo-rithm highly tuned for object detection, and does not requirean additional neural network to be stored on the embeddeddevice to extract visual features. This algorithm quantifieshow likely it is for an image window to cover an object of any

class. It does so by considering four image cues: multi-scalesaliency, color contrast, edge density and straddleness [3, 4].Combining different image windows, the algorithm producesan objectness map that can be seen in Figure 1. In this figure,we show how the objectness map still can identify primaryobjects even when operating on different types of subsampledimagery. We will utilize these objectness maps to help deter-mine our spatial subsampling in the video.

Fig. 1: The first row shows the original image and its resultingobjectness map. The next three rows show the same processfor three different forms of image subsampling on the originalimage: Random Pixelation, Checkerboard Mask, and Adap-tive Video Sampling.

Adaptive Subsampling Algorithm: We now present ouradaptive algorithm, which couples this objectness map withintensity changes in the video to help determine a spatialsampling pattern. Let I(x, y, t) represent a video where(x, y) represents the locations of the pixels and t representsthe frame index in time. Let N1 and N2 represent numberof rows and columns in a given frame, respectively. Conse-quently, the number of pixels in a given frame is given byP = N1N2 for a gray-scale image. Let Mi for 1 < i < Trepresent the objectness-maps as described above, and Trepresents total number of frames in the video.

The algorithm begins by considering the importance mapof the first frame (reference frame) M1. We calculate a his-togram of the importance map, and based on this histogramand an empirically-chosen threshold, we convert the gray-scale objectness map to a binary mask B1. This threshold

is called the objectness threshold, and is determined either byempirically-chosen values or by Otsu’s method [15].

The object blobs in the binary mask are labelled based ontheir neighboring pixel connections. Once these object blobsin the binary mask are identified, we compute the area of theseblobs and only select objects with area greater than a thresh-old of 2000 pixels to obtain an updated binary mask Bu

1 . Thisbinary mask is used for our subsampling for the next consec-utive frame.

In other words, the updated binary image is the final maskwhich is used to turn off pixels in the reference frame. Weutilize this mask to subsample the consecutive frames in thevideo. However, the underlying assumption is that the objectsin the scene do not move significantly, so that the mask is stillrelevant in the subsampling. To check the continued validityof this assumption, we calculate the absolute mean intensitydifference between the reference frame and the current sub-sampled frame, as shown in (1):∣∣∣∣∣∣

∑(x,y)

I(x, y, t+ j)−∑(x,y)

I(x, y, t+ k)

∣∣∣∣∣∣ ≤ τ, (1)

where I(x, y, t+j) represents reference frame and I(x, y, t+k) represents the current frame. Note that the choice ofthe Frame Intensity threshold τ is critical for determiningwhether to update the reference frame and whether the binarymask may overlap only partially with objects in the refer-ence image. A smaller threshold means less energy-savingsas more reference frames need to be fully sampled, but theresulting subsampling will more accurately track object mo-tion.

To further validate this object motion assumption, we useanother constraint obtained using optical flow between twoframes. We use Lucas-Kanade optical flow [16], and if themean magnitude of the optical flow is less than a fixed thresh-old φ, we use the same subsampling binary mask as the previ-ous frame. If the constraint is not satisfied, we capture a newreference frame. In Section 4, we compare using the frameintensity threshold versus the optical flow magnitude thresh-old.

4. RESULTS

Dataset: For the video subsampling algorithm, we use theILSVRC2015 Image Vid Dataset [17] which has 555 videosnippets with 30 classes. For our experiments, we considervideos with 6 classes namely, Bird, WaterCraft, Car, Dog,Horse and Train. We performed object detection using animplementation of Faster RCNN, an object classification al-gorithm [18]. The accepted metric of object detection, meanAverage Precision (mAP), per classification is obtained basedon the bounding boxes from the video frames.

We compare four types of subsampling: (1) random sub-sampling where each pixel has a probability α of being turned

Fig. 2: Flowchart explaining the adaptive video subsamplingalgorithm.

off, (2) our adaptive sampling algorithm using Otsu’s methodfor objectness threshold and values of 0.1, 0.3, 0.5 for theframe intensity threshold, (3) adaptive subsampling algorithmwith Otsu’s method for objectness threshold and an opticalflow magnitude threshold with values 0.0015, 0.005, 0.015,and (4) Adaptive subsampling with the tuned parameters of0.15 for the objectness threshold and 0.1 for the frame inten-sity threshold. These parameters were tuned on one separatevideo from the dataset not included in the test set we consider.

Energy modeling: For our energy modeling, we assumethat the proportion of pixels that are turned off are propor-tional to the savings in readout energy [2]. As described inSection 3, τ (i.e. the frame intensity threshold) is one ofthe most important parameters to control the energy savingswhile keeping the accuracy of object detection at almost thesame level. If the optimization constraint is too strong (i.e τis really low), it will lead to subsampling calculation of ev-ery consecutive frame which will result in high computationtime. It will make the algorithm inefficient for use in cam-era sensors. However, if this threshold τ is very big, it canlead to conditions where the subsampling strategy neglectsthe changes due to object motion. The choice of φ (i.e. FlowMagnitude threshold) can be justified similarly.

Qualitative Results: In Fig. 3, we show some visualsof detected objects from adaptive subsampling strategy. For

SubsamplingStrategies

FullySampled

Random Subsampling(α)

Adaptive Subsampling(Otsu’s Objectness Threshold +

Frame Intensity Threshold)

Adaptive Subsampling(Flow Magnitude

Threshold (10−3))

Adaptive Subsampling(Objectness Threshold +

Frame Intensity Threshold)0.15 0.25 0.35 0.1 0.3 0.5 1.5 5.0 15.0 0.15 + 0.1

mAP 55.5 15.4 5.9 0.9 40.1 37 38 41.8 41.7 28.6 50.1

Table 1: mAP scores for different subsampling strategies

SubsamplingStrategies

Random Subsampling(α)

Adaptive Subsampling(Otsu’s Objectness Threshold +

Frame Intensity threshold)

Adaptive Subsampling(Flow Magnitude

Threshold (10−3))

Adaptive Subsampling(Objectness Threshold +

Frame Intensity threshold)0.15 0.25 0.35 0.1 0.3 0.5 1.5 5.0 15.0 0.15 + 0.1

Bird 14.16 22.75 30.80 87.43 86.64 86.43 92.23 92.24 92.22 54.04Watercraft 13.26 22.32 31.62 79.80 79.91 80.09 83.15 83.01 88.30 50.04

Dog 17.71 29.39 40.59 11.83 11.87 11.86 68.76 68.76 68.76 18.44Car 18.13 30.66 42.86 30.42 30.10 30.18 90.44 90.72 88.94 67.87

Horse 21.21 34.96 48.01 25.82 26.26 26.46 75.65 75.65 75.90 38.85Train 22.24 29.60 37.41 21.05 21.02 21.07 71.19 71.19 71.19 55.97

Table 2: Energy efficiency in terms of turned off pixel percentage in a video for different subsampling strategies.

the shown result, a frame is chosen from the Car video andbounding box generated on each subsampled frame is shown.It is evident that even after turning off a large number of pix-els, Faster RCNN is able to detect the object in most cases.These benefits coupled with energy savings make it a decentapproach for video subsampling.

Fig. 3: The first column shows the object detection for threedifferent frame intensity thresholds (τ = 0.1, 0.3 and 0.5).The next column shows the same process for three differentOptical Flow Magnitude Thresholds (φ = (1.5, 5.0, 15.0) ×10−3)

Quantitative Results: To test whether the proposed sub-sampling strategy achieves the desired energy savings alongwith the computer vision task accuracy, we show the results

of mean Average Precision (mAP) scores of fully sampled,randomly subsampled and adaptive subsampled videos pre-sented in Table. 1. It is evident that random subsamplingresults in the worst mAP scores compared to adaptive sub-sampling strategy. As mentioned in Section. 3, in adaptivesubsampling strategy, a binary mask is used to obtain the sub-sampled frames. This binary mask is obtained using the ob-jectness threshold obtained from Otsu’s method. As shown inTable. 1, the empirical objectness threshold resulted in bettermAP score compared to Otsu’s objectness threshold. Amongthe two thresholding methods i.e. optical flow magnitudeand frame intensity, the frame intensity threshold performedslightly better with an empirically-chosen objectness thresh-old which gives a mAP score of 50.1% which is closest tofully sampled video mAP score of 55.5%.

In Table 2, we show the percentage of pixels turned offfor each subsampling strategy. Note that the strategy that re-ceived the best mAP score (Adaptive Subsampling with ob-jecness threshold and frame intensity threshold) saves 18 -67% of energy.

5. DISCUSSION AND FUTURE WORK

The proposed image subsampling patterns hold promise forembedded and mobile platforms performing a computer vi-sion task. One advantage of our method is that it is conceptu-ally simple, and requires very little resources to implement onan embedded vision platform (as opposed to neural networkor machine learning approaches). Further, this method re-quires no training data or addition/modification of hardwareto work. Finally, the method is agnostic to the type of fea-ture detection, and could be modified to use CNN features or

other appropriate features suited for a particular applicationdomain.

However, there are several avenues for future improve-ments. The current adaptive sampling algorithm proposeduses a style of key-framing to determine the ROI for sam-pling, and keeps that ROI for subsequent frames. A promisingimprovement is to warp the ROI to follow motion of the ob-ject(s) using either Kalman filtering or optical flow warping.Further, the method needs to be tested against multi-objectdetection/tracking as well as tracking in bad weather or visi-bility conditions such as nighttime, fog, snow, and/or rain. Italso needs to be compared against reconstruction-free objecttracking algorithms for compressive sensing [19], althoughthis will require a single-pixel camera to save energy. Finally,determining sampling patterns in the future while tracking ob-jects can utilize reinforcement learning techniques to have theimage sensor adapt to the input data in an energy-efficientmethod. All these proposed improvements can yield possiblyhigher energy-savings while keeping high performance objectdetection and tracking accuracy.

6. REFERENCES

[1] R. LiKamWa, Z. Wang, A. Carroll, F. X. Lin, andL. Zhong, “Draining our glass: An energy and heat char-acterization of google glass,” in Proceedings of 5th Asia-Pacific Workshop on Systems. ACM, 2014, p. 10.

[2] R. LiKamWa, B. Priyantha, M. Philipose, L. Zhong,and P. Bahl, “Energy characterization and optimizationof image sensing toward continuous mobile vision,” inProceeding of the 11th Annual International Conferenceon Mobile Systems, Applications, and Services. ACM,2013, pp. 69–82.

[3] B. Alexe, T. Deselaers, and V. Ferrari, “What is an ob-ject?” in CVPR, 2010 IEEE Conference on. IEEE,2010, pp. 73–80.

[4] B. Alexe et al., “Measuring the objectness of image win-dows,” IEEE Trans. on Pattern Analysis and MachineIntelligence, vol. 34, no. 11, pp. 2189–2202, 2012.

[5] W. Lin and L. Dong, “Adaptive downsampling to im-prove image compression at low bit rates,” IEEE Trans-actions on Image Processing, vol. 15, no. 9, pp. 2513–2521, 2006.

[6] J. Dong and Y. Ye, “Adaptive downsampling for high-definition video coding,” IEEE Transactions on Circuitsand Systems for Video Technology, vol. 24, no. 3, pp.480–488, 2014.

[7] R. A. Belfor, M. P. Hesp, R. L. Lagendijk, andJ. Biemond, “Spatially adaptive subsampling of imagesequences,” IEEE Transactions on Image Processing,vol. 3, no. 5, pp. 492–500, 1994.

[8] M. Buckler, S. Jayasuriya, and A. Sampson, “Recon-figuring the imaging pipeline for computer vision,” in2017 IEEE International Conference on Computer Vi-sion (ICCV). IEEE, 2017, pp. 975–984.

[9] J. Guo, H. Gu, and M. Potkonjak, “Efficient image sen-sor subsampling for dnn-based image classification,” inProceedings of the International Symposium on LowPower Electronics and Design. ACM, 2018, p. 40.

[10] E. J. Candes, J. Romberg, and T. Tao, “Robust un-certainty principles: Exact signal reconstruction fromhighly incomplete frequency information,” IEEE Trans-actions on Information Theory, vol. 52, no. 2, pp. 489–509, 2006.

[11] M. F. Duarte, M. A. Davenport, D. Takbar, J. N. Laska,T. Sun, K. F. Kelly, and R. G. Baraniuk, “Single-pixelimaging via compressive sampling,” IEEE Signal Pro-cessing Magazine, vol. 25, no. 2, pp. 83–91, 2008.

[12] R. G. Baraniuk, T. Goldstein, A. C. Sankaranarayanan,C. Studer, A. Veeraraghavan, and M. B. Wakin,“Compressive video sensing: algorithms, architectures,and applications,” IEEE Signal Processing Magazine,vol. 34, no. 1, pp. 52–66, 2017.

[13] K. Kulkarni and P. Turaga, “Reconstruction-free ac-tion inference from compressive imagers,” IEEE Trans-actions on Pattern Analysis and Machine Intelligence,vol. 38, no. 4, pp. 772–784, 2016.

[14] A. Sharif Razavian, H. Azizpour, J. Sullivan, andS. Carlsson, “Cnn features off-the-shelf: an astoundingbaseline for recognition,” in Proceedings of the IEEEConference on Computer Vision and Pattern Recogni-tion Workshops, 2014, pp. 806–813.

[15] N. Otsu, “A threshold selection method from gray-levelhistograms,” IEEE Transactions on Systems, Man, andCybernetics, vol. 9, no. 1, pp. 62–66, 1979.

[16] B. D. Lucas and T. Kanade, “An iterative image regis-tration technique with an application to stereo vision,”1981.

[17] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, andL. Fei-Fei, “Imagenet: A large-scale hierarchical imagedatabase,” in Computer Vision and Pattern Recognition,2009. CVPR 2009. IEEE Conference on. Ieee, 2009,pp. 248–255.

[18] J. Yang, J. Lu, D. Batra, and D. Parikh, “Afaster pytorch implementation of faster r-cnn,”https://github.com/jwyang/faster-rcnn.pytorch, 2017.

[19] H. Braun, P. Turaga, and A. Spanias, “Direct track-ing from compressive imagers: A proof of concept,”

in 2014 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP). IEEE, 2014,pp. 8139–8142.

Documents

ADAPTIVE VIDEO SUBSAMPLING FOR ENERGY ......To enable energy-efﬁcient video object detection, we propose an adaptive algorithm to subsample video frames that uses a metric for objectness