8
SyNERGY: An energy measurement and prediction framework for Convolutional Neural Networks on Jetson TX1 Crefeda Faviola Rodrigues, Graham Riley, and Mikel Luján School of Computer Science, The University of Manchester, Manchester, United Kingdom AbstractThere is a huge demand for on-device execution of deep learning algorithms on mobile and embedded plat- forms. These devices present constraints on the application due to limited hardware resources and power. However, current evaluation studies in existing deep learning frame- works (for example, Caffe, Tensorflow, Torch and others) are limited to performance measurements of these applications on high-end CPUs and GPUs. In this work, we propose "SyNERGY" a fine-grained energy measurement (that is, at specific layers) and prediction framework for deep neu- ral networks on embedded platforms. We integrate ARM’s Streamline Performance Analyser with standard deep learn- ing frameworks such as Caffe and CuDNNv5 to quantify the energy-use of deep convolutional neural networks on the Nvidia Jetson Tegra X1. Our measurement framework pro- vides an accurate breakdown of actual energy consumption and performance across all layers in the neural network while our prediction framework models the energy-use in terms of target-specific performance counters such as SIMD and bus accesses and application specific parameters such as Multiply and Accumulate (MAC) counts. Our experimen- tal results using 9 representative Deep Convolutional Neural Network shows that a multi-variable linear regression model based on hardware performance counters alone achieves an average prediction test error of 8.04 ± 5.96% compared to actual energy measurements. Surprisingly, we find that it is possible to refine the model to predict the number of SIMD instructions and main memory accesses solely from the application’s Multiply-Accumulate (MAC) counts with an average prediction test error of 0.81 ± 0.77% and 17.97 ± 15.29% respectively. This alleviates the need for actual measurements giving a final average prediction test error of 7.08 ± 5.05% using solely the application’s MAC counts as input. Keywords: Energy Measurement, Energy prediction, Convolu- tional Neural Networks, mobile computing 1. Introduction The aim of contemporary and future computing systems is to deliver higher performance at lower power budgets [1]. This includes embedded systems or "edge-devices" that add further limits to energy-usage due to limited battery life. Measurement of power consumption is an understudied espe- cially when designing software for deep learning algorithms. These algorithms have become core components of applica- tions such as key-word spotting, facial recognition, language translation and others [2]. Specifically, Convolutional Neural Networks (hereafter referred to as ConvNets) have achieved state-of-the-art results in various vision domains and natural language processing domains [2]. To enable such appli- cations for embedded devices there are numerous efforts to optimize these algorithms in terms of computation and memory resources. However, there are very few studies that quantify the actual energy-use for deep learning workloads [3], [4], [5]. These optimization efforts are spread across various levels. At the algorithmic level, newer compact neural network designs [6], [7], compression and pruning techniques [8], [9] or reduced precision [2] are proposed to save memory and increase throughput. At the software level, device-specific software implementations such as TensorRT, ARM Compute library, Qualcomm’s Snapdragon Neural Processing Engine (NPE) [1] and others aim to accelerate deep learning inferences or deployment on existing mobile platforms. These libraries are complementary to existing deep learning frameworks such as Caffe2 [10], Tensorflow, Torch and others in which deep learning models must first be designed and trained. At the hardware level, application- specific hardware have emerged such as specialised GPUs (e.g. Jetson Tegra TX1 and TX2), FPGAs and ASICs [1]. We consolidate our observations from these works and attribute the lack of adoption of energy-use as an evaluation criteria to the following reasons: Lack of energy-measurement support in existing deep learning frameworks: currently, popular frameworks such as Caffe, Torch, Tensorflow and others provide designers with the tools to benchmark their appli- cation’s performance through timing measurements. There is no support for energy measurements as these are challenging to obtain consistently across platforms. For example, power measurement facilities can vary from system-to-system. This includes different types of power meters [5] and power sensors [3]. Accuracy as a key metric to evaluate models: the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) [11] has been a major test-bed for the de- velopment of innovative ConvNet models. However, a given published accuracy is often achieved by averaging the accuracy from an ensemble of models that are Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'18 | 375 ISBN: 1-60132-487-1, CSREA Press ©

SyNERGY: An energy measurement and prediction framework for Convolutional Neural ... · 2018-10-06 · SyNERGY: An energy measurement and prediction framework for Convolutional Neural

  • Upload
    others

  • View
    12

  • Download
    0

Embed Size (px)

Citation preview

Page 1: SyNERGY: An energy measurement and prediction framework for Convolutional Neural ... · 2018-10-06 · SyNERGY: An energy measurement and prediction framework for Convolutional Neural

SyNERGY: An energy measurement and prediction framework forConvolutional Neural Networks on Jetson TX1

Crefeda Faviola Rodrigues, Graham Riley, and Mikel LujánSchool of Computer Science, The University of Manchester, Manchester, United Kingdom

Abstract— There is a huge demand for on-device executionof deep learning algorithms on mobile and embedded plat-forms. These devices present constraints on the applicationdue to limited hardware resources and power. However,current evaluation studies in existing deep learning frame-works (for example, Caffe, Tensorflow, Torch and others) arelimited to performance measurements of these applicationson high-end CPUs and GPUs. In this work, we propose"SyNERGY" a fine-grained energy measurement (that is,at specific layers) and prediction framework for deep neu-ral networks on embedded platforms. We integrate ARM’sStreamline Performance Analyser with standard deep learn-ing frameworks such as Caffe and CuDNNv5 to quantifythe energy-use of deep convolutional neural networks on theNvidia Jetson Tegra X1. Our measurement framework pro-vides an accurate breakdown of actual energy consumptionand performance across all layers in the neural networkwhile our prediction framework models the energy-use interms of target-specific performance counters such as SIMDand bus accesses and application specific parameters suchas Multiply and Accumulate (MAC) counts. Our experimen-tal results using 9 representative Deep Convolutional NeuralNetwork shows that a multi-variable linear regression modelbased on hardware performance counters alone achieves anaverage prediction test error of 8.04 ± 5.96% comparedto actual energy measurements. Surprisingly, we find thatit is possible to refine the model to predict the number ofSIMD instructions and main memory accesses solely fromthe application’s Multiply-Accumulate (MAC) counts withan average prediction test error of 0.81 ± 0.77% and 17.97± 15.29% respectively. This alleviates the need for actualmeasurements giving a final average prediction test error of7.08 ± 5.05% using solely the application’s MAC counts asinput.

Keywords: Energy Measurement, Energy prediction, Convolu-tional Neural Networks, mobile computing

1. IntroductionThe aim of contemporary and future computing systems

is to deliver higher performance at lower power budgets [1].This includes embedded systems or "edge-devices" that addfurther limits to energy-usage due to limited battery life.Measurement of power consumption is an understudied espe-cially when designing software for deep learning algorithms.

These algorithms have become core components of applica-tions such as key-word spotting, facial recognition, languagetranslation and others [2]. Specifically, Convolutional NeuralNetworks (hereafter referred to as ConvNets) have achievedstate-of-the-art results in various vision domains and naturallanguage processing domains [2]. To enable such appli-cations for embedded devices there are numerous effortsto optimize these algorithms in terms of computation andmemory resources. However, there are very few studies thatquantify the actual energy-use for deep learning workloads[3], [4], [5]. These optimization efforts are spread acrossvarious levels. At the algorithmic level, newer compactneural network designs [6], [7], compression and pruningtechniques [8], [9] or reduced precision [2] are proposed tosave memory and increase throughput. At the software level,device-specific software implementations such as TensorRT,ARM Compute library, Qualcomm’s Snapdragon NeuralProcessing Engine (NPE) [1] and others aim to acceleratedeep learning inferences or deployment on existing mobileplatforms. These libraries are complementary to existingdeep learning frameworks such as Caffe2 [10], Tensorflow,Torch and others in which deep learning models must firstbe designed and trained. At the hardware level, application-specific hardware have emerged such as specialised GPUs(e.g. Jetson Tegra TX1 and TX2), FPGAs and ASICs [1].

We consolidate our observations from these works andattribute the lack of adoption of energy-use as an evaluationcriteria to the following reasons:

• Lack of energy-measurement support in existing deeplearning frameworks: currently, popular frameworkssuch as Caffe, Torch, Tensorflow and others providedesigners with the tools to benchmark their appli-cation’s performance through timing measurements.There is no support for energy measurements as theseare challenging to obtain consistently across platforms.For example, power measurement facilities can varyfrom system-to-system. This includes different types ofpower meters [5] and power sensors [3].

• Accuracy as a key metric to evaluate models: theImageNet Large Scale Visual Recognition Challenge(ILSVRC) [11] has been a major test-bed for the de-velopment of innovative ConvNet models. However, agiven published accuracy is often achieved by averagingthe accuracy from an ensemble of models that are

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'18 | 375

ISBN: 1-60132-487-1, CSREA Press ©

Page 2: SyNERGY: An energy measurement and prediction framework for Convolutional Neural ... · 2018-10-06 · SyNERGY: An energy measurement and prediction framework for Convolutional Neural

executed on desktop or server systems [5]. This impliesthat more computational resources are used to achievedthe desired accuracy. However, on embedded platformsor specialised hardware, resource-budgets are a majorconcern, leading to pruned versions of existing modelsor smaller models being chosen for deployment [7].

In this paper, we propose an energy measurement andprediction framework called SyNERGY, by integrating Caffe,a deep learning framework, and vendor-specific tools suchas ARM Streamline Performance Analyzer [12]. Shown inFigure 1, is the set up for measuring energy and performance(or execution time) of ConvNet models. Our methodologyfocuses on power measurements made using the on-boardpower monitoring sensor (TI-INA3221x available on theJetson TX1) for single image inferences on the CPU usingan optimised OpenBLAS library. Our contribution includethe following:

• To the best of our knowledge, this is the first energyprediction framework that models the energy-use perlayer of a neural network to actual measurements fromthe Jetson TX1 hardware performance counters such asSIMD instruction executed and bus accesses (or mainmemory accesses), with an average prediction error rateof 8.04 ± 5.96% over actual energy measurements.

• Based on experimental data gathered, SyNERGY, fur-ther develops a regression model to predict the energy-use of the convolutional layers solely from the applica-tion’s MAC count, with an average prediction test errorof 7.08 ± 5.05% over actual energy measurements.This is achieved by developing models to predict theindividual performance counters such as SIMD and busaccesses with an average prediction test error rate of0.81 ± 0.77%and 17.97 ± 15.29% respectively.

• Our energy and performance measurement frameworkprovides detailed breakdown of the energy consumed inevery layer of the whole neural network model, thereby,enabling us to identify the energy bottlenecks in themodel.

The organisation of the paper is as follows. Section 2, isa general background into convolutional neural networks.Section 3, provides details of the energy measurement frame-work and per-layer energy breakdown of models. Section4, covers the prediction framework and the experimentalresults. Section 5, details the related work in energy mea-surement and analysis. Finally, section 6 concludes andhighlights possible future directions.

2. Primer on Convolutional Neural Net-works

To provide computers with ability to perform intelligenttasks such as understanding images and audio, learning, andothers, the field of Machine Learning focuses on developingmathematical models or algorithms to acquire knowledge

Fig. 1: Overview of the measurement framework

Fig. 2: Standard Convolution in ConvNets

by extracting information from raw data. A ConvolutionalNeural Network (or ConvNet) extracts information or fea-tures such as edges, color blobs through a process calledfeature extraction from images, and uses this informationto provide a classification output (or a decision). It iscomposed of layers to transform the raw input data intoa meaningful probabilistic output. Figure 2, shows the datadimensions involved in a convolution (or conv) operation.Other typical layers found are pooling (pool), batch norm(norm), Rectified linear unit (ReLU) and fully-connected(fc), that lend the model certain properties1. Each layer ina model has a certain number of computation known ascomputational complexity or storage cost or bandwidth, andeach of these have implications in terms of energy-use [2].

The computational complexity for a conv layer is esti-mated from the number of Multiply-accumulates (MAC),and its storage cost is the sum of the sizes of the input featuremap (Ix × Iy × Iz) to each layer, the corresponding filterweights (Kx×Ky×Iz×Oz) and biases and the output featuremap (Ox×Oy×Oz). Multiple input images can be groupedinto batches called batch size and is denoted as N . Here,x, y and z represent the Cartesian axes. The computationalcomplexity of a standard convolution operation is given by:

Ox ×Oy ×Oz ×Kx ×Ky × Iz (1)

1This has not been explained for the sake of simplicity and the reader isadvised to refer to [2] for more information.

376 Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'18 |

ISBN: 1-60132-487-1, CSREA Press ©

Page 3: SyNERGY: An energy measurement and prediction framework for Convolutional Neural ... · 2018-10-06 · SyNERGY: An energy measurement and prediction framework for Convolutional Neural

Table 1: ConvNet models in the literature

ConvNet NamingConvention in graphs

Top-5accuracy (%) Dataset # Layers Parameters Model Size

AlexNet alexNet [11] 80.3 ImageNet 5 conv + 3 fc 62 M 244 MBGoogleNet googleNet [6] 90.85 ImageNet 57 conv + 1 fc 6.9 M 54 MBResidual Net resNet50 [13] 93.29 ImageNet 53 conv + 1 fc 25 M 103 MBSqueezeNet squeezeNet [7] 80.3 ImageNet 26 conv 1.2 M 5 MBSqueezeNet withResidual Connections squeezeNetRes [7] 82.5 ImageNet 26 conv 1.2 M 6.3 MB

VGG vgg- small [14] 86.9 ImageNet 5 conv + 3 fc 102 M 393 MBPlaces-CDNS-8s Places-CDNS-8s [15] 86.8 ImageNet 8 conv + 3 fc 60 M 241.6 MBInception-BN Inception-BN [16] 89.0 ImageNet 69 conv + 1 fc 1.4 B 134.6 MBALL-CNN-C ALL-CNN-C [17] 90.92 CIFAR 10 9 conv 1.3 M 5.5 MB

and the storage cost or bandwidth (in counts) is given by:

(Ix× Iy × Iz +Kx×Ky × Iz ×Oz +Ox×Oy ×Oz) (2)

Table 1 gives a list of the models chosen for this study, whereColumn 5 represents the counts of two types of layers: convand fc present in each model. Recently, in newer ConvNetmodels, for example, Squeeznet [7], the global averagepooling layer has replaced traditional fully-connected layers[6]. Column 7 shows the size of the model which is stored in32-bit floating point precision and is four times its numberof parameters. The top-5 accuracy of the model is the top-5predictions of the object category in a given image from theImageNet dataset or CIFAR 10 and is a measure of how wellthe model performs for the task of image classification [11].In our study, we target the conv layers as these are knownto consume 85-90% of the computational time [1].

3. Energy and performance measure-ment framework

The embedded system chosen for the power and per-formance measurements is the Jetson TX1 which has aquad-core of ARM Cortex A57/A53 running at 1.9GHzand 256 CUDA core Maxwell GPU running at 1GHz. TheTX1 is cross-compiled with a modified Linux kernel version(3.10.96+) using Jetpack 2.3 and has a host operating systemof Ubuntu 16.04. The modification enables the on-boardpower monitor sensor TI-INA3221x and integrates the gatormodule required by ARM Streamline tool. This power mon-itor provides system level power, CPU level power and GPUlevel power. This is the post-regulation power after AC toDC conversion of the wall power and DC to DC conversion(pre-regulation power conversion) as required by the system-on-chip (SoC). Power values (mW) are instantaneous andare accessible from the hardware counters on the /sys filesystem (sysfs). We use the system level power to the SoCas this accounts for the power due to the processing core,DRAM memory and peripherals. The power measurementsare gathered with the default interactive Linux governor.

The measurement framework is composed of two com-ponents: the hardware component that includes the targetdevice and the power monitor that provides direct powervalues, and the software component as shown in Figure 1that implements the methodology to acquire accurate andconsistent power and performance measurements.

To carry out an inference, we chose the Caffe framework,a well-known and widely used deep learning framework.Our Caffe (version 1.0.0-rc3) was compiled for the GPUwith Cuda (8.0) and CuDNN (5.1.5) and CPU with Open-BLAS (libopenblas_cortexa57p-r.0.2.20.dev.a) with a maxnum-thread of 4. The application selected was an inferenceusing a ConvNet model on a single RGB image fromImageNet and CIFAR 10 dataset [11]. The size of the inputimage is given in the model specification of each ConvNet.All computations are 32-bit floating point. The pre-trainedweight file for the selected ConvNet models, given in Table1, is available in Caffe’s model-zoo repository.

To map per layer functions in Caffe, we use Streamline’sannotation library to demarcate the beginning and end ofa layer’s execution. The annotations are collected by thegator and fed to Streamline tool, which is running on ahost machine. The sampling frequency was selected to be1kHz. The power sampler script is created to collect thepower values, and further processing is required to obtainthe energy as per Equation 3.

Einference =T∑

i=0

Ei+1 =T∑

i=0

Pi+1 × dt (3)

Here, Ei+1 and is calculated using the i+1th power samplePi+1 over that duration dt = ti+1 − ti. The total energy forthe inference Einference is the sum of all the rectangularareas over the duration of the inference T . In our study,we report the execution time or performance per image(sec/image) and energy per image (mJ/image) averaged over5 separate runs for single image inferences.

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'18 | 377

ISBN: 1-60132-487-1, CSREA Press ©

Page 4: SyNERGY: An energy measurement and prediction framework for Convolutional Neural ... · 2018-10-06 · SyNERGY: An energy measurement and prediction framework for Convolutional Neural

Fig. 3: Per-layer system energy profile for SqueezeNet batch1 inference running on CPU

Fig. 4: Per-layer system energy profile for GoogleNet batch1 inference running on CPU

3.1 Per-layer breakdown of energy consumedTo study the execution behaviour at a fine-grained level

we extract per-layer system energy measurements for eachConvNet. Figure 3 and 4 visualizes the per-layer executionbehaviour of two recent ConvNet models SqueezeNet [7]and GoogleNet [6], for inferences executing on the CPUon a single image. SqueezeNet encapsulates inside its firemodule a combination of 1 × 1 and 3 × 3 conv layers andfinally, GoogleNet has an arrangement of 1×1 and 3×3 convfilters along with max pooling into an inception module. Inboth cases, we obtain an energy consumption of 82% and77% of the total energy consumed, for the conv layers whilean energy consumption of 17% to 21% of the total energyconsumed, for the pooling layers. GoogleNet uses a singlefully-connected layer (or loss3 layer) contributing to 1.1% ofthe total energy consumed. In the next section, we focus onenergy prediction models for the conv layers as it is the maincomponent of the overall energy consumed by the model.

4. Energy prediction frameworkWe adopt the standard supervised learning approach in

machine learning to build a regression model [18]. Ourprediction framework focuses on modelling the energy ata per-layer level by first running a set of models in thetraining set to capture the execution behaviour at a per-layergranularity on the target platform. The Jetson TX1 systemconsists of a memory hierarchy with L1, L2 caches and mainmemory. Therefore, a simple multi-variable regression modelcan be built by extracting the execution profile of the layers

based on the number of SIMD instructions executed andthe number of bus accesses (equivalent to last-level cachemisses), on the target platform. While sophisticated predic-tion models can be developed based on other performancecounters such as L1 cache misses and others, it is well knownthat the amount of computation and main memory accessesare the most expensive, in terms of energy [19].

The training data helps to establish a relationship, if any,between energy consumption and hardware-specific perfor-mance counters. Once this prediction model is built, it canbe tested on example ConvNet models not seen during thetraining phase. The energy values and performance counterdata thus obtained is fed into a learning algorithm to learnthe regression coefficients for SIMD x1 and bus accessesx2. This captures any relationship between application andhardware, as given in Equation 4.

ˆEconv = x1 × bus_accessesconv + x2 × SIMDconv (4)

Next, we demonstrate the quality of the CPU-based energyprediction models for conv layers with single-threaded exe-cution. An extension would aim at energy prediction modelsin the context of multi-threaded executions.

4.1 Energy prediction model from measuredSIMD and bus accesses

Initially, the models from Table 1 were qualitatively se-lected to form a training set on the following basis: AlexNetand VggNet-small represent structurally similar models thathave fewer layers but relatively large number of parameters(62M-102M), SqueezeNet and SqueezeNetRes are an im-portant class of ConvNets that are trained to keep the modelsize low without compromising accuracy (1.2M parameters,∼ 80% top-5 accuracy). GoogleNet has the best trade-off ofsize and accuracy (6.9M, 90.85%) and ResNet50 representthe current state-of-the-art model in terms of accuracy inimage classification tasks (25M, 93.2%).

To evaluate the robustness of the regression-based predic-tion model, we train this subset of ConvNet models withSIMD and bus access as the input. The set of predictionmodels were obtained by excluding a single ConvNet duringthe training phase, known as Leave one out cross validation[18]. Here, each model given in Column 1, indicates themodel excluded during training. The first row represents asingle experiment to form a regression model by excludingAlexNet as a data point during training. Column 2, representsthe coefficient for the total bus accesses (x1) and Column3 represents the coefficient for the total SIMD counts (x2)for all the conv layers in a given ConvNet. We then usedthe coefficients derived to predict the energy consumedfor the conv layers and compare it against actual energymeasurements.

The relative error, given by Equation 5, quantifies theperformance of the predictor with respect to the baseline

378 Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'18 |

ISBN: 1-60132-487-1, CSREA Press ©

Page 5: SyNERGY: An energy measurement and prediction framework for Convolutional Neural ... · 2018-10-06 · SyNERGY: An energy measurement and prediction framework for Convolutional Neural

measured energy value.

Rel.Err(%) =(|predicted_value− actual_value|)

actual_value×100

(5)For example, in the first row, the average relative training

error for the all ConvNets (which does not include AlexNet)in the training set is 5.36 ± 3.36% and its relative testerror on AlexNet is 2.23%. Figure 5, shows the averagebus access and SIMD counts over 5 runs which are used asinputs into the model. The corresponding average measuredenergy and average measured time is given in Column 5 andColumn 6 of Table 2. We provide timing measurement herefor the sake of completeness. We find that the coefficient forbus accesses contributes greater to the energy consumptionwhich is consistent with the fact that main memory accesscosts more energy than a SIMD operation [19].

Finally, we consider the average relative training error(that is, 4.81 ± 3.19%) by including all the ConvNets (orallNets) in the training set. This is lower than by excludingindividual models which implies that the given subset ofmodels sufficiently captures the execution behaviour of theconv layers on the CPU of the target system. Given thescenario where we can measure performance counters suchas SIMD instruction and bus accesses, we are able to predictthe energy consumption of unseen test ConvNets with anaverage relative test error of approximately ∼ 8% comparedto actual energy measurements.

Finally, to alleviate this need of having to measure SIMDand bus accesses counts, we attempted to explore the pos-sibility of building prediction models for the two dependentvariables bus_accessesconv and SIMDconv themselves.This data was then fed into our current prediction modelbased on allNets to obtain a final estimate of energy con-sumption of any given ConvNet on the CPU of the JetsonTX1 platform, as discussed in the next subsections.

4.2 Predicting conv layer SIMD countsSIMD operations exploit the data parallelism in matrix-

matrix multiplication to obtain higher efficiency. Since thecomputation in a conv layer can be transformed to a matrix-matrix multiplication operation, we explore the relationshipbetween the application MAC count and its measured SIMDinstructions for every conv layer. We use Equation 1 inSection 2, to determine the MAC count for a conv layer.The total MAC for all the conv layers in a given ConvNet,is tabulated in Column 3 of Table 6. This total MAC countcan then be used to build a simple linear regression model,as given in Equation 6, to predict the SIMD counts for allthe conv layers in a ConvNet. The data corresponding tothe measured SIMD (y) and predicted SIMD (y) counts areshown in Column 2 and 4 of Table 3. We use the sametraining set from subsection 4.1, to build a simple SIMDpredictor from MAC counts.

y = c1 × x (6)

Fig. 5: Bus Access versus SIMD count

Fig. 6: SIMD versus MAC counts

We obtain a slope of 0.24, as show in Figure 6, whichconfirms that the SIMD width is 4 for the ARM A57 cores onthe Jetson TX1. Therefore, given an appropriate calculationof MAC count from the application, we can build a SIMDpredictor that obtains an average relative test error of 0.81±0.77% compared to actual SIMD measurements.

4.3 Predicting conv layer bus accessesConv layers are often preceded and succeeded by data

transformation operations such as im2col to transform itinto a matrix-matrix computation, and col2im to transform itback into the original 2D image layout [1]. This is becausethese conv layers are interleaved with pool, ReLU and otherlayers which required data in a specific 2D format. Eventhough, we can calculate the bandwidth of each conv layeras given in Equation 2, the data re-structuring and associateddata movement between layers within the complex cachememory hierarchy make the relationship between bandwidthand performance counters such as cache and bus accessesnon-trivial.

Surprisingly, we found that a linear relationship exists

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'18 | 379

ISBN: 1-60132-487-1, CSREA Press ©

Page 6: SyNERGY: An energy measurement and prediction framework for Convolutional Neural ... · 2018-10-06 · SyNERGY: An energy measurement and prediction framework for Convolutional Neural

Table 2: Regression Model to predict EnergyBus access (x1) SIMD (x2)

PredictedEnergy (E) (mJ)

Avg. MeasuredEnergy (E) (mJ) Avg. Measured Time (sec) Avg. Relative

Train error (%)Relativevalidation error (%)

alexNet 3.37E-05 3.16E-06 951.28 930.44 0.1682 5.36 ± 3.36 2.23resNet50 3.89E-05 2.47E-06 4686.75 5261.42 0.9468 2.03± 2.06 10.92squeezeNet 4.09E-05 2.70E-06 1388.74 1240.29 0.2652 5.26 ± 1.88 11.96googleNet 3.76E-05 2.93E-06 2212.37 2072.48 0.4228 5.76 ± 3.58 6.74squeezeNetRes 3.30E-05 3.20E-06 1365.02 1371.62 0.2558 5.66 ± 2.5 0.48vggNet-small 1.27E-05 4.75E-06 3509.11 3027.99 0.5646 3.41 ± 2.67 15.88allNets 3.34E-05 3.18E-06 4.81 ± 3.19Avg. Rel. Validation Error(excluding a ConvNet) 8.04 ± 5.96

Table 3: SIMD prediction tableAvg. Measured (y)SIMD MAC (x) Predicted SIMD (y) Relative

error (%)alexNet 166326858 665784864 163383605 1.76resNet-50 936965249 3855925248 946244055 0.99squeezeNet 212510630 861339936 211372820 0.53googleNet 383528521 1581647872 388136387 1.20squeezenetRes 213932097 861339936 211372820 1.19vgg -small 638627941 2541337632 623644254 2.34Test SetPlaces-CNDS-8s 492978185 1967702016 482874074 2.04ALL-CNN-C 66909070 270798336 66453911 0.37Inception-BN 834842927 3400527872 834489539 0.02Avg. Relative Test Error (%) 0.81 ± 0.77%

Table 4: Bus Access prediction tablePredictedSIMD (y)

Avg. MeasuredBus access (z)

Predicted BusAccess (z)

Relativeerror (%)

alexNet 163383605 12635625 10847037 14.15resNet-50 946244055 61100440 62821142 2.81squeezeNet 211372820 19929941 14033041 29.58googleNet 388136387 28927569 25768374 10.92squeezenetRes 211372820 20600111 140330412 31.87vgg - small 623644254 37448187 41403742 10.56Places-CNDS-8s 482874074 31498902 32058009 1.77ALL-CNN-C 66453911 7172165 4411875 38.48Inception-BN 834489539 64169256 55401760 13.66Avg. Rel. TestError (%) 17.97 ± 15.29

between the total number of measured bus accesses andSIMD counts in the conv layers, which can be seen in Figure5. Therefore, a similar linear regression predictor, as givenin Equation 6, was built to determine the bus access countsfor the conv layers from the measured SIMD counts. Wefind a linear relationship exists with a coefficient of 0.0663between SIMD and bus access counts. We can now use thepredicted SIMD (y) counts obtained previously, to predictthe bus access counts (z) for all the ConvNets. The predictedbus access (z) is given in Column 4 of Table 4 and is thusobtained without the need for actual measurements of busaccesses. For most ConvNets we are able to obtain a goodprediction of bus access counts from predicted SIMD withan average relative test error of 17.97± 15.29 %.

4.4 Energy prediction model from MAC countOur final step, is to estimate the energy consumption of all

the conv layers in a ConvNet directly from the applicationparameters, which in our case is the conv layer MAC count.The prediction for the SIMD (y) and bus access counts (z)for the conv layers can now be fed into our final energyprediction model obtained from the allNets training set.This enables us to make prediction using regression models

without having to execute and measure the energy of anew ConvNet. This data is tabulated in Table 5, where thepredicted energy (E) is given in Column 2. Therefore, we areable to predict the energy consumption of the conv layers ofany given ConvNet, solely using MAC count, with an averagerelative error test rate of 7.08± 5.05%.

5. Related WorkPerformance measurement and analysis: Several stud-

ies such as Fathom [20], ConvNet-Benchmarks [21], GPUperformance [22], characterize the behaviour of deep learn-ing models based on execution time. While these perfor-mance studies explore the space of desktop and server basedCPUs and GPUs, they offer limited insight on the executionbehaviour of deep learning models on resource constrainedembedded platforms.

Energy measurement and analysis: Few studies haveemerged that report energy and performance of deep learningmodels on the TX1 platform [9], [5]. However, these studiesare often adhoc and with a limited set of deep learning mod-els with the aim of platform-to-platform comparisons. Thesestudies lack in a consistent methodology to acquire power forthe Jetson TX1 and provide minimum details of the adoptedmethod. Our work instead develops a detailed methodologyto acquire power measurements using the power sensor on-board the TX1.

BenchIP [23], aims at evaluating the efficiency deeplearning workloads comprising ConvNets and Long ShortTerm Memory networks (LSTM) models on representativeplatforms from desktop, server and embedded domains. Theauthors evaluate each layer in isolation and overall end-to-end model executions in Caffe. However, it differs fromour approach, where we study each layer in isolation in thecontext of an actual inference. To the best of our knowledge,this benchmark suite is yet to be open-sourced.

Finally, our work shares similarity to an energy profilingmethodology adopted by [4]. The authors provide the energyconsumption at a functional level through the use of code-annotations to demarcate specific phases of a decision-treeapplication. However, it is unclear on how transferable theapproach would be as the tool is not open-sourced. Ourwork instead builds an energy measurement and predictionframework on top of existing open-sourced deep learning

380 Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'18 |

ISBN: 1-60132-487-1, CSREA Press ©

Page 7: SyNERGY: An energy measurement and prediction framework for Convolutional Neural ... · 2018-10-06 · SyNERGY: An energy measurement and prediction framework for Convolutional Neural

Table 5: Energy Prediction Results for coeffs. x1 = 3.34E − 05 and x2 = 3.18E − 06PredictedEnergy (E) (mJ)

Avg. MeasuredEnergy (E) (mJ)

Avg. MeasuredTime (sec)

Relativeerror (%)

alexNet 881.41 930.45 0.17 5.26resNet-50 5104.76 5261.42 0.95 2.97squeezeNet 1140.30 1240.30 0.27 8.06googleNet 2093.90 2072.49 0.42 1.03squeezeNetRes 1140.30 1371.62 0.25 16.86vgg - small 3364.41 3028.00 0.56 11.11Places-CNDS-8s 2604.99 2613.46 0.46 0.32ALL-CNN-C 358.50 422.29 0.08 15.10Inception-BN 4501.87 4641.14 0.84 3.00Average Relative TestError (%) 7.08 ± 5.05

software framework such as Caffe that is widely adopted bythe machine learning community.

Energy estimation methodologies: While benchmarkingefforts are continuing to grow, it is a time-consuming effort.Therefore, energy estimation tools have been proposed toevaluate deep learning models [9]. Here, the authors proposea model of energy consumption based on the number ofMACs and bandwidth, and associates an energy for eachoperation to estimate the energy consumption of a neuralnet on a specialised ASIC. However, such an estimationmethodology may not be transferable to models executing ongeneral-purpose hardware. Our work focuses on building amodel of energy consumption based on the actual executionbehaviour for the CPU on the target platform and deriving amodel to estimate the energy of a layer solely on its MACcount.

6. Conclusions and Future WorkDeployment of deep learning applications on mobile and

embedded platforms remains a challenge due to limitedpower budgets available on such devices. Efforts to improveenergy consumption of deep learning applications have be-gun to emerge from the development of compact ConvNetmodels to building specialised hardware.

We propose a framework to enable fine-grained perfor-mance and energy measurement of deep learning applica-tions targeting embedded platforms such as the Jetson TX1.We demonstrate a systematic methodology by integratinga widely used deep learning framework such as Caffe andARM’s Streamline Performance Analyser. With the help ofour energy measurement framework, we were able to buildan initial energy prediction model using 9 representativeConvNet models.

Our initial energy prediction model was based on hard-ware performance counters such as SIMD and bus accessfor the CPU obtained from actual execution runs of thesemodels. Using this approach, we are able to predict the

energy-use of all the conv layers in a ConvNet model withan average relative test error rate of 8.04 ± 5.96% overactual energy measurements. Furthermore, we refine thismodel to make predictions directly from the applicationparameters. This is achieved by developing predictors foreach of the hardware performance counters such as SIMDand bus access. We obtain a prediction error of 0.81±0.77%and 17.97±15.29% for each respectively. Using this as input,our final energy predictor achieves 7.08±5.05% average rel-ative test-error by using solely Multiply-Accumulate (MAC)counts obtained from the application description.

Future work, includes extending this energy predictionmodel to target other performance counters by includingL1 and L2 cache access and other hardware platforms.Opportunities for in-depth analysis work to guide the useof power management techniques such as DVFS to reduceenergy consumption at specific layers could also be explored.

AcknowledgmentThis research was conducted with support for C. Ro-

drigues and G.D. Riley from the IS-ENES2 project, fundedunder the European FP7-INFRASTRUCTURES-2012-1 call(GA No: 312979). C. Rodrigues is also part-funded byARM Ltd. under a PhD Studentship Agreement. MikelLuján is supported by a Royal Society University ResearchFellowship. The author would also like to acknowledge thevaluable insights and guidance of the colleagues from theAdvanced Processors Technology group at The Universityof Manchester and ARM Ltd. research group, Cambridge.

References[1] M. Verhelst and B. Moons, “Embedded deep neural network process-

ing: Algorithmic and processor techniques bring deep learning to iotand edge devices,” vol. 9, pp. 55–65, 11 2018.

[2] F. Iandola, “Exploring the design space of deep convolutional neuralnetworks at large scale,” arXiv preprint arXiv:1612.06519, 2016.

[3] C. F. Rodrigues, G. Riley, and M. Lujan, “Fine-grained energyprofiling for deep convolutional neural networks on the jetson tx1,”in 2017 IEEE International Symposium on Workload Characterization(IISWC), Oct 2017, pp. 114–115.

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'18 | 381

ISBN: 1-60132-487-1, CSREA Press ©

Page 8: SyNERGY: An energy measurement and prediction framework for Convolutional Neural ... · 2018-10-06 · SyNERGY: An energy measurement and prediction framework for Convolutional Neural

[4] E. Garcia-Martin, N. Lavesson, and H. Grahn, “Identification ofenergy hotspots: A case study of the very fast decision tree,” inInternational Conference on Green, Pervasive, and Cloud Computing.Springer, 2017, pp. 267–281.

[5] A. Canziani, A. Paszke, and E. Culurciello, “An analysisof deep neural network models for practical applications,”CoRR, vol. abs/1605.07678, 2016. [Online]. Available:http://arxiv.org/abs/1605.07678

[6] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov,D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper withconvolutions,” in Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition, 2015, pp. 1–9.

[7] F. N. Iandola, M. W. Moskewicz, K. Ashraf, S. Han, W. J. Dally,and K. Keutzer, “Squeezenet: Alexnet-level accuracy with 50x fewerparameters and< 1mb model size,” arXiv preprint arXiv:1602.07360,2016.

[8] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressingdeep neural networks with pruning, trained quantization and huffmancoding,” arXiv preprint arXiv:1510.00149, 2015.

[9] T.-J. Yang, Y.-H. Chen, and V. Sze, “Designing energy-efficientconvolutional neural networks using energy-aware pruning,” arXivpreprint arXiv:1611.05128, 2016.

[10] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick,S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture forfast feature embedding,” in Proceedings of the ACM InternationalConference on Multimedia. ACM, 2014, pp. 675–678.

[11] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classificationwith deep convolutional neural networks,” pp. 1097–1105, 2012.

[12] “Ds-5 development studio,” 2016. [Online]. Avail-able: https://developer.arm.com/products/software-development-tools/ds-5-development-studio/streamline

[13] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning forimage recognition,” arXiv preprint arXiv:1512.03385, 2015.

[14] K. Simonyan and A. Zisserman, “Very deep convolutional networksfor large-scale image recognition,” arXiv preprint arXiv:1409.1556,2014.

[15] L. Wang, C.-Y. Lee, Z. Tu, and S. Lazebnik, “Training deeperconvolutional networks with deep supervision,” arXiv preprintarXiv:1505.02496, 2015.

[16] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deepnetwork training by reducing internal covariate shift,” in Internationalconference on machine learning, 2015, pp. 448–456.

[17] J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. A.Riedmiller, “Striving for simplicity: The all convolutionalnet,” CoRR, vol. abs/1412.6806, 2014. [Online]. Available:http://arxiv.org/abs/1412.6806

[18] C. Robert, “Machine learning, a probabilistic perspective,” 2014.[19] S. Han, J. Pool, J. Tran, and W. Dally, “Learning both weights

and connections for efficient neural network,” in Advances in NeuralInformation Processing Systems, 2015, pp. 1135–1143.

[20] R. Adolf, S. Rama, B. Reagen, G.-Y. Wei, and D. Brooks, “Fathom:reference workloads for modern deep learning methods,” in WorkloadCharacterization (IISWC), 2016 IEEE International Symposium on.IEEE, 2016, pp. 1–10.

[21] “Convnet benchmarks,” 2016. [Online]. Available:https://github.com/soumith/convnet-benchmarks

[22] X. Li, G. Zhang, H. H. Huang, Z. Wang, and W. Zheng, “Performanceanalysis of gpu-based convolutional neural networks,” in ParallelProcessing (ICPP), 2016 45th International Conference on. IEEE,2016, pp. 67–76.

[23] J.-H. Tao, Z.-D. Du, Q. Guo, H.-Y. Lan, L. Zhang, S.-Y. Zhou, L.-J.Xu, C. Liu, H.-F. Liu, S. Tang, et al., “Bench ip: Benchmarking in-telligence processors,” Journal of Computer Science and Technology,vol. 33, no. 1, pp. 1–23, 2018.

382 Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'18 |

ISBN: 1-60132-487-1, CSREA Press ©