DAC: Data-free Automatic Acceleration of Convolutional ...DAC: Data-free Automatic Acceleration of Convolutional Networks Xin Li yz, Shuai Zhang y, Bolan Jiang , Yingyong Qiy, Mooi

DAC: Data-free Automatic Acceleration of Convolutional NetworksXin Li ∗†‡, Shuai Zhang∗†, Bolan Jiang†, Yingyong Qi†, Mooi Choo Chuah‡, and Ning Bi†

†Qualcomm AI Research‡Department of Computer Science and Engineering, Lehigh University

[email protected], [email protected], [email protected]

[email protected], [email protected], [email protected]

Abstract

Deploying a deep learning model on mobile/IoT devicesis a challenging task. The difficulty lies in the trade-off be-tween computation speed and accuracy. A complex deeplearning model with high accuracy runs slowly on resource-limited devices, while a light-weight model that runs muchfaster loses accuracy. In this paper, we propose a novel de-composition method, namely DAC, that is capable of factor-izing an ordinary convolutional layer into two layers withmuch fewer parameters. DAC computes the correspond-ing weights for the newly generated layers directly fromthe weights of the original convolutional layer. Thus, notraining (or fine-tuning) or any data is needed. The exper-imental results show that DAC reduces a large number offloating-point operations (FLOPs) while maintaining highaccuracy of a pre-trained model. If 2% accuracy drop is ac-ceptable, DAC saves 53% FLOPs of VGG16 image classifi-cation model on ImageNet dataset, 29% FLOPS of SSD300object detection model on PASCAL VOC2007 dataset, and46% FLOPS of a multi-person pose estimation model onMicrosoft COCO dataset. Compared to other existing de-composition methods, DAC achieves better performance.

1. IntroductionDeep learning techniques have been applied to many ar-

eas of artificial intelligence, which affects our daily lives.For example, smart surveillance video systems that can de-tect and identify suspects help law enforcement personnel tomaintain a safer living environment. Self-driving cars lib-erate drivers from steering wheels so that they can do moremeaningful things, e.g., read business news. As technol-ogy for high-performance mobile or edge computing de-vices continues to improve, more and more deep learningmodels are deployed on these devices, e.g., face recognitionsystems are used on cell phones to unlock screens, etc.

However, some of these AI tasks, e.g., voice recognition,∗Xin Li and Shuai Zhang are equally contributed authors. This work is

done while Xin Li is interning at Qualcomm.

requires internet access, which means the model is not en-tirely run on mobile/IoT devices. The major reason is thatmost of the deep learning models with high accuracy run tooslowly on resource-limited devices. Many techniques to re-duce the size of neural network models, e.g., model quanti-zation of neural network models using fewer bits, have beenproposed to facilitate their implementations on mobile chips[28, 26, 27, 25]. However, limited by current hardwarestructure and the tolerance for model accuracy drop, mostof these quantization methods for real applications only fo-cus on the 8-bit format. To further accelerate neural net-work models, it is more important to reduce computationcomplexity directly from the network architectures. Someresearch [8, 21, 32, 17, 13, 14] has been done to simplifythese models before running them on mobile/IoT devices.Such research can be roughly categorized into two classes:

Designing new light-weight network architectures:MobileNet proposed by Howard et al. in [8, 21] is an ex-cellent example. The model is based on a streamlined archi-tecture that uses depthwise separable convolutions to build alight weight deep neural network. The model achieves goodaccuracy and runs fast on mobile devices. Similar with Mo-bileNet, ShuffleNet [32, 18] is another type of light weightnetwork architecture, based on depthwise separable layersfor acceleration. However, these models require powerfulservers and massive data to tune the weights. This is not afriendly solution to those who cannot access such resources.

Modifying an existing model to a slim version: An-other solution is to produce a slimmer version of an exist-ing model. Unfortunately, the training data in some casesis exclusively available to the original designer of a model,which prevents other researchers from re-training the modelafter modification. Besides, it is costly and time-consumingto train a model from scratch. Thus, compared to designingnew models and training them from scratch, accelerating anexisting model based on its pretrained weights is a better so-lution. Network pruning and parameter decomposition aretwo common methods for this purpose. Network pruningis a practical tool for speeding up existing deep neural net-works [19]. He et al. propose a channel pruning method

arX

iv:1

812.

0837

4v2

[cs

.CV

] 2

7 D

ec 2

018

[7] that utilizes LASSO regression to prune the number ofthe input channels in each convolutional layer. Even thoughsuch network pruning scheme simplifies models, it still hassome weaknesses. Network pruning is based on the statisti-cal results of a set of samples. Thus: (1) it still requires datato discover which channel to prune, and (2) the accuracy ofthe model drops after pruning because the statistical resultsare not suitable for all data during testing. Louizos et al.incorporate l0 relaxation [17] into the training loss functionto enforce compactness of network parameters. Thus, thisl0 pruning method should only be used during the trainingprocess. Parameter decomposition is another way to sim-plify an existing model. It is a layer-wise operation that de-composes a layer into one or multiple smaller layers, eitherhaving smaller kernel sizes or fewer channels. Althoughthere will be more layers after being decomposed, the totalnumber of weights and the computational complexity willbe reduced. The decomposition methods only use the pre-trained weights of a layer, with the fact that most neuralnetwork models have much redundant parameters and canbe largely simplified with low rank constraints. In this pa-per, we propose a new parameters decomposition methodwhich does not require access to data or retraining.

The contributions of this paper are:1. We propose a novel decomposition method that re-

places standard convolutional layers in a pre-trained modelwith separable layers to significantly reduce the number ofFLOPs.

2. The newly generated model maintains high accuracywithout using any data and training process.

3. The experimental results on three computer vision ap-plication scenarios show that DAC maintains high accura-cies even when a vast amount of FLOPs is trimmed.

The rest of this paper is organized as follows. Somerelated works are summarized in section 2. In section 3,we describe the architecture of DAC and our factorizationmethod. The experimental results are reported in section 4,followed by the conclusion in Section 5.

2. Related WorkMuch work has been done to do parameter decomposi-

tion. In this section, we will discuss some prior work thatdecomposes convolutional layers. To simplify the descrip-tion, we assume the weight of the convolutional layer thatwe are going to decompose has a size of (n×kw×kh× c),where n is the number of kernels, kw and kh are the spa-tial width and height of a kernel respectively, and c is thenumber of channels of the input feature map.

First, Jaderberg et al. [9] propose a spatial decomposi-tion method. The method decomposes a convolutional layerwith (n×kw×kh× c) kernel size into two layers. One hashorizontal filters with (c′ × kw × 1 × c) kernel size andthe other consists of vertical filters with (n × 1 × kh × c′)

kernel size. In theory, this method indeed reduces parame-ters. However, running the decomposed model on a mobiledevice that has limited resources does not result in a signifi-cant speed up. This is due to the caching behavior of data. Afeature map is horizontally (or vertically) loaded into a con-tinuous block of memory. When we compute convolutionusing horizontal (vertical) filters, we access the memory se-quentially. There is no impact on running time. However, ifwe compute the convolution using vertical (horizontal) fil-ters, we cannot access memory sequentially any more whichresults in more cache misses and hence longer computationtime.

Then, Zhang et al. describe a channel decompositionmethod in [33]. It decomposes a convolutional layer with(n × kw × kh × c) kernel size into a convolutional layerwith fewer output channels and a pointwise convolutionallayer. The newly generated convolutional layer has (c′ ×kw × kh × c) kernel size, and the pointwise convolutionallayer has (n × 1 × 1 × c′) kernel size. Notice that the firstlayer is also an ordinary convolutional layer, so it does notimprove the situation fundamentally.

Direct tensor decomposition methods including CP de-composition [12] and Tucker decomposition [10] are alsoapplied to accelerate networks. After these tensor decom-positions, one convolution layer will be factorized into 3or 4 small layers with a bottleneck structure, opposite with[21] architecture. One big disadvantage of these tensor de-composition methods is that the depth of network architec-ture is tripled (3x) compared to the original model, thus itincreases the memory access cost (MAC) and largely offsetthe gains from the reduction of FLOPs, as claimed in [18].

There are also many network decomposition works usinglow rank constraints in training process or solving layer-wise regression problem with data samples [24, 2]. Butall these methods require the access of sufficient data fromtraining/test domain.

Our research focus is based on the real application sce-nario with limited access of data. In this paper, we pro-pose a novel data-free convolutional layers decompositionmethod and compare its performance to two most relatedworks [33, 9]. (After this paper was accepted, we foundGuo et al. proposed a similar solution in [6]. These twoworks are independent and concurrent.)

3. Proposed SolutionThe intuition of our proposed scheme is that the depth-

wise + pointwise combination runs efficiently on mobile de-vices has already been proven by MobileNet [8]. It will beuseful if we can convert an ordinary convolutional layer intosuch a structure and compute their weights from the originallayer directly. The feasibility of decomposing the weightsof a convolutional layer has been mathematically proved byZhang et al. [33].

Convolutional Layer

Wf

Hf

c

Input Feature Map

c

Weights Bias

khkw kw kw

n

11

Convolutional Layer (n x kw x kh x c)

Wf’’

Hf’’

n

Output Feature Map

Depthwise Separable Layer

Wf

Hf

c

Input Feature Map(Wf x Hf x c)

kw

kh1

r

r

r

rC

Weights

Depthwise Layer(rC x kw x kh x 1)

Wf’

Hf’rC

Intermediate Feature Map(Wf

’ x Hf’ x rC)

11

1

n

Weights

Bias

rC

1

1 1

Pointwise Layer(n x 1 x 1 x rC)

Wf’’

Hf’’

n

Output Feature Map(Wf

’’ x Hf’’ x n)

DAC

( indicates weights assignment)

Td Ts

Figure 1. The architecture of our proposed DAC. An input feature map consists of c channels (in this figure, c = 3) is marked with differentcolors. In “Depthwise Layer”, kernels are only applied on the channel with the same color. Thus, each channel is processed by r kernels.

3.1. Convolutional Layer Factorization

In this section, we propose a novel factorization methodfor convolutional layers. Figure 1 shows the details of ourscheme. An ordinary convolutional layer with the shape of(n× kw × kh × c) is decomposed into two layers. One is adepthwise layer with the shape of (rC×kw×kh×1), and theother is a pointwise layer with the shape of (n×1×1×rC),where rC = r∗c and r is a factor used to balance the trade-off between model compression ratio and accuracy drop.There is no bias in the depthwise layer, and the bias vectorin the original layer is assigned to the pointwise layer.

Even though our scheme is inspired by MobileNet, it isworth highlighting the differences between MobileNet andDAC. DAC has no non-linear layers (batch normalizationlayers and activation layers) between the depthwise and thepointwise layers. The absence of non-linear layers makesDAC quantization friendly and hence suitable for furtherhardware acceleration, which Sheng et al. [22] have alreadyexperimentally verified.

3.2. Weights Decomposition

Once a convolutional layer is factorized, we want tocompute weights for the newly generated layers (a depth-wise and a pointwise layer) from the original weights di-rectly. We assume T is the trained weights of the originalconvolutional layer, and its shape is (n× kw × kh× c). Wedenote Td ∈ D := RrC×kw×kh×1 as the weights of thedepthwise layer and Ts ∈ S := Rn×1×1×rC as the weightsof the pointwise layer. Then, the objective function of fac-torizing a convolutional layer is:

minTd∈D,Ts∈S

‖T − Ts ∗ Td‖2F , (1)

where operator ∗ is the combination of convolution opera-tions of the depthwise and the pointwise layer, and ‖‖F isthe Frobenius norm for tensor/matrix. Thus

minTd∈D,Ts∈S

‖T − Ts ∗ Td‖2F

= minTd∈D,Ts∈S

C∑i=1

‖Ti − Tsi ∗ Tdi‖2F

=

C∑i=1

minTdi,Tsi

‖Ti − Tsi ∗ Tdi‖2F

=

C∑i=1

minSi,Di

‖Mi − SiDi‖2F .

Here matrices Mi, Si and Di are transformed from tensorsTi, Tsi and Tdi respectively.

According to the SVD theory, the solution of minimiza-tion problem min

Si,Di

‖Mi − SiDi‖2F is the singular matrices

with rank r, where the top r singular values can be mergedinto either Si or Di. Also, Frobenius norm ‖‖F can be de-fined as ‖‖2,2 induced by L2 vector norm, so the above DACminimization objective function can be considered as

minTd∈D,Ts∈S

‖T − Ts ∗ Td‖2F

= minTd∈D,Ts∈S

sup‖F‖2 6=0

‖(T − Ts ∗ Td)F‖2‖F‖2

,

where F is the input feature maps and ‖F‖2 is the vector L2

norm. In this formula, it minimizes the output feature mapswith approximation error measured in Euclidean space andthe constraint of the decomposition ‘rank’ r (the factor usedto balance the trade-off between model compression ratioand accuracy drop). The process of weights decompositionis described in Algorithm 1.

Algorithm 1: DAC Weights DecompositionInput : Weights of a convolutional layer: T ∈ Rn×kw×kh×c;

Decomposition Rank: r.Output: Weights of the depthwise layer: Td ∈ RrC×kw×kh×1;

Weights of the pointwise layer: Ts ∈ Rn×1×1×rC

1 begin2 list d ∈ Rc×r×kw×kh×1 ← ∅3 list s ∈ Rn×1×1×r×c ← ∅4 for i ∈ c do5 Ti ← T [:, :, :, i] ∈ Rn×kw×kh

6 Mi ← Reshape(T i, (n, kw × kh)) ∈ Rn×kwkh

7 Di, Si ← Decompose(Mi, r)

8 list d[i, :, :, :, :]← Di ∈ Rr×kw×kh×1

9 list s[:, :, :, :, i]← Si ∈ Rn×1×1×r

10 Td← Reshape(list d, (r × c, kw, kh, 1))11 Ts← Reshape(list s, (n, 1, 1, r × c))

12 function Decompose(M, r)13 begin14 U, Sigma, V ← SV D(M)

15 Ur ← U [:, : r] ∈ Rn×r

16 V r ← V [: r, :] ∈ Rr×kwkh

17 Sr ← Sigma[: r, : r] ∈ Rr×r

18 D ← Reshape(V r, (r, kw, kh, 1))19 S ← Ur Sr20 S ← Reshape(S, (n, 1, 1, r))21 return D,S

3.3. Computation Reduction

We consider the original convolutional layer with (n ×kw × kh × c) kernel size takes a (Wf × Hf × c) featuremap F as an input and produces a (Wf ×Hf × n) featuremap G, where Wf and Hf are the spatial width and heightof the feature maps. Here, we assume the output featuremap has the same spatial size as the input for simplification.Then, the computation cost of the convolutional layer is:Wf ×Hf × c× kw × kh × n.

The computation cost depends on the number of inputchannels c, the number of output channels n, the kernel sizekw×kh and the input features map size Wf×Hf . After de-composition, the newly generated depthwise and pointwiselayer in total have the cost of Wf ×Hf × kw × kh × rC +Wf ×Hf × rC ×n, where rC = r ∗ c and the reduction incomputation is

Wf ×Hf × kw × kh × rC +Wf ×Hf × rC × n

Wf ×Hf × c× kw × kh × n

=r

n+

r

kwkh

4. Experimental ResultsTo prove the universality of our proposed scheme, we ap-

ply DAC to three major application scenarios in the field ofComputer Vision: (1) Image Classification, (2) Object De-tection, and (3) Multi-person Pose Estimation. We imple-ment our scheme using Python and Keras Library [4] withTensorflow backend [1].

4.1. Datasets

Four datasets are used in this paper:CIFAR-10 dataset: The CIFAR-10 dataset [11] consists

of 50,000 training images and 10,000 test images in 10 cat-egories. It is a small dataset, from which we can quickly getresults after tuning parameters. Thus, we use it for ablationstudy to get some insights about DAC, e.g., the impacts ofusing different ranks or decomposing different layers.

ImageNet dataset: The ImageNet dataset [20] has50,000 ILSVRC validation images in 1,000 object cate-gories. We use this ILSVRC validation subset to evaluatethe performance of DAC in the task of image classification.

Pascal VOC2007 dataset: For object detection task,Pascal VOC2007 dataset [5] is used. It consists of 4,952testing images for object detection. The bounding box andlabel of each object from twenty target classes have beenannotated. Each image has one or multiple objects.

Microsoft COCO dataset: The Microsoft COCOdataset [15] is used to evaluate the performance of DAC inthe task of multi-person pose estimation. We use the COCO2017 keypoints subset which consists of 5,000 validationimages and 40K testing images.

4.2. Ablation Study

Here, we use a pre-trained CIFAR-VGG model i, asimple Convolutional Neural Network, on the CIFAR-10dataset as our original model. Figure 2 shows the architec-ture of the CIFAR-VGG. In total, the CIFAR-VGG modelhas 13 convolutional layers. The original model (trained onCIFAR-10 training subset) achieves 93.6% on CIFAR-10testing subset.

3x3Co

nv,64

3x3Co

nv,64

3x3Co

nv,128

3x3Co

nv,128

3x3Co

nv,256

3x3Co

nv,256

3x3Co

nv,256

3x3Co

nv,512

3x3Co

nv,512

3x3Co

nv,512

3x3Co

nv,512

3x3Co

nv,512

3x3Co

nv,512

Flatten

FC,512

FC,10

Pool/2

Pool/2

Pool/2

Pool/2

Pool/2

32x32

16x16

8x8

4x4

2x2

512

Figure 2. The architecture of the CIFAR-VGG.

First, we decompose a single convolutional layer to ex-plore the impact of decomposing different layers. Table 1shows the details of testing accuracy when applying varyingranks (rank 1 to rank 5) decomposition on different layers ofCIFAR-VGG model. Each time, we only modify one layer.

ihttps://github.com/geifmany/cifar-vgg

All results are collected using decomposed weights directly(no access to data or any training process).

Table 1. Testing Accuracy on CIFAR-10 dataset when decompos-ing different layers of CIFAR-VGG model using variant ranks.

Accuracy (%)Original Model 93.6

Decomposed Layer Rank1 Rank2 Rank3 Rank4 Rank5conv2d 1 18.6 76.4 86.7 91.9 92.8conv2d 2 39.1 86.5 91.6 92.7 93.2conv2d 3 54.0 87.8 92.6 93.2 93.4conv2d 4 31.4 83.7 91.8 92.8 93.2conv2d 5 80.1 90.2 92.6 93.1 93.8conv2d 6 84.3 90.9 92.8 93.3 93.4conv2d 7 66.0 89.5 92.6 93.0 93.3conv2d 8 83.2 91.1 92.6 93.0 93.2conv2d 9 91.2 93.1 93.3 93.4 93.5conv2d 10 91.7 93.2 93.3 93.3 93.4conv2d 11 93.1 93.4 93.3 93.4 93.4conv2d 12 93.3 93.4 93.4 93.4 93.5conv2d 13 92.9 93.4 93.4 93.4 93.5

From Table 1, we gain two insights: (a) Decomposingfirst few layers of a model causes large drops in accuracy(75% drop when rank 1 decomposition is applied on layerconv2d 1), while decomposing last few layers has a smallerimpact on the accuracy (less than 1% drop when rank 1 de-composition is applied on layer conv2d 13). (b) Decompos-ing a layer using a larger rank helps to maintain the accu-racy. This can be observed by comparing different columnsin the same row. These two insights are consistent with ourintuition. (a) Decomposing a layer generates tiny errors. Ifsuch errors occur at the beginning of a model, the errorswill accumulate to bigger errors at the final prediction. (b)Compared to smaller ranks, larger ranks generate more pa-rameters in the depthwise layers. Thus, the newly generatedlayers have more possibility of replicating the performanceof the original layer.

Next, we explore the performance of DAC when multipleconvolutional layers are decomposed. We decompose themodel with two opposite directions: (1) from the last layerto the first one, and (2) from the first layer to the last one. Tosimply the experiment, we use the same rank to decomposeall chosen layers. The experimental results are reported inFigure 3. First, one can quickly notice that most decompo-sition cases (solid points) achieve high accuracies (higherthan 91.6% or 2% drop). Second, after saving 42% FLOPs,DAC still achieves 92.7% accuracy (drops less than 1%).Both of these prove that our proposed DAC has the capabil-ity of maintaining accuracy when the number of FLOPs issubstantially reduced.

Besides, in Figure 3, red-star points (Rank 5) achievehigh accuracies. If we compare the solid (open) red-starmarks to other solid (open) marks, we can notice that theabove insights also hold in the case of decomposing mul-tiple convolutional layers. Ten (eight) out of twelve Rank5 decomposition cases (solid red-star spots) drop accuracy

9/20/2018 temp-plot.html

file:///Users/xincoder/Documents/Research/12.Model_compression/VGG_cifar10/ReadMe/temp-plot.html 1/1

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 1000

20

40

60

80

100

Saved FLOPs Ratio (%)

Accuracy (%)

42



0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 951000

20

40

60

80

100

Export to plot.ly »

Last k layers (Rank1)





Accuracy = 92.6%

First k layers (Rank1)





Original Model (93.6%)


Accuracy (%)

52



0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 951000

20

40

60

80

100







Accuracy = 92.6%








Accuracy (%)

52



0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 951000

20

40

60

80

100







Accuracy = 92.6%








Accuracy (%)

52

Figure 3. Classification accuracy on the CIFAR-10 dataset. Eachcurve has 12 points that correspond to different numbers of decom-posed layers (2 to 13 layers from left to right). Solid spots indicatethe cases that last few layers are decomposed (layer “conv2d 13”included). Open spots are the cases that first few layers are decom-posed (“conv2d 1” layer included).

by less than 2% (1%). The worst solid red-star case thatachieves 91.2% (accuracy drops 2.4%) is caused by the de-composition of the first layers of the model (first insightdiscussed above). It is worth highlighting that these decom-posed models that maintain high accuracies are generatedby DAC without accessing data or training process.

4.3. Image Classification

For the task of image classification, we use the VGG16model proposed by Simonyan et al. in [23]. It includes 12(3x3) convolutional layers. We downloaded a model ii pre-trained on ImageNet dataset. All convolutional layers butthe first one are decomposed considering the first insightwe got in our ablation study.Table 2. Top-5(Top-1) Validation Accuracy on ImageNet dataset

Top-5(Top-1) Accuracy (%)VGG16 [23] (Baseline) 88.9(69.2)Method Saved 40% Saved 50% Saved 60%Channel Decomp. [33] 86.5(65.6) 74.4(48.7) 43.3(20.8)Spatial Decomp. [9] 88.6(68.5) 86.3(65.0) 78.0(52.5)DAC (Ours) 88.6(68.5) 87.5(66.8) 84.7(62.5)

Here we compare our approach with two schemes,namely, the Filter Reconstruction Optimization proposed byJaderberg et al. in [9] (Spatial Decomp. in Table 2) andthe Channel Decomposition method proposed by Zhang etal. in [33] (Channel Decomp. in Table 2). Spatial De-composition is the one that does not need data and train-ing like DAC as we discussed in Section 2. Although theChannel Decomposition requires some data, we can stilluse the method as a filter reconstruction without accessingany data and training process. We implemented these twoalgorithms ourselves. For fair comparison, we choose ap-propriate parameters for Channel Decomposition and Spa-tial Decomposition, so that all schemes save roughly same

iihttps://github.com/fchollet/deep-learning-models/releases/download/v0.1/vgg16 weights tf dim ordering tf kernels.h5

FLOPs. Given a rank r of DAC, the number of filters c′c inthe first newly generated layer in Channel Decompositioncan be computed using:

c′c = r ∗ c(n+ khkw)

ckhkw + n(2)

and for Spatial Decomposition, the number of filters c′s inthe first newly generated layer is

c′s = r ∗ c(n+ khkw)

ckw + nkh(3)

where n is the number of kernels in original convolutionallayer, kw and kh are the spatial width and height of a kernelrespectively, and c is the number of channels of the inputfeature map.

Table 2 shows the accuracy of the model (after saving40%, 50%, and 60% FLOPs respectively) on ImageNet val-idation set. First, DAC maintains high accuracy on bothTop-1 and Top-5 accuracy even when a significant amountof FLOPs are reduced. Second, compared to the ChannelDecomposition and Spatial Decomposition, DAC performsmuch better. Especially when we saved 60% FLOPs, DACachieves 41.4% higher accuracy than Channel Decomposi-tion and 6.7% higher accuracy than Spatial Decomposition.

4.4. Multi-person Pose Estimation

For the task of multi-person pose estimation, we use thescheme proposed by Cao et al. [3]. Figure 4 is the architec-ture extracted from their paper. After generating the featuremap F by a convolutional network (initialized by the first 10layers of VGG-19 [23] and fine-tuned), the model is splitinto two branches: the top branch predicts the confidencemaps, and the bottom branch predicts the affinity fields.

Figure 4. The model architecture figure extracted from [3].

We download an implementation of Cao’s model iii thatwas pre-trained on Microsoft COCO dataset as our origi-nal model. It achieves 57.9% average precision (AP) on thevalidation subset of 2017 COCO keypoints challenge. Thismodel consists of six stages, which means t ∈ {2, 3, 4, 5, 6}in Figure 4. Thus, the first stage (Stage 1) has 6 convolu-tional layers (3x3 kernel size), and each of the followingstage (Stage 2 to Stage 6) includes 10 convolutional lay-ers (7x7 kernel size). Based on the above two insights, we

iiihttps://github.com/anatolix/keras Realtime Multi-Person Pose Estimation

decompose the model from the bottom to the top with vari-ant ranks (from Rank20 to Rank3). Because the full rankof a (3x3) convolutional kernel (in Stage 1) is 9, so we setthe maximum rank used to decompose these (3x3) convolu-tional layers equals to 5 for a large compression ratio.


file:///Users/xincoder/Documents/Research/11.Pose_estimation/ReadMe/old_result/temp-plot.html 1/1

0 10 20 30 40 50 60 70 800

10

20

30

40

50

60


Decompose last 1 stage

Decompose last 2 stages





AP = 55.9%


AP (%)

46

Figure 5. Results on the Microsoft COCO dataset. Each curvehas 18 points that correspond to different ranks (Rank20 to Rank3from left to right).

Figure 5 shows the experimental results. First, it is obvi-ous that in the task of person pose estimation, the DAC alsomaintains high accuracy without any retraining when largeamounts of FLOPs are saved. Our proposed DAC saves upto 46% FLOPs when 2% AP drop is allowed. Second, foreach curve, the AP decreases with decreasing decomposi-tion rank. This observation is consistent with the above sec-ond insight. Then, we notice that “Decompose last 6 stages”achieves similar results (similar saved ratios and APs) as“Decompose last 5 stages” does. This can be explained asfollows: the “Decompose last 6 stages” includes Stage 1 inwhich all decomposed convolutional layers (6 layers) have(3x3) kernel size. Comparing to a convolutional layer with(7x7) kernel size, these layers have much fewer parameters,so decomposing them does not contribute much.

Table 3. Results on the COCO 2017 keypoint challengeMean Average Precision (%)

Openpose [3] (Original) 57.9Method Saved 40% Saved 50% Saved 60%Channel Decomp. [33] 25.9 5.0 0Spatial Decomp. [9] 55.9 54.4 45.4DAC (Ours) 56.7 55.6 52.5

Table 3 shows the accuracy of the model (after saving40%, 50%, and 60% FLOPs respectively) on COCO 2017keypoint challenge. The parameters of Channel Decompo-sition and Spatial Decomposition are computed using Equa-tion 2 and 3 correspondingly. Compared to Channel andSpatial Decomposition, DAC achieves higher accuracy evenwhen a significant amount of FLOPs is reduced. After sav-ing 60% FLOPs, Channel Decomposition cannot correctlydetect any person’s pose, while DAC can still achieve 7.1%higher accuracy than Spatial Decomposition.

Figure 6 shows the visualized multi-person pose estima-

Figure 6. Visualized results on COCO dataset. The first row shows the results generated using the original weights, while the second rowshows the results created using the model that saves 50% FLOPs.

tion results on COCO dataset. It shows that after being de-composed using DAC, the model still works pretty well.There are only small changes observed. For example, thedecomposed model misses a leg of a person in the first ex-ample (the second person on the right side) and the thirdsample ( the second person on the left side). Please refer toour Appendix for more visualized results.

4.5. Object Detection

Next, we evaluate the performance of DAC in the task ofobject detection using the Single Shot MultiBox Detector(SSD) model proposed by Liu et al. [16]. Figure 7 showsthe framework of the SSD.

4 Liu et al.

300

300

3

VGG-16 through Conv5_3 layer

19

19

Conv7(FC7)

1024

10

10

Conv8_2

512

5

5

Conv9_2

2563

Conv10_2

256 256

38

38

Conv4_3

3

1

Image

Conv: 1x1x1024 Conv: 1x1x256Conv: 3x3x512-s2

Conv: 1x1x128Conv: 3x3x256-s2


Det

ectio

ns:8

732

per

Cla

ss

Classifier : Conv: 3x3x(4x(Classes+4))

512

448

448

3

Image

7

7

1024

7

7

30

Fully Connected

YOLO Customized Architecture

Non

-Max

imum

Sup

pres

sion

Fully Connected

Non

-Max

imum

Sup

pres

sion

Det

ectio

ns: 9

8 pe

r cla

ss

Conv11_2

74.3mAP 59FPS

63.4mAP 45FPS

Classifier : Conv: 3x3x(6x(Classes+4))

19

19

Conv6(FC6)

1024

Conv: 3x3x1024

SS

DY

OLO

Extra Feature Layers


Conv: 3x3x(4x(Classes+4))

Fig. 2: A comparison between two single shot detection models: SSD and YOLO [5].Our SSD model adds several feature layers to the end of a base network, which predictthe offsets to default boxes of different scales and aspect ratios and their associatedconfidences. SSD with a 300 ⇥ 300 input size significantly outperforms its 448 ⇥ 448YOLO counterpart in accuracy on VOC2007 test while also improving the speed.

box position relative to each feature map location (cf the architecture of YOLO[5] thatuses an intermediate fully connected layer instead of a convolutional filter for this step).Default boxes and aspect ratios We associate a set of default bounding boxes witheach feature map cell, for multiple feature maps at the top of the network. The defaultboxes tile the feature map in a convolutional manner, so that the position of each boxrelative to its corresponding cell is fixed. At each feature map cell, we predict the offsetsrelative to the default box shapes in the cell, as well as the per-class scores that indicatethe presence of a class instance in each of those boxes. Specifically, for each box out ofk at a given location, we compute c class scores and the 4 offsets relative to the originaldefault box shape. This results in a total of (c + 4)k filters that are applied around eachlocation in the feature map, yielding (c + 4)kmn outputs for a m⇥ n feature map. Foran illustration of default boxes, please refer to Fig. 1. Our default boxes are similar tothe anchor boxes used in Faster R-CNN [2], however we apply them to several featuremaps of different resolutions. Allowing different default box shapes in several featuremaps let us efficiently discretize the space of possible output box shapes.

2.2 Training

The key difference between training SSD and training a typical detector that uses regionproposals, is that ground truth information needs to be assigned to specific outputs inthe fixed set of detector outputs. Some version of this is also required for training inYOLO[5] and for the region proposal stage of Faster R-CNN[2] and MultiBox[7]. Oncethis assignment is determined, the loss function and back propagation are applied end-to-end. Training also involves choosing the set of default boxes and scales for detectionas well as the hard negative mining and data augmentation strategies.

Figure 7. SSD architecture extracted from [16]We use a model iv pre-trained on Pascal VOC2007 and

VOC2012 trainval subset. The model uses VGG-16 [23] asits base net that has (300x300) input size. Ten extra convo-lutional layers are added to the VGG-16 model to provideextra information. In total, 18 (3x3) convolutional layersand 5 (1x1) convolutional layers are used to generate multi-scale feature maps for detection, and 12 (3x3) convolutionallayers are used to produce a fixed set of detection predic-tions. This model achieves 76.5% on VOC2007 testing set.

There is no benefit in decomposing a convolutional layerwith (1x1) kernel size, so we only decompose those layerswith (3x3) kernel size. Furthermore, considering that de-composing first layers causes large drops of accuracy, we donot decompose the first convolutional layer of the model. Tosimplify the description, we denote 18 layers (the first layer,

ivhttps://github.com/pierluigiferrari/ssd keras

conv1 1, is not decomposed) that generate multi-scale fea-ture maps by “Feature Convolutional Layers (FL)” and 12layers that produce detection predictions by “Detector Con-volutional Layers (DL)”.


file:///Users/xincoder/Documents/Research/13.Object_detection/ReadMe/temp-plot.html 1/1

−5 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 850

5

10

15

20

25

30

35

40

45

50

55

60

65

70

75

80


Detector Convolutional Layers (DL)

Feature Convolutional Layers (FL)

DL + FL

mAP = 74.5%


mAP (%)

29

Figure 8. Object detection results on PASCAL VOC2007 testingset. Nine spots on each curve indicate Rank9 toward Rank1 corre-spondingly from left to right.

We demonstrate the experimental results in Figure 8.First, one can see that if 2% mAP drop is acceptable, DACsaves up to 29% FLOPs. Second, decreasing the decom-position rank results in a drop of mAP, which is also ob-served in the previous experiment. Third, compared to“DL”, “FL” achieves a bigger FLOPs saved ratio. This isbecause that there are fewer layers in “DL” and each layerin “DL” has fewer channels than layers in “FL”. In addi-tion, for this model, the maximum decomposition rank is 9so when the decomposition rank is set to 9, the number ofparameters increases after decomposition. This is becausethat all layers we decompose in this model have (3x3) kernelsize whose full rank is 9. The newly generated depthwiselayer with Rank9 has the same number of parameters as thedecomposed layer, while an extra pointwise layer that hasrC ×N × 1× 1 parameters is added.

Table 5 shows the comparison of the detection accuracyon PASCAL VOC2007 Dataset. One can see that DAC

Original

Saved40%FLOPs

Figure 9. Visualized results on PASCAL VOC2007 dataset. The first row shows the results generated using the original weights, whilethe second row shows the results created using a model that saves 40% FLOPs. Red dashed rectangles are ground truths. The first twosamples are examples that the model works well after being decomposed, the third sample shows an example that DAC helps improve theperformance, while the following three samples are different kinds of errors caused by decomposition.

Table 4. Detection results on PASCAL VOC2007 testing set. All results expected original are collected using the SSD300 model decom-posed with Rank6. “DL” indicates only Detector Convolutional Layers are decomposed and “FL” indicates only Feature ConvolutionalLayers are decomposed.

Model mAP(%) aero bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tvSSD[16](Original) 76.5 78.6 83.9 75.3 67.8 48.5 86.7 84.7 87.7 58.1 79.3 75.0 85.9 87.5 82.6 77.5 51.2 77.1 79.5 87.2 76.5(DL) Channel[33] 76.1 77.6 83.8 75.8 66.6 45.7 86.4 84.5 87.6 58.1 78.6 74.5 86.1 87.4 82.5 77.0 50.3 76.8 79.7 87.8 75.1(DL) Spatial[9] 76.1 77.9 83.2 75.3 67.2 46.0 86.3 84.4 86.9 58.2 78.9 74.5 85.5 87.3 82.7 76.9 51.0 76.5 79.4 87.7 75.7(DL) DAC(Ours) 76.3 78.4 82.9 74.5 68.3 47.8 86.7 84.4 88.4 58.0 79.4 74.9 85.6 86.5 83.1 77.3 50.7 77.3 79.0 87.5 76.1(FL) Channel[33] 62.2 70.4 69.7 63.8 52.9 38.3 75.1 79.8 72.8 42.2 73.2 38.0 65.7 76.3 69.6 64.0 38.8 66.6 53.4 75.6 57.3(FL) Spatial[9] 63.2 73.7 69.7 64.6 52.0 39.0 75.6 79.9 77.6 42.6 73.2 39.5 70.7 76.3 71.3 65.5 37.9 67.0 53.0 77.9 56.1(FL) DAC(Ours) 75.3 78.2 83.0 73.0 67.1 44.3 86.3 83.3 87.7 56.6 78.5 75.2 84.2 85.9 82.8 75.8 48.8 75.3 78.6 86.4 75.6(DL+FL) Channel[33] 62.2 70.6 69.4 64.1 51.1 36.1 75.7 79.8 72.8 43.0 72.9 39.9 66.4 74.5 70.2 63.7 38.5 65.9 55.1 75.4 58.7(DL+FL) Spatial[9] 63.1 73.8 70.3 64.1 50.8 37.8 75.3 79.8 76.9 43.0 74.3 39.8 69.7 75.8 70.9 64.5 38.6 68.9 54.1 77.9 56.2(DL+FL) DAC(Ours) 74.8 76.4 81.1 73.1 66.0 44.6 85.9 83.1 88.1 56.5 76.8 74.0 84.3 86.1 83.1 75.5 47.7 74.1 77.5 86.1 75.7

Table 5. Object detection results on PASCAL VOC2007 Dataset.Mean Average Precision (%)

SSD [16] (Original) 76.5Method Saved 30% Saved 40% Saved 50%Channel Decomp. [33] 62.2 60.0 52.4Spatial Decomp. [9] 63.1 62.2 60.6DAC (Ours) 74.8 71.4 60.8

achieves higher accuracy than other schemes. In Table4, we list the details of the detection results on PASCALVOC2007 testing set. Comparing the results of DAC to theoriginal model, one can see that decomposing the model us-ing DAC does not impact the performance of the model toomuch, for all categories. The change of the accuracy hap-pens on each category within a small range.

Figure 9 shows the visualized object detection results onPASCAL VOC2007 testing set. From the first two sam-ples, one can see that after being decomposed, the modelcan still correctly detect objects. The locations and sizes ofthe detected bounding boxes have small changes. The thirdsample is an example that the original model does not detectan object (the bottle) that is successfully detected by our de-composed model. The fourth sample shows an extra falsepositive example (an unexpected potted-plant is detected),the fifth sample is a missing example (miss the car on theright), and the last sample is an example that the detectedlabel changed (from bird to dog). Please refer to our Ap-pendix for more visualized results.

5. Conclusion

In this paper, we propose a novel decomposition method,namely DAC. Given a pre-trained model, DAC is able tofactorize an ordinary convolutional layer into two layerswith much fewer parameters and computes their weights bydecomposing the original weights directly. Thus, no train-ing (or fine-tuning) or any data is needed. The experimen-tal results on three computer vision tasks show that DACreduces a large number of FLOPs while maintaining highaccuracy of a pre-trained model.

We plan to evaluate the performance of DAC for deeplearning models in other fields, e.g., voice recognition, lan-guage translation, etc. We also want to explore the possi-bility of adapting DAC on other types of layers, e.g. 3Dconvolutional layer, compared with other tensor decompo-sition formats [10, 12]. Another research direction is tocombine low rank constraints with weight decomposition.These constraints could be convex regularizations like nu-clear norm and Frobenius norm, or non-convex quasi-normslike Schatten p and TS1 [30, 29, 31].

6. Acknowledgements

We would love to express our appreciation to Jacob Nel-son for his useful discussions.

References[1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen,

C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, et al.Tensorflow: Large-scale machine learning on heterogeneousdistributed systems. arXiv preprint arXiv:1603.04467, 2016.4

[2] J. M. Alvarez and M. Salzmann. Compression-aware train-ing of deep networks. In Advances in Neural InformationProcessing Systems, pages 856–867, 2017. 2

[3] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh. Realtime multi-person 2d pose estimation using part affinity fields. In TheIEEE Conference on Computer Vision and Pattern Recogni-tion (CVPR), 2017. 6

[4] F. Chollet et al. Keras. https://github.com/fchollet/keras, 2015. 4

[5] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn,and A. Zisserman. The PASCAL Visual Object ClassesChallenge 2007 (VOC2007) Results. http://www.pascal-network.org/challenges/VOC/voc2007/workshop/index.html.4

[6] J. Guo, Y. Li, W. Lin, Y. Chen, and J. Li. Network decou-pling: From regular to depthwise separable convolutions. InBMVC, 2018. 2

[7] Y. He, X. Zhang, and J. Sun. Channel pruning for accelerat-ing very deep neural networks. 2(6), 2017. 2

[8] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang,T. Weyand, M. Andreetto, and H. Adam. Mobilenets: Effi-cient convolutional neural networks for mobile vision appli-cations. arXiv preprint arXiv:1704.04861, 2017. 1, 2

[9] M. Jaderberg, A. Vedaldi, and A. Zisserman. Speeding upconvolutional neural networks with low rank expansions.arXiv preprint arXiv:1405.3866, 2014. 2, 5, 6, 8

[10] Y.-D. Kim, E. Park, S. Yoo, T. Choi, L. Yang, and D. Shin.Compression of deep convolutional neural networks forfast and low power mobile applications. arXiv preprintarXiv:1511.06530, 2015. 2, 8

[11] A. Krizhevsky and G. Hinton. Learning multiple layers offeatures from tiny images. Technical report, Citeseer, 2009.4

[12] V. Lebedev, Y. Ganin, M. Rakhuba, I. Oseledets, andV. Lempitsky. Speeding-up convolutional neural net-works using fine-tuned cp-decomposition. arXiv preprintarXiv:1412.6553, 2014. 2, 8

[13] X. Li and M. C. Chuah. Sbgar: Semantics based group activ-ity recognition. In Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition, 2017. 1

[14] X. Li and M. C. Chuah. Rehar: Robust and efficient hu-man activity recognition. In Applications of Computer Vision(WACV), 2018 IEEE Winter Conference on. IEEE, 2018. 1

[15] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra-manan, P. Dollar, and C. L. Zitnick. Microsoft coco: Com-mon objects in context. In European conference on computervision, pages 740–755. Springer, 2014. 4

[16] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg. Ssd: Single shot multibox detector.In European conference on computer vision, pages 21–37.Springer, 2016. 7, 8

[17] C. Louizos, M. Welling, and D. P. Kingma. Learning sparseneural networks through l 0 regularization. arXiv preprintarXiv:1712.01312, 2017. 1, 2

[18] N. Ma, X. Zhang, H.-T. Zheng, and J. Sun. Shufflenet v2:Practical guidelines for efficient cnn architecture design. InProceedings of the European Conference on Computer Vi-sion (ECCV), pages 116–131, 2018. 1, 2

[19] P. Molchanov, S. Tyree, T. Karras, T. Aila, and J. Kautz.Pruning convolutional neural networks for resource efficientinference. arXiv preprint arXiv:1611.06440, 2016. 1

[20] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh,S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein,et al. Imagenet large scale visual recognition challenge. In-ternational Journal of Computer Vision, 2015. 4

[21] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C.Chen. Mobilenetv2: Inverted residuals and linear bottle-necks. pages 4510–4520, 2018. 1, 2

[22] T. Sheng, C. Feng, S. Zhuo, X. Zhang, L. Shen, and M. Alek-sic. A quantization-friendly separable convolution for mo-bilenets. arXiv preprint arXiv:1803.08607, 2018. 3

[23] K. Simonyan and A. Zisserman. Very deep convolutionalnetworks for large-scale image recognition. ICLR, 2015. 5,6, 7

[24] W. Wen, C. Xu, C. Wu, Y. Wang, Y. Chen, and H. Li. Coor-dinating filters for faster deep neural networks. In The IEEEInternational Conference on Computer Vision (ICCV), 2017.2

[25] Y. Xu, Y. Wang, A. Zhou, W. Lin, and H. Xiong. Deep neuralnetwork compression with single and multiple level quanti-zation. arXiv preprint arXiv:1803.03289, 2018. 1

[26] P. Yin, S. Zhang, J. Lyu, S. Osher, Y. Qi, and J. Xin. Bi-naryrelax: A relaxation approach for training deep neu-ral networks with quantized weights. arXiv preprintarXiv:1801.06313, 2018. 1

[27] P. Yin, S. Zhang, J. Lyu, S. Osher, Y. Qi, and J. Xin. Blendedcoarse gradient descent for full quantization of deep neuralnetworks. arXiv preprint arXiv:1808.05240, 2018. 1

[28] P. Yin, S. Zhang, Y. Qi, and J. Xin. Quantization and train-ing of low bit-width convolutional neural networks for objectdetection. Journal of Computational Mathematics, 2019. 1

[29] S. Zhang and J. Xin. Minimization of transformed l 1penalty: Closed form representation and iterative threshold-ing algorithms. Communications in Mathematical Sciences,2017. 8

[30] S. Zhang and J. Xin. Minimization of transformed l 1penalty: theory, difference of convex function algorithm,and robust application in compressed sensing. Mathemati-cal Programming, 2018. 8

[31] S. Zhang, P. Yin, and J. Xin. Transformed schatten-1 itera-tive thresholding algorithms for low rank matrix completion.Communications in Mathematical Sciences, 2017. 8

[32] X. Zhang, X. Zhou, M. Lin, and J. Sun. Shufflenet: Anextremely efficient convolutional neural network for mobiledevices, 2017. 1

[33] X. Zhang, J. Zou, K. He, and J. Sun. Accelerating verydeep convolutional networks for classification and detection.IEEE transactions on pattern analysis and machine intelli-gence, 38(10):1943–1955, 2016. 2, 5, 6, 8

https://github.com/fchollet/keras

https://github.com/fchollet/keras