Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Efficient Super Resolution Using Binarized Neural Network
Yinglan Ma∗
Adobe Inc. Research
Hongyu Xiong∗
Facebook Research
Zhe Hu
Hikvision Research
Lizhuang Ma
East China Normal University
Abstract
Deep convolutional neural networks (DCNNs) have re-
cently demonstrated high-quality results in single-image
super-resolution (SR). DCNNs often suffer from over-
parametrization and large amounts of redundancy, which
results in inefficient inference and high memory usage, pre-
venting massive applications on mobile devices. As a way
to significantly reduce model size and computation time, bi-
narized neural network has only been shown to excel on
semantic-level tasks such as image classification and recog-
nition. However, little effort of network quantization has
been spent on image enhancement tasks like SR, as network
quantization is usually assumed to sacrifice pixel-level ac-
curacy. In this work, we explore an network-binarization
approach for SR tasks without sacrificing much reconstruc-
tion accuracy. To achieve this, we binarize the convolu-
tional filters in only residual blocks, and adopt a learnable
weight for each binary filter. We evaluate this idea on sev-
eral state-of-the-art DCNN-based architectures, and show
that binarized SR networks achieve comparable qualitative
and quantitative results as their real-weight counterparts.
Moreover, the proposed binarized strategy could help re-
duce model size by 80% when applying on SRResNet [18],
and could potentially speed up inference by 5×.
1. Introduction
The challenging task of estimating a high-resolution
(HR) image from its low-resolution (LR) counterpart is
referred to as super-resolution (SR). SR, particularly sin-
gle image SR, has received substantial attention within
the computer vision research community, and has been
widely used in applications ranging from HDTV, surveil-
lance imaging to medical imaging. The difficulty of SR is to
∗These two authors contributed equally.
reconstruct high-frequency details from the low-frequency
information in the input image. In other words, it is to revert
the non-revertible process of low-pass filter and downsam-
pling that produces LR images.
Recent literatures have witnessed promising progress of
SR using convolutional neural networks [6, 13]. How-
ever, inefficiency and large model size raise big issues
for practical application of deep neural networks, due to
over-parametrization. To address these issues, neural net-
work quantizations, e.g., using binary weights and oper-
ations [5, 23], are proposed for semantic-level tasks like
classification and recognition. Binarized values have huge
advantages from an efficiency perspective, since multiplica-
tions can be eliminated, and bit-wise operations can be used
to further reduce computational cost. Nevertheless, little ef-
fort of neural network quantization has been spent on image
enhancement tasks like SR, as it was assumed to sacrifice
the desired pixel-level accuracy for those tasks.
(a) (b) (c)Figure 1. Example of binarized networks based on SRResNet [18],
with a factor of 4×. (a) Real-weight network; (b) Fully-binarized
network using binarization strategy [23]; (c) Our binarized net-
work.
In this work, we explore a network binarization approach
for SR methods. To our best knowledge, it is the first work
to explore neural network binarization for image SR task.
We show that simply binarizing the whole SR network does
not generate satisfactory results. Therefore, we propose a
binarization strategy for SR task, by (1) applying binariza-
1
tion only to residual blocks, and (2) using learnable weights
to binarize convolutional filters.
We apply this strategy to a few state-of-the-art SR neural
networks to verify its effectiveness. The experimental re-
sults show that the binarized SR network perform similarly
to the real-weight network without sacrificing much image
quality, but with significant model size and computational
load saving, making it possible to apply state-of-the-art SR
algorithms in mobile device applications, video streaming,
and Internet-of-Things (IoT) edge-device processing.
2. Related Work
Single-image super-resolution has achieved significant
progress in recent years. Pioneer methods on single-image
super-resolution have been well discussed in the survey
work [21, 29]. In this work we focus our discussion on the
most updated SR paradigms based on convolutional neural
network (CNN), and the progress in neural network quanti-
zation.
CNN-based super resolution. As one of the first pro-
posed SR method based on convolutional structure, SR-
CNN [6] trained an end-to-end network using the bicubic
upsampling of the LR image as the input. The VDSR net-
work [13] demonstrates significant improvement over SR-
CNN by increasing the network depth. To facilitate train-
ing a deeper model with a fast convergence speed, VDSR
aims on predicting the residuals rather than the actual pixel
values. In [14], the authors provide a deep convolutional
structure that allows a recursive forward propagation of a
certain layer, giving decent performance and reducing the
number of parameters. ESPCN [24] manages to improve
the network’s performance in both accuracy and speed by
directly learning the upsampling filter, and using the LR
images as input instead of their bicubic upsampled images.
The work [7] adopts a similar idea as ESPCN but with more
layers and fewer parameters. DCSCN [28] proposes a shal-
lower CNN than VDSR, by introducing skip-connections at
different levels; directly using LR image as the input, DC-
SCN investigates and divides the network’s function to fea-
ture extraction module and reconstruction module, giving
one of the highest super-resolution performance. Johnson
et al. [12] investigates a perceptual loss related to human
perception of high-resolution image, and incorporates the
loss with the difference between high-level features from
a pretrained VGG network of the predicted and target im-
ages. SRGAN [18] introduces generative adversarial net-
work (GAN) [8, 22] into SR technique, with a combined
content loss and adversarial loss for the perceptual loss
function; its generator model, also described as SRResNet,
adopts a deep residual framework for feature extraction and
uses subpixel-convolutional layer for upscaling reconstruc-
tion. EDSR [20] moves beyond SRResNet by simplifying
the the residual blocks and proposes a multi-scale SR sys-
tems, achieving even higher performances. LapSRN [16]
adopts a Laplacian pyramid to predict HR image; given a
fixed upscaling factor for a single level, multiple levels of
pyramid could be stacked for larger upscaling factor, and
the convolutional filters are shared between different pyra-
mid levels, significantly reducing the number of parameters.
Neural network with low-precision weights. A great
amount of efforts have been made to the speed-up and com-
pression on CNNs during training, feed-forward inference
or both stages. Among existing methods, the attempt to
restrict CNNs weights into low-precision versions (like bi-
nary value or bit-quantized value) attracts great attention
from researchers and developers. Soudry et al. [25] pro-
pose expectation back-propagation (EBP) to estimate the
posterior distribution of the weights of the network, which
are constrained to +1 and −1 during feed-forward infer-
ence in a probabilistic way. BinaryConnect [4] extends the
idea of EBP, to binarize network weights during training
phase directly and updating the real-value weights during
the backward pass based on the gradients of the binarized
weights. BinaryConnect achieves state-of-the-art classifi-
cation performance for small datasets such as MNIST [17]
and CIFAR-10 [15], showing the possibility that binarized
CNNs can have a performance extremely close to real-value
network. BinaryNet [5] moves beyond BinaryConnect,
whose weights and activations are both binarized. XNOR-
net [23] extends further beyond BinaryNet and BinaryCon-
nect, by incorporating binarized convolution operation and
binarized input during feed-forward inference, showing a
significant reduction of memory usage and huge boost of
computation speed, despite at certain level compromise of
the accuracy. Later on, other than binarized networks, a
series of efforts have been invested to train CNNs with
low-precision weights, low-precision activations and even
low-precision gradients, including but not limited to ternary
weight network (TWN) [19], DoReFa-Net [32], quantized
neural network (QNN) [11], and incremental network quan-
tization (INQ) [31], but we are focusing on binary net-
work in this paper. CNNs with low-precision weights have
been shown to exhibit extremely closed performance as
their real-value counterparts, on semantic-level tasks such
as image classification [4] (like TWN, XNOR, QNN, INQ,
etc.); however, it is widely presumed that CNNs with low-
precision weights would fail on pixel-level tasks such as im-
age reconstruction and super resolution, due to the reduced
model complexity.
In this work, we are proposing an efficient network bi-
narization strategy for image super-resolution tasks, based
on BinaryNet and XnorNet [5, 23], to speed up inference
and simultaneously achieve similar image quality as the
full-precision network. To verify the effectiveness of the
proposed strategy for general CNN-based SR methods, we
evaluate two state-of-the-art SR models SRGAN [18] and
LapSRN [16], and compare them with their real-value coun-
terparts, on three benchmark metrics the peak-signal-to-
noise-ratio (PSNR), structure similarity (SSIM) [27], and
information fidelity criterion (IFC) calculated on y-channel
to evaluate the performance of super resolution.
3. SR Network Binarization
The original binary neural networks [5, 23] are designed
to binarize the parameters of convolution layers, and are
shown to be effective for semantic-level tasks, without sac-
rificing much accuracy. Unlike semantic-level tasks, SR
is an image reconstruction task with a focus on the pixel-
level accuracy. And simply applying the binarization strat-
egy [5, 23] to the SR networks does not work well. Fig-
ure 1(b) shows an example from a fully binarized (bina-
rize all the parameters of convolution layers) SRResNet net-
work [18] with a factor of 4×, using the strategy [23]. The
results show that the fully binarized network does not gen-
erate satisfying results. In fact, this network does not con-
verge well, due to the limited network capacity and the vari-
ation of the pixel-level reconstruction. Therefore, network
binarization needs to be specially designed for the SR task,
as well as other image reconstruction tasks.
3.1. Motivation
For a neural network, deeper network structure usually
means more representation capacity and therefore better
performance, but higher difficulty of convergence. Residual
blocks [9] are designed to facilitate the training process of
very deep neural networks by a skip connection of the origi-
nal feature. Recent SR algorithms utilize residual structures
in their deep neural networks to effectively forward color
and structure information from the input image, and achieve
decent results.
The residual structure has some similarities to the multi-
scale image pyramids that are commonly used for image
reconstruction tasks [1, 3]. Image pyramids represent an
image with a series of band-pass filtered images. The base
layer contains the low-frequency information and the gra-
dient layers capture sub-band information and most high-
frequency details. Similarly, SR tasks can be considered as
reconstructing high-frequency information based on a low-
pass filtered image (LR image). And the recent residual-
structure-based SR networks function in a similar way as
the image-pyramid reconstruction process, where the resid-
ual structure and the skip connection in neural network re-
semble to the gradient layers and the upsampled images
from coarser scales in image pyramid.
In image pyramid, the gradient layers are known to hold
good sparsity property. We represent images using the
Laplacian pyramid [3], and collect the statistics of the inten-
sity values in the gradient layers (See Figure 2). As shown
Figure 2. Statistics on histograms of the gradient layers in the
Laplacian pyramid. The range of the image pixel intensity is [0, 1].
in the figure, value distribution is highly centralized around
0 and most of the pixels (≈ 40%) has a value of 0. Those
facts indicate that the gradient layer of a Laplacian pyra-
mid can be approximated by a layer with a small number of
values, and therefore motivate us to represent it via binary
filters with scaling factors in a neural network.
3.2. Binarization Method
Binary weights, from its quantization property in na-
ture, would cause information loss comparing to real-value
weights. Since SR tasks aim at enhancing image details
from its original image, binary weights should be carefully
used in the network. Based on the analysis in Section 3.1,
we propose a SR network binarization strategy that bina-
rizes the residual blocks and couples each binary filter with
a learnable weight in the network. The convolutional filters
outside the residual blocks will be kept in real-values, and
the propagation and parameter update steps are the same as
usual.
Algorithm 1 Binary Forward Propagation
for l = 1 to L layer do
for k = 1 to K output channel do
αlk ← UpdateAlpha() //using (4)
Blk ← sign(Wlk)end for
end for
ISR ← BinaryForward(ILR, αB)
To binarize residual blocks in the network, we first define
our binarization function as in [5] to transform real values
to +1 or −1:
xb = sign(x)
{
+1, if x > 0,−1, otherwise,
(1)
where x is the real-value variable, and xb is the corre-
sponding binarized weight. For the practical applications,
booleans can be used to represent the binarized weights. In
the binarized network, a scaling factor α is assigned to each
Figure 3. System architecture of the generator network in SRResNet/SRGAN [18] that transforms the low-resolution input image ILR to
high-resolution image ISR. The red boxes indicate the binarized layers in our network. k is the filter size, n denotes the number of filters
and s represents the stride number in convolution layer.
Figure 4. System architecture of the Laplacian Pyramid network [16] for SR. The red boxes indicate binarized layers in our network.
binarized filter, so a real-value weight filter W in the con-
volutional neural network is replaced with α ·sign(W ). The
factor α is a vector of the same length as the output channel
size of the corresponding convolutional layer.
At each iteration of the training phase, it takes LR im-
age patches ILR, corresponding HR image patches IHR,
the cost function C(IHR, ISR) and the learning rate η as
inputs, and outputs updated weights W t. For the forward
propagation, we first update the factor α and binarize the
real-value weight filter B = sign(W ), then αB is used for
the following computation. Algorithm 1 demonstrates the
procedure of forward propagation, and we will describe α
updating strategy in Section 3.3. For the backward propa-
gation, the gradients of the weights are kept in real values.
Similar to [23], we take derivatives of l-th layer cost Cl with
respect to the binarized weight Blk as∂Cl
∂Blk
. The gradients
are clipped to range (−5, 5) for stability, and are used to
update the real weight by1
αlk
∂Cl
∂Blk
. The real-value param-
eters and binary parameters are then updated by accumulat-
ing gradients as shown in Algorithm 3.
Algorithm 2 Binary Backward Propagation
for l = L to 1 layer do
for k = 1 to K output channel do
gWlk← Clip(
1
αlk
∂Cl
∂Blk
,−5, 5)
Wlk ← UpdateBinaryParameter(Wlk, η, gWlk)
end for
end for
3.3. Updating Scaling Factors for Binary Filters
In [23], the scaling vector α is determined in a determin-
istic way, by solving the following optimization:
J(B,α) = ||W − αB||2, (2)
where the binary filter B = sign(W ) is obtained by the bi-
narization function. The optimal value of the scaling factor
α∗ is calculated as:
α∗ =1
n||W ||l1. (3)
Figure 5. System architecture of the discriminator network in SRGAN [18] that is trained to distinguish super-resolution results ISR from
high-resolution image IHR. k is the filter size, n denotes the number of filters and s represents the stride number in convolution layer.
However, this α∗ is optimized for approximating W with
αB, instead of optimized for closing the gap between the
prediction and ground truth in pixel-level.
Algorithm 3 Accumulating Parameter Gradients
for t = 0 to T do
θt+1 ← UpdateNonBinaryParameter(θt, ηt, gθ)W t+1 ← UpdateBinaryParameter(W t, ηt, gW )ηt+1 ← UpdateLearningRate(ηt)
end for
To better approximate the gradient information, we pro-
pose to train α as a parameter in the network,
αt
c ← UpdateAlpha(αt−1c , η, gαc
), (4)
instead of
αc ← avg(abs|Wc|1). (5)
Experimental comparison between these two methods will
be provided in Section 4.
4. Experiment
To verify the proposed binarization strategy for SR tasks,
we evaluate this strategy on several state-of-the-art SR ar-
chitectures: SRResNet/SRGAN [18] and LapSRN [16]. We
evaluate the methods using NTIRE 2017 dataset [26], which
includes 800 HR training images and 100 HR validation im-
ages as our training and test datasets. To train/test 2× and
4× models, the corresponding LR images are generated us-
ing the bicubic downsampling, with a scale factor of 2× and
4×, respectively.
4.1. Training Details
We test our binarization strategy on three different mod-
els, SRResNet, SRGAN, and LapSRN, all of which employ
residual blocks in their networks. We will describe the mod-
els and training details for each method below. For all the
three models, we set larger learning rates for the binary ver-
sions, with a factor of 3×-4× comparing to those for the
real-value network. The reason is that it requires larger mo-
mentum to change the binary weights/ switch signs. Within
this learning-rate setting, the binary network converges at a
similar pace as the real-value network. We show the learn-
ing curves for both real-value networks and their binary
weight counterparts in the supplementary material. We use
a batch size of 16 for all the training. For data augmen-
tation, we adopt the following approaches: 1) randomly
cropping patches; 2) randomly flipping horizontally or ver-
tically, rotation and noise; 3) randomly rotating images by
{0, 90, 180, 270} degrees; 4) adding Gaussian noise to HR
training patches.
SRResNet. We provide the binarized SRResNet structure
in Figure 3 and show the binarized layers using red boxes.
Basically, we binarize all the convolutional layers in the
residual blocks, using the algorithms 1, 2 and 3. We train
real-weight and binary-weight SRResNets for 500 epochs,
with each epoch containing 50 iterations. We use 1× 10−4
and 3× 10−4 respectively, as our initial learning rate, and a
decay of 0.9 in every 20 epochs.
SRGAN. We apply the same binarization method used for
SRResNet to the generator of SRGAN. We keep the dis-
criminator in real weights, as the purpose of binarization
is to improve inference performance in terms of speed and
model storage size and the discriminator does not affect in-
ference performance. We pretrain real-weight and binary-
weight generators of SRGAN for 100 epochs with learn-
ing rates of 1 × 10−4 and 3 × 10−4 respectively. Then
we jointly train generator and discriminator for 500 epochs
with a learning rate of 1× 10−4 as our initial learning rate,
and a decay of 0.9 in every 20 epochs.
LapSRN. The Laplacian Pyramid network proposed by Lai
et al. [16] consists of D residual blocks in the Feature Ex-
traction Branch. We binarized all the weights of convo-
lutional layers in residual blocks, as shown in Figure 4.
We train real-weight and binary-weight LapSRNs for 300epochs, with each epoch containing 800 iterations. We use
3 × 10−5 and 1 × 10−4 respectively, as the initial learning
rates, and a decay of 0.8 in every 30 epochs.
4.2. Evaluation Results
We evaluate our models on three widely used benchmark
datasets Set5 [2], Set14 [30], and Urban100 [10]. To eval-
Ground-truth Bicubic Real-weight LapSRN Binary-weight LapSRN
Real-weight SRResNet Binary-weight SRResNet Real-weight SRGAN Binary-weight SRGANFigure 6. Comparisons between real-weight networks and their binarized versions with an upsampling factor of 2.
Ground-truth Bicubic Real-weight LapSRN Binary-weight LapSRN
Real-weight SRResNet Binary-weight SRResNet Real-weight SRGAN Binary-weight SRGANFigure 7. Comparisons between real-weight networks and their binarized versions with an upsampling factor of 4.
Algorithm Scale Set5 Set14 Urban100
PSNR/SSIM/IFC PSNR/SSIM/IFC PSNR/SSIM/IFC
Bicubic 4 28.43 / 0.811 / 2.337 25.90 / 0.704 / 2.246 23.15 / 0.660 / 2.386
SRResNet [18] 4 31.10 / 0.875 / 3.091 27.59 / 0.764 / 2.743 25.09 / 0.750 / 3.040
SRResNet Binary (ours) 4 30.34 / 0.864 / 3.052 27.16 / 0.756 / 2.749 24.48 / 0.728 / 2.913
SRGAN [18] 4 30.43 / 0.855 / 2.862 27.00 / 0.749 / 2.493 24.75 / 0.734 / 2.865
SRGAN Binary (ours) 4 30.13 / 0.853 / 2.547 26.98 / 0.746 / 2.336 24.31 / 0.716 / 2.499
LapSRN (4x) [16] 4 30.28 / 0.858 / 2.851 27.15 / 0.748 / 2.566 24.37 / 0.722 / 2.755
LapSRN Binary (4x) (ours) 4 30.21 / 0.857 / 2.865 27.13 / 0.751 / 2.616 24.31 / 0.720 / 2.735
Bicubic 2 33.69 / 0.931 / 6.166 30.25 / 0.870 / 6.126 26.89 / 0.841 / 6.319
SRResNet [18] 2 36.36 / 0.952 / 6.714 32.16 / 0.904 / 6.577 29.96 / 0.901 / 7.016
SRResNet Binary (ours) 2 35.66 / 0.946 / 5.936 31.56 / 0.897 / 5.893 28.76 / 0.882 / 6.112
SRGAN [18] 2 35.31 / 0.941 / 6.332 31.81 / 0.901 / 6.280 29.63 / 0.897 / 6.646
SRGAN Binary (ours) 2 34.91 / 0.938 / 5.840 30.92 / 0.892 / 6.354 28.55 / 0.878 / 6.490Table 1. Quantitative evaluation of state-of-the-art SR algorithms and their binarized versions.
uate our method, we provide comparisons of the SR im-
ages from the following models and their corresponding bi-
narized versions: SRResNet [18], SRGAN [18], and Lap-
SRN [16]. The results of the bicubic upsampling method
is provided as a baseline. More experiment results can be
found in the supplementary material.
As shown in Figure 6 and 7, the binarized networks using
the proposed method perform similarly to their real-weight
counterparts. Although achieving comparable results, the
binarized networks generate results with slightly more alias-
ing artifacts compared with their real-weight versions. That
is, the binarization strategy still compromises performance
in terms of high-frequency details, due to the limited net-
work capacity.
Model (4x) deterministic learnable
SRResNet 27.06/0.754/2.712 27.16/0.756/2.749
LapSRN 27.02/0.747/2.604 27.13/0.751/2.616
Table 2. Comparison on networks using deterministic and learn-
able scaling factors. The results are evaluated on SRResNet [18]
and LapSRN [16] structures in PSNR/SSIM/IFC.
For quantitative evaluation, we calculate PSNR, SSIM
and IFC metrics [27] between generated images and the
ground truths, and all the reported values in Table 1 are cal-
culated on the Y-channel of the YUV color space. Both the
real-weight and the binary-weight networks achieve better
results than the bicubic baseline in terms of the metrics. In
most of the cases, the binary-weight networks perform sim-
ilarly with their real-weight counterparts, with a difference
less than 0.5 in PSNR or 0.005 in SSIM. And the margin
becomes even smaller for the upsampling factor of 4.
# of Res-Blocks PSNR SSIM IFC
8 26.74 0.738 2.136
16 27.16 0.756 2.749
24 27.19 0.754 2.774
Table 3. Performance of binarized SRRestNet (4x) with 8, 16 and
24 residual blocks on Set14 [30] dataset.
4.3. Model Analysis
To verify the effectiveness of the proposed learnable
scaling factor for binary filters, we conduct experimental
comparisons between networks using the learnable scaling
factors and the deterministic ones in Table 2. In general,
using learnable scaling factors has faster convergence and
achieve better results comparing to the deterministic ones.
The binarized networks reduce the computational load
and enable efficient inference. As deeper networks have
better representation abilities and usually generate better re-
sults, we also evaluate the binarized networks in terms of the
network depth. We train SRResNet-Binary (4x) with 8, 16and 24 residual blocks, and show the quantitative results in
Table 3. We note that we use 16 residual blocks for the SR-
ResNet in all other experiments, same as in [18]. From the
results, we can see that deeper networks perform better than
shallower ones in general. However, marginal improvement
is achieved after using more residual blocks over 16, and it
suggests that binarized layers in 16 residual blocks are suf-
ficient to represent the high-frequency details of images.
4.4. Model Size and Computational Complexity
In Table 4, we list the number of parameters and model
size of the networks that we test. We use 32-/64-bit floating
point precision to represents real parameters, and the pa-
rameter binarization would reduce it to a single bit, which
Algorithm # of Res-Blocks / # of Binary Params # of Real Params Model Size
Res-Conv-Layer Binary/Real Network Binary/Real Network Binary/Real Network
SRResNet (2x) 16 / 32 1,179,648 / 0 197,059 / 1,374,659 0.928 / 5.499MB
SRResNet (4x) 8 / 16 589,824 / 0 339,651 / 928,451 1.428 / 3.714MB
SRResNet (4x) 16 / 32 1,179,648 / 0 344,771 / 1,522,371 1.518 / 6.089MB
SRResNet (4x) 24 / 48 1,769,472 / 0 349,801 / 2,116,201 1.608 / 8.465MB
LapSRN (4x) 10 / 10 368,640 / 0 152,656 / 520,656 1.494 / 2.083MBTable 4. Number of parameters and model size of binary and real-weight networks (32-bit floating point). The parameters of binary
networks include binary parameters of filters and float parameters assigned to filters.
Vote Percentage 2× 4×SRResNet 0.50 0.45
SRResNet-Bin 0.50 0.55
SRGAN 0.60 0.62
SRGAN-Bin 0.40 0.38
LapSRN 0.51 0.55
LapSRN-Bin 0.49 0.45
Table 5. User study for 2× and 4× SR results from binary and
real-weight networks. The test images are randomly selected from
Set5 [2] and Set14 [30].
is a factor of 32 or 64 in terms of memory used for model
storage. Even though this binarization strategy only applies
to residual blocks in the network, it still yields a 50%−80%model compression depending on the network structure.
Higher portion of residual components would yield higher
compression rate. This also provides design guideline for
applications with memory limitation, like on-device pro-
cessing in IoT edge devices and mobile platforms.
Another benefit of network binarization is to reduce
computational complexity during inference, as binarized fil-
ter weights enables bit operations instead of float multi-
plication (flops) when computing convolution. For a sin-
gle convolution layer with input tensor size (W,H,C) and
convolutional filters of size (K,K,C, F ), where W,H,C
represent the width, height, and number of input channels
of the intermediate state, and F represent the number of
output channels; there are in total O(WHCFK2) multi-
plication floating-operations (flops), which will be replaced
with bit operations after binarization. Modern computers
can compute 4 flops or 512 bit operations per clock cycle,
and the operation of addition is ∼ 3× faster than that of
multiplication, For a SRResNet 4x model with 16 residual
blocks on an image of size 1200 × 800 × 3, there are in
total ∼ 1014 flops, with about the same number of flops of
multiplication and addition. In this model, more than 75%of multiplication flops are replaced with bit operations af-
ter binarization. Thus, there is a potential speedup of ∼ 2×for this SRResNet(4x) model. Applying same calculation to
SRResNet(2x) model results in ∼ 5× computational gain.
4.5. User Study
To better evaluate the visual quality of the results from
real-weight networks and their binarized versions, we con-
ducted a user study on the SR results. We develop a web-
based system to display and collect study results. The sys-
tem provides two side-by-side images at a time, one from
the real-weight network with scaling factor 2× or 4×, and
another from its binary counterpart. Each pair of images is
randomly selected and placed from the dataset Set5 [2] and
Set14 [30]. We have collected results of their preferred im-
ages from 24 users, and each user is asked to rate 20 pairs
of images. The results are shown in Table 5: our binarized
SR models perform similarly as real-weight SR models.
5. Limitation and Future Work
The proposed binarization strategy are designed for SR
tasks, but it is possible to apply this strategy to other pixel-
level tasks like denoising and deblurring. However, specific
network designs are needed when applying binarization to
other tasks, e.g., special handling may be needed for the
kernel estimation process in deblurring. And how to apply
the binarization strategy to other pixel-level image recon-
struction tasks could be a good future research direction.
Although we are able to replace the float multiplications
with bit-wise operation when computing convolution, there
are still rooms of efficiency improvement using binary net-
work. How to binarize input images for further bit-wise
computation remains a difficult and open question.
6. Conclusions
In this paper, we introduce a network binarization strat-
egy for super resolution, without losing much per-pixel ac-
curacy. Inspired by the structure and statistics on the gra-
dient histogram of the Laplacian pyramid, we argue that it
is appropriate to pose binarization components in residual
architectures, and assign a learnable parameter to each bi-
nary convolutional filter. Qualitative and quantitative exper-
iments have shown that residual-based SR networks with
binarized components can generate comparable results to
their real-weight counterparts, and obtain significant im-
provement in model size and computational complexity.
References
[1] Edward H Adelson, Charles H Anderson, James R Bergen,
Peter J Burt, and Joan M Ogden. Pyramid methods in image
processing. RCA engineer, 29(6):33–41, 1984. 3
[2] Marco Bevilacqua, Aline Roumy, Christine Guillemot, and
Marie Line Alberi-Morel. Low-complexity single-image
super-resolution based on nonnegative neighbor embedding.
2012. 5, 8
[3] Peter J Burt and Edward H Adelson. The laplacian pyramid
as a compact image code. In Readings in Computer Vision,
pages 671–679. Elsevier, 1987. 3
[4] Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre
David. Binaryconnect: Training deep neural networks with
binary weights during propagations. In Advances in Neural
Information Processing Systems 28, pages 3123–3131. 2015.
2
[5] Matthieu Courbariaux, Itay Hubara, Daniel Soudry, Ran El-
Yaniv, and Yoshua Bengio. Binarynet: Training deep neural
networks with weights and activations constrained to +1 or
-1. CoRR, abs/1602.02830, 2016. 1, 2, 3
[6] Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou
Tang. Image super-resolution using deep convolutional net-
works. IEEE transactions on pattern analysis and machine
intelligence, 38(2):295–307, 2016. 1, 2
[7] Chao Dong, Chen Change Loy, and Xiaoou Tang. Acceler-
ating the super-resolution convolutional neural network. In
European Conference on Computer Vision, pages 391–407.
Springer, 2016. 2
[8] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing
Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and
Yoshua Bengio. Generative adversarial nets. In Advances
in Neural Information Processing Systems 27, pages 2672–
2680. Curran Associates, Inc., 2014. 2
[9] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian
Sun. Deep residual learning for image recognition. CoRR,
abs/1512.03385, 2015. 3
[10] Jia-Bin Huang, Abhishek Singh, and Narendra Ahuja. Sin-
gle image super-resolution from transformed self-exemplars.
In Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, pages 5197–5206, 2015. 5
[11] Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-
Yaniv, and Yoshua Bengio. Quantized neural networks:
Training neural networks with low precision weights and ac-
tivations. arXiv preprint arXiv:1609.07061, 2016. 2
[12] Justin Johnson, Alexandre Alahi, and Fei-Fei Li. Percep-
tual losses for real-time style transfer and super-resolution.
CoRR, abs/1603.08155, 2016. 2
[13] Jiwon Kim, Jung Kwon Lee, and Kyoung Mu Lee. Accurate
image super-resolution using very deep convolutional net-
works. In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pages 1646–1654, 2016. 1,
2
[14] Jiwon Kim, Jung Kwon Lee, and Kyoung Mu Lee. Deeply-
recursive convolutional network for image super-resolution.
In Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, pages 1637–1645, 2016. 2
[15] Alex Krizhevsky and Geoffrey Hinton. Learning multiple
layers of features from tiny images. 2009. 2
[16] Wei-Sheng Lai, Jia-Bin Huang, Narendra Ahuja, and Ming-
Hsuan Yang. Deep laplacian pyramid networks for fast and
accurate super-resolution. CoRR, abs/1704.03915, 2017. 2,
3, 4, 5, 7
[17] Yann LeCun, Corinna Cortes, and Christopher JC Burges.
Mnist handwritten digit database. AT&T Labs [Online].
Available: http://yann. lecun. com/exdb/mnist, 2, 2010. 2
[18] Christian Ledig, Lucas Theis, Ferenc Huszar, Jose Caballero,
Andrew P. Aitken, Alykhan Tejani, Johannes Totz, Zehan
Wang, and Wenzhe Shi. Photo-realistic single image super-
resolution using a generative adversarial network. CoRR,
abs/1609.04802, 2016. 1, 2, 3, 4, 5, 7
[19] Fengfu Li, Bo Zhang, and Bin Liu. Ternary weight networks.
arXiv preprint arXiv:1605.04711, 2016. 2
[20] Bee Lim, Sanghyun Son, Heewon Kim, Seungjun Nah, and
Kyoung Mu Lee. Enhanced deep residual networks for sin-
gle image super-resolution. In Computer Vision and Pattern
Recognition Workshops (CVPRW), 2017 IEEE Conference
on, pages 1132–1140. IEEE, 2017. 2
[21] Kamal Nasrollahi and Thomas B. Moeslund. Super-
resolution: A comprehensive survey. Mach. Vision Appl.,
25(6):1423–1468, Aug. 2014. 2
[22] Alec Radford, Luke Metz, and Soumith Chintala. Unsuper-
vised representation learning with deep convolutional gen-
erative adversarial networks. CoRR, abs/1511.06434, 2015.
2
[23] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon,
and Ali Farhadi. Xnor-net: Imagenet classification using bi-
nary convolutional neural networks. CoRR, abs/1603.05279,
2016. 1, 2, 3, 4
[24] Wenzhe Shi, Jose Caballero, Ferenc Huszar, Johannes Totz,
Andrew P. Aitken, Rob Bishop, Daniel Rueckert, and Zehan
Wang. Real-time single image and video super-resolution us-
ing an efficient sub-pixel convolutional neural network. 2016
IEEE Conference on Computer Vision and Pattern Recogni-
tion (CVPR), pages 1874–1883, 2016. 2
[25] Daniel Soudry, Itay Hubara, and Ron Meir. Expectation
backpropagation: Parameter-free training of multilayer neu-
ral networks with continuous or discrete weights. In Z.
Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and
K. Q. Weinberger, editors, Advances in Neural Information
Processing Systems 27, pages 963–971. 2014. 2
[26] Radu Timofte, Eirikur Agustsson, Luc Van Gool, Ming-
Hsuan Yang, Lei Zhang, Bee Lim, Sanghyun Son, Heewon
Kim, Seungjun Nah, Kyoung Mu Lee, et al. Ntire 2017
challenge on single image super-resolution: Methods and re-
sults. In Computer Vision and Pattern Recognition Work-
shops (CVPRW), 2017 IEEE Conference on, pages 1110–
1121. IEEE, 2017. 5
[27] Zhou Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simon-
celli. Image quality assessment: from error visibility to struc-
tural similarity. IEEE Transactions on Image Processing,
13(4):600–612, April 2004. 3, 7
[28] Jin Yamanaka, Shigesumi Kuwashima, and Takio Kurita.
Fast and accurate image super resolution by deep cnn with
skip connection and network in network. In International
Conference on Neural Information Processing, pages 217–
225. Springer, 2017. 2
[29] Chih-Yuan Yang, Chao Ma, and Ming-Hsuan Yang. Single-
image super-resolution: A benchmark. In Proceedings of
European Conference on Computer Vision, 2014. 2
[30] Roman Zeyde, Michael Elad, and Matan Protter. On sin-
gle image scale-up using sparse-representations. In Interna-
tional conference on curves and surfaces, pages 711–730.
Springer, 2010. 5, 7, 8
[31] Aojun Zhou, Anbang Yao, Yiwen Guo, Lin Xu, and
Yurong Chen. Incremental network quantization: Towards
lossless cnns with low-precision weights. arXiv preprint
arXiv:1702.03044, 2017. 2
[32] Shuchang Zhou, Yuxin Wu, Zekun Ni, Xinyu Zhou, He Wen,
and Yuheng Zou. Dorefa-net: Training low bitwidth convo-
lutional neural networks with low bitwidth gradients. arXiv
preprint arXiv:1606.06160, 2016. 2