10
Image Super-Resolution via RL-CSC: When Residual Learning Meets Convolutional Sparse Coding Menglei Zhang, Zhou Liu, Lei Yu School of Electronic and Information, Wuhan University, China {zmlhome, liuzhou, ly.wd}@whu.edu.cn Abstract We propose a simple yet effective model for Single Image Super-Resolution (SISR), by combining the merits of Resid- ual Learning and Convolutional Sparse Coding (RL-CSC). Our model is inspired by the Learned Iterative Shrinkage- Threshold Algorithm (LISTA). We extend LISTA to its con- volutional version and build the main part of our model by strictly following the convolutional form, which improves the network’s interpretability. Specifically, the convolu- tional sparse codings of input feature maps are learned in a recursive manner, and high-frequency information can be recovered from these CSCs. More importantly, residual learning is applied to alleviate the training difficulty when the network goes deeper. Extensive experiments on bench- mark datasets demonstrate the effectiveness of our method. RL-CSC (30 layers) outperforms several recent state-of-the- arts, e.g., DRRN (52 layers) and MemNet (80 layers) in both accuracy and visual qualities. Codes and more re- sults are available at https://github.com/axzml/ RL-CSC. 1. Introduction Single Image Super-Resolution (SISR), which aims to restore a visually pleasing high-resolution (HR) image from its low-resolution (LR) version, is still a challenging task within computer vision research community [25, 27]. Since multiple solutions exist for the mapping from LR to HR space, SISR is highly ill-posed and a variety of algo- rithms, especially the current leading learning-based meth- ods [26, 4, 13, 14, 23, 28] are proposed to address this prob- lem. In recent years, Convolutional Neural Network (CNN) has shown remarkable performance for various computer vision task [9, 36, 33] owing to its powerful capabilities of learning informative hierarchical representations. Dong et al.[4] firstly proposed the seminal CNN model for SR termed as SRCNN, which exploits a shallow convolutional Figure 1: PSNRs of recent CNN models versus the number of parameters for scale factor ×3 on Set5 [1]. The number of layers are marked in the parentheses. Red points repre- sent our models. RL-CSC with 25 recursions achieves com- petitive performance with MemNet [24]. When increasing the number of recursions without introducing any parame- ters, the performance of RL-CSC can be further improved. neural network to learn a nonlinear LR-HR mapping in an end-to-end manner and dramatically overshadows conven- tional methods [32]. Inspired by VGG-net [21], Kim et al.[13] firstly constructed a very deep network up to 20 lay- ers named VDSR, which shows significant improvements over SRCNN. Techniques like skip-connection, adjustable gradient clipping were introduced to mitigate the vanishing- gradient problem when the network goes deeper. Kim et al. further proposed a deeply-recursive convolutional net- work (DRCN) [14] with a very deep recursive layer, and performance can be even improved by increasing recursion depth without new parameters introduced. As the extraordi- nary success of ResNet [9] in image recognition, extensive ResNet or residual units based models for SR have emerged. SRResNet [15] made up of 16 residual units sets up a new state of the art for large upscaling factors (×4). EDSR [16] 1 arXiv:1812.11950v1 [cs.CV] 31 Dec 2018

f [email protected] arXiv:1812.11950v1 [cs.CV] 31 Dec 2018 · removes Batch Normalization [12] (BN) layers in residual units and produces astonishing results in both qualitative and quantitative

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: f g@whu.edu.cn arXiv:1812.11950v1 [cs.CV] 31 Dec 2018 · removes Batch Normalization [12] (BN) layers in residual units and produces astonishing results in both qualitative and quantitative

Image Super-Resolution via RL-CSC: When Residual Learning MeetsConvolutional Sparse Coding

Menglei Zhang, Zhou Liu, Lei YuSchool of Electronic and Information, Wuhan University, China

{zmlhome, liuzhou, ly.wd}@whu.edu.cn

Abstract

We propose a simple yet effective model for Single ImageSuper-Resolution (SISR), by combining the merits of Resid-ual Learning and Convolutional Sparse Coding (RL-CSC).Our model is inspired by the Learned Iterative Shrinkage-Threshold Algorithm (LISTA). We extend LISTA to its con-volutional version and build the main part of our model bystrictly following the convolutional form, which improvesthe network’s interpretability. Specifically, the convolu-tional sparse codings of input feature maps are learnedin a recursive manner, and high-frequency information canbe recovered from these CSCs. More importantly, residuallearning is applied to alleviate the training difficulty whenthe network goes deeper. Extensive experiments on bench-mark datasets demonstrate the effectiveness of our method.RL-CSC (30 layers) outperforms several recent state-of-the-arts, e.g., DRRN (52 layers) and MemNet (80 layers) inboth accuracy and visual qualities. Codes and more re-sults are available at https://github.com/axzml/RL-CSC.

1. Introduction

Single Image Super-Resolution (SISR), which aims torestore a visually pleasing high-resolution (HR) image fromits low-resolution (LR) version, is still a challenging taskwithin computer vision research community [25, 27]. Sincemultiple solutions exist for the mapping from LR to HRspace, SISR is highly ill-posed and a variety of algo-rithms, especially the current leading learning-based meth-ods [26, 4, 13, 14, 23, 28] are proposed to address this prob-lem.

In recent years, Convolutional Neural Network (CNN)has shown remarkable performance for various computervision task [9, 36, 33] owing to its powerful capabilitiesof learning informative hierarchical representations. Donget al. [4] firstly proposed the seminal CNN model for SRtermed as SRCNN, which exploits a shallow convolutional

0 500 1000 1500 2000 2500 3000 3500 4000 4500Number of parameters (K)

32.6

32.8

33.0

33.2

33.4

33.6

33.8

34.0

34.2

PSN

R (d

B)

SRCNN (3)

SCN (5)

VDSR (20)

DRCN (20) RED-Net (30)

DRRN (52)

MemNet (80)

RL-CSC (30)

RL-CSC (53)

Figure 1: PSNRs of recent CNN models versus the numberof parameters for scale factor ×3 on Set5 [1]. The numberof layers are marked in the parentheses. Red points repre-sent our models. RL-CSC with 25 recursions achieves com-petitive performance with MemNet [24]. When increasingthe number of recursions without introducing any parame-ters, the performance of RL-CSC can be further improved.

neural network to learn a nonlinear LR-HR mapping in anend-to-end manner and dramatically overshadows conven-tional methods [32]. Inspired by VGG-net [21], Kim etal. [13] firstly constructed a very deep network up to 20 lay-ers named VDSR, which shows significant improvementsover SRCNN. Techniques like skip-connection, adjustablegradient clipping were introduced to mitigate the vanishing-gradient problem when the network goes deeper. Kim etal. further proposed a deeply-recursive convolutional net-work (DRCN) [14] with a very deep recursive layer, andperformance can be even improved by increasing recursiondepth without new parameters introduced. As the extraordi-nary success of ResNet [9] in image recognition, extensiveResNet or residual units based models for SR have emerged.SRResNet [15] made up of 16 residual units sets up a newstate of the art for large upscaling factors (×4). EDSR [16]

1

arX

iv:1

812.

1195

0v1

[cs

.CV

] 3

1 D

ec 2

018

Page 2: f g@whu.edu.cn arXiv:1812.11950v1 [cs.CV] 31 Dec 2018 · removes Batch Normalization [12] (BN) layers in residual units and produces astonishing results in both qualitative and quantitative

removes Batch Normalization [12] (BN) layers in residualunits and produces astonishing results in both qualitativeand quantitative measurements. Tai et al. [23] proposedDRRN in which modified residual units are learned in a re-cursive manner, leading to a deeper yet concise network.They further introduced memory block to build MemNet[24] based on dense connections. In [17], an encoding-decoding network named RED-Net was proposed to takefull advantage of many symmetric skip connections.

Despite achieving amazing success in SR, the aforemen-tioned models usually lack convincing analyses about whythey worked. Numerous questions are expected to be ex-plored, i.e., what role each module plays in the network,whether BN is needed, etc. In the past decades, sparse rep-resentation with strong theoretical support has been widelyused [32, 37] due to its good performance. Its still valuableeven nowadays data-driven models have became more andmore popular. Wang et al. [29] introduced a sparse codingbased network (SCN) for image super-resolution task, bycombining the powerful learning ability of neural networkand peoples domain expertise of sparse coding, which fullyexploits the approximation of sparse coding learned fromLISTA [6]. With considerable improvements of SCN overtraditional sparse coding methods [32] and SRCNN [4] areobserved, the authors claim that peoples domain knowledgeis still valuable and when it combines with the merits ofdeep learning, results can benefit a lot. However, layers inSCN are strictly corresponding to each step in the procedureof traditional sparse coding based image SR, so the networkstill attempts to learn the mapping from LR to HR images. Itturns out to be inefficient as indicated in [13, 14, 23], whichlimits the results to be further improved. Moreover, as theexperimental results of SCN show no observable advance-ment when the number of recurrent stages k is increased,the authors finally choose k = 1 causing SCN to become ashallow network.

Convolutional Sparse Coding (CSC) has attained muchattention from researchers [34, 2, 10, 5] for years. AsCSC inherently takes the consistency constraint of pixelsin overlapped patches into consideration, Gu et al. [7] pro-posed CSC based SR (CSC-SR) model and revealed thepotential of CSC for image super-resolution over conven-tional sparse coding methods. In order to build a com-putationally efficient CSC model, Sreter et al. [22] intro-duced a convolutional recurrent sparse auto-encoder by ex-tending the LISTA method to a convolutional version, anddemonstrated its efficiency in image denoising and inpaint-ing tasks.

To add more interpretability to CNN models for SR andinspire more researches to focus on this topic, we proposea novel model for SR, simple yet effective, to attempt tocombine the merits of Residual Learning and ConvolutionalSparse Coding (RL-CSC). In a nutshell, the contributions of

this paper are three-fold:

1. Unlike many researchers referring to networks pro-posed in the field of image recognition for inspira-tion, our model, termed as RL-CSC, is deduced fromLISTA. So we provide a new effective way to facilitatemodel construction, in which every module has well-defined interpretability.

2. Analyses about the advantages over [29, 14, 23] arediscussed in detail.

3. Thanks to the guidelines of sparse coding theory, RL-CSC (30 layers) has achieved competitive results withDRRN [23] (52 layers) and MemNet [24] (up to 80layers) in image super-resolution task. Figure 1 showsthe performances of several recent CNN models [4, 29,13, 14, 17, 23, 24] in SR task.

2. Related Work2.1. Sparse Coding and LISTA

Sparse Coding (SC) has been widely used in a variety ofapplications such as image classification, super resolutionand visual tracking [37]. The most popular form of sparsecoding attempts to find the optimal sparse code that mini-mizes the objective function (1), which combines a data fit-ting term and an `1-norm sparsity-inducing regularization:

argminz

1

2‖y −Dz‖22 + λ‖z‖1, (1)

where z ∈ Rm is the sparse representation of a given inputsignal y ∈ Rn w.r.t. an n × m dictionary D, and regular-ization coefficient λ is used to control the sparsity penalty.m > n is satisfied whenD is overcomplete.

One popular method to optimize (1) is the so-called Iter-ative Shrinkage Thresholding Algorithm (ISTA) [3, 37]. Atthe kth iteration, the sparse code is updated as:

zk+1 = hλ/L

(zk +

1

LDT (y −Dzk)

), (2)

where L ≤ µmax, and µmax denotes the largest eigenvalueof DTD. hθ (·) is an element-wise soft shreshold operatordefined as

hθ(α) = sign(α)max(|α| − θ, 0). (3)

However, ISTA suffers from slow convergence speed,which limits its application in real-time situations. To ad-dress this issue, Gregor and LeCun [6] proposed a fast al-gorithm termed as Learned ISTA (LISTA) that produces ap-proximate estimates of sparse code with the power of neuralnetwork. LISTA can be obtained by rewriting (2) as

zk+1 = hθ (Wey +Gzk) , (4)

Page 3: f g@whu.edu.cn arXiv:1812.11950v1 [cs.CV] 31 Dec 2018 · removes Batch Normalization [12] (BN) layers in residual units and produces astonishing results in both qualitative and quantitative

F0 ReLU F1 ReLU W1

ReLU S +

ReLU W2

ReLU H +

W1

ReLU S +

ReLU S +

ReLU S + +

ReLU

K recursions

Iy

y

z R

Ix

Figure 2: The proposed RL-CSC framework. Our model takes an interpolated LR image Iy as input and predicts the residualcomponent R. Two convolution layers F0 and F1 are used for feature extraction and output the feature maps y, which thengo through a convolutional LISTA based sub-network (with K recursions surrounded by the dashed box). When the sparsefeature maps z are obtained, W2 is utilized to recover the features of high-frequency information and the convolution layerH maps the features to residual image R. The final HR image Ix can be restored by the addition of ILR image Iy andresidual imageR. The unfolded version of the recusive sub-network is shown in the bottom.

where 1LD

T is replaced with We ∈ Rm×n, I − 1LD

TD

with G ∈ Rm×m and λL with a vector θ ∈ Rm+ (so every

entry has its own threshold value). Unlike ISTA, parame-ters We, G and θ in LISTA are all learned from trainingsamples using back-propagation procedure. After a fixednumber of iterations, the best possible approximation of thesparse code will be produced.

2.2. Convolutional Sparse Coding (CSC)

Most of conventional sparse coding based algorithms di-vide the whole image into overlapped patches and cope withthem separately and the consistency constraint, i.e., pixelsin the overlapping area of adjacent patches should be ex-actly the same , is not considered. The convolutional sparsecoding (CSC) model [34, 2, 10, 7, 22, 5] is inherently suit-able for this issue, as it processes the whole image directly:

argminf ,Z

1

2

∥∥∥∥∥Y −N∑i=1

fi ⊗Zi

∥∥∥∥∥2

2

+ λ

N∑i=1

‖Zi‖1 , (5)

where Y ∈ Rm×n represents an input image, {fi}Ni=1 is agroup of s×s convolution filters with their respective sparsefeature mapsZi ∈ Rm×n. The reconstruction image can bederived by a summation of convolution results:

Y =

N∑i=1

fi ⊗Zi. (6)

The current leading strategies on CSC are based on the Al-ternating Direction Method of Multipliers (ADMM) [2, 30,10, 5]. However, when these methods are utilized to solve(5), the whole training set is optimized at once, tends tocause a heavy memory burden.

Gu et al. [7] proposed a CSC based SR (CSC-SR)method which takes consistency constraint of neighboringpatches into consideration for better image reconstruction.SA-ADMM [38] is used in their work to alleviate the mem-ory burden issue of ADMM.

2.3. Residual Learning

Residual learning for SR was first introduced in VDSR[13] to tackle the vanishing/exploding gradients issue whenthe network goes deeper. As LR image and HR image aresimilar to a large extent, fitting the residual mapping seemseasier for optimization. Given a training set of N LR-HRpairs {I(i)y , I

(i)x }Ni=1, the residual image is defined as r(i) =

I(i)x − I(i)y , the goal is to learn a model f with parameters

Θ that minimizes the following objective function:

L(Θ) =1

N

N∑i=1

∥∥∥f (I(i)y )− r(i)∥∥∥22. (7)

VDSR [14] uses a single skip connection to link the in-put Interpolated LR (ILR) image and the final output ofthe network for HR image reconstruction, termed as Global

Page 4: f g@whu.edu.cn arXiv:1812.11950v1 [cs.CV] 31 Dec 2018 · removes Batch Normalization [12] (BN) layers in residual units and produces astonishing results in both qualitative and quantitative

Residual Learning (GRL), which benefits both the conver-gency speed and the reconstruction accuracy a great deal.In DRRN [23], both Global and Local Residual Learning(LRL) are adopted to help the gradient flow.

3. Proposed Method3.1. Feature Extraction from ILR Image

As illustrated in Figure 2, our model takes the Interpo-lated Low-Resolution (ILR) image Iy as input, and pre-dicts the output HR image Ix. Two convolution layers,F0 ∈ Rn×c×s×s consisting of n filters of spatial sizec×s×s and F1 ∈ Rn×n×s×s containing n filters of spatialsize n×s×s are utilized for hierarchical features extractionfrom ILR image:

y = ReLU(F1 ⊗ReLU(F0 ⊗ Iy)

), (8)

where ⊗ is the convolution operator, and ReLU(·) denotesthe Rectified Linear Unit (ReLU) activation function.

3.2. Learning CSC of ILR Features

Its worth noting that CSC model can be considered as aspecial case of conventional SC model, for the convolutionoperation can be viewed as a matrix multiplication by con-verting one of the inputs into a Toeplitz matrix. So the CSCmodel (5) has a similar structure to the traditional SC model(1) when the convolution operation is transformed to matrixmultiplication. In addition, LISTA is an efficient and effec-tive tool to learn the approximate sparse coding vector of(1). It takes the exact form of equation (4) with the weightsWe andG represented as linear layers and it can be viewedas a feed-forward neural network withG shared over layers.

In order to efficiently solving (5), we extend (4) to itsconvolutional version by substituting We ∈ Rm×n forW1 ∈ Rm×n×s×s, G ∈ Rm×m for S ∈ Rm×m×s×s. Theconvolutional case of (4) can be reformulated as:

zk+1 = hθ (W1 ⊗ y + S ⊗ zk) . (9)

The sparse feature maps z ∈ Rm×m×c×c are learned afterK recursions.

As for the activation function hθ, [19] reveals two im-portant facts: (1) the expressiveness of the sparsity inspiredmodel is not affected even by restricting the coefficientsto be nonnegative; (2) the ReLU and the soft nonnegativethresholding operator are equal, that is:

h+θ (α) = max(α− θ, 0) = ReLU(α− θ). (10)

So we choose ReLU as activation function in RL-CSC.

3.3. Recovery of Residual Image

When the sparse feature maps z are obtained, they’rethen fed into a convolution layer W2 ∈ Rm×n×s×s to

recover the features of high-frequency information. Thelast convolution layer H ∈ Rc×n×s×s is used for high-frequency information reconstruction:

R =H ⊗ReLU(W2 ⊗ z). (11)

Note that we pad zeros before all convolution operations tokeep all the feature maps have the same size, which is acommon strategy used in a variety of methods [13, 14, 23,24]. So the residual image R has the same size as the inputILR image Iy , and the final HR image Ix will be recon-structed by the addition of Iy andR:

Ix = Iy +R. (12)

We build our model by strictly following these analyses.

3.4. Network Structure

The entire network structure of RL-CSC is illustrated inFigure 2. There are totally 6 trainable layers in our model:two convolution layers F0 and F1 used for feature extrac-tion, W1 and S for learning CSC, W2 and H for residualimage reconstruction. The weight of S is shared during ev-ery recursion. When K recursions are applied in the train-ing process, the depth d of the network can be calculatedas:

d = K + 5. (13)

The loss function of Mean Square Error (MSE) is ex-ploited in our training process. Given N LR-HR imagepatch pairs {I(i)y , I

(i)x }Ni=1 as a training set, our goal is to

minimize the following objective function with RL-CSC:

L(Θ) =1

N

N∑i=1

∥∥∥RL-CSC(I(i)y

)+ I(i)y − I(i)x

∥∥∥22, (14)

where Θ denotes the learnable parameters. Stochastic gra-dient descent (SGD) is used for optimization and we imple-ment our model using the PyTorch [20] framework.

4. DiscussionsIn this section, we discuss the advantages of RL-CSC

over several recent CNN models for SR with recursivelearning strategy applied. Specifically, DRRN [23], SCN[29] and DRCN [14] are used for comparison. The simpli-fied structures of these models are shown in Figure 3. “Convis the abbreviation for Convolution layer, “BN representsBatch Normalization [12], “Linear stands for Linear layerand “Shreshold means Soft Shreshold operator. The digitson the left of the recursion line is the number of recursions.

Difference to DRRN. The main part of DRRN [23]is the recursive block structure, in which several residualunits are stacked. To further improve the performance, a

Page 5: f g@whu.edu.cn arXiv:1812.11950v1 [cs.CV] 31 Dec 2018 · removes Batch Normalization [12] (BN) layers in residual units and produces astonishing results in both qualitative and quantitative

BNReLUConv

BNReLUConvBN

ReLUConv

Add

BNReLUConv

Add

25

(a) DRRN

Conv

Linear

ShresholdLinear

Add

ShresholdLinear

Conv

1

(b) SCN

ConvReLUConvReLU

ConvReLU

Add

ConvReLUConvReLU

Add

16

ω1 ω2 · · · ω16

(c) DRCN

ConvReLUConvReLU

Conv

ReLUConv

Add

ReLU

ConvReLUConv

Add

25

(d) RL-CSC

Figure 3: Network structures of: (a) DRRN [23]. (b) SCN[29]. (c) DRCN [14]. (d) Our model.

multi-path structure (all residual units share the same in-put) and a pre-activation structure (activation layers comebefore the weight layers) are utilized. These strategies areproved to be effective. The interesting part is, RL-CSC,deduced from LISTA [6], includes a multi-path structureand uses pre-activation inherently. In addition, guided by(9), RL-CSC contains no BN layers at all. BN consumesmuch amount of GPU memory and increases computationalcomplexity. Experiments on this topic are conducted inSection 5.4. Furthermore, every module in RL-CSC hasa good interpretability, which helps the choice of param-eter settings for better performance. Experimental resultson benchmark datasets under commonly-used assessmentsdemonstrate the superiority of RL-CSC in Section 5.3.

Difference to SCN. There are three main differences be-tween SCN [29] and RL-CSC. Firstly, RL-CSC (30 layers)is much deeper than SCN (5 layers). As indicated in [13],a deeper network will have a larger receptive filed, whichmeans the network can utilize more contextual informationin an image to infer image details. Secondly, we extendLISTA to its convolutional version in (9), instead of us-ing linear layers, so more hierarchical information will beextracted. Last but not the least, RL-CSC adopts residuallearning, which is a powerful tool for training deeper net-works. With the help of residual learning, we can use morerecursions, i.e., 25, even 48, to achieve better performance.

Difference to DRCN. In the recursive part, RL-CSC dif-fers with DRCN [14] in two aspects. One for Local Resid-ual Learning [23] (LRL) and the other is pre-activation. Be-sides, DRCN is not easy to train, so recursive-supervisionand skip-connection are introduced to facilitate network toconverge. Moreover, an ensemble strategy (In Figure 3(c),the final output is the weighted average of all intermedi-ate predictions) is used to further improve the performance.

RL-CSC is relived from these strategies and can be easilytrained with more recursions. Advantages of RL-CSC arefurther illustrated in Section 5.3.

5. Experimental ResultsIn this section, performances of our method on four

benchmark datasets are evaluated. We first give a brief in-troduction to the datasets used for training and testing. Thenthe implementation details are provided. Finally, compar-isons with state-of-the-arts are presented and more analysesabout RL-CSC are illustrated.

5.1. Datasets

By following [13, 23], our training set consists of 291images, where 91 of these images are from Yang et al. [32]with the addition of 200 images from Berkeley Segmenta-tion Dataset [18]. During testing, we choose the datasetSet5 [1], and Set14 [35] which are widely used for bench-mark. Moreover, the BSD100 [18], consisting of 100 nat-ural images are used for testing. Finally, the Urban100 of100 urban images introduced by Huang et al. [11] is alsoemployed. Both the Peak Signal-to-Noise Ratio (PSNR)and the Structural SIMilarity (SSIM) on Y channel (i.e., lu-minance) of transformed YCbCr space are calculated forevaluation.

5.2. Implementation details

To enlarge the training set, data augmentation, which in-cludes flipping (horizontally and vertically), rotating (90,180, and 270 degrees), scaling (0.7, 0.5 and 0.4), is per-formed on each image of 291-image dataset. In addition,inspired by prior works, i.e., VDSR [13] and DRRN [23],we also train a single multi-scale model, which means scaleaugmentation is exploited by combining images of differentscales (×2, ×3 and ×4) into one training set. Not only forthe network scalability, but the fair comparison with otherstate-of-the-arts. Furthermore, all training images are parti-tioned into 33× 33 patches with the stride of 33, providinga total of 1, 929, 728 LR-HR training pairs.

The dimensions of all convolution layers are determinedas follows: F0 ∈ R128×1×3×3, F1 ∈ R128×128×3×3,W1 ∈R256×128×3×3, S ∈ R256×256×3×3, W2 ∈ R256×128×3×3,H ∈ R1×128×3×3. As for the number of recursion, wechoose K = 25 in our final model so the depth of RL-CSC is 30 according to (13). Further discussions about thenumber of network parameters and K will be illustrated inSection 5.4

We follow the same strategy as He et al. [8] for weightinitialization where all weights are drawn from a normaldistribution with zero mean and variance 2/nout, wherenout is the number of output units. The network is opti-mized using SGD with mini-batch size of 128, momentumparameter of 0.9 and weight decay of 10−4. The learning

Page 6: f g@whu.edu.cn arXiv:1812.11950v1 [cs.CV] 31 Dec 2018 · removes Batch Normalization [12] (BN) layers in residual units and produces astonishing results in both qualitative and quantitative

Dataset Scale Bicubic SRCNN VDSR DRCN DRRN MemNet RL-CSC

Set5×2 33.66/0.9299 36.66/0.9542 37.53/0.9587 37.63/0.9588 37.74/0.9591 37.78/0.9597 37.79/0.9600×3 30.39/0.8682 32.75/0.9090 33.66/0.9213 33.82/0.9226 34.03/0.9244 34.09/0.9248 34.11/0.9254×4 28.42/0.8104 30.48/0.8628 31.35/0.8838 31.53/0.8854 31.68/0.8888 31.74/0.8893 31.82/0.8907

Set14×2 30.24/0.8688 32.45/0.9067 33.03/0.9124 33.04/0.9118 33.23/0.9136 33.28/0.9142 33.33/0.9152×3 27.55/0.7742 29.30/0.8215 29.77/0.8314 29.76/0.8311 29.96/0.8349 30.00/0.8350 29.99/0.8359×4 26.00/0.7027 27.50/0.7513 28.01/0.7674 28.02/0.7670 28.21/0.7721 28.26/0.7723 28.29/0.7741

BSD100×2 29.56/0.8431 31.36/0.8879 31.90/0.8960 31.85/0.8942 32.05/0.8973 32.08/0.8978 32.09/0.8985×3 27.21/0.7385 28.41/0.7863 28.82/0.7976 28.80/0.7963 28.95/0.8004 28.96/0.8001 28.99/0.8021×4 25.96/0.6675 26.90/0.7101 27.29/0.7251 27.23/0.7233 27.38/0.7284 27.40/0.7281 27.44/0.7302

Urban100×2 26.88/0.8403 29.50/0.8946 30.76/0.9140 30.75/0.9133 31.23/0.9188 31.31/0.9195 31.36/0.9207×3 24.46/0.7349 26.24/0.7989 27.14/0.8279 27.15/0.8276 27.53/0.8378 27.56/0.8376 27.64/0.8403×4 23.14/0.6577 24.52/0.7221 25.18/0.7524 25.14/0.7510 25.44/0.7638 25.50/0.7630 25.59/0.7680

Table 1: Benchmark results. Average PSNR/SSIMs for scale factor ×2, ×3 and ×4 on datasets Set5, Set14, BSD100 andUrban100. Red color indicates the best performance and blue color indicates the second best performance.

Dataset Scale Bicubic SRCNN [4] SelfEx [11] VDSR [13] DRRN B1U9 DRRN B1U25 RL-CSC

Set5×2 6.083 8.036 7.811 8.569 8.583 8.671 9.095×3 3.580 4.658 4.748 5.221 5.241 5.397 5.565×4 2.329 2.991 3.166 3.547 3.581 3.703 3.791

Set14×2 6.105 7.784 7.591 8.178 8.181 8.320 8.656×3 3.473 4.338 4.371 4.730 4.732 4.878 4.992×4 2.237 2.751 2.893 3.133 3.147 3.252 3.324

Urban100×2 6.245 7.989 7.937 8.645 8.653 8.917 9.372×3 3.620 4.584 4.843 5.194 5.259 5.456 5.662×4 2.361 2.963 3.314 3.496 3.536 3.676 3.816

Table 2: Benchmark results. Average IFCs for scale factor ×2, ×3 and ×4 on datasets Set5, Set14 and Urban100. Red colorindicates the best performance and blue color indicates the second best performance.

rate is initially set to 0.1 and then decreased by a factor of10 every 10 epochs. We train a total of 35 epochs as no fur-ther improvements of the loss are observed after that. Formaximal convergence speed, we utilize the adjustable gra-dient clipping strategy stated in [13], with gradients clippedto [−θ, θ], where θ = 0.4 is the gradient clipping parame-ter. A NVIDIA Titan Xp GPU is used to train our model ofK = 25, which takes approximately four and a half days .

5.3. Comparison with State of the Arts

We now compare the proposed RL-CSC model withother state-of-the-arts in recent years. Specifically, SRCNN[4], VDSR [13], DRCN [14], DRRN [23] and MemNet [24]are used for benchmarks. All of these models apply bicu-bic interpolation to the original LR images before passingthem to the networks. As the prior methods crop image pix-els near borders before evaluation, for fair comparison, wecrop the pixels to the same amount as well, even if this isunnecessary for our method.

Table 1 shows the PSNR and SSIM on the four bench-mark testing sets, and results of other methods are obtainedfrom [13, 14, 23, 24]. Our model RL-CSC with 30 layersoutperforms DRRN (52 layers) and MemNet (80 layers) inall datasets and scale factors (both PSNR and SSIM).

Furthermore, the metric Information Fidelity Criterion(IFC), which has the highest correlation with perceptual

scores for SR evaluation [31], is also used for comparison.Experimental results are summarized in Table 2. The IFCsof [4, 11, 13] and DRRN are obtained from [23]. By fol-lowing [23], BSD100 is not evaluated. Its obvious that ourmethod achieves better performances than other methods inall datasets and scale factors.

Qualitative results are provided in Figures 4, 5 and 6.Our method tends to produce shaper edges and more correcttextures, while other images may be blurred or distorted.

5.4. K, Residual Learning, Number of Filters andBatch Normalization

The number of recursions K is a key parameter in ourmodel. When K is increased, a deeper RL-CSC model willbe constructed. We have trained and tested RL-CSC with15, 20, 25, 48 recursions, and according to (13), the depthsof the these models are 20, 25, 30, 53 respectively. Theresults are presented in Figure 7. The performance curvesclearly show that increasing K can promote the final per-formance (K = 15 33.98dB, K = 20 34.06dB, K = 2534.11dB, K = 48 34.16dB), which indicates deeper is bet-ter. Similar conclusions are observed in LISTA [6] thatmore iterations help prediction error decreased. However,when we extend LISTA to its convolutional version and at-tempt to combine the powerful learning ability of CNN, thecharacteristics of CNN itself must also be considered. With

Page 7: f g@whu.edu.cn arXiv:1812.11950v1 [cs.CV] 31 Dec 2018 · removes Batch Normalization [12] (BN) layers in residual units and produces astonishing results in both qualitative and quantitative

Ground Truth(PSNR, SSIM)

Bicubic(28.50, 0.8285)

SRCNN [4](29.40, 0.8561)

VDSR [13](29.54, 0.8651)

DRCN [14](30.29, 0.8653)

DRRN [23](29.74, 0.8671)

MemNet [24](30.19, 0.8698)

RL-CSC(30.54, 0.8705)

Figure 4: SR results of “8023” from BSD100 with scale factor ×4. The direction of the stripes on the feathers is correctlyrestored in RL-CSC, while other methods fail to recover the pattern.

Ground Truth(PSNR, SSIM)

Bicubic(23.71, 0.8746)

SRCNN [4](27.04, 0.9392)

VDSR [13](27.86, 0.9616)

DRCN [14](27.67, 0.9609)

DRRN [23](28.70, 0.9702)

MemNet [24](28.92, 0.9711)

RL-CSC(28.60, 0.9705)

Figure 5: SR results of “ppt3” from Set14 with scale factor ×3. Texts in RL-CSC are sharp while character edges are blurryin other methods.

Ground Truth(PSNR, SSIM)

Bicubic(21.57, 0.6283)

SRCNN [4](22.03, 0.6779)

VDSR [13](22.15, 0.6920)

DRCN [14](22.11, 0.6867)

DRRN [23](21.92, 0.6897)

MemNet [24](22.11, 0.6927)

RL-CSC(22.24, 0.6976)

Figure 6: SR results of “img076” from Urban with scale factor ×4. More details are recovered by RL-CSC, while othersproduce blurry visual results.

more recursions used, deeper networks tend to be botheredby the gradient vanishing/exploding problems. Residuallearning is such a useful tool that not only solves these dif-ficulties, but helps network converge faster. We remove theidentity branch in RL-CSC and use the same parameter set-tings as stated in Section 5.2 to train the new network. Theresults are summarized in Table 3. Without residual learn-

ing, the new network cannot even converge. We stop thetraining process at the 14th epoch in advance.

We also evaluate our model with different number of fil-ters. Specifically, two types of parameter settings are ap-plied:

i) F0 ∈ R128×1×3×3, F1 ∈ R128×128×3×3, W1 ∈R256×128×3×3, S ∈ R256×256×3×3, W2 ∈

Page 8: f g@whu.edu.cn arXiv:1812.11950v1 [cs.CV] 31 Dec 2018 · removes Batch Normalization [12] (BN) layers in residual units and produces astonishing results in both qualitative and quantitative

0 5 10 15 20 25 30 35Epochs

32.0

32.2

32.4

32.6

32.8

33.0

33.2

33.4

33.6

33.8

34.0

34.2

PSN

R (d

B)

K = 48K = 25K = 20K = 15

Figure 7: PSNR for RL-CSC with different recursions. Themodels are tested under Set5 with scale factor ×3.

Epoch 1 5 10 13With Residual 33.24 33.61 33.61 34.02No Residual 6.54 6.54 6.54 6.54

Table 3: PSNR (dB) for RL-CSC and its non-residual coun-terpart. Tests on Set5 with scale factor ×3.

0 5 10 15 20 25 30 35Epochs

33.2

33.3

33.4

33.5

33.6

33.7

33.8

33.9

34.0

34.1

PSN

R (d

B)

setting 1setting 2

Figure 8: Results on two types of parameter settings. Thetests are conducted on Set5 with scale factor ×3.

R256×128×3×3, H ∈ R1×128×3×3, and K = 15. Thetotal number of parameters is about 1, 329k;

ii) F0 ∈ R128×1×3×3, F1 ∈ R128×128×3×3, W1 ∈R128×128×3×3, S ∈ R128×128×3×3, W2 ∈R128×128×3×3, H ∈ R1×128×3×3, and K = 15.The total number of parameters is approximately 592k,

model frameworkbatchsize

patchsize memory

DRRN (52) Caffe 128 31 > 12 GBDRRN (20) Caffe 64 31 9, 043 MB

RL-CSC (30) PyTorch 64 31 4, 955 MBRL-CSC (30) PyTorch 128 31 9, 263 MBRL-CSC (30) PyTorch 128 33 9, 421 MB

Table 4: GPU memory usage of different models. RL-CSCwith 30 layers are evaluated, compared to DRRN [23] with20 and 52 layers. A Titan Xp with 12 GB is used.

which is less than VDSRs 664k.

Results are shown in Figure 8. Increasing the the numberof filters can benefit the performance, and our model withless parameters, e.g., 592k, still outperforms VDSR, whosePSNR is 33.66dB for scale factor ×3 on Set5. Our finalmodel uses the parameter settings illustrated in Section 5.2.

Although RL-CSC has more parameters than DRRN[23], in our experiment we find DRRN consumes muchmore GPU memory resources. We test both models withdifferent batch sizes and patch sizes of training data, andthe results are summarized in Table 4. The memory us-age datas are derived from the nvidia-smi tool. Patch sizeof 31 is the default setting in DRRN. Training DRRN1 of52 layers is difficult with one Titan Xp GPU, using the de-fault settings given by the authors, because of the Out-of-Memory (OOM) issue. The reason is that the recursive unitof DRRN is based on the residual unit of ResNet [9], so theBN layers are exploited, which tend to be memory intensiveand increase computational burden. Guided by the analysespresented in Section 3.2, BN layers are not needed in ourdesign. As for inference time, RL-CSC takes 0.15 secondto process a 288× 288 image on a Titan Xp GPU.

6. Conclusions

In this work, we have proposed a novel network forimage super-resolution task by combining the merits ofResidual Learning and Convolutional Sparse Coding. Ourmodel is derived from LISTA so it has inherently good in-terpretability. We extend the LISTA method to its con-volutional version and build the main part of our modelby strictly following the convolutional form. Furthermore,residual learning is adopted in our model, with which we areable to construct a deeper network by utilizing more recur-sions without introducing any new parameters. Extensiveexperiments show that our model achieves competitive re-sults with state-of-the-arts and demonstrate its superiorityin SR.

1code can be found: https://github.com/tyshiwo/DRRN_CVPR17

Page 9: f g@whu.edu.cn arXiv:1812.11950v1 [cs.CV] 31 Dec 2018 · removes Batch Normalization [12] (BN) layers in residual units and produces astonishing results in both qualitative and quantitative

References[1] M. Bevilacqua, A. Roumy, C. Guillemot, and M.-L. Alberi-

Morel. Low-Complexity Single-Image Super-Resolutionbased on Nonnegative Neighbor Embedding. BMVC, pages135.1–135.10, 2012. 1, 5

[2] H. Bristow, A. Eriksson, and S. Lucey. Fast ConvolutionalSparse Coding. In CVPR, pages 391–398. IEEE, 2013. 2, 3

[3] I. Daubechies, M. Defrise, and C. De Mol. An iterativethresholding algorithm for linear inverse problems with asparsity constraint. Communications on Pure and AppliedMathematics, 57(11):1413–1457, 2004. 2

[4] C. Dong, C. C. Loy, K. He, and X. Tang. Image super-resolution using deep convolutional networks. IEEE TPAMI,38(2):295–307, 2016. 1, 2, 6, 7

[5] C. Garcia-Cardona and B. Wohlberg. Convolutional Dictio-nary Learning: A Comparative Review and New Algorithms.IEEE Transactions on Computational Imaging, 4(3):366–381, 2018. 2, 3

[6] K. Gregor and Y. LeCun. Learning Fast Approximations ofSparse Coding. ICML, 2010. 2, 5, 6

[7] S. Gu, W. Zuo, Q. Xie, D. Meng, X. Feng, and L. Zhang.Convolutional Sparse Coding for Image Super-Resolution.ICCV, pages 1823–1831, 2015. 2, 3

[8] K. He, X. Zhang, S. Ren, and J. Sun. Delving Deep into Rec-tifiers - Surpassing Human-Level Performance on ImageNetClassification. ICCV, 2015. 5

[9] K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learn-ing for Image Recognition. In CVPR, pages 770–778. IEEE,2016. 1, 8

[10] F. Heide, W. Heidrich, and G. Wetzstein. Fast and flexi-ble convolutional sparse coding. CVPR, pages 5135–5143,2015. 2, 3

[11] J.-B. Huang, A. Singh, and N. Ahuja. Single image super-resolution from transformed self-exemplars. CVPR, pages5197–5206, 2015. 5, 6

[12] S. Ioffe and C. Szegedy. Batch normalization: Acceleratingdeep network training by reducing internal covariate shift.ICML, 2015. 1, 4

[13] J. Kim, J. Kwon Lee, and K. Mu Lee. Accurate image super-resolution using very deep convolutional networks. In CVPR,pages 1646–1654, 2016. 1, 2, 3, 4, 5, 6, 7

[14] J. Kim, J. K. Lee, and K. M. Lee. Deeply-Recursive Convo-lutional Network for Image Super-Resolution. CVPR, pages1637–1645, 2016. 1, 2, 3, 4, 5, 6, 7

[15] C. Ledig, L. Theis, F. Huszar, J. Caballero, A. Cunning-ham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, andW. Shi. Photo-Realistic Single Image Super-Resolution Us-ing a Generative Adversarial Network. In CVPR, pages 105–114. IEEE, 2017. 1

[16] B. Lim, S. Son, H. Kim, S. Nah, and K. M. Lee. Enhanceddeep residual networks for single image super-resolution. InCVPR, volume 1, page 4, 2017. 1

[17] X. Mao, C. Shen, and Y.-B. Yang. Image restoration us-ing very deep convolutional encoder-decoder networks withsymmetric skip connections. In NIPS, pages 2802–2810,2016. 2

[18] D. Martin, C. Fowlkes, D. Tal, and J. Malik. A databaseof human segmented natural images and its application toevaluating segmentation algorithms and measuring ecologi-cal statistics. volume 2, pages 416–423, 2001. 5

[19] V. Papyan, Y. Romano, and M. Elad. Convolutional neuralnetworks analyzed via convolutional sparse coding. JMLR,18(1):2887–2938, 2017. 4

[20] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. De-Vito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Auto-matic differentiation in pytorch. In NIPS-W, 2017. 4

[21] K. Simonyan and A. Zisserman. Very deep convolutionalnetworks for large-scale image recognition. arXiv preprintarXiv:1409.1556, 2014. 1

[22] H. Sreter and R. Giryes. Learned convolutional sparse cod-ing. In ICASSP, pages 2191–2195. IEEE, 2018. 2, 3

[23] Y. Tai, J. Yang, and X. L. 0002. Image Super-Resolutionvia Deep Recursive Residual Network. CVPR, pages 2790–2798, 2017. 1, 2, 4, 5, 6, 7, 8

[24] Y. Tai, J. Yang, X. Liu, and C. Xu. Memnet: A persis-tent memory network for image restoration. In CVPR, pages4539–4547, 2017. 1, 2, 4, 6, 7

[25] R. Timofte, E. Agustsson, L. Van Gool, M.-H. Yang,L. Zhang, B. Lim, S. Son, H. Kim, S. Nah, K. M. Lee,et al. Ntire 2017 challenge on single image super-resolution:Methods and results. In CVPRW, pages 1110–1121. IEEE,2017. 1

[26] R. Timofte, V. De Smet, and L. Van Gool. A+: Adjustedanchored neighborhood regression for fast super-resolution.In ACCV, pages 111–126. Springer, 2014. 1

[27] R. Timofte, S. Gu, J. Wu, and L. Van Gool. NTIRE 2018Challenge on Single Image Super-Resolution - Methods andResults. CVPR Workshops, 2018. 1

[28] T. Tong, G. Li, X. Liu, and Q. Gao. Image super-resolutionusing dense skip connections. In ICCV, pages 4809–4817.IEEE, 2017. 1

[29] Z. Wang, D. Liu, J. Yang, W. Han, and T. Huang. Deepnetworks for image super-resolution with sparse prior. InICCV, pages 370–378, 2015. 2, 4, 5

[30] B. Wohlberg. Efficient convolutional sparse coding. InICASSP, pages 7173–7177. IEEE, 2014. 3

[31] C.-Y. Yang, C. Ma, and M.-H. Yang. Single-image super-resolution: A benchmark. In ECCV, pages 372–386.Springer, 2014. 6

[32] J. Yang, J. Wright, T. S. Huang, and Y. Ma. Im-age super-resolution via sparse representation. IEEE TIP,19(11):2861–2873, 2010. 1, 2, 5

[33] W. Yang, X. Zhang, Y. Tian, W. Wang, and J.-H. Xue. Deeplearning for single image super-resolution: A brief review.arXiv preprint arXiv:1808.03344, 2018. 1

[34] M. D. Zeiler, D. Krishnan, G. W. Taylor, and R. Fergus. De-convolutional networks. CVPR, 2010. 2, 3

[35] R. Zeyde, M. Elad, and M. Protter. On single image scale-upusing sparse-representations. In International conference oncurves and surfaces, pages 711–730. Springer, 2010. 5

[36] K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang. Beyonda gaussian denoiser: Residual learning of deep cnn for imagedenoising. IEEE TIP, 26(7):3142–3155, 2017. 1

Page 10: f g@whu.edu.cn arXiv:1812.11950v1 [cs.CV] 31 Dec 2018 · removes Batch Normalization [12] (BN) layers in residual units and produces astonishing results in both qualitative and quantitative

[37] Z. Zhang, Y. Xu, J. Yang, X. Li, and D. Zhang. A Survey ofSparse Representation - Algorithms and Applications. IEEEAccess, 3:490–530, 2015. 2

[38] W. Zhong and J. T.-Y. Kwok. Fast Stochastic AlternatingDirection Method of Multipliers. ICML, 2014. 3