13
JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1 Light Field Image Super-Resolution Using Deformable Convolution Yingqian Wang, Jungang Yang, Longguang Wang, Xinyi Ying, Tianhao Wu, Wei An, and Yulan Guo Abstract—Light field (LF) cameras can record scenes from multiple perspectives, and thus introduce beneficial angular information for image super-resolution (SR). However, it is challenging to incorporate angular information due to disparities among LF images. In this paper, we propose a deformable convolution network (i.e., LF-DFnet) to handle the disparity problem for LF image SR. Specifically, we design an angular de- formable alignment module (ADAM) for feature-level alignment. Based on ADAM, we further propose a collect-and-distribute approach to perform bidirectional alignment between the center- view feature and each side-view feature. Using our approach, angular information can be well incorporated and encoded into features of each view, which benefits the SR reconstruction of all LF images. Moreover, we develop a baseline-adjustable LF dataset to evaluate SR performance under different disparities. Experiments on both public and our self-developed datasets have demonstrated the superiority of our method. Our LF-DFnet can generate high-resolution images with more faithful details and achieves state-of-the-art reconstruction accuracy. Besides, our LF-DFnet is more robust to disparity variations, which has not been well addressed in literature. Index Terms—Light field, super-resolution, deformable convo- lution, dataset I. I NTRODUCTION A THOUGH light field (LF) cameras enable many attrac- tive functions such as post-capture refocusing [1]–[3], depth sensing [4]–[9], saliency detection [10]–[14], and de- occlusion [15]–[17], the resolution of a sub-aperture image (SAI) is much lower than that of the total sensors. The low spatial resolution problem hinders the development of LF imaging [18]. Since high-resolution (HR) images are required in various LF applications, it is necessary to reconstruct HR images from low-resolution (LR) observations, namely, to perform LF image super-resolution (SR). To achieve high SR performance, information both within a single view (i.e., spatial information) and among different views (i.e., angular information) is important. Several models have been proposed in early LF image SR methods, such as variational model [19], Gaussian mixture model [20], and PCA analysis model [21]. Although different delicately handcrafted image priors have been investigated in these traditional meth- ods [19]–[24], their performance is relatively limited due to This work was partially supported in part by the National Natural Science Foundation of China (Nos. 61972435, 61401474, 61921001). Y. Wang, J. Yang, L. Wang, X. Ying, T. Wu, W. An, and Y. Guo are with the College of Electronic Science and Technology, National University of Defense Technology, P. R. China. Y. Guo is also with the School of Electronics and Communication Engineering, Sun Yat-sen University, P. R. China. Emails: {wangyingqian16, yangjungang, wanglongguang15, yingxinyi18, wutianhao16, anwei, yulan.guo}@nudt.edu.cn. (Corresponding author: Jungang Yang, Yulan Guo) Groundtruth PSNR/SSIM Bicubic 34.30/0.970 RCAN 40.24/0.990 HCInew_origami (2xSR) resLF 40.02/0.991 LF-DFnet 42.47/0.995 Groundtruth PSNR/SSIM Bicubic 23.60/0.778 RCAN 26.83/0.897 STFgantry_cards (4xSR) LFSSR 27.58/0.918 resLF 26.76/0.895 LF-DFnet 28.37/0.929 LF-ATO 27.87/0.920 LF-InterNet 27.78/0.919 LFSSR 41.27/0.993 LF-ATO 41.63/0.993 LF-InterNet 41.79/0.994 Fig. 1: Results achieved by bicubic interpolation, RCAN [34], LFSSR [30], resLF [28], LF-ATO [35], LF-InterNet [31], and our LF-DFnet for 2× and 4× SR. Here, scene Origami from the HCInew dataset [36] and scene Cards from the STFgantry dataset [37] are used for evaluation. Our method achieves superior visual performance with significant improvements in terms of PSNR and SSIM. their inferiority in spatial information exploitation. In contrast, recent deep learning-based methods [25]–[31] enhance spatial information exploitation via cascaded convolutions, and thus achieve improved performance as compared to traditional methods. Yoon et al. [25], [26] proposed the first CNN-based method LFCNN for LF image SR. Specifically, SAIs are first super-resolved using SRCNN [32], and then fine-tuned in pairs to incorporate angular information. Similarly, Yuan et al. [27] super-resolved each SAI separately using EDSR [33], and then proposed an EPI-enhancement network to refine the results. Although several recent deep learning-based methods [28]–[31] have been proposed to achieve the state-of-the-art performance, the disparity issue in LF image SR is still under- investigated [25]–[31]. In real-world scenes, objects at different depths have dif- ferent disparity values in LF images. Existing CNN-based LF image SR methods [25]–[31] do not explicitly address the disparity issue. Instead, they use cascaded convolutions to achieve a large receptive field to cover the disparity range. As demonstrated in [38], [39], it is difficult for SR networks to arXiv:2007.03535v1 [eess.IV] 7 Jul 2020

JOURNAL OF LA Light Field Image Super-Resolution Using

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: JOURNAL OF LA Light Field Image Super-Resolution Using

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1

Light Field Image Super-Resolution UsingDeformable Convolution

Yingqian Wang, Jungang Yang, Longguang Wang, Xinyi Ying, Tianhao Wu, Wei An, and Yulan Guo

Abstract—Light field (LF) cameras can record scenes frommultiple perspectives, and thus introduce beneficial angularinformation for image super-resolution (SR). However, it ischallenging to incorporate angular information due to disparitiesamong LF images. In this paper, we propose a deformableconvolution network (i.e., LF-DFnet) to handle the disparityproblem for LF image SR. Specifically, we design an angular de-formable alignment module (ADAM) for feature-level alignment.Based on ADAM, we further propose a collect-and-distributeapproach to perform bidirectional alignment between the center-view feature and each side-view feature. Using our approach,angular information can be well incorporated and encoded intofeatures of each view, which benefits the SR reconstruction ofall LF images. Moreover, we develop a baseline-adjustable LFdataset to evaluate SR performance under different disparities.Experiments on both public and our self-developed datasets havedemonstrated the superiority of our method. Our LF-DFnet cangenerate high-resolution images with more faithful details andachieves state-of-the-art reconstruction accuracy. Besides, ourLF-DFnet is more robust to disparity variations, which has notbeen well addressed in literature.

Index Terms—Light field, super-resolution, deformable convo-lution, dataset

I. INTRODUCTION

ATHOUGH light field (LF) cameras enable many attrac-tive functions such as post-capture refocusing [1]–[3],

depth sensing [4]–[9], saliency detection [10]–[14], and de-occlusion [15]–[17], the resolution of a sub-aperture image(SAI) is much lower than that of the total sensors. The lowspatial resolution problem hinders the development of LFimaging [18]. Since high-resolution (HR) images are requiredin various LF applications, it is necessary to reconstruct HRimages from low-resolution (LR) observations, namely, toperform LF image super-resolution (SR).

To achieve high SR performance, information both withina single view (i.e., spatial information) and among differentviews (i.e., angular information) is important. Several modelshave been proposed in early LF image SR methods, such asvariational model [19], Gaussian mixture model [20], and PCAanalysis model [21]. Although different delicately handcraftedimage priors have been investigated in these traditional meth-ods [19]–[24], their performance is relatively limited due to

This work was partially supported in part by the National Natural ScienceFoundation of China (Nos. 61972435, 61401474, 61921001).

Y. Wang, J. Yang, L. Wang, X. Ying, T. Wu, W. An, and Y. Guo arewith the College of Electronic Science and Technology, National Universityof Defense Technology, P. R. China. Y. Guo is also with the Schoolof Electronics and Communication Engineering, Sun Yat-sen University,P. R. China. Emails: {wangyingqian16, yangjungang, wanglongguang15,yingxinyi18, wutianhao16, anwei, yulan.guo}@nudt.edu.cn. (Correspondingauthor: Jungang Yang, Yulan Guo)

GroundtruthPSNR/SSIM

Bicubic34.30/0.970

RCAN40.24/0.990

HCInew_origami (2xSR) resLF40.02/0.991

LF-DFnet42.47/0.995

Groundtruth

PSNR/SSIMBicubic

23.60/0.778RCAN

26.83/0.897

STFgantry_cards (4xSR)

LFSSR

27.58/0.918

resLF

26.76/0.895LF-DFnet

28.37/0.929LF-ATO

27.87/0.920LF-InterNet

27.78/0.919

LFSSR41.27/0.993

LF-ATO41.63/0.993

LF-InterNet41.79/0.994

Fig. 1: Results achieved by bicubic interpolation, RCAN [34],LFSSR [30], resLF [28], LF-ATO [35], LF-InterNet [31], andour LF-DFnet for 2× and 4× SR. Here, scene Origami fromthe HCInew dataset [36] and scene Cards from the STFgantrydataset [37] are used for evaluation. Our method achievessuperior visual performance with significant improvements interms of PSNR and SSIM.

their inferiority in spatial information exploitation. In contrast,recent deep learning-based methods [25]–[31] enhance spatialinformation exploitation via cascaded convolutions, and thusachieve improved performance as compared to traditionalmethods. Yoon et al. [25], [26] proposed the first CNN-basedmethod LFCNN for LF image SR. Specifically, SAIs are firstsuper-resolved using SRCNN [32], and then fine-tuned inpairs to incorporate angular information. Similarly, Yuan etal. [27] super-resolved each SAI separately using EDSR [33],and then proposed an EPI-enhancement network to refine theresults. Although several recent deep learning-based methods[28]–[31] have been proposed to achieve the state-of-the-artperformance, the disparity issue in LF image SR is still under-investigated [25]–[31].

In real-world scenes, objects at different depths have dif-ferent disparity values in LF images. Existing CNN-basedLF image SR methods [25]–[31] do not explicitly addressthe disparity issue. Instead, they use cascaded convolutionsto achieve a large receptive field to cover the disparity range.As demonstrated in [38], [39], it is difficult for SR networks to

arX

iv:2

007.

0353

5v1

[ee

ss.I

V]

7 J

ul 2

020

Page 2: JOURNAL OF LA Light Field Image Super-Resolution Using

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 2

learn the non-linear mapping between LR and HR images un-der complex motion patterns. Consequently, the misalignmentimpedes the incorporation of angular information and leadsto performance degradation. Therefore, specific mechanismsshould be designed to handle the disparity problem in LFimage SR.

Inspired by the success of deformable convolution [40],[41] in video SR [42]–[46], in this paper, we propose adeformable convolution network (namely, LF-DFnet) to handlethe disparity problem for LF image SR. Specifically, wedesign an angular deformable alignment module (ADAM)and a collect-and-distribute approach to achieve feature-levelalignment and angular information incorporation. In ADAM,all side-view features are first aligned with the center-viewfeature to achieve feature collection. These collected featuresare then fused and distributed to their corresponding viewsby performing alignment with their original features. Throughfeature collection and distribution, angular information can beincorporated and encoded into each view. Consequently, theSR performance is evenly improved among different views.Moreover, we develop a novel LF dataset named NUDT toevaluate the performance of LF image SR methods underdifferent disparity variations. All scenes in our NUDT datasetare rendered using 3dsMax1 and the baseline of virtual cameraarrays is adjustable. In summary, the main contributions of thispaper are as follows:• We propose a LF-DFnet to achieve the state-of-the-art LF

image SR performance (as shown in Fig. 1) by addressingthe disparity problem.

• We propose an angular deformable alignment moduleand a collect-and-distribute approach to achieve high-quality reconstruction of each LF image. Compared to[28], our approach avoids repetitive feature extraction andcan exploit angular information from all SAIs.

• We develop a novel NUDT dataset by rendering syntheticscenes with adjustable camera baselines. Experiments onthe NUDT dataset have demonstrated the robustness ofour method with respect to disparity variations.

The rest of this paper is organized as follows: In Secion II,we briefly review the related work. In Section III, we introducethe architecture of our LF-DFnet in details. In Section IV, weintroduce our self-developed dataset. Experimental results arepresented in Section V. Finally, we conclude this paper inSection VI.

II. RELATED WORK

In this section, we briefly review the major works in singleimage SR (SISR), LF image SR, and deformable convolution.

A. Single Image SR

The task of SISR is to generate a clear HR image fromits blurry LR counterpart. Since an input LR image canbe associated to multiple HR outputs, SISR is a highly ill-posed problem. Recently, several surveys [47]–[49] have been

1https://www.autodesk.eu/products/3ds-max/overview

published to comprehensively review SISR methods. Here, weonly describe several mile-stone works in literature.

Since Dong et al. [32], [50] proposed the seminal workof CNN-based SISR method SRCNN, deep learning-basedmethods have dominated this area due to their remarkableperformance in terms of both accuracy and efficiency. Byfar, various networks have been proposed to continuouslyimprove the SISR performance. Kim et al. [51] proposed avery deep SR network (i.e., VDSR) and achieved a significantperformance improvement over SRCNN [32], [50]. Lim et al.[33] proposed an enhanced deep SR network (i.e., EDSR).With the combination of local and global residual connections,EDSR [33] won the NTIRE 2017 SISR challenge [52]. Zhanget al. [53], [54] proposed a residual dense network (i.e., RDN),which achieved a further improvement over the state-of-the-arts at that time. Subsequently, Zhang et al. [34] proposeda residual channel attention network (i.e., RCAN) by intro-ducing a channel attention module and a residual in residualmechanism. More recently, Dai et al. [55] proposed SAN byapplying the second-order attention mechanism to SISR. Notethat, RCAN [34] and SAN [55] achieve the state-of-the-artSISR performance to date in terms of PSNR and SSIM.

In summary, SISR networks are becoming increasingly deepand complicated, resulting in continuously improved capabilityin spatial information exploitation. Note that, performing SISRon LF images is a straightforward scheme to achieve LF imageSR. However, the angular information is discarded in thisscheme, resulting in limited performance.

B. LF image SR

In the area of LF image SR, both traditional and deeplearning-based methods are widely used. For traditional meth-ods, various models have been developed for problem formu-lation. Wanner et al. [19] proposed a variational method forLF image SR based on the estimated depth information. Mitraet al. [20] encoded LF structure via a Gaussian mixture modelto achieve depth estimation, view synthesis, and LF imageSR. Farrugia et al. [21] decomposed HR-LR patches intosubspaces and proposed a linear subspace projection methodfor LF image SR. Alain et al. proposed LFBM5D for LFimage denoising [56] and LF image SR [24] by extendingBM3D filtering [57] to LFs. Rossi et al. [22] developeda graph-based method to achieve LF image SR via graphoptimization. Although the LF structure is well encoded bythese models [19]–[22], [24], the spatial information cannotbe fully exploited due to the poor representation capability ofthese handcrafted image priors.

Recently, deep learning based SISR methods are demon-strated superior to traditional methods in spatial informationexploitation. Inspired by these works, recent LF image SRmethods adopted deep CNN to improve their performance.In the pioneering work LFCNN [25], [26], SAIs were firstseparately super-resolved via SRCNN [50], and then fine-tuned in pairs to enhance both spatial and angular resolution.Subsequently, Yuan et al. [27] proposed LF-DCNN to improveLFCNN by super-resolving each SAI via a more powerfulSISR network EDSR [33] and fine-tuning the initial results

Page 3: JOURNAL OF LA Light Field Image Super-Resolution Using

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 3

using a specially designed EPI-enhancement network. Apartfrom these two-stage SR methods, a number of one-stagenetwork architectures have been designed for LF image SR.Wang et al. proposed a bidirectional recurrent network LFNet[29] by extending BRCN [58] to LFs. Zhang et al. [28]proposed a multi-stream residual network resLF by stackingSAIs along different angular directions as inputs to super-resolve the center-view SAI. Yeung et al. [30] proposedLFSSR to alternately shuffle LF features between SAI patternand macro-pixel image pattern for convolution. More recently,Jin et al. [35] proposed an all-to-one LF image SR method (i.e.,LF-ATO) and performed structural consistency regularizationto preserve the parallax structure among reconstructed views.Wang et al. [31] proposed an LF-InterNet to interact spatialand angular information for LF image SR. LF-ATO [35] andLF-InterNet [31] are state-of-the-art LF image SR methods todate and can achieve a high reconstruction accuracy.

Although the performance is continuously improved byrecent networks, the disparity problem has not been welladdressed in literature. Several methods [25]–[28] use stackedSAIs as their inputs, making pixels of same objects vary inspatial locations. In LFSSR [30] and LF-InterNet [31], LFfeatures are organized into a macro-pixel image pattern toincorporate angular information. However, pixels can fall intodifferent macro-pixels due to the disparity problem. In sum-mary, due to the lack of the disparity handling mechanism, theperformance of these methods degrade when handling sceneswith large disparities. Note that, LFNet [29] achieves LF imageSR in a video SR framework and implicitly addresses thedisparity issue via recurrent networks. Although all angularviews can contribute to the final SR performance, the recurrentmechanism in LFNet [29] only takes SAIs from the same rowor column as its inputs. Therefore, the angular information inLFs cannot be efficiently used.

C. Deformable Convolution

The fixed kernel configuration in regular CNNs hindersthe exploitation of long-range information. To address thisproblem, Dai et al. [40] proposed deformable convolution byintroducing additional offsets, which can be learned adaptivelyto make the convolution kernel process feature far away fromits local neighborhood. Deformable convolutions have beenapplied to both high-level vision tasks [40], [59]–[61], andlow-level vision tasks such as video SR [42]–[45]. Specifically,Tian et al. [42] proposed a temporal deformable alignmentnetwork (i.e., TDAN) by applying deformable convolution toalign input video frames without explicit motion estimation.Wang et al. [43] proposed an enhanced deformable videorestoration network (i.e., EDVR) by introducing a pyramid,cascading and deformable alignment module to handle largemotions between frames. EDVR [43] won the NTIRE19 videorestoration and enhancement challenges [62]. More recently,deformable convolution is integrated with non-local operation[44], convolutional LSTM [45] and 3D convolutions [46] tofurther enhance the video SR performance.

In summary, existing deformable convolution-based videoSR methods [42]–[46] only perform unidirectional alignments

to align neighborhood frames to the reference frame. However,in LF image SR, it is computational expensive to repetitivelyperform unidirectional alignments for each view to super-resolve all LF images. Consequently, we propose a collect-and-distribute approach to achieve bidirectional alignments usingdeformable convolutions. To the best of our knowledge, this isthe first work to apply deformable convolutions to LF imageSR.

III. NETWORK ARCHITECTURE

In this section, we introduce our LF-DFnet in details.Following [27]–[31], we convert input images from RGBchannel space to YCbCr channel space and only super-resolvethe Y channel images, leaving Cb and Cr channel imagesbeing bicubicly upscaled. Consequently, without consideringthe channel dimension, an LF can be formulated as a 4D tensorL ∈ RU×V×H×W , where U and V represent angular dimen-sions H and W represent spatial dimensions. Specifically, a4D LF can be considered as a U × V array of SAIs, and theresolution of each SAI is H ×W . Following [27]–[31], weachieve LF image SR using SAIs distributed in a square array,i.e., U = V = A.

As illustrated in Fig. 2(a), our LF-DFnet takes LR SAIsas its inputs and sequentially performs feature extraction(Section III-A), angular deformable alignment (Section III-B),reconstruction and upsampling (Section III-C).

A. Feature Extraction Module

Discriminative feature representation with rich spatial con-text information is beneficial to the subsequent feature align-ment and SR reconstruction steps. Therefore, a large receptivefield with a dense pixel sampling rate is required to extracthierarchical features. To this end, we follow [63] and useresidual atrous spatial pyramid pooling (ASPP) module as thefeature extraction module in our LF-DFnet.

As shown in Fig. 2(a), input SAIs are first processed by a1 × 1 convolution to generate initial features, and then fedto residual ASPP modules (Fig. 2(b)) and residual blocks(Fig. 2(c)) for deep feature extraction. Note that, each view areprocessed separately and the weights in our feature extractionmodule are shared among these views. In each residual ASPPblock, three 3× 3 dilated convolutions (with dilation rates of1, 2, 4, respectively) are combined in parallel to extract hierar-chical features with dense sampling rates. After activation witha Leaky ReLU layer (with a leaky factor of 0.1), features ofthese three branches are concatenated and fused by a 1×1 con-volution. Finally, both the center-view feature Fc ∈ RH×W×C

and side-view features Fi ∈ RH×W×C (i = 1, 2, · · · , A2−1)are generated by our feature extraction module. Following[28], we set the feature depth to 32 (i.e., C = 32). Theeffectiveness of residual ASPP module is demonstrated inSection V-C.

B. Angular Deformable Alignment Module (ADAM)

Given features generated by the feature extraction module,the main objective of ADAM is to perform alignment between

Page 4: JOURNAL OF LA Light Field Image Super-Resolution Using

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 4

ADAM1

Global Residual

Pix

el S

hu

ffle

1x1

Co

nv

Co

nv

Res

ASP

P

Res

Blo

ck

Res

ASP

P

Res

Blo

ck

3x3 ConvDilat=1

3x3 ConvDilat=2

3x3 ConvDilat=4

LReL

ULR

eLU

LReL

U

C

1x1

Co

nv

1x1

Co

nv

3x3

Co

nv

LReL

U

3x3

Co

nv

(a) Overall Architecture

(b) Residual ASPP Block

(c) Residual Block(d) Angular Deformable Alignment Module (ADAM)

Rec

on

stru

ctio

nM

od

ule

Output HR SAIs

Feature Extraction Angular Deformable Alignment Modules Reconstruction & Upsampling

ADAMk

ADAMK

C C C

Input LR

SAIs

DCN (collect)

Co

nv

Res

ASP

P

Co

nv

LReL

U

C

kiP

DCN (distribute)

1ki

Co

nv

kiP

kc

c ki

ki c

offset

feature

concatenationelement-wise

summationCcenter-view

featureside-view feature

offset feature

omission

1x1

Co

nv

LReL

U

IMD

B-1

IMD

B-N

3x3

Co

nv

3x3 conv & LReLU

128 128 128 128 128 128

96 96 96

3232 32

32

1x1 conv

channel split

(e) Reconstruction Module

(f) Information Multi-Distillation Block (IMDB)

C

Fig. 2: An overview of our LF-DFnet.

the center-view feature and each side-view feature. Here, wepropose a bidirectional alignment approach (i.e., collect-and-distribute) to incorporate angular information. Specifically,side-view features are first warped to the center view andaligned with the center-view feature (i.e., feature collection).These aligned features are fused by a 1 × 1 convolution toincorporate angular information. Afterwards, the fused featureis warped to side views by performing alignment with theiroriginal features (i.e., feature distribution). In this way, angularinformation can be jointly incorporated into each angular view,and the SR performance of all perspectives can be evenlyimproved. In this paper, we cascaded K ADAMs to performfeature collection and feature distribution. Without loss ofgenerality, we take the kth (k = 1, 2, · · · ,K) ADAM as anexample to introduce its mechanism, as shown in Fig. 2(d).

The core component of ADAM is deformable convolution,which is used to align features according to their corre-sponding offsets. In our implementation, we use a deformableconvolution for feature collection and another deformableconvolution with opposite offset values for feature distribution.The first deformable convolution, which is used for featurecollection, takes the (k − 1)th side-view feature Fk−1

i andlearnable offsets ∆P k

i as its input to generate the kth featureFk

i→c (which is aligned to the center view). That is,

Fki→c = Hk

dcn

(Fk−1

i ,∆P ki

), (1)

where Hkdcn represents the deformable convolution in the kth

deformable block, ∆P ki = {∆pn} ∈ RH×W×C ′

is theoffset of Fk−1

i with respect to Fc. More specifically, for eachposition p0 = (x0, y0) on Fk

i→c, we have

Fki→c(p0) =

∑pn∈R

w(pn) · Fk−1i (p0 + pn + ∆pn) , (2)

where R = {(−1,−1), (−1, 0), · · · , (0, 1), (1, 1)} representsa 3 × 3 neighborhood region centered at p0. pn ∈ R isthe predefined integral offset. ∆pn is an additional learnableoffset, which is added to the predefined offset pn to makethe positions of deformable kernels spatially-variant. Thus,information far away from p0 can be adaptively processedby deformable convolution. Since ∆pn can be fractional, wefollow [40] to perform bilinear interpolation in our implemen-tation.

Since an accurate offset is beneficial to deformable align-ment, we design an offset generation branch to learn offset∆P k

i in Eq. (1). As illustrated in Fig. 2(d), the side-viewfeature Fk−1

i is first concatenated with the center-view featureFc, and then fed to a 1× 1 convolution for feature depth re-duction. To handle the complicated and large motions betweenFk−1

i and Fc, a residual ASPP module (which is identicalto that in Section III-A) is applied to enlarge the receptivefield while maintaining a dense sampling rate. The residualASPP module enhances the exploitation of angular depen-dencies between the center view and side views, resulting inimproved SR performance. The effectiveness of the residualASPP module in the offset generation branch is investigated inSection V-C. Finally, another 1× 1 convolution with C ′ = 18output channels is used to generate the offset feature.

Once all side-view features are aligned to the center view, a1×1 convolution is performed to fuse the angular informationin these aligned features.

Fkc = Hk

1×1

([Fk

1→c,Fk2→c, · · · ,Fk

(A2−1)→c,Fc

]), (3)

where [· , ·] denotes concatenation and Hk1×1 denotes a 1 × 1

convolution.To super-resolve all LF images, the incorporated angular

information need to be encoded into each side view. Con-

Page 5: JOURNAL OF LA Light Field Image Super-Resolution Using

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 5

buildingsce

nte

r vi

ew

GT

de

pth

landscapes rooms objectsce

nte

r vi

ew

GT

de

pth

Fig. 3: Example images and their groundtruth depth maps inour NUDT dataset.

sequently, we perform feature distribution to propagate theincorporated angular information to side views. Since thedisparities between side-view features and center-view featuresare mutual, we do not perform additional offset learning.Instead, we use the opposite offset ∆P k

i = −∆P ki to warp

the fused center-view feature Fkc to the ith side view. That is,

Fki = Hk

dcn

(Fk

c ,∆P ki

). (4)

After feature distribution, both the center-view feature Fkc

and side-view features Fki , (i = 1, 2, · · · , A2−1) are produced

by the kth ADAM. In this paper, we cascade four ADAMs toachieve repetitive feature collection and distribution. Conse-quently, angular information can be repetitively incorporatedinto the center view and then propagated to all side views,resulting in continuous performance improvements (see Sec-tion V-C).

C. Reconstruction & Upsampling Module

To achieve high reconstruction accuracy, the spatial andangular information has to be incorporated. Since precedingmodules in our LF-DFnet have produced angular-aligned hi-erarchical features, a reconstruction module is needed to fusethese features for LF image SR. Following [64], we proposea reconstruction module with information multi-distillationblocks (IMDB). By adopting distillation mechanism to gradu-ally extract and process hierarchical features, superior SR per-formance can be achieved with a small number of parametersand a low computational cost [65].

The overall architecture of our reconstruction module isillustrated in Fig. 2(e). For each view, the outputs of thefeature extraction module and each ADAM are concatenatedand processed by a 1×1 convolution for coarse fusion. Then,the coarsely-fused feature (with 128 channels) is fed to severalstacked IMDBs for deep feature fusion. The structure of IMDBis illustrated in Fig. 2(f). Specifically, in each IMDB, the inputfeature is first processed by a 3× 3 convolution and a LeakyReLU. The processed feature is then split into two parts alongthe channel dimension, resulting in a narrow feature (with 32channels) and a wide feature (with 96 channels). The narrowfeature is preserved and directly fed to the final bottleneck ofthis IMDB, while the wide feature is fed to a 3×3 convolutionto enlarge its channels to 128 for further refinement. In this

(a) (b) (c)

Fig. 4: An illustration of the concentric configuration. (a)Configuration of scene Robots. Here, 3 × 3 camera arrays of5 different settings of baselines are used as examples. Blockson the translucent yellow plane denote virtual cameras, wherecamera arrays of different baselines are drawn in differentcolors. (b) 3 × 3 concentric configuration with 5 differentsettings of baselines. (c) 5 × 5 concentric configuration with3 different settings of baselines.

way, useful information can be gradually distilled, and theSR performance is improved in an efficient manner. Finally,features of different stages in the IMDB are concatenated andprocessed by a 1 × 1 convolution for local residual learning.Moreover, the feature produced by the last IMDB is processedby a 3× 3 convolution to reduce its depth from 128 to 32 forglobal residual learning.

Features obtained from the reconstruction module are finallyfed to a upsampling module. Specifically, a 1× 1 convolutionis first applied to the reconstructed features to extend theirdepth to α2C, where α is the upsampling factor. Then, pixelshuffle is performed to upscale the reconstructed feature tothe target resolution αH × αW . Finally, a 1× 1 convolutionis applied to squeeze the number of feature channels to 1 togenerate super-resolved SAIs.

IV. THE NUDT DATASET

LF images captured by different devices (especially cameraarrays) usually have significantly different baseline lengths. Itis therefore, necessary to know how existing LF algorithmswork under baseline variations, including those developedfor depth estimation [72]–[77], view synthesis [78]–[86], andimage SR [35], [87]–[90]. However, all existing LF datasets[16], [36], [37], [70], [71] only include images with fixedbaselines. To facilitate the study of LF algorithms under base-line variations, we introduce a novel LF dataset (namely, theNUDT dataset) with adjustable baselines, which is availableat: https://github.com/YingqianWang/NUDT-Dataset.

A. Technical Details

Our NUDT dataset has 32 synthetic scenes and coversdiverse scenarios (see Fig. 3). All scenes in our dataset arerendered using the 3dsMax software2, and have an angularresolution of 9 × 9 and a spatial resolution of 1024 × 1024.Groundtruth depth maps are available for LF depth/disparityestimation methods. During the image rendering process, allvirtual cameras in the array have identical internal parameters

2https://www.autodesk.eu/products/3ds-max/overview

Page 6: JOURNAL OF LA Light Field Image Super-Resolution Using

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 6

TABLE I: Main characteristics of several popular LF datasets. Note that, average scores are reported for spatial resolution(SpaRes) and perceptual quality metrics (i.e., BRISQUE [66], NIQE [67], CEIQ [68], and ENIQA [69]). Images in our NUDTdataset have high spatial resolution and perceptual quality.

Datasets Type #Scenes AngRes SpaRes GT Depth BRISQUE (↓) NIQE (↓) CEIQ (↑) ENIQA (↓)EPFL [70] real (lytro) 119 14×14 0.034 Mpx 7 47.19 5.820 3.286 0.212HCInew [36] synthetic 24 9×9 0.026 Mpx 3 14.80 3.833 3.153 0.087HCIold [71] synthetic 12 9×9 0.070 Mpx 3 24.17 2.985 3.369 0.117INRIA [16] real (lytro) 57 14×14 0.027 Mpx 7 23.56 5.338 3.184 0.160STFgantry [37] real (gantry) 12 17×17 0.118 Mpx 7 25.28 4.246 2.781 0.232NUDT (Ours) synthetic 32 9×9 0.105 Mpx 3 8.901 3.593 3.375 0.041

Note: 1) Mpx denotes mega-pixels per image. 2) The best results are in bold faces and the second best results are underlined. 3) Lower scores ofBRISQUE [66], NIQE [67], ENIQA [69] and higher scores of CEIQ [68] indicate better perceptual quality.

TABLE II: Public datasets used in our experiments.

Datasets #Training #TestEPFL [70] 70 10HCInew [36] 20 4HCIold [71] 10 2INRIA [16] 35 5STFgantry [37] 9 2Total 144 23

and are coplanar with the parallel optical axes. To captureLF images with different baselines, we used a concentricconfiguration to align camera arrays at the center views. Inthis way, LF images of different baselines share the samecenter-view SAI and groundtruth depth map. An illustrationof our concentric configuration is shown in Fig. 4. For eachscene, we rendered LF images with 10 different baselines.Note that, we tuned the parameters (e.g., lighting, and depthrange) to better reflect real scenes. Consequently, our datasethas a high perceptual quality, which will be introduced in thenext subsection.

B. Comparison to Existing Datasets

In this section, We compare our NUDT dataset to severalpopular LF datasets [16], [36], [37], [70], [71]. Following [91],we use four no-reference image quality assessment (NRIQA)metrics to evaluate the perceptual quality of LF images in thesedatasets. These NRIQA metrics, including blind/referencelessimage spatial quality evaluator (BRISQUE) [66], natural imagequality evaluator (NIQE) [67], contrast enhancement basedimage quality evaluator (CEIQ) [68], and entropy-based imagequality assessment (ENIQA) [69], are highly correlated tohuman perception. As shown in Table I, our NUDT datasetachieves the best scores in BRISQUE [66], CEIQ [68], andENIQA [69], and achieves the second best score in NIQE [67].That is, images in our NUDT dataset have high perceptualquality. Meanwhile, our dataset has more scenes (see #Scenes)and higher image resolution (see SpaRes) than the HCInew[36] and the HCIold [71] datasets.

V. EXPERIMENTS

In this section, we first introduce our implementation details.Then, we compare our LF-DFnet to state-of-the-art SISR andLF image SR methods. Finally, we present ablation studies toinvestigate our network.

A. Implementation Details

As listed in Table II, we used 5 public LF datasets in ourexperiments for both training and test. All LFs in these datasetshave an angular resolution of 9 × 9. In the training stage,we cropped each SAI into HR patches with a stride of 32,and used the bicubic downsampling approach to generate LRpatches with a resolution of 64 × 64. We performed randomhorizontal flipping, vertical flipping, and 90-degree rotation toaugment the training data by 8 times. Note that, both spatialand angular dimensions need to be flipped or rotated duringdata augmentation to maintain LF structures.

By default, we used the model with K = 4, N = 4, C =32, and an angular resolution of 5 × 5 for both 2× and 4×SR. We also investigated several variants of our LF-DFnetin Section V-C. The L1 loss function was used to train ournetwork due to its robustness to outliers [94].

Following [28], [29], [31], [34], [35], [95], we used PSNRand SSIM as quantitative metrics for performance evaluation.Both PSNR and SSIM were separately calculated on the Ychannel of each SAI. To obtain the overall metric score fora dataset with M test scenes (each scene with an angularresolution of A×A), we first obtained the score for a scene byaveraging its A2 scores, and then generated the overall scoreby averaging the scores of all M scenes.

Our LF-DFnet was implemented in PyTorch on a PC withtwo NVidia RTX GPUs. Our model was initialized using theXavier method [96] and optimized using the Adam method[97]. The batch size was set to 8 and the learning rate wasinitially set to 4× 10−4 and decreased by a factor of 0.5 forevery 10 epochs. The training was stopped after 50 epochsand took about 1.5 days.

B. Comparison to the State-of-the-arts

We compare our method to several state-of-the-art methods,including 6 single image SR methods (i.e., VDSR [51], EDSR[33], RCAN [34], SAN [55], SRGAN [92], and ESRGAN[93]) and 7 LF image SR methods (i.e., LFBM5D [24], GB[22], LFNet [29], LFSSR [30], resLF [28], LF-ATO [35], andLF-InterNet [31]). We also use bicubic interpolation methodto present baseline results.

1) Quantitative Results: Quantitative results are presentedin Table III. Our LF-DFnet achieves the highest SSIM scoreson all the 5 datasets for both 2× and 4× SR. In terms of PSNR,our method achieves the best performance on the HCInew and

Page 7: JOURNAL OF LA Light Field Image Super-Resolution Using

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 7

TABLE III: PSNR/SSIM values achieved by different methods for 2× and 4×SR. The best results are in red and the secondbest results are in blue.

Method Scale DatasetEPFL [70] HCInew [36] HCIold [71] INRIA [16] STFgantry [37]

Bicubic 2× 29.50 / 0.935 31.69 / 0.934 37.46 / 0.978 31.10 / 0.956 30.82 / 0.947VDSR [51] 2× 32.01 / 0.959 34.37 / 0.956 40.34 / 0.986 33.80 / 0.972 35.80 / 0.980EDSR [33] 2× 32.86 / 0.965 35.02 / 0.961 41.11 / 0.988 34.61 / 0.977 37.08 / 0.985RCAN [34] 2× 33.46 / 0.967 35.56 / 0.963 41.59 / 0.989 35.18 / 0.978 38.18 / 0.988SAN [55] 2× 33.36 / 0.967 35.51 / 0.963 41.47 / 0.989 35.15 / 0.978 37.98 / 0.987LFBM5D [24] 2× 31.15 / 0.956 33.72 / 0.955 39.62 / 0.985 32.85 / 0.969 33.55 / 0.972GB [22] 2× 31.22 / 0.959 35.25 / 0.969 40.21 / 0.988 32.76 / 0.972 35.44 / 0.984LFNet [29] 2× 31.79 / 0.950 33.52 / 0.943 39.44 / 0.982 33.49 / 0.966 32.76 / 0.957LFSSR [30] 2× 34.15 / 0.973 36.98 / 0.974 43.29 / 0.993 35.76 / 0.982 37.67 / 0.989resLF [28] 2× 33.22 / 0.969 35.79 / 0.969 42.30 / 0.991 34.86 / 0.979 36.28 / 0.985LF-ATO [35] 2× 34.49 / 0.976 37.28 / 0.977 43.76 / 0.994 36.21 / 0.984 39.06 / 0.992LF-InterNet [31] 2× 34.76 / 0.976 37.20 / 0.976 44.65 / 0.995 36.64 / 0.984 38.48 / 0.991LF-DFnet (Ours) 2× 34.37 / 0.977 37.77 / 0.979 44.64 / 0.995 36.17 / 0.985 40.17 / 0.994Bicubic 4× 25.14 / 0.831 27.61 / 0.851 32.42 / 0.934 26.82 / 0.886 25.93 / 0.843VDSR [51] 4× 26.82 / 0.869 29.12 / 0.876 34.01 / 0.943 28.87 / 0.914 28.31 / 0.893EDSR [33] 4× 27.82 / 0.892 29.94 / 0.893 35.53 / 0.957 29.86 / 0.931 29.43 / 0.921RCAN [34] 4× 28.31 / 0.899 30.25 / 0.896 35.89 / 0.959 30.36 / 0.936 30.25 / 0.934SAN [55] 4× 28.30 / 0.899 30.25 / 0.898 35.88 / 0.960 30.29 / 0.936 30.30 / 0.933SRGAN [92] 4× 26.85 / 0.870 28.95 / 0.873 34.03 / 0.942 28.85 / 0.916 28.19 / 0.898ESRGAN [93] 4× 25.59 / 0.836 26.96 / 0.819 33.53 / 0.933 27.54 / 0.880 28.00 / 0.905LFBM5D [24] 4× 26.61 / 0.869 29.13 / 0.882 34.23 / 0.951 28.49 / 0.914 28.30 / 0.900GB [22] 4× 26.02 / 0.863 28.92 / 0.884 33.74 / 0.950 27.73 / 0.909 28.11 / 0.901LFNet [29] 4× 25.95 / 0.854 28.14 / 0.862 33.17 / 0.941 27.79 / 0.904 26.60 / 0.858LFSSR [30] 4× 29.16 / 0.915 30.88 / 0.913 36.90 / 0.970 31.03 / 0.944 30.14 / 0.937resLF [28] 4× 27.86 / 0.899 30.37 / 0.907 36.12 / 0.966 29.72 / 0.936 29.64 / 0.927LF-ATO [35] 4× 29.16 / 0.917 31.08 / 0.917 37.23 / 0.971 31.21 / 0.950 30.78 / 0.944LF-InterNet [31] 4× 29.52 / 0.917 31.01 / 0.917 37.23 / 0.972 31.65 / 0.950 30.44 / 0.941LF-DFnet (Ours) 4× 28.92 / 0.919 31.33 / 0.921 37.46 / 0.973 31.00 / 0.952 31.29 / 0.952

STFgantry datasets for 2× and 4×SR, and on the HCIolddataset for 4×SR. On datasets captured by Lytro cameras(i.e., EPFL and INRIA), our method is marginally inferiorto LF-ATO and LF-InterNet but significantly better than othermethods (e.g., RCAN, SAN, and resLF). It is worth notingthat, the superiority of our LF-DFnet is very significant onthe STFgantry dataset for 2× SR. That is because, scenesin the STFgantry dataset are captured by a moving cameramounted on a gantry, and thus have relatively large baselinesand significant disparity variations. Our LF-DFnet can handlethis disparity problem by using deformable convolutions forangular alignment, while maintaining promising performancefor LFs with small baselines (e.g., LFs on the EPFL andINRIA datasets). More analyses with respect to differentbaseline lengths are presented in Section V-B5.

2) Qualitative Results: Qualitative results for 2× and 4×SR are shown in Figs. 5 and 6, respectively. As comparedto the state-of-the-art SISR and LF image SR methods, ourmethod can produce images with more faithful details and lessartifacts. Specifically, for 2× SR, the images generated by ourLF-DFnet are very close to the groundtruth images. Note that,the stairway in scene INRIA Sculpture is faithfully recoveredby our method without blurring or artifacts, and the tile edgesin the scene HCIold buddha are as sharp as in the groundtruthimage. For 4× SR, state-of-the-art SISR methods RCAN andSAN produce blurring results with warped textures, and theperceptual-oriented SISR method ESRGAN generates imageswith fake textures. That is because, the SR problem becomeshighly ill-posed for 4× SR, and the spatial information ina single image is insufficient to reconstruct high-quality HRimages. In contrast, our LF-DFnet can use complementary

TABLE IV: Comparisons of the number of parameters (i.e.,#Params.) and FLOPs for 2× and 4× SR. Note that, FLOPs iscalculated on an input LF with a size of 5×5×32×32. Here,we use PSNR and SSIM values averaged over 5 datasets [16],[36], [37], [70], [71] to represent their reconstruction accuracy.

Method Scale #Params. FLOPs(G) PSNR / SSIMRCAN [34] 2× 15.44M 15.71×25 36.79 / 0.977SAN [55] 2× 15.71M 16.05×25 36.69 / 0.797resLF [28] 2× 6.35M 37.06 36.49 / 0.979LF-ATO [35] 2× 1.51M 597.66 38.16 / 0.985LF-InterNet [31] 2× 4.80M 47.46 38.35 / 0.985LF-DFnet 2× 4.62M 57.21 38.62 / 0.986RCAN [34] 4× 15.59M 16.34×25 31.01 / 0.925SAN [55] 4× 15.86M 16.67×25 31.00 / 0.925resLF [28] 4× 6.79M 39.70 30.74 / 0.927LF-ATO [35] 4× 1.66M 686.99 31.89 / 0.940LF-InterNet [31] 4× 5.23M 50.10 31.97 / 0.939LF-DFnet 4× 4.64M 59.85 32.00 / 0.943

information among different views to recover missing details,and thus achieves superior SR performance.

3) Computational Efficiency: We compare our LF-DFnetto several competitive methods [28], [31], [34], [35], [55] interms of the number of parameters (i.e., #Params) and FLOPs.As shown in Table IV, our method achieves the highest PSNRand SSIM scores with a small number of parameters andFLOPs. Note that, the FLOPs of our method are significantlylower than RCAN, SAN, and LF-ATO but marginally higherthan resLF and LF-InterNet. That is because, our LF-DFnetuses more complicated feature extraction and reconstructionmodules than resLF and LF-InterNet. These modules introducea notable performance improvement at the cost of a reasonable

Page 8: JOURNAL OF LA Light Field Image Super-Resolution Using

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 8

Groundtruth Bicubic VDSR RCAN

INRIA_Sculpture GB LFSSR resLF

LFBM5D

LF-DFnet

Groundtruth Bicubic VDSR RCAN

HCIold_buddha GB LFSSR resLF

LFBM5D

LF-ATO

EDSR

EDSR

SAN

SAN

LFNet

LFNet LF-DFnetLF-InterNet

LF-ATO LF-InterNet

Fig. 5: Visual results of 2×SR.

Groundtruth Bicubic VDSR RCAN

EPFL_Palais GB LFSSR resLF

SRGAN

LF-DFnet

Groundtruth Bicubic RCAN

INRIA_hublais

GBSQ LFSSR_4D resLF Ours

PSNR/SSIM 24.84/0.850 28.63/0.914 26.20/0.890 27.32/0.912 28.20/0.924 31.03/0.942

HCIold_mona

Groundtruth Bicubic RCAN GBSQ LFSSR_4D resLF Ours

PSNR/SSIM 32.04/0.934 36.60/0.965 33.11/0.950 35.39/0.965 36.25/0.969 37.79/0.977

Groundtruth Bicubic RCAN GBSQ LFSSR_4D resLF Ours

PSNR/SSIM 27.69/0.931 31.26/0.959 28.16/0.943 28.84/0.958 29.69/0.959 33.26/0.966EPFL_friends_1

LFBM5D

ESRGAN

Groundtruth Bicubic VDSR RCAN

HCInew_bedroom GB LFSSR resLF

SRGAN

LF-DFnetLFBM5D

ESRGAN

SANEDSR

SANEDSR

LFNet

LFNet

LF-ATO LF-InterNet

LF-ATO LF-InterNet

Fig. 6: Visual results of 4×SR.

increase of FLOPs.

4) Performance w.r.t. Perspectives: Since LF image SRmethods aim at super-resolving all SAIs in an LF, we compareour method to resLF under different perspectives. We used thecentral 3×3, 5×5, 7×7, and 9×9 SAIs in the HCInew datasetto perform 2×SR, and used PSNR values for performanceevaluation, as visualized in Fig. 7. Note that, due to the chang-ing perspectives, the contents of different SAIs are not identi-cal, resulting in inherent PSNR variations among perspectives.Therefore, we evaluate this variation by using RCAN toperform SISR on each SAI. As shown in Fig. 7, RCANachieves a relatively balanced PSNR distribution (Std=0.0327

for 9×9 LFs). It demonstrates that the inherent PSNR variationamong perspectives are relatively small. It can be observedthat resLF achieves notable performance improvements overRCAN under all angular resolutions. However, since resLFuses part of views for LF image SR, the PSNR scores achievedby resLF on side views are relatively low.

As compared to RCAN and resLF, our method uses allSAIs to super-resolve each view and handles the disparityproblem. Consequently, our method can achieve better SRperformance (i.e., higher PSNR values) with a more balanceddistribution (i.e., lower Std scores). It can be also observedthat, our SR performance is continuously improved as the

Page 9: JOURNAL OF LA Light Field Image Super-Resolution Using

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 9

LF-DFnet

resLF

37.53 37.74 37.81 37.82 37.89 37.83 37.76 37.75 37.52

37.69

37.76

37.79

37.84

37.80

37.75

37.68

37.45

37.69

37.72

37.77

37.79

37.74

37.68

37.63

37.4537.68 37.74 37.75 37.82 37.79 37.6537.73

37.90 37.97 37.95 38.05 37.97 37.94 37.89

37.96

37.99

38.02

37.99

37.97

37.88 37.96 37.94 38.03 37.97 37.90 37.85

37.92

37.96

38.03

37.97

37.9538.03 37.99 38.08 38.05 37.98

38.00

38.04

38.00

37.9938.0338.0737.9938.00

38.02

38.07

38.05 38.06 38.11 38.04

38.08

38.04 38.11 38.07

38.1038.11

37.18 37.35 37.18

37.30 37.46 37.31

37.22 37.39 37.25

37.56 37.70 37.79 37.74 37.50

37.72 37.86 37.95 37.87 37.65

37.76 37.90 38.01 37.95 37.72

37.6837.73

37.57 37.68 37.79 37.71 37.52

37.87 37.97 37.91

37.62 37.79 37.81 37.90 37.85 37.77 37.60

37.77

37.83

37.86

37.82

37.79

37.63 37.80 37.85 37.93 37.85 37.78 37.58

37.76

37.82

37.86

37.84

37.8037.94 37.98 38.05 38.02 37.94

37.99

38.02

38.01

37.9538.0138.0537.9837.94

37.98

38.03

38.02 38.07 38.13 38.07

38.12

38.12

38.11

38.0838.07

38.06

35.52 35.61 35.48

35.87 36.02 35.88

35.52 35.61 35.52

35.55 35.59 35.61 35.59 35.50

35.85 35.95 35.97 35.96 35.81

35.88 35.97 36.22 36.01 35.87

35.84 35.93 35.97 35.97 35.83

35.56 35.64 35.64 35.63 35.48

35.53 35.61 35.61 35.61 35.58 35.59 35.51

35.86 35.99 35.95 35.97 35.98 35.96 35.83

35.83 35.97 36.16 36.17 36.15 35.92 35.83

35.89 36.00 36.19 36.16 36.19 36.01 35.88

35.87 35.96 36.14 36.14 36.17 35.94 35.81

35.54 35.63 35.62 35.62 35.61 35.60 35.48

35.86 35.95 35.93 35.94 35.95 35.94 35.81

35.52 35.61 35.62 35.64 35.65 35.61 35.62 35.63 35.50

35.85

35.85

35.84

35.89

35.86

35.84

35.88

35.53

35.83

35.81

35.82

35.84

35.81

35.79

35.80

35.4635.62 35.62 35.63 35.64 35.61 35.6035.59

35.94 35.96 35.95 35.96 35.95 35.94 35.95

35.94

35.93

35.97

35.93

35.94

35.92 35.94 35.89 35.95 35.94 35.93 35.92

35.94

35.94

36.00

35.94

35.9536.19 36.17 36.16 36.18 36.16

36.12

36.16

36.13

36.1136.1336.1136.1336.15

36.15

36.20

36.17 36.10 36.12 36.11

36.10

36.08 36.14 36.12

36.1436.25

38.4

38.3

38.2

38.1

38.0

37.9

37.8

37.7

37.6

37.5

37.4

37.3

37.2

37.1

37.0

36.9

36.8

36.7

36.6

36.5

36.4

36.3

36.2

36.1

36.0

35.9

35.8

35.7

35.6

35.5

35.4

35.3

35.51 35.58 35.56 35.57 35.60 35.60 35.58 35.61 35.58

35.61

35.56

35.51

35.57

35.55

35.54

35.53

35.47

35.53

35.51

35.54

35.49

35.53

35.55

35.56

35.56 35.52

35.55 35.59 35.57 35.48 35.52 35.57 35.54

35.52 35.55 35.57 35.54 35.56

35.57 35.59 35.59 35.53 35.57 35.61 35.61

35.55

35.54

35.55

35.58

35.60 35.60 35.57 35.55 35.53 35.57 35.55

35.57

35.56

35.58

35.58

35.52

35.55

35.55

35.57

35.5535.55 35.53 35.53 35.60

35.55

35.56

35.63

35.59 35.53 35.47

35.58 35.53 35.54

35.53 35.58 35.51

Avg=35.54Std=0.0391

Avg=35.56Std=0.0358

Avg=35.56Std=0.0304

Avg=35.56Std=0.0327

RCAN

Avg=35.68Std=0.1965

Avg=35.79Std=0.1972

Avg=35.87Std=0.2107

Avg=35.91Std=0.2048

Avg=37.79Std=0.0977

Avg=37.77Std=0.1433

Avg=37.91Std=0.1433

Avg=37.89Std=0.1609

35.56 35.52 35.52 35.55 35.57 35.54

35.57 35.59 35.59 35.53 35.57 35.61 35.61

35.55

35.54

35.55

35.58

35.60 35.60 35.57 35.55 35.53 35.57 35.55

35.57

35.56

35.58

35.58

35.52

35.55

35.55

35.57

35.5535.55 35.53 35.53 35.60

35.55

35.56

35.63

35.59 35.53 35.47

35.58 35.53 35.54

35.53 35.58 35.51

35.59 35.59 35.53 35.57 35.61

35.57

35.56

35.58

35.58 35.55 35.53 35.53 35.60

35.55

35.56

35.63

35.59 35.53 35.47

35.58 35.53 35.54

35.53 35.58 35.51

35.59 35.53 35.47

35.58 35.53 35.54

35.53 35.58 35.51

Fig. 7: PSNR values achieved by RCAN [34], resLF [28] andour LF-DFnet on each SAI on the HCInew dataset [36]. Here,the central 3×3, 5×5, 7×7, and 9×9 input views are usedto perform 2×SR. We use the standard deviation (Std) valueto measure the uniformity of PSNR distribution. Note that,our LF-DFnet achieves high reconstruction quality (i.e., highPSNR values) and balanced distribution (i.e., low Std scores)among different perspectives.

angular resolution increases from 3×3 to 7×7. That is because,the additional views can introduce more angular information,which is beneficial to SR reconstruction. Note that, our LF-DFnet achieves comparable performance with 7×7 and 9×9input views (37.91 v.s. 37.89 in average PSNR score). Thatis, the angular information tends to be saturated when angularresolution is more than 7×7, and a further increase in angularresolution cannot introduce significant performance improve-ment.

5) Performance w.r.t. Disparity Variations: We selected 4scenes (see Fig. 8) from our NUDT dataset, and rendered themwith linearly increased baselines. Note that, the disparitiesin a specific scene are proportional to the baseline lengthwhen the camera intrinsic parameters (e.g., focal length) arefixed. Consequently, we can investigate the performance of LFimage SR algorithms with respect to disparity variations bystraightforwardly applying them to the same scenes renderedwith different baselines. Following the HCInew dataset [36],we calculated the disparity range of each scene using itsgroundtruth depth value. As shown in Fig. 8, the reconstructionaccuracy (i.e., PSNR) of both methods tends to be decreasedwith increasing disparities, except for LF-DFnet on sceneRobots (see Fig. 8(c)). Note that, the superiority of our LF-DFnet becomes more significant on LF images with larger

PSNR44

42

40

38

2 64 8036

Scene: BedroomDisparity: [-0.6, 0.8] × Kd

(a)

resLF LF-DFnet

10

PSNR44

42

40

38

2 64 8036

10

PSNR42

40

38

36

2 64 8034

10

PSNR44

42

40

38

2 64 8036

10

KdScene: Chess

Disparity: [-1.2, 1.6] × Kd

(b)

Scene: RobotDisparity: [-2.4, 4.0] × Kd

(c)

Scene: StudyDisparity: [-0.7, 0.7] × Kd

(d)

Kd

Kd Kd

Fig. 8: PSNR and SSIM values achieved by resLF [28] andLF-DFnet on 4 scenes on our NUDT dataset under linearlyincreased disparities for 2×SR. Our LF-DFnet achieves betterperformance than resLF, especially on LF images with largedisparity variations (i.e., wide baselines).

disparity variations (i.e., wider baselines). That is because,large disparities can result in large misalignment among LFimages and thus introduce difficulties in angular informationexploitation. Since deformable convolution is used by ourmethod to perform angular alignment, our LF-DFnet is morerobust to disparity variations, and thus achieves better perfor-mance on LF images with wide baselines.

6) Performance Under Real-World Degradations: We com-pare our method to RCAN, ESRGAN, resLF, LF-ATO, andLF-InterNet under real-world degradation by directly applyingthem to LFs in the EPFL dataset. Since no groundtruth HRimages are available in this dataset, we compare their visualperformance in Fig. 9. Our LF-DFnet recovers much finerdetails from input LF images, and produces less artifacts thanRCAN and ESRGAN. Since LF structures keep unchangedbetween bicubic and real-world degradation, our method cansuccessfully learn to incorporate spatial and angular informa-tion from bicubicly downsampled training data, and is wellgeneralized to LF images under real degradation.

C. Ablation Study

In this subsection, we compare our LF-DFnet with severalvariants to investigate the potential benefits introduced by ournetwork modules.

1) ADAMs: As the core component of our LF-DFnet,ADAM can perform feature alignment between the center

Page 10: JOURNAL OF LA Light Field Image Super-Resolution Using

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 10

Input

resLF

RCAN

LF-DFnet

ESRGAN

Input

resLF

RCAN

LF-DFnet

ESRGAN

LF-InterNet

LF-InterNet

Fig. 9: Visual results achieved by different methods under real-world degradation.

view and each side view. Here, we investigate ADAM byintroducing the following three variants:

• RegularCNN: it is introduced by replacing the de-formable convolutions with regular 3 × 3 convolutionsin both feature collection and feature distribution stages.

• RegularDist: it is introduced by replacing the deformableconvolutions with regular 3 × 3 convolutions only infeature distribution stage.

• LearnDist: it is introduced by performing offset learningduring feature distribution rather than using their oppositevalues.

Apart from these three variants, SR performance is alsoinfluenced by the number of ADAMs in the network. Toinvestigate the effect of these coupled factors, we trained 20models with 4 design options and 5 different numbers ofADAMs from scratch. Comparative results of these 20 modelsare shown in Fig. 10.

It can be observed from Figs. 10(a) and 10(d) that Regu-larDist, LearnDist, and our model achieve comparable resultsin center-view PSNR/SSIM, while RegularCNN achieves rel-atively lower scores. That is because, by using deformableconvolutions for feature collection, contributive informationcan be effectively collected and used to reconstruct the cen-ter views. Similarly, deformable convolutions in the featuredistribution stage also play an important role in LF imageSR. As shown in Figs. 10(c) and 10(f), RegularCNN andRegularDist achieve lower PSNR/SSIM scores than LearnDistand the proposed model. That is because, without deformableconvolutions for feature distribution, the incorporated informa-tion cannot be effectively fused to side views, resulting in alower minimum reconstruction accuracy. In terms of averagedPSNR/SSIM (see Figs. 10(b) and 10(e)), the proposed modeland LearnDist achieve comparable results. However, LearnDistperforms offset learning twice in each ADAM, and thus haslarger model size and higher FLOPs than the proposed model,as shown in Figs. 10(g) and 10(h). In summary, our proposed

Center-View PSNR39.2

39.0

38.8

38.6

2 43 5138.4

Center-View SSIM

0.9870

0.9865

0.9860

0.9855

2 43 510.9

850

Avg PSNR38.8

38.6

38.4

38.2

2 43 5138.0

Avg SSIM

0.9860

0.9855

0.9850

0.9845

2 43 510.9

840

Min PSNR38.2

38.0

37.8

37.6

2 43 5137.4

Min SSIM

0.9850

0.9845

0.9840

0.9835

2 43 510.9

830

(a) (b) (c)

(d) (e) (f)

RegularCNN

RegularDist

LearnDist

Proposed

# Params.6.0 M

5.0 M

4.0 M

3.0 M

2 43 512.0 M

FLOPs65 G

60 G

55 G

50 G

2 43 5145 G

(g) (h)

# Blo

cks

# Blo

cks

# Blo

cks

# Blo

cks

# Blo

cks

# Blo

cks

# Blo

cks

# Blo

cks

Fig. 10: Comparative results achieved on 5 datasets [16], [36],[37], [70], [71] by LF-DFnet and its variants with 1–5 ADAMsfor 2×SR. Here, the center-view, average, and minimum PSNRand SSIM values are reported to comprehensively evaluate thereconstruction accuracy. Moreover, the number of parameters(i.e., #Params.) and FLOPs (calculated with an input SAI ofsize 160× 160) are also reported to show their computationalefficiency.

TABLE V: Average PSNR and SSIM values achieved on5 datasets [16], [36], [37], [70], [71] by LF-DFnet and itsvariants for 2× SR. Here, the number of parameters (i.e.,#Params.) and FLOPs (calculated with an input SAI of size160×160) are reported to show their computational efficiency.

Model PSNR SSIM #Params. FLOPsLF-DFnet woASPPinFEM 38.46 0.9856 4.60M 56.6GLF-DFnet woASPPinOFS 38.44 0.9856 4.57M 56.1GLF-DFnet 38.59 0.9858 4.62M 57.2G

model can achieve good SR performance on all angular viewswhile maintaining a reasonable computational cost.

Moreover, it can be observed in Figs. 10(a)–10(f) that thereconstruction accuracy is improved as the number of ADAMsincreases. However, the performance tends to be saturatedwhen the number of ADAM is increased from 4 to 5. Since themodel size (i.e., Params.) and memory cost (i.e., FLOPs) growlinearly with respect to the number of ADAMs (as shown inFigs. 10(g) and 10(h)), we finally use 4 ADAMs in our network(i.e., K = 4) to achieve a tradeoff between reconstructionaccuracy and computational efficiency.

2) Residual ASPP Module: Residual ASPP module is usedin our LF-DFnet for both feature extraction and offset learning.To demonstrate its effectiveness, we introduced two variantsLF-DFnet woASPPinFEM and LF-DFnet woASPPinOFS byreplacing the residual ASPP blocks with residual blocks in

Page 11: JOURNAL OF LA Light Field Image Super-Resolution Using

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 11

the feature extraction module and the offset learning branch,respectively. As shown in Table V, LF-DFnet woASPPinFEMsuffers a 0.13 dB decrease in PSNR as compared to LF-DFnet.That is because, residual ASPP module can extract hierarchicalfeatures from input images, which are beneficial to LF imageSR. Similarly, a 0.16 dB PSNR decrease is introduced whenASPP module is removed from the offset learning branch.That is because, the ASPP module can achieve accurate offsetlearning through multi-scale feature representation and theenlargement of receptive fields.

VI. CONCLUSION

In this paper, we proposed an LF-DFnet to handle thedisparity problem for LF image SR. By performing featurealignment using our angular deformable alignment module,the angular information can be well incorporated and the SRperformance is significantly improved. Moreover, we developa baseline-adjustable LF dataset for performance evaluation.Experimental results on both public and our self-developeddatasets have demonstrated the superiority of our method. OurLF-DFnet achieves state-of-the-art quantitative and qualitativeSR performance, and is more robust to disparity variations.

REFERENCES

[1] C.-T. Huang, Y.-W. Wang, L.-R. Huang, J. Chin, and L.-G. Chen, “Fastphysically correct refocusing for sparse light fields using block-basedmulti-rate view interpolation,” IEEE Transactions on Image Processing,vol. 26, no. 2, pp. 603–618, 2016.

[2] Z. Xiao, Q. Wang, G. Zhou, and J. Yu, “Aliasing detection and reductionscheme on angularly undersampled light fields,” IEEE Transactions onImage Processing, vol. 26, no. 5, pp. 2103–2115, 2017.

[3] Y. Wang, J. Yang, Y. Guo, C. Xiao, and W. An, “Selective light field re-focusing for camera arrays using bokeh rendering and superresolution,”IEEE Signal Processing Letters, vol. 26, no. 1, pp. 204–208, 2018.

[4] K. Mishiba, “Fast depth estimation for light field cameras,” IEEETransactions on Image Processing, vol. 29, pp. 4232–4242, 2020.

[5] A. Chuchvara, A. Barsi, and A. Gotchev, “Fast and accurate depthestimation from sparse light fields,” IEEE Transactions on Image Pro-cessing, 2019.

[6] W. Zhou, E. Zhou, G. Liu, L. Lin, and A. Lumsdaine, “Unsupervisedmonocular depth estimation from light field image,” IEEE Transactionson Image Processing, vol. 29, pp. 1606–1617, 2019.

[7] F. Liu, S. Zhou, Y. Wang, G. Hou, Z. Sun, and T. Tan, “Binocular light-field: Imaging theory and occlusion-robust depth perception application,”IEEE Transactions on Image Processing, vol. 29, pp. 1628–1640, 2019.

[8] J. Shi, X. Jiang, and C. Guillemot, “A framework for learning depthfrom a flexible subset of dense and sparse light field views,” IEEETransactions on Image Processing, vol. 28, no. 12, pp. 5867–5880, 2019.

[9] H. Sheng, S. Zhang, X. Cao, Y. Fang, and Z. Xiong, “Geometricocclusion analysis in depth estimation using integral guided filter forlight-field image,” IEEE Transactions on Image Processing, vol. 26,no. 12, pp. 5758–5771, 2017.

[10] Y. Piao, X. Li, M. Zhang, J. Yu, and H. Lu, “Saliency detection viadepth-induced cellular automata on light field,” IEEE Transactions onImage Processing, vol. 29, pp. 1879–1889, 2019.

[11] J. Zhang, Y. Liu, S. Zhang, R. Poppe, and M. Wang, “Light field saliencydetection with deep convolutional networks,” IEEE Transactions onImage Processing, vol. 29, pp. 4421–4434, 2020.

[12] M. Zhang, J. Li, J. WEI, Y. Piao, and H. Lu, “Memory-orienteddecoder for light field salient object detection,” in Advances in NeuralInformation Processing Systems (NeurIPS), 2019, pp. 896–906.

[13] T. Wang, Y. Piao, X. Li, L. Zhang, and H. Lu, “Deep learning for lightfield saliency detection,” in IEEE International Conference on ComputerVision (ICCV), 2019, pp. 8838–8848.

[14] N. Li, J. Ye, Y. Ji, H. Ling, and J. Yu, “Saliency detection on lightfield,” in IEEE Conference on Computer Vision and Pattern Recognition(CVPR), 2014, pp. 2806–2813.

[15] T. Li, D. P. Lun, Y.-H. Chan et al., “Robust reflection removal based onlight field imaging,” IEEE Transactions on Image Processing, vol. 28,no. 4, pp. 1798–1812, 2018.

[16] M. Le Pendu, X. Jiang, and C. Guillemot, “Light field inpaintingpropagation via low rank matrix completion,” IEEE Transactions onImage Processing, vol. 27, no. 4, pp. 1981–1993, 2018.

[17] Y. Wang, T. Wu, J. Yang, L. Wang, W. An, and Y. Guo, “DeOccNet:Learning to see through foreground occlusions in light fields,” in WinterConference on Applications of Computer Vision (WACV). IEEE, 2020.

[18] G. Wu, B. Masia, A. Jarabo, Y. Zhang, L. Wang, Q. Dai, T. Chai, andY. Liu, “Light field image processing: An overview,” IEEE Journal ofSelected Topics in Signal Processing, vol. 11, no. 7, pp. 926–954, 2017.

[19] S. Wanner and B. Goldluecke, “Variational light field analysis fordisparity estimation and super-resolution,” IEEE Transactions on PatternAnalysis and Machine Intelligence, vol. 36, no. 3, pp. 606–619, 2013.

[20] K. Mitra and A. Veeraraghavan, “Light field denoising, light fieldsuperresolution and stereo camera based refocussing using a gmm lightfield patch prior,” in IEEE Conference on Computer Vision and PatternRecognition Workshops (CVPRW). IEEE, 2012, pp. 22–28.

[21] R. A. Farrugia, C. Galea, and C. Guillemot, “Super resolution of lightfield images using linear subspace projection of patch-volumes,” IEEEJournal of Selected Topics in Signal Processing, vol. 11, no. 7, pp.1058–1071, 2017.

[22] M. Rossi and P. Frossard, “Geometry-consistent light field super-resolution via graph-based regularization,” IEEE Transactions on ImageProcessing, vol. 27, no. 9, pp. 4207–4218, 2018.

[23] V. K. Ghassab and N. Bouguila, “Light field super-resolution using edge-preserved graph-based regularization,” IEEE Transactions on Multime-dia, 2019.

[24] M. Alain and A. Smolic, “Light field super-resolution via lfbm5d sparsecoding,” in IEEE International Conference on Image Processing (ICIP).IEEE, 2018, pp. 2501–2505.

[25] Y. Yoon, H.-G. Jeon, D. Yoo, J.-Y. Lee, and I. So Kweon, “Learninga deep convolutional network for light-field image super-resolution,”in IEEE International Conference on Computer Vision Workshops (IC-CVW), 2015, pp. 24–32.

[26] Y. Yoon, H.-G. Jeon, D. Yoo, J.-Y. Lee, and I. S. Kweon, “Light-field image super-resolution using convolutional neural network,” IEEESignal Processing Letters, vol. 24, no. 6, pp. 848–852, 2017.

[27] Y. Yuan, Z. Cao, and L. Su, “Light-field image superresolution usinga combined deep cnn based on epi,” IEEE Signal Processing Letters,vol. 25, no. 9, pp. 1359–1363, 2018.

[28] S. Zhang, Y. Lin, and H. Sheng, “Residual networks for light field imagesuper-resolution,” in IEEE Conference on Computer Vision and PatternRecognition (CVPR), 2019, pp. 11 046–11 055.

[29] Y. Wang, F. Liu, K. Zhang, G. Hou, Z. Sun, and T. Tan, “Lfnet: Anovel bidirectional recurrent convolutional neural network for light-field image super-resolution,” IEEE Transactions on Image Processing,vol. 27, no. 9, pp. 4274–4286, 2018.

[30] H. W. F. Yeung, J. Hou, X. Chen, J. Chen, Z. Chen, and Y. Y. Chung,“Light field spatial super-resolution using deep efficient spatial-angularseparable convolution,” IEEE Transactions on Image Processing, vol. 28,no. 5, pp. 2319–2330, 2018.

[31] Y. Wang, L. Wang, J. Yang, W. An, J. Yu, and Y. Guo, “Spatial-angular interaction for light field image super-resolution,” in EuropeanConference on Computer Vision (ECCV), 2020.

[32] C. Dong, C. C. Loy, K. He, and X. Tang, “Image super-resolution usingdeep convolutional networks,” IEEE Transactions on Pattern Analysisand Machine Intelligence, vol. 38, no. 2, pp. 295–307, 2015.

[33] B. Lim, S. Son, H. Kim, S. Nah, and K. Mu Lee, “Enhanced deep resid-ual networks for single image super-resolution,” in IEEE Conference onComputer Vision and Pattern Recognition Workshops (CVPRW), 2017,pp. 136–144.

[34] Y. Zhang, K. Li, K. Li, L. Wang, B. Zhong, and Y. Fu, “Image super-resolution using very deep residual channel attention networks,” inEuropean Conference on Computer Vision (ECCV), 2018, pp. 286–301.

[35] J. Jin, J. Hou, J. Chen, and S. Kwong, “Light field spatial super-resolution via deep combinatorial geometry embedding and structuralconsistency regularization,” in Proceedings of the IEEE/CVF Conferenceon Computer Vision and Pattern Recognition (CVPR), 2020, pp. 2260–2269.

[36] K. Honauer, O. Johannsen, D. Kondermann, and B. Goldluecke, “Adataset and evaluation methodology for depth estimation on 4d lightfields,” in Asian Conference on Computer Vision (ACCV). Springer,2016, pp. 19–34.

[37] V. Vaish and A. Adams, “The (new) stanford light field archive,”Computer Graphics Laboratory, Stanford University, vol. 6, no. 7, 2008.

Page 12: JOURNAL OF LA Light Field Image Super-Resolution Using

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 12

[38] L. Wang, Y. Guo, Z. Lin, X. Deng, and W. An, “Learning for videosuper-resolution through hr optical flow estimation,” in Asian Conferenceon Computer Vision (ACCV). Springer, 2018, pp. 514–529.

[39] L. Wang, Y. Guo, L. Liu, Z. Lin, X. Deng, and W. An, “Deep videosuper-resolution using HR optical flow estimation,” IEEE Transactionson Image Processing, 2020.

[40] J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei, “Deformableconvolutional networks,” in IEEE International Conference on ComputerVision (ICCV), 2017, pp. 764–773.

[41] X. Zhu, H. Hu, S. Lin, and J. Dai, “Deformable convnets v2: Moredeformable, better results,” in IEEE Conference on Computer Visionand Pattern Recognition (CVPR), 2019, pp. 9308–9316.

[42] Y. Tian, Y. Zhang, Y. Fu, and C. Xu, “Tdan: Temporally deformablealignment network for video super-resolution,” IEEE Conference onComputer Vision and Pattern Recognition (CVPR), 2020.

[43] X. Wang, K. C. Chan, K. Yu, C. Dong, and C. Change Loy, “Edvr: Videorestoration with enhanced deformable convolutional networks,” in IEEEConference on Computer Vision and Pattern Recognition Workshops(CVPRW), 2019, pp. 0–0.

[44] H. Wang, D. Su, C. Liu, L. Jin, X. Sun, and X. Peng, “Deformablenon-local network for video super-resolution,” IEEE Access, vol. 7, pp.177 734–177 744, 2019.

[45] X. Xiang, Y. Tian, Y. Zhang, Y. Fu, A. Jan, and C. Xu, “Zooming slow-mo: Fast and accurate one-stage space-time videosuper-resolution,” inIEEE Conference on Computer Vision and Pattern Recognition (CVPR),June 2020.

[46] X. Ying, L. Wang, Y. Wang, W. Sheng, W. An, and Y. Guo, “Deformable3d convolution for video super-resolution,” arXiv preprint, 2020.

[47] S. Anwar, S. Khan, and N. Barnes, “A deep journey into super-resolution: A survey,” ACM Computing Surveys (CSUR), vol. 53, no. 3,pp. 1–34, 2020.

[48] W. Yang, X. Zhang, Y. Tian, W. Wang, J.-H. Xue, and Q. Liao,“Deep learning for single image super-resolution: A brief review,” IEEETransactions on Multimedia, 2019.

[49] Z. Wang, J. Chen, and S. C. Hoi, “Deep learning for image super-resolution: A survey,” IEEE Transactions on Pattern Analysis andMachine Intelligence, 2020.

[50] C. Dong, C. C. Loy, K. He, and X. Tang, “Learning a deep convolu-tional network for image super-resolution,” in European Conference onComputer Vision (ECCV). Springer, 2014, pp. 184–199.

[51] J. Kim, J. Kwon Lee, and K. Mu Lee, “Accurate image super-resolutionusing very deep convolutional networks,” in IEEE Conference on Com-puter Vision and Pattern Recognition (CVPR), 2016, pp. 1646–1654.

[52] R. Timofte, E. Agustsson, L. Van Gool, M.-H. Yang, and L. Zhang,“Ntire 2017 challenge on single image super-resolution: Methods andresults,” in IEEE Conference on Computer Vision and Pattern Recogni-tion Workshops (CVPRW), 2017, pp. 114–125.

[53] Y. Zhang, Y. Tian, Y. Kong, B. Zhong, and Y. Fu, “Residual densenetwork for image super-resolution,” in IEEE Conference on ComputerVision and Pattern Recognition (CVPR), 2018, pp. 2472–2481.

[54] ——, “Residual dense network for image restoration,” IEEE Transac-tions on Pattern Analysis and Machine Intelligence, 2020.

[55] T. Dai, J. Cai, Y. Zhang, S.-T. Xia, and L. Zhang, “Second-order atten-tion network for single image super-resolution,” in IEEE Conference onComputer Vision and Pattern Recognition (CVPR), 2019, pp. 11 065–11 074.

[56] M. Alain and A. Smolic, “Light field denoising by sparse 5d transformdomain collaborative filtering,” in International Workshop on Multime-dia Signal Processing (MMSP). IEEE, 2017, pp. 1–6.

[57] K. Egiazarian and V. Katkovnik, “Single image super-resolution viabm3d sparse coding,” in European Signal Processing Conference (EU-SIPCO). IEEE, 2015, pp. 2849–2853.

[58] Y. Huang, W. Wang, and L. Wang, “Bidirectional recurrent convolutionalnetworks for multi-frame super-resolution,” in Advances in NeuralInformation Processing Systems (NeurIPS), 2015, pp. 235–243.

[59] G. Bertasius, L. Torresani, and J. Shi, “Object detection in videowith spatiotemporal sampling networks,” in European Conference onComputer Vision (ECCV), 2018, pp. 331–346.

[60] Y. Zhao, Y. Xiong, and D. Lin, “Trajectory convolution for actionrecognition,” in Advances in Neural Information Processing Systems(NeurIPS), 2018, pp. 2204–2215.

[61] X. Sun, B. Xiao, F. Wei, S. Liang, and Y. Wei, “Integral human poseregression,” in European Conference on Computer Vision (ECCV), 2018,pp. 529–545.

[62] S. Nah, R. Timofte, S. Baik, S. Hong, G. Moon, S. Son, and K. Mu Lee,“Ntire 2019 challenge on video deblurring: Methods and results,” in

IEEE Conference on Computer Vision and Pattern Recognition Work-shops (CVPRW), 2019, pp. 0–0.

[63] L. Wang, Y. Wang, Z. Liang, Z. Lin, J. Yang, W. An, and Y. Guo,“Learning parallax attention for stereo image super-resolution,” in IEEEConference on Computer Vision and Pattern Recognition (CVPR), 2019.

[64] Z. Hui, X. Gao, Y. Yang, and X. Wang, “Lightweight image super-resolution with information multi-distillation network,” in The 27th ACMInternational Conference on Multimedia. ACM, 2019, pp. 2024–2032.

[65] K. Zhang, S. Gu, R. Timofte, Z. Hui, X. Wang, X. Gao, D. Xiong, S. Liu,R. Gang, N. Nan et al., “Aim 2019 challenge on constrained super-resolution: Methods and results,” in IEEE International Conference onComputer Vision Workshop (ICCVW). IEEE, 2019, pp. 3565–3574.

[66] A. Mittal, A. K. Moorthy, and A. C. Bovik, “No-reference imagequality assessment in the spatial domain,” IEEE Transactions on ImageProcessing, vol. 21, no. 12, pp. 4695–4708, 2012.

[67] A. Mittal, R. Soundararajan, and A. C. Bovik, “Making a completelyblind image quality analyzer,” IEEE Signal Processing Letters, vol. 20,no. 3, pp. 209–212, 2012.

[68] J. Yan, J. Li, and X. Fu, “No-reference quality assessment of contrast-distorted images using contrast enhancement,” arXiv preprint, 2019.

[69] X. Chen, Q. Zhang, M. Lin, G. Yang, and C. He, “No-reference colorimage quality assessment: from entropy to perceptual quality,” EURASIPJournal on Image and Video Processing, vol. 2019, no. 1, p. 77, 2019.

[70] M. Rerabek and T. Ebrahimi, “New light field image dataset,” in In-ternational Conference on Quality of Multimedia Experience (QoMEX),2016.

[71] S. Wanner, S. Meister, and B. Goldluecke, “Datasets and benchmarks fordensely sampled 4d light fields.” in Vision, Modelling and Visualization(VMV), vol. 13. Citeseer, 2013, pp. 225–226.

[72] C. Shin, H.-G. Jeon, Y. Yoon, I. So Kweon, and S. Joo Kim, “Epinet:A fully-convolutional neural network using epipolar geometry for depthfrom light field images,” in IEEE Conference on Computer Vision andPattern Recognition (CVPR), 2018, pp. 4748–4757.

[73] I. K. Park, K. M. Lee et al., “Robust light field depth estimationusing occlusion-noise aware data costs,” IEEE Transactions on PatternAnalysis and Machine Intelligence, vol. 40, no. 10, pp. 2484–2497, 2017.

[74] J. Y. Lee and R.-H. Park, “Complex-valued disparity: Unified depthmodel of depth from stereo, depth from focus, and depth from defocusbased on the light field gradient,” IEEE Transactions on Pattern Analysisand Machine Intelligence, 2019.

[75] H.-G. Jeon, J. Park, G. Choe, J. Park, Y. Bok, Y.-W. Tai, and I. S.Kweon, “Depth from a light field image with learning-based matchingcosts,” IEEE Transactions on Pattern Analysis and Machine Intelligence,vol. 41, no. 2, pp. 297–310, 2018.

[76] H. Sheng, P. Zhao, S. Zhang, J. Zhang, and D. Yang, “Occlusion-awaredepth estimation for light field using multi-orientation epis,” PatternRecognition, vol. 74, pp. 587–599, 2018.

[77] S. Zhang, H. Sheng, C. Li, J. Zhang, and Z. Xiong, “Robust depthestimation for light field via spinning parallelogram operator,” ComputerVision and Image Understanding, vol. 145, pp. 148–159, 2016.

[78] G. Wu, M. Zhao, L. Wang, Q. Dai, T. Chai, and Y. Liu, “Lightfield reconstruction using deep convolutional network on epi,” in IEEEConference on Computer Vision and Pattern Recognition (CVPR), 2017,pp. 6319–6327.

[79] G. Wu, Y. Liu, L. Fang, Q. Dai, and T. Chai, “Light field reconstructionusing convolutional network on epi and extended applications,” IEEETransactions on Pattern Analysis and Machine Intelligence, vol. 41,no. 7, pp. 1681–1694, 2018.

[80] G. Wu, Y. Liu, Q. Dai, and T. Chai, “Learning sheared epi structurefor light field reconstruction,” IEEE Transactions on Image Processing,vol. 28, no. 7, pp. 3261–3273, 2019.

[81] S. Vagharshakyan, R. Bregovic, and A. Gotchev, “Light field reconstruc-tion using shearlet transform,” IEEE Transactions on Pattern Analysisand Machine Intelligence, vol. 40, no. 1, pp. 133–147, 2017.

[82] H. Wing Fung Yeung, J. Hou, J. Chen, Y. Ying Chung, and X. Chen,“Fast light field reconstruction with deep coarse-to-fine modeling ofspatial-angular clues,” in European Conference on Computer Vision(ECCV), 2018, pp. 137–152.

[83] Y. Wang, F. Liu, Z. Wang, G. Hou, Z. Sun, and T. Tan, “End-to-endview synthesis for light field imaging with pseudo 4dcnn,” in EuropeanConference on Computer Vision (ECCV), 2018, pp. 333–348.

[84] S. Zhang, H. Sheng, D. Yang, J. Zhang, and Z. Xiong, “Micro-lens-basedmatching for scene recovery in lenslet cameras,” IEEE Transactions onImage Processing, vol. 27, no. 3, pp. 1060–1075, 2017.

[85] N. Meng, H. K.-H. So, X. Sun, and E. Lam, “High-dimensional denseresidual convolutional neural network for light field reconstruction,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019.

Page 13: JOURNAL OF LA Light Field Image Super-Resolution Using

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 13

[86] J. Jin, J. Hou, H. Yuan, and S. Kwong, “Learning light field angularsuper-resolution via a geometry-aware network,” in AAAI Conferenceon Artificial Intelligence, 2020.

[87] N. Meng, X. Wu, J. Liu, and E. Y. Lam, “High-order residual network forlight field super-resolution,” AAAI Conference on Artificial Intelligence,2020.

[88] R. Farrugia and C. Guillemot, “Light field super-resolution using a low-rank prior and deep convolutional neural networks,” IEEE Transactionson Pattern Analysis and Machine Intelligence, 2019.

[89] R. A. Farrugia and C. Guillemot, “A simple framework to leverage state-of-the-art single-image super-resolution methods to restore light fields,”Signal Processing: Image Communication, vol. 80, p. 115638, 2020.

[90] M. S. K. Gul and B. K. Gunturk, “Spatial and angular resolutionenhancement of light fields using convolutional neural networks,” IEEETransactions on Image Processing, vol. 27, no. 5, pp. 2146–2159, 2018.

[91] Y. Wang, L. Wang, J. Yang, W. An, and Y. Guo, “Flickr1024: A large-scale dataset for stereo image super-resolution,” in IEEE InternationalConference on Computer Vision Workshops (CVPRW), 2019, pp. 0–0.

[92] C. Ledig, L. Theis, F. Huszar, J. Caballero, A. Cunningham, A. Acosta,A. Aitken, A. Tejani, J. Totz, Z. Wang et al., “Photo-realistic singleimage super-resolution using a generative adversarial network,” in IEEEConference on Computer Vision and Pattern Recognition (CVPR), 2017,pp. 4681–4690.

[93] X. Wang, K. Yu, S. Wu, J. Gu, Y. Liu, C. Dong, Y. Qiao, andC. Change Loy, “Esrgan: Enhanced super-resolution generative adver-sarial networks,” in European Conference on Computer Vision (ECCV),2018, pp. 0–0.

[94] Y. Anagun, S. Isik, and E. Seke, “Srlibrary: Comparing different lossfunctions for super-resolution over various convolutional architectures,”Journal of Visual Communication and Image Representation, vol. 61,pp. 178–187, 2019.

[95] X. Ying, Y. Wang, L. Wang, W. Sheng, W. An, and Y. Guo, “Astereo attention module for stereo image super-resolution,” IEEE SignalProcessing Letters, vol. 27, pp. 496–500, 2020.

[96] X. Glorot and Y. Bengio, “Understanding the difficulty of training deepfeedforward neural networks,” in International Conference on ArtificialIntelligence and Statistics, 2010, pp. 249–256.

[97] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”International Conference on Learning and Representation (ICLR), 2015.

Yingqian Wang received the B.E. degree in electri-cal engineering from Shandong University (SDU),Jinan, China, in 2016, and the M.E. degree in in-formation and communication engineering from Na-tional University of Defense Technology (NUDT),Changsha, China, in 2018. He is currently pursuingthe Ph.D. degree with the College of Electronic Sci-ence and Technology, NUDT. His research interestsfocus on low-level vision, particularly on light fieldimaging and image super-resolution.

Jungang Yang received the B.E. and Ph.D. degreesfrom National University of Defense Technology(NUDT), in 2007 and 2013 respectively. He wasa visiting Ph.D. student with the University ofEdinburgh, Edinburgh from 2011 to 2012. He iscurrently an associate professor with the Collegeof Electronic Science, NUDT. His research interestsinclude computational imaging, image processing,compressive sensing and sparse representation. Dr.Yang received the New Scholar Award of ChineseMinistry of Education in 2012, the Youth Innovation

Award and the Youth Outstanding Talent of NUDT in 2016.

Longguang Wang received the B.E. degree in elec-trical engineering from Shandong University (SDU),Jinan, China, in 2015, and the M.E. degree in in-formation and communication engineering from Na-tional University of Defense Technology (NUDT),Changsha, China, in 2017. He is currently pursuingthe Ph.D. degree with the College of Electronic Sci-ence and Technology, NUDT. His research interestsinclude low-level vision and deep learning.

Xinyi Ying is currently pursuing the M.E. degreewith the College of Electronic Science and Tech-nology, National University of Defense Technol-ogy (NUDT). Her research interests focus on low-level vision, particularly on image and video super-resolution.

Tianhao Wu received the B.E. degree in electronicengineering from National University of DefenseTechnology (NUDT), Changsha, China, in 2020. Heis currently pursuing the M.E. degree with the Col-lege of Electronic Science and Technology, NUDT.His research interests include light field imaging andcamera calibration.

Wei An received the Ph.D. degree from the Na-tional University of Defense Technology (NUDT),Changsha, China, in 1999. She was a Senior Vis-iting Scholar with the University of Southampton,Southampton, U.K., in 2016. She is currently aProfessor with the College of Electronic Science andTechnology, NUDT. She has authored or co-authoredover 100 journal and conference publications. Hercurrent research interests include signal processingand image processing.

Yulan Guo is currently an associate professor. Hereceived the B.E. and Ph.D. degrees from NationalUniversity of Defense Technology (NUDT) in 2008and 2015, respectively. He was a visiting Ph.D. stu-dent with the University of Western Australia from2011 to 2014. He worked as a postdoctorial researchfellow with the Institute of Computing Technology,Chinese Academy of Sciences from 2016 to 2018.He has authored over 90 articles in journals andconferences, such as the IEEE TPAMI and IJCV.His current research interests focus on 3D vision,

particularly on 3D feature learning, 3D modeling, 3D object recognition,and scene understanding. Dr. Guo received the CAAI Outstanding DoctoralDissertation Award in 2016, the CAAI Wu-Wenjun Outstanding AI YouthAward in 2019. He served/will serve as an associate editor for IET ComputerVision and IET Image Processing, a guest editor for IEEE TPAMI, an areachair for CVPR 2021 and ICPR 2020, an organizer for a tutorial in CVPR2016 and a workshopin CVPR 2019. Dr. Guo is a senior member of IEEE.