10
Efficient Video Super-Resolution through Recurrent Latent Space Propagation Dario Fuoli Shuhang Gu Radu Timofte Computer Vision Lab, ETH Zurich, Switzerland {dario.fuoli, shuhang.gu, radu.timofte}@vision.ee.ethz.ch Abstract With the recent trend for ultra high definition displays, the demand for high quality and efficient video super- resolution (VSR) has become more important than ever. Previous methods adopt complex motion compensation strategies to exploit temporal information when estimating the missing high frequency details. However, as the motion estimation problem is a highly challenging problem, inac- curate motion compensation may affect the performance of VSR algorithms. Furthermore, the complex motion com- pensation module may also introduce a heavy computa- tional burden, which limits the application of these meth- ods in real systems. In this paper, we propose an efficient recurrent latent space propagation (RLSP) algorithm for fast VSR. RLSP introduces high-dimensional latent states to propagate temporal information between frames in an im- plicit manner. Our experimental results show that RLSP is a highly efficient and effective method to deal with the VSR problem. We outperform current state-of-the-art method [16] with over 70× speed-up. 1. Introduction Super-resolution aims to obtain high-resolution (HR) im- ages from its low-resolution (LR) observations. It provides a practical solution to enhance existing images as well as al- leviating the pressure of data transportation. One category of methods [6, 18, 23, 25, 22, 12] takes a single LR image as input. The single image super-resolution (SISR) problem has been intensively studied for many years and is still an active topic in the area of computer vision. Another cate- gory of approaches [15, 33, 16, 7, 2, 38], i.e. video super- resolution (VSR), takes LR video as input. In contrast to SISR methods, which can only rely on natural image priors for estimation of high resolution details, VSR exploits tem- poral information for improved recovery of image details. A key issue to the success of VSR algorithms, is how to take full advantage from temporal information [30]. In the early years, different methods have been suggested to model the subpixel-level motion between LR observations, 10 1 10 2 10 3 t [ms] 26.2 26.4 26.6 26.8 27.0 27.2 27.4 27.6 PSNR [dB] RLSP 7-48 7-64 7-128 7-256 DUF-16 DUF-28 DUF-52 FRVSR 3-64 FRVSR 10-128 Figure 1. Quantitative comparison of PSNR values on Vid4 and computation times to produce a single Full HD (1920x1080) frame with other state-of-the-art methods FRVSR [33] and DUF [16]. including the Bilateral prior [8] and Bayesian estimation model [26] have been adopted to solve the VSR problem. In recent years, the success of deep learning in other vi- sion tasks inspired the research to apply convolutional neu- ral networks (CNN) also to VSR. Following a similar strat- egy, already adopted in conventional VSR algorithms, most of existing deep learning based VSR methods divide the task into two sub-problems: motion estimation and the fol- lowing compensation procedure. In the last years, a large number of elaborately designed models have been proposed to capture the subpixel motion between input LR frames. However, as subpixel-level alignment of images is a highly challenging problem, these types of approaches may gen- erate blurred estimations, when the motion compensation module fails to generate accurate motion estimation. Fur- thermore, the complex motion estimation and compensation modules are often computationally expensive, which makes these methods unable to handle HR video in real time. To address the accuracy issue, Jo et al.[16] proposed dynamic upsampling filters (DUF) to perform VSR without explicit motion compensation. In their solution, motion in- formation is implicitly captured with dynamic upsampling filters and the HR frame is directly constructed by local fil- 1

Efficient Video Super-Resolution through Recurrent Latent ...timofter/publications/Fuoli-ICCVW-2019b.pdfSingle Image Super-Resolution With the rise of deep learning, especially convolutional

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Efficient Video Super-Resolution through Recurrent Latent Space Propagation

Dario Fuoli Shuhang Gu Radu TimofteComputer Vision Lab, ETH Zurich, Switzerland

{dario.fuoli, shuhang.gu, radu.timofte}@vision.ee.ethz.ch

Abstract

With the recent trend for ultra high definition displays,the demand for high quality and efficient video super-resolution (VSR) has become more important than ever.Previous methods adopt complex motion compensationstrategies to exploit temporal information when estimatingthe missing high frequency details. However, as the motionestimation problem is a highly challenging problem, inac-curate motion compensation may affect the performance ofVSR algorithms. Furthermore, the complex motion com-pensation module may also introduce a heavy computa-tional burden, which limits the application of these meth-ods in real systems. In this paper, we propose an efficientrecurrent latent space propagation (RLSP) algorithm forfast VSR. RLSP introduces high-dimensional latent states topropagate temporal information between frames in an im-plicit manner. Our experimental results show that RLSP isa highly efficient and effective method to deal with the VSRproblem. We outperform current state-of-the-art method[16] with over 70× speed-up.

1. IntroductionSuper-resolution aims to obtain high-resolution (HR) im-

ages from its low-resolution (LR) observations. It providesa practical solution to enhance existing images as well as al-leviating the pressure of data transportation. One categoryof methods [6, 18, 23, 25, 22, 12] takes a single LR imageas input. The single image super-resolution (SISR) problemhas been intensively studied for many years and is still anactive topic in the area of computer vision. Another cate-gory of approaches [15, 33, 16, 7, 2, 38], i.e. video super-resolution (VSR), takes LR video as input. In contrast toSISR methods, which can only rely on natural image priorsfor estimation of high resolution details, VSR exploits tem-poral information for improved recovery of image details.

A key issue to the success of VSR algorithms, is howto take full advantage from temporal information [30]. Inthe early years, different methods have been suggested tomodel the subpixel-level motion between LR observations,

101 102 103

t [ms]

26.2

26.4

26.6

26.8

27.0

27.2

27.4

27.6

PSN

R [

dB]

RLSP

7-48

7-64

7-128

7-256

DUF-16

DUF-28

DUF-52

FRVSR 3-64

FRVSR 10-128

Figure 1. Quantitative comparison of PSNR values on Vid4 andcomputation times to produce a single Full HD (1920x1080) framewith other state-of-the-art methods FRVSR [33] and DUF [16].

including the Bilateral prior [8] and Bayesian estimationmodel [26] have been adopted to solve the VSR problem.In recent years, the success of deep learning in other vi-sion tasks inspired the research to apply convolutional neu-ral networks (CNN) also to VSR. Following a similar strat-egy, already adopted in conventional VSR algorithms, mostof existing deep learning based VSR methods divide thetask into two sub-problems: motion estimation and the fol-lowing compensation procedure. In the last years, a largenumber of elaborately designed models have been proposedto capture the subpixel motion between input LR frames.However, as subpixel-level alignment of images is a highlychallenging problem, these types of approaches may gen-erate blurred estimations, when the motion compensationmodule fails to generate accurate motion estimation. Fur-thermore, the complex motion estimation and compensationmodules are often computationally expensive, which makesthese methods unable to handle HR video in real time.

To address the accuracy issue, Jo et al. [16] proposeddynamic upsampling filters (DUF) to perform VSR withoutexplicit motion compensation. In their solution, motion in-formation is implicitly captured with dynamic upsamplingfilters and the HR frame is directly constructed by local fil-

1

tering of the center input frame. Such an implicit formula-tion avoids conducting motion compensation in the imagespace and helps DUF to obtain state-of-the-art VSR results.However, as DUF needs to estimate dynamic filters in eachlocation, the algorithm suffers from heavy computation aswell as putting a burden on memory for processing largesize images.

In this paper, we propose a recurrent latent space propa-gation (RLSP) method 1 for efficient VSR. RLSP follows asimilar strategy as FRVSR [33], which utilizes a recurrentarchitecture to avoid processing LR input frames multipletimes. In contrast to FRVSR, which adopts explicit motionestimation and warping operations to exploit temporal in-formation, RLSP introduces high dimensional latent statesto propagate temporal information in an implicit way.

In Fig. 1, we present the trade-off between runtimeand accuracy (average PSNR) for state-of-the-art VSR ap-proaches on the Vid4 dataset [26]. The proposed RLSP ap-proach achieves a better balance between speed and perfor-mance than the competing methods. RLSP achieves about10× and 70× speed-up over the methods FRVSR and DUF,respectively, while maintaining similar accuracy. Further-more, despite its efficiency, by utilizing more filters in ourmodel, RLSP can also be pushed to pursue state-of-the-artVSR accuracy. Our model RLSP 7-256 achieves the highestPSNR on the Vid4 benchmark.

2. Related Work

Single Image Super-Resolution With the rise of deeplearning, especially convolutional neural networks(CNN) [21], learning based super-resolution models haveshown to be superior in terms of accuracy comparedto classical interpolation methods, such as bilinear andbicubic interpolation and similar approaches. One of theearliest methods to apply convolution for super-resolutionis SRCNN, proposed by [6]. SRCNN uses a shallownetwork of only 3 convolutional layers. VDSR [18] showssubstantial improvements by using a much deeper networkof 20 layers combined with residual learning. In orderto get visually more pleasing images, photorealistic andnatural looking, the accuracy to the ground truth is tradedoff by method such as SRGAN [23], EnhanceNet [32],and [31, 4, 28] that introduce alternative loss functions [11]to super-resolution. An overview of recent methods in thefield of SISR is provided by [36].Video Super-Resolution Super-resolution can be gener-alized from images to videos. Videos additionally pro-vide temporal information among frames, which can be ex-ploited to improve interpolation quality. Non-deep learningvideo super-resolution problems are often solved by formu-

1Our code and models are publicly available at:https://github.com/dariofuoli/RLSP

lating demanding optimization problems, leading to slowevaluation times [1, 8, 26].

Many deep learning based VSR methods are composedof multiple independent processing pipelines, motivated byprior knowledge and inspired by traditional computer visiontools. To leverage temporal information, a natural exten-sion to SISR is combining multiple low-resolution framesto produce a single high-resolution estimate [24, 29, 5].Kappeler et al. [17] combine several adjacent frames. Non-center frames are motion compensated by calculating op-tical flow and warping towards the center frame. Allframes are then concatenated and followed by 3 convolu-tion layers. Tao et al. [35] produce a single high-resolutionframe yt from up to 7 low-resolution input frames xt−3:t+3.First, motion is estimated in low resolution and a prelim-inary high-resolution frame is computed through a sub-pixel motion compensation layer. The final output is com-puted by applying an encoder-decoder style network withan intermediate convolutional LSTM [13] layer. Liu etal. [27] calculate multiple high-resolution estimates in par-allel branches, each processing an increasing number oflow-resolution frames. Additionally, a temporal modulationbranch computes weights according to which the respectivehigh-resolution estimates are aggregated, forming the finalhigh-resolution output. Caballero et al. [3] extract motionflow maps between adjacent frames and center frame xt.The frames are warped according to the flow maps towardsframe xt. These frames are then processed with a spatio-temporal network, by either direct concatenation and con-volution, gradually applying several convolution and con-catenation steps or applying 3D convolutions [37]. Jo etal. [16] propose DUF, a network without explicit motionestimation. Dynamic upsampling filters and residuals arecalculated from a batch of adjacent input frames. The cen-ter frame is filtered and added with the residuals to get thefinal output.

A more powerful approach to process sequential datalike video, is to use recurrent connections between timesteps. Methods using a fixed number of input frames are in-herently limited by the information content in those frames.Recurrent models however, are able to leverage informationfrom a potentially unlimited number of frames. Sajjadi etal. [33] use an optical flow network, followed by a super-resolution network. Optical flow is calculated between xt−1

and xt to warp the previous output yt−1 towards t. The fi-nal output yt is calculated from the warped previous outputand the current low-resolution input frame xt. The two net-works are trained jointly. Huang et al. [14] propose a bidi-rectional recurrent network using 2D and 3D convolutionswith recurrent connections between time steps. A forwardpass and a backward pass are combined to produce the finaloutput frames. Because of its nature, this method can not beapplied online.

ReLU

Conv

3x3xf

1 n-1

Conv

3x3x16

n

ReLU

Conv

3x3xf

ReLU

Conv

3x3xf

y t

x t

x4/4

h t-1

Hidden StateFlow

RLSPCell

HR Flow

LR Flow

y t+1

x t+1

/4

h tRLSPCell

y t-1

x t-1

x4

RLSPCell

RLSP Cellx* t

Figure 2. RLSP Architecture. The recurrent cell is shown at time t, ⊗ denotes concatenation along the channel dimension, ⊕ denoteselement-wise addition. Information is propagated over time through hidden state h and feedback.

RLSP does not rely on a dedicated motion compensationmodule and instead introduces a recurrent hidden state, toefficiently leverage temporal information implicitly.

3. MethodVideo super-resolution (VSR) maps a LR video x to a

HR video y by a given scaling factor r. At time t, a singleframe yt ∈ RrH×rW×C represents the reconstructed HRframe of xt ∈ RH×W×C , where H and W are spatial di-mensions andC is the number of color channels. In contrastto SISR, the temporal dimension provides additional infor-mation, which can be leveraged when generating a singleframe yt. As a natural choice for sequential data, we there-fore propose a recurrent neural network (RNN). The modelis fully defined by its cell, illustrated in Fig. 2.

As done in many non-recurrent methods, we feed severaladjacent LR frames xt−1:t+1 to produce yt. These framesare concatenated along the channel axis, together with therecurrent inputs ht−1 and yt−1. To concatenate and alignthe previous HR output yt−1 with the LR tensors, yt−1 isshuffled down by the scaling factor r, see Sec. 3.1. Thecombined input is then fed to n convolution layers withReLU activation function. In the last stage, the hidden stateht for the next iteration and the HR output’s residuals in LRspace are produced. The residuals are added with the near-est neighbor interpolated frame x∗t represented in LR spaceand shuffled up by scaling factor r, to finally generate theoutput yt (see Sec. 3.2). All input frames are in RGB colorspace, while the output represents the brightness channel Yof YCbCr color space. Chroma channels are upscaled sepa-rately with bicubic interpolation. All our models are trainedwith a scaling factor of r = 4.

The model is a recurrent, fully convolutional network. Itis therefore not limited to a fixed input size and can accom-modate video data of any dimensions. With the exceptionof higher level contextual information (along the spatial andtemporal axes), super-resolution is a highly locality basedinterpolation problem. Thus, a convolutional neural net-

work is a sensible choice. Since the receptive field growswith the number of convolution layers, the network is stillable to detect complex structure across an extended neigh-borhood. In order to optimize information flow, much careis taken to keep local alignment throughout the network andis achieved by using operations which keep local integrity,also see Sec. 3.1 and Sec. 3.2. We aim for efficiency andtherefore use a fixed number of n = 7 layers, the filter’sspatial dimensions are set to 3× 3 for all models. The num-ber of filters f is adapted per model.

In the following sections, the model’s core elements arediscussed in more detail.

3.1. Shuffling

To realize the mapping from LR to HR, the spatial di-mensions need to be transformed at some point in themodel. The most relevant processing is kept in LR and spa-tial expansion is executed at the last stage of the processingchain. Because the output is fed back, an inverse transfor-mation from HR to LR is applied. Shuffling performs thesetransformations by reducing the channel dimension Z of atensor t with factor r2 and extending both spatial dimen-sions H and W with factor r and vice versa for the inversetransformation. Since shuffling is a bijective transformationthese operations are reversible:

tLR ∈ RH×W×Z ×r−−→ tHR ∈ RrH×rW×Z/r2 (1)

tHR ∈ RH×W×Z /r−→ tLR ∈ RH/r×W/r×r2Z (2)To get a single channel HR output image with upscalingfactor r = 4, the LR tensor’s channel dimension needs tobe Z = 16. Therefore, the last layer has 16 filters. A veryimportant characteristic of this transformation is that it re-tains local integrity. All pixels along the channel dimensionin LR are rearranged in their corresponding local HR inter-polation area. This enables a smooth localized informationflow from LR input to HR output. The shuffling operationhas been adopted in previous works [34, 33] to change thespatial resolution of image/feature maps.

1 2

4

1 2

4

1 2

4

1 1

11

2 2

22

3 3

33

4 4

44

1 2

3 4

Figure 3. Shuffling: tLR ∈ R2×2×4 ×2−−→ tHR ∈ R4×4×1.

3.2. Residual Learning

Reducing the sample rate inherently leads to loss of fre-quency components above the Nyquist frequency. Due tothe lower Nyquist frequency in LR space, the main informa-tion loss occurs in the spatial high-frequency components.Low-frequency components below the Nyquist frequencycan be fully retained when downsampling. This a prioriknowledge is used by introducing a residual connection di-rectly from the LR input frame xt to the output frame yt.First, xt is converted from RGB color space to the bright-ness channel Y (from YCbCr color space) and replicated 16times to match the residual’s dimension. This procedure iseffectively nearest-neighbor interpolation. No informationis altered during this process, which means the network inparallel does not need to allocate complexity to learn thistransformation, as it would be the case with other meth-ods, e.g. bicubic interpolation. In contrast to FRVSR, whichdoes not adopt this strategy, all complexity can be used toreconstruct only the missing high-frequency components.All LR inputs and nearest-neighbor interpolated HR output(represented in LR space) are properly aligned.

3.3. Feedback

Feeding back the output naturally helps to improve con-tinuity between frames and reduces flickering, which canoccur in models with limited temporal connectivity. Be-cause of high correlation between adjacent frames, havinga reference of the previous output yt−1 also supplies addi-tional, already processed HR information when producingthe HR estimate yt.

3.4. Hidden State

In order to propagate complex, abstract informationacross time, a hidden state ht is added to the processingchain. The hidden state is realized by carrying forward thefeature maps from the previous iteration and feeding themback to the input through concatenation. This keeps the net-work structure in line with the prior on locality. Since thewhole network is fully convolutional, the hidden state’s spa-tial dimensions are dynamically adjusted according to theinput size of x. The hidden state can be seen as a set ofvectors in a locality based latent space Rf , characterizedby vectors v ∈ Rf with entries along the channel axis, seeFig. 4. Since every instance in a feature map is calculatedby the same convolution kernel, each feature map represents

v1

v2

vC

12

C

v1v2

vf

H

W

f

Figure 4. Locality based hidden state. Entries along the channeldimension represent vectors v ∈ Rf .

a dimension in the latent space Rf . In contrast to using justfeedback as done in [33], which is bound to pass HR in-formation from yt−1 only, the hidden state is not limited inthe type of information that can be propagated. It is theo-retically possible to propagate past information across thewhole time axis.

Because at time instance t, the next frame xt+1 is fed tothe input already, the hidden state also allows every framext to be processed twice before estimating yt. This essen-tially increases the receptive field and the number of pro-cessing layers for a single frame, even though both process-ing steps share their weights. Because of recurrence, thesetwo steps can be efficiently distributed over two time steps.

3.5. Loss

The loss function is defined as the pixel-wise mean-squared-error (MSE) between all k pixels in the groundtruth frames y∗ and the network’s output y:

L =1

k‖y∗ − y‖22 (3)

4. Experimental Setup4.1. Dataset

We adhere to the experimental setup from [33] and usethe same dataset. Originally, it contained 40 high resolutionvideos (720p, 1080p, 4k), downloaded from vimeo.com,but 3 videos were not online anymore, so we train on theavailable 37 videos instead. Following the same proce-dure as in [33] we produce 40,000 random crops of size20 × 256 × 256 × 3, which serve as HR ground truth se-quences y∗ ∈ RT×H×W×C . To get the corresponding LRsequences x, Gaussian blur with σ = 1.5 is applied and ev-ery 4-th pixel in both spatial dimensions is sampled as donein both methods [33, 16] that we use for comparison.To monitor training progress and generalization, wedownloaded 10 additional high resolution videos fromyoutube.com and generated validation sequences with thesame procedure as used for generating the training data.

4.2. Training

To train our RNN, the cell is unrolled along the time axisto accommodate for a training clip length of 10 frames. For

training, we randomly sample 12 consecutive frames fromthe training clips, which contain 20 frames. The 2 additionalframes are used to feed xt−1 at the beginning and xt+1 atthe end. The weights are initialized with Xavier initializa-tion [10] and the network is trained with batches of size 4with Adam optimizer [20]. Because of the network’s recur-rent structure, the hidden state ht−1, and the previous esti-mate yt−1 need to be initialized. Both tensors are initializedwith zeros. In our experimental setup, our fastest model has7 layers with 48 filters per layer, denoted as RLSP 7-48,while our most accurate model is implemented with 7 lay-ers and 256 filters (RLSP 7-256). The networks are trainedwith decreasing learning rate, starting at 10−4. For RLSP7-48 the learning rate is divided by 10 after 2M and 3M it-erations. For all other models, the learning rate is dividedby 10 after 2M and 4M steps. The models are selected ac-cording to the lowest moving average on the validation lossat convergence.

5. Results and DiscussionWe investigate our models in terms of runtime, accuracy,

temporal consistency, information flow and provide imagesfor qualitative comparison. We compare our models withstate-of-the-art video super-resolution methods DUF [16]and FRVSR [33] on Vid4 [26] benchmark. To the best ofour knowledge DUF achieves the highest accuracy whileFRVSR is very efficient with the best tradeoff between ac-curacy and runtime to date.

5.1. Ablation

We compare different configurations found in other VSRmethods by applying them to our architecture and assess theimpact of each part in terms of accuracy and runtimes. ASISR implementation of our network serves as a baseline.For that matter, all recurrent connections are removed andonly the current frame xt is fed to produce yt. As an ex-tension to SISR, a batch of 3 consecutive frames xt−1:t+1

is fed to the network. This approach is further expanded byintroducing a recurrent feedback connection yt−1. Finally,our proposed locality based hidden state is added. All con-figurations are trained on the same data with the settings,described in Sec. 4.2. The experiments are conducted withn = 7 layers and f = 64 filters. The PSNR values on Vid4and Full HD runtimes are shown in Tab. 1.

Adding adjacent frames already improves PSNR sub-stantially compared to the SISR implementation by 1.3dB.Feeding back the previous output improves PSNR by0.23dB. Our proposed locality based hidden state furtherimproves the PSNR score substantially by 0.55dB. The ex-periments show, that adjacent frames and our locality basedhidden state, have the strongest impact on the video perfor-mance. As expected, the runtime increases for more com-plex configurations. The models without recurrence need

12ms to produce a single Full HD frame. Adding feedback(yt−1) and the hidden state (ht−1) to the network increasesruntime by 2ms and another 5ms, respectively. The com-plete configuration RLSP 7-64 achieves a PSNR value com-parable to recent state-of-the-art by gaining 1.98dB com-pared to the SISR implementation. RLSP 7-64 can generate50fps Full HD in real time.

Inputs PSNR [dB] Runtime [ms]xt 24.91 12xt−1:t+1 26.20 12xt−1:t+1 + yt−1 26.43 14xt−1:t+1 + yt−1 + ht−1 26.89 19

Table 1. Results for different network configurations on Vid4.

5.2. Temporal Consistency

An important aspect in VSR is temporal consistency be-tween consecutive frames. Methods with limited tempo-ral connectivity often expose flickering and other artefactsalong the time domain. This property can not be analyzedby per frame PSNR values. Therefore, we provide tem-poral profiles to visually assess the temporal continuity ofour RLSP method compared to others. For that matter, asingle pixel line (red line in Fig. 5) is recorded along thewhole sequence and stacked vertically. Temporal profilesof high quality videos with smooth temporal transitions ex-pose sharp detailed images. DUF-52 is the most limitedmethod in terms of exploiting temporal information, as itonly uses adjacent frames without any recurrent connec-tions. FRVSR 10-128 uses feedback to leverage tempo-ral information. Our method additionally propagates a la-tent state, which increases temporal connectivity even fur-ther. These properties are also reflected in the temporal pro-files in Fig. 5. The vertical stripes in DUF-52’s profile areblurred out, which indicates discontinuities between con-secutive frames. FRVSR 10-128 shows increased perfor-mance, while our method RLSP 7-128 exhibits the sharpeststripes among all methods. Our method can therefore pro-duce better temporal consistency overall.

5.3. Information Flow over Time

Unlike many existing methods, our model is not limitedby the number of LR input frames to extract informationover time. To investigate the range of information flow, thesame model (RLSP 7-128) is evaluated on a Full HD se-quence from the validation set, but initialized at differentinstances in time. The first run is started 100 frames aheadof the second run. Therefore, the first run has already accu-mulated information over 100 frames, when the second runis initialized at frame 0. The respective PSNR values perframe and the difference between the two runs are plottedfor 200 frames in Fig. 6. The experiment shows that the

Figure 5. Temporal profiles for calendar. The profiles are produced from the red line, shown on the left

0 25 50 75 100 125 150 175 200

30

31

32

33

PSN

R

RLSP 7-128 Run 1RLSP 7-128 Run 2

0 25 50 75 100 125 150 175 200Frame

0.1

0.0

0.1

0.2

0.3

0.4

0.5

PSN

R D

iffe

rence

Figure 6. Information flow over time for model RLSP 7-128 on a validation video. Top: Absolute PSNR values, Bottom: Difference inPSNR per frame.

model can propagate information over almost 175 framesuntil the two runs finally converge. Accumulated informa-tion from the first 100 frames can be leveraged to get upto 0.2dB higher PSNR values over a long period of 150frames. Interestingly, the two runs collapse at the begin-ning, but quickly separate again, which leads to the conclu-sion, that the model saves information, based on consider-ing a large horizon.

5.4. Initialization

Due to the recurrent nature, our method is dependent onpreviously processed information. Because the available in-formation content is at its lowest at the beginning, it takes acouple of frames to gather temporal information. This phe-nomenon can be observed in Fig. 7, where the first 6 framesof sequence city are shown. We compare our method RLSP7-128 with DUF-52, which suffers from initialization untilframe 2 as well. The fine structure of the building in thefirst frame can not be fully reconstructed, but as more infor-mation is processed, the quality increases quickly until thecorrect structure is revealed after frame 4. This behaviour isalso represented in the PSNR values in Fig. 8. Our modelsstart with lower PSNR at the beginning, but quickly catchup with DUF to then surpass it. Unfortunately, the first 2and the last two frames of FRVSR 10-128 are not providedby the authors, probably, because these frames are not con-sidered for PSNR evaluation.

5.5. Accuracy and Runtimes

To compare performance of the proposed network withother methods we calculate the average PSNR over all se-quences from Vid4 and measure runtimes to produce a sin-gle Full HD (1920x1080) frame. The ideal method attainsfast runtimes and high PSNR. Video PSNR is calculatedfrom the MSE over the total number of pixels in a singlevideo sequence. The final reported value is the averageof each sequence’s PSNR value. Sometimes, methods useencoder-decoder parts in their architectures and thereforeneed to restrict input sizes. It is therefore common to cropthe spatial dimensions on test sets to accommodate for thisdrawback. Our method can process any input dimensionand cropping would not be required. However, to objec-tively compare our method to others, we follow the eval-uation strategy described in [16] and directly include thereported values from that paper. The values for FRVSR 10-128 [33] are recalculated in the same way from the providedoutput images. Because the outputs for model FRVSR 3-64 are not provided, we simply add the same difference inPSNR on top of the reported value in the paper, which wasgained, when calculating the new PSNR value for FRVSR10-128. Our runtimes are measured on a NVIDIA TITANXp with our unoptimized code. DUF and FRVSR runtimesare taken from the respective papers. The results are listedin Tab. 2 and displayed in Fig. 1.

RLSP 7-64 shows comparable PSNR to FRVSR 10-

Figure 7. Initialization artefacts for DUF-52 and RLSP 7-128. First 6 frames of city are shown. DUF-52 is fully initialized at frame 2, whenall input frames xt−2:t+2 are available. RLSP 7-128 reconstructs the full structure starting from frame 4, while DUF-52 still has artefacts.

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32Frame

25.5

26.0

26.5

27.0

27.5

28.0

PSN

R

Vid4 Average

RLSP 7-64RLSP 7-128FRVSR 10-128DUF-52

Figure 8. Average PSNR values over all 4 videos in Vid4 for the top models of FRVSR [33], DUF [16] and our models RLSP 7-64 andRLSP 7-128. The full average is calculated for the first 34 frames, constrained by the shortest sequence city. The first and last two framesof FRVSR 10-128 are not provided by the authors.

Method BicubicFRVSR

3-64FRVSR10-128

DUF16

DUF28

DUF52

RLSP7-48

RLSP7-64

RLSP7-128

RLSP7-256

PSNR [dB] 23.79 26.38 26.90 26.81 26.99 27.34 26.65 26.89 27.46 27.55Runtime [ms] - 74 191 403 838 2819 15 19 38 92

Table 2. Comparison of PSNR values on Vid4 and runtimes between methods FRVSR [33], DUF [16] and ours. Bicubic interpolation isincluded as a baseline. All runtimes are computed on Full HD.

128 and DUF-16, but is 10× and 20× faster, respectively.RLSP 7-128 achieves a good trade-off between accuracyand speed. It reaches state-of-the-art accuracy, while re-ducing runtimes by two orders of magnitude compared toDUF-52. Our direct implementation without any optimiza-tion can produce 25fps of Full HD video in real-time. Be-cause the focus in this work is on performance, we keepthe layer count constant at 7 and instead vary the numberof filters. Due to the high parallelizability, this allows toincrease complexity without putting too much burden oncomputation time. By further increasing the number of fil-ters, RLSP 7-256 achieves the highest reported PSNR todate on Vid4, improving 0.65dB over FRVSR 10-128 and0.21dB over DUF 52 while still being more than 2× fasterthan FRVSR and 30× faster than DUF, respectively.

To investigate evolution of accuracy across time, aver-age PSNR per frame over all sequences in vid4 is computedand plotted in Fig. 8. All methods have to deal with in-

complete initialization. Since our RLSP approach profitsgreatly from past information, PSNR values are lower atthe beginning compared to DUF-52. However, RLSP 7-128improves quickly and is able to surpass all other methodsfrom frame 6 until the end.

We also provide images for visual comparison in Fig. 9for each sequence in Vid4.

5.6. AIM 2019 Challenge

The RLSP architecture has been adopted to upscalingfactor×16 in order to set a baseline for the AIM 2019 chal-lenge on video extreme super-resolution [9, 19]. For thiswe introduced 4 stages of intermediate upscaling with fac-tor ×2 inside the RLSP cell, see Fig. 10. The number oflayers and filters is set to 14 and 128, respectively. Thisbaseline ranked 3rd, being the fastest.

GT Bicubic FRVSR 10-128 [33] DUF-52 [16] RLSP 7-64 RLSP 7-128

Figure 9. Visual comparison on Vid4. From top to bottom: calendar, foliage, city, walk.

ReLU

Conv

3x3x128

1 13

Conv

3x3x3

ReLU

Conv

3x3x128

ReLU

Conv

3x3x2

y t

x t

/16

h t-1

Hidden StateFlow

RLSPCell

HR Flow

LR Flow

y t+1

x t+1

/16

h tRLSPCell

y t-1

x t-1

RLSPCell

RLSP Cellx* t

x2

ReLU

Conv

3x3x128

4

x2 /2 /8x2 x2

147 10

Figure 10. RLSP adapted to factor ×16. We use 4 intermediateupscaling stages of factor ×2.

6. Conclusion

We introduced RLSP, a new end-to-end trainable recur-rent video super-resolution architecture with locality basedlatent space propagation, without relying on a dedicatedmotion estimation module. Due to the ability of effectivelyleveraging temporal information over long periods of time,RLSP reduces runtimes drastically, while still maintainingstate-of-the-art accuracy. Because our network is shallowand wide, a large amount of computation can be run in par-allel, which is also responsible for its efficiency and couldenable even faster runtimes for dedicated hardware imple-mentations. Even though the network structure is designedto be highly efficient, we could show, that it is still possibleto improve accuracy by increasing complexity, e.g. addingmore filters. Our RLSP achieves the best accuracy on Vid4benchmark while being more than 70× faster than DUF,the former state-of-the-art. Accuracy could also be furtherimproved by investigating different configurations of kernelsizes, alternative convolution types or numbers of layers.Acknowledgments. This work was partly supported byETH General Fund and by Nvidia through a hardware grant.

References[1] S. P. Belekos, N. P. Galatsanos, and A. K. Katsaggelos. Max-

imum a posteriori video super-resolution using a new multi-channel image prior. IEEE Transactions on Image Process-ing, 19(6):1451–1464, June 2010. 2

[2] R. A. Borsoi, G. H. Costa, and J. C. M. Bermudez. A newadaptive video super-resolution algorithm with improved ro-bustness to innovations. IEEE Transactions on Image Pro-cessing, 28(2):673–686, Feb 2019. 1

[3] J. Caballero, C. Ledig, A. Aitken, A. Acosta, J. Totz,Z. Wang, and W. Shi. Real-time video super-resolution withspatio-temporal networks and motion compensation. In TheIEEE Conference on Computer Vision and Pattern Recogni-tion (CVPR), July 2017. 2

[4] R. Dahl, M. Norouzi, and J. Shlens. Pixel recursive superresolution. In The IEEE International Conference on Com-puter Vision (ICCV), Oct 2017. 2

[5] Q. Dai, S. Yoo, A. Kappeler, and A. K. Katsaggelos. Sparserepresentation-based multiple frame video super-resolution.IEEE Transactions on Image Processing, 26(2):765–781,Feb 2017. 2

[6] C. Dong, C. C. Loy, K. He, and X. Tang. Imagesuper-resolution using deep convolutional networks. IEEETransactions on Pattern Analysis and Machine Intelligence,38(2):295–307, Feb 2016. 1, 2

[7] S. Erfanian Ebadi, V. Guerra Ones, and E. Izquierdo. Uhdvideo super-resolution using low-rank and sparse decompo-sition. In The IEEE International Conference on ComputerVision (ICCV) Workshops, Oct 2017. 1

[8] S. Farsiu, M. D. Robinson, M. Elad, and P. Milanfar. Fast andRobust Multiframe Super Resolution. IEEE Transactions onImage Processing, 13:1327–1344, Oct. 2004. 1, 2

[9] D. Fuoli, S. Gu, R. Timofte, X. Tao, W. Li, T. Guo, Z. Deng,L. Lu, T. Dai, X. Shen, S. Xia, Y. Dai, J. Jia, P. Y. Z.W. K. Jiang, J. Jiang, J. Ma, Z. Zhong, C. Wang, J. Jiang,and X. Liu. Aim 2019 challenge on video extreme super-resolution: Methods and results. In ICCV Workshops, 2019.7

[10] X. Glorot and Y. Bengio. Understanding the difficulty oftraining deep feedforward neural networks. Journal of Ma-chine Learning Research - Proceedings Track, 9:249–256,01 2010. 5

[11] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu,D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Gen-erative adversarial nets. In Advances in Neural InformationProcessing Systems 27, pages 2672–2680. 2014. 2

[12] M. Haris, G. Shakhnarovich, and N. Ukita. Deep back-projection networks for super-resolution. In The IEEEConference on Computer Vision and Pattern Recognition(CVPR), June 2018. 1

[13] S. Hochreiter and J. Schmidhuber. Long short-term memory.Neural computation, 9(8):1735–1780, 1997. 2

[14] Y. Huang, W. Wang, and L. Wang. Bidirectional recurrentconvolutional networks for multi-frame super-resolution. InProceedings of the 28th International Conference on NeuralInformation Processing Systems - Volume 1, NIPS’15, pages235–243, Cambridge, MA, USA, 2015. MIT Press. 2

[15] T. Hyun Kim, M. S. Sajjadi, M. Hirsch, and B. Scholkopf.Spatio-temporal transformer network for video restoration.In Proceedings of the European Conference on Computer Vi-sion (ECCV), pages 106–122, 2018. 1

[16] Y. Jo, S. Wug Oh, J. Kang, and S. Joo Kim. Deep videosuper-resolution network using dynamic upsampling filterswithout explicit motion compensation. In The IEEE Confer-ence on Computer Vision and Pattern Recognition (CVPR),June 2018. 1, 2, 4, 5, 6, 7, 8

[17] A. Kappeler, S. Yoo, Q. Dai, and A. Katsaggelos. Videosuper-resolution with convolutional neural networks. InIEEE Transactions on Computational Imaging, 2016. 2

[18] J. Kim, J. Kwon Lee, and K. Mu Lee. Accurate image super-resolution using very deep convolutional networks. In TheIEEE Conference on Computer Vision and Pattern Recogni-tion (CVPR), June 2016. 1, 2

[19] S. Kim, G. Li, D. Fuoli, M. Danelljan, Z. Huang, S. Gu, andR. Timofte. The vid3oc and intvid datasets for video superresolution and quality mapping. In ICCV Workshops, 2019.7

[20] D. P. Kingma and J. Ba. Adam: A method for stochasticoptimization. CoRR, abs/1412.6980, 2014. 5

[21] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet clas-sification with deep convolutional neural networks. NeuralInformation Processing Systems, 25, 01 2012. 2

[22] W.-S. Lai, J.-B. Huang, N. Ahuja, and M.-H. Yang. Deeplaplacian pyramid networks for fast and accurate super-resolution. In The IEEE Conference on Computer Vision andPattern Recognition (CVPR), July 2017. 1

[23] C. Ledig, L. Theis, F. Huszar, J. Caballero, A. Cunning-ham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, andW. Shi. Photo-realistic single image super-resolution usinga generative adversarial network. In The IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR), July2017. 1, 2

[24] R. Liao, X. Tao, R. Li, Z. Ma, and J. Jia. Video super-resolution via deep draft-ensemble learning. In The IEEEInternational Conference on Computer Vision (ICCV), De-cember 2015. 2

[25] B. Lim, S. Son, H. Kim, S. Nah, and K. Mu Lee. Enhanceddeep residual networks for single image super-resolution. InProceedings of the IEEE Conference on Computer Visionand Pattern Recognition Workshops, pages 136–144, 2017.1

[26] C. Liu and D. Sun. A bayesian approach to adaptive video su-per resolution. In Proceedings of the 2011 IEEE Conferenceon Computer Vision and Pattern Recognition, CVPR ’11,pages 209–216, Washington, DC, USA, 2011. IEEE Com-puter Society. 1, 2, 5

[27] D. Liu, Z. Wang, Y. Fan, X. Liu, Z. Wang, S. Chang, andT. Huang. Robust video super-resolution with learned tem-poral dynamics. In The IEEE International Conference onComputer Vision (ICCV), Oct 2017. 2

[28] A. Lucas, S. Lopez Tapia, R. Molina, and A. K. Katsaggelos.Generative Adversarial Networks and Perceptual Losses forVideo Super-Resolution. arXiv e-prints, June 2018. 2

[29] O. Makansi, E. Ilg, and T. Brox. End-to-End Learning ofVideo Super-Resolution with Motion Compensation. arXive-prints, July 2017. 2

[30] S. C. Park, M. K. Park, and M. G. Kang. Super-resolutionimage reconstruction: a technical overview. IEEE signal pro-cessing magazine, 20(3):21–36, 2003. 1

[31] E. Perez-Pellitero, M. S. M. Sajjadi, M. Hirsch, andB. Scholkopf. Photorealistic Video Super Resolution. arXive-prints, July 2018. 2

[32] M. S. M. Sajjadi, B. Scholkopf, and M. Hirsch. Enhancenet:Single image super-resolution through automated texturesynthesis. In The IEEE International Conference on Com-puter Vision (ICCV), Oct 2017. 2

[33] M. S. M. Sajjadi, R. Vemulapalli, and M. Brown. Frame-Recurrent Video Super-Resolution. In The IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR), June2018. 1, 2, 3, 4, 5, 6, 7, 8

[34] W. Shi, J. Caballero, F. Huszar, J. Totz, A. P. Aitken,R. Bishop, D. Rueckert, and Z. Wang. Real-time single im-age and video super-resolution using an efficient sub-pixelconvolutional neural network. In CVPR, 2016. 3

[35] X. Tao, H. Gao, R. Liao, J. Wang, and J. Jia. Detail-revealingdeep video super-resolution. In The IEEE International Con-ference on Computer Vision (ICCV), Oct 2017. 2

[36] R. Timofte, E. Agustsson, L. Van Gool, M.-H. Yang, andL. Zhang. Ntire 2017 challenge on single image super-resolution: Methods and results. In The IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR) Work-shops, July 2017. 2

[37] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri.Learning spatiotemporal features with 3d convolutional net-works. In The IEEE International Conference on ComputerVision (ICCV), December 2015. 2

[38] Z. Wang, P. Yi, K. Jiang, J. Jiang, Z. Han, T. Lu, andJ. Ma. Multi-memory convolutional neural network for videosuper-resolution. IEEE Transactions on Image Processing,28(5):2530–2544, May 2019. 1