10
Self-Supervised Fast Adaptation for Denoising via Meta-Learning Seunghwan Lee 1 , Donghyeon Cho 2 , Jiwon Kim 3 , and Tae Hyun Kim 1 1 Department of Computer Science, Hanyang University, Seoul, Korea [email protected], [email protected] 2 Department of Electronic Engineering, Chungnam National University, Daejeon, Korea [email protected] 3 SK T-Brain, Seoul, Korea [email protected] Abstract Under certain statistical assumptions of noise, recent self-supervised approaches for denoising have been intro- duced to learn network parameters without true clean im- ages, and these methods can restore an image by exploit- ing information available from the given input (i.e., inter- nal statistics) at test time. However, self-supervised meth- ods are not yet combined with conventional supervised de- noising methods which train the denoising networks with a large number of external training samples. Thus, we pro- pose a new denoising approach that can greatly outperform the state-of-the-art supervised denoising methods by adapt- ing their network parameters to the given input through self- supervision without changing the networks architectures. Moreover, we propose a meta-learning algorithm to enable quick adaptation of parameters to the specific input at test time. We demonstrate that the proposed method can be eas- ily employed with state-of-the-art denoising networks with- out additional parameters, and achieve state-of-the-art per- formance on numerous benchmark datasets. 1. Introduction When a scene is captured by imaging devices, a desired clean image X is corrupted by noise n. We usually assume that the noise n is an Additional White Gaussian Noise (AWGN), and the observed image Y can be expressed as Y = X + n. In particular, noise n increases in envi- ronments with high ISO, short exposure times, and low- light conditions. Image denoising is a task that restores the clean image X by removing noise n from the noisy in- put Y, and is a highly ill-posed problem. Thus, substan- tial literature concerning denoising problem has been intro- duced [16, 29, 8, 25, 11, 12, 27, 6]. Recent deep learning technologies have been used not only to obtain an image prior model via discriminative learning but also to design feed-forward denoising networks that directly produce denoised outputs. These methods train networks for denoising by using pairs of input im- ages and true clean images (Noise2Truth), and have per- formed well. However, Noise2Truth-based methods are limited in performance when the noise distribution of the test image is considerably different from the distribution of the training dataset, i.e. when domain misalignment oc- curs. To overcome these issues, researchers have proposed new training methods recently, such as Noise2Noise [21], Noise2Void [19], and Noise2Self [5], which allow to train the denoising networks without using the true clean images. These methods are based on statistical assumptions, such as zero-mean noise (i.e., E(n)=0). In this study, we improve the performance of existing Noise2Truth-based networks through a method that updates the network parameters adaptively using the information available from the given noisy input image. First, we start with a pre-trained network by the Noise2Truth technique to fully explore the large external database. Then, the net- work is fine-tuned using the Noise2Self method using the input test image during the inference phase. This approach not only solves the domain misalignment problem, but also improves the denoising performance by exploiting the self- similarity present in the input image. Self-similarity is a property that a large number of corresponding patches are existing within a single image (patch-recurrence), and it has been employed in numerous super-resolution tasks to en- hance the restoration quality [15, 30, 18]. We experimentally show that the adaptation via self- supervision during the inference stage can consistently in- 1 arXiv:2001.02899v1 [cs.CV] 9 Jan 2020

Self-Supervised Fast Adaptation for Denoising via …formance by exploring the large external datasets, and ex-ploiting internal information available from the given in-put image,

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

  • Self-Supervised Fast Adaptation for Denoising via Meta-Learning

    Seunghwan Lee1, Donghyeon Cho2, Jiwon Kim3, and Tae Hyun Kim1

    1Department of Computer Science, Hanyang University, Seoul, [email protected], [email protected]

    2Department of Electronic Engineering, Chungnam National University, Daejeon, [email protected]

    3SK T-Brain, Seoul, [email protected]

    Abstract

    Under certain statistical assumptions of noise, recentself-supervised approaches for denoising have been intro-duced to learn network parameters without true clean im-ages, and these methods can restore an image by exploit-ing information available from the given input (i.e., inter-nal statistics) at test time. However, self-supervised meth-ods are not yet combined with conventional supervised de-noising methods which train the denoising networks with alarge number of external training samples. Thus, we pro-pose a new denoising approach that can greatly outperformthe state-of-the-art supervised denoising methods by adapt-ing their network parameters to the given input through self-supervision without changing the networks architectures.Moreover, we propose a meta-learning algorithm to enablequick adaptation of parameters to the specific input at testtime. We demonstrate that the proposed method can be eas-ily employed with state-of-the-art denoising networks with-out additional parameters, and achieve state-of-the-art per-formance on numerous benchmark datasets.

    1. Introduction

    When a scene is captured by imaging devices, a desiredclean image X is corrupted by noise n. We usually assumethat the noise n is an Additional White Gaussian Noise(AWGN), and the observed image Y can be expressed asY = X + n. In particular, noise n increases in envi-ronments with high ISO, short exposure times, and low-light conditions. Image denoising is a task that restoresthe clean image X by removing noise n from the noisy in-put Y, and is a highly ill-posed problem. Thus, substan-tial literature concerning denoising problem has been intro-

    duced [16, 29, 8, 25, 11, 12, 27, 6].

    Recent deep learning technologies have been used notonly to obtain an image prior model via discriminativelearning but also to design feed-forward denoising networksthat directly produce denoised outputs. These methodstrain networks for denoising by using pairs of input im-ages and true clean images (Noise2Truth), and have per-formed well. However, Noise2Truth-based methods arelimited in performance when the noise distribution of thetest image is considerably different from the distributionof the training dataset, i.e. when domain misalignment oc-curs. To overcome these issues, researchers have proposednew training methods recently, such as Noise2Noise [21],Noise2Void [19], and Noise2Self [5], which allow to trainthe denoising networks without using the true clean images.These methods are based on statistical assumptions, such aszero-mean noise (i.e., E(n) = 0).

    In this study, we improve the performance of existingNoise2Truth-based networks through a method that updatesthe network parameters adaptively using the informationavailable from the given noisy input image. First, we startwith a pre-trained network by the Noise2Truth techniqueto fully explore the large external database. Then, the net-work is fine-tuned using the Noise2Self method using theinput test image during the inference phase. This approachnot only solves the domain misalignment problem, but alsoimproves the denoising performance by exploiting the self-similarity present in the input image. Self-similarity is aproperty that a large number of corresponding patches areexisting within a single image (patch-recurrence), and it hasbeen employed in numerous super-resolution tasks to en-hance the restoration quality [15, 30, 18].

    We experimentally show that the adaptation via self-supervision during the inference stage can consistently in-

    1

    arX

    iv:2

    001.

    0289

    9v1

    [cs

    .CV

    ] 9

    Jan

    202

    0

  • crease denoising performance regardless of the deep learn-ing architectures and target datasets. Furthermore, we adopta meta-learning technique [28] to train the denoising net-works to be quickly adapted to the specific input imagesat test time. Overall, our method obtains generalizationbased on Noise2Truth by using the large external trainingdata while breaking the limit of previously achieved perfor-mance through adoption of the Noise2Self approaches.

    In this study, we present a new learning method whichallows to train the denoising networks by supervision andself-supervision, and boosts the inference speed by trainingthe network with a meta-learning algorithm. To the best ofour knowledge, this work is the first attempt to seriouslyexplore meta-learning for the denoising task. The contribu-tions of this paper are summarized as follows:

    • Conventional supervised denoising networks can befurther improved by self-supervision during the testtime. A two-phase approach, which utilizes the inter-nal statistics of a natural image (self-supervision), isproposed to enhance restoration quality during the testtime.

    • A meta-learning-based denoising algorithm, which fa-cilitates the denoising network to quickly adapt param-eters to the given test image, is introduced.

    • The proposed algorithm can be easily applied to manyconventional denoising networks without changing thenetwork architectures and improve performance by alarge margin.

    2. Related WorkIn this section, we review numerous denoising methods

    with and without the use of true clean images for training.Image denoising is an actively studied area in image

    processing, and various denoising methods have been in-troduced, such as self-similarity-based methods [7, 9, 14],sparse-representation-based methods [26, 10], and externaldatabase exploiting methods [4, 3, 24, 33]. With the re-cent development of deep learning technologies, the denois-ing area also has been improved, and remarkable progresshas been achieved in this field. Specifically, after Xie etal. [32] adopted deep neural networks for denoising and in-painting tasks, numerous follow-up studies have been pro-posed [34, 35, 36, 38, 20, 23, 37, 17, 2].

    Based on deep CNN, Zhang et al. [34] proposed adeep neural network to learn a residual image with a skipconnection between the input and output of the network,and accelerate training speed and enhance denoising per-formance. Zhang et al. [35] also proposed IRCNN tolearn a Gaussian denoiser and this network can be com-bined with conventional model-based optimization meth-ods to solve various image restoration problems such as

    denoising, super-resolution, and deblurring. Furthermore,Zhang et al. [36] proposed a fast and efficient denoisingnetwork FFDNet, which takes cropped sub-images and anoise level map as inputs. In addition to being fast, FFDNetcan handle locally varying and a wide range of noise levels.Zhang et al. [38] introduced a very deep residual dense net-work (RDN) which is composed of multiple residual denseblocks. RDN achieves superior performance by exploit-ing all the hierarchical local and global features throughdensely connected convolutional layers and dense featurefusion. To incorporate long-range dependencies among pix-els, Zhang et al. [37] proposed a residual non-local attentionnetwork (RNAN), which consists of a trunk and (non-) lo-cal mask branches. In [23], the non-local block was usedwith a recurrent mechanism to increase the receptive fieldof the denoising network. Recently, CBDNet [17] and RID-Net [2] were introduced to handle noise in real photographswhere the noise level is unknown (blind denoising). CBD-Net is a two-step approach that combines noise estimationand non-blind denoising tasks, whereas RIDNet is a single-stage method that employs feature attention.

    After deep CNN was adopted to increase denoisingperformance, various research directions, such as residuallearning for constructing deeper networks, non-local or hi-erarchical features for enlarging the receptive fields, andnoise level estimation for real photographs, have been con-sidered. However, such works remain limited to the casesin which networks are supervised by true clean images(Noise2Truth). Recently, several self-supervision-basedstudies have been conducted to leverage only noisy im-ages for network training without true clean images. Lehti-nen et al. [21] demonstrated that a denoising network canbe trained without clean images. The network was trainedwith pairs of noisy patches (Noise2Noise) based on statis-tical reasoning that the expectation of randomly corruptedsignal is close to the clean target data. Furthermore, toavoid constructing pairs of noisy images, Krull et al. [19]proposed a Noise2Void method and introduced a blind-spotnetwork. Specifically, only the center pixel of the inputpatch was considered in the loss function, and the net-work was trained to predict its center pixel without any trueclean dataset. Similarly, Baston and Royer [5] introduced aNoise2Self method for training the network without know-ing the ground truth data.

    However, these self-supervision-based methods can notoutperform supervised methods where the distribution ofthe input is identical to training sample distribution. Weuse the supervised approach (i.e., Noise2Truth) during thetraining phase to achieve state-of-the-art performance bygeneralization, and use a self-supervised approach (e.g.,Noise2Void, Noise2Self) on the test input image during theinference stage to further improve the performance by adap-tation. To do so, we can employ the conventional meta-

  • learning algorithms [13, 28, 22] with our denoising net-works to enable quick adaption of the network parametersto the given input image. In the end, our supervised networkcan be adapted to the input image during the inference phasebased on the self-supervision with only few gradient updatesteps.

    The proposed method can achieve state-of-the-art per-formance by exploring the large external datasets, and ex-ploiting internal information available from the given in-put image, such as self-similarity as in [21, 5, 19]. To thebest of our knowledge, this work is the first attempt to ap-ply the meta-learning to enable quick adaptation with self-supervision for blind and non-blind denoising tasks.

    3. Supervision vs. Self-SupervisionRecent learning-based denoising works [21, 5, 19] have

    attempted to remove noise and restore the clean image byself-supervision without relying on a large training dataset.Natural images have similar patches that are redundantwithin a single image [15, 18, 30]; thus, we can estimateclean patches by using the corresponding but differentlycorrupted patches, assuming that the expected value of theadded random noise is zero [5, 19].

    In general, these self-supervised methods can effectivelyremove unseen noise (i.e., noise from an unknown distri-bution) by exploiting self-similarity with specially designedloss functions at test-time, whereas conventional supervisedmethods cannot handle unexpected and unseen noise whichis not sampled from the trained distribution. Conventionalsupervised denoising networks that learned using a large ex-ternal dataset cannot exploit self-similarity at test-time dueto the limited capacity of the network architectures (e.g., re-ceptive field), and thus the performance is limited. In con-trast, self-supervision-based methods cannot outperform theconventional supervision-based methods when the noisy in-put image is sampled under a learned distribution, becauseself-supervised methods do not learn from a large externaldataset.

    Therefore, we aim to improve the performance ofthe conventional supervised methods by merging super-vised and self-supervised methods to utilize large externaldatasets and exploit the given test image. However, integra-tion techniques have yet to be investigated actively.

    We first simply combine the self-supervision method andthe conventional supervised Gaussian denoising network byusing the fully pre-trained parameters of the Gaussian de-noiser as initial parameters of the self-supervised network.After initialization, the parameters of the self-supervisednetwork are updated (fine-tuned) using the test input with-out knowing the ground truth version, as in [19]. However,as shown in Table 1, this naive integration even degrades theperformance of the supervised baseline model [34] whenthe test image is corrupted by noise with learned distri-

    Supervision [34] Supervision [34] +Self-supervision [19]PSNR

    (σ = 40)27.84 27.39

    Table 1. Gaussian denoising results with and without using self-supervision. The backbone network of the self-supervision basedmethod N2V [19] is DnCNN [34]. N2V is initialized with fullytrained parameters then updated using the input image as in [19].Notably, naive integration degrades the performance of the base-line model (i.e., DnCNN).

    bution (i.e., Gaussian noise). Therefore, in this work, wepresent a novel denoiser that improves restoration perfor-mance by deriving the benefits of a large external train-ing dataset and self-similarity from an input image. Suchbenefits are derived by integrating both supervised and self-supervised methods.

    4. Proposed Method4.1. Two-phase denoising approach

    Many self-supervised methods [21, 19, 5] assume thatthe input image is corrupted by independent and identicallydistributed (i.i.d) noise n, and an optimal denoiser for theinput can be estimated by minimizing the self-supervisedloss function as follows:

    Loss(θ) =E‖f(Yi; θ)−Yj‖2

    =E‖f(Yi; θ)−Xi‖2 + E‖Yj −Xi‖2,(1)

    where Yi and Yj are independently corrupted correspond-ing patches with i.i.d noise n, Xi denotes their clean andground-truth version, and E[Yi|Xi] = Xi. A mappingfunction f is our denoiser and our goal is to estimate theparameters θ.

    Ideally, we can learn optimal parameters θ by mini-mizing the self-supervised loss with corresponding noisypatches [21]. Therefore, in the learning process, we shouldcollect redundant and self-similar patches within the givenimage. However, the number of corresponding patches isnot infinitely many in practice; thus, minimizing (1) doesnot provide an optimal solution. Moreover, finding corre-sponding patches within the given noisy image is a difficultand time-consuming task. Therefore, recent self-supervisedapproaches slightly modify the loss function, and consideronly the center pixel value as follows:

    Loss(θ) ≈∑i

    ‖Mi(f(Yi; θ))−Mi′(Yi)‖2, (2)

    where Mi extracts a single pixel value at the center locationof patch Yi, and Mi′ randomly takes a pixel value from thesurrounding area of the center pixel within the same patch

  • 𝒀: Input Denoiser (g)ഥ𝒀⊕

    r: Random noise

    Denoiser (f) ഥ𝒀Backprop.for N times

    (a) Adaption (fine-tune) step with input image. (Stage Ⅰ) (b) Inference step. (Stage Ⅱ)

    Denoiser (f)𝒀: Input ෩𝑿𝒁

    ෩𝑿

    Figure 1. Overall flow of the proposed method. Note that the denoiser g is non-trainable while f is trainable.

    (a) (b)

    Figure 2. (a) Noisy input image. (b) Denoised images at differ-ent image scales. Yellow patches are corresponding to each other.Clean (yellow) patches at different image scales in (b) can be usedto remove the noise within the (yellow) patch in (a).

    Yi, assuming that neighboring pixel values (e.g., color) aresimilar. Therefore, the self-similarity exploitation abilityis considerably limited because we can generate only 256samples per patch (the number of samples can be used tocalculate the expected value for Yi) because surroundingpixel values should be in between [0, 255] in gray-scale,and thus increase the variance of the estimator. Moreover,self-supervised methods take much time in training becausethey compare only a single pixel value in the loss functionwhile taking a large patch as input.

    To alleviate this problem, we present a novel solution inthis study. Specifically, we can reduce the amount of noisein the patch Yi by using an arbitrary denoiser g, and wepropose to minimize a new self-supervision loss as follows:

    Loss(θ) = E‖f(Yi; θ)− Ȳj‖2, (3)

    where Ȳj denotes the denoised version of Yj by the de-noiser g. Note that Yi, and Yj are corresponding.

    If we assume that Ȳj = Xi + n′ where the remaining(residual) noise n′ is still i.i.d, then our denoiser f can learnbetter parameters compared with those obtained by mini-mizing (2) because the noise level of Ȳj is lower than thatof Yj (i.e., V ar(n′) < V ar(n)). In addition, we can gen-erate a new noisy signal Zj = Ȳj + r by adding somerandom noise r into the denoised patch Ȳj , and our lossfunction with respect to θ can be reformulated as

    Loss(θ) = E‖f(Zj ; θ)− Ȳj‖2

    = E‖f(Zi; θ)− Ȳi‖2,(4)

    when the distribution of Zj is identical to the distributionof Yj . Therefore, we can obtain the optimally denoisedversion of the Yj by minimizing the proposed loss func-tion, and it also becomes the clean version of Yi becausethey are corresponding. We no longer need to find corre-sponding patches from the given test image in (4), and wecan compare patch by patch in the proposed loss function incontrast to the previous self-supervised works that calculatethe loss pixel by pixel.

    To be specific, if we generate N noisy patches {Zi} forthe patch Yi, we can obtain a total of MN self-similarpatches when M corresponding patches exist within thegiven image. Then, the denoised patch X̃i is given by,

    X̃i =1

    M

    M∑i=1

    (1

    N

    N∑n=1

    Zni ), (5)

    where n denotes the index of the realized sample Zi. WhenN approaches infinity, X̃i can be approximated as:

    X̃i ≈1

    M

    M∑j=1

    Ȳi =1

    M

    M∑i=1

    (Xi + r), (6)

    and the denoised patch becomes the average value of Mcorresponding patches denoised by g.

    Ideally, we can generate a maximum of N = 256H×W

    samples from an H ×W patch Ȳi, and the variance of ourestimator can be reduced by a factor of M as N → ∞.However, the previous self-supervised methods can gener-ate only a limited number of samples (N=256) per patch;thus, the variances of the estimators in [5, 19] are higherthan our proposed estimator. We can train the network fwith pairs of images (Zi, Ȳi). Note that we can generatemany Zi correspond to Ȳi, and thus the training procedurebecomes super-efficient. Moreover, in our experiments, theproposed loss function remains valid with images at differ-ent scales because self-similar patches are existing acrossscales [30, 15], as shown in Fig. 2. This property allowsthe use of self-similarity in a larger space and can increasethe number of corresponding patches M within the givendataset.

  • Based on the proposed loss function in (4), we present anew denoising network that can integrate the state-of-the-artsupervised and self-supervised methods into a single net-work to utilize the power of deep learning with a large ex-ternal database and internal statistics. The sketch of the pro-posed two-phase denoising approach is illustrated in Fig. 1.

    4.2. Fast adaptation via meta-learning

    32.1

    32.2

    32.3

    32.4

    32.5

    0 200 400 600 800 1000 1200 1400 1600 1800 2000

    PSN

    R (dB)

    Iteration

    pre-trained L1 L2

    Figure 3. Fully pre-trained DnCNN [34] on the DIV2K trainingset is given. For each image X in the DIV2K test set, we generate2000 train samples {Z}, and minimize the proposed loss functionin (4) at test time. The Average PSNR value goes up as iterationnumber increases. We also show the result by different metrics(i.e., L1 norm) which is also widely used in many recent works.

    We can further update the parameters of fully traineddenoising networks during the testing phase by minimiz-ing the proposed loss function with the test input. Fig. 3shows denoising performance of fully trained DnCNN [34]in terms of PSNR while updating the network parametersthrough the minimization of (4) using the DIV2K 10 vali-dation set. For the experiment, we use Gaussian noise forn and r (σ = 20), and the fully pre-trained DnCNN on theDIV2K training set is used as g and initial f . Accordingto the steps shown in Fig. 1(a), we update the network fwith Ȳ and differently corrupted Z for 2000 iterations (i.e.,N = 2000 and mini-batch size = 1), and denoising per-formance improves as the update (fine-tune) procedure pro-gresses, as shown in Fig. 3. Notably, the PSNR value atiteration 0 denotes the performance of the initial f (i.e.,black solid line). Although PSNR drops for the first fewiterations, we can elevate the performance of f up to ap-proximately 0.25dB through 2000 updates without using theground truth image X.

    However, as we use the full-resolution image during theupdate procedure in Fig. 1(a), it takes much time duringthe testing phase. Therefore, we propose a fast update al-gorithm that allows a quick adaptation of the network pa-rameters during the testing phase by embedding the recentmeta-learning algorithms [13, 28, 22] into our two-phasedenoising algorithm.

    Meta-learning algorithms can be used to find initial pa-rameters of the network in the training stage, which facil-

    itate fast adaptation at test time. In general, meta-learningalgorithms require ground-truth training samples for param-eter adaptation at test time, but only a single noisy image isavailable in our denoising task. Thus, the use of the meta-learning scheme is restricted. However, as shown in Fig. 3,we have shown that we can train the network f in an un-supervised manner using Ȳ and a large number of {Z}.Thus, we can adopt the meta-learning algorithms by usingthe training samples composed of Ȳ and {Z} to efficientlyadapt our parameters at test time. The overall flow of theproposed meta-learning process to initialize the parametersof f for test-time adaptation with the external dataset is il-lustrated in Fig. 4. Then, the meta-learned network param-eters θT can be used as the initial parameters of f for thetest-time updates in the two-phase denoising algorithm.

    Specifically, our meta-learning integrated denoising al-gorithm is not limited to any specific meta-learning al-gorithm, and recent methods, such as MAML [13], Rep-tile [28], and Meta-SGD [22], which aim for fast adapta-tion can be used. In our experiments, Reptile[28] fromOpenAI shows consistently better results compared withMAML [13]. Thus, we provide the detailed steps of ourmeta-learning algorithm with Reptile [28] in Algorithm 1,and the inference algorithm during test time is given in Al-gorithm 2. We believe Reptile outperforms MAML in ourtask, because our task requires relatively numerous itera-tions (updates) at test time. Notably, our denoiser in Al-gorithm 2 can solve blind (unknown noise level) and non-blind (known noise level) denoising tasks without changingthe training scheme in Algorithm 1. We only need to deter-mine the noise level of r as a random or fixed value duringtest-time adaptation depending on the given task.

    5. ExperimentsPlease refer to our supplementary material for the ex-

    tensive experimental results, and the code will be publiclyavailable upon acceptance.

    5.1. Implementation details

    In our experiments, we evaluate the proposed methodsusing different state-of-the-art denoisers on DIV2K, Ur-ban100, and BSD68 datasets. We first pre-train the state-of-the-art denoisers under fair conditions using an NVIDIA2080Ti graphics card. DnCNN [34], RIDNet [2], andRDN [38] are trained on the DIV2K training set with Gaus-sian noise until convergence.

    Currently, RDN shows the best performance on publicbenchmark tests [1] in removing Gaussian noise, and re-cent RIDNet shows competitive results. We use the lightversion of RDN (D = 10,C = 4,G = 16) due to the limitedmemory size of our graphics unit. The standard deviationof the Gaussian noise is randomly selected from [0, 50]during pre-training, and a conventional data augmentation

  • Meta-learning with external dataset (Stage 0)

    𝑿〮〮〮

    〮〮〮

    External dataset

    Sampling

    gഥ𝒀

    Noise: n

    Noise: r

    𝒀

    Meta Learningw.r.t. 𝜃𝜽𝑡

    𝜽𝑇

    {𝒁}

    Figure 4. Parameter initialization via meta-earning process with external large training dataset.

    Algorithm 1:Training via meta-learning. (Stage 0)

    Input: Fully pre-trained params.: θ0Output: θTData: Clean images {X}

    1 for t = 1 to T do2 θ0 ← θt−13 for k = 1 to K do4 X ∼ {X}// image batch sample5 σn ∼ rand(0, σmax), n ∼ N(0, σn)6 Y ← X + n7 Ȳ ← g(Y) // g:non-trainable8 σr ∼ rand(0, σmax), r ∼ N(0, σr)9 Z← Ȳ + r

    10 θk ← minimizeθ Loss(θ|Ȳ,Z, θk−1)// loss in (4) with ADAM

    11 θt ← θt−1 + �(θK − θt−1)// �: small update step

    technique is applied. We minimize the distance betweenthe ground truth image and the prediction.

    For meta-learning, we set T = 2000, K = 256, σmax =50, and � = 1e-5 in Algorithm 1 and in Algorithm 2, andthe pre-trained networks (i.e., DnCNN, RIDNet, RDN) areused as denoiser g in Fig. 1.

    5.2. Self-similarity exploitation

    First, we fine-tune the fully trained DnCNN for 200 it-erations on the Urban100 dataset using the proposed two-phase denoising algorithm (w/o meta-learning). For theupdates, we use different image scales and resize Ȳ withdifferent scaling factors from 0.4 to 1.2. At each update,we measure the average PSNR values by removing Gaus-

    Algorithm 2:Inference through adaptation. (Stage 1 + Stage 2)

    Input: Noisy input: Y, Initial params.: θT , NOutput: Denoised image: X̃

    1 Ȳ ← g(Y)2 for n = 1 to N do

    3 σ =

    {Const. if known (non-blind)rand(0, σmax) otherwise (blind)

    4 r ∼ N(0, σ)5 Z← Ȳ + r6 θ′ ← minimizeθ Loss(θ|Ȳ,Z, θT+n−1)

    // loss in (4) with ADAM7 θT+n ← θT+n−1 + �(θ′ − θT+n−1)

    // �: small update step

    8 X̃← f(Y; θT+N )

    sian noise with σ = 20. The results are shown in Fig. 5.As we expected, PSNR values still increase as N increasesat different image scales because similar patches are exist-ing across different image scales, as shown in Fig. 2. In-terestingly, we can achieve the best performance when thescaling factor is 0.8 and N < 200 because the noise level(i.e., residual noise V ar(n′)) further decreases by resizing.Moreover, we also perform updates by choosing the scalerandomly using a normal distribution ∼ N (µ = 0.8, σ =0.1). The chosen scale value is clipped if it is larger than 1.0or smaller than 0.6. Thus, we can update the networks usingmultiples scales (solid brown line). Although, the final per-formance of the multi-scale training after 200 iterations islower compared to the result obtained when the scale factoris 0.8 (solid yellow line), we use a random multiple-scalefactor in Algorithm 2 during inference because the multi-scale version takes less time with smaller images and shows

  • Urban100 BSD68 DIV2KPSNR gain 0.44 0.17 0.23

    Table 2. After 200 updates, PSNR gains on Urban100, BSD68,and DIV2K test set are measured. Performance gain is particularlyhuge on the Urban100 dataset.

    σ = 10 σ = 20 σ = 30 σ = 40Small patch 35.94 32.52 30.68 29.15Large patch 35.94 32.57 30.74 29.23

    Table 3. Performance comparison by updating DnCNN using dif-ferent sizes of patches. Parameters trained with large patches pro-vide consistently better results for various Gaussian noise levels.

    slightly better performances when the number of iterationsis small (∼ 5), which well suits to our real testing scenario.During the meta-learning procedure, we use a fixed scalefactor (= 1) in Algorithm 1.

    We also perform updates for 200 iterations on theBSD68, and DIV2K test sets as well as the Urban100dataset under the same condition (scale factor = 0.8). After200 iterations, we measure the performance gain, and theresults are given in Table. 2. The gain from the Urban100dataset is much larger than others because urban imagesgenerally include a large number of self-similar patchesfrom man-made repeated structures.

    We perform additional experiments to see whether self-similarity is a significant factor in the proposed method.We fine-tune the fully pre-trained DnCNN on the Ur-ban100 dataset, and update the parameters with differ-ent sizes of patches. To be specific, we collect 64×64and 128×128 patches from the Urban100 dataset wherethe 64×64 patches are centrally cropped version of therandomly chosen 128×128 patches. We compare thePSNR values obtained results by parameters learned from128×128 patches and 64×64 patches respectively. For theevaluation of parameters updated with 128×128 patches,we measure the PSNR on the 64×64 central parts of the128×128 patches to carry out a fair comparison. The com-parison results on different Gaussian noises are given in Ta-ble. 3, and the parameters updated with larger patches ren-der better results because larger patches are likely to includemore corresponding patches (i.e., large M ).

    In this ablation study, we demonstrate that our algorithmcan exploit the self-similarity within the given input, andthus the proposed method can produce better results whereN and M are large.

    5.3. Denoising results via meta-learning

    In Fig. 6, we evaluate the performance of DnCNN dur-ing the meta-learning procedure in Algorithm 1. We useDIV2K training set for meta-learning and set T = 2000.

    31.9

    32

    32.1

    32.2

    32.3

    32.4

    32.5

    0 40 80 120 160 200

    PSN

    R (dB)

    Iteration

    pre-trained scale 0.4 scale 0.6 scale 0.8

    scale 1.0 scale 1.1 scale 1.2 scale [0.6, 1.0]

    Figure 5. Denoising results with our two-phase denoising algo-rithm. Performances are evaluated by changing the image scalesused in update. DnCNN is used for removing a Gaussian noise(σ = 20) on the Urban100 dataset.

    32

    32.05

    32.1

    32.15

    32.2

    32.25

    32.3

    32.35

    0 200 400 600 800 1000 1200 1400 1600 1800 2000

    PSN

    R (dB)

    Iteration

    Fully pre-trained Non-blind_0 Non-blind_5 Blind_5

    Blind_10 Blind_15 Blind_20

    Figure 6. Performance evaluation during meta-learning procedurefor blind and non-blind denoising with DnCNN.

    At each meta-learning iteration, degraded Urban100 datasetwith Gaussian noise (σ = 20) is restored using the methodin Algorithm 2. We measure the performance by chang-ing the number of iterations N in Algorithm 2 from 0 to20 in blind and non-blind manner. As meta-learning pro-gresses (i.e., t → T ) inference accuracy improves grad-ually. DnCNN provides consistently superior results withlarge N for both blind and non-blind denoising.

    In Table 4, we provide quantitative comparisons re-sults. Blind and non-blind denoising results from meta-learned RIDNet (MetaRIDNet), RDN (MetaRDN), andDnCNN (MetaDnCNN) are measured with different set-tings, and compared with results by conventional meth-ods (BM3D [9], MemNet [31], and FFDNet [36]). Ourmeta-learned blind/non-blind denoisers can produce bet-ter results with a small number of updates because theycan adapt their parameters quickly to the specific input,and can outperform the pre-trained baseline models withonly 5 iterations. Note that the performance gaps betweenthe naive fine-tuning (Finetune 5) and our meta-learning-based adaptation (Bind 5) for 5 iterations are large particu-larly when the noise level is high, and these results demon-

  • strate that the proposed method can improve the perfor-mance more quickly than naive fine-tuning. With NVIDIA2080Ti Graphics card, it takes around 0.99, 2.49, and 2.64seconds to restore a 1000×600 image with MetaDnCNN,MetaRDN, and MetaRIDNet updated for 5 times respec-tively.

    In Fig. 7, we provide qualitative comparison results. Theinputs are corrupted with high-level Gaussian noise (σ =40), and the proposed methods restore the clean images inblind and non-blind manners. In particular, with more itera-tions during inference, our blind and non-blind methods canproduce visually much better results and restore tiny detailscompared to the fully pre-trained baseline models.

    6. ConclusionConsidering that we can improve the performance of the

    conventional supervision-based denoising methods duringtest time using the self-similarity property from the givennoisy input image with the proposed loss function. Thus,we introduce a new two-phase denoising approach that al-lows the update of the network parameters from the fullytrained version at test time and enhance the image qual-ity significantly by exploiting self-similarity. Furthermore,we integrate meta-learning technique while updating (fine-tuning) our denoiser to enable quick parameter adaptationand accurate inference at test time. Our proposed algorithmcan be generally applicable to many denoising networks,and we improve the restoration quality significantly withoutchanging the architectures of the state-of-the-art denoisingmethods. Experimental results demonstrate the superiorityof the proposed method.

    AcknowledgementThis work was supported by the research fund of SK

    Telecom T-Brain and Hanyang University(HY-2018).

    References[1] Leaderboards. https://paperswithcode.com/

    task/image-denoising/latest. Accessed: 2019-11-15. 5

    [2] Saeed Anwar and Nick Barnes. Real image denoising withfeature attention. In Proceedings of the IEEE InternationalConference on Computer Vision (ICCV), 2019. 2, 5, 9

    [3] Saeed Anwar, Cong Phuoc Huynh, and Fatih Porikli. Com-bined internal and external category-specific image denois-ing. In Proceedings of the British Machine Vision Confer-ence (BMVC), 2017. 2

    [4] Saeed Anwar, Fatih Murat Porikli, and Cong Phuoc Huynh.Category-specific object image denoising. IEEE Transac-tions on Image Processing, 26:5506–5518, 2017. 2

    [5] Joshua Batson and Loic Royer. Noise2self: Blind denoisingby self-supervision. In International Conference on MachineLearning (ICML), 2019. 1, 2, 3, 4

    [6] A. Buades, B. Coll, and J. . Morel. A non-local algorithmfor image denoising. In Proceedings of the IEEE Confer-ence on Computer Vision and Pattern Recognition (CVPR),volume 2, pages 60–65 vol. 2, 2005. 1

    [7] Antoni Buades, Bartomeu Coll, and Jean-Michel Morel. Anon-local algorithm for image denoising. In Proceedingsof the IEEE Conference on Computer Vision and PatternRecognition (CVPR), pages 60–65, 2005. 2

    [8] K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian. Im-age denoising by sparse 3-d transform-domain collabora-tive filtering. IEEE Transactions on Image Processing,16(8):2080–2095, 2007. 1

    [9] Kostadin Dabov, Alessandro Foi, Vladimir Katkovnik, andKaren O. Egiazarian. Image denoising by sparse 3-dtransform-domain collaborative filtering. IEEE Transactionson Image Processing, 16:2080–2095, 2007. 2, 7, 9

    [10] Weisheng Dong, Xin Li, Lei Zhang, and Guangming Shi.Sparsity-based image denoising via dictionary learning andstructural clustering. In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR), pages457–464, 2011. 2

    [11] W. Dong, L. Zhang, G. Shi, and X. Li. Nonlocally central-ized sparse representation for image restoration. IEEE Trans-actions on Image Processing, 22(4):1620–1630, 2013. 1

    [12] M. Elad and M. Aharon. Image denoising via sparseand redundant representations over learned dictionaries.IEEE Transactions on Image Processing, 15(12):3736–3745,2006. 1

    [13] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks.In International Conference on Machine Learning (ICML),2017. 3, 5

    [14] A. Foi, V. Katkovnik, and K. Egiazarian. Pointwise shape-adaptive dct for high-quality denoising and deblocking ofgrayscale and color images. IEEE Transactions on ImageProcessing, 16(5):1395–1411, 2007. 2

    [15] Daniel Glasner, Shai Bagon, and Michal Irani. Super-resolution from a single image. In Proceedings of the IEEEInternational Conference on Computer Vision (ICCV), 2009.1, 3, 4

    https://paperswithcode.com/task/image-denoising/latesthttps://paperswithcode.com/task/image-denoising/latest

  • GT Noisy Pre-trained

    Non-blind_0 Non-blind_5 Blind_20Noisy Input (σ=40)

    GT Noisy Pre-trained

    Non-blind_0 Non-blind_5 Blind_20Noisy Input (σ=40)

    (a)

    GT Pre-trained

    Non-blind_0 Non-blind_5 Blind_20Noisy Input (σ=40)

    GT Noisy Pre-trained

    Non-blind_0 Non-blind_5 Blind_20Noisy Input (σ=40)

    (b)

    Noisy

    Figure 7. Visual comparisons. (a) Denoising results with our MetaRIDNet [2]. (b) Denoising results with our MetaRDN [38]. Notably,Non-blind 0 denotes the results obtained by parameters θT which is the initial parameters of the inference step in Algorithm 2.

    Dataset Urban100 DIV2K BSD68Method Adaptation Noise σ = 10 σ = 20 σ = 30 σ = 40 σ = 10 σ = 20 σ = 30 σ = 40 σ = 10 σ = 20 σ = 30 σ = 40

    BM3D [9] - 35.77 31.92 29.38 27.06 36.15 32.13 29.58 27.49 35.75 31.52 29.07 27.19MemNet [31] - 35.66 32.32 30.32 28.87 36.60 33.01 30.97 29.55 36.07 32.27 30.21 28.83FFDNet [36] - 35.43 31.87 29.51 27.60 36.36 32.55 30.08 28.11 35.82 31.87 29.55 27.82

    MetaRIDNet(ours)

    Fully pre-trainedFinetune 5

    Non-blind 5Blind 5

    Blind 10Blind 15Blind 20

    35.6735.7635.9435.8835.8635.8635.89

    32.4032.5132.7432.7032.7232.7532.76

    30.4430.5630.8330.7930.8330.8530.87

    29.0229.1429.4329.3929.4329.4629.48

    36.6336.6836.8336.7436.7736.7836.79

    33.0833.1333.2833.2433.2633.2833.29

    31.0631.1231.2831.2431.2731.2831.30

    29.6529.7229.8829.8429.8729.8929.90

    36.1036.1536.2236.1736.1836.1936.20

    32.3232.3732.4632.4432.4432.4532.47

    30.2730.3330.4230.4030.4230.4330.44

    28.8928.9629.0629.0329.0529.0729.07

    MetaRDN(ours)

    Fully pre-trainedFinetune 5

    Non-blind 5Blind 5

    Blind 10Blind 15Blind 20

    35.4635.5535.7635.7135.7235.7335.74

    32.1132.2232.5032.4732.5132.5332.55

    30.1030.2230.5530.5230.5730.6030.63

    28.6528.7729.1229.0829.1429.1729.20

    36.4436.5330.6936.6436.6436.6636.67

    32.8732.9433.1333.1133.1333.1433.16

    30.8230.9031.0931.0831.1031.1231.13

    29.4029.4829.6829.6529.6829.6929.71

    35.9936.0736.1436.1136.1136.1136.13

    32.1932.2632.3632.3532.3732.3732.39

    30.1230.1830.3130.3030.3230.3330.34

    28.7428.8028.9328.9128.9428.9528.96

    MetaDnCNN(ours)

    Fully pre-trainedFinetune 5

    Non-blind 5Blind 5

    Blind 10Blind 15Blind 20

    35.4635.5535.6835.5735.5635.5735.59

    32.0132.1232.3032.2532.2832.3032.32

    29.9730.0930.3030.2630.2930.3130.35

    28.4828.6228.8528.7928.8128.8428.87

    36.4136.4636.5636.4736.4936.5136.52

    32.8032.8532.9832.9332.9432.9632.97

    30.7630.8130.9430.8730.9130.9330.94

    29.3429.3929.5129.4329.4729.4829.51

    35.9936.0436.0936.0236.0336.0436.05

    32.1632.2132.2932.2532.2632.2832.28

    30.0930.1430.2230.1730.1930.2130.22

    28.7028.7328.8228.7928.8028.8128.83

    Table 4. Quantitative comparisons. Non-blind N and Blind N indiciate that the network is updated for N iterations during the testingphase with and without knowing the Gaussian noise level (non-blind and blind denoising respectively).

    [16] S. Gu, L. Zhang, W. Zuo, and X. Feng. Weighted nuclearnorm minimization with application to image denoising. InProceedings of the IEEE International Conference on Com-puter Vision (ICCV), pages 2862–2869, 2014. 1

    [17] Shi Guo, Zifei Yan, Kai Zhang, Wangmeng Zuo, and LeiZhang. Toward convolutional blind denoising of real pho-tographs. In Proceedings of the IEEE Conference on Com-

    puter Vision and Pattern Recognition (CVPR), 2018. 2[18] Jia-Bin Huang, Abhishek Singh, and Narendra Ahuja. Sin-

    gle image super-resolution from transformed self-exemplars.In Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition (CVPR), 2015. 1, 3

    [19] Alexander Krull, Tim-Oliver Buchholz, and Florian Jug.Noise2void-learning denoising from single noisy images. In

  • Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition (CVPR), 2019. 1, 2, 3, 4

    [20] Stamatios Lefkimmiatis. Non-local color image denoisingwith convolutional neural networks. In Proceedings of theIEEE Conference on Computer Vision and Pattern Recogni-tion (CVPR), pages 5882–5891, 2016. 2

    [21] Jaakko Lehtinen, Jacob Munkberg, Jon Hasselgren, SamuliLaine, Tero Karras, Miika Aittala, and Timo Aila.Noise2Noise: Learning image restoration without clean data.In International Conference on Machine Learning (ICML),volume 80, pages 2965–2974, 2018. 1, 2, 3

    [22] Zhenguo Li, Fengwei Zhou, Fei Chen, and Hang Li. Meta-sgd: Learning to learn quickly for few-shot learning. arXivpreprint arXiv:1707.09835, 2017. 3, 5

    [23] Ding Liu, Bihan Wen, Yuchen Fan, Chen Change Loy, andThomas S. Huang. Non-local recurrent network for imagerestoration. In Advances in Neural Information ProcessingSystems (NIPS), pages 1680–1689, 2018. 2

    [24] Enming Luo, Stanley H. Chan, and Truong Q. Nguyen.Adaptive image denoising by targeted databases. IEEETransactions on Image Processing, 24:2167–2181, 2015. 2

    [25] J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman.Non-local sparse models for image restoration. In Proceed-ings of the IEEE International Conference on Computer Vi-sion (ICCV), pages 2272–2279, 2009. 1

    [26] Julien Mairal, Francis Bach, J. Ponce, Guillermo Sapiro,and Andrew Zisserman. Non-local sparse models for im-age restoration. In Proceedings of the IEEE InternationalConference on Computer Vision (ICCV), pages 2272–2279,2009. 2

    [27] J. Mairal, M. Elad, and G. Sapiro. Sparse representation forcolor image restoration. IEEE Transactions on Image Pro-cessing, 17(1):53–69, 2008. 1

    [28] Alex Nichol, Joshua Achiam, and John Schulman. Onfirst-order meta-learning algorithms. arXiv preprintarXiv:1803.02999, 2018. 2, 3, 5

    [29] J. Portilla, V. Strela, M. J. Wainwright, and E. P. Simoncelli.Image denoising using scale mixtures of gaussians in thewavelet domain. IEEE Transactions on Image Processing,12(11):1338–1351, 2003. 1

    [30] Assaf Shocher, Nadav Cohen, and Michal Irani. zero-shotsuper-resolution using deep internal learning. In Proceed-ings of the IEEE Conference on Computer Vision and PatternRecognition (CVPR), 2018. 1, 3, 4

    [31] Ying Tai, Jian Yang, Xiaoming Liu, and Chunyan Xu. Mem-net: A persistent memory network for image restoration. InProceedings of the IEEE International Conference on Com-puter Vision (ICCV), 2017. 7, 9

    [32] Junyuan Xie, Linli Xu, and Enhong Chen. Image denoisingand inpainting with deep neural networks. In Advances inNeural Information Processing Systems (NIPS), pages 341–349, 2012. 2

    [33] Huanjing Yue, Xiaoyan Sun, Jing yu Yang, and Feng Wu.Image denoising by exploring external and internal correla-tions. IEEE Transactions on Image Processing, 24:1967–1982, 2015. 2

    [34] Kai Zhang, Wangmeng Zuo, Yunjin Chen, Deyu Meng, andLei Zhang. Beyond a gaussian denoiser: Residual learning ofdeep cnn for image denoising. IEEE Transactions on ImageProcessing, 26:3142–3155, 2017. 2, 3, 5

    [35] Kai Zhang, Wangmeng Zuo, Shuhang Gu, and Lei Zhang.Learning deep cnn denoiser prior for image restoration. InProceedings of the IEEE Conference on Computer Visionand Pattern Recognition (CVPR), pages 2808–2817, 2017.2

    [36] Kai Zhang, Wangmeng Zuo, and Lei Zhang. Ffdnet: To-ward a fast and flexible solution for cnn based image de-noising. IEEE Transactions on Image Processing, 27:4608–4622, 2018. 2, 7, 9

    [37] Yulun Zhang, Kunpeng Li, Kai Li, Bineng Zhong, and YunFu. Residual non-local attention networks for image restora-tion. In Proceedings of the International Conference onLearning Representations (ICLR), 2019. 2

    [38] Yulun Zhang, Yapeng Tian, Yu Kong, Bineng Zhong, andYun Fu. Residual dense network for image restoration.CoRR, abs/1812.10477, 2018. 2, 5, 9