10
Fully Automated Image De-fencing using Conditional Generative Adversarial Networks Divyanshu Gupta 1 , Shorya Jain 1 , Utkarsh Tripathi 1 , Pratik Chattopadhyay 1 , Lipo Wang 2 1 Department of Computer Science and Engineering, IIT(B.H.U), Varanasi-221005, India 2 School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore 1 {divyanshu.gupta.cse15,shorya.jain.cse16,utkarsh.tripathi.cse16}@iitbhu.ac.in 1 [email protected] 2 [email protected] Abstract Image de-fencing is one of the important aspects of recreational photography in which the objective is to remove the fence texture present in an image and generate an aesthetically pleasing version of the same image without the fence texture. In this paper, we aim to develop an automated and effective technique for fence removal and image reconstruction using conditional Generative Adversarial Networks (cGANs). These networks have been successfully applied in several domains of Computer Vision focusing on image generation and rendering. Our initial approach is based on a two-stage architecture involving two cGANs that generate the fence mask and the inpainted image, respectively. Training of these networks is carried out independently and, during evaluation, the input image is passed through the two generators in succession to obtain the de-fenced image. The results obtained from this approach are satisfactory, but the response time is long since the image has to pass through two sets of convolution layers. To reduce the response time, we propose a second approach involving only a single cGAN architecture that is trained using the ground-truth of fenced de-fenced image pairs along with the edge map of the fenced image produced by the Canny Filter. Incorporation of the edge map helps the network to precisely detect the edges present in the input image, and also imparts it an ability to carry out high quality de-fencing in an efficient manner, even in the presence of a fewer number of layers as compared to the two-stage network. Qualitative and quantitative experimental results reported in the manuscript reveal that the de-fenced images generated by the single-stage de-fencing network have similar visual quality to those produced by the two-stage network. Comparative performance analysis also emphasizes the effectiveness of our approach over state-of-the-art image de-fencing techniques. 1 Introduction In recent times, due to the existence of inexpensive image capturing devices such as smart-phones and tablets with sophisticated cameras, there is a steady rise in the number of images captured and shared over the internet. Despite these technological advances, often it is difficult to take a snapshot of the intended object due to the presence of certain obstructions between the camera and the object. For example, consider that a person wishes to capture the image of an animal within a cage in a zoo. It is understandable that he/she will not be able to capture a clear image of the animal from a distance due to the presence of a fence or cage-bars in front. Till date, a number of research articles on image de-fencing have been proposed in the literature, (e.g., Park et al. [2010], Khasare et al. [2013], Jonna et al. [2015a, 2016], Liu et al. [2008], Farid et al. [2016]). However, these methods are either Preprint. Under review. arXiv:1908.06837v1 [cs.CV] 19 Aug 2019

Fully Automated Image De-fencing using Conditional Generative … · 2019-08-20 · Fully Automated Image De-fencing using Conditional Generative Adversarial Networks Divyanshu Gupta

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Fully Automated Image De-fencing using Conditional Generative … · 2019-08-20 · Fully Automated Image De-fencing using Conditional Generative Adversarial Networks Divyanshu Gupta

Fully Automated Image De-fencing using ConditionalGenerative Adversarial Networks

Divyanshu Gupta1, Shorya Jain1, Utkarsh Tripathi1, Pratik Chattopadhyay1, Lipo Wang2

1Department of Computer Science and Engineering, IIT(B.H.U), Varanasi-221005, India2School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore

1{divyanshu.gupta.cse15,shorya.jain.cse16,utkarsh.tripathi.cse16}@[email protected]

[email protected]

Abstract

Image de-fencing is one of the important aspects of recreational photographyin which the objective is to remove the fence texture present in an image andgenerate an aesthetically pleasing version of the same image without the fencetexture. In this paper, we aim to develop an automated and effective technique forfence removal and image reconstruction using conditional Generative AdversarialNetworks (cGANs). These networks have been successfully applied in severaldomains of Computer Vision focusing on image generation and rendering. Ourinitial approach is based on a two-stage architecture involving two cGANs thatgenerate the fence mask and the inpainted image, respectively. Training of thesenetworks is carried out independently and, during evaluation, the input image ispassed through the two generators in succession to obtain the de-fenced image.The results obtained from this approach are satisfactory, but the response timeis long since the image has to pass through two sets of convolution layers. Toreduce the response time, we propose a second approach involving only a singlecGAN architecture that is trained using the ground-truth of fenced de-fenced imagepairs along with the edge map of the fenced image produced by the Canny Filter.Incorporation of the edge map helps the network to precisely detect the edgespresent in the input image, and also imparts it an ability to carry out high qualityde-fencing in an efficient manner, even in the presence of a fewer number of layersas compared to the two-stage network. Qualitative and quantitative experimentalresults reported in the manuscript reveal that the de-fenced images generated bythe single-stage de-fencing network have similar visual quality to those producedby the two-stage network. Comparative performance analysis also emphasizes theeffectiveness of our approach over state-of-the-art image de-fencing techniques.

1 IntroductionIn recent times, due to the existence of inexpensive image capturing devices such as smart-phonesand tablets with sophisticated cameras, there is a steady rise in the number of images captured andshared over the internet. Despite these technological advances, often it is difficult to take a snapshotof the intended object due to the presence of certain obstructions between the camera and the object.For example, consider that a person wishes to capture the image of an animal within a cage in azoo. It is understandable that he/she will not be able to capture a clear image of the animal from adistance due to the presence of a fence or cage-bars in front. Till date, a number of research articles onimage de-fencing have been proposed in the literature, (e.g., Park et al. [2010], Khasare et al. [2013],Jonna et al. [2015a, 2016], Liu et al. [2008], Farid et al. [2016]). However, these methods are either

Preprint. Under review.

arX

iv:1

908.

0683

7v1

[cs

.CV

] 1

9 A

ug 2

019

Page 2: Fully Automated Image De-fencing using Conditional Generative … · 2019-08-20 · Fully Automated Image De-fencing using Conditional Generative Adversarial Networks Divyanshu Gupta

semi-automated, or are time-intensive due to involvement of complex computations. Usually, imagede-fencing is viewed as a combination of two separate sub-problems: (i) fence mask generation,which essentially clusters the image region into two groups, i.e., fence and non-fence regions, and (ii)image inpainting, which deals with artificially synthesizing colors to the fence regions to make therendered image look realistic. This is explained with the help of Figure 1.

(a) (b) (c)

Figure 1: (a) Input image with fence, (b) fence detection, (c) de-fenced image after inpainting.

Conditional Generative Adversarial Networks (cGANs) (Reed et al. [2016], Zhang et al. [2017],Huang et al. [2018]) have already demonstrated strong potential in generating realistic images inaccordance with a set of user-defined conditions or constraints (Yeh et al. [2017], Brkic et al. [2017],Radford et al. [2015], Chen et al. [2016]), and are possibly the best networks available today forperforming any image-to-image translation task. Since image de-fencing can also be viewed as asub-class of image-to-image translation problems, we propose to use cGANs in our work as well.

To the best of our knowledge, the present paper is the first work on image de-fencing that exploitsthe powerful generalization capability of GANs to generate de-fenced images. Specifically, wepropose two different GAN-based de-fencing algorithms: (i) a two-stage network that consists of twosub-networks to carry out the fence mask generation and image inpainting in succession, and (ii) asingle-stage network which directly performs image de-fencing without any intermediate fence maskgeneration step. Experimental results on a public fence segmentation data set (Du et al. [2018]) andour own artificially created data set show the effectiveness of our work.

The rest of the paper is organized as follows: in Section 2, we briefly review the related work inautomatic fence detection, image inpainting and image de-fencing. The problem formalization andthe proposed cGAN based de-fencing approaches are described in detail in Section 3. An extensiveexperimental campaign is reported in Section 4. Finally, conclusions and future scopes for researchare highlighted in Section 5.

2 Related WorkAs explained before, traditionally, the task of image de-fencing is viewed as a two-step problem: (i)segmenting the fence in the image, and (ii) recovering the eliminated area with plausible content viaimage inpainting. Hence, in addition to reviewing the image de-fencing techniques in the literature,we also study existing solutions to fence mask generation and image inpainting, which are brieflydescribed in the Sub-sections 2.1 and 2.2, respectively. Finally, in Sub-section 2.3, we present areview of the state-of-the-art image de-fencing techniques.

2.1 Fence Mask DetectionFence mask detection is the process of segmenting an input image with fence into two clusters, suchthat all the fence pixels are assigned a particular cluster, and each of the other pixels is assigned adifferent cluster. A lot of research has been done in the domain of regular and near-regular patterndetection (Hays et al. [2006], Park et al. [2009], Lin and Liu [2007], Park et al. [2010]). The work inHays et al. [2006] used higher-order feature matching to discover the lattices of near-regular patternsin real images based on the principal eigen vector of the affinity matrix. Park et al. [2009] proposed amethod for detection of deformed 2D wallpaper patterns in real-world images by mapping the 2Dlattice detection problem into a multi-target tracking problem, which can be solved within an MarkovRandom Field framework. In another work by Park et al. [2008], the problem of near-regular fencedetection was handled by employing efficient Mean-Shift Belief Propagation method to extract theunderlying deformed lattice in the image. Mu et al. [2014] proposed a soft fence-detection methodusing visual parallax as the cue to differentiate the fence from the non-fenced regions.

2

Page 3: Fully Automated Image De-fencing using Conditional Generative … · 2019-08-20 · Fully Automated Image De-fencing using Conditional Generative Adversarial Networks Divyanshu Gupta

2.2 Image Inpainting

Image inpainting is the process of restoration of the unfilled portions of an image with appropriateplausible content/color. Image inpainting methods used in the literature can be broadly divided intotwo categories: (a) diffusion based methods (Bertalmio et al. [2003], Levin et al. [2008]) and (b)exemplar-based methods (Criminisi et al. [2004], Xu and Sun [2010], Darabi et al. [2012], Huanget al. [2014]). The former category of approaches uses smoothness priors to propagate informationfrom known regions to the unknown region, while the latter category fills in the occluded regionsby means of similar patches from other locations in the image. Exemplar-based methods have thepotential of filling up large occluded regions and recreate missing textures to reconstruct large regionswithin an image. But these methods are unable to recover the high-frequency details of the imageproperly. To the best of our knowledge, Context Encoder (Pathak et al. [2016]) is the first deeplearning approach used for image inpainting, in which an encoder is used to map an image withmissing regions to a low-dimensional feature space, which is next used by the decoder to reconstructthe output image. Yang et al. [2017] used a pretrained VGG network that minimizes the featuredifferences in the image background, thereby improving the work of Pathak et al. [2016]. Yeh et al.[2017] proposed a GAN-based approach for image inpainting by applying a set of conditions onthe available data. Contextual attention (Yu et al. [2018]) is another approach in which the missingregions were first estimated followed by an attention mechanism to sharpen the results. Nazeriet al. [2019] developed a two-stage adversarial model consisting of an edge generator and imagecompletion network. The edge generator network detects edges of missing regions and is used by theimage completion network as prior to fill in the missing regions.

2.3 Image De-fencing

The image de-fencing problem was first addressed in Liu et al. [2008], where the fence patterns weresegmented by means of spatial regularity, and the fence pixels were filled in with suitable colors byapplying an appropriate inpainting algorithm (Criminisi et al. [2004]). Park et al. [2010] extended thework of Liu et al. [2008] by employing multiple images for extracting the information of occludedimage data from additional frames. Park et al. [2010] also used a deformable lattice detection methodsimilar to that of Park et al. [2009] discussed in Section 2.1. Khasare et al. [2013] proposed animproved multi-frame de-fencing technique by using loopy-belief propagation (Felzenszwalb andHuttenlocher [2006]). This method uses an image matting (Zheng and Kambhamettu [2009]) for fencesegmentation, but the main drawback of this approach is that it involves significant user interactionand is therefore not very suitable for practical purposes. Jonna et al. [2015b] proposed a multimodalapproach for image de-fencing in video frames, in which the fence mask was first extracted in eachframe with the aid of depth maps corresponding to the color images obtained from a Kinect sensor,and next an optical flow algorithm was used to find correspondences between adjacent frames. Finally,estimation of the de-fenced image was done by modeling it as a Markov Random Field, and obtainingits maximum a-posteriori estimate by applying loopy belief propagation. Kumar et al. [2016] usedsignal demixing to capture the sparsity and regularity of the different image regions, thereby detectingfences, following which inpainting was performed to fill-in the fence pixels with suitable colors.Farid et al. [2016] used a semi-automated approach in which an user is requested to manually markseveral fence pixels in the image. A Bayesian classifier is next employed to classify each pixel asfence or non-fence pixel based on the knowledge of the color distribution of the marked pixels andthe non-marked pixels. This approach is prone to human error and also highly time-intensive.

Recently, a few deep learning based video de-fencing approaches have been developed, e.g., the workby Jonna et al. [2015a] is a semi-automated approach which first employs a CNN-based algorithm todetect the fence pixels in an input image, and next use a sparsity based optimization framework tofill-in the fence pixels. Jonna et al. [2016] utilized a pre-trained CNN coupled with the SVM classifierfor fence texel joint detection, and then connect the joints to obtain scribbles for image matting. Duet al. [2018] presented an approach for fence segmentation using fully convolutional neural networks(FCN) (Long et al. [2015]) and a fast robust recovery algorithm by employing occlusion-aware opticalflow.

The main contributions of the paper are as follows:

1. Developing for the first time fully automated and efficient image de-fencing algorithmsbased on conditional Generative Adversarial Networks (cGANs).

3

Page 4: Fully Automated Image De-fencing using Conditional Generative … · 2019-08-20 · Fully Automated Image De-fencing using Conditional Generative Adversarial Networks Divyanshu Gupta

2. Proposing a two-stage image de-fencing network that involves a fence mask generator and aimage recovering network, each of which is based on cGAN.

3. Making the algorithm more time-efficient by developing a single-stage end-to-end cGANnetwork without any intermediate fence mask generation step. Edge-based features alongwith the ground-truth of fenced de-fenced image pairs are used to train the model so thatit can generate high quality de-fenced images even in the presence of a fewer number oflayers compared to the two-stage network.

4. Performing extensive experimental evaluation, and also making the codes and data set usedin the experiments publicly available to the research community for further comparisons.

3 Proposed Techniques for Image De-fencing using cGANsThe proposed cGAN-based two-stage and single-stage architectures for image de-fencing are ex-plained in detail in the following two sub-sections.

3.1 Two-stage Image De-fencing Network

The two-stage architecture shown in Figure 2 follows a work-flow similar to that seen in most existingapproaches, i.e., (i) fence mask generation and, (ii) recovering the missing parts of the image behindthe fence (refer to Section 1). The difference from the existing techniques is that, we use adversariallearning to carry out both these steps. More specifically, we use a pair of cGANs, each consisting ofa generator-discriminator pair, to perform the two steps. The generator consists of a encoder networkwith seven down-sampling layers, followed by a decoder network with seven up-sampling layers as inIsola et al. [2017]. The discriminator is a 16×16 Markovian Discriminator, i.e, PatchGAN (Isola et al.[2017]), which classifies each 16×16 patch in the image as real or fake and averages all the responsesto provide the final output. The proposed de-fencing algorithm is discussed in Sections 3.1.1 and3.1.2 by denoting D1 and G1 as the discriminator and generator for the fence mask generator, andD2 and G2 as the discriminator and generator for the image recovering network, respectively.

G1

Encoder Decoder

+G2

Encoder Decoder

Real / Fake (Ladv,1)

D1 Reconstruction (LL1,1)

Real / Fake (Ladv,2)

Reconstruction (LL1,2)

Input Image  (fenced)

Fence Mask      Image

Output Image  (De-fenced)

        Input + Mask 

Style (Lsty)Perceptual (Lperc)

(a) Fence Mask Network (b) Image Recovering Network

SSIM (Lssim)

D2

Figure 2: Two-stage image de-fencing network.

3.1.1 Fence Mask Generator

The task of the fence mask generator can be viewed as an image-to-image translation problemwhere we convert a fenced image to a fence mask image. Let Imask be the ground-truth fence maskimage, and Ifen the input fenced image. During the training phase, the generator takes the input asIfen, conditioned on Imask and predicts the fence mask image Ipred. Mathematically, the generatorfunction can be represented as:

Ipred = G1(Ifen, Imask). (1)

We use Imask and Ipred, conditioned on Ifen as the input of the discriminator, to predict the fencemask image as real or fake. The network is trained only with the objective function comprisingadversarial loss and L1 loss, as shown next in (2):

minG1

maxD1

LG1= min

G1

(α1 maxD1

(Ladv,1)) + β1(LL1,1)), (2)

4

Page 5: Fully Automated Image De-fencing using Conditional Generative … · 2019-08-20 · Fully Automated Image De-fencing using Conditional Generative Adversarial Networks Divyanshu Gupta

where α1 and β1 are regularization parameters, with α1 = 1 and β1 = 10. The adversarial loss(Ladv,1) and L1 loss are defined in the following two equations:

Ladv,1 = E(Ifen,Imask)[log(D1(Imask, Ifen))] + E(Ifen,Ipred)[log(1−D1(Ipred, Ifen))], (3)

andLL1,1 = E[||Ipred − Imask||1], (4)

where E denotes the expectation operator. The network is trained in multiple epochs, and training isstopped when the absolute difference of the value of loss function in two successive epochs is lessthan a small threshold ε. We consider the value of ε as 10−3.

3.1.2 Image Recovering NetworkLet Idef be the ground-truth de-fenced image. The input fenced image Ifen is masked with the fencemask image, which is obtained by the fence mask generator, say M, to generate Ifen, i.e., Ifen =Ifen�(1−M) representing unfilled image without fences, where� represents the Hadamard product.The image recovering network, takes Ifen as input, conditioned on Idef to generate a de-fenced imageIpred filled with plausible content in the unfilled part of Ifen having the same resolution as the inputimage, as shown in (5):

Ipred = G2(Ifen, Idef ). (5)

The image recovering network is trained until convergence by a joint objective function based onNazeri et al. [2019] that considers adversarial loss, perceptual loss, style loss and L1 loss, along withthe SSIM loss (Wang et al. [2004], Zhao et al. [2017]). The adversarial loss function for G2 is definedas:

Ladv,2 = E(Ifen,Idef )[log(D2(Idef , Ifen))] + E(Ifen ,Ipred)

[log(1−D2(Ipred, Ifen))]. (6)

The perceptual loss term (Johnson et al. [2016]) Lperc, computes the differences between the high-level feature representations between the ground-truth and the generated images extracted from apre-trained CNN. If the predicted image label is dissimilar from the actual image label, a higherpenalty is imposed through this loss term. Mathematically, the perceptual loss is defined as follows:

Lperc = E[∑i

1

Ni|| ~ai(Idef )−~ai(Ipred) ||1], (7)

where ~ai is the activation map of the ith layer of a pre-trained VGG-19 network. We also use the styleloss (Gatys et al. [2016]), which determines the difference between the style representations of twoimages. The style representation of an image at a particular layer is given by the gram matrix G, andeach element of this matrix represents the inner product between a pair of vectorized feature maps atthe given layer. For the vectorized feature map of size Cj ×Hj ×Wj , the style loss is mathematicallydefined as follows:

Lsty = Ej [|| Gj~a(Ipred)−Gj

~a(Idef ) ||1], (8)

where Gj~a is the a Cj × Cj gram matrix corresponding to feature map ~aj. The L1 loss function is

computed as follows:LL1,2 = E[||Ipred − Idef ||1]. (9)

For obtaining visually pleasing images from the generator, we also incorporate a structural similarityloss term (Wang et al. [2004], Zhao et al. [2017]) as shown in (10), which indicates the differences inthe luminance, contrast, and structure between the generated de-fenced image and the ground-truthde-fenced image.

LSSIM =1

N

∑p

(1− SSIM(p)), (10)

where, p refers to a particular pixel position, and

SSIM(p) =2µxµy + C1

µx2µy

2 + C1.2σxy + C2

σx2σy2 + C2, (11)

refers to the structural similarity index magnitude between ground-truth and the generated image atpixel position p. In the above expression, µx and µy represent mean intensities in the neighborhood

5

Page 6: Fully Automated Image De-fencing using Conditional Generative … · 2019-08-20 · Fully Automated Image De-fencing using Conditional Generative Adversarial Networks Divyanshu Gupta

of p, while σx and σy represent standard deviations for two non-negative image signals x and y,respectively, σxy represents the co-variance of x and y, and C1 and C2 are constants. Appropriate val-ues for C1 and C2 to compute the SSIM loss can be found in https://github.com/keras-team/keras-contrib/blob/master/keras_contrib/losses/dssim.py.

The overall loss function for the image recovering network is henceforth computed as:

minG2

maxD2

LG2 = minG2

(α2 maxD1

(Ladv,2) + β2(LL1,2) + γ(Lperc) + δ(Lsty) + η(LSSIM )), (12)

where α2, β2, γ, δ and η are regularization parameters. In our experiments, we use α2=0.1, β2=10,γ=2, δ=1 and η=1.

3.2 Single Stage End-to-End Image De-fencing Network

G

Encoder Decoder

Real / Fake (Ladv)

D

Reconstruction (LL1)

                     Input(fenced image + canny edge map)

  Output Image(De-fenced image)

Style (Lsty)

Perceptual (Lperc)

SSIM (Lssim)

Figure 3: Single-stage image de-fencing network.

Use of two generators makes the image de-fencing process time-intensive. To reduce the responsetime, we propose to use a single end-to-end architecture with a single generator-discriminator pair fortranslating the input image with a fence structure directly to its de-fenced version without any fencegeneration step in between. Since only a single generator is used here, the single-stage network willhave fewer number of layers compared to the two-stage network described in Section 3.1. Presenceof fewer layers in a convolutional neural network has a tendency to loose significant contextualinformation. To boost up the performance of the single-stage network, we propose to train the modelwith the given ground-truth fenced de-fenced image pairs as well as with an edge map of the fencedimage given by the Canny filter. We observe that appending the Canny edge map along with theinput fenced image facilitates high quality image de-fencing even in the presence of fewer number oflayers. The network architectures for the generator and discriminator are similar to that of the imagerecovering network in Section 3.1.2. Also, an objective function similar to (12) has been used to trainthe single-stage network until convergence.

4 Experiments and Results4.1 Data Set and System Description

We use three Graphics Processing Units (GPUs) to train the models out of which one is Nvidia TitanXp (with 12GB RAM, total FB memory as 12196 MB and total BAR1 memory as 256 MB), andthe other two are Nvidia GeForce GTX 1080 Ti (with 11 GB RAM, total FB memory as 11178 MBand total BAR1 memory as 256 MB). The experimental protocols are briefly discussed next. Fortraining the two-stage image de-fencing network, the fence segmentation dataset provided by Duet al. [2018] is used along with a synthetically generated data set formed by adding artificial fencestructures on a set of images from the Pascal VOC dataset (Everingham et al. [2010]). For trainingthe image recovering network, we also created an artificial data set by using images from the PascalVOC dataset (Everingham et al. [2010]) and the COCO dataset (Lin et al. [2014]). The data set fortraining the single-stage end-to-end image de-fencing network is also constructed in a similar mannerby applying random fence structures on a set of images from Pascal VOC data and COCO data. Thetest set consists of a total of 245 images and is formed by selecting images from the above-mentionedpublic data sets as well as some images captured by our research team. Before, making a forwardpass through the network, the dimensions of each training and test image is made equal to 256×256.

6

Page 7: Fully Automated Image De-fencing using Conditional Generative … · 2019-08-20 · Fully Automated Image De-fencing using Conditional Generative Adversarial Networks Divyanshu Gupta

4.2 Experiments with Two-Stage Image De-fencing NetworkFigure 4 shows the qualitative performance of the two-stage image de-fencing network by meansof a sample set of fenced images. The first row in the figure represents the input, the second rowrepresents the output of the mask generator, the third row corresponds to the input to the imagerecovery network, and finally the last row shows the output of the image recovery network. It is seen

 Input (Fenced)[Mask Generator]

 Output (Mask)[Mask Generator]

        Input[Recovery Network]

Output (Defenced)[Recovery Network]

Figure 4: Qualitative results for the two-stage image de-fencing network.

from the figure that the output results are visually quite appealing.

4.3 Experiments with Single-Stage Image De-fencing NetworkAs discussed in Section 3.2, in a bid to reduce the response time of image de-fencing further, wepropose an efficient single-stage image de-fencing network. In the fourth row of Figure 5, the outputsgiven by the proposed single-stage de-fencing network (with supervision of canny edge map) areshown on eight different test images. The first and second rows of the figure respectively depict theinput image and the corresponding ground-truth, while the third row shows the images generated bysingle-stage image de-fencing network without extra supervision of canny edge maps, fifth row showsthe images generated by the two-stage de-fencing network discussed in Section 3.1. The qualitative

   Input (Fenced)

Output (Defenced)[without Canny]

[Using Two-stage      network]

  Ground Truth  (Defenced)

Output (Defenced) [with Canny]

Output (Defenced)

Figure 5: Qualitative results by the single stage end-to-end image de-fencing network.

results clearly depict that the output images given by the proposed single stage image de-fencingnetwork are comparable with that of the two-stage de-fencing network.

5 Conclusions and Future WorkIn the present manuscript, we explore the applicability of conditional Generative Adversarial Networks(cGANs) for image de-fencing in two different settings: (i) a two-stage image de-fencing network

7

Page 8: Fully Automated Image De-fencing using Conditional Generative … · 2019-08-20 · Fully Automated Image De-fencing using Conditional Generative Adversarial Networks Divyanshu Gupta

which employs a fence mask generator network and an image recovering network in succession, and(ii) a single-stage image de-fencing network, which uses a single generator to render the de-fencedimage from the original fenced image and its edge map. Experimental results show that both theproposed methods can produce de-fenced images with high visual quality, but the later approach ismore advantageous if a fast response is desired. Comparison with state-of-the-art techniques alsoemphasize the effectiveness of employing GANs in the image de-fencing task. In future, it may bestudied how to make the algorithm more time-efficient, so that it can be applied for real-time videode-fencing. Another challenging area of research is how to effectively de-fence images with irregularor repeated fence structures, which the present work and also other existing techniques are unable tohandle properly.

ReferencesMarcelo Bertalmio, Luminita Vese, Guillermo Sapiro, and Stanley Osher. Simultaneous structure and

texture image inpainting. IEEE Transactions on Image Processing, 12(8):882–889, 2003.

Karla Brkic, Ivan Sikiric, Tomislav Hrkac, and Zoran Kalafatic. I know that person: Generative fullbody and face de-identification of people in images. In 2017 IEEE Conference on Computer Visionand Pattern Recognition Workshops (CVPRW), pages 1319–1328. IEEE, 2017.

Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan:Interpretable representation learning by information maximizing generative adversarial nets. InAdvances in Neural Information Processing Systems, pages 2172–2180, 2016.

Antonio Criminisi, Patrick Pérez, and Kentaro Toyama. Region filling and object removal byexemplar-based image inpainting. IEEE Transactions on Image Processing, 13(9):1200–1212,2004.

Soheil Darabi, Eli Shechtman, Connelly Barnes, Dan B Goldman, and Pradeep Sen. Image melding:Combining inconsistent images using patch-based synthesis. ACM Transactions on Graphics, 31(4):82–1, 2012.

Chen Du, Byeongkeun Kang, Zheng Xu, Ji Dai, and Truong Nguyen. Accurate and efficient videode-fencing using convolutional neural networks and temporal information. In Proceeding of theIEEE International Conference on Multimedia and Expo, pages 1–6. IEEE, 2018.

Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman.The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 88(2):303–338, 2010.

Muhammad Shahid Farid, Arif Mahmood, and Marco Grangetto. Image de-fencing framework withhybrid inpainting algorithm. Signal, Image and Video Processing, 10(7):1193–1201, 2016.

Pedro F Felzenszwalb and Daniel P Huttenlocher. Efficient belief propagation for early vision.International Journal of Computer Vision, 70(1):41–54, 2006.

Leon A Gatys, Alexander S Ecker, and Matthias Bethge. Image style transfer using convolutional neu-ral networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,pages 2414–2423, 2016.

James Hays, Marius Leordeanu, Alexei A Efros, and Yanxi Liu. Discovering texture regularity asa higher-order correspondence problem. In European Conference on Computer Vision, pages522–535. Springer, 2006.

He Huang, Philip S. Yu, and Changhu Wang. An introduction to image synthesis with generativeadversarial nets. CoRR, abs/1803.04469, 2018.

Jia-Bin Huang, Sing Bing Kang, Narendra Ahuja, and Johannes Kopf. Image completion using planarstructure guidance. ACM Transactions on Graphics (TOG), 33(4):129, 2014.

Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation withconditional adversarial networks. In Proceedings of the IEEE Conference on Computer Vision andPattern Recognition, pages 1125–1134, 2017.

8

Page 9: Fully Automated Image De-fencing using Conditional Generative … · 2019-08-20 · Fully Automated Image De-fencing using Conditional Generative Adversarial Networks Divyanshu Gupta

Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual losses for real-time style transfer andsuper-resolution. In European Conference on Computer Vision, pages 694–711. Springer, 2016.

Sankaraganesh Jonna, Krishna K Nakka, and Rajiv R Sahay. My camera can see through fences: Adeep learning approach for image de-fencing. In Proceedings of the 3rd IAPR Asian Conferenceon Pattern Recognition (ACPR), pages 261–265. IEEE, 2015a.

Sankaraganesh Jonna, Vikram S Voleti, Rajiv R Sahay, and Mohan S Kankanhalli. A multimodalapproach for image de-fencing and depth inpainting. In Proceedings of the 8th InternationalConference on Advances in Pattern Recognition (ICAPR), pages 1–6. IEEE, 2015b.

Sankaraganesh Jonna, Krishna K Nakka, and Rajiv R Sahay. Deep learning based fence segmentationand removal from an image using a video sequence. In European Conference on Computer Vision,pages 836–851. Springer, 2016.

Vrushali S Khasare, Rajiv R Sahay, and Mohan S Kankanhalli. Seeing through the fence: Imagede-fencing using a video sequence. In Proceedings of the IEEE International Conference on ImageProcessing, pages 1351–1355. IEEE, 2013.

Veepin Kumar, Jayanta Mukherjee, and Shyamal Kumar Das Mandal. Image defencing via signaldemixing. In Proceedings of the 10th Indian Conference on Computer Vision, Graphics and ImageProcessing, page 11. ACM, 2016.

Anat Levin, Dani Lischinski, and Yair Weiss. A closed-form solution to natural image matting. IEEETransactions on Pattern Analysis and Machine Intelligence, 30(2):228–242, 2008.

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, PiotrDollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In EuropeanConference on Computer Vision, pages 740–755. Springer, 2014.

Wen-Chieh Lin and Yanxi Liu. A lattice-based mrf model for dynamic near-regular texture tracking.IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(5):777–792, 2007.

Yanxi Liu, Tamara Belkina, James Hays, and Roberto Lublinerman. Image de-fencing. 2008 IEEEConference on Computer Vision and Pattern Recognition, pages 1–8, 2008.

Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semanticsegmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition,pages 3431–3440, 2015.

Yadong Mu, Wei Liu, and Shuicheng Yan. Video de-fencing. IEEE Transactions on Circuits andSystems for Video Technology, 24(7):1111–1121, 2014.

Kamyar Nazeri, Eric Ng, Tony Joseph, Faisal Z. Qureshi, and Mehran Ebrahimi. Edgeconnect:Generative image inpainting with adversarial edge learning. CoRR, abs/1901.00212, 2019.

Minwoo Park, Robert T Collins, and Yanxi Liu. Deformed lattice discovery via efficient mean-shiftbelief propagation. In European Conference on Computer Vision, pages 474–485. Springer, 2008.

Minwoo Park, Kyle Brocklehurst, Robert T Collins, and Yanxi Liu. Deformed lattice detection inreal-world images using mean-shift belief propagation. IEEE Transactions on Pattern Analysisand Machine Intelligence, 31(10):1804–1816, 2009.

Minwoo Park, Kyle Brocklehurst, Robert T Collins, and Yanxi Liu. Image de-fencing revisited. InAsian Conference on Computer Vision, pages 422–434. Springer, 2010.

Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and Alexei A Efros. Contextencoders: Feature learning by inpainting. In Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition, pages 2536–2544, 2016.

Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deepconvolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.

Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee.Generative adversarial text to image synthesis. arXiv preprint arXiv:1605.05396, 2016.

9

Page 10: Fully Automated Image De-fencing using Conditional Generative … · 2019-08-20 · Fully Automated Image De-fencing using Conditional Generative Adversarial Networks Divyanshu Gupta

Zhou Wang, Alan C Bovik, Hamid R Sheikh, Eero P Simoncelli, et al. Image quality assessment:from error visibility to structural similarity. IEEE Transactions on Image Processing, 13(4):600–612, 2004.

Zongben Xu and Jian Sun. Image inpainting by patch propagation using patch sparsity. IEEEtransactions on image processing, 19(5):1153–1165, 2010.

Chao Yang, Xin Lu, Zhe Lin, Eli Shechtman, Oliver Wang, and Hao Li. High-resolution imageinpainting using multi-scale neural patch synthesis. In Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition, pages 6721–6729, 2017.

Raymond A Yeh, Chen Chen, Teck Yian Lim, Alexander G Schwing, Mark Hasegawa-Johnson, andMinh N Do. Semantic image inpainting with deep generative models. In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition, pages 5485–5493, 2017.

Jiahui Yu, Zhe L. Lin, Jimei Yang, Xiaohui Shen, Xin Lu, and Thomas S. Huang. Generative imageinpainting with contextual attention. 2018 IEEE/CVF Conference on Computer Vision and PatternRecognition, pages 5505–5514, 2018.

Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, and Dimitris NMetaxas. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarialnetworks. In Proceedings of the IEEE International Conference on Computer Vision, pages5907–5915, 2017.

Hang Zhao, Orazio Gallo, Iuri Frosio, and Jan Kautz. Loss functions for image restoration withneural networks. IEEE Transactions on Computational Imaging, 3(1):47–57, 2017.

Yuanjie Zheng and Chandra Kambhamettu. Learning based digital mmtting. In Proceedings of theIEEE 12th International Conference on Computer Vision, pages 889–896. IEEE, 2009.

10