Unsupervised Eyeglasses Removal in the Wild

1

Unsupervised Eyeglasses Removal in the WildBingwen Hu, Wankou Yang, and Mingwu Ren

Abstract—Eyeglasses removal is challenging in removing dif-ferent kinds of eyeglasses, e.g., rimless glasses, full-rim glassesand sunglasses, and recovering appropriate eyes. Due to thesignificant visual variants, the conventional methods lack scal-ability. Most existing works focus on the frontal face imagesin the controlled environment such as laboratory and need todesign specific systems for different eyeglass types. To addressthe limitation, we propose a unified eyeglass removal model calledEyeglasses Removal Generative Adversarial Network (ERGAN),which could handle different types of glasses in the wild. Theproposed method does not depend on the dense annotation ofeyeglasses location but benefits from the large-scale face imageswith weak annotations. Specifically, we study the two relevanttasks simultaneously, i.e., removing and wearing eyeglasses. Giventwo facial images with and without eyeglasses, the proposedmodel learns to swap the eye area in two faces. The generationmechanism focuses on the eye area and invades the difficulty ofgenerating a new face. In the experiment, we show the proposedmethod achieves a competitive removal quality in terms of realismand diversity. Furthermore, we evaluate our method on severalsubsequent tasks, such as face verification and facial expressionrecognition. The experiment shows that our method could serveas a pre-processing method for these tasks.

Index Terms—Eyeglasses Removal, Image Manipulation, Gen-erative Adversarial Network

I. INTRODUCTION

THE eye is viewed as “ a window to the soul ” [1], contain-ing rich bio-metric information, e.g., identity, gender, and

age. Recent years, there are increasing interests in the face-related applications. Among these applications, eyeglasses areusually considered as one kind of occlusion in the face images.As a result, the occlusion compromises sub-sequential tasks,such as face verification [2]–[5], expression recognition [6]–[8]. One way to address occlusion is by ignoring the occludedarea, which successfully applied in the field of person re-identification [9]–[12]. Differ from the human body, the faceis rich in identity information, and the eye area is the most dis-criminative facial field. The retained information is insufficientto support us make an accurate decision while the occludedarea is Ignored. Therefore, we propose an eyeglasses removalmethod to transform the occlusion area into a non-occlusionarea. Despite significant advances in the image manipula-tion, eyeglasses removal in the unconstrained environment, asknown as in the wild, has not been well-studied. In this work,we intend to fill this gap.

The pioneer eyeglasses removal works mainly focus on theeyeglasses in a controlled environment. Most works [13]–[17] require pair-wise training samples. Every pair of images

B. Hu and M. Ren are with the School of Computer Science and Engineer-ing, Nanjing University of Science and Technology, China, Nanjing 210092,China (e-mail: [email protected]; [email protected]).

W. Yang is with the School of Automation, Southeast University, Nanjing210092, China. (e-mail: [email protected])

Fig. 1: (a) Examples of different types of eyeglasses in thewild. Different types of eyeglasses have significant visualvariants, such as color, style, and transparency. Besides, thefaces in the wild are usually in the arbitrary poses withvarious lighting and backgrounds. (b) A brief pipeline of theproposed method. The proposed Eyeglasses Removal Genera-tive Adversarial Network (ERGAN) simultaneously takes thetwo relevant tasks, i.e., wearing and removing glasses, intoconsideration.

contains two frontal faces of the same identity with andwithout glasses. When testing, given one frontal testing image,the framework first detects the eye area and then replacesthe original eye area with the reconstructed eye. The craftedeye area fuses the no-glasses training samples by PrincipalComponent Analysis (PCA) [18]. These line of works adopta strong assumption that faces are in a frontal pose, andthe types of eyeglasses are limited. In the realistic scenario,however, there are three main drawbacks: (1) It is hard tocollect a large number of pair-wise training data of the samepeople with and without eyeglasses; (2) Eyeglasses are usuallymade of different material with significant visual variants,such as color, style, and transparency (see Fig. 1 (a)). It isinfeasible to train dedicated removal systems for differenteyeglasses types; (3) Existing works could not be applied tothe face images in different poses. The model trained in thelaboratory environment usually lacks scalability to significantvisual variants.

This paper addresses these three challenges. First, the pre-vious methods [13]–[17] demand the pair-wise training faceimages of the same person in a controlled environment. Someworks [13]–[15] also require an extra detector to locate theeyeglasses. Instead, we propose an unsupervised eyeglassesremoval method. The only information that we need is whetherthe training image contains eyeglasses on the face. In partic-

arX

iv:1

909.

0698

9v1

[cs

.CV

] 1

6 Se

p 20

19

2

ular, we collect the large-scale training images from publicdatasets on the web and divide them into two sets, i.e., theimages with glasses and without glasses, respectively. Theweak demand for training set largely saves the annotation cost.

Second, eyeglasses usually have significant intra-class vari-ants in terms of geometry shape and appearance. It is infeasibleto build dedicated models for every kind of eyeglasses. We,therefore, exploit an alternative method that let the model“see” various eyeglasses and learn the scalable weights froma large number of face images. In particular, we proposethe Eyeglass Removal GAN (ERGAN) to learn the generalstructure of glasses and encode different types of eyeglasses.The proposed method not only remove the eyeglasses butdemands the model to generate the eyeglasses, which furtherenforces the model to learn the local patterns of differenteyeglasses (see Fig. 1 (b)).

Third, the problem of eyeglasses removal from arbitrary faceposes is common and challenging in a realistic scenario. Theprevious works mostly assume that we could obtain frontalface images and accurately align the eye area. In the realisticscenario, the user, however, may upload the non-frontal faceimages and the eye-area is not well aligned. One solutionis to resort to the complicated alignment calibration, e.g.,generating the frontal faces. By contrast, the proposed methoddoes not need alignment densely. The only pre-processingwe needed is to rotate the face images according to thecenter of the face. Thank to an extensive collection of faceimages of the arbitrary poses, the proposed method does notneed complicated alignment and is robust to the various posevariants.

In an attempt to address the challenges mentioned above,we propose a unified eyeglasses removal generative adversarialnetwork (ERGAN) to take advantages of large-scale trainingimages with weak annotations fully. We learn two representa-tions of the input face image, i.e., face appearance code, andeye-area attribute code. Face appearance code mainly containsthe low-level geometric structure of the face, while the eye-area attribute code captures the semantic pattern in the eyearea. In more details, we first utilize two different encodersto decompose the face image into face appearance code andeye-area attribute code, respectively. Then, one decoder islearned to combine the face appearance code with the eye-area attribute code to reconstruct the face image. The self-reconstruction loss is applied to ensure that the two latentcodes are complementary to each other and preserve theinformation of the original image. To enforce the modelfocuses on the eye area of the input image, we introduce theeye-area reconstruction loss. Furthermore, the face appearancereconstrution loss and the cycle consistency loss are adoptedto encourage that the mapping function is invertible betweenthe reconstructed face image and the two latent codes. Whentesting, we could combine the face appearance code of thetarget face with the eye-area attribute code to remove or wearany specific eyeglasses.

To evaluate the generation quality, we adopt the FID [19]and LPIPS [20] as an indicator to test the realism anddiversity of the generated face images, respectively. Exten-sive qualitative and quantitative experiments show that our

method achieves superior performance to several existingapproaches [21]–[23] and could serve as a good pre-processingtool for subsequent tasks, such as face verification and facialexpression recognition.

In a summary, the main contributions of this work are asfollows:

• We propose a unified framework called ERGAN thatenables different types of eyeglasses removal from facesin the wild. Compared with previous works, the proposedmethod does not need densely-aligned faces nor pair-wisetraining samples. The only weak annotation that we needis whether the training data contains eyeglasses or not.

• Due to the large visual variants of the eyeglasses, includ-ing shape and color, eyeglasses removal demands to learna robust model. In this work, we propose a dual learningscheme to simultaneously learn two inverse manipula-tions, i.e., removing eyeglasses, and wearing eyeglasses.Furthermore, we utilize the eye-area reconstruction lossto explicitly make the model pay more attention to theeye area. The ablation study verifies the effectiveness ofthe dual learning scheme and the eye-area reconstructionloss.

• The qualitative and quantitative experiments show thatour method outperforms other competitive approaches interms of realism and diversity. Furthermore, we evaluatethe generated faces with other face-related tasks. Attributeto the high-fidelity eyeglass removal, generated facesbenefit the subsequential tasks, such as face verificationand facial expression recognition.

II. RELATED WORK

A. Statistical Learning

Most pioneering works on eyeglasses removal are basedon statistical learning [13], [14], [16], [17], [24]. One lineof works adopts the assumption that the target faces couldbe reconstructed from other faces without eyeglasses. Basedon this assumption, previous methods widely adopt PrincipalComponent Analysis (PCA) [18] to learn the shared compo-nents among the face images. In one of the early works, Wuet al. [13] proposes a find-and-replace approach to removeeyeglasses from frontal face images. This method first findsthe location of eyeglasses by an eye-region detector andthen replaces it with a synthesized glasses-free image. Thesynthesized glass-free image is inferred by combining theoriginal eye area and different weighted glasses-free faces inthe training data. Furthermore, Park et al. [14] applies the re-cursive process of PCA reconstruction and error compensationto generate facial images without glasses. Due to the differenttemperature between glasses and human faces, another lineof works take advantage of thermal images to remove theeyeglasses. Wong et al. [17] proposes a nonlinear eyeglassesremoving algorithm for thermal images based on kernel prin-cipal component analysis (KPCA) [25]. This method performsKPCA to transfer the visible reconstruction information fromthe visible feature space to the thermal feature space, and thenapply the image reconstruction to remove eyeglasses from thethermal face image. Different from the method based on PCA

3

mentioned above, some researchers resort to sparse coding andexpectation-maximization to reconstruct faces. Yi et al. [16]deploys the sparse representation technique in a local featurespace to deal with the issue of eyeglasses occlusion. De Smetet al. [24] proposed a generalized expectation-maximizationapproach, which applies the visibility map to inpaint theoccluded areas of the face image.

However, these methods [13], [14], [16], [17], [24] areusually designed for frontal faces and specific eyeglasses in acontrolled environment. Different from the existing work, ourmodel is trained on a large number of face images collectedfrom the realistic scenario, and is scalable to visual variants,such as pose, illumination, and different types of glasses.

B. Generative Adversarial Network

Recent advance in facial manipulation is due to two factors:(1) large-scale public face datasets with attribute annotations,e.g., CelebA [26]; (2) The high-fidelity images generated byGenerative Adversarial Network (GAN). GAN is one kind ofthe generative model benefiting from the competition betweenthe generator and the discriminator [27]. The facial imagemanipulation algorithm based on GANs have taken significantsteps [21]–[23], [28]–[30]. One line of previous work directlylearns the image-to-image translation between different facialattributes. Shen et al. [28] presents a GAN-based methodusing residual image learning and dual learning to manipulatethe attribute-specific face area. Liu et al. [30] presents aunified selective transfer network for arbitrary image attributeediting (STGAN), by combining difference attribute vectorand selective transfer unit (STUs) in autoencoder network.Another line of works first learn the embedding of the faceattributes and then decode the learned feature to generateimages. Liu et al. [22] proposes an unsupervised image-to-image translation (UNIT) framework, which combines thespirit of both GANs and variational autoencoders (VAEs)[31]. UNIT adopts the assumption that a pair of images indifferent domains can be mapped to a shared latent featurespace, and thus, could conduct the face images translationby manipulating the latent code. Furthermore, Huang et al.proposes a method for the multimodal unsupervised image-to-image translation (MUNIT) [23]. MUNIT assumes that apair of images in different domains share the same contentspace but the style space. Sampling different style codes couldproduce diverse and multimodal outputs while preserving theprinciple content.

In this work, we also adopt the GAN-based network, calledERGAN, but we are different from these GAN-based ap-proaches [21]–[23] in the following main aspects: (1) Attributeto the eye-area loss and invertible eye generation, the proposedERGAN explicitly focuses on the manipulation of the eyeregion, while the existing works focus on generating thewhole face. The conventional generation mechanism inevitablyintroduces the noise to other areas of the original face whenremoving glasses; (2) We adopt the instance generation mech-anism, which could swap the eye area of any two facialimages. Compared with the CycleGAN-based methods, wecould generate more diverse images; (3) Different from the

previous works treat the face with different attributes astwo or multiple domains, we view the face images as onedomain with two codes, i.e., face appearance code and an eye-area attribute code. We could further manipulate the code togenerate conditional outputs to meet the user demands.

III. METHODOLOGY

We first formulate the problem of eyeglasses removal andprovide the overview of the proposed eyeglasses removalgenerative adversarial network (ERGAN) in Section III-A. InSection III-B, we describe the details of each component inERGAN, followed by the full objective function in SectionIII-C.

A. Formulation

The architecture of the proposed unified eyeglasses removalgenerative adversarial network (ERGAN) is presented in Fig.2 (a). In this paper, we attempt to handle two inverse ma-nipulations at the same time, i.e., eyeglasses removal andeyeglasses wearing. We denote face images with glasses andwithout glasses as X and Y ( X ,Y ⊂ RH×W×3 ), respectively.The goal of the proposed method is to learn two mappings:X → Y and Y → X that could transfer an image x ∈ X toanother image y ∈ Y , and vice versa. In this work, we assumeeach face image can be decomposed into a face appearancecode and an eye-area attribute code. Then we combine aface appearance code with an eye-area attribute code froman example face image to remove or wear eyeglasses. In thiscase, two inverse manipulations can be achieved.

Generator. As illustrated in Fig. 2 (a), given a pair ofimages (x, y), the generator of the proposed method adoptsa similar framework of auto-encoder, which consists of faceappearance encoders (Ef

x , Efy ), eye-area attribute encoders

(Eex, E

ey), and decoders (Gx, Gy). Specifically, the role of

encoders (Ef. , E

e. ) is to encode a given face image into

the face appearance code f. and the eye attribute code e.,respectively. Moreover, f. and e. are combined to generate anew image with the decoder G..

Discriminator. (Dx, Dy) are two discriminators for X andY respectively. For instance, given a pair of images x ∈ Xand y ∈ Y , the discriminator Dx aims to distinguish imagesgenerated by decoder Gx(E

fx (x), E

ey(y)) from real images in

X . In the same way, the discriminator Dy aims to distinguishimages generated by decoder Gy(E

fy (x), E

ex(x)) from real

images in Y .

B. Eyeglasses Removal Generative Adversarial Network

Face image self-reconstruction. As shown in Fig. 2 (b),to enforce the generator focuses on the eye-area of the inputface image x, we first mask out the eye region to generatexmask and then encode x and xmask by encoders (Ef

x , Eex) to

obtain the eye-area attribute code ex and the face appearancecode fx. Finally, ex and fx are combined to generate theself-reconstructed image xrecon with the decoder Gx, wherexrecon = Gx(E

fx (x), E

ex(x)). It is straight-forward that the

generated result of self-reconstruction approximates the source

4

Fig. 2: An overview of Eyeglasses Removal Generative Adversarial Network (ERGAN). (a) The proposed method implementstwo mappings: X → Y and Y → X . The input face image is decomposed to a face appearance code f and an eye-area attributecode e by encoders Ef

. and Ee. , respectively. Encoders Ef

x and Efy share weights. The black dash line denotes that the eye

region of the input image to face appearance encoder Ef. is masked. (b) The self-reconstruction process of input image x to

generate xrecon. xmask denotes x after the eye area is masked. (c) The illustration of example-guided eyeglasses removal.ex and fy are combined to generate v by Gx. (d) We propose a dual learning scheme to simultaneously learn two inversemanipulations (removing eyeglasses and wearing eyeglasses) by swapping eye-area attribute codes.

image. We introduce the face self-reconstruction loss, whichis defined as:

Lfacerecon = E[‖Gx(E

fx (x), E

ex(x))− x‖1]+

E[‖Gy(Efy (y), E

ey(y))− y‖1],

(1)

where x ∈ X and y ∈ Y . The pixel-wise `1-norm ‖ · ‖1 isemployed for preserving the sharpness of self-reconstructionimages. The face self-reconstruction loss enforces the encoders{Ef

x , Eex} to learn two representations of the input face

image, e.g., eye-area attribute code and face appearance code,respectively.

Eye-area reconstruction. The self-reconstruction loss en-courages the model to focus on the global features of the inputimage. For the task of glasses removal, to enforce the modelpay more attention to the eye region, we introduce the eye-areareconstruction loss:

Leyerecon = E[‖xeyerecon − xeye‖1] + E[‖yeyerecon − yeye‖1], (2)

where xeye and xeyerecon are denoted the eye-area of x andxrecon, respectively. Similarly, yeye and yeyerecon are denotedthe eye-area of y and yrecon. In practice, since the faceimages are all center-aligned, the eye area is defined as(x1 = 0.4h, x2 = 0.2w, y1 = 0.65h, y2 = 0.75w) area ofthe input image, where (h,w) is the size of the input faceimage.

Dual learning. The proposed method is based on theassumption that the face image can be decomposed into a faceappearance code and an eye-area attribute code. When the faceappearance code is fixed, combining the face appearance codewith the eye-area attribute code from the target face image togenerate an image with or without glasses. Fig. 2 (c) shows theexample of eyeglasses removal. The two related tasks of re-moving glasses and wearing glasses can be regarded as a dualprocess. Therefore, we apply a dual learning scheme [32] tolearn two inverse manipulations simultaneously. The ablationstudy (in Section IV-F) confirms the effectiveness of the duallearning scheme.

We show the process of dual learning in Fig. 2 (d). For givenimages x and y, we encode them into (fx, ex) and (fy, ey),where (fx, ex) = (Ef

x , Eex) and (fy, ey) = (Ef

y , Eey). The

dual tasks (i.e.,, removing eyeglasses and wearing eyeglasses)is performed by swapping codes. We adopt decoders Gx andGy to generate the final output images u = Gy(fx, ey) andv = Gx(fy, ex). We learn the face appearance code and theeye attribute code of the input face image, respectively. Theface appearance code mainly contains the low-level geometricstructure of the face. Since the geometries of different facesare approximate that we can map face appearance codesinto the same latent space. During the training, therefore, weemploy the weights sharing between Ef

x and Efy to update

the model synchronously. By contrast, the eye-area attribute

5

Fig. 3: Results of eyeglasses removal from face images on the CelebA dataset. From top to bottom: real images, CycleGAN [21],UNIT [22], MUNIT [23] and our method.

code captures the semantic pattern in the eye arearesulting inthe eye-area attributes of face images with glasses or withoutglasses are significantly different. Therefore, we do not shareweights between Ee

x and Eey .

Furthermore, the proposed method should be able to re-construct (fx, ex) and (fx, ey) after decoding u and v. Asillustrated in Fig. 2 (d), we apply encoders Ef

x and Eey obtain

the face appearance code fx and the eye-area attribute codeey , then fx and ey are concatenated together to generate u.A similar process is utilized to generate v. Then we encodeu and v into (fu, eu) and (fv, ev). To ensure that generatedimages u and v retain the original information, we define theface appearance reconstruction loss Lf

recon and the eye-areaattribute reconstruction loss Le

recon. We formulate Lfrecon and

Lerecon as follows:

Lfrecon = E[‖fu − fx‖1] + E[‖fv − fy‖1] (3)

Lerecon = E[‖eu − ex‖1] + E[‖ev − ey‖1], (4)

where (fu, fv) = (Efx (u), E

fy (v)) and (eu, ev) =

(Eex(u), E

ey(v)). Notably, we find that the performance of the

model is declined by introducing the eye-area attribute recon-struction loss Le

recon. To achieve the highest performance, weignore this loss. The detailed discussion in our ablation study(in Section IV-F).

Then we recombine face appearance codes (fu, fv) and eye-area attribute codes (eu, ev) to generate x = Gx(fu, ev) =Gx(E

fx (u), E

ey(v)) and y = Gy(fv, eu) = Gy(E

fy (v), E

ex(u)),

respectively. To ensure the generated images u and v recon-struct the origin images x and y, we introduce the cycleconsistency loss [21]. The cycle consistency loss Lcc is definedas:

Lcc = E[‖Gx(Efx (u), E

ey(v))− x‖1]+

E[‖Gy(Efy (v), E

ex(u))− y‖1],

(5)

where u = Gy(fx, ey) and v = Gx(fy, ex).Adversarial loss. To encourage the generated face image

indistinguishable from the real face image, we adopt theadversarial loss [27]. In this case, Gx and Gy attempt togenerate high-fidelity face images (e.g., with glasses andwithout glasses), while Dx and Dy attempt to distinguish realface images from generated face images. The adversarial lossis defined as:

Ladv = E[logDx(x)] + E[log(1−Dx(Gx(fy, ex)))]+

E[logDy(y)] + E[log(1−Dy(Gy(fx, ey)))].(6)

C. Objective Function

Taking all above loss functions into account, we jointly trainthe encoders, decoders, and discriminators. We formulate thefull objective function as:

minEe

. ,Ef. ,Gx,Gy

maxDx,Dy

Ltotal(Ee. , E

f. , Gx, Gy, Dx, Dy) =

λfaceLfacerecon + λeyeL

eyerecon + Lf

recon + Lcc + Ladv,(7)

where the hyper-parameters λface and λeye control the weightsof the face self-reconstruction loss and the eye-area reconstruc-tion loss.

IV. EXPERIMENT

A. Datasets

CelebA. The CelebA dataset [26] consists of 202,599aligned face images collected from 10,177 celebrities in the

6

Fig. 4: Results of eyeglasses removal from face images on the LFW dataset. From top to bottom: real images, CycleGAN [21],UNIT [22], MUNIT [23] and our method.

Fig. 5: The residual images obtained by our method and threecompetitive methods, e.g., CycleGAN [21], UNIT [22], andMUNIT [23].

wild. Each image in CelebA is annotated with 40 face at-tributes (e.g., with/without eyeglasses, smiling/no-smiling). Wesplit the CelebA dataset into one subset with glasses andanother without glasses, based on the annotated attributes. Ac-cordingly, we obtain 13,193 images with glasses and 189,406images without glasses. All face images are center-cropped to160× 160 and resized to 224× 224 with a probability of 0.5horizontal flipping.

LFW. The LFW dataset [33] contains 13,233 face images of5,749 identities collected from the uncontrolled surroundings.All images are aligned by deep funneling [34]. We choose1600 face images with glasses and 8000 face images withoutglasses from the LFW dataset to verify the effectiveness ofthe proposed method. The preprocessing of the LFW datasetis same as the CelebA dataset.

B. Evaluation Metrics

Frechet Inception Distance (FID). Most image generationtasks [35]–[38] adopt the FID metric [19], which is utilizedto measure the distance between generated images and realimages through feature extracted by Inception Network [39].In this work, we apply the FID metric to measure the realismof the generated face images.

Learned Perceptual Image Patch Similarity (LPIPS).The LPIPS [20] distance is used to evaluate the diversityof generated images. Follow several typical image-to-imagetranslation approaches [23], [37], [40], [41]. We employthe LPIPS distance to measure the diversity of manipulatedface images. Note that the proposed method focuses on thetask of eyeglasses removal. Therefore, to fairly compare thediversity of the generated images from different methods, weonly capture the eye area to measure the LPIPS distance. Wedenote the modified LPIPS metric as eLPIPS.

C. Implementation Details

Here we provide details about the network architecture ofERGAN. (1) The eye-area attribute encoder Ee

. consists ofseveral convolutional layers and residual blocks [42] as wellas a global average pooling layer and a fully connected layer.(2) For the face appearance encoder Ef

. , it only consistsof several convolutional layers and residual blocks. (3) Thedecoder G. combines Ee

. and Ef. through four residual blocks

followed by several convolutional layers and upsampling. Inparticular, we finally append a refine block which consistsof a convolutional layer and a fully connected layer. Therefine block is used to concatenate the reconstructed imagewith the input image to produce a higher quality generatedimage. Additionally, all convolutional layers are followed byinstance normalization [43]. Similar to MUNIT [23], eachresidual block contains two adaptive instance normalizationlayers [44]. (4) For the discriminator D, we apply multi-scalediscriminators (PatchGAN) proposes by Wang et al. [45]. Weadopt the Leaky ReLU with slope 0.2 in both generator anddiscriminator. We show detailed network architectures of Ee

. ,Ef

. , and G. in Tab. I, Tab. II, and Tab. III.During the training, we adopt Adam optimizer [31] to

optimize the generator and the discriminators. In addition, we

7

TABLE I: The network architecture of the eye-area attributeencoder Ee

. .

Layer Parameters Output SizeInput - 3 × 224 × 224Conv1 [ 3×3, 42 ] 42 × 224 × 224Conv2 [ 3×3, 42 ] 42 × 224 × 224Conv3 [ 3×3, 84 ] 84 × 224 × 224

ResBlocks[

3×3, 843×3, 84

]×2 84 × 224 × 224

Conv4 [ 3×3, 168 ] 168 × 224 × 224

ResBlocks[

3×3, 1683×3, 168

]×2 168 × 224 × 224

Conv5 [ 3×3, 168 ] 168 × 224 × 224Conv6 [ 3×3, 168 ] 168 × 224 × 224Conv7 [ 1×1, 168 ] 168 × 224 × 224

TABLE II: The network architecture of the face appearanceencoder Ef

. .

Layer Parameters Output SizeInput - 3 × 224 × 224Conv1 [ 3×3, 42 ] 42 × 224 × 224Conv2 [ 3×3, 42 ] 42 × 224 × 224Conv3 [ 4×4, 84 ] 84 × 224 × 224

ResBlocks[

3×3, 843×3, 84

]×2 84 × 224 × 224

Conv4 [ 4×4, 168 ] 168 × 224 × 224

ResBlocks[

3×3, 1683×3, 168

]×2 168 × 224 × 224

set the initial learning rate to 0.0001, weight decay 0.0005, andexponential decay rates (β1, β2) = (0, 0.999). Following sev-eral typical image-to-image translations [22], [23], [41], we sethyper-parameters λface = 10 for the face self-reconstructionloss. To encourage the model to focus on the eye region, we seta large weight of λeye = 10 for the eye-area reconstructionloss. For the adversarial loss Ladv , we employ the LSGANobjective proposes by Mao et al. [46]. Moreover, the gradientpunishment strategy [47] is also adopted to stabilize our modeltraining procedure.

D. Competitive Methods

We compare the proposed method with several two-domain image-to-image translation frameworks including Cy-cleGAN [21], UNIT [22], and MUNIT [23]. To fairly evaluatethe performance of our method with three competitive meth-ods, we train the proposed method and these methods with thesame training data and test on both CelebA and LFW.

CycleGAN [21]. CycleGAN combines adversarial loss andcycle consistency loss to train two generative adversarialnetworks for image-to-image translation.

TABLE III: The network architecture of the decoder G..

Layer Parameters Output SizeInput - 168 × 224 × 224

ResBlocks[

3×3, 1683×3, 168

]×4 168 × 224 × 224

Unsample - 168 × 224 × 224Conv1 [ 3×3, 84 ] 84 × 224 × 224Unsample - 84 × 224 × 224Conv2 [ 3×3, 42 ] 42 × 224 × 224Conv3 [ 3×3, 42 ] 42 × 224 × 224Conv4 [ 3×3, 3 ] 3 × 224 × 224Conv5 [ 3×3, 3 ] 3 × 224 × 224Conv6 [ 1×1, 3 ] 3 × 224 × 224

UNIT [22]. The UNIT model assumes that images of differ-ent domains can be mapped to the same latent representation,and the images generated by the model can be associatedwith input images of different domains by variational autoen-coders [48], respectively.

MUNIT [23]. The MUNIT model is a multi-modal un-supervised image-to-image translation framework. It assumesthat images from different domains can be decomposed into ashared content space and a style specific space. Therefore, animage can be translated to the target domain by recombiningthe content code of the image with a random style code in thetarget style space.

These competitive methods usually treat the face withdifferent attributes as two or multiple domains. By contrast,we view the face images as one domain with two codes, e.g.,face appearance code, and an eye-area attribute code. Besides,these three competitive methods manipulate the whole imageto remove glasses, while our method only operates the eyearea and other regions remain unchanged.

E. Evaluations

Qualitative evaluation. We first qualitatively compare ourmethod with three generative approaches above-mentioned.As shown in Fig. 3 and Fig. 4, we evaluate the generatedresults quality after eyeglasses removal from face images onthe CelebA dataset and the LFW dataset, respectively. Were-implement the competitive methods, i.e., CycleGAN [21],UNIT [22], and MUNIT [23] by the open-source code. Asshown in Fig. 3 and Fig. 4, CycleGAN, UNIT, and MUNITstill have limitations in the manipulation of eyeglasses re-moval. These competitive methods are prone to generate blurryor over-smoothing results and remove glasses insufficiently.In comparison, our generated images are natural and realistic,suggesting that the proposed method is effective in removingeyeglasses from face images. In particular, as illustrated in Fig.5, the manipulation of eyeglasses removal by three competitivemethods also change rest regions outside the eye area, whichdramatically reduces the quality of the generated image. Bycontrast, our method only manipulates the eyes region whilekeeping the rest regions unchanged.

8

Fig. 6: Examples of our generated images by swapping eye-area attribute codes on the CelebA dataset.

Additionally, we show example-guided eye-area attributemanipulation results in Fig. 6. We observe that our methodachieves high-quality results for eyeglasses wearing and eye-glasses removing. We also find that the eye-area featuremaintains faithfully. Furthermore, we perform linear interpo-lation between two eye-area attribute codes and generate thecorresponding face images, as shown in Fig. 7. The linearinterpolation results demonstrate that the eye-area attribute offace images change smoothly with latent codes.

Quantitative evaluation. Here we report the results ofquantitative evaluations based on FID and eLPIPS metricsaiming to measure the realism and the diversity of our gen-erated face images. As shown in Tab. IV, our method obtainsthe minimum FID and the maximum eLPIPS on the CelebAdataset, suggesting that generated face images by our methodhave a closet distribution of real face images. We proceed toperform our method on the LFW dataset. As shown in Tab. V,our method also achieves the minimum FID and the maximumeLPIPS. The result indicates that the proposed method hasscalability compared to three competitive methods. Moreover,the quantitative evaluation verifies the authenticity of quali-tative evaluation, and our method is significantly superior tothree competitive methods on both realism and diversity.

F. Ablation Study

To study the contribution of each component in the proposedmethod, we perform several variants of ERGAN and evaluatethem on the CelebA dataset. Specifically, we evaluate sixvariants, i.e., (1) Ours w/o Lface

recon: our method without the faceself-reconstruction loss term. (2) Ours w/o Leye

recon: our methodwithout the eye-area reconstruction loss term. (3) Ours w/oLfrecon: our method without the face appearance reconstruction

loss term. (4) Ours w/o Lcc: our method without the cycleconsistency loss term. (5) Ours (half ): our method withoutthe dual learning scheme and only implementing the task ofeyeglasses removal. (6) Ours w/ Le

recon: our method with theeye-area attribute reconstruction loss term.

We show the qualitative results of six variants in Fig. 8.Without using the face self-reconstruction loss, our methodcan still generate plausible results. However, some noise isintroduced into the generated results, suggesting that the face

TABLE IV: Quantitative Results. Comparison of FID (loweris better) and eLPIPS (higher is better) to evaluate realismand diversity of generated face images and the real data onthe CelebA dataset.

Method FID ↓ eLPIPS ↑Wearing Removal Wearing Removal

Real data 6.02 5.62 0.579 0.518CycleGAN [21] 15.65 20.67 - -UNIT [22] 18.80 18.86 0.114 0.074MUNIT [23] 29.42 18.85 0.283 0.144Ours w/o Lface

recon 14.12 16.60 0.005 0.002Ours w/o Leye

recon 12.79 16.50 0.384 0.162Ours w/o Lf

recon 14.42 15.68 0.432 0.208Ours w/o Lcc 12.46 16.27 0.426 0.234Ours (half ) - 15.87 - 0.010Ours w/ Le

recon 12.96 16.33 0.435 0.220Ours (full) 11.96 15.07 0.428 0.240

TABLE V: Quantitative Results. Comparison of FID (loweris better) and eLPIPS (higher is better) to evaluate realismand diversity of generated face images and the real data onthe LFW dataset.

Method FID ↓ eLPIPS ↑Wearing Removal Wearing Removal

Real data 24.17 6.96 0.499 0.472CycleGAN [21] 40.37 23.33 - -UNIT [22] 40.30 35.31 0.132 0.076MUNIT [23] 51.74 42.83 0.201 0.141Ours(full) 26.58 19.87 0.367 0.260

self-reconstruction loss is a critical factor in keeping therest regions except the eye area remain unchanged. Withoutadopting the face appearance reconstruction loss and thecycle consistency loss, our method generates slightly distortedresults. The results can not preserve the consistency of theinput image. Without employing the dual learning scheme andonly implementing the task of eyeglasses removal, the modelproduces much lower quality results, demonstrating that thedual learning scheme is effective in learning features. We also

9

Fig. 7: Linear interpolation. Image generation results with linear interpolation between two eye-area attribute codes.

Fig. 8: Qualitative ablation study. Visualization results of sixvariants.

Fig. 9: The success and failure cases of eyeglasses removalapplying our method.

evaluate the variant of our method with the eye-area attributereconstruction loss. The results show that adopting the eyeattribute reconstruction loss can not improve the quality ofgenerated images. We suspect that this constraint is too strong,increasing the interferences in such a small region of the eyearea. To achieve the best performance of our method, weremove the eye attribute reconstruction loss.

We report the quantitative ablation study results of sixvariants in Tab. IV. It can be observed that the full methodobtains lower FID score than six variants, suggesting thatthe proposed method can generate more realistic images ontwo tasks. For diversity, the scores drop dramatically withoutapplying the face self-reconstruction loss, which indicates thatthis constraint is the key component to generate diversity out-puts. In comparison to the variant which only implements thesingle task of eyeglasses removal, our method achieves lowerFID scores and higher eLPIPS scores. It demonstrates that

TABLE VI: Quantitative Face verification results. Compar-ison of the face verification accuracy on P1, P2, and P2rto evaluate the efficiency of eyeglasses removal. P2 containsface images with glasses which select from original images.

Face images AccuracyOriginal test images P1 97.80%Selected images P2 96.35%Selected images after Eyeglass Removal P2r 96.67%

TABLE VII: Quantitative Facial expression recognitionresults. Comparison of the facial expression recognition ac-curacy on S1, S2, and S2r to evaluate the efficiency ofeyeglasses removal. S1 contains face images without glasses.S2 only includes face images with glasses. S2r contains thesame images with S2 while those images are manipulated withglasses removal adopting our method.

Face images AccuracyImages without Eyeglass S1 80.61%Images with Eyeglass S2 73.65%Images after Eyeglass Removal S2r 78.59%

adopting the dual learning scheme can generate higher qualityimages in terms of realism and diversity. We observe that ourmethod introduces the eye-area attribute reconstruction lossperforms slightly better on glasses wearing in the diversitymetric. The proposed method achieves the best performanceon all the other indications, especially on eyeglasses removalin the realism metric.

G. Further Analysis and Discussions

Limitation. We notice that the proposed method tends toproduce low-fidelity results when the input face image pairshave large pose changes. As shown in Fig. 9, given two faceimages with a similar gesture, our method could generate high-quality results. By contrast, the quality of generated images isdegraded when the input images have significant pose changes(e.g., one front face and one profile face). There are twomain reasons for the limitation. One is the limited amountof training data. Therefore, our method unable to thoroughlylearn the two representations (e.g., face appearance code andeye-area attribute code). Another reason is that the profileface usually contains the noises such as background and hairs,

10

Fig. 10: Qualitative face verification results. Comparisonof the Euclidean distance (lower is better) to evaluate theeffectiveness of eyeglasses removal by our method. The leftimage is the input face image with eyeglasses. The middleimages are face images without glasses that match the leftimage. The right image is the generated image by our method.The number on the line indicates the Euclidean distancebetween two images.

Fig. 11: Qualitative facial expression recognition results.The red word in the upper left corner of each image indi-cates the result of facial expression recognition. First row:expression recognition results of original images. Second row:expression recognition results of generated images by glassesremoval.

compromising the eye attribute code learning. As a result, thegenerated face is hard to simulate the target eye.

Face verification. To further evaluate the performance ofthe proposed glasses removal method, we first test whether thismethod is beneficial to face verification. We apply the off-the-shelf face verification model FaceNet [2]. Three test subsets ofthe LFW dataset are taken into consideration, i.e., original testimages P1, selected images P2 with eyeglasses, and selectedimages P2r removed glasses by the proposed method. P1 isthe standard test set of the LFW dataset. Due to the limitednumber of the face images with eyeglass, we further sample

a test set P2 from the original test set, which contains 736matched pairs (one with glasses and another without glasses)and 3864 mismatched pairs (both with glasses). The set P2rcontains the same images as P2 while those images are manip-ulated with glasses removal by our method. We report the faceverification results in Tab. VI. Comparing the face verificationresults obtained on P1 and P2, we observe that the accuracyobtained on the test set P2 is lower than obtained on P1. Thisresult shows that the occlusion of glasses compromises theaccuracy of the face verification model. Moreover, comparingthe results between P2 and P2r, we find that the performanceof face verification slightly improved after eyeglasses removalby the proposed method. One possible reason for the slightlyimproved performance is that the matched pairs (one withglasses and another without glasses) in the LFW dataset arelimited. Furthermore, we show the qualitative face verificationresults in Fig. 10. It demonstrates shows that the Euclideandistance decreases between matched images and generatedimages by our method.

Facial expression recognition. Besides, we demonstratethat the proposed method benefits the facial expression recog-nition. To evaluate our method, we adopt a facial expres-sion recognition algorithm 1, which effectively classifies theemotion into six adjectives (e.g., angry, scared, happy, sad,surprised, neutral). We perform experiments on the CelebAdataset. Notably, the CelebaA data set only labels the attributeof smiling or not. Accordingly, we view that smiling representsthe emotion of happy, and no-smiling represents other emo-tions. During the test, three sets are taken into consideration.S1 denotes that the set contains face images with attributes ofsmiling and no-glasses, S2 indicates that the set contains faceimages with attributes of smiling and glasses. S2r includesthe same samples as S2 while all images are manipulated witheyeglasses removal by our method.

We show the qualitative results of facial expression recog-nition in Fig. 11. We observe that face images with glassesin the first row are all misidentified as other expressions. Bycontrast, after the manipulation of eyeglasses removal by ourmethod, all images are recognized as the happy expression inthe second row. We present the quantitative results in Tab. VII.Comparing the facial expression recognition results betweenS1 and S2, we find that the accuracy drops by 6.9%. Thisresult indicates that facial expression recognition is affectedby the occlusion of glasses. Besides, comparing the resultsbetween S2 and S2r, the accuracy of expression recognitiongains by 4.9%. The performance is close to the result on S1,which demonstrates the benefit of eyeglasses removal for facialexpression recognition.

V. CONCLUSION

In this paper, we have proposed a GAN-based frameworkfor eyeglasses removal in the wild. We adopt a dual learningscheme simultaneously to learn two inverse manipulations(removing glasses and wearing glasses), which enforces themodel to generate high-quality results. Extensive qualitativeand quantitative experiments demonstrate that our method

1https://github.com/opconty/keras-shufflenetV2

11

outperforms previous state-of-the-art methods in terms of re-alism and diversity. Furthermore, we remark that the proposedmethod has the potential to be served as a pre-processing toolfor other face-related tasks, e.g., face verification and facialexpression recognition.

REFERENCES

[1] L. Zebrowitz, Reading faces: Window to the soul? Routledge, 2018. 1[2] F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A unified embed-

ding for face recognition and clustering,” in CVPR, 2015. 1, 10[3] N. McLaughlin, J. Ming, and D. Crookes, “Largest matching areas for

illumination and occlusion robust face recognition,” IEEE transactionson cybernetics, vol. 47, no. 3, pp. 796–808, 2016. 1

[4] Y. Guo, L. Jiao, S. Wang, S. Wang, and F. Liu, “Fuzzy sparse autoen-coder framework for single image per person face recognition,” IEEEtransactions on cybernetics, vol. 48, no. 8, pp. 2402–2415, 2017. 1

[5] Y. Wang, Y. Y. Tang, L. Li, and H. Chen, “Modal regression-basedatomic representation for robust face recognition and reconstruction,”IEEE transactions on cybernetics, 2019. 1

[6] P. Liu, S. Han, Z. Meng, and Y. Tong, “Facial expression recognitionvia a boosted deep belief network,” in CVPR, 2014. 1

[7] P. Liu, J. T. Zhou, I. W.-H. Tsang, Z. Meng, S. Han, and Y. Tong,“Feature disentangling machine-a novel approach of feature selectionand disentangling in facial expression analysis,” in ECCV, 2014. 1

[8] H. Yang, U. Ciftci, and L. Yin, “Facial expression recognition by de-expression residue learning,” in CVPR, 2018. 1

[9] Z. Zhong, L. Zheng, G. Kang, S. Li, and Y. Yang, “Random erasingdata augmentation,” arXiv preprint arXiv:1708.04896, 2017. 1

[10] Z. Zheng, L. Zheng, and Y. Yang, “Pedestrian alignment network forlarge-scale person re-identification,” IEEE Transactions on Circuits andSystems for Video Technology, 2018. 1

[11] Y. Sun, L. Zheng, Y. Yang, Q. Tian, and S. Wang, “Beyond part models:Person retrieval with refined part pooling (and a strong convolutionalbaseline),” in ECCV, 2018. 1

[12] Y. Sun, Q. Xu, Y. Li, C. Zhang, Y. Li, S. Wang, and J. Sun, “Perceivewhere to focus: Learning visibility-aware part-level features for partialperson re-identification,” in CVPR, 2019. 1

[13] C. Wu, C. Liu, H.-Y. Shum, Y.-Q. Xy, and Z. Zhang, “Automaticeyeglasses removal from face images,” IEEE transactions on patternanalysis and machine intelligence, vol. 26, no. 3, pp. 322–336, 2004. 1,2, 3

[14] J.-S. Park, Y. H. Oh, S. C. Ahn, and S.-W. Lee, “Glasses removal fromfacial image using recursive error compensation,” IEEE transactions onpattern analysis and machine intelligence, vol. 27, no. 5, pp. 805–811,2005. 1, 2, 3

[15] C. Du and G. Su, “Eyeglasses removal from facial images,” PatternRecognition Letters, vol. 26, no. 14, pp. 2215–2220, 2005. 1

[16] D. Yi and S. Z. Li, “Learning sparse feature for eyeglasses problem inface recognition,” in FG. IEEE, 2011. 1, 2, 3

[17] W. K. Wong and H. Zhao, “Eyeglasses removal of thermal image basedon visible information,” Information Fusion, vol. 14, no. 2, pp. 163–176,2013. 1, 2, 3

[18] K. Pearson, “Liii. on lines and planes of closest fit to systems of pointsin space,” The London, Edinburgh, and Dublin Philosophical Magazineand Journal of Science, vol. 2, no. 11, pp. 559–572, 1901. 1, 2

[19] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter,“Gans trained by a two time-scale update rule converge to a local nashequilibrium,” in NIPS, 2017. 2, 6

[20] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “Theunreasonable effectiveness of deep features as a perceptual metric,” inCVPR, 2018. 2, 6

[21] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-imagetranslation using cycle-consistent adversarial networkss,” in ICCV, 2017.2, 3, 5, 6, 7, 8

[22] M.-Y. Liu, T. Breuel, and J. Kautz, “Unsupervised image-to-imagetranslation networks,” in NIPS, 2017. 2, 3, 5, 6, 7, 8

[23] X. Huang, M.-Y. Liu, S. Belongie, and J. Kautz, “Multimodal unsuper-vised image-to-image translation,” in ECCV, 2018. 2, 3, 5, 6, 7, 8

[24] M. De Smet, R. Fransens, and L. Van Gool, “A generalized em approachfor 3d model based face recognition under occlusions,” in CVPR. IEEE,2006. 2, 3

[25] S. Mika, B. Scholkopf, A. J. Smola, K.-R. Muller, M. Scholz, andG. Ratsch, “Kernel pca and de-noising in feature spaces,” in Advancesin neural information processing systems, 1999, pp. 536–542. 2

[26] Z. Liu, P. Luo, X. Wang, and X. Tang, “Deep learning face attributes inthe wild,” in ICCV, 2015. 3, 5

[27] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” inNIPS, 2014. 3, 5

[28] W. Shen and R. Liu, “Learning residual images for face attributemanipulation,” in CVPR, 2017. 3

[29] G. Zhang, M. Kan, S. Shan, and X. Chen, “Generative adversarialnetwork with spatial attention for face attribute editing,” in ECCV, 2018.3

[30] M. Liu, Y. Ding, M. Xia, X. Liu, E. Ding, W. Zuo, and S. Wen, “Stgan:A unified selective transfer network for arbitrary image attribute editing,”in CVPR, 2019. 3

[31] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”arXiv preprint arXiv:1412.6980, 2014. 3, 6

[32] Z. Yi, H. Zhang, P. Tan, and M. Gong, “Dualgan: Unsupervised duallearning for image-to-image translation,” in ICCV, 2017. 4

[33] G. B. Huang, M. Ramesh, T. Berg, and E. Learned-Miller, “Labeledfaces in the wild: A database for studying face recognition in uncon-strained environments,” University of Massachusetts, Amherst, Tech.Rep. 07-49, 2007. 6

[34] G. B. Huang, M. Mattar, H. Lee, and E. Learned-Miller, “Learning toalign from scratch,” in NIPS, 2012. 6

[35] A. Bulat, J. Yang, and G. Tzimiropoulos, “To learn image super-resolution, use a gan to learn how to do image degradation first,” inECCV, 2018. 6

[36] Z. Zheng, X. Yang, Z. Yu, L. Zheng, Y. Yang, and J. Kautz, “Jointdiscriminative and generative learning for person re-identification,”2019. 6

[37] Q. Mao, H.-Y. Lee, H.-Y. Tseng, S. Ma, and M.-H. Yang, “Modeseeking generative adversarial networks for diverse image synthesis,”arXiv preprint arXiv:1903.05628, 2019. 6

[38] A. Razavi, A. v. d. Oord, and O. Vinyals, “Generating diverse high-fidelity images with vq-vae-2,” arXiv preprint arXiv:1906.00446, 2019.6

[39] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,”in CVPR, 2015. 6

[40] J.-Y. Zhu, R. Zhang, D. Pathak, T. Darrell, A. A. Efros, O. Wang,and E. Shechtman, “Toward multimodal image-to-image translation,” inNIPS, 2017. 6

[41] H.-Y. Lee, H.-Y. Tseng, J.-B. Huang, M. Singh, and M.-H. Yang,“Diverse image-to-image translation via disentangled representations,”in ECCV, 2018. 6, 7

[42] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for imagerecognition,” in CVPR, 2016. 6

[43] D. Ulyanov, A. Vedaldi, and V. Lempitsky, “Improved texture networks:Maximizing quality and diversity in feed-forward stylization and texturesynthesis,” in CVPR, 2017. 6

[44] X. Huang and S. Belongie, “Arbitrary style transfer in real-time withadaptive instance normalization,” in CVPR, 2017. 6

[45] T.-C. Wang, M.-Y. Liu, J.-Y. Zhu, A. Tao, J. Kautz, and B. Catanzaro,“High-resolution image synthesis and semantic manipulation with con-ditional gans,” in CVPR, 2018. 6

[46] X. Mao, Q. Li, H. Xie, R. Lau, Z. Wang, and S. Smolley, “Least squaresgenerative adversarial networks,” in ICCV, 2017. 7

[47] L. Mescheder, A. Geiger, and S. Nowozin, “Which training methods forgans do actually converge?” in ICML, 2018. 7

[48] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” CoRR,vol. abs/1312.6114, 2014. 7

Documents

Unsupervised Eyeglasses Removal in the Wild