Self Supervised Learning in Facial Attributes in Videos · 2019-10-30 · Self Supervised Learning in Facial Attributes in Videos Huaer Li(u6364576) A report submitted for the course

Self Supervised Learning in FacialAttributes in Videos

Huaer Li(u6364576)

A report submitted for the courseCOMP3770 CS PROJECT

Supervised by: Fatemeh Saleh, Stephen GouldThe Australian National University

October 2019

c• Huaer Li 2019

Except where otherwise indicated, this report is my own original work.

Huaer Li25 October 2019

Acknowledgments

I would like to express my deep gratitude to my supervisor Fatemeh Saleh andProfessor Stephen Gould for their patient guidance, continuous support and usefuladvisement throughout this year of this research work. I would also like to thank toRobotic Vision group at ANU, for their great advice and assistance in introducingstate-of-art works and giving constructive suggestions to this project. I also wish toacknowledge the help provided by Mr Rong Xin for his support and suggestion inimplementing the experiments of this project.

I would also like to extend my thanks to the CECS IT service department for theirhelp in offering me the server to run the program.

Lastly, I would like to express my great appreciation to my parents and friendsfor their support and encouragement throughout this year.

iii

Abstract

Increasing attention has been drawn to the research field of computer vision anddeep neural networks; the demand for large-scale labeled datasets is emerged forobtaining better performance in learning visual features using convolutional neuralnetworks. To avoid the great expense of labelling large-scale datasets, more and moreself-supervised learning methods are proposed to learn visual representatives frommassive amounts of unlabeled data. Most of the work focused on self-supervisedtasks learned general visual features from input images; this work focuses on investi-gating self-supervised learning tasks that learn facial attributes that can be transferredto downstream tasks related to faces.

This paper provides a review of CNN-based self-supervised approaches and eval-uates the result of transferring learning from chosen pretext tasks. We conductedexperiments by transferring knowledge learnt in various self-supervised learningand fully supervised learning tasks; these learnt knowledge are used as weight ini-tialisation for the chosen downstream task, facial expression classification. Differentweight initialisation, including random weight initialisation, fully supervised learningpre-trained with ImageNet and CK+, self-supervised learning with image colouri-sation and FAb-Net, are implemented. Quantitative performance comparisons ofthese methods are discussed. Our results have shown that using FAb-net as pretexttasks can effectively learn visual features of faces from images; it also indicates thereis a positive correlation between the amount of training data used in pretext tasksfor self-supervised learning and the performance of classifying on downstream task.Limitations and future work of this project are also discussed.

v

vi

Contents

Acknowledgments iii

Abstract v

1 Introduction 11.1 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Report Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Background and Related Work 52.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1.1 Common Convolutional Neural Network Work Architectures . . 52.1.2 Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.1.3 Self-supervised Learning . . . . . . . . . . . . . . . . . . . . . . . 8

2.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.2.1 Self-supervised Learning of Facial Attributes . . . . . . . . . . . 92.2.2 Facial Expression Classification . . . . . . . . . . . . . . . . . . . . 10

2.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3 Experimental Methodology 133.1 Methodologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.1.1 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 133.1.2 Self supervised Learning . . . . . . . . . . . . . . . . . . . . . . . 143.1.3 Evaluation on Facial Classification . . . . . . . . . . . . . . . . . . 16

4 Experiment Setup 174.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4.1.1 ImageNet and CK+ . . . . . . . . . . . . . . . . . . . . . . . . . . . 174.1.2 VoxCeleb 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184.1.3 CelebA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184.1.4 EmotioNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4.2 Software platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204.3 Hardware platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204.4 Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

5 Results and Evaluation 235.1 Baseline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235.2 Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

5.2.1 Supervised versus Self-supervised Learning . . . . . . . . . . . . 24

vii

viii Contents

5.2.2 Training Data for Pretext Task . . . . . . . . . . . . . . . . . . . . 265.2.3 Comparing with State-of-Art . . . . . . . . . . . . . . . . . . . . . 27

5.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

6 Conclusion 296.1 Limitation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

Bibliography 31

List of Figures

2.1 A General CNN Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 62.2 ResNet Building Block with Skip Connection . . . . . . . . . . . . . . . . 82.3 Architecture Proposed by Self-supervised Learning of a facial attribute

embedding from video . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.4 Architecture Proposed by Facial Expression Recognition via Deep Learn-

ing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.1 Architecture for ResNet-18 trained on ImageNet [Napoletano et al., 2018] 143.2 CNN Architecture used in Image Colourisation . . . . . . . . . . . . . . 153.3 FAb-Net Architectures. Each convolution is followed by a leaky ReLU

and a batch-norm layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

4.1 Comparison between Images before and after Cropping using Face++ . 20

5.1 Accuracy of Facial Expression Classification using Different WeightInitialisation Method. The performance of classification is evaluated byaccuracy. Higher accuracy is better. . . . . . . . . . . . . . . . . . . . . . 24

5.2 Accuracy of Facial Expression Classification using Image Colourisationas Pretext Task with Different Data Size for Training. The performanceof classification is evaluated using accuracy. Higher accuracy is better. . 26

5.3 Accuracy of Facial Expression Classification using FAb-Net as PretextTask with Different Data Size for Training.The performance of classifi-cation is evaluated using AUC. Higher AUC is better. . . . . . . . . . . . 27

ix

x LIST OF FIGURES

List of Tables

4.1 Datasets used in the evaluation. . . . . . . . . . . . . . . . . . . . . . . . 174.2 Distribution of Different Classes in EmotioNet dataset . . . . . . . . . . 194.3 Python packages used in our evaluation for different tasks. . . . . . . . 194.4 Hardware Implementation details . . . . . . . . . . . . . . . . . . . . . . 20

5.1 Comparison between Self-supervised model with state-of-art work. . . 27

xi

xii LIST OF TABLES

Chapter 1

Introduction

Deep neural networks have been favoured by various researchers in the field ofcomputer vision due to their exceptional ability to learn and capture the enormousnumber of visual features. Due to the outstanding performances of such architec-tures provided, deep neural networks have been applied in various fields such asobject detection, face detection and recognition, semantic segmentation. However,the performances of deep neural networks not only heavily depend on their uniquearchitectures, but also the amount of training data.

This hence emerges the immense demanding for large scale datasets, such as Ima-geNet[Deng et al., 2009], in different areas of study. There are two possible solutionsto this rising problem: 1) providing more large-scale human-labelled datasets, or2) proposing deep neural networks that do not heavily rely on the human-labelleddatasets.

Unfortunately, datasets with enormous scale would be highly expensive and time-consuming to collect and produce. In the past few years of the development ofcomputer vision and deep convolutional neural networks, some large-scale datasetshave been widely used. For instance, ImageNet, a collection of over 1.3 million human-labelled images among 1000 classes and with one label associated with each image, isan excellent image dataset for training 2D convolutional networks. However, it is notfeasible for each specific field of research in computer vision to obtain similar scaledatasets. Therefore, researchers attempt to seek better solutions through alternativeways.

With the limited number of large-scale datasets, transfer learning is a mildly goodsolution as it is more robust and less dependent on the size of datasets. Due tothe lack of large-scale datasets in certain areas of study, these pre-trained modelsprovide a great starting point with previously learned features to avoid the problemof overfitting in training on relatively small dataset[Weiss et al., 2016][Pan and Yang,2010]. However, with the developing of deeper neural networks, more and moredata is required to train such models; dataset with similar scale of ImageNet is noteven sufficient. The limited types of labels provided by datasets also restrict variousresearch possibilities. However, there is a vast amount of unlabelled image that existsin real life; the potential of such data is huge. Therefore, it is a significant researchchallenge to find an approach that is not limited to the amount of labelled data andcan successively learn from non-human-annotated data.

1

2 Introduction

As a result of this motivation, a significant amount of attention has been drawnto self-supervised learning approaches that benefit from large scale datasets withouthuman annotation[Jing and Tian, 2019]. In supervised learning, each training data ishuman-annotated with a label. To produce such label in self-supervised learning with-out the great expense of manual labour, various pretext tasks have been introduced toformulate a label for each training data. Most pretext tasks are designed to relieve theburden of annotating large datasets and be solved to learn visual features using auto-matically generated pseudo-labels. By training on solving such pretext tasks, modelslearn representatives that becomes useful in solving other specific downstream tasks,such as facial expression classification.

Learning in faces has long been a vital research area in computer vision, yetit is extremely labour-expensive to obtain a large scale annotated dataset for faceattributes as one image may need up to 50 labels (e.g., facial landmark detection).Even with the help of some face detection algorithms, the expense of producingdataset fulfilled all research interests is overly high. Large scale datasets in thefield of facial expression classification are also extremely lacking as most of thewell-annotated facial expression datasets were extremely small. Research in facialexpression classification has hence been limited.

Therefore, knowing the great advantage of using cost-effective datasets, more andmore researches started to explore the potential of using self-supervised learning andhoping that they would have comparable results compared to using fully supervisedlearning.

With these motivations, in this project, we explore various self-supervised learningand fully supervised learning approaches to transfer learning of facial attributes. Thiswork aims to evaluate their performance and effectiveness by comparing their resultedaccuracy after transfer to a target task, facial expression classification, and how closethe performance of proposed self-supervised learning methods could be comparedto the state-of-art results using fully supervised learning. Experiments are conductedin two stages: firstly training models with a chosen pretext task, which may be fullysupervised or self-supervised and the learnt weights in the first stage are transferredonto facial expression classification task as various weight initialisation, including:

• Random initial weights• Pretrained with ImageNet• Pretrained with CK+• Self-supervised learning with Image Colourisation• Self-supervised learning with Fab-NetBy comparing the performance of transferring different pretext tasks to facial

expression learning, statistical analyses of the results are performed and outcomes ofexperiments are thoroughly discussed.

1.1 Contribution

The main contribution of this project are listed below:

§1.2 Report Outline 3

• Exploration on different pretext tasks and evaluation on their performance offace-related target task

• Proposed suitable data pre-processing method that able to improve the perfor-mance of downstream task

• Observed the relation between amount of training data used for pretext tasksand performance of classification

1.2 Report Outline

This report is organised into 6 chapters.• Background and related works of this project is presented in Chapter 2; some

state-of-art works are mentioned.• Chapter 3 outlines the various self-supervised learning and fully supervised

learning methods we implemented and its detailed methodologies and architectures.• Chapter 4 gives a detailed overview of the experimental setup and dataset used

in this work; data pre-processing methods and subsets of data used are also specified.• Chapter 5 explains the results and discusses the findings from experiments. It

also compares our work to some state-of-art results.• Chapter 6 concludes this report by summarising the contribution, finding, limi-

tation and future works.

4 Introduction

Chapter 2

Background and Related Work

This chapter briefly introduces some necessary background knowledge of this work.Since we explored the variety deep learning approaches to transfer learning on learn-ing facial attributes, this chapter explored the work done in Convolutional NeuralNetworks, self-supervised learning and facial expression classification as this workdiscusses the intersection of these three areas of work.

Section 2.1 gives background material in the architecture of CNN and an overviewof self supervised learning and transfer learning, and related work in self supervisedlearning in faces is given in Section 2.2. Section 2.2 also discusses some state-of-artsupervised learning methods on facial expression classification and recognition.

2.1 Background

2.1.1 Common Convolutional Neural Network Work Architectures

Despite different forms of supervisions they took towards various vision-related prob-lems, these approaches share similar network architectures, Convolutional neural net-work (CNN). It is a type of neural network that is commonly employed in the field ofcomputer vision due to the great feature extraction and discrimination ability it pos-sesses[LeCun et al., 2015]. Intense interests have been drawn to CNN in the ImageNetLarge Scale Visual Recognition Challenge (ILSVRC), where some CNN architecturesoutperformed other contestants and showed their state-of-art performance on imageclassification tasks[Krizhevsky et al., 2017][Simonyan and Zisserman, 2014][He et al.,2015]. Since then, more and more researchers have discovered its great learningabilities in vision related tasks. One common feature of all different CNN is thata typical CNN architecture consists of various number of convolution and poolinglayers; there are also one of more fully connected layers attached at the end of thearchitecture. These different components and the final arrangement of layers are vitalto the performance of CNN; hence, functions of each layers is briefly discussed inthis section.

5

6 Background and Related Work

Figure 2.1: A General CNN Architecture

Convolutional Layer

The main purpose of the convolutional layer is to extract features from each inputimage. Each layer is composed of several kernels, which are represented by neurons,and these neurons are arranged into a feature map; each neuron is also associatedwith a small section of the image, known as receptive fields [Aloysius and Geetha,2017]. By learning features within a given receptive field, CNN is able to preservethe spatial relationship between pixels. After dividing each input image into severalreceptive fields, the linear operation of multiplying a set of weights with the elementsin receptive fields is performed. In other words, a feature map is generated bysystematically applying the kernel with the same weights through the image; however,there can be different feature maps that existed within the same convolutional layer.By having different feature maps associated with different weights, CNN is able toextract multiple features at each location [LeCun et al., 2015]. Once a feature mapis created, it will be passed through a non-linearity activation function. Formally,convolution operation can be expressed as:

Yk = f (Wk ⇤ i) (2.1)

where Yk denotes the kth feature map and Wk represents the weights associatedwith the kth feature map; the input image is represented by i and the * operation signdenotes the dot product of elements in receptive fields and the weights; f (D) is thenon-linearity activation function, such as ReLu.

Generally, the chosen activation function affects network training time, which willfurther affect the performance of large CNN on large datasets [Rawat and Wang,2017]. Typical activation functions include sigmoid, tanh and ReLU; we would onlybriefly explain ReLu, the most commonly used activation function nowadays. Theabbreviation for non-linearity activation function ReLU means "Rectified Linear Unit"and it is usually placed in between two convolutional layers[Yaseen and Saud, 2018].ReLU keeps only the positive part of the activation and reduces the negative partto zero; this allows for faster training. Only activated features, in this case, positiveparts, are passed to the next layer.

§2.1 Background 7

Pooling Layer

Pooling layers are commonly arranged in-between successive Conv layers in a Con-vNet architecture. Its function is to reduce the spatial size of the feature maps andthe amount of parameters in the network and hence control the problem of overfit-ting[Aloysius and Geetha, 2017]. In previous works, average pooling was commonlyused to pass the average of all the input values of the neighborhood of a receptive fieldto the next layer. However, most recent works prefer the approach of max-poolingby summing up similar information in adjacent receptive fields and outputting themaximum value within this local region to the next layer.

Fully Connected Layer

Fully connected layers are generally added after stacks of convolutional and poolinglayers. The aim of such layer is to interpret the learnt abstract features from previouslayers and perform classification. Unlike convolution and pooling, which are localoperations, a fully connected layer performs a global operation; it has full connectionsto all previous layers [Khan et al., 2019]. It is worth noting that fully connectedlayers have the identical functional form to convolutional layers since both layerscompute dot product. The only difference is that neurons in fully connected layersare connected to all activations in preceding layers, whereas neurons in convolutionallayers are connected to inputs in local regions. In order to perform analysis forclassification problems, softmax is generally used.

ResNet

Although various CNN models, such as VGG, have demonstrated their compellingperformances among various image classification tasks, we chose to use ResNet dueto its low complexity, greater depth, less training time, the ability to preserve fine-grained features and outstanding results.

Previous works on CNN has demonstrated that deeper neural networks have thepotential to obtain better performances on complex inputs, such as images; however,experiments show the opposite. As the depth of proposed network architecturecontinued to increase, it was found that the neurons in earlier layers learned veryslowly due to their negligibly small gradient. This problem is hence known as thegradient vanishing problem.

[He et al., 2015] introduced the idea of skip connections with their proposedmodel ResNet in order to reduce the impact of such problem. The core idea ofskip connections is introducing an identity shortcut connection that skips one ormore layers; this would allow previous features to be sent to the next convolutionalblock, which effectively solves the problem gradient vanishing. Instead of wideningthe model, He et al. chose to increase the depth of the model, which resulted infewer extra parameters being introduced to the network. This low complexity is alsoachieved by applying residual learning to every three stacked layers and learning the


Figure 2.2: ResNet Building Block with Skip Connection

residual function referred to each layer’s inputs. Due to its superior performance,ResNet is favoured by various computer vision tasks.

2.1.2 Transfer Learning

Transfer learning is a learning approach that aims at improving the learning per-formance in one domain by transferring information previously learnt in a relateddomain. It is commonly used in machine learning to against the problem of insuffi-cient training data. In transfer learning, information is firstly learnt by training fora source domain with sufficient data and the learnt knowledge is transferred in thetarget domain without the need of training from scratch in the targeted domain. Thus,the demand for training data and time for the target domain is effectively reduced.

2.1.3 Self-supervised Learning

Self-supervised learning is a general learning approach that relies on performingvarious pretext tasks on unsupervised data with automatically generated labels. Thepseudo label is automatically generated from the dataset. The data with a pseudolabel is then provided to a CNN in order to solve a chosen pretext task. In the processof accomplishing given pretext tasks, the network is able to learn visual features andimage representations. Transfer learning is then performed to the target task withcaptured visual features from pretext tasks. Self-supervised learning has been usedin a wide range of visual related topics[Zhai et al., 2019]. In this paper, we employsome self-supervised learning techniques that are milestones to the development ofself-supervised learning.

Doersch et al. [Doersch et al., 2015] have proposed a pretext task that classifiesrelative position between two randomly selected non-overlapping image patches;follow-up papers also proposed similar pretext tasks, solving jigsaw puzzle[Norooziand Favaro, 2016]. Jigsaw puzzle is a context-based pretext task. The aim of thistask is to find relative spatial positions of 9 random image patches sampled fromthe original data. The pseudo label for each patch is their relative location and allnine patches are sent through the same network. In the process of predicting thespatial positions, visual representations are forced to be learnt. To avoid shortcuts thatrelies on low-level image statistics, such as edge alignment, the proposed network

§2.2 Related work 9

delays the computation across different tiles; the network focuses on learning featureswithin each tile itself and it lastly finds the relative space by merely using the learntrepresentatives of the nine tiles. It also sampled the patches with random gapsbetween them to further avoid classifying using edge alignments. However, unlikefinding relative positions, all tiles are observed at the same time in the jigsaw puzzleproblem.

Another efficient pretext task is predicting the angel of rotation given an inputimage[Gidaris et al., 2018]. Four copies of an input image with rotation (0, 90, 180,270) are input into a single CNN architecture and the pretext task classifies the giveninput into one of the four categories. This simple pretext task is proven to be havingcomparable results.

Image colourisation is a generation-based pretext task that also employs image-level. This pretext tasks firstly convert an RGB image into a gray-scale image and usethe gray-scale image as the input to the network in order to train a network that canpredict pixel colour from a monochrome input. The RGB image acts as its pseudolabel and the goal of pretext task network is to minimise the loss of colour variationsat each pixel. Visual features are thoroughly learnt in this process. [Zhang et al.,2016] introduces the idea of using loss tailored for colorization problem instead ofstandard loss, such as Euclidean error. They predict a distribution of possible coloursfor each pixel and utilises annealed-mean of this distribution to produce the finaloutput. This enables the colourisation to be more vibrant and realistic compared toprevious colourisation approaches.

All these pretext tasks can become extremely useful in some downstream tasks,including object detection and image classification due to the great learning ability ofself-supervised learning.

2.2 Related work

2.2.1 Self-supervised Learning of Facial Attributes

As self-supervised learning gradually draws more attention in computer vision, learn-ing in facial attributes also attempts to integrate some self-supervised learning meth-ods. A recent work [Sharma et al., 2019] proposed a similar Siamese network usedself-supervised learning. This network takes a pair of images as inputs and the faceimages from a video cluster are ranked according to their Euclidean distance to thesource frame; the most similar and dissimilar pairs are treated as the positive andnegative pairs respectively. This work is useful in the study of video face clustering.

Y Li et al. proposed a work that focused on the facial action units[Li et al., 2019].Since the facial action units are movements of muscles, they treated the movementsas transformations between two face images in different frames. Such transformationcan be greatly influenced by movements of head; hence, this paper proposed a Twin-Cycle Autoencoder to distinguish between the movements between facial action unitsand head movements. This enables the network to learn representations that arediscriminative for AU detection.


Another work was proposed by A. Wiles et al. [Wiles et al., 2018], which proposeda pretext task that specifically targeted at learning facial attributes (FAb-Net). Itproposed two different architectures, single-source frame and multiple source frames.Both architectures intend to predict a frame from source frame(s) and a target frame; amapping that predicts offsets for each pixel location in the target frame is learnt. Thismapping can be used to obtain the predicted frame by sampling from the source framegiven these offsets. Multiple source frame architecture follows a similar approachas the single source frame architecture, whereas an additional confidence heatmapis also predicted, which denotes the confidence level at each pixel location of thepredicted frame. The architectures are illustrated in Figure 3.

Figure 2.3: Architecture Proposed by Self-supervised Learning of a facial attribute embeddingfrom video

2.2.2 Facial Expression Classification

Facial expression classification or recognition is one of the most vital tasks in humancommunication since it captures human’s feelings, opinions, and human natures. Dueto its importance in human communication, great interests have been drawn to theresearch in facial expression classification using computer vision. While the majorityof traditional approaches uses shallow learning to learn hand-crafted features manu-ally (e.g. MLP (Multi-layerPerceptron Model), SVM (Support Vector Machines) andLBP (Local Binary Patterns)), study of facial expression classification has graduallytransferred to deep learning due to the outstanding performance CNN has on othercomputer vision tasks [Vyas et al., 2019]. Due to the purpose of our paper, only

§2.3 Summary 11

approaches using supervised learning and CNN are discussed.There are two common categories of classification labels in facial expression recog-

nition 1) Action Units (AU), which describes the basic movements of face muscles 2)labels that describe human emotion such as happy, sad and angry. Depend on thedataset used in particular papers, different sets of labels are used.

R. Kumar et al. [Kumar et al., 2017] proposed a nine-layer convolutional neuralnetwork to classify on seven different expression labels on the FER-13 dataset. Thisproposed model is not only able to classify on the dominant emotion, but also capableof analysing the percentages of all the presented emotions with an accuracy of around90%.

Figure 2.4: Architecture Proposed by Facial Expression Recognition via Deep Learning

Another deep learning approach was introduced by A. Fathallah et al.[Fathallahet al., 2017]. In this paper, they firstly performed fine-tuning on a pretrained VGGmodel using a mixture of datasets, including CK+, KDEF, RaFD and MUG datasets;this first model was used as the fine-tuning model for the final model. The proposedarchitecture is illustrated in Figure 4. This approach performed exceptionally well onCK+ datasets with an accuracy of 99.33%.

On the other hand, CNN is also applied to classification problems on AU. In [Al-Darraji et al., 2017] paper, the author proposed a CNN architecture with parametersoptimised by Genetic Algorithm. This optimised CNN architecture was able toclassify on 23 most relevant AUs and achieved an average accuracy of 90.85%.

2.3 Summary

In this chapter, we had a brief overview of the architectures of CNN networks andthe broad application of CNN. The power of self-supervised learning to learn visualrepresentatives was also shown; several classical pretext tasks, including jigsaw puz-zle problem, image rotation and image classification, were introduced. We also took alook at some self-supervised learning approaches toward facial expression problems,


compared with some classic fully supervised learning architectures proposed for thisproblem.

With the knowledge of the state-of-art approaches in the field of self-supervisedlearning in faces, we need to take further a look at the experimental setups anddatasets we used for our work.

Chapter 3

Experimental Methodology

3.1 Methodologies

The aim of this work is to train a few networks with different pretext task (self su-pervised learning) and various fine-tuning task (fully supervised learning). Thesenetworks are expected to learn some visual representatives related to facial attributesin self-supervised or fully supervised approaches. These learnt visual representativesare evaluated by their performance of a chosen target task, facial expression classifica-tion, as our aim is to evaluate on which pretext task is about to boost the performanceon the downstream task. In order to achieve such aim, we trained several networksemployed different datasets, architectures with various parameters. These tasks arefurther described in the following subsections.

3.1.1 Supervised Learning

Two models were pre-trained with fully supervised learning, i.e. each image isattached with a human-annotated label. Both networks used ResNet 18; ImageNetpre-trained model was directly loaded from the PyTorch package whereas the modelpre-trained on CK+ was trained by ourselves.

CK+

To train a ResNet-18 model with CK+, we resized the input to 224 ⇥ 224 and nor-malised the input image. Since the size of CK+ dataset is quite small, we trained thenetwork with a learning rate of 0.05 and epoch of 100. The loss function we used inthis pre-trained model was cross entropy loss, which is a loss function that used tolearn the probability distribution of the data. It predicts the the classification resultswith a confidence related to it. Cross Entropy loss is denoted as

�M

Âc=1

yo,clog(po,c) (3.1)

whereas M denotes the number of total classes, yo,c represents the whether ob-servation o is a sample from class c and po,c denotes the predicted probability of

13

14 Experimental Methodology

Figure 3.1: Architecture for ResNet-18 trained on ImageNet [Napoletano et al., 2018]

observation o belongs to class c.

3.1.2 Self supervised Learning

For self-supervised learning approaches, there were large amount of images usedto train the models in pretext tasks compared to data used in supervised learning.Hence, the learning rates were smaller with greater epochs.

Image Colourisation

In order to train A CNN for image colourisation, this network needs to take gray-scaleimages as input and output a colourful image. Therefore, the architecture we usedfor image colourisation composed of two different parts; the first part was the sameas a ResNet-18 architecture for learning features while the second part of the networkupscales our images in terms of spatial resolution to sample from black-white imagesto colourful images[luk, 2018]. The approach we took in this work firstly convertedthe input RGB image to a LAB image and extracted the lightness layer as our grayscale input image to the CNN. This architecture is illustrated in Figure 3.2. To trainthis pretext task, we used a learning rate of 0.001 and epoch of 300. The batch sizewas chosen to be 64.

In order to have the colourised image to become closer to the original image, thesquared distance between the predicted color value and the ground-truth color valuewas minimised. Mean squared error can be formally denoted as

n

Âi=0

(yi � pi)2 (3.2)

§3.1 Methodologies 15

Figure 3.2: CNN Architecture used in Image Colourisation

where n represents number of samples, yi represented the ground truth valuesample i and pi is the prediction for sample i. Adam optimisation was used tooptimise the loss function.

To transfer the learnt visual features from this network, we only extracted thefeatures learnt in the first part of the architecture and transferred the weights to thedownstream tasks; for the layers that did not have any initialisation weights learnt inthe prior pretext tasks, random initial weights were used.

FAb-Net

Pretext task trained with FAb-Net using the single source frame architecture [Wileset al., 2018] was implemented with a learning rate of 0.001 and batch size of 32. Asshown in Figure 2.3 and Figure 3.3, this architecture took a pair of inputs, a sourceframe s and a target frame t from the same video sequence containing the sameidentity; this input pair was used to obtain a predicted frame sp. The decoder in thisnetwork predicted offsets for each pixel location in frame t and the offsets were usedto obtain sp by adding the offsets to the pixel location in the source frame; formally,sp(x, y) = s(x + Ox, y + Oy) whereas (Ox, Oy) denoted the offsets predicted by thedecoder. The loss function used in this pretext task is the L1 norm between the targetframe the predicted frame, denoted as

��sp � t��

1 (3.3)

The cross entropy loss was used to evaluate the loss transferred in the downstreamclassification task.

16 Experimental Methodology

Figure 3.3: FAb-Net Architectures. Each convolution is followed by a leaky ReLU and abatch-norm layer

3.1.3 Evaluation on Facial Classification

Accuracy and Area under ROC curve were used to evaluate the performance of ourdownstream tasks. Despite the model transferred from FAb-Net, all other modelsused accuracy to compare the performance with other models. For FAb-Net model weobtained another value, the area under ROC curve to further evaluate the performanceof tasks. This AUC was obtained by first computing AUC for each class independentlyand averaging the results over 12 classes.

Chapter 4

Experiment Setup

In this chapter, we introduced the datasets used for conducting the experiments andthe implementation details for each experiment, for both pretext tasks and down-stream tasks.

Various datasets have been used in the process of conduction the experimentsand evaluation. Some of them were used in the training of pretext tasks or super-vised learning tasks before transfer learning, while others were used to train on thedownstream task. In Table 4.1, the list of datasets and their size can be found.

The experimental setup for evaluation was also explained in details; the softwareand hardware platforms were shown in Section 4.2 and Section 4.3. Section 4.4introduced the pre-processing methods done to the input images and dataset whilethe detailed methodologies for each task can be found in Section 3.1.

4.1 Dataset

4.1.1 ImageNet and CK+

There were two supervised learning tasks were trained for transfer on downstreamtasks, trained with ImageNet and CK+ respectively.

ImageNet is a large collection of human annotated images with 21 thousandsdifferent classes [Deng et al., 2009]. It contains more than 14 million images. Thedataset used in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) is

Table 4.1: Datasets used in the evaluation.

Dataset Number of Image Identities Classes

ImageNet 1 million None 1000CK+ 981 327 7

VoxCeleb 1 >1.2 million 1251 NoneCeleb A 202,599 10,177 None

EmotioNet >30,000 None 12

17

18 Experiment Setup

a subset of the whole ImageNet dataset that contain a million images across 1000classes. We used the ResNet-18 model that was pre-train on our one-million imagesubset of ImageNet dataset as the fine-tuning model before transfer learning.

Another dataset we used is CK+ [Lucey et al., 2010], the extended version ofCohn-Kanade AU-Coded Expression Database. This dataset includes 327 labeledfacial video sequences. Since the size of each sequence is not identical, we used thedataset that extracted the last three frames from each video, which in total contains981 different facial expression images. This datasets contain 7 classes representing thehuman emotions. We used this facial expression datasets to pre-train on a ResNet-18model and further transfer this model onto learning a target task.

4.1.2 VoxCeleb 1

VoxCeleb 1 is a large video dataset that contains over 100,000 utterances for 1,251celebrities; the short clips of this datasets are extracted from interview videos fromYouTube [Nagrani et al., 2017]. We used this dataset to replicate the training processof FAb-Net; since in the original work of FAb-Net, videos are treated as set of images,we hence trained our FAb-Net architecture with the cropped images extracted fromvideos at 1fps, which includes over 1.2 million images.

We also obtained a pretrained FAb-Net model trained with both VoxCeleb 1 andVoxCeleb 2 where VoxCeleb 2 is a even larger audio-visual dataset with over a millionutterances for 6,112 identities [Chung et al., 2018]. This model is pretrained by [Wileset al., 2018].

4.1.3 CelebA

Celeb A is a large-scale facial image dataset contains more than 200K celebrity imageswith over 10,000 identities; it is usually used in the computer vision task such as facedetection, face recognition and facial landmark detection [Liu et al., 2015]. It contains5 landmark annotations for each image; none of these labels is used in the trainingprocess. This dataset shares identical celebrity identities with VoxCeleb A dataset;thus, it is chosen to be used to train on the pretext task, image colourisation.

4.1.4 EmotioNet

The chosen downstream task in our work is to classify on facial expressions. Hence,we chose a large scale dataset, EmotioNet, that contains more than 950,000 imageswith annotated action units [Benitez-Quiroz et al., 2016]. A subset of EmotioNetdatabase was obtained for our purpose, containing over 30,000 images across 12different AUs. These action units were used as the label for each image; for ourpurpose of work, these subsets of different labels are assumed to be disjoint witheach other. The distribution of each action unit labels can be seen in the Table 4.2.

§4.1 Dataset 19

Table 4.2: Distribution of Different Classes in EmotioNet dataset

Labels Action Units Number of Images

1 Inner brow raiser 202912 Lip corner puller 1333717 Chin raiser 85892 Outer brow raiser 303820 Lip stretcher 241525 Lips part 1776626 Jaw drop 172964 Brow lowerer 48435 Upper lid raiser 541643 Eyes closed 44026 Cheek raiser 83259 Nose wrinkler 5785

Table 4.3: Python packages used in our evaluation for different tasks.

Python Package Colourisation FAb-Net CK+ Downstream Task

python 3.6 X X X Xtorch 1.1.0 X X X X

PIL(pillow) 6.0.0 X X Xtorchvision X X X Xtensorboard 5X 4

cv2 X

20 Experiment Setup

Table 4.4: Hardware Implementation details

Dataset Implementation details

GPU 1⇥ NVIDIA Tesla K80 (with two GPUs), 1⇥ NVIDIA Tesla K40CPU 2⇥ Hexa core 2.0Ghz Intel Xeon CPU E5-2620 v3s

Memory 128G

(a) fig 1 (b) fig 2 (c) fig 3

(d) fig 4 (e) fig 5 (f) fig 6

Figure 4.1: Comparison between Images before and after Cropping using Face++

4.2 Software platform

In the process of the experiments, we conducted all the training and testing processthrough using PyTorch. The package used and details were listed in Table 4.3. Allprocesses were implemented and ran with Ubuntu 14.04.

4.3 Hardware platform

The training and testing for this work’s experiment was performed on GPUs for lesstraining time. The GPU used to perform such experiments were Nvidia K80 and K40GPUs. Table 4.4 specifies all the hardware details used.

§4.4 Data Preparation 21

4.4 Data Preparation

To enable the images be colourised, all the input images were convert to gray-scaleimages in before training on the pretext tasks of image colourisation.

To improve the accuracy of classification in the downstream task, a commercialface detection API (Face++ ) is used to crop out the face inside the image [fac]. Thecomparison between images before and after cropping is shown in[].This allowsthe image to be cropped only around the facial attributes and eliminates the noiseintroduced from the background; same cropping technique applied to test images.

The input images were then resized to the required input size defined for eachindividual tasks. For image colourisation, FAb-Net and all downstream tasks, thecorresponding dataset were randomly partitioned into train/validation/test sets (witha split of 75/15/10).

22 Experiment Setup

Chapter 5

Results and Evaluation

We evaluated our results obtained from training pretext tasks and transferred learningonto facial expression classification. The parameters for each transferred downstreamtasks were identical to the parameters used in their source tasks. As this work aimsto compare the performance on classifying facial expressions using different transferlearning approaches, self-supervised and fully supervised, the accuracy of facialclassification was recorded. Observations were made according to these results ofexperiments and we drew some conclusions from such results.

In Section 5.1, we discussed the baseline we used to compare all the transfer learn-ing models with. In Section 5.2, the performance of using learned representation fromvarious self-supervised learning methods were compared to the results transferredfrom supervised methods. The result of the experiments was discussed in Section 5.3.

5.1 Baseline

Random Initial Weights To compare our results obtained from transfer learningwith a standard baseline, a model with random weight initialisation was trainedon EmotioNet on classifying the facial expressions. This model was trained withResNet-18 with cross-entropy loss as its loss function and optimised with SGD. Thelast fully-connected layer was replaced in order to classify 12 classes. This model wastrained with a learning rate of 0.01 and epochs of 300. Results for this baseline hasbeen evaluated on images both before and after cropping to test the effectiveness ofsuch pre-processing method.

This trained-from-scratch ResNet-18 network was able to set a baseline for evaluat-ing the performance of transfer learning; there was no prior knowledge learnt for thisnetwork and thus the comparison between the performance of the transfer-learningmodel and this baseline was an indication of how well the pretext tasks learnt thevisual representatives of facial attributes.

23

24 Results and Evaluation

Figure 5.1: Accuracy of Facial Expression Classification using Different Weight InitialisationMethod. The performance of classification is evaluated by accuracy. Higher accuracy is better.

5.2 Result

5.2.1 Supervised versus Self-supervised Learning

The result of our experiments was shown in Figure 5.1. As we can see, the baselinemodel we trained from scratch got an accuracy of 32% over classifying over 12 differ-ent classes; all other models initialised with weights transferred from some form ofprior learning outperformed the baseline model. This was an indication that in thelearning of pretext tasks, some form of features about facial attributes were learnt;the difference between accuracy of the baseline model and the model used transferlearning explicitly showed the effectiveness of such transfer learning and indicatedwhether there were facial attributes learnt in pretext tasks.

Overall, the classification accuracy were not high across all these different ap-proaches; this were likely caused by the extremely imbalanced number of samplesfor each class in the dataset, which can be seen in Table 4.2. Imbalanced data canexert major problems in classification problems as it would have a severe impact onaccuracy. This may explain why all initialisation methods performed not well in thiscase.

We obtained results from training a model with random weights by not using anyimage pre-processing method except resized to the accepted input size (224⇥224) forResNet-18. The accuracy was 24%. Compared to this result, we were able to see asignificant improvement in terms of accuracy after introducing the pre-processingmethod discussed in Section 4.4.

§5.2 Result 25

It is evident in Figure 5.1 that the chosen two self-supervised learning approachesoutperformed all models transferred using fully supervised learning. We expectedthat model pre-trained with ImageNet would perform well after transferred to facialexpression classification as it was expected to learn enough visual features to performimage classification. However, since ImageNet was a dataset that contains 1000classes of objects, which was not a facial image dataset, not enough features werelearnt that were helpful in terms of facial attributes. On the other hand, CK+ was afacial expression dataset with seven labels; although this dataset was small, it wasable to get slightly better results in the target task compared to model pre-trainedwith ImageNet. We suspected that using dataset relevant to target tasks would boostthe performance in the downstream task.

However, in the field of learning facial attributes, it is not feasible to obtain awell-annotated large scale dataset in every specific domain of work. In this case,self-supervised learning became extremely useful as it could be trained on unlabelleddata. The best result of using image colourisation as pretext tasks was trained onover 1.4 million face images; it was nearly impossible to acquire a face image datasetat a similar scale. With these large amounts of data, the performance of transferringfrom image colourisation improved 11.92% compared with the baseline model. Evenimage colourisation was a general and straightforward pretext task, it showed itseffectiveness in learning facial attributes. In order to colour a gray-scale image, thenetwork was expected to learn the context of the image, in this case, the facial features.It was also expected that it learnt facial attributes from detecting edges and texturechanges in the face images. Our experiments showed that self-supervised learningwas capable of having outperforming results on a downstream task with sufficientunlabelled data used to train it.

Our best results of facial expression classification were obtained by using FAb-Netarchitecture trained with over 1.2 million face images extracted from videos, withan accuracy of 52.2%. This result improved 20.20% from the baseline and nearly10% from the second-best result achieved by image colourisation. This architecturewas expected to learn facial attributes in the training process as it was designed topredict a face image with given source frames. In the process of this prediction, thenetwork was expected to learn essential features in the face, such as facial landmarks,in order to transfer the faces in the source frame to the predicted frame. Since weclassified on action units in the target task, the learnt features in FAb-Net should beable to discriminate against the change of muscle actions. The results confirmed ourhypothesis.

Although both image colourisation trained on the same dataset, VoxCeleb 1, ac-curacy of transferring these two pretext tasks were not the same; image colourisationachieved an accuracy of 43.06% while FAb-Net got an accuracy of 52.2%. We concludethat FAb-Net was more effective in terms of learning facial features in the processof training pretext tasks compared to image colourisation. These may due to twomain reasons 1)FAb-Net had a more complex architecture that designed to learnfacial features. In contrast, image colourisation was a more general pretext task thatlearnt more general image features. This results in the learnt weights from FAb-Net


Figure 5.2: Accuracy of Facial Expression Classification using Image Colourisation as PretextTask with Different Data Size for Training. The performance of classification is evaluatedusing accuracy. Higher accuracy is better.

provided a better initialisation for facial expression classification. 2) Since the imagecolourisation architecture implemented in this work learnt embedding and weightsfor the first part of the ResNet-18 architecture, whereas the weights we used for thesecond half of the architecture was still initialised randomly. This provides relativelyless information to the downstream task and may therefore result in less accuracy.

5.2.2 Training Data for Pretext Task

Since the great advantage of self-supervised learning is that it does not rely on anyhuman annotation for the training data; hence, one can employ as many data to trainon a pretext task. As shown in Figure 5.2 and 5.3, the amount of the training dataemployed in pretext task would have a positive effect on the performance of facialexpression classification. This explicitly indicated that the more data were providedto train the pretext task, the more visual representatives about facial attributes werelearnt in this process, and these learnt representatives provided a better initialisationpoint for the downstream task network to start with.

In Figure 5.2, the accuracy of facial expression classification boosted from 37.6% to43.92% by providing an additional 1.2 million images to train the image colourisationpretext task with. Similar trend was observed with the AUC in FAb-Net when extravideo sequences were given to train; the area under ROC increased by 22.9% from59.2 to 72.9.

§5.2 Result 27

Figure 5.3: Accuracy of Facial Expression Classification using FAb-Net as Pretext Task withDifferent Data Size for Training.The performance of classification is evaluated using AUC.Higher AUC is better.

Table 5.1: Comparison between Self-supervised model with state-of-art work.

Architecture Supervised Learning Method Accuracy

FAb-Net Self-supervised 52.2%Image Colourisation Self-supervised 43.9%

PingAn-GammaLab[emo, 2018] Fully supervised 94.46%[Zhao et al., 2018] Weakly supervised 91.3%

5.2.3 Comparing with State-of-Art

Although FAb-Net outperformed all other experiments we conducted, it performedrelatively poor compared with the state-of-art results on classifying with EmotioNet.Table 5.1 showed some state-of-results using EmotioNet dataset. It is worth noticingthat the whole dataset was used in works mentioned in the table; [Zhao et al., 2018]classified the images into only seven different AUs, and the result was obtained usingclustering instead of any CNN architecture. It was clear that there was still a hugegap between the state-of-art performance and the result obtained from our work.

This gap may be decreased by employing the whole dataset on training the down-stream tasks or introduced more data into the training of pretext tasks. Although ourresult achieved with the current setting is not comparable with the fully supervisedresults, self-supervised learning still showed its great advantage and potential inlearning facial features.


5.3 Discussion

In conclusion, the results of conducted experiments have been evaluated and anal-ysed. FAb-net outperformed all other methods with its greater ability to learn facialfeatures. There was a positive relationship has been observed between the amount oftraining data used to train pretext tasks in self-supervised learning and the accuracyof downstream classification task; the greater the amount of data provided to traina pretext task, the better the performance of the chosen downstream task. We alsocompared our result with the current state-of-art work.

Chapter 6

Conclusion

In this work, we use various numbers of self-supervised approaches to learn facialattribute features using large scale video datasets without any labels. The learntembeddings are then transferred on a target task to classify facial expressions. Toevaluate the effectiveness of our self-supervised approaches, comparison between self-supervised learning and fully supervised transfer learning are made on the accuracyof chosen classification task; specifically, 5 different weight initialisation methodsare implemented: random weight initialisation, fully supervised transfer learningpre-trained on ImageNet and CK+ respectively, self-supervised learning using imagecolourisation and FAb-Net.

Through the analysis of experimental results, we conclude that FAb-Net outper-formed all other approaches due to its complexity and ability to learn complex facialattributes. Our experiments also illustrate that the amount of data used in trainingpretext tasks in self-supervised learning has positive effects on the performance ofthe downstream task transferred from the pretext task. Moreover, self-supervisedlearning is proven to be useful and applicable under the circumstance that the field oftarget task lacks large scale human labelled datasets or such dataset is hard to obtain.

6.1 Limitation

We are aware that there are a few limitations in our experiments and implementa-tion that results in the performance that was not as comparable as state-of-the-artapproaches. Due to the limitation of equipment and time, the parameters selectedfor each models were not the optimised in terms of performance. Parameters withrelatively reasonable performance were chosen; further tuning on the parameters mayimprove on the performances.

Moreover, as discussed in Section 5.2, our accuracy for the model may be affectedby the unbalanced number of samples in each class. Further improvement could bemade by obtaining a balanced dataset for training.

Lastly, the loss function we chose for image colourisation was MSE loss, where itwould heavily penalise on bright yet wrong colour; this would result in desaturatedcolours favoured by the implemented model. To obtain more vibrant images, theimplementation of work by [Zhang et al., 2016] can be considered in future work.

29

30 Conclusion

6.2 Future Work

Despite the limitation mentioned in the above section, several different possible futureworks can be considered, which includes the following:

1. Further extending this work by evaluating on a greater variety of generalself-supervised learning approaches and some self-supervised learning approachestargeted at learning facial attributes[Li et al., 2019] [Sharma et al., 2019].

2. Exploring on the fusion of some of the well-performed self-supervised learningtechniques by using multi-tasking [Doersch and Zisserman, 2017]. Various tasks canbe combined to jointly train a network in order to achieve better performance ondownstream tasks. Deeper networks are also expected to be used for better successthan the results obtained in this work.

3. As mentioned in Section 5.2, the amount of the data used for training pretexttasks would have an impact on the performance on the target task; therefore, weexpected to use even larger datasets to train on the pretext tasks. As no labels arerequired for training of pretext tasks, data in the wild may also be useful.

Bibliography

Face - face cognitive services. https://www.faceplusplus.com/. (cited on page 21)

2018. Emotionet challenge 2018. http://cbcsl.ece.ohio-state.edu/EmotionNetChallenge/index.html#2018results. (cited on page 27)

2018. Image colorization with convolutional neural networks. https://lukemelas.github.io/image-colorization.html. (cited on page 14)

Al-Darraji, S.; Berns, K.; and Rodic, A., 2017. Action unit based facial expressionrecognition using deep learning. vol. 540, 413–420. doi:10.1007/978-3-319-49058-8_45. (cited on page 11)

Aloysius, N. and Geetha, M., 2017. A review on deep convolutional neural networks.2017 International Conference on Communication and Signal Processing (ICCSP), (2017).doi:10.1109/iccsp.2017.8286426. (cited on pages 6 and 7)

Benitez-Quiroz, F.; Srinivasan, R.; and Martinez, A., 2016. Emotionet: An accurate,real-time algorithm for the automatic annotation of a million facial expressions inthe wild. doi:10.1109/CVPR.2016.600. (cited on page 18)

Chung, J. S.; Nagrani, A.; and Zisserman, A., 2018. Voxceleb2: Deep speakerrecognition. In INTERSPEECH. (cited on page 18)

Deng, J.; Dong, W.; Socher, R.; Li, L.; Kai Li; and Li Fei-Fei, 2009. Imagenet: Alarge-scale hierarchical image database. In 2009 IEEE Conference on Computer Visionand Pattern Recognition, 248–255. doi:10.1109/CVPR.2009.5206848. (cited on pages 1and 17)

Doersch, C.; Gupta, A.; and Efros, A. A., 2015. Unsupervised visual representationlearning by context prediction. (cited on page 8)

Doersch, C. and Zisserman, A., 2017. Multi-task self-supervised visual learning.(cited on page 30)

Fathallah, A.; Abdi, L.; and Douik, A., 2017. Facial expression recognition viadeep learning. 2017 IEEE/ACS 14th International Conference on Computer Systems andApplications (AICCSA), (2017). doi:10.1109/aiccsa.2017.124. (cited on page 11)

Gidaris, S.; Singh, P.; and Komodakis, N., 2018. Unsupervised representationlearning by predicting image rotations. (cited on page 9)

31

https://www.faceplusplus.com/

https://lukemelas.github.io/image-colorization.html

https://lukemelas.github.io/image-colorization.html

http://dx.doi.org/10.1007/978-3-319-49058-8_45

http://dx.doi.org/10.1007/978-3-319-49058-8_45

http://dx.doi.org/10.1109/iccsp.2017.8286426

http://dx.doi.org/10.1109/CVPR.2016.600

http://dx.doi.org/10.1109/CVPR.2009.5206848

http://dx.doi.org/10.1109/aiccsa.2017.124

32 Bibliography

He, K.; Zhang, X.; Ren, S.; and Sun, J., 2015. Deep residual learning for imagerecognition. (cited on pages 5 and 7)

Jing, L. and Tian, Y., 2019. Self-supervised visual feature learning with deep neuralnetworks: A survey. (cited on page 2)

Khan, A.; Sohail, A.; Zahoora, U.; and Qureshi, A. S., 2019. A survey of the recentarchitectures of deep convolutional neural networks. (cited on page 7)

Krizhevsky, A.; Sutskever, I.; and Hinton, G. E., 2017. Imagenet classificationwith deep convolutional neural networks. Commun. ACM, 60, 6 (May 2017), 84–90.doi:10.1145/3065386. http://doi.acm.org/10.1145/3065386. (cited on page 5)

Kumar, G. A. R.; Kumar, R. K.; and Sanyal, G., 2017. Facial emotion analysis usingdeep convolution neural network. 2017 International Conference on Signal Processingand Communication (ICSPC), (2017). doi:10.1109/cspc.2017.8305872. (cited on page11)

LeCun, Y.; Bengio, Y.; and Hinton, G., 2015. Deep learning. Nature, 521 (2015),436–444. (cited on pages 5 and 6)

Li, Y.; Zeng, J.; Shan, S.; and Chen, X., 2019. Self-supervised representation learningfrom videos for facial action unit detection. In The IEEE Conference on ComputerVision and Pattern Recognition (CVPR). (cited on pages 9 and 30)

Liu, Z.; Luo, P.; Wang, X.; and Tang, X., 2015. Deep learning face attributes in thewild. In Proceedings of International Conference on Computer Vision (ICCV). (cited onpage 18)

Lucey, P.; Cohn, J. F.; Kanade, T.; Saragih, J.; Ambadar, Z.; and Matthews, I., 2010.The extended cohn-kanade dataset (ck ): A complete dataset for action unit andemotion-specified expression. 2010 IEEE Computer Society Conference on ComputerVision and Pattern Recognition - Workshops, (2010). doi:10.1109/cvprw.2010.5543262.(cited on page 18)

Nagrani, A.; Chung, J. S.; and Zisserman, A., 2017. Voxceleb: a large-scale speakeridentification dataset. In INTERSPEECH. (cited on page 18)

Napoletano, P.; Piccoli, F.; and Schettini, R., 2018. Anomaly detection in nanofi-brous materials by cnn-based self-similarity. Sensors (Basel, Switzerland), 18 (01 2018).doi:10.3390/s18010209. (cited on pages ix and 14)

Noroozi, M. and Favaro, P., 2016. Unsupervised learning of visual representationsby solving jigsaw puzzles. (cited on page 8)

Pan, S. J. and Yang, Q., 2010. A survey on transfer learning. IEEE Transactions onKnowledge and Data Engineering, 22, 10 (2010), 1345–1359. doi:10.1109/tkde.2009.191.(cited on page 1)

http://dx.doi.org/10.1145/3065386

http://doi.acm.org/10.1145/3065386

http://dx.doi.org/10.1109/cspc.2017.8305872

http://dx.doi.org/10.1109/cvprw.2010.5543262

http://dx.doi.org/10.3390/s18010209

http://dx.doi.org/10.1109/tkde.2009.191

Bibliography 33

Rawat, W. and Wang, Z., 2017. Deep convolutional neural networks for imageclassification: A comprehensive review. Neural Computation, 29, 9 (2017), 2352–2449.doi:10.1162/neco_a_00990. (cited on page 6)

Sharma, V.; Tapaswi, M.; Sarfraz, M. S.; and Stiefelhagen, R., 2019. Self-supervised learning of face representations for video face clustering. (cited onpages 9 and 30)

Simonyan, K. and Zisserman, A., 2014. Very deep convolutional networks for large-scale image recognition. (cited on page 5)

Vyas, A. S.; Prajapati, H. B.; and Dabhi, V. K., 2019. Survey on face expressionrecognition using cnn. 2019 5th International Conference on Advanced Computing andCommunication Systems (ICACCS), (2019). doi:10.1109/icaccs.2019.8728330. (cited onpage 10)

Weiss, K.; Khoshgoftaar, T. M.; and Wang, D., 2016. A survey of transfer learning.Journal of Big Data, 3, 1 (2016). doi:10.1186/s40537-016-0043-6. (cited on page 1)

Wiles, O.; Koepke, A. S.; and Zisserman, A., 2018. Self-supervised learning of afacial attribute embedding from video. (cited on pages 10, 15, and 18)

Yaseen, A. F. and Saud, L. J., 2018. A survey on the layers of convolutional neuralnetworks. (cited on page 6)

Zhai, X.; Oliver, A.; Kolesnikov, A.; and Beyer, L., 2019. S4l: Self-supervisedsemi-supervised learning. (cited on page 8)

Zhang, R.; Isola, P.; and Efros, A. A., 2016. Colorful image colorization. In ECCV.(cited on pages 9 and 29)

Zhao, K.; Chu, W.-S.; and Martinez, A. M., 2018. Learning facial action units fromweb images with scalable weakly supervised clustering. 2018 IEEE/CVF Conferenceon Computer Vision and Pattern Recognition, (2018). doi:10.1109/cvpr.2018.00223. (citedon page 27)

http://dx.doi.org/10.1162/neco_a_00990

http://dx.doi.org/10.1109/icaccs.2019.8728330

http://dx.doi.org/10.1186/s40537-016-0043-6

http://dx.doi.org/10.1109/cvpr.2018.00223

ᅾݞ

Documents

Self Supervised Learning in Facial Attributes in Videos · 2019-10-30 · Self Supervised Learning in Facial Attributes in Videos Huaer Li(u6364576) A report submitted for the course