53
T HE A USTRALIAN N ATIONAL U NIVERSITY T HESIS OF COMP8755 Adaptive Integration of Multiple Fine-tuning Models in Transfer Learning for Image Classification Author: Yu WANG U5762606 Supervisor: Professor Tom Gedeon Ms Jo Plested A thesis submitted in fulfillment of the requirements for the degree of Master of Computing in the Research School of Computer Science June 11, 2020

Adaptive Integration of Multiple Fine ... - cs.anu.edu.au · THE AUSTRALIAN NATIONAL UNIVERSITY Abstract Research School of Computer Science Master of Computing Adaptive Integration

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

  • THE AUSTRALIAN NATIONAL UNIVERSITY

    THESIS OF COMP8755

    Adaptive Integration of MultipleFine-tuning Models in Transfer

    Learning for Image Classification

    Author:Yu WANGU5762606

    Supervisor:Professor Tom Gedeon

    Ms Jo Plested

    A thesis submitted in fulfillment of the requirementsfor the degree of Master of Computing

    in the

    Research School of Computer Science

    June 11, 2020

    https://www.anu.edu.au/https://cs.anu.edu.au/

  • ii

    Declaration of AuthorshipI, Yu WANGU5762606, declare that this thesis titled, “Adaptive Integration of MultipleFine-tuning Models in Transfer Learning for Image Classification” and thework presented in it are my own, expect where otherwise indicated.

  • iii

    THE AUSTRALIAN NATIONAL UNIVERSITY

    AbstractResearch School of Computer Science

    Master of Computing

    Adaptive Integration of Multiple Fine-tuning Models in TransferLearning for Image Classification

    by Yu WANGU5762606

    Transfer Learning (TL) has been widely used as a Deep Learning (DL) tech-nique to solve computer vision related problems, especially when the prob-lem is image classification employing Convolutional Neural Networks (CNN).Traditionally, there are two ways to implement TL, one is freezing all theweights learnt from the source dataset, and the other one is fine-tuning mostor all the weights learnt from the source dataset. In this paper, a novel TLapproach that can adaptively integrate multiple models with different fine-tuning settings is proposed, which is denoted as MultiTune.

    To evaluate the performance of MultiTune, it is compared to a state-of-the-art TL technique called SpotTune, which can be considered as an adaptivefine-tuning approach that is able to generate the optimal fine-tuning strategyfor every image in the target dataset. Two image datasets are used to eval-uate the performance of SpotTune and MultiTune, which are FGVC-Aircraftdataset and CIFAR100 dataset. These datasets are smaller-sized (72 pixels forthe shorter edge) versions taken from the Visual Decathlon Challenge.

    Results obtained in this paper shows that MultiTune can achieve a valida-tion accuracy of 59.59% on Aircraft dataset and 79.31% on CIFAR100 dataset,while SpotTune achieves a validation accuracy of 55.15% on Aircraft datasetand 78.45% on CIFAR100 dataset. To study their performance on small targetdatasets, they are also evaluated by using smaller-sized (smaller number ofimages per class) Aircraft and CIFAR100 datasets. MultiTune outperformsSpotTune on most of these smaller-sized datasets as well. Besides, Multi-Tune is less computational than SpotTune and requires less time for trainingfor each dataset used in this paper. Future works could be done on integrat-ing more fine-tuning models with different settings, and further tuning of thehyper-parameters used in MultiTune.

    HTTPS://WWW.ANU.EDU.AU/https://cs.anu.edu.au/

  • iv

    AcknowledgementsI would like to express my honest appreciation to my supervisor, Ms JoPlested and Professor Tom Gedeon for their continuous guidance and sup-ports throughout the whole project. Their knowledge, understanding ofDeep Learning and self-discipline inspire me to execute every task with ef-forts. It is their weekly feedback what gives me the motivation and leads mein the right direction for my project.

    Especially, I am deeply grateful to Ms Jo Plested for all the assistancesthat she gives to me in the project meeting. Thank you for giving me theinstructions, and clarifying my doubts about many questions.

    Finally, I would like to thank my family and friends. Without your help,I would not be able to accomplish this project on time.

  • v

    Contents

    Declaration of Authorship ii

    Abstract iii

    Acknowledgements iv

    1 Introduction 11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Project Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

    2 Background and Related Work 42.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

    2.1.1 Neural Networks and Deep Learning . . . . . . . . . . 42.1.2 Convolutional Neural Network . . . . . . . . . . . . . . 5

    Convolutional Layer . . . . . . . . . . . . . . . . . . . . 6Pooling Layer . . . . . . . . . . . . . . . . . . . . . . . . 7ReLU Layer . . . . . . . . . . . . . . . . . . . . . . . . . 7Fully Connected Layer . . . . . . . . . . . . . . . . . . . 8Loss Layer . . . . . . . . . . . . . . . . . . . . . . . . . . 8

    2.1.3 Deep Residual Network . . . . . . . . . . . . . . . . . . 82.1.4 Attention Mechanism . . . . . . . . . . . . . . . . . . . . 102.1.5 L2 Regularization . . . . . . . . . . . . . . . . . . . . . . 10

    2.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.2.1 Image Classification by Using TL . . . . . . . . . . . . . 102.2.2 Input-dependent/Selective Execution . . . . . . . . . . 112.2.3 SpotTune . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.2.4 L2-SP Regularization . . . . . . . . . . . . . . . . . . . . 132.2.5 Attention Mechanism in CNNs . . . . . . . . . . . . . . 14

    3 Methodology and Experiment 163.1 Introduction of Datasets . . . . . . . . . . . . . . . . . . . . . . 163.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

    3.2.1 CNN architecture . . . . . . . . . . . . . . . . . . . . . . 173.2.2 Data Loading and Transformation . . . . . . . . . . . . 18

    Data Loading . . . . . . . . . . . . . . . . . . . . . . . . 18Data Transformation . . . . . . . . . . . . . . . . . . . . 18

    3.2.3 Learning Method . . . . . . . . . . . . . . . . . . . . . . 193.2.4 Transfer Learning Method . . . . . . . . . . . . . . . . . 20

  • vi

    3.2.5 Implementation of MultiTune . . . . . . . . . . . . . . . 203.2.6 Activation Function, Loss Function and Optimizer . . . 21

    3.3 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.3.1 Environments Used . . . . . . . . . . . . . . . . . . . . . 223.3.2 Data Preparation . . . . . . . . . . . . . . . . . . . . . . 233.3.3 Baseline Preparation . . . . . . . . . . . . . . . . . . . . 233.3.4 MultiTune Setup . . . . . . . . . . . . . . . . . . . . . . 24

    Basic Settings . . . . . . . . . . . . . . . . . . . . . . . . 24Settings of the Fine-tuning Models . . . . . . . . . . . . 24Settings of the MultiTune model . . . . . . . . . . . . . 25

    3.3.5 Overall Approach . . . . . . . . . . . . . . . . . . . . . . 26

    4 Results and Analysis 274.1 Results of Baseline . . . . . . . . . . . . . . . . . . . . . . . . . . 274.2 Results of MultiTune . . . . . . . . . . . . . . . . . . . . . . . . 294.3 Analysis of the Results . . . . . . . . . . . . . . . . . . . . . . . 30

    5 Conclusion and Future Work 355.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

    A Project Contract 37

    Bibliography 40

  • vii

    List of Figures

    2.1 An Example of Perceptron Algorithm . . . . . . . . . . . . . . 42.2 A visualization of AlexNet . . . . . . . . . . . . . . . . . . . . . 62.3 An example of convolution process . . . . . . . . . . . . . . . . 62.4 An example of max pooling . . . . . . . . . . . . . . . . . . . . 72.5 A visualization of a residual block . . . . . . . . . . . . . . . . 92.6 A visualization of ResNet34 . . . . . . . . . . . . . . . . . . . . 92.7 Different architectures of D2NN . . . . . . . . . . . . . . . . . . 122.8 Illustration of SpotTune’s working procedure . . . . . . . . . . 132.9 A visualization of the global and local attention mechanism . 15

    3.1 Visualization of ResNet26 . . . . . . . . . . . . . . . . . . . . . 173.2 Visualization of MultiTune . . . . . . . . . . . . . . . . . . . . . 21

    4.1 Validation Accuracy after every Epoch on Aircraft and CIFAR100Datasets for SpotTune . . . . . . . . . . . . . . . . . . . . . . . . 28

    4.2 Validation Accuracy after every Epoch on Smaller-sized Air-craft Dataset for SpotTune . . . . . . . . . . . . . . . . . . . . . 28

    4.3 Validation Accuracy after every Epoch on Smaller-sized CI-FAR100 Dataset for SpotTune . . . . . . . . . . . . . . . . . . . 29

    4.4 Validation Accuracy versus the Number of Epochs of Spot-Tune and MultiTune on Aircraft and CIFAR100 datasets. . . . 30

    4.5 Validation Accuracy of SpotTune and MultiTune after Running5 Iterations on Aircraft Dataset . . . . . . . . . . . . . . . . . . 31

    4.6 Validation Accuracy after every Epoch on Smaller-sized Air-craft Dataset for SpotTune and MultiTune . . . . . . . . . . . . 32

    4.7 Validation Accuracy after every Epoch on Smaller-sized CI-FAR100 Dataset for SpotTune and MultiTune . . . . . . . . . . 32

    4.8 Running time comparison between SpotTune and MultiTune . 33

  • viii

    List of Tables

    3.1 The Visual Decathlon Datasets . . . . . . . . . . . . . . . . . . . 173.2 Mean and Standard Deviation of the Datasets . . . . . . . . . . 193.3 Settings of the Two Fine-tuning Models . . . . . . . . . . . . . 243.4 Settings of the MultiTune model . . . . . . . . . . . . . . . . . . 25

    4.1 Results of SoptTune on Aircraft and CIFAR100 . . . . . . . . . 274.2 Results of MultiTune on Aircraft and CIFAR100 . . . . . . . . . 294.3 Results of 5 Iterations for MultiTune and SpotTune on Aircraft

    Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

  • ix

    List of Equations

    2.1 Convolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2 Rectified Linear Units (ReLU) . . . . . . . . . . . . . . . . . . . . . . 72.3 Cross Entropy Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.4 Residual function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.5 CE with L2 regularizer . . . . . . . . . . . . . . . . . . . . . . . . . . 102.6 Gumbel Softmax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.7 L2-SP Regularization without the Last Layer . . . . . . . . . . . . . 142.8 L2-SP Regularization with Last Layer . . . . . . . . . . . . . . . . . 143.1 Image Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.2 MultiTune Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . 213.3 SoftMax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.4 CE with L2-SP regularizer . . . . . . . . . . . . . . . . . . . . . . . . 223.5 SGD with Momentum . . . . . . . . . . . . . . . . . . . . . . . . . . 22

  • x

    List of Abbreviations

    API Application Programming InterfaceBP Back PropagationCE Cross Entropy LossCOCO Common Objects in ContextCV Computer VisionCNN Convolutional Neural NetworkDL Deep LearningDNN Deep Artificial Neural NetworkD2NN Dynamic Deep Neural NetworkFC Fully ConnectedML Machine LearningMSE Mean Square Error LossNLP Natural Language ProcessingNN Neural NetworkPCA Principle Component AnalysisReLU Rectified Linear UnitsResNet Deep Residual NetworkRL Reinforcement LearningRNN Recurrent Neural NetworkSGD Stochastic Gradient Descenttanh Hyperbolic TangentTL Transfer Learning

  • xi

  • 1

    Chapter 1

    Introduction

    1.1 Introduction

    Transfer learning (TL) is a recent research problem in machine learning (ML),which focuses on applying obtained knowledge from one problem to a dif-ferent but related problem. It can be regarded as a simulation of the humanbeing’s learning process. Humans are able to use inherent ways to trans-fer their knowledge between tasks. This means, we usually apply relevantknowledge obtained from our previous learning experiences when we havenew tasks. Usually, the more related the new task is to our previous learn-ing, the more easily we can handle it. Common machine learning algorithmsare often designed to solve single and isolated tasks. However, the study ofTL aims to develop methods to transfer knowledge learnt from one or moresource tasks and apply this knowledge to improve the learning process in adifferent but related target task. (Torrey and Shavlik, 2009)

    In general, we use TL to extract weights learnt from the source task andapply these weights to a problem on a related target task. There are two im-portant concepts frequently used in TL, which are freezing and fine-tuning.Freezing means to freeze the weights that are learnt from the source task andonly update the weights in the last classification layer. (Azizpour et al., 2016)The method of freezing weights in TL is also called feature extraction. By us-ing this freezing approach, most of the weights in the neural networks (NN)will be frozen and will act as a feature extractor to solve the target task. Theother concept fine-tuning is opposite to freezing, which is performed whereall or most of the weights learnt from the source task are retrained and up-dated to fit the target task. These pre-trained weights act as a regularizer thatprevents overfitting happening during the learning process of the target task.(Agrawal, Girshick, and Malik, 2014)

    In this thesis, a novel technique that can be used in TL is proposed, whichenables the adaptive integration of multiple fine-tuning models with differ-ent fine-tuning settings. It is denoted as MultiTune and will be mentioned byusing this name in the later chapters. The approach of MultiTune is used inTL for image classification in this project. The methodology, details of exper-iments, results and analysis will be covered in the later chapters.

  • 2 Chapter 1. Introduction

    1.2 Motivation

    There have been numerous researches studying the transferability of featuresin deep neural networks (DNN). One of the most thorough researches isYosinski et al.’s paper published in 2014. In their paper, they discussed thetransferability of features when using convolutional neural networks (CNNs).It is sated in their paper that features on the first layer seems to occur regard-less of the exact loss function and natural image dataset and can be consid-ered as ’general’, while the features on the last layer depend largely on thedataset and task and can be considered as ’specific’. Their study quantifiedto which a particular layer is general or specific and found that even featurestransferred from distant tasks are better than random weights. (Yosinski etal., 2014)

    The results of Yosinski et al. have been widely adopted by many re-searchers and their method becomes one of the best practice paradigms whenapplying TL with CNNs. (Plested and Gedeon, 2019) However, the paradigmis not adopted by everyone and some recent studies showed results differentfrom Yosinski et al.’s. For instance, Azizpour et al. showed that all the layersexcept the last one of a CNN should be transferred when the target task issimilar to the source task. Furthermore, they also found that all but the finaltwo or three layers of a CNN should be transferred when the target task isthe least related to the source target. (Azizpour et al., 2016)

    Therefore, there are no golden rules or a perfect paradigm can be followedand applied to a modern TL problem. The number of layers that shouldbe transferred to the target task still depends on the similarity between thesource and target tasks. Then, each TL problem has a different optimal num-ber of layers to be transferred. That means the transferability is still problem-based or problem-dependent. As a result, researches in this particular areahave been focused on studying the transferability of the layers in the DNNsand the optimal fine-tuning settings for different tasks, and also finding anautomatic way to decide how many layers should be transferred for differ-ent TL tasks. Guo et al.’s paper "SpotTune: Transfer Learning through AdaptiveFine-tuning" published in 2018 discussed a way of adaptive fine-tuning calledSoptTune. By using SpotTune, NN can find the optimal fine-tuning strategyper instance for the target data. The details of the SpotTune will be elabo-rated in the next chapter. In short, it trains a policy network to make routingdecisions on whether to pass the image through the fine-tuning layers or thefrozen layers. (Guo et al., 2018) However, SpotTune is also not a perfect so-lution to fit every task in TL. There are some flaws in its algorithm and themethod can be improved in several ways.

    1.3 Project Scope

    In this project, the final deliverable is to successfully apply MultiTune andrelated transfer learning techniques to improve the performance of specific

  • 1.3. Project Scope 3

    image classification tasks. The TL will be applied in CNNs for image classi-fication using the same dataset with SpotTune. The results obtained by Spot-Tune will be used as the baseline of this paper as SpotTune can be regardedas one of the most recent and effective methods to do TL. To achieve the finalobjective, all the researches and experiments conducted during the projectduration are contained in the project scope. These include the backgroundresearch of the related areas and concepts such as NN, DL and CNNs, ap-plying state-of-the-art techniques to improve TL in different ways, and theanalysis of the eventually obtained results.

  • 4

    Chapter 2

    Background and Related Work

    2.1 Background

    2.1.1 Neural Networks and Deep Learning

    A standard NN consists of many neurons that can be regarded as simple,connected processing units. Each neuron in an NN produces a sequenceof real-valued activations. In general, input neurons get activated throughsensors perceiving the environment or through the user’s input to the NN,other neurons are activated by the weights connected to the previous layer.(Schmidhuber, 2015)

    FIGURE 2.1: An example of perceptron algorithm. (Arunava,2018)

    The NN and the architecture of NN are bio-inspired, the concept of artifi-cial NN and artificial neuron can be dated back to 1940s. In 1943, McCullochand Pitts proposed the initial concept of artificial NN with the mathematicalmodel of an artificial neuron, which led to the era of artificial NN. (McCul-loch and Pitts, 1943) However, this early NN architecture was not able tolearn. (Schmidhuber, 2015) The artificial NN was then further developed byFrank Rosenblatt in 1958. The perceptron algorithm was invented by himat the Cornell Aeronautical Laboratory. In fact, the perceptron was initiallyintended as a machine rather than an algorithm and was used in recognitionof simple images. The perceptron algorithm is able to be trained and is able

  • 2.1. Background 5

    to learn during the training process. (Rosenblatt, 1958) However, the percep-tron was single-layered at that time, which caused it to be only capable oflearning linearly separable patterns. For instance, a single-layered percep-tron was unable to handle a simple XOR gate. For visualisation, an exampleof the basic concept of the perceptron algorithm is shown in Figure 2.1.

    After the invention of the perceptron algorithm, the NN was slowly de-veloped until the term Backpropagation (BP) and its general use in NN wasannounced by Rumelhart, Hinton and Williams in 1986. (Rumelhart, Hinton,and Williams, 1986) Deep Learning (DL) can be considered as a kind of deepartificial NNs. However, the word ’deep’ is abstract and hard to define. Ingeneral, NNs with more than three layers (including input layer and outputlayer) can be qualified as DL. (Shu, 2020) When the number of layers in anNN reaches more than 10, it is usually considered as Very Deep Learning.(Schmidhuber, 2015)

    In recent years, DL has been applied in different industrial fields, ob-tained significant achievements and become one of the most popular topicsin the 21st century, especially in the last decade. Deep Artificial Neural Net-works (DNN) have won numerous contests in pattern recognition and ma-chine learning. (Schmidhuber, 2015) The idea of DL is just mentioned here,the details of it will be covered later in this chapter.

    2.1.2 Convolutional Neural Network

    In the area of DL, the CNN is usually referred to as a class of DNN, whichis most commonly applied to image-related problems. The history of CNNcan be dated back to 1950s and 1960s when Hubel and Wiesel identified twobasic visual cell types in the brain. They called these two types of visual cellsimple cells and complex cells (receptive fields) and proposed a cascadingmodel of these two types of cells for use in pattern recognition tasks. (Hubeland Wiesel, 1968) Yann LeCun et al. proposed a method to use BP to learnthe weights of convolution kernel directly from images of hand-written num-bers. Their method enabled the fully automatic learning of CNN, which per-formed better than manually designed weights. It was proven to be suited toa broader range of image recognition problems. (LeCun et al., 1998) It is be-lieved that their approach is the foundation of modern computer vision (CV).After that, CNN has attracted numerous studies and become one of the mostpopular topics of DL, especially after 2012 when the architecture of AlexNetcame out. AlexNet is an architecture of CNN invented by Alex Krizhevsky etal. and won the ImageNet Large Scale Visual Recognition Challenge 2012. Infact, the AlexNet not only won the championship but also outperformed thesecond place by 11% in terms of accuracy. (Krizhevsky, Sutskever, and Hin-ton, 2012) A visualization of the structure of AlexNet is shown in Figure 2.2.

    A typical CNN usually contains five different building blocks, which areconvolutional layer, ReLU layer, pooling layer, fully connected layer and losslayer.

  • 6 Chapter 2. Background and Related Work

    FIGURE 2.2: A visualization of AlexNet. (Han et al., 2017)

    Convolutional Layer

    Convolution is a basic operation usually used in the field related to com-puter vision (CV), which involves the mathematical calculation between afilter (also called a kernel) and the image. The convolution process can beimplemented by using the formula listed below, where F is the filter, H is theoriginal image and G is the image after the convolution process.

    G[i, j] =k

    ∑u=−k

    k

    ∑v=−k

    H[u, v]F[i− u, j− v] (2.1)

    The basic idea of the convolution is using a sliding window way to applya filter to the image region by region and extract the features of this region.Usually, one convolutional kernel is able to extract a specific type of feature.So, typical architecture of CNN usually contains several convolutional ker-nels to detect various features. The following Figure 2.3 visualizes the pro-cess of convolution.

    FIGURE 2.3: An example of convolution process. (Wicht, 2018)

    To control the number of free parameters, a parameter sharing schema isusually used in convolutional layers. The assumption is made that a patchfeature should be useful to compute at other positions if it is useful to com-pute at some spatial positions. Assuming a convolutional layer has a di-mension of [55x55x96] and denoting a slice of depth as a depth slice, theneurons in each depth slice will be constrained to use the same weights and

  • 2.1. Background 7

    bias. Thus, in this situation, there will be only 96 unique sets of weights. Byusing the parameter sharing, the number of free parameters is significantlyreduced.

    Pooling Layer

    The pooling layer is another important part of a CNN. It can be regardedas a non-linear down-sampling. There are several widely used functions toimplement pooling, such as max pooling and average pooling. The formerone, max pooling is the most common pooling method, which divides theimage uniformly into small regions and outputs the maximum pixel for eachregion. Figure 2.4 demonstrates a simple example of how the max poolingworks. Intuitively, the exact location of a feature is not essential if the roughlocation of this feature is known. That is one of the reasons why pooling isused in CNN. By adding the pooling layers into CNN, the number of param-eters and amount of computation can be significantly reduced. It is a ruleof thumb to insert a pooling layer between two neighbouring convolutionallayers in a CNN. The most common form of the pooling layer is a 2x2 sizedfilter with a stride of 2. The stride here means the step size. In another word,it is the number of pixels shifted every time. When the stride is 2, the filter ismoved to 2 pixels away each time.

    FIGURE 2.4: An example of max pooling.

    ReLU Layer

    Rectified Linear Units (ReLU) is a non-linear activation function widely usedin DL. The ReLU layer is actually a layer of the activation function. It is calledReLU layer because ReLU is the most commonly used activation function inCNNs. In fact, other activation functions are also able to be used in this layer,such as the hyperbolic tangent (tanh) and the sigmoid function. The formulaof ReLU is shown below.

    f (x) = max(0, x) (2.2)

  • 8 Chapter 2. Background and Related Work

    The ReLU activation function compares each value in an activation mapand replaces the value with 0 if the value is negative. As a result, the nonlin-ear properties of the decision function is increased. ReLU is the most com-mon activation function used in CNNs because it offers much faster train-ing than other activation functions. Due to the simplicity of its derivative,the computation during the BP process is much lighter than other activationfunctions.

    Fully Connected Layer

    The fully Connected (FC) layer is exactly the same as the normal hidden layerin a regular artificial NN. Neurons in a fully connected layer connect to allthe activations in the previous layer. The purpose of the FC layer is to reasonhigh-level features of the images after going through the convolutional layersand ReLU layers.

    Loss Layer

    Similar to the FC layer, the loss layer is also the same as the normal loss layerused in a regular artificial NN. Usually, the loss layer is the final layer of aNN and can be regarded as a layer of the loss function. The loss function isused to calculate the difference between the predicted output of the NN andthe target output of the images. After getting the loss, BP process can be con-ducted from the final layer back to the input layer to update the weights inbetween. There are several loss functions used in modern DL, such as MeanSquare Error (MSE) loss used for predicting real values and Cross EntropyLoss (CE) used for the classification of multiple classes. For an image classi-fication problem, CE loss is the most common loss function used in CNNs.The formula of CE loss is shown below, where ti is the ground truth and yi isCNN’s output for each class i in the dataset C.

    CE = −C

    ∑i

    tilog(yi) (2.3)

    2.1.3 Deep Residual Network

    Recent studies on image recognition and image classification usually rely onvery deep NNs. However, staking more layers in a NN is not always leadingto a better result. This is because very deep NN usually suffers from vanish-ing/exploding gradients issues, which hampers convergence of the networkfrom the beginning. Although the effect of these problems has been signif-icantly reduced by normalized initialization and batch-normalization (Ioffeand Szegedy, 2015), it is still a huge obstacle when the depth of NN goesvery large. Besides, He et al. found that when NN becomes deeper, it also

  • 2.1. Background 9

    suffered a degradation problem. Both training and testing accuracy gets sat-urated and then degrades rapidly. (He et al., 2015) In their paper, this degra-dation problem is addressed by using a deep residual learning framework.Then, the CNN which uses this framework is called Deep Residual Network(ResNet).

    Instead of fitting a desired underlying mapping, residual learning focuseson letting the NN layers to fit a residual mapping. Denoting the underlyingmapping as H(x), residual learning let the stacked nonlinear layers to fit aresidual mapping of F(x) := H(x)− x. Then, the original mapping becomesto F(x) + x. He et al. hypothesized that the optimization of the residualmapping should be easier than the original mapping. ResNet uses identitymapping by shortcuts to enable residual learning. (He et al., 2015) A visual-ization of a residual block is shown below in Figure 2.5.

    FIGURE 2.5: A visualization of a residual block. (He et al., 2015)

    Usually, the residual learning is implemented every few stacked layers.The output of the layers can be expressed using the following equation 2.4,where x and y are the input and output vectors of the layers. The functionF represents the residual mapping. For the residual block shown in Fig-ure 2.5, F = W2σ(W1x) where σ is the ReLU activation function and W isthe weight between layers. (He et al., 2015) By using the ResNet, CNNs canbe constructed with significantly more layers with minimized degradationand gradient vanishing problems. This has been proven by the LSVRC 2015classification task, where ResNet won the first place with a 152-layer CNN.As a comparison, the previous mentioned AlexNet only has 8 layers. Forvisualization, a ResNet-34 is shown in Figure 2.6.

    y = F(x, {Wi}) + x (2.4)

    FIGURE 2.6: A visualization of ResNet34. (He et al., 2015)

  • 10 Chapter 2. Background and Related Work

    2.1.4 Attention Mechanism

    The attention mechanism is usually used in Recurrent Neural Networks (RNN)for Natural Language Processing (NLP) related tasks. The traditional seq2seqmodel used for machine translation has a critical and apparent disadvantagethat it has a fixed-length context vector design which is incapable to remem-ber long sentences. (Sutskever, Vinyals, and Le, 2014) The attention mecha-nism was born to resolve this problem to help memorize long sentences inneural machine translation. The attention mechanism creates shortcut con-nections between the context vector and the entire source input. Also, theweights of these shortcut connections can be modified for each output ele-ment. Due to these shortcut connections and weights in-between, the contextvector is able to access to the entire input sequence. Therefore, the forgettingissue is addressed. (Bahdanau, Cho, and Bengio, 2014)

    2.1.5 L2 Regularization

    Regularization is widely used in ML and DL to prevent the model from over-fitting. Overfitting happens in a NN when the model fits the training datasettoo closely, which makes the model incapable to predict unseen data. In an-other word, when the training accuracy is very high but the testing accuracyis low, it may indicate that overfitting exists in the model.

    Also, regularization is implemented in the NN to improve the general-ization of a learned model. By implementing the regularization, the learnedmodel will be simplified and sparse. Usually, the regularization is appliedto the NN by adding a penalty term into the loss function, which is calledregularizer. The L2 regularizer is the most common penalty term used inNN and DL. In L2 regularization, the L2 penalty is the sum of the squareof all the weights, which is added to the loss function. When using the CEloss function, the L2 regularization can be implemented by using the Equa-tion 2.5 listed below, where w is the weight and λ is a parameter controllinghow much penalty is applied to the model.

    L(y, t) = −C

    ∑i

    tilog(yi) +12

    λW

    ∑i

    w2i (2.5)

    2.2 Related Works

    2.2.1 Image Classification by Using TL

    Image classification is a very popular topic in the research field of CV andDL. CNN has been proven to be a successful technique and has been widelyused in image classification. In the last decade, many effective CNN mod-els were proposed for image classification and CV related tasks, such asAlexNet, VGG, and ResNet. (Krizhevsky, Sutskever, and Hinton, 2012; Si-monyan and Zisserman, 2014; He et al., 2015) However, training a CNN

  • 2.2. Related Works 11

    from scratch is usually unable to let the model learn very well, especiallyfor small image dataset. (Kornblith, Shlens, and Le, 2018) Therefore, TL isused to transfer the pre-trained weights from a source dataset that is usu-ally very large to the target dataset that is usually has a smaller size than thetarget dataset. As mentioned before, freezing and fine-tuning are two mainways to implement TL. It is found that fine-tuning the transferred weightshas better performance than freezing the transferred weights, in both caseswhen the source dataset is highly related to the target dataset and when thesource dataset is not so related to the target dataset. (Yosinski et al., 2014;Kornblith, Shlens, and Le, 2018) A commonly used source dataset is the Im-ageNet dataset as it is a very large image dataset containing more than 1.2millions of images and also a generic dataset with 1000 categories. (Denget al., 2009) Therefore, it is a common practice used in image classification topre-train the model on the ImageNet dataset and transfer the learnt weightsto the source dataset and fine-tune these weights. It has been shown in Ko-rnblith et al.’s research that image classification by using this approach canachieve extraordinary results on different target datasets. (Kornblith, Shlens,and Le, 2018)

    2.2.2 Input-dependent/Selective Execution

    Input-dependent/Selective execution has been widely used in computer vi-sion tasks such as cascade detectors and hierarchical classification. DynamicDeep Neural Network (D2NN) proposed by Liu and Deng in 2017 was oneof the types of NN that uses selective execution. A D2NN is a feed-forwarddeep neural network which contains one or more control models. The con-trol module is defined as a sub-network which makes decisions on whetherneurons will execute or not. Therefore, D2NN enables the selective executionin a NN. (Liu and Deng, 2017)

    In a D2NN, the normal NN and the control models are trained togetherfrom the beginning to the end. This is realized by the integration of BP andReinforcement Learning (RL). In their paper, it is stated that Q-learning isused as the form of RL. The architecture of D2NN is similar to a normalCNN except that it contains control edges that can cause some neurons tobe dropped. A control edge is active only if it has the highest score amongall the control edges. Also, there are several architectures proposed in theirpaper, including the high-low capacity D2NN motivated by the capacity ofthe nodes, the cascade D2NN motivated by the cascade design commonlyused in computer vision, the chain D2NN that is like a chain containing acontrol node selecting between two regular nodes every link, and the hier-archical D2NN which classifies images to coarse classes first and followedby fine classes. (Liu and Deng, 2017) Figure 2.7 shows these four differentstructures of D2NN.

    The idea of D2NN is very interesting. The optimal path of the images in aNN is determined during the training process of the NN. If it can be appliedto TL, the efficiency and effectiveness of the fine-tuning process are highlylikely to be improved. However, their paper was published without code.

  • 12 Chapter 2. Background and Related Work

    FIGURE 2.7: Different architectures of D2NN. (Liu and Deng,2017)

    So, it is very hard to know how they implemented the mentioned differentarchitectures of D2NN and how they combined their D2NN with CNN.

    2.2.3 SpotTune

    As mentioned before, SpotTune is an adaptive fine-tuning method, whichis able to determine which layers to be frozen and which layers to be fine-tuned per training example. The adaptive fine-tuning is achieved by traininga policy network together with two parallel CNN models. One of the CNNmodels has all the layers to be frozen, while the other CNN model has allits layers to be fine-tuned. The policy network outputs a decision vectorcontaining 0 or 1 for each layer where 0 means the image will go through thefrozen layer, and 1 means the image will go through the fine-tuned layer. Asa result, the optimal route of an image in terms of frozen or fine-tuning canbe determined. (Guo et al., 2018)

    However, as mentioned, the output of the policy network is either 0 or1, which means it is discrete. As a result, the policy network is not differen-tiable. To solve this problem and make the policy network differentiable, theGumbel Softmax method is introduced in their paper. The Gumbel-Max isa simple and effective trick to draw samples from a categorical distribution.(Maddison, Mnih, and Teh, 2016) Equation 2.6 illustrates how the GumbelSoftmax works. In the equation, αi is the output of policy network for eachlayer, Gi is the Gumbel distribution, and τ is a temperature parameter thatcontrols the discreteness of the output vector. (Guo et al., 2018)

    Yi =exp((logαi + Gi)/τ)

    ∑zj=1 exp((logαi + Gi)/τ))f or i = 1, ..., z (2.6)

    ResNet is used as the architecture of CNN in their paper due to its abilityto be resilient to residual block swapping. (Guo et al., 2018) This means theresidual blocks can be easily swapped between the frozen and fine-tunedmodel with little effect on the final performance. Figure 2.8 is taken fromtheir paper which illustrates the working procedure of the SoptTune.

  • 2.2. Related Works 13

    FIGURE 2.8: Illustration of SpotTune’s working procedure.(Guo et al., 2018)

    However, the policy network used in SpotTune is trained from scratch.It usually takes some time to train up the policy network to let it give somemeaningful decisions on whether to do freezing or fine-tuning at the begin-ning of the training process. As a result, the initial stage of the training pro-cess is slow as compared to other TL paradigms such as the standard fine-tuning and Yosinski’s approach. Also, due to the additional policy networkand the parallel CNN model, the SpotTune method needs to train 3 NNs in-cluding 2 CNNs concurrently, which makes this approach extremely compu-tational. Therefore, the entire training process of SpotTune is very slow. Theoriginal intention of TL is to apply learnt knowledge from the source task tothe target task, so that the learning process of the target can be easy and fastwith good accuracy. But, it seems that the performance of the SoptTune interms of efficiency contradicts the intention of TL.

    2.2.4 L2-SP Regularization

    In TL, it is assumed that the pre-trained model extracts generic features andthese generic features are then fine-tuned to be more specific to fit the targettask if fine-tuning is used. Thus, when using fine-tuning to solve a relatedtarget task, the NN is initialized with pre-trained parameters (e.g. weights,bias) learned from source task. However, it is found that some of these pa-rameters may be tuned very far away from their initial values during theprocess of fine-tuning. This may cause significant losses of the initial knowl-edge transferred from the source task which is assumed to be relevant to thetarget task. (Li, Grandvalet, and Davoine, 2018)

    Parameter regularization is widely used in DL nowadays, especially whenlearning from small datasets. As mentioned before, regularization is used tofacilitate optimization and avoid overfitting when learning from scratch. InTL, the role of regularization is about the same. However, the starting pointof the fine-tuning process should convey information which fits the sourcetask. Li et al. proposed a novel type of regularization to reduce losses of theinitially transferred knowledge. In their paper, the pre-trained model is not

  • 14 Chapter 2. Background and Related Work

    only used as the starting point of the fine-tuning process but also used as thereference in the penalty to encode an explicit inductive bias. This novel typeof regularization is called L2-SP regularization with SP referring to StartingPoint of the fine-tuning process. (Li, Grandvalet, and Davoine, 2018) Let Wbe the weights used in the model of the target task and w0 be the weights ofthe model learned from the source task, the formula of L2-SP regularizer canbe shown in Equation 2.7. (Li, Grandvalet, and Davoine, 2018)

    Ω(w) =α

    2||w− w0||22 (2.7)

    However, as mentioned in Chapter 1, the last classification layer is usu-ally modified or changed to fit the purpose of the target task. Due to thechange of architecture in the last layer, there is no one-to-one mapping be-tween w and w0 in this layer. Therefore, in fact, two penalties are introducedin the L2-SP regularization to solve this problem. Defining the weights oflayers except for the last one as w and the weights of the last layer as wS̄, thecomplete version of L2-SP regularizer can be shown as the formula in Equa-tion 2.8. (Li, Grandvalet, and Davoine, 2018) α and β in this equation are theregularization factors that control the strength of the penalty.

    Ω(w) =α

    2||w− w0||22 +

    β

    2||wS̄||22 (2.8)

    Li et al. applied the L2-SP regularization in ResNet and did experimentsrelated to image classification. Also, to study the effectiveness of this typeof regularization, they trained their models in two source datasets for com-parison. One is ImageNet for generic object recognition, and the other isPlaces 365 (Zhou et al., 2016) for scene classification. The target tasks includegeneric image classification, specific image classification, and scene classifi-cation. In their results, it is found that L2-SP is much more effective than thestandard L2 penalty that is commonly used in fine-tuning. (Li, Grandvalet,and Davoine, 2018)

    The idea of L2-SP is very interesting and intuitive. It can prevent over-fitting and also retain the knowledge learnt from the source task. It can beregarded as the state-of-the-art regularization used in TL. The L2-SP regular-ization will be used as one of the techniques in this project to improve TL.

    2.2.5 Attention Mechanism in CNNs

    The research on the attention mechanism has been extended into the com-puter vision field recently. Various other forms of attention mechanisms havebeen explored by researchers. Xu et al. applied attention mechanisms in com-puter vision and proposed two important concepts of the attention mecha-nism, which were soft attention and hard attention. Soft attention calculatesweights for all patches in the source image, which results in a smooth anddifferentiable model but may be expensive if the source input is large. On

  • 2.2. Related Works 15

    the contrary, hard attention only pays attention to one patch at a time, whichresults in less calculation but the model is non-differentiable and hard to betrained. (Xu et al., 2015) Similarly, Luong et al. introduced the "global" and"local" attention, where the global attention can be regarded as the soft at-tention and the local attention is a combination of hard and soft attention.(Luong, Pham, and Manning, 2015) A visualization of the global and localattention mechanism is shown in Figure 2.9.

    FIGURE 2.9: A visualization of the global and local attentionmechanism. (Luong, Pham, and Manning, 2015)

    The application of attention mechanism in computer vision and CNNsmakes it a potential technique to be used in TL as well. Multiple CNN modelswith different fine-tuning settings can be trained concurrently and combinedtogether by using the attention mechanism. The performance of TL can bepotentially improved by using the attention mechanism in such a way.

  • 16

    Chapter 3

    Methodology and Experiment

    3.1 Introduction of Datasets

    As mentioned previously, the results of SpotTune is taken as the baseline ofthis project. So, for the ease of comparison, the same datasets are used in thisproject, which are the Visual Decathlon datasets. (Rebuffi, Bilen, and Vedaldi,2017) The Visual Decathlon challenge contains 10 datasets from multiple vi-sual domains. These datasets are listed in Table 3.1 with a short descriptionfor each of them.

    In order to reduce the computation burden of the evaluation process,the images in the Visual Decathlon Datasets are resized isotropically witha shorter side of 72 pixels. As shown in Table 3.1, these 10 datasets havedifferent visual domains. Some of them are aimed at a more specific do-main, such as FGVC-Aircraft, Describable Texture Dataset and Flower102,and some of them are aimed at a more generic domain, such as CIFAR100and ILSVRC12. Due to the computational limitations, it is impossible to useall these 10 datasets in this project. Inspired by Li et al.’s paper where theytest their method with different target domains, different target domains arealso selected to test the performance of the model in this project. (Li, Grand-valet, and Davoine, 2018) It is hypothesized that the method proposed inthis thesis should outperform SpotTune in both a more generic target imagedataset and a more specific target image dataset. So, FGVC-Aircraft and CI-FAR100 are used throughout this project, which represents a more specificand a more generic dataset, respectively.

    Dataset DescriptionFGVC-Aircraft 10,000 images of aircraft, 100 images for each of

    100 different aircraft models. (e.g. Boeing 737-400, Airbus A310) Training, validation and test-ing sets are equally divided with around 3,333images for each.

    CIFAR100 60,000 colour images for 100 object categories.40,000 for training, 10,000 for validation, 10,000for testing.

    Daimler Mono Pedes-trian

    50,000 grayscale pedestrian and non-pedestrianimages.

    Describable TextureDataset

    A texture dataset, which contains 5640 imagesand 47 categories.

  • 3.2. Methodology 17

    The German TrafficSign Recognition

    43 common traffic sign categories in differentresolutions.

    Flowers102 102 flower categories from the UK with 40 to 258images for every category.

    ILSVRC12 ImageNet12, contains 1000 categories and 1.2million images.

    Omniglot 1623 different handwritten characters from 50different alphabets.

    The Street View HouseNumbers

    Real-world digit recognition dataset, containsaround 70,000 images.

    UCF101 An action recognition dataset of realistic humanaction images, contains 13,320 images.

    TABLE 3.1: The Visual Decathlon Datasets. (Rebuffi, Bilen, andVedaldi, 2017) [Bold and italic font means datasets are used in

    this project.]

    3.2 Methodology

    3.2.1 CNN architecture

    The CNN architecture used in this project is the same as the architecture usedin SpotTune, which is originally proposed by Rebuffi et al. The architectureis a type of ResNet with 26 layers, which is denoted as ResNet-26. (Rebuffi,Bilen, and Vedaldi, 2018) There are 3 macro blocks of convolutional layers inthis CNN. The first block has 64 output feature channels, the second blockhas 128 output feature channels, and the last block has 256 output featurechannels. Also, each macro block contains 4 residual blocks and every resid-ual block consists of 2 convolutional layers with 3 x 3 filters and shortcutconnection that usually used in ResNet. (Rebuffi, Bilen, and Vedaldi, 2018)Average pooling with a stride of 2 is used to perform the downsampling inthis CNN architecture and ReLU layers are used as the activation layers. Be-sides, this architecture also contains a convolutional layer at the beginningand a fully connected layer at the end, which makes the total number of lay-ers in this architecture to be 26. A visualization of this CNN architecture isshown in Figure 3.1.

    FIGURE 3.1: Visualization of ResNet26. (Rebuffi, Bilen, andVedaldi, 2018)

  • 18 Chapter 3. Methodology and Experiment

    As mentioned before, SpotTune uses two parallel CNN models, one forfreezing and the other one for fine-tuning. And, a policy network is alsoused in SpotTune to generate the routing decision for each image per layer.Similar to SoptTune, two parallel CNN models are also used in this project.As a result, there will be two ResNet-26 CNN models in the CNN architec-ture for this project. The purpose of using two parallel CNN models is toeasily change different fine-tuning settings of these two models. Also, thisarchitecture also facilitates the implementation of MultiTune in this project.

    3.2.2 Data Loading and Transformation

    Data Loading

    SpotTune uses COCO API to prepare data loader and uses PyTorch’s (Paszkeet al., 2019) ImageFolder to load the images. COCO stands for CommonObjects in Context (COCO), which is a large-scale object detection, segmen-tation, and captioning dataset. (Lin et al., 2014) The COCO API in pythonaims to match the annotations of the images in COCO dataset, and thereforethe images in COCO dataset can be easily loaded.

    In fact, the Aircraft and CIFAR100 datasets can be easily loaded by purelyusing PyTorch’s ImageFolder function without COCO API. Using COCO APIdoesn’t provide any helps for image loading but makes the programmingcode more complex and harder to understand. Both Aircraft and CIFAR100have training, validation, and testing sets already defined. So, data split isnot required in this project. However, it is required to create different dataloader for different sets. For example, a training data loader for the trainingset and a validation loader for the validation set.

    Data Transformation

    As the images of Visual Decathlon Datasets are all in small sizes with ashorter side of 72 pixels. The images are first resized accordingly, with ashorter side of 72 pixels. Resizing the images to a size coincides with thesize of images in the source task makes the training of the model easier. Thisis because the ResNet-26 models used in this project are pre-trained on theImageNet12 with a smaller size (a smaller side of 72 pixels). Usually, it isgood to keep the sizes of images in the target task to be the same with thatin the source task. This also ensures all images in the Aircraft and CIFAR100datasets align with this size without any outliers present.

    The images are also center cropped with a size of 72 x 72. The center crop-ping is to only keep the features of images in the center region, and other arearather than the center (72 x 72 in this case) is removed from the images. Dataaugmentation is a common technique used in DL, which modifies imageswhile training the NN to see additional images by flipping or rotating theimages at different axes and angles. After seeing more variations of the sameimages, the model can obtain more knowledge from the images and has abetter chance of identifying its class. This usually results in better trainingperformance. RandomHorizontalFlip is used to flip the images randomly

  • 3.2. Methodology 19

    in the data transformation. By default, RandomHorizontalFlip flips imagesaround the vertical axis with a probability of 50%, which means that aroundhalf of the images will be flipped horizontally during the training process.

    Besides, image normalization is performed during the loading process.Normalization of images usually refers to the statistical normalization of thepixel values in the images. A common way to do the normalization is tosubtract the mean pixel value of the whole image dataset for each channelfrom the pixel value of each image on its corresponding channel, and thendivided by the standard deviation of the pixel values of the whole dataset.Equation 3.1 illustrates the formula of the normalization, where I stands foreach pixel value of images, n stands for the channel, µ is the mean of channeln for the whole dataset, and σ is the standard deviation of channel n for thewhole dataset. Usually, normalization results in better training performanceand faster convergence.

    I′n =

    In − µnσn

    (3.1)

    In general, when applying TL in a target task, we use the mean and stan-dard deviation of the target dataset to do the normalization. There are twotarget datasets used in this project, Aircraft and CIFAR100. So, there are twosets of mean and standard deviation, which are listed in Table 3.2. The meanand standard deviation of these datasets are taken from Rebuffi et al.’s im-plementation of their method proposed in their paper. (Rebuffi, Bilen, andVedaldi, 2018)

    Dataset Mean of Each Channel Standard Deviation ofEach Channel

    FGVC-Aircraft [0.47983041, 0.51074066,0.53437998]

    [0.21070221, 0.20508901,0.23729657]

    CIFAR100 [0.50705882, 0.48666667,0.44078431]

    [0.26745098, 0.25647059,0.27607843]

    TABLE 3.2: Mean and Standard Deviation of the Datasets. (Re-buffi, Bilen, and Vedaldi, 2018)

    3.2.3 Learning Method

    There are mainly three paradigms or algorithms used in machine learning ordeep learning, which are supervised learning, unsupervised learning and re-inforcement learning. In short, supervised learning has a "teacher", which isthe labels of the training set. The model learns the relationship between theinput data and its labels by using a loss function and backpropagation, andthen predicts the labels of unseen data that are not in the training set. On thecontrary, unsupervised learning does not have a "teacher", which means thatthe input data has no predefined labels. Therefore, in unsupervised learn-ing, it is assumed that a group of data near to each other or in the same

  • 20 Chapter 3. Methodology and Experiment

    region represents the same category or the same class. These regions that di-vide input data to different categories are statistically determined by the un-supervised learning algorithms, such as K-means and Principal ComponentAnalysis (PCA). RL is different from supervised learning and unsupervisedlearning. It is about taking suitable action to maximize reward in a particularsituation. It enables an agent to learn a mapping from states to actions bytrial and error, and then the expected cumulative reward in the future can bemaximized.

    The type of learning algorithm used in this project is supervised learn-ing. Both Aircraft and CIFAR100 have predefined labels, and therefore thesepredefined labels can be used in supervised learning for image classification.These labels are represented by the name of the folder for each class. How-ever, these labels are abstract but not concrete. Every class of these datasetsis named as a four-digit code. For example, in CIFAR100, the folder name’0001’ represents the class of apple. When using PyTorch, the labels are fur-ther encoded by using a one-digit number. For example, ’0’ represents ’0001’when training the model by using PyTorch.

    3.2.4 Transfer Learning Method

    Fine-tuning rather than freezing is used as the TL method in this project. Asmentioned, there are two ResNet-26 models used in the method proposedby this thesis. Both of them have their weights fine-tuned after transferringfrom the source task (ImageNet12 in this case) without freezing any weightsin any layers.

    However, standard fine-tuning is not used in this project. Instead of trans-ferring all the weights learnt from the ImageNet12 and fine-tuning them tofit the target task, the weights in the last block of these two fine-tuning mod-els are initialized randomly before training. The layers in the last block arethe deepest layers in the NN, which extract the most specific features of thedataset trained by the NN. Therefore, the random initialization is to makethese deeper layers more fit the target task rather than the source task. Asa result, the NNs will only retain the general features learnt from the Ima-geNet12 and learn the specific features of the target datasets from scratch.The settings and hyper-parameters of these two models are set to be differ-ent to achieve better training performance. The details of the settings andhyper-parameters will be elaborated in the section of Experiment.

    3.2.5 Implementation of MultiTune

    Inspired by the attention mechanism used in NLP and CV, the MultiTune isimplemented by adding a single-layer neural network after the convolutionallayers in the last block and before the last fully connected layer. Attentionmechanism usually requires an encoder and a decoder when aiming to solveproblems related to NLP. When using CNN to do computer vision relatedtasks, the convolutional layers act like the encoder. Unlike NLP, a decoderis not required in this implementation because we do not need a decoder to

  • 3.2. Methodology 21

    translate the encoded data into words like what NLP tasks usually do. Thus,the MultiTune method used in this thesis only contains an encoder but notcontains a decoder. This one-layer network is denoted as MultiTune modelhere.

    In detail, the features extracted by the two ResNet-26 models after thelast block are concatenated and then go through the MultiTune model beforegoing through the final fully connected layer. Theoretically, the MultiTunemodel should determine which features to take from these two different fine-tuning models. The MultiTune model proposed here can be expressed inEquation 3.2, where Z represents the output of the MultiTune model, Wrepresents the weights of the MultiTune model, X1 is the output after the lastconvolutional block of the first fine-tuning model, X2 is the output after thelast convolutional block of the second fine-tuning model, and α is a factorthat controls what portion of each model to be used in the MultiTune model.After applying the MultiTune model, the output Z is passed to the last fullyconnected layer for the later classification. Figure 3.2 is a visualization of theMultiTune technique.

    Z = W ∗ concat[αX1; (1− α)X2] (3.2)

    FIGURE 3.2: Visualization of MultiTune.

    3.2.6 Activation Function, Loss Function and Optimizer

    As mentioned in Section 2.1.2, the ReLU activation function is commonlyused in CNNs. It is same in this paper, and therefore ReLU is used as the ac-tivation function in the convolutional layers of the ResNet-26 models. How-ever, the activation function used in the final FC layer is different. Becausethe project task is a multiple-class image classification problem, the SoftMaxactivation function is the most suitable one to be used here. The equationof the SoftMax function is shown in 3.3, where z is the output of the finalclassification layer and σ(z)i is the probability of an image belonging to classi.

    σ(z)i =ezi

    ∑Kj=1 ezj

    f or i = 1, ..., K and z = (z1, ..., zK) ∈ RK (3.3)

    After the last fully connected layer, the prediction is made by the model.And then the loss is calculated and backpropagated by the optimizer to min-imize the loss and make the prediction more accurate. CE loss function is

  • 22 Chapter 3. Methodology and Experiment

    ideal for classification of multiple classes. As the target task is image classi-fication and both Aircraft and CIFAR100 have 100 classes, CE loss is used inthe method proposed by this thesis. However, the default CE loss in PyTorchis not used. The CE loss is modified to contain the L2-SP regularizer in itsfunction. Equation 3.4 shows the modified CE loss with L2-SP regularizer.The parameters used in this Equation are same as the ones in Equation 2.3and Equation 2.8. As mentioned before, the inclusion of the L2-SP regular-izer not only prevents overfitting, but also prevents the significant loss ofinitial transferred knowledge. So, it is used in the proposed method of thisthesis.

    L(y, t) = −C

    ∑i

    tilog(yi) +α

    2

    W

    ∑i||wi − w0i ||22 +

    β

    2||wS̄||22 (3.4)

    The optimizer used is the Stochastic Gradient Descent (SGD) with mo-mentum. SGD is a commonly used optimizer for backpropagation in NNand DL. In general, the weights and bias will be updated according to theirgradients after training every batch. This can be regarded as an optimiza-tion process which continuously optimizes the weights and bias used in theNN and then minimizes the predicted loss of the model. When the loss is re-duced to a considerably small figure, it means the model is fully trained andcan predict the seen data very well. The momentum is used to prevent theoscillations of parameters’ directions during the optimization process, whichis a common approach used to accelerate the convergence of the model. SGDwith momentum tends to remember the difference of weights updated ateach iteration. It makes each update as a linear combination of the gradientand the previous update. (Sutskever et al., 2013) The SGD with momentumis shown in Equation 3.5, where w is the weight, L is the loss function, and αis a factor used to control the portion of ∆w to be added to the update of theweight.

    w := w− η ∂L∂w

    + α∆w (3.5)

    3.3 Experiment

    3.3.1 Environments Used

    The programming code in the project is implemented by using PyTorch. Py-Torch is an open source deep learning library based on Torch library. Itwas originally developed by Facebook. Due to its simplicity of using, it hasdrawn attention from numerous researchers and becomes one of the mostwidely used libraries for DL in the area of computer vision and natural lan-guage processing. The PyTorch version used in this project is 1.2.0. Also, asPyTorch uses Python as its interface, Python is used as the programming lan-guage in this project. The version of Python used is 3.6.10. Besides, becausethe techniques of TL will be tested by tasks related to image classification,the package torchvision is also used in this project. The torchvision package

  • 3.3. Experiment 23

    consists of popular datasets, model architectures, and common image trans-formation for computer vision. The version of torchvision used in this projectis 0.4.0.

    Due to the use of image datasets, CUDA is also used for GPU program-ming. CUDA is a parallel and Application Programming Interface (API)model created by Nvidia, which enables GPU computing for general pur-pose. With the GPU, the model can be trained much faster than purely us-ing CPU. The version of CUDA used is 10.1. The use of CUDA requires aCUDA-enabled GPU. An Nvidia GTX 1060 (6GB version) is used throughoutthis project. This project and its implementation can be found on GitHub athttps://github.com/YuWang24/MultiTune.

    3.3.2 Data Preparation

    The datasets used in this thesis are two datasets from Visual Decathlon Chal-lenge, which are Aircraft and CIFAR100. These two datasets have been in-troduced in Section 3.1 and are available for downloading at https://www.robots.ox.ac.uk/~vgg/decathlon/#download. The mean and standard de-viation of each dataset are obtained from Rebuffi et al.’s implementation onGitHub at https://github.com/srebuffi/residual_adapters.(Rebuffi, Bilen, and Vedaldi, 2017)

    3.3.3 Baseline Preparation

    As the results of SoptTune will be taken as the baseline in this thesis, therunning of the SpotTune’s code is the first step of the whole experiment. Thecode of SoptTune is available on GitHub at https://github.com/gyhui14/spottune. (Guo et al., 2018) So, the first step is to download the code to mypersonal device and make it runnable by setting up an environment which isable to execute this code. The programming environment used in this thesishas been shown in Section 3.3.1.

    The downloaded code is modified a little instead of being run AS IS. Thereare mainly two places modified, both in the code of data loading. One ofthe purposes of SpotTune is to achieve a better performance in the VisualDecathlon Challenge. So, to let the model see more images for better trainingand testing, the original code of SoptTune includes the validation set in thetraining set for each dataset. This means the model is then trained on thislarger combined training set and still evaluated by using the validation setwhich is also a part of the training set. As a result, the validation accuracygoes to a very high figure, almost 100%, after tens of epochs. So, it is veryhard to evaluate the performance of the model if the training set containsthe validation set. In general, a model should be evaluated by using unseendata. Consequently, the modification is made to remove the validation setfrom the training set. This modification lets the model be trained only onthe training set and evaluated by using the unseen validation set. After that,the validation accuracy becomes a very useful factor to evaluate the model.The other modification is the batch size set in the data loader. The original

    https://github.com/YuWang24/MultiTunehttps://www.robots.ox.ac.uk/~vgg/decathlon/#downloadhttps://www.robots.ox.ac.uk/~vgg/decathlon/#downloadhttps://github.com/srebuffi/residual_adaptershttps://github.com/gyhui14/spottunehttps://github.com/gyhui14/spottune

  • 24 Chapter 3. Methodology and Experiment

    batch size used in the code is 128. However, due to the limitation of GPUcomputation, it seems a batch size of 128 exceeds the limit of my GPU. Thus,keeping a batch size of 128 causes the "CUDA out of memory" issue. So, thebatch size is reduced a little bit from 128 to 120. This small reduction shouldnot cause any differences in the results. However, to keep the consistency,the batch size in my method will also be set to 120.

    3.3.4 MultiTune Setup

    Basic Settings

    Most of the basic settings used here are set the same as those of the SoptTuneto keep consistency. The consistent settings are aimed for further comparisonbetween these two methods. Both of these methods are run with 110 epochswithout early stopping. Also, they use the same CNN architecture, ResNet-26, as mentioned before. Besides, CE loss is also used because the target taskis image classification. The optimizer used here is SGD with a momentum of0.9 as well.

    Settings of the Fine-tuning Models

    SpotTune trains two CNNs, one with all weights frozen, and the other onewith all weights fine-tuned. Besides, a policy network is also trained to de-termine the routing decision of each image for each layer. Different fromSoptTune, there are no policy network and freezing model used. Instead ofusing one freezing and one fine-tuning model, both CNN models used hereare fine-tuning models. However, as mentioned, these two CNN modelshave different fine-tuning settings. For convenience, the first CNN model isdenoted as Fine-Tuning A and the second model is denoted as Fine-TuningB. The settings of these two models are listed in the Table 3.3.

    Model Reinitialization LearningRate

    LearningRate Decay

    LearningRate DecayRate

    Fine-TuningA

    Last block 0.1 [20, 50, 80] 0.1

    Fine-TuningB

    Last block 0.01 forlast block,0.1 for anylayers else

    [20, 50, 80] 0.1

    TABLE 3.3: Settings of the Two Fine-tuning Models.

    Overall, as compared to the code of SoptTune, instead of transferring allthe learnt weights and bias, the weights and bias in the last block of thesetwo fine-tuning models are reinitialized with random numbers as describedbefore. So, the layers in the last block of the ResNet-26 can be trained from

  • 3.3. Experiment 25

    scratch and then be more specific to the target dataset. Also, by running thecode of SoptTune, it is found that the learning of the model at the initial stageis slow. This may be potentially due to the original setting of learning ratedecay in the code of SpotTune. The learning rate decay is set to be [40, 60,80] in SpotTune with a decay rate of 0.1. This means the learning rate will bedecayed by 0.1 each time after the number of epoch reaching 40, 60 and 80. Toaddress this issue and let the training perform better at the initial stage, thelearning rate decay is made to be earlier. Thus, a different learning rate decay[20, 50, 80] is used in both Fine-Tuning A and Fine-Tuning B. Apart from thelearning rate decay, Fine-Tuning A is the same as the fine-tuning model usedin SpotTune, which can be regarded as a standard fine-tuning model. While,what makes the MultiTune different is the settings of the other fine-tuningmodel, Fine-Tuning B. As shown in Table 3.3, Fine-Tuning B uses differentlearning rate for different blocks. The learning rate of the first two blocks isset the same as Fine-Tuning A with a figure of 0.1, while the learning rate ofthe last block is set as 0.01 to reduce the update amount of weights in the lastblock. The purpose of this setting is to prevent the jumping of the trainingprocess to find the global minimum effectively. Also, having a small learningrate in the last block helps the model to learn the features specific to the targettask more thoroughly.

    Settings of the MultiTune model

    As described before, the MultiTune model is a one-layer neural network thatintegrates the output of these two fine-tuning models. It is placed after thelast block the convolutional layers and before the final FC layer. Theoreti-cally, the MultiTune model should be able to adaptively integrate these twofine-tuning models and extract the better parts from each one. As shown inEquation 3.2, there are two control factors α and 1− α that defines the portionof each model to be used for the adaptive integration. The α is set to be 0.5here, and as a result, 1− α is 0.5 as well. This makes these two models beevenly adapted by the MultiTune model.

    Besides, the MultiTune includes the L2-SP regularization in the loss func-tion. As shown in Equation 3.4, there are two control factors, α and β. Thesetwo factors are used to control the degree of regularization. These factors areset to be 0.01 and 0.01 as suggested in Li et al.’s paper. (Li, Grandvalet, andDavoine, 2018) These settings are summarized in Table 3.4.

    Model Equation 3.2 α Equation 3.4 α Equation 3.4 βFine-Tuning A 0.5 0.01 0.01Fine-Tuning B 0.5 0.01 0.01

    TABLE 3.4: Settings of the MultiTune model.

  • 26 Chapter 3. Methodology and Experiment

    3.3.5 Overall Approach

    To better study the performance of MultiTune and also for better analysis, thefollowing experimental approaches are followed. First, the code of SpotTuneis run on the Aircraft and CIFAR100 datasets to get the baseline. The best val-idation accuracy, total time used and the plot of validation accuracy versusthe number of epochs are recorded. After that, the proposed method, Mul-tiTune is applied to these datasets. Also, the best validation accuracy, totaltime used and the plot of validation accuracy versus the number of epochsare recorded for comparison with the baseline. Then, to analyze the perfor-mance of these two methods on small target datasets, these two methods arealso applied to the smaller-sized Aircraft and CIFAR100 dataset with 20 im-ages per class, 15 images per class, 10 images per class, and 5 images perclass. The results of the baseline and the MultiTune on these smaller-sizeddatasets are also recorded for analysis. The details of results and analysiswill be elaborated in Chapter 4.

  • 27

    Chapter 4

    Results and Analysis

    4.1 Results of Baseline

    As mentioned, the code of SoptTune includes validation set into the trainingset to let the model see more images. The figures of results shown in Guo etal.’s paper are testing results rather than validation results obtained by sub-mitting the results to the Visual Decathlon Challenge website. (Guo et al.,2018) Because only two datasets of the Visual Decathlon Datasets are usedhere, the results of them are not submitted to the Visual Decathlon Challengewebsite. Thus, the testing results of the MultiTune method proposed in thisthesis on these two datasets are unknown. To address this issue, the vali-dation results instead of testing results are used to compare these methods.Therefore, the validation sets are removed from training sets in the code ofSpotTune. Due to the removal of the validation sets, the number of imagesseen by the model of SoptTune is reduced. As a result, the validation resultsshown later in this Chapter are a little bit lower than the figures of testing re-sults shown in their paper. The results of SpotTune on Aircraft and CIFAR100for running 110 epochs are listed in Table 4.1.

    Dataset Best Validation Accuracy Total Time Used (mins)Aircraft 55.15% 47.49Aircraft-20 45.60% 29.15Aircraft-15 39.20% 21.84Aircraft-10 30.70% 14.67Aircraft-5 17.40% 7.57CIFAR100 78.45% 454.80CIFAR100-20 59.15% 34.60CIFAR100-15 55.73% 23.74CIFAR100-10 49.10% 16.52CIFAR100-5 33.40% 8.96

    TABLE 4.1: Results of SoptTune on Aircraft and CIFAR100.

    In the above Table 4.1, Aircraft-20 means the smaller-sized Aircraft datasetwith 20 images per class, Aircraft-15 is the smaller-sized Aircraft dataset with15 images per class, Aircraft-10 and Aircraft-5 follow the same naming rule.Also, it is same for CIFAR100-20, CIFAR100-15, CIFAR100-10, and CIFAR100-5. Figure 4.1 illustrates the validation accuracy of SoptTune on Aircraft and

  • 28 Chapter 4. Results and Analysis

    CIFAR100 after every epoch. And the graphs shown in Figure 4.2 and Fig-ure 4.3 demonstrate the validation accuracy of SpotTune on smaller-sizedAircraft and CIFAR100 datasets.

    FIGURE 4.1: Validation Accuracy after every Epoch on Aircraftand CIFAR100 Datasets for SpotTune.

    FIGURE 4.2: Validation Accuracy after every Epoch on Smaller-sized Aircraft Dataset for SpotTune.

  • 4.2. Results of MultiTune 29

    FIGURE 4.3: Validation Accuracy after every Epoch on Smaller-sized CIFAR100 Dataset for SpotTune.

    4.2 Results of MultiTune

    For better comparison, the results of MultiTune on the Aircraft, CIFAR100,and smaller-sized Aircraft and CIFAR100 datasets are recorded in the sameway. The performance of MultiTune on these datasets are listed in Table 4.2.

    Dataset Best Validation Accuracy Total Time Used (mins)Aircraft 59.59% 38.19Aircraft-20 47.85% 22.50Aircraft-15 40.73% 16.88Aircraft-10 29.90% 11.51Aircraft-5 18.80% 5.91CIFAR100 79.31% 321.37CIFAR100-20 59.00% 22.80CIFAR100-15 56.40% 16.87CIFAR100-10 49.10% 11.37CIFAR100-5 29.20% 5.86

    TABLE 4.2: Results of MultiTune on Aircraft and CIFAR100.

    For better visualization, the plots of the validation accuracy of MultiTuneon these datasets after every epoch are also recorded. To compare the per-formance of these two methods, the two sets of the validation accuracy ofthese two methods are plotted into the same figure. To distinguish the re-sults, the validation accuracy of SpotTune is illustrated by blue lines, while

  • 30 Chapter 4. Results and Analysis

    the validation accuracy of MultiTune is shown by red lines. Figure 4.4 showsthe validation accuracy versus the number of epochs of SpotTune and Multi-Tune on Aircraft and CIFAR100 datasets.

    (A) Validation Accuracy versus the Numberof Epochs on Aircraft Dataset.

    (B) Validation Accuracy versus the Numberof Epochs on CIFAR100 Dataset.

    FIGURE 4.4: Validation Accuracy versus the Number of Epochsof SpotTune and MultiTune on Aircraft and CIFAR100 datasets.

    4.3 Analysis of the Results

    The results listed in Table 4.1 and Table 4.2 are obtained from a single runningof the code because for such large image datasets, the results obtained usu-ally have very small variations. However, the random initialization and thenature of SGD can still introduce some levels of randomness. To minimizethe effects of randomness on the models, both methods are run 5 iterations.The box plot of these results obtained after the 5 iterations is shown in Fig-ure 4.5, and the sets of validation accuracy of these 5 iterations are listed inTable 4.3. Due to the size of the dataset, only the Aircraft dataset is run with5 iterations. The time required to train SpotTune on CIFAR100 is around 7 to8 hours, while the time required to train MultiTune on CIFAR100 is around6 to 7 hours. Therefore, running 5 iterations of the models on CIFAR100 isextremely time-intensive.

    SpotTune MultiTuneIteration 1 54.10% 60.04%Iteration 2 54.16% 59.62%Iteration 3 55.18% 58.54%Iteration 4 56.20% 59.02%Iteration 5 55.33% 59.86%Mean 54.99% 59.42%Standard Deviation 0.79 0.56

    TABLE 4.3: Results of 5 Iterations for MultiTune and SpotTuneon Aircraft Dataset.

  • 4.3. Analysis of the Results 31

    FIGURE 4.5: Validation Accuracy of SpotTune and MultiTuneafter Running 5 Iterations on Aircraft Dataset.

    It is clear that for both datasets used in this thesis, the method Multi-Tune outperforms SpotTune. In the Aircraft Dataset, the validation accuracyof MultiTune is consistently higher than that of SpotTune after 20 epochs,and the difference is around 4.5%. Similarly, for the CIFAR100 dataset, theperformance of MultiTune is also consistently better than that of SpotTuneafter 20 epochs with an increase of around 1%. The obtained results indicatethat MultiTune has better performance than SpotTune on both a more spe-cific dataset (Aircraft) and a more generic dataset (CIFAR100) when runningthese methods on the whole datasets. Apart from the validation accuracy,the total training time is also taken as one of the considerations of the per-formance in this thesis. As shown in Table 4.1 and Table 4.2, the total timeused for training Aircraft and CIFAR100 datasets for SoptTune is 47.49 min-utes and 454.80 minutes, respectively. And the total time used for trainingAircraft and CIFAR100 datasets for MultiTune are 38.19 minutes and 321.37minutes, respectively. In comparison, the percentage of reduction in the totaltraining time is 19.58% and 29.34% for Aircraft and CIFAR100, respectively.

  • 32 Chapter 4. Results and Analysis

    FIGURE 4.6: Validation Accuracy after every Epoch on Smaller-sized Aircraft Dataset for SpotTune and MultiTune.

    FIGURE 4.7: Validation Accuracy after every Epoch on Smaller-sized CIFAR100 Dataset for SpotTune and MultiTune.

  • 4.3. Analysis of the Results 33

    For better visualization of the performance difference between these twomethods when applied to smaller-sized datasets, the results of SpotTune andMultiTune on smaller-sized Aircraft and CIFAR100 datasets are shown inFigure 4.6 and Figure 4.7. It is clear in Figure 4.6 that MultiTune also outper-forms SpotTune in these smaller-sized Aircraft datasets except for Aircraft-10. Aircraft dataset has around 33 images per class, when reduced to 20, or 15images per class, the number of images per class is roughly reduced by 43%and 60%, respectively. In these two situations, the results of MultiTune areconsistently better than SpotTune after 20 epochs. The difference is around2.3% for Aircraft-20 and 1.5% for Aircraft-15. But, when it comes to extremelysmall datasets, in the Aircraft-10 and Aircraft-5 datasets. the differences be-tween these two methods are not so obvious. In Aircraft-10 and Aircraft-5,the number of images per class is roughly reduced by 70% and 85%, respec-tively. The inconspicuous differences between these two methods in thesetwo datasets may be due to the extremely small size of datasets. Due to thesignificantly reduced size, it is hard for the model to learn enough knowledgeto predict unseen data. So, both methods perform really bad in these twodatasets. In this case, the performance difference between these two meth-ods becomes insignificant. While, as for the smaller-sized CIFAR100 datasets,the situation is a little bit different. MultiTune is slightly better than SpotTunefor CIFAR100-15 and CIFAR100-10, about the same for CIFAR-20, and notice-ably worse than SpotTune for CIFAR100-5. The different situation might bedue to the large number of images per class in the original CIFAR100 dataset.There are 400 images per class in the training set. So, once reduced to 20, 15,10, or 5, the size of dataset becomes ridiculously small, which disables themodels to learn sufficient knowledge to predict classes correctly. However,even if it is in this tough situation, MultiTune doesn’t lose the ability to learnand still slightly outperforms SpotTune in 2 of these smaller-sized CIFAR100datasets.

    FIGURE 4.8: Running time comparison between SpotTune andMultiTune.

  • 34 Chapter 4. Results and Analysis

    However, MultiTune runs much faster than SpotTune in all of these eightsmaller-sized Aircraft and CIFAR100 datasets. As shown in Table 4.1 andTable 4.2, the total running time of SpotTune for these four smaller-sized Air-craft datasets are 29.15 minutes for Aircraft-20, 21.84 minutes for Aircraft-15,14.67 minutes for Aircraft-10, and 7.57 minutes for Aircraft-5. In comparison,the total running time for MultiTune is 22.50 minutes for Aircraft-20, 16.88minutes for Aircraft-15, 11.51 minutes for Aircraft-10, and 5.91 minutes forAircraft-5. The performance of total running time is similar in terms of thesmaller-sized CIFAR100 datasets since the same number of images per classis used. To better visualize the running time differences between these twomethods, the results are plotted in Figure 4.8. Due to the similar performanceon the smaller-sized CIFAR100, the results of them are not shown in this Fig-ure. MultiTune runs faster because the algorithm behind is much simplerthan SpotTune. As mentioned previously, SpotTune trains a policy model togenerate the routing decision from scratch for every image. Although it usesa very simple CNN as the policy network, it still requires a lot of time totrain it. As for MultiTune, it just involves a single-layer neural network afterthe convolutional layers in the last block and before the last fully connectedlayer. So, MultiTune is much less computational than SpotTune, which is alsoindicated by the performance of these two methods.

  • 35

    Chapter 5

    Conclusion and Future Work

    5.1 Conclusion

    In this thesis, a novel TL technique denoted as MultiTune is proposed, whichcan adaptively integrate multiple fine-tuning CNN models with different set-tings. This method has been applied to image classification with two imagedatasets taken from the Visual Decathlon challenge. To evaluate MultiTune’sability of generalization, these two datasets are intentionally selected to beone specific dataset with the other generic dataset. The Aircraft dataset usedin this thesis acts as the specific one, while the CIFAR100 dataset acts as thegeneric one. The results of one of the state-of-the-art TL approaches, an adap-tive fine-tuning model called SpotTune is taken as the baseline for compar-ison. (Guo et al., 2018) The MultiTune model used in this thesis containstwo ResNet-26 CNN models with different fine-tuning settings, one uses thestandard fine-tuning settings that retrains the weights in all the layers exceptfor the layers in the last block, the other one has the last block reinitializedand set a different learning rate of 0.01. The integration factor used to com-bine these models are 0.5 for each. Also, L2-SP regularization is employed tofacilitate optimization and avoid overfitting.

    By doing several experiments, it is found that MultiTune is able to achievea validation accuracy of 59.59% on the Aircraft dataset, which is around 4.5%higher than the result obtained by SpotTune whose figure is 55.15%. Besides,MultiTune also outperforms SpotTune on the CIFAR100 dataset by around1%, the accuracy figures are 79.31% and 78.45%, respectively. To study theperformance of these two methods on the small datasets, the Aircraft andCIFAR100 datasets with different images per class are also used for evalua-tion. The results show that MultiTune also obtains higher performance thanSpotTune on most of these smaller-sized datasets. The difference is around2.3% for Aircraft-20, 1.5% for Aircraft-15, 1.4% for Aircraft-5 and 0.7% forCIFAR100-15. As for the time used for training the model, MultiTune needsmuch less time than SpotTune on all the datasets used in this thesis.

    These results achieved in this thesis indicate that the proposed Multi-Tune technique can improve the performance of TL on the image classifica-tion problem. Undeniably, MultiTune will be a good approach to be furtheradopted and applied in the fields of TL and tasks related to image classifica-tion.

  • 36 Chapter 5. Conclusion and Future Work

    5.2 Future Work

    It is impossible to obtain a perfect model, and therefore there are always fu-ture works could be done to improve the model in different aspects. Thecurrent version of MultiTune used in this thesis only contains two ResNet-26models with different fine-tuning settings. The performance of MultiTunecould be further improved by including more fine-tuning models with dif-ferent settings because multiple different models could extract more hiddenfeatures than only two models.

    Besides, the hyper-parameters and factors used in the current version canbe further tuned to generate a better model. Hyper-parameters include dif-ferent learning rates used in these fine-tuning models, the schedule of thelearning rate decay, and the factors used in the L2-SP regularization. Moreimportantly, the integration factor used to combine these fine-tuning modelscan be further tuned. Currently, the two ResNet-26 models used in this the-sis are treated evenly and assigned with the same factor of 0.5. However, theintegration factor used here is not fixed and a model can be dominant overthe other model with a higher integration factor than the other.

    Due to the computational constraints and time limits, the results of CI-FAR100 are obtained by a single training and testing iteration. Although thevariations of results are very small for such a large image dataset, the eval-uation of the CIFAR100 dataset should be run several iterations to take theaverage results like what has been done on the Aircraft dataset. Runningmultiple iterations can minimize the randomness brought by random initial-ization and the nature of SGD. However, running several iterations is two-sided. Taking the average of multiple iterations may obtain slightly worseresults, but can also obtain slightly better results.

    As mentioned, the datasets used to evaluate MultiTune are taken fromthe Visual Decathlon challenge. To test the capability of MultiTune in termsof generalization, the Aircraft and CIFAR100 are selected to represent a morespecific dataset and a more generic dataset. However, there are totally tendatasets contained in the Visual Decathlon challenge. If time and computa-tional capability of the device permit, the MultiTune model proposed in thisthesis should be evaluated by the entire Visual Decathlon challenge. By do-ing this, not only can the validation sets be used, but also the testing sets.The results of the testing sets can be uploaded to the website of Visual De-cathlon challenge. There is a scoring mechanism to assess the performanceof the model if the testing results of the entire Visual Decathlon challengeis submitted on its website. Guo et al. also reported the Visual Decathlonchallenge score of SpotTune in their paper. (Guo et al., 2018) Therefore, fora more thorough comparison with SpotTune and other state-of-the-art TLtechniques, the MultiTune model proposed in this thesis should be furtherevaluated on all the datasets contained in the Visual Decathlon challenge.

  • 37

    Appendix A

    Project Contract

  • 40

    Bibliography

    Agrawal, Pulkit, Ross Girshick, and Jitendra Malik (2014). “Analyzing theperformance of multilayer neural networks for object recognition”. In: Eu-ropean conference on computer vision 38(9), pp. 329–344.

    Arunava (2018). The Perceptron. URL: https://towardsdatascience.com/the-perceptron-3af34c84838c.

    Azizpour, Hossein et al. (2016). “Factors of transferability for a generic con-vnet representation. IEEE transactions on pattern”. In: analysis and machineintelligence 38(9), pp. 1790–1802.

    Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio (2014). “NeuralMachine Translation by Jointly Learning to Align and Translate”. In: ICLR2015. URL: http://arxiv.org/abs/1409.0473.

    Deng, Jia et al. (2009). “Imagenet: A large-scale hierarchical image database”.In: pp. 248–255.

    Guo, Yunhui et al. (2018). “SpotTune: Transfer Learning through AdaptiveFine-tuning”. In: CoRR abs/1811.08737. arXiv: 1811.08737. URL: http://arxiv.org/abs/1811.08737.

    Han, X. et al. (2017). “Pre-Trained AlexNet Architecture with Pyramid Pool-ing and Supervision for High Spatial Resolution Remote Sensing ImageScene Classification”. In: Remote Sens 9(8), pp. 848.

    He, Kaiming et al. (2015). “Deep Residual Learning for Image Recognition”.In: CoRR abs/1512.03385. arXiv: 1512.03385. URL: http://arxiv.org/abs/1512.03385.

    Hubel, D. H. and T. N. Wiesel (1968). “Receptive fields and functional archi-tecture of monkey striate cortex”. In: The Journal of Physiology 195 (1), pp.215–243.