Transfer Learning for Robot Control with Generative ...ejohns/Documents/jonathan_heng_thesis.pdfChapter 5 explores the use of generative adversarial networks (GAN) for im-age translation

IMPERIAL COLLEGE LONDON

DEPARTMENT OF COMPUTING

Transfer Learning for Robot Controlwith Generative Adversarial Networks

Author:Jonathan Heng

Supervisor:Dr. Edward Johns

Submitted in partial fulfillment of the requirements for the MSc degree inComputing Science of Imperial College London

September 2017

Abstract

Training a robot in reality is more resource intensive as compared to training it ina simulator. Annotations are easily available in simulators, data can be generatedmuch more quickly, and training speeds are faster. Hence, it would be ideal to beable to use simulated data to train a robot and deploy it successfully in the realworld. However, even while physics simulators are improving and able to renderincreasingly realistic images, there still exists a difference between real data andsimulated data. This difference is also known as the reality gap.

The reality gap is a classic transfer learning problem, where the source domain Xis simulated data and the target domain Y is real world data. To bridge this gap,we explore the idea of using image translation to create source-target domain pairsX-Y . The idea of image translation is to transfer the styles of a particular domain toanother and various related works have been done with use of generative adversar-ial networks, where the aim is to learn a mapping G : X → Y . This paper presentsa novel approach of training generative adversarial networks to generate a pairedtarget domain G(X) to a source domain image X by minimizing the task loss of thegenerated target image on the same task as its paired source image.

We then evaluate methodologies under the condition of limited source-target pairs totrain a robot arm to reach for a cube in the target domain. With the use of an imagetranslation model to generate a large amount X-G(X) pairs, we find significantperformance improvements ranging between 20-50% compared to just using limitedX-Y pairs in various source-target domain pairings.

Acknowledgments

I would like to acknowledge my supervisor, Dr. Edward Johns, for all his time andpatience in providing valuable advice on the thesis as well as guidance on futurepathways. I am also grateful to my family and friends for all the support throughoutthe years. Finally, I would like to give special thanks to my parents, who gave methe wonderful opportunity to be here.

ii

Contents

1 Introduction 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.4 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Background 52.1 Introduction to transfer learning . . . . . . . . . . . . . . . . . . . . 52.2 Neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.3 Robot control systems . . . . . . . . . . . . . . . . . . . . . . . . . . 72.4 Distance between distributions . . . . . . . . . . . . . . . . . . . . . 82.5 Domain shift . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.6 Domain alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.7 Style transfer between image domains . . . . . . . . . . . . . . . . . 11

3 Generating Training Data 123.1 Data sources for transfer learning . . . . . . . . . . . . . . . . . . . . 123.2 V-REP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.3 Simulation scene set-up . . . . . . . . . . . . . . . . . . . . . . . . . 133.4 V-REP remote API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.5 Training images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.6 Labelling images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.7 Domain transformation . . . . . . . . . . . . . . . . . . . . . . . . . 16

4 Control of Robot Arm from Images 174.1 Evaluation set-up and metrics . . . . . . . . . . . . . . . . . . . . . . 174.2 Fully supervised learning . . . . . . . . . . . . . . . . . . . . . . . . . 18

4.2.1 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4.3 Learning with limited data . . . . . . . . . . . . . . . . . . . . . . . . 244.3.1 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

5 Image Translation and Pairing across Domains 305.1 Training details and architectures . . . . . . . . . . . . . . . . . . . . 325.2 Typical GAN for image translation . . . . . . . . . . . . . . . . . . . . 33

iii

CONTENTS Table of Contents

5.3 CycleGAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345.3.1 Model details . . . . . . . . . . . . . . . . . . . . . . . . . . . 345.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

5.4 TaskGAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385.4.1 Model details . . . . . . . . . . . . . . . . . . . . . . . . . . . 395.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

6 Training with Generated Source-Target Pairs 546.1 Training with CycleGAN images . . . . . . . . . . . . . . . . . . . . . 54

6.1.1 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 556.1.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

6.2 Training with TaskGAN images . . . . . . . . . . . . . . . . . . . . . 596.2.1 Model details . . . . . . . . . . . . . . . . . . . . . . . . . . . 596.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

7 Conclusions and Future Work 66

iv

Chapter 1

Introduction

1.1 Motivation

One of the key hallmarks of human intelligence is the capacity to adapt to new sit-uations based on similar past experiences. For example, a tennis player is likely tobe able to pick up other racket sports much more easily than a person who has noexperience in racket sports. The ability to reuse knowledge learnt from solving onetask on a separate but related task is present in all humans. In other words, beingable to transfer knowledge between different task domains is a highly desirable traitof intelligence. However, it remains extremely difficult to come up with a generalalgorithm to allow a computer to effectively reuse the learnt knowledge across arange of different yet similar tasks. The objective of transfer learning is to effectivelyreuse past knowledge for new tasks. The potential benefits are far-reaching and thispaper aims to study this topic in the context of training robots in simulation andtransferring the knowledge over to a robot in the real world.

The ideal goal is the capability to train a robot completely in simulation and suc-cessfully deploy it in the real world. The desire to achieve transfer learning in thisparticular domain is due to the numerous obstacles faced when training a robot inthe real world. Firstly, there is the issue of safety. The robot can potentially go outof control during the training stage, resulting in damage to surrounding properties,itself, or even humans. Secondly, large amounts of time have to be put in by hu-mans to supervise the robot and create relevant training data. Imagine the case oftraining a robot to pick up a cube. To create training data of images of objects on atable, a human has to manually set up the scene. When training the robot, a humanneeds to present to supervise it due to safety concerns and to reset the scene forthe task. Furthermore, simulation environments often come with the capability ofproviding annotations to the training data (e.g. joint states, object orientation/poseetc.). These labels are expensive to obtain in reality. On the other hand, these issuesare circumvented in a simulated environment. There are no safety concerns, simu-lation time can be sped up during training, and large amounts of annotated trainingdata can be generated programmatically. Hence, if a robot is able to effectively usethe knowledge learnt in a simulated environment in the real world, training in the

1

Chapter 1. Introduction

real world can be reduced dramatically or even rendered completely unnecessary.This problem of transferring knowledge learned in a simulated world to the realworld is one that transfer learning aims to solve. More generally, this can be seenas a problem of transferring knowledge from a source domain to a related targetdomain.

(a) Source domain (b) Target domain

Figure 1.1: A sample source-target domain pairing created in simulation. The sourcedomain is a simple setup of a 3DOF arm and a cube, while the target domain addslighting effects, slight colour changes, and slight shape deformations to simulate typicaldifferences between simulation and reality.

This project studies the transfer of knowledge from a source to a target domain in asimulated setting. Figure 1.1 shows a possible source-target domain pairing that canbe created in simulation. Since the project is motivated by the problem of training arobot in a simulated environment on a task and successfully performing the task inreality, the source and target domains will be designed with the respective charac-teristics found in simulation and reality.

2


1.2 Objectives

Figure 1.2: Cost of tasks for obtaining relevant training data given a source-target do-main pairing of simulated world-real world

Given a source-target domain pairing of simulated world-real world, the costs of ob-taining relevant training data are described in Figure 1.2. It is desirable to minimizethe need of high cost tasks and this project aims to study the transfer of knowledgefrom a source domain to a target domain with these characteristics. Reducing thedependency on target data will be an aim of the project. The specific problem fortesting and evaluating of transfer learning methodologies is to control a 3-degree-of-freedom (3DOF) robot arm to move towards a cube. This will be formulated asa regression problem, where joint velocities will be predicted using an image input.However, it remains difficult for any algorithm to learn a regression task in a fullyunsupervised fashion. On the other hand, semi-supervised learning combines unsu-pervised and supervised methods, which suits our aim of reducing the dependenceon target domain data. For training and evaluation of methodologies, source-targetdomain pairs will be created in simulation. This can be done by creating applying atransformation to the source domain image directly, or adjusting the original sourcedomain scene to create a target domain scene. The objectives can be summarized asfollows:

• Setting up the simulated environment: The simulator of choice for thisproject is V-REP [1]. The simulator will act as a training data generator and anenvironment for the testing of models.

• Control a 3DOF robot arm in the simulated environment: Train a neuralnetwork that is able to control the robot using the images generated from thesimulator. Design performance metrics to evaluate the performance of variousmodels.

• Transfer knowledge between different simulated environments: Evaluatetransfer learning methodologies from a source domain in simulation to anothertarget domain in simulation. Design training methodologies that reduces thedependence of target domain training examples that are costly to obtain.

3


1.3 Contributions

The contributions of this paper are as follows:

• a novel dataset for the study of transfer learning for robot control which iseasily configurable to simulate various characteristics between simulation andreality

• motivate the use of image translation using generative adversarial networks(GANs) to create image pairs between source and target domains in order toeasily generate more target domain data from source domain data

• analyze methodologies of supplementing limited target domain data with am-ple source domain data to improve performances in target domain

• a novel method in training GANs for image translation by minimizing the taskloss of generated target domain images

1.4 Outline

This section gives an outline of this paper with a brief summary of each chapter.

• Chapter 2 covers background research and related work.

• Chapter 3 details the process behind generating relevant training data for theproblem at hand, including the use of the simulator V-REP and processing tech-niques on the training data.

• Chapter 4 discusses training a neural network to perform regression for robotcontrol directly from images. The evaluation set-up and metrics are described.Training and test results in conditions with full target domain data and limitedtarget domain data are evaluated.

• Chapter 5 explores the use of generative adversarial networks (GAN) for im-age translation purposes. The use of CycleGAN from [2] and self-proposedvariations of GAN are tested and evaluated.

• Chapter 6 combines the models from Chapters 4 and 5 to create a transferlearning process. Performance gains from transfer learning on several source-target domain pairings akin to that of simulation-reality are evaluated.

• Chapter 7 concludes the work with a summary and points out possible futuredirections of research.

4

Chapter 2

Background

2.1 Introduction to transfer learning

We now consider the below definitions following from a survey on transfer learning[3].

Definition 2.1 (Transfer Learning). Given a source domain DS and learning task TS,a target domain DT and learning task TT , transfer learning aims to help improve thelearning of the target predictive function fT (·) in DT using the knowledge in DS andTS, where DS 6= DT , or TS 6= TT .

Definition 2.2 (Domain). A domain is a pair D = {X,P (X)} where X is a featurespace and P (X) is a marginal probability distribution.

Definition 2.3 (Task). A task is a pair T = {Y, P (Y |X)} where Y is a label spaceand P (Y |X) is an objective predictive function.

Particular to this project, we will be dealing with the case of transferring knowledgefor controlling a robot in a simulated environment to the real world. This is char-acterized as a case of different source and target domains and different source andtarget tasks. The source and target domains share the same feature space, X, whichare pixel values in an image, but do not share the marginal distribution, P (X). Sim-ilarly, the source and target tasks share the same label space, Y , which are the jointvelocities, but do not share the same predictive function, P (Y |X).

According to the survey [3], transfer learning problems can be further categorizedinto three distinct settings: inductive, transductive, and unsupervised. In the inductivesetting, source and target labels are available. In the transductive setting, sourcelabels are available while target labels are unavailable. For the scope of this project,both inductive and transductive settings are most relevant and will be investigated ingreater detail.

5

Chapter 2. Background

2.2 Neural networks

In recent years, the applications of neural networks have grown explosively. Neuralnetworks achieved state-of-the-art performances in various domains including visualrecognition [4], video game playing [5], and natural language processing [6]. Neu-ral networks have proven to be a powerful learning model that is able to learn awide range of tasks.

Works on transfer learning have been done with the use of neural networks as themodel. In [7], a neural network is trained in simulation and fine-tuned with a smallnumber of real-world images to perform a planar reaching task. This was done bysetting up a simulated environment to train a robot arm in and approaching it asa reinforcement learning problem. The robot arm’s state includes its end-effectorposition and the joint configuration. Control of robot is done by taking an actionfrom a finite set of actions, which is quantified by a combination of decrease or in-crease in the joint angles or leaving them unchanged. Instead of directly obtainingthe Q-values from the neural network trained on raw pixel inputs as in [5], the au-thors separated the neural network into two distinct components, one for perceptionand the other for control. This allowed the networks to be trained in parallel. Theperception network will learn to predict the state of the robot arm from raw inputpixels while the control network will learn to predict the Q-values from the state ofthe robot arm as input. Furthermore, instead of utilizing the ε-Greedy method forpolicy search, which is a form of random exploration, a kinematics-based controlleris introduced to guide the policy search. Experiments show that training a predictionnetwork from scratch in simulation and fine-tuning a network with a good ratio ofimages (found to be 75% real and 25% simulated images in the paper) is able tooutperform a network trained from scratch using only real-world images. This workshows that a neural network trained in simulation can provide an effective startingpoint to further fine-tune with a much smaller amount of real-world data. One pos-sible limitation of the above work is lack of generality of the chosen search policy.Since it is designed specifically for a kinematics problem, it is not widely applicableto other problems.

Another interesting approach is to design a neural network architecture that canutilize past knowledge learnt. [8] tackles the transfer learning problem through in-novative construction of an architecture, termed progressive networks, that grows asnew tasks are added. A network column is first trained for a particular task. Insteadof re-weighting the neural network when being trained on a new task, old weightsare frozen and additional network columns are added in parallel to the previous net-work columns. The old network columns are linked to the new network columns toenable the reuse of previously learnt features. Freezing the weights would also meanthat it ensures previously learnt features are not lost when being trained for a dif-ferent task or on a separate data source. The new network columns are then free tolearn new features and combine them with previously learnt features. Experimentscompared the stability and performances of fine-tuning versus progressive networkapproaches and found progressive networks to be more robust and better perform-

6


ing. A possible downside to progressive networks is the need to expand the network.While the paper explores branching out to a couple of new tasks successfully, if thereare a large number of new tasks, computational costs will rise significantly.

Neural networks have also proven to be a highly effective feature extractors. In [9],features learnt from a convolutional neural network trained on ImageNet are used totrain a SVM to perform classification tasks in different datasets and achieves leadingresults. This shows the reusability and generality of the features that are learnt byneural networks. In [10], extensive experiments are done to investigate the degreeof specificity of features learnt in neural networks. It is found that later layers learnmore specific features and more interestingly, transferring part of a network learntin a separate domain leads to an overall better performance than just training in asingle domain. From these results, transfer learning when applied appropriately caneven be used to give a boost to performance.

2.3 Robot control systems

Traditional approaches to controlling a robot [11] relies on designing multiple mod-ules such as perception, planning, and motor control that processes input sensorsignals and outputs a value to actuators. However, such approaches tend to requirea significant amount of manually designed components to obtain a low-dimensionalfeature representation of the information at each step. Ideally, we would like to avoidusing hand-crafted features as information tends to be lost. Intuitively, a model thatis able to directly utilize input information and output an action can possibly per-form better since all information is built and retained by the model.

In [12], the authors present a method of training robot control policies that predictsan action directly from visual observations by using deep convolutional neural net-works. This falls in line with the ideology of having a single end-to-end model thattrains all modules of the robot control system jointly. While the overall methodologycovered in the [12] is largely reinforcement learning centric, it shares similaritieswith this paper in that supervised learning samples are used to guide the policysearch. In Chapter 4, we will cover the use of a physics simulator to provide su-pervised learning samples of a trajectory. The difference is that the problem in thispaper is formulated as a regression task instead of a reinforcement learning problem.

7


2.4 Distance between distributions

In the case of different source and target domains, either XS 6= XT or P (XS) 6=P (XT ). Consider the latter case. Clearly it is not a trivial task to train a modelto recognize data generated from separate probability distributions. One approachis to incorporate statistical measures of distances between probabilities during thetraining of the model. This was done in [13] by minimizing the Maximum MeanDiscrepancy (MMD) as part of the loss function when training a neural network toperform object recognition and detection tasks. In this case, the statistical distancethat the neural network is trying to minimize is between the features extracted fromthe source images and target images. The authors were also utilizing a two streamarchitecture, with one network being trained on the source domain and the otherbeing trained on the target domain. This loss function encourages both networks tolearn to recognize a similar set of features. In other words, these learnt features canbe seen as domain invariant since they are effective in both domains.

The MMD has many properties that make it a good choice for a probability metricand we shall consider its empirical estimate as taken from [14].

Definition 2.4 (Maximum Mean Discrepancy). Let p and q be distributions definedon a domain Z. Let the observations X := {x1, ..., xm} and Y := {y1, ..., yn} be drawnindependently and identically distributed (i.i.d.) from p and q respectively. Then wedefine the maximum mean discrepancy’s empirical estimate as

MMD[X, Y ] =

[1

m2

m∑i=1

m∑j=1

k(xi, xj)−2

mn

m∑i=1

n∑j=1

k(xi, yj) +1

n2

n∑i=1

n∑j=1

k(yi, yj)

] 12

(2.1)

where, k(·, ·) is a kernel function. Ideal choices for the kernels are the Gaussian andLaplace kernels for their universal property as further discussed in [15].

Computation time of the above empirical estimate costs O((m + n)2) time. Moreefficient approximations of the MMD can be found in [16]. There are other proba-bility metrics that exists and can be considered, such as Kullback-Leibler divergence,Kolmogorov metric, and χ2 distance. A comprehensive study of these metrics can befound in [17].

8


2.5 Domain shift

One of the major problems that transfer learning attempts to solve is domain shift.For example, a robot arm can be trained to pick up a red cube. However, thelearned model will often break down when the environment changes such as whenthe colour of the cube is different or when the lighting effects are different. A simple2-dimensional dataset can be used to illustrate this effect. In Figure 2.1, domain 1is generated by a 2-dimensional Gaussian variable with mean 0 and variance 1. Do-main 2 is generated by the datapoints in domain 1 with a shift in the y-axis by a valueof 10. In this case, the x-dimension represents a domain invariant feature. Imaginetrying to learn a function for these datapoints. Being able to identify the x-dimensionin this case would be much more helpful in solving the problem since it providesthe same information for both domains, whereas the y-dimension is not helpful forlearning a single function mapping for both domains simultaneously. Identifying thisfeature that describes data from separate domains is far from a trivial task and wewill discuss some works in the next paragraphs. Other terms in literature describinga similar phenomenon include domain adaptation and covariate shift.

Figure 2.1: 2-dimensional visualization of domain shift

One method of approaching the problem of discrepancies between the source andtarget data distributions is to learn a feature representation which is unable to dis-tinguish between the two domains. This is known as domain confusion. In [18],the negative cross entropy loss is added to the loss function. In addition to trainingthe neural networks for the task, a domain classifier is trained to recognize whichdomain the image originates from. This loss function is minimized when the domain

9


classifier is maximally confused. Hence, training the model with this loss functionwill encourage the model to learn a feature representation that maps the data fromdifferent domains into a similar feature space.

A different take on bridging the gap between the simulated environment and the realworld is through domain randomization [19]. This approach attempts to introducea large amount of variability in the simulated environment such that the real worldappears to be just another variation. In this paper, the aim is to train an objectdetector on simulated images and test its effectiveness when giving a scene fromthe real world. Aspects of the simulated environment that were randomized includethe number and shape of distractor objects present, position and texture of objects,camera position and orientation, lighting, and type and amount of noise added.Even without the need of any real-world training data, the object detector was ableto perform well. A possible explanation for the effectiveness of the method is thatthe neural network model learned a set of features that are invariant to cope withlarge amount of variability in training data.

2.6 Domain alignment

A recurring problem in machine learning is the difficulty in obtaining annotated realworld data. Ideally, we would like a model to be completely trained on easily gen-erated synthetic data and perform well when thrown into the real world. Unfortu-nately, this is far from a trivial task. One way to overcome the lack of annotated realworld data is to have some form of automatic annotation. In [18], a weak pairwiseconstraint method was proposed. This method attempts to pair annotated syntheticdata with an unannotated real world counterpart by finding a nearest neighbouringsource image for each target image. The distance between images can be measuredby the Euclidean distance between the set of features extracted by a convolutionalneural network. This distance can then be added into the loss function as a pairwiseloss.

Both the domain confusion loss and pairwise loss were evaluated in [18]. It is foundthat pairing unlabelled real world data with simulated data is able to outperformjust having real world labelled data.

10


2.7 Style transfer between image domains

The visual world is often characterized by style. What is stylistically coherent inreality might look completely out of place in a cartoon world. Humans are able tounderstand and find correspondence between similar object across visually distinctdomains. For example, while a real car and a cartoon car may possess distinct visualstyles, humans easily understand that these two objects share the same semanticmeaning. Recent works [20, 21] presented methods of transferring the visual stylesand characteristics from an image domain to other images.

Generative adversarial networks, or GANs in short, are a class of neural networksthat estimates a generative model through training a generative model and a dis-criminative model in an adversarial process [22]. It is shown to be able to generateMNIST-like handwritten images from a Gaussian input. This shows the viability ofutilizing GANs as a method of generating images from a desired domain. This ideahas been furthered in works [2, 23] which used generative adversarial networks togenerate source-target domain image pairs. These model uses a source domain im-age as input and obtains a paired target domain image as output. The CycleGANmodel in [2] is of particular interest since the training process does not require ex-plicit paired source-target domain examples. This also removes the need for labelledtarget examples since labels can be transferred from the respective source examples.The CycleGAN model will be further investigated in Chapter 5.

Figure 2.2: An image visualized in the styles of famous works of various artists. Top leftimage (A) represents the original image, while the other images represent the outputafter transferring the style of the associated artist’s work. Source: [20]

11

Chapter 3

Generating Training Data

The problem of interest here is to predict joint velocities for a 3DOF arm from aninput image. This can be framed as a regression problem and the correspondingtraining dataset are images annotated with corresponding joint velocities. In thischapter, we cover the use of a simulator to create a scene of a 3DOF robot arm anda cube and utilize it to generate training data.

3.1 Data sources for transfer learning

In other studies of transfer learning between a simulated environment and the realworld, custom training data sets are created. In [19], a physics simulator is used togenerate various images of objects on a table and object positions can be obtainedwithin the simulator. This is then used to train on image detector and tested onreal world images with no additional training. In another work [18], synthetic im-ages are created using the Gazebo simulator and pose annotations can be obtainedthrough the simulator.

3.2 V-REP

For the purposes of this project, a custom training dataset will be created. Trainingand testing will be done completely in simulation. Target domains are created tosimulate typical differences between a simulated source domain and a real targetdomain. A simulation environment that is able to create a training dataset of imageswith corresponding joint velocities has to be chosen. Virtual Robot Experimenta-tion Platform (V-REP) [1] is a robot simulator that fulfils the requirements of theproject. It possesses an inverse kinematics module that is able to calculate a list ofjoint states, where following this list of joint states is equivalent to moving a point ona robot arm to a target position along a straight path. This list of joint states can beconverted into corresponding joint velocities as labels. The platform also possessesa remote API in various languages including Python.

12

Chapter 3. Generating Training Data

3.3 Simulation scene set-up

This section will describe the various components that are in the V-REP scene for the3DOF arm. References to specific objects in the scene will be marked with quota-tions, e.g. ‘3dof-arm’. A screenshot of the simulation scene can be found in Figure3.1. For further details on how to properly set up a scene, V-REP provides their owntutorials on its website which are much more comprehensive.

Firstly, a custom model of the arm is created. 4 simples shapes are needed to createthe arm: a cylinder ‘base’, followed by 3 cuboid ‘links’. These shapes can then belinked together with ‘joints’. Next, a simple red ‘cube’ can be added to the scene.

To obtain images of the arm moving towards the cube, the inverse kinematics mod-ule in V-REP can be used. This creates the behaviour of moving a designated tipto target in a straight line. A tip-to-target pair has to be created. A dummy object,‘target’, was created and placed under the ‘cube’ in the scene hierarchy. Similarly, adummy object, ‘tip’, was added at the end of the arm model, ‘link3’. At this point,a tip-to-target relationship can be assigned between this pair of dummy objects, asindicated by the red arrows linking them under the scene hierarchy. For the inversekinematics to start working, an inverse kinematics group has to be added under theinverse kinematics calculation module.

Vision sensors which act as cameras are also available. They can be added andpositioned at various angles to capture different viewpoints of the scene. As seenin the Figure 3.1, there are 4 vision sensors that indicate 4 different viewpoints ofthe arm and the cube, namely front, side, top, and diagonal. The resolution of theimages can be controlled. Images in this paper are of 128x128x3 in dimension.

3.4 V-REP remote API

V-REP provides a remote API where other various languages can be used to commu-nicate with V-REP’s simulation environment. The choice of language for this projectis Python. Help with setting up the remote API can be found in on V-REP’s tutorialwebsite1.

V-REP provides a large range of functions that enables communication for variousactions in the simulated environment. In general, objects in the simulated environ-ment have handles and there are functions that allow the user the obtain handles ofthe objects. Different objects have different properties which can be controlled. Forexample, images from vision sensors can be obtained, joints have states that can beobtained and changed, and objects have positions that can be obtained and changed.A detailed list of the remote API functions can be found on the website as well2. For

1http://www.coppeliarobotics.com/helpFiles/en/remoteApiOverview.htm2http://www.coppeliarobotics.com/helpFiles/en/remoteApiFunctionListAlphabetical.htm

13


Figure

3.1:V-R

EPscene

of3DO

Farm

anda

redcube

14


more advanced functions that are not available in this list, the simxCallScriptFunc-tion provides the flexibility of allowing the user to implement any kind of remoteAPI function.

By utilizing these remote API functions, the simulated scene can be prepared andmanipulated completely in Python. For the training dataset, variability in the initialcube positions and initial joint positions are desired. After some initial testing of theperformances of trained neural networks on various datasets, it is found that creat-ing a dataset in a grid-like manner (e.g. initial cube x position ranges from [0, 1]and taking 5 points between this range will include the points [0, 0.25, 0.5, 0.75,1]) and mixing it with uniform random samples lead to the best results.

3.5 Training images

Using the remote API, the inverse kinematics module is able to provide a list of jointstates. This list of joint states can be followed iteratively to create the behaviour ofthe robot arm moving towards a cube. The size of this list can be specified and thissize corresponds to the number of steps of moving the robot arm from the initialposition to the final position. At each step of following this list, the image can beobtained in Python through the remote API. The joint states are also obtained andrecorded on a text file. Each image is then exported to be saved on disk. The choiceof export format is JPEG. Even though it is not a lossless compression, it is highlyspace efficient and extremely high fidelity images are not necessary to train the robotarm successfully.

Figure 3.2: Training images. Each row represents a set of images comprised of 16 steps.From left to right, the robot arm gradually moves towards the cube.

15


3.6 Labelling images

Since the objective is to perform control of the robot as a regression problem, thejoint states have to be transformed to joint velocities. To appropriately achieve theeffect of moving the arm at each step, the joint velocities label should be the differ-ence between the next joint state and the current joint state. For the final joint state,since there is no next state, the joint velocities labels will be set to 0.

jointV elocity[i] = jointState[i+ 1]− jointState[i] if i 6= finalStepjointV elocity[i] = [0, ..., 0] otherwise

Joint velocity labels are further processed by normalizing and multiplying by a damp-ing factor. Normalization is done by the dividing the joint velocities with the absolutenorm. The damping factor used is the distance between the tip and target position,which can be obtained at each step through the remote API. The motivation of usinga damping factor is to encode knowledge of distance in the labels and create the ef-fect of taking bigger steps while the arm is further from the cube and slowing downas it approaches the cube.

3.7 Domain transformation

To create distinct target domains, transformations can be done to the original image.Simple transformations include gamma correction and multiplying the colour chan-nels with a factor. An example of a more complex transformation is to put an imagethrough a edge detection algorithm such as canny edges and create an edge-versionof the original image. Other ways include directly adjust the colours, shapes, andsizes of the objects in the scene and generating new images.

Figure 3.3: An image of 3DOF arm in various domains. Left: Original. Center: Tintedimage by multiplying the colour channels by [0.25, 0.5, 0.75] respectively after pre-processing to values between a range of [-1, 1]. Right: Image created by putting eachcolour channel through a canny edge detector and recombining the colour channels.

16

Chapter 4

Control of Robot Arm from Images

Convolutional neural networks have proven to be highly effective models in imagerecognition and other image related tasks and related works such as [7, 19] showedthe viability of this model for application in transfer learning. Hence, convolutionalneural networks will be trained to predict joint velocities given an image of a robotarm.

Under the transfer learning problem, we would like to study the performances ofneural networks in both source and target domains under different modes of train-ing. Firstly, the fully supervised scenario where all labels and pairings are assumedto be available will be studied. This will set the performance benchmarks. Then, wewill study methods of training that lead to better performances under the scenarioof target domain limited data.

4.1 Evaluation set-up and metrics

An appropriate evaluation setting has to be designed to test the effectiveness of thetrained neural networks. The same V-REP scene used for generating training datawill be used to evaluate the performances of the neural networks. During test time,at every time step an image will be obtained from the vision sensor in the sceneand passed through the neural network. Then, the predicted joint velocities will beapplied to the robot arm. Each episode starts with a random cube position and jointposition and will be played over 100 time steps.

17

Chapter 4. Control of Robot Arm from Images

The following metrics will be used to evaluate the performance:

• Distance: Both minimum distance and final distance from the tip to targetpoint will be recorded. Since we want the arm to slow down and stop near theobject, the final distance is a better metric of performance since the arm canreach a good minimum distance through random movements and eventuallymove away from the cube. The overall mean and standard deviation of thesedistances played over multiple episodes are recorded.

• Success rate: A distance threshold will be used to decide which episodes areconsidered successful. If the minimum/final distance for an episode falls belowthis threshold, that episode is considered successful.

To ensure fairness when evaluating different models, the same test set will be used.The test set ensures that every episode is loaded from an identical list of startingjoint and cube states and is composed of 100 episodes.

4.2 Fully supervised learning

In the fully supervised learning scenario, we assume that all labels are available forboth source and target domains and all examples are paired.

4.2.1 Models

• Source only: Train on source domain images only.

• Target only: Train on target domain images only.

• Mixed: Train on both source and target domain images.

• Separate feature extractor: Train separate feature extractors on each domainand use a shared regressor head. Target and source domain images are pairedand a L2 loss between the feature representations of the source and targetdomains is minimized.

18


The first 3 models share the same neural network architecture and a figure of theneural network architecture can be found in Figure 4.1. This is a relatively straightforward convolutional neural network architecture. Given a source domain X withimages xi and labels φxi , the neural network is trained to learn a mapping functionfX : X → φX . The training process involves minimizing the following cost function:

Ltask(X,φX) =

1

N

N∑i=1

(fX(xi)− φxi )2 (4.1)

Figure 4.1: Neural network architecture for the first 3 models. Each convolutional layeruses filter of 3x3 size and stride of 2.

19


Having separate feature extractors is inspired by the work in [13] and a figure ofthe neural network architecture can be found in Figure 4.2. Neural networks canbe seen as a method of extracting image features as discussed in [10]. By treatingneural networks as feature extractors and adding a loss that penalizes the Euclideandistance between the feature layers, it encourages the separate feature extractors fordistinct domains to map pairs of images to the same feature space. The layer beforethe output layer is chosen as the feature layer in this implementation.

Given a source domain X with images xi, labels φxi and feature layer representationθxi , a target domain Y with images yi, labels φyi , and feature layer representation θyi ,the loss between the feature layers can be expressed as follows:

Lθ(X, Y ) =1

N

N∑i=1

(θxi − θyi )

2 (4.2)

Overall, the neural network is trained to minimize the following cost function:

Lsum(X,φX , Y, φY ) =

1

N

N∑i=1

[(fX(xi)− φxi )2 + (fY (yi)− φyi )2 + (θxi − θyi )

2]

= Ltask(X,φX) + Ltask(Y, φ

Y ) + λθLθ(X, Y )

(4.3)

where λθ is a constant that controls the importance of the loss between featurelayers. It is set to a value of 1 for experiments in this paper.

20


Figure 4.2: Neural network architecture for method separate feature extractors. Inputimages for separate domains are paired.

21


4.2.2 Results

These models are trained with the source-target domain pair as seen in Figure 4.3.The target domain image is created by multiplying a constant value of [0.25, 0.5,0.75] to the RGB channels of the source domain image respectively. The trainingdataset consists of a total of 22656 images. This is created by saving images of 1416episodes, 16 steps each episode, of the arm moving towards the cube using the in-verse kinematics module in V-REP as described in Chapter 3.

The source only and target only models are trained for 15000 training steps with abatch size of 16 images each. The mixed and separate feature extractor models aretrained for 15000 training steps with a batch size of 16 images each for both sourceand target domains. The images are shuffled before being fed into the neural net-work. The choice of optimizer is the Adam optimizer with a learning rate of 0.001.

Each of the 4 models are tested in both the source and target domains and thefull results can be found in Figure 4.4. An episode is considered a success if theminimum/final distance falls below the chosen threshold value of 0.1.


Figure 4.3: Sample image from source and target domain for training and testing.

22


Figure 4.4: Results of fully supervised models tested on both source and target domains.

As expected, the “Source only” model performs reasonably well in the source domainwith a final distance success rate of 89% and fails completely in the target domainwith a 0% final distance success rate.

On the other hand, the “Target only” model surprisingly works on both source andtarget domains with final distance success rates of 90% in both domains. This ishighly likely due to the way the target domain was created, which was to multiplythe RGB channels of the source domain images with [0.25, 0.5, 0.75] respectively. Asimple visual relation is clearly not sufficient since the “Source only” model does notwork well in the target domain. It is a coincidence that the chosen transformationled to good performance in the source domain despite training on target domainimages only.

The “Mixed” model is trained on both target and source domain images. It fails toperform up to par in the respective domains which the source only and target onlymodels are trained on. Both final distance success rates of 70% in the source do-main compared to the 89% of the source only model and 87% in the target domaincompared to 90% of target only model are lower. This suggests that training a singleneural network to perform a similar task across different domains leads to worseperformance, even if the target domain is very similar to the source domain visuallyand numerically. This is due to the neural network being unable to learn a represen-tation that works for both domains.

The “Separate feature extractor” model trains separate neural networks to extract asimilar set of features θ and share a regression layer on top of the extracted featurespace. This model outperforms all other models with a final distance success rateof 91% and 92% on both source and target domains respectively. Learning a sharedrepresentation across different domains for a similar task improved the overall per-formance. While these tests are done with complete knowledge of pairs and labelsin both domains, it suggests the potential of supplementing training for a target do-main task (e.g. in reality) with rich and easily obtainable source domain data (e.g.from a simulator). Another plausible reason for the improved performance is that

23


the training images are saved in jpg, which degrades the quality of the image beingsaved. Hence, the image passed into the neural network during test time and train-ing time is different. Having the L2 feature layer loss trains the neural network tobecome more resistant to noisy images.

Overall, the “Separate feature extractor” model works the best and further tests andfurther studies will be done on variations of this model in the following section.

4.3 Learning with limited data

Since obtaining pairings between real world examples and simulated environmentexamples and labelling real world examples are expensive, we want to minimize theamount of such work. Ideally, the training can be done in a completely unsuper-vised manner. However, this remains extremely difficult. A more common approachis to provide a small amount of labels in the target domain. Another approach issemi-supervised learning, where unlabelled target domain data can be utilized. Thissection will study various models and their effectiveness under the stresses of limitedtarget data.

4.3.1 Models

Each model described in the following will be trained on varying amounts of targetdomain examples being available. The target only model uses the architecture asshown in Figure 4.1. The other models that train on both source and target domainexamples use the architecture as shown in Figure 4.2. Some of the models describedhere take inspiration from the work in [13], which utilizes the maximum mean dis-crepancy (MMD) loss and initializing the neural network for the target domain withpretrained weights from a neural network trained in the source domain. The costfunction for the MMD can be found in Section 2.4.

• Target: Train on target domain images only. All examples are labelled. Thismodel serves as the benchmark for comparison to other models.

• Paired + L2: Train from scratch on paired source-target examples and un-paired source examples. All examples are labelled. Paired examples are trainedon both task loss and the L2 loss between feature layer. Unpaired source ex-amples are trained on task loss only.

• Paired + L2 + MMD: Train from scratch on paired source-target examplesand unpaired source and target examples. Unpaired target examples are un-labelled and all other examples are labelled. Paired examples are trained onboth task loss and the L2 loss between feature layer. Unpaired source andtarget examples are trained on task loss and MMD loss between the featurelayer.

24


• Pre-train + Target: Pre-train the feature extractor for the source domainand initialize the feature extractor for the target domain with the pre-trainedweights. All examples are labelled. Finetune on target examples only. Targetexamples are trained on task loss only.

• Pre-train + Paired + L2: Pre-train the feature extractor for the source domainand initialize the feature extractor for the target domain with the pre-trainedweights. All examples are labelled. Finetune on paired source-target examplesand unpaired source examples. Paired examples are trained on both task lossand the L2 loss between feature layer. Unpaired source examples are trainedon task loss only.

• Pre-train + Paired + L2 + MMD: Pre-train the feature extractor for the sourcedomain and initialize the feature extractor for the target domain with the pre-trained weights. Unpaired target examples are unlabelled and all other ex-amples are labelled. Finetune on paired source-target examples and unpairedsource examples. Paired examples are trained on both task loss and the L2 lossbetween feature layer. Unpaired source and target examples are trained ontask loss and MMD loss between the feature layer.

• Pre-train + MMD: Pre-train the feature extractor for the source domain andinitialize the feature extractor for the target domain with the pre-trained weights.All target examples are unlabelled and all source examples are labelled. Fine-tune on unpaired source and target examples only. Unpaired source and targetexamples are trained with MMD loss between the feature layer. Source exam-ples are trained on task loss.

25


4.3.2 Results

Two different source-target domain pairings will be tested in this section. The firstwill be a visually similar source-target domain pairing while the latter will have amore significant visual transformation from source to target domain.

The first source-target domain pairing is shown in Figure 4.3. The same sourcedataset used in Section 4.2.2 is used here as well. The source:target dataset ra-tios tested are 100:1 and 10:1, with the exception of the unsupervised “Pre-train +MMD” model, which was tested with a source:target dataset ratio of 1:1.

The pre-trained model used is trained for 15000 steps with a batch size of 16 onsource domain images only. All models are trained or further trained for 15000steps with a batch size of 16. The Adam optimizer with a learning rate of 0.001is used here. Full results can be found in Figure 4.5. A summary of the results formodels trained with a source:target dataset ratio of 100:1 can be found in Figure 4.6.

Figure 4.5: Results of semi-supervised models tested on test domains with varyingamounts of source:target dataset ratio.

Generally, the models performed pretty well when the source:target dataset ratio is10:1. Final distance success rates range from 80% to 99% with little loss in perfor-mance in most cases and surprisingly improvements in some. For example, the “Pre-train + Target” model where 10:1 source:target dataset pairs is used to fine-tunethe pre-trained model led to a final distance success rate of 99%, outperforming theresults found in the fully supervised scenarios where all source:target dataset pairsare available. A reason is that the dataset is noisy and the randomly sampled source-target pairs were better for performance in the given test set. Also likely is that thetarget domain is easier in general for the neural network to learn in.

26


Figure 4.6: Summary of semi-supervised models tested on test domains with 100:1source:target dataset ratio.

Incorporating the maximum mean discrepancy (MMD) loss consistently led to worseresults. From the results summary in Figure 4.6, training with the MMD loss leadsto much worse results than just training the neural network with only the targetdomain images and on the task loss only. This contradicts the results of [13], wherethe addition of MMD loss during training led to improved results. Furthermore, inthe completely unsupervised test of the “Pre-train + MMD” model where a similartarget dataset to the source dataset is provided but unpaired, the neural networkfails to learn well. While the MMD loss was not expected to work work the unsuper-vised setting, it remains unclear why the MMD loss fails to perform well even in thesemi-supervised setting but this problem will be left open for future works.

The more prevalent and interesting problem at hand is to test the effectiveness ofthe models at bridging the lack of target domain data. The effect becomes more ob-vious when the available target domain data is further reduced to a ratio of 100:1.Initializing the target domain feature extractor with pre-trained weights led to con-sistently better results. The improvements are especially significant when there islesser target domain data available. For example, when the source:target datasetratio is 10:1, the “Paired + L2” model achieved a final distance success rate of 94%compared to the 98% of the “Pre-train + Paired + L2 model”, an improvement of4%. When the source:target dataset ratio is 100:1, the “Paired + L2” model onlyachieved a final distance success rate of 39% compared to the 93% of the “Pre-train+ Paired + L2” model, a significantly larger improvement of 54%.

Also consistent with the results in the fully supervised learning section is that train-ing with pairs and L2 loss outperforms just finetuning with target examples only. Tosummarize, training with paired-examples and initializing with pre-trained weightsfrom the source domain leads to better overall performances when the source andtarget domains are similar.

27


Intuitively, the less striking the change from source to target domain, the less thepre-trained neural network on the source domain needs to change to adapt to thetarget domain. The question arises whether pre-training is beneficial for visuallydissimilar domains. In the above experiments, the chosen target domain happensto be both visually and numerically similar. On the other hand, we would also liketo test the effects of these models on an extreme form of source to target domaintransformation. The chosen target domain is created through a putting each colourchannel of the source domain image through a Canny edge detector and recombin-ing the channels. A sample of the source-target domain pair can be seen in Figure4.7. The models are trained with a source:target dataset ratio of 100:1.

Figure 4.7: Sample of the source and target domain image. Left: source domain image,right: target domain image.

28


Figure 4.8: Results of selected semi-supervised models on the edge target domain. Thesource:target dataset ratio for these models are 100:1.

From the results, we find that the effect of pre-training has a negative effect whenthe source and target domains are highly dissimilar. In fact, initializing with pre-trained weights led to worse results than training a neural network from scratch,with a final distance success rate of 8% compared to 17%. Training with pairs andthe L2 loss between feature layers still led to improved performances even when thedomains are visually dissimilar. In the context of utilizing simulators as a sourcedomain to generate training images, this problem is not significant since we wouldnot expect such a large visual discrepancy between the source and target domains.The simulator-reality domains should hold a relationship more alike the first set ofsource-target domains used for testing, where pre-training led to much improvedresults.

29

Chapter 5

Image Translation and Pairing acrossDomains

Being able to automate the task of pairing of images between separate domains ishighly valuable. For example, if there is an algorithm that can transform a simulatedimage to a corresponding real world image, a large amount of realistic training datawith annotations can be created. Pairing simulated images with real images is highlyexpensive task, hence there is a large interest in wanting to automate pairing andtranslation between domains. A successful pairing between two different image do-mains maintains the semantics across domains. This concept is illustrated in Figure5.1.

Figure 5.1: Pairing of images between different visual domains.

30

Chapter 5. Image Translation and Pairing across Domains

Generative adversarial networks have been used to generate pictures of a particu-lar domain from random vectors. Figure 5.2 depicts this process. However, for ourapplication, the input is an image from the source domain and the output is an im-age from the target domain. Furthermore, we want the source and target domainimages to be semantically related. The works in [23, 2] present methods to trainGANs to translate a source domain image to a corresponding target domain image.The model in [2] is of particular interest since the training process it is an unsuper-vised method and fits in well with the motivation of this paper. This model is termedCycleGAN by the authors and the application to training a robotic task will be dis-cussed in greater detail in the later sections. Other methods include incorporating aloss function [24] to guide the training of the generator.

Figure 5.2: GAN as a generative model for images

This chapter will focus mainly on the study of GANs for image translation and pair-ing across domains as seen in Figure 5.3. A novel model that utilizes the task loss ofthe generated target domain image to guide the GAN training process will be pro-posed and evaluated in Section 5.4.

Figure 5.3: Adapting GANs to achieve translation and pairing across domains

31


5.1 Training details and architectures

One flaw of using generative adversarial networks for image translation is the insta-bility of the training process. This section covers general training details and tipsthat help to stabilize the training process.

In [24], the authors proposed training the discriminator with a history of generatedimages. A similar process is implemented for this project. The history of gener-ate images is created as a first-in first-out (FIFO) queue with a limited capacity. Ateach step where the generator creates new images, they are added to the queue.When the number of images in the queue exceeds the capacity, the oldest images arediscarded. For each mini-batch training step for the discriminator, the new batch ofgenerated images is mixed with a randomly sampled batch of generated images fromthe queue. The ratio of most recent batch of generated images to randomly sampledbatch of generated images can be varied, but the default ratio used in experimentsis 1:3.

Other ways of stabilizing the training process utilizes additional loss functions toguide the generator. In [2], the authors propose a cycle consistency loss which re-duces the space of possible mapping functions, hence encouraging the generators tolearn a mapping from a source input to a related target output. In [24], the authorspropose a feature transform loss by mapping the generated output and input to asimilar feature space and penalizing the L1 loss between the two feature vectors.However, in the experiments of the paper, the authors use an identity map, whichimplicitly assumes a visual relation between the source and target domain, whichmay not necessarily always be the case.

We also found that the training of generative adversarial networks is highly sensitiveto initialization. A common problem faced by training generative adversarial net-works is being stuck at poor local minima/maxima early in the training process andusually re-initializing the training process helps.

The choice of discriminator and generator architectures in this paper follows thosedescribed in the appendix CycleGAN paper [2]. Since we are dealing with 128x128images, we use the 6 block generator architecture as described in the paper.

32


5.2 Typical GAN for image translation

Figure 5.4: Typical GAN architecture for image translation

Given a source domain X and a target domain Y , we want train the generatorto learn a mapping function G : X → Y and a discriminator DY to differentiatebetween generated target domain images G(x) and real target domain images y.The objective function can be expressed as:

LGAN(G,DY , X, Y ) = Ey∼Y [log(DY (y))] + Ex∼X [log(1−DY (G(x)))] (5.1)

The adversarial training leads to a minimax game where the discriminator will tryto maximize the above loss function while the generator will try to minimize it. Theactual implementation differs from the above equation in that the negative log likeli-hood objective is replaced with a least square loss. This follows the implementationdetails from [2], where using the least square loss is reported to be more stable dur-ing training and generates higher quality images. The least squares version of theobjective function is:

LGAN(G,DY , X, Y ) = Ey∼Y [DY (y)2] + Ex∼X [(1−DY (G(x)))

2] (5.2)

With this basic architecture, the generator will learn to transform an input sourceimage to a target domain image. However, this will likely not lead to a pairingbetween domains since there is no function in the objective that encourages thisrelation. This property will be explored and tested in Section 5.4.

33


5.3 CycleGAN

As discussed in Section 2.7 and in the introduction of this chapter, CycleGAN isa method of training generative adversarial networks to perform image-to-imagetranslation without the need of source-target domain pairs x-y developed by theauthors in [2].

5.3.1 Model details

Figure 5.5: CycleGAN architecture

A figure of the CycleGAN architecture can be found in Figure 5.5. Instead of sim-ply learning a mapping function G : X → Y only, the inverse mapping functionF : Y → X is learnt as well. Likewise, a separate discriminator DY which learnsto differentiate between images from domain y and the generated images F (y) isnecessary to train the inverse mapping function.

An additional loss is added to the objective function to utilize the inverse mappingfunction. This loss is termed a cycle consistency loss. It penalizes the L1 loss be-tween the original image and the reconstructed image after being put through thegenerator and the inverse generator. This is to encourage the generators to learn tomap to a related output in a different domain. The cycle consistency loss and fullobjective can be formulated as follows:

Lcyc(G,F ) = Ex∼X [||F (G(x))− x||1] + Ey∼Y [||G(F (y))− y||1] (5.3)

LCycleGAN(G,F,DX , DY , X, Y ) = LGAN(G,DY , X, Y )

+ LGAN(F,DX , Y,X)

+ λLcyc(G,F )

(5.4)

34


5.3.2 Results

The CycleGAN is trained with a learning rate of 0.0002 for 30000 training steps.Two different source-target domain pairings are tested. Each pairing has 22656 im-ages in the source domain and 8000 images in the target domain. The experimentsin the original paper are trained for much longer and due to lack of computationalresources and time, experiments in this paper are run over a relatively much shorterperiod of time.

The first source-target domain pairing that was tested is shown in Figure 5.6. Forthis simple transformation, the pairings are quite successful and generalized well.

(a) From top to bottom: Source domain image x, generated target domain image G(x), reconstructedsource domain image F (G(x))

(b) From top to bottom: Target domain image y, generate source domain image F (y), reconstructedtarget domain image G(F (y))

Figure 5.6: Sample images from CycleGAN after 30000 training steps. Target domaincreated by multiplying [0.25, 0.5, 0.75] to the RGB channels of the source domain imagerespectively.

35


Further testing is done on a more difficult source-target domain pairing. The sourcedomain remains the same while the target domain is created by processing the origi-nal image with a Canny edge detection algorithm. Each colour channel of the sourcedomain image is put through a Canny edge detection algorithm and recombined toform the target domain image. Samples during the training process can be found inFigures 5.7 and 5.8.



Figure 5.7: Sample images from CycleGAN after 8000 training steps. Target domaincreated by putting each colour channel of the original image through a Canny edgedetection algorithm with a sigma value of 4 and recombining the channels afterwards.

36




Figure 5.8: Sample images from CycleGAN after 30000 training steps. Target domaincreated by putting each colour channel of the original image through a Canny edgedetection algorithm with a sigma value of 4 and recombining the channels afterwards.

Interestingly, at training step 8000, a reasonably good image translation was achieved.Upon further training, the image translation diverged from the desired pairings.While the image translation at this point is far from perfect, other experiments wereran and a better quality image translation for this source-target domain pairing wasnever found. A possible explanation is that this particular source-target domaintranslation was difficult to learn. From the sample images at step 8000, it definitelyshows the potential and possibility for it. However, achieving a stable training pro-cess towards such an image translation remains rather difficult with this approach.

37


Another point to note about the CycleGAN is that while the cycle consistency loss en-courage a pairing between source domain image x and reconstructed source domainimage F (G(x)), it does not necessarily ensure a pairing between x and generatedtarget domain image G(x) and between G(x) and F (G(x)). This is best seen in Fig-ure 5.8a, where a very strong pairing between x and F (G(x)) can be seen while thepairings to G(x) is with a visually completely black image. It is highly likely thatthere are very small values in this black image that allows this pairing to exist andit shows that the cycle consistency loss is unable to ensure a great and consistentpairing will be found. A similar problem holds for the target domain translationsfrom y to F (y) and F (y) to G(F (y)).

As stated in the start of this section, these experiments were ran over 30000 trainingsteps only whereas those in the original paper were ran over a much longer periodof time. It is entirely possible that further training might lead to better results, butthis will not be further tested in this paper due to lack of computational resourcesand time. Also worth noting that the failure to achieve a good image translationin the latter domain pairing can also be due to limitations of the generator anddiscriminator architectures.

5.4 TaskGAN

In the scenario where there are labels for some amount of data in the target domain,we are able to train a neural network to perform a regression task with that label. Forexample, in Section 4.3, we show that a small amount of labels in the target domaincan be supplemented by rich source domain data to achieve a good performance.The neural network that is trained for the regression contains knowledge about howa target domain image should look like given a label. Intuitively, this means that itis also possible for this neural network to guide the generator in creating an imagein the target domain that is paired with the source domain image given a label ofthe source domain image. Given a data rich source domain that has annotations,it is possible to backpropagate gradients from the task loss on the generated targetdomain image to the generator. The architecture for this model is similar that of thetypical GAN model described in Section 5.2. The key configuration is training thegenerator with the task loss of the generated image.

38


5.4.1 Model details

The name TaskGAN comes from the addition of a trained neural network on a taskin the target domain and using the task loss of the generated target domain imageto guide the training of the generator.

Given a target domain Y with images yi and labels φyi , we can train a neural networkto learn a mapping function fY : Y → φY . Furthermore, as shown in Chapter 4,supplementing training with the data from a rich source domain X with images xiand labels φxi can lead to the neural network learning a more effective mappingfunction fY . We want the labels across domains to be transferrable in order to createa pairing effect when training. In other words, for each xi-yi pair, the labels φxi -φ

yi

should be identical. This will allow the source domain label φxi to be used as thetarget domain label φyi and the task loss of the generated target domain image canbe computed with (fY (G(xi))− φxi )2. Then, the TaskGAN can be trained without theneed of any explicit x-y pairs with the following objective function:

LTaskGAN(G,DY ) = Ey∼Y [DY (y)2]

+ Ex∼X [(1−DY (G(x)))2]

+ λφEx∼X [(fY (G(x))− φx)2](5.5)

where λφ corresponds to the weight of the task loss of the generated image G(x).The weight used in implementation is 0.01. We find that larger values tend to notwork as well. Through adversarial training, we aim to solve the following minimaxgame:

G∗ = argminG

maxDY

LTaskGAN(G,DY ) (5.6)

To further break it down, at each training step, the generator and discriminator aretrained to minimize the following functions respectively:

LG(G,DY , X, φX) = Ex∼X [(1−DY (G(x)))

2] + Ex∼X [(fY (G(x))− φx)2] (5.7)

LDY(G,DY , X, Y ) = Ey∼Y [(1−DY (y))

2] + Ex∼X [DY (G(x))2] (5.8)

Given a minibatch of size N , equations 5.7 and 5.8 can be rewritten in the followingmanner:

LG(G,DY , X, φX) =

1

N

N∑i=1

[(1−DY (G(xi)))2 + (fY (G(xi))− φxi )2] (5.9)

LDY(G,DY , X, Y ) =

1

N

N∑i,j=1

[(1−DY (yj))2 + (DY (G(xi)))

2] (5.10)

39


At this point, it is also worth pointing out that in earlier sections, the mapping func-tion fY was trained with some x-y pairs provided while the above TaskGAN modeldoes not use explicit pairs during training. It is possible to train the TaskGAN withexplicit pairs with a small modification. An additional L1 loss between the generatedimage G(x) and target domain image y can be added and used to train the generator.Equation 5.9 can be updated to the following when training with paired examples:

LG(G,DY , X, Y, φX) =

1

N

N∑i=1

[(1−DY (G(xi)))2 + (fY (G(xi))− φxi )2 + ||yi −G(xi)||1]

(5.11)

Another method of training with paired examples is described in [23], where a dis-criminator is trained to distinguish between real xi-yi pairs and fake xi-G(xi) pairs. Areason for not wanting to using x-y pairs in the training on TaskGAN is motivated bythe fact that the mapping function fY can be trained without pairs at all. If there is atarget domain Y dataset that is already well annotated with labels that can be repli-cated in the source domain, a neural network can be trained solely on that datasetand used for the TaskGAN. Clearly, having some pairs during training would helpthe learning process and we will evaluate the quality of the images generated usinga TaskGAN trained with and without pairs in the next section.

5.4.2 Results

For testing of the TaskGAN, we create several source to target transformations ofvarying difficulties. We choose transformations that are typical of differences be-tween simulators and reality. The target transformation is created by editing el-ements V-REP scene. The mechanics of the robot arm are not modified (i.e. thelengths of the links, the relative positions of tip and target). Other attributes such asshape, colour, and width are changed. This is to test the effectiveness of transferringknowledge for the same task over to different visual domains.

All datasets are created with 800 paired images between the source-target domainmade of 50 episodes of 16 steps each. This is used in conjunction with the sourcedomain dataset with 22656 images to train a neural network to learn the mappingfunction fY . The neural network is initialized with pre-trained weights and trainedover 15000 steps with a learning rate of 0.001 with the task loss and L2 loss betweenfeature layers. Image translation models (i.e. the GAN models) are trained for 20000steps with a learning rate of 0.0002.

40



Figure 5.9: First source to target transformation. Links width are changed, with thebottom link being wider, the middle link remaining the same, and last link being thinner.Colours are slightly changed to a lighter shade.

The first transformation changes the widths of the links and the colours of the cubeand robot arm slightly. A sample source-target domain image pair can be found inFigure 5.9. Some samples of the images generated by the TaskGAN can be found inFigure 5.10.

Figure 5.10: Samples from TaskGAN model for first source to target transformation.Top row represents samples from source domain. Bottom row represents the generatedtarget domain images from the source domain samples.

We find that the quality of generated images are of high quality. Generally, pose ofthe arm is well matched and the position of the cube is correct. The sizes of thelinks are also transformed to better represent the target domain, with a thicker linkat the bottom and a thinner final link. The cube is also correctly being resized to asmaller one. Colours are also lightened to match the target domain. Overall, usingthe TaskGAN for this transformation is rather successful.

41


Next, we want to try a more difficult source to target transformation. A sample ofthe second source to target transformation can be found in Figure 5.11.


Figure 5.11: Second source to target transformation. In addition to the size changesof the links and cube, the second transformation involves more colour changes, withdifferent colours for each link and inverting the colour of the cube.

For the second source to target transformation, we will evaluate 3 different models:the typical GAN for image translation, TaskGAN trained without paired examples,and TaskGAN trained with paired examples. The respective samples can be found inFigures 5.12, 5.13, 5.14.

As expected, the models’ quality of translated images can be ranked in the followingascending order: typical GAN, TaskGAN trained without paired examples, TaskGANtrained with paired examples. From the figures, we can see that the generated im-ages by the typical GAN captures the qualities of the target domain, but consistentlydistorts the information such as position of the cube and pose of the robot arm. BothTaskGAN models create better translations, with translations that correctly retainarm pose and cube position information. From the paired samples, the TaskGANtrained without paired samples has very slight errors as seen from the superimposedimages while the TaskGAN trained with paired samples are of higher quality. Forboth TaskGANs, we can occasional large errors in the random unpaired sampleswhere the cube disappears after translation. This might be due to the TaskGANsgeneralizing poorly since small amount of target domain samples (800 in this case)is used to train both fY and GAN. Ways to improve it include using more x-y pairs inthe training of the fY or having more unpaired y examples to train the discriminator.

42


(a) Paired samples. From top to bottom: Source domain images x, generated target domain imagesG(x), paired target domain images y, superimposed image from G(x) and y.

(b) Random unpaired samples. From top to bottom: Source domain images x, generated targetdomain images G(x).

Figure 5.12: Samples for second source to target transformation using typical GAN.

43




Figure 5.13: Samples for second source to target transformation using TaskGAN trainedwithout paired examples.

44




Figure 5.14: Samples for second source to target transformation using TaskGAN trainedwith paired examples.

45



Figure 5.15: Third source to target transformation. Cube size remains the same whilewidth of links are changed in a way similar to previous transformations. Lighting effectsare added and all colours including background are slightly changed.

For the third source to target transformation, we want to introduce realistic effects aswell as a different background. We chose to do this through adding light effects andchanging the background colour slightly. In the previous tests, we used 800 pairedexamples to train the neural network fY . However, we found that it was insufficientin guiding the generator in this particular transformation. We trained another fYwith 8000 paired examples to ensure good performance.

Figure 5.16: Performances of various models in the third target domain.

The results can be found in Figure 5.16. Training with 8000 paired examples led toa final distance success rate of 95% compared to the 53% when trained with 800paired examples. Intuitively, the better the performance of fY , the more effectiveit will be in guiding the generator to learn to generate paired examples. 3 differ-ent models will be evaluated: the typical GAN for image translation, TaskGAN withfY trained on 800 paired examples, and TaskGAN with fY trained on 8000 pairedexamples. To be clear, the TaskGAN itself is trained without paired examples. Therespective samples can be found in Figures 5.17, 5.18, 5.19.

46


For the typical GAN, we find that with more training, the generator learns to mapany given input source domain image to a similar target domain image. Since theonly loss given to the generator is to create a target domain lookalike image, thereis no incentive to create a related pairing between source and target domains. Thistends to encourage this behaviour of mapping all source domain images to a highlysimilar target domain image as seen in Figure 5.17b and c. It is worth noting thatat earlier training steps, it is still able to generate a target domain image that some-what matches the pose of the source domain image as seen in Figure 5.17a. It failsto match the position of the cube and treats the cube as part of the background.

For the TaskGAN with fY trained with 800 paired examples, we find that the generaltranslation quality is quite poor. While it still manages the match the pose of thearm, the arm itself is of low quality. It manages to recreate the general backgroundof the target domain, but treats the cube as part of the background. In every trans-lated target domain image, the cube is at the exact same position, disregarding theposition of the cube in the source domain. This makes it a very poor image transla-tion model for generating target domain data. Reasons for the difficulty of traininga good image translation model are the complex background changes, the weak pre-dicting strength of fY , and a combination of both factors. As a point to note, thefY used for the successful image translation in the second source to target domaintranslation had a final distance success rate of 43% in that target domain, making iteven weaker than the current fY used in this example. This leads us to believe theincreased complexity of the source to target transformation played a larger role inaffecting the image translation quality.

For the TaskGAN with fY trained with 8000 paired examples, we find that the gener-ated target domain images are of good quality. Most aspects of the target domain arereconstructed, with shadows of the cube and the arm. The shadows of the arm arenot perfect as it is treated as part of the background and always reconstructed in thesame manner. However, the shadow are not significantly important to the task, sothis is an acceptable flaw. While some shapes of the arms are slightly distorted, botharm pose and position of cube are accurately reconstructed, making these translatedimages a highly desirable source of training data in the target domain.

47


(a) Samples at step 3000

(b) Samples at step 19000

(c) Samples at step 20000

Figure 5.17: Samples at various training steps for the third source to target transfor-mation using a typical GAN for image translation. For each set of samples, the top rowrepresents the source domain images x and the bottom row represents the generatedtarget domain images G(x).

48




Figure 5.18: Samples for third source to target transformation using TaskGAN with fYtrained with 800 paired examples.

49




Figure 5.19: Samples for third source to target transformation using TaskGAN with fYtrained with 8000 paired examples.

50



Figure 5.20: Fourth source to target transformation. Cube size remains the same whilewidth of links are changed in a way similar to previous transformations. Lighting effectsare added and the background is changed to a wood-like texture.

For the fourth source to target transformation, we introduce a more significant back-ground change by adding a wood-like texture. The fY used to train the TaskGAN istrained on 8000 paired examples and had a 91% final distance success rate whentested in the target domain. Samples at various training steps can be found in Figure5.21.

At step 5000, we find that the generator has learnt to recreate the background andmatch the pose of the arm. However, the generator ignores the position of the cubein the source domain and recreates it at the same position in the generated targetdomain image.

At step 35000, we find that arm pose and cube position are generally well matched.However, the quality of the image translation is poor. The generated backgroundis noisy and of poor quality. The generated arm is noisy and does not match thesmoother colours that are generated by translations in previous experiments. Thecube, while matching in position, has a highly faded appearance.

Upon further training at step 40000, we find that the overall quality of the gener-ated image improves. However, it once again fails to match the position of the cubeand generates it at the same exact position in all generated target domain images.It seems that the generator trades off being able to match the object informationin the source domain to creating a higher quality target domain characteristics. Itis unclear whether further training can lead to better results, but due to lack ofcomputational resources, this will not be attempted in this paper. Overall, the im-age translation is not considered successful for this source-target domain pairing.However, it shows promise and tweaks in various aspects such as the generator anddiscriminator architecture might lead to much better results.

51


(a) Samples at step 5000

(b) Samples at step 35000

(c) Samples at step 40000

Figure 5.21: Samples at various training steps for the fourth source to target transfor-mation using a TaskGAN trained with 8000 paired examples. For each set of samples,the top row represents the source domain images x and the bottom row represents thegenerated target domain images G(x).

52


In summary, the simpler the source-target domain transformation, the easier it isfor the GAN to learn it. Utilizing the task loss of even a weakly trained fY tends tolead to much better results than just using a typical GAN losses. Target domains withcomplicated and noisy backgrounds tend to be hard to re-create without losing otherinformation and need a well trained fY for better results. For best results, source andtarget domains are recommended to have clean and non-noisy backgrounds, similarcolours, shapes and sizes. For application purposes, we would tend towards such ascenario where the source domain in a simulator is created to be visually similar tothe target domain in reality.

53

Chapter 6

Training with GeneratedSource-Target Pairs

After discussing in depth the use of GANs for image translation in Chapter 5, wewill investigate the use of these translated images to train neural networks for arobotic task in this chapter. Clearly, if there is a successful image translation from asource domain to a target domain, the labels from the source domain can be usedfor the paired generated target domain image. We will adapt the methods discussedin Chapter 4 to utilize the translated target domain images.

Section 6.1 covers a completely unsupervised method of performing image transla-tion and training on a regression task. Section 6.2 covers methods of performingimage translation with limited source-target domain pairs and training on a regres-sion task. It will also include a more comprehensive study on the effects of utilizingimage translation models for transfer learning on source to target domain transfor-mations of various difficulties.

6.1 Training with CycleGAN images

This section describes a few methodologies to utilize the generated images fromCycleGAN to train on a robotic task. Since the CycleGAN model can be trained inan unsupervised fashion, source domain images are assumed to possess labels andtarget domain images are unlabelled. This is to simulate the source-target domainpairing between simulator-reality and the aim to not require expensive annotationsin reality.

54

Chapter 6. Training with Generated Source-Target Pairs

6.1.1 Models

The models revolve heavily around the concept of extracting a feature vector froman image using a neural network. Figure 6.1 is a summary of the earlier discussionin Chapter 4 of the feature extractor model.

Figure 6.1: Feature extractor example. Images of input size 128x128x3 are mapped toa 256-dimension feature space.

• Source domain translation: Source domain images x, generated target do-main image,G(x), and reconstructed source domain image F (G(x)) are trainedon the task with labels of x. L2 loss between feature layers are minimized. Fig-ure 6.2a depicts the architecture.

• Source + target domain translations: In addition to training on the sourcedomain translations, the L2 loss between the target domain translations areminimized. Specifically, target domain images y, generated source domainimage F (y), and reconstructed target domain image G(F (y) have their corre-sponding feature layer L2 losses minimized. Figure 6.2b depicts the extensionsto the architecture.

• Transferring of labels from source to target domain: All source domain im-ages x are mapped to a feature space θx. Any nearest neighbours algorithmcan be chosen to construct a tree from this feature space. The ball tree al-gorithm was chosen for this implementation. For each target domain imagey, its corresponding feature space vector θy is used to query the tree. Thenearest neighbour in the feature space will transfer its label to the target do-main images y. This idea is inspired by the work in [18], where a similar ideaof mapping source and target domain images to a shared feature space andtransferring of labels to the nearest neighbour is done.

55


(a) Source domain translation.

(b) Target domain translation.

Figure 6.2: Training architectures for translated images with CycleGAN

56


6.1.2 Results

The source-target domain pairing used can be found earlier in Figure 5.6. All gener-ated and reconstructed images are pre-computed and saved on disk in jpg format forquicker training. Models are trained for 10000 trainings steps of batch size 16 foreach type of image. For the transfer of labels, the “Source + target domain transla-tion” model is used to construct and find the nearest neighbours in the feature space.That model is further fine-tuned using the newly labelled target domain exampleswith the target domain translation.

Figure 6.3: Results of training on translated images of CycleGAN.

Training on the source domain translation only leads to a final distance success rateof 90% and 87% in the source and target domains respectively. With the additionof minimizing the L2 loss on the target domain translation, both source and targetdomain performances can be improved to a final distance success rate of 94% and93% respectively. Both generated and reconstructed images can be considered asnoisy versions of the originals, but by training on such noisy images, it encouragesthe neural networks to identify domain invariant features which are essential to thetask, leading to an overall improved performance.

The motivation for introducing labels transfer is to utilize the true target domainimages for training on the task loss since they are of high quality and it is a waste tonot use them for training. As seen in Figure 6.4, the majority of nearest neighbourpairings are of good quality. However, training the source + target domain transla-tion model with the newly transferred labels did not lead to improved results. Finaldistance success rates fell slightly in the source domain from 94% to 93%, whileperformance in the target domain fell from 93% to 87%. This could be attributed tothe fact that the pairings were not perfect and that the translated pairings were ofgreater quality. However, these results should not lead to the conclusion of this ap-proach being unviable. In this source-target pairing, the translations were relativelyeasy to learn and of high quality. However, in other more complex scenarios, highquality image translation models might not be learnt and supplementing the trainingwith labels transfer might lead to performance improvements in those scenarios.

57


Figure 6.4: Nearest neighbour samples. Each set of samples represents 5 rows of ran-domly chosen target domain images and its 3 nearest neighbour source domain imagessuperimposed onto itself. Left to right for each row: Original target domain image, su-perimposed image with nearest neighbour, superimposed image with 2nd nearest neigh-bour, superimposed image with 3rd nearest neighbour.

58


6.2 Training with TaskGAN images

Under the scenario where a small number of paired source-target domain examplesare present, we can utilize these pairs to train a TaskGAN image translation model asdescribed in Chapter 5. This section describes approaches of using some combinationof unpaired source domain images x, paired source-target domain images x-y, andpaired source to generated domain images x-G(x) to improve performances in thetarget domain.

6.2.1 Model details

There are 4 models being evaluated in the following section. The first model istrained on source domain images only while the rest are trained in a manner similarto the 2 separate feature extractor model with L2 loss between feature layers andpre-training as described in Chapter 4. Models are trained for 15000 steps with alearning rate of 0.001 and batch size of 16. More details on the 4 different methodsare as follows:

• Source: This model is trained on the source domain images x only. This servesas a benchmark against the performances of other models with transfer learn-ing.

• Source + Paired: This model is trained on unpaired source domain imagesx and paired source-target domain images x-y. The aim is also to use only asmall number of x-y pairs and evaluate the performance improvements undersuch conditions.

• Source + Translated: This model is trained on unpaired source domain im-ages x and paired source to generated target domain images x-G(x). Sincesource domain images and labels are assumed to be easily obtainable, we al-low the use of a larger amount of x-G(x) pairs and test the potential benefitsof such an approach.

• Source + Paired + Translated: This model is trained on unpaired sourcedomain images x, paired source-target domain images x-y, and paired sourceto generated target domain images x-G(x). In a similar vein of thought to theabove methodologies, we limit x-y pairs to a small amount and allow a largeramount of x-G(x) pairs. This model determines the effectiveness of trainingwith a combination of a small number of true paired examples and a largenumber of generated paired examples.

59


6.2.2 Results

In this section, an evaluation of performances from 4 different methods on varioustarget domains will be done. In particular, two sets of experiments will be done tostudy the performance changes under different conditions. The first set of experi-ments studies target domains with a wider range of changes while the second set ofexperiments investigates the effect of varying only the cube size in the target domain.

Figure 6.5 shows the 3 target domains for the first set of experiments where themodels are trained and tested on. The changes from the source domain are detailedas follows:

• Target A: Width links are modified with the base link becoming thicker andthe tip link becoming thinner. Colours of cube and arm are slightly changed.Cube size is slightly reduced.

• Target B: Width links are modified with the base link becoming thicker and thetip link becoming thinner. Colours of cube, arm, and background are slightlychanged. Lighting effects are added to simulate realism.

• Target C: Width links are modified with the base link becoming thicker andthe tip link becoming thinner. Colours are changed more significantly, with thebase link becoming red, the middle link becoming green, the tip link having aslight colour change, and the cube’s colour is inverted from red to cyan.

The unpaired source domain x dataset used for this set of experiment contains 22656images, the small x-y pairs dataset contains 800 pairs of images, and the large x-G(x) pairs dataset contains 8000 pairs of images. The fY mapping function for theTaskGANs for Target A and C were trained with the small x-y pairs dataset of 800pairs of images. It was more difficult to achieve a success TaskGAN model for TargetB and the fY mapping function for the TaskGAN for Target B was trained with alarger x-y pairs dataset of 8000 pairs of images. The TaskGAN training itself wasdone without the use of the pairs (i.e. without the L1 loss). Further details for thetraining of TaskGANs and image translation samples can be found in Chapter 5.

(a) Source (b) Target A (c) Target B (d) Target C

Figure 6.5: Sample image from the source domain and the various target domains beingtested on.

60


Figure 6.6: Full results of models trained and tested under different conditions.

Figure 6.7: Final distance success rates of models trained and tested under differentconditions.

The full results can be found in Figure 6.6 and a visual summary in the form of agraph can be found in Figure 6.7. While it is difficult to quantify directly the gap be-tween source-target transformations, the performance of the “Source” in the varioustarget domains can be used as a form of approximation. From this perspective, Tar-get A represents the simplest or smallest source-target domain gap with the sourcemodel achieving final distance success rate of 29% while Target C is the most com-

61


plex or largest source-target domain gap as the “Source” model completely fails witha final distance success rate of 0%.

While we are able to approximately rank the difficulty of the transformations in theascending order of Target domains A, B, and C, we find that training a TaskGANfor Target B to be the most difficult to achieve. Complex background changes muchmore difficult for an image translation model to achieve. As described earlier, it isimportant to note that the TaskGANs for Target B uses a large number x-y pairs in itstraining (8000 pairs to be exact, compared to the 800 pairs for the other TaskGANs)in order to achieve a reasonable translation. It is possible that the number of pairscan be reduced from 8000, but that was not attempted due to the lack of computa-tional resources. It also goes against the motivation of this section which attemptsto build a model to perform in the target domain from a small number of x-y pairs.However, we included it as a proof of concept to show the possible impact if animage translation was successfully trained. To reiterate, the TaskGANs for Target Aand Target C were trained with a small number of pairs (800), and the TaskGAN forTarget B was trained with a large number of pairs (8000).

In general, we find the most performance gains from the “Source + Paired + Trans-lated” method of training with both a small amount x-y pairs and a large amountof x-G(x) pairs. The other two methods of “Source + Paired” and “Source + Trans-lated” had performance gains over the “Source” method, which suggests that bothtypes of image pairs are helpful. In test domains Target A and C, we find that hav-ing large number of translated x-G(x) pairs worked better than a small number ofx-y pairs, while the opposite was true in test domain Target B. A plausible reason isthe quality of the translated images G(x). Target B involved the reconstruction of acomplex background with shadows and lighting and these generated target domainimages tend to be noisy compared to the true samples y. Hence, generated targetdomain images G(x) in Target B are noisier and less effective in improving perfor-mance when used alone.

However, the noisy reconstruction G(x) are not useless. By combining both x-y andx-G(x) pairs in the “Source + Paired + Translated” method in domain Target B, arespectable final distance success rate of 77% could still be achieved. By training onboth the noisy images and the true images, the neural network learns to ignore thenoise and the large number of G(x) becomes a valuable source of information.

62


In the second set of experiments, we would like to test how the models will per-form when varying only a single factor of the source domain. This is to test howperformance changes on target domains with small incremental changes from thesource domain. The variable chosen is the cube length. The original cube length inthe source domain is 7, and target domains with cube lengths of 5, 3, and 1 will betested on.

In all previous experiments and tests, the images have been saved in jpg format.However, this led to some loss in information which affects the image quality muchmore significantly when the cube size is small. Hence, some configurations to thedatasets were made for this set of experiments and all images were saved in pngformat. The unpaired source domain x dataset used here contains 8000 images, thesmall x-y pairs dataset contains 800 pairs of images, and the large x-G(x) uses thesame unpaired source domain x dataset, which makes 8000 pairs of images. Wetrain TaskGANs, Gc5, Gc3, Gc1, to map from the source domain to the respectivetarget domains with a mapping function fY trained using the “Source + Paired”method and L1 loss. Samples of source-target pairs can be found in Figure 6.8 andsamples of source to generated target pairs can be found in Figure 6.9. Generally,we find that image translation quality is good.

Figure 6.8: Sample image from the source domain and the target domains with decreas-ing cube length. Cube lengths from left to right: 7, 5, 3, 1.

(a) Source x (b) Gc5(x) (c) Gc3(x) (d) Gc1(x)

Figure 6.9: Sample image from the source domain and generated target domain images.From left to right: Source domain image, generated target domain images with cubelength 5, 3, and 1.

63


Figure 6.10: Full results of models trained and tested under varying cube lengths.

Figure 6.11: Final distance success rates of models trained and tested under varyingcube lengths.

Full results can be found in Figure 6.10 and a summary of the performances in theform of a graph can be found in Figure 6.11. Interestingly, the results are not fullyconsistent with what was found in the first set of experiments. In the first set ofexperiments where target domains were much more different, the “Source + Paired+ Translated” method outperformed all other methods. However, in these targetdomains which are arguably much more similar to the source domain, the “Source+ Paired + Translated” method does not always work best. There are a few possible

64


reasons as to why the “Source + Translated” methodology performs better in thecases where the cube length difference is 2 and 4.

Firstly, the “Source + Translated” methodology performs better in the cases wherethe cube length difference is 2 and 4 is the overfitting caused by the small numberof x-y pairs. The decrease in performance of the “Source” model is relatively lessercompared to the previous set of experiments, which can be interpreted as these twotarget domains being highly similar to the source domain. Hence, training with asmall number of pairs can lead to overfitting and a decrease in performance. Thisidea is further reinforced by the performance of the “Source + Paired” model whenthe cube length difference from source domain is 2. In that scenario, the “Source+ Paired” model had a final distance success rate of 67%, worse than the “Source”model which had a final distance success rate of 88%. Furthermore, the quality oftranslated images were good and just using the translated images as training for atarget domain task was sufficient in producing good performances of 88% in bothcases.

As the difference from the source domain increases, the results once again becameconsistent to what was found in the first set of experiments. Training with some x-ypairs with the “Source + Paired” method led to slightly improved results, trainingwith a larger number of x-G(x) pairs with the “Source + Translated” method ledto better results, and training with both types of pairs with the “Source + Paired +Translated” method led to the best results. The overfitting argument does not holdhere since the target domain is far enough from the source domain that the smallnumber of x-y pairs proved to be valuable training data as opposed to causing over-fitting.

Overall, we find that if the source and target domains are highly similar, training withthe small amount of x-y pairs tend to be detrimental to the overall performance.Performance of the “Source + Translated” method of using x-G(x) pairs tends todecrease as the source to target domain gap increases. At larger source to targetdomain gaps, training with the small x-y pairs has a positive effect as it helps theneural network learn to ignore the noise in generated target domain images G(x).For most realistic applications, it is highly likely that the gap between simulated dataand real data falls under the category of being a large source to target domain gap,where the “Source + Paired + Translated” model will work best.

65

Chapter 7

Conclusions and Future Work

In this paper, we presented a method of training generative adversarial networks forimage translation and pairing across domains, TaskGAN. This requires the trainingof a neural network fY on a task in the target domain Y and we do it by usingsome number of source-target domain pairs x-y. We showed the viability of us-ing this model to generate target domain images G(x) that recreates target domaincharacteristics such as background, shape deformations, and colour changes whileretaining source domain information such as arm pose and cube position. Resultsshow good image translation and pairing quality for simple to moderately complexsource to target domain transformations. The method’s potential in applicationswhere simulated source domain images can be enhanced to look more like targetdomain images and used as training data was demonstrated in various target do-mains as well. We also introduced a novel dataset that is easily configurable for thetraining and testing of robot control for transfer learning purposes.

A limitation of TaskGAN is that complicated transformations can be unsuccessfuleven with a well trained fY with many x-y pairs. There are a few areas that canbe worked on to improve this. Firstly, improvements in generator and discrimina-tor architectures could lead to higher quality of translations. Also, the fY mappingtrained in this paper was highly dependent on providing it some x-y pairs. Pairingand manual annotations are expensive and we want to reduce the dependency onx-y pairs or even remove the need of it. While we investigated some unsupervisedand semi-supervised methods in this paper, it was not helpful to performance. Itwould be interesting to see future work that focuses on these branches of methodsto reduce the dependence on paired data between source and target domains.

Another point worth noting is that the GANs trained in this paper are trained overa much shorter period of time compared to other similar works due to the lack ofcomputational resources. It is still yet to be seen if extended training periods canlead to significantly better results in the more complicated transformations. Also,training and tests were completely carried out in simulated environments. Furtherexperiments that utilize a real robot for target domain data generating and testingwould be highly interesting.

66

Bibliography

[1] E. Rohmer, S. P. Singh, and M. Freese, “V-rep: A versatile and scalablerobot simulation framework,” in Intelligent Robots and Systems (IROS), 2013IEEE/RSJ International Conference on, pp. 1321–1326, IEEE, 2013.

[2] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-imagetranslation using cycle-consistent adversarial networks,” arXiv preprintarXiv:1703.10593, 2017.

[3] S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE Transactions onknowledge and data engineering, vol. 22, no. 10, pp. 1345–1359, 2010.

[4] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deepconvolutional neural networks,” in Advances in neural information processingsystems, pp. 1097–1105, 2012.

[5] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare,A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al., “Human-levelcontrol through deep reinforcement learning,” Nature, vol. 518, no. 7540,pp. 529–533, 2015.

[6] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa,“Natural language processing (almost) from scratch,” Journal of MachineLearning Research, vol. 12, no. Aug, pp. 2493–2537, 2011.

[7] F. Zhang, J. Leitner, B. Upcroft, and P. Corke, “Vision-based reaching usingmodular deep networks: from simulation to the real world,” arXiv preprintarXiv:1610.06781, 2016.

[8] A. A. Rusu, M. Vecerik, T. Rothorl, N. Heess, R. Pascanu, and R. Hadsell,“Sim-to-real robot learning from pixels with progressive nets,” arXiv preprintarXiv:1610.04286, 2016.

[9] A. Sharif Razavian, H. Azizpour, J. Sullivan, and S. Carlsson, “Cnn features off-the-shelf: an astounding baseline for recognition,” in Proceedings of the IEEEConference on Computer Vision and Pattern Recognition Workshops, pp. 806–813, 2014.

[10] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, “How transferable are features indeep neural networks?,” in Advances in neural information processing systems,pp. 3320–3328, 2014.

67

BIBLIOGRAPHY

[11] R. Brooks, “A robust layered control system for a mobile robot,” IEEE journalon robotics and automation, vol. 2, no. 1, pp. 14–23, 1986.

[12] S. Levine, C. Finn, T. Darrell, and P. Abbeel, “End-to-end training of deep visuo-motor policies,” Journal of Machine Learning Research, vol. 17, no. 39, pp. 1–40,2016.

[13] A. Rozantsev, M. Salzmann, and P. Fua, “Beyond sharing weights for deepdomain adaptation,” arXiv preprint arXiv:1603.06432, 2016.

[14] A. Gretton, K. M. Borgwardt, M. Rasch, B. Scholkopf, A. J. Smola, et al., “Akernel method for the two-sample-problem,” Advances in neural informationprocessing systems, vol. 19, p. 513, 2007.

[15] I. Steinwart, “On the influence of the kernel on the consistency of supportvector machines,” Journal of machine learning research, vol. 2, no. Nov, pp. 67–93, 2001.

[16] A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Scholkopf, and A. Smola, “Akernel two-sample test,” Journal of Machine Learning Research, vol. 13, no. Mar,pp. 723–773, 2012.

[17] A. L. Gibbs and F. E. Su, “On choosing and bounding probability metrics,”International statistical review, vol. 70, no. 3, pp. 419–435, 2002.

[18] E. Tzeng, C. Devin, J. Hoffman, C. Finn, P. Abbeel, S. Levine, K. Saenko, andT. Darrell, “Adapting deep visuomotor representations with weak pairwise con-straints,” in Workshop on the Algorithmic Foundations of Robotics (WAFR), 2016.

[19] J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel, “Domainrandomization for transferring deep neural networks from simulation to thereal world,” arXiv preprint arXiv:1703.06907, 2017.

[20] L. A. Gatys, A. S. Ecker, and M. Bethge, “A neural algorithm of artistic style,”arXiv preprint arXiv:1508.06576, 2015.

[21] L. A. Gatys, A. S. Ecker, and M. Bethge, “Image style transfer using convo-lutional neural networks,” in Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition, pp. 2414–2423, 2016.

[22] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair,A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in neuralinformation processing systems, pp. 2672–2680, 2014.

[23] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation withconditional adversarial networks,” arXiv preprint arXiv:1611.07004, 2016.

[24] A. Shrivastava, T. Pfister, O. Tuzel, J. Susskind, W. Wang, and R. Webb, “Learn-ing from simulated and unsupervised images through adversarial training,”arXiv preprint arXiv:1612.07828, 2016.

68