One Shot Object Detection1161376/FULLTEXT02.pdf · Thus the one shot object detection network used for a tracking application can improve the experience of augmented reality applications

IN DEGREE PROJECT ELECTRICAL ENGINEERING,SECOND CYCLE, 30 CREDITS

, STOCKHOLM SWEDEN 2017

One Shot Object Detectionfor Tracking Purposes

TIJMEN VERHULSDONCK

KTH ROYAL INSTITUTE OF TECHNOLOGYSCHOOL OF ELECTRICAL ENGINEERING

iii

Abstract

One of the things augmented reality depends on is object tracking, whichis a problem classically found in cinematography and security. However, thealgorithms designed for the classical application are often too expensive com-putationally or too complex to run on simpler mobile hardware. One of themethods to do object tracking is with a trained neural network, this hasalready led to great results but is unfortunately still running into some ofthe same problems as the classical algorithms. For this reason a neural net-work designed specifically for object tracking on mobile hardware needs to bedeveloped. This thesis will propose two di�erent neural networks designedfor object tracking on mobile hardware. Both are based on a siamese net-work structure and methods to improve their accuracy using filtering are alsointroduced. The first network is a modified version of “CNN architecturefor geometric matching” that utilizes an a�ne regression to perform objecttracking. This network was shown to underperform in the MOT benchmarkas-well as the VOT benchmark and therefore not further developed. Thesecond network is an object detector based on “SqueezeDet” in a siamese net-work structure utilizing the performance optimized layers of “MobileNets”.The accuracy of the object detector network is shown to be competitive inthe VOT benchmark, placing at the 16th place compared to trackers from the2016 challenge. It was also shown to run in real-time on mobile hardware.Thus the one shot object detection network used for a tracking application canimprove the experience of augmented reality applications on mobile hardware.

Keywords: Object tracking, Deep learning, Siamese neural network, A�neregression network, One shot learning, Object detector, PID controller

iv

Sammanfattning

En av de saker som förstärkts verklighet beror på är objektspårning, vil-ket är ett problem som klassiskt finns i filmografi och säkerhet. Algoritmernasom är utformade för den klassiska applikationen är dock ofta för dyra be-räkningsmässigt eller för komplexa för att driva på enklare mobila hårdvaror.En av metoderna för att göra objektspårning är med ett utbildat neuralt nät-verk, vilket har redan lett till bra resultat, men tyvärr går fortfarande i någraav samma problem som de klassiska algoritmerna. Av detta skäl måste ettneuralt nätverk utformat speciellt för objektspårning på mobil hårdvara ut-vecklas. Denna avhandling kommer att föreslå två olika neurala nätverk somär avsedda för objektspårning pâ mobil hårdvara. Båda är baserade på ensiamesisk nätverksstruktur och metoder för att förbättra deras noggrannhetmed filtrering introduceras också. Det första nätverket är en modifierad ver-sion av “CNN arkitektur för geometrisk matchning” som använder en a�ne-regression för att utföra objektspårning. Det här nätverket visade sig varaunderpresterande i MOT-riktmärket såväl som VOT-riktmärket och därförinte vidareutvecklat. Det andra nätverket är en objektdetektor baserad på“SqueezeDet” i en siamesisk nätverksstruktur med utnyttjande av prestanda-optimerade lager av “MobileNets”. Nätverksdetektorns noggrannhet visar sigvara konkurrenskraftig i VOT-riktmärket och placeras på 16: e plats jämförtmed trackers från 2016-utmaningen. Det visades ocksâ att det kördes i realtidpå mobil hårdvara. Således kan det enda objektet för detektering av objektsom används för en spårningsapplikation förbättra upplevelsen av utvidgadeverklighetsapplikationer på mobil hårdvara.

Nyckelord: Objektspårning, djupt lärande, Siamese neuralt nätverk, A�-ne regressionsnätverk, Ett skottlära, Objektdetektor, PID-kontroller

v

Acknowledgements

I want to extend my gratitude to all the people involved in this project but I feelthat I should mention a couple of people who have been very closely involved. Firstand foremost I want to thank my advisor Kenneth van Hoey from ETH Zurich forhis continued support throughout the project, guiding me and helping me navi-gate the various obstacles. Without him the thesis would not have been what itis now. Secondly I want to thank Maximilian Schneider from Viorama GmBh, forhis guidance and continued trust in my research even though the first results weredisappointing. I also want to thank Bichen Wu from UC Berkeley for his help ondetermining a course for this thesis, and associate Professor Jim Dowling for hisgreat course on Deep learning at KTH and being my second advisor. Finally I wantto thank professor Magnus Boman from KTH for supporting me in this project andallowing me to do my research abroad.I would also like to thank the Kungliga Tekniska högskolan (KTH) for allowingme to pursue this research, and Viorama Ltd. for hosting me and supporting mewherever they could.

Of course none of this would have happened without the continued support frommy family and my girlfriend. They made the di�cult moments during the makingof this thesis a lot more bearable. Thank you.

Tijmen Verhulsdonck,Berlin, 29-08-2017

Contents

Contents vi

List of Acronyms ix

1 Introduction 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2.1 Deep learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2.2 State of the art in tracking algorithms . . . . . . . . . . . . . 21.2.3 Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Research Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . 31.4 Research Contributions . . . . . . . . . . . . . . . . . . . . . . . . . 41.5 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Background 52.1 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1.1 Fully connected networks (FCN) . . . . . . . . . . . . . . . . 62.1.2 Training process . . . . . . . . . . . . . . . . . . . . . . . . . 72.1.3 Inference stage . . . . . . . . . . . . . . . . . . . . . . . . . . 82.1.4 Convolutional neural networks (CNN) . . . . . . . . . . . . . 92.1.5 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.1.6 Siamese network . . . . . . . . . . . . . . . . . . . . . . . . . 122.1.7 Image classifiers . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.2 Machine Learning APIs . . . . . . . . . . . . . . . . . . . . . . . . . 132.2.1 Tensorflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.2.2 Metal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.3 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.4 Project Goals and Specifications . . . . . . . . . . . . . . . . . . . . 16

2.4.1 Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.4.2 Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.4.3 Proposed solution . . . . . . . . . . . . . . . . . . . . . . . . 18

3 Related work 19

vi

CONTENTS vii

3.1 One shot learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.2 Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.2.1 Tracking using deep regression . . . . . . . . . . . . . . . . . 203.2.2 Tracking using a CNN and recurrent layers . . . . . . . . . . 203.2.3 Fully convolutional neural network for object tracking . . . . 203.2.4 Learnet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.2.5 Visual Tracking by Reinforced Decision Making . . . . . . . . 213.2.6 Correlation Filter based tracking . . . . . . . . . . . . . . . . 223.2.7 Tracking using Recurrent net and LSTM Cells . . . . . . . . 223.2.8 Tracking by detection . . . . . . . . . . . . . . . . . . . . . . 23

3.3 Optimizing network performance . . . . . . . . . . . . . . . . . . . . 233.3.1 Deep compression . . . . . . . . . . . . . . . . . . . . . . . . 233.3.2 SqueezeNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.3.3 SqueezeDet . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.3.4 MobileNets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.4 A�ne Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . 273.4.1 Spatial Transformer Networks . . . . . . . . . . . . . . . . . . 283.4.2 CNN architecture for geometric matching . . . . . . . . . . . 28

3.5 Datasets and Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . 293.5.1 Imagenet video dataset . . . . . . . . . . . . . . . . . . . . . 303.5.2 Multiple object tracking benchmark&dataset . . . . . . . . . 303.5.3 Visual object tracking benchmark . . . . . . . . . . . . . . . 31

4 Tracking algorithm 334.1 Evaluating related works . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.1.1 SqueezeDet performance . . . . . . . . . . . . . . . . . . . . . 344.1.2 Fully Convolutional Siamese Tracker . . . . . . . . . . . . . . 35

4.2 A�ne regression tracker . . . . . . . . . . . . . . . . . . . . . . . . . 384.2.1 Modifications . . . . . . . . . . . . . . . . . . . . . . . . . . . 384.2.2 Tracking algorithm . . . . . . . . . . . . . . . . . . . . . . . . 41

4.3 One shot learning object detector . . . . . . . . . . . . . . . . . . . . 424.3.1 Object detection . . . . . . . . . . . . . . . . . . . . . . . . . 424.3.2 Loss function . . . . . . . . . . . . . . . . . . . . . . . . . . . 434.3.3 Tracking Algorithm . . . . . . . . . . . . . . . . . . . . . . . 44

5 Technical details 455.1 A�ne regression network . . . . . . . . . . . . . . . . . . . . . . . . 48

5.1.1 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495.1.2 Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

5.2 One shot object detector . . . . . . . . . . . . . . . . . . . . . . . . . 525.2.1 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 535.2.2 Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

6 Evaluation and Results 59

viii CONTENTS

6.1 MOT Challenge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 596.2 VOT Benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

6.2.1 Comparison with the VOT 2016 Challenge results . . . . . . 626.3 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

7 Conclusion 657.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 657.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

A Fire Module swift implementation 67

B SqueezeDet network architecture 69

C Expected Average Overlap results on the VOT benchmark 71

D Accuracy ranking on the VOT benchmark 73

E Speed of di�erent tracking algorithms on the VOT benchmark 75

F Robustness ranking on the VOT benchmark 77

References 79

List of Acronyms

GPU Graphics processing unitCPU Central processing unitAPI Application programming interfaceSGD Stochastic gradient descentFCN Fully connected networkCNN Convolutional neural networkFPS Frames per secondVOT Visual object trackingMOT Multiple object trackingMaccs Multiply accumulatesNaN Not a number

ix

Chapter 1

Introduction

1.1 Motivation

Recognizing people or objects in an image when presented with an example of theobject or person is a trivial task for humans. For machines however this is notthe case. Tracking has a range of applications in fields such as cinematography,security, self-driving vehicles and many more. In some of these, tracking is actuallystill done by humans as the accuracy of trackers using computers is not enough.This illustrates that there is still room for improvement.

Even when tracking is automated, the algorithms are often executed on deviceswith a lot of computational power and an unlimited source of energy. With thegrowing popularity of augmented reality on mobile platforms such as an iOS orandroid device, there is a need for good tracking algorithms designed for these mo-bile platforms. This means the algorithm needs to be designed with computationallimitations and a limited energy supply. This goal is only becoming more and morerelevant with the increasing popularity of mobile platforms.

1.2 Background

An automated tracking application or program can be imagined as a black box towhich an exemplar of the target to track (tracking target) is given e.g. an imageof a person or object, together with a new image from a camera or video sequencein which that same person should be found. The objective of the black box is tolocate the target within the new image, also known as the search window (seen infig. 1.1).

The output of the black box needs to be in the form of the smallest box possible,that when overlaid on the search window, fully encompasses the target. So thedesired output is in the form of a point in 2D space locating the center of the box,and a width and height defining its size and aspect ratio.

1

2 CHAPTER 1. INTRODUCTION

INPUT OUTPUTBlack Box

Exemplar

Search Window

Figure 1.1: Goal of a tracking algorithm.

1.2.1 Deep learning

In the past decade deep learning has become an established research field, it usestraining data in order to teach a generic algorithm to perform a certain function.In essence deep learning is training a predefined black box with annotated data toproduce a desired output. The algorithm is defined by a neural network (explainedin section 2.1) trained with annotated data, this di�ers from a manually designedand implemented algorithm. Neural networks are nothing new but with the in-troduction of big data and the use of GPUs for increased computational power,they have started to outperform classical algorithms; the first example of that was“Alexnet” [1]. Alexnet was one of the first networks to be completely trained andexecuted on GPU hardware, it also beat the competition in the Imagenet classifi-cation challenge [2] with a lead of 10.8 percentage points in top-5 accuracy. Thesedays neural networks can even beat humans in games like chess and GO [3].

1.2.2 State of the art in tracking algorithms

A good method of comparing tracking algorithms is with a benchmark that tests theperformance of a tracking algorithm with a number of well chosen video sequencesthat challenge a tracking algorithm. A good benchmark challenges a tracking al-gorithm by testing on video sequences with lighting changes, deformations of thetarget, occlusion of the target(when it is partially in frame), and change of thecamera angle or camera movement. For single object tracking algorithms the mostwell known benchmark is visual object tracking (VOT) benchmark [4]. This bench-mark is used in a yearly challenge that releases a report comparing all the trackingalgorithms that participated. It measures their performance as-well as the speedwith which they execute. The main performance indicator is tracking accuracy orhow much the calculated bounding box overlaps with a bounding box annotated bya person. One of the best and fastest tracking algorithms according to that bench-mark is STAPLE [5], which has a throughput of 80 video frames per second(FPS)on desktop hardware. This algorithm runs on the CPU and could not operate at

1.3. RESEARCH METHODOLOGY 3

these speeds on a much weaker mobile CPU, so therefore is not suited for mobileapplications. Even more recent is CFNet released in April of this year (2017). In-stead of using hand crafted correlation filters that classical trackers use, it learnedthem using deep learning [6]. CFNet runs on a GPU with 52 FPS, it has yet toappear in the VOT benchmark so it is hard to compare it to other trackers.

1.2.3 ProblemEven the most state-of-the-art tracking algorithms, are often not able to do anisotropicscaling and are not designed for execution on mobile hardware. This prevents theapplication of these tracking algorithms for augmented reality on mobile hardware,as explained in section 2.4.

1.3 Research Methodology

The goal of the thesis is to develop a tracker that has good performance and accu-racy, while also being possible to run on mobile hardware i.e. compact in memoryand fast to execute. For this reason a literature research of current available track-ing algorithms is performed; a selection of papers will be examined that are not onlyfocused on accuracy but also on performance. The keywords used to find paperswithin this scope are:

• Object detection neural networks

• Tracking algorithms

• Energy e�cient neural network

• Single shot learning algorithms

All papers will be evaluated on a quantitative basis, that is, see if they reach state-of-the-art results on either accuracy or speed. Speed will be measured in requiredcomputations and memory usage. Accuracy will be measured with the help of abenchmark, a relevant benchmark will be selected in order to compare di�erentalgorithms. Based on the results of the quantitative deduction, selected works willbe evaluated to the following qualitative requirements: Does the algorithm allowanisotropic scaling, and is the design of the algorithm simple.This deduction will establish a base of knowledge and help with defining the prob-lems that are currently still unsolved. Based on the literature research, a computervision network will be selected for implementation on mobile hardware to bench-mark the capabilities of such hardware to run a computer vision neural networkin real-time. This should result in a hard limit of computations that every pro-posed algorithm should stay within. A selected tracking algorithm based on aneural network will be evaluated to gain a better understanding of the problemsand limitations of neural network based tracking algorithms. Based on this research

4 CHAPTER 1. INTRODUCTION

a novel or modified neural network architecture will be developed with the maingoals of providing state-of-the-art accuracy combined with the ability to run onmobile hardware.

1.4 Research Contributions

This paper contributes an evaluation of current neural network tracking algorithms,and the possibility of implementation of a neural network tracker on mobile hard-ware. Based on those findings two networks will be proposed, implemented andevaluated. One network performs a prediction of an a�ne transformation, whichis shown to decrease accuracy and is not competitive on accuracy or performance.The other proposed and implemented network is a one shot learning object detectorthat can do single target object detection based on an exemplar. This network isshown to achieve a competitive accuracy on a popular single object tracking bench-mark while being simple in structure, and e�cient in performance. To increase thecapabilities of the tracking algorithm, a number of filtering algorithms are used toincrease performance of the two networks.

1.5 Thesis Organization

This thesis is organized in seven chapters as follows:

1. Introduction: Explaining the motivation, background and contribution of thisthesis.

2. Background: Explaining the concepts and technologies used in this thesis.

3. Related work: An overview of related work used and referenced in this thesis.

4. Tracking algorithm: A description of the two proposed network designs, andan evaluation of selected works

5. Technical Details: Explaining all the specific technicalities used for imple-mentation

6. Evaluation and Results: Presenting the results of the two networks, evaluatedon accuracy and speed

7. Conclusion: Presenting conclusion, discussion and future work

Chapter 2

Background

This chapter explains the concepts behind this thesis, a short introduction intoneural networks (section 2.1), the APIs used for deep learning and implementationon a mobile platform section 2.2, and the use of big data in the form datasetexplained in section 2.3. The final section 2.4 will present the specifications andrequirements, that the final result should fulfill.

2.1 Neural Networks

Artificial intelligence - and by extension neural networks - has been a science formany years dating as far back as the 40s [7, Chapter 2], but has seen a sharp spikein scholarly interest in the last decade. This is mainly due to the practical applica-tion of deep learning and the large amounts of data available for training. Neuralnets are essentially a series of mathematical operations, that learn to produce thecorrect output when presented with an input. Unlike classical mathematics whereall operations are defined beforehand by a mathematician, neural nets work bydefining a structure of basic building blocks which consists of simple mathematicaloperations. These building blocks(hereinafter referred to as layers) can be stackedand combined to create an advanced neural network. The layers in this neural net-work are then trained to produce a desired output when presented with a certaininput. This means while the structure of a neural network is known, what eachlayer actually does after training is not; for this reason the layers of a neural netare commonly known as hidden layers. The exact process of training a neural netwill be explained further in section 2.1.2.

5

6 CHAPTER 2. BACKGROUND

2.1.1 Fully connected networks (FCN)Among the most general layer designs to be used in neural networks are fully con-nected layers, consisting of artificial neurons. These artificial neurons are inspiredby the biological neurons (seen in fig. 2.1) of the human brain[8]. Their function isto transmit a signal as a function of its inputs.

Dendrites

AxonTerminals

Axon

Figure 2.1: A 2d representation of a biological neuron (Image adapted from [9])

The artificial neuron works by calculating a weighted sum of its inputs x, addinga bias value b, and applying an activation function f . This process can be writtenas (2.1) where N

inputs

is the number of inputs.

y = f(b +i=0ÿ

N

inputs

xi

· wi

) (2.1)

A visual representation of this function can be seen in fig. 2.2. The weights canbe used to adjust the influence a certain input has on the final result, this can ine�ect tune a neuron to produce a desired output when presented with a collectionof inputs.

Σw2

w1

w3

y

x0

x1

x3

b

Figure 2.2: A visual representation of an artificial neuron

The bias and activation functions are there respectively to compensate for ashared bias in the inputs, and to make the function non linear. Without an acti-vation function, multiple layers of artificial neurons would always result in a linearfunction. This severely limits the capabilities of a neural net to approximate a morecomplex mathematical function. An activation function prevents this by including asimple non-linear function. For this reason it is also called a non-linearity function,common examples of non-linearities functions are sigmoid, hyperbolic tangent, andrectified linear units (ReLu).

2.1. NEURAL NETWORKS 7

2.1.1.1 Layer structure & performance

A fully connected layer is defined by an arbitrary amount of artificial neuronshaving the same inputs but di�erent outputs. This is called a fully connectedlayer or densely connected layer due to all neurons being connected to all inputs.Combining two fully connected layers with an input layer where data is fed into,and an output layer where the results will be presented, results in a simple neuralnetwork as seen in fig. 2.3.

Figure 2.3: A simple neural net with 2 fully connected layers (from [10])

While the network presented in fig. 2.3 is relatively simple, neural nets cancontain many layers and many neurons per layer. It is important to note that,though simple, fully connected layers scale badly with the number of inputs. Thenumber of parameters(n

parameters

) needed to be stored per neuron based on (2.1) isN

inputs

+ 1. The amount of parameters needed to be stored per layer can thereforebe be calculated like this n

parameters

= nneurons

· ninputs

+ nneurons

. Immediatelyit can be seen that increasing the number of neurons or inputs linearly increasesthe number of variables, this becomes an issue when dealing with large amountsof inputs for example when inputing an image. A simple color image of 250 ◊ 250pixels results in 250 · 250 · 3 = 187500 inputs. With a single layer containing thesame amount of neurons this results in n

parameters

= 1875002 = 35, 156, 250, 000.With the computations requiring equal scaling, it can be determined that fullyconnected layers are not suitable for layers with a large amount of inputs, as is thecase with images. Convolutional layers (presented in section 2.1.4) can handle alarge amount of inputs better.

2.1.2 Training processAs mentioned in section 2.1, a neural net is trained to achieve the desired outputwhen presented with a certain input. Training a neural net can be done witha set of annotated examples that contain input data for the network as-well asthe corresponding desired output, known as a label. This collection of annotated


examples is called a training dataset, and is used to train a neural network in amethod called supervised learning.

Supervised learning is done by feeding an example to the input layer, and per-forming the calculations of all neurons and layers to produce an output(also knownas forward-pass or inference stage). The output produced by the network is thencompared to the desired output by calculating the deviation using a loss function.The goal of the training is to minimize the output of the loss function by updat-ing the weights and biases; the most popular way of doing this is by performingstochastic gradient descent (SGD). SGD works by calculating the gradient vectorof the loss function and repeating this for all neurons in the network using the gra-dient of the previous layer and the delta rule [11]. This process is called gradientback propagation or backward-pass. The calculated gradients are then used in theupdate phase to adjust the weights and biases by a certain magnitude called thelearning rate in such a way that minimizes the loss function [12]. The steps shownbelow summarize this process:

1. Initialize all weights and biases with random values

2. Feed an example to the input of the neural net

3. Execute all neurons and layers to produce an output (forward-pass)

4. Calculate the deviation of the output from the label using the loss function

5. Back propagate the gradient through the network (backward-pass)

6. Update all weights and biases based on their respective gradient (updatephase)

7. Repeat from step 2 until convergence

The steps shown above are repeated over and over in order to approach the desiredoutput for an example as closely as possible. A small variant on the above processis to feed not just one example but multiple examples as a mini-batch; this hasthe advantage of faster convergence to a minimum due to less noise in the gradient.This method of using multiple examples to update the gradient can also be executedin parallel therefore is especially beneficial when running on a GPU.

2.1.3 Inference stageThe inference stage, where an example is fed to a neural network and the outputis calculated, is part of the training process. In the deployment phase of a trainedneural network, only a single fordward-pass is required to produce a result, as thenetwork no longer needs to be trained. This means that the inference stage is exe-cuted but none of the other steps in the training process, besides the initializationstage (using pre-trained weights and biases), need to be carried out. Once the neu-ral net has been initialized it can be used over and over again for inference, this


requires far fewer computations than are needed for the whole training process withthe gradient back-propagation. For this reason inference can often be executed onmuch weaker hardware with respectable speeds. These two separate processes arealso called the o�ine stage when training the neural net, and online stage whenusing the neural net for inference.

2.1.4 Convolutional neural networks (CNN)Convolutional neural networks are based on convolutional layers instead of fullyconnected ones. Convolutional layers are mainly popular for use in image processingapplications as they are designed to exploit the strong spatial correlation present inimages. They are inspired by the biological eye, which uses cells only sensitive toa small part of the image called a receptive field, but tiles them to cover the wholeimage. A convolutional layer imitates the biological cell with something calleda filter. A convolution works similarly to a fully connected neuron, but insteadof having connections to every input it only has connections to the inputs in itsreceptive field. The receptive field of each convolution is small and constant, but bytiling many partially overlapping convolutions the receptive field e�ectively coversthe whole input, just like the human eye. While the receptive field of each individualconvolution is constant and unique, the convolution itself is not. Convolutions shareweights between them in and are executed with a scanning behaviour where theweights are reused for multiple separate convolutions. This scanning behaviour canbe executed with a 2d filter also known as a kernel. A convolution of a 2d filtercan be used to scan a 2d plane or an image. A kernel can be an arbitrary size butpopular sizes are 3 ◊ 3, and 1 ◊ 1. A 3 ◊ 3 kernel as presented as a matrix in (2.2)holds 9 di�erent weights which can be tuned during training. In addition to thekernel, a convolution can also use a bias value, that is added to the output of thekernel convolution. S

Uw1 w2 w3w4 w5 w6w7 w8 w9

T

V (2.2)

Using a 3 ◊ 3 kernel to apply a convolution to an input image produces an outputimage of similar size as seen in fig. 2.4. The output size of a convolutional layer sizechanges depending on whether or not padding is use around the edges and the stepsize or stride. Padding enables kernels larger then 1 ◊ 1 to read data outside of theoriginal image, this is necessary to prevent a reduction in the image size after everylayer. The stride dictates the step size used in the scanning motion, a step sizelarger than one makes the kernel skip one or more convolutions before applying thenext convolution. The result of a stride greater than one is an exponential decreasein output size of a convolutional layer. There does not have to be only 1 kernel perconvolutional layer but there can be multiple trainable kernels per convolutionallayer. Each kernel can apply a convolution to an image, this increases the abilityto recognize patterns, for example one kernel could look for straight edges whileanother could look for diagonal edges. This results an image with a number of


Figure 2.4: A diagram of a 2d kernel used to apply a convolution to a 2d plane.(from[13])

channel, also called a feature map. A color image also consists of multiple channelsnamely a Red, Green and Blue channel or RGB and can therefore also be considereda feature map. In order to act on the increased dimensionality of feature maps aconvolution has to have multiple kernels to act on the di�erent input channels. Thisresults in a 3d kernel with an added dimension called depth, the depth of a kernelmust match the number of input channels, this is true for a normal convolution,special depth wise convolutions exist but will be explained later in section 3.3.4. Avisual illustration of a convolution with 3d kernels can be seen in fig. 2.5.

Figure 2.5: An illustration showing a 3d convolution, hin

, win

and chin

describe thesize of the input feature map. The output feature map is described with h

out

, wout

and chout

, the amount of kernels k per output channel is equal to chin

(source:[14])

Increasing the depth parameter is the most common way to increase the capabilitiesof a convolutional neural network. This is because when one increases the depth


of a convolution, it adds kernels and thus the ability to detect a wider variety offeatures. Multiple convolutional layers can also be used, with each convolutionusing the output feature map of the previous convolution as input.

2.1.4.1 Pooling layers

Pooling layers are similar to convolutional layers but simpler. A pooling layer hasa stride and kernel size just like a convolutional layer, but it does not containany weights, and is therefore also not trainable [1]. A pooling layer is a constantmathematical operation that is applied similar to a convolution, popular poolinglayers are max pooling and average pooling. The output of a convolution in a maxpooling layer is the maximum value of that convolution, in the case of averagepooling it is the average value. Pooling layers are often used to reduce the sizeof the feature map as they do not require a lot of computations. In more recentworks pooling layers are often dropped in favor of regular convolutions with a stridegreater than one [15].

2.1.5 PerformanceUsing convolutions on images takes advantage of the inherent spatial correlationsbetween the pixels and their respective locations. This advantage enables the con-volutional layer to operate e�ciently on image data, where a fully connected layerwould be unpractical. The number of parameters in a convolutional layer can be cal-culated as followed, n

parameters

= Kwidth

·Kheight

·inputchannels·outputchannels+outputchannels (with K being the kernel size). As one can see the number of vari-ables is not dependent on the number and size of the inputs. The number of com-putations required is however dependent on the number and size of the inputs, thismeans bigger input images require more computations. The computations consistof multiply accumulates also known as Maccs, and the formula to calculate amountof Maccs can be seen in equation (2.3) (from [16]).

Stride = S

KernelWidth = Kw

KernelHeight = Kh

InputWidth = Iw

InputHeight = Ih

InputChannels = Cin

OutputChannels = Cout

Maccs = ((Kw

· Kh

) · Iw

· Ih

S· C

in

) · Cout

(2.3)

When a 3x3 convolution (no bias is added for simplicity) with 64 output channelsis applied to an input image equal to the one of the example in section 2.1.1.1


the required parameters and computations are much smaller. The number ofparameters is n

parameters

= 3 · 3 · 3 · 64 = 1728, and the number of Maccs isMaccs = ((3 ·3) · 250·250

1 ·3) ·64 = 108 ·106. Compared to the performance of a fullyconnected layer on the same input, this is a reduction of 352 times in the numberof Maccs. And the number of parameters is reduced around 2 · 107 times. It mustbe noted that the two layer types di�er so much in their functioning that it is hardto compare them. It is shown, however, that convolutional layers are much morepractical for inputs consisting of images than fully connected layers.

2.1.6 Siamese networkClassically a neural network has a single data path from input to output, this isbecause the networks are highly specialized and fine tuned for performing a singletask. However some tasks require adaption of the network based on a given in-put. A relatively new neural network design called siamese neural networks [17] isdesigned with these kinds of tasks in mind. These networks contain two or moredistinct inputs that are combined somewhere later in the neural network. A simpleexample can be seen fig. 2.6, where two distinct inputs are processed by two sepa-rate hidden layers and combined in a combination layer which is connected to theoutput layer.

Hidden layer A

Hidden layer B

Input layer A

Input layer B

Output layer

Combination layer

Figure 2.6: An illustration of a siamese network with two distinct inputs A and Bresulting in one output.

Each branch of the network is tuned to embed the input in a semantically mean-ingful feature map before combining those representations to compare them. Thesiamese network structure supports all layer types, and layers can also be addedbetween the combination layer and output. The main purpose of the combination

2.2. MACHINE LEARNING APIS 13

layer is to combine the output of the two separate branches in a meaningful way.The exact combination layer to use di�ers per application, but some very basic com-bination layers include concatenation, fully connected layers (as shown in fig. 2.6),and addition or subtraction.

2.1.7 Image classifiers

Image classifiers are algorithms that can recognize or predict what is represented inan image. Neural networks can also be trained to perform classification. Classifica-tion requires a neural network to output the probability of a certain class of objectbeing in an image. This is a core function of computer vision, and it has thereforebecome a very popular field of research. A trained classifier performs a numberof low level and high level feature extractions, it looks for edges, shapes and evenspecific objects like heads. These features are used in almost all computer visionapplication and for this reason a pre-trained classifier can also be reused. Whenreusing a pre-trained classifier as a feature extractor, the network is used to extractgeneric features which are then processed by another neural network trained for adi�erent application. Another reason classifiers are popular is because of the annualImagenet Challenge [18]. The Imagenet Challenge compares the top-1 and top-5accuracy of di�erent classifiers. Many of the big players in the field of artificialintelligence have participated in one way or another and the top ranking networkssometimes only di�er from each other by as little as 0.05% accuracy.

2.2 Machine Learning APIs

A simple network such as the one given as an example in section 2.1.1.1, is notdi�cult to implement as it is only a couple of simple matrix multiplications. Onecould do it in python using the library NumPy, or natively in C. This becomes lesstrivial when convolutions are used, as in that case multiple matrix multiplicationswith di�erent slices of the original matrix and the kernel have to be executed. Al-though this could still be done in NumPy, this becomes problematic when networksget bigger and more complex they start to run very slowly on a normal CPU. Forthis reason a GPU is used, as it is optimized for parallel computing, and thus isvery suitable to perform these large and easy-to-parallelize matrix multiplications.Implementing all these layers on a GPU, however, is a lot more di�cult, as na-tive shaders have to be written, preferably heavily optimized for their respectivetasks. For this reason numerous application programming interfaces (APIs) havebeen developed to handle the low level implementation of the layers in a neuralnetwork. The notable ones are, Ca�e[19], Theano [20] and Tensorflow [21]. Thedi�erent APIs provide a back end for neural net development such that a user canfocus on designing a network architecture and does not need to be bothered withimplementing, debugging and optimizing low level code. Each of the APIs has itspros and cons, but for the neural network design and development in this thesis


Tensorflow was used. Because Tensorflow does not yet run on mobile hardware,another API was used called Metal. Metal is developed by Apple and only runson iOS devices, it provides a small API with most of the commonly used layers inneural networks. Metal is currently the only neural network API that can run allof it’s operations on the GPU of a mobile phone, which is required to run inferenceof any computer vision network at real-time frame rates.

2.2.1 TensorflowIn this work Tensorflow is used for training the neural network, Tensorflow is anAPI that uses python to interface to it. Python is an interpreted language, whichmeans it interprets every line of code during execution, this becomes very ine�cientfor repeated executions of the same bit of code. For this reason Tensorflow usessomething called a Graph describing the path of the training data, the computationsapplied to it and di�erent data modification operations these operations and pathsare also called tensors (inspired by mathematical tensors). A graph of tensors is setup using Tensorflow API calls from python, no actual data is being processed duringthis setup phase of the graph. In this setup phase the inputs of the neural networkare defined, and tensors that act on these inputs, the tensors can be strung togetherto create complex computation graphs [21]. After the graph is set up as requiredfor a certain neural network, a Tensorflow session is started. This session can thenbe used to feed data into the graph and evaluate certain tensors, in this session thegraph cannot be changed anymore. Because the graph is fixed during execution,not only can it use native implementations of certain operations but it can alsooptimize the execution order and data path between the di�erent operations. As aresult of this, Tensorflow can be used to implement complex neural networks usinga high level language like python, while still taking advantage of a very low leveloptimized implementation.

2.2.2 MetalAs stated in section 2.1.3, the inference stage requires far fewer computations thanthe training stage. This means that inference can be done on much weaker hardwarethan training, while still maintaining a respectable throughput or FPS. Unfortu-nately, while a smaller number of computations is required, doing inference of anycomputer vision networks on a CPU will not result in real-time fps, especially on amobile CPU. Fortunately, most mobile devices nowadays have a modestly powerfulGPU integrated for applications such as mobile games and other UI tasks. In the-ory, one could write custom shaders to use the GPU in order to execute a neuralnet. The issue is that the mobile hardware market is quite diverse and writingcustom optimized shaders for every single GPU used would be counter productive.Apple, however, has released an API called Metal that already contains the mostimportant neural network layers implemented in shaders [22] such that they can beexecuted on the GPU. Apple has the advantage of controlling both hardware and

2.3. DATASETS 15

software on their devices, this enables them to support running neural networks onthe GPU of all devices that are an iPhone 6 or later model. Of course more recentdevices have a powerful GPU able to run a neural network at a higher frame ratecompared to older devices, so the frame rate varies from device to device.

2.3 Datasets

As stated in section 1.2.1, neural networks are often trained using big sets of an-notated data [23]. For training very complex neural networks that are many layersdeep, very big datasets are needed to enable the network to find very complex sim-ilarities between pictures. With a small set of training data a very deep neuralnetwork could start to learn irrelevant and tiny features specific to the trainingset, this behaviour is called over-fitting. The problem with over-fitting is that itcan make a network non-generalizable, this means that the network would performwell on the training data but not on any unseen data. To prevent over-fitting, theannotated data must be of a significant size to enable the neural network to find bigpicture features and not focus on features only found in the training set. Reducingover-fitting allows for better generalization meaning that when the neural net isused for inference on unseen data it produces better results.Notable players in the information technologies industry including Google, Mi-crosoft, Facebook, Apple and Baidu are all investing heavily in gathering data.Often they use their own platforms to gather data. Unfortunately not every com-pany has millions of users to generate big data, especially not researchers and forthis reason there are a lot of freely available datasets. Some of these datasets areprovided by universities like the famous Imagenet dataset [2], and are often used tocompare image classifier accuracy. Other are provided by one of the aforementionedcompanies such as the Youtube-8M dataset made available by Google [24].


2.4 Project Goals and Specifications

This section will establish the goals and specifications of the project, but also de-termine the design constrains and limitations.

2.4.1 ProblemIn a video or live stream a subject can move around, not only can the subjectmove but often the camera as-well. This results in transformations of the subjectin the 3D world, these transformations include translations, scaling and changes inshape. Projected on the 2D plane of a camera image this results in translations,anisotropic scaling and shape changes. Shape detection is part of a research fieldcalled segmentation, and is not part of the scope of most tracking algorithms. Moststate-of-the-art object tracking algorithms are able to accurately detect the trans-lations and isotropic-scaling of a target. Unfortunately most trackers are unable tohandle anisotropic scaling. In practice this means that the bounding box of a targetduring the tracking process can only change in size by using the same scaling factorfor both width and height. This behaviour is inherent to the design of most track-ers as they are often initialized with subsets of an original image containing onlythe target called a patch. This patch is then compared to a search window whichcan be the full original image or a subset of it, the location where the comparisonreturns the greatest activations is then assumed to be the new location of the tar-get. To identify scale changes of the target a scale pyramid is often used. A scalepyramid is a set of images containing the search window and slightly bigger andsmaller versions of it as can be seen in fig. 2.7. The comparison described earlieris performed on each of the di�erent scales, and whichever scale has the greatestactivations is assumed to be the new scale of the target [25]. While tracking algo-rithms using a scale-pyramid have shown to be very e�ective, they will not be ableto do anisotropic scaling, or any other transformations like rotation or shearing,it is also ine�cient as running the comparison for each scale linearly increases thecomputational complexity with the amount of scales.As explained, available tracking algorithms don not support anisotropic scaling bydesign, they are also not designed for performance on a mobile hardware. Thus theproblem with the current state-of-the-art is a lack of anisotropic scaling, complexalgorithms and not being optimized for speed and accuracy on mobile hardware.

2.4. PROJECT GOALS AND SPECIFICATIONS 17

Figure 2.7: A scale pyramid showing a search window in original scale, slightlyupscaled, and slightly downscaled to detect scale changes. (from [26])

2.4.2 GoalThe performance problems highlighted earlier are often a side thought when itcomes to applying trackers on powerful hardware. Tracking algorithms are com-paratively light, and can be executed with very high frame rates especially whenrun on a GPU [4]. Maintaining high frame rates starts to become a problem whenexecuting trackers on less powerful hardware, for example a mobile phone. An-other problem encountered when executing complex algorithms on a phone is anincreased load on the battery, draining it faster than acceptable from an end user’sperspective. Mobile hardware is a relevant platform for execution of tracking al-gorithms; they all contain video cameras which can be used for augmented realityapplications. Solving the problems that come with running complex tracking algo-rithms can enable more advanced augmented reality applications. Therefore it is animportant research subject and can enable advancement in the field of augmentedreality and its applications on mobile hardware. With the problems of section 2.4.1in mind the main the goals of this research are:

Performance Minimize computational complexity of a tracker while maintaininga respectable accuracy.

Scaling Enable an-isotropic tracking of a target’s scale

Simplicity The tracker must be simple in design to allow for e�cient implemen-tation on mobile hardware.

Performance and simplicity are the primary goals of this thesis, a secondary goal isto enable anisotropic scaling of the target. Both of these goals have to be achieved


with either a limited loss in accuracy or preferably an increase in accuracy.

2.4.3 Proposed solutionCurrently one of the most state-of-the-art tracking algorithms is a neural net basedtracker [27]. This tracker is based on a siamese network structure and can tracka target over multiple scales with very high accuracy. This tracking algorithmused a neural net architecture to extract meaningful feature maps for comparison.Unfortunately this tracker still uses a scale-pyramid to track the targets scale,which prevents any anisotropic scaling. This tracking algorithm also increased thecomputational complexity as the neural net used the structure of Alexnet[1] asa feature extractor which is much more demanding in the sense of computationsrequired as opposed to hand-crafted feature extractors. This tracker would notachieve real-time frame rates [28] when run on mobile hardware. Nonetheless theidea is a promising one and has potential for improvement, possibly allowing it tobe suitable for mobile hardware.

In order to improve the performance of the Siamese neural net tracker one has toseparate the tracker in two components, the first being the neural net that does thecomparison the second being the tracking algorithm that facilitates tracking overscales, and temporal filtering. There is room for improvement in both components.Either the structure of the neural net that does the comparison can be improved toachieve similar accuracy with less computations. There are specific structures thatpromise this exact thing e.g. Squeezenet [29] or MobileNets [30]. The other optionis to improve the tracking algorithm, the multi scale tracking algorithm requireswhatever comparison function is being used to be run once per scale. Reducing thenumber of forward-passes required to one would increase performance linearly.Based on this knowledge we propose to design a tracking algorithm utilizing aSiamese neural net, but instead of using a classic multi scale tracking algorithm,we will add some form of regression to estimate the targets new position, scale andaspect ratio. To improve performance the neural net will be based on one of theperformance optimized networks currently available e.g. SqueezeNet or MobileNets.

Chapter 3

Related work

3.1 One shot learning

Neural networks classically contain a single data path from input to output. Thisis because neural networks are often trained to perform a single task, and this taskdoes not change during the inference stage. If the task changes during inference theneural network would need to be retrained, in order to perform the new task. Oftenit is not an option to retrain a neural network for each new task, this is becausethere might be a lack of training data or the hardware used for inference is notpowerful enough to perform SGD. This problem of having limited training data, oreven only one example is also known as one shot learning. Tracking is in essence aone shot learning problem, as an algorithm is given one example of the target andasked to track this target over multiple frames.

One of the first papers attempting to solve the problem of one shot learning byre-training a pre-trained neural network was released in 2013 [31]. It showed that apre-trained network was better at generalizing to a new class then a network trainedfrom scratch. A more recent paper utilizes a Neural turing machine to perform oneshot learning, the turing machine consists of a controller such as a feed forwardnetwork or a recurrent neural network [32] that interacts with an external memorymodule [33]. This machine has long term storage in the network weights which areslowly updated, and short term storage in the form of the aforementioned externalmemory module. This structure achieved better results than a human on a few shotproblem using the Omniglot [34] dataset and was a big step forward compared tocomparable methods at that time. The “matching networks for one shot” learning[35] showed that, besides designing for one shot learning as in [33], training for oneshot learning can improve results even further. They proposed a network which wasdesigned to be trained for one shot learning and showed significant improvementover the previously described methods with 98% accuracy on 5-way challenges after1 shot learning. These papers have laid out some of the best-practice design meth-

19

20 CHAPTER 3. RELATED WORK

ods which are now used as a base for many other researchers doing research on oneshot learning. The applications are very widespread from simple visual recognition[36] all the way to object segmentation in video [37].

3.2 Tracking

This chapter presents the current state of research concerning tracking algorithms.All the tracking algorithms presented here were selected based on their trackingperformance, if they worked on a frame by frame basis, and if they can achievereal-time frame rates.

3.2.1 Tracking using deep regressionIn [38] a method for subject tracking using a feed forward neural network is de-scribed. The network uses the search window of the previous frame containingthe target and a search window of the new frame as inputs. The neural networkthen applies a number of convolutions to both inputs and combines the outputsof the convolutions using fully connected layers. The neural network is trained topredict translations as well as anisotropic scaling of the target from search windowto search window. This is an architecturally simple method which achieves a highspeed (100fps) on a Titan X GPU. The shortcoming of this method and why itwon’t be used, is that it can not look further back than 1 frame so any occlusionlonger than that will result in the target being lost and never recovered.

3.2.2 Tracking using a CNN and recurrent layersA network called ROLO (recurrent YOLO) described in [39] utilize the well knownYOLO network [40] and combines it with a layer of LSTM [32] cells to improvetracking of a single subject compared to individual detections each frame. Thisnetwork bested most of the competitors in the OTB-30 benchmark [41]. The draw-back of the network is the computational intensity; where the YOLO network byitself was already an expensive network to run, ROLO adds another layer to thisnetwork. There might be an option to replace YOLO with a lighter architecturee.g. Squeezedet. The network is ine�cient by design as it uses the final predictionsof a di�erent networks designed for other purposes and adds to it. Besides the ar-chitecture being computationally expensive there is also no method of re-acquiringa target after it has been lost for a longer period of time.

3.2.3 Fully convolutional neural network for object trackingIn [42] a method is presented to do single subject tracking using a fully convolu-tional neural network. While one shot learning is often understood as the networkparameters being updated with only one example of a class. What the networkin [42] actually does is generate a feature map of the target once, which is then

3.2. TRACKING 21

cross-correlated with a feature map of every search image to find the target to betracked. This system uses Alexnet [1] for the convolutions and shows state of theart quantitative and qualitative results, while also running at a high FPS. It hasto be seen, however, how sensitive the network is to a change in the target’s pose,as the feature map of the target is never updated. Due to the good results on theVOT benchmark and the use of neural networks, this network will be evaluated insection 4.1.2.

3.2.4 LearnetLearnet [43] is from some of the same authors as the network presented in [43].It proposes another siamese network structure but it not only uses the siamesebranches as feature extractors it also trained one of the branches to update theweights of a convolution in the other branch as seen in fig. 3.1. The weight matrix

Figure 3.1: The structure of the Learnet (from [43])

M of the convolution that is being changed during inference can be generated usingthe following equation M = v◊diag◊hT . During inference when the weight matrixis updated only the diagonal is updated, v an h learned during o�ine training staythe same. This greatly reduces the number of parameters to update. The resultof updating the weights of a convolution is an improved tracking accuracy whencompared to a siamese network the weights of which do not change during inference.The siamese design proposed in this paper is an interesting idea, but due to itscomplexity it is not a candidate for implementation on mobile hardware.

3.2.5 Visual Tracking by Reinforced Decision MakingIn [44] an extension on the Siamese network of [42] is proposed, it introduces adecision making network to determine whether or not a new feature map of thesubject should be used. It does this by adding and training a policy network todetermine if a new feature map of the subject should be used. This solves the posesensitivity problem and shows improved quantitative results, while still performingwith a real-time FPS. The only downside is the added complexity of the policy


network and the requirement to recompute a feature map of the subject every passthrough.

3.2.6 Correlation Filter based trackingIn [6], a method is presented to adapt the correlation filter algorithm [45] to anend-to-end neural net training process. It is a step forward from hand-craftedcorrelation filters to correlation filters learned using training data, the paper showsthat even a shallow neural network using correlation layers can acquire similar orbetter level of precision then deeper neural networks. The paper was only releasedin April of this year, and since the used correlation filter is non-standard, it will bea challenge to implement and debug.

3.2.7 Tracking using Recurrent net and LSTM CellsIn [46] a network is proposed that uses a recurrent neural network for temporalprediction, updates and track management, and combines it with an LSTM to solvethe combinatorial problem of data association. Their experiments show middle tierresults compared with other recent trackers, but they do achieve a high framerate (160FPS) compared to other tracking algorithms in the MOT benchmark [47].Implementation of recurrent network and LSTM cells is currently not trivial inTensorflow as well as Metal but it shows promise in future utilization on computervision and tracking applications.

3.3. OPTIMIZING NETWORK PERFORMANCE 23

3.2.8 Tracking by detectionThis method utilizes frame by frame detection and adds a method of data associa-tion to track multiple targets similar to [48]. It can be implemented more easily byusing any of the detectors currently available and add an algorithm such as solutionpath algorithm [49] for tracking. This can produce good results as shown in [49],and potentially improve the qualitative performance of a tracking by detection al-gorithm. It will however not be used in this paper, due to it’s focus on multi objecttracking.

3.3 Optimizing network performance

This section will present all papers that focus on optimizing the performance ofneural networks instead of accuracy, this means a trade-o� between memory pres-sure or computations and total accuracy was considered. The techniques identifiedin these papers might help optimize performance of a neural net tracker.

3.3.1 Deep compressionIn [50] a method for reducing model size and improving performance is proposedcalled deep compression. It uses a combination of pruning quantization and Hu�-man coding to reduce a model’s size while maintaining a similar accuracy. Pruningproposed in [51] removes all weights that are below a certain threshold as they aredeemed to be of too little influence, this results in a decrease in size and computa-tions required, and is shown to maintain a similar accuracy. Quantization meansreducing the size of the weights and biases, often they are stored as a single precisionfloating point number in 32 bits. The Deep Compression paper shows that 5-8 bitsweights have similar accuracy, as single precision floating point weights. Finally theHu�man coding is a method of lossless data compression [52], and works becauseof the non-uniform distribution of the weights to save another 20%-30% in memorysize. The combination of these 3 methods allows for a compression of around 35to 49 times of popular networks while maintaining a similar accuracy. Deep com-pression is an exciting field of research, with good results. However in order to usethe pruned and reduced precision weights one needs to write custom convolutions,because a normal 3x3 kernel convolution will still execute all computations eventhough the middle weight might have been pruned. This prevents it from beingused to the fullest extend on the limited API available on mobile hardware.


3.3.2 SqueezeNetIn [29] a network architecture is proposed that maintained the same accuracy asAlexNet [1] on the imagenet dataset, while reducing the model size by up to 500times. To achieve this performance increase, while maintaining the same accuracythe authors proposed a new module called a fire module (seen in fig. 3.2). The firemodule reduces the number of expensive 3 ◊ 3 convolutions, using a combinationof 1 ◊ 1 and 3 ◊ 3 convolutions after a squeeze layer. The cheaper 1x1 convolutionused in the squeeze layer reduces the number of channels in the feature map toa preset value called s1◊1. After which a 1 ◊ 1 and 3 ◊ 3 convolution are usedto expand the number of channels in the feature map to a preset value calledrespectively e1◊1 and e3◊3. This prevents any convolution from having a largequantity of input channels, and a large quantity of output channels which is verycostly according to section 2.1.5. The combination of a 1 ◊ 1 and 3 ◊ 3 convolutionworks especially well as they tend to cooperate, the 1◊1 convolutions focus more onchannel relationships and 3◊3 convolutions focus more on spatial information. The

Figure 3.2: The Fire Module proposed in SqueezeNet (from [29])

main contribution of SqueezeNet was the fire module which showed that reducingmodel size and computations can not only be done by compression but also bymaking smart decisions in the architecture. But to achieve the aforementioned 500times model size reduction SqueezeNet also utilized Deep Compression explainedin section 3.3.1.

3.3.3 SqueezeDetSqueezeDet is an object detector released in December of 2016 as an extension ofSqueezeNet and designed to be as small as possible while still maintaining goodaccuracy. It utilizes the same layer structure as SqueezeNet but adds another 2fire modules and another convolution to perform bounding box predictions. These

3.3. OPTIMIZING NETWORK PERFORMANCE 25

last 3 layers are called ConvDet in the paper. It reduced energy consumption 84xcompared to a previous work called “Faster R-CNN”(FRCNN) [53], while achievinga similar accuracy and running at real-time speeds (57.2 FPS). It combines ideasof FRCNN and YOLO[40] as it only uses convolutions for it’s output layers andk-means clustered anchors (explained in section 3.3.3.1) proposed in FRCNN. Theoutput layer does classification as well as region proposal similar to the YOLOnetwork.

3.3.3.1 Anchors

The anchors of SqueezeDet are default bounding boxes determined by k-means clus-tering of the bounding boxes in the annotated data. This method has a statisticaladvantage over simple square boxes as it takes into account the specific size andaspect ratio of its classes. The anchors are arranged in a grid with every defaultbox repeated for every grid position. One of the goals of the network is to regresswhich anchors to use when presented with an input. To accommodate more finegrained localization the network also predicts deltas for every anchor, such thatevery anchor can be adjusted slightly as can be seen in fig. 3.3. The total numberof anchors at every position is output

widht

·outputheight

·kclusters

= nanchors

. Everyanchor is assigned a certain probability of a class being there based on a class andconfidence regression. This probability can be used to filter the output and selectwhich of the predicted bounding boxes to keep.

Figure 3.3: An illustration showing the selection, adjustment and assignment of ananchor in convDet (from [54])


3.3.3.2 Loss function

The loss function used by SqueezeDet to train the network was another improve-ment over FRCNN as it enabled end-to-end training of the neural net as opposedto a four-step training strategy [53]. The loss function of SqueezeDet consists of3 parts, the first part is the delta loss, which calculates the loss of the predicteddeltas ”pred

kj

compared to the ground truth deltas ”gt

kj

. This loss is a sum of thesquare distance between the respective deltas seen in equation (3.1). The deltasloss equation is normalized with respect to the number of objects N

obj

, and aninput mask I is used to only train relevant deltas. The ⁄

bbox

factor is used laterwhen combining the di�erent parts of the total loss function.

Deltasloss

= ⁄bbox

Nobj

n

anchorsÿ

k=1(I

k

4ÿ

j=1(”GT

kj

≠ ”pred

kj

)2) (3.1)

The second part of the loss function seen in equation (3.2) is the confidence losswhich trains the neural network, to select the right anchor for a detected object.The predicted confidence “pred

k

and ground truth confidence “GT

k

are comparedusing a square distance, the loss function also penalizes any confidence that doesnot correspond to a ground truth anchor. To adjust the influence of the positiveand negative confidence loss a ⁄

confpos

and ⁄confneg

is used.

Confidenceloss

=n

anchorsÿ

k=1

⁄confpos

Nobj

Ik

(“GT

k

≠ “pred

k

)2) + ⁄confneg

nanchors

≠ Nobj

Ik

(“pred

k

)2

(3.2)The last part of the equation is the class loss seen in equation (3.3), this is used totrain the network to be able to detect di�erent classes of objects. The output ofthe neural network is normalized with a softmax activation function and the loss isa simple cross entropy loss for classification where l

ck

is a one hot encoded groundtruth vector, and p

k

the output of the softmax.

Classloss

= 1N

obj

n

anchorsÿ

k=1

n

classesÿ

c=1I

k

lkc

log(pkc

) (3.3)

The three separate equations (3.1),(3.2),(3.3) are summed together, and the lambdafactors are used to adjust their e�ect on the final output. The factors used in theSqueezeDet paper are ⁄

bbox

= 5, ⁄confpos

= 75, ⁄confneg

= 100. The confidence lossand anchor loss was also used in section 4.3 of this thesis.

3.3.4 MobileNetsReleased in April of 2017 by the Google Brain team, MobileNets [30] is a neuralnetwork specifically designed to run on mobile hardware like smartphones. Themain contribution was demonstrating the power of using depthwise seperable con-volutions on mobile phones. Depthwise seperable convolutions were previously used

3.4. AFFINE TRANSFORMATIONS 27

[55] but never for mobile specific applications. Mobilenets focus on mobile appli-cations also showed in the structure of the network, it is a simple design makingit easier to implement on mobile hardware that has a reduced set of instructionscompared to desktop hardware.

3.3.4.1 Depthwise separable convolution

A depthwise separable convolution splits a normal convolution in 2 parts, everyinput channel is first convolved with a single 3x3 or greater kernel for each channel,afterwards a normal 1x1 convolution combines the depthwise convolutions and ifnecessary increases the number of filters. This way the number of parameters isonly (3 · 3 · n

inchannels

) + (1 · 1 · ninchannels

· noutchannels

) = nparameters

instead ofthe number of parameters shown in section 2.1.4. As a drop in replacement thedepthwise separable convolution is only slightly worse then a normal convolution,MobileNets saw a drop of 1.1% when using depthwise separable convolutions asopposed to using the normal convolutions.

3.4 A�ne Transformations

The objective of the tracking network is to track a target and update the boundingbox of that target in consecutive frames, most of these updates can be described inthe form of a�ne transformations[56]. An a�ne transformation works by recalcu-lating the position of a point by doing a matrix multiplication with an a�ne matrixcommonly referred to as ◊ seen below.

◊ =5a11 a12 a13a21 a22 a23

6

◊ ·

S

Ux

old

yold

1

T

V =5x

new

ynew

6

Applying the same a�ne transformation to the 4 corners of a bounding box enablesthe following transformations of the bounding box as a whole: identity, reflection,rotation and shearing. These transformations are su�cient for the application oftracking an object in 2D space. A�ne transformation networks therefore might beable to predict the target transformation from one frame to the next, provided thatthey are capable of focusing on whatever is in the foreground.


3.4.1 Spatial Transformer NetworksSpatial transformer networks proposed in [57], were designed to overcome the lackof scaling and rotation invariance in CNNs. The spatial transformer can be addedbetween layer in any convolutional network as seen in fig. 3.4. The spatial trans-former network transforms the input according to a predicted a�ne transformationbefore feeding it to the neural network. It requires no extra modifications and canbe trained in place, thus it is very versatile and could work in a lot of networks.The localisation network aims to predict an a�ne matrix that transforms a grid of

Figure 3.4: Spatial transformer network (from:[57])

points that cover only the important part of an image. This grid is then used tosample the original image and output an image that is correctly scaled rotated andtranslated. The authors suggested a convolutional network or fully connected couldbe used for the localisation network. The paper showed an increase of up to 2%when using multiple spatial transformer networks between convolutions, this alsomeant adding a substantial amount of complexity to the network; for that reasonwe do not use it.

3.4.2 CNN architecture for geometric matchingThe network presented in [58] was designed to predict the a�ne transformation be-tween two pictures. What that means is that when the network was designed with2 pictures it would try and predict a ◊ that would transform picture 1 to matchas close as possible to picture 2. To do this the network used a Siamese structure(seen in fig. 3.5), where both pictures were fed to a classifier (in this case VGG[59]). The outputs of the classifiers were combined using a matching layer designedby the authors. The matching layer combines two inputs with a correlation layerand zeros out any non zero activations using a relu function. The output of the reluis normalized with channel-wise L2 normalization [58]. The output of the matchinglayer containing data from both pictures, was then fed to a custom convolutionalregression network that outputs the transformation parameters. Either parametersfor an a�ne transformation or for a thin plate spline transformation [60]. Thepaper showed an increase in accuracy over previous papers, but they also demon-

3.5. DATASETS AND BENCHMARKS 29

Figure 3.5: The Siamese network structure of the geometric matching network(from:[58])

strated their matching layer to outperform more common combination layers usedin Siamese networks, such as subtraction and concatenation. We used the matchinglayer in both networks proposed in this paper, due to it’s well argued performanceand relative simplicity.

3.4.2.1 Loss function

In order to train for di�erent geometric transformations the loss function was de-signed so it could be used for any geometric transformation. The authors did so bynot training directly on the parameters of a transformation, but rather expressingthe loss function on a transformed grid of points which is transformation agnostic.The score grid was an evenly spaced grid of 400 points in an image which top rightcorner was at (1,1) and bottom left corner was at (-1,-1). The loss seen in equation(3.4), calculates the square distance between the two score grids known as · , onetransformed with the ground truth transformation ◊

GT

and the other with the pre-dicted transformation ◊. The distance between the all the points is summed anddivide by the number of points, to calculate the average square distance betweenthe points. This loss function is also used in the loss function of section 4.2.

L(◊, ◊GT

) = 1N

Nÿ

i=1d(·

◊

(i), ·◊

GT

(i))2 (3.4)

3.5 Datasets and Benchmarks

In order to train a neural network to perform a tracking application, a datasetcontaining annotated video sequences is required. The dataset must be of a sizesignificant enough to prevent over-fitting of the neural network and allow for betterperformance on unseen video sequences. In general a dataset used for trackingpurposes should preferably contain the following data.

• A su�cient number of di�erent videos

• A wide variety of di�erent objects, in a number of di�erent videos

• Multiple annotated objects per video


• Frame by frame annotated bounding boxes preferably with rotation

• A dense mask or single value describing the visibility of an object

A benchmark is important to compare di�erent tracking algorithms, a good bench-mark of a tracking algorithm uses a wide range of di�erent video sequences. Itis also important that the measured performance indicators are well argued andexplained, finally a benchmark should preferably be used in a yearly challenge thatcompares tracking algorithms.

3.5.1 Imagenet video datasetThe Imagenet dataset [18] (ILSVCR) is most well known for its use in the “Ima-genet Large Scale Visual Recognition Challenge” (ILSVRC) challenge. The datasetcontains around 1.2 million training images with 200 di�erent classes of objects.Less well known is their video dataset, used in their object detection from videochallenge. This dataset as of 2017 consists of 4000 di�erent training sequences withannotated bounding boxes. The dataset also contains an additional 1314 validationsequences of annotated data, which can be used to test performance on a datasetnot seen during training. The training sequences and validation sequences, bothcontain the following 30 classes of objects: airplane, antelope, bear, bicycle, bird,bus, car, cattle, dog, domestic cat, elephant, fox, giant panda, hamster, horse, lion,lizard, monkey, motorcycle, rabbit, red panda, sheep, snake, squirrel, tiger, train,turtle, watercraft, whale, zebra. The annotations consist of a square bounding box,without rotation and a rudimentary occlusion flag that is set when part of the ob-ject is occluded. The annotations are frame by frame, and there can be multipleannotated objects in a single sequence. It must be noted that this dataset does notcontain annotations of people. But it will still be used to train both networks dueto it being the largest and most varied available.

3.5.2 Multiple object tracking benchmark&datasetThe multiple object tracking benchmark(MOT) [47] is used to compare trackerperformance on simultaneous tracking of multiple objects. The ground truth ofthe benchmark annotates multiple trajectories per frame, where a trajectory is thepath of a single target during the whole sequence. The benchmark tests for anumber of di�erent metrics but the most important one is MOTA which standsfor multiple object tracking accuracy. MOTA is calculated using equation (3.5)where the number of false negative FN

t

is the number of tracked targets that donot correspond to a target annotated in frame t. The number of targets that areannotated but are not being tracked in a frame is FP

t

, and IDSWt

is the number oftrackers that are tracking a target annotated with a di�erent number in the currentframe from the one tracked by the same tracker in the previous frame. Finally GT

t

is the number of targets annotated in the current frame t.

3.5. DATASETS AND BENCHMARKS 31

MOTA = 1 ≠q

t

(FNt

+ FPt

+ IDSWT

)qt

GTT

(3.5)

The authors of [47] note that while MOTA is a good indicator of overall per-formance, it is debatable whether or not this number alone can serve as a goodperformance measure. Another metric used to measure performance is the multipleobject tracking precision, which measures the average overlap of all tracked bound-ing boxes which have been matched to a ground truth annotated box [47].The MOT benchmark is used to compare the performance of all the trackers sub-mitted to the yearly MOT Challenge, the 2017 challenge was open for entries untilMay 31st. To aid participants of the challenge, a dataset is made available totrain competing neural networks on. The dataset consists of 21 video sequences fortraining in which a total of 1638 di�erent people are annotated. The annotationsconsists of frame by frame bounding box annotations for each person in frame, anda visibility percentage form 0 to 1 describing how much of a person is visible. Thedataset will be used to train the network presented in section 4.2 for the MOTchallenge [47] submission.

3.5.3 Visual object tracking benchmarkThe visual object tracking (VOT) benchmark [4] is a benchmark designed to chal-lenge and compare single object tracking algorithms. It tests every tracker on anumber of carefully selected sequences designed to test if the tracker is capable ofhandling, occlusion changes, camera motion, light changes and sequences with ex-cessive scale and position changes. The main performance indicators are accuracy(A), robustness (R), expected average overlap (EAO), and equivalent filter opera-tions (EFO). Accuracy is a measure of the overlap between a predicted boundingbox and the ground truth annotated bounding box. Robustness is a measure of howoften a tracker fails, which means there is no overlap between the predicted bound-ing box and ground truth bounding box. The expected average overlap estimatesthe average overlap on a short-term sequence a tracker would obtain without havingto reset [61]. Finally the equivalent filter operations is a measure of performancethat prevents a bias due to hardware di�erences by normalizing the speed of thetracker using a predefined filtering operation [4]. Although the VOT benchmarkprovides a dataset to test the tracking algorithms, it is forbidden to train on thatdataset when submitting a tracker to the yearly VOT challenge [62].

Chapter 4

Tracking algorithm

The problem description in section 2.4 described the problems with current trackingimplementations. They often do not support anisotropic scaling, which means theaspect ratio of the bounding box of the target is constant. And changes in thetarget’s scale are often detected with the use of a scale pyramid [26] containingdi�erently scaled search windows. The problem with using a scale pyramid is thatit linearly increases the computations required due to a tracking algorithm havingto run for each scale. A tracker using a scale pyramid is also harder to implement onmobile hardware due to the lack of available API calls to perform scaling operationsand batching of the di�erent scales. This chapter will first present an evaluation of“SqueezeDet” and the “Fully-Convolutional Siamese Networks for Object tracking”in section 4.1. Then a network for regression of an a�ne transformation adapted fortracking applications will be presented in section 4.2. And finally, a novel networkperforming object detection based on a single exemplar, will presented includingthe filtering required to apply it to a tracking application in section 4.3.

4.1 Evaluating related works

In section 2.4.2 the main goals of the thesis were determined to be: increasingthe performance regarding the speed of the tracker, enable tracking of scale andanisotropic scaling, and to create an algorithm that is simple to implement. Someof the related works already focus on these goals, for this reason two related worksare evaluated in depth. The first one being “SqueezeDet” (SQDet), as it is one ofthe lightest and fastest networks [54] that does object detection. The goal is toimplement it in iOS using the Metal API, this should show whether or not a state-of-the-art network (like SQDet) designed to be lightweight is able to run in real-timeon mobile hardware. The second one is the “Fully-Convolutional Siamese Networksfor Object Tracking” [42], this should give insights into the challenges regarding theimplementation and training of a Siamese Tracker in Tensorflow and whether thenetwork in the paper can be altered to fulfill the requirements of section 2.4.2.

33

34 CHAPTER 4. TRACKING ALGORITHM

Release date iPhone Model Speed in fps19 September 2014 iPhone 6 149 September 2015 iPhone 6S 47

21 March 2016 iPhone SE 4716 September 2016 iPhone 7 57

Table 4.1: Comparing the speed of SqueezeDet on di�erent iPhone models

4.1.1 SqueezeDet performance

At the time of release SqueezeDet was the lightest and fastest network for objectdetection, and thus a perfect candidate for an experimental implementation in iOS.The metal API natively supports convolutional layers, and though the fire moduleis not natively supported it could however be implemented using an intermediateimage and an image o�set as can be seen in appendix A. It was possible to executethe whole neural net on the GPU using the Metal API, this way only the finalbounding box calculation and filtering needed to be done on the CPU. To filterthe bounding boxes, first the anchors with a confidence score below a thresholdof 0.4 were excluded from consideration. Of the remaining anchors the boundingboxes were calculated using the predicted deltas, the final bounding boxes werethen sorted by confidence. Starting with the bounding box that had the highestconfidence score, an IOU with every other bounding box was calculated and anythat had an IOU above 0.4 were dropped. This process of filtering the detectionsis also known as non maximum suppression (NMS) [63]. Doing the NMS on thephone made the FPS fluctuate by ±5FPS when there were more or less detectionskept after the confidence threshold. The neural network and bounding box filteringwas executed on di�erent iPhone models that support the Metal API. The resultsof which can be seen in table 4.1.

4.1.1.1 Conclusion

The results show that the iPhone 5se and later models all run the network in real-time. This is slightly surprising as only two years ago in 2015, Faster R-CNN wasrunning at only 5 fps on the much more powerful Titan X. It must be noted thatGPU utilization is 100% when running SqueezeDet. Thus it can be concluded thatSqueezeDet configured as shown in appendix B using around 850 million Maccs canrun real-time on an iPhone. Due to high GPU utilization it would be a significantdrain on the battery; it would be preferable if the GPU utilization were decreasedto save battery life. Thus the 850 million Maccs can be seen as an upper limit forreal-time execution of a neural network on mobile hardware.

4.1. EVALUATING RELATED WORKS 35

4.1.2 Fully Convolutional Siamese Tracker

As stated in section 1.2.2, CFNet[6] is one of the most recent tracking algorithmsutilizing deep learning. Since this tracking algorithm was only released in April2017 after the work on this thesis was started, CFNet has not been evaluated inthis thesis. The performance of CFNet is state-of-the-art but its use of novel layersin its network architecture will not be supported by the Metal API in the foreseeablefuture. For this reason an earlier paper by the same authors that introduced someideas used in the CFNet paper will be evaluated, namely [42]. In the paper a smallerversion of Alexnet is used in a tracking application.

4.1.2.1 Tracking algorithm

The tracking algorithm in the paper uses a siamese network (seen in fig. 4.1) toperform a cross correlation of an exemplar and search window. The output of thecross correlation is a score map where the highest activation represents the locationof the target in the search window. fig. 4.1.

Figure 4.1: The Siamese network structure used in [42]. Ï represents the neuralnetwork (from:[42])

In order to perform the cross correlation, a feature map of the target is first gen-erated. This is done by cropping a part of the image known to contain the target,the crop needs to include some amount of context to enable robust performance[6]. The size of the crop including the context can be calculated using the equation(4.1) p is a parameter determining the amount of context, the p value used in thispaper is 0.5 resulting in 50% of the crop being context.

context = p(targetwidth

+ targetheight

)

sizecropex

=Ò

(targetwidth

+ context) · (targetheight

+ context)(4.1)


The crop is resized to 127◊127 pixels and fed as exemplar z to the neural net Ï, theoutput is a feature map of 6◊6◊128 [42]. The search window is created in a similarmethod, the size of the search window can be determined from the exemplar cropsize and the ratio of their respective dimensions size

cropse

= sizecropex

· (254/127).The crop of the search window is done at the last known location of the target.It is resized to 254 ◊ 254 pixels and fed as search window x into the neural net,the resulting feature map is 22 ◊ 22 ◊ 128. The exemplar feature map is crosscorrelated with the feature map of the search window resulting in a feature map ofsize 17◊17◊128. This feature map is reduced by adding all channel values of each2D position resulting in a score map of 17 ◊ 17 ◊ 1. The score map is upscaled toincrease accuracy using bilinear interpolation.To enable detection of scale changes, a scale pyramid of search windows is used.The scale of the target in the new frame is assumed to be the score map with thehighest activation, while the position is the location of the highest activation onthat specific score map. The size and location of the bounding box are adjustedbased on the detected position and scale, the same happens to the search window.It must be noted that due to the use of a scale pyramid, the neural net is not ableto detect anisotropic scaling.

4.1.2.2 Neural network

As stated before, [42] uses a smaller version of AlexNet, the structure of which canbe seen in table 4.2. It is important to note that the network uses no padding aroundthe edges, which means that a convolution will only be applied on a receptive fieldcontaining image data. A side e�ect of that is that the output size of a layer is notonly determined by it’s stride but also by the kernel size as seen in equation (4.2).

outputwidth

= inputwidth

stride≠ kernel

width

+ (kernelwidth

%2)

outputheight

= inputheight

stride≠ kernel

height

+ (kernelheight

%2)(4.2)

4.1.2.3 Training method

The neural network is trained using an exemplar taken from a random sequence ata random frame, and a search window taken from a random frame (up to 50 frameslater) in the same sequence. The exemplar is generated as usual but the searchwindow is always generated in such a way that the target is always in the middle ofthe window. For this reason the ground truth labels are a 17 ◊ 17 ◊ 1 map namedv, with ones in a radius R around the center and -1 outside R. This score map canbe used in the loss function seen in equation (4.3) where D = 17 · 17 and y is the

4.1. EVALUATING RELATED WORKS 37

Table 4.2: The layers of the neural network used in “fully convolutional siamesetracker” (adapted from [42])

Activation sizeLayer Kernel Stride Exemplar Search window ChannelsInput 127x127 255x255 3Conv1 11x11 2 59x59 123x123 96Pool1 3x3 2 29x29 123x123 96Conv2 5x5 1 25x25 57x57 256Pool2 3x3 2 12x12 28x28 256Conv3 3x3 1 10x10 26x26 192Conv4 3x3 1 8x8 24x24 192Conv5 3x3 1 6x6 22x22 128

predicted score map.

Loss = 1D

Dÿ

i=0log(1 + exp(≠y

i

vi

))) (4.3)

This method of training is only possible when no padding is used in the network.If padding were used in the network, the network could over-fit on the zeros usedto pad the image. Since the score map does not change, the network could learnto produce a perfect score using only the padded zeros. But since no padding ispresent, the neural network can only act on the image data, and thus does not adevelop a bias.

4.1.2.4 Performance

While this neural network has one of the best performance and accuracy scoresin the 2016 VOT benchmark, it is not possible to run it in real-time on mobilehardware. The amount of Maccs required per single forward pass based on thenetwork structure shown in table 4.2 can be calculated (using the equations insection 2.1.5) to be around 1.97 billion. While one pass could be executed inreal-time, the fact it has to run multiple times to allow for scale detection usinga scale pyramid rules out any chance of running it in its current state on mobilehardware. One could try to replace Alexnet by a lighter network such as SqueezeNetor Mobilenets, to achieve an acceptable performance. The problem with using eitherof those, is that they rely heavily on padding in their structure. Without paddingthe two branches of the fire module of SqueezeNet produce di�erent size featuremaps which can’t be concatenated. MobileNets could be used without paddingbut due to every layer reducing the feature map size, it would result in an outputfeature map too small to be used for cross correlation. This illustrates the problemwith the “fully convolutional siamese network” as its reliance on the lack of padding


rules out most of the high performing networks in use. Increasing depth is oftenused as a means to achieve better accuracy as each layer can learn more and moreabstract features.

4.2 A�ne regression tracker

The Fully Convolutional Siamese Tracker has the following issues: a lack of anisotropicscaling, di�culty in improving performance due a lack of padding in the networkand no real-time performance on mobile hardware. Based on these factors an at-tempt was made by us to design a new network that does not rely on the lackof padding in the network, does not need to be run multiple times, and supportsanisotropic scaling. A recent paper released in March 2017 describes a method toperform geometric matching between two images [58]. The method uses VGG [59]as a feature extractor and a siamese network structure as seen in fig. 3.5 to predictan a�ne transformation matrix. The geometric matching network was designed topredict an a�ne transformation that would transform one image to look similarto another image. We assumed that instead of predicting a transformation for thewhole image, it would be able to track the transformation of a target from frameto frame. This would enable us to track the transformations of a target betweenframes, these transformations can theoretically include: translation, anisotropicscaling, rotation and shearing.

4.2.1 ModificationsIn order to reuse the network presented in [58] for a tracking application a numberof modifications needed to be made. Instead of training with randomly generateda�ne transformations, the a�ne transformation needs to be calculated based onground truth data as explained in section 4.2.1.1. A second problem was that thenetwork originally trained using a grid of points covering the whole image calleda score grid. This score grid had to be modified to cover only the bounding boxof the target as explained in section 4.2.1.2. Finally the loss function had to beadjusted to detect when a target was no longer visible anymore as explained insection 4.2.1.3.

4.2.1.1 A�ne transformation

An a�ne transformation is a mapping that linearly acts on the points in an a�nespace. The most common method for doing a�ne transformations on points in a2D space is with the use of an a�ne matrix explained in section 3.4. An a�nematrix is capable enough to describe the full transformation, of one bounding boxin 2D space to another bounding box in 2D space including rotation, scaling andshearing. A visual demonstration of an a�ne transformation of two bounding boxescan be seen in fig. 4.2. Since the ILSVCR dataset only contains position and sizedata [18] the transformations are limited to translation, and scaling. Rotation and

4.2. AFFINE REGRESSION TRACKER 39

shearing were unfortunately not annotated in the ILSVCR dataset. To calculatethe a�ne transformation matrix also known as ◊ (seen in 4.2.1.1) one needs toknow the position of each bounding box and its respective width and height. Theseparameters can be used to calculate two points describing the top left, and bottomright of each bounding box. Multiplying each of these coordinate points by thea�ne matrix results in bboxÕ matrix (4.2.1.1) describing the new bounding box ofthe target in the new frame.

� =5a11 a12 a13a21 a22 a23

6(4.4)

bboxÕ =5

xÕleft

yÕtop

xÕright

yÕbottom

6= � ·

5bboxT

1

6(4.5)

From annotated data both sets of points describing a bounding box are known, inorder to calculate theta using these points the system of linear equations in (4.6)needs to be solved. Note that both a12, a21 = 0 because the ground truth onlycontains square and axis aligned bounding boxes, thus scaling and translation onone axis is independent of the other axis.

xÕleft

= a11 · xleft

+ 0 · ytop

+ a13

yÕtop

= 0 · xleft

+ a22 · ytop

+ a23

xÕright

= a11 · xright

+ 0 · ybottom

+ a13

yÕbottom

= 0 · xright

+ a22 · ybottom

+ a23

(4.6)

Solving for all relevant aij

gives the equations in (4.7).

a11 = (xÕright

≠ xÕleft

)/(xright

≠ xleft

)a22 = (yÕ

bottom

≠ yÕtop

)/(ybottom

≠ ytop

)a13 = (xÕ

left

≠ (xleft

· a11)a23 = (yÕ

top

≠ (ytop

· a22)

(4.7)

Filling in the calculated values of aij

in the ◊ matrix gives the transformationdescribing the translation and scaling of a bounding box from one frame to another.

4.2.1.2 Score grid

In order to train this network for a tracking application, the scoregrid used in theloss function explained in section 3.4.2.1 needs to be modified. Instead of generatinga score grid covering the whole image, the score grid should only cover the boundingbox at the position of the object in the previous frame. This score grid is coloredgreen in fig. 4.2.


Figure 4.2: Image illustrating the score grid and ◊ in a tracking application. Thegreen grid is covering the bounding box of the original position of the target, andthe red grid is covering the bounding box of the target at it’s new position.

Since the score grid is only used to enable the neural net to train on predictinga ◊, and not during inference one could use a score grid covering the whole imageas done in the original paper. It is important to have a score grid only coveringthe bounding box of the target at its original position. If a scoregird covering thewhole image was used, any big position change of the target would transform itin such a way that puts it far outside of the unity coordinate system defined from-1 to 1. This is not such a problem in the original paper but when a target canmove from one corner to another it would put the score grid far outside the unitrange. It is always beneficial for a neural network to have its inputs and variablesin a symmetric range of -1 and 1 as this prevents a bias in the data, and is in thehighest precision range of floating point numbers [64].

4.2.1.3 Loss Function

The network could be trained using the loss function of the original paper [58]explained in section 3.4.2.1, but with any tracking application it is important torecognize when a target is lost. A target could be considered lost when it is nolonger in the search window, or fully occluded by another object. To solve thisissue, we regress a visibility value. In the MOT challenge training dataset there isa visibility value in the range of 0 to 1. This visibility value can be used to train aneural network to predict a similar value when a target is occluded. The occlusionwould be o

GT

= 1 ≠ visibility, this can be simply specified as an extra parameter oto regress next to the transformation parameters. Adding this as a square distanceerror to the original results in equation (4.8). The occlusion variable o

GT

is alsoused to inversely scale the loss of the score grid, as it shouldn’t be trained on when

4.2. AFFINE REGRESSION TRACKER 41

a target is fully occluded.

L(◊, ◊GT

, o, oGT

) = (o ≠ oGT

)2 + (1 ≠ oGT

)N

Nÿ

i=1d(·

◊

(i), ·◊

GT

(i))2 (4.8)

Where N is number of points in the score grid, ·◊

is the resulting score grid afterbeing transformed with the predicted ◊. And ·

◊

GT

is the score grid transformedwith the ground truth ◊

GT

. Note that it is important to apply a sigmoid activationfunction to the o variable before the loss is calculated, this squashes the value so itis always in the desired 0 to 1 range. To balance the loss function in equation (4.8)and prevent the occlusion loss from being the dominant factor in the loss function,we integrated a hyper parameter – œ [0, 1] in the loss, as seen in equation (4.9).

L(◊, ◊GT

, o, oGT

) = –(o ≠ oGT

)2 + (1 ≠ oGT

)N

Nÿ

i=1(1 ≠ –)d(·

◊

(i), ·◊

GT

(i))2 (4.9)

4.2.2 Tracking algorithmThe a�ne regression tracker uses a similar tracking algorithm as the fully convo-lutional tracker. Instead of adjusting the search window and bounding box basedon a score map and scale pyramid, it adjusts the search window and bounding boxbased on the predicted a�ne matrix. Using theta in the tracking algorithm enablesdetection of anisotropic scaling, and removes the need to run a network multipletimes. The tracking algorithm is essentially a feedback control loop attempting tocorrect the error of the target position being o� center and di�erent in scale. Acommon method to increase the capabilities of an error correcting feedback loop,is with the use of a proportional, integral, and derivative (PID) controller [65]. APID controller takes into account the error over time and the derivative of the er-ror which adds some form of temporal awareness to the position updates. A PIDcontroller can be tuned with the use of 3 factors, K

p

is a scaling factor to increaseor decrease the immediate response to a measured error. The integrated error, orerror over time, which can be used to predict an error is scaled with K

i

. The factorK

d

is used to scale the derivative of the error, which is mainly used to dampen anysporadic changes. An illustration of the PID controller and the di�erent K factorscan be seen in fig. 4.3. The K factors are tuned o�ine by hand to decrease theoverall error of the position, the reference is set to 0.0 representing the center ofthe search window.


Figure 4.3: Image illustrating the process of using a PID controller to minimize theerror in a feedback loop.

4.3 One shot learning object detector

We designed a one shot learning object detector neural network to be used in atracking algorithm. The network is a siamese network that uses MobileNets as afeature extractor, and the matching layer from [58] as a combination layer. Theoutput of the matching layer is then used in a small neural network similar toconvDet from [54] that performs object detection as seen in fig. 4.4.

Mobilenets

Mobilenets

Weights&Biases

112x112

224x224

Matching layer Object detection

Figure 4.4: A visual representation of the network proposed as a one shot objectdetector.

4.3.1 Object detectionThe object detection network consists of a small number of layers similar to the onesused in [30], with the final layer being similar to the final layer of convDet. The finalconvolution has a 3◊3 kernel size and outputs a feature map with 5 channels. Each2D location on the output feature map corresponds to a grid point in the originalimage. Combined with the original bounding box of the target each of these grid

4.3. ONE SHOT LEARNING OBJECT DETECTOR 43

points become reference bounding boxes called anchors [54]. Every anchor can bedescribed as (x

i

, yj

, w, h), i œ [1, W ], j œ [1, H] which is a slight modification onthe anchor description in [54] as the height and width for each anchor is equal. Inorder to have a more fine-grained bounding box prediction and allow changes inthe size of the anchor, every anchor can be slightly adjusted using a small o�set( ˆ”x

ij

, ˆ”yij

, ˆ”wij

, ˆ”hij

) which are encoded in 4 of the 5 output channels of the finalfeature map. The ground truth o�sets used for training can be calculated withequation (4.10) from [66].

”xGT

i

= (xGT ≠ xi

)/w

”yGT

j

= (yGT ≠ yj

)/h

”wGT

i

= log(wGT /wk

)”hGT

j

= log(hGT /hk

)

(4.10)

Equation (4.11) can be used to calculate the predicted bounding boxes from theanchors and deltas.

xp

i

= yi

+ wi

”xi

yp

j

= yj

+ hj

”yj

wp = we”w

ij

hp = he”h

ij

(4.11)

The 5th channel of the output feature map is a confidence score ⁄, of the targetbeing at certain position. To this parameter a sigmoid activation function is ap-plied, to squash the output into the range of 0 to 1. During training the groundtruth confidence score is defined as ⁄GT

ij

= IOU([xGT , yGT , wGT , hGT ], [xi

, yj

, w, h])where IOU represents the intersection over union. In fig. 4.5 a visual illustrationcan be seen of the function of the output layer. Out of a grid of 14 · 14 = 196anchors, the anchor with the highest confidence score (ideally the one closest to thecenter of the target in the search window) is chosen. Using the reference boundingbox from the exemplar and the predicted o�sets a bounding box to cover the targetin the search window is calculated.

4.3.2 Loss function

To train the siamese detection network we used a loss function similar to the oneused in Squeezedet. Since only one anchor shape is used the number of anchorsequaled the size of the output feature map n

anchors

= W ú H. The detector will betrained to detect any generic object given to it’s exemplar branch. This negates theneed for a class loss function, since there can only be one object used as exemplarat a time. For this reason the class loss is removed from the original loss function.


Figure 4.5: A diagram demonstrating the use of anchors and the the position of adetection in the feature map to predict a bounding box for the target in the searchwindow.

The resulting loss function can be seen in equation (4.12).

Loss = ⁄bbox

n

anchorsÿ

k=1(I

k

4ÿ

h=1(”GT

h

≠ ”P

h

)2)+

n

anchorsÿ

k=1⁄

confpos

Ik

(“GT

k

≠ “p

k

)2) + ⁄confneg

nanchors

Ik

(“p

k

)2(4.12)

4.3.3 Tracking AlgorithmAs explained in section 4.3.1 the neural network outputs a confidence score anddelta for every anchor. To predict the bounding box position of the target inthe search window, the anchor with the highest confidence score is selected andcorresponding deltas used to calculate the bounding box using equation (4.11). Theposition change of the target’s search window is filtered using the same method assection 4.2.2. The size change of the bounding box is dampened using the widthand height of the previous bounding box respectively w

old

and hold

. The dampenedbounding box can be calculated using equation (4.13) where w

new

and hnew

are thewidth and height of the newly predicted bounding box and —

w

and —h

are hyperparameters.

wÕ = (1 ≠ —w

) · wold

+ —w

· wnew

hÕ = (1 ≠ —h

) · hold

+ —h

· hnew

(4.13)

Chapter 5

Technical details

This chapter will describe the exact neural network structure of the two separatetracking algorithms, and the implementation details to use the neural networks fora tracking application. This includes the hyperparameters that were used duringtraining of the respective networks, and the hyperparameters used during tracking.While the two tracking algorithms are di�erent, they both consist of three distinctstages.Initialization: in the initialization state executed at t = 0 the algorithm is suppliedwith the first video frame and a bounding box covering the target in the frame.Using the annotated bounding box the exemplar image is cropped from the frameaccording to the method described in section 4.1.2.1 using p = 0.5 for the contextamount. This exemplar is then fed to the exemplar branch of the neural networkused in the tracking algorithm to generate the initial feature map feat

ex

of theexemplar. The scaling factor used to generate the search window in the next stepneeds to calculated in this step as-well. This was done using the following equation:scale = size

cropex

size

ex

, where sizecropex

is the crop size of the exemplar and sizeex

isthe size of the exemplar input to the neural network.

Figure 5.1: A diagram of the initialization step of the tracking algorithms.

45

46 CHAPTER 5. TECHNICAL DETAILS

Bounding box prediction: after initialization the first step of every new frameis to predict a new bounding box location using the neural network. First a searchwindow is created by scaling frame t = t with the scale calculated at t = t≠1, afterwhich a crop the size of the neural network input size

se

is extracted. A feature mapof this search window is generated called feat

se

. The feature map of the exemplarfeat

ex

from t = t ≠ 1 and featse

are fed to the matching layer and object detectionnetwork. The output of the object detection is a predicted bounding box in theformat of c

x

, cy

, w, h where cx

= centerx, cy

= centery, w = width, h = height. Avisual illustration of this process can be seen in fig. 5.2.

Figure 5.2: A diagram of the bounding box prediction step of the tracking algo-rithm.

47

Filtering&feature map update, when a bounding box is predicted it is fil-tered using a PID controller (see section 4.2.2) to filter the position update. Thescale is filtered with either the method explained in section 5.1.2.2 for the a�netracking algorithm or the method explained in section 5.2.2 for the one shot detec-tion based algorithm. The feature map used in the next frame is updated using arolling average as explained in section 5.1.2.3. A block diagram of the interactionof these filtering techniques can be seen in fig. 5.3.

Figure 5.3: A diagram of the bounding box filtering, and feature map updatingprocess.

Every frame after the initialization step has to be processed by the bounding boxprediction, and filtering&feature map update steps. However during training of theneural networks only the bounding box prediction step will be executed as part ofthe inference stage. The search window and exemplar used for training are createdwith the method described in section 4.1.2.1.


5.1 A�ne regression network

As stated in section 4.2, the a�ne regression network part of the a�ne trackingalgorithm uses VGG [59] as a feature extractor. Specifically version D is used, butwith only the first 10 layers cutting o� the network after the third max poolinglayer. The results of the VGG feature extractors are combined using the matchinglayer from [58], as can be seen in table table 5.1. Note that the VGG feature ex-tractors used in both branches use the same pre-trained weights which are frozenduring training.

Table 5.1: Preprocessing structure before the a�ne regression network, match isthe matching layer of [58] used to combine the feature maps of the VGG [59] basedfeature extractors.

Activation sizeLayer Exemplar(size

ex

) Search window(sizese

) ChannelsInput 126x126 226x226 3VGG16 11x11 28x28 256Match 28x28 121

The output feature map of the matching layer is used as input for the a�ne regres-sion network. The specific network structure, of the a�ne regression network canbe seen in table 5.2.

Table 5.2: Structure of the a�ne regression network, Conv stands for convolutionallayer, FC for fully connected layer. Maxpool and avgpool are respectively maxpooling layers and average pooling layers.

Layer Kernel Stride Activation size ChannelsInput 28x28 121Conv1 3x3 1 28x28 512Conv2 3x3 1 28x28 1024Maxpool1 3x3 2 13x13 1024Conv3 3x3 1 13x13 2048Maxpool2 3x3 1 6x6 2048Conv4 3x3 1 6x6 4096Avgpool1 6x6 1 1x1 4096Dropout 1x1 4096FC 1 4096x5 1x1 5

5.1. AFFINE REGRESSION NETWORK 49

5.1.1 TrainingTraining of the a�ne regression network was done on a Nvidia GTX1080Ti, withtwo di�erent datasets. For the submission to the MOT challenge (explained insection 3.5.2), which is a yearly multi object tracking challenge requiring partic-ipants to track multiple people in a frame concurrently, the network was trainedwith the MOT training dataset described in section 3.5.2. For evaluation using thethe VOT benchmark the network was trained with the ILSVCR dataset describedin section 3.5.1. Examples to train the network were generated by cropping anexemplar of a target from a randomly selected frame called f

ex

within a randomlyselected sequence. The search window was generated by selecting a frame called f

se

up to Tframes

earlier or later in the same sequence containing the target. Framef

se

would be resized using the same scale as the exemplar scale (as explained inchapter 5) after which a crop the size of the search window input would be croppedat the location of the target in f

ex

. This method would result in around Tframes

· 2combinations per frame in the dataset, there are 1181113 frames in the ILSVCRdataset [18]. Using T

frames

= 50 this would result in around 118 ·106 example pairsin an epoch. For this reason the run time of the network is given in the numberof mini-batches or steps, as it is not necessary to run one whole epoch to reachconvergence.To increase robustness of the network, a percentage of the training examples wereaugmented by rotation, mirroring horizontally or mirroring vertically. The percent-age of augmented examples was a hyperparameter set before training determiningwhat percentage of randomly selected examples per mini batch were augmentedduring training. During training care was taken to train with su�cient examples inwhich a target becomes occluded. This was done by searching for examples wherethe target in the exemplar was not occluded according to the ground truth, butthe target in the search window was occluded or not inside the cropped searchwindow. The percentage of exemplar and search window pairs where the targetbecomes occluded was another hyperparameter set before training. This hyper-parameter determined the ratio of not occluded and occluded examples within amini-batch. The optimizer used to perform SGD was the “Adam” optimizer [67].The hyperparameters used during training are listed in table 5.3.


Table 5.3: Hyperparameters used during training of the a�ne regression network

Name Value descriptionInitial learning Rate 10≠3 anything higher would result in NaNs

in the weightsBatch size 64 maximum batch size possible in order to reduce

the large amount ojf noise on the loss functionScore grid size 3 ◊ 3 more is not needed for a�ne transformations

Tframes

50 from [42]– 0.03 experimentally set

Dropout percentage 40% required to prevent over fittingOcclusion change examples 25% best guessAugmentation percentage 50% best guessNumber of training steps 500k At this point convergence had been reached

5.1.2 TrackingTo use the output of the a�ne regression network for a tracking algorithm, thedependent variables in the a�ne transformation matrix ◊ have to be made inde-pendent as explained in section 5.1.2.1. It is important that the translation andscaling variables are independent so that any filtering can be applied to each ofthe variables independently. The independent variables were filtered to increaseaccuracy, and a rolling average was used to increase robustness. The hyperparam-eters used for the tracking process are listed in table 5.4, they were determined byexperimentations on a small subset of video sequences.

Table 5.4: Hyperparameters used during tracking with the a�ne regression algo-rithm

Name ValueK

p

2K

i

0.1K

d

0.001—

avg

0.04

5.1. AFFINE REGRESSION NETWORK 51

5.1.2.1 Separating variables

The output of the a�ne regression network is a predicted a�ne transformation. Thevariables in an a�ne transformation are dependent on each other, to enable filteringof the separated scaling and translation values they first had to be made indepen-dent. This was done by applying the a�ne transformation to the following normal-ized points xmin

norm

, yminnorm

= (0, 0) and xmaxnorm

, ymaxnorm

= (1, 1). Theoutput of this transformation was used to calculate the independent normalizedvariables called d

w

, dh

, dx

, dy

using the equations of (5.1).

dw

= xmaxnorm

≠ xminnorm

dh

= ymaxnorm

≠ yminnorm

dx

= xminnorm

+ (dw

2 ) ≠ 0.5

dy

= ymaxnorm

+ (dh

2 ) ≠ 0.5

(5.1)

5.1.2.2 Filtering

The variables dx

and dy

of equation (5.1) are both used as input to two di�erentPID controllers sharing the same K factors resulting in dpid

x

and dpid

y

. The firstqualitative tests of the a�ne regression network showed very erratic bounding boxchanges, up to the point of rendering it unusable. The issue was most likely someform of oscillation happening between the 2 scaling factors, if one would increasein size the other would decrease in size. This oscillation was not stable and wouldquickly ruin any chance of tracking the target. As a result of this, the anisotropicscaling had to be removed. The separate scaling factors were combined into onevalue using d

size

=Ô

dw

· dh

. This ensured no oscillation could happen. The newbounding box in the original image and the new scale factor was calculated withthe equations in (5.2), where size

se

is the size of the search window input of theneural network.

ˆscale = scale

dsize

ˆwidth = width · dsize

ˆheight = height · dsize

xcenter

= xcenter

+ dpid

x

· sizese

scale

ycenter

= ycenter

+dpid

y

· sizese

scale

(5.2)


5.1.2.3 Feature map update

To increase robustness of the tracker during the tracking process the feature mapfeat

ex

of the exemplar is updated. Updating the feature map featex

is done using arolling average. A new feature map would be calculated every frame of an exemplargenerated with the newly predicted bounding box. This was done by scaling theframe with the newly calculated ˆscale , and cropping an image of size

ex

withit’s center at (x

center

, ycenter

) padding with mean values if necessary. The newlycreated exemplar would then be fed to the exemplar branch of the neural networkwhich would generate an output feature map called featnew

ex

. Using the new featuremap and the old feature map called feat

ex

a new feature map is calculated usingequation (5.3) where —

avg

is a hyperparameter used as a weight factor.

ˆfeatex

= (1 ≠ —avg

) · featex

+ —avg

· featnew

ex

(5.3)

5.2 One shot object detector

The one shot object detector is a neural network based detection algorithm, thattries to find the exemplar in the search window. The output is a detection in theform of bounding box covering an area in the search window where the networkpredicts the exemplar object is present. Since the goal of the network is to detectthe same object in di�erent search windows it is suitable for a tracking algorithm.The object detector uses a reduced version of [30] as a feature extractor, only thelayers up to the 6th depthwise separable convolution are used. The network uses thematching layer from [58] to combine the two branches. The output of the matchingnetwork is processed in an object detection network, that consists of Mobilenets [58]style layers consisting of a depthwise separable convolution, followed by a pointwiseconvolution (see section 3.3.4). Every convolution in the Mobilenets style layer wasfollowed by a batch normalization layer [68] and an activation function. MobileNetsspecifies the use of the ReLu activation function [58], but this has been replaced withthe exponential linear units (Elu) [69] activation to increase performance as shownin section 3.5.3. The final layer of the detection network is a normal convolutionwithout a batch normalization layer. To the output channel corresponding to theconfidence score, used to select an anchor section 4.3.1, a sigmoidal activationfunction is applied. The full structure of the network can be seen in table 5.5.

5.2. ONE SHOT OBJECT DETECTOR 53

Table 5.5: Structure of the siamese object detection network(dw stands for depth-wise, pw stands for pointwise, and match is the matching layer)

Activation sizeLayer Kernel Stride Exemplar(size

ex

) Search window(sizese

) ChannelsInput 112x112 224x224 3Mobilenets 14x14 28x28 256Match 28x28 196Conv dw 1 3x3 1 28x28 256Conv pw 1 1x1 1 28x28 256Conv dw 2 3x3 1 28x28 256Conv pw 2 1x1 1 28x28 256Dropout 1 28x28 256Conv dw 3 3x3 2 14x14 512Conv pw 3 1x1 1 14x14 512Conv dw 4 3x3 1 14x14 512Conv pw 4 1x1 1 14x14 512Conv dw 5 3x3 1 14x14 512Conv pw 5 1x1 1 14x14 512Dropout 2 14x14 512Conv 1 3x3 1 14x14 5

5.2.1 TrainingTraining the object detection network is done in a similar way to the a�ne regressionnetwork. The dataset used for training was the ILSVCR dataset presented insection 3.5.1. The selection of exemplar and search window pairs was done in thesame way as explained in section 5.1.1. Augmentations explained in section 5.2.1.1were added to make the network more robust. The optimizer used to perform SGDwas a momentum optimizer with Nesterov accelerated gradient [70], the switch formto Nesterov was motivated by the better IOU observed during training as seen infig. 5.4. The learning rate was annealed over time using exponential decay with adecay constant ⁄

dec

, and decay step of stepdec

. The hyperparameters used duringtraining can be seen in table 5.6, the parameters were selected either from otherworks or based on early observations. The Initial learning Rate, Decay step,⁄

dec

were adjusted from [54] to allow for a longer training process as there are 1181113di�erent images in the ILSVCR dataset [18] as opposed to the 7481 [71] imagesused to train SqueezeDet [54].


Figure 5.4: Graph showing the IOU on the y axis of the predicted bounding boxesand the ground truth bounding boxes calculate ever training step shown on the xaxis. The yellow line is the result of training with Nesterov accelerated gradientoptimizer, and the red line the result of training with the Adam optimizer

Table 5.6: Hyperparameters used during training of the siamese object detectionnetwork

Name Value DescriptionInitial learning Rate 5 · 10≠3 adapted from [54]Batch size 32 adapted from [54]Decay step 15k adapted from [54]⁄

dec

0.75 adapted from [54]Dropout 1 percentage 30% based on [72]Dropout 2 percentage 50% from [54]T

frames

50 from [42]“

scale

1.20 based on qualitative tests“

shift

0.20 based on qualitative testsconlo

aug

, conup

aug

0.30,1.00 experimentalsatlo

aug

, satup

aug

0.60,1.10 experimental”

bright

0.30 experimentalimage

aug

25% from [54]gauss

aug

20% from [54]Number of training steps 500k based on observations⁄

confpos

75 from [54]⁄

confneg

100 from [54]⁄

bbox

5 from [54]


5.2.1.1 Augmentations

To make the network generalize better, two types of augmentations were added:augmentations that a�ect the target’s position and size within the search windowand augmentations that a�ect the appearance of the images. To change a target’sscale a random rescale was applied to frame f

se

from which the search window wascropped. This was done by multiplying the scale factor with a scaling value from arandom uniform distribution using ˆscale = scale ·scale

aug

before resizing frame fse

and cropping out a search window. The factor scaleaug

was picked from a randomuniform distribution with the upper and lower limits calculated using the equationsin (5.4), where “

scale

is a hyperparameter.

maxrescale

= 1 · “scale

minrescale

= 1“

scale

(5.4)

To augment the target’s position within the search window the center of the cropwas adjusted before actually cropping out a search window to use as input for theneural network. This was done by centering the search window on the target’sposition in f

ex

, and moving it a small amount along the x and/or y axis. The shiftin pixels from the center of the target’s position in frame f

ex

, was selected froma random uniform distribution using a hyperparameter “

shift

and the size of thesearch search window input to the neural network size

se

. The upper and lowerlimit of the distribution are calculated using equation (5.5).

maxshift

= “shift

· sizese

minshift

= ≠“shift

· sizese

(5.5)

As this network was only used for the VOT benchmark where it is not necessary todetect if a target is lost or out of frame. For this reason the search window crop wasadjusted in such a way that the center of the target was always within ±0.45 ·size

se

pixels of the search window center.

The second type of augmentations used were random image augmentations,consisting of brightness change with a maximum factor of ±”

bright

, contrast changeswith an upper limit conup

aug

and lower limit conlo

aug

and saturation changes with anupper limit satup

aug

and lower limit satlo

aug

. The image augmentations were appliedto a percentage of the images image

aug

every mini-batch, and they were alwaysapplied to both the exemplar and search window within a pair.An image standardization layer linearly scales the augmented images to have zeromean and unit norm. A percentage gauss

aug

of all images within a batch hadGaussian noise added to them with µ = 0 and a standard deviation of ‡ = 0.2.Note that the Gaussian noise was added independently so it could be added to boththe search window and exemplar in a pair or to only one of them.


5.2.2 TrackingApplying the siamese object detection network in a tracking application is donesimilar to the a�ne tracking algorithm. The parameters predicted by the neuralnetwork are already independent and normalized, the error from the center canbe calculated using the following two equations d

x

= 0.5 ≠ xp and dy

= 0.5 ≠ yp

which are fed to separate PID controllers. The width and height are updated usingequation (4.13) from section 4.3.3 but a —

size

parameter was added to compensatefor any bias in the width and height prediction. The bounding box location, sizein frame, and scale can be calculated using the equations seen in (5.6) where —

scale

is a hyperparameter to adjust the scale update. One more important di�erence isthat the crop size of the search window size

cropse

is reduced by a factor —crop

whenthe tracker is initialized.

xcenter

= xcenter

+ dpid

x

scale

ycenter

= ycenter

+dpid

y

scale

w = (1 ≠ —w

) · w + —w

· w · —size

scale

h = (1 ≠ —h

) · h + —h

· h · —size

scale

ˆscale = (1 ≠ —scale

) · scale + —scale

· scale · w · h

w · h

(5.6)

In order to improve the robustness of the one shot object detector based track-ing algorithm, a rolling average of the feature map was used as explained in sec-tion 5.1.2.3. The rolling average method of section 5.1.2.3 was extended to includethe confidence score of the selected anchor “, and the percentage of the target thatwas still visible in frame “

vis

. These factors were added based on the assumptionthat firstly, a higher confidence score of the anchor meant it would be more certainof the position thus a better candidate to update the feature map (and vice-versafor a lower score). And secondly, a feature map extracted from an exemplar wherehalf the target is visible in the frame is a worse candidate for updating the featuremap than one where the target was fully in frame. The resulting equations thatinclude this extension can be seen in (5.7).

“adjusted

= —avg

· “ · “vis

ˆfeatex

= (1 ≠ “adjusted

) · featex

+ “adjusted

· featnew

ex

(5.7)

Updating the feature map introduces problems with the anchors, as the anchorshape was based on the shape of the original exemplar. Thus the anchor widthw

anchor

and height hanchor

have to be updated when the feature map is updated,this is done using the equations seen in (5.8). The hyperparameter —

anchors

is


introduced to adjust the bias of the anchor update.

wanchor


) · wanchor

+ “adjusted

w · —anchors

· scale

sizeex

hanchor


) · hanchor

+ “adjusted

h · —anchors

· scale

sizeex

(5.8)

A number of di�erent hyperparameters used for the tracking algorithm were evalu-ated in section 6.2, however, some were not changed for the di�erent configurations,these are listed in table 5.7

Table 5.7: Hyperparameters kept constant during tracking using the di�erent con-figurations of the siamese object detection network presented in section 6.2.

Name Value DescriptionK

p

1.0 ExperimentalK

i

0.1 ExperimentalK

d

0.05 Experimental—

crop

0.75 Experimental—

scale

0.59 from [42]

Chapter 6

Evaluation and Results

In section 6.1 the results of the a�ne tracker on the MOT Challenge [47] will bepresented. There are no results for the single shot object detector on the MOTchallenge, as the challenge was deemed not suitable for the goal of this thesis.The objective of the MOT challenge was to track multiple objects concurrently.To perform this task, using a single target tracking algorithm, there needed to bemore algorithms to perform data association and detection filtering as explainedin section 6.1. For this reason a switch was made to the VOT benchmark as it isfocused on single object tracking, which aligns better with the goals of this thesis.In section 3.5.3 the results on the VOT benchmark of the a�ne tracker as-well asthe single shot object detector used for tracking will be presented and compared.

6.1 MOT Challenge

The MOT challenge [47], is a multi target tracking benchmark and to accommodatemultiple targets being tracked simultaneously the a�ne tracking algorithm had tobe extended. The MOT challenge supplies a set of per frame detections of peopleusing three di�erent detectors, DPM [73], FRCNN [53] and SDP [74]. The detec-tions were filtered by removing all detections with a confidence less than 0.95, andfurther filtered with a non maximum suppression using a threshold of 0.1. Initiallyall detections were used to initialize an instance of the tracking algorithm. Afterthe first frame all the detections were compared to the bounding boxes predictedby the tracking instances and any that had an intersection over union greater than0.05 were removed, the remaining detections were used to initialize new instances ofthe tracking algorithm. In fig. 6.1 the bottom seven results of the MOT challengecan be seen, the results of the a�ne tracking algorithm is the tracker with the nameSTSiam at the bottom of the list. The main performance indicator is the medianobject tracking accuracy (MOTA) and median object tracking precision (MOTP)describing the percentage of targets tracked accurately and the precision of thebounding box size and location as explained in section 3.5.2. The other relative

59

60 CHAPTER 6. EVALUATION AND RESULTS

metrics in the table are: FAF which is the ratio of incorrectly tracked boundingboxes per frame, MT is the percentage of mostly tracked trajectories and ML isthe percentage of mostly lost trajectories. The absolute metrics are: FP numberof incorrectly tracked trajectories, FN number of trajectories not tracked by thetracking algorithm. The number of id switches (ID Sw.) is the amount of timesa tracked bounding box was matched to a di�erent trajectory the it was matchedin the previous frame. Frag is the number of times a new tracker instance wasmatched to a trajectory previously lost by a di�erent tracker. Finally Hz is thetracker speed in frames per second. As one can see, STSiam or the a�ne tracking

Figure 6.1: The results of the MOT challenge showing the 7 lowest ranked trackingalgorithms (source [75])

algorithm came in last, it was outperformed on all metrics by nearly all other com-peting tracking algorithms. The disparity between the STSiam tracking algorithmand the second worst tracking algorithm is substantial with a di�erence of 9.7 onthe MOTA metric. The distance to the first place is 38.6 in the MOTA metric.This showed that the a�ne tracking algorithm could not produce a state-of-the-artor even a competitive score on the MOT challenge. Qualitative reviews, showed asimilar result, a lot of id switches during tracking and the scaling was still unstableeven after applying filtering and updating the feature map. The results on theMOT challenge were very disappointing. Some of the problems were due to theadded complexity of a multiple object tracking task, and some problems could havebeen fixed with more e�ort. It was however decided to stop further development ofthe a�ne regression network, due to the disappointing results. It showed that thecomplex task of estimating a target’s scale, position and confidence might not bepossible to regress into 5 simple variables.

6.2. VOT BENCHMARK 61

6.2 VOT Benchmark

The VOT benchmark [4] explained in section 3.5.3 is a single target tracking bench-mark, mainly focused on challenging tracking algorithms with target occlusion, bigdi�erences in scale, lighting changes, camera movement and target deformations.Both the a�ne tracking algorithm and the one shot object detector based trackingalgorithms were evaluated on this benchmark. A number of di�erent configurationsof the one shot object detector tracking algorithm were tested to evaluate the use-fulness of the di�erent components. The di�erent configurations that have beentested are listed in table 6.1. SiamDet is the one shot object detection trackingalgorithm, and SiamA�ne is the a�ne regression based algorithm.

Table 6.1: Table listing the di�erent configurations of the Siamese tracking algo-rithms, EAO stands for average expected overlap explained in section 3.5.3, ID isa unique identifier, AF stands for activation function

ID EAO PID —avg

—anchors

—size

—h

,—w

AFSiamDet 0.2571 Yes 0.02 0.93 0.96 0.15 Elu

SiamDet_nopid 0.2459 No 0.02 0.93 0.96 0.15 EluSiamDet_v4 0.2425 Yes 0.02 0.9 0.96 0.15 EluSiamDet_v3 0.2368 Yes 0.02 0.93 0.96 0.1 EluSiamDet_v6 0.2357 Yes 0.02 0.93 1.0 0.15 EluSiamDet_v2 0.2242 Yes 0.02 0.9 0.96 0.2 EluSiamDet_v1 0.2242 Yes 0.05 0.9 0.96 0.15 EluSiamDet_v5 0.2238 Yes 0.02 1.0 0.96 0.15 Elu

SiamDet_default 0.2056 Yes 0.02 N.A. 0.96 0.15 EluSiamDet_relu 0.2035 Yes 0.02 0.93 0.96 0.15 Relu

SiamDet_noavg 0.1937 Yes 0.00 0.93 0.96 0.15 EluSiamDet_uc 0.1699 Yes 0.02 0.93 0.96 0.15 Elu

SiamDet_nosc 0.1606 Yes 0.02 0.93 0.96 0.15 EluSiamA�ne 0.1557 Yes 0.04 N.A. N.A. N.A. Relu

SiamDet is currently the best of the di�erent networks with an EAO of 0.2571 inthe VOT benchmark, but there are a couple of versions that confirm the usefulnessof di�erent components. SiamDet_nopid shows that not using a PID degrades per-formance. SiamDet_default is a special version that works by predicting a bound-ing box based on a default shape, thus independent of the exemplar shape. This isshown to degrade performance, even though it eliminates the need to update theanchors which simplifies the tracking algorithm. SiamDet_relu uses the Relu acti-vation function instead of the Elu, resulting in a significant decline in performance.The SiamDet_noavg version does not use a rolling average, resulting in a decreasein performance, proving the usefulness of the rolling average update. SiamDet_ucis a version where the crop was not reduced with the factor —

crop

, thus —crop

= 1.0,


and SiamDet_nosc is a model with —scale

= 1.0, the result of both models showa sharp decrease in accuracy. Finally SiamA�ne performs the worst of all, oncemore confirming that the network is not suitable for object tracking.

6.2.1 Comparison with the VOT 2016 Challenge resultsThe best version of the tracking algorithms described in section 6.2, is comparedto the top 20 tracking algorithms of the 2016 VOT challenge [4]. A comparisonof the EAO can be seen in fig. 6.2, and the accuracy and robustness results canbe seen in fig. 6.3. The full results on expected average overlap can be found inappendix C, and the accuracy and robustness results can be found in ??. It can beseen that the best version of the SiamDet tracking algorithm, ends up just belowMD_Net [76] at a respectable 16th place. The raw accuracy and robustness resultsseen in fig. 6.3 place the SiamDet on a lower place compared to SiamAN, which isthe name for the network from [42], this is contradictory to the results of the EAO.This might mean that the SiamDet performs better over a longer time due to thehigher EAO, while SiamAN produces better results before it needs to be reset. Acomparison of speed can be seen in appendix E, it should be noted that the EFOof the siamese object detector is lower than that of SiamAN. This should not bethe case according to the calculations of Maccs in section 4.1.2.4 and section 6.3.The loss in speed could most probably be attributed to the use of a weaker GPUfor inference, namely a GTX1060 as opposed to the Titan X used in [42]. It mighthave also been due to the use of the Trax protocol [77] to interface with python,and the repeated initialization of the Tensorflow graph on the GPU.

6.3. PERFORMANCE 63

Figure 6.2: Plot of the trackers EAOon the y axis vs rank the on x axis.(data from appendix C)

Figure 6.3: Plot of the trackers accu-racy (data from appendix D) on the yaxis vs robustness on the x axis (datafrom appendix F). Trackers in the topright are considered better perform-ing.

Figure 6.4: Legend of the di�erent trackers, and their respective symbols.

6.3 Performance

As the results of the a�ne tracking algorithm were disappointing and the networkused VGG which is a relatively heavy feature extractor, it was decided to no longerpursue work on this network and for that reason the speed of this network wasnot evaluated. The siamese object detector however did yield good results, andwith the use of Mobilenets style layers it was a promising candidate for real-timeexecution on mobile hardware. The limit of Maccs that can be used by a networkwas determined in section 4.1.1 to be around 850 million Maccs. Using [78] the totalnumber of Maccs necessary to run the feature extractor is calculated to be around202 million Maccs. This leaves around 650 million Maccs for the matching layerand the object detection network. Since the matching layer is not much more thana simple reshape and matrix multiplication this will not be factored into the limitof computations. Using the numbers from [78] and the equation from section 2.1.5a calculation of Maccs required to run the object detection network can be seen intable 6.2.


Table 6.2: Calculation of Maccs using the numbers from [78], and the equationin section 2.1.5. Conv dw stands for depthwise convolution, Conv pw stands forpointwise convolution, Conv is a regular convolution

.

Layer name Estimated million maccsConv dw 1 1.8Conv pw 1 51.3Conv dw 2 1.8Conv pw 2 51.3Conv dw 3 0.45Conv pw 3 25.7Conv dw 4 0.9Conv pw 4 51.3Conv dw 5 0.9Conv pw 5 51.3

Conv 1 0.5Total 237.3

The amount of Maccs required to run the search window feature extractor andthe amount required for the object detection network total around 440 millionMaccs. This is well within the limit of real-time execution. Adding a rolling averageupdate of the feature map (explained in section 5.1.2.3) requires to recalculate thefeature map of the exemplar every frame. Running the feature extractor for theexemplar input requires 4 times fewer computations compared to extracting thefeature map of the search window. This is due to the fact that the exemplar is 4times smaller than the search window (see section 2.1.5), this results in 50 millionMaccs required to run the feature extractor for the exemplar. Thus even with therolling average update the total amount of Maccs required is around 490 millionMaccs, which is well within the limit of 850 million Maccs.

Chapter 7

Conclusion

This thesis presented an investigation into tracking algorithms and their possibleapplication on mobile hardware. Two network designs were proposed, one basedon the regression of an a�ne transformation and the other a one shot learningobject detector. A proof of concept of a neural network running in real-time onmobile hardware was shown in section 4.1.1, and methods for filtering and updatingbounding boxes were proposed in section 4.3.3. The a�ne regression network wasshown to not increase accuracy or speed compared to the existing tracking algo-rithms. It placed last in the MOT challenge and also produced mediocre resultsin the VOT benchmark. Based on these disappointing results further e�orts in thea�ne regression network were stopped and a new network design was pursued thatwas based on an object detection algorithm. This siamese object detection networkperformed among the top 20 algorithms in the VOT 2016 challenge, with an ex-pected average overlap of 0.2571. It would place at the 16th place with a lead of0,0213 compared to the network [42] evaluated in section 4.1.2. The required Maccsto run the siamese object detection network as a tracking algorithm was also shownto be within the range of real-time execution on mobile hardware. The siamese oneshot object detection network is, as far as we are aware, the first of its kind, and isa good candidate for further research into applications outside of object tracking.

7.1 Discussion

The results of the a�ne regression tracker were disappointing and surprising, as theresults in [58] showed promise. The neural network proved capable of focusing onthe object and not on the background, but unable to accurately do so. The reasonfor the bad results of the a�ne regression tracker might be due to the fact thatobject tracking is a more fine grained task then the task the network was originallydesigned for. It might be impossible to regress such a complicated task of objecttracking into 5 simple parameters describing a transformation. The one shot objectdetection algorithm solved this task more elegantly by splitting it up into smaller,

65

66 CHAPTER 7. CONCLUSION

simpler tasks. First a fine grained detection was done using the anchor selection,which was then adjusted slightly to increase accuracy. Another benefit was thatthe anchor selection used the inherent location of the target in the receptive field,a target in the top left corner would most likely also have more activations in thatarea.As a whole the siamese network structure has shown to be capable of an end-to-end tracking task. It enables a network to adapt to a task during inference, this isespecially relevant on weaker hardware which can’t run complex neural networks.Other applications of a siamese network structure could include segmentation, im-age comparison and data association.

7.2 Future work

The a�ne tracking algorithm was determined to not be further considered due tothe bad results. The one shot object detection algorithm is however an interestingcandidate for further research. The network showed good accuracy and high-levelperformance, however due to the lack of time it has not been optimized. Mostof the hyperparameters used to train the network are not yet optimized for thespecific task. The augmentation parameters are currently best guesses, and shouldbe more extensively tested on their impact. The other training hyperparameterscould also be tested further in order to optimize them. Another improvement couldcome from adding the MOT dataset to the training data of the one shot detectionalgorithm. The ILSVCR dataset does not include a person class, while this is oneof the most common classes in the VOT benchmark.The tracking algorithm used for the one shot object detection tracking algorithm,could also be tested more thoroughly. The di�erent — hyperparameters used tocompensate for a bias in the scale might prove to be unnecessary if the boundingbox and anchor update is done di�erently. There also needs to be a more thoroughsearch of the hyperparameter space for a possible increase in accuracy.

Appendix A

Fire Module swift implementation

1 public class FireLayer:NNLayer{2 var fireConfig : FireLayerConfiguration3 var squeeze: MPSCNNConvolution? // 56x56x64 > 56x56x16; stride = 1, 1x1x3x16 =

48 + 16Òæ

4 var expand_1: MPSCNNConvolution? // 56x56x16 > 56x56x64; stride = 1, 1x1x3x64 =

192 + 64Òæ

5 var expand_3: MPSCNNConvolution? // 56x56x16 > 56x56x64; stride = 1, 3x3x3x64 =

192 + 64Òæ

6 var squeezeImgDesc:MPSImageDescriptor?7

8 init(device: MTLDevice,9 config:BasicLayerConfiguration,

10 fireConfig: FireLayerConfiguration,11 data: NeuralNetData)12 {13 self.fireConfig = fireConfig14 super.init(device: device, config:config)15 //setup the fire structure with an intermediate squeeze image

16

17 self.squeeze = InitConv(inDepth: config.inDepth, outDepth:fireConfig.s1x1, weights: data.getData(offset:fireConfig.weights.weightsS1x1), bias: data.getData(offset:fireConfig.biases.biasesS1X1),kWidth: 1, kHeight: 1)

Òæ

Òæ

Òæ

18 self.expand_1 = InitConv(inDepth: fireConfig.s1x1, outDepth:fireConfig.e1x1, weights: data.getData(offset:fireConfig.weights.weightsE1x1), bias: data.getData(offset:fireConfig.biases.biasesE1X1),kWidth: 1, kHeight: 1)

Òæ

Òæ

Òæ

19 self.expand_3 = InitConv(inDepth: fireConfig.s1x1, outDepth:fireConfig.e3x3, weights: data.getData(offset:fireConfig.weights.weightsE3x3), bias: data.getData(offset:fireConfig.biases.biasesE3X3),kWidth: 3, kHeight: 3, offset:fireConfig.e1x1)

Òæ

Òæ

Òæ

Òæ 67

68 APPENDIX A. FIRE MODULE SWIFT IMPLEMENTATION

20 }21 public override func setSqueezeImg(width:Int, height:Int)22 {23 fireConfig.outWidth = width24 fireConfig.outHeight = height25 self.squeezeImgDesc = MPSImageDescriptor(channelFormat: .float16, width:

fireConfig.outWidth, height: fireConfig.outHeight, featureChannels:fireConfig.s1x1)

Òæ

Òæ

26 }27

28

29 //encoding function to encode the commands on the buffer using the corresponding

input and output imageÒæ

30 public override func encodeLayer( commandBuffer: MTLCommandBuffer,inputImg:MPSImage, outputImg:MPSImage ){Òæ

31 let squeezeImg = MPSTemporaryImage(commandBuffer: commandBuffer,imageDescriptor: squeezeImgDesc!)Òæ

32 squeezeImg.readCount += 1 //to allow the second expand module to read from

it.Òæ

33 squeezeImg.texture.label = config.name+"_squeeze_img" //name the images for

easy debuggingÒæ

34 outputImg.texture.label = config.name+"_concat_img"35

36 squeeze!.encode(commandBuffer: commandBuffer, sourceImage: inputImg,destinationImage: squeezeImg)Òæ

37 expand_1!.encode(commandBuffer: commandBuffer, sourceImage: squeezeImg,destinationImage: outputImg)Òæ

38 expand_3!.encode(commandBuffer: commandBuffer, sourceImage: squeezeImg,destinationImage: outputImg)Òæ

39 }

Appendix B

SqueezeDet network architecture

69

70 APPENDIX B. SQUEEZEDET NETWORK ARCHITECTURE

Figure B.1: A diagram of the SqueezeDet network structure

Appendix C

Expected Average Overlap resultson the VOT benchmark

71

72APPENDIX C. EXPECTED AVERAGE OVERLAP RESULTS ON THE VOT

BENCHMARK

Table C.1: The expected average overlap results of the top 20 tracking algorithmsfrom the VOT 2016 challenge [4], compared to six versions of the tracking algo-rithms proposed in this paper.

baselineName Expected overlap Overall

CCOT 0.3310 0.3310TCNN 0.3249 0.3249SSAT 0.3207 0.3207MLDF 0.3106 0.3106Staple 0.2952 0.2952DDC 0.2929 0.2929EBT 0.2913 0.2913

SRBT 0.2904 0.2904STAPLEp 0.2862 0.2862

DNT 0.2783 0.2783SSKCF 0.2771 0.2771SiamRN 0.2766 0.2766

DeepSRDCF 0.2763 0.2763SHCT 0.2661 0.2661

MDNet_N 0.2572 0.2572SiamDet 0.2571 0.2571

FCF 0.2510 0.2510SRDCF 0.2471 0.2471

SiamDet_nopid 0.2459 0.2459RFD_CF2 0.2415 0.2415

GGTv2 0.2377 0.2377DPT 0.2358 0.2358

SiamAN 0.2352 0.2352SiamDet_default 0.2056 0.2056

SiamDet_Relu 0.2035 0.2035SiamDet_noavg 0.1937 0.1937

SiamA�ne 0.1557 0.1557

Appendix D

Accuracy ranking on the VOTbenchmark

73

74 APPENDIX D. ACCURACY RANKING ON THE VOT BENCHMARK

Table D.1: The accuracy and robustness results of the top 20 tracking algorithmsfrom the VOT 2016 challenge [4], compared to six versions of the tracking algo-rithms proposed in this paper.

Name A-Rank OverlapTCNN 3.07 0.55CCOT 4.02 0.54MLDF 6.85 0.49Staple 3.70 0.54

STAPLEp 2.77 0.55DDC 3.53 0.54EBT 8.87 0.46

SRBT 6.40 0.49SSAT 2.28 0.58DNT 5.08 0.51

SSKCF 4.55 0.54SiamAN 5.90 0.53SiamRN 2.42 0.55

DeepSRDCF 4.92 0.52SHCT 3.42 0.54

MDNet_N 3.27 0.54FCF 3.10 0.55

SRDCF 4.42 0.53RFD_CF2 6.97 0.47

GGTv2 4.72 0.52DPT 6.43 0.49

SiamDet 9.63 0.45

Appendix E

Speed of di�erent trackingalgorithms on the VOT benchmark

75

76APPENDIX E. SPEED OF DIFFERENT TRACKING ALGORITHMS ON

THE VOT BENCHMARK

Table E.1: The speed comparison of the top 20 tracking algorithms from the VOT2016 challenge [4], and six versions of the tracking algorithms proposed in thispaper. Normalized speed is also known as equivalent filter operations or EFO.

Name Normalized speed Raw speedCCOT 51.00 82.18DDC 0.19 0.16DNT 1.08 1.88DPT 3.58 4.03

DeepSRDCF 47.43 65.30EBT 2.97 2.87FCF 1.68 2.39

GGTv2 0.34 0.52MDNet_N 0.51 0.69

MLDF 1.40 2.20RFD_CF2 0.77 1.20

SHCT 0.64 0.54SRBT 2.22 2.90

SRDCF 363.72 503.18SSAT 0.46 0.80

SSKCF 29.15 44.06STAPLEp 19.50 18.12SiamAN 9.05 11.93

SiamA�ne 0.56 2.16SiamDet 6.24 23.47

SiamDet_Relu 5.54 20.98SiamDet_default 6.44 24.32SiamDet_noavg 3.15 11.92SiamDet_nopid 6.24 23.50

SiamRN 5.35 7.05Staple 10.98 14.43TCNN 1.00 1.35

Appendix F

Robustness ranking on the VOTbenchmark

77

78 APPENDIX F. ROBUSTNESS RANKING ON THE VOT BENCHMARK

Table F.1: The Robustness results of the top 20 tracking algorithms from the VOT2016 challenge [4], compared to six versions of the tracking algorithms proposed inthis paper.

Name R-Rank FailuresTCNN 5.92 0.83CCOT 4.40 0.89MLDF 4.02 0.92Staple 7.45 1.42

STAPLEp 6.97 1.31DDC 7.27 1.27EBT 4.27 1.05

SRBT 7.20 1.33SSAT 5.78 1.05DNT 7.50 1.20

SSKCF 7.47 1.43SiamAN 8.93 1.91SiamRN 7.98 1.36

DeepSRDCF 7.20 1.23SHCT 7.68 1.39

MDNet_N 6.50 0.91FCF 7.97 1.85

SRDCF 7.83 1.43RFD_CF2 7.17 1.27

GGTv2 10.77 1.73DPT 9.52 1.85

SiamDet 8.48 1.96

References

[1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification withdeep convolutional neural networks,” in Advances in Neural InformationProcessing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, andK. Q. Weinberger, Eds. Curran Associates, Inc., 2012, pp. 1097–1105.[Online]. Available: http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf

[2] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang,A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, “ImageNetLarge Scale Visual Recognition Challenge,” International Journal of ComputerVision (IJCV), vol. 115, no. 3, pp. 211–252, 2015.

[3] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche,J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman,D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach,K. Kavukcuoglu, T. Graepel, and D. Hassabis, “Mastering the game of Gowith deep neural networks and tree search,” Nature, vol. 529, no. 7587, pp.484–489, Jan. 2016.

[4] M. Kristan, A. Leonardis, J. Matas, M. Felsberg, R. Pflugfelder, L. �ehovin,T. VojírÃÉ, G. Häger, A. Lukeûi�, G. Fernández, A. Gupta, A. Petrosino,A. Memarmoghadam, A. Garcia-Martin, A. Solís Montero, A. Vedaldi,A. Robinson, A. J. Ma, A. Varfolomieiev, A. Alatan, A. Erdem, B. Ghanem,B. Liu, B. Han, B. Martinez, C.-M. Chang, C. Xu, C. Sun, D. Kim,D. Chen, D. Du, D. Mishra, D.-Y. Yeung, E. Gundogdu, E. Erdem, F. Khan,F. Porikli, F. Zhao, F. Bunyak, F. Battistone, G. Zhu, G. Ro�o, G. R.K. S. Subrahmanyam, G. Bastos, G. Seetharaman, H. Medeiros, H. Li,H. Qi, H. Bischof, H. Possegger, H. Lu, H. Lee, H. Nam, H. J. Chang,I. Drummond, J. Valmadre, J.-c. Jeong, J.-i. Cho, J.-Y. Lee, J. Zhu, J. Feng,J. Gao, J. Y. Choi, J. Xiao, J.-W. Kim, J. Jeong, J. F. Henriques, J. Lang,J. Choi, J. M. Martinez, J. Xing, J. Gao, K. Palaniappan, K. Lebeda, K. Gao,K. Mikolajczyk, L. Qin, L. Wang, L. Wen, L. Bertinetto, M. K. Rapuru,M. Poostchi, M. Maresca, M. Danelljan, M. Mueller, M. Zhang, M. Arens,M. Valstar, M. Tang, M. Baek, M. H. Khan, N. Wang, N. Fan, N. Al-Shakarji,

79

80 REFERENCES

O. Miksik, O. Akin, P. Moallem, P. Senna, P. H. S. Torr, P. C. Yuen,Q. Huang, R. Martin-Nieto, R. Pelapur, R. Bowden, R. Laganière, R. Stolkin,R. Walsh, S. B. Krah, S. Li, S. Zhang, S. Yao, S. Hadfield, S. Melzi, S. Lyu,S. Li, S. Becker, S. Golodetz, S. Kakanuru, S. Choi, T. Hu, T. Mauthner,T. Zhang, T. Pridmore, V. Santopietro, W. Hu, W. Li, W. Hübner, X. Lan,X. Wang, X. Li, Y. Li, Y. Demiris, Y. Wang, Y. Qi, Z. Yuan, Z. Cai,Z. Xu, Z. He, and Z. Chi, The Visual Object Tracking VOT2016 ChallengeResults. Cham: Springer International Publishing, 2016, pp. 777–823.[Online]. Available: https://doi.org/10.1007/978-3-319-48881-3_54

[5] L. Bertinetto, J. Valmadre, S. Golodetz, O. Miksik, and P. H. S. Torr, “Staple:Complementary learners for real-time tracking,” CoRR, vol. abs/1512.01355,2015. [Online]. Available: http://arxiv.org/abs/1512.01355

[6] J. Valmadre, L. Bertinetto, J. F. Henriques, A. Vedaldi, and P. H. Torr, “End-to-end representation learning for correlation filter based tracking,” arXivpreprint arXiv:1704.06036, 2017.

[7] N. Yadav, A. Yadav, and M. Kumar, An Introduc-tion to Neural Network Methods for Di�erential Equations(SpringerBriefs in Applied Sciences and Technology). Springer,2015. [Online]. Available: https://www.amazon.com/Introduction-Di�erential-Equations-SpringerBriefs-Technology-ebook/dp/B00U2523PS?SubscriptionId=0JYN1NVW651KCA56C102&tag=techkie-20&linkCode=xm2&camp=2025&creative=165953&creativeASIN=B00U2523PS

[8] V. G. Maltarollo, K. M. HonoÃÅrio, and A. B. F. d. Silva, “Applications ofartificial neural networks in chemical problems,” in Artificial Neural Networks- Architectures and App, K. Suzuki, Ed. Rijeka: InTech, 2013, ch. 10.[Online]. Available: http://dx.doi.org/10.5772/51275

[9] Various, “Habituation, sensitization, and potentiation -boundless open textbook,” Sep 2016. [Online]. Available:https://www.boundless.com/psychology/textbooks/boundless-psychology-textbook/learning-7/biological-basis-of-learning-49/habituation-sensitization-and-potentiation-204-12739/

[10] a. karpathy, “Convolutional neural networks: Architectures, convolution /pooling layers.” [Online]. Available: http://cs231n.github.io/convolutional-networks/

[11] I. Russel, “Neural networks module,” 1996. [Online]. Available:http://uhaweb.hartford.edu/compsci/neural-networks-delta-rule.html

[12] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016.[Online]. Available: http://www.deeplearningbook.org

REFERENCES 81

[13] “Bringing parallelism to the web with river trail.” [Online]. Available:http://intellabs.github.io/RiverTrail/tutorial/

[14] D. Gschwend, “Zynqnet: An fpga-accelerated embedded convolutional neuralnetwork,” Master’s thesis, ETH Zurich, Switzerland, 2016.

[15] J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. A. Riedmiller, “Strivingfor simplicity: The all convolutional net,” CoRR, vol. abs/1412.6806, 2014.[Online]. Available: http://arxiv.org/abs/1412.6806

[16] D. Gschwend, “Netscope,” 2016. [Online]. Avail-able: https://github.com/dgschwend/netscope/blob/406deb0ea3e9b8015f28e4b4a70532df729d94e9/src/analyzer.co�ee#L77

[17] G. Koch, R. Zemel, and R. Salakhutdinov, “Siamese neural networks for one-shot image recognition,” 2015.

[18] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang,A. Karpathy, A. Khosla, M. Bernstein, A. Berg, and L. Fei-Fei, “Imagenetlarge scale visual recognition challenge,” International Journal of ComputerVision, vol. 115, no. 3, pp. 211–252, 12 2015.

[19] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. B. Girshick,S. Guadarrama, and T. Darrell, “Ca�e: Convolutional architecture for fastfeature embedding,” CoRR, vol. abs/1408.5093, 2014. [Online]. Available:http://arxiv.org/abs/1408.5093

[20] Theano Development Team, “Theano: A Python framework for fast com-putation of mathematical expressions,” arXiv e-prints, vol. abs/1605.02688,May 2016. [Online]. Available: http://arxiv.org/abs/1605.02688

[21] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S.Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp,G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg,D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens,B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan,F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu,and X. Zheng, “TensorFlow: Large-scale machine learning on heterogeneoussystems,” 2015, software available from tensorflow.org. [Online]. Available:http://tensorflow.org/

[22] “Metal performance shaders | apple developer docu-mentation,” (Accessed on 08/20/2017). [Online]. Available:https://developer.apple.com/documentation/metalperformanceshaders

[23] X. W. Chen and X. Lin, “Big data deep learning: Challenges and perspectives,”IEEE Access, vol. 2, pp. 514–525, 2014.

82 REFERENCES

[24] S. Abu-El-Haija, N. Kothari, J. Lee, P. Natsev, G. Toderici, B. Varadarajan,and S. Vijayanarasimhan, “Youtube-8m: A large-scale video classifica-tion benchmark,” CoRR, vol. abs/1609.08675, 2016. [Online]. Available:http://arxiv.org/abs/1609.08675

[25] M. Danelljan, G. Hager, F. Shahbaz Khan, and M. Felsberg, “Accurate scaleestimation for robust visual tracking,” in Proceedings of the British MachineVision Conference. BMVA Press, 2014.

[26] M. Danelljan, G. Häger, F. S. Khan, and M. Felsberg, “Discriminativescale space tracking,” CoRR, vol. abs/1609.06141, 2016. [Online]. Available:http://arxiv.org/abs/1609.06141

[27] L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, and P. H. Torr,“Fully-convolutional siamese networks for object tracking,” arXiv preprintarXiv:1606.09549, 2016.

[28] Z. Li and Z. Zhang, “Espresso.” [Online]. Available:http://codinfox.github.io/espresso/

[29] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, andK. Keutzer, “Squeezenet: Alexnet-level accuracy with 50x fewer parametersand <0.5mb model size,” arXiv:1602.07360, 2016.

[30] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand,M. Andreetto, and H. Adam, “Mobilenets: E�cient convolutional neuralnetworks for mobile vision applications,” CoRR, vol. abs/1704.04861, 2017.[Online]. Available: http://arxiv.org/abs/1704.04861

[31] J. Ho�man, E. Tzeng, J. Donahue, Y. Jia, K. Saenko, and T. Darrell,“One-shot adaptation of supervised deep convolutional models,” CoRR, vol.abs/1312.6204, 2013. [Online]. Available: http://arxiv.org/abs/1312.6204

[32] H. Sak, A. W. Senior, and F. Beaufays, “Long short-term mem-ory based recurrent neural network architectures for large vocabularyspeech recognition,” CoRR, vol. abs/1402.1128, 2014. [Online]. Available:http://arxiv.org/abs/1402.1128

[33] A. Santoro, S. Bartunov, M. Botvinick, D. Wierstra, and T. P. Lillicrap,“One-shot learning with memory-augmented neural networks,” CoRR, vol.abs/1605.06065, 2016. [Online]. Available: http://arxiv.org/abs/1605.06065

[34] B. M. Lake, R. Salakhutdinov, and J. B. Tenenbaum, “Human-level concept learning through probabilistic program induction,” Sci-ence, vol. 350, no. 6266, pp. 1332–1338, 2015. [Online]. Available:http://science.sciencemag.org/content/350/6266/1332

REFERENCES 83

[35] O. Vinyals, C. Blundell, T. P. Lillicrap, K. Kavukcuoglu, and D. Wierstra,“Matching networks for one shot learning,” CoRR, vol. abs/1606.04080, 2016.[Online]. Available: http://arxiv.org/abs/1606.04080

[36] B. Hariharan and R. B. Girshick, “Low-shot visual object recog-nition,” CoRR, vol. abs/1606.02819, 2016. [Online]. Available:http://arxiv.org/abs/1606.02819

[37] S. Caelles, K. Maninis, J. Pont-Tuset, L. Leal-Taixé, D. Cremers, and L. V.Gool, “One-shot video object segmentation,” CoRR, vol. abs/1611.05198,2016. [Online]. Available: http://arxiv.org/abs/1611.05198

[38] D. Held, S. Thrun, and S. Savarese, “Learning to track at 100 FPS with deepregression networks,” CoRR, vol. abs/1604.01802, 2016. [Online]. Available:http://arxiv.org/abs/1604.01802

[39] G. Ning, Z. Zhang, C. Huang, Z. He, X. Ren, and H. Wang,“Spatially supervised recurrent convolutional neural networks for visualobject tracking,” CoRR, vol. abs/1607.05781, 2016. [Online]. Available:http://arxiv.org/abs/1607.05781

[40] J. Redmon, S. K. Divvala, R. B. Girshick, and A. Farhadi, “You only lookonce: Unified, real-time object detection,” CoRR, vol. abs/1506.02640, 2015.[Online]. Available: http://arxiv.org/abs/1506.02640

[41] Y. Wu, J. Lim, and M. H. Yang, “Object tracking benchmark,” IEEE Trans-actions on Pattern Analysis and Machine Intelligence, vol. 37, no. 9, pp. 1834–1848, Sept 2015.

[42] L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, and P. H. S.Torr, “Fully-convolutional siamese networks for object tracking,” CoRR, vol.abs/1606.09549, 2016. [Online]. Available: http://arxiv.org/abs/1606.09549

[43] L. Bertinetto, J. F. Henriques, J. Valmadre, P. H. S. Torr, and A. Vedaldi,“Learning feed-forward one-shot learners,” CoRR, vol. abs/1606.05233, 2016.[Online]. Available: http://arxiv.org/abs/1606.05233

[44] J. Choi, J. Kwon, and K. M. Lee, “Visual tracking by reinforceddecision making,” CoRR, vol. abs/1702.06291, 2017. [Online]. Available:http://arxiv.org/abs/1702.06291

[45] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista, “High-speed trackingwith kernelized correlation filters,” CoRR, vol. abs/1404.7584, 2014. [Online].Available: http://arxiv.org/abs/1404.7584

[46] A. Milan, S. H. Rezatofighi, A. R. Dick, K. Schindler, and I. D. Reid,“Online multi-target tracking using recurrent neural networks,” CoRR, vol.abs/1604.03635, 2016. [Online]. Available: http://arxiv.org/abs/1604.03635

84 REFERENCES

[47] A. Milan, L. Leal-Taixé, I. Reid, S. Roth, and K. Schindler, “MOT16:A benchmark for multi-object tracking,” arXiv:1603.00831 [cs], Mar. 2016,arXiv: 1603.00831. [Online]. Available: http://arxiv.org/abs/1603.00831

[48] M. D. Breitenstein, F. Reichlin, B. Leibe, E. Koller-Meier, and L. V. Gool,“Robust tracking-by-detection using a detector confidence particle filter,” in2009 IEEE 12th International Conference on Computer Vision, Sept 2009, pp.1515–1522.

[49] S.-I. Yu, D. Meng, W. Zuo, and A. Hauptmann, “The solution path algo-rithm for identity-aware multi-object tracking,” in The IEEE Conference onComputer Vision and Pattern Recognition (CVPR), June 2016.

[50] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neuralnetwork with pruning, trained quantization and hu�man coding,” CoRR, vol.abs/1510.00149, 2015. [Online]. Available: http://arxiv.org/abs/1510.00149

[51] S. Han, J. Pool, J. Tran, and W. J. Dally, “Learning both weights andconnections for e�cient neural networks,” CoRR, vol. abs/1506.02626, 2015.[Online]. Available: http://arxiv.org/abs/1506.02626

[52] J. van Leeuwen, “On the construction of hu�man trees.” inICALP, 1976, pp. 382–410. [Online]. Available: http://dblp.uni-trier.de/db/conf/icalp/icalp76.html#Leeuwen76

[53] S. Ren, K. He, R. B. Girshick, and J. Sun, “Faster R-CNN: towards real-timeobject detection with region proposal networks,” CoRR, vol. abs/1506.01497,2015. [Online]. Available: http://arxiv.org/abs/1506.01497

[54] B. Wu, F. Iandola, P. H. Jin, and K. Keutzer, “Squeezedet: Unified, small,low power fully convolutional neural networks for real-time object detectionfor autonomous driving,” 2016.

[55] F. Chollet, “Xception: Deep learning with depthwise separableconvolutions,” CoRR, vol. abs/1610.02357, 2016. [Online]. Available:http://arxiv.org/abs/1610.02357

[56] M. Berger, Geometry. Berlin New York: Springer-Verlag, 1994.

[57] M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu, “Spatialtransformer networks,” CoRR, vol. abs/1506.02025, 2015. [Online]. Available:http://arxiv.org/abs/1506.02025

[58] I. Rocco, R. Arandjelovic, and J. Sivic, “Convolutional neural networkarchitecture for geometric matching,” CoRR, vol. abs/1703.05593, 2017.[Online]. Available: http://arxiv.org/abs/1703.05593

REFERENCES 85

[59] K. Simonyan and A. Zisserman, “Very deep convolutional networks forlarge-scale image recognition,” CoRR, vol. abs/1409.1556, 2014. [Online].Available: http://arxiv.org/abs/1409.1556

[60] F. L. Bookstein, “Principal warps: thin-plate splines and the decompositionof deformations,” IEEE Transactions on Pattern Analysis and Machine Intel-ligence, vol. 11, no. 6, pp. 567–585, Jun 1989.

[61] M. Kristan, J. Matas, A. Leonardis, M. Felsberg, L. Cehovin, G. Fernan-dez, T. Vojir, G. Hager, G. Nebehay, and R. Pflugfelder, “The visual objecttracking vot2015 challenge results,” in The IEEE International Conference onComputer Vision (ICCV) Workshops, December 2015.

[62] “Vot2017 challenge |participate.” [Online]. Available:http://www.votchallenge.net/vot2017/participation.html

[63] R. Rothe, M. Guillaumin, and L. V. Gool, “Non-maximum suppression forobject detection by passing messages between windows,” in Asian Conferenceon Computer Vision (ACCV), November 2014.

[64] D. Goldberg, “What every computer scientist should know about floating-point arithmetic,” ACM Comput. Surv., vol. 23, no. 1, pp. 5–48, Mar. 1991.[Online]. Available: http://doi.acm.org/10.1145/103162.103163

[65] G. C. Goodwin, S. F. Graebe, and M. E. Salgado,Control System Design. Pearson, 2000. [Online]. Avail-able: https://www.amazon.com/Control-System-Design-Graham-Goodwin/dp/0139586539?SubscriptionId=0JYN1NVW651KCA56C102&tag=techkie-20&linkCode=xm2&camp=2025&creative=165953&creativeASIN=0139586539

[66] R. B. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich featurehierarchies for accurate object detection and semantic segmentation,” CoRR,vol. abs/1311.2524, 2013. [Online]. Available: http://arxiv.org/abs/1311.2524

[67] D. P. Kingma and J. Ba, “Adam: A method for stochas-tic optimization,” CoRR, vol. abs/1412.6980, 2014. [Online]. Available:http://arxiv.org/abs/1412.6980

[68] S. Io�e and C. Szegedy, “Batch normalization: Accelerating deep networktraining by reducing internal covariate shift,” CoRR, vol. abs/1502.03167,2015. [Online]. Available: http://arxiv.org/abs/1502.03167

[69] D. Clevert, T. Unterthiner, and S. Hochreiter, “Fast and accuratedeep network learning by exponential linear units (elus),” CoRR, vol.abs/1511.07289, 2015. [Online]. Available: http://arxiv.org/abs/1511.07289

86 REFERENCES

[70] I. Sutskever, J. Martens, G. Dahl, and G. Hinton, “On the importance ofinitialization and momentum in deep learning,” in Proceedings of the 30thInternational Conference on International Conference on Machine Learning -Volume 28, ser. ICML’13. JMLR.org, 2013, pp. III–1139–III–1147. [Online].Available: http://dl.acm.org/citation.cfm?id=3042817.3043064

[71] A. Geiger, “The kitti vision benchmark suite,”2017, (Accessed on 08/29/2017). [Online]. Available:http://www.cvlibs.net/datasets/kitti/eval_object.php?obj_benchmark=2d

[72] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov,“Dropout: A simple way to prevent neural networks from overfitting,”Journal of Machine Learning Research, vol. 15, pp. 1929–1958, 2014. [Online].Available: http://jmlr.org/papers/v15/srivastava14a.html

[73] P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ramanan, “Visual objectdetection with deformable part models,” Commun. ACM, vol. 56, no. 9, pp.97–105, Sep. 2013. [Online]. Available: http://doi.acm.org/10.1145/2494532

[74] F. Yang, W. Choi, and Y. Lin, “Exploit all the layers: Fast and accurate cnnobject detector with scale dependent pooling and cascaded rejection classi-fiers,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition(CVPR), June 2016, pp. 2129–2137.

[75] E. Zurich, “Mot17 results,” Jul 2017. [Online]. Available:https://motchallenge.net/results/MOT17/

[76] H. Nam and B. Han, “Learning multi-domain convolutional neural networksfor visual tracking,” CoRR, vol. abs/1510.07945, 2015. [Online]. Available:http://arxiv.org/abs/1510.07945

[77] L. Cehovin, “Trax: The visual tracking exchange protocol andlibrary,” CoRR, vol. abs/1705.04469, 2017. [Online]. Available:http://arxiv.org/abs/1705.04469

[78] A. G. Howard, “mobilenet_v1 code,” Jun 2017. [Online]. Available:https://github.com/tensorflow/models/blob/master/slim/nets/mobilenet_v1.py

TRITA-ICT-EX-2017:141

www.kth.se

Documents

One Shot Object Detection1161376/FULLTEXT02.pdf · Thus the one shot object detection network used for a tracking application can improve the experience of augmented reality applications