Semantic Relatedness in Convolutional Neural Networks Paul

Paul Missault

Semantic Relatedness in Convolutional Neural Networks

Academic year 2015-2016Faculty of Engineering and ArchitectureChair: Prof. dr. ir. Daniël De ZutterDepartment of Information Technology

Master of Science in Computer Science Engineering Master's dissertation submitted in order to obtain the academic degree of

Counsellors: Ir. Rein Houthooft, Dr. Stijn VerstichelSupervisors: Prof. dr. ir. Filip De Turck, Dr. Femke Ongenae

2

Semantic Relatedness in Convolutional NeuralNetworks

Paul Missault, Gent, Belgium. E-mail: [email protected] .

Supervisor(s): Filip De Turck, Femke Ongenae

Abstract—The work in this article will introduce the semantic cross en-tropy, a novel objective function with interesting properties. It will beshown that when this semantically aware objective function is used to traindeep networks the resulting output confidences are very well calibrated.When such a network predicts a class with a high confidence, there is infact a high probability that the prediction is correct, conversely a predictionwith a low confidence has a high probability of being wrong. This is opposedto traditionally trained networks that are typically extremely confident inall their predictions, regardless of the actual probability of being correct.Furthermore, with an appropriate choice of semantic relations among yourdataset, the objective will correlate the classification errors more stronglywith the ground truth, i.e. the errors made by a classifier will be semanti-cally closer to the ground truth. The impact of this novel objective will beevaluated on Convolutional Neural Networks (CNNs).

Keywords— Semantics, Convolutional Neural Network, Deep Learning,Computer Vision

I. INTRODUCTION

THE research within computer vision traditionally considersclassification errors in a binary manner, a classification is

correct or it is not. Furthermore, a classifier is always trainedto achieve the highest possible confidence in the ground truth.Both these statements follow from the most widely used classi-fication objective, the cross entropy which is shown in Equation1. In this equation (Xn, Yn) is a datapoint from the training set(X,Y ), P (Yn|Xn, θ) is therefore the classifiers confidence inthe ground truth Yn given classifier parameters θ and input Xn.

C(θ,X, Y ) = − 1

N

N−1∑

n=0

log (P (Yn|Xn, θ)) (1)

A classifier trained to minimize this objective will push itsconfidence in the ground truth as close to 1 as possible for alldatapoints in the training set. The output probabilities of thesenetworks are therefore very spiked, a property that is not onlypresent during training but that also holds at test time as will beshown in what follows.

This paper proposes the following novel objective function,the semantic cross entropy built on the work of Zhao et al [1]

C(θ,X, Y ) = − 1

N

N−1∑

n=0

M−1∑

m=0

SYn,Lm log (P (Lm|Xn, θ)) (2)

The values of Si,j describe the semantic relatedness of labelsi and j. The inner sum of Equation 2 is therefore a sum overall the possible labels, where the (logarithm of) the confidencein each label is weighed by that labels relatedness to the groundtruth. If S is the eye matrix the inner sum will only be non-zero when Lm = Yn which reverts the semantic cross entropyback to the general cross entropy. Such a choice for S assumes

a semantic relatedness where a concept is only related to itselfas Si,j = 0 for i 6= j.

Perhaps a clearer way to think about the proposed functionis that the value of SYn,Lm allows a network to be somewhatconfident in a label Lm that is not the ground truth Yn providedthat label is semantically related to the ground truth.

II. CROSS ENTROPY GENERATES OVERLY CONFIDENTCLASSIFIERS

TO show the claim that output of classifiers trained withcross entropy are spiked even at test time we built a

network heavily inspired by the current best classifier on theCIFAR-100 dataset as designed by Clevert et al. [2]. This clas-sifier is a CNN with the architecture as described in Table I.

layer filters filtersizeconvolutional 384 3x3convolutional 384 1x1

MaxPool 2x2convolutional 384 1x1convolutional 480 3x3convolutional 480 3x3




MaxPool 2x2convolutional 600 1x1

softmax 100

TABLE I: Network architecturetable

The architecture described in Table I will be used through-out the rest of this paper. The activation function used in alllayers except the final softmax is the Exponential Linear Unit(ELU), as described by Clevert et al. Dropout [3] is appliedto the output of the last convolutional layer, and the outputof every MaxPool above it with respective dropout rates of[0.5, 0.4, 0.3, 0.2, 0.2, 0]. The size of the filters is slightly differ-

ent as proposed in [2]. Uneven filters can (with proper padding)preserve the size of the input. The proposed filter 2x2 filter sizesrequire either complex padding schemes or systematic upsam-pling of the output.

This CNN was trained on CIFAR-100 [4], a dataset of 60 000labeled tiny images with a fixed train-test split of 50 000 and10 000 images. It was trained for 80 epochs using stochasticgradient descent with a batch size of 100. The initial learningrate was set at 0.01 and decayed by a factor 1

1.05 after everyepoch. Momentum was used with a parameter of 0.9, as wasL2 weight decay with strength of 0.0004. This trained networkwas evaluated on the fixed set of 10 000 test images on whichit achieved an accuracy of 69.24%. Statistics of the predictionconfidence are summarized in Table II.

statisticmean 0.912median 0.999std 0.164

TABLE II: Statistics of the prediction confidence on the testsettable

These statistics do not tell the full story. Despite being a clearindication that the prediction confidence is typically close to 1,we can not decisively conclude whether or not this network isoverly confident. It is still very well possible that this networkis very confident in those prediction that are correct but stillhas a lower confidence in all the predictions that ended up be-ing wrong. To further examine this case all predictions weregrouped by a threshold on their prediction confidence. On eachof these groups the prediction accuracy is calculated once more,summarized in Table III.

threshold # images accuracynone 10000 0.692≥ mean (=0.912) 7519 0.791≥ 0.99 6229 0.860<mean 2481 0.292

TABLE III: Accuracy for varying thresholds on prediction con-fidencetable

In an ideal case we would want the accuracy in each of thegroups to be higher than their threshold, e.g. when we con-sider only predictions where the confidence is higher than 0.9we would like at least 90% of those prediction to be correct.From Table III we see that this is not the case in a traditionallytrained network. Out of all the predictions with a confidencehigher than 0.99, only 86% is correct. Secondly we also seethat 6299 of the 10 000 images are predicted with a confidencehigher than 0.99. This leads us to the conclusion that cross en-tropy generates overly confident networks. The confidence in aprediction is very hard to relate to an actual chance of that theprediction is correct. This is especially an issue in a detectionuse case, as these networks will generate a lot of false positives( ‘The network has more than 0.99 confidence that a I can findwaldo in this spot, therefore it surely is correct to assume there

is a waldo.’ ). This is further explored and visualized in Figure1.

0.0 0.2 0.4 0.6 0.8 1.0

confidence threshold

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

actual accuracy

0 2

52 131 258

420 438440

627

7632 = amount of samples in bin##

Fig. 1: Accuracy within confidence intervals for network trainedwith cross entropyfigure

Again we see 7632 out of 10 000 predictions have a confi-dence ≥ 0.9, showing that cross entropy generates very confi-dent networks. The resulting acutal accuracy for each of thesepredictions is significantly lower than the prediction confidence.The network is therefore not only very confident it is in factoverly confident.

III. SEMANTICS IN SEMANTIC CROSS ENTROPY

FROM the definition of Semantic Cross Entropy in Equation2 we see that the relatedness among labels is governed by

a matrix S. The introduction already mentioned that when thisS matrix is an eye-matrix, a label is only related to itself re-sulting in a Semantic Cross Entropy that is equivalent to CrossEntropy. Generally this matrix can be any symmetric matrix assemantic relatedness is a symmetric property (i relates to j justas much as j relates to i). Secondly the sum of the elements ofa row (equivalently column) should be constant. If we let thesum of row (column) i be greater than the sum of row (column)j we effectively unbalance our objective in favor of label i. Thisbehavior is often used in datasets where label i is underrepre-sented, but should be avoided in the general case as each label isequally important. The constant is assumed to be 1 without lossof generatlity.

Continuing on the work of Zhao et al. [1] we propose a gen-eral method to select a matrix S for any dataset of which thelabels are structured in a hierarchy or an ontology. We beginby defining a matrix D where the element Di,j is a quantitativemeasure of the hierarchical or ontological distance of labels iand j. Consider for instance followingDi,j for labels structuredin a hierarchic tree.

Di,j =length(path(i) ∩ path(j))

max(length(path(i)), length(path(j)))(3)

path(i) defines the path from the root node (the base class)down the hierarchy to class i, while p1 ∩ p2 defines the number

of classes that are part of both paths p1 and p2. This very gen-eral construction of D will work when the labels structured ina hierarchical tree, but any hierarchical or ontological distancemeasure can be used to construct D.

From the matrix D when then construct S as:

Si,j =1

Z· exp(−κ(1−Di,j)) (4)

Where κ is a hyper-parameter that governs the decay of therelatedness and Z a normalization constant such that the rowsof S sum to 1. The impact of κ is explored in Figure 2. Dis built according to Equation 3 using the labels of CIFAR-100structured hierarchically according to Figure 7.

Fig. 2: Impact of κ on the S matrixfigure

Studying the diagonal of the S matrices can help select an ap-propriate κ. For κ = 2 we see that the elements on the diagonalare approximately 0.06. Such a choice for kappa would result inan objective function that depends for only 6% on its confidencein the ground truth. Inversely this means 94% of the objective isdetermined by the confidence is related labels, which is typicallynot what we want.

Selecting κ = 8 results in an S matrix with 0.9 on the di-agonal. Such an S matrix will have more use cases as now theobjective is still dominated by the ground truth but the other 10%of the objective will depend on how confident the network is insemantically related labels.

A second approach can be taken where we neglect the actualsemantics of a dataset and superimpose a very simple, uniformrelatedness. For a dataset of N possible labels we impose that alabel is related to itself with strength α and related to others as1−αN−1 . On a dataset of 100 labels such as CIFAR and an α = 0.9this would translate to an S matrix with 0.9 on the diagonal and0.001 off the diagonal.

S =

0.9 0.001 · · · · · · 0.001 0.0010.001 0.9 · · · · · · 0.001 0.001

......

. . ....

......

.... . .

......

0.001 0.001 · · · · · · 0.9 0.0010.001 0.001 · · · · · · 0.001 0.9

This choice of S shall henceforth be called the uniform S.

IV. IMPACT OF SEMANTIC CROSS ENTROPY

TO research the impact of semantic cross entropy we eval-uate its performance on CIFAR-100 using the same net-

work architecture and training parameters as discussed in Sec-

tion II. We train this architecture three times, with three differ-ent choices of S. A semantic choice of S with κ = 8 whereD is built according to Equation 3 using the CIFAR hierarchydepicted in to Figure 7. The non-semantic choice of S is thatwhere S is an eye-matrix. As then Si,j = 0 for i 6= j the seman-tic cross entropy reverts to cross entropy. The network trainedwith this choice of S is therefore completely equivalent to thenetwork discussed in section II trained with cross entropy. Thethird choice is the uniform S where α = 0.9. The test error isshown over 80 training epochs for these networks in Figures 3and 4

0 10 20 30 40 50 60 70 80

epoch

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

error

semantic

uniform

non-semantic

Fig. 3: Test error rate shown for 80 epochsfigure

50 60 70 80

epoch

0.300

0.305

0.310

0.315

0.320

0.325

0.330

error

semantic

uniform

non-semantic

Fig. 4: Test error rate for the last 30 epochsfigure

non-semantic semantic uniformlowest error 0.3076 0.3162 0.302highest accuracy 69.24% 68.38% 69.98%

TABLE IV: Lowest error or equivalently highest accuracy forthree networkstable

The results from Figure 3, summarized in Table IV are quite

surprising. Not only does the introduction of a uniform S causethe network to train faster, it also lowers the error by 0.074 orequivalently increases accuracy by 0.74%. The prediction confi-dences of the semantic and uniform networks are analyzed in thesame way as we already did for the non-semantic one in SectionII, the results of which are shown in Figures 5 and 6

0.0 0.2 0.4 0.6 0.8 1.0


0.0

0.2

0.4

0.6

0.8

1.0

actual accuracy

220

987

1020

955

834

781

796

970


Fig. 5: Accuracy for varying thresholds with semantic Sfigure

0.0 0.2 0.4 0.6 0.8 1.0


0.0

0.2

0.4

0.6

0.8

1.0

actual accuracy

135 605

627

599611

574621

649

1072


Fig. 6: Accuracy for varying thresholds with uniform Sfigure

These results show that the confidence of a network trainedwith semantic cross entropy is more tightly bound with the ac-tual chance that the predicted label is correct. Whereas the pre-diction confidence of a network trained with the non-semanticcross entropy is typically high, this confidence does not prop-erly reflect whether we can actually trust the networks’ predic-tion. Should a network trained with semantic cross-entropy ex-hibit a high confidence we can be quite sure that the predictionis in fact correct. We will informally refer to this well calibratedprediction confidence as the trustworthiness of the network.

The impact semantic cross entropy has on the trustworthinessof a network seems to be similar for both uniform and semanti-cally inspired S. This remarkable quality makes it so trustwor-thy predictions can be generated for arbitrary datasets regardless

of semantic relatedness of the labels in that dataset. We onlyhave to decide on a fitting choice for α in S.

An additional benefit of the semantically inspired S matrixbecomes apparent when we evaluate the errors made by the threenetworks. More specifically we are interested in the parent ac-curacy, the percentage of errors that still have the same parent inthe hierarchy of Figure 7 as the ground truth. E.g. if the groundtruth label of an input is the label ’beaver’ a prediction of ’seal’would be an error, but as they both have the parent label ’aquaticanimals’ they are parent accurate.

non-semantic semantic uniformparent accuracy 28.13% 35.46% 33.59%

TABLE V: Parent accuracy evaluated on errors made by all threenetworkstable

Table V reveals that the errors made by a network trained us-ing semantic cross entropy with a semantically inspired S makeserrors that are semantically related to the ground truth. Shouldsuch a network make an error, there is still a 35.46% chance thatthe error has the same hierarchical parent as the ground truth.Remarkably the network trained with a uniform S also outper-forms the non-semantic network on this metric despite neitherhaving knowledge about the hierarchy.

V. CONCLUSIONS

ON the CIFAR-100 dataset a network trained using seman-tic cross entropy with a uniform S achieved an accuracy

of 69.98% while a non-semantically trained network reached69.24%. Whether or not this improvement can continuously benoticed is left to future work, but it is a clear indication that se-mantic cross entropy with a uniform S can be introduced with-out lowering the overall accuracy. The network with a semanti-cally inspired S reached an accuracy of 68.38%, but of all thoseerrors 35.46% still have the same hierarchic parent as the groundtruth. This is only true for 28.13% of the errors made by thenon-semantic network.

Arguably the most beneficial aspect of semantic cross en-tropy is the calibration of output confidences. Both the net-works trained with semantically inspired S and uniform S gen-erate predictions of which the confidence is closely related to aprobability that this prediction is correct. Provided this propertycan be contiously shown on other datasets, such networks couldhave many use cases. Most noticably these network would dras-tically decrease the amount of false positives in a detection task.

REFERENCES

[1] Bin Zhao, Li Fei-Fei, and Eric P Xing, “Large-Scale Category StructureAware Image Categorization,” Advances in Neural Information ProcessingSystems 24 (Proceedings of NIPS), pp. 1–9, 2011.

[2] Djork-Arne Clevert, Thomas Unterthiner, and Sepp Hochreiter, “Fast andAccurate Deep Network Learning by Exponential Linear Units (ELUs),”Under review of ICLR2016 ELU, , no. 1997, pp. 1–13, 2015.

[3] Geoffrey E. Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, andRuslan R. Salakhutdinov, “Improving neural networks by preventing co-adaptation of feature detectors,” arXiv: 1207.0580, pp. 1–18, 2012.

[4] Alex Krizhevsky, “Learning Multiple Layers of Features from Tiny Im-ages,” . . . Science Department, University of Toronto, Tech. . . . , pp. 1–60,2009.

Fig. 7: Hierarchy of labels in CIFARfigure

vi

Acknowledgements

A big thanks goes out to Kristof and Rein for their efforts in proofread-ing this thesis. Their insightful comments have shaped many sections andparagraphs.

I’d like to thank Brecht, Femke and Stijn for their role in providing mewith the necessary hardware both before and after the relocation of theiMinds Virtual Wall. Femke and Stijn have also tremendously helped mewith the final structure of chapters which has lead to -what I believe- asequential story.

Special thanks goes out to Kristof and Tom who personally helped medeal with floods due to heavy rains in the last few days of writing. Withouttheir selfless efforts I wouldn’t have been able to get my final revisions donein time.

vii

viii

Contents

1 Introduction 1

2 Convolutional Neural Networks 5

2.1 Artificial Neural Networks . . . . . . . . . . . . . . . . . . . . 5

2.2 Neural networks and supervised learning . . . . . . . . . . . . 6

2.3 A brief history of Convolutional Neural Networks for com-puter vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.4 The architecture of Convolutional Neural Networks . . . . . . 9

2.4.1 The convolutional layer . . . . . . . . . . . . . . . . . 10

2.4.2 The pooling layer . . . . . . . . . . . . . . . . . . . . . 15

2.4.3 Activation Function . . . . . . . . . . . . . . . . . . . 16

2.5 Training Convolutional Neural Networks . . . . . . . . . . . . 18

2.5.1 Objective Functions . . . . . . . . . . . . . . . . . . . 18

2.5.2 Gradient Descent . . . . . . . . . . . . . . . . . . . . . 20

2.5.3 Backpropagation . . . . . . . . . . . . . . . . . . . . . 23

2.6 Getting more out of Convolutional Neural Networks . . . . . 26

2.6.1 Stochastic gradient descent . . . . . . . . . . . . . . . 26

2.6.2 Data Augmentation . . . . . . . . . . . . . . . . . . . 27

2.7 Regularizing Convolutional Neural Networks . . . . . . . . . . 27

2.7.1 The Convolutional Layer . . . . . . . . . . . . . . . . 27

2.7.2 Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3 Objective Function: Design and Adaptation 31

3.1 Discriminative Categories . . . . . . . . . . . . . . . . . . . . 31

3.2 Hedging Your Bets . . . . . . . . . . . . . . . . . . . . . . . . 32

3.3 Semantic cost . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.4 Semantic Cross Entropy . . . . . . . . . . . . . . . . . . . . . 34

3.5 Semantic relatedness matrix . . . . . . . . . . . . . . . . . . . 35

3.6 Example use . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.7 Research questions . . . . . . . . . . . . . . . . . . . . . . . . 37

ix

x CONTENTS

4 Research Strategy 394.1 Hardware Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 394.2 Dataset selection . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.2.1 Belgian Traffic Signs . . . . . . . . . . . . . . . . . . . 394.2.2 ImageNet . . . . . . . . . . . . . . . . . . . . . . . . . 404.2.3 CIFAR . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.3 Network Architecture and Training Specification . . . . . . . 424.4 Overcoming numerical instability . . . . . . . . . . . . . . . . 444.5 Tier-n Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . 44

5 Evaluation and Results 475.1 Selecting κ . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475.2 Network Accuracy . . . . . . . . . . . . . . . . . . . . . . . . 495.3 Network Confidence . . . . . . . . . . . . . . . . . . . . . . . 525.4 Semantic cross entropy as regularisation . . . . . . . . . . . . 55

6 Conclusion and Future Work 596.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 596.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

6.2.1 Does semantic cross entropy with a uniform S continu-ously outperform non-semantic cross entropy in termsof accuracy? . . . . . . . . . . . . . . . . . . . . . . . . 60

6.2.2 Semantic Cross Entropy in Detection . . . . . . . . . . 606.2.3 Erroneous Truth Labelling . . . . . . . . . . . . . . . . 606.2.4 Evaluation on Complex Datasets . . . . . . . . . . . . 61

List of Figures

2.1 ANN with 1 hidden layer . . . . . . . . . . . . . . . . . . . . 5

2.2 Convolution taken over an image. . . . . . . . . . . . . . . . . 11

2.3 Gabor filters . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.4 Example result of convolving the image on the left with a setof Gabor filters. . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.5 First layer filters trained on natural images . . . . . . . . . . 12

2.6 Locally connected neurons . . . . . . . . . . . . . . . . . . . . 13

2.7 Stride of filters. . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.8 Effects of a 2-by-2 maxpool layer. . . . . . . . . . . . . . . . . 15

2.9 commonly used activation functions. . . . . . . . . . . . . . . 16

2.10 ReLu(x) = max(0,x) . . . . . . . . . . . . . . . . . . . . . . . 17

2.11 Variants on RELU with α1 = 0.1 and α2 = 1 . . . . . . . . . 18

2.12 Saddle point . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.1 Image of a rottweiler . . . . . . . . . . . . . . . . . . . . . . . 36

4.1 Example from the traffic sign dataset . . . . . . . . . . . . . . 40

4.2 Example from ImageNet. A few of its labels are Panda, GiantPanda, Mammal and Vertebrate . . . . . . . . . . . . . . . . 41

4.3 Example from CIFAR, 32x32 tiny color image labelled as dog 41

4.4 Tree of related labels in CIFAR . . . . . . . . . . . . . . . . . 46

5.1 Visualization of the effect of κ on the S matrix . . . . . . . . 47

5.2 Error rates of finetuning a network with various values for κ 48

5.3 Zoom on the last epochs of Figure 5.2 . . . . . . . . . . . . . 49

5.4 Test error rate shown for 80 epochs . . . . . . . . . . . . . . . 50

5.5 Test error rate for the last 30 epochs . . . . . . . . . . . . . . 50

5.6 Non-semantic cross entropy for 80 epochs . . . . . . . . . . . 52

5.7 Accuracy for specific confidence intervals for semantically trainednetwork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

5.8 Accuracy for specific confidence intervals for non-semanticallytrained network . . . . . . . . . . . . . . . . . . . . . . . . . . 55

5.9 Test error rate shown for 80 epochs . . . . . . . . . . . . . . . 56

5.10 Test error rate for the last 30 epochs . . . . . . . . . . . . . . 57

xi

xii LIST OF FIGURES

5.11 Accuracy for specific confidence intervals for the network trainedwith uniform S . . . . . . . . . . . . . . . . . . . . . . . . . . 58

List of Tables

3.1 prediction results of rottweiler image . . . . . . . . . . . . . . 363.2 semantic and non-semantic cross entropy . . . . . . . . . . . . 36

4.1 Hardware specifications . . . . . . . . . . . . . . . . . . . . . 394.2 Network architecture . . . . . . . . . . . . . . . . . . . . . . . 43

5.1 Lowest error or equivalently highest accuracy for both networks 515.2 Tier-1 accuracy evaluated on tier-0 errors made by both net-

works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 515.3 Tier-1 accuracy evaluated on all predictions made by both

networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 515.4 statistics of the highest confidence on each image in the testset 535.5 accuracy for varying thresholds with semantic cross entropy 535.6 accuracy for varying thresholds with non-semantic cross en-

tropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 545.7 Lowest error or equivalently highest accuracy for three networks 575.8 Tier-1 accuracy evaluated on tier-0 errors made by all three

networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

xiii

xiv LIST OF TABLES

Acronyms

ANN Articifial Neural Network. 1, 2, 5–10, 16–18, 20

CNN Convolutional Neural Network. i, ii, 1, 2, 7–12, 14–16, 18, 19, 22,26–28, 34, 37, 39, 40, 61

DNN Deep Neural Network. 1, 2, 19, 22

ELU Exponential Linear Unit. 17, 18, 42

LReLU Leaky Rectified Linear Unit. 17, 18

MSE Mean Squared Error. 19, 20

ReLU Rectified Linear Unit. 16–18, 25

SVM Support Vector Machine. 8, 31

xv

xvi Acronyms

Chapter 1

Introduction

Research concerning deep learning has become increasingly popular overthe past years. In the year I took to write this thesis Google beat the worldchampion Go by revolutionizing reinforcement learning, NVIDIA taught acomputer how to drive a car using only Convolutional Neural Networks(CNNs) and both Microsoft and Apple considerably advanced speech recog-nition. The amount of papers published on every aspect of deep learning isquite overwhelming and leads to quotes like ”... to our knowledge we arethe first to ...” in papers by even the most highly regarded research groups.In order to therefore fully define where this thesis fits in the whole deeplearning landscape I will first attempt to define deep learning.

Deep learning is a subset of machine learning algorithms that consist ofseveral modules that extract features from the data. Each module uses theoutput of the previous module in an hierarchical way, resulting in high-levelfeatures of the input data. These features can then be used by subsequentmodules to perform the required task.

In the case of Deep Neural Networks (DNNs) every such module is a set ofneurons, which we call a layer. The distinction between traditional ArticifialNeural Networks (ANNs) and a DNN is not defined by the amount of layers,but rather by the role of these layers. Every layer in a DNN transforms theoutput of the previous layer into some higher level features of the data,whereas a layer in an ANNs is typically regarded to perform a (possiblycomplex) task on its input. Perhaps the most widespread type of a DNN isa CNN. These CNNs are highly structured and have arguably had the mostimpact in deep learning to date.

The true power of CNNs is most clear in the field of computer vision. Thispower has only recently been fully discovered due to the increased availabilityof high-resolution image datasets and increases in both computational power

1

2 CHAPTER 1. INTRODUCTION

and computationally efficient libraries. For example, the ImageNet LargeScale Visual Recognition Challenge 2012 required classifying 150,000 imagesusing over 10 million labeled high-resolution images as training data in morethen 1,000 categories. This challenge was won by using a CNN achievinga top5-error rate of 16.4% [1]. Even more remarkable is the fact that thesecond placed classifier achieved a top5-error rate of 26.1% by using anensemble of classifiers based on traditional image features such as SIFT,HOG, LBP, etc. This dominance of CNNs on large, high-resolution datasetsmakes them a very promising technique to solve complex computer visiontasks.

Upon further researching the role of CNNs in these computer vision tasksI noticed that most of the tasks involve some form a hierarchy. If we aimto recognize traffic signs, the different types of signs can be structured hier-archically, e.g. all the different danger signs can be considered hierarchicaldescendants of a single parent (an abstract danger sign). Despite hierarchybeing prevalent in the labels, it is rarely taken into account during training.Deep learning research aimed at computer vision concerns itself with findingalgorithms that make as little errors as possible, but it generally does notcare which errors it does end up making. This thesis will attempt to furtherresearch this issue and will propose a solution.

Chapter 2 handles CNNs and in what way they differ from ANN. Thechapter assumes the reader is familiar with the theories around machinelearning and ANNs. It is in no means meant to be a complete explanationof CNNs but it will help the reader understand the rest of the thesis. Readerswith extensive knowledge on CNNs or DNNs can skip this chapter.

Chapter 3 sketches existing research on hierarchy in machine learning.From this research questions are posed, and a new technique to handlehierarchy is proposed.

Chapter 4 provides the reader with the necessary information about theexperiments run in order to evaluate the proposed technique. It explains indetail the final network hierarchy on which the technique is evaluated andthe hardware on which it was run.

Chapter 5 evaluates the results achieved in chapter 4. The reader canfind a full discussion on the impact of the proposed technique, and how itcompares to techniques currently in use by the deep learning community.

Finally, Chapter 6 will present an overall conclusion. The evaluationfrom Chapter 5 is summarized and the most notable results are discussed.

3

The end of the chapter contains a listing of work that follows from thisthesis.

4 CHAPTER 1. INTRODUCTION

Chapter 2

Convolutional NeuralNetworks

2.1 Artificial Neural Networks

Figure 2.1: ANN with 1 hidden layer

ANNs are biologically inspired models that pass information between neu-rons in order to perform complex tasks. Each neuron produces a numericaloutput that can serve as an input to any number of other neurons. A setof neurons that all draw inputs from the same set of neurons is called alayer. The layer who’s neurons do not serve as inputs to subsequent layersis the output layer as the outputs of these neurons are considered to bethe outputs of the network. The layer who’s neurons have no inputs is theinput layer, its neurons can be used to give a certain value as input to thenetwork. Every layer between the input and output will never be directly

5

6 CHAPTER 2. CONVOLUTIONAL NEURAL NETWORKS

observed and is subsequently called a hidden layer.

Every connection between neurons has a specific weight associated with it.A neuron multiplies every one of its inputs with the associated weight beforetaking the sum of all these weighted inputs. To this linear combination ofinputs and weights we then apply an activation function. The choice of thisactivation function turns out to be important and is reviewed in Section2.4.3. It is important to note that this function is not necessarily linear.In fact if it were linear any network would be reducible to an equivalentnetwork of one specific input and output layer. You can convince yourself ofthis fact by considering that if the activation function is linear in its input,the neuron is linear in its inputs. Any subsequent layer will then again takea linear combination of previously calculated linear combinations which initself is again a linear combination of the same inputs.

The output of a neuron can be seen in Equation 2.1. y is the outputof a neuron with N inputs xi for which weights wi are associated. Fornotational compactness we rephrase the linear combination as a dotproduct.The activation function a introduces the non-linearity of the model.

y = a(N∑

i=0

wixi) = a(~w~x) (2.1)

The layers in an ANN are fully connected. This means that every neuronin a layer is connected to every neuron of the previous layer, or equivalentlyall neurons in a layer have the same inputs. It is tempting to think all theneurons in a layer are therefore equivalent. One should however note thatthere is a lot of freedom in selecting the associated weights for every neuron.

Rather than evaluating the output of every neuron individually a com-putationally efficient way to evaluate every neuron from a layer is intro-duced. Instead of evaluating the dotproduct for every neuron we structurethe weights ~wi such that they are the columns of a matrix W . As all neu-rons have the same inputs we can then simultaneous calculate the output ofevery neuron in a layer and introduce the recurrent expression to evaluatean ANN shown in Equation 2.2.

Xi+1 = a(XiW ) (2.2)

2.2 Neural networks and supervised learning

Supervised learning generally entails two distinct tasks. Regression is thetask in which we attempt to approximate a function based on limited set of

2.3. A BRIEF HISTORYOF CONVOLUTIONAL NEURAL NETWORKS FOR COMPUTER VISION7

(noisey) samples from that function. To perform regression with an ANN itsuffices to construct an ANN with the amount of input neurons equal to thedimensionality of the function. Despite functions having only one output bydefinition, we often generalize this definition by allowing ANNs with morethan one output neuron to evaluate a function with more outputs. TheANN depicted in Figure 2.1 could be used to evaluate a three dimensionalfunction with two outputs.

The second task, classification, aims to detect distinct patterns in theinput in order to predict how to label this input. A typical example here isthe task of labelling handwritten digits (which are inhenertly noisy) with theintended digit. The desired output of an ANN tasked with a classificationjob is a set of probabilities, one for each possible classification label. In theexample of the handwritten digits, we would like the output to be a setof 10 probabilities ( as there are 10 digits ) where each probability reflectsthe confidence that the given input is the respective digit. We will thereforename each output probability the confidence of the network in the associatedlabel.

We can not assume that the output of an ANN behaves like a probabilitydistribution (i.e. the sum of the outputs is 1, and each output is ∈ [0, 1]),therefore we add a transformation of the output at the very end that turnsthe output of the network into a probability distribution. This transforma-tion is called the softmax which can can be read in Equation 2.3 where xiis the output of a neuron i and yi the associated confidence.

yi =exi

∑N exj

(2.3)

For consistency the softmax expression is implemented as another layerin the ANN. This does not go against our current definition of a layer as itis merely a non-linear function of the outputs in the previous layer (albeita rather complex function). It is important to note that, unlike the otherlayers in an ANN, the softmax layer is completely defined by its inputswithout associated weights and can therefore not be shaped according towill. It is merely a transformation of the network output into a probabilitydistribution.

2.3 A brief history of Convolutional Neural Net-works for computer vision

CNNs are a type of ANN in which the layers are highly structured. Theycan be considered as an ANN of typically many layers in which the layers


are heavily constrained and structured. The nature of this structure willbe explained in detail in Section 2.4. They gained popularity when theywere found to perform well on visual data by LeCun et al. in 1998 [2]. Itsperformance is accredited to the multilayer architecture that allows CNNsto solve complex tasks, while the structure in each layers leads to betterscaling behavior as opposed to general ANNs.

Up until the work of LeCun image recognition tasks were tackled by usingmanually designed feature extractors. These extracted features were thenfed through a classifier such as a Support Vector Machine (SVM) or an ANNto perform classification. It was the goal of LeCun et al. to come up with amultilayer neural network that could take raw pixel data as an input insteadof manually designed features. Such neural network should transform theraw data into appropriate features in the first few layers and then performclassification of these features in the subsequent layers.

The most important philosophy behind the workings of a CNN is that thefeature extraction is part of the learning process. Therefore it is no longerexplicitly required to perform manual feature extraction. This does notmean that CNNs are an excuse to not look at the data as we still have tofind suitable hyperparameters for the new concepts that we will introduce.The choice of these parameters is highly dependant on the task at hand, butsurprising results have been reached by using off-the-shelf CNNs or evenrandomly generated ones [3][4]. A detailed discussion on these cases can befound in Section 2.7.

Today CNNs are regarded as one of the most powerful techniques to per-form computer vision tasks, but it took until 2012 for CNNs to outperformany other classifier on the highly-competitive MNIST dataset [5]. This ismainly due to the fact that MNIST is a low-resolution dataset that requiresclassification into only 10 labels which makes it possible to manually designvery good features. On the very large 1000-categories ImageNet dataset thismanual feature extraction becomes far less feasible, which is where CNNsshow their dominance by outperforming the best non-CNN classifier by 10%top-5 error [1].

A recent improvement in CNNs is the use of GPUs. Efficient usage ofGPUs has been found to speed up the training of a CNN by 2-3 times[5]. This alleviated some practical boundaries on how long a researcher waswilling to train a CNN in order to obtain good results.

Current prior-art has also invested research in understanding and visual-izing the weights of a fully trained CNN. It is a known difficulty that the

2.4. THE ARCHITECTUREOF CONVOLUTIONAL NEURAL NETWORKS9

weights of ANNs are difficult to interpret, but this difficulty has never typ-ically been seen as a drawback when it comes to the performance of ANNs.Many of the weights in a CNN, which are explained in Section 2.4.1, donot share this problem and this gives rise to the idea that a thorough un-derstanding of what is learned in a CNN can help design more accuratearchitectures [6].

2.4 The architecture of Convolutional Neural Net-works

Keeping in mind that we wish to achieve a network that takes raw pixels asinput, we can quickly see that a fully connected layers do not scale favorablyin the number of input pixels. Consider small 32x32 color images. Suchimages have 32 · 32 · 3 = 3072 pixel values and as we want a network thattakes raw pixel data as inputs we would therefore need 3,072 input neurons.Lets say we get optimal results using 1000 hidden neurons and we want toclassify these images into 100 categories. A simple 2-layer network alreadyhas 3072 · 1000 + 1000 · 100 ≈ 3, 1 · 106 weights to learn. Clearly if we wantto scale this network to deal with higher resolution input we need to find away to reduce the amount of learnable weights.

This problem is solved in CNNs by implementing two new types of layers,the convolutional layer and the pooling layer. A complete CNN consists oftwo parts, a feature extraction part and a classification part. The featureextraction part contains any number of these new convolutional and poolinglayers, while the classification part uses the traditional fully connected layers.The first part transforms the input into a set of features which the secondpart can then use to classify the input.

We will show that the new layers are fundamentally no different from thefully connected layers of ANNs. Therefore we can train CNNs using the samesupervised training methods we would use in an ANN. The most commonmethod, the backpropagation algorithm is discussed in Section 2.5.3.

To aid terminology in the following, we discuss an image in terms of width,height and depth. The width and height are the spatial dimensions of theimage, while depth can be seen as the information channels of the image.For example, a typical internet thumbnail has a width and height of 200pixels with a depth of three, the RGB channels. A grayscale image has onlya depth of one, while a PNG image can feature the transparancy and colorof a pixel resulting in a depth of four. It is important to note that while inthese examples the depth has an intuitive meaning such as color, greyscaleor transparency, it is not necessarily always the case. In the terminology


of CNNs it is not uncommon for a transformed image to have a depth of100 or more, without each of these 100 channels having an intuitively clearmeaning such as color or transparency. These outputs are generally calledfeature maps while a channel is referred to as a depth slice.

2.4.1 The convolutional layer

The key difference between a CNN and an ANN is the introduction of aconvolutional layer. We aim to introduce a filter in this layer that scansa small region of the image along the full depth. We aim to measure thesimilarity of the filter with each region in the image. In a more formalwording, we will introduce a discrete mathematical convolution of the imageand a filter of fixed size. This filter is small in width and height, but is asdeep as the image.

Consider O to be the output of convolving the input I with filter F. I andF are both 3 dimensional matrices with each dimension as defined in Section2.4, namely two spatial dimensions and one depth dimension. D denotes thedepth of the input, and therefore also the depth of the filter, while W andH respectively denote the width and the height of the filter. We can thenformally define the output O as :

Oxy =D−1∑

d=0

(Fd ∗ Id)x,y =D−1∑

d=0

W−1∑

i=0

H−1∑

j=0

Fi,j,dIx+−i,y−j,d (2.4)

A visual example is given in Figure 2.2. Where the input image has awidth of 5, a height of 1 and a depth of 3.

We see in Figure 2.2 that the resulting output decreases in width. Inpractice we pad the input to a convolutional layer with 0’s on the corners insuch a way that the result of the convolution has the same width and heightas the input. We do so for two reasons, firstly ensuring that the output of aconvolution has the same spatial dimensions as the input can help alleviateissues when designing a CNN as we can determine the size of the outputof an arbitrary layer in the network without much consideration. Secondly,padding the input with zeros prevents information at the edges from beingunder-valuated. If a filter represents a particular pattern that is present inthe image but cutoff by the edge it will be overlooked by the according filter.Padding with zeros will allow that filter to better match the cutoff patternand therefore better preserves information at the edges of an input.

Before continuing how a convolutional layer works, we discuss why theywork. The idea of convolving filters over an image is not new. Gabor


Figure 2.2: Convolution taken over an image.

filters are used to perform edge detection in images, and involve taking theconvolution of certain handmade filters with natural images. You can findthese Gabor filters in Figure 2.3. Each of these filters represents an edgealong a specific angle, and the convolution of an image with this Gabor filterresults in an output map where edges are strongly visible as can be seen inFigure 2.4.

Figure 2.3: Gabor filters

Now remember that we aim to use the first layers to extract features froman image. If we compare the filters from the convolutional layer taken froma fully trained CNN in Figure 2.5 to the Gabor filters in Figure 2.3 we can


Figure 2.4: Example result of convolving the image on the left with a set ofGabor filters.

definitely spot similarities. The trained CNN has learned that these filtersgenerate a useful output, and we can see this to be true as they are similarto the Gabor filters (which are manually designed to be ideal filters to detectedges).

Figure 2.5: First layer filters trained on natural images

This comparison leads us to the conclusion that training a convolutionallayer makes it search for the most informative local characteristics of a setof images. In Figure 2.5 we see that a lot of these filters are Gabor-likeedge-detectors but also checkerboard-like patterns and patterns that appearto be wavelets. Next to distinguishable shapes we can also see so-called colorblob detectors which can detect the presence of a certain color in regions ofthe image.


The next question we can ask ourselves is how we can efficiently apply thismathematical convolution in such a way that the resulting operation doesnot differ fundamentally from a normal fully connected layer. The answerlies in local connectivity of neurons and weight sharing. As can be seen inFigure 2.2 the output of a convolution at a given point can be calculatedfrom a small region of the input along the full depth. This gives rise to theidea that the output of this convolution can equivalently be seen as a neuronof which the inputs are exactly the pixel values of that region. This neuronis no longer connected to the full input, rather it is connected to a smallregion that holds all the information it needs. This property is named localconnectivity as a neuron is no longer fully connected to all neurons of theprevious layer, but only to a small region of it.

Figure 2.6: Locally connected neurons


The second insight that makes convolutional layers tick is the weight shar-ing. When we consider Equation 2.4, which describes the convolution, wecan notice two important things. Firstly the convolution is nothing morethe linear combination of a region of the input I and the filter F . Secondlythis filter F does not depend on the value of x and y, which is the pointin which we evaluate the convolution. In the equivalent network of Figure2.6 this means that all the neurons have exactly the same weights as theyrepresent the output of a convolution with exactly the same filter.

Together with the local connectivity, weight sharing makes it so a con-volutional layer scales incredibly well. Whereas fully connected layers scaleproportional to the size of their input squared, a convolutional layer scaleswith the size of its filter. This filter is typically small, meaning a convolu-tional layer only holds a handful of weights.

These concepts help us understand why CNNs work so well on computervision tasks. The convolutional filters are used to detect patterns in theinput, but this filter is identical wherever it is applied in the input. Thisspatial invariance of the filter makes it so the same pattern is filtered out ofthe input wherever it appears, which is very similar to how we as humansperform visual tasks. We notice distinguishing features about an object, nomatter where it lies in our visual field.

We limited ourselves to a single filter in the previous discussion, but inmost cases not all necessary information can be extracted by using a singlefilter. The generalisation to more filters can be done by creating arbitrarymany independent convolutional layers and stack stacking the output asdepth slices in a feature map. This makes it so the output of a convolutionallayer is again an image of fixed width and height with any number of depthslices, albeit these depth slices do not carry the same intuitive information-like color or transparency- as they did in the input image. It does meanhowever that the output of a convolutional layer can serve as the input to anext convolutional layer.

Figure 2.7: Stride of filters.


Next to the size of the filters, the amount of filters and the padding ofthe image we also have to decide on the stride of each filter. In Figure 2.7a stride of one can be seen on the left and a stride of two on the right. Thestride determines the overlap of filters during the convolution. Typically astride of 1 is used, but a higher stride is not uncommon in larger images asit decreases the size of the output by paying for it in accuracy.

2.4.2 The pooling layer

The convolutional layer vastly lowers the amount of learnable weights ascompared to a fully connected layer due to the weight sharing and the localconnectivity. But the resulting output of these layers is not significantlysmaller in width and height as the input image. In terms of depth theoutput can even increase due to using many filters. Therefore we merelyshifted the scalability problems to the classification layers.

The proposed solution is the use of pooling layers which subsample theoutput considerably. Much like the convolutional layer a pooling layer isconnected to a local region of the input image and its output can be avariety of subsampling operations such as the mean or the maximum of theregion. A layer which performs the maximum operation over a region isdepicted in Figure 2.8 and is typically called the Maxpool Layer.

Figure 2.8: Effects of a 2-by-2 maxpool layer.

Prior-art has empirically shown that from the possible subsampling layersthe Maxpool layer significantly outperforms other subsampling operationssuch as the mean [7]. The Maxpool layer also increases the spatial invarianceof a CNN as the pooling operation summarizes the prevalence of a featurein a certain local region.


An interesting side-note is that, despite significant improvements due topooling layers, some researchers are skeptic towards the use of pooling layers.It is conjectured that the pooling operation filters out a lot of useful spatialinformation. Most notably Geoff Hinton himself can be quoted as follows:

“The pooling operation used in CNNs is a big mistake and the fact thatit works so well is a disaster. ”[8]

Regardless of this powerful quote by one of the most influential researchersin machine learning the pooling layers allow us to tackle very complex prob-lems by dramatically reducing the amount of features extracted by featureextracting layers.

2.4.3 Activation Function

The activation function traditionally used in ANNs are the sigmoid andtanh functions which are depicted in Figure 2.9 [9]. These functions sufferfrom two major drawbacks. For one, since they are applied to every neuronin a layer they can be quite costly to calculate due to the exponentialsand the division. Secondly, they suffer from what is called the VanishingGradient problem [10]. We will discuss this at length in Section 2.5.3 butfor now assume a problem occurs due to the fact that the derivative of thesefunctions is approximately zero when |x| > 0.

−10 −5 0 5 10

−1

−0.5

0

0.5

1

(a) tanh(x) = ex−e−x

ex+e−x

−10 −5 0 5 10

0

0.2

0.4

0.6

0.8

1

(b) sigmoid(x) = 11+e−x

Figure 2.9: commonly used activation functions.

Rectified Linear Units (ReLUs) were first conceived in the context of Re-stricted Boltzman machines [11] but have been show to not only greatlyspeed up the training also improve the results of CNNs [1]. For these rea-sons the ReLU and the activation functions derived from it are the activationfunctions of choice, and have almost completely pushed out the use of the sig-moid and tanh. The reason for its superior performance is often accredited


to its constant derivative which alleviates training issues in deep networks(this will be discussed in depth at Section 2.5.3 ) and its computationallyefficient expression. The ReLU expression can be seen in Figure 2.10.

−10 −5 0 5 10

0

2

4

6

8

10

Figure 2.10: ReLu(x) = max(0,x)

The ReLU activation function effectively filters out all negative values.This property leads to what is called sparse coding, a biologically inspiredphenomenon. The brain of mammals consists of billions of neurons, butonly a fraction of those neurons are active at the same time when processingdata. Using the ReLU activation function enforces this effect on any type ofANN. It has been argued that such sparse codes are more efficient (than non-sparse ones) in an information-theoretic sense [12], and that this propertyof ReLUs is another reason for their superior performance.

The previous paragraph also defined a key limitation of ReLUs. To un-derstand this, I argue the following. For a given input pattern only a smallfraction of the total number of neurons will activate. If we wish to adaptthe behavior of all the neurons based on its current behavior, we can onlyreliably adapt the small fraction of neurons that are activate at that time.More formally we see that the derivative of the output of a ReLU is 0 withrespect to the input if the input is less than zero. This severely limits theflow of information (as a zero derivative means no small change in the inputis reflected in the output), and the zero derivative will prove to be a signif-icant hurdle when training neurons using the ReLU activation function. Afew attempts have been made to tweak ReLUs so that they overcome thisrestricted flow of information in the form of Leaky Rectified Linear Units(LReLUs) and Exponential Linear Units (ELUs).


−10 −8 −6 −4 −2 0 2

−1

0

1

2

(a) LReLU(x) =

{x if x > 0

α1x if x ≤ 0

−10 −8 −6 −4 −2 0 2

−1

0

1

2

(b) ELU(x) ={x if x > 0

α2(exp(x)− 1) if x ≤ 0

Figure 2.11: Variants on RELU with α1 = 0.1 and α2 = 1

Both the LReLU end ELU activation functions allow non-zero derivativeswhen the inputs are negative. LReLUs have been shown to perform equallyto ReLUs but they speed up the training process [11]. ELUs are a recentdiscovery that have significantly improved CNNs on a competitive academicdataset [13]. It is quite interesting to note that despite a computationallyheavy expression ELUs still speed up training, as it reduces the amountof training steps. They are however so recent that a full discussion andcomparison is yet to be made.

2.5 Training Convolutional Neural Networks

As we discussed in Section 2.4, the architecture of CNNs allows them toperform tasks on high dimensional inputs. Despite scaling favorably com-pared to fully connected ANNs there are still a lot of weights to learn.

2.5.1 Objective Functions

Once the architecture of an ANN and therefore CNN is fixed, its outputY is a deterministic function of the input X and the weights W which canbe seen in Equation 2.5. We have a large degree of freedom in shapingthis output function due to the high dimensionality of W . A notable recentexample of this high dimensionality of W is VGGNet [14] which won theImageNet ILSVRC-2014 contest in the localization task, and had a total of140 million weights.

Y = F (X,W ) (2.5)

2.5. TRAINING CONVOLUTIONAL NEURAL NETWORKS 19

The act of training a CNN is a (mainly) supervised learning problem. Itis the task of adjusting the output function to match that what is reflectedin the training data. The first step to perform this task is quantifying howwell the current output matches the desired output. We call this quantifierthe Objective Function but it is equivalently referred to as a Loss Functionor a Cost Function. Once we have found such a fitting measure we canrephrase the training of a neural network as a mathematical optimisationproblem. It is our objective to adjust the weights such that they minimizethe Loss/Cost/Objective Function.

An objective function is the average of costs c over the training data.Each cost c depends on the training sample and weights of the network andreturns some measure of how good the network behaves under these weightsaccording to the training sample it is given. A general objective function isgiven in Equation 2.6 where (Xn, Yn) is a training sample and F (Xn,W ) isthe output of a network given input Xn and weights W .

C(W,X, Y ) =1

N

N∑

n=0

c(Yn, F (Xn,W )) (2.6)

Recent work suggests that it can be beneficial to pre-train the networkin an unsupervised manner [15] [16]. The benefits are numerous, especiallywhen labelled data is scarce [17]. Despite having an important role in DNNs,unsupervised pre-training is not mentioned in the rest of this thesis as CNNssuffer less from the problems solved by unsupervised pre-training as generalDNNs [1] [13].

In what follows I will discuss two different objective functions, MeanSquared Error (MSE) and Cross Entropy. These are two of the most comonlyused objective function in machine learning or statistics in general. Theyalso inspired a set of other objective functions, each solving one of the down-sides of their respective original often at the cost of computational efficiency.MSE is used in regression problems, while cross entropy is used in classifi-cation tasks.

Mean Squared Error

The MSE function seen in Equation 2.7 is widely used in statistics andmachine learning to describe the error between a prediction (or estimationin statistics) and the ground truth. Its prevalence is partly due to its intu-itive interpretation but foremost due to its relationship with the MaximumLikelihood Estimator (MLE). It can be shown that under reasonable condi-tions the value of W that minimizes Equation 2.7 is equal to the MLE of Wgiven the function F and training data (Yn, Xn) [9].


C(W,X, Y ) =1

N

N∑

n=0

(Yn − F (Xn,W ))2 (2.7)

Equation 2.7 is often multiplied by a constant 12 as a mathematical con-

venience to ease differentiating with respect to W. The reason why we wantthis differentiation to be convenient will become clear in Section 2.5.2.

Cross Entropy

MSE emphasizes the distance between a prediction by the network F (Xn,W )and the ground truth Yn. This implies there is some innate ordering in ourpredictions and truths which is trivially so for regression problems (a 4 iscloser to 3 as a 5), but this is no longer the case for classification. Whenthe ground truth is label 3, a classification as label 5 is not twice as bad aslabel 4.

C(W.X, Y ) =1

N

N∑

n=0

−log(P (F (Xn,W ) = Yn)) = − 1

N

N∑

n=0

log(P (Yn|X,W ))

(2.8)

We’ve seen in Section 2.2 that ANNs used for classification output a prob-ability density that resembles the confidence in given label. We would aimto optimize the network such that the confidence in the truth (Yn) is as closeto 1 as possible (and therefore, since the output is a a probability density,the confidence in other labels would drop to 0). To improve the numeri-cal stability, rather than maximizing the sum of a networks confidence inthe ground truth, we minimize sum of the the negative logarithms of thoseconfidences.

2.5.2 Gradient Descent

Gradient descent is a method to reach the nearest local minimum of afunction given a starting point X0. Assuming X is a vector of inputs(x1, x2, ..., xn) the method is based on the notion that it can proven un-der reasonable assumptions that at any point X in which a function f isdifferentiable the following statement holds for some small λ > 0:

Xi+1 = Xi − λ−X∇f(Xi) (2.9)

f(Xi+1) ≤ f(Xi) (2.10)


Where ∇X is the gradient of the function which is defined as the vec-tor who’s components are the partial derivatives of f with respect to thecomponents of X.

∇f = (δf

δx1,δf

δx2, ...,

δf

δxn) (2.11)

Applying Equation 2.9 to Equation 2.6 -which is the function we ulti-mately aim to minimize- gives us

Wi+1 = Wi − λ∇WC(W ) (2.12)

Wi+1 = Wi − λ∇W1

N

N∑

n=0

c(Yn, F (Xn,W )) (2.13)

Wi+1 = Wi − λ1

N

N∑

n=0

∇W c(Yn, F (Xn,W )) (2.14)

If we consider subsequent Xi points from Equation 2.9 until f(Xn+1) =f(Xn) we have either reached a local minimum, i.e. no small change of Xi

in any of the n directions reaches a lower function value as Xi itself or asaddle point, i.e. a point in which ∇f = ~0 that isn’t a local minimum.

Figure 2.12: Saddle point

At first glance the notion of local minimum is a crippling counterargumentfor using gradient descent in order to find a minimum. It turns out we do nothave to worry about getting stuck in such a local minimum as the prevalenceof local minima decays exponentially with respect to the dimensionality ofthe problem [18]. A bigger problem however is the high prevalence of saddlepoints.

Due the prevalence of these saddle points or pseudo-saddle points ( where∇f ≈ ~0 ), the cost function shows a lot of regions of low curvature. Due to


this issue DNNs (and therefore CNNs) were thought of as nearly impossibleto train with gradient descent as the weights would often get stuck in non-optimal regions of very low-curvature. Techniques involving higher orderderivatives were introduced so the training can make a big step in low-curvature regions and small steps in high-curvature regions. These methodshave been shown to obtain results superior to those trained with gradientdescent. In one dimension we can develop such a second-order method calledNewton’s Method using the second-order Taylor expansion of a function:

f(x+ ∆x) = f(x) + ∆xf ′(x) +∆x2

2f ′′(x) + η (2.15)

with η ∼ O(∆x3

6) (2.16)

Provided ∆x is small we can assume η to become negligible as it is pro-portional to ∆x3 such that Equation 2.15 is a strong approximation of f inthe neighborhood of x. With this in mind we can minimize f by finding ∆xsuch that f(x+ ∆x) is a minimum. Or:

d

d∆x(f(X + ∆x)) = 0

d

d∆x

(f(X) + ∆xf ′(x) +

∆x2

2f ′′(x)

)= 0

f ′(x) + ∆xf ′′(x) = 0

∆x = − f′(x)

f ′′(x)

xi+1 = xi + ∆xi = xi −f ′(xi)

f ′′(xi)(2.17)

We then recursively follow Equation 2.17 to reach a minimum for f . Thebenefit of this technique over gradient descent is clear in the denominator of∆xi. When in a low curvature region -where f ′′ is small, as f ′ changes onlyslightly- we take a big step in the direction of −f ′ to avoid getting stuck.Similarly when f ′′ is large -in an area of high curvature- the step we takewill be small to avoid stepping over a minimum.

The generalisation towards multiple dimensions follows from replacingderivatives with gradients and Hessians, which is the matrix of second-orderderivatives.


f ′(x)⇒ ∇f(X)

f ′′(x)⇒ H(f)(X)

Despite being clear solution to stuck gradients, these second-order tech-niques require the computation of the Hessian matrix H(f), or equivalentlyall second order derivatives. As of today we do not know how to calculatethat Hessian in a large neural network other then reverting to numericalestimations. A proposed solution is that of the Hessian-free second-orderoptimization developed by Martens et al.[19] but due to state-of-the-artnetworks reporting no real issues when it comes to training with first ordermethods [1] [20], gradient descent is still preferred when carefully applied[21].

Momentum

From Equation 2.9 we see that the update rule in gradient descent isinstantaneous. i.e. the change in X at one step depends only on the gra-dient in that point at that step. We can significantly improve [22] rate ofconvergence by introducing a velocity vector in Equation 2.9 such that:

Xi+1 = Xi + Vi+1 (2.18)

with Vi+1 = µVi − λ∇f(Xi) (2.19)

In Equation 2.18 we refer to µ as the momentum and to λ as the learn-ing rate. The benefits of including a momentum become clear if we againconsider regions of low curvature. As follows from the definition of such re-gions the gradient changes only slightly across steps. In traditional gradientdescent this leads to a very slow learning process or even to the algorithmgetting stuck in such a region. With the inclusion of a momentum however,the update accumulates those directions that don’t change across steps lead-ing to a growing stepsize in such regions, while staying small in high-variableregions.

From the previous discussions we can take away that if the objectivefunction is differentiable with respect to the weights of the network we canuse gradient descent to minimize this function. We do still have to considerhow we can efficiently find that gradient.

2.5.3 Backpropagation

Finding the gradient of an objective function with respect to the weightsproved to be a difficult issue, as gradient descent was largely unexplored


in neural networks until the famous 1986 paper by Rumelhart and Hinton[22]. In this paper the idea of propagating errors back through the networkdubbed backpropagation, or backprop is explored and shown to be a fastalgorithm for computing the necessary gradient.

Backpropagation is based on the chain rule that states when g is differ-entiable in x and f differentiable in g(x), f(g(x)) is differentiable and canbe written as

df

dx=df

du

du

dx(2.20)

With u = g(x) (2.21)

Next let’s reconsider Equation 2.5, rather then looking at the input-outputbehavior of the network as a whole we can regard it as a subsequent appli-cation of functions on the input. If we consider fl to be the function appliedin layer l, and F to be the function of the complete network with L layerswe can write

F = fL ◦ fL−1 ◦ ... ◦ f2 ◦ f1

Or with X as the input to the network, W all the weights of the networkand Wl as the weights for layer l we write

F (X,W ) = fL(fL−1(...f2(f1(X,W1),W2)...,WL−1),WL)

Remember that we are not looking for the gradient of the network, butrather the gradient of the objective function C. This C is one of the functionsdescribed in Section 2.5.1 applied on the output F (X,W ), and is thereforesubjectable to the chain rule once more. The resulting derivative of objectivefunction C with respect to weight wi after applying 2.20 would look like:

dC

dwi=dC

dF

dF

dfL

dfLdfL−1

...df2df1

df1dwi

(2.22)

To see how this equation has helped us find the gradient, we considerwhat the factors in Equation 2.22 represent. dC

dF is the rate of change ofthe objective with respect to the output of the network, a quantity that iscalculable from knowing objective function C and the output of the network.dFdfL

is equal to 1 as fL represents the last layer which is equivalent to to the

network output. The factors dfndfn−1

are the rate of change in a layer, withrespect to the output of the previous layer. Once the output of the very


last layer is know, we can therefore calculate this factor and work our waythrough the equation multiplying all the factors that can be calculated.

The quantity df1dwi

can be calculated as wi is part of the weight in layer l1.If it was not, and therefore part of some layer n > 1 Equation 2.22 wouldtruncate at ...dfn+1

dfndfndwi

, leaving us with an equivalent case.

All the factors of Equation 2.22 can therefore be calculated locally in eachlayer (it only needs to know its inputs, which are the outputs of the previouslayer, and its own weights). This is where backpropagation gets its name,we do a forward pass through the network such that all the outputs and theeventual objective score is known, after which we make a backwards pass,calculating the derivatives along the way.

To simplify notation the process of training a network using a gradient-based optimization method where the gradient is calculated based on back-propagation is often pars-pro-toto referred to simply as backprop.

Vanishing Gradient

In order to describe the vanishing gradient problem we take a deeper lookat the factors dfn

dfn−1from Equation 2.22. Recall from Equation 2.2 that the

output of a layer is nothing more than a linear combination of its inputs withits weights passed through an activation function a. For ease of notationwe create a new function i that is the linear combination of inputs with theweights. differentiating dfn

dfn−1can therefore be done with the chain rule.

dfndfn−1

=da

di

di

dfn−1(2.23)

Substituting Equation 2.23 in Equation 2.22 pinpoints the vanishing gra-dient problem. The factor da

dfi (or the derivative of the activation functionwith respect to the linear combination that serves as its input) suddenlyappears once for every layer. We can therefore see that Equation 2.22 isproportional to the factor da

dfi to some power K that is the depth of the layerin which wi plays a role.

dC

dwi∼ (

da

dfi)K

When our network has many layers, the weights in the first layers will havea large K. For da

dfi < 1 this means the gradient for weights in the first layers

decays exponentially, whereas dadfi > 1 would lead to enormous values for the

gradient. The first case is dubbed the vanishing gradient problem and is


prevalent when the sigmoid or tanh activation functions from Figure 2.9 areused as their derivative is close to 0 outside a small interval around 0. Thisalso explains why the ReLU alleviated (not solved, even with constant da

dfiEquation 2.22 is still a product of many possibly small values) the vanishinggradient problem as its derivative was a constant 1.

This fundamental issue with deep networks make the results achieved onImageNet by deep networks even more remarkable. Despite a clear mathe-matical reason why backprop shouldn’t work on deep networks, it does notappear to be an issue. It is however important to note that CNNs are highlystructured in the lower layers, and that the weights of the convolutions arestrongly constrained. As will be discussed in 2.7.1 even randomly initialisedconvolution weights can be used to perform strong classification which leadsus to believe that the training needed in the lower layers is minimal providedcareful initialization of these layers [21].

Research now considers the vanishing gradient problem as a fundamentallimit, but not a dealbreaker. Careful initialization, greedy layerwise pre-training and highly structured or constrained networks (which CNNs are)can make it so the supervised learning required at lower layers is limited.

2.6 Getting more out of Convolutional Neural Net-works

2.6.1 Stochastic gradient descent

Looking at the gradient of the general cost function 2.12 we notice that forlarge N it becomes quite a nuisance to calculate the cost or its gradient as itinvolves summing over large N . Stochastic gradient descent avoids this issueby estimating the gradient based on a subsample S of the whole training setN .

Using the stochastic approach involves selecting a few new parameterssuch as the size of the subset S and a learning rate (typically a lot smalleras with the non-stochastic variant). Some insights have been collected [23]but practically you can begin using stochastic gradient descent with only afew insights

∇WC(W ) =1

S

S∑

n=0

∇W c(Yn, F (Xn,W )) (2.24)

• Use stochastic gradient descent when time or memory are a bottleneckduring training.

2.7. REGULARIZING CONVOLUTIONAL NEURAL NETWORKS 27

• Employ more iterations ( ∼ |N |/|S| ) with a smaller learning rate

• Use a random permutation of the training set such that each subset Sis independent and representative of your full training set.

2.6.2 Data Augmentation

Generally spoken you can improve results of any machine learning problemby obtaining more data. It is however not often easy or cheap to obtain suchdata. Consider a case where we would like to classify EEG’s of a patientsbrain to determine if he’s likely to suffer an epileptic attack. In order fora model to learn this task it needs samples of an EEG when an attack islikely to happen, which is cumbersome to obtain. Therefore despite thehigh value of good data it is not straightforward (or profitable) to obtainarbitrary amounts of it.

Synthetically generating more data based on the existing data is dubbeddata augmentation. This process is highly dependant on the task at handand will therefore not be discussed despite its relative importance.

2.7 Regularizing Convolutional Neural Networks

Highly complex models are capable of performing highly complex tasks,therefore it can be tempting to make your model infinitely complex (or atleast as complex as your hardware will allow) such that it trivially has thenecessary complexity to complete the task at hand. These infinitely complexmodels will be able to achieve perfect scores on the training data at any task,but will not behave accordingly on data it was not been trained on. Perhapsthe clearest example is a model that is so complex it can keep every datapointused in training in its memory. It can therefore perfectly perform tasks onthese points as it has the answer in memory. It should however be noted thatthe eventual goal of a model is always to perform well on unseen data. Thispossibly large gap between performance on training data and performanceon unseen data is subsequently called overfitting as a model will be overlyeager to fit well on the training data it tends to behave unexpectedly onunseen data. It would be beyond the scope of this thesis to discuss all thetechniques used to prevent overfitting, but two cases are discussed explicitydue to their importance in CNNs

2.7.1 The Convolutional Layer

As weight sharing is a strong constraint on the complexity of convolutionallayers they appear to be strongly regularized. Keeping in mind that the first


layers of a CNN are considered to be the feature extractors, this strong reg-ularization should mean that trained layers should perform well regardlessof the task at hand. This can be shown by the good results reached by us-ing the convolutional layers of a trained network known as OverFeat as anof-the-shelf feature extractor to perform image classification. OverFeat wastrained for the ImageNet 2013 challenge using millions of datapoints andstill yields highly competitive results when used as a feature extractor for aclassification task which has nothing to do with ImageNet [4]. Sermanet andLeCun take it one step further by using randomly initialized convolutionallayers and training a classifier to work with these random features. Theydid so on a on a dataset of traffic signs, alongside a fully-trained CNN. Notsurprisingly the fully trained CNN performed the best with an accuracyof 99.17%. What is surprising however is that a classifier trained to usefeatures generated by the random convolutional layers reached competitiveresults of 97.33% [3].

This remarkable generality of the convolutional layers can be interpretedas two things. For one regularizing a CNN is not harder than regularizinga general feed-forward neural network of equivalent depth and parameters.Secondly the feature extraction layers of a CNN can be used as an off-the-shelve feature extractor for an arbitrary problem. Provided the full weightsof the pretrained layers are available, these weights can also be used asinitialization weights which can then be finetuned for a specific problemand drastically improve training time as compared to starting from randomweights.

2.7.2 Dropout

The overfitting behavior of large neural networks (not just CNNs) hasbeen found to greatly reduce when random neurons are ommited duringthe training phase [24]. After every training step the output of a neuron israndomly set to 0 with a chance of p. We can implement this by introducingan extra layer between all layers of the neural network that multiplies someoutputs of the previous layer by 0. Another, perhaps more intuitive, way isto define the activation function of every neuron as

B ∼ Bernouille(p) (2.25)

a′(W · x) = B · a(t) (2.26)

After training, dropout is no longer applied. At this time the amountof neurons that is active on average increases by a factor 1

p . Therefore theweights of every neuron in the network should be multiplied with a factorof p, to account for the increased average input.

2.7. REGULARIZING CONVOLUTIONAL NEURAL NETWORKS 29

Dropout has been found to consistently improve the generalization ofneural networks [24]. To understand why dropout improves performanceand how dropping random connections results in better generalisation weinformally consider the results of applying dropout in two ways.

Firstly, the resulting network after randomly removing (or nullifying) con-nections from a larger neural network results in a new neural network thatis a random subset of the original network as every neuron from the newnetwork is also present in the original. When dropout is applied after everytraining step, the next training step trains a different one of these networksubsets. After a number of these training steps we combine all these trainedsubsets into the original network that is appropriately weighted (with afactor p). In general, a set of independent strong classifiers can be madeto perform better as the individual classifiers when combined properly [9].Dropout can therefore be regarded as an efficient way to combine an enor-mous set of strong classifiers. These classifiers might not be independentbut empirically and intuitively a combination of expert opinions outperforma single expert. This property is analogous to that of random forests.

Secondly, when applying dropout a single neuron in the network can notrely solely on a complex combination of its input as that input is no longerguaranteed to be there for every iteration of the training process. This isbest explained by following example. Imagine for a given dataset of imagesall the images relating to a guitar have bright green pixels in the top leftdue to some camera error. Even though these pixels are not a significantfeature (we know its a camera error, the network does not), they can beused to identify images of guitars in the dataset. Applying dropout to sucha network would make it so none of the neurons can rely on these pixelsbeing present in the pictures of guitars and therefore every neuron is forcedto learn more reliable features rather than a (possibly complex) unreliablefeature.


Chapter 3

Objective Function: Designand Adaptation

The research within computer vision traditionally considers classificationerrors in a binary manner, a classification is correct or it is not. As dis-cussed in Section 2.2 a network tasked with classifying an input outputs aprobability density that represents the chance that the input belongs to arespective class. These probabilities are often considered a networks’ con-fidence in each label. In this case the performance of the network is basedsolely on its confidence in the correct classification label. If a given inputbelongs to label A, the overall performance is based solely on how confidenta network is in this label A.

This binary approach to errors means research generally does not carewhich error occurs, only that an error occurs. When a dataset is structuredin some hierarchy, classifying an oak as a pine tree is of course wrong, but itwould hierarchically be a better guess than classifying that oak as a sportscar.

In what follows several approaches are discussed that introduce some formof hierarchy on the data and/or errors in order to reduce the severity of anerror. At the end of this chapter several open questions will be formulatedwhich will be researched in following chapters.

3.1 Discriminative Categories

When the classes of a dataset form a hierarchy, it can be exploited formore favorable results as opposed to a flat class space. Gao et al. [25] aimedto exploit the hierarchy by building a tree in which each node is a binarySVM. Their main contribution is the reduced computational complexity of

31

32CHAPTER 3. OBJECTIVE FUNCTION: DESIGN ANDADAPTATION

structuring a multiclass classifier this way as opposed to the more generalone-vs-all or one-vs-one multiclass classifiers.

Next to the improvement in computational complexity they also reporta significant increase in classification accuracy. Once an image is passedthrough a binary SVM it is effectively bound to the left/right side of itssubtree. This provides a rather intuitive approach to multiclass classificationwhere an image is first classified coarsely, and then classified using morespecialist classifiers to perform the finer classification.

Another benefit that was not discussed by Gao et al. is that this discrim-inative classification tree allows for an easy opt-out of a classification. If atsome point a binary classifier in the nodes can no longer confidently classifyan image, we can allow the global classifier to use this coarser category as afinal prediction rather than a fine grained category that has a high probabil-ity of being wrong (as the classifier is not confident). This idea was furtherdeveloped in the research discussed in following section.

3.2 Hedging Your Bets

A more recent approach taken towards introducing this new look on er-rors in computer vision was done by Deng et al. [26]. Their work definesa trade-off between the accuracy and the specificity of a prediction. It isimportant to note these metrics relate to a prediction, not to the network.Whereas both accuracy and specificity are well defined metrics to describeperformance of a network, in this case they are used to describe respectivelythe correctness and information gain of a prediction. The accuracy is either1 (the prediction is correct) or 0. The specificity is a measure for the infor-mation in a prediction. If we label a tree as either a tree or an oak, bothwould be correct but the latter label holds more information and is thereforemore specific.

Deng et al. provide an algorithm to maximize the information givenby a prediction (i.e. its specificity) while guaranteeing an arbitrarily highaverage accuracy. The specificity-accuracy trade-off is is best illustrated bythe following example.

Consider a classifier trained on a dataset of animals to detect the differentspecies of animals. A classifier that confidently detects everything to be an’animal’ achieves a perfect accuracy as it is never wrong, but the extra in-formation given by the prediction is naught, as the dataset already impliesthat everything is an animal. On the other hand, a classifier that can pre-dict everything up to the subfamily of an animals’ species provides us with

3.3. SEMANTIC COST 33

maximum information but such a classifier is bound to be less than perfect(the state of the art top-5 error on ImageNet as of this writing 6.7% [27]).

Deng et al. provide a way to automatically select the correct level ofspecificity. A classifier will decide the level within the hierarchy that hasboth the required accuracy and the highest specificity. A real comparisonof how good it performs compared to the state of the art is rather difficultas no comparable benchmark has been set.

I argue that despite this technique of opting-out of a specific classificationsounding promising it is not generally applicable. Often we will still wantour classifier to make a best guess even if there is no guarantee on thecorrectness of this guess. A more interesting solution to the problem wouldbe a classifier that is as specific as possible and therefore can make mistakes,but those mistakes are still as semantically close to the ground truth aspossible. Consider the case when a classifier is presented with an image ofa tree. If the classifier is not sure what type of tree it is given, one trainedwith the ’hedging your bets’ technique would label the image as just a tree.If that image is actually an oak, labelling it as a pinetree would contain justas much information as labelling it as a tree. The problem remains to makesure the classifier does pick one of these close guesses, namely a tree, ratherthan a completely different label that by coincidence shares a lot of featureswith the image of the oak.

3.3 Semantic cost

The work of Zhao et al. [28] studies the results of introducing a semanticrelatedness of classes in the soft-max likelihood. They consider the benefitsof hierarchical approaches to be two-fold. On the one hand it will aid inreducing the severity of the errors as is sketched in the introduction ofthis chapter, but it can also aid in reducing the effective dimensionality offeatures. It is shown that classes with a strong semantic relatedness willbe able to share features and reduce the total number of features needed todistinguish between all possible classes.

The conditional likelihood is traditionally given by the soft-max:

P (yi|xi,W ) =exp(wT

yixi)∑k exp(wT

k xi)(3.1)

We now aim to introduce a measure S where Si,j measure the semanticrelatedness of classes i and j. Which leads us to a new likelihood P such


that:

P (yi|xi,W ) =1

Z

M∑

r=1

Syi,rP (r|xi,W ) (3.2)

Where 1Z is a normalisation constant such that P is a probability distri-

bution. M is the total number of classes. Computing this 1Z and simplifying

the expression defines the augmented soft-max as:

P (yi|xi,W ) =

∑Kr=1 Syi,rexp(w

Tr xi)∑K

r=1

∑Kk=1 Sk,rexp(wrxi)

(3.3)

In order to define the measure for semantic relatedness S Zhao et al. firstdefine a distance matrix D of which the elements are given by:


max(length(path(i)), length(path(j)))(3.4)

path(i) defines the path from the root node (the base class) down thehierarchy to class i, while p1 ∩ p2 defines the number of classes that are partof both paths p1 and p2. The matrix S is then defined as:

S = exp(−κ(1−D)) (3.5)

Where κ is a hyper-parameter that governs the decay of the relatedness.

The work of Zhao et al. provided a useful framework on which to basewhat follows.

3.4 Semantic Cross Entropy

Previous work inspired the following research. I will introduce a hierarchy-aware objective function for and evaluate the performance of CNNs trainedwith this new objective function against set benchmarks.

The semantic cross entropy is inspired heavily by the work of Zhao etal. [28] and the Cross Entropy objective function from Section 2.5.1. Theproprosed objective can be seen in Equation 3.6 where L is the set of Mpossible labels.

C(W,X, Y ) = − 1

N

N−1∑

n=0

M−1∑

m=0

SYn,Lm log(P (Lm|Xn,W )) (3.6)

3.5. SEMANTIC RELATEDNESS MATRIX 35

The values of Si,j describe the semantic relatedness of labels i and j. Theinner sum of Equation 3.6 is therefore a sum over all the possible labels,where the (logarithm of) the confidence in each label is weighed by thatlabels relatedness to the ground truth. If all concepts are only related toitself S will be an eye-matrix as Si,j = 0 for i 6= j. In that case the innersum will only be non-zero when Lm = Yn which reverts the semantic crossentropy back to the general cross entropy.

Perhaps a clearer way to think about the proposed function is that thevalue of SYn,Lm allows a network to be somewhat confident in a label Ln

that is not the ground truth Yn provided that label is semantically relatedto the ground truth.

3.5 Semantic relatedness matrix

Continuing on the work described in Section 3.3 we define a fitting seman-tic similarity measure S to be used in the proposed semantic cross entropy.

We assume the dataset over which we will train and evaluate our networkhas labels structured in a tree. The leaf nodes are the labels we want thenetwork to learn, while the intermediate nodes are (possibly abstract) se-mantic parents of the lower nodes. In such a dataset we can define a distancebetween nodes i and j.


max(length(path(i)), length(path(j)))(3.7)

Defined exactly as in Equation 3.4

Si,j = exp(−κ(1−Di,j)) (3.8)

As a final step the matrix S will be normalized such that the rows (orequivalently the columns due to symmetry) sum to 1. This normalizationmakes it so none of the labels is more important as any other when used inthe semantic cross entropy objective. The impact of κ will be discussed inSection 5.1.


3.6 Example use

Consider a simple example where a network is asked to label images aseither a rottweiler, a doberman or a cat. An S matrix could look as follows.

S =

rottweiler doberman cat( )0.7 0.3 0 rottweiler

0.3 0.7 0 doberman0 0 1 cat

We will ask two fictional networks to classify the image of a rottweilerdepicted in Figure 3.1. The resulting probabilities are listed in Tabel 3.1

Figure 3.1: Image of a rottweiler

rottweiler doberman cat

confidence network 1 0.2 0.6 0.2confidence network 2 0.3 0.1 0.6

Table 3.1: prediction results of rottweiler image

The first network seems to think the presented image is that of a dober-man. We know this to be wrong but all in all the picture might not beclear enough to confidently distinguish between the breeds of dogs. We canhowever be quite sure it is not a cat, therefor network 2 can be consideredblatantly wrong.

network 1 semantic −0.7 log(0.2) +−0.3 log(0.6) = 0.43non-semantic − log(0.2) = 0.70

network 2 semantic −0.7 log(0.3) +−0.3 log(0.1) = 0.66non-semantic − log(0.3) = 0.52

Table 3.2: semantic and non-semantic cross entropy

Table 3.2 depicts the results of evaluating both the non-semantic (thetraditional) and the semantic cross entropy on the results of both networks.Non-semantic cross entropy prefers network 2, while the semantic variantpicks network 1.

3.7. RESEARCH QUESTIONS 37

Now imagine that the ground truth on Figure 3.1 is wrong, and that theimage in fact depicts a rottweiler. By introducing the semantic relatednessof labels we made the network more resilient to errors in the training set ifthose errors are still semantically related to the ground truth.

3.7 Research questions

• Current research shows that introduction of hierarchy or semantic re-latedness of concepts in classifiers can speed up training and improveclassifier accuracy. Will introduction of hierarchy in CNN result insimilar or other noticeable improvements?

Can a CNN be made semantically aware? If we give it some measureof how concepts are semantically related, will that lead to a networkthat outperform others on semantically inspired criteria? For instanceif a CNN is not only be trained on the different breeds of cats anddogs, but also on the fact that all of those labels are either a cat or adog, will such a network correctly predict the animal even if it isn’tcorrect in its prediction of the breed?

• How can we evaluate the performance of an hierarchy-aware classifier?

Most benchmarks in literature report the error on a test set. Thismeans we can’t effectively compare results set on traditional bench-marks with those achieved with an hierarchy-aware classifier. I arguethat the total number of errors will increase, but that the averageseverity of an error will decrease considerably. An effort will have tobe made to relate these new results to those reported in literature.


Chapter 4

Research Strategy

4.1 Hardware Setup

The experiments discussed in the following sections were run on an Ubuntu14.04 LTS machine at the iMinds iLab.t Virtual Wall. The full specificationsof the machine can be seen in Table 4.1

GPU NVIDIA GeForce GTX980

CPU Intel Xeon E5-2620 12-core

memory 2x Samsung 16Gb 288-Pin DDR4 SDRAML1 cache 384KbL2 cache 1536KbL3 cache 15Mb

storage OCZ Deneva 2 480Gb SSD

Table 4.1: Hardware specifications

The NVIDIA GeForce GTX980 has 2048 CUDA Cores and is reported tobe in the top-3 GPU for deep learning in numerous benchmarks set in 2015.Despite the power of one GPU, time and memory are still a bottleneck inmany cases. For this reason many researchers and deep learning tools havemade the transistion to a multi-GPU system [29][30]. In my work, the singleGPU proved enough.

4.2 Dataset selection

4.2.1 Belgian Traffic Signs

Most of my early exploration of CNNs was done on a dataset of labelledtraffic signs collected from Google streetview. As these images are extractedfrom streetview, they are heavily different in rotation, scale, lightning due

39

40 CHAPTER 4. RESEARCH STRATEGY

to weather or time of day, time-worn paint, etc. Therefore one of the mainhurdles in this dataset was the normalization of the images.

Figure 4.1: Example from the traffic sign dataset

Next to normalization the results on this dataset improved considerablyby extensive data augmentation. As already mentioned in Section 2.6.2data augmentation is highly dependent on the task at hand. In the case ofclassifying traffic signs horizontally mirroring all the images is very effective,while vertically mirroring effectively worsens performance.

It is important to keep in mind that traffic signs were designed to beeasily recognizable. It is therefore no surprise that even the simplest CNNscan reach error rates of less than 5% with proper normalization and aug-mentation. The dataset does not contain the required complexity to allownoticeable improvements. It is for this reason that I decided not to performexperiments with the proposed semantic cross entropy on this dataset.

4.2.2 ImageNet

ImageNet is one of the most competitive datasets in computer vision. Asof this writing it contains 14,197,122 high-resolution images labelled with oneor more of 21841 categories. The labelling is non-exclusive and semanticallystructured according to WordNet, a lexical database for the English languagemaintained at Princeton [31]. An example is given in Figure 4.2.

At first glance ImageNet is an ideal candidate for experimentation. The la-bels are structured semantically according to WordNet and both the amountof labels and the resolution of the images make them complex enough to al-low improvement. The size of the dataset however proved to be too largefor the scope of this thesis.

4.2. DATASET SELECTION 41

Figure 4.2: Example from ImageNet. A few of its labels are Panda, GiantPanda, Mammal and Vertebrate

4.2.3 CIFAR

Collected by Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton at TorontoUniversity, the CIFAR dataset contains 60000 tiny images [32]. The set ofimages can be labelled either in 10 coarse labels, or 100 finer labels respec-tively called the CIFAR-10 and CIFAR-100 datasets. All labels are mutuallyexclusive, a single image only has a single label. These labels are not inher-ently hierarchical but as CIFAR-100 and CIFAR-10 represent respectivelyfine labelling and coarse labelling over the same dataset, a single-step hier-archy can be found nonetheless.

Figure 4.3: Example from CIFAR, 32x32 tiny color image labelled as dog


An example of a CIFAR image can be seen in 4.3. All images are naturalphenomena and are scaled to 32x32 pixels. Despite the relatively low reso-lution, the resulting classification task is not straight forward. This can beaccredited to the fact that natural images do not contain easily distinguish-able features as was the case with the traffic signs.

Despite not having an inherent hierarchy in the labels, I argue the CIFARdataset is the best choice of dataset to evaluate the impact of the semanticcross entropy. The popularity of the dataset in research groups means I cancompare results and base many choices on literature. The results on CIFARstill have a margin for improvement and the labelspace is small enough tomanually and individually organize them in a semantic hierarchy. I havedone this organization and structured the labels to form a tree of relatedconcepts, the new leaf nodes are the fine grain labels of CIFAR-100, all non-leaf nodes are abstractions of what the labels represent. The resulting treecan be seen in Figure 4.4

4.3 Network Architecture and Training Specifica-tion

The best result on CIFAR-100 at the time of writing is held by Clevert etal. with an accuracy of 75.72% [13]. Their main contribution was the choiceof ELUs as activation functions, which is discussed at length in Section2.4.3. The data augmentation used be Clevert et al. was minimal, and thereported time used for training the full network is manageable. For thesereasons the network architecture used in my experiments is heavily inspiredby that reported by Clevert et al. albeit smaller.

The architecture described in Table 4.2 will be used throughout the restof this thesis. The activation function used in all layers except the finalsoftmax is the ELU, as described by Clevert et al. Dropout is applied tothe output of the last convolutional layer, and the output of every MaxPoolabove it with respective dropout rates of [0.5, 0.4, 0.3, 0.2, 0.2, 0]. The size ofthe filters is slightly different as proposed in [13]. Uneven filters can (withproper padding) preserve the size of the input. The proposed filter 2x2 filtersizes require either complex padding schemes or systematic upsampling ofthe output. This issue is not addressed in the work and therefore I optedfor a solution that involves 3x3 filters and a simple padding scheme.

The use of 1x1 filters was first introduced by Lin et. al. [33] and representsa (small) fully connected layer that is evaluated over all features in a featuremap. This equivalence of 1x1 filters and fully connected layers is also visiblein the last layer before the softmax layer. Rather than introducing a fully

4.3. NETWORKARCHITECTURE AND TRAINING SPECIFICATION43

layer filters filtersize

convolutional 384 3x3convolutional 384 1x1





MaxPool 2x2convolutional 600 1x1

softmax 100

Table 4.2: Network architecture

connected layer to classify the features learned in the previous layer we dothe same task with a convolutional layer that has a 1x1 filter.

The network was trained with stochastic gradient descent with a batchsize of 100. The initial learning rate was set at 0.01 and decayed by a factor1

1.05 after every epoch. Momentum was used with a parameter of 0.9, as wasL2 weight decay with strength of 0.0004

Data augmentation was done at runtime during training. At every it-eration in the training process the 100 images in a batch were randomlyhorizontally mirrored. Every image was also padded with 4 zero pixels atevery border, and at every iteration a random 32x32 sample was croppedfrom these padded images on which the network was trained.

Training the network as such for 80 epochs spanning 28 hours with tra-ditional cross entropy as an objective resulted in an accuracy of 69.24% onCIFAR-100. As the network is considerably smaller as that of [13] theseresults are a good indication this network performs well on CIFAR. Also


worth nothing is that the best performance on CIFAR-100 at May 2015 was67.7% [34] which this network outperforms.

4.4 Overcoming numerical instability

The defined semantic cross entropy will not perform well as such due to anumerical instability. An optimal output for a network trained with semanticcross entropy would have confidence distributed over all semantically relatedlabels for a given input, strongly peaked at the ground truth. This meansthat, much like the non-semantic cross entropy, it enforces many confidencesto be 0 as there are many labels that are not semantically related. In non-semantic cross entropy these zeroes are harmless as only the confidence inthe ground truth is taken into account. Those zeroes do however prove tobe an issue in semantic cross entropy.

The inner sum of Equation 3.6 reveals the issue more formally.When SYn,Lm → 0 then

P (Ln|Xn,W )→ 0

and hencelog(P (Ln|Xn,W ))→ +∞

ThereforeSYn,Lm · log(P (Ln|Xn,W ))→ ?

The resulting product is therefore unstable. We can however quickly solvethis by clipping all probabilties between [10−a, 1− 10−a] where a is chosento be 3.

4.5 Tier-n Accuracy

In order to evaluate which errors are made we define a new metric, thetier-n accuracy.

Definition. A label is tier-n accurate given hierarchy H if the nth ancestorof that label is equal to the nth ancestor of the ground truth in the samehierarchy H.

Consider following example where hierarchy H is as defined in Figure 4.4.Predicting a ’ray’ to be a ’shark’ is tier-1 accurate as they both share the’fish’ parent. Predicting the same ’ray’ to be a ’bee’ is not tier-1 accurate,but it is tier-2 accurate as they both have the 2nd ancestor ’animals’. Thetier-0 error is equivalent to the error on leaf nodes, a prediction is only tier-0accurate if the prediction is equal to the ground truth.

4.5. TIER-N ACCURACY 45

The tier-n accuracy of a set of predictions can then be defined as thefraction of those predictions that are tier-n accurate.


Figure 4.4: Tree of related labels in CIFAR

Chapter 5

Evaluation and Results

5.1 Selecting κ

We begin by visualizing the effect κ has on matrix S for D according toSection 3.5. The results can be seen in Figure 5.1.

Figure 5.1: Visualization of the effect of κ on the S matrix

Looking at the diagonals in Figure 5.1 can help us select an appropriate κ.For κ = 2 we see that the elements on the diagonal are approximately 0.06.Such a choice for kappa would result in an objective function that dependsfor only 6% on its confidence in the ground truth. Inversely this means 94%of the objective is determined by the confidence is related labels, which istypically not what we want.

Selecting κ = 8 results in an S matrix with 0.9 on the diagonal. Such anS matrix will have more use cases as now the objective is still dominatedby the ground truth. The other 10% of the objective will depend on howconfident the network is in related labels. An optimal network under thisconstraint will therefore have to learn in which way semantically similarlabels are similar and output probabilities accordingly.

When introducing the parameter κ its influence on the S matrix wasexamined. For small κ the S matrix would become almost uniform and

47

48 CHAPTER 5. EVALUATION AND RESULTS

the model would therefore never confidently distinguish between classes. Alarge value for κ lead to an S matrix that closely resembled an eye matrixresulting in a model that does not noticeably differ from one trained withtraditional cross entropy.

Despite these intuitive limitations we still have to find a way to select thebest κ for the task at hand. Completely training the network takes roughly30hours, therefore we can’t simply sweep a set of possible values for κ andfully train a network for each. Instead, a network was trained to 53.35%accuracy or equivalently an error rate of 0.4665. This network was then fine-tuned for 2 hours for κ in the set [2, 4, 6, 8, 10, 16, 32]. The initial learningrate was set at 0.001, which decayed every epoch with a factor 1

1.05 . Theresult of which are shown in Figures 5.2 and 5.3.

0 1 2 3 4 5 6 7

epoch

0.3

0.4

0.5

0.6

0.7

0.8

0.9

error

2

4

6

8

10

16

Figure 5.2: Error rates of finetuning a network with various values for κ

Figure 5.2 reveals a very noticeable step during the first epochs. To explainthis behavior first consider that the network at epoch 0 is in fact a networkthat has been trained with non-semantic cross entropy to achieve 53.35%accuracy. The weights of this network have subsequently been optimizedfor the non-semantic objective. Now from previous discussions recall thatsemantic cross entropy differs increasingly from non-semantic cross entropyas κ gets closer to 0. The weights in the networks trained with κ = 2and κ = 4 therefore have to be completely readjusted in the first training

5.2. NETWORK ACCURACY 49

5 6 7

epoch

0.40

0.45

0.50

error

2

4

6

8

10

16

Figure 5.3: Zoom on the last epochs of Figure 5.2

iterations, whereas the networks with κ ≥ 6 optimize an objective moreclosely related to non-semantic cross entropy.

Despite this initial step, the trend still seem to be clear. A higher valuefor κ results in a lower error rate. We do keep in mind that we wish the eval-uate the impact of the semantic cross entropy, and that the S matrix shouldtherefore differ sufficiently from the eye matrix. Careful consideration leadto a choice of κ = 8 as the ideal parameter to perform experimentationwith the semantic cross entropy. This choice of κ defines an S matrix suffi-ciently different from the eye matrix as can be seen in Figure 5.1 while notdestructively lowering the overal accuracy.

With κ chosen to be equal to 8, we built and train a second networkexactly as the one that reached an accuracy of 69.24% in the beginningof this section. This second network does differ in the choice of objectivefunction as now it is trained with a semantic cross entropy where κ = 8

5.2 Network Accuracy

We begin by examining the overall accuracy of both the network trainedwith non-semantic cross entropy, and the network trained with the semanticvariant where κ = 8. CIFAR-100 has a predefined testset of 10 000 images,


the accuracy on this set across training epochs can be seen in Figures 5.4and 5.5, summarized in Table 5.1

0 10 20 30 40 50 60 70 80

epoch

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

error

semantic

non-semantic

Figure 5.4: Test error rate shown for 80 epochs

50 60 70 80

epoch

0.305

0.310

0.315

0.320

0.325

0.330

error

semantic

non-semantic

Figure 5.5: Test error rate for the last 30 epochs

5.2. NETWORK ACCURACY 51

non-semantic semantic

lowest error 0.3076 0.3162highest accuracy 69.24% 68.38%

Table 5.1: Lowest error or equivalently highest accuracy for both networks

As was already clear when we selected a value for κ, the semantic crossentropy does not improve overall accuracy on CIFAR-100. Whereas thenon-semantic cross entropy trains a network such that its confidence in theground truth is as high as possible the semantic variant does not directlyenforce this behavior. Networks trained with semantic cross entropy canoften lower their objective by taking confidence away from the ground truthand distributing it over semantically related labels. It can therefore bepossible that in some cases the semantic objective causes a networks mostconfident label to shift from the ground truth to a semantically relatedconcept.

From the definition of tier-n accuracy as in Section 4.5 it is tempting toevaluate the tier-1 accuracy of both networks in order to see if the errors aresemantically related. This would be incorrect as the overall tier-1 accuracywould heavily be influence by the tier-0 accuracy. Any prediction that istier-0 accurate is of course tier-1 accurate but such a prediction is not con-sidered to be an error whereas we would like to find out of errors are moresemantically related. Therefore we take from both networks the set of tier-0errors and evaluate the tier-1 accuracy on these errors. In other words, weaim to evaluate how many of the errors are child of the same parent nodeas the ground truth. The results can be seen in Table 5.2


tier-1 accuracy 28.13% 35.46%

Table 5.2: Tier-1 accuracy evaluated on tier-0 errors made by both networks

For completeness the tier-1 accuracy is also evaluated over all the predic-tions, the results of which can be found in 5.3


tier-1 accuracy 77.60% 77.76%

Table 5.3: Tier-1 accuracy evaluated on all predictions made by both net-works


As already mentioned the tier-1 accuracy is heavily influenced by the tier-0 accuracy. As the non-semantic networks tier-0 accuracy is 0.86% higher(which can be seen in Table 5.1) its tier-1 error is only slightly worse asthat of the semantically trained network. Table 5.2 does however reveal theexpected behavior. The tier-1 accuracy significantly increases with 7.33%when evaluated over the errors made by both networks.

The results described in this section reveal that a semantically trainednetwork will make (slightly) more errors, but those errors are on averagemore semantically related to the ground truth.

5.3 Network Confidence

0 10 20 30 40 50 60 70 80

epoch

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

cross entropy

semantic

non-semantic

Figure 5.6: Non-semantic cross entropy for 80 epochs

Figure 5.4 shows that the overall accuracy of both the semantically andnon-semantically trained network are quite comparable. Next to the accu-racy, the non-semantic cross entropy was also evaluated on all the predictionsmade on the testset by both networks, the results of which can be seen inFigure 5.6. Remember that the cross entropy is merely a measure for theaverage confidence in the ground truth. The lower it is, the higher the av-erage confidence. These figures therefore tell us that the network trained

5.3. NETWORK CONFIDENCE 53

with semantic cross entropy is significantly less confident, but its highestconfidence is still in the ground truth.

This could be interpreted as either the value of κ = 8 made it so thenetwork is uniformly less confident in its predictions, or it loses confidencefor some samples (as we have only measured the average confidence), orthe network behaves as discussed in previous sections, namely when it wasunsure it distributed confidence over semantically related labels. To differ-entiate between these cases a few summary statistics are shown in Table5.4.

statistic non-semantic semantic

mean 0.912 0.591median 0.999 0.628std 0.164 0.286

Table 5.4: statistics of the highest confidence on each image in the testset

These summary statistics show that the semantic cross entropy does notjust cause the highest confidence to drop uniformly as the standard devi-ation nearly doubles. In order to find out what does happen we performthe following experiment. We let a network generate predictions on eachimage in the testset. For each of these predictions we compare the highestconfidence to a given threshold, and split the predictions accordingly. Thisresults in two sets of predictions, one set where the highest confidence islarger than a given threshold, the other where that confidence is lower. Onboth of these sets we will calculate the accuracy, i.e. how many of thesepredictions have their highest confidence in the ground truth. The resultsof which can be seen in Tables 5.5 and 5.6.

threshold # images accuracy

none 10000 0.684≥ mean (=0.591) 5271 0.885≥ mean+std (=0.877) 2419 0.962<mean (=0.591) 4729 0.399

Table 5.5: accuracy for varying thresholds with semantic cross entropy

Table 5.5 reveals that when the semantically trained network’s confidencein a label is larger than the average confidence it is correct 88.5% of thetime. If its confidence is larger than the average plus the standard deviationthe label is almost certainly correct with a chance of 96.2%. In other words,


the network is rather conservative in its confidence (as the average is only0.591) but a high confidence indicates a high probability that its confidenceis in the correct label.

threshold # images accuracy

none 10000 0.692≥ mean (=0.912) 7519 0.791≥ 0.99 6229 0.860<mean 2481 0.292

Table 5.6: accuracy for varying thresholds with non-semantic cross entropy

Table 5.6 tells us that the non-semantically trained network tends to beslightly overly confident. A confidence higher than the average confidencemeans the predicted label has a chance of 79.1% to be correct. Even whenthe network predicts a label with a confidence of 0.99 there’s still only 86%chance that it is correct.

0.0 0.2 0.4 0.6 0.8 1.0


0.0

0.2

0.4

0.6

0.8

1.0

actual accuracy

220

987

1020

955

834

781

796

970


Figure 5.7: Accuracy for specific confidence intervals for semantically trainednetwork

To further analyse this behavior, the predictions are binned according totheir confidence. For each of these bins the actual accuracy is calculated

5.4. SEMANTIC CROSS ENTROPY AS REGULARISATION 55

0.0 0.2 0.4 0.6 0.8 1.0


0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

actual accuracy

0 2

52 131 258

420 438440

627


Figure 5.8: Accuracy for specific confidence intervals for non-semanticallytrained network

and shown in Figure 5.7 for the semantically trained network and Figure 5.8for non-semantically trained network.

These results confirm that the confidence of a semantically trained net-work is more tightly bound with the actual chance that the predicted labelis correct. Whereas the prediction confidence of a network trained with thenon-semantic cross entropy is typically high, this confidence does not prop-erly reflect whether we can actually trust the networks’ prediction. Shoulda network trained with semantic cross-entropy exhibit a large confidence wecan be quite sure that the label is in fact the correct one. We will informallyrefer to this well calibrated output accuracy as the trustworthiness of thenetwork.

5.4 Semantic cross entropy as regularisation

So far we have only considered semantic cross entropy with an S matrixthat reflect the semantic relations among possible labels. There is howeverno guarantee that the results as described in previous section are the resultof this semantically inspired S matrix. To further evaluate the impact ofsemantics, consider an S matrix such that we can rephrase the definition ofsemantic cross entropy as shown in Equation 5.1. This rephrasing is possible


when S has 1 on the diagonal, and α for elements off the diagonal. Thischoice of S is no longer semantically inspired, and due to its values it ishenceforth called the ‘uniform’ S matrix.

C(W,X, Y ) = − 1

N

N−1∑

n=0

log (P (Yn|Xn,W )) + α

∑

Lm 6=Yn

log (P (Lm|Xn,W ))

(5.1)

From Equation 5.1 it becomes clear that this particular choice of S resultsin an objective where the confidence of a network is constrained. Should theconfidence is the ground truth be close to 1, the other confidences will bemarginally small, resulting in large value for the inner sum. Conversely, thevalue for α makes it so the network can’t drop the confidence in the groundtruth to 0 in order to minimize the inner sum as an α << 1 makes it so theobjective is still dominated by the confidence in the ground truth. Figure5.9, Figure 5.10 and Table 5.7 summarize the results of a network trainedwith this uniform matrix.

0 10 20 30 40 50 60 70 80

epoch

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

error

semantic

uniform

non-semantic

Figure 5.9: Test error rate shown for 80 epochs

The results from Figure 5.9, summarized in Table 5.7 are quite surprising.Not only does the introduction of a uniform S causes the network to trainfaster, it also lowers the error by 0.074 or equivalently increases accuracy

5.4. SEMANTIC CROSS ENTROPY AS REGULARISATION 57

50 60 70 80

epoch

0.300

0.305

0.310

0.315

0.320

0.325

0.330

error

semantic

uniform

non-semantic

Figure 5.10: Test error rate for the last 30 epochs

non-semantic semantic uniform

lowest error 0.3076 0.3162 0.302highest accuracy 69.24% 68.38% 69.98%

Table 5.7: Lowest error or equivalently highest accuracy for three networks

by 0.74%. The eventual outputs are analyzed in the same way as those inSection 5.3, the results of which are shown in Figure 5.11

Figure 5.11 shows the same behavior as we discovered when using a seman-tically inspired S. These results indicate that the trustworthy predictionsare a result of the regularization effect of semantic cross entropy and notdue to the added semantic knowledge. Finally we will also review the tier-1accuracy of tier-0 errors on this new network much like Section 5.2.

non-semantic semantic uniform

tier-1 accuracy 28.13% 35.46% 33.59%

Table 5.8: Tier-1 accuracy evaluated on tier-0 errors made by all threenetworks

Table 5.8 shows that networks trained with the semantically inspired Sachieve a higher tier-1 accuracy on their tier-0 errors as opposed to one


0.0 0.2 0.4 0.6 0.8 1.0


0.0

0.2

0.4

0.6

0.8

1.0

actual accuracy

135 605

627

599611

574621

649

1072


Figure 5.11: Accuracy for specific confidence intervals for the networktrained with uniform S

trained with the uniform S.

Chapter 6

Conclusion and Future Work

6.1 Conclusions

In the previous chapter we have shown that the semantic cross entropybehaves much like we hoped in the regard of accuracy. A network trainedwith semantic cross entropy makes slightly more errors opposed to a networktrained with non-semantic cross entropy, but those errors are semanticallymore related to the ground truth. On the CIFAR-100 dataset the semanticnetwork achieved an accuracy of 68.38% while a non-semantically trainednetwork reached 69.24%. Of all those errors made by the semantic network,35.46% still have the same hierarchic parent as the ground truth. This isonly true for 28.13% of the errors made by the non-semantic network.

Next to having shown what we set out to show, two other interestingresults seemed to have been discovered. For one, semantic cross entropygenerates networks of which the confidence can be closely related to anactual chance of being correct. These trustworthy confidences can howeveralso be achieved when the S matrix is not semantically inspired but ratherchosen to be uniform and peaked at the diagonal. Moreover, in that casethe overall accuracy improved slightly compared to traditional non-semanticcross entropy.

This last discovery puts a different perspective on the entirety of thisthesis. Whereas the semantics between labels have been the dominant in-spiration for what ultimately lead to semantic cross entropy, dropping thesemantic aspect seems to result in more interesting cases. Semantic crossentropy with a uniform S matrix (that is no longer based on the seman-tics) causes the semantic cross entropy to act as a regularized cross entropy.This regularized cross entropy shows great promise in terms of accuracy andtrustworthy confidences.

59

60 CHAPTER 6. CONCLUSION AND FUTURE WORK

6.2 Future Work

The experiments and results discussed in this thesis lead to a couple ofquestions. The research that follows from these will hopefully further extendthe understanding of semantic cross entropy.

6.2.1 Does semantic cross entropy with a uniform S continu-ously outperform non-semantic cross entropy in termsof accuracy?

We saw that the network trained with semantic cross entropy achievedhigher accuracy on CIFAR-100 when the S matrix was selected to be theuniform S as described in Section 5.4. The reported accuracy was still wellbelow the highest achieved accuracy (75.75% as of this writing), and it istempting to see if this improvement is still visible at those levels of accuracy.The network used in this thesis was substantially smaller as it was never thegoal to improve on this state of the art. From what is shown in this thesis Iargue that such an improvement would be visible when introducing semanticcross entropy.

6.2.2 Semantic Cross Entropy in Detection

Detection is the task of finding all occurrences of a given entity in an im-age. For example consider the task of finding and correctly labelling all theleaves in an image of a tree. Such a task is now handled by running a clas-sifier on a different crop of the image, and when the classifier is sufficientlyconfident the algorithm assumes there is a leaf in this crop. A big problemhere is selecting how confident a classifier needs to be before its sufficientlyconfident. In the case of a classifier trained with semantic cross entropy wecan relate this selection directly to the required levels of accuracy.

6.2.3 Erroneous Truth Labelling

Smaller datasets are manually labelled and the concepts are simple enoughsuch that we can trust these labels. This assumption no longer holds whenit comes to giant datasets, humans tend to make errors due to fatigue orinsufficient knowledge about labels. Without costly methods to preventerrors these datasets will always be error prone.

When the errors show that two labels are completely inseparable by hu-mans, a solution would be to simply merge those labels and treat them asequal. No such solution can be easily thought of when it comes to a moreprevalent case where labels are often confused, but they are not quite in-terchangeable. I argue that in such cases a semantically aware objectivefunction could aid training. Rather than simply training on the errors, a

6.2. FUTURE WORK 61

well constructed distance matrix where the distance reflects the probabilityof an error would obtain a model that is not punished heavily for errorsagainst classes on which the ground truth is erroneous. This would not onlybetter reflect the behavior of how a network should handle an erroneousdataset (i.e. allow a network to go against the ground truth), I argue itwould also generalize better as it will not overfit on those datapoints thatare erroneous.

6.2.4 Evaluation on Complex Datasets

The work presented in this thesis ultimately lead to an evaluation on theCIFAR-100 dataset. It proved to be a correct choice for a proof of conceptas it is small enough to be trained on a single GPU in a manageable amountof time while still being complex enough to properly reflect the capabilitiesof both CNNs and the proposed objective function. However as a conjectureI claim that the proposed objective function would have a stronger impacton a more complex dataset, where samples of semantically related labels areharder to distinguish.

In such a dataset non-semantic objective functions train a network suchthat they have high confidence when classifying labels for which such a highconfidence might actually not be possible. When labels are hard to distin-guish this forced distinction will cause the classifier to base its prediction onoverly complex features that have a high chance of being overfitted on thetraining data. A semantically aware classifier on the other hand would traina network such that it is as confident as reasonably possible.

62 CHAPTER 6. CONCLUSION AND FUTURE WORK

Bibliography

[1] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. ImageNetClassification with Deep Convolutional Neural Networks. In F Pereira,C J C Burges, L Bottou, and K Q Weinberger, editors, Advances InNeural Information Processing Systems, pages 1–9. Curran Associates,Inc., 2012.

[2] Yann LeCun, Leon Bottou, Yoshua Bengio, and Patrick Haffner.Gradient-based learning applied to document recognition. Proceedingsof the IEEE, 86(11):2278–2323, 1998.

[3] Pierre Sermanet and Yann Lecun. Traffic sign recognition with multi-scale convolutional networks. Proceedings of the International JointConference on Neural Networks, pages 2809–2813, 2011.

[4] Ali Sharif, Razavian Hossein, Azizpour Josephine, Sullivan Stefan, andK T H Royal. CNN Features off-the-shelf : an Astounding Baseline forRecognition. In CVPRW ’14 Proceedings of the 2014 IEEE Conferenceon Computer Vision and Pattern Recognition Workshops, pages 512–519, 2014.

[5] Dan C. Ciresan, Ueli Meier, Jonathan Masci, and Luca M Gambardella.High-Performance Neural Networks for Visual Object Classification.Applied Sciences, 1102:12, 2011.

[6] Matthew D. Zeiler and Rob Fergus. Visualizing and UnderstandingConvolutional Networks. Computer Vision–ECCV 2014, 8689:818–833,2014.

[7] Dominik Scherer, Andreas Muller, and Sven Behnke. Evaluation ofpooling operations in convolutional architectures for object recognition.Lecture Notes in Computer Science (including subseries Lecture Notesin Artificial Intelligence and Lecture Notes in Bioinformatics), 6354LNCS(PART 3):92–101, 2010.

[8] Geoffrey Hinton. redditama geoffrey hinton, 2014.

63

64 BIBLIOGRAPHY

[9] Christopher M Bishop. Pattern Recognition and Machine Learning,volume 4. 2006.

[10] Yoshua Bengio, Patrice Simard, and Paolo Frasconi. Learning long-termdependencies with gradient descent is difficult. IEEE Transactions onNeural Networks, 5(2):157–166, 1994.

[11] Vinod Nair and Geoffrey E Hinton. Rectified Linear Units ImproveRestricted Boltzmann Machines. Proceedings of the 27th InternationalConference on Machine Learning, (3):807–814, 2010.

[12] Yoshua Bengio. Learning Deep Architectures for AI. Foundations andTrends R© in Machine Learning, 2(1):1–127, 2009.

[13] Djork-Arne Clevert, Thomas Unterthiner, and Sepp Hochreiter. Fastand Accurate Deep Network Learning by Exponential Linear Units(ELUs). Under review of ICLR2016, (1997):1–13, 2015.

[14] Karen Simonyan and Andrew Zisserman. Very Deep ConvolutionalNetworks for Large-Scale Image Recognition. arXiv, pages 1–13, 2014.

[15] Geoffrey E Hinton, Simon Osindero, and Yee-Whye Teh. A fast learningalgorithm for deep belief nets. Neural computation, 18(7):1527–54, 2006.

[16] Hugo Larochelle, Hugo Larochelle, Yoshua Bengio, Yoshua Bengio,Jerome Lourador, Jerome Lourador, Pascal Lamblin, and Pascal Lam-blin. Exploring Strategies for Training Deep Neural Networks. Journalof Machine Learning Research, 10:1–40, 2009.

[17] Dumitru Erhan, Aaron Courville, and Pascal Vincent. Why Does Un-supervised Pre-training Help Deep Learning ? Journal of MachineLearning Research, 11:625–660, 2010.

[18] Yann Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho,Surya Ganguli, and Yoshua Bengio. Identifying and attacking the saddlepoint problem in high-dimensional non-convex optimization. arXiv,pages 1–14, 2014.

[19] James Martens. Deep learning via Hessian-free optimization. 27thInternational Conference on Machine Learning, 951:735–742, 2010.

[20] Pierre Sermanet, David Eigen, Xiang Zhang, Michael Mathieu, RobFergus, and Yann LeCun. OverFeat: Integrated Recognition, Local-ization and Detection using Convolutional Networks. arXiv preprintarXiv:1312.6229, pages 1–15, 2013.

[21] Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. Onthe importance of initialization and momentum in deep learning. JmlrW&Cp, 28(2010):1139–1147, 2013.

BIBLIOGRAPHY 65

[22] David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams.Learning representations by back-propagating errors. Nature,323(6088):533–536, 1986.

[23] Leon Bottou. Stochastic Gradient Descent Tricks. Neural Networks:Tricks of the Trade, 1(1):421–436, 2012.

[24] Geoffrey E. Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever,and Ruslan R. Salakhutdinov. Improving neural networks by preventingco-adaptation of feature detectors. arXiv: 1207.0580, pages 1–18, 2012.

[25] Tianshi Gao and Daphne Koller. Discriminative learning of relaxedhierarchy for large-scale visual recognition. In Proceedings of the IEEEInternational Conference on Computer Vision, pages 2072–2079, 2011.

[26] Jia Deng, Jonathan Krause, Alexander C. Berg, and Li Fei-Fei. Hedg-ing your bets: Optimizing accuracy-specificity trade-offs in large scalevisual recognition. In Proceedings of the IEEE Computer Society Con-ference on Computer Vision and Pattern Recognition, pages 3450–3457,2012.

[27] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed,Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and An-drew Rabinovich. Going Deeper with Convolutions. arXiv preprintarXiv:1409.4842, pages 1–12, 2014.

[28] Bin Zhao, Li Fei-Fei, and Eric P Xing. Large-Scale Category StructureAware Image Categorization. Advances in Neural Information Process-ing Systems 24 (Proceedings of NIPS), pages 1–9, 2011.

[29] Martin Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, ZhifengChen, Craig Citro, Greg Corrado, Andy Davis, Jeffrey Dean, MatthieuDevin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irv-ing, Michael Isard, Yangqing Jia, Lukasz Kaiser, Manjunath Kudlur,Josh Levenberg, Dan Man, Rajat Monga, Sherry Moore, Derek Murray,Jon Shlens, Benoit Steiner, Ilya Sutskever, Paul Tucker, Vincent Van-houcke, Vijay Vasudevan, Oriol Vinyals, Pete Warden, Martin Wicke,Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Large-Scale MachineLearning on Heterogeneous Distributed Systems. In None, page 19,2015.

[30] James Bergstra, Olivier Breuleux, Frederic Frederic Bastien, PascalLamblin, Razvan Pascanu, Guillaume Desjardins, Joseph Turian, DavidWarde-Farley, and Yoshua Bengio. Theano: a CPU and GPU mathcompiler in Python. Proceedings of the Python for Scientific ComputingConference (SciPy), (Scipy):1–7, 2010.

66 BIBLIOGRAPHY

[31] Christiane Fellbaum. WordNet: An Electronic Lexical Database, 1998.

[32] Alex Krizhevsky. Learning Multiple Layers of Features from Tiny Im-ages. . . . Science Department, University of Toronto, Tech. . . . , pages1–60, 2009.

[33] Min Lin, Qiang Chen, and Shuicheng Yan. Network In Network. arXivpreprint, page 10, 2013.

[34] Rupesh Kumar Srivastava, Klaus Greff, and Jurgen Schmidhuber.Highway Networks. arXiv:1505.00387 [cs], 2015.

Documents

Semantic Relatedness in Convolutional Neural Networks Paul