Modern Convolutional Neural Network techniques for image segmentation

Modern Convolutional Neural Network

techniques for image segmentation

Deep Learning Journal Club

Gioele Ciaparrone

Michele Curci

November 30, 2016

University of Salerno

Index

1. Introduction

2. The Inception architecture

3. Fully convolutional networks

4. Hypercolumns

5. Conclusion

2

Introduction

CNN recap

Sequence of convolutional and pooling layers Rectifier activation function Fully connected layers at the end Softmax function for classification

4

Convolution I

5

Convolution II

Valid padding (left) and same padding (right) convolutions

6

LeNet-5 (1989-1998)

First CNN (1989) proven to work well, used for handwritten Zipcode recognition [1]

Refined through the years until the LeNet-5 version (1998) [2]

7

LeNet-5 interactive visualization [3]

Its possible to interact with the network in 3D, manually drawing a digit

to be classified, clicking on the neurons to get info about the parameters

and the connected units, or rotating and zooming the network:

http://scs.ryerson.ca/~aharley/vis/conv/ 8

http://scs.ryerson.ca/~aharley/vis/conv/

AlexNet (2012) [5]

After a long hiatus in which deep learning was ignored [4], theyreceived attention once again after Alex Krizhevsky overwhelmingly

won the ILSVRC in 2012 with AlexNet

Structure very similar to LeNet-5, but with some new key insights:very efficient GPU implementation, ReLU neurons and dropout

9

The Inception architecture

Motivations

Increasing model size tends to improve quality

More computational resources are needed

Computational efficiency and low parameter count are still important

Mobile vision and embedded systems

Big Data

11

Going Deeper with Convolutions [6]

The Inception module solves this problem making a better use of thecomputing resources

Proposed in 2014 by Christian Szegedy and other Google researchers

Used in the GoogLeNet architecture that won both the ILSVRC2014 classification and detection challanges

12

Inception module I

Visual information is processed at various scales and then aggregated Since pooling operations are beneficial in CNNs, a parallel pooling

path has been added

Problems: 3x3 and 5x5 convolutions can be very expensive on top of a layer

with lots of filters

The number of filters substantially increases for each Inception layeradded, leading to a computational blow up 13

Inception module II

Adding the 1x1 convolutions before the bigger convolutions reducesdimensionality

The same is done after the pooling layer

14

GoogLeNet I

GoogLeNet is a particular incarnation of the Inception architecture

22 convolutional layers (27 including pooling)

9 Inception modules

2 auxiliary classifiers to solve the vanishing gradient problem and forregularization

Designed with computational efficiency in mind Inference can be run on devices with limited computational

resources, especially memory

7 of these networks used in an ensemble for the ILSVRC 2014classification task

15

GoogLeNet II

16

GoogLeNet III

17

GoogLeNet - Training

Trained with the DistBelief distributed machine learning system

Asynchronous stochastic gradient descent with 0.9 momentum

Image sampling methods have changed many times before thecompetition

Converged models were trained on with other options

Models were trained on crops of different size

There isnt a definitive guidance to the most effective single way totrain these networks

18

GoogLeNet - ILSVRC 2014 Results

Classification (above) and object detection (below) results.19

DeepDream

Googles DeepDream uses a GoogLeNet to produce machine dreams

20

Inception-v2 and Inception-v3

The Inception module authors later presented new optimizedversions of the architecture, called Inception-v2 and Inception-v3 [7]

They managed to significantly improve GoogLeNet ILSVRC 2014results

The improvements were based on various key principles: Avoid representational bottlenecks Spatial aggregation on lower dimensional embeddings doesnt usually

induce relevant losses in representational power

Balance the width and depth of the network

21

Convolution factorization I

Factorizing convolutions allows to reduce the number of parameterswhile not loosing much expressiveness

For example 5x5 convolutions can be factorized into a pair of 3x3convolutions

It is also possible to factorize a NxN convolutions into a 1xN and aNx1 convolutions

22

Convolution factorization II

The original Inception module (left) and the new factorized module

(right).

23

Efficient grid size reduction - problem

Suppose we want to pass from a d d grid with k filters to a d2 d2

grid with 2k filters

We need to compute a stride-1 convolution and then a pooling Computational cost dominated by convolutions: 2d2k2 operations Inverting the order, the number of operations is reduced to 2( d2 )

2k2,

but we violate the bottleneck principle

24

Efficient grid size reduction - solution

The solution is an Inception module with convolution and poolingblocks with stride 2

Computationally efficient and no representational bottleneckintroduced

25

The new architecture

Using various modified Inception modules, here is the newInception-v2 architecture

26

Inception-v2: modules used

n = 7

27

Inception-v2: training and observations

The network was trained on the ILSVRC 2012 images usingstochastic gradient descent and the TensorFlow library

Experimental testings proved the two auxiliary classifiers to have lessimpact on the training convergence than expected

In the early training phases, the model performance was not affectedby the presence of the auxiliary classifiers: they only improved the

performance near the end of training

Removing the lower auxiliary classifier didnt have any effect

The main classifier performs better if batch normalization or dropoutare added to the auxiliary ones

The model was also trained and tested on smaller receptive fieldswith only a small loss of top-1 accuracy (76.6% for 299x299 RF vs.

75.2% on 79x79 RF). Important for post-classification of detection

28

Inception-v2 to Inception-v3 results (single model)

Each rows Inception-v2 model adds a feature with respect to theprevious rows model

The last lines model is referred to as the Inception-v3 model29

Inception-v3 vs other models (single and ensemble)

Single model results Ensemble results

On the ILSVRC 2012 dataset, there is a significant improvementversus state-of-the-art models, both with a single model and with an

ensemble of models

Note that the ensemble errors here are validation errors (except forthe one marked with *, that is a test error)

30

Fully convolutional networks

Semantic segmentation

Image segmentation is the process of partitioning an image inmultiple segments (set of pixels or super-pixels)

Semantic segmentation is the partitioning of an image intosemantically meaningful parts and to classify each part into one of

the pre-determined classes

Its possible to achieve the same result with pixel-wiseclassification, i.e. assigning a class to each pixel

32

Fully convolutional networks

Shelhamer et al. [8] showed that fully convolutional networks trainedpixels-to-pixels exceed the state-of-the-art in semantic segmentation

The fully convolutional networks they proposed take input ofarbitrary size and produce same-sized output to make dense

predictions

33

Convolutionalization of a classic net I

Typical recognition nets (AlexNet, GoogLeNet, etc.) take fixed-sizedinputs and produce non-spatial outputs

The fully connected layers have fixed dimensions and drop thespatial coordinates

However we can view these fully connected layers as convolutionsthat cover their entire input regions

34

Convolutionalization of a classic net II

These fully convolutional networks take input of any size and outputclassifications map

The resulting maps are equivalent to the evaluation of the originalnetwork on particular input patches

The new network is more than 5 times faster than the originalnetwork both at learning time and at inference time (considering a

10x10 output grid)

Note that the output dimensions are typically reduced bysubsampling

So output interpolation is needed to obtain dense predictions

The interpolation is obtained through backwards convolutions

35

Backwards strided convolution

Upsampling from 3x3 grid to 5x5

36

Architecture I

Coarse and local information is fused combining lower and higherlayers

3 network types with different layers fused were tested

37

Architecture II

3 proven classification architectures were transformed to fullyconvolutional: AlexNet, VGG16 and GoogLeNet

Each nets final classifier layer is discarded and all the fullyconnected layers are converted to convolutions

A 1x1 convolution with 21 channels (the number of classes in thePASCAL VOC 2011 dataset) is added to the end, followed by a

backwards convolution layer

38

Architecture III

The original nets were first pre-trained using image classification

Then they were transformed to fully convolutional for fine tuningusing whole images (using SGD with momentum)

The best results were obtained with FCN-VGG16

Training on whole images proved to be as effective as samplingpatches

39

Architecture comparison

The first models (FCN-32s) didnt fuse different layers, but theresulting output is very coarse

They then fused lower layers with the last one (as shown earlier) toobtain better results (mean IU 62.7 for FCN-8s vs. 59.4 for

FCN-32s)40

Results comparison I

The model reaches state-of-the-art performance on semanticsegmentation

Also the model is much faster at inference time than previousarchitectures

41

Results comparison II

42

Hypercolumns

Hypercolumns I

The last layer of a CNN captures general features of the image, butis too coarse spatially to allow precise localization

Earlier layers instead may be precise in localization but will notcapture semantics

Hariharan et al. [9] presented the hypercolumn concept, which putstogheter the information from both higher and lower layers to obtain

better results on 3 fine-grained localization tasks:

Simultaneous detection and segmentation Keypoint localization Part labeling

44

Hypercolumns II

The hypercolumn corresponding to a given input location is definedas the outputs of all units above that location at all layers of the

CNN, stacked into one vector

45

Problem setting I

Input: a set of detections (subjected to non-maximum suppression),each with a bounding box, a category label and a score

According to the task we are performing for each detection we want: segment out the object segment its parts predict its keypoints

Whichever the task, the bounding boxes are slightly expanded and a50x50 heatmap is predicted on each of them

46

Problem setting II

The information encoded in each heatmap and the number ofheatmaps depend on the chosen task:

For segmentation, the heatmap encodes the probability that aparticular location is inside the object

For part labeling a separate heatmap is predicted for each part,where each heatmap is the probability a location belongs to that part

For keypoint localization a separate heatmap is predicted for eachkeypoint, with each heatmap encoding the probability that the

keypoint is at a particular location

The heatmaps are finally resized to the size of the expandedbounding boxes

So all the tasks are solved assigning a probability to each of the50x50 locations

47

Problem setting III

For each of the 50x50 locations and for each category a classifiershould be trained

But doing so has 3 problems: The amount of data that each classifier sees during training is

heavily reduced

Training so many classifiers is computationally expensive While the classifier should vary according to the location, to adjacent

pixels should be classified similarly

The solution is to train a coarse K K (usually K = 5 or K = 10)grid of classifiers and interpolate between them

48

Network architecture

conv conv conv

upsample upsample upsample

sigmoid

classifier interpolation

Note: inverting the order of upsampling and convolutions (that calculate

the K K grids) and computing them separately for each of the 3combined layers allows to reduce computational cost

49

Bounding box refining

A special technique is used to improve the box selection, calledrescoring

50

SDS results

51

Keypoint prediction results

52

Part labeling results

53

Conclusion

Conclusion

We have seen how the Inception modules allow to train deeper andbetter networks in a computationally efficient manner

We have then observed how to transform a classification CNN into afully convolutional network for pixel-wise classification

We have learned the hypercolumn technique to combine high andlow level information to improve the accuracy on various fine-grained

localization tasks

55

Thank you for your patience! :)

56

References I

[1] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard,

W. Hubbard, and L. D. Jackel, Backpropagation applied to

handwritten zip code recognition, Neural Computation, vol. 1(4),

pp. 541551, 1989.

[2] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, Gradient-based

learning applied to document recognition, Proc. IEEE, vol. 86,

pp. 22782324, 1998.

[3] A. W. Harley, An interactive node-link visualization of convolutional

neural networks, in ISVC, pp. 867877, 2015.

[4] A. Kurenkov, A brief history of neural nets and deep learning, part

4. http://www.andreykurenkov.com/writing/

a-brief-history-of-neural-nets-and-deep-learning-part-4/.

57

http://www.andreykurenkov.com/writing/a-brief-history-of-neural-nets-and-deep-learning-part-4/http://www.andreykurenkov.com/writing/a-brief-history-of-neural-nets-and-deep-learning-part-4/

References II

[5] A. Krizhevsky, I. Sutskever, , and G. Hinton, Imagenet classification

with deep convolutional neural networks, Advances in Neural

Information Processing Systems, vol. 25, pp. 11061114, 2012.

[6] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. E. Reed, D. Anguelov,

D. Erhan, V. Vanhoucke, and A. Rabinovich, Going deeper with

convolutions, CoRR, vol. abs/1409.4842, 2014.

[7] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna,

Rethinking the inception architecture for computer vision, CoRR,

vol. abs/1512.00567, 2015.

[8] E. Shelhamer, J. Long, and T. Darrell, Fully convolutional networks

for semantic segmentation, CoRR, vol. abs/1605.06211, 2016.

58

References III

[9] B. Hariharan, P. A. Arbelaez, R. B. Girshick, and J. Malik,

Hypercolumns for object segmentation and fine-grained

localization, CoRR, vol. abs/1411.5752, 2014.

59

IntroductionThe Inception architectureFully convolutional networksHypercolumnsConclusion

fd@rm@0: fd@rm@1:

Data & Analytics

Modern Convolutional Neural Network techniques for image segmentation