59
Modern Convolutional Neural Network techniques for image segmentation Deep Learning Journal Club Gioele Ciaparrone Michele Curci November 30, 2016 University of Salerno

Modern Convolutional Neural Network techniques for image segmentation

Embed Size (px)

Citation preview

  • Modern Convolutional Neural Network

    techniques for image segmentation

    Deep Learning Journal Club

    Gioele Ciaparrone

    Michele Curci

    November 30, 2016

    University of Salerno

  • Index

    1. Introduction

    2. The Inception architecture

    3. Fully convolutional networks

    4. Hypercolumns

    5. Conclusion

    2

  • Introduction

  • CNN recap

    Sequence of convolutional and pooling layers Rectifier activation function Fully connected layers at the end Softmax function for classification

    4

  • Convolution I

    5

  • Convolution II

    Valid padding (left) and same padding (right) convolutions

    6

  • LeNet-5 (1989-1998)

    First CNN (1989) proven to work well, used for handwritten Zipcode recognition [1]

    Refined through the years until the LeNet-5 version (1998) [2]

    7

  • LeNet-5 interactive visualization [3]

    Its possible to interact with the network in 3D, manually drawing a digit

    to be classified, clicking on the neurons to get info about the parameters

    and the connected units, or rotating and zooming the network:

    http://scs.ryerson.ca/~aharley/vis/conv/ 8

    http://scs.ryerson.ca/~aharley/vis/conv/

  • AlexNet (2012) [5]

    After a long hiatus in which deep learning was ignored [4], theyreceived attention once again after Alex Krizhevsky overwhelmingly

    won the ILSVRC in 2012 with AlexNet

    Structure very similar to LeNet-5, but with some new key insights:very efficient GPU implementation, ReLU neurons and dropout

    9

  • The Inception architecture

  • Motivations

    Increasing model size tends to improve quality

    More computational resources are needed

    Computational efficiency and low parameter count are still important

    Mobile vision and embedded systems

    Big Data

    11

  • Going Deeper with Convolutions [6]

    The Inception module solves this problem making a better use of thecomputing resources

    Proposed in 2014 by Christian Szegedy and other Google researchers

    Used in the GoogLeNet architecture that won both the ILSVRC2014 classification and detection challanges

    12

  • Inception module I

    Visual information is processed at various scales and then aggregated Since pooling operations are beneficial in CNNs, a parallel pooling

    path has been added

    Problems: 3x3 and 5x5 convolutions can be very expensive on top of a layer

    with lots of filters

    The number of filters substantially increases for each Inception layeradded, leading to a computational blow up 13

  • Inception module II

    Adding the 1x1 convolutions before the bigger convolutions reducesdimensionality

    The same is done after the pooling layer

    14

  • GoogLeNet I

    GoogLeNet is a particular incarnation of the Inception architecture

    22 convolutional layers (27 including pooling)

    9 Inception modules

    2 auxiliary classifiers to solve the vanishing gradient problem and forregularization

    Designed with computational efficiency in mind Inference can be run on devices with limited computational

    resources, especially memory

    7 of these networks used in an ensemble for the ILSVRC 2014classification task

    15

  • GoogLeNet II

    16

  • GoogLeNet III

    17

  • GoogLeNet - Training

    Trained with the DistBelief distributed machine learning system

    Asynchronous stochastic gradient descent with 0.9 momentum

    Image sampling methods have changed many times before thecompetition

    Converged models were trained on with other options

    Models were trained on crops of different size

    There isnt a definitive guidance to the most effective single way totrain these networks

    18

  • GoogLeNet - ILSVRC 2014 Results

    Classification (above) and object detection (below) results.19

  • DeepDream

    Googles DeepDream uses a GoogLeNet to produce machine dreams

    20

  • Inception-v2 and Inception-v3

    The Inception module authors later presented new optimizedversions of the architecture, called Inception-v2 and Inception-v3 [7]

    They managed to significantly improve GoogLeNet ILSVRC 2014results

    The improvements were based on various key principles: Avoid representational bottlenecks Spatial aggregation on lower dimensional embeddings doesnt usually

    induce relevant losses in representational power

    Balance the width and depth of the network

    21

  • Convolution factorization I

    Factorizing convolutions allows to reduce the number of parameterswhile not loosing much expressiveness

    For example 5x5 convolutions can be factorized into a pair of 3x3convolutions

    It is also possible to factorize a NxN convolutions into a 1xN and aNx1 convolutions

    22

  • Convolution factorization II

    The original Inception module (left) and the new factorized module

    (right).

    23

  • Efficient grid size reduction - problem

    Suppose we want to pass from a d d grid with k filters to a d2 d2

    grid with 2k filters

    We need to compute a stride-1 convolution and then a pooling Computational cost dominated by convolutions: 2d2k2 operations Inverting the order, the number of operations is reduced to 2( d2 )

    2k2,

    but we violate the bottleneck principle

    24

  • Efficient grid size reduction - solution

    The solution is an Inception module with convolution and poolingblocks with stride 2

    Computationally efficient and no representational bottleneckintroduced

    25

  • The new architecture

    Using various modified Inception modules, here is the newInception-v2 architecture

    26

  • Inception-v2: modules used

    n = 7

    27

  • Inception-v2: training and observations

    The network was trained on the ILSVRC 2012 images usingstochastic gradient descent and the TensorFlow library

    Experimental testings proved the two auxiliary classifiers to have lessimpact on the training convergence than expected

    In the early training phases, the model performance was not affectedby the presence of the auxiliary classifiers: they only improved the

    performance near the end of training

    Removing the lower auxiliary classifier didnt have any effect

    The main classifier performs better if batch normalization or dropoutare added to the auxiliary ones

    The model was also trained and tested on smaller receptive fieldswith only a small loss of top-1 accuracy (76.6% for 299x299 RF vs.

    75.2% on 79x79 RF). Important for post-classification of detection

    28

  • Inception-v2 to Inception-v3 results (single model)

    Each rows Inception-v2 model adds a feature with respect to theprevious rows model

    The last lines model is referred to as the Inception-v3 model29

  • Inception-v3 vs other models (single and ensemble)

    Single model results Ensemble results

    On the ILSVRC 2012 dataset, there is a significant improvementversus state-of-the-art models, both with a single model and with an

    ensemble of models

    Note that the ensemble errors here are validation errors (except forthe one marked with *, that is a test error)

    30

  • Fully convolutional networks

  • Semantic segmentation

    Image segmentation is the process of partitioning an image inmultiple segments (set of pixels or super-pixels)

    Semantic segmentation is the partitioning of an image intosemantically meaningful parts and to classify each part into one of

    the pre-determined classes

    Its possible to achieve the same result with pixel-wiseclassification, i.e. assigning a class to each pixel

    32

  • Fully convolutional networks

    Shelhamer et al. [8] showed that fully convolutional networks trainedpixels-to-pixels exceed the state-of-the-art in semantic segmentation

    The fully convolutional networks they proposed take input ofarbitrary size and produce same-sized output to make dense

    predictions

    33

  • Convolutionalization of a classic net I

    Typical recognition nets (AlexNet, GoogLeNet, etc.) take fixed-sizedinputs and produce non-spatial outputs

    The fully connected layers have fixed dimensions and drop thespatial coordinates

    However we can view these fully connected layers as convolutionsthat cover their entire input regions

    34

  • Convolutionalization of a classic net II

    These fully convolutional networks take input of any size and outputclassifications map

    The resulting maps are equivalent to the evaluation of the originalnetwork on particular input patches

    The new network is more than 5 times faster than the originalnetwork both at learning time and at inference time (considering a

    10x10 output grid)

    Note that the output dimensions are typically reduced bysubsampling

    So output interpolation is needed to obtain dense predictions

    The interpolation is obtained through backwards convolutions

    35

  • Backwards strided convolution

    Upsampling from 3x3 grid to 5x5

    36

  • Architecture I

    Coarse and local information is fused combining lower and higherlayers

    3 network types with different layers fused were tested

    37

  • Architecture II

    3 proven classification architectures were transformed to fullyconvolutional: AlexNet, VGG16 and GoogLeNet

    Each nets final classifier layer is discarded and all the fullyconnected layers are converted to convolutions

    A 1x1 convolution with 21 channels (the number of classes in thePASCAL VOC 2011 dataset) is added to the end, followed by a

    backwards convolution layer

    38

  • Architecture III

    The original nets were first pre-trained using image classification

    Then they were transformed to fully convolutional for fine tuningusing whole images (using SGD with momentum)

    The best results were obtained with FCN-VGG16

    Training on whole images proved to be as effective as samplingpatches

    39

  • Architecture comparison

    The first models (FCN-32s) didnt fuse different layers, but theresulting output is very coarse

    They then fused lower layers with the last one (as shown earlier) toobtain better results (mean IU 62.7 for FCN-8s vs. 59.4 for

    FCN-32s)40

  • Results comparison I

    The model reaches state-of-the-art performance on semanticsegmentation

    Also the model is much faster at inference time than previousarchitectures

    41

  • Results comparison II

    42

  • Hypercolumns

  • Hypercolumns I

    The last layer of a CNN captures general features of the image, butis too coarse spatially to allow precise localization

    Earlier layers instead may be precise in localization but will notcapture semantics

    Hariharan et al. [9] presented the hypercolumn concept, which putstogheter the information from both higher and lower layers to obtain

    better results on 3 fine-grained localization tasks:

    Simultaneous detection and segmentation Keypoint localization Part labeling

    44

  • Hypercolumns II

    The hypercolumn corresponding to a given input location is definedas the outputs of all units above that location at all layers of the

    CNN, stacked into one vector

    45

  • Problem setting I

    Input: a set of detections (subjected to non-maximum suppression),each with a bounding box, a category label and a score

    According to the task we are performing for each detection we want: segment out the object segment its parts predict its keypoints

    Whichever the task, the bounding boxes are slightly expanded and a50x50 heatmap is predicted on each of them

    46

  • Problem setting II

    The information encoded in each heatmap and the number ofheatmaps depend on the chosen task:

    For segmentation, the heatmap encodes the probability that aparticular location is inside the object

    For part labeling a separate heatmap is predicted for each part,where each heatmap is the probability a location belongs to that part

    For keypoint localization a separate heatmap is predicted for eachkeypoint, with each heatmap encoding the probability that the

    keypoint is at a particular location

    The heatmaps are finally resized to the size of the expandedbounding boxes

    So all the tasks are solved assigning a probability to each of the50x50 locations

    47

  • Problem setting III

    For each of the 50x50 locations and for each category a classifiershould be trained

    But doing so has 3 problems: The amount of data that each classifier sees during training is

    heavily reduced

    Training so many classifiers is computationally expensive While the classifier should vary according to the location, to adjacent

    pixels should be classified similarly

    The solution is to train a coarse K K (usually K = 5 or K = 10)grid of classifiers and interpolate between them

    48

  • Network architecture

    conv conv conv

    upsample upsample upsample

    sigmoid

    classifier interpolation

    Note: inverting the order of upsampling and convolutions (that calculate

    the K K grids) and computing them separately for each of the 3combined layers allows to reduce computational cost

    49

  • Bounding box refining

    A special technique is used to improve the box selection, calledrescoring

    50

  • SDS results

    51

  • Keypoint prediction results

    52

  • Part labeling results

    53

  • Conclusion

  • Conclusion

    We have seen how the Inception modules allow to train deeper andbetter networks in a computationally efficient manner

    We have then observed how to transform a classification CNN into afully convolutional network for pixel-wise classification

    We have learned the hypercolumn technique to combine high andlow level information to improve the accuracy on various fine-grained

    localization tasks

    55

  • Thank you for your patience! :)

    56

  • References I

    [1] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard,

    W. Hubbard, and L. D. Jackel, Backpropagation applied to

    handwritten zip code recognition, Neural Computation, vol. 1(4),

    pp. 541551, 1989.

    [2] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, Gradient-based

    learning applied to document recognition, Proc. IEEE, vol. 86,

    pp. 22782324, 1998.

    [3] A. W. Harley, An interactive node-link visualization of convolutional

    neural networks, in ISVC, pp. 867877, 2015.

    [4] A. Kurenkov, A brief history of neural nets and deep learning, part

    4. http://www.andreykurenkov.com/writing/

    a-brief-history-of-neural-nets-and-deep-learning-part-4/.

    57

    http://www.andreykurenkov.com/writing/a-brief-history-of-neural-nets-and-deep-learning-part-4/http://www.andreykurenkov.com/writing/a-brief-history-of-neural-nets-and-deep-learning-part-4/

  • References II

    [5] A. Krizhevsky, I. Sutskever, , and G. Hinton, Imagenet classification

    with deep convolutional neural networks, Advances in Neural

    Information Processing Systems, vol. 25, pp. 11061114, 2012.

    [6] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. E. Reed, D. Anguelov,

    D. Erhan, V. Vanhoucke, and A. Rabinovich, Going deeper with

    convolutions, CoRR, vol. abs/1409.4842, 2014.

    [7] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna,

    Rethinking the inception architecture for computer vision, CoRR,

    vol. abs/1512.00567, 2015.

    [8] E. Shelhamer, J. Long, and T. Darrell, Fully convolutional networks

    for semantic segmentation, CoRR, vol. abs/1605.06211, 2016.

    58

  • References III

    [9] B. Hariharan, P. A. Arbelaez, R. B. Girshick, and J. Malik,

    Hypercolumns for object segmentation and fine-grained

    localization, CoRR, vol. abs/1411.5752, 2014.

    59

    IntroductionThe Inception architectureFully convolutional networksHypercolumnsConclusion

    fd@rm@0: fd@rm@1: