90
COS 429: Computer Vision Lecture 23 Deep Learning: Segmentation COS429 : 12.12.16 : Andras Ferencz Thanks: most of these slides shamelessly adapted from Stanford CS231n: Convolutional Neural Networks for Visual Recognition Fei-Fei Li, Andrej Karpathy, Justin Johnson http://cs231n.stanford.edu/

Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected

COS 429: Computer Vision

Lecture 23Deep Learning: Segmentation

COS429 : 12.12.16 : Andras Ferencz

Thanks: most of these slides shamelessly adapted fromStanford CS231n: Convolutional Neural Networks for Visual Recognition

Fei-Fei Li, Andrej Karpathy, Justin Johnson http://cs231n.stanford.edu/

Page 2: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected

2 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:

Page 3: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected

3 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:

3

Szegedy et al, Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning, arXiv 2016

Page 4: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected

4 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:

4

ClassificationClassification + Localization

Computer Vision Tasks

CAT CAT CAT, DOG, DUCK

Object DetectionInstance

Segmentation

CAT, DOG, DUCK

Single object Multiple objects

Page 5: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected

5 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:

5

Simple Recipe for Classification + Localization

Step 2: Attach new fully-connected “regression head” to the network

Image

Convolution and Pooling

Final conv feature map

Fully-connected layers

Class scores

Fully-connected layers

Box coordinates

“Classification head”

“Regression head”

Page 6: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected

6 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:

6

Sliding Window: Overfeat

Network input: 3 x 221 x 221

0.5 0.75

Classification scores: P(cat)

Larger image:3 x 257 x 257

Page 7: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected

7 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:

7

Sliding Window: Overfeat

Network input: 3 x 221 x 221

0.5 0.75

0.6 0.8

Classification scores: P(cat)

Larger image:3 x 257 x 257

Page 8: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected

8 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:

8

Sliding Window: Overfeat

Network input: 3 x 221 x 221

0.5 0.75

0.6 0.8

Classification scores: P(cat)

Larger image:3 x 257 x 257

Page 9: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected

9 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:

9

Sliding Window: Overfeat

Network input: 3 x 221 x 221 Classification score:

P(cat)Larger image:3 x 257 x 257

Greedily merge boxes and scores (details in paper)

0.8

Page 10: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected

10 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:

10

Sliding Window: Overfeat

In practice use many sliding window locations and multiple scales

Window positions + score maps Box regression outputs Final Predictions

Sermanet et al, “Integrated Recognition, Localization and Detection using Convolutional Networks”, ICLR 2014

Page 11: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected

11 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:

11

Efficient Sliding Window: Overfeat

Image: 3 x 221 x 221

Convolution + pooling

Feature map: 1024 x 5 x 5

4096 x 1 x 1 1024 x 1 x 1

5 x 5 conv

5 x 5 conv

1 x 1 conv

4096 x 1 x 1 1024 x 1 x 1

Box coordinates:(4 x 1000) x 1 x 1

Class scores:1000 x 1 x 1

1 x 1 conv

1 x 1 conv 1 x 1 conv

Efficient sliding window by converting fully-connected layers into convolutions

Page 12: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected

12 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:

12

Efficient Sliding Window: Overfeat

Training time: Small image, 1 x 1 classifier output

Test time: Larger image, 2 x 2 classifier output, only extra compute at yellow regions

Sermanet et al, “Integrated Recognition, Localization and Detection using Convolutional Networks”, ICLR 2014

Page 13: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected

13 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:

13

ClassificationClassification + Localization

Computer Vision Tasks

Instance Segmentation

Object Detection

Page 14: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected

14 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:

14

Region Proposals

● Find “blobby” image regions that are likely to contain objects● “Class-agnostic” object detector● Look for “blob-like” regions

Page 15: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected

15 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:

15

Region Proposals: Selective Search

Bottom-up segmentation, merging regions at multiple scales

Convert regions to boxes

Uijlings et al, “Selective Search for Object Recognition”, IJCV 2013

Page 16: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected

16 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:

16

R-CNN

Girschick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014

Slide credit: Ross Girschick

Page 17: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected

17 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:

R-CNN Problems: Slow at test-time due to independent forward passes of the CNN

Solution: Share computation of convolutional layers between proposals for an image

R-CNN Problems: - Post-hoc training: CNN not updated in response to final classifiers and regressors- Complex training pipeline

Solution:Just train the whole system end-to-end all at once!

Fast R-CNN

Page 18: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected

18 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:

18

Fast R-CNN: Region of Interest Pooling

Hi-res input image:

3 x 800 x 600with region proposal

Convolution and Pooling

Hi-res conv features:C x H x W

with region proposal

Fully-connected layers

Can back propagate similar to max pooling

RoI conv features:C x h x w

for region proposal

Fully-connected layers expect low-res conv features:

C x h x w

Page 19: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected

19 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:

19

Faster R-CNN: TrainingIn the paper: Ugly pipeline- Use alternating optimization to train RPN,

then Fast R-CNN with RPN proposals, etc.- More complex than it has to be

Since publication: Joint training!One network, four losses- RPN classification (anchor good / bad)- RPN regression (anchor -> proposal)- Fast R-CNN classification (over classes)- Fast R-CNN regression (proposal -> box)

Slide credit: Ross Girschick

Page 20: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected

20 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:

20

Faster R-CNN: Results

R-CNN Fast R-CNN Faster R-CNN

Test time per image(with proposals)

50 seconds 2 seconds 0.2 seconds

(Speedup) 1x 25x 250x

mAP (VOC 2007) 66.0 66.9 66.9

Page 21: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected

21 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:

21

Object Detection State-of-the-art:ResNet 101 + Faster R-CNN + some extras

He et. al, “Deep Residual Learning for Image Recognition”, arXiv 2015

Page 22: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected

22 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:

22

ImageNet Detection 2013 - 2015

Page 23: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected

23 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:

23

YOLO: You Only Look OnceDetection as Regression

Divide image into S x S grid

Within each grid cell predict:B Boxes: 4 coordinates + confidenceClass scores: C numbers

Regression from image to 7 x 7 x (5 * B + C) tensor

Direct prediction using a CNN

Redmon et al, “You Only Look Once: Unified, Real-Time Object Detection”, arXiv 2015

Page 24: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected

24 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:

24

YOLO: You Only Look OnceDetection as Regression

Faster than Faster R-CNN, but not as good

Redmon et al, “You Only Look Once: Unified, Real-Time Object Detection”, arXiv 2015

Page 25: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected

25 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:

2525

ClassificationClassification + Localization

Computer Vision Tasks

CAT CAT CAT, DOG, DUCK

Object Detection Segmentation

CAT, DOG, DUCK

Multiple objectsSingle object

Page 26: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected

26 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:

Classification

Today

Object DetectionClassification + Localization

Segmentation

Today

Page 27: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected

27 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:

Semantic Segmentation

27

Label every pixel!

Don’t differentiate instances (cows)

Classic computer vision problem

Figure credit: Shotton et al, “TextonBoost for Image Understanding: Multi-Class Object Recognition and Segmentation by Jointly Modeling Texture, Layout, and Context”, IJCV 2007

Page 28: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected

28 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:

Instance Segmentation

28

Figure credit: Dai et al, “Instance-aware Semantic Segmentation via Multi-task Network Cascades”, arXiv 2015

Detect instances, give category, label pixels

“simultaneous detection and segmentation” (SDS)

Lots of recent work (MS-COCO)

Page 29: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected

29 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:

Semantic Segmentation

29

Extract patch

Page 30: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected

30 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:

Semantic Segmentation

30

CNN

Extract patch

Run througha CNN

Page 31: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected

31 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:

Semantic Segmentation

31

CNN COW

Extract patch

Run througha CNN

Classify center pixel

Page 32: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected

32 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:

Semantic Segmentation

32

CNN COW

Extract patch

Run througha CNN

Classify center pixel

Repeat for every pixel

Page 33: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected

33 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:

Semantic Segmentation

33

CNN

Run “fully convolutional” network to get all pixels at once

Smaller output due to pooling

Page 34: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected

34 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:

Semantic Segmentation: Multi-Scale

34

Farabet et al, “Learning Hierarchical Features for Scene Labeling,” TPAMI 2013

Page 35: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected

35 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:

Semantic Segmentation: Multi-Scale

35

Farabet et al, “Learning Hierarchical Features for Scene Labeling,” TPAMI 2013

Resize image to multiple scales

Page 36: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected

36 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:

Semantic Segmentation: Multi-Scale

36

Farabet et al, “Learning Hierarchical Features for Scene Labeling,” TPAMI 2013

Resize image to multiple scales

Run one CNN per scale

Page 37: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected

37 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:

Semantic Segmentation: Multi-Scale

37

Farabet et al, “Learning Hierarchical Features for Scene Labeling,” TPAMI 2013

Resize image to multiple scales

Run one CNN per scale

Upscale outputsand concatenate

Page 38: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected

38 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:

Semantic Segmentation: Multi-Scale

38

Farabet et al, “Learning Hierarchical Features for Scene Labeling,” TPAMI 2013

Resize image to multiple scales

Run one CNN per scale

Upscale outputsand concatenate

External “bottom-up” segmentation

Page 39: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected

39 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:

Semantic Segmentation: Multi-Scale

39

Farabet et al, “Learning Hierarchical Features for Scene Labeling,” TPAMI 2013

Resize image to multiple scales

Run one CNN per scale

Upscale outputsand concatenate

External “bottom-up” segmentation

Combine everything for final outputs

Page 40: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected

40 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:

Semantic Segmentation: Refinement

40

Pinheiro and Collobert, “Recurrent Convolutional Neural Networks for Scene Labeling”, ICML 2014

Apply CNN once to get labels

Page 41: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected

41 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:

Semantic Segmentation: Refinement

41

Pinheiro and Collobert, “Recurrent Convolutional Neural Networks for Scene Labeling”, ICML 2014

Apply CNN once to get labels

Apply AGAIN to refine labels

Page 42: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected

42 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:

Semantic Segmentation: Refinement

42

Pinheiro and Collobert, “Recurrent Convolutional Neural Networks for Scene Labeling”, ICML 2014

Apply CNN once to get labels

Apply AGAIN to refine labels

And again!

Same CNN weights:recurrent convolutional network

More iterations improve results

Page 43: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected

43 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:

43

Long, Shelhamer, and Darrell, “Fully Convolutional Networks for Semantic Segmentation”, CVPR 2015

Semantic Segmentation: Upsampling

Learnable upsampling!

Page 44: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected

44 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:

44

Long, Shelhamer, and Darrell, “Fully Convolutional Networks for Semantic Segmentation”, CVPR 2015

Semantic Segmentation: Upsampling

Page 45: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected

45 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:

45

Long, Shelhamer, and Darrell, “Fully Convolutional Networks for Semantic Segmentation”, CVPR 2015

Semantic Segmentation: Upsampling

“skip connections”

Page 46: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected

46 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:

46

Long, Shelhamer, and Darrell, “Fully Convolutional Networks for Semantic Segmentation”, CVPR 2015

Semantic Segmentation: Upsampling

Skip connections = Better results

“skip connections”

Page 47: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected

47 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:

Learnable Upsampling: “Deconvolution”

47

Typical 3 x 3 convolution, stride 1 pad 1

Input: 4 x 4 Output: 4 x 4

Page 48: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected

48 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:

Learnable Upsampling: “Deconvolution”

48

Typical 3 x 3 convolution, stride 1 pad 1

Input: 4 x 4 Output: 4 x 4

Dot product between filter and input

Page 49: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected

49 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:

Learnable Upsampling: “Deconvolution”

49

Typical 3 x 3 convolution, stride 1 pad 1

Input: 4 x 4 Output: 4 x 4

Dot product between filter and input

Page 50: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected

50 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:

Learnable Upsampling: “Deconvolution”

50

Typical 3 x 3 convolution, stride 2 pad 1

Input: 4 x 4 Output: 2 x 2

Page 51: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected

51 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:

Learnable Upsampling: “Deconvolution”

51

Typical 3 x 3 convolution, stride 2 pad 1

Input: 4 x 4 Output: 2 x 2

Dot product between filter and input

Page 52: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected

52 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:

Learnable Upsampling: “Deconvolution”

52

Typical 3 x 3 convolution, stride 2 pad 1

Input: 4 x 4 Output: 2 x 2

Dot product between filter and input

Page 53: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected

53 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:

Learnable Upsampling: “Deconvolution”

53

3 x 3 “deconvolution”, stride 2 pad 1

Input: 2 x 2 Output: 4 x 4

Page 54: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected

54 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:

Learnable Upsampling: “Deconvolution”

54

3 x 3 “deconvolution”, stride 2 pad 1

Input: 2 x 2 Output: 4 x 4

Input gives weight for filter

Page 55: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected

55 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:

Learnable Upsampling: “Deconvolution”

55

3 x 3 “deconvolution”, stride 2 pad 1

Input: 2 x 2 Output: 4 x 4

Input gives weight for filter

Sum where output overlaps

Same as backward pass for normal convolution!

“Deconvolution” is a bad name, already defined as “inverse of convolution”

Better names: convolution transpose,backward strided convolution,1/2 strided convolution, upconvolution

Page 56: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected

56 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:

Semantic Segmentation: Upsampling

56

Noh et al, “Learning Deconvolution Network for Semantic Segmentation”, ICCV 2015

Normal VGG “Upside down” VGG

6 days of training on Titan X…

Page 57: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected

57 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:

Instance Segmentation

57

Figure credit: Dai et al, “Instance-aware Semantic Segmentation via Multi-task Network Cascades”, arXiv 2015

Detect instances, give category, label pixels

“simultaneous detection and segmentation” (SDS)

Lots of recent work (MS-COCO)

Page 58: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected

58 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:

Instance Segmentation

58

Hariharan et al, “Simultaneous Detection and Segmentation”, ECCV 2014

Similar to R-CNN, but with segments

Page 59: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected

59 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:

Instance Segmentation

59

Hariharan et al, “Simultaneous Detection and Segmentation”, ECCV 2014

External Segment proposals

Similar to R-CNN, but with segments

Page 60: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected

60 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:

Instance Segmentation

60

Hariharan et al, “Simultaneous Detection and Segmentation”, ECCV 2014

External Segment proposals

Similar to R-CNN

Page 61: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected

61 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:

Instance Segmentation

61

Hariharan et al, “Simultaneous Detection and Segmentation”, ECCV 2014

External Segment proposals

Mask out background with mean image

Similar to R-CNN, but with segments

Page 62: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected

62 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:

Instance Segmentation

62

Hariharan et al, “Simultaneous Detection and Segmentation”, ECCV 2014

External Segment proposals

Mask out background with mean image

Similar to R-CNN, but with segments

Page 63: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected

63 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:

Instance Segmentation

63

Hariharan et al, “Simultaneous Detection and Segmentation”, ECCV 2014

External Segment proposals

Mask out background with mean image

Similar to R-CNN, but with segments

Page 64: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected

64 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:

Instance Segmentation: Cascades

64

Dai et al, “Instance-aware Semantic Segmentation via Multi-task Network Cascades”, arXiv 2015

Similar to Faster R-CNN

Won COCO 2015 challenge (with ResNet)

Page 65: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected

65 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:

Instance Segmentation: Cascades

65

Dai et al, “Instance-aware Semantic Segmentation via Multi-task Network Cascades”, arXiv 2015

Similar to Faster R-CNN

Region proposal network (RPN)

Won COCO 2015 challenge (with ResNet)

Page 66: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected

66 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:

Instance Segmentation: Cascades

66

Dai et al, “Instance-aware Semantic Segmentation via Multi-task Network Cascades”, arXiv 2015

Similar to Faster R-CNN

Won COCO 2015 challenge (with ResNet)

Region proposal network (RPN)

Reshape boxes to fixed size,figure / ground logistic regression

Mask out background, predict object class

Learn entire model end-to-end!

Page 67: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected

67 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:

Instance Segmentation: Cascades

67

Dai et al, “Instance-aware Semantic Segmentation via Multi-task Network Cascades”, arXiv 2015 Predictions Ground truth

Page 68: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected

68 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:

Segmentation Overview

Semantic segmentationClassify all pixelsFully convolutional models, downsample then upsampleLearnable upsampling: fractionally strided convolutionSkip connections can help

Instance SegmentationDetect instance, generate maskSimilar pipelines to object detection

68

Page 69: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected

69 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:

Quick overview ofOther Topics

69

Page 70: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected

70 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:

70

Recurrent Neural Networks (RNN)

Vanilla Neural Networks

Page 71: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected

71 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:

71

e.g. Image Captioningimage -> sequence of words

Recurrent Neural Networks (RNN)

Page 72: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected

72 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:

72

e.g. Sentiment Classificationsequence of words -> sentiment

Recurrent Neural Networks (RNN)

Page 73: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected

73 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:

73

e.g. Machine Translationseq of words -> seq of words

Recurrent Neural Networks (RNN)

Page 74: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected

74 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:

74

e.g. Video classification on frame level

Recurrent Neural Networks (RNN)

Page 75: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected

75 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:

75

x

RNN

y

Page 76: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected

76 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:

train more

train more

train more

Character RNN during training

Page 77: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected

77 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:

77

Page 78: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected

78 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:

78

Generated C code

Page 79: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected

79 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:

79

quote detection cell

Searching for interpretable cells

Page 80: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected

80 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:

Multiple Object Recognition with

Visual Attention, Ba et al.

Sequential Processing of fixed inputs

Page 81: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected

81 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:

DRAW: A Recurrent

Neural Network For

Image Generation,

Gregor et al.

Sequential Processing of fixed outputs

Page 82: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected

82 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:

82

Explain Images with Multimodal Recurrent Neural Networks, Mao et al.Deep Visual-Semantic Alignments for Generating Image Descriptions, Karpathy and Fei-FeiShow and Tell: A Neural Image Caption Generator, Vinyals et al.Long-term Recurrent Convolutional Networks for Visual Recognition and Description, Donahue et al.Learning a Recurrent Visual Representation for Image Caption Generation, Chen and Zitnick

Image Captioning

Page 83: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected

83 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:

83

Convolutional Neural Network

Recurrent Neural Network

Page 84: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected

84 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:

84

CNN

Image: H x W x 3

Features: L x D

h0

a1

z1

Weighted combination of features

y1

h1

First word

Distribution over L locations

Soft Attention for Captioning

a2 d1

h2

z2 y2Weighted

features: D

Distribution over vocab

Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015

Page 85: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected

85 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:

Soft Attention for Captioning

85

Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015

Soft attention

Hard attention

Page 86: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected

86 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:

Soft Attention for Captioning

86

Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015

Page 87: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected

87 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:

Spatial Transformer Networks

Input image:H x W x 3

Box Coordinates:(xc, yc, w, h)

Cropped and rescaled image:

X x Y x 3

Jaderberg et al, “Spatial Transformer Networks”, NIPS 2015

Can we make this function differentiable?

Page 88: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected

88 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:

Spatial Transformer Networks

Input image:H x W x 3

Box Coordinates:(xc, yc, w, h)

Cropped and rescaled image:

X x Y x 3

Can we make this function differentiable?

Jaderberg et al, “Spatial Transformer Networks”, NIPS 2015

Idea: Function mapping pixel coordinates (xt, yt) of output to pixel coordinates (xs, ys) of input

Repeat for all pixels in output to get a sampling grid

Then use bilinear interpolation to compute output

Network attends to input by predicting �

Page 89: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected

89 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:

Spatial Transformer Networks

Input: Full image

A smallLocalization network predicts transform �

Grid generator uses to �compute sampling grid

Sampler uses bilinear interpolation to produce output

Output: Region of interest from input

Page 90: Lecture 23 Deep Learning: Segmentation...Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected

90 : COS429 : L23 : 12.12.16 : Andras Ferencz Slide Credit:

Spatial Transformer Networks

90

Differentiable “attention / transformation” module

Insert spatial transformers into a classification network and it learns to attend and transform the input