69

[Boston March for Science 2017static.cs.brown.edu/courses/csci1430/2017_Spring/... · Fast R-CNN Figure: Girshick et al. - Convolve whole image into feature map (many layers; abstracted)

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: [Boston March for Science 2017static.cs.brown.edu/courses/csci1430/2017_Spring/... · Fast R-CNN Figure: Girshick et al. - Convolve whole image into feature map (many layers; abstracted)
Page 2: [Boston March for Science 2017static.cs.brown.edu/courses/csci1430/2017_Spring/... · Fast R-CNN Figure: Girshick et al. - Convolve whole image into feature map (many layers; abstracted)

[Boston March for Science 2017 – photo Hendrik Strobelt]

Page 3: [Boston March for Science 2017static.cs.brown.edu/courses/csci1430/2017_Spring/... · Fast R-CNN Figure: Girshick et al. - Convolve whole image into feature map (many layers; abstracted)

[Boston March for Science 2017]

Page 4: [Boston March for Science 2017static.cs.brown.edu/courses/csci1430/2017_Spring/... · Fast R-CNN Figure: Girshick et al. - Convolve whole image into feature map (many layers; abstracted)

[Boston March for Science 2017]

Page 5: [Boston March for Science 2017static.cs.brown.edu/courses/csci1430/2017_Spring/... · Fast R-CNN Figure: Girshick et al. - Convolve whole image into feature map (many layers; abstracted)

[Boston March for Science 2017]

Page 6: [Boston March for Science 2017static.cs.brown.edu/courses/csci1430/2017_Spring/... · Fast R-CNN Figure: Girshick et al. - Convolve whole image into feature map (many layers; abstracted)
Page 7: [Boston March for Science 2017static.cs.brown.edu/courses/csci1430/2017_Spring/... · Fast R-CNN Figure: Girshick et al. - Convolve whole image into feature map (many layers; abstracted)

Object Detectors Emerge in Deep Scene CNNs

Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, AntonioTorralba

Massachusetts Institute of Technology

ICLR 2015

Page 8: [Boston March for Science 2017static.cs.brown.edu/courses/csci1430/2017_Spring/... · Fast R-CNN Figure: Girshick et al. - Convolve whole image into feature map (many layers; abstracted)

How Objects are Represented in CNN?

CNN uses distributed code to represent objects.

Agrawal, et al. Analyzing the performance of multilayer neural networks for object recognition. ECCV, 2014

Szegedy, et al. Intriguing properties of neural networks.arXiv preprint arXiv:1312.6199, 2013.

Zeiler, M. et al. Visualizing and Understanding Convolutional Networks, ECCV 2014.

Conv2

Conv3

Conv4

Pool5

Conv1

Page 9: [Boston March for Science 2017static.cs.brown.edu/courses/csci1430/2017_Spring/... · Fast R-CNN Figure: Girshick et al. - Convolve whole image into feature map (many layers; abstracted)

Estimating the Receptive Fields

Estimated receptive fieldspool1

Actual size of RF is much smaller than the theoretic size

conv3 pool5

Segmentation using the RF of Units

More semantically meaningful

Page 10: [Boston March for Science 2017static.cs.brown.edu/courses/csci1430/2017_Spring/... · Fast R-CNN Figure: Girshick et al. - Convolve whole image into feature map (many layers; abstracted)

Annotating the Semantics of Units

Top ranked segmented images are cropped and sent to Amazon Turk for annotation.

Page 11: [Boston March for Science 2017static.cs.brown.edu/courses/csci1430/2017_Spring/... · Fast R-CNN Figure: Girshick et al. - Convolve whole image into feature map (many layers; abstracted)

Annotating the Semantics of Units

Pool5, unit 76; Label: ocean; Type: scene; Precision: 93%

Page 12: [Boston March for Science 2017static.cs.brown.edu/courses/csci1430/2017_Spring/... · Fast R-CNN Figure: Girshick et al. - Convolve whole image into feature map (many layers; abstracted)

Pool5, unit 13; Label: Lamps; Type: object; Precision: 84%

Annotating the Semantics of Units

Page 13: [Boston March for Science 2017static.cs.brown.edu/courses/csci1430/2017_Spring/... · Fast R-CNN Figure: Girshick et al. - Convolve whole image into feature map (many layers; abstracted)

Pool5, unit 77; Label:legs; Type: object part; Precision: 96%

Annotating the Semantics of Units

Page 14: [Boston March for Science 2017static.cs.brown.edu/courses/csci1430/2017_Spring/... · Fast R-CNN Figure: Girshick et al. - Convolve whole image into feature map (many layers; abstracted)

Pool5, unit 112; Label: pool table; Type: object; Precision: 70%

Annotating the Semantics of Units

Page 15: [Boston March for Science 2017static.cs.brown.edu/courses/csci1430/2017_Spring/... · Fast R-CNN Figure: Girshick et al. - Convolve whole image into feature map (many layers; abstracted)

Annotating the Semantics of Units

Pool5, unit 22; Label: dinner table; Type: scene; Precision: 60%

Page 16: [Boston March for Science 2017static.cs.brown.edu/courses/csci1430/2017_Spring/... · Fast R-CNN Figure: Girshick et al. - Convolve whole image into feature map (many layers; abstracted)

Distribution of Semantic Types at Each Layer

Page 17: [Boston March for Science 2017static.cs.brown.edu/courses/csci1430/2017_Spring/... · Fast R-CNN Figure: Girshick et al. - Convolve whole image into feature map (many layers; abstracted)

Distribution of Semantic Types at Each Layer

Object detectors emerge within CNN trained to classify scenes, without any object supervision!

Page 18: [Boston March for Science 2017static.cs.brown.edu/courses/csci1430/2017_Spring/... · Fast R-CNN Figure: Girshick et al. - Convolve whole image into feature map (many layers; abstracted)

18

“tabby cat”

1000-dim vector

< 1 millisecond

ConvNets perform classification

end-to-end learning

[Slides from Long, Shelhamer, and Darrell]

Page 19: [Boston March for Science 2017static.cs.brown.edu/courses/csci1430/2017_Spring/... · Fast R-CNN Figure: Girshick et al. - Convolve whole image into feature map (many layers; abstracted)
Page 20: [Boston March for Science 2017static.cs.brown.edu/courses/csci1430/2017_Spring/... · Fast R-CNN Figure: Girshick et al. - Convolve whole image into feature map (many layers; abstracted)
Page 21: [Boston March for Science 2017static.cs.brown.edu/courses/csci1430/2017_Spring/... · Fast R-CNN Figure: Girshick et al. - Convolve whole image into feature map (many layers; abstracted)
Page 22: [Boston March for Science 2017static.cs.brown.edu/courses/csci1430/2017_Spring/... · Fast R-CNN Figure: Girshick et al. - Convolve whole image into feature map (many layers; abstracted)
Page 23: [Boston March for Science 2017static.cs.brown.edu/courses/csci1430/2017_Spring/... · Fast R-CNN Figure: Girshick et al. - Convolve whole image into feature map (many layers; abstracted)
Page 24: [Boston March for Science 2017static.cs.brown.edu/courses/csci1430/2017_Spring/... · Fast R-CNN Figure: Girshick et al. - Convolve whole image into feature map (many layers; abstracted)
Page 25: [Boston March for Science 2017static.cs.brown.edu/courses/csci1430/2017_Spring/... · Fast R-CNN Figure: Girshick et al. - Convolve whole image into feature map (many layers; abstracted)

R-CNN

26

many seconds

“cat”

“dog”

R-CNN does detection

[Long et al.]

Page 26: [Boston March for Science 2017static.cs.brown.edu/courses/csci1430/2017_Spring/... · Fast R-CNN Figure: Girshick et al. - Convolve whole image into feature map (many layers; abstracted)

27

R-CNN: Region-based CNN

Figure: Girshick et al.

Page 27: [Boston March for Science 2017static.cs.brown.edu/courses/csci1430/2017_Spring/... · Fast R-CNN Figure: Girshick et al. - Convolve whole image into feature map (many layers; abstracted)

Fast R-CNN

RoI = Region of Interest

Figure: Girshick et al.

Multi-task loss

Page 28: [Boston March for Science 2017static.cs.brown.edu/courses/csci1430/2017_Spring/... · Fast R-CNN Figure: Girshick et al. - Convolve whole image into feature map (many layers; abstracted)

Fast R-CNN

Figure: Girshick et al.

- Convolve whole image into feature map (many layers; abstracted)- For each candidate RoI:

- Squash feature map weights into fixed-size ‘RoI pool’ – adaptive subsampling!- Divide RoI into H x W subwindows, e.g., 7 x 7, and max pool

- Learn classification on RoI pool with own fully connected layers (FCs)- Output classification (softmax) + bounds (regressor)

Page 29: [Boston March for Science 2017static.cs.brown.edu/courses/csci1430/2017_Spring/... · Fast R-CNN Figure: Girshick et al. - Convolve whole image into feature map (many layers; abstracted)

What if we want pixels out?

semanticsegmentation

30

monocular depth estimation Eigen & Fergus 2015

boundary prediction Xie & Tu 2015optical flow Fischer et al. 2015

[Long et al.]

Page 30: [Boston March for Science 2017static.cs.brown.edu/courses/csci1430/2017_Spring/... · Fast R-CNN Figure: Girshick et al. - Convolve whole image into feature map (many layers; abstracted)

31

~1/10 second

end-to-end learning

???

[Long et al.]

Page 31: [Boston March for Science 2017static.cs.brown.edu/courses/csci1430/2017_Spring/... · Fast R-CNN Figure: Girshick et al. - Convolve whole image into feature map (many layers; abstracted)

UC Berkeley

Fully Convolutional Networksfor Semantic Segmentation

Jonathan Long* Evan Shelhamer* Trevor Darrell32

[CVPR 2015] Slides from Long, Shelhamer, and Darrell

Page 32: [Boston March for Science 2017static.cs.brown.edu/courses/csci1430/2017_Spring/... · Fast R-CNN Figure: Girshick et al. - Convolve whole image into feature map (many layers; abstracted)

“tabby cat”

33

A classification network…

[Long et al.]

Number of filters, e.g., 64Number of perceptrons in MLP layer, e.g., 1024

Page 33: [Boston March for Science 2017static.cs.brown.edu/courses/csci1430/2017_Spring/... · Fast R-CNN Figure: Girshick et al. - Convolve whole image into feature map (many layers; abstracted)

“tabby cat”

34

A classification network…

[Long et al.]

Page 34: [Boston March for Science 2017static.cs.brown.edu/courses/csci1430/2017_Spring/... · Fast R-CNN Figure: Girshick et al. - Convolve whole image into feature map (many layers; abstracted)

“tabby cat”

35

A classification network…

[Long et al.]

The response of every kernel across all positions are attached densely to the array of perceptrons in the fully-connected layer.

Page 35: [Boston March for Science 2017static.cs.brown.edu/courses/csci1430/2017_Spring/... · Fast R-CNN Figure: Girshick et al. - Convolve whole image into feature map (many layers; abstracted)

“tabby cat”

36

A classification network…

[Long et al.]

The response of every kernel across all positions are attached densely to the array of perceptrons in the fully-connected layer.

AlexNet: 256 filters over 6x6 response mapEach 2,359,296 response is attached to one of 4096 perceptrons, leading to 37 mil params.

Page 36: [Boston March for Science 2017static.cs.brown.edu/courses/csci1430/2017_Spring/... · Fast R-CNN Figure: Girshick et al. - Convolve whole image into feature map (many layers; abstracted)

Problem

• We want a label at every pixel

• Current network gives us a label for the whole image.

• We want a matrix of labels

• Approach:• Make CNN for sub-image size

• ‘Convolutionalize’ all layers of network, so that we can treat it as one (complex) filter and slide around our full image.

Page 37: [Boston March for Science 2017static.cs.brown.edu/courses/csci1430/2017_Spring/... · Fast R-CNN Figure: Girshick et al. - Convolve whole image into feature map (many layers; abstracted)

Long, Shelhamer, and Darrell 2014

Page 38: [Boston March for Science 2017static.cs.brown.edu/courses/csci1430/2017_Spring/... · Fast R-CNN Figure: Girshick et al. - Convolve whole image into feature map (many layers; abstracted)

“tabby cat”

39

A classification network…

[Long et al.]

The response of every kernel across all positions are attached densely to the array of perceptrons in the fully-connected layer.

AlexNet: 256 filters over 6x6 response mapEach 2,359,296 response is attached to one of 4096 perceptrons, leading to 37 mil params.

Page 39: [Boston March for Science 2017static.cs.brown.edu/courses/csci1430/2017_Spring/... · Fast R-CNN Figure: Girshick et al. - Convolve whole image into feature map (many layers; abstracted)
Page 40: [Boston March for Science 2017static.cs.brown.edu/courses/csci1430/2017_Spring/... · Fast R-CNN Figure: Girshick et al. - Convolve whole image into feature map (many layers; abstracted)

42

Convolutionalization

[Long et al.]

1x1 convolution operates across all filters in the previous layer, and is slid across all positions.

Number of filtersNumber of filters

Page 41: [Boston March for Science 2017static.cs.brown.edu/courses/csci1430/2017_Spring/... · Fast R-CNN Figure: Girshick et al. - Convolve whole image into feature map (many layers; abstracted)

Back to the fully-connected perceptron…

Perceptron is connected to everyvalue in the previous layer (across all channels; 1 visible).

[Long et al.]

Page 42: [Boston March for Science 2017static.cs.brown.edu/courses/csci1430/2017_Spring/... · Fast R-CNN Figure: Girshick et al. - Convolve whole image into feature map (many layers; abstracted)
Page 43: [Boston March for Science 2017static.cs.brown.edu/courses/csci1430/2017_Spring/... · Fast R-CNN Figure: Girshick et al. - Convolve whole image into feature map (many layers; abstracted)

1x1100

Page 44: [Boston March for Science 2017static.cs.brown.edu/courses/csci1430/2017_Spring/... · Fast R-CNN Figure: Girshick et al. - Convolve whole image into feature map (many layers; abstracted)

46

Convolutionalization

[Long et al.]

1x1 convolution operates across all filters in the previous layer, and is slid across all positions.

e.g., 64x1x1 kernel, with shared weights over 13x13 output, x1024 filters = 11mil params.

# filters, e.g. 1024# filters, e.g., 64

Page 45: [Boston March for Science 2017static.cs.brown.edu/courses/csci1430/2017_Spring/... · Fast R-CNN Figure: Girshick et al. - Convolve whole image into feature map (many layers; abstracted)

47

Becoming fully convolutional

[Long et al.]

Arbitrary-sized image

When we turn these operations into a convolution, the 13x13 just becomes another parameter and our output size adjust dynamically.

Now we have a vector/matrix output, and our network acts itself like a complex filter.

Multiple outputs

Page 46: [Boston March for Science 2017static.cs.brown.edu/courses/csci1430/2017_Spring/... · Fast R-CNN Figure: Girshick et al. - Convolve whole image into feature map (many layers; abstracted)

Long, Shelhamer, and Darrell 2014

Page 47: [Boston March for Science 2017static.cs.brown.edu/courses/csci1430/2017_Spring/... · Fast R-CNN Figure: Girshick et al. - Convolve whole image into feature map (many layers; abstracted)

49

Upsampling the output

[Long et al.]

Some upsamplingalgorithm to return us to H x W

Page 48: [Boston March for Science 2017static.cs.brown.edu/courses/csci1430/2017_Spring/... · Fast R-CNN Figure: Girshick et al. - Convolve whole image into feature map (many layers; abstracted)

50

End-to-end, pixels-to-pixels network

[Long et al.]

Page 49: [Boston March for Science 2017static.cs.brown.edu/courses/csci1430/2017_Spring/... · Fast R-CNN Figure: Girshick et al. - Convolve whole image into feature map (many layers; abstracted)

conv, pool,nonlinearity

upsampling

pixelwiseoutput + loss

End-to-end, pixels-to-pixels network

51

[Long et al.]

Page 50: [Boston March for Science 2017static.cs.brown.edu/courses/csci1430/2017_Spring/... · Fast R-CNN Figure: Girshick et al. - Convolve whole image into feature map (many layers; abstracted)

52

What is the upsampling layer?

This one.

[Long et al.]

Hint: it’s actually an upsampling _network_

Page 51: [Boston March for Science 2017static.cs.brown.edu/courses/csci1430/2017_Spring/... · Fast R-CNN Figure: Girshick et al. - Convolve whole image into feature map (many layers; abstracted)

Upsampling with convolution

Convolution Transposed convolution = weighted kernel ‘stamp’

Often called “deconvolution”,but not actually the deconvolution that we previously saw in deblurring -> that is division in the Fourier domain.

Page 52: [Boston March for Science 2017static.cs.brown.edu/courses/csci1430/2017_Spring/... · Fast R-CNN Figure: Girshick et al. - Convolve whole image into feature map (many layers; abstracted)

Spectrum of deep features

Combine where (local, shallow) with what (global, deep)

Fuse features into deep jet

(cf. Hariharan et al. CVPR15 “hypercolumn”) 54

[Long et al.]

Page 53: [Boston March for Science 2017static.cs.brown.edu/courses/csci1430/2017_Spring/... · Fast R-CNN Figure: Girshick et al. - Convolve whole image into feature map (many layers; abstracted)

Learning upsampling kernels with skip layer refinement

interp + sum

interp + sum

dense output 55

End-to-end, joint learningof semantics and location

[Long et al.]

Page 54: [Boston March for Science 2017static.cs.brown.edu/courses/csci1430/2017_Spring/... · Fast R-CNN Figure: Girshick et al. - Convolve whole image into feature map (many layers; abstracted)

stride 32

no skips

stride 16

1 skip

stride 8

2 skips

ground truthinput image

Skip layer refinement

56

[Long et al.]

Page 55: [Boston March for Science 2017static.cs.brown.edu/courses/csci1430/2017_Spring/... · Fast R-CNN Figure: Girshick et al. - Convolve whole image into feature map (many layers; abstracted)

ResultsFCN SDS* Truth Input

58

Relative to prior state-of-the-art SDS:

- 30% relative improvementfor mean IoU

- 286× faster

*Simultaneous Detection and Segmentation Hariharan et al. ECCV14

[Long et al.]

Page 56: [Boston March for Science 2017static.cs.brown.edu/courses/csci1430/2017_Spring/... · Fast R-CNN Figure: Girshick et al. - Convolve whole image into feature map (many layers; abstracted)

Long, Shelhamer, and Darrell 2014

What can we do with an FCN?

Page 57: [Boston March for Science 2017static.cs.brown.edu/courses/csci1430/2017_Spring/... · Fast R-CNN Figure: Girshick et al. - Convolve whole image into feature map (many layers; abstracted)

im2gps (Hays & Efros, CVPR 2008)

6 million geo-tagged Flickr images

http://graphics.cs.cmu.edu/projects/im2gps/

How much can an image tell about its geographic location?

Page 58: [Boston March for Science 2017static.cs.brown.edu/courses/csci1430/2017_Spring/... · Fast R-CNN Figure: Girshick et al. - Convolve whole image into feature map (many layers; abstracted)

Nearest Neighbors according to gist + bag of SIFT + color histogram + a few others

Page 59: [Boston March for Science 2017static.cs.brown.edu/courses/csci1430/2017_Spring/... · Fast R-CNN Figure: Girshick et al. - Convolve whole image into feature map (many layers; abstracted)
Page 60: [Boston March for Science 2017static.cs.brown.edu/courses/csci1430/2017_Spring/... · Fast R-CNN Figure: Girshick et al. - Convolve whole image into feature map (many layers; abstracted)

PlaNet - Photo Geolocation with Convolutional Neural Networks

Tobias Weyand, Ilya Kostrikov, James Philbin

ECCV 2016

Page 61: [Boston March for Science 2017static.cs.brown.edu/courses/csci1430/2017_Spring/... · Fast R-CNN Figure: Girshick et al. - Convolve whole image into feature map (many layers; abstracted)

Discretization of Globe

Page 62: [Boston March for Science 2017static.cs.brown.edu/courses/csci1430/2017_Spring/... · Fast R-CNN Figure: Girshick et al. - Convolve whole image into feature map (many layers; abstracted)

Network and Training

• Network Architecture: Inception with 97M parameters

• 26,263 “categories” – places in the world

• 126 Million Web photos

• 2.5 months of training on 200 CPU cores

Page 63: [Boston March for Science 2017static.cs.brown.edu/courses/csci1430/2017_Spring/... · Fast R-CNN Figure: Girshick et al. - Convolve whole image into feature map (many layers; abstracted)
Page 64: [Boston March for Science 2017static.cs.brown.edu/courses/csci1430/2017_Spring/... · Fast R-CNN Figure: Girshick et al. - Convolve whole image into feature map (many layers; abstracted)
Page 65: [Boston March for Science 2017static.cs.brown.edu/courses/csci1430/2017_Spring/... · Fast R-CNN Figure: Girshick et al. - Convolve whole image into feature map (many layers; abstracted)

PlaNet vs im2gps (2008, 2009)

Page 66: [Boston March for Science 2017static.cs.brown.edu/courses/csci1430/2017_Spring/... · Fast R-CNN Figure: Girshick et al. - Convolve whole image into feature map (many layers; abstracted)

Spatial support for decision

Page 67: [Boston March for Science 2017static.cs.brown.edu/courses/csci1430/2017_Spring/... · Fast R-CNN Figure: Girshick et al. - Convolve whole image into feature map (many layers; abstracted)

PlaNet vs Humans

Page 68: [Boston March for Science 2017static.cs.brown.edu/courses/csci1430/2017_Spring/... · Fast R-CNN Figure: Girshick et al. - Convolve whole image into feature map (many layers; abstracted)

PlaNet vs. Humans

Page 69: [Boston March for Science 2017static.cs.brown.edu/courses/csci1430/2017_Spring/... · Fast R-CNN Figure: Girshick et al. - Convolve whole image into feature map (many layers; abstracted)

PlaNet summary

• Very fast geolocalization method by categorization.

• Uses far more training data than previous work (im2gps)

• Better than humans!