Lecture 25

12/1/2015

1

Deep learning and convolutional neural nets

Tues Dec 1Kristen Grauman

UT Austin

Last time

• Support vector machines (SVM)

– Basic algorithm

– Kernels

– Multi-class

– HOG + SVM for person detection

• Visualizing a feature: Hoggles

• Evaluating an object detector

– Precision/recall, IoU, Average precision

12/1/2015

2

Today

• Pyramid match kernels

• (Deep) Neural networks

• Convolutional neural networks

Recalll: Examples of kernel functions

Linear:

Gaussian RBF:

Histogram intersection:

)2

exp()(2

2

ji

ji

xx,xxK

k

jiji kxkxxxK ))(),(min(),(

j

T

iji xxxxK ),(

• Kernels go beyond vector space data

• Kernels also exist for “structured” input spaces like

sets, graphs, trees…

12/1/2015

3

Discriminative classification with

sets of features?• Each instance is unordered set of vectors

• Varying number of vectors per instance

Slide credit: Kristen Grauman

Partially matching sets of features

We introduce an approximate matching kernel that

makes it practical to compare large sets of features

based on their partial correspondences.

Optimal match: O(m3)

Greedy match: O(m2 log m)Pyramid match: O(m)

(m=num pts)

[Previous work : Indyk & Thaper, Bartal, Charikar, Agarwal &

Varadarajan, …]


12/1/2015

4

Pyramid match: main idea

descriptor space

Feature space partitions

serve to “match” the local

descriptors within

successively wider regions.


Pyramid match: main idea

Histogram intersection

counts number of possible

matches at a given

partitioning.Slide credit: Kristen Grauman

12/1/2015

5

Pyramid match

• For similarity, weights inversely proportional to bin size

(or may be learned)

• Normalize these kernel values to avoid favoring large sets

[Grauman & Darrell, ICCV 2005]

measures

difficulty of a

match at level

number of newly matched

pairs at level


Pyramid match

optimal partial

matching

Optimal match: O(m3)

Pyramid match: O(mL)

The Py ramid Match Kernel: Ef f icient

Learning with Sets of Features. K.

Grauman and T. Darrell. Journal of

Machine Learning Research (JMLR), 8

(Apr): 725--760, 2007.

12/1/2015

6

BoW Issue:

No spatial layout preserved!

Too much? Too little?


[Lazebnik, Schmid & Ponce, CVPR 2006]

• Make a pyramid of bag-of-words histograms.

• Provides some loose (global) spatial layout information

Spatial pyramid match

12/1/2015

7

[Lazebnik, Schmid & Ponce, CVPR 2006]

• Make a pyramid of bag-of-words histograms.

• Provides some loose (global) spatial layout information


Sum over PMKs

computed in image

coordinate space,

one per word.

• Can capture scene categories well---texture-like patterns

but with some variability in the positions of all the local

pieces.


12/1/2015

8

• Can capture scene categories well---texture-like patterns

but with some variability in the positions of all the local

pieces.

• Sensitive to global shifts of the view

Confusion table


Today




12/1/2015

9

Traditional Image Categorization: Training phase

Training Labels

Training

Images

Classifier Training

Training

Image Features

Trained Classifier

Jia-Bin Huang and Derek Hoiem, UIUC

Training Labels

Training

Images

Classifier Training

Training

Image Features

Trained Classifier

Image Features

Testing

Test Image

Outdoor

PredictionTrained Classifier

Traditional Image Categorization: Testing phase

12/1/2015

10

Features have been key

SIFT [Low e IJCV 04] HOG [Dalal and Triggs CVPR 05]

SPM [Lazebnik et al. CVPR 06] Textons

SURF, MSER, LBP, Color-SIFT, Color histogram, GLOH, …..

and many others:

• Each layer of hierarchy extracts features from output of previous layer

• All the way from pixels classifier

• Layers have the (nearly) same structure

• Train all layers jointly

Learning a Hierarchy of Feature Extractors

Layer 1 Layer 2 Layer 3 Simple Classifier

Image/VideoPixels

Image/video Labels

Slide: Rob Fergus

https://www.cs.ubc.ca/~lowe/papers/ijcv04.pdf

http://lear.inrialpes.fr/people/triggs/pubs/Dalal-cvpr05.pdf

http://www.di.ens.fr/sierra/pdfs/cvpr06b.pdf

12/1/2015

11

Learning Feature Hierarchy

Goal: Learn useful higher-level features from imagesFeature representation

Input data

1st layer “Edges”

2nd layer “Object parts”

3rd layer “Objects”

Pixels

Lee et al., ICML 2009;

CACM 2011

Slide: Rob Fergus

Learning Feature Hierarchy

• Better performance

• Other domains (unclear how to hand engineer):– Kinect

– Video

– Multi spectral

• Feature computation time– Dozens of features now regularly used [e.g., MKL]

– Getting prohibitive for large datasets (10’s sec /image)

Slide: R. Fergus

12/1/2015

12

Biological neuron and Perceptrons

A biological neuron An artificial neuron (Perceptron) - a linear classifier


Simple, Complex and Hypercomplex cells

David H. Hubel and Torsten Wiesel

David Hubel's Eye, Brain, and Vision

Suggested a hierarchy of feature detectors

in the visual cortex, with higher level features

responding to patterns of activation in lower

level cells, and propagating activation

upwards to still higher level cells.


http://hubel.med.harvard.edu/

12/1/2015

13

Hubel/Wiesel Architecture and Multi-layer Neural Network

Hubel and Weisel’s architecture Multi-layer Neural Network- A non-linear classifier


Neuron: Linear Perceptron

Inputs are feature values

Each feature has a weight

Sum is the activation

If the activation is:

Positive, output +1

Negative, output -1

Slide credit: Pieter Abeel and Dan Klein

12/1/2015

14

Multi-layer Neural Network

• A non-linear classifier

• Training: find network weights w to minimize the error between true training labels 𝑦𝑖 and estimated labels 𝑓𝒘 𝒙𝒊

• Minimization can be done by gradient descent provided 𝑓 is differentiable

• This training method is called back-propagation


Two-layer perceptron network


http://en.wikipedia.org/wiki/Backpropagation

12/1/2015

15





12/1/2015

16

Learning w

Training examples

Objective: a misclassification loss

Procedure:

Gradient descent / hill climbing




12/1/2015

17



Two-layer neural network


12/1/2015

18

Neural network properties

Theorem (Universal function approximators): A

two-layer network with a sufficient number of neurons can approximate any continuous function to any desired accuracy

Practical considerations:

Can be seen as learning the features

Large number of neurons

Danger for overfitting

Hill-climbing procedure can get stuck in bad local

optima

Slide credit: Pieter Abeel and Dan KleinApproximation by Superpositions of Sigmoidal Function,1989

Today




http://www.dartmouth.edu/~gvc/Cybenko_MCSS.pdf

12/1/2015

19

Convolutional Neural Networks (CNN, ConvNet, DCN)

• CNN = a multi-layer neural network with

– Local connectivity:• Neurons in a layer are only connected to a small region

of the layer before it

– Share weight parameters across spatial positions:• Learning shift-invariant filter kernels

Image credit: A. KarpathyJia-Bin Huang and Derek Hoiem, UIUC

Neocognitron [Fukushima, Biological Cybernetics 1980]

Deformation-Resistant Recognition

S-cells: (simple)

- extract local features

C-cells: (complex)

- allow for positional errors


http://www.cs.princeton.edu/courses/archive/spr08/cos598B/Readings/Fukushima1980.pdf

12/1/2015

20

LeNet [LeCun et al. 1998]

Gradient-based learning applied to document recognition [LeCun, Bottou, Bengio, Haffner 1998] LeNet-1 from 1993


What is a Convolution?

• Weighted moving sum

Input Feature Activation Map

.

.

.

slide credit: S. Lazebnik

http://yann.lecun.com/exdb/publis/pdf/lecun-01a.pdf

12/1/2015

21

Input Image

Convolution (Learned)

Non-linearity

Spatial pooling

Normalization

Convolutional Neural Networks

Feature maps


Input Image


Non-linearity

Spatial pooling

Normalization

Feature maps

Input Feature Map

.

.

.



12/1/2015

22

Input Image


Non-linearity

Spatial pooling

Normalization

Feature maps


Rectified Linear Unit (ReLU)


Input Image


Non-linearity

Spatial pooling

Normalization

Feature maps

Max pooling



Max-pooling: a non-linear down-sampling

Provide translation invariance

12/1/2015

23

Input Image


Non-linearity

Spatial pooling

Normalization

Feature maps



Engineered vs. learned features

Image

Feature extraction

Pool ing

Classifier

Label

Image

Convolution/pool

Convolution/pool

Convolution/pool

Convolution/pool

Convolution/pool

Dense

Dense

Dense

LabelConvolutional filters are trained in a

supervised manner by back-propagating

classification error


12/1/2015

24

SIFT Descriptor

Image Pixels Apply

oriented filters

Spatial pool

(Sum)

Normalize to unit length

Feature Vector

Lowe [IJCV 2004]

slide credit: R. Fergus

Spatial Pyramid Matching

SIFTFeatures

Filter with Visual Words

Multi-scalespatial pool

(Sum)

Max

Classifier

Lazebnik, Schmid,

Ponce [CVPR 2006]

slide credit: R. Fergus

12/1/2015

25

AlexNet

• Similar framework to LeCun’98 but:• Bigger model (7 hidden layers, 650,000 units, 60,000,000 params)• More data (106 vs. 103 images)• GPU implementation (50x speedup over CPU)

• Trained on two GPUs for a week

A. Krizhevsky, I . Sutskever, and G. Hinton, ImageNet Classification with Deep Convolutional Neural Networks, NIPS 2012


Applications

• Handwritten text/digits

– MNIST (0.17% error [Ciresan et al. 2011])

– Arabic & Chinese [Ciresan et al. 2012]

• Simpler recognition benchmarks

– CIFAR-10 (9.3% error [Wan et al. 2013])

– Traffic sign recognition• 0.56% error vs 1.16% for humans [Ciresan et al. 2011]

Slide: R. Fergus

http://www.cs.toronto.edu/~fritz/absps/imagenet.pdf

12/1/2015

26

Application: ImageNet

[Deng et al. CVPR 2009]

• ~14 million labeled images, 20k classes

• Images gathered from Internet

• Human labels via Amazon Turk

https://sites.google.com/site/deeplearningcvpr2014 Slide: R. Fergus

AlexNet [Krizhevsky et al., NIPS 2012]

• 7 hidden layers, 650,000 neurons, 60,000,000 parameters

• Trained on 2 GPUs for a week

• Similar model as LeCun’98 but:

- Bigger model (8 layers)

- More data (106 vs 103 images)

- GPU implementation (50x speedup over CPU)


12/1/2015

27

ImageNet Classification 2012

• Krizhevsky et al. -- 16.4% error (top-5)

• Next best (non-convnet) – 26.2% error

35

30

25

20

15

10

5

0SuperVision ISI Oxford INRIA Amsterdam

Top

-5e

rro

rra

te%


ImageNet Classification 2013 Results

• http://www.image-net.org/challenges/LSVRC/2013/results.php

0.17

0.16

0.15

0.14

0.13

0.12

0.11

0.1

Test

err

or

(to

p-5)

https://sites.google.com/site/deeplearningcvpr2014

http://www.image-net.org/challenges/LSVRC/2013/results.php

12/1/2015

28

ImageNet Challenge 2012-2014

Team Year Place Error (top-5) External data

SuperVision – Toronto(7 layers)

2012 - 16.4% no

SuperVision 2012 1st 15.3% ImageNet 22k

Clarifai – NYU (7 layers) 2013 - 11.7% no

Clarifai 2013 1st 11.2% ImageNet 22k

VGG – Oxford (16 layers) 2014 2nd 7.32% no

GoogLeNet (19 layers) 2014 1st 6.67% no

Human expert* 5.1%Error (top-5)

5.33%

4.94%

4.82%

Best non-convnet in 2012: 26.2%


Industry Deployment

• Used in Facebook, Google, Microsoft

• Image Recognition, Speech Recognition, ….

• Fast at test time

Taigman et al. DeepFace: Closing the Gap to Human-Level Performance in Face Verification, CVPR’14

Slide: R. Fergus

http://karpathy.github.io/2014/09/02/what-i-learned-from-competing-against-a-convnet-on-imagenet/

12/1/2015

29

Beyond classification

• Detection

• Segmentation

• Regression

• Pose estimation

• Matching patches

• Synthesis

and many more…


R-CNN: Regions with CNN features

• Trained on ImageNet classification

• Finetune CNN on PASCAL

RCNN [Girshick et al. CVPR 2014]Jia-Bin Huang and Derek Hoiem, UIUC

http://www.cs.berkeley.edu/~rbg/papers/r-cnn-cvpr.pdf

12/1/2015

30

Labeling Pixels: Semantic Labels

Fully Convolutional Netw orks for Semantic Segmentation [Long et al. CVPR 2015] Jia-Bin Huang and Derek Hoiem, UIUC

Labeling Pixels: Edge Detection

DeepEdge: A Multi-Scale Bifurcated Deep Netw ork for Top-Dow n Contour Detection

[Bertasius et al. CVPR 2015]Jia-Bin Huang and Derek Hoiem, UIUC

http://www.cs.berkeley.edu/~jonlong/long_shelhamer_fcn.pdf

http://arxiv.org/pdf/1412.1123.pdf

12/1/2015

31

CNN for Regression

DeepPose [Toshevand Szegedy CVPR 2014]Jia-Bin Huang and Derek Hoiem, UIUC

CNN as a Similarity Measure for Matching

FaceNet [Schroff et al. 2015]Stereo matching [Zbontar and LeCun CVPR 2015]

Compare patch [Zagoruyko and Komodakis 2015]

Match ground and aerial images

[Lin et al. CVPR 2015]Flow Net [Fischer et al 2015]Jia-Bin Huang and Derek Hoiem, UIUC

http://arxiv.org/pdf/1312.4659v3.pdf




http://cs.brown.edu/people/hays/papers/deep_geo.pdf


12/1/2015

32

CNN for Image Generation

Learning to Generate Chairs w ith Convolutional Neural Netw orks [Dosovitskiy et al. CVPR 2015]Jia-Bin Huang and Derek Hoiem, UIUC

Chair Morphing

Learning to Generate Chairs w ith Convolutional Neural Netw orks [Dosovitskiy et al. CVPR 2015]Jia-Bin Huang and Derek Hoiem, UIUC



12/1/2015

33

Recap

• Pyramid match kernels: example of structured input data for kernel-based classifiers (SVM)

• Neural networks / multi-layer perceptrons– View of neural networks as learning hierarchy of

features

• Convolutional neural networks– Architecture of network accounts for image structure

– “End-to-end” recognition from pixels

– Together with big (labeled) data and lots of computation major success on benchmarks, image classification and beyond

Coming up

Thursday

• Visual attributes and learning to rank

Final exam

• Dec 9, 2-5 pm, in JGB 2.216

• Comprehensive

• Closed book

• Two pages of notes allowed

http://www.utexas.edu/maps/main/buildings/jgb.html

Documents

Lecture 25