55
6.S093 Visual Recognition through Machine Learning Competition Image by kirkh.deviantart om Aditya Khosla and Joseph Lim

6.S093 Visual Recognition through Machine Learning Competition

  • Upload
    linore

  • View
    86

  • Download
    2

Embed Size (px)

DESCRIPTION

6.S093 Visual Recognition through Machine Learning Competition. Aditya Khosla and Joseph Lim. Image by kirkh.deviantart.com. Today’s class. Part 1: Introduction to deep learning What is deep learning? Why deep learning? Some common deep learning algorithms Part 2: Deep learning tutorial - PowerPoint PPT Presentation

Citation preview

Page 1: 6.S093 Visual Recognition through Machine Learning Competition

6.S093 Visual Recognition through Machine Learning

Competition

Image by kirkh.deviantart.com

Aditya Khosla and Joseph Lim

Page 2: 6.S093 Visual Recognition through Machine Learning Competition

Today’s class• Part 1: Introduction to deep learning

• What is deep learning?• Why deep learning?• Some common deep learning algorithms

• Part 2: Deep learning tutorial• Please install Python++ now!

Page 3: 6.S093 Visual Recognition through Machine Learning Competition

Slide credit• Many slides are taken/adapted from Andrew Ng’s

Page 4: 6.S093 Visual Recognition through Machine Learning Competition

Typical goal of machine learning

Label: “Motorcycle”Suggest tagsImage search…

Speech recognitionMusic classificationSpeaker identification…

Web searchAnti-spamMachine translation…

text

audio

images/video

input output

ML

ML

ML

Page 5: 6.S093 Visual Recognition through Machine Learning Competition

Typical goal of machine learning

Label: “Motorcycle”Suggest tagsImage search…

Speech recognitionMusic classificationSpeaker identification…

Web searchAnti-spamMachine translation…

text

audio

images/video

input output

ML

ML

ML

Feature engineering: most time consuming!

Page 6: 6.S093 Visual Recognition through Machine Learning Competition

Our goal in object classification

“motorcycle”ML

Page 7: 6.S093 Visual Recognition through Machine Learning Competition

Why is this hard?You see this:

But the camera sees this:

Page 8: 6.S093 Visual Recognition through Machine Learning Competition

Pixel-based representation

Input

Raw image

Motorbikes“Non”-Motorbikes

Learningalgorithm

pixel 1

pixe

l 2

pixel 1

pixel 2

Page 9: 6.S093 Visual Recognition through Machine Learning Competition

Pixel-based representation

InputMotorbikes“Non”-Motorbikes

Learningalgorithm

pixel 1

pixe

l 2

pixel 1

pixel 2

Raw image

Page 10: 6.S093 Visual Recognition through Machine Learning Competition

Pixel-based representation

InputMotorbikes“Non”-Motorbikes

Learningalgorithm

pixel 1

pixe

l 2

pixel 1

pixel 2

Raw image

Page 11: 6.S093 Visual Recognition through Machine Learning Competition

What we want

InputMotorbikes“Non”-Motorbikes

Learningalgorithm

pixel 1

pixe

l 2

Feature representation

handlebars

wheelE.g., Does it have Handlebars? Wheels?

Handlebars

Whe

els

Raw image Features

Page 12: 6.S093 Visual Recognition through Machine Learning Competition

Some feature representations

SIFT Spin image

HoG RIFT

Textons GLOH

Page 13: 6.S093 Visual Recognition through Machine Learning Competition

Some feature representations

SIFT Spin image

HoG RIFT

Textons GLOH

Coming up with features is often difficult, time-consuming, and requires expert knowledge.

Page 14: 6.S093 Visual Recognition through Machine Learning Competition

The brain: potential motivation for deep learning

[Roe et al., 1992]

Auditory cortex learns to see!

Auditory Cortex

Page 15: 6.S093 Visual Recognition through Machine Learning Competition

The brain adapts!

[BrainPort; Welsh & Blasch, 1997; Nagel et al., 2005; Constantine-Paton & Law, 2009]

Seeing with your tongue Human echolocation (sonar)

Haptic belt: Direction sense Implanting a 3rd eye

Page 16: 6.S093 Visual Recognition through Machine Learning Competition

Basic idea of deep learning• Also referred to as representation learning or

unsupervised feature learning (with subtle distinctions)

• Is there some way to extract meaningful features from data even without knowing the task to be performed?

• Then, throw in some hierarchical ‘stuff’ to make it ‘deep’

Page 17: 6.S093 Visual Recognition through Machine Learning Competition

Feature learning problem• Given a 14x14 image patch x, can represent it using 196

real numbers.

• Problem: Can we find a learn a better feature vector to represent this?

255989387899148…

Page 18: 6.S093 Visual Recognition through Machine Learning Competition

First stage of visual processing: V1

V1 is the first stage of visual processing in the brain.Neurons in V1 typically modeled as edge detectors:

Neuron #1 of visual cortex(model)

Neuron #2 of visual cortex(model)

Page 19: 6.S093 Visual Recognition through Machine Learning Competition

Learning sensor representations

Sparse coding (Olshausen & Field,1996)

Input: Images x(1), x(2), …, x(m) (each in Rn x n)

Learn: Dictionary of bases f1, f2, …, fk (also Rn x n), so that each input x can be approximately decomposed as:

x aj fj

s.t. aj’s are mostly zero (“sparse”)

j=1

k

Page 20: 6.S093 Visual Recognition through Machine Learning Competition

Sparse coding illustration Natural Images Learned bases (f1 , …, f64): “Edges”

50 100 150 200 250 300 350 400 450 500

50

100

150

200

250

300

350

400

450

500

50 100 150 200 250 300 350 400 450 500

50

100

150

200

250

300

350

400

450

500

50 100 150 200 250 300 350 400 450 500

50

100

150

200

250

300

350

400

450

500

» 0.8 * + 0.3 * + 0.5 *

x » 0.8 * f36 + 0.3 * f42

+ 0.5 * f63[a1, …, a64] = [0, 0, …, 0, 0.8, 0, …, 0, 0.3, 0, …, 0, 0.5, 0]

(feature representation)

Test example

Page 21: 6.S093 Visual Recognition through Machine Learning Competition

Sparse coding illustration

0.6 * + 0.8 * + 0.4 * f15 f28

f

37

1.3 * + 0.9 * + 0.3 *

f5 f18

f

29 • Method “invents” edge detection• Automatically learns to represent an image in terms of the edges that appear in it. Gives a more succinct, higher-level representation than the raw pixels. • Quantitatively similar to primary visual cortex (area V1) in brain.

Represent as: [a5=1.3, a18=0.9, a29 = 0.3]

Represent as: [a15=0.6, a28=0.8, a37 = 0.4]

Page 22: 6.S093 Visual Recognition through Machine Learning Competition

Going deep

pixels

edges

object parts(combination of edges)

object models

[Honglak Lee]

Training set: Alignedimages of faces.

Page 23: 6.S093 Visual Recognition through Machine Learning Competition

Why deep learning?

Method Accuracy

Hessian + ESURF [Williems et al 2008] 38%

Harris3D + HOG/HOF [Laptev et al 2003, 2004] 45%

Cuboids + HOG/HOF [Dollar et al 2005, Laptev 2004] 46%

Hessian + HOG/HOF [Laptev 2004, Williems et al 2008] 46%

Dense + HOG / HOF [Laptev 2004] 47%

Cuboids + HOG3D [Klaser 2008, Dollar et al 2005] 46%

Unsupervised feature learning (our method) 52%

[Le, Zhou & Ng, 2011]

Task: video activity recognition

Page 24: 6.S093 Visual Recognition through Machine Learning Competition

TIMIT Phone classification AccuracyPrior art (Clarkson et al.,1999) 79.6%

Feature learning 80.3%

TIMIT Speaker identification AccuracyPrior art (Reynolds, 1995) 99.7%Feature learning 100.0%

Audio

Images

Multimodal (audio/video)

CIFAR Object classification AccuracyPrior art (Ciresan et al., 2011) 80.5%

Feature learning 82.0%

NORB Object classification AccuracyPrior art (Scherer et al., 2010) 94.4%

Feature learning 95.0%

AVLetters Lip reading AccuracyPrior art (Zhao et al., 2009) 58.9%

Stanford Feature learning 65.8%

GalaxyHollywood2 Classification AccuracyPrior art (Laptev et al., 2004) 48%

Feature learning 53%

KTH AccuracyPrior art (Wang et al., 2010) 92.1%

Feature learning 93.9%

UCF AccuracyPrior art (Wang et al., 2010) 85.6%

Feature learning 86.5%

YouTube AccuracyPrior art (Liu et al., 2009) 71.2%

Feature learning 75.8%

Video

Text/NLPParaphrase detection AccuracyPrior art (Das & Smith, 2009) 76.1%

Feature learning 76.4%

Sentiment (MR/MPQA data) AccuracyPrior art (Nakagawa et al., 2010) 77.3%

Feature learning 77.7%

Page 25: 6.S093 Visual Recognition through Machine Learning Competition

Speech recognition on Android

Page 26: 6.S093 Visual Recognition through Machine Learning Competition

Impact on speech recognition

Page 27: 6.S093 Visual Recognition through Machine Learning Competition

Application to Google Streetview

Page 28: 6.S093 Visual Recognition through Machine Learning Competition

ImageNet classification: 22,000 classes…smoothhound, smoothhound shark, Mustelus mustelusAmerican smooth dogfish, Mustelus canisFlorida smoothhound, Mustelus norrisiwhitetip shark, reef whitetip shark, Triaenodon obseusAtlantic spiny dogfish, Squalus acanthiasPacific spiny dogfish, Squalus suckleyihammerhead, hammerhead sharksmooth hammerhead, Sphyrna zygaenasmalleye hammerhead, Sphyrna tudesshovelhead, bonnethead, bonnet shark, Sphyrna tiburoangel shark, angelfish, Squatina squatina, monkfishelectric ray, crampfish, numbfish, torpedosmalltooth sawfish, Pristis pectinatusguitarfishroughtail stingray, Dasyatis centrourabutterfly rayeagle rayspotted eagle ray, spotted ray, Aetobatus narinaricownose ray, cow-nosed ray, Rhinoptera bonasusmanta, manta ray, devilfishAtlantic manta, Manta birostrisdevil ray, Mobula hypostomagrey skate, gray skate, Raja batislittle skate, Raja erinacea…

Stingray

Mantaray

Page 29: 6.S093 Visual Recognition through Machine Learning Competition

0.005%Random guess

9.5% ?Feature learning From raw pixels

State-of-the-art(Weston, Bengio ‘11)

Le, et al., Building high-level features using large-scale unsupervised learning. ICML 2012

ImageNet Classification: 14M images, 22k categories

Page 30: 6.S093 Visual Recognition through Machine Learning Competition

0.005%Random guess

9.5% 21.3%Feature learning From raw pixels

State-of-the-art(Weston, Bengio ‘11)

Le, et al., Building high-level features using large-scale unsupervised learning. ICML 2012

ImageNet Classification: 14M images, 22k categories

Page 31: 6.S093 Visual Recognition through Machine Learning Competition

Some common deep architectures•Autoencoders•Deep belief networks (DBNs)•Convolutional variants•Sparse coding

Page 32: 6.S093 Visual Recognition through Machine Learning Competition

Andrew Ng

Logistic regression

x1

x2

x3

+1

Logistic regression has a learned parameter vector q. On input x, it outputs:

where

Draw a logisticregression unit as:

Page 33: 6.S093 Visual Recognition through Machine Learning Competition

Andrew Ng

Neural Network

String a lot of logistic units together. Example 3 layer network:

x1

x2

x3

+1 +1

a3

a2

a1

Layer 1 Layer 3

Layer 3

Page 34: 6.S093 Visual Recognition through Machine Learning Competition

Andrew Ng

Neural Network

x1

x2

x3

+1 +1

Layer 1 Layer 2

Layer 4+1

Layer 3

Example 4 layer network with 2 output units:

Page 35: 6.S093 Visual Recognition through Machine Learning Competition

Andrew Ng

Training a neural network

Given training set (x1, y1), (x2, y2), (x3, y3 ), ….

Adjust parameters q (for every node) to make:

(Use gradient descent. “Backpropagation” algorithm. Susceptible to local optima.)

Page 36: 6.S093 Visual Recognition through Machine Learning Competition

Andrew Ng

Unsupervised feature learning with a neural network

x4

x5

x6

+1

Layer 1

Layer 2

x1

x2

x3

x4

x5

x6

x1

x2

x3

+1

Layer 3

Autoencoder.

Network is trained to output the input (learn identify function).

Trivial solution unless:- Constrain number of units in Layer 2 (learn compressed representation), or- Constrain Layer 2 to be sparse.

a1

a2

a3

Page 37: 6.S093 Visual Recognition through Machine Learning Competition

Andrew Ng

Unsupervised feature learning with a neural network

x4

x5

x6

+1

Layer 1

Layer 2

x1

x2

x3

x4

x5

x6

x1

x2

x3

+1

Layer 3

a1

a2

a3

Page 38: 6.S093 Visual Recognition through Machine Learning Competition

Andrew Ng

Unsupervised feature learning with a neural network

x4

x5

x6

+1

Layer 1

Layer 2

x1

x2

x3

+1

a1

a2

a3

New representation for input.

Page 39: 6.S093 Visual Recognition through Machine Learning Competition

Andrew Ng

Unsupervised feature learning with a neural network

x4

x5

x6

+1

Layer 1

Layer 2

x1

x2

x3

+1

a1

a2

a3

Page 40: 6.S093 Visual Recognition through Machine Learning Competition

Andrew Ng

Unsupervised feature learning with a neural network

x4

x5

x6

+1

x1

x2

x3

+1

a1

a2

a3

+1

b1

b2

b3

Train parameters so that , subject to bi’s being sparse.

Page 41: 6.S093 Visual Recognition through Machine Learning Competition

Andrew Ng

Unsupervised feature learning with a neural network

x4

x5

x6

+1

x1

x2

x3

+1

a1

a2

a3

+1

b1

b2

b3

Train parameters so that , subject to bi’s being sparse.

Page 42: 6.S093 Visual Recognition through Machine Learning Competition

Andrew Ng

Unsupervised feature learning with a neural network

x4

x5

x6

+1

x1

x2

x3

+1

a1

a2

a3

+1

b1

b2

b3

Train parameters so that , subject to bi’s being sparse.

Page 43: 6.S093 Visual Recognition through Machine Learning Competition

Andrew Ng

Unsupervised feature learning with a neural network

x4

x5

x6

+1

x1

x2

x3

+1

a1

a2

a3

+1

b1

b2

b3

New representation for input.

Page 44: 6.S093 Visual Recognition through Machine Learning Competition

Andrew Ng

Unsupervised feature learning with a neural network

x4

x5

x6

+1

x1

x2

x3

+1

a1

a2

a3

+1

b1

b2

b3

Page 45: 6.S093 Visual Recognition through Machine Learning Competition

Andrew Ng

Unsupervised feature learning with a neural network

x4

x5

x6

+1

x1

x2

x3

+1

a1

a2

a3

+1

b1

b2

b3

+1

c1

c2

c3

Page 46: 6.S093 Visual Recognition through Machine Learning Competition

Andrew Ng

Unsupervised feature learning with a neural network

x4

x5

x6

+1

x1

x2

x3

+1

a1

a2

a3

+1

b1

b2

b3

+1

c1

c2

c3

New representation for input.

Use [c1, c3, c3] as representation to feed to learning algorithm.

Page 47: 6.S093 Visual Recognition through Machine Learning Competition

Andrew Ng

Deep Belief Net

Deep Belief Net (DBN) is another algorithm for learning a feature hierarchy.

Building block: 2-layer graphical model (Restricted Boltzmann Machine).

Can then learn additional layers one at a time.

Page 48: 6.S093 Visual Recognition through Machine Learning Competition

Andrew Ng

Restricted Boltzmann machine (RBM)

Input [x1, x2, x3, x4]

Layer 2. [a1, a2, a3](binary-valued)

MRF with joint distribution:

Use Gibbs sampling for inference.

Given observed inputs x, want maximum likelihood estimation:

x4x1 x2 x3

a2 a3a1

Page 49: 6.S093 Visual Recognition through Machine Learning Competition

Andrew Ng

Restricted Boltzmann machine (RBM)

Input [x1, x2, x3, x4]

Layer 2. [a1, a2, a3](binary-valued)

Gradient ascent on log P(x) :

[xiaj]obs from fixing x to observed value, and sampling a from P(a|x).

[xiaj]prior from running Gibbs sampling to convergence.

Adding sparsity constraint on ai’s usually improves results.

x4x1 x2 x3

a2 a3a1

Page 50: 6.S093 Visual Recognition through Machine Learning Competition

Andrew Ng

Deep Belief Network

Input [x1, x2, x3, x4]

Layer 2. [a1, a2, a3]

Layer 3. [b1, b2, b3]

Similar to a sparse autoencoder in many ways. Stack RBMs on top of each other to get DBN.

Train with approximate maximum likelihood (often with sparsity constraint on ai’s):

Page 51: 6.S093 Visual Recognition through Machine Learning Competition

Andrew Ng

Deep Belief Network

Input [x1, x2, x3, x4]

Layer 2. [a1, a2, a3]

Layer 3. [b1, b2, b3]

Layer 4. [c1, c2, c3]

Page 52: 6.S093 Visual Recognition through Machine Learning Competition

Andrew Ng

Convolutional DBN for audio

Spectrogram

Detection units

Max pooling unit

Page 53: 6.S093 Visual Recognition through Machine Learning Competition

Andrew Ng

Convolutional DBN for audio

Spectrogram

Page 54: 6.S093 Visual Recognition through Machine Learning Competition

Andrew Ng

Convolutional DBN for Images

Wk

Detection layer H

Max-pooling layer P

Hidden nodes (binary)

“Filter” weights (shared)

‘’max-pooling’’ node (binary)

Input data V

Page 55: 6.S093 Visual Recognition through Machine Learning Competition

Tutorial

image classifier demo