A framework for machine vision based on neuro-mimetic frontend and clustering · 2016-01-09 · A framework for machine vision based on neuro-mimetic frontend and clustering Emre

A framework for machine vision based on neuro-mimetic frontend and clustering

Emre Akbas*, Aseem Wadhwa#, Miguel Eckstein*, Upamanyu Madhow#

*Department of Psychology and Brain Sciences, UCSB, #Department of ECE UCSB

• Current records held by Supervised Deep Neural Nets

• Loosely inspired by spatial organization of the visual cortex

• Hierarchical Structure

Significant recent progress in machine vision

Krizhevsky et al 2012

Ciresan et al 2011

2

Supervised Deep Nets: Not quite the last word?

• Large number of tunable parameters

– Long training times

– Overfitting an issue

• Large amounts of labeled training data required

• Tricks, clever engineering required to make them work

– DropOut, DropConnect etc

– setting correct values of learning rates, weight decay, momentum

• Not clear what information extracted by different layers – lower layers? higher layers?

3

• What are they doing? Why do they work? • Simpler ways of implementing hierarchical feature extraction?

Why do we need supervision?

• We can see things even if we don’t know their labels

• Perhaps our visual system be extracting a “universal” set of features?

Can’t we mostly learn unsupervised?

(Add in supervision at the final stage for a classification task.)

• Has been tried recently

• Was abandoned when supervised techniques were tuned to do better – Too quickly, we think.

4

• Can we do it mostly unsupervised? • Can we leverage everything we know “for sure” about mammalian vision? • Can we reduce the number of engineering judgment calls and parameters to be tuned?

Our Proposal 1. Neuro-mimetic Frontend (lower layers)

– detailed models available for retina processing and simple cells in V1

2. Unsupervised Learning via Clustering to build

higher layers

– natural interpretation: segmenting data into patterns

– Clustering similar to neural processing, easy interpretation

– Drastically fewer parameters to tune (only two: number of cluster centers, sparsity level – discussed later)

3. Finally, use Supervised Classification (Generic nonlinear SVM)

5 5

Main focus: Unsupervised, universal, interpretable feature extraction

Architecture

RGC Layer Simple Cell Layer

Supervised Classifier (eg: SVM or neural

net)

N X N

(Raw Image)

N X N X f

“Feature Maps”

Clustering and Pooling

6

1. “Neuro-mimetic Frontend”

2. Unsupervised Feature Extraction

6

•Universal frontend •Independent of labels

Neuro-Mimetic Frontend

7

– Convolutional and Hierarchical Architecture

– Local Contrast Normalization (LCN)

– Rectification

Loose Neuro-Inspiration plays a key role already

Normalize

Local spatial organization of cells in the visual cortex

Figure from LeCun et al,

1989

Figure from

Brady et al, 2000

Local inhibition and competition between neighboring neurons

Neurons firing when inputs exceed a threshold

8 8

• Much less work on neuro-mimetic computational models

Yes, at least for the frontend

• Higher than V1 layers (complex cells) – not enough understanding

• Some basic models for simple cells (V1)

• Detailed models available based on experiments for the retinal frontend

• In this work – mimic the parts we know, use neuro-inspiration for the rest

• Potential benefits

– Fundamentally interesting for computational Neuroscience

– Leverage Evolution as much as possible: need to tune less parameters

– “universal” front-end

Graphic from Bengio, Lecun ICML Workshop 2009, Montreal

9

Visual Pathway

Figure from Filipe and Alexandre, 2013.

Figure from Hubel, 1995.

RGC/LGN

Optic Nerve

V1 Simple Cells

Retina Complex Cells

Fovea

in the cortex Frontend: in the Fovea/Optic nerve (Retinal Ganglion Cells/Lateral Geniculate Nucleus)

10

What it does:

Retinal Ganglion Cells (RGCs)

1. Luminance Gain Control

2. Center Surround Filtering

3. Local Contrast Normalization (LCN) and Rectification

RGC/LGN V1 Simple

Cells Retina Complex

Cells

5 10 15-0.1

0

0.1

0.2

0.3

Local centre-surround filtering

center-on

center-off

+ve

-ve

LCN

LCN

Threshold

Threshold

Rectified RGC Output

Retinal Ganglion Cell (RGC) Filter

(Difference of

Gaussians)

References:

• Wandell 1995

• Croner et al 1995

• Carandini et al 2005

11

What it does:

Retinal Ganglion Cells (RGCs)

1. Luminance Gain Control

2. Center Surround Filtering

3. Local Contrast Normalization (LCN) and Rectification

RGC/LGN V1 Simple


Cells

Luminance Control

center ON

center OFF

“Relevant” parts of the image lit

up (>0)

12

V1 Simple Cells

RGC/LGN V1 Simple


Cells

Rectified RGC Output

Simple Cells Filter Bank (48)

LCN + Threshold

Rectified Simple Cells

Output

Simple cells sum RGC outputs

48 Edge

Oriented Filters

Output: 48

“Feature” Maps

etc

etc

References:

• Hubel and Wiesel 1968

13

Higher Layers

14

Architecture

RGC Layer Simple Cell Layer

Supervised Classifier (eg: SVM or neural

net)

N X N

(Raw Image)

N X N X f

“Feature Maps”

Clustering and Pooling

15

1. “Neuro-mimetic Frontend”

2. Unsupervised Feature Extraction

15

Does Unsupervised Learning Make Sense?

• It should….even though most state of the art Nets are purely supervised

• “Transfer Learning” has been shown to help

• Intuitively:

– extract low level features like edges, corners, t-junction etc

– independent of labels

• Low level “Representative” features (unsupervised) followed by higher level “Discriminative” features (supervised)

• Not entirely a new idea: Lee et al 2009, Jarrett et 2009, Kavukcuoglo

et al 2010, Zeiler et al 2011, Labusch et al 2009 etc

• Our approach: “clustering” to extract low level features

16

Why Clustering?

• Natural candidate to find patterns in data

• Simple

• Encoding operation similar to a layer in a neural network

• cluster center == neuron

Processing in Neural Layers Clustering

x1

x2

xd

Layer “i”

neurons

Layer “i+1”

neurons

Normalize input and “cluster centers” and perform K-means (spherical K-means)

Once centers are learned, non-linear function of soft encoding

Same as neural layer processing

17

Related Work

• Unsupervised layers followed by Supervised layers

– Lee et al 2009, Jarrett et 2009, Kavukcuoglo et al 2010, Zeiler et al

2011, Labusch et al 2009

– All of them utilize some form of reconstruction + sparsity cost function for unsupervised training

– We investigate a much simpler alternative (K-means clustering)

• Using K-means clustering to build features

– Coates and Ng 2011, 2012

– Use K-means directly on raw images

– Number of cluster centers learned very large (a few thousands)

– We build clustering on top of neuro-mimetic preprocessing, we get competitive performance with a few cluster centers

• Benefits of tuning Pre-processing Step

– Ciresan et al 2011, Fukushima 2003

– gains obtained using contrast normalization, center surround filtering

18

Architecture: more details

RGC + Simple Cells

N X N

(Raw Image)

N X N X 48

“Feature Maps”

Each Spatial location: 48 neurons – their activations represent 7X7 patches

• Simple Cell Receptive field size : roughly 7X7 pixels in the original images space

• “viewing distance” : only parameter to be set for the Frontend. We set it guided by the resolution of the image.

• “Receptive field sizes” depend on the size of the fovea (fixed) and the “viewing distance”

20

RGC + Simple Cells

N X N

(Raw Image)

N X N X 48

“Feature Maps”

Example:

21

7 X 7 Patch Simple Cell Response

0 10 20 30 400.1

0.2

0.3

0.4

0.5

0.6

0 10 20 30 400

0.5

1

1.5

2

2.5

Response prior to LCN

1 2 3 4 5 6 7

1

2

3

4

5

6

7

Each Spatial location: 48 neurons – their activations represent 7X7 patches

• LCN (Local Contrast Normalization) operation:

• normalizes activations

• inhibits weaker responses

22

Example:

7 X 7 Patch Simple Cell Response

0 10 20 30 400.1

0.2

0.3

0.4

0.5

0.6

0 10 20 30 400

0.5

1

1.5

2

2.5

Response prior to LCN

1 2 3 4 5 6 7

1

2

3

4

5

6

7

i : feature index j : spatial index

sum over

“features”

sum over spatial

neighborhood

Clustering Simple Cell Outputs

23

x1

x2

x48

• Input Data {x48X1}

• Learn K1 centers

• Use online spherical K-means algorithm (Shi Zhong 2005)

N X N X 48

Simple Cell

Outputs

N X N X K1

c1

c2

ck1

N X N

(Raw Image)

Cluster Centers (correspond to 7 X 7 patches) • Mostly Edges • Interpretation: orientation sensitive neurons

24

2 4 6

246

2 4 6

246

2 4 6

246

2 4 6

246

2 4 6

246

2 4 6

246

2 4 6

246

2 4 6

246

2 4 6

246

2 4 6

246

2 4 6

246

2 4 6

246

2 4 6

246

2 4 6

246

2 4 6

246

2 4 6

246

2 4 6

246

2 4 6

246

2 4 6

246

2 4 6

246

2 4 6

246

2 4 6

246

2 4 6

246

2 4 6

246

2 4 6

246

2 4 6

246

2 4 6

246

2 4 6

246

2 4 6

246

2 4 6

246

2 4 6

246

2 4 6

246

2 4 6

246

2 4 6

246

2 4 6

246

2 4 6

246

2 4 6

246

2 4 6

246

2 4 6

246

2 4 6

246

2 4 6

246

2 4 6

246

2 4 6

246

2 4 6

246

2 4 6

246

2 4 6

246

2 4 6

246

2 4 6

246

2 4 6

246

2 4 6

246

2 4 6

246

2 4 6

246

2 4 6

246

2 4 6

246

2 4 6

246

2 4 6

246

2 4 6

246

2 4 6

246

2 4 6

246

2 4 6

246

2 4 6

246

2 4 6

246

2 4 6

246

2 4 6

246

2 4 6

246

2 4 6

246

2 4 6

246

2 4 6

246

2 4 6

246

2 4 6

246

2 4 6

246

2 4 6

246

2 4 6

246

2 4 6

246

2 4 6

246

2 4 6

246

2 4 6

246

2 4 6

246

2 4 6

246

2 4 6

246

2 4 6

246

2 4 6

246

2 4 6

246

2 4 6

246

2 4 6

246

2 4 6

246

2 4 6

246

2 4 6

246

2 4 6

246

2 4 6

246

2 4 6

246

2 4 6

246

2 4 6

246

2 4 6

246

2 4 6

246

2 4 6

246

2 4 6

246

2 4 6

246

2 4 6

246

2 4 6

246

Encoding Function (Activations): Sparsity Level

• choice of f ?

• We choose:

– soft Threshold

– choose T to keep 80% sparsity on average

• Why?

– neuro-inspired (rectification)

– a patch is characterized by relative contributions of several centers

– gives a direct handle on sparsity

(unlike other unsupervised learning algorithms)

25

0 50 100 150 2000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

1 2 3 4 5 6 7

1

2

3

4

5

6

7

Patch

Activations after threshold

26

N X N X 48

Simple Cell

Outputs

N X N X K1 N X N

(Raw Image)

Next Layer ? 1. either feed to classifier OR 2. zoom out via “pooling” and

cluster “larger” patches

Next layer: Option 1

• Pool and feed it to a supervised classifier

• Pooling: Local Translation invariance, edge anywhere in a cell

• eg: MNIST, K1 = 200

– before pooling 28 X 28 X 200

– after pooling 4 X 4 X 200 = 3200: length of the feature

27

7 14 21 28

7

14

21

28

pooling over 4X4 Grid

Next layer: Option 2

• One more layer of extracting unsupervised features before discriminative learning

• “Zoom out” and cluster: Receptive fields now larger than 7X7

28

Local 2X2 pooling and 2X2

concatenation

28 X 28 X 200 7 X 7 X 800

Represents 10 X 10 Patches

K1 :#first layer centers

2nd Layer Clustering

• Individual matching score (0-1)

• Similarity metric for clustering: average matching score over the 4 quadrants

29

1 6 15 38

2 4 6 8 10

2

4

6

8

10

88 3 9 4286

2 4 6 8 10

2

4

6

8

10

135 3 15 6603

2 4 6 8 10

2

4

6

8

10

111 3 15 5427

2 4 6 8 10

2

4

6

8

10

202 9 18 9895

2 4 6 8 10

2

4

6

8

10

299 3 15 14639

2 4 6 8 10

2

4

6

8

10

264 6 12 12918

2 4 6 8 10

2

4

6

8

10

172 6 15 8417

2 4 6 8 10

2

4

6

8

10

249 3 15 12189

2 4 6 8 10

2

4

6

8

10

248 1 12 12132

2 4 6 8 10

2

4

6

8

10

44 3 12 2137

2 4 6 8 10

2

4

6

8

10

86 3 15 4202

2 4 6 8 10

2

4

6

8

10

377 12 12 18457

2 4 6 8 10

2

4

6

8

10

24 3 18 1171

2 4 6 8 10

2

4

6

8

10

6 3 12 275

2 4 6 8 10

2

4

6

8

10

134 6 15 6555

2 4 6 8 10

2

4

6

8

10

80 6 15 3909

2 4 6 8 10

2

4

6

8

10

25 3 6 1192

2 4 6 8 10

2

4

6

8

10

28 3 15 1360

2 4 6 8 10

2

4

6

8

10

132 12 15 6459

2 4 6 8 10

2

4

6

8

10

156 12 12 7628

2 4 6 8 10

2

4

6

8

10

21 6 15 1018

2 4 6 8 10

2

4

6

8

10

82 12 15 4009

2 4 6 8 10

2

4

6

8

10

13 6 15 626

2 4 6 8 10

2

4

6

8

10

164 3 3 7996

2 4 6 8 10

2

4

6

8

10

10 X 10 Patch

9 18

2 4 6

246

9 19

2 4 6

246

9 21

2 4 6

246

9 22

2 4 6

246

10 18

2 4 6

246

10 19

2 4 6

246

10 21

2 4 6

246

10 22

2 4 6

246

12 18

2 4 6

246

12 19

2 4 6

246

12 21

2 4 6

246

12 22

2 4 6

246

13 18

2 4 6

246

13 19

2 4 6

246

13 21

2 4 6

246

13 22

2 4 6

246

9 18

2 4 6

246

9 19

2 4 6

246

9 21

2 4 6

246

9 22

2 4 6

246

10 18

2 4 6

246

10 19

2 4 6

246

10 21

2 4 6

246

10 22

2 4 6

246

12 18

2 4 6

246

12 19

2 4 6

246

12 21

2 4 6

246

12 22

2 4 6

246

13 18

2 4 6

246

13 19

2 4 6

246

13 21

2 4 6

246

13 22

2 4 6

246

9 18

2 4 6

246

9 19

2 4 6

246

9 21

2 4 6

246

9 22

2 4 6

246

10 18

2 4 6

246

10 19

2 4 6

246

10 21

2 4 6

246

10 22

2 4 6

246

12 18

2 4 6

246

12 19

2 4 6

246

12 21

2 4 6

246

12 22

2 4 6

246

13 18

2 4 6

246

13 19

2 4 6

246

13 21

2 4 6

246

13 22

2 4 6

246

9 18

2 4 6

246

9 19

2 4 6

246

9 21

2 4 6

246

9 22

2 4 6

246

10 18

2 4 6

246

10 19

2 4 6

246

10 21

2 4 6

246

10 22

2 4 6

246

12 18

2 4 6

246

12 19

2 4 6

246

12 21

2 4 6

246

12 22

2 4 6

246

13 18

2 4 6

246

13 19

2 4 6

246

13 21

2 4 6

246

13 22

2 4 6

246

pooling

Concatenation

2nd layer Cluster Centers

30

• curves, t-junctions

2 4 6810

2468

10

2 4 6810

2468

10

2 4 6810

2468

10

2 4 6 810

2468

10

2 4 6810

2468

10

2 4 6810

2468

10

2 4 6810

2468

10

2 4 6810

2468

10

2 4 6 810

2468

10

2 4 6810

2468

10

2 4 6 810

2468

10

2 4 6 810

2468

10

2 4 6 810

2468

10

2 4 6 810

2468

10

2 4 6 810

2468

10

2 4 6 810

2468

10

2 4 6 810

2468

10

2 4 6 810

2468

10

2 4 6 810

2468

10

2 4 6 810

2468

10

2 4 6 810

2468

10

2 4 6 810

2468

10

2 4 6 810

2468

10

2 4 6 810

2468

10

2 4 6 810

2468

10

2 4 6810

2468

10

2 4 6810

2468

10

2 4 6810

2468

10

2 4 6 810

2468

10

2 4 6810

2468

10

2 4 6810

2468

10

2 4 6810

2468

10

2 4 6810

2468

10

2 4 6 810

2468

10

2 4 6810

2468

10

2 4 6 810

2468

10

2 4 6 810

2468

10

2 4 6 810

2468

10

2 4 6 810

2468

10

2 4 6 810

2468

10

2 4 6 810

2468

10

2 4 6 810

2468

10

2 4 6 810

2468

10

2 4 6 810

2468

10

2 4 6 810

2468

10

2 4 6 810

2468

10

2 4 6 810

2468

10

2 4 6 810

2468

10

2 4 6 810

2468

10

2 4 6 810

2468

10

2 4 6810

2468

10

2 4 6810

2468

10

2 4 6810

2468

10

2 4 6 810

2468

10

2 4 6810

2468

10

2 4 6810

2468

10

2 4 6810

2468

10

2 4 6810

2468

10

2 4 6 810

2468

10

2 4 6810

2468

10

2 4 6 810

2468

10

2 4 6 810

2468

10

2 4 6 810

2468

10

2 4 6 810

2468

10

2 4 6 810

2468

10

2 4 6 810

2468

10

2 4 6 810

2468

10

2 4 6 810

2468

10

2 4 6 810

2468

10

2 4 6 810

2468

10

2 4 6 810

2468

10

2 4 6 810

2468

10

2 4 6 810

2468

10

2 4 6 810

2468

10

2 4 6 810

2468

10

2 4 6810

2468

10

2 4 6810

2468

10

2 4 6810

2468

10

2 4 6 810

2468

10

2 4 6 810

2468

10

2 4 6810

2468

10

2 4 6 810

2468

10

2 4 6810

2468

10

2 4 6 810

2468

10

2 4 6810

2468

10

2 4 6 810

2468

10

2 4 6 810

2468

10

2 4 6 810

2468

10

2 4 6 810

2468

10

2 4 6 810

2468

10

2 4 6 810

2468

10

2 4 6 810

2468

10

2 4 6 810

2468

10

2 4 6 810

2468

10

2 4 6 810

2468

10

2 4 6 810

2468

10

2 4 6 810

2468

10

2 4 6 810

2468

10

2 4 6 810

2468

10

2 4 6 810

2468

10

The Datasets

Standard for image classification tests

MNIST

• 10 digits

• 28 X 28 images

• 60K Training data, 10k Testing

NORB (uniform dataset)

• 5 objects

• 96 X 96 dimensions

• Varying Illumination, Elevation, Rotation

• 24300 Training, 24300 Testing

Animal Human Plane Truck Car

32

Experimental results

Summary

Decent performance on MNIST

Beats state of the art on NORB

Can do it with very sparse encoding

Three scenarios tested

1. 1 layer of clustering K1 = 200

2. 1 layer of clustering K1 = 600

3. 2 layers of clustering K1=200, K2=600

concatenate layer 1 and 2 activations to form the feature vector

(consistent with current neuroscientific understanding)

• Case 2 and 3 : similar lengths of feature vectors

• No augmentation using affine distortions (translation, rotation etc) of the data

34

RBF SVM as Supervised Classifier

• Since we are using clustering to organize patterns, we expect a classifier that uses a Gaussian mixture model (GMM) to do well: Radial Basis function classifier (Sung 1996)

• RBF classifier: like GMM but mixing probabilities trained discriminatively

• RBF+ SVM superior to GMM-based classification (Scholkopf et al 1996)

• Scale parameter of RBF kernel set via cross-validation on training set.

35

Simulation Results (Sparsity Level- 80%)

• MNIST • NORB

36

Error Rate

Layer 1 (K1=200) 0.73%

Layer 1 (K1=600) 0.72%

Layer 1 + 2 (K1=200,K2=600)

0.66%

Error Rate

Layer 1 (K1=200) 3.96%

Layer 1 (K1=600) 3.71%

Layer 1 + 2 (K1=200,K2=600)

2.94%

• State of the art (without distortions): 0.39% , Chen Yu Lee et al, 2014, “Deeply supervised Nets” • Works that use unsupervised layers followed by supervised classifiers: • 0.82% - Lee et al 2009 • 0.64% - Ranzato et al 2007, uses 2-layer neural net for supervised training on top of 2 layers • 0.59% - Labusch et al 2009, sparsenet algorithm + RBF SVM

• State of the art (with translations): 2.53% - Ciresan et al 2011; (without translation)- 3.94% - supervised Deep Net • State of the art (without translation): 2.87% - Uetz & Behnke 2009; supervised Deep Net • 3.0% - Coates et al 2010, unsupervised + SVM, use K-means (K=4000) • 5.6% - Jarrett et al 2009, unsupervised pretraining + fine tuning

Simulation Results- Change Sparsity Level to 95%

• MNIST • NORB

37

Error Rate

Layer 1 (K1=200) 0.78%

Layer 1 (K1=600) 0.78%

Layer 1 + 2 (K1=200,K2=600)

0.68%

Error Rate

Layer 1 (K1=200) 2.58%

Layer 1 (K1=600) 2.52%

Layer 1 + 2 (K1=200,K2=600)

2.90%

• MNIST performance slight degradation • NORB performance with single layer greatly improves: better than state of the art

Conclusions

A promising start

• Neuromimetic frontend + clustering is a promising approach to “universal” feature extraction

– Easy to implement, very few tunable parameters (viewing distance, #cluster centers, sparsity level)

• Potential for interpreting cluster centers as successively more abstract and “zoomed out” representations

39

Many unanswered questions

• How to tell if we are capturing all the information? – Is there an alternative metric to classification performance?

• Impact of design choices on classification performance appears to be dataset dependent – How many layers?

– What sparsity level? Layer-dependent sparsity?

• Optimizing choice of supervised layer – Neural nets versus nonlinear SVM

• What are the best approaches for low-power hardware implementations? – Power savings from sparsity, backend complexity

40

Documents

A framework for machine vision based on neuro-mimetic frontend and clustering · 2016-01-09 · A framework for machine vision based on neuro-mimetic frontend and clustering Emre