Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
A framework for machine vision based on neuro-mimetic frontend and clustering
Emre Akbas*, Aseem Wadhwa#, Miguel Eckstein*, Upamanyu Madhow#
*Department of Psychology and Brain Sciences, UCSB, #Department of ECE UCSB
• Current records held by Supervised Deep Neural Nets
• Loosely inspired by spatial organization of the visual cortex
• Hierarchical Structure
Significant recent progress in machine vision
Krizhevsky et al 2012
Ciresan et al 2011
2
Supervised Deep Nets: Not quite the last word?
• Large number of tunable parameters
– Long training times
– Overfitting an issue
• Large amounts of labeled training data required
• Tricks, clever engineering required to make them work
– DropOut, DropConnect etc
– setting correct values of learning rates, weight decay, momentum
• Not clear what information extracted by different layers – lower layers? higher layers?
3
• What are they doing? Why do they work? • Simpler ways of implementing hierarchical feature extraction?
Why do we need supervision?
• We can see things even if we don’t know their labels
• Perhaps our visual system be extracting a “universal” set of features?
Can’t we mostly learn unsupervised?
(Add in supervision at the final stage for a classification task.)
• Has been tried recently
• Was abandoned when supervised techniques were tuned to do better – Too quickly, we think.
4
• Can we do it mostly unsupervised? • Can we leverage everything we know “for sure” about mammalian vision? • Can we reduce the number of engineering judgment calls and parameters to be tuned?
Our Proposal 1. Neuro-mimetic Frontend (lower layers)
– detailed models available for retina processing and simple cells in V1
2. Unsupervised Learning via Clustering to build
higher layers
– natural interpretation: segmenting data into patterns
– Clustering similar to neural processing, easy interpretation
– Drastically fewer parameters to tune (only two: number of cluster centers, sparsity level – discussed later)
3. Finally, use Supervised Classification (Generic nonlinear SVM)
5 5
Main focus: Unsupervised, universal, interpretable feature extraction
Architecture
RGC Layer Simple Cell Layer
Supervised Classifier (eg: SVM or neural
net)
N X N
(Raw Image)
N X N X f
“Feature Maps”
Clustering and Pooling
6
1. “Neuro-mimetic Frontend”
2. Unsupervised Feature Extraction
6
•Universal frontend •Independent of labels
Neuro-Mimetic Frontend
7
– Convolutional and Hierarchical Architecture
– Local Contrast Normalization (LCN)
– Rectification
Loose Neuro-Inspiration plays a key role already
Normalize
Local spatial organization of cells in the visual cortex
Figure from LeCun et al,
1989
Figure from
Brady et al, 2000
Local inhibition and competition between neighboring neurons
Neurons firing when inputs exceed a threshold
8 8
• Much less work on neuro-mimetic computational models
Yes, at least for the frontend
• Higher than V1 layers (complex cells) – not enough understanding
• Some basic models for simple cells (V1)
• Detailed models available based on experiments for the retinal frontend
• In this work – mimic the parts we know, use neuro-inspiration for the rest
• Potential benefits
– Fundamentally interesting for computational Neuroscience
– Leverage Evolution as much as possible: need to tune less parameters
– “universal” front-end
Graphic from Bengio, Lecun ICML Workshop 2009, Montreal
9
Visual Pathway
Figure from Filipe and Alexandre, 2013.
Figure from Hubel, 1995.
RGC/LGN
Optic Nerve
V1 Simple Cells
Retina Complex Cells
Fovea
in the cortex Frontend: in the Fovea/Optic nerve (Retinal Ganglion Cells/Lateral Geniculate Nucleus)
10
What it does:
Retinal Ganglion Cells (RGCs)
1. Luminance Gain Control
2. Center Surround Filtering
3. Local Contrast Normalization (LCN) and Rectification
RGC/LGN V1 Simple
Cells Retina Complex
Cells
5 10 15-0.1
0
0.1
0.2
0.3
Local centre-surround filtering
center-on
center-off
+ve
-ve
LCN
LCN
Threshold
Threshold
Rectified RGC Output
Retinal Ganglion Cell (RGC) Filter
(Difference of
Gaussians)
References:
• Wandell 1995
• Croner et al 1995
• Carandini et al 2005
11
What it does:
Retinal Ganglion Cells (RGCs)
1. Luminance Gain Control
2. Center Surround Filtering
3. Local Contrast Normalization (LCN) and Rectification
RGC/LGN V1 Simple
Cells Retina Complex
Cells
Luminance Control
center ON
center OFF
“Relevant” parts of the image lit
up (>0)
12
V1 Simple Cells
RGC/LGN V1 Simple
Cells Retina Complex
Cells
Rectified RGC Output
Simple Cells Filter Bank (48)
LCN + Threshold
Rectified Simple Cells
Output
Simple cells sum RGC outputs
48 Edge
Oriented Filters
Output: 48
“Feature” Maps
etc
etc
References:
• Hubel and Wiesel 1968
13
Higher Layers
14
Architecture
RGC Layer Simple Cell Layer
Supervised Classifier (eg: SVM or neural
net)
N X N
(Raw Image)
N X N X f
“Feature Maps”
Clustering and Pooling
15
1. “Neuro-mimetic Frontend”
2. Unsupervised Feature Extraction
15
Does Unsupervised Learning Make Sense?
• It should….even though most state of the art Nets are purely supervised
• “Transfer Learning” has been shown to help
• Intuitively:
– extract low level features like edges, corners, t-junction etc
– independent of labels
• Low level “Representative” features (unsupervised) followed by higher level “Discriminative” features (supervised)
• Not entirely a new idea: Lee et al 2009, Jarrett et 2009, Kavukcuoglo
et al 2010, Zeiler et al 2011, Labusch et al 2009 etc
• Our approach: “clustering” to extract low level features
16
Why Clustering?
• Natural candidate to find patterns in data
• Simple
• Encoding operation similar to a layer in a neural network
• cluster center == neuron
Processing in Neural Layers Clustering
x1
x2
xd
Layer “i”
neurons
Layer “i+1”
neurons
Normalize input and “cluster centers” and perform K-means (spherical K-means)
Once centers are learned, non-linear function of soft encoding
Same as neural layer processing
17
Related Work
• Unsupervised layers followed by Supervised layers
– Lee et al 2009, Jarrett et 2009, Kavukcuoglo et al 2010, Zeiler et al
2011, Labusch et al 2009
– All of them utilize some form of reconstruction + sparsity cost function for unsupervised training
– We investigate a much simpler alternative (K-means clustering)
• Using K-means clustering to build features
– Coates and Ng 2011, 2012
– Use K-means directly on raw images
– Number of cluster centers learned very large (a few thousands)
– We build clustering on top of neuro-mimetic preprocessing, we get competitive performance with a few cluster centers
• Benefits of tuning Pre-processing Step
– Ciresan et al 2011, Fukushima 2003
– gains obtained using contrast normalization, center surround filtering
18
Architecture: more details
RGC + Simple Cells
N X N
(Raw Image)
N X N X 48
“Feature Maps”
Each Spatial location: 48 neurons – their activations represent 7X7 patches
• Simple Cell Receptive field size : roughly 7X7 pixels in the original images space
• “viewing distance” : only parameter to be set for the Frontend. We set it guided by the resolution of the image.
• “Receptive field sizes” depend on the size of the fovea (fixed) and the “viewing distance”
20
RGC + Simple Cells
N X N
(Raw Image)
N X N X 48
“Feature Maps”
Example:
21
7 X 7 Patch Simple Cell Response
0 10 20 30 400.1
0.2
0.3
0.4
0.5
0.6
0 10 20 30 400
0.5
1
1.5
2
2.5
Response prior to LCN
1 2 3 4 5 6 7
1
2
3
4
5
6
7
Each Spatial location: 48 neurons – their activations represent 7X7 patches
• LCN (Local Contrast Normalization) operation:
• normalizes activations
• inhibits weaker responses
22
Example:
7 X 7 Patch Simple Cell Response
0 10 20 30 400.1
0.2
0.3
0.4
0.5
0.6
0 10 20 30 400
0.5
1
1.5
2
2.5
Response prior to LCN
1 2 3 4 5 6 7
1
2
3
4
5
6
7
i : feature index j : spatial index
sum over
“features”
sum over spatial
neighborhood
Clustering Simple Cell Outputs
23
x1
x2
x48
• Input Data {x48X1}
• Learn K1 centers
• Use online spherical K-means algorithm (Shi Zhong 2005)
N X N X 48
Simple Cell
Outputs
N X N X K1
c1
c2
ck1
N X N
(Raw Image)
Cluster Centers (correspond to 7 X 7 patches) • Mostly Edges • Interpretation: orientation sensitive neurons
24
2 4 6
246
2 4 6
246
2 4 6
246
2 4 6
246
2 4 6
246
2 4 6
246
2 4 6
246
2 4 6
246
2 4 6
246
2 4 6
246
2 4 6
246
2 4 6
246
2 4 6
246
2 4 6
246
2 4 6
246
2 4 6
246
2 4 6
246
2 4 6
246
2 4 6
246
2 4 6
246
2 4 6
246
2 4 6
246
2 4 6
246
2 4 6
246
2 4 6
246
2 4 6
246
2 4 6
246
2 4 6
246
2 4 6
246
2 4 6
246
2 4 6
246
2 4 6
246
2 4 6
246
2 4 6
246
2 4 6
246
2 4 6
246
2 4 6
246
2 4 6
246
2 4 6
246
2 4 6
246
2 4 6
246
2 4 6
246
2 4 6
246
2 4 6
246
2 4 6
246
2 4 6
246
2 4 6
246
2 4 6
246
2 4 6
246
2 4 6
246
2 4 6
246
2 4 6
246
2 4 6
246
2 4 6
246
2 4 6
246
2 4 6
246
2 4 6
246
2 4 6
246
2 4 6
246
2 4 6
246
2 4 6
246
2 4 6
246
2 4 6
246
2 4 6
246
2 4 6
246
2 4 6
246
2 4 6
246
2 4 6
246
2 4 6
246
2 4 6
246
2 4 6
246
2 4 6
246
2 4 6
246
2 4 6
246
2 4 6
246
2 4 6
246
2 4 6
246
2 4 6
246
2 4 6
246
2 4 6
246
2 4 6
246
2 4 6
246
2 4 6
246
2 4 6
246
2 4 6
246
2 4 6
246
2 4 6
246
2 4 6
246
2 4 6
246
2 4 6
246
2 4 6
246
2 4 6
246
2 4 6
246
2 4 6
246
2 4 6
246
2 4 6
246
2 4 6
246
2 4 6
246
2 4 6
246
2 4 6
246
Encoding Function (Activations): Sparsity Level
• choice of f ?
• We choose:
– soft Threshold
– choose T to keep 80% sparsity on average
• Why?
– neuro-inspired (rectification)
– a patch is characterized by relative contributions of several centers
– gives a direct handle on sparsity
(unlike other unsupervised learning algorithms)
25
0 50 100 150 2000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
1 2 3 4 5 6 7
1
2
3
4
5
6
7
Patch
Activations after threshold
26
N X N X 48
Simple Cell
Outputs
N X N X K1 N X N
(Raw Image)
Next Layer ? 1. either feed to classifier OR 2. zoom out via “pooling” and
cluster “larger” patches
Next layer: Option 1
• Pool and feed it to a supervised classifier
• Pooling: Local Translation invariance, edge anywhere in a cell
• eg: MNIST, K1 = 200
– before pooling 28 X 28 X 200
– after pooling 4 X 4 X 200 = 3200: length of the feature
27
7 14 21 28
7
14
21
28
pooling over 4X4 Grid
Next layer: Option 2
• One more layer of extracting unsupervised features before discriminative learning
• “Zoom out” and cluster: Receptive fields now larger than 7X7
28
Local 2X2 pooling and 2X2
concatenation
28 X 28 X 200 7 X 7 X 800
Represents 10 X 10 Patches
K1 :#first layer centers
2nd Layer Clustering
• Individual matching score (0-1)
• Similarity metric for clustering: average matching score over the 4 quadrants
29
1 6 15 38
2 4 6 8 10
2
4
6
8
10
88 3 9 4286
2 4 6 8 10
2
4
6
8
10
135 3 15 6603
2 4 6 8 10
2
4
6
8
10
111 3 15 5427
2 4 6 8 10
2
4
6
8
10
202 9 18 9895
2 4 6 8 10
2
4
6
8
10
299 3 15 14639
2 4 6 8 10
2
4
6
8
10
264 6 12 12918
2 4 6 8 10
2
4
6
8
10
172 6 15 8417
2 4 6 8 10
2
4
6
8
10
249 3 15 12189
2 4 6 8 10
2
4
6
8
10
248 1 12 12132
2 4 6 8 10
2
4
6
8
10
44 3 12 2137
2 4 6 8 10
2
4
6
8
10
86 3 15 4202
2 4 6 8 10
2
4
6
8
10
377 12 12 18457
2 4 6 8 10
2
4
6
8
10
24 3 18 1171
2 4 6 8 10
2
4
6
8
10
6 3 12 275
2 4 6 8 10
2
4
6
8
10
134 6 15 6555
2 4 6 8 10
2
4
6
8
10
80 6 15 3909
2 4 6 8 10
2
4
6
8
10
25 3 6 1192
2 4 6 8 10
2
4
6
8
10
28 3 15 1360
2 4 6 8 10
2
4
6
8
10
132 12 15 6459
2 4 6 8 10
2
4
6
8
10
156 12 12 7628
2 4 6 8 10
2
4
6
8
10
21 6 15 1018
2 4 6 8 10
2
4
6
8
10
82 12 15 4009
2 4 6 8 10
2
4
6
8
10
13 6 15 626
2 4 6 8 10
2
4
6
8
10
164 3 3 7996
2 4 6 8 10
2
4
6
8
10
10 X 10 Patch
9 18
2 4 6
246
9 19
2 4 6
246
9 21
2 4 6
246
9 22
2 4 6
246
10 18
2 4 6
246
10 19
2 4 6
246
10 21
2 4 6
246
10 22
2 4 6
246
12 18
2 4 6
246
12 19
2 4 6
246
12 21
2 4 6
246
12 22
2 4 6
246
13 18
2 4 6
246
13 19
2 4 6
246
13 21
2 4 6
246
13 22
2 4 6
246
9 18
2 4 6
246
9 19
2 4 6
246
9 21
2 4 6
246
9 22
2 4 6
246
10 18
2 4 6
246
10 19
2 4 6
246
10 21
2 4 6
246
10 22
2 4 6
246
12 18
2 4 6
246
12 19
2 4 6
246
12 21
2 4 6
246
12 22
2 4 6
246
13 18
2 4 6
246
13 19
2 4 6
246
13 21
2 4 6
246
13 22
2 4 6
246
9 18
2 4 6
246
9 19
2 4 6
246
9 21
2 4 6
246
9 22
2 4 6
246
10 18
2 4 6
246
10 19
2 4 6
246
10 21
2 4 6
246
10 22
2 4 6
246
12 18
2 4 6
246
12 19
2 4 6
246
12 21
2 4 6
246
12 22
2 4 6
246
13 18
2 4 6
246
13 19
2 4 6
246
13 21
2 4 6
246
13 22
2 4 6
246
9 18
2 4 6
246
9 19
2 4 6
246
9 21
2 4 6
246
9 22
2 4 6
246
10 18
2 4 6
246
10 19
2 4 6
246
10 21
2 4 6
246
10 22
2 4 6
246
12 18
2 4 6
246
12 19
2 4 6
246
12 21
2 4 6
246
12 22
2 4 6
246
13 18
2 4 6
246
13 19
2 4 6
246
13 21
2 4 6
246
13 22
2 4 6
246
pooling
Concatenation
2nd layer Cluster Centers
30
• curves, t-junctions
2 4 6810
2468
10
2 4 6810
2468
10
2 4 6810
2468
10
2 4 6 810
2468
10
2 4 6810
2468
10
2 4 6810
2468
10
2 4 6810
2468
10
2 4 6810
2468
10
2 4 6 810
2468
10
2 4 6810
2468
10
2 4 6 810
2468
10
2 4 6 810
2468
10
2 4 6 810
2468
10
2 4 6 810
2468
10
2 4 6 810
2468
10
2 4 6 810
2468
10
2 4 6 810
2468
10
2 4 6 810
2468
10
2 4 6 810
2468
10
2 4 6 810
2468
10
2 4 6 810
2468
10
2 4 6 810
2468
10
2 4 6 810
2468
10
2 4 6 810
2468
10
2 4 6 810
2468
10
2 4 6810
2468
10
2 4 6810
2468
10
2 4 6810
2468
10
2 4 6 810
2468
10
2 4 6810
2468
10
2 4 6810
2468
10
2 4 6810
2468
10
2 4 6810
2468
10
2 4 6 810
2468
10
2 4 6810
2468
10
2 4 6 810
2468
10
2 4 6 810
2468
10
2 4 6 810
2468
10
2 4 6 810
2468
10
2 4 6 810
2468
10
2 4 6 810
2468
10
2 4 6 810
2468
10
2 4 6 810
2468
10
2 4 6 810
2468
10
2 4 6 810
2468
10
2 4 6 810
2468
10
2 4 6 810
2468
10
2 4 6 810
2468
10
2 4 6 810
2468
10
2 4 6 810
2468
10
2 4 6810
2468
10
2 4 6810
2468
10
2 4 6810
2468
10
2 4 6 810
2468
10
2 4 6810
2468
10
2 4 6810
2468
10
2 4 6810
2468
10
2 4 6810
2468
10
2 4 6 810
2468
10
2 4 6810
2468
10
2 4 6 810
2468
10
2 4 6 810
2468
10
2 4 6 810
2468
10
2 4 6 810
2468
10
2 4 6 810
2468
10
2 4 6 810
2468
10
2 4 6 810
2468
10
2 4 6 810
2468
10
2 4 6 810
2468
10
2 4 6 810
2468
10
2 4 6 810
2468
10
2 4 6 810
2468
10
2 4 6 810
2468
10
2 4 6 810
2468
10
2 4 6 810
2468
10
2 4 6810
2468
10
2 4 6810
2468
10
2 4 6810
2468
10
2 4 6 810
2468
10
2 4 6 810
2468
10
2 4 6810
2468
10
2 4 6 810
2468
10
2 4 6810
2468
10
2 4 6 810
2468
10
2 4 6810
2468
10
2 4 6 810
2468
10
2 4 6 810
2468
10
2 4 6 810
2468
10
2 4 6 810
2468
10
2 4 6 810
2468
10
2 4 6 810
2468
10
2 4 6 810
2468
10
2 4 6 810
2468
10
2 4 6 810
2468
10
2 4 6 810
2468
10
2 4 6 810
2468
10
2 4 6 810
2468
10
2 4 6 810
2468
10
2 4 6 810
2468
10
2 4 6 810
2468
10
The Datasets
Standard for image classification tests
MNIST
• 10 digits
• 28 X 28 images
• 60K Training data, 10k Testing
NORB (uniform dataset)
• 5 objects
• 96 X 96 dimensions
• Varying Illumination, Elevation, Rotation
• 24300 Training, 24300 Testing
Animal Human Plane Truck Car
32
Experimental results
Summary
Decent performance on MNIST
Beats state of the art on NORB
Can do it with very sparse encoding
Three scenarios tested
1. 1 layer of clustering K1 = 200
2. 1 layer of clustering K1 = 600
3. 2 layers of clustering K1=200, K2=600
concatenate layer 1 and 2 activations to form the feature vector
(consistent with current neuroscientific understanding)
• Case 2 and 3 : similar lengths of feature vectors
• No augmentation using affine distortions (translation, rotation etc) of the data
34
RBF SVM as Supervised Classifier
• Since we are using clustering to organize patterns, we expect a classifier that uses a Gaussian mixture model (GMM) to do well: Radial Basis function classifier (Sung 1996)
• RBF classifier: like GMM but mixing probabilities trained discriminatively
• RBF+ SVM superior to GMM-based classification (Scholkopf et al 1996)
• Scale parameter of RBF kernel set via cross-validation on training set.
35
Simulation Results (Sparsity Level- 80%)
• MNIST • NORB
36
Error Rate
Layer 1 (K1=200) 0.73%
Layer 1 (K1=600) 0.72%
Layer 1 + 2 (K1=200,K2=600)
0.66%
Error Rate
Layer 1 (K1=200) 3.96%
Layer 1 (K1=600) 3.71%
Layer 1 + 2 (K1=200,K2=600)
2.94%
• State of the art (without distortions): 0.39% , Chen Yu Lee et al, 2014, “Deeply supervised Nets” • Works that use unsupervised layers followed by supervised classifiers: • 0.82% - Lee et al 2009 • 0.64% - Ranzato et al 2007, uses 2-layer neural net for supervised training on top of 2 layers • 0.59% - Labusch et al 2009, sparsenet algorithm + RBF SVM
• State of the art (with translations): 2.53% - Ciresan et al 2011; (without translation)- 3.94% - supervised Deep Net • State of the art (without translation): 2.87% - Uetz & Behnke 2009; supervised Deep Net • 3.0% - Coates et al 2010, unsupervised + SVM, use K-means (K=4000) • 5.6% - Jarrett et al 2009, unsupervised pretraining + fine tuning
Simulation Results- Change Sparsity Level to 95%
• MNIST • NORB
37
Error Rate
Layer 1 (K1=200) 0.78%
Layer 1 (K1=600) 0.78%
Layer 1 + 2 (K1=200,K2=600)
0.68%
Error Rate
Layer 1 (K1=200) 2.58%
Layer 1 (K1=600) 2.52%
Layer 1 + 2 (K1=200,K2=600)
2.90%
• MNIST performance slight degradation • NORB performance with single layer greatly improves: better than state of the art
Conclusions
A promising start
• Neuromimetic frontend + clustering is a promising approach to “universal” feature extraction
– Easy to implement, very few tunable parameters (viewing distance, #cluster centers, sparsity level)
• Potential for interpreting cluster centers as successively more abstract and “zoomed out” representations
39
Many unanswered questions
• How to tell if we are capturing all the information? – Is there an alternative metric to classification performance?
• Impact of design choices on classification performance appears to be dataset dependent – How many layers?
– What sparsity level? Layer-dependent sparsity?
• Optimizing choice of supervised layer – Neural nets versus nonlinear SVM
• What are the best approaches for low-power hardware implementations? – Power savings from sparsity, backend complexity
40