1
798.01 / Y5 Understanding neural representations in early visual areas using convolutional neural networks Yimeng Zhang a , Corentin Massot a , Tiancheng Zhi b , George Papandreou c , Alan Yuille c , Tai Sing Lee a a Center for the Neural Basis of Cognition and Computer Science Department, Carnegie Mellon University, Pittsburgh, PA b Peking University, Beijing, China c UCLA, Los Angeles, CA Motivations Convolutional neural networks (CNNs) have feature representations like those in higher layers of the primate and hu- man visual cortex (Agrawal et al., 2014; Khaligh-Razavi & Kriegeskorte, 2014; Yamins et al., 2014). Recent data on V1 neurons suggested that they may encode much more com- plex features (see poster 798.03/Y7). CNN might be a useful tool to under- stand the encoding of complex features in lower layers (V1/V2) of visual cortex as well. Images and neural data 286 V1 and 390 V2 neurons in 2 mon- keys to 150 stimuli using multi-electrode arrays. The 150 stimuli have 3 subsets of 50, Edge (E), Appearance (A), and Exem- plar (EX), in levels of increasing com- plexity. 0 4 8 12 16 20 0 5 10 15 20 0 4 8 12 16 20 E A EX 10 20 30 40 50 0 5 10 15 20 Ranked stimuli Firing rate (spk/s) 25 stimuli Responses Edge Appearance Exemplar Tuning curves Firing rate (spk/s) a b c E A EX E A EX Around 3000 V1 neurons in 3 monkeys to 2250 natural images, using calcium imaging (see poster 798.03/Y7). Computer vision models normalize pool conv1 norm1 pool1 pool5 fc6 fc7 6 other layers pool 227 x 227 x 3 55 x 55 x 96 55 x 55 x 96 27 x 27 x 96 6 x 6 x 256 convolve + threshold ⊗⊗ ⊗⊗ product + threshold ×× ×× product + threshold ×× ×× 1 x 1 x 2048 1 x 1 x 2048 CNN “AlexNet” V1like / V1likeSC convolve ⊗⊗ ⊗⊗ threshold & saturate normalize pool filters for V1like filters for V1likeSC Model comparison using Representational Similarity Analysis RSA (Kriegeskorte et al., 2008) was used to compare model and neural representations model , area (area can be V1, V2, etc.) Similarity between representa- tions is defined as the Spearman’s correlation coefficient of their rep- resentational dissimilarity matri- ces (RDMs), each of which cap- tures pairwise distances between images for a given representation. 50 images 20 units 30 neurons RDM( V1 ) RDM( model ) RDM(φ(x)) ij =1 ρ(φ(x i )(x j )) where denotes Pearson’s correlation. V1 (x i ),i =1,..., 50 model (x i ) Results E A EX V1 V2 Spearman correlation with neural RDM model vs. neural data on the 150 stimuli set V1 V2 CNN vs. neural data on the 150 stimuli set, by layer normalized similarity CNN vs. neural data on the 2250 stimuli set, by layer similarity V1 Left: comparison of models on the 150 stimulus set. Top right: all CNN layers on the 150 stimulus set. Bottom right: all CNN layers on the 2250 stimulus set. Horizontal lines estimates the achievable similarity by computing the similarities of feature repre- sentations among different monkeys. Similar to “explainable variance”. 150 set: CNN > V1likeSC > V1like, especially on complex stimuli (EX), and the best matching CNN layer is stimulus dependent, simpler stimuli (E) best explained by lower layers, and complex stimuli (EX) by higher layers. 2250 set: CNN is far away from achievable similarity, suggesting missing constraints in CNN. Higher layers in CNN can match well (even better) with V1/V2 than lower layers, suggesting complex coding in V1/V2 neurons. Neuron matching and visualization visualization of some best matching units using deconvolution. V2, pool2 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 K K V1 V2 conv1 norm1 pool1 conv2 norm2 pool2 conv3 conv4 conv5 pool5 fc6 fc7 normalized proportion dataset E A EX V1, pool1 a, E b, E c, EX d, EX Distributions of the layers of best matching CNN units for each individual neuron. f, EX e, E Single neuron matching results were consistent with RSA: V1 matched better to pool1, V2 to pool2. Complex stimuli (EX) shifted to higher layers compared to simple stimuli (E). V2 neurons were also more correlated to higher layer CNN units than V1 neurons. While some neurons have visualizations consistent with the existing literature (a,b,c,e), some neurons preferred more complex features (d,f). Why CNN performs better Network effects. Without normalization and pooling, V1like performed worse (not shown). Diverse filter shapes. V1likeSC and pool1 are better than V1like partially due to learned di- verse filters compared to Gabor ones in V1like. Network architecture might contribute as well. On the 2250 stimulus set, higher CNN layers performed better than lower layers even with all network parameters being random. Conclusion Some V1/V2 neurons may encode more complex features than previously thought. CNN is a good approximate model for understanding and visualizing V1 and V2 neurons. Future work: (1) add more biological constraints into CNN models to make CNN explain neural data better, (2) ex- plore CNNs with heterogeneous layers, each layer with units of different com- plexities. References Agrawal, P. et al. 2014, arXiv.org. Khaligh-Razavi, S.-M. & Kriegeskorte, N. 2014, PLoS computa- tional biology, 10, e1003915. Kriegeskorte, N., Mur, M., & Bandettini, P. A. 2008, Frontiers in Systems Neuroscience, 2. Krizhevsky, A., Sutskever, I., & Hinton, G. E. 2012, in NIPS 25, 1097–1105. Olshausen, B. A. 2013, in IS&T/SPIE Electronic Imaging. Pinto, N., Cox, D. D., & DiCarlo, J. J. 2008, PLoS Comput Biol, 4, e27. Yamins, D. L. K. et al. 2014, Proceedings of the National Academy of Sciences, 111, 8619. Acknowledgments This research is supported by IARPA MICRONS contract #D16PC00007, and NIH R01 EY022247. Thanks to Professor Shiming Tang for the calcium imaging monkey data. Contact: [email protected]

Understandingneuralrepresentationsinearlyvisualareas 798 ...tzhi/presentations/SfN2016_understanding_neural.pdfYimeng Zhanga, Corentin Massota, Tiancheng Zhib, George Papandreouc,

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Understandingneuralrepresentationsinearlyvisualareas 798 ...tzhi/presentations/SfN2016_understanding_neural.pdfYimeng Zhanga, Corentin Massota, Tiancheng Zhib, George Papandreouc,

798.01 / Y5Understandingneural representations inearlyvisualareas

usingconvolutionalneuralnetworks

Yimeng Zhang

a

, Corentin Massot

a

, Tiancheng Zhi

b

, George Papandreou

c

, Alan Yuille

c

, Tai Sing Lee

a

aCenter for the Neural Basis of Cognition and Computer Science Department, Carnegie Mellon University, Pittsburgh, PA bPeking University, Beijing, China cUCLA, Los Angeles, CA

Motivations

• Convolutional neural networks (CNNs)

have feature representations like those

in higher layers of the primate and hu-

man visual cortex (Agrawal et al., 2014;

Khaligh-Razavi & Kriegeskorte, 2014;

Yamins et al., 2014).

• Recent data on V1 neurons suggested

that they may encode much more com-

plex features (see poster 798.03/Y7).

• CNN might be a useful tool to under-

stand the encoding of complex features

in lower layers (V1/V2) of visual cortex

as well.

Images and neural data

• 286 V1 and 390 V2 neurons in 2 mon-

keys to 150 stimuli using multi-electrode

arrays.

• The 150 stimuli have 3 subsets of 50,

Edge (E), Appearance (A), and Exem-

plar (EX), in levels of increasing com-

plexity.

0

4

8

12

16

20

0

5

10

15

20

0

4

8

12

16

20

EAEX

10 20 30 40 50

0

5

10

15

20

Ranked stimuli

Fir

ing

ra

te (

spk

/s)

25

stim

uli

Re

spo

nse

s

Edge Appearance Exemplar Tuning curves

Fir

ing

ra

te (

spk

/s)

a

b c

E A EX E A EX

• Around 3000 V1 neurons in 3 monkeys

to 2250 natural images, using calcium

imaging (see poster 798.03/Y7).

   

 

 

   

 

 

 

   

 

 

Computer vision models

normalize pool

conv1 norm1 pool1 pool5 fc6 fc7

6 otherlayers

pool

227 x 227 x 3 55 x 55 x 96 55 x 55 x 96 27 x 27 x 96 6 x

6 x

256

convolve + threshold⊗⊗⊗⊗

product + threshold× ×× ×

product + threshold× ×× ×

1 x

1 x

2048

1 x

1 x

2048

CNN “AlexNet”

V1like / V1likeSCconvolve

⊗⊗⊗⊗

threshold & saturate

normalize poolfilters for V1like filters for V1likeSC

Model comparison using Representational Similarity Analysis

• RSA (Kriegeskorte et al., 2008)

was used to compare model and

neural representationsmodel

,

area

(area can be V1, V2, etc.)

• Similarity between representa-

tions is defined as the Spearman’s

correlation coefficient of their rep-

resentational dissimilarity matri-

ces (RDMs), each of which cap-

tures pairwise distances between

images for a given representation.

50 im

ages

20 units 30 neurons

RDM(V1)RDM(model)

RDM(φ(x))ij = 1 ρ(φ(xi),φ(xj))

where denotes Pearson’s correlation.

V1(xi), i = 1, . . . , 50model(xi)

Results

E A EX

V1V2

Spea

rman

cor

rela

tion

with

neu

ral R

DM

model vs. neural data on the 150 stimuli set

V1V2

CNN vs. neural data on the 150 stimuli set, by layer

norm

alize

d si

mila

rity

CNN vs. neural data on the 2250 stimuli set, by layer

sim

ilarit

y V1

• Left: comparison of models on the 150 stimulus set. Top right: all CNN layers on the 150stimulus set. Bottom right: all CNN layers on the 2250 stimulus set.

• Horizontal lines estimates the achievable similarity by computing the similarities of feature repre-sentations among different monkeys. Similar to “explainable variance”.

• 150 set: CNN > V1likeSC > V1like, especially on complex stimuli (EX), and the best matchingCNN layer is stimulus dependent, simpler stimuli (E) best explained by lower layers, and complexstimuli (EX) by higher layers.

• 2250 set: CNN is far away from achievable similarity, suggesting missing constraints in CNN.• Higher layers in CNN can match well (even better) with V1/V2 than lower layers, suggesting

complex coding in V1/V2 neurons.

Neuron matching and visualization

visualization of some best matching units using deconvolution.

V2, pool2

0.00.10.20.30.40.5

0.00.10.20.30.40.5

KK

V1V2

conv

1

norm

1

pool

1

conv

2

norm

2

pool

2

conv

3

conv

4

conv

5

pool

5

fc6

fc7

norm

alize

d pr

opor

tion

datasetEAEX

V1, p

ool1 a, E b, E c, EX d, EX

Distributions of the layers of best matching CNN units for each individual neuron.

f, EX

e, E

• Single neuron matching results were consistent with RSA: V1 matched better to pool1, V2 to pool2.Complex stimuli (EX) shifted to higher layers compared to simple stimuli (E). V2 neurons werealso more correlated to higher layer CNN units than V1 neurons.

• While some neurons have visualizations consistent with the existing literature (a,b,c,e), someneurons preferred more complex features (d,f).

Why CNN performs better

• Network effects. Without normalization andpooling, V1like performed worse (not shown).

• Diverse filter shapes. V1likeSC and pool1 arebetter than V1like partially due to learned di-verse filters compared to Gabor ones in V1like.

• Network architecture might contribute as well.On the 2250 stimulus set, higher CNN layersperformed better than lower layers even withall network parameters being random.

Conclusion

• Some V1/V2 neurons may encode

more complex features than previously

thought.

• CNN is a good approximate model for

understanding and visualizing V1 and

V2 neurons.

• Future work: (1) add more biological

constraints into CNN models to make

CNN explain neural data better, (2) ex-

plore CNNs with heterogeneous layers,

each layer with units of different com-

plexities.

References

• Agrawal, P. et al. 2014, arXiv.org.• Khaligh-Razavi, S.-M. & Kriegeskorte, N. 2014, PLoS computa-

tional biology, 10, e1003915.• Kriegeskorte, N., Mur, M., & Bandettini, P. A. 2008, Frontiers in

Systems Neuroscience, 2.• Krizhevsky, A., Sutskever, I., & Hinton, G. E. 2012, in NIPS 25,

1097–1105.• Olshausen, B. A. 2013, in IS&T/SPIE Electronic Imaging.• Pinto, N., Cox, D. D., & DiCarlo, J. J. 2008, PLoS Comput Biol,

4, e27.• Yamins, D. L. K. et al. 2014, Proceedings of the National Academy

of Sciences, 111, 8619.

Acknowledgments

This research is supported by IARPA MICRONScontract #D16PC00007, and NIH R01 EY022247.

Thanks to Professor Shiming Tang for thecalcium imaging monkey data.

Contact: [email protected]