1
A High-Throughput Screening Approach to Discovering Good Forms of Biologically-Inspired Visual Representation Nicolas Pinto 1 , David Cox 2 and James DiCarlo 1 1. The McGovern Institute for Brain Research and Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology 2. The Rowland Institute at Harvard Take home message: GPU Metaprogramming dramatically accelerates the discovery of bio-inspired vision models that beat state-of-the-art computer vision systems Abstract 1 While many models of biological object recognition share a common set of "broad-stroke" properties, the performance of any one model depends strongly on the choice of parameters in a particular instantiation of that model - e.g. the number of units per layer, the size of pooling kernels, expo- nents in normalization operations, etc. Since the number of such parameters (explicit or implicit is typically large, and the computational cost of evaluating one particular parameter set is high, the space of possible model instantia- tions goes largely unexplored. Thus, when a model fails to approach the abili- ties of biological visual systems, we are left uncertain whether this failure is because we are missing a fundamental idea, or because the correct "parts" have not been tuned correctly, assembled at sufficient scale, or provided with enough training. Here, we present a high-throughput approach to the exploration of such pa- rameter sets, leveraging recent advances in stream processing hardware (high-end NVIDIA graphic cards and the PlayStation 3's IBM Cell Processor). In analogy to high-throughput screening approaches in molecular biology and genetics, we explored thousands of potential network architectures and pa- rameter instantiations, screening those that show promising object recogni- tion performance for further analysis. We show that this approach can yield significant, reproducible gains in performance in across an array of basic object recognition tasks, consistently outperforming a variety of state-of- the-art vision systems from the literature. As the scale of available computational power continues to expand, we argue that this approach has the potential to greatly accelerate progress in both ar- tificial vision and our understanding of the computational underpinning of biological vision. The Rowland Institute at Harvard HARVARD UNIVERSITY High-throughput screening for good models 5 Analysis: Why are these models good ? 7 Metaprogramming 3 A Vast space of models to explore 2 Recipe: 1) Use scripting (e.g. Python) and templates, 2) Instrumentalize your code, 3) Let the computer auto-tune it! Unsupervised Learning Test with a “screening” object recognition task Validate on other tasks Choose the best models Generate Random Models Unsupervised Learning Screen Distribution Law & Order “World” Performance / Cost Hardware Relative Speedup 1x 80x 4x 544x 222x 1544x 2712x Relative Perf. / $ 1x 120x 2x 272x 833x 772x 1356x Full System Cost (approx.) $1,500** $1,000 $2,700** $3,000* $400 $3,000* $3,000* Manufacturer Intel Intel Intel NVIDIA Sony, IBM, Toshiba NVIDIA NVIDIA Model Q9450 Q9450 Q9450 7900 GTX PlayStation 3 8800 GTX GTX 280 Year 2008 2008 2008 2006 2007 2007 2008 Implementation MATLAB SSE2 MATLAB Cg Cell SDK CUDA CUDA # cores used 1 4 4 4x96 2+6 4x128 4x240 CPUs GPUs 10 -1 10 0 10 1 10 2 10 3 theoretical observed Processing Performance in GFLOPS (log-scale) Software Languages Extensions / Libraries Performance / Cost Relative Speedup $ / GFLOPS Year of implementation Full System Cost (approx.) Hardware Manufacturer Model # cores used GFLOPS (max) 5.3 (max) 0.5 (max) { Intel Q9450 MATLAB / C MEX 1 2008 $1,500** 3000 1 0.5 1350 Q9450 21 (max) 2 (max) { 4 Intel MATLAB / C 4 2008 $2,700** MEX 2 25 Q9450 85 (max) 40 (max) { 4 Intel C SSE2 / pthread 80 2008 $1,000 40 22-11* 187 (max) 68 (max) { NVIDIA 7900 GTX Python / C++ OpenGL / Cg 136-544* 2006 $1,500-$3,000* 96 68-272* 4 154 (max) 111 (max) 76 (mean) 9 (min) { Sony, IBM, Toshiba PlayStation 3 (Cell) Python / C Cell SDK 222 2007 $400 8 (2+6) 111 8-4* 518 (max) 193 (max) { NVIDIA 8800 GTX Python / C PyCUDA 386-1544* 2007 $1,500-$3,000* 128 193-772* 240 4-2* 933 (max) 339 (max) { NVIDIA GTX 280 Python / C PyCUDA 678-2712* Relative GFLOPS / $ 1 2 120 136-272* 833 386-772* 678-1356* 2008 $1,500-$3,000* 339-1356* number of models 50 60 70 80 90 100 0 50 100 150 200 250 N=2500 top 5 models Percent Correct • bio-inspired architecture • 52 free parameters • >10 models 25 Best Models 50 60 70 80 90 100 Percent Correct top 5 models (screen) 5 4 3 2 1 blend different initial weights 50 100 different videos Cars vs. Planes • Why are the best models better? • How can we adjust the parameter space to get farther? False True Preprocessing / Norm / Zero-mean t c e r r o c % Parameter value 3 5 7 9 Layer 2 / Linear Filtering / RF size Parameter value 16 32 64 128 1 3 5 7 50 60 70 80 90 Layer 3 / Learning / Neighborhood size 9 50 60 70 80 90 50 60 70 80 90 50 60 70 80 90 32.5 count median of Hamming distances (from 5-choose-2 binarized parameter vectors) N = 100,000 p = 0.136 Parameters similarity (top-five vs rest) count 12828 N = 100,000 p = 0.082 median of Euclidean distances (from 5-choose-2 similarity matrices) Representation similarity (top-five vs rest) > = count -ln(p) Parameter Significance 10 2 4 6 8 10 20 0.25 0.15 0.05 Fraction expected fraction Filtering Activation Normalization Pooling Learning Parameters by Category low significance medium significance high significance conv_kernel_template.cu ... mad.rn.f32 $r4, s[$ofs3+0x0000], $r4, $r1 mov.b32 $r1, c0[$ofs2+0x0008] mad.rn.f32 $r4, s[$ofs3+0x0008], $r1, $r4 mov.b32 $r1, c0[$ofs2+0x000c] mad.rn.f32 $r4, s[$ofs3+0x000c], $r1, $r4 mov.b32 $r1, c0[$ofs2+0x0010] mad.rn.f32 $r4, s[$ofs3+0x0010], $r1, $r4 ... version A 2x faster! version B ... mad.rn.f32 $r1, s[$ofs1+0x007c], c0[$ofs1+0x0078], $r1 mad.rn.f32 $r1, s[$ofs2+0x0000], c0[$ofs2+0x007c], $r1 mad.rn.f32 $r1, s[$ofs2+0x0008], c0[$ofs2+0x0080], $r1 mad.rn.f32 $r1, s[$ofs2+0x000c], c0[$ofs2+0x0084], $r1 mad.rn.f32 $r1, s[$ofs2+0x0010], c0[$ofs2+0x0088], $r1 ... texture<float4, 1, cudaReadModeElementType> tex_float4; __constant__ float constant[$FILTER_D][$FILTER_W][$N_FILTERS]; #define IMUL(a, b) __mul24(a, b) extern "C" { #for j in xrange($FILTER_H) __global__ void convolve_beta_j${j}(float4 *input, float4 *output) { #set INPUT_BLOCK_W = $BLOCK_W+$FILTER_W-1 __shared__ float shared_in[$INPUT_BLOCK_W][4+1]; // -- input/output offsets const uint in_idx = (blockIdx.y+$j)*INPUT_W + blockIdx.x*blockDim.x + threadIdx.x; const uint out_idx = blockIdx.y*OUTPUT_W + blockIdx.x*blockDim.x + threadIdx.x; float4 input_v4; // -- load input to shared memory #for i in xrange($LOAD_ITERATIONS) #if $i==($LOAD_ITERATIONS-1) if((threadIdx.x+$BLOCK_W*$i)<$INPUT_BLOCK_W) #end if { input_v4 = tex1Dfetch(tex_float4, in_idx+$BLOCK_W*$i); shared_in[threadIdx.x+$BLOCK_W*$i][0] = input_v4.x; shared_in[threadIdx.x+$BLOCK_W*$i][1] = input_v4.y; shared_in[threadIdx.x+$BLOCK_W*$i][2] = input_v4.z; ... Cheetah Templating conv_kernel_4x4x4.cu 20 kB conv_kernel_8x8x4.cu 64 kB #include <stdio.h> texture<float4, 1, cudaReadModeElementType> tex_float4; __constant__ float constant[4][4][4]; #define IMUL(a, b) __mul24(a, b) extern "C" { __global__ void convolve_beta_j0(float4 *input, float4 *output) { __shared__ float shared_in[131][4+1]; // -- input/output offsets const uint in_idx = (blockIdx.y+0)*INPUT_W + blockIdx.x*blockDim.x + threadIdx.x; const uint out_idx = blockIdx.y*OUTPUT_W + blockIdx.x*blockDim.x + threadIdx.x; float4 input_v4; // -- load input to shared memory { input_v4 = tex1Dfetch(tex_float4, in_idx+128*0); shared_in[threadIdx.x+128*0][0] = input_v4.x; shared_in[threadIdx.x+128*0][1] = input_v4.y; shared_in[threadIdx.x+128*0][2] = input_v4.z; shared_in[threadIdx.x+128*0][3] = input_v4.w; } if((threadIdx.x+128*1)<131) { input_v4 = tex1Dfetch(tex_float4, in_idx+128*1); shared_in[threadIdx.x+128*1][0] = input_v4.x; shared_in[threadIdx.x+128*1][1] = input_v4.y; shared_in[threadIdx.x+128*1][2] = input_v4.z; shared_in[threadIdx.x+128*1][3] = input_v4.w; } __syncthreads(); // -- compute dot products float v, w; float sum0 = 0; float sum1 = 0; float sum2 = 0; float sum3 = 0; v = shared_in[threadIdx.x+0][0]; w = constant[0][0][0]; sum0 += v*w; w = constant[0][0][1]; sum1 += v*w; w = constant[0][0][2]; sum2 += v*w; w = constant[0][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][0]; w = constant[0][1][0]; sum0 += v*w; w = constant[0][1][1]; sum1 += v*w; w = constant[0][1][2]; sum2 += v*w; w = constant[0][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][0]; w = constant[0][2][0]; sum0 += v*w; ... 4 GPU performance Result 6 Validation 50 60 70 80 90 100 Percent Correct Percent Correct 50 60 70 80 90 100 50 60 70 80 90 100 Cars vs. Planes (validation) Synthetic Faces MultiPIE Hybrid v1-like (control) state-of-the-art (from literature) top 5 models (high-throughput search) 5 SIFT regular + GB PHOG PHOW SLF 4 3 2 1 blend 50 60 70 80 90 100 Boats vs. Animals v1-like (control) state-of-the-art (from literature) top 5 models (high-throughput search) regular + blend SIFT GB PHOG PHOW SLF v1-like (control) state-of-the-art (from literature) top 5 models (high-throughput search) SIFT regular + GB PHOG PHOW SLF blend v1-like (control) state-of-the-art (from literature) top 5 models (high-throughput search) regular + blend SIFT GB PHOG PHOW SLF 5 4 3 2 1 5 4 3 2 1 5 4 3 2 1 We discovered models that beat state-of-the-art computer vision (object and face recognition) L1 L2 L3 input Read-out (standard “back-end” classifier) number of filters kernel kernel number of filters number of filters Learning kernel normalization normalization normalization norm strength thresh/sat norm strength thresh/sat norm strength thresh/sat Rate Trace “Temp. Adv.” “Auto-reset” ... Learning Rate Trace “Temp. Adv.” “Auto-reset” ... Learning Rate Trace “Temp. Adv.” “Auto-reset” ...

A High-Throughput Screening Approach to Discovering Good ... · A High-Throughput Screening Approach to Discovering Good Forms of Biologically-Inspired Visual Representation Nicolas

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: A High-Throughput Screening Approach to Discovering Good ... · A High-Throughput Screening Approach to Discovering Good Forms of Biologically-Inspired Visual Representation Nicolas

A High-Throughput Screening Approach to Discovering Good Forms of Biologically-Inspired Visual RepresentationNicolas Pinto1 , David Cox 2 and James DiCarlo1

1. The McGovern Institute for Brain Research and Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology2. The Rowland Institute at Harvard

Take home message:

GPU Metaprogramming dramatically accelerates the discovery of bio-inspired vision models that beat state-of-the-art computer vision systems

Abstract1While many models of biological object recognition share a common set of "broad-stroke" properties, the performance of any one model depends strongly on the choice of parameters in a particular instantiation of that model - e.g. the number of units per layer, the size of pooling kernels, expo-nents in normalization operations, etc. Since the number of such parameters (explicit or implicit is typically large, and the computational cost of evaluating one particular parameter set is high, the space of possible model instantia-tions goes largely unexplored. Thus, when a model fails to approach the abili-ties of biological visual systems, we are left uncertain whether this failure is because we are missing a fundamental idea, or because the correct "parts" have not been tuned correctly, assembled at sufficient scale, or provided with enough training.

Here, we present a high-throughput approach to the exploration of such pa-rameter sets, leveraging recent advances in stream processing hardware (high-end NVIDIA graphic cards and the PlayStation 3's IBM Cell Processor). In analogy to high-throughput screening approaches in molecular biology and genetics, we explored thousands of potential network architectures and pa-rameter instantiations, screening those that show promising object recogni-tion performance for further analysis. We show that this approach can yield significant, reproducible gains in performance in across an array of basic object recognition tasks, consistently outperforming a variety of state-of-the-art vision systems from the literature.

As the scale of available computational power continues to expand, we argue that this approach has the potential to greatly accelerate progress in both ar-tificial vision and our understanding of the computational underpinning of biological vision.

The Rowland Institute at HarvardHARVARD UNIVERSITY

High-throughput screening for good models5

Analysis: Why are these models good ?7

Metaprogramming3

A Vast space of models to explore2

Recipe:1) Use scripting (e.g. Python) and templates,2) Instrumentalize your code,3) Let the computer auto-tune it!

Unsupervised Learning Test with a “screening”object recognition task Validate on other tasksChoose the best modelsGenerate Random Models

Unsupervised Learning Screen Distribution

Law & Order “World”

Performance / Cost

Hardware

Relative Speedup 1x 80x4x 544x 222x 1544x 2712x

Relative Perf. / $ 1x 120x2x 272x 833x 772x 1356x

Full System Cost(approx.)

$1,500** $1,000$2,700** $3,000* $400 $3,000* $3,000*

Manufacturer Intel IntelIntel NVIDIA Sony, IBM, Toshiba NVIDIA NVIDIA

Model Q9450 Q9450Q9450 7900 GTX PlayStation 3 8800 GTX GTX 280

Year 2008 20082008 2006 2007 2007 2008

Implementation MATLAB SSE2MATLAB Cg Cell SDK CUDA CUDA

# cores used 1 44 4x96 2+6 4x128 4x240

CPUs GPUs

10-1

100

101

102

103

theoreticalobserved

Proc

essi

ng P

erfo

rman

ce in

GFL

OPS

(lo

g-sc

ale)

Software

LanguagesExtensions / Libraries

Performance / Cost

Relative Speedup$ / GFLOPS

Year of implementation

Full System Cost(approx.)

Hardware

ManufacturerModel# cores used

GFLOPS (max)

5.3

(max

)

0.5

(max

) {

IntelQ9450

MATLAB / CMEX

1

2008

$1,500**

3000

1

0.5

1350

Q9450

21 (m

ax)

2 (m

ax)

{

4

Intel

MATLAB / C

4

2008

$2,700**

MEX

2

25

Q9450

85 (m

ax)

40 (m

ax)

{

4

Intel

CSSE2 / pthread

80

2008

$1,00040

22-11*

187

(max

)

68 (m

ax)

{

NVIDIA7900 GTX

Python / C++OpenGL / Cg

136-544*

2006

$1,500-$3,000*

96

68-272*

4

154

(max

)

111

(max

)

76 (m

ean)

9 (m

in)

{

Sony, IBM, ToshibaPlayStation 3 (Cell)

Python / CCell SDK

222

2007

$400

8 (2+6)

111

8-4*

518

(max

)

193

(max

) {

NVIDIA8800 GTX

Python / CPyCUDA

386-1544*

2007

$1,500-$3,000*

128

193-772*

240

4-2*

933

(max

)

339

(max

) {

NVIDIAGTX 280

Python / CPyCUDA

678-2712*Relative GFLOPS / $ 1 2 120 136-272* 833 386-772* 678-1356*

2008

$1,500-$3,000*339-1356*

num

ber o

f mod

els

50 60 70 80 90 1000

50

100

150

200

250

N=2500

top 5 models

Percent Correct

• bio-inspired architecture• 52 free parameters• >10 models25

Best Models

50

60

70

80

90

100

Perc

ent C

orre

ct

top 5 models (screen)

5 4 3 2 1 blend

di�erent initial weights

50

100di�erent

videos

Cars

vs.

Planes

• Why are the best models better? • How can we adjust the parameter space to get farther?

False True

Preprocessing / Norm / Zero-mean

tcerroc %

Parameter value

3 5 7 9

Layer 2 / Linear Filtering / RF size

Parameter value

16 32 64 128

1 3 5 7

50

60

70

80

90

Layer 3 / Learning / Neighborhood size

9

50

60

70

80

90

50

60

70

80

90

50

60

70

80

90

32.5

coun

t

median of Hamming distances (from 5-choose-2 binarized parameter vectors)

N = 100,000p = 0.136

Parameters similarity (top-�ve vs rest)

coun

t

12828

N = 100,000p = 0.082

median of Euclidean distances(from 5-choose-2 similarity matrices)

Representation similarity (top-�ve vs rest)

>=

coun

t

-ln(p)

Parameter Signi�cance

102 4 6 8

10

20

0.25

0.15

0.05

Frac

tion expected fraction

Filtering ActivationNormalizationPooling

Learning

Parameters by Category

low signi�cance

medium signi�cance

high signi�cance

conv_kernel_template.cu

...mad.rn.f32 $r4, s[$ofs3+0x0000], $r4, $r1mov.b32 $r1, c0[$ofs2+0x0008]mad.rn.f32 $r4, s[$ofs3+0x0008], $r1, $r4mov.b32 $r1, c0[$ofs2+0x000c]mad.rn.f32 $r4, s[$ofs3+0x000c], $r1, $r4mov.b32 $r1, c0[$ofs2+0x0010]mad.rn.f32 $r4, s[$ofs3+0x0010], $r1, $r4

...

version A2x faster!version B

...mad.rn.f32 $r1, s[$ofs1+0x007c], c0[$ofs1+0x0078], $r1mad.rn.f32 $r1, s[$ofs2+0x0000], c0[$ofs2+0x007c], $r1mad.rn.f32 $r1, s[$ofs2+0x0008], c0[$ofs2+0x0080], $r1mad.rn.f32 $r1, s[$ofs2+0x000c], c0[$ofs2+0x0084], $r1mad.rn.f32 $r1, s[$ofs2+0x0010], c0[$ofs2+0x0088], $r1

...

texture<float4, 1, cudaReadModeElementType> tex_float4;__constant__ float constant[$FILTER_D][$FILTER_W][$N_FILTERS];

#define IMUL(a, b) __mul24(a, b)extern "C" {

#for j in xrange($FILTER_H)

__global__ void convolve_beta_j${j}(float4 *input, float4 *output) {

#set INPUT_BLOCK_W = $BLOCK_W+$FILTER_W-1 __shared__ float shared_in[$INPUT_BLOCK_W][4+1];

// -- input/output offsets const uint in_idx = (blockIdx.y+$j)*INPUT_W + blockIdx.x*blockDim.x + threadIdx.x; const uint out_idx = blockIdx.y*OUTPUT_W + blockIdx.x*blockDim.x + threadIdx.x; float4 input_v4;

// -- load input to shared memory#for i in xrange($LOAD_ITERATIONS)#if $i==($LOAD_ITERATIONS-1) if((threadIdx.x+$BLOCK_W*$i)<$INPUT_BLOCK_W)#end if {

input_v4 = tex1Dfetch(tex_float4, in_idx+$BLOCK_W*$i);shared_in[threadIdx.x+$BLOCK_W*$i][0] = input_v4.x;shared_in[threadIdx.x+$BLOCK_W*$i][1] = input_v4.y;shared_in[threadIdx.x+$BLOCK_W*$i][2] = input_v4.z; ...

Cheetah Templating

conv_kernel_4x4x4.cu

20 kB

conv_kernel_8x8x4.cu

64 kB

#include <stdio.h>

texture<float4, 1, cudaReadModeElementType> tex_float4;__constant__ float constant[4][4][4];

#define IMUL(a, b) __mul24(a, b)extern "C" {

__global__ void convolve_beta_j0(float4 *input, float4 *output) {

__shared__ float shared_in[131][4+1];

// -- input/output offsets const uint in_idx = (blockIdx.y+0)*INPUT_W + blockIdx.x*blockDim.x + threadIdx.x; const uint out_idx = blockIdx.y*OUTPUT_W + blockIdx.x*blockDim.x + threadIdx.x; float4 input_v4;

// -- load input to shared memory {

input_v4 = tex1Dfetch(tex_float4, in_idx+128*0);shared_in[threadIdx.x+128*0][0] = input_v4.x;shared_in[threadIdx.x+128*0][1] = input_v4.y;shared_in[threadIdx.x+128*0][2] = input_v4.z;shared_in[threadIdx.x+128*0][3] = input_v4.w;

} if((threadIdx.x+128*1)<131) {

input_v4 = tex1Dfetch(tex_float4, in_idx+128*1);shared_in[threadIdx.x+128*1][0] = input_v4.x;shared_in[threadIdx.x+128*1][1] = input_v4.y;shared_in[threadIdx.x+128*1][2] = input_v4.z;shared_in[threadIdx.x+128*1][3] = input_v4.w;

} __syncthreads();

// -- compute dot products float v, w;

float sum0 = 0; float sum1 = 0; float sum2 = 0; float sum3 = 0; v = shared_in[threadIdx.x+0][0]; w = constant[0][0][0]; sum0 += v*w; w = constant[0][0][1]; sum1 += v*w; w = constant[0][0][2]; sum2 += v*w; w = constant[0][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][0]; w = constant[0][1][0]; sum0 += v*w; w = constant[0][1][1]; sum1 += v*w; w = constant[0][1][2]; sum2 += v*w; w = constant[0][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][0]; w = constant[0][2][0]; sum0 += v*w; ...

4 GPU performance

Result6Validation

50

60

70

80

90

100

Perc

ent C

orre

ctPe

rcen

t Cor

rect

50

60

70

80

90

100

50

60

70

80

90

100

Cars vs. Planes (validation)

Synthetic Faces MultiPIE Hybrid

v1-like(control)

state-of-the-art(from literature)

top 5 models(high-throughput search)

5SIFTregular + GB PHOG PHOW SLF 4 3 2 1 blend50

60

70

80

90

100

Boats vs. Animals

v1-like(control)

state-of-the-art(from literature)

top 5 models(high-throughput search)

regular + blendSIFT GB PHOG PHOW SLF

v1-like(control)

state-of-the-art(from literature)

top 5 models(high-throughput search)

SIFTregular + GB PHOG PHOW SLF blend

v1-like(control)

state-of-the-art(from literature)

top 5 models(high-throughput search)

regular + blendSIFT GB PHOG PHOW SLF

5 4 3 2 1

5 4 3 2 15 4 3 2 1

We discovered models that beat state-of-the-art computer vision(object and face recognition)

L1

L2

L3

input

Read-out (standard “back-end” classi�er)

number of �lters

kernel

kernel

number of �lters

number of �lters

Learning

kernel

normalization

normalization

normalization

norm strengththresh/sat

norm strengththresh/sat

norm strengththresh/sat

RateTrace“Temp. Adv.”“Auto-reset”

...

Learning

RateTrace“Temp. Adv.”“Auto-reset”

...

Learning

RateTrace“Temp. Adv.”“Auto-reset”

...