Upload
embedded-vision-alliance
View
256
Download
6
Tags:
Embed Size (px)
Citation preview
Copyright © 2015 Altera 1
Dr. Deshanand Singh, Director of Software Engineering
12 May 2015
Efficient Implementation of
Convolutional Neural Networks using
OpenCL on FPGAs
Copyright © 2015 Altera 2
• Convolutional Neural Network
• Feed forward network Inspired by biological processes
• Composed of different layers
• More layers increases accuracy
• Convolutional Layer
• Extract different features from input
• Low level features
• e.g. edges, lines corners
• Pooling Layer
• Reduce variance
• Invariant to small translations
• Max or average value
• Feature over region in the image
• Applications
• Classification & Detection, Image recognition/tagging
Convolutional Neural Network (CNN)
Prof. Hinton’s CNN Algorithm
Copyright © 2015 Altera 3
Image
Pooling
Convolution
Basic Building Block of the CNN
Source: Li Deng
Deep Learning Technology Center, Microsoft
Research, Redmond, WA. USA
A Tutorial at International Workshop on
Mathematical Issues in Information Sciences
Copyright © 2015 Altera 4
• Dataflow through the CNN can proceed in pipelined fashion
• No need to wait until the entire execution is complete
• Can start a new set of data going to stage 1 as soon as the stage
complete its execution
Key Observation: Pipelining
Layer 1:
Convolving Kernel
Layer 1:
Pooling Kernel
Layer 2:
Convolving Kernel
Copyright © 2015 Altera 5
FPGAs : Compute fabrics that support pipelining
1-bit configurable
operation
Configured to perform any
1-bit operation:
AND, OR, NOT, ADD, SUB
Basic Element
1-bit register
(store result)
Basic Elements are
surrounded with a
flexible interconnect
…
16-bit add
Your custom 64-bit
bit-shuffle and encode
32-bit sqrt
Memory
Block
20 Kb
addr
data_in
data_out
Can be configured and
grouped using the
interconnect to create
various cache architectures
data_in
Dedicated floating point
multiply and add blocks
data_out
Blocks are connected into
a custom data-path that
matches your application.
Copyright © 2015 Altera 6
Programming FPGAs: SDK for OpenCL
C compiler
OpenCL
Host
Program
Altera Kernel
Compiler
OpenCL
Kernels
Host
Binary
Device
Binary
Altera OpenCL
host library
Users write
software
FPGA
specific
compilatio
n
target
Copyright © 2015 Altera 7
Accelerator
Local M
em
Glo
bal M
em
Local M
em
Local M
em
Local M
em
Accelerato
r
Accelerato
r
Accelerato
r Processor
Accelerator
Local M
em
Glo
bal M
em
Local M
em
Local M
em
Local M
em
Accelerator Accelerator Accelerator Processor
• Host + Accelerator Programming Model
• Sequential Host program on microprocessor
• Function offload onto a highly parallel accelerator device
OpenCL Programming Model
Host Accelerator
Local M
em
Glo
bal M
em
Local M
em
Local M
em
Local M
em
Accelerato
r
Accelerato
r
Accelerato
r Processor
__kernel void sum(__global float *a, __global float *b, __global float *y) { int gid = get_global_id(0); y[gid] = a[gid] + b[gid]; }
main() { read_data( … ); maninpulate( … ); clEnqueueWriteBuffer( … ); clEnqueueNDRange(…,sum,…); clEnqueueReadBuffer( … ); display_result( … ); }
Copyright © 2015 Altera 8
• Data-parallel function
• Defines many parallel threads of
execution
• Each thread has an identifier
specified by “get_global_id”
• Contains keyword extensions to
specify parallelism and memory
hierarchy
• Executed by compute object
• CPU
• GPU
• FPGA
• DSP
• Other Accelerators
OpenCL Kernels
__kernel void sum(__global const float *a, __global const float *b, __global float *answer) { int xid = get_global_id(0); answer[xid] = a[xid] + b[xid]; }
float *a =
float *b =
float *answer =
__kernel void sum( … );
0 1 2 3 4 5 6 7
7 6 5 4 3 2 1 0
7 7 7 7 7 7 7 7
Copyright © 2015 Altera 9
• On each cycle the portions of the pipeline
are processing different threads
• While thread 2 is being loaded, thread 1
is being added, and thread 0 is being
stored
Dataflow / Pipeline Architecture from OpenCL
Load Load
Store
0 1 2 3 4 5 6 7
8 threads for vector add example
Thread IDs
+
Load Load
Store
0 1 2 3 4 5 6 7
8 threads for vector add example
Thread IDs
+
Load Load
Store
0
1 2 3 4 5 6 7
8 threads for vector add example
Thread IDs
+
Load Load
Store
1
2
3 4 5 6 7
8 threads for vector add example
Thread IDs
+
0
Load Load
Store
2
3
4 5 6 7
8 threads for vector add example
Thread IDs
+
0
1
Copyright © 2015 Altera 10
Convolutions: Our Basic Building Block
Inew 𝑥 𝑦 = Iold
1
𝑦′=−1
1
𝑥′=−1
𝑥 + 𝑥′ 𝑦 + 𝑦′ × F 𝑥′ 𝑦′
Copyright © 2015 Altera 11
Main
Memory
Cache
• A cache can hide poor memory access patterns
Processor (CPU/GPU) Implementation
for(int y=1; y<height-1; ++y) { for(int x=1; x<width-1; ++x) { for(int y2=-1; y2<1; ++y2) { for(int x2=-1; x2<1; ++x2) { i2[y][x] += i[y+y2][x+x2] * filter[y2][x2];
CPU
Copyright © 2015 Altera 12
• Example Performance Point: 1 pixel per cycle
• Cache requirements: 9 reads + 1 write per cycle
• Expensive hardware!
• Power overhead
• Cost overhead: More built in addressing flexibility than we need
• Why not customize the cache for the application?
FPGA Implementation ?
Cache Custom
Data-path
9 read ports! Memory
Copyright © 2015 Altera 13
Optimizing the “Cache”
• Start out with the initial picture that is W pixels wide
w
• Let’s remove all the lines that aren’t in the neighborhood of the
window
w w
• Take all of the lines and arrange them as a 1D array of pixels
w w
w
w w
• The pixels at the edges that we don’t need for the computation
w w
w
w w
w w
w
w w
w
• What happens when we move the window one pixel to the right?
w w
w
w w
w
• What happens when we move the window one pixel to the right?
Copyright © 2015 Altera 14
Optimizing the “Cache”
w w
w
Copyright © 2015 Altera 15
data_out[9]
• Managing data movement
to match the FPGA’s
architectural strengths is
key to obtaining high
performance.
Shift Registers in Software
pixel_t sr[2*W+3]; while(keep_going) { // Shift data in #pragma unroll for(int i=1; i<2*W+3; ++i) sr[i] = sr[i-1] sr[0] = data_in; // Tap output data data_out = {sr[ 0], sr[ 1], sr[ 2], sr[ w], sr[ w+1], sr[ w+2] sr[2*w], sr[2*w+1], sr[2*w+2]} // ... }
w w data_in
sr[0] sr[2*W+2]
Copyright © 2015 Altera 16
• CIFAR-10 Dataset
• 60000 32x32 color images
• 10 classes
• 6000 images per class
• 50000 training images
• 10000 test images
• Many CNN implementations are available for CIFAR-10
• Cuda-convnet provides a baseline implementation that many
works build upon
Building a CNN for CIFAR-10 on an FPGA
The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.
Here are the classes in the dataset, as well as 10 random images from each:
airplane
automobile
bird
cat
deer
dog
frog
horse
ship
truck
Max-Pooling
Fully Connected
2-D Convolution (5x5)
On FPGA
Copyright © 2015 Altera 17
• High-latency: Requires access to global memory
• High memory-bandwidth
• Requires host coordination to pass buffers from one kernel to another
CIFAR-10 CNN: Traditional OpenCL
Implementation
Conv 1
Kernel
Pool 1
Kernel
Conv 2
Kernel
Global Memory (DDR)
Buffer Buffer Buffer Buffer
Stratix V Implementation :
• Processes 183 images per second
for CIFAR-10
Copyright © 2015 Altera 18
• Low-latency communication between kernels
• Significantly less memory bandwidth requirements
• Host is not involved in coordinating communication between
kernels
Global Memory (DDR)
CIFAR-10 CNN: Kernel-to-Kernel Channels
Buffer Buffer
Channels
Conv 1
Kernel
Pool 1
Kernel
Conv N
Kernel
Stratix V Implementation :
• Processes 400 images per second
for CIFAR-10
• Channel declaration:
• Create a queue:
value_type channel();
• Channel write:
• Push data into the queue:
void write_channel_altera(channel &ch, value_type data);
• Channel read:
• Pop the first element from the queue
value_type read_channel_altera(channel &ch);
channel int my_channel;
write_channel_altera(my_channel, x);
int y = read_channel_altera(my_channel);
Copyright © 2015 Altera 19
• Entire algorithm can be expressed in ~500
lines of OpenCL for the FPGA
• Kernels are written as standard building
blocks that are connected together
through channels
• The shift register convolution building
block is the portion of this code that is
used most heavily
• The concept of having multiple
concurrent kernels executing
simultaneously and communicating
directly on a device is currently unique
to FPGAs
• Will be portable in OpenCL 2.0 through
the concept of “OpenCL Pipes”
CIFAR-10: FPGA Code
#pragma OPENCL_EXTENSION cl_altera_channel : enable
// Declaration of Channel API data types channel float prod_conv1_channel; channel float conv1_pool1_channel; channel float pool1_conv2_channel; channel float conv2_pool2_channel; channel float pool2_conv3_channel; channel float conv3_pool3_channel; channel float pool3_cons_channel;
__kernel void convolutional_neural_network_prod( int batch_id_begin, int batch_id_end, __global const volatile float * restrict input_global) { for(...) { write_channel_altera( prod_conv1_channel, input_global[...]); write_channel_altera( prod_pool3_channel, input_global[...]); } }
Copyright © 2015 Altera 20
• Altera’s Arria-10 family of FPGA introduces DSP blocks with a dedicated floating point mode.
• Each DSP includes a IEEE 754 single precision floating-point multiplier and adder
• FPGA logic blocks (lookup tables and registers) are no longer needed to implement floating point functions
• Massive resource savings allows many more processing pipelines to run simultaneously on the FPGA
Effect of Floating Point FPGAs
Arria-10 Implementation :
• Processes 6800 images per
second for CIFAR-10
Copyright © 2015 Altera 21
• CNN implementations are well suited to pipelined implementations
• Exploiting pipelining on the FPGA requires some attention to coding
style to overcome the inherent assumptions of the writing “software”
• FPGAs do not have caches
• Need to exploit data reuse in a more explicit way
• The concept of dataflow pipelining will not realize its full potential if
we write intermediate results to memory
• Bandwidth limitations begin to dominate compute
• Use direct kernel to kernel communication called channels
• Native support for floating point on the FPGA allows order of magnitude
performance increase
Lessons Learned
Copyright © 2015 Altera 22
• Altera’s SDK for OpenCL
• http://www.altera.com/products/software/opencl/opencl-
index.html
• FPGA Optimized OpenCL examples (filters, convolutions, …)
• http://www.altera.com/support/examples/opencl/opencl.html
• CIFAR-10 dataset
• http://www.cs.toronto.edu/~kriz/cifar.html
• CUDA-Convnet
• https://code.google.com/p/cuda-convnet/
Resources