fpgaConvNet: A Framework for Mapping Convolutional Neural

Preview:

Citation preview

fpgaConvNet:AFrameworkforMappingConvolutionalNeuralNetworksonFPGAs

Stylianos I. Venieris, Christos-Savvas Bouganisstylianos.venieris10@imperial.ac.uk

FCCM 2016, Washington DC2 May 2016

Deep Learning and AI

2

Deep Learning Success Stories - ConvNets

3

Image Recognition (Microsoft, 2015)

Image Captioning (Microsoft, 2015)“Deep Face” (Facebook, 2014)

Deep Learning on FPGAs

4

• Hand-tuned implementations • Memory I/O Optimisation

• Design Space Exploration

What is missing?

5

• Caffe• TensorFlo• Theano• Torch

• FPGA

– ConvNet Functionality

– Optimised for

HighPerformance

Deep Learning Developers fpgaConvNet

Our approach - fpgaConvNet

6

Automated Design Space Exploration

ConvNet Hardware Mapping

fpgaConvNet

FPGA Target Platform Specifications

Supplied by Deep Learning Expert

ConvNet Description

Bitstream

Convolutional Neural Networks (ConvNets)

7

convolutional+ nonlinearity

pooling convolutional+ nonlinearity

pooling

• SynchronousDataFlow

– ConvNetasadata-drivengraph

– Representedasamatrix

– Eachlayermappedtoatunablesetofhardwarebuildingblocks

fpgaConvNet – ConvNet Modelling Framework

8

Streaming

Analytical power

fpgaConvNet – Modelling ConvNets with SDF

9

ConvNet Hardware SDF Graph

Convolutional Layerwith4filtersNonlinLayer PoolingLayer Convolutional Layerwith4filters

Memory

Interface

Input

Data

fpgaConvNet – Design Space Perspective

10

ConvNet Hardware SDF Graph

Bottlenecks:

– Limitedcomputeresources

– Limitedoff-chipmemorybandwidth

– Limitedon-chipmemory formodelparameters

0

2

4

6

0 5 10

Thro

ughp

ut

Area

Design Space

Current Design PointFPGA 2

FPGA 1

Defineasetofactionstomovearoundthedesignspace

Action 1: Coarse-grained Folding

1) Exceeding the available compute resources

4 Convolutions / cycle

11

2) Not enough off-chip memory bandwidth

Action 1: Coarse-grained Folding

2 Convolutions / cycle

12

Action 2

Fine-grained FoldingCompute Resources

Required Bandwidth

Input Data

ConvLayer 1

NonlinLayer 1

PoolLayer 1

ConvLayer 2

NonlinLayer 2

PoolLayer 2

Subgraph 1 Subgraph 2 Subgraph 3

Bitstream 1 Bitstream 2 Bitstream 3

13

Bitstream

Exceeding the available compute resources

Action 3: Partitioning through Reconfiguration

Not enough on-chip memory FPGA Reconfiguration

InputData

Intermediate Results

Intermediate Results

Final ResultsOff-chip Memory

fpgaConvNet – SDF Analytical Power

14

Window Size = KPool Size = P

Design 1

Design 2

• SynchronousDataFlow– Actionsasalgebraicoperations

– Anylocalactionpropagates throughthenetwork

– Staticscheduling

– AnalyticalPerformanceModel

– CastDSEtoformalresource-constrainedoptimisation

Hardware Stages Interconnections

Evaluation - Experimental Setup

15

• fpgaConvNet

– XilinxZynq-7000XC7Z020SoCwith220DSPsat100MHz

– Q8.8fixed-pointprecisiontomatchexistingwork

(alsosupportsfloating-point)

– CurrenttoolflowsupportstheVivadoHLStoolchain

Performance Model Accuracy

16

0 2 4 6 8 10 12 14

CFF

LeNet-5

MPCNN

CNP

Sign Recognition ConvNet

Scene Labelling ConvNet

Performance Model Accuracy

Measured Performance (GOps/s) Predicted Performance (GOps/s)

Error between 1.73% and 11.76%

fpgaConvNet vs. Existing FPGA Work

17

00.000050.0001

0.000150.0002

0.000250.0003

0.000350.0004

0.000450.0005

Hand-tuned [1] Memory-optimised [2]

Performance Density Comparison (GOps/s/Slice)

Existing Work (GOps/s/Slice) fpgaConvNet (GOps/s/Slice)

1.62×

[1] C. Farabet et al., “CNP: An FPGA-Based Processor for Convolutional Networks”, in FPL, IEEE, 2009.

[2] M. Peemen et al., “Memory-centric accelerator design for Convolutional Neural Networks”, in ICCD, IEEE, 2013.

18

0

1

2

3

4

5

6

7

8

Hand-tuned Embedded GPU [3]

Performance Efficiency Comparison (GOps/s/Watt)

Existing Work (GOps/s/Watt)fpgaConvNet (GOps/s/Watt)

[3] L. Cavigelli et al., “Accelerating real-time embedded scene labeling with convolutional networks”, in DAC, ACM/EDAC/IEEE, 2015.

Hand-tuned Embedded GPU

• Tegra K1at800MHz

• MemoryBandwidth: 12GB/s

fpgaConvNet

• Zynq-7000XC7Z020at100MHz

• MemoryBandwidth:4.26GB/s

fpgaConvNet vs. Existing Embedded GPU Work

Conclusions

19

• Caffe• TensorFlo• Theano• Torch

Deep Learning Developers fpgaConvNet

Recommended