Upload
hiroki-nakahara
View
2.580
Download
7
Embed Size (px)
Citation preview
A Deep ConvolutionalNeural Network Based on
Nested Residue Number System
Hiroki Nakahara1 Tsutomu Sasao2
1Ehime University, Japan2Meiji University, Japan
1
Outline
• Background
• Deep convolutional neural network (DCNN)
• Residue number system (RNS)
• DCNN using nested RNS (NRNS)
• Experimental results
• Conclusion
2
Background
• Deep Neural Network
– Multi-layer neuron model
– Used for embedded vision system
• FPGA realization is suitable for real-time systems
– faster than the CPU
– Lower power consumption than the GPU
– Fixed point representation is sufficient
• High-performance per area is desired
3
Deep Convolutional Neural Network (DCNN)
4
Artificial Neuron
+
x0=1x0=1
x1
x2
xN
...
w0 (Bias)
w1
w2
wN
f(u)u y
xi: Input signalwi: Weightu: Internal statef(u): Activation function (Sigmoid, ReLU, etc.)y: Output signal
5
y f (u)
u wixii0
N
Deep Convolutional Neural Network(DCNN)for ImageNet
• 2D convolutional layer, pooling layer, and fully connection layer
6
2D Convolutional Layer• Consumes more than 90% of the computation time
– Multiply-accumulation (MAC) operation is performed
7
zij yij xim, jnwmnn0
K1
m0
K1
xij: Input signalyij : Biaswmn: WeightK: Kernel sizezij: Output signal
K
K
FPGA
Realization of 2D Convolutional Layer• Requires more than billion MACs!• Our realization
– Time multiplexing– Nested Residue Number System(NRNS)
8
Off‐chip Memory
** **
++
** **
++
++
** **
++
** **
++
++
BRAMBRAM BRAMBRAM
FPGA
Off‐chip Memory
** **
BRAMBRAM
****
** ****
** **
BRAMBRAM
****
** ****
** **
BRAMBRAM
****
** ****
** **
BRAMBRAM
****
** ****
➔ConverterConverter ConverterConverter ConverterConverter ConverterConverter
ConverterConverter ConverterConverter ConverterConverter ConverterConverter
Residue Number System(RNS)
9
Residue Number System (RNS)
Defined by a set of L mutually prime integer constants 〈m1,m2,...,mL〉
No pair modulus have a common factor with any other
Typically, prime number is used as moduli set
An arbitrary integer X can be uniquely represented by a tuple of L integers (X1,X2,…,XL), where
Dynamic range
10
)(mod ii mXX
M mii1
L
Parallel Multiplication
Multiplication on RNS
Moduli set〈3,4,5〉, X=8, Y=2
Z=X×Y=16=(1,0,1)
X=(2,0,3), Y=(2,2,2)
Z=(4 mod 3,0 mod 4,6 mod 5)
=(1,0,1)=16
11
Binary2RNS Conversion
RNS2Binary Conversion
➔ ➔
Binary2RNS Converter
12
X mod 2 mod 3 mod40 0 0 01 1 1 12 0 2 23 1 0 34 0 1 05 1 2 1
➔
13
00 01 10 11
00011011
0111
1100
0111
1100
X1=(x1, x2)
X2=(x3, x4)
h(X1) 0 01 1
x1 0 0 1 1x2 0 1 0 1
h(X1) 0 1 0 1
0 100 0 101 1 110 1 011 1 0
x3,x4
h(X1)
Functional Decomposition
24x1=16 [bit] 22x1+23x1=12 [bit]
Column multiplicity=2
Bound variables
Freevariables
Decomposition Chart for X mod 3
14
000 001 010 011
00011011
0120
1201
2012
0120
X2=(x3, x4, x5)
X1=(
x1, x
2)
100 101 110 111
1201
2012
0120
1201
0 mod 3 = 01 mod 3 = 12 mod 3 = 23 mod 3 = 04 mod 3 = 15 mod 3 = 26 mod 3 = 07 mod 3 = 18 mod 3 = 29 mod 3 = 0
10 mod 3 = 1
…
Freevariables
Bound variables
Decomposition Chart for X mod 3
15
0 1 2
00011011
0120
1201
2012
X2=(
x3, x
4, x5
)X 1
=(x1
, x2)
0 mod 3 = 01 mod 3 = 12 mod 3 = 23 mod 3 = 04 mod 3 = 15 mod 3 = 26 mod 3 = 07 mod 3 = 18 mod 3 = 29 mod 3 = 0
10 mod 3 = 1
…
Fre
eBound
x3 0 0 0 0 1 1 1 1x4 0 0 1 1 0 0 1 1x5 0 1 0 1 0 1 0 1
h(X2) 0 1 2 0 1 2 0 1
Binary2RNS Converter
16
LUT cascade for X mod m1
LUT cascade for X mod m2
BRAM
BRAM
BRAM
RNS2Binary Converter (m=30)
17
x1 y10 01 15
x2 y20 01 102 20
x3 y30 01 62 123 184 24
Mod mAdder
Mod mAdder
carry
carry
Problem• Moduli set of RNS consists of mutually prime numbers
– sizes of circuits are all different
• Example: <7,11,13>
18
6‐inputLUT
8‐inputLUT
8‐inputLUT
34
4
443
3
4
4
Binary2RNSConverter
byBRAMs
RNS2BinaryConverter
byDSP blocksand BRAMs
➔ ➔
DCNN using Nested RNS
19
Nested RNS
• (Z1,Z2,…,Zi,…, ZL) (Z1,Z2,…,(Zi1,Zi2,…,Zij),…, ZL)
• Ex: <7,11,13>×<7,11,13>
<7,<5,6,7>11,<5,6,7>13>×<7,<5,6,7>11,<5,6,7>13>
20
1. Reuse the same moduli set
2. Decompose a large modulo into smaller ones
Original modulus
➔
Example of Nested RNS• 19x22(=418) on <7,<5,6,7>11,<5,6,7>13>
19×22
=<5,8,6>×<1,0,9>
=<5,<3,2,1>11,<1,0,6>13>×<1,<0,0,0>11,<4,3,2>13>
=<5,<0,0,0>11,<4,0,5>13>
=<5,0,2>
=418
21
Modulo Multiplication
Bin2RNS on NRNS
RNS2Bin
Binary2NRNS Conversion
Realization of Nested RNS
22
<5,6,7>2Bin
Bin2<7,11,13>
3
<7,11,13>2Bin
<5,6,7>2Bin
Bin2<5,6,7>
Bin2<5,6,7>
6‐inputLUT
6‐inputLUT
6‐inputLUT
6‐inputLUT
6‐inputLUT
6‐inputLUT
6‐inputLUT
Bin2<7,11,13> Bin2
<5,6,7>
Bin2<5,6,7>
44
3
44
3333
3
3
Binary2NRNS
NRNS2Binary
Realized by BRAMs LUTs BRAMs and DSP blocks
Moduli Set for NRNS
• Conventional RNS (uses 23 moduli)
<3,4,5,7,11,13,17,19,23,29,31,37,41,43,47,53,59,61,67,71,73,79,83>
• Applied the NRNS to moduli that are greater than 15
<3,4,5,7,11,13,
<3,4,5,7,11,13>17,
<3,4,5,7,11,13>19,
<3,4,5,7,11,13,<3,4,5,7,11,13>17>23,
<3,4,5,7,11,13,<3,4,5,7,11,13>17>29,
…, <3,4,5,7,11,13,<3,4,5,7,11,13>17>83>
23All the 48-bit MAC operations are decomposed into 4-bit ones
DCNN Architecture using the NRNS
24
...
16 parallel modulo mi2D convolutional units
...
...
. . .
BRAM BRAM BRAM...
BRAM BRAM BRAM...
BRAM BRAM BRAM...
. . .
Parallel Bin2NRNSConverters
Tree‐based NRNS2BinConverters
Sequencer
External DDR3SODIMM
DDR3 Ctrl.DDR3 Ctrl.On‐chipMemory
RNS2Bin
RNS2Bin
RNS2Bin
RNS2Bin
RNS2Bin
......
...
...
Experimental Results
25
Implementation Setup
• FPGA board: Xilinx VC707
– FPGA: Virtex7 VC485T
– 1GB DDR3SODIMM
(Bus@800MHz, 64 bit width)
• Realized the pre-trained
ImageNet by Convnet2
– 48-bit fixed precision
• Synthesis tool: Xilinx Vivado2014.1
– Timing constrain: 400MHz
26
Comparison with Other Implementations
27
Precision Max.Freq.[MHz]
FPGA Performance[GOPS]
Performanceper area[GOPS/
Slice x 10‐4]ASAP2009 16bit fixed 115 Viretex5 LX330T 6.7 1.3
PACT2010 ‐‐‐ fixed 125 Viretex5 SX240T 7.0 1.9
FPL2009 48bit fixed 125 Spartax3A DSP3400 5.3 2.2
ISCA2010 48bit fixed 200 Virtex5 SX240T 16.0 4.3
ICCD2013 ‐‐‐ fixed 150 Virtex6 LVX240T 17.0 4.5
FPGA2015 32bit float 100 Virtex7 VX485T 61.6 8.1
Proposed 48bit fixed 400 Virtex7 VX485T 132.2 25.2
Conclusion
• Realized the DCNN on the FPGA
– Time multiplexing– Nested RNS
• MAC operation is realized by small LUTs
• Functional decomposition are used as follows:
– Bin2NRNS converter is realized by BRAMs
– NRNS2Bin converter is realized by DSP blocks and BRAMs
• Performance per area (GOPS/Slice)– 5.86 times higher than ISCA10’s
28