Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
OrthrusPE: Runtime Reconfigurable Processing Elements
for Binary Neural Networks
Nael Fasfous
Technical University of Munich
Department of Electrical and Computer Engineering
Chair of Integrated Systems
• Optimization of Convolutional Neural Networks
• Challenges of Binary Neural Networks
• Motivation for Runtime Reconfigurable BNN Processing Elements
• OrthrusPE: Dual Modes
• Experimentation and Results
Outline
Nael Fasfous | Chair of Integrated Systems | TUM
Optimization of Convolutional Neural Networks
Nael Fasfous | Chair of Integrated Systems | TUM
Structural Algorithmic Hardware
Optimization of Convolutional Neural Networks
Nael Fasfous | Chair of Integrated Systems | TUM
Structural Algorithmic Hardware
Low-Rank
Decomp.Pruning Quantization Winograd FFT GEMM Vectorization PartitioningDataflow
Optimization of Convolutional Neural Networks
Nael Fasfous | Chair of Integrated Systems | TUM
Structural Algorithmic Hardware
Low-Rank
Decomp.Pruning Quantization Winograd FFT GEMM Vectorization
Bypass
PartitioningDataflow
Tucker
Decomp.
CP
Decomp.
Channel-wise
Filter-wise
Element-wise
Winograd
Sparsity
Spectral
Sparsity
Filter
Sharing
Floating
PointFixed Point
Binarization
Pow. 2
Log Base
Weight
Sharing
Array
MappingIn Out
IFmap Filter
IO OW
IW
OS IS
RS WS
Overview of Binary Neural Networks
Nael Fasfous | Chair of Integrated Systems | TUM
• Binary Weights
• Binary Activations
• Binary Weights AND Activations
Overview of Binary Neural Networks
Nael Fasfous | Chair of Integrated Systems | TUM
• Binary Weights
• Binary Activations
• Binary Weights AND Activations Replace Multiplications by XNOR ops
Replace Accumulations by Popcount ops
Overview of Binary Neural Networks
Nael Fasfous | Chair of Integrated Systems | TUM
• Binary Weights
• Binary Activations
• Binary Weights AND Activations Replace Multiplications by XNOR ops
Replace Accumulations by Popcount ops
Reduce Memory requirements (1/16 of FP-16)
Overview of Binary Neural Networks
Nael Fasfous | Chair of Integrated Systems | TUM
• Binary Weights
• Binary Activations
• Binary Weights AND Activations Replace Multiplications by XNOR ops
Replace Accumulations by Popcount ops
Severe Information Loss
Reduce Memory requirements (1/16 of FP-16)
Overview of Binary Neural Networks
Nael Fasfous | Chair of Integrated Systems | TUM
• Binary Weights
• Binary Activations
• Binary Weights AND Activations Replace Multiplications by XNOR ops
Replace Accumulations by Popcount ops
Severe Information Loss
MNIST, CIFAR-10, SVHN
ImageNet
Reduce Memory requirements (1/16 of FP-16)
Overview of Binary Neural Networks
Nael Fasfous | Chair of Integrated Systems | TUM
Naïve Binarization
1 -2 4
10 0.1 3
-5 7 -3
𝑠𝑖𝑔𝑛 𝑥
1 0 1
1 1 1
0 1 0
Severe information loss:
10 and 0.1 have the same effect on the
network.
𝑠𝑖𝑔𝑛 𝑥 = 0, 𝑥 < 01, 𝑥 ≥ 0
Overview of Binary Neural Networks
Nael Fasfous | Chair of Integrated Systems | TUM
Approximation through binary bases
1 -2 4
10 0.1 3
-5 7 -3
𝑠𝑖𝑔𝑛 𝑥 = 0, 𝑥 < 01, 𝑥 ≥ 0
𝑠𝑖𝑔𝑛 𝑥 − 50 0 0
1 0 0
0 1 0
0 0 1
1 0 1
0 1 0
0 0 1
1 0 0
0 1 0
𝑠𝑖𝑔𝑛 𝑥 − 3
𝑠𝑖𝑔𝑛 𝑥 − 4
Overview of Binary Neural Networks
Nael Fasfous | Chair of Integrated Systems | TUM
Approximation through binary bases
1 -2 4
10 0.1 3
-5 7 -3
𝑠𝑖𝑔𝑛 𝑥 − 50 0 0
1 0 0
0 1 0
𝑠𝑖𝑔𝑛 𝑥 = 0, 𝑥 < 01, 𝑥 ≥ 0
0 0 1
1 0 1
0 1 0
0 0 1
1 0 0
0 1 0
𝑠𝑖𝑔𝑛 𝑥 − 3
𝑠𝑖𝑔𝑛 𝑥 − 4
Captured Information:
0.1 is less positive than 10
Overview of Binary Neural Networks
Nael Fasfous | Chair of Integrated Systems | TUM
Approximation through binary bases
1 -2 4
10 0.1 3
-5 7 -3
𝑠𝑖𝑔𝑛 𝑥 − 50 0 0
1 0 0
0 1 0
𝑠𝑖𝑔𝑛 𝑥 = 0, 𝑥 < 01, 𝑥 ≥ 0
0 0 1
1 0 1
0 1 0
0 0 1
1 0 0
0 1 0
𝑠𝑖𝑔𝑛 𝑥 − 3
𝑠𝑖𝑔𝑛 𝑥 − 4
Captured Information:
0.1 is less positive than 10
4 and 3 lie between 10 and 0.1
Overview of Binary Neural Networks
Nael Fasfous | Chair of Integrated Systems | TUM
Approximation through binary bases
1 -2 4
10 0.1 3
-5 7 -3
𝑠𝑖𝑔𝑛 𝑥 − 50 0 0
1 0 0
0 1 0
𝑠𝑖𝑔𝑛 𝑥 = 0, 𝑥 < 01, 𝑥 ≥ 0
0 0 1
1 0 1
0 1 0
0 0 1
1 0 0
0 1 0
𝑠𝑖𝑔𝑛 𝑥 − 3
𝑠𝑖𝑔𝑛 𝑥 − 4
Captured Information:
0.1 is less positive than 10
4 and 3 lie between 10 and 0.1
4 and 3 are single unit away from
each other
Overview of Binary Neural Networks
Nael Fasfous | Chair of Integrated Systems | TUM
Approximation through binary bases
1 -2 4
10 0.1 3
-5 7 -3
𝑠𝑖𝑔𝑛 𝑥 − 50 0 0
1 0 0
0 1 0
𝑠𝑖𝑔𝑛 𝑥 = 0, 𝑥 < 01, 𝑥 ≥ 0
0 0 1
1 0 1
0 1 0
0 0 1
1 0 0
0 1 0
𝑠𝑖𝑔𝑛 𝑥 − 3
𝑠𝑖𝑔𝑛 𝑥 − 4
Captured Information:
0.1 is less positive than 10
4 and 3 lie between 10 and 0.1
4 and 3 are single unit away from
each other
Information can be extracted for
negative numbers, e.g. 𝑠𝑖𝑔𝑛 𝑥 + 3
Overview of Binary Neural Networks
Nael Fasfous | Chair of Integrated Systems | TUM
Approximation through binary bases
1 -2 4
10 0.1 3
-5 7 -3
𝑠𝑖𝑔𝑛 𝑥 − 50 0 0
1 0 0
0 1 0
𝑠𝑖𝑔𝑛 𝑥 = 0, 𝑥 < 01, 𝑥 ≥ 0
0 0 1
1 0 1
0 1 0
0 0 1
1 0 0
0 1 0
𝑠𝑖𝑔𝑛 𝑥 − 3
𝑠𝑖𝑔𝑛 𝑥 − 4
𝑏𝑖𝑛𝐶𝑜𝑛𝑣
𝑏𝑖𝑛𝐶𝑜𝑛𝑣
𝑏𝑖𝑛𝐶𝑜𝑛𝑣
𝐶𝑜𝑛𝑣
× 𝛼1
× 𝛼2
× 𝛼3
[1] X. Lin et al., “Towards accurate binary convolutional neural network,” NIPS, 2017.
Overview of Binary Neural Networks
Nael Fasfous | Chair of Integrated Systems | TUM
Approximation through binary bases
1 -2 4
10 0.1 3
-5 7 -3
𝑠𝑖𝑔𝑛 𝑥 − 50 0 0
1 0 0
0 1 0
𝑠𝑖𝑔𝑛 𝑥 = 0, 𝑥 < 01, 𝑥 ≥ 0
0 0 1
1 0 1
0 1 0
0 0 1
1 0 0
0 1 0
𝑠𝑖𝑔𝑛 𝑥 − 3
𝑠𝑖𝑔𝑛 𝑥 − 4
𝑏𝑖𝑛𝐶𝑜𝑛𝑣
𝑏𝑖𝑛𝐶𝑜𝑛𝑣
𝑏𝑖𝑛𝐶𝑜𝑛𝑣
𝐶𝑜𝑛𝑣
× 𝛼1
× 𝛼2
× 𝛼3
α: learned during
training to induce
further diversity in
information.
[1] X. Lin et al., “Towards accurate binary convolutional neural network,” NIPS, 2017.
Overview of Binary Neural Networks
Nael Fasfous | Chair of Integrated Systems | TUM
Approximation through binary bases
1 -2 4
10 0.1 3
-5 7 -3
𝑠𝑖𝑔𝑛 𝑥 − 50 0 0
1 0 0
0 1 0
𝑠𝑖𝑔𝑛 𝑥 = 0, 𝑥 < 01, 𝑥 ≥ 0
0 0 1
1 0 1
0 1 0
0 0 1
1 0 0
0 1 0
𝑠𝑖𝑔𝑛 𝑥 − 3
𝑠𝑖𝑔𝑛 𝑥 − 4
𝑏𝑖𝑛𝐶𝑜𝑛𝑣
𝑏𝑖𝑛𝐶𝑜𝑛𝑣
𝑏𝑖𝑛𝐶𝑜𝑛𝑣
𝐶𝑜𝑛𝑣
× 𝛼1
× 𝛼2
× 𝛼3
α: learned during
training to induce
further diversity in
information.
BNNs with only 4-5%
accuracy degradation
from full-precision
CNN on ImageNet [1].
[1] X. Lin et al., “Towards accurate binary convolutional neural network,” NIPS, 2017.
How Binary are Binary Neural Networks?
Nael Fasfous | Chair of Integrated Systems | TUM
K Ci
Xi
Yi
CiK Ci
𝐴𝑙 = 𝐶𝑜𝑛𝑣(𝑊𝑙 , 𝐴𝑙−1)
Co
𝑊𝑙 ∈ ℝ𝐾×𝐾×𝐶𝑖×𝐶𝑜𝐴𝑙−1 ∈ ℝ𝑋𝑖×𝑌𝑖×𝐶𝑖
How Binary are Binary Neural Networks?
Nael Fasfous | Chair of Integrated Systems | TUM
K Ci
Xi
Yi
CiK Ci
𝐵𝑙 ∈ 𝔹𝐾×𝐾×𝐶𝑖×𝑀×𝐶𝑜𝐻𝑙−1 ∈ 𝔹𝑋𝑖×𝑌𝑖×𝐶𝑖×𝑁
𝔹 = {0,1}
𝐴𝑙 = 𝐶𝑜𝑛𝑣(𝑊𝑙 , 𝐴𝑙−1)
Co
How Binary are Binary Neural Networks?
Nael Fasfous | Chair of Integrated Systems | TUM
𝐴𝑙 = 𝐶𝑜𝑛𝑣(𝑊𝑙 , 𝐴𝑙−1)
K Ci
Xi
Yi
N
M
Ci
K Ci
M
…
Co
𝐵𝑙 ∈ 𝔹𝐾×𝐾×𝐶𝑖×𝑀×𝐶𝑜𝐻𝑙−1 ∈ 𝔹𝑋𝑖×𝑌𝑖×𝐶𝑖×𝑁
𝔹 = {0,1}
How Binary are Binary Neural Networks?
Nael Fasfous | Chair of Integrated Systems | TUM
K Ci
Xi
Yi
N
M
Ci
K Ci
M
…
Co
𝐵𝑙 ∈ 𝔹𝐾×𝐾×𝐶𝑖×𝑀×𝐶𝑜𝐻𝑙−1 ∈ 𝔹𝑋𝑖×𝑌𝑖×𝐶𝑖×𝑁
𝔹 = {0,1}
𝐴𝑙 =
𝑚=1
𝑀
𝑛=1
𝑁
𝛼𝑚𝛽𝑛𝐵𝑖𝑛𝐶𝑜𝑛𝑣(𝐵𝑚𝑙 , 𝐻𝑛
𝑙−1)
How Binary are Binary Neural Networks?
Nael Fasfous | Chair of Integrated Systems | TUM
𝔹 = {0,1}
𝐴𝑙 =
𝑚=1
𝑀
𝑛=1
𝑁
𝛼𝑚𝛽𝑛𝐵𝑖𝑛𝐶𝑜𝑛𝑣(𝐵𝑚𝑙 , 𝐻𝑛
𝑙−1)
K Ci
Xi
Yi
N
M
Ci
K Ci
M
…
Co
𝐵𝑚𝑙 ∈ 𝔹𝐾×𝐾×𝐶𝑖×𝐶𝑜𝐻𝑛
𝑙−1 ∈ 𝔹𝑋𝑖×𝑌𝑖×𝐶𝑖
How Binary are Binary Neural Networks?
Nael Fasfous | Chair of Integrated Systems | TUM
𝔹 = {0,1}
𝑎𝑚,𝑛 =
𝑐𝑖=1
𝐶𝑖
(𝑝𝑚,𝑛,𝑐𝑖)
𝐴𝑙 =
𝑚=1
𝑀
𝑛=1
𝑁
𝛼𝑚𝛽𝑛𝐵𝑖𝑛𝐶𝑜𝑛𝑣(𝐵𝑚𝑙 , 𝐻𝑛
𝑙−1)
K Ci
Xi
Yi
N
M
Ci
K Ci
M
…
Co
𝐵𝑚𝑙 ∈ 𝔹𝐾×𝐾×𝐶𝑖×𝐶𝑜𝐻𝑛
𝑙−1 ∈ 𝔹𝑋𝑖×𝑌𝑖×𝐶𝑖
How Binary are Binary Neural Networks?
Nael Fasfous | Chair of Integrated Systems | TUM
K Ci
Xi
Yi
N
M
Ci
K Ci
M
…
Co
𝔹 = {0,1}
𝐴𝑙 =
𝑚=1
𝑀
𝑛=1
𝑁
𝛼𝑚𝛽𝑛𝐵𝑖𝑛𝐶𝑜𝑛𝑣(𝐵𝑚𝑙 , 𝐻𝑛
𝑙−1)
𝐵𝑚𝑙 ∈ 𝔹𝐾×𝐾×𝐶𝑖×𝐶𝑜𝐻𝑛
𝑙−1 ∈ 𝔹𝑋𝑖×𝑌𝑖×𝐶𝑖
𝑝𝑚,𝑛,𝑐𝑖 =
𝑘𝑥=1
𝐾
𝑘𝑦=1
𝐾
𝑥𝑛𝑜𝑟 𝑏𝑘𝑥, 𝑘𝑦 , ℎ𝑥𝑖+𝑘𝑥, 𝑦𝑖+𝑘𝑦
𝑎𝑚,𝑛 =
𝑐𝑖=1
𝐶𝑖
(𝑝𝑚,𝑛,𝑐𝑖)
How Binary are Binary Neural Networks?
Nael Fasfous | Chair of Integrated Systems | TUM
K Ci
Xi
Yi
N
M
Ci
K Ci
M
…
Co
𝔹 = {0,1}
𝐴𝑙 =
𝑚=1
𝑀
𝑛=1
𝑁
𝛼𝑚𝛽𝑛𝐵𝑖𝑛𝐶𝑜𝑛𝑣(𝐵𝑚𝑙 , 𝐻𝑛
𝑙−1)
𝐵𝑚𝑙 ∈ 𝔹𝐾×𝐾×𝐶𝑖×𝐶𝑜𝐻𝑛
𝑙−1 ∈ 𝔹𝑋𝑖×𝑌𝑖×𝐶𝑖
𝑎𝑚,𝑛 =
𝑐𝑖=1
𝐶𝑖
(𝑝𝑚,𝑛,𝑐𝑖)
𝑝𝑚,𝑛,𝑐𝑖 = 𝑝𝑜𝑝𝑐𝑛𝑡 (𝑥𝑛𝑜𝑟 𝑏𝑘𝑥, 𝑘𝑦 , ℎ𝑥𝑖+𝑘𝑥, 𝑦𝑖+𝑘𝑦 )
How Binary are Binary Neural Networks?
Nael Fasfous | Chair of Integrated Systems | TUM
K Ci
Xi
Yi
N
M
Ci
K Ci
M
…
Co
𝔹 = {0,1}
𝐴𝑙 =
𝑚=1
𝑀
𝑛=1
𝑁
𝛼𝑚𝛽𝑛𝐵𝑖𝑛𝐶𝑜𝑛𝑣(𝐵𝑚𝑙 , 𝐻𝑛
𝑙−1)
𝑎𝑚,𝑛 =
𝑐𝑖=1
𝐶𝑖
(𝑝𝑚,𝑛,𝑐𝑖)
𝐵𝑚𝑙 ∈ 𝔹𝐾×𝐾×𝐶𝑖×𝐶𝑜𝐻𝑛
𝑙−1 ∈ 𝔹𝑋𝑖×𝑌𝑖×𝐶𝑖
ℎ𝑙−1 𝑏𝑙
𝑝𝑚,𝑛,𝑐𝑖 = 𝑝𝑜𝑝𝑐𝑛𝑡 (𝑥𝑛𝑜𝑟 𝑏𝑘𝑥, 𝑘𝑦 , ℎ𝑥𝑖+𝑘𝑥, 𝑦𝑖+𝑘𝑦 )
How Binary are Binary Neural Networks?
Nael Fasfous | Chair of Integrated Systems | TUM
K Ci
Xi
Yi
N
M
Ci
K Ci
M
…
Co
𝔹 = {0,1}
𝐴𝑙 =
𝑚=1
𝑀
𝑛=1
𝑁
𝛼𝑚𝛽𝑛𝐵𝑖𝑛𝐶𝑜𝑛𝑣(𝐵𝑚𝑙 , 𝐻𝑛
𝑙−1)
𝑎𝑚,𝑛 =
𝑐𝑖=1
𝐶𝑖
(𝑝𝑚,𝑛,𝑐𝑖)
𝐵𝑚𝑙 ∈ 𝔹𝐾×𝐾×𝐶𝑖×𝐶𝑜𝐻𝑛
𝑙−1 ∈ 𝔹𝑋𝑖×𝑌𝑖×𝐶𝑖
ℎ𝑙−1 𝑏𝑙
𝑝𝑚,𝑛,𝑐𝑖 = 𝑝𝑜𝑝𝑐𝑛𝑡 (𝑥𝑛𝑜𝑟 𝑏𝑘𝑥, 𝑘𝑦 , ℎ𝑥𝑖+𝑘𝑥, 𝑦𝑖+𝑘𝑦 )
Binary
Fixed-point
More FP in Binary Neural Networks
Nael Fasfous | Chair of Integrated Systems | TUM
• Binary Weight and Activation Bases (Scale/Shift)
• First Layer Remains Non-Binarized
• Batch Normalization
More FP in Binary Neural Networks
Nael Fasfous | Chair of Integrated Systems | TUM
• Binary Weight and Activation Bases (Scale/Shift)
• First Layer Remains Non-Binarized
• Batch Normalization
Motivation for Reconfigurable Processing Elements
Fixed-Point Ops Binary Ops
OrthrusPE: Runtime Reconfigurable PEs for BNNs
Nael Fasfous | Chair of Integrated Systems | TUM
Dual Modes
• Fixed-precision mode: First layer, Batch-norm, Scale and Shift Operations
• Binary mode: SIMD Binary Hadamard Products and Popcounts
• Achieved with high resource reuse
OrthrusPE: Binary Mode
Nael Fasfous | Chair of Integrated Systems | TUM
Efficient SIMD Binary Hadamard Product Execution
1 1 1 1 1 1 0 X
0 0 0 0 0 0 0 X
1 0 1 0 1 0 0 X
X X X X X X X X
1 0 1
1 0 1
1 0 1
Input: hl-1 ⊂ Hnl-1
Kernel: bl ⊂ Bml
OrthrusPE: Binary Mode
Nael Fasfous | Chair of Integrated Systems | TUM
Efficient SIMD Binary Hadamard Product Execution
1 1 1 1 1 1 0 X
0 0 0 0 0 0 0 X
1 0 1 0 1 0 0 X
X X X X X X X X
1 0 1
1 0 1
1 0 1
Input: hl-1 ⊂ Hnl-1
Kernel: bl ⊂ Bml
Sig A
Sig B
1 1 1
0 0 0
1 0 1
1 1 1
0 0 0
0 1 0
1 1 1
0 0 0
1 0 1
1
0
0
1 1
0 0
1 0
1 1 0
0 0 0
1 0 0
X
X
X
1 0 1
1 0 1
1 0 1
1 0 1
1 0 1
1 0 1
1 0 1
1 0 1
1 0 1
1 0 1
1 0 1
1 0 1
1 0 1
1 0 1
1 0 1
X
X
X
Sig C
OrthrusPE: Binary Mode
Nael Fasfous | Chair of Integrated Systems | TUM
Efficient SIMD Binary Hadamard Product Execution
1 1 1 1 1 1 0 X
0 0 0 0 0 0 0 X
1 0 1 0 1 0 0 X
X X X X X X X X
1 0 1
1 0 1
1 0 1
Input: hl-1 ⊂ Hnl-1
Kernel: bl ⊂ Bml
DSP
Sig P
Sig A
Sig B
1 1 1
0 0 0
1 0 1
1 1 1
0 0 0
0 1 0
1 1 1
0 0 0
1 0 1
1
0
0
1 1
0 0
1 0
1 1 0
0 0 0
1 0 0
X
X
X
1 0 1
1 0 1
1 0 1
1 0 1
1 0 1
1 0 1
1 0 1
1 0 1
1 0 1
1 0 1
1 0 1
1 0 1
1 0 1
1 0 1
1 0 1
X
X
X
Sig C
OrthrusPE: Dual Modes
Nael Fasfous | Chair of Integrated Systems | TUM
Runtime Reconfigurability
Binary Mode
Fixed-Precision Mode
Reconfiguration Signals
Using the same
hardware resource for
two distinct, critical
BNN operations
OrthrusPE: Dual Modes
Nael Fasfous | Chair of Integrated Systems | TUM
Runtime Reconfigurability
Binary Mode
Fixed-Precision Mode
Reconfiguration Signals
Using the same
hardware resource for
two distinct, critical
BNN operations
OrthrusPE: Dual Modes
Nael Fasfous | Chair of Integrated Systems | TUM
Runtime Reconfigurability
Binary Mode
Fixed-Precision Mode
Reconfiguration Signals
Using the same
hardware resource for
two distinct, critical
BNN operations
OrthrusPE: Dual Modes
Nael Fasfous | Chair of Integrated Systems | TUM
Runtime Reconfigurability
Binary Mode
Fixed-Precision Mode
Reconfiguration Signals
Using the same
hardware resource for
two distinct, critical
BNN operations
Experimentation and Evaluation
Nael Fasfous | Chair of Integrated Systems | TUM
Synthesized Four Throughput-Equivalent Configurations:
• OrthrusPE
• OrthrusPE-DS (Dual-Static): SIMD Binary Hadamard Products on Static DSP
• Hybrid (Common): Binary operations on LUTs, FP operations on DSP
• All-LUT: Execution restricted to LUTs
Experimentation and Evaluation
Nael Fasfous | Chair of Integrated Systems | TUM
Resource Utilization
• OrthrusPE and OrthrusPE-DS are more
resource efficient across all target
accelerator frequencies.
Experimentation and Evaluation
Nael Fasfous | Chair of Integrated Systems | TUM
Resource Utilization
• OrthrusPE’s closest FINN configuration
@200MHz
• 16 Extra Bit Accumulations
• 3 MACs (through reconfigurability)
• 32% fewer LUTs
Experimentation and Evaluation
Nael Fasfous | Chair of Integrated Systems | TUM
Dynamic Power Estimation
• OrthrusPE more efficient across all
frequencies
• Results scale as accelerators use 100-
1000s of PEs
Conclusion and Future Work
Nael Fasfous | Chair of Integrated Systems | TUM
• Accurate BNNs cannot be achieved without fixed-point operations and reliance on
DSP blocks.
• OrthrusPE improves the efficiency of computation by executing both fixed-point and
binary ops on FPGA hard blocks.
• Accurate BNNs solve many of the computation and memory challenges for deep neural
network workloads on edge devices.
Nael Fasfous | Chair of Integrated Systems | TUM
Thank you for your attention