OrthrusPE: Runtime Reconfigurable Processing Elements for

OrthrusPE: Runtime Reconfigurable Processing Elements

for Binary Neural Networks

Nael Fasfous

Technical University of Munich

Department of Electrical and Computer Engineering

Chair of Integrated Systems

• Optimization of Convolutional Neural Networks

• Challenges of Binary Neural Networks

• Motivation for Runtime Reconfigurable BNN Processing Elements

• OrthrusPE: Dual Modes

• Experimentation and Results

Outline

Nael Fasfous | Chair of Integrated Systems | TUM

Optimization of Convolutional Neural Networks


Structural Algorithmic Hardware




Low-Rank

Decomp.Pruning Quantization Winograd FFT GEMM Vectorization PartitioningDataflow




Low-Rank

Decomp.Pruning Quantization Winograd FFT GEMM Vectorization

Bypass

PartitioningDataflow

Tucker

Decomp.

CP

Decomp.

Channel-wise

Filter-wise

Element-wise

Winograd

Sparsity

Spectral

Sparsity

Filter

Sharing

Floating

PointFixed Point

Binarization

Pow. 2

Log Base

Weight

Sharing

Array

MappingIn Out

IFmap Filter

IO OW

IW

OS IS

RS WS

Overview of Binary Neural Networks


• Binary Weights

• Binary Activations

• Binary Weights AND Activations



• Binary Weights


• Binary Weights AND Activations Replace Multiplications by XNOR ops

Replace Accumulations by Popcount ops



• Binary Weights




Reduce Memory requirements (1/16 of FP-16)



• Binary Weights




Severe Information Loss




• Binary Weights




Severe Information Loss

MNIST, CIFAR-10, SVHN

ImageNet




Naïve Binarization

1 -2 4

10 0.1 3

-5 7 -3

𝑠𝑖𝑔𝑛 𝑥

1 0 1

1 1 1

0 1 0

Severe information loss:

10 and 0.1 have the same effect on the

network.

𝑠𝑖𝑔𝑛 𝑥 = 0, 𝑥 < 01, 𝑥 ≥ 0



Approximation through binary bases

1 -2 4

10 0.1 3

-5 7 -3

𝑠𝑖𝑔𝑛 𝑥 = 0, 𝑥 < 01, 𝑥 ≥ 0

𝑠𝑖𝑔𝑛 𝑥 − 50 0 0

1 0 0

0 1 0

0 0 1

1 0 1

0 1 0

0 0 1

1 0 0

0 1 0

𝑠𝑖𝑔𝑛 𝑥 − 3





1 -2 4

10 0.1 3

-5 7 -3

𝑠𝑖𝑔𝑛 𝑥 − 50 0 0

1 0 0

0 1 0

𝑠𝑖𝑔𝑛 𝑥 = 0, 𝑥 < 01, 𝑥 ≥ 0

0 0 1

1 0 1

0 1 0

0 0 1

1 0 0

0 1 0



Captured Information:

0.1 is less positive than 10




1 -2 4

10 0.1 3

-5 7 -3

𝑠𝑖𝑔𝑛 𝑥 − 50 0 0

1 0 0

0 1 0

𝑠𝑖𝑔𝑛 𝑥 = 0, 𝑥 < 01, 𝑥 ≥ 0

0 0 1

1 0 1

0 1 0

0 0 1

1 0 0

0 1 0





4 and 3 lie between 10 and 0.1




1 -2 4

10 0.1 3

-5 7 -3

𝑠𝑖𝑔𝑛 𝑥 − 50 0 0

1 0 0

0 1 0

𝑠𝑖𝑔𝑛 𝑥 = 0, 𝑥 < 01, 𝑥 ≥ 0

0 0 1

1 0 1

0 1 0

0 0 1

1 0 0

0 1 0






4 and 3 are single unit away from

each other




1 -2 4

10 0.1 3

-5 7 -3

𝑠𝑖𝑔𝑛 𝑥 − 50 0 0

1 0 0

0 1 0

𝑠𝑖𝑔𝑛 𝑥 = 0, 𝑥 < 01, 𝑥 ≥ 0

0 0 1

1 0 1

0 1 0

0 0 1

1 0 0

0 1 0






4 and 3 are single unit away from

each other

Information can be extracted for

negative numbers, e.g. 𝑠𝑖𝑔𝑛 𝑥 + 3




1 -2 4

10 0.1 3

-5 7 -3

𝑠𝑖𝑔𝑛 𝑥 − 50 0 0

1 0 0

0 1 0

𝑠𝑖𝑔𝑛 𝑥 = 0, 𝑥 < 01, 𝑥 ≥ 0

0 0 1

1 0 1

0 1 0

0 0 1

1 0 0

0 1 0



𝑏𝑖𝑛𝐶𝑜𝑛𝑣



𝐶𝑜𝑛𝑣

× 𝛼1

× 𝛼2

× 𝛼3

[1] X. Lin et al., “Towards accurate binary convolutional neural network,” NIPS, 2017.




1 -2 4

10 0.1 3

-5 7 -3

𝑠𝑖𝑔𝑛 𝑥 − 50 0 0

1 0 0

0 1 0

𝑠𝑖𝑔𝑛 𝑥 = 0, 𝑥 < 01, 𝑥 ≥ 0

0 0 1

1 0 1

0 1 0

0 0 1

1 0 0

0 1 0






𝐶𝑜𝑛𝑣

× 𝛼1

× 𝛼2

× 𝛼3

α: learned during

training to induce

further diversity in

information.





1 -2 4

10 0.1 3

-5 7 -3

𝑠𝑖𝑔𝑛 𝑥 − 50 0 0

1 0 0

0 1 0

𝑠𝑖𝑔𝑛 𝑥 = 0, 𝑥 < 01, 𝑥 ≥ 0

0 0 1

1 0 1

0 1 0

0 0 1

1 0 0

0 1 0






𝐶𝑜𝑛𝑣

× 𝛼1

× 𝛼2

× 𝛼3

α: learned during

training to induce

further diversity in

information.

BNNs with only 4-5%

accuracy degradation

from full-precision

CNN on ImageNet [1].


How Binary are Binary Neural Networks?


K Ci

Xi

Yi

CiK Ci

𝐴𝑙 = 𝐶𝑜𝑛𝑣(𝑊𝑙 , 𝐴𝑙−1)

Co

𝑊𝑙 ∈ ℝ𝐾×𝐾×𝐶𝑖×𝐶𝑜𝐴𝑙−1 ∈ ℝ𝑋𝑖×𝑌𝑖×𝐶𝑖



K Ci

Xi

Yi

CiK Ci

𝐵𝑙 ∈ 𝔹𝐾×𝐾×𝐶𝑖×𝑀×𝐶𝑜𝐻𝑙−1 ∈ 𝔹𝑋𝑖×𝑌𝑖×𝐶𝑖×𝑁

𝔹 = {0,1}


Co




K Ci

Xi

Yi

N

M

Ci

K Ci

M

…

Co


𝔹 = {0,1}



K Ci

Xi

Yi

N

M

Ci

K Ci

M

…

Co


𝔹 = {0,1}

𝐴𝑙 =

𝑚=1

𝑀

𝑛=1

𝑁

𝛼𝑚𝛽𝑛𝐵𝑖𝑛𝐶𝑜𝑛𝑣(𝐵𝑚𝑙 , 𝐻𝑛

𝑙−1)



𝔹 = {0,1}

𝐴𝑙 =

𝑚=1

𝑀

𝑛=1

𝑁


𝑙−1)

K Ci

Xi

Yi

N

M

Ci

K Ci

M

…

Co

𝐵𝑚𝑙 ∈ 𝔹𝐾×𝐾×𝐶𝑖×𝐶𝑜𝐻𝑛

𝑙−1 ∈ 𝔹𝑋𝑖×𝑌𝑖×𝐶𝑖



𝔹 = {0,1}

𝑎𝑚,𝑛 =

𝑐𝑖=1

𝐶𝑖

(𝑝𝑚,𝑛,𝑐𝑖)

𝐴𝑙 =

𝑚=1

𝑀

𝑛=1

𝑁


𝑙−1)

K Ci

Xi

Yi

N

M

Ci

K Ci

M

…

Co





K Ci

Xi

Yi

N

M

Ci

K Ci

M

…

Co

𝔹 = {0,1}

𝐴𝑙 =

𝑚=1

𝑀

𝑛=1

𝑁


𝑙−1)



𝑝𝑚,𝑛,𝑐𝑖 =

𝑘𝑥=1

𝐾

𝑘𝑦=1

𝐾

𝑥𝑛𝑜𝑟 𝑏𝑘𝑥, 𝑘𝑦 , ℎ𝑥𝑖+𝑘𝑥, 𝑦𝑖+𝑘𝑦

𝑎𝑚,𝑛 =

𝑐𝑖=1

𝐶𝑖




K Ci

Xi

Yi

N

M

Ci

K Ci

M

…

Co

𝔹 = {0,1}

𝐴𝑙 =

𝑚=1

𝑀

𝑛=1

𝑁


𝑙−1)



𝑎𝑚,𝑛 =

𝑐𝑖=1

𝐶𝑖


𝑝𝑚,𝑛,𝑐𝑖 = 𝑝𝑜𝑝𝑐𝑛𝑡 (𝑥𝑛𝑜𝑟 𝑏𝑘𝑥, 𝑘𝑦 , ℎ𝑥𝑖+𝑘𝑥, 𝑦𝑖+𝑘𝑦 )



K Ci

Xi

Yi

N

M

Ci

K Ci

M

…

Co

𝔹 = {0,1}

𝐴𝑙 =

𝑚=1

𝑀

𝑛=1

𝑁


𝑙−1)

𝑎𝑚,𝑛 =

𝑐𝑖=1

𝐶𝑖




ℎ𝑙−1 𝑏𝑙




K Ci

Xi

Yi

N

M

Ci

K Ci

M

…

Co

𝔹 = {0,1}

𝐴𝑙 =

𝑚=1

𝑀

𝑛=1

𝑁


𝑙−1)

𝑎𝑚,𝑛 =

𝑐𝑖=1

𝐶𝑖




ℎ𝑙−1 𝑏𝑙


Binary

Fixed-point

More FP in Binary Neural Networks


• Binary Weight and Activation Bases (Scale/Shift)

• First Layer Remains Non-Binarized

• Batch Normalization

More FP in Binary Neural Networks


• Binary Weight and Activation Bases (Scale/Shift)

• First Layer Remains Non-Binarized

• Batch Normalization

Motivation for Reconfigurable Processing Elements

Fixed-Point Ops Binary Ops

OrthrusPE: Runtime Reconfigurable PEs for BNNs


Dual Modes

• Fixed-precision mode: First layer, Batch-norm, Scale and Shift Operations

• Binary mode: SIMD Binary Hadamard Products and Popcounts

• Achieved with high resource reuse

OrthrusPE: Binary Mode


Efficient SIMD Binary Hadamard Product Execution

1 1 1 1 1 1 0 X

0 0 0 0 0 0 0 X

1 0 1 0 1 0 0 X

X X X X X X X X

1 0 1

1 0 1

1 0 1

Input: hl-1 ⊂ Hnl-1

Kernel: bl ⊂ Bml




1 1 1 1 1 1 0 X

0 0 0 0 0 0 0 X

1 0 1 0 1 0 0 X

X X X X X X X X

1 0 1

1 0 1

1 0 1


Kernel: bl ⊂ Bml

Sig A

Sig B

1 1 1

0 0 0

1 0 1

1 1 1

0 0 0

0 1 0

1 1 1

0 0 0

1 0 1

1

0

0

1 1

0 0

1 0

1 1 0

0 0 0

1 0 0

X

X

X

1 0 1

1 0 1

1 0 1

1 0 1

1 0 1

1 0 1

1 0 1

1 0 1

1 0 1

1 0 1

1 0 1

1 0 1

1 0 1

1 0 1

1 0 1

X

X

X

Sig C




1 1 1 1 1 1 0 X

0 0 0 0 0 0 0 X

1 0 1 0 1 0 0 X

X X X X X X X X

1 0 1

1 0 1

1 0 1


Kernel: bl ⊂ Bml

DSP

Sig P

Sig A

Sig B

1 1 1

0 0 0

1 0 1

1 1 1

0 0 0

0 1 0

1 1 1

0 0 0

1 0 1

1

0

0

1 1

0 0

1 0

1 1 0

0 0 0

1 0 0

X

X

X

1 0 1

1 0 1

1 0 1

1 0 1

1 0 1

1 0 1

1 0 1

1 0 1

1 0 1

1 0 1

1 0 1

1 0 1

1 0 1

1 0 1

1 0 1

X

X

X

Sig C

OrthrusPE: Dual Modes


Runtime Reconfigurability

Binary Mode

Fixed-Precision Mode

Reconfiguration Signals

Using the same

hardware resource for

two distinct, critical

BNN operations




Binary Mode



Using the same



BNN operations




Binary Mode



Using the same



BNN operations




Binary Mode



Using the same



BNN operations

Experimentation and Evaluation


Synthesized Four Throughput-Equivalent Configurations:

• OrthrusPE

• OrthrusPE-DS (Dual-Static): SIMD Binary Hadamard Products on Static DSP

• Hybrid (Common): Binary operations on LUTs, FP operations on DSP

• All-LUT: Execution restricted to LUTs



Resource Utilization

• OrthrusPE and OrthrusPE-DS are more

resource efficient across all target

accelerator frequencies.



Resource Utilization

• OrthrusPE’s closest FINN configuration

@200MHz

• 16 Extra Bit Accumulations

• 3 MACs (through reconfigurability)

• 32% fewer LUTs



Dynamic Power Estimation

• OrthrusPE more efficient across all

frequencies

• Results scale as accelerators use 100-

1000s of PEs

Conclusion and Future Work


• Accurate BNNs cannot be achieved without fixed-point operations and reliance on

DSP blocks.

• OrthrusPE improves the efficiency of computation by executing both fixed-point and

binary ops on FPGA hard blocks.

• Accurate BNNs solve many of the computation and memory challenges for deep neural

network workloads on edge devices.


Thank you for your attention

Documents

OrthrusPE: Runtime Reconfigurable Processing Elements for