Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Towards High-Throughput with Low-Effort Programming: From General-Purpose Manycores to Dedicated Circuits
J. Andrade1, G. Falcao1, V. Silva1, M. Owaida2, C. Antonopoulos2,
N. Bellas2 and P. Ienne3 1Instituto de Telecomunicações, University of Coimbra, Portugal; 2Dept. of Computer and Communications Engineering, University of Thessaly, Greece; 3École Polytechnique Fédérale de Lausanne, Switzerland Abstract Parallel programming models, such as OpenCL, have emerged in the recent past to bring GPU application development effort at par with conventional CPU programming. The higher throughputs achieved have benefited a wide realm of scientific applications. However, even though from CPUs to GPUs the computational power-to-watt-ratio is substantially higher, there subsists a large spectrum of applications that cannot enjoy the benefits of manycore execution due to the power consumption incurred. While RTL design typically requires long development times, OpenCL to RTL translation tools have been developed to assist and ameliorate the programming effort required to target FPGA accelerators. The profits of FPGA design using OpenCL models are tremendous, as kernels can be developed in a unified parallel programming model. We analysed a dataset of short- and long-block LDPC decoders to quantify the throughput-to-effort achieved by extending the OpenCL portability to FPGAs. Also, we identified a series of challenges to be tackled in the near future in order to convey low effort designs to the high throughput levels achieved with dedicated RTL.
→
→→
CPUGPU
FPGA
CPUGPU
FPGA
Programming Effort
Ach
ieva
ble
Perf
orm
ance
SOpenCL RTL Generation__kernel void vecAdd(__global int *a, __global int *b){
wkID = get_global_id(0); a[wkID] += b[wkID];}
Load Load
+
Store
3
4...
2
1
0
II=2(Initiation Interval)
Scheduledwork-itemsD,
Pip
elin
e De
pth
FPGA
RAM
DSPs
CLBs
OpenCL
GPU
L2 CacheMem
Con
trolle
r
Mem
Con
trolle
r
Core
s
Linear Block Code: Parity Equations, Tanner Graph and Belief-Propagation
c1+c4+c7=0
c0+c3+c6=0
c0+c1+c2=0
c4+c5+c6=0
ParityEquations
Min-Sum Algorithm
γn =2ynσ2
α(0)nm = γn
β(i)mn =
�
n�∈N(m)\n
sign�α(i−1)
n�m
�·
· minn�∈N(m)\n
|α(i−1)
n�m|
α(i)n = γn +
�
m�∈M(n)\m
β(i)mn
BitNode Processing
CheckNode Processing
Initialization
BN0
BN1
BN2
BN3
BN4
BN5
BN6
BN7
CN0 CN1 CN2 CN3
1
0
1
0
1
0
0
1
1
0
0
0
0
1
1
0
0
1
0
1
0
1
0
0
0
0
1
0
0
0
0
1
H=
CN Processing
BN4
BN3
BN5
CN1
CN0BN0
BN Process
ing
BN0 1 2 3 4 5 6 7CN
0
1
2
3
CN1
Hard-decision
α(i)n = γn +
�
m�∈M(n)
β(i)mn
α(i−1)41
β(i)14
α(i−1)51
α(i−1)31
α(i)01
β(i)10
(N,M) {N,M}
Push FUs
Sin_align_0 RGU_0
Arbiter
Sout_align
CE_1
CE_0
Arbiter
Data_line Local
requests
tannerIdx[]
Address AddressData_out
Pop FUMultiplexer
V Data
FUMultiplexer
V Data
FU
MUX
FU
MUX
FU
MUX
FU
MUX
FU
MUX
FUMultiplexer
V Data
FUMultiplexer
V Data
Sin_align_1 RGU_1
Data_line Local requests Address
Pop FUMultiplexer
V Data
FUMultiplexer
V Data
FUMultiplexer
V Data
Sequ-
encer
Idle
Start
Idle
Start
FU
MUX
FU
MUX
Computing FUs
α[] ∗α[] β[]∗β[]&tannerIdx[]
CPU GPU FPGA ASIC [1]10
0
101
102
103
767
233
141
Clo
ck c
ycle
s p
er
dec
od
ed
bit
27
13
4
1.059
RegularDVB−S2
{15.2%, 23.3%, 54.3%} {1, 2, 8}
CPU FPGA GPU0
20
40
60
80
Ex
ec
uti
on
Tim
e [
ms
]
0
5
10
15
Regular 8000x4000 LDPC Code
Th
rou
gh
pu
t [M
bit
/s]
Memory time [ms]Execution time [ms]Throughput [Mbit/s]
CPU FPGA GPU0
50
100
150
200
Ex
ec
uti
on
Tim
e [
ms
]
0
20
40
60
80DVB−S2 64800x32400 LDPC Code
Th
rou
gh
pu
t [M
bit
/s]
Memory time [ms]Execution time [ms]Throughput [Mbit/s]
× ⇐ �× →
10→