2
Towards High-Throughput with Low-Effort Programming: From General-Purpose Manycores to Dedicated Circuits J. Andrade 1 , G. Falcao 1 , V. Silva 1 , M. Owaida 2 , C. Antonopoulos 2 , N. Bellas 2 and P. Ienne 3 1 Instituto de Telecomunicações, University of Coimbra, Portugal; 2 Dept. of Computer and Communications Engineering, University of Thessaly, Greece; 3 École Polytechnique Fédérale de Lausanne, Switzerland Abstract Parallel programming models, such as OpenCL, have emerged in the recent past to bring GPU application development effort at par with conventional CPU programming. The higher throughputs achieved have benefited a wide realm of scientific applications. However, even though from CPUs to GPUs the computational power-to-watt-ratio is substantially higher, there subsists a large spectrum of applications that cannot enjoy the benefits of manycore execution due to the power consumption incurred. While RTL design typically requires long development times, OpenCL to RTL translation tools have been developed to assist and ameliorate the programming effort required to target FPGA accelerators. The profits of FPGA design using OpenCL models are tremendous, as kernels can be developed in a unified parallel programming model. We analysed a dataset of short- and long-block LDPC decoders to quantify the throughput-to-effort achieved by extending the OpenCL portability to FPGAs. Also, we identified a series of challenges to be tackled in the near future in order to convey low effort designs to the high throughput levels achieved with dedicated RTL.

Towards High-Throughput with Low-Effort Programming ...conferenze.dei.polimi.it/depcp/proceedings/arch_pdf/...DVB−S2 {15.2%, 23.3%, 54.3%} {1, 2, 8} CPU FPGA GPU 0 20 40 60 80 Execution

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Towards High-Throughput with Low-Effort Programming ...conferenze.dei.polimi.it/depcp/proceedings/arch_pdf/...DVB−S2 {15.2%, 23.3%, 54.3%} {1, 2, 8} CPU FPGA GPU 0 20 40 60 80 Execution

Towards High-Throughput with Low-Effort Programming: From General-Purpose Manycores to Dedicated Circuits

J. Andrade1, G. Falcao1, V. Silva1, M. Owaida2, C. Antonopoulos2,

N. Bellas2 and P. Ienne3 1Instituto de Telecomunicações, University of Coimbra, Portugal; 2Dept. of Computer and Communications Engineering, University of Thessaly, Greece; 3École Polytechnique Fédérale de Lausanne, Switzerland Abstract Parallel programming models, such as OpenCL, have emerged in the recent past to bring GPU application development effort at par with conventional CPU programming. The higher throughputs achieved have benefited a wide realm of scientific applications. However, even though from CPUs to GPUs the computational power-to-watt-ratio is substantially higher, there subsists a large spectrum of applications that cannot enjoy the benefits of manycore execution due to the power consumption incurred. While RTL design typically requires long development times, OpenCL to RTL translation tools have been developed to assist and ameliorate the programming effort required to target FPGA accelerators. The profits of FPGA design using OpenCL models are tremendous, as kernels can be developed in a unified parallel programming model. We analysed a dataset of short- and long-block LDPC decoders to quantify the throughput-to-effort achieved by extending the OpenCL portability to FPGAs. Also, we identified a series of challenges to be tackled in the near future in order to convey low effort designs to the high throughput levels achieved with dedicated RTL.

Page 2: Towards High-Throughput with Low-Effort Programming ...conferenze.dei.polimi.it/depcp/proceedings/arch_pdf/...DVB−S2 {15.2%, 23.3%, 54.3%} {1, 2, 8} CPU FPGA GPU 0 20 40 60 80 Execution

→→

CPUGPU

FPGA

CPUGPU

FPGA

Programming Effort

Ach

ieva

ble

Perf

orm

ance

SOpenCL RTL Generation__kernel void vecAdd(__global int *a, __global int *b){

wkID = get_global_id(0); a[wkID] += b[wkID];}

Load Load

+

Store

3

4...

2

1

0

II=2(Initiation Interval)

Scheduledwork-itemsD,

Pip

elin

e De

pth

FPGA

RAM

DSPs

CLBs

OpenCL

GPU

L2 CacheMem

Con

trolle

r

Mem

Con

trolle

r

Core

s

Linear Block Code: Parity Equations, Tanner Graph and Belief-Propagation

c1+c4+c7=0

c0+c3+c6=0

c0+c1+c2=0

c4+c5+c6=0

ParityEquations

Min-Sum Algorithm

γn =2ynσ2

α(0)nm = γn

β(i)mn =

n�∈N(m)\n

sign�α(i−1)

n�m

�·

· minn�∈N(m)\n

|α(i−1)

n�m|

α(i)n = γn +

m�∈M(n)\m

β(i)mn

BitNode Processing

CheckNode Processing

Initialization

BN0

BN1

BN2

BN3

BN4

BN5

BN6

BN7

CN0 CN1 CN2 CN3

1

0

1

0

1

0

0

1

1

0

0

0

0

1

1

0

0

1

0

1

0

1

0

0

0

0

1

0

0

0

0

1

H=

CN Processing

BN4

BN3

BN5

CN1

CN0BN0

BN Process

ing

BN0 1 2 3 4 5 6 7CN

0

1

2

3

CN1

Hard-decision

α(i)n = γn +

m�∈M(n)

β(i)mn

α(i−1)41

β(i)14

α(i−1)51

α(i−1)31

α(i)01

β(i)10

(N,M) {N,M}

Push FUs

Sin_align_0 RGU_0

Arbiter

Sout_align

CE_1

CE_0

Arbiter

Data_line Local

requests

tannerIdx[]

Address AddressData_out

Pop FUMultiplexer

V Data

FUMultiplexer

V Data

FU

MUX

FU

MUX

FU

MUX

FU

MUX

FU

MUX

FUMultiplexer

V Data

FUMultiplexer

V Data

Sin_align_1 RGU_1

Data_line Local requests Address

Pop FUMultiplexer

V Data

FUMultiplexer

V Data

FUMultiplexer

V Data

Sequ-

encer

Idle

Start

Idle

Start

FU

MUX

FU

MUX

Computing FUs

α[] ∗α[] β[]∗β[]&tannerIdx[]

CPU GPU FPGA ASIC [1]10

0

101

102

103

767

233

141

Clo

ck c

ycle

s p

er

dec

od

ed

bit

27

13

4

1.059

RegularDVB−S2

{15.2%, 23.3%, 54.3%} {1, 2, 8}

CPU FPGA GPU0

20

40

60

80

Ex

ec

uti

on

Tim

e [

ms

]

0

5

10

15

Regular 8000x4000 LDPC Code

Th

rou

gh

pu

t [M

bit

/s]

Memory time [ms]Execution time [ms]Throughput [Mbit/s]

CPU FPGA GPU0

50

100

150

200

Ex

ec

uti

on

Tim

e [

ms

]

0

20

40

60

80DVB−S2 64800x32400 LDPC Code

Th

rou

gh

pu

t [M

bit

/s]

Memory time [ms]Execution time [ms]Throughput [Mbit/s]

× ⇐ �× →

10→