FPGA-Based Accelerator Design from a Domain-Specific Language · FPGAs High throughput, great energy efﬁciency OpenCL Portable and open programming model for heterogeneous ... High

FPGA-Based Accelerator Design from aDomain-Specific Language

M. Akif Özkan, Oliver Reiche, Frank Hannig, and Jürgen TeichHardware/Software Co-Design, Friedrich-Alexander University Erlangen-NürnbergFPL, August 29, 2016, Lausanne, Switzerland

Motivation

FPGAs High throughput, great energy efficiency

OpenCL Portable and open programming model for heterogeneoussystems, including CPUs, GPUs, DSPs etc.

© Copyright Khronos Group 2015 - Page 3

OpenCL – Portable Heterogeneous Computing•OpenCL = Two APIs and Two Kernel languages- C Platform Layer API to query, select and initialize compute devices- OpenCL C and (soon) OpenCL C++ kernel languages to write parallel code - C Runtime API to build and execute kernels across multiple devices

•One code tree can be executed on CPUs, GPUs, DSPs, FPGA and hardware- Dynamically balance work across available processors

OpenCL Kernel Code

OpenCL Kernel Code

OpenCL Kernel Code

OpenCL Kernel Code

GPU

DSPCPU

CPUFPGA

HW

Kernel code compiled for

devices

Devices

CPU

Host

Runtime APIloads and executes

kernels across devices

https://www.khronos.org/assets/uploads/developers/library/overview/opencl_overview.pdf

Motivation: High Level Synthesis using Altera OpenCL

+ Plug and accelerate: Automatic interface design+ Directives for exploiting data-level parallelism- Efficiency comes with hardware design patterns! Develop Altera OpenCL (AOC) software in Hardware Manner (HwM)

0 100 200 300 400 500 600 700

20

40

60

80

NDRange

SingleCU2

CU4

SIMD2

SIMD4

CU2/SIMD2

Line Buffer

LB CLAMP

LB CLAMP HwM

Throughput [frame/s]

Logi

cU

tiliz

atio

n[%

]

GPU KernelLine Buffer (LB)

Figure: Design points for a 5×5 Gaussian blur filter.



0 100 200 300 400 500 600 700

20

40

60

80

NDRange

SingleCU2

CU4

SIMD2

SIMD4

CU2/SIMD2

Line Buffer

LB CLAMP

LB CLAMP HwM


Logi

cU

tiliz

atio

n[%

]





0 100 200 300 400 500 600 700

20

40

60

80

NDRange

SingleCU2

CU4

SIMD2

SIMD4

CU2/SIMD2

Line Buffer

LB CLAMP

LB CLAMP HwM


Logi

cU

tiliz

atio

n[%

]





0 100 200 300 400 500 600 700

20

40

60

80

NDRange

SingleCU2

CU4

SIMD2

SIMD4

CU2/SIMD2

Line Buffer

LB CLAMP

LB CLAMP HwM


Logi

cU

tiliz

atio

n[%

]





0 100 200 300 400 500 600 700

20

40

60

80

NDRange

SingleCU2

CU4

SIMD2

SIMD4

CU2/SIMD2

Line Buffer

LB CLAMP

LB CLAMP HwM


Logi

cU

tiliz

atio

n[%

]



Outline

Programming MethodologyDeveloping Software in Hardware Manner

Efficient Code Generation through a Domain-Specific Language

Extensions to HIPAcc FrameworkAutomatic Bit-width Reduction

Evaluation and Results

Conclusion

M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 0

Programming Methodology

Efficient OpenCL program for AOC

• Use hardware design patterns• Single-item line-buffered kernels for image processing

input output

win

dow

load

store

line

buffe

r

loca

l op

erat

or

f



• Use hardware design patterns• Apply known hardware design optimizations

• Strength Reduction• Bit-width reduction• Use 1-bit control signals instead of repeated conditions

Listing 1: Source Code

c = a * 16;

Listing 2: Strength Reduction

c = a << 4;











• Fetch best practices• Remove the else branch by moving its content to preceding area• Nested loops should always have constant bounds• Use temporary registers wherever it is possible• Limit life time of variables using scope operators• Avoid using 2-dimensional arrays• Writing to arrays should be sequential and simple as possible


if(condition1){a = 3;

}else{a = 5;

}

Listing 2: Remove else condition

a = 5;if(condition1){

a = 3;}





for (int i = 0; i < 5; ++i){for (int j = 0; j < i; ++j){

/* operations */}

}

Listing 2: Constant loop bounds

for (int i = 0; i < 5; ++i){for (int j = 0; j < 5; ++j){

if (j < i) {/* operations */

}}

}





a = array[i];if (condition){

b = array[i];}c = array[i]*5;


char temp = array[i];a = temp;if (condition){

b = temp;}c = temp *5;





/* code portion1 */char temp;/* code_portion2(temp) *//* code_portion3 */

Listing 2: Scope Operators

/* code portion1 */{

char temp;/* code_portion2(temp) */

}/* code_portion3 */





char array [5][5];

Listing 2: Avoid 2 dimensional arrays

char arrayRow1 [5];char arrayRow2 [5];char arrayRow3 [5];char arrayRow4 [5];char arrayRow5 [5];





char array [7];// Load new datafor(int i=0; i<2; i++){

char newdata = newdata[i];array[i+3] = newdata;array[i+5] = newdata;

}// Shift only a portionif(condition){

array [5] = array [6];}

Listing 2: Simplify array access

char array1 [5];char array2 [2];// Load new datafor(int i=0; i<2; i++){

char newdata = newdata[i];array1[i+3] = newdata;array2[i] = newdata;

}// Shift only a portionif(condition){

array2 [0] = array2 [1];}


Proposed Approach: Develop Software in Hardware Manner

• Write Altera OpenCL code following these two principles:1. Calculate all the conditional cases first, decide at the end2. Exploit temporal locality




Listing 3: Software Manner

if (condition){calculate_a (...);result = a;

}else{calculate_b (...);result = b;

}

Listing 4: Hardware Manner

calculate_a (...);calculate_b (...);result = b;if (condition){

result = a;}

regreg in[i]compute(x, y, z) x y z

MUX

calculate_a(…)

calculate_b(…)

result

a

b




Listing 3: Software Manner

for(int i=0; i<SIZE; i++){x = in[i-2];y = in[i-1];z = in[i];out = compute(x, y, z);

}

Listing 4: Hardware Manner

for(int i=0; i<SIZE; i++){x = y;y = z;z = in[i];out = compute(x, y, z);

}

regreg in[i]compute(x, y, z) x y z

MUX

calculate_a(…)

calculate_b(…)

result

a

b


Writing HLS Code for AOC: CLAMP Boundary Handling (HW)

M M

M M

M N O P M N O P

M N O P M N O P

P P

P P

M M

M M

I I

I I

E E

E E

A A

A A

M N O P

I J K L

E F G H

A B C D

M N O P

I J K L

E F G H

A B C D

M N O P

I J K L

E F G H

A B C D

M N O P

I J K L

E F G H

A B C D

P P

P P

L L

L L

H H

H H

D D

D D

A A

A A

A B C D A B C D

A B C D A B C D

D D

D D

Figure: Clamp boundary condition on an image



M M

M M

M N O P M N O P

M N O P M N O P

P P

P P

M M

M M

I I

I I

E E

E E

A A

A A

M N O P

I J K L

E F G H

A B C D

M N O P

I J K L

E F G H

A B C D

M N O P

I J K L

E F G H

A B C D

M N O P

I J K L

E F G H

A B C D

P P

P P

L L

L L

H H

H H

D D

D D

A A

A A

A B C D A B C D

A B C D A B C D

D D

D D

BH_NO

BH_TL BH_T BH_TR

BH_L BH_R

BH_BL BH_B BH_BR

Figure: Clamp boundary condition on an image



reg30'

reg40

reg30

reg20

reg10

reg40'

reg00 M

UX

MUX

MUX

MUX

Line BufferMUX

reg31’

reg41

reg31

reg21

reg11

reg41’

reg01 M

UX

MUX

MUX

MUX

MUX

reg33’

reg43

reg33

reg23

reg13

reg43’

reg03 M

UX

MUX

MUX

MUX

MUX

reg34’

reg44

reg34

reg24

reg14

reg44’

reg04 M

UX

MUX

MUX

MUX

MUX

Line Buffer

Line Buffer

Pixel Stream

Line Buffer

reg32’

reg42

reg32

reg22

reg12

reg42’

reg02 M

UX

MUX

MUX

MUX

[3:4] ROWs SELECTION

5x5 LOCAL OPERATOR

[0:2] ROWs

Requires hardware design expertise!

Best practices on syntactic description matters!

No automatic kernel and accelerator replication!

Code Size increases upto 10x!



reg30'

reg40

reg30

reg20

reg10

reg40'

reg00 M

UX

MUX

MUX

MUX

Line BufferMUX

reg31’

reg41

reg31

reg21

reg11

reg41’

reg01 M

UX

MUX

MUX

MUX

MUX

reg33’

reg43

reg33

reg23

reg13

reg43’

reg03 M

UX

MUX

MUX

MUX

MUX

reg34’

reg44

reg34

reg24

reg14

reg44’

reg04 M

UX

MUX

MUX

MUX

MUX

Line Buffer

Line Buffer

Pixel Stream

Line Buffer

reg32’

reg42

reg32

reg22

reg12

reg42’

reg02 M

UX

MUX

MUX

MUX


5x5 LOCAL OPERATOR

[0:2] ROWs







reg30'

reg40

reg30

reg20

reg10

reg40'

reg00 M

UX

MUX

MUX

MUX

Line BufferMUX

reg31’

reg41

reg31

reg21

reg11

reg41’

reg01 M

UX

MUX

MUX

MUX

MUX

reg33’

reg43

reg33

reg23

reg13

reg43’

reg03 M

UX

MUX

MUX

MUX

MUX

reg34’

reg44

reg34

reg24

reg14

reg44’

reg04 M

UX

MUX

MUX

MUX

MUX

Line Buffer

Line Buffer

Pixel Stream

Line Buffer

reg32’

reg42

reg32

reg22

reg12

reg42’

reg02 M

UX

MUX

MUX

MUX


5x5 LOCAL OPERATOR

[0:2] ROWs







reg30'

reg40

reg30

reg20

reg10

reg40'

reg00 M

UX

MUX

MUX

MUX

Line BufferMUX

reg31’

reg41

reg31

reg21

reg11

reg41’

reg01 M

UX

MUX

MUX

MUX

MUX

reg33’

reg43

reg33

reg23

reg13

reg43’

reg03 M

UX

MUX

MUX

MUX

MUX

reg34’

reg44

reg34

reg24

reg14

reg44’

reg04 M

UX

MUX

MUX

MUX

MUX

Line Buffer

Line Buffer

Pixel Stream

Line Buffer

reg32’

reg42

reg32

reg22

reg12

reg42’

reg02 M

UX

MUX

MUX

MUX


5x5 LOCAL OPERATOR

[0:2] ROWs







reg30'

reg40

reg30

reg20

reg10

reg40'

reg00 M

UX

MUX

MUX

MUX

Line BufferMUX

reg31’

reg41

reg31

reg21

reg11

reg41’

reg01 M

UX

MUX

MUX

MUX

MUX

reg33’

reg43

reg33

reg23

reg13

reg43’

reg03 M

UX

MUX

MUX

MUX

MUX

reg34’

reg44

reg34

reg24

reg14

reg44’

reg04 M

UX

MUX

MUX

MUX

MUX

Line Buffer

Line Buffer

Pixel Stream

Line Buffer

reg32’

reg42

reg32

reg22

reg12

reg42’

reg02 M

UX

MUX

MUX

MUX


5x5 LOCAL OPERATOR

[0:2] ROWs






Efficient Code Generation through aDomain-Specific Language

HIPAcc: The Heterogeneous Image Processing AccelerationFramework

C++

embedded DSL

Source-to-SourceCompiler

Clang/LLVM

Domain

Knowledge

Architecture

Knowledge

CUDA(GPU)

OpenCL(x86/GPU)

C/C++

(x86)

Renderscript(x86/ARM/GPU)

OpenCL(ALTERA FPGA)

Vivado C++

(FPGA)

CUDA/OpenCL/Renderscript Runtime LibraryCUDA/OpenCL/Renderscript Runtime Library AOCL Vivado HLS


Operators in the Image Processing Domain

We can define three characteristic data operations in the domain:

Point Operators:Output data is determined by single input data

Local Operators:Output data is determined by local region of the inputdata

Global Operators:Output data is determined by all of the input data

input image output image




Streaming Pipeline: Harris Corner Detection Example

Transform sequential execution order. . .

dx

dy

sx

sxy

sy

gx

gxy

gy

hcinput outputin

dx

dy

sx

in’

dx’

dy’

sx’

in’’

dx’’

Figure: HIPAcc’s sequential execution for the Harris corner detector

. . . into streaming pipeline of FPGA kernels.

dx

dy

sxy

sy

gxy

gy

hcinput output

sx gx

Figure: Representation of AOC kernels


Streaming Pipeline: Harris Corner Detection Example

Transform sequential execution order. . .

dx

dy

sx

sxy

sy

gx

gxy

gy

hcinput outputin

dx

dy

sx

in’

dx’

dy’

sx’

in’’

dx’’

Figure: HIPAcc’s sequential execution for the Harris corner detector

. . . into streaming pipeline of FPGA kernels.

dx

dy

sxy

sy

gxy

gy

hcinput output

sx gx

Figure: Representation of AOC kernels


Example: Laplacian Operator

// coefficients for Laplacian operatorconst int coef [3][3] = { { 0, 1, 0 },

{ 1, -4, 1 },{ 0, 1, 0 } };

// read inputuchar4 *image_bits = readImage ();

Image <uchar4 > in(width , height , inputImage);Image <uchar4 > out(width , height);

// load image datain = image_bits;

// Mask (Stencil) of local operatorMask <int > mask(coef);

// reading from in with mirroring as boundary conditionBoundaryCondition <uchar4 > bound(in, mask , BOUNDARY_MIRROR);Accessor <uchar4 > acc(bound);

// output imageIterationSpace <uchar4 > iter(out);

// define kernelLaplacian filter(iter , acc , mask);

// execute kernelfilter.execute ();

//Write outputuchar4* out = out.data();

InputImage

BoundaryCondition

Accessor

KERNEL

IterationSpace

Mask


Domain-specific Extensions

IterationSpace defines ROI of the output image

Accessor input ROI with filtering

BoundaryCondition boundary handling modes

Mask convolution mask

Output image Crop of outputimage

Crop of outputimage with offset


Output Image Crop of output image

Crop of output imagewith offset

OutputImage







C DG H

A B C DE F G H

A BE F

O PK LG HC D

M N O PI J K LE F G HA B C D

M NI JE FA B

O PK L

M N O PI J K L

M NI J

Repeat

M MM M

M N O PM N O P

P PP P

M MI IE EA A


P PL LH HD D

A AA A

A B C DA B C D

D DD D

Clamp

I MJ N

M N O PI J K L

P LO K

N MJ IF EB A


P OL KH GD C

E AF B

A B C DE F G H

D HC G

Mirror

Q QQ Q

Q Q Q QQ Q Q Q

Q QQ Q

Q QQ QQ QQ Q


Q QQ QQ QQ Q

Q QQ Q

Q Q Q QQ Q Q Q

Q QQ Q

Constant


Repeat Clamp Mirror ConstantHIPAcc: The Heterogeneous Image Processing AccelerationFramework




BoundaryCondition boundary handling mode


Image and boundary Image crop Image crop withoffset

Image offset


Image andBoundary Image crop Image crop

with offset Image offset




{ 1, -4, 1 },{ 0, 1, 0 } };










InputImage

BoundaryCondition

Accessor

KERNEL

IterationSpace

Mask












OutputImage







C DG H

A B C DE F G H

A BE F

O PK LG HC D


M NI JE FA B

O PK L

M N O PI J K L

M NI J

Repeat

M MM M

M N O PM N O P

P PP P

M MI IE EA A


P PL LH HD D

A AA A

A B C DA B C D

D DD D

Clamp

I MJ N

M N O PI J K L

P LO K

N MJ IF EB A


P OL KH GD C

E AF B

A B C DE F G H

D HC G

Mirror

Q QQ Q

Q Q Q QQ Q Q Q

Q QQ Q

Q QQ QQ QQ Q


Q QQ QQ QQ Q

Q QQ Q

Q Q Q QQ Q Q Q

Q QQ Q

Constant









Image offset







{ 1, -4, 1 },{ 0, 1, 0 } };










InputImage

BoundaryCondition

Accessor

KERNEL

IterationSpace

Mask












OutputImage







C DG H

A B C DE F G H

A BE F

O PK LG HC D


M NI JE FA B

O PK L

M N O PI J K L

M NI J

Repeat

M MM M

M N O PM N O P

P PP P

M MI IE EA A


P PL LH HD D

A AA A

A B C DA B C D

D DD D

Clamp

I MJ N

M N O PI J K L

P LO K

N MJ IF EB A


P OL KH GD C

E AF B

A B C DE F G H

D HC G

Mirror

Q QQ Q

Q Q Q QQ Q Q Q

Q QQ Q

Q QQ QQ QQ Q


Q QQ QQ QQ Q

Q QQ Q

Q Q Q QQ Q Q Q

Q QQ Q

Constant









Image offset







{ 1, -4, 1 },{ 0, 1, 0 } };










InputImage

BoundaryCondition

Accessor

KERNEL

IterationSpace

Mask












OutputImage







C DG H

A B C DE F G H

A BE F

O PK LG HC D


M NI JE FA B

O PK L

M N O PI J K L

M NI J

Repeat

M MM M

M N O PM N O P

P PP P

M MI IE EA A


P PL LH HD D

A AA A

A B C DA B C D

D DD D

Clamp

I MJ N

M N O PI J K L

P LO K

N MJ IF EB A


P OL KH GD C

E AF B

A B C DE F G H

D HC G

Mirror

Q QQ Q

Q Q Q QQ Q Q Q

Q QQ Q

Q QQ QQ QQ Q


Q QQ QQ QQ Q

Q QQ Q

Q Q Q QQ Q Q Q

Q QQ Q

Constant









Image offset







{ 1, -4, 1 },{ 0, 1, 0 } };










InputImage

BoundaryCondition

Accessor

KERNEL

IterationSpace

Mask












OutputImage







C DG H

A B C DE F G H

A BE F

O PK LG HC D


M NI JE FA B

O PK L

M N O PI J K L

M NI J

Repeat

M MM M

M N O PM N O P

P PP P

M MI IE EA A


P PL LH HD D

A AA A

A B C DA B C D

D DD D

Clamp

I MJ N

M N O PI J K L

P LO K

N MJ IF EB A


P OL KH GD C

E AF B

A B C DE F G H

D HC G

Mirror

Q QQ Q

Q Q Q QQ Q Q Q

Q QQ Q

Q QQ QQ QQ Q


Q QQ QQ QQ Q

Q QQ Q

Q Q Q QQ Q Q Q

Q QQ Q

Constant


Repeat Clamp Mirror Constant








Image offset







{ 1, -4, 1 },{ 0, 1, 0 } };










InputImage

BoundaryCondition

Accessor

KERNEL

IterationSpace

Mask












OutputImage







C DG H

A B C DE F G H

A BE F

O PK LG HC D


M NI JE FA B

O PK L

M N O PI J K L

M NI J

Repeat

M MM M

M N O PM N O P

P PP P

M MI IE EA A


P PL LH HD D

A AA A

A B C DA B C D

D DD D

Clamp

I MJ N

M N O PI J K L

P LO K

N MJ IF EB A


P OL KH GD C

E AF B

A B C DE F G H

D HC G

Mirror

Q QQ Q

Q Q Q QQ Q Q Q

Q QQ Q

Q QQ QQ QQ Q


Q QQ QQ QQ Q

Q QQ Q

Q Q Q QQ Q Q Q

Q QQ Q

Constant









Image offset







{ 1, -4, 1 },{ 0, 1, 0 } };










InputImage

BoundaryCondition

Accessor

KERNEL

IterationSpace

Mask












OutputImage







C DG H

A B C DE F G H

A BE F

O PK LG HC D


M NI JE FA B

O PK L

M N O PI J K L

M NI J

Repeat

M MM M

M N O PM N O P

P PP P

M MI IE EA A


P PL LH HD D

A AA A

A B C DA B C D

D DD D

Clamp

I MJ N

M N O PI J K L

P LO K

N MJ IF EB A


P OL KH GD C

E AF B

A B C DE F G H

D HC G

Mirror

Q QQ Q

Q Q Q QQ Q Q Q

Q QQ Q

Q QQ QQ QQ Q


Q QQ QQ QQ Q

Q QQ Q

Q Q Q QQ Q Q Q

Q QQ Q

Constant









Image offset







{ 1, -4, 1 },{ 0, 1, 0 } };










InputImage

BoundaryCondition

Accessor

KERNEL

IterationSpace

Mask












OutputImage







C DG H

A B C DE F G H

A BE F

O PK LG HC D


M NI JE FA B

O PK L

M N O PI J K L

M NI J

Repeat

M MM M

M N O PM N O P

P PP P

M MI IE EA A


P PL LH HD D

A AA A

A B C DA B C D

D DD D

Clamp

I MJ N

M N O PI J K L

P LO K

N MJ IF EB A


P OL KH GD C

E AF B

A B C DE F G H

D HC G

Mirror

Q QQ Q

Q Q Q QQ Q Q Q

Q QQ Q

Q QQ QQ QQ Q


Q QQ QQ QQ Q

Q QQ Q

Q Q Q QQ Q Q Q

Q QQ Q

Constant









Image offset







{ 1, -4, 1 },{ 0, 1, 0 } };










InputImage

BoundaryCondition

Accessor

KERNEL

IterationSpace

Mask












OutputImage







C DG H

A B C DE F G H

A BE F

O PK LG HC D


M NI JE FA B

O PK L

M N O PI J K L

M NI J

Repeat

M MM M

M N O PM N O P

P PP P

M MI IE EA A


P PL LH HD D

A AA A

A B C DA B C D

D DD D

Clamp

I MJ N

M N O PI J K L

P LO K

N MJ IF EB A


P OL KH GD C

E AF B

A B C DE F G H

D HC G

Mirror

Q QQ Q

Q Q Q QQ Q Q Q

Q QQ Q

Q QQ QQ QQ Q


Q QQ QQ QQ Q

Q QQ Q

Q Q Q QQ Q Q Q

Q QQ Q

Constant









Image offset







{ 1, -4, 1 },{ 0, 1, 0 } };










InputImage

BoundaryCondition

Accessor

KERNEL

IterationSpace

Mask












OutputImage







C DG H

A B C DE F G H

A BE F

O PK LG HC D


M NI JE FA B

O PK L

M N O PI J K L

M NI J

Repeat

M MM M

M N O PM N O P

P PP P

M MI IE EA A


P PL LH HD D

A AA A

A B C DA B C D

D DD D

Clamp

I MJ N

M N O PI J K L

P LO K

N MJ IF EB A


P OL KH GD C

E AF B

A B C DE F G H

D HC G

Mirror

Q QQ Q

Q Q Q QQ Q Q Q

Q QQ Q

Q QQ QQ QQ Q


Q QQ QQ QQ Q

Q QQ Q

Q Q Q QQ Q Q Q

Q QQ Q

Constant









Image offset





Example: Laplacian Operator Kernel

class Laplacian : public Kernel <uchar4 > {private:

Accessor <uchar4 > &input;Mask <int > &mask;

public:Laplacian(IterationSpace <uchar4 > &iter ,

Accessor <uchar4 > &input , Mask <int > &mask): Kernel(iter), input(input), mask(mask) {

addAccessor (& input);}

void kernel () {int4 sum = convolve(mask , HipaccSUM , [&] () -> int4 {

return mask() * convert_int4(input(mask));});

sum = max(sum , 0);sum = min(sum , 255);output () = convert_uchar4(sum);

}};


Example: Laplacian Operator Kernel

class Laplacian : public Kernel <uchar4 > {private:

Accessor <uchar4 > &input;Mask <int > &mask;

public:Laplacian(IterationSpace <uchar4 > &iter ,

Accessor <uchar4 > &input , Mask <int > &mask): Kernel(iter), input(input), mask(mask) {

addAccessor (&input);}

void kernel () {int4 sum = convolve(mask , HipaccSUM , [&] () -> int4 {

return mask() * convert_int4(input(mask));});

sum = max(sum , 0);sum = min(sum , 255);output () = convert_uchar4(sum);

}};

Convolution call


Extensions to HIPAcc Framework

Arbitrary Bit Width Hardware Design

- The OpenCL standard contains primitive data types only, e.g., char/short/int

+ AOC supports arbitrary bit width declaration

- AOC arbitrary bit width declaration is masking

- Compound assignment operations need to be expanded

- Unary operations need to be expanded

Utilizing exact bit widths is tedious in Altera OpenCL 14.1!






x = (RHS) & 0x7;









x = (RHS) & 0x7;


// x += (RHS) & 0x7;x = ((x & 0x7) + RHS) & 0x7;


// x++;x = ((x & 0x7) + 1) & 0x7;







x = (RHS) & 0x7;


// x += (RHS) & 0x7;x = ((x & 0x7) + RHS) & 0x7;


// x++;x = ((x & 0x7) + 1) & 0x7;



Automatic Bit Width Reduction: An extension to HIPAcc

Listing 5: Gaussian blur in HIPAcc with annotated bit widths

#pragma hipacc bw(sum ,12)uint sum = 0;#pragma hipacc bw(x,2)uint x = 0;#pragma hipacc bw(y,2)uint y = 0;for (y = 0; y < size; ++y) {

for (x = 0; x < size; ++x) {sum += mask[y][x] * Input(x-1,y-1);

}}sum /= 16;output () = sum;

↓Listing 6: Gaussian blur in HIPAcc with annotated bit widths

uint sum = 0;uint x = 0;uint y = 0;for (y = ((0) & 3); ((y) & 3) < size; y = ((((y) & 3) + 1) & 3)){

for (x = ((0) & 3); ((x) & 3) < size; x = ((((x) & 3) + 1) & 3)){sum = ((((sum) & 4095) + mask[y][x]

* getWindowAt(Input , 1 + ((x) & 3) - 1, 1 + ((y) & 3) -1)) & 4095);}

}sum = ((((sum) & 4095) / 16) & 4095);return sum;


Automatic Bit Width Reduction: An extension to HIPAcc


#pragma hipacc bw(sum ,12)uint sum = 0;#pragma hipacc bw(x,2)uint x = 0;#pragma hipacc bw(y,2)uint y = 0;for (y = 0; y < size; ++y) {

for (x = 0; x < size; ++x) {sum += mask[y][x] * Input(x-1,y-1);

}}sum /= 16;output () = sum; ↓


uint sum = 0;uint x = 0;uint y = 0;for (y = ((0) & 3); ((y) & 3) < size; y = ((((y) & 3) + 1) & 3)){

for (x = ((0) & 3); ((x) & 3) < size; x = ((((x) & 3) + 1) & 3)){sum = ((((sum) & 4095) + mask[y][x]

* getWindowAt(Input , 1 + ((x) & 3) - 1, 1 + ((y) & 3) -1)) & 4095);}

}sum = ((((sum) & 4095) / 16) & 4095);return sum;


Evaluation and Results

HIPAcc vs Handwritten Examples provided by Altera

Edge-DetectorAltera

Edge-DetectorHIPAcc

Lukas KanadeAltera

Lukas KanadeHIPAcc

0

10

20

30

40

50

13.5515.04

18.05

25.63

19.59 19.21

22.59

27.62

7.03

1.17

18.36

14.45

Har

dwar

eU

tiliz

atio

n[%

]M20K Logic DSP

0

100

200

300

Clo

ckFr

eque

ncy

[MH

z]

Fmax

Figure: HIPAcc vs Altera handwritten examples for 1024×1024 image size.


FPGA vs GPU

100 101 102 103 104

Gaussian

Laplacian

Sobel

Bilateral

Harris Corner

Optical Flow

Throughput [MPixel/s]

Tegra K1 Tesla K20 Stratix V

Figure: Comparison of throughput for the Nvidia Tegra K1, Nvidia Tesla K20, and Altera Stratix V.


Conclusion

Conclusion

Advantages of DSL-based Approach

Productivity – compact algorithm description– less error-prone

Performance – efficient target-specific code generation

Portability – flexible target choice– performance portability, not just functional portability

HIPAcc DSL code serves as baseline implementation⇒ Test bench

Develop the algorithm first, decide the target architecture afterwards!


Conclusion

Advantages of DSL-based Approach

Productivity – compact algorithm description– less error-prone

Performance – efficient target-specific code generation

Portability – flexible target choice– performance portability, not just functional portability

HIPAcc DSL code serves as baseline implementation⇒ Test bench

Develop the algorithm first, decide the target architecture afterwards!


Questions?Thanks for listening.

Any questions?

C++

embedded DSL

Source-to-SourceCompiler

Clang/LLVM

Domain

Knowledge

Architecture

Knowledge

CUDA(GPU)

OpenCL(x86/GPU)

C/C++

(x86)

Renderscript(x86/ARM/GPU)

OpenCL(ALTERA FPGA)

Vivado C++

(FPGA)

CUDA/OpenCL/Renderscript Runtime LibraryCUDA/OpenCL/Renderscript Runtime Library AOCL Vivado HLS

http://github.com/hipacc/hipacc-fpga

Title FPGA-Based Accelerator Design from a Domain-Specific Language

Speaker M. Akif Özkan, [email protected]

http://github.com/hipacc/hipacc-fpga

Backup Slides

Additional Results

Boundary Handling: SW Manner vs HW Manner

UNDEF-SW CONST-SW CLAMP-SW MIRROR-SW0

10

20

30

40

50

60

Dataset A

12.85 13.1 13.1 13.1

16.1517.55

41.11 41.42

2.3 2.3

13.79 13.79

Har

dwar

eU

tiliz

atio

n[%

]

M10K Logic DSP

0

20

40

60

80

100

120

140

160

Clo

ckFr

eque

ncy

[MH

z]

Fmax

(a) Single-Item Line-Buffered Kerneldeveloped in Software Manner

UNDEF-SW CONST-SW CLAMP-SW MIRROR-SW0

10

20

30

40

50

60

Dataset A

12.85 13.1 13.1 13.1

16.18 16.71 17.03 17.21

2.3 2.3 2.3 2.3

Har

dwar

eU

tiliz

atio

n[%

]

M10K Logic DSP

0

20

40

60

80

100

120

140

160

Clo

ckFr

eque

ncy

[MH

z]

Fmax

(b) Single-Item Line-Buffered Kerneldeveloped in Hardware Manner

Figure: Boundary conditions for a 5×5 Gaussian blur on a 1024×1024 image.


Automatic Bit Width Reduction: Results

• AOC is very successful when inner loop trip counts are known during compiletime and the unroll directive has been specified for synthesis

• Great improvement in clock frequency for dynamic loops

Table: Automatic bit width reduction on local operators of size 3×3 with image size 1024×1024.

Filter Type II ALUTs Registers Logic (%) M10K DSP Freq [MHz]

Sobelnormal 2 8552 12433 19.76 84 4 131.25reduced 2 8230 12098 19.11 84 3 152.09

Gaussiannormal 2 8593 12423 19.86 84 4 132.50reduced 2 8439 11843 19.26 83 3 153.11


Kernel Vectorization


Vectorization by loop coarsening [1]: unroll the outer loop by factor vi

• replicate only the kernel: sublinear increase of resource usage• better exploitation of bandwidth: nearly linear speedup

[1] Moritz Schmid et al. “Loop Coarsening in C-based High-Level Synthesis”. In: Proceedings of the 26th IEEE International Conference on Application-specificSystems, Architectures and Processors (ASAP). (Toronto, Canada). IEEE, July 27–29, 2015, pp. 166–173



1 2 4 8 160

20

40

60

80

100

120

14.57 14.49 14.49 14.65 14.88

22.77 24.4527.91

34.28

48.93

8.9814.45

25.39

47.27

91.02

Vectorization Factor v

Har

dwar

eU

tiliz

atio

n[%

]M20K Logic DSP

0

100

200

300

400

Clo

ckFr

eque

ncy

[MH

z]

Fmax

Figure: Vectorization of a 3×3 bilateral filter with clamping on an image of size 1024×1024.


Generating the Streaming Pipeline


Trace host code and translate it to internal representation:

• model as combination ofprocesses and spaces

• create unique channelobjects for each space

• identify memory reuseand utilize processesaccordingly

• build dependency graph• traverse in depth-first

search starting fromoutput spaces

1× channel2× channels

Kernel

IterSpace

Image

Accessor

Kernel

processprocess

space

process

Accessor Accessor

Kernel Kernel

process1to2

space space

process process










Kernel

IterSpace

Image

Accessor

Kernel

processprocess

space

process

Accessor Accessor

Kernel Kernel

process1to2

space space

process process










Kernel

IterSpace

Image

Accessor

Kernel

processprocess

space

process

Accessor Accessor

Kernel Kernel

process1to2

space space

process process









1× channel

2× channels

Kernel

IterSpace

Image

Accessor

Kernel

processprocess

space

process

Accessor Accessor

Kernel Kernel

process1to2

space space

process process










Kernel

IterSpace

Image

Accessor

Kernel

processprocess

space

process

Accessor Accessor

Kernel Kernel

process1to2

space space

process process










Kernel

IterSpace

Image

Accessor

Kernel

processprocess

space

process

Accessor Accessor

Kernel Kernel

process1to2

space space

process process










Kernel

IterSpace

Image

Accessor

Kernel

processprocess

space

process

Accessor Accessor

Kernel Kernel

process1to2

space space

process process









1× channel

2× channels

Kernel

IterSpace

Image

Accessor

Kernel

processprocess

space

process

Accessor Accessor

Kernel Kernel

process1to2

space space

process process









1× channel

2× channels

Kernel

IterSpace

Image

Accessor

Kernel

processprocess

space

process

Accessor Accessor

Kernel Kernel

process1to2

space space

process process


Altera OpenCL System Level Interfaces


Channels Data sharing between kernelsFigure 1-5: Difference in Global Memory Access Pattern as a Result of Channels or PipesImplementation

Global Memory

Kernel 1 Kernel 4Kernel 2 Kernel 3Write

ReadRead

ReadRead

WriteWriteWrite

Global Memory

Kernel 1 Kernel 4Kernel 2 Kernel 3

Read

Write

Channel/Pipe Channel/Pipe

Global Memory Access Pattern Before AOCL Channels or Pipes Implementation

Global Memory Access Pattern After AOCL Channels or Pipes Implementation

Channel/Pipe

For more information on the usage of channels, refer to the Implementing AOCL Channels Extensionsection of the Altera SDK for OpenCL Programming Guide.

For more information on the usage of pipes, refer to the Implementing OpenCL Pipes section of the AlteraSDK for OpenCL Programming Guide.

Related Information

• Implementing AOCL Channels Extension• Implementing OpenCL Pipes

Channels and Pipes CharacteristicsTo implement channels or pipes in your OpenCL kernel program, keep in mind their respectivecharacteristics that are specific to the Altera SDK for OpenCL (AOCL).

Default Behavior

The default behavior of channels is blocking. The default behavior of pipes is nonblocking.

Concurrent Execution of Multiple OpenCL Kernels

You can execute multiple OpenCL kernels concurrently. To enable concurrent execution, modify the hostcode to instantiate multiple command queues. Each concurrently executing kernel is associated with aseparate command queue.Important: Pipe-specific considerations:

The OpenCL pipe modifications outlined in Ensuring Compatibility with Other OpenCLSDKs in the Altera SDK for OpenCL Programming Guide allow you to run your kernel on theAOCL. However, they do not maximize the kernel throughput. The OpenCL Specificationversion 2.0 requires that pipe writes occur before pipe reads so that the kernel is not reading

1-8 Channels and Pipes CharacteristicsOCL003-15.0.0

2015.05.04

Altera Corporation Altera SDK for OpenCL Best Practices Guide

Send Feedback

Altera, SDK. for OpenCL: Best Practices Guide, Altera, 2016.



Channels Data sharing between kernels

Table: Impact of channels on line-buffered single-work item implementations.

Kernel Type II ALUTs Registers LU (%) BRAM DSP Freq [MHz]

Copy Kernel single 1 6821 7812 14.04 43 0 152.89Mean Filter single 1 8248 9415 16.78 51 0 151.52

2 Mean Filter separate 1 12104 15132 25.36 100 0 145.78

2 Mean Filter channel 1 10439 12466 21.47 60 0 154.142 Mean Filter fused 1 9480 11715 19.47 64 0 122.473 Mean Filter channel 1 12829 15451 26.56 67 0 145.573 Mean Filter fused 1 10899 13651 22.24 79 0 125.00

Luma+Mean single 1 8244 9694 16.84 66 2 132.56Luma+Mean channel 1 11281 13653 23.75 64 3 148.75

Luma+Mean fused 1 9635 11846 20.09 72 2 123.53

Harris corner channel 1 43127 54334 89.87 186 9 142.52Harris corner fused 1 33263 43319 70.87 177 9 107.35

- Channels use considerable amount of logic!






Copy Kernel single 1 6821 7812 14.04 43 0 152.89Mean Filter single 1 8248 9415 16.78 51 0 151.522 Mean Filter separate 1 12104 15132 25.36 100 0 145.78

2 Mean Filter channel 1 10439 12466 21.47 60 0 154.14

2 Mean Filter fused 1 9480 11715 19.47 64 0 122.473 Mean Filter channel 1 12829 15451 26.56 67 0 145.573 Mean Filter fused 1 10899 13651 22.24 79 0 125.00

Luma+Mean single 1 8244 9694 16.84 66 2 132.56Luma+Mean channel 1 11281 13653 23.75 64 3 148.75

Luma+Mean fused 1 9635 11846 20.09 72 2 123.53



+ Channels are cheaper than kernel IOs.






Copy Kernel single 1 6821 7812 14.04 43 0 152.89Mean Filter single 1 8248 9415 16.78 51 0 151.522 Mean Filter separate 1 12104 15132 25.36 100 0 145.78

2 Mean Filter channel 1 10439 12466 21.47 60 0 154.142 Mean Filter fused 1 9480 11715 19.47 64 0 122.473 Mean Filter channel 1 12829 15451 26.56 67 0 145.573 Mean Filter fused 1 10899 13651 22.24 79 0 125.00

Luma+Mean single 1 8244 9694 16.84 66 2 132.56Luma+Mean channel 1 11281 13653 23.75 64 3 148.75Luma+Mean fused 1 9635 11846 20.09 72 2 123.53



+ Channels are cheaper than kernel IOs.

- Smaller the size of kernel body, faster the clock frequency.


Documents

FPGA-Based Accelerator Design from a Domain-Specific Language · FPGAs High throughput, great energy efﬁciency OpenCL Portable and open programming model for heterogeneous ... High