82
FPGA-Based Accelerator Design from a Domain-Specific Language M. Akif Özkan , Oliver Reiche, Frank Hannig, and Jürgen Teich Hardware/Software Co-Design, Friedrich-Alexander University Erlangen-Nürnberg FPL, August 29, 2016, Lausanne, Switzerland

FPGA-Based Accelerator Design from a Domain-Specific Language · FPGAs High throughput, great energy efficiency OpenCL Portable and open programming model for heterogeneous ... High

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: FPGA-Based Accelerator Design from a Domain-Specific Language · FPGAs High throughput, great energy efficiency OpenCL Portable and open programming model for heterogeneous ... High

FPGA-Based Accelerator Design from aDomain-Specific Language

M. Akif Özkan, Oliver Reiche, Frank Hannig, and Jürgen TeichHardware/Software Co-Design, Friedrich-Alexander University Erlangen-NürnbergFPL, August 29, 2016, Lausanne, Switzerland

Page 2: FPGA-Based Accelerator Design from a Domain-Specific Language · FPGAs High throughput, great energy efficiency OpenCL Portable and open programming model for heterogeneous ... High

Motivation

FPGAs High throughput, great energy efficiency

OpenCL Portable and open programming model for heterogeneoussystems, including CPUs, GPUs, DSPs etc.

© Copyright Khronos Group 2015 - Page 3

OpenCL – Portable Heterogeneous Computing•OpenCL = Two APIs and Two Kernel languages- C Platform Layer API to query, select and initialize compute devices- OpenCL C and (soon) OpenCL C++ kernel languages to write parallel code - C Runtime API to build and execute kernels across multiple devices

•One code tree can be executed on CPUs, GPUs, DSPs, FPGA and hardware- Dynamically balance work across available processors

OpenCL Kernel Code

OpenCL Kernel Code

OpenCL Kernel Code

OpenCL Kernel Code

GPU

DSPCPU

CPUFPGA

HW

Kernel code compiled for

devices

Devices

CPU

Host

Runtime APIloads and executes

kernels across devices

https://www.khronos.org/assets/uploads/developers/library/overview/opencl_overview.pdf

Page 3: FPGA-Based Accelerator Design from a Domain-Specific Language · FPGAs High throughput, great energy efficiency OpenCL Portable and open programming model for heterogeneous ... High

Motivation: High Level Synthesis using Altera OpenCL

+ Plug and accelerate: Automatic interface design+ Directives for exploiting data-level parallelism- Efficiency comes with hardware design patterns! Develop Altera OpenCL (AOC) software in Hardware Manner (HwM)

0 100 200 300 400 500 600 700

20

40

60

80

NDRange

SingleCU2

CU4

SIMD2

SIMD4

CU2/SIMD2

Line Buffer

LB CLAMP

LB CLAMP HwM

Throughput [frame/s]

Logi

cU

tiliz

atio

n[%

]

GPU KernelLine Buffer (LB)

Figure: Design points for a 5×5 Gaussian blur filter.

Page 4: FPGA-Based Accelerator Design from a Domain-Specific Language · FPGAs High throughput, great energy efficiency OpenCL Portable and open programming model for heterogeneous ... High

Motivation: High Level Synthesis using Altera OpenCL

+ Plug and accelerate: Automatic interface design+ Directives for exploiting data-level parallelism- Efficiency comes with hardware design patterns! Develop Altera OpenCL (AOC) software in Hardware Manner (HwM)

0 100 200 300 400 500 600 700

20

40

60

80

NDRange

SingleCU2

CU4

SIMD2

SIMD4

CU2/SIMD2

Line Buffer

LB CLAMP

LB CLAMP HwM

Throughput [frame/s]

Logi

cU

tiliz

atio

n[%

]

GPU KernelLine Buffer (LB)

Figure: Design points for a 5×5 Gaussian blur filter.

Page 5: FPGA-Based Accelerator Design from a Domain-Specific Language · FPGAs High throughput, great energy efficiency OpenCL Portable and open programming model for heterogeneous ... High

Motivation: High Level Synthesis using Altera OpenCL

+ Plug and accelerate: Automatic interface design+ Directives for exploiting data-level parallelism- Efficiency comes with hardware design patterns! Develop Altera OpenCL (AOC) software in Hardware Manner (HwM)

0 100 200 300 400 500 600 700

20

40

60

80

NDRange

SingleCU2

CU4

SIMD2

SIMD4

CU2/SIMD2

Line Buffer

LB CLAMP

LB CLAMP HwM

Throughput [frame/s]

Logi

cU

tiliz

atio

n[%

]

GPU KernelLine Buffer (LB)

Figure: Design points for a 5×5 Gaussian blur filter.

Page 6: FPGA-Based Accelerator Design from a Domain-Specific Language · FPGAs High throughput, great energy efficiency OpenCL Portable and open programming model for heterogeneous ... High

Motivation: High Level Synthesis using Altera OpenCL

+ Plug and accelerate: Automatic interface design+ Directives for exploiting data-level parallelism- Efficiency comes with hardware design patterns! Develop Altera OpenCL (AOC) software in Hardware Manner (HwM)

0 100 200 300 400 500 600 700

20

40

60

80

NDRange

SingleCU2

CU4

SIMD2

SIMD4

CU2/SIMD2

Line Buffer

LB CLAMP

LB CLAMP HwM

Throughput [frame/s]

Logi

cU

tiliz

atio

n[%

]

GPU KernelLine Buffer (LB)

Figure: Design points for a 5×5 Gaussian blur filter.

Page 7: FPGA-Based Accelerator Design from a Domain-Specific Language · FPGAs High throughput, great energy efficiency OpenCL Portable and open programming model for heterogeneous ... High

Motivation: High Level Synthesis using Altera OpenCL

+ Plug and accelerate: Automatic interface design+ Directives for exploiting data-level parallelism- Efficiency comes with hardware design patterns! Develop Altera OpenCL (AOC) software in Hardware Manner (HwM)

0 100 200 300 400 500 600 700

20

40

60

80

NDRange

SingleCU2

CU4

SIMD2

SIMD4

CU2/SIMD2

Line Buffer

LB CLAMP

LB CLAMP HwM

Throughput [frame/s]

Logi

cU

tiliz

atio

n[%

]

GPU KernelLine Buffer (LB)

Figure: Design points for a 5×5 Gaussian blur filter.

Page 8: FPGA-Based Accelerator Design from a Domain-Specific Language · FPGAs High throughput, great energy efficiency OpenCL Portable and open programming model for heterogeneous ... High

Outline

Programming MethodologyDeveloping Software in Hardware Manner

Efficient Code Generation through a Domain-Specific Language

Extensions to HIPAcc FrameworkAutomatic Bit-width Reduction

Evaluation and Results

Conclusion

M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 0

Page 9: FPGA-Based Accelerator Design from a Domain-Specific Language · FPGAs High throughput, great energy efficiency OpenCL Portable and open programming model for heterogeneous ... High

Programming Methodology

Page 10: FPGA-Based Accelerator Design from a Domain-Specific Language · FPGAs High throughput, great energy efficiency OpenCL Portable and open programming model for heterogeneous ... High

Efficient OpenCL program for AOC

• Use hardware design patterns• Single-item line-buffered kernels for image processing

input output

win

dow

load

store

line

buffe

r

loca

l op

erat

or

f

M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 1

Page 11: FPGA-Based Accelerator Design from a Domain-Specific Language · FPGAs High throughput, great energy efficiency OpenCL Portable and open programming model for heterogeneous ... High

Efficient OpenCL program for AOC

• Use hardware design patterns• Apply known hardware design optimizations

• Strength Reduction• Bit-width reduction• Use 1-bit control signals instead of repeated conditions

Listing 1: Source Code

c = a * 16;

Listing 2: Strength Reduction

c = a << 4;

M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 1

Page 12: FPGA-Based Accelerator Design from a Domain-Specific Language · FPGAs High throughput, great energy efficiency OpenCL Portable and open programming model for heterogeneous ... High

Efficient OpenCL program for AOC

• Use hardware design patterns• Apply known hardware design optimizations

• Strength Reduction• Bit-width reduction• Use 1-bit control signals instead of repeated conditions

M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 1

Page 13: FPGA-Based Accelerator Design from a Domain-Specific Language · FPGAs High throughput, great energy efficiency OpenCL Portable and open programming model for heterogeneous ... High

Efficient OpenCL program for AOC

• Use hardware design patterns• Apply known hardware design optimizations

• Strength Reduction• Bit-width reduction• Use 1-bit control signals instead of repeated conditions

M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 1

Page 14: FPGA-Based Accelerator Design from a Domain-Specific Language · FPGAs High throughput, great energy efficiency OpenCL Portable and open programming model for heterogeneous ... High

Efficient OpenCL program for AOC

• Fetch best practices• Remove the else branch by moving its content to preceding area• Nested loops should always have constant bounds• Use temporary registers wherever it is possible• Limit life time of variables using scope operators• Avoid using 2-dimensional arrays• Writing to arrays should be sequential and simple as possible

Listing 1: Source Code

if(condition1){a = 3;

}else{a = 5;

}

Listing 2: Remove else condition

a = 5;if(condition1){

a = 3;}

M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 2

Page 15: FPGA-Based Accelerator Design from a Domain-Specific Language · FPGAs High throughput, great energy efficiency OpenCL Portable and open programming model for heterogeneous ... High

Efficient OpenCL program for AOC

• Fetch best practices• Remove the else branch by moving its content to preceding area• Nested loops should always have constant bounds• Use temporary registers wherever it is possible• Limit life time of variables using scope operators• Avoid using 2-dimensional arrays• Writing to arrays should be sequential and simple as possible

Listing 1: Source Code

for (int i = 0; i < 5; ++i){for (int j = 0; j < i; ++j){

/* operations */}

}

Listing 2: Constant loop bounds

for (int i = 0; i < 5; ++i){for (int j = 0; j < 5; ++j){

if (j < i) {/* operations */

}}

}

M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 2

Page 16: FPGA-Based Accelerator Design from a Domain-Specific Language · FPGAs High throughput, great energy efficiency OpenCL Portable and open programming model for heterogeneous ... High

Efficient OpenCL program for AOC

• Fetch best practices• Remove the else branch by moving its content to preceding area• Nested loops should always have constant bounds• Use temporary registers wherever it is possible• Limit life time of variables using scope operators• Avoid using 2-dimensional arrays• Writing to arrays should be sequential and simple as possible

Listing 1: Source Code

a = array[i];if (condition){

b = array[i];}c = array[i]*5;

Listing 2: Source Code

char temp = array[i];a = temp;if (condition){

b = temp;}c = temp *5;

M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 2

Page 17: FPGA-Based Accelerator Design from a Domain-Specific Language · FPGAs High throughput, great energy efficiency OpenCL Portable and open programming model for heterogeneous ... High

Efficient OpenCL program for AOC

• Fetch best practices• Remove the else branch by moving its content to preceding area• Nested loops should always have constant bounds• Use temporary registers wherever it is possible• Limit life time of variables using scope operators• Avoid using 2-dimensional arrays• Writing to arrays should be sequential and simple as possible

Listing 1: Source Code

/* code portion1 */char temp;/* code_portion2(temp) *//* code_portion3 */

Listing 2: Scope Operators

/* code portion1 */{

char temp;/* code_portion2(temp) */

}/* code_portion3 */

M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 2

Page 18: FPGA-Based Accelerator Design from a Domain-Specific Language · FPGAs High throughput, great energy efficiency OpenCL Portable and open programming model for heterogeneous ... High

Efficient OpenCL program for AOC

• Fetch best practices• Remove the else branch by moving its content to preceding area• Nested loops should always have constant bounds• Use temporary registers wherever it is possible• Limit life time of variables using scope operators• Avoid using 2-dimensional arrays• Writing to arrays should be sequential and simple as possible

Listing 1: Source Code

char array [5][5];

Listing 2: Avoid 2 dimensional arrays

char arrayRow1 [5];char arrayRow2 [5];char arrayRow3 [5];char arrayRow4 [5];char arrayRow5 [5];

M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 2

Page 19: FPGA-Based Accelerator Design from a Domain-Specific Language · FPGAs High throughput, great energy efficiency OpenCL Portable and open programming model for heterogeneous ... High

Efficient OpenCL program for AOC

• Fetch best practices• Remove the else branch by moving its content to preceding area• Nested loops should always have constant bounds• Use temporary registers wherever it is possible• Limit life time of variables using scope operators• Avoid using 2-dimensional arrays• Writing to arrays should be sequential and simple as possible

Listing 1: Source Code

char array [7];// Load new datafor(int i=0; i<2; i++){

char newdata = newdata[i];array[i+3] = newdata;array[i+5] = newdata;

}// Shift only a portionif(condition){

array [5] = array [6];}

Listing 2: Simplify array access

char array1 [5];char array2 [2];// Load new datafor(int i=0; i<2; i++){

char newdata = newdata[i];array1[i+3] = newdata;array2[i] = newdata;

}// Shift only a portionif(condition){

array2 [0] = array2 [1];}

M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 2

Page 20: FPGA-Based Accelerator Design from a Domain-Specific Language · FPGAs High throughput, great energy efficiency OpenCL Portable and open programming model for heterogeneous ... High

Proposed Approach: Develop Software in Hardware Manner

• Write Altera OpenCL code following these two principles:1. Calculate all the conditional cases first, decide at the end2. Exploit temporal locality

M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 3

Page 21: FPGA-Based Accelerator Design from a Domain-Specific Language · FPGAs High throughput, great energy efficiency OpenCL Portable and open programming model for heterogeneous ... High

Proposed Approach: Develop Software in Hardware Manner

• Write Altera OpenCL code following these two principles:1. Calculate all the conditional cases first, decide at the end2. Exploit temporal locality

Listing 3: Software Manner

if (condition){calculate_a (...);result = a;

}else{calculate_b (...);result = b;

}

Listing 4: Hardware Manner

calculate_a (...);calculate_b (...);result = b;if (condition){

result = a;}

regreg in[i]compute(x, y, z) x y z

MUX

calculate_a(…)

calculate_b(…)

result

a

b

M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 3

Page 22: FPGA-Based Accelerator Design from a Domain-Specific Language · FPGAs High throughput, great energy efficiency OpenCL Portable and open programming model for heterogeneous ... High

Proposed Approach: Develop Software in Hardware Manner

• Write Altera OpenCL code following these two principles:1. Calculate all the conditional cases first, decide at the end2. Exploit temporal locality

Listing 3: Software Manner

for(int i=0; i<SIZE; i++){x = in[i-2];y = in[i-1];z = in[i];out = compute(x, y, z);

}

Listing 4: Hardware Manner

for(int i=0; i<SIZE; i++){x = y;y = z;z = in[i];out = compute(x, y, z);

}

regreg in[i]compute(x, y, z) x y z

MUX

calculate_a(…)

calculate_b(…)

result

a

b

M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 3

Page 23: FPGA-Based Accelerator Design from a Domain-Specific Language · FPGAs High throughput, great energy efficiency OpenCL Portable and open programming model for heterogeneous ... High

Writing HLS Code for AOC: CLAMP Boundary Handling (HW)

M M

M M

M N O P M N O P

M N O P M N O P

P P

P P

M M

M M

I I

I I

E E

E E

A A

A A

M N O P

I J K L

E F G H

A B C D

M N O P

I J K L

E F G H

A B C D

M N O P

I J K L

E F G H

A B C D

M N O P

I J K L

E F G H

A B C D

P P

P P

L L

L L

H H

H H

D D

D D

A A

A A

A B C D A B C D

A B C D A B C D

D D

D D

Figure: Clamp boundary condition on an image

M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 4

Page 24: FPGA-Based Accelerator Design from a Domain-Specific Language · FPGAs High throughput, great energy efficiency OpenCL Portable and open programming model for heterogeneous ... High

Writing HLS Code for AOC: CLAMP Boundary Handling (HW)

M M

M M

M N O P M N O P

M N O P M N O P

P P

P P

M M

M M

I I

I I

E E

E E

A A

A A

M N O P

I J K L

E F G H

A B C D

M N O P

I J K L

E F G H

A B C D

M N O P

I J K L

E F G H

A B C D

M N O P

I J K L

E F G H

A B C D

P P

P P

L L

L L

H H

H H

D D

D D

A A

A A

A B C D A B C D

A B C D A B C D

D D

D D

BH_NO

BH_TL BH_T BH_TR

BH_L BH_R

BH_BL BH_B BH_BR

Figure: Clamp boundary condition on an image

M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 4

Page 25: FPGA-Based Accelerator Design from a Domain-Specific Language · FPGAs High throughput, great energy efficiency OpenCL Portable and open programming model for heterogeneous ... High

Writing HLS Code for AOC: CLAMP Boundary Handling (HW)

reg30'

reg40

reg30

reg20

reg10

reg40'

reg00 M

UX

MUX

MUX

MUX

Line BufferMUX

reg31’

reg41

reg31

reg21

reg11

reg41’

reg01 M

UX

MUX

MUX

MUX

MUX

reg33’

reg43

reg33

reg23

reg13

reg43’

reg03 M

UX

MUX

MUX

MUX

MUX

reg34’

reg44

reg34

reg24

reg14

reg44’

reg04 M

UX

MUX

MUX

MUX

MUX

Line Buffer

Line Buffer

Pixel Stream

Line Buffer

reg32’

reg42

reg32

reg22

reg12

reg42’

reg02 M

UX

MUX

MUX

MUX

[3:4] ROWs SELECTION

5x5 LOCAL OPERATOR

[0:2] ROWs

Requires hardware design expertise!

Best practices on syntactic description matters!

No automatic kernel and accelerator replication!

Code Size increases upto 10x!

M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 5

Page 26: FPGA-Based Accelerator Design from a Domain-Specific Language · FPGAs High throughput, great energy efficiency OpenCL Portable and open programming model for heterogeneous ... High

Writing HLS Code for AOC: CLAMP Boundary Handling (HW)

reg30'

reg40

reg30

reg20

reg10

reg40'

reg00 M

UX

MUX

MUX

MUX

Line BufferMUX

reg31’

reg41

reg31

reg21

reg11

reg41’

reg01 M

UX

MUX

MUX

MUX

MUX

reg33’

reg43

reg33

reg23

reg13

reg43’

reg03 M

UX

MUX

MUX

MUX

MUX

reg34’

reg44

reg34

reg24

reg14

reg44’

reg04 M

UX

MUX

MUX

MUX

MUX

Line Buffer

Line Buffer

Pixel Stream

Line Buffer

reg32’

reg42

reg32

reg22

reg12

reg42’

reg02 M

UX

MUX

MUX

MUX

[3:4] ROWs SELECTION

5x5 LOCAL OPERATOR

[0:2] ROWs

Requires hardware design expertise!

Best practices on syntactic description matters!

No automatic kernel and accelerator replication!

Code Size increases upto 10x!

M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 5

Page 27: FPGA-Based Accelerator Design from a Domain-Specific Language · FPGAs High throughput, great energy efficiency OpenCL Portable and open programming model for heterogeneous ... High

Writing HLS Code for AOC: CLAMP Boundary Handling (HW)

reg30'

reg40

reg30

reg20

reg10

reg40'

reg00 M

UX

MUX

MUX

MUX

Line BufferMUX

reg31’

reg41

reg31

reg21

reg11

reg41’

reg01 M

UX

MUX

MUX

MUX

MUX

reg33’

reg43

reg33

reg23

reg13

reg43’

reg03 M

UX

MUX

MUX

MUX

MUX

reg34’

reg44

reg34

reg24

reg14

reg44’

reg04 M

UX

MUX

MUX

MUX

MUX

Line Buffer

Line Buffer

Pixel Stream

Line Buffer

reg32’

reg42

reg32

reg22

reg12

reg42’

reg02 M

UX

MUX

MUX

MUX

[3:4] ROWs SELECTION

5x5 LOCAL OPERATOR

[0:2] ROWs

Requires hardware design expertise!

Best practices on syntactic description matters!

No automatic kernel and accelerator replication!

Code Size increases upto 10x!

M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 5

Page 28: FPGA-Based Accelerator Design from a Domain-Specific Language · FPGAs High throughput, great energy efficiency OpenCL Portable and open programming model for heterogeneous ... High

Writing HLS Code for AOC: CLAMP Boundary Handling (HW)

reg30'

reg40

reg30

reg20

reg10

reg40'

reg00 M

UX

MUX

MUX

MUX

Line BufferMUX

reg31’

reg41

reg31

reg21

reg11

reg41’

reg01 M

UX

MUX

MUX

MUX

MUX

reg33’

reg43

reg33

reg23

reg13

reg43’

reg03 M

UX

MUX

MUX

MUX

MUX

reg34’

reg44

reg34

reg24

reg14

reg44’

reg04 M

UX

MUX

MUX

MUX

MUX

Line Buffer

Line Buffer

Pixel Stream

Line Buffer

reg32’

reg42

reg32

reg22

reg12

reg42’

reg02 M

UX

MUX

MUX

MUX

[3:4] ROWs SELECTION

5x5 LOCAL OPERATOR

[0:2] ROWs

Requires hardware design expertise!

Best practices on syntactic description matters!

No automatic kernel and accelerator replication!

Code Size increases upto 10x!

M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 5

Page 29: FPGA-Based Accelerator Design from a Domain-Specific Language · FPGAs High throughput, great energy efficiency OpenCL Portable and open programming model for heterogeneous ... High

Writing HLS Code for AOC: CLAMP Boundary Handling (HW)

reg30'

reg40

reg30

reg20

reg10

reg40'

reg00 M

UX

MUX

MUX

MUX

Line BufferMUX

reg31’

reg41

reg31

reg21

reg11

reg41’

reg01 M

UX

MUX

MUX

MUX

MUX

reg33’

reg43

reg33

reg23

reg13

reg43’

reg03 M

UX

MUX

MUX

MUX

MUX

reg34’

reg44

reg34

reg24

reg14

reg44’

reg04 M

UX

MUX

MUX

MUX

MUX

Line Buffer

Line Buffer

Pixel Stream

Line Buffer

reg32’

reg42

reg32

reg22

reg12

reg42’

reg02 M

UX

MUX

MUX

MUX

[3:4] ROWs SELECTION

5x5 LOCAL OPERATOR

[0:2] ROWs

Requires hardware design expertise!

Best practices on syntactic description matters!

No automatic kernel and accelerator replication!

Code Size increases upto 10x!

M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 5

Page 30: FPGA-Based Accelerator Design from a Domain-Specific Language · FPGAs High throughput, great energy efficiency OpenCL Portable and open programming model for heterogeneous ... High

Efficient Code Generation through aDomain-Specific Language

Page 31: FPGA-Based Accelerator Design from a Domain-Specific Language · FPGAs High throughput, great energy efficiency OpenCL Portable and open programming model for heterogeneous ... High

HIPAcc: The Heterogeneous Image Processing AccelerationFramework

C++

embedded DSL

Source-to-SourceCompiler

Clang/LLVM

Domain

Knowledge

Architecture

Knowledge

CUDA(GPU)

OpenCL(x86/GPU)

C/C++

(x86)

Renderscript(x86/ARM/GPU)

OpenCL(ALTERA FPGA)

Vivado C++

(FPGA)

CUDA/OpenCL/Renderscript Runtime LibraryCUDA/OpenCL/Renderscript Runtime Library AOCL Vivado HLS

M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 6

Page 32: FPGA-Based Accelerator Design from a Domain-Specific Language · FPGAs High throughput, great energy efficiency OpenCL Portable and open programming model for heterogeneous ... High

Operators in the Image Processing Domain

We can define three characteristic data operations in the domain:

Point Operators:Output data is determined by single input data

Local Operators:Output data is determined by local region of the inputdata

Global Operators:Output data is determined by all of the input data

input image output image

input image output image

input image output image

M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 7

Page 33: FPGA-Based Accelerator Design from a Domain-Specific Language · FPGAs High throughput, great energy efficiency OpenCL Portable and open programming model for heterogeneous ... High

Streaming Pipeline: Harris Corner Detection Example

Transform sequential execution order. . .

dx

dy

sx

sxy

sy

gx

gxy

gy

hcinput outputin

dx

dy

sx

in’

dx’

dy’

sx’

in’’

dx’’

Figure: HIPAcc’s sequential execution for the Harris corner detector

. . . into streaming pipeline of FPGA kernels.

dx

dy

sxy

sy

gxy

gy

hcinput output

sx gx

Figure: Representation of AOC kernels

M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 8

Page 34: FPGA-Based Accelerator Design from a Domain-Specific Language · FPGAs High throughput, great energy efficiency OpenCL Portable and open programming model for heterogeneous ... High

Streaming Pipeline: Harris Corner Detection Example

Transform sequential execution order. . .

dx

dy

sx

sxy

sy

gx

gxy

gy

hcinput outputin

dx

dy

sx

in’

dx’

dy’

sx’

in’’

dx’’

Figure: HIPAcc’s sequential execution for the Harris corner detector

. . . into streaming pipeline of FPGA kernels.

dx

dy

sxy

sy

gxy

gy

hcinput output

sx gx

Figure: Representation of AOC kernels

M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 8

Page 35: FPGA-Based Accelerator Design from a Domain-Specific Language · FPGAs High throughput, great energy efficiency OpenCL Portable and open programming model for heterogeneous ... High

Example: Laplacian Operator

// coefficients for Laplacian operatorconst int coef [3][3] = { { 0, 1, 0 },

{ 1, -4, 1 },{ 0, 1, 0 } };

// read inputuchar4 *image_bits = readImage ();

Image <uchar4 > in(width , height , inputImage);Image <uchar4 > out(width , height);

// load image datain = image_bits;

// Mask (Stencil) of local operatorMask <int > mask(coef);

// reading from in with mirroring as boundary conditionBoundaryCondition <uchar4 > bound(in, mask , BOUNDARY_MIRROR);Accessor <uchar4 > acc(bound);

// output imageIterationSpace <uchar4 > iter(out);

// define kernelLaplacian filter(iter , acc , mask);

// execute kernelfilter.execute ();

//Write outputuchar4* out = out.data();

InputImage

BoundaryCondition

Accessor

KERNEL

IterationSpace

Mask

HIPAcc: The Heterogeneous Image Processing AccelerationFramework

Domain-specific Extensions

IterationSpace defines ROI of the output image

Accessor input ROI with filtering

BoundaryCondition boundary handling modes

Mask convolution mask

Output image Crop of outputimage

Crop of outputimage with offset

M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 5

Output Image Crop of output image

Crop of output imagewith offset

OutputImage

HIPAcc: The Heterogeneous Image Processing AccelerationFramework

Domain-specific Extensions

IterationSpace defines ROI of the output image

Accessor input ROI with filtering

BoundaryCondition boundary handling modes

Mask convolution mask

C DG H

A B C DE F G H

A BE F

O PK LG HC D

M N O PI J K LE F G HA B C D

M NI JE FA B

O PK L

M N O PI J K L

M NI J

Repeat

M MM M

M N O PM N O P

P PP P

M MI IE EA A

M N O PI J K LE F G HA B C D

P PL LH HD D

A AA A

A B C DA B C D

D DD D

Clamp

I MJ N

M N O PI J K L

P LO K

N MJ IF EB A

M N O PI J K LE F G HA B C D

P OL KH GD C

E AF B

A B C DE F G H

D HC G

Mirror

Q QQ Q

Q Q Q QQ Q Q Q

Q QQ Q

Q QQ QQ QQ Q

M N O PI J K LE F G HA B C D

Q QQ QQ QQ Q

Q QQ Q

Q Q Q QQ Q Q Q

Q QQ Q

Constant

M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 5

Repeat Clamp Mirror ConstantHIPAcc: The Heterogeneous Image Processing AccelerationFramework

Domain-specific Extensions

IterationSpace defines ROI of the output image

Accessor input ROI with filtering

BoundaryCondition boundary handling mode

Mask convolution mask

Image and boundary Image crop Image crop withoffset

Image offset

M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 5

Image andBoundary Image crop Image crop

with offset Image offset

M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 9

Page 36: FPGA-Based Accelerator Design from a Domain-Specific Language · FPGAs High throughput, great energy efficiency OpenCL Portable and open programming model for heterogeneous ... High

Example: Laplacian Operator

// coefficients for Laplacian operatorconst int coef [3][3] = { { 0, 1, 0 },

{ 1, -4, 1 },{ 0, 1, 0 } };

// read inputuchar4 *image_bits = readImage ();

Image <uchar4 > in(width , height , inputImage);Image <uchar4 > out(width , height);

// load image datain = image_bits;

// Mask (Stencil) of local operatorMask <int > mask(coef);

// reading from in with mirroring as boundary conditionBoundaryCondition <uchar4 > bound(in, mask , BOUNDARY_MIRROR);Accessor <uchar4 > acc(bound);

// output imageIterationSpace <uchar4 > iter(out);

// define kernelLaplacian filter(iter , acc , mask);

// execute kernelfilter.execute ();

//Write outputuchar4* out = out.data();

InputImage

BoundaryCondition

Accessor

KERNEL

IterationSpace

Mask

HIPAcc: The Heterogeneous Image Processing AccelerationFramework

Domain-specific Extensions

IterationSpace defines ROI of the output image

Accessor input ROI with filtering

BoundaryCondition boundary handling modes

Mask convolution mask

Output image Crop of outputimage

Crop of outputimage with offset

M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 5

Output Image Crop of output image

Crop of output imagewith offset

OutputImage

HIPAcc: The Heterogeneous Image Processing AccelerationFramework

Domain-specific Extensions

IterationSpace defines ROI of the output image

Accessor input ROI with filtering

BoundaryCondition boundary handling modes

Mask convolution mask

C DG H

A B C DE F G H

A BE F

O PK LG HC D

M N O PI J K LE F G HA B C D

M NI JE FA B

O PK L

M N O PI J K L

M NI J

Repeat

M MM M

M N O PM N O P

P PP P

M MI IE EA A

M N O PI J K LE F G HA B C D

P PL LH HD D

A AA A

A B C DA B C D

D DD D

Clamp

I MJ N

M N O PI J K L

P LO K

N MJ IF EB A

M N O PI J K LE F G HA B C D

P OL KH GD C

E AF B

A B C DE F G H

D HC G

Mirror

Q QQ Q

Q Q Q QQ Q Q Q

Q QQ Q

Q QQ QQ QQ Q

M N O PI J K LE F G HA B C D

Q QQ QQ QQ Q

Q QQ Q

Q Q Q QQ Q Q Q

Q QQ Q

Constant

M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 5

Repeat Clamp Mirror ConstantHIPAcc: The Heterogeneous Image Processing AccelerationFramework

Domain-specific Extensions

IterationSpace defines ROI of the output image

Accessor input ROI with filtering

BoundaryCondition boundary handling mode

Mask convolution mask

Image and boundary Image crop Image crop withoffset

Image offset

M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 5

Image andBoundary Image crop Image crop

with offset Image offset

M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 9

Page 37: FPGA-Based Accelerator Design from a Domain-Specific Language · FPGAs High throughput, great energy efficiency OpenCL Portable and open programming model for heterogeneous ... High

Example: Laplacian Operator

// coefficients for Laplacian operatorconst int coef [3][3] = { { 0, 1, 0 },

{ 1, -4, 1 },{ 0, 1, 0 } };

// read inputuchar4 *image_bits = readImage ();

Image <uchar4 > in(width , height , inputImage);Image <uchar4 > out(width , height);

// load image datain = image_bits;

// Mask (Stencil) of local operatorMask <int > mask(coef);

// reading from in with mirroring as boundary conditionBoundaryCondition <uchar4 > bound(in, mask , BOUNDARY_MIRROR);Accessor <uchar4 > acc(bound);

// output imageIterationSpace <uchar4 > iter(out);

// define kernelLaplacian filter(iter , acc , mask);

// execute kernelfilter.execute ();

//Write outputuchar4* out = out.data();

InputImage

BoundaryCondition

Accessor

KERNEL

IterationSpace

Mask

HIPAcc: The Heterogeneous Image Processing AccelerationFramework

Domain-specific Extensions

IterationSpace defines ROI of the output image

Accessor input ROI with filtering

BoundaryCondition boundary handling modes

Mask convolution mask

Output image Crop of outputimage

Crop of outputimage with offset

M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 5

Output Image Crop of output image

Crop of output imagewith offset

OutputImage

HIPAcc: The Heterogeneous Image Processing AccelerationFramework

Domain-specific Extensions

IterationSpace defines ROI of the output image

Accessor input ROI with filtering

BoundaryCondition boundary handling modes

Mask convolution mask

C DG H

A B C DE F G H

A BE F

O PK LG HC D

M N O PI J K LE F G HA B C D

M NI JE FA B

O PK L

M N O PI J K L

M NI J

Repeat

M MM M

M N O PM N O P

P PP P

M MI IE EA A

M N O PI J K LE F G HA B C D

P PL LH HD D

A AA A

A B C DA B C D

D DD D

Clamp

I MJ N

M N O PI J K L

P LO K

N MJ IF EB A

M N O PI J K LE F G HA B C D

P OL KH GD C

E AF B

A B C DE F G H

D HC G

Mirror

Q QQ Q

Q Q Q QQ Q Q Q

Q QQ Q

Q QQ QQ QQ Q

M N O PI J K LE F G HA B C D

Q QQ QQ QQ Q

Q QQ Q

Q Q Q QQ Q Q Q

Q QQ Q

Constant

M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 5

Repeat Clamp Mirror ConstantHIPAcc: The Heterogeneous Image Processing AccelerationFramework

Domain-specific Extensions

IterationSpace defines ROI of the output image

Accessor input ROI with filtering

BoundaryCondition boundary handling mode

Mask convolution mask

Image and boundary Image crop Image crop withoffset

Image offset

M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 5

Image andBoundary Image crop Image crop

with offset Image offset

M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 9

Page 38: FPGA-Based Accelerator Design from a Domain-Specific Language · FPGAs High throughput, great energy efficiency OpenCL Portable and open programming model for heterogeneous ... High

Example: Laplacian Operator

// coefficients for Laplacian operatorconst int coef [3][3] = { { 0, 1, 0 },

{ 1, -4, 1 },{ 0, 1, 0 } };

// read inputuchar4 *image_bits = readImage ();

Image <uchar4 > in(width , height , inputImage);Image <uchar4 > out(width , height);

// load image datain = image_bits;

// Mask (Stencil) of local operatorMask <int > mask(coef);

// reading from in with mirroring as boundary conditionBoundaryCondition <uchar4 > bound(in, mask , BOUNDARY_MIRROR);Accessor <uchar4 > acc(bound);

// output imageIterationSpace <uchar4 > iter(out);

// define kernelLaplacian filter(iter , acc , mask);

// execute kernelfilter.execute ();

//Write outputuchar4* out = out.data();

InputImage

BoundaryCondition

Accessor

KERNEL

IterationSpace

Mask

HIPAcc: The Heterogeneous Image Processing AccelerationFramework

Domain-specific Extensions

IterationSpace defines ROI of the output image

Accessor input ROI with filtering

BoundaryCondition boundary handling modes

Mask convolution mask

Output image Crop of outputimage

Crop of outputimage with offset

M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 5

Output Image Crop of output image

Crop of output imagewith offset

OutputImage

HIPAcc: The Heterogeneous Image Processing AccelerationFramework

Domain-specific Extensions

IterationSpace defines ROI of the output image

Accessor input ROI with filtering

BoundaryCondition boundary handling modes

Mask convolution mask

C DG H

A B C DE F G H

A BE F

O PK LG HC D

M N O PI J K LE F G HA B C D

M NI JE FA B

O PK L

M N O PI J K L

M NI J

Repeat

M MM M

M N O PM N O P

P PP P

M MI IE EA A

M N O PI J K LE F G HA B C D

P PL LH HD D

A AA A

A B C DA B C D

D DD D

Clamp

I MJ N

M N O PI J K L

P LO K

N MJ IF EB A

M N O PI J K LE F G HA B C D

P OL KH GD C

E AF B

A B C DE F G H

D HC G

Mirror

Q QQ Q

Q Q Q QQ Q Q Q

Q QQ Q

Q QQ QQ QQ Q

M N O PI J K LE F G HA B C D

Q QQ QQ QQ Q

Q QQ Q

Q Q Q QQ Q Q Q

Q QQ Q

Constant

M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 5

Repeat Clamp Mirror ConstantHIPAcc: The Heterogeneous Image Processing AccelerationFramework

Domain-specific Extensions

IterationSpace defines ROI of the output image

Accessor input ROI with filtering

BoundaryCondition boundary handling mode

Mask convolution mask

Image and boundary Image crop Image crop withoffset

Image offset

M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 5

Image andBoundary Image crop Image crop

with offset Image offset

M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 9

Page 39: FPGA-Based Accelerator Design from a Domain-Specific Language · FPGAs High throughput, great energy efficiency OpenCL Portable and open programming model for heterogeneous ... High

Example: Laplacian Operator

// coefficients for Laplacian operatorconst int coef [3][3] = { { 0, 1, 0 },

{ 1, -4, 1 },{ 0, 1, 0 } };

// read inputuchar4 *image_bits = readImage ();

Image <uchar4 > in(width , height , inputImage);Image <uchar4 > out(width , height);

// load image datain = image_bits;

// Mask (Stencil) of local operatorMask <int > mask(coef);

// reading from in with mirroring as boundary conditionBoundaryCondition <uchar4 > bound(in, mask , BOUNDARY_MIRROR);Accessor <uchar4 > acc(bound);

// output imageIterationSpace <uchar4 > iter(out);

// define kernelLaplacian filter(iter , acc , mask);

// execute kernelfilter.execute ();

//Write outputuchar4* out = out.data();

InputImage

BoundaryCondition

Accessor

KERNEL

IterationSpace

Mask

HIPAcc: The Heterogeneous Image Processing AccelerationFramework

Domain-specific Extensions

IterationSpace defines ROI of the output image

Accessor input ROI with filtering

BoundaryCondition boundary handling modes

Mask convolution mask

Output image Crop of outputimage

Crop of outputimage with offset

M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 5

Output Image Crop of output image

Crop of output imagewith offset

OutputImage

HIPAcc: The Heterogeneous Image Processing AccelerationFramework

Domain-specific Extensions

IterationSpace defines ROI of the output image

Accessor input ROI with filtering

BoundaryCondition boundary handling modes

Mask convolution mask

C DG H

A B C DE F G H

A BE F

O PK LG HC D

M N O PI J K LE F G HA B C D

M NI JE FA B

O PK L

M N O PI J K L

M NI J

Repeat

M MM M

M N O PM N O P

P PP P

M MI IE EA A

M N O PI J K LE F G HA B C D

P PL LH HD D

A AA A

A B C DA B C D

D DD D

Clamp

I MJ N

M N O PI J K L

P LO K

N MJ IF EB A

M N O PI J K LE F G HA B C D

P OL KH GD C

E AF B

A B C DE F G H

D HC G

Mirror

Q QQ Q

Q Q Q QQ Q Q Q

Q QQ Q

Q QQ QQ QQ Q

M N O PI J K LE F G HA B C D

Q QQ QQ QQ Q

Q QQ Q

Q Q Q QQ Q Q Q

Q QQ Q

Constant

M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 5

Repeat Clamp Mirror Constant

HIPAcc: The Heterogeneous Image Processing AccelerationFramework

Domain-specific Extensions

IterationSpace defines ROI of the output image

Accessor input ROI with filtering

BoundaryCondition boundary handling mode

Mask convolution mask

Image and boundary Image crop Image crop withoffset

Image offset

M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 5

Image andBoundary Image crop Image crop

with offset Image offset

M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 9

Page 40: FPGA-Based Accelerator Design from a Domain-Specific Language · FPGAs High throughput, great energy efficiency OpenCL Portable and open programming model for heterogeneous ... High

Example: Laplacian Operator

// coefficients for Laplacian operatorconst int coef [3][3] = { { 0, 1, 0 },

{ 1, -4, 1 },{ 0, 1, 0 } };

// read inputuchar4 *image_bits = readImage ();

Image <uchar4 > in(width , height , inputImage);Image <uchar4 > out(width , height);

// load image datain = image_bits;

// Mask (Stencil) of local operatorMask <int > mask(coef);

// reading from in with mirroring as boundary conditionBoundaryCondition <uchar4 > bound(in, mask , BOUNDARY_MIRROR);Accessor <uchar4 > acc(bound);

// output imageIterationSpace <uchar4 > iter(out);

// define kernelLaplacian filter(iter , acc , mask);

// execute kernelfilter.execute ();

//Write outputuchar4* out = out.data();

InputImage

BoundaryCondition

Accessor

KERNEL

IterationSpace

Mask

HIPAcc: The Heterogeneous Image Processing AccelerationFramework

Domain-specific Extensions

IterationSpace defines ROI of the output image

Accessor input ROI with filtering

BoundaryCondition boundary handling modes

Mask convolution mask

Output image Crop of outputimage

Crop of outputimage with offset

M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 5

Output Image Crop of output image

Crop of output imagewith offset

OutputImage

HIPAcc: The Heterogeneous Image Processing AccelerationFramework

Domain-specific Extensions

IterationSpace defines ROI of the output image

Accessor input ROI with filtering

BoundaryCondition boundary handling modes

Mask convolution mask

C DG H

A B C DE F G H

A BE F

O PK LG HC D

M N O PI J K LE F G HA B C D

M NI JE FA B

O PK L

M N O PI J K L

M NI J

Repeat

M MM M

M N O PM N O P

P PP P

M MI IE EA A

M N O PI J K LE F G HA B C D

P PL LH HD D

A AA A

A B C DA B C D

D DD D

Clamp

I MJ N

M N O PI J K L

P LO K

N MJ IF EB A

M N O PI J K LE F G HA B C D

P OL KH GD C

E AF B

A B C DE F G H

D HC G

Mirror

Q QQ Q

Q Q Q QQ Q Q Q

Q QQ Q

Q QQ QQ QQ Q

M N O PI J K LE F G HA B C D

Q QQ QQ QQ Q

Q QQ Q

Q Q Q QQ Q Q Q

Q QQ Q

Constant

M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 5

Repeat Clamp Mirror ConstantHIPAcc: The Heterogeneous Image Processing AccelerationFramework

Domain-specific Extensions

IterationSpace defines ROI of the output image

Accessor input ROI with filtering

BoundaryCondition boundary handling mode

Mask convolution mask

Image and boundary Image crop Image crop withoffset

Image offset

M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 5

Image andBoundary Image crop Image crop

with offset Image offset

M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 9

Page 41: FPGA-Based Accelerator Design from a Domain-Specific Language · FPGAs High throughput, great energy efficiency OpenCL Portable and open programming model for heterogeneous ... High

Example: Laplacian Operator

// coefficients for Laplacian operatorconst int coef [3][3] = { { 0, 1, 0 },

{ 1, -4, 1 },{ 0, 1, 0 } };

// read inputuchar4 *image_bits = readImage ();

Image <uchar4 > in(width , height , inputImage);Image <uchar4 > out(width , height);

// load image datain = image_bits;

// Mask (Stencil) of local operatorMask <int > mask(coef);

// reading from in with mirroring as boundary conditionBoundaryCondition <uchar4 > bound(in, mask , BOUNDARY_MIRROR);Accessor <uchar4 > acc(bound);

// output imageIterationSpace <uchar4 > iter(out);

// define kernelLaplacian filter(iter , acc , mask);

// execute kernelfilter.execute ();

//Write outputuchar4* out = out.data();

InputImage

BoundaryCondition

Accessor

KERNEL

IterationSpace

Mask

HIPAcc: The Heterogeneous Image Processing AccelerationFramework

Domain-specific Extensions

IterationSpace defines ROI of the output image

Accessor input ROI with filtering

BoundaryCondition boundary handling modes

Mask convolution mask

Output image Crop of outputimage

Crop of outputimage with offset

M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 5

Output Image Crop of output image

Crop of output imagewith offset

OutputImage

HIPAcc: The Heterogeneous Image Processing AccelerationFramework

Domain-specific Extensions

IterationSpace defines ROI of the output image

Accessor input ROI with filtering

BoundaryCondition boundary handling modes

Mask convolution mask

C DG H

A B C DE F G H

A BE F

O PK LG HC D

M N O PI J K LE F G HA B C D

M NI JE FA B

O PK L

M N O PI J K L

M NI J

Repeat

M MM M

M N O PM N O P

P PP P

M MI IE EA A

M N O PI J K LE F G HA B C D

P PL LH HD D

A AA A

A B C DA B C D

D DD D

Clamp

I MJ N

M N O PI J K L

P LO K

N MJ IF EB A

M N O PI J K LE F G HA B C D

P OL KH GD C

E AF B

A B C DE F G H

D HC G

Mirror

Q QQ Q

Q Q Q QQ Q Q Q

Q QQ Q

Q QQ QQ QQ Q

M N O PI J K LE F G HA B C D

Q QQ QQ QQ Q

Q QQ Q

Q Q Q QQ Q Q Q

Q QQ Q

Constant

M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 5

Repeat Clamp Mirror ConstantHIPAcc: The Heterogeneous Image Processing AccelerationFramework

Domain-specific Extensions

IterationSpace defines ROI of the output image

Accessor input ROI with filtering

BoundaryCondition boundary handling mode

Mask convolution mask

Image and boundary Image crop Image crop withoffset

Image offset

M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 5

Image andBoundary Image crop Image crop

with offset Image offset

M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 9

Page 42: FPGA-Based Accelerator Design from a Domain-Specific Language · FPGAs High throughput, great energy efficiency OpenCL Portable and open programming model for heterogeneous ... High

Example: Laplacian Operator

// coefficients for Laplacian operatorconst int coef [3][3] = { { 0, 1, 0 },

{ 1, -4, 1 },{ 0, 1, 0 } };

// read inputuchar4 *image_bits = readImage ();

Image <uchar4 > in(width , height , inputImage);Image <uchar4 > out(width , height);

// load image datain = image_bits;

// Mask (Stencil) of local operatorMask <int > mask(coef);

// reading from in with mirroring as boundary conditionBoundaryCondition <uchar4 > bound(in, mask , BOUNDARY_MIRROR);Accessor <uchar4 > acc(bound);

// output imageIterationSpace <uchar4 > iter(out);

// define kernelLaplacian filter(iter , acc , mask);

// execute kernelfilter.execute ();

//Write outputuchar4* out = out.data();

InputImage

BoundaryCondition

Accessor

KERNEL

IterationSpace

Mask

HIPAcc: The Heterogeneous Image Processing AccelerationFramework

Domain-specific Extensions

IterationSpace defines ROI of the output image

Accessor input ROI with filtering

BoundaryCondition boundary handling modes

Mask convolution mask

Output image Crop of outputimage

Crop of outputimage with offset

M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 5

Output Image Crop of output image

Crop of output imagewith offset

OutputImage

HIPAcc: The Heterogeneous Image Processing AccelerationFramework

Domain-specific Extensions

IterationSpace defines ROI of the output image

Accessor input ROI with filtering

BoundaryCondition boundary handling modes

Mask convolution mask

C DG H

A B C DE F G H

A BE F

O PK LG HC D

M N O PI J K LE F G HA B C D

M NI JE FA B

O PK L

M N O PI J K L

M NI J

Repeat

M MM M

M N O PM N O P

P PP P

M MI IE EA A

M N O PI J K LE F G HA B C D

P PL LH HD D

A AA A

A B C DA B C D

D DD D

Clamp

I MJ N

M N O PI J K L

P LO K

N MJ IF EB A

M N O PI J K LE F G HA B C D

P OL KH GD C

E AF B

A B C DE F G H

D HC G

Mirror

Q QQ Q

Q Q Q QQ Q Q Q

Q QQ Q

Q QQ QQ QQ Q

M N O PI J K LE F G HA B C D

Q QQ QQ QQ Q

Q QQ Q

Q Q Q QQ Q Q Q

Q QQ Q

Constant

M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 5

Repeat Clamp Mirror ConstantHIPAcc: The Heterogeneous Image Processing AccelerationFramework

Domain-specific Extensions

IterationSpace defines ROI of the output image

Accessor input ROI with filtering

BoundaryCondition boundary handling mode

Mask convolution mask

Image and boundary Image crop Image crop withoffset

Image offset

M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 5

Image andBoundary Image crop Image crop

with offset Image offset

M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 9

Page 43: FPGA-Based Accelerator Design from a Domain-Specific Language · FPGAs High throughput, great energy efficiency OpenCL Portable and open programming model for heterogeneous ... High

Example: Laplacian Operator

// coefficients for Laplacian operatorconst int coef [3][3] = { { 0, 1, 0 },

{ 1, -4, 1 },{ 0, 1, 0 } };

// read inputuchar4 *image_bits = readImage ();

Image <uchar4 > in(width , height , inputImage);Image <uchar4 > out(width , height);

// load image datain = image_bits;

// Mask (Stencil) of local operatorMask <int > mask(coef);

// reading from in with mirroring as boundary conditionBoundaryCondition <uchar4 > bound(in, mask , BOUNDARY_MIRROR);Accessor <uchar4 > acc(bound);

// output imageIterationSpace <uchar4 > iter(out);

// define kernelLaplacian filter(iter , acc , mask);

// execute kernelfilter.execute ();

//Write outputuchar4* out = out.data();

InputImage

BoundaryCondition

Accessor

KERNEL

IterationSpace

Mask

HIPAcc: The Heterogeneous Image Processing AccelerationFramework

Domain-specific Extensions

IterationSpace defines ROI of the output image

Accessor input ROI with filtering

BoundaryCondition boundary handling modes

Mask convolution mask

Output image Crop of outputimage

Crop of outputimage with offset

M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 5

Output Image Crop of output image

Crop of output imagewith offset

OutputImage

HIPAcc: The Heterogeneous Image Processing AccelerationFramework

Domain-specific Extensions

IterationSpace defines ROI of the output image

Accessor input ROI with filtering

BoundaryCondition boundary handling modes

Mask convolution mask

C DG H

A B C DE F G H

A BE F

O PK LG HC D

M N O PI J K LE F G HA B C D

M NI JE FA B

O PK L

M N O PI J K L

M NI J

Repeat

M MM M

M N O PM N O P

P PP P

M MI IE EA A

M N O PI J K LE F G HA B C D

P PL LH HD D

A AA A

A B C DA B C D

D DD D

Clamp

I MJ N

M N O PI J K L

P LO K

N MJ IF EB A

M N O PI J K LE F G HA B C D

P OL KH GD C

E AF B

A B C DE F G H

D HC G

Mirror

Q QQ Q

Q Q Q QQ Q Q Q

Q QQ Q

Q QQ QQ QQ Q

M N O PI J K LE F G HA B C D

Q QQ QQ QQ Q

Q QQ Q

Q Q Q QQ Q Q Q

Q QQ Q

Constant

M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 5

Repeat Clamp Mirror ConstantHIPAcc: The Heterogeneous Image Processing AccelerationFramework

Domain-specific Extensions

IterationSpace defines ROI of the output image

Accessor input ROI with filtering

BoundaryCondition boundary handling mode

Mask convolution mask

Image and boundary Image crop Image crop withoffset

Image offset

M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 5

Image andBoundary Image crop Image crop

with offset Image offset

M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 9

Page 44: FPGA-Based Accelerator Design from a Domain-Specific Language · FPGAs High throughput, great energy efficiency OpenCL Portable and open programming model for heterogeneous ... High

Example: Laplacian Operator

// coefficients for Laplacian operatorconst int coef [3][3] = { { 0, 1, 0 },

{ 1, -4, 1 },{ 0, 1, 0 } };

// read inputuchar4 *image_bits = readImage ();

Image <uchar4 > in(width , height , inputImage);Image <uchar4 > out(width , height);

// load image datain = image_bits;

// Mask (Stencil) of local operatorMask <int > mask(coef);

// reading from in with mirroring as boundary conditionBoundaryCondition <uchar4 > bound(in, mask , BOUNDARY_MIRROR);Accessor <uchar4 > acc(bound);

// output imageIterationSpace <uchar4 > iter(out);

// define kernelLaplacian filter(iter , acc , mask);

// execute kernelfilter.execute ();

//Write outputuchar4* out = out.data();

InputImage

BoundaryCondition

Accessor

KERNEL

IterationSpace

Mask

HIPAcc: The Heterogeneous Image Processing AccelerationFramework

Domain-specific Extensions

IterationSpace defines ROI of the output image

Accessor input ROI with filtering

BoundaryCondition boundary handling modes

Mask convolution mask

Output image Crop of outputimage

Crop of outputimage with offset

M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 5

Output Image Crop of output image

Crop of output imagewith offset

OutputImage

HIPAcc: The Heterogeneous Image Processing AccelerationFramework

Domain-specific Extensions

IterationSpace defines ROI of the output image

Accessor input ROI with filtering

BoundaryCondition boundary handling modes

Mask convolution mask

C DG H

A B C DE F G H

A BE F

O PK LG HC D

M N O PI J K LE F G HA B C D

M NI JE FA B

O PK L

M N O PI J K L

M NI J

Repeat

M MM M

M N O PM N O P

P PP P

M MI IE EA A

M N O PI J K LE F G HA B C D

P PL LH HD D

A AA A

A B C DA B C D

D DD D

Clamp

I MJ N

M N O PI J K L

P LO K

N MJ IF EB A

M N O PI J K LE F G HA B C D

P OL KH GD C

E AF B

A B C DE F G H

D HC G

Mirror

Q QQ Q

Q Q Q QQ Q Q Q

Q QQ Q

Q QQ QQ QQ Q

M N O PI J K LE F G HA B C D

Q QQ QQ QQ Q

Q QQ Q

Q Q Q QQ Q Q Q

Q QQ Q

Constant

M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 5

Repeat Clamp Mirror ConstantHIPAcc: The Heterogeneous Image Processing AccelerationFramework

Domain-specific Extensions

IterationSpace defines ROI of the output image

Accessor input ROI with filtering

BoundaryCondition boundary handling mode

Mask convolution mask

Image and boundary Image crop Image crop withoffset

Image offset

M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 5

Image andBoundary Image crop Image crop

with offset Image offset

M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 9

Page 45: FPGA-Based Accelerator Design from a Domain-Specific Language · FPGAs High throughput, great energy efficiency OpenCL Portable and open programming model for heterogeneous ... High

Example: Laplacian Operator Kernel

class Laplacian : public Kernel <uchar4 > {private:

Accessor <uchar4 > &input;Mask <int > &mask;

public:Laplacian(IterationSpace <uchar4 > &iter ,

Accessor <uchar4 > &input , Mask <int > &mask): Kernel(iter), input(input), mask(mask) {

addAccessor (& input);}

void kernel () {int4 sum = convolve(mask , HipaccSUM , [&] () -> int4 {

return mask() * convert_int4(input(mask));});

sum = max(sum , 0);sum = min(sum , 255);output () = convert_uchar4(sum);

}};

M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 10

Page 46: FPGA-Based Accelerator Design from a Domain-Specific Language · FPGAs High throughput, great energy efficiency OpenCL Portable and open programming model for heterogeneous ... High

Example: Laplacian Operator Kernel

class Laplacian : public Kernel <uchar4 > {private:

Accessor <uchar4 > &input;Mask <int > &mask;

public:Laplacian(IterationSpace <uchar4 > &iter ,

Accessor <uchar4 > &input , Mask <int > &mask): Kernel(iter), input(input), mask(mask) {

addAccessor (&input);}

void kernel () {int4 sum = convolve(mask , HipaccSUM , [&] () -> int4 {

return mask() * convert_int4(input(mask));});

sum = max(sum , 0);sum = min(sum , 255);output () = convert_uchar4(sum);

}};

Convolution call

M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 10

Page 47: FPGA-Based Accelerator Design from a Domain-Specific Language · FPGAs High throughput, great energy efficiency OpenCL Portable and open programming model for heterogeneous ... High

Extensions to HIPAcc Framework

Page 48: FPGA-Based Accelerator Design from a Domain-Specific Language · FPGAs High throughput, great energy efficiency OpenCL Portable and open programming model for heterogeneous ... High

Arbitrary Bit Width Hardware Design

- The OpenCL standard contains primitive data types only, e.g., char/short/int

+ AOC supports arbitrary bit width declaration

- AOC arbitrary bit width declaration is masking

- Compound assignment operations need to be expanded

- Unary operations need to be expanded

Utilizing exact bit widths is tedious in Altera OpenCL 14.1!

M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 11

Page 49: FPGA-Based Accelerator Design from a Domain-Specific Language · FPGAs High throughput, great energy efficiency OpenCL Portable and open programming model for heterogeneous ... High

Arbitrary Bit Width Hardware Design

- The OpenCL standard contains primitive data types only, e.g., char/short/int

+ AOC supports arbitrary bit width declaration

- AOC arbitrary bit width declaration is masking

x = (RHS) & 0x7;

- Compound assignment operations need to be expanded

- Unary operations need to be expanded

Utilizing exact bit widths is tedious in Altera OpenCL 14.1!

M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 11

Page 50: FPGA-Based Accelerator Design from a Domain-Specific Language · FPGAs High throughput, great energy efficiency OpenCL Portable and open programming model for heterogeneous ... High

Arbitrary Bit Width Hardware Design

- The OpenCL standard contains primitive data types only, e.g., char/short/int

+ AOC supports arbitrary bit width declaration

- AOC arbitrary bit width declaration is masking

x = (RHS) & 0x7;

- Compound assignment operations need to be expanded

// x += (RHS) & 0x7;x = ((x & 0x7) + RHS) & 0x7;

- Unary operations need to be expanded

// x++;x = ((x & 0x7) + 1) & 0x7;

Utilizing exact bit widths is tedious in Altera OpenCL 14.1!

M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 11

Page 51: FPGA-Based Accelerator Design from a Domain-Specific Language · FPGAs High throughput, great energy efficiency OpenCL Portable and open programming model for heterogeneous ... High

Arbitrary Bit Width Hardware Design

- The OpenCL standard contains primitive data types only, e.g., char/short/int

+ AOC supports arbitrary bit width declaration

- AOC arbitrary bit width declaration is masking

x = (RHS) & 0x7;

- Compound assignment operations need to be expanded

// x += (RHS) & 0x7;x = ((x & 0x7) + RHS) & 0x7;

- Unary operations need to be expanded

// x++;x = ((x & 0x7) + 1) & 0x7;

Utilizing exact bit widths is tedious in Altera OpenCL 14.1!

M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 11

Page 52: FPGA-Based Accelerator Design from a Domain-Specific Language · FPGAs High throughput, great energy efficiency OpenCL Portable and open programming model for heterogeneous ... High

Automatic Bit Width Reduction: An extension to HIPAcc

Listing 5: Gaussian blur in HIPAcc with annotated bit widths

#pragma hipacc bw(sum ,12)uint sum = 0;#pragma hipacc bw(x,2)uint x = 0;#pragma hipacc bw(y,2)uint y = 0;for (y = 0; y < size; ++y) {

for (x = 0; x < size; ++x) {sum += mask[y][x] * Input(x-1,y-1);

}}sum /= 16;output () = sum;

↓Listing 6: Gaussian blur in HIPAcc with annotated bit widths

uint sum = 0;uint x = 0;uint y = 0;for (y = ((0) & 3); ((y) & 3) < size; y = ((((y) & 3) + 1) & 3)){

for (x = ((0) & 3); ((x) & 3) < size; x = ((((x) & 3) + 1) & 3)){sum = ((((sum) & 4095) + mask[y][x]

* getWindowAt(Input , 1 + ((x) & 3) - 1, 1 + ((y) & 3) -1)) & 4095);}

}sum = ((((sum) & 4095) / 16) & 4095);return sum;

M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 12

Page 53: FPGA-Based Accelerator Design from a Domain-Specific Language · FPGAs High throughput, great energy efficiency OpenCL Portable and open programming model for heterogeneous ... High

Automatic Bit Width Reduction: An extension to HIPAcc

Listing 5: Gaussian blur in HIPAcc with annotated bit widths

#pragma hipacc bw(sum ,12)uint sum = 0;#pragma hipacc bw(x,2)uint x = 0;#pragma hipacc bw(y,2)uint y = 0;for (y = 0; y < size; ++y) {

for (x = 0; x < size; ++x) {sum += mask[y][x] * Input(x-1,y-1);

}}sum /= 16;output () = sum; ↓

Listing 6: Gaussian blur in HIPAcc with annotated bit widths

uint sum = 0;uint x = 0;uint y = 0;for (y = ((0) & 3); ((y) & 3) < size; y = ((((y) & 3) + 1) & 3)){

for (x = ((0) & 3); ((x) & 3) < size; x = ((((x) & 3) + 1) & 3)){sum = ((((sum) & 4095) + mask[y][x]

* getWindowAt(Input , 1 + ((x) & 3) - 1, 1 + ((y) & 3) -1)) & 4095);}

}sum = ((((sum) & 4095) / 16) & 4095);return sum;

M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 12

Page 54: FPGA-Based Accelerator Design from a Domain-Specific Language · FPGAs High throughput, great energy efficiency OpenCL Portable and open programming model for heterogeneous ... High

Evaluation and Results

Page 55: FPGA-Based Accelerator Design from a Domain-Specific Language · FPGAs High throughput, great energy efficiency OpenCL Portable and open programming model for heterogeneous ... High

HIPAcc vs Handwritten Examples provided by Altera

Edge-DetectorAltera

Edge-DetectorHIPAcc

Lukas KanadeAltera

Lukas KanadeHIPAcc

0

10

20

30

40

50

13.5515.04

18.05

25.63

19.59 19.21

22.59

27.62

7.03

1.17

18.36

14.45

Har

dwar

eU

tiliz

atio

n[%

]M20K Logic DSP

0

100

200

300

Clo

ckFr

eque

ncy

[MH

z]

Fmax

Figure: HIPAcc vs Altera handwritten examples for 1024×1024 image size.

M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 13

Page 56: FPGA-Based Accelerator Design from a Domain-Specific Language · FPGAs High throughput, great energy efficiency OpenCL Portable and open programming model for heterogeneous ... High

FPGA vs GPU

100 101 102 103 104

Gaussian

Laplacian

Sobel

Bilateral

Harris Corner

Optical Flow

Throughput [MPixel/s]

Tegra K1 Tesla K20 Stratix V

Figure: Comparison of throughput for the Nvidia Tegra K1, Nvidia Tesla K20, and Altera Stratix V.

M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 14

Page 57: FPGA-Based Accelerator Design from a Domain-Specific Language · FPGAs High throughput, great energy efficiency OpenCL Portable and open programming model for heterogeneous ... High

Conclusion

Page 58: FPGA-Based Accelerator Design from a Domain-Specific Language · FPGAs High throughput, great energy efficiency OpenCL Portable and open programming model for heterogeneous ... High

Conclusion

Advantages of DSL-based Approach

Productivity – compact algorithm description– less error-prone

Performance – efficient target-specific code generation

Portability – flexible target choice– performance portability, not just functional portability

HIPAcc DSL code serves as baseline implementation⇒ Test bench

Develop the algorithm first, decide the target architecture afterwards!

M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 15

Page 59: FPGA-Based Accelerator Design from a Domain-Specific Language · FPGAs High throughput, great energy efficiency OpenCL Portable and open programming model for heterogeneous ... High

Conclusion

Advantages of DSL-based Approach

Productivity – compact algorithm description– less error-prone

Performance – efficient target-specific code generation

Portability – flexible target choice– performance portability, not just functional portability

HIPAcc DSL code serves as baseline implementation⇒ Test bench

Develop the algorithm first, decide the target architecture afterwards!

M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 15

Page 60: FPGA-Based Accelerator Design from a Domain-Specific Language · FPGAs High throughput, great energy efficiency OpenCL Portable and open programming model for heterogeneous ... High

Questions?Thanks for listening.

Any questions?

C++

embedded DSL

Source-to-SourceCompiler

Clang/LLVM

Domain

Knowledge

Architecture

Knowledge

CUDA(GPU)

OpenCL(x86/GPU)

C/C++

(x86)

Renderscript(x86/ARM/GPU)

OpenCL(ALTERA FPGA)

Vivado C++

(FPGA)

CUDA/OpenCL/Renderscript Runtime LibraryCUDA/OpenCL/Renderscript Runtime Library AOCL Vivado HLS

http://github.com/hipacc/hipacc-fpga

Title FPGA-Based Accelerator Design from a Domain-Specific Language

Speaker M. Akif Özkan, [email protected]

Page 61: FPGA-Based Accelerator Design from a Domain-Specific Language · FPGAs High throughput, great energy efficiency OpenCL Portable and open programming model for heterogeneous ... High

Backup Slides

Page 62: FPGA-Based Accelerator Design from a Domain-Specific Language · FPGAs High throughput, great energy efficiency OpenCL Portable and open programming model for heterogeneous ... High

Additional Results

Page 63: FPGA-Based Accelerator Design from a Domain-Specific Language · FPGAs High throughput, great energy efficiency OpenCL Portable and open programming model for heterogeneous ... High

Boundary Handling: SW Manner vs HW Manner

UNDEF-SW CONST-SW CLAMP-SW MIRROR-SW0

10

20

30

40

50

60

Dataset A

12.85 13.1 13.1 13.1

16.1517.55

41.11 41.42

2.3 2.3

13.79 13.79

Har

dwar

eU

tiliz

atio

n[%

]

M10K Logic DSP

0

20

40

60

80

100

120

140

160

Clo

ckFr

eque

ncy

[MH

z]

Fmax

(a) Single-Item Line-Buffered Kerneldeveloped in Software Manner

UNDEF-SW CONST-SW CLAMP-SW MIRROR-SW0

10

20

30

40

50

60

Dataset A

12.85 13.1 13.1 13.1

16.18 16.71 17.03 17.21

2.3 2.3 2.3 2.3

Har

dwar

eU

tiliz

atio

n[%

]

M10K Logic DSP

0

20

40

60

80

100

120

140

160

Clo

ckFr

eque

ncy

[MH

z]

Fmax

(b) Single-Item Line-Buffered Kerneldeveloped in Hardware Manner

Figure: Boundary conditions for a 5×5 Gaussian blur on a 1024×1024 image.

M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 16

Page 64: FPGA-Based Accelerator Design from a Domain-Specific Language · FPGAs High throughput, great energy efficiency OpenCL Portable and open programming model for heterogeneous ... High

Automatic Bit Width Reduction: Results

• AOC is very successful when inner loop trip counts are known during compiletime and the unroll directive has been specified for synthesis

• Great improvement in clock frequency for dynamic loops

Table: Automatic bit width reduction on local operators of size 3×3 with image size 1024×1024.

Filter Type II ALUTs Registers Logic (%) M10K DSP Freq [MHz]

Sobelnormal 2 8552 12433 19.76 84 4 131.25reduced 2 8230 12098 19.11 84 3 152.09

Gaussiannormal 2 8593 12423 19.86 84 4 132.50reduced 2 8439 11843 19.26 83 3 153.11

M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 17

Page 65: FPGA-Based Accelerator Design from a Domain-Specific Language · FPGAs High throughput, great energy efficiency OpenCL Portable and open programming model for heterogeneous ... High

Kernel Vectorization

Page 66: FPGA-Based Accelerator Design from a Domain-Specific Language · FPGAs High throughput, great energy efficiency OpenCL Portable and open programming model for heterogeneous ... High

Kernel Vectorization

Vectorization by loop coarsening [1]: unroll the outer loop by factor vi

• replicate only the kernel: sublinear increase of resource usage• better exploitation of bandwidth: nearly linear speedup

[1] Moritz Schmid et al. “Loop Coarsening in C-based High-Level Synthesis”. In: Proceedings of the 26th IEEE International Conference on Application-specificSystems, Architectures and Processors (ASAP). (Toronto, Canada). IEEE, July 27–29, 2015, pp. 166–173

M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 18

Page 67: FPGA-Based Accelerator Design from a Domain-Specific Language · FPGAs High throughput, great energy efficiency OpenCL Portable and open programming model for heterogeneous ... High

Kernel Vectorization

1 2 4 8 160

20

40

60

80

100

120

14.57 14.49 14.49 14.65 14.88

22.77 24.4527.91

34.28

48.93

8.9814.45

25.39

47.27

91.02

Vectorization Factor v

Har

dwar

eU

tiliz

atio

n[%

]M20K Logic DSP

0

100

200

300

400

Clo

ckFr

eque

ncy

[MH

z]

Fmax

Figure: Vectorization of a 3×3 bilateral filter with clamping on an image of size 1024×1024.

M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 19

Page 68: FPGA-Based Accelerator Design from a Domain-Specific Language · FPGAs High throughput, great energy efficiency OpenCL Portable and open programming model for heterogeneous ... High

Generating the Streaming Pipeline

Page 69: FPGA-Based Accelerator Design from a Domain-Specific Language · FPGAs High throughput, great energy efficiency OpenCL Portable and open programming model for heterogeneous ... High

Generating the Streaming Pipeline

Trace host code and translate it to internal representation:

• model as combination ofprocesses and spaces

• create unique channelobjects for each space

• identify memory reuseand utilize processesaccordingly

• build dependency graph• traverse in depth-first

search starting fromoutput spaces

1× channel2× channels

Kernel

IterSpace

Image

Accessor

Kernel

processprocess

space

process

Accessor Accessor

Kernel Kernel

process1to2

space space

process process

M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 20

Page 70: FPGA-Based Accelerator Design from a Domain-Specific Language · FPGAs High throughput, great energy efficiency OpenCL Portable and open programming model for heterogeneous ... High

Generating the Streaming Pipeline

Trace host code and translate it to internal representation:

• model as combination ofprocesses and spaces

• create unique channelobjects for each space

• identify memory reuseand utilize processesaccordingly

• build dependency graph• traverse in depth-first

search starting fromoutput spaces

1× channel2× channels

Kernel

IterSpace

Image

Accessor

Kernel

processprocess

space

process

Accessor Accessor

Kernel Kernel

process1to2

space space

process process

M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 20

Page 71: FPGA-Based Accelerator Design from a Domain-Specific Language · FPGAs High throughput, great energy efficiency OpenCL Portable and open programming model for heterogeneous ... High

Generating the Streaming Pipeline

Trace host code and translate it to internal representation:

• model as combination ofprocesses and spaces

• create unique channelobjects for each space

• identify memory reuseand utilize processesaccordingly

• build dependency graph• traverse in depth-first

search starting fromoutput spaces

1× channel2× channels

Kernel

IterSpace

Image

Accessor

Kernel

processprocess

space

process

Accessor Accessor

Kernel Kernel

process1to2

space space

process process

M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 20

Page 72: FPGA-Based Accelerator Design from a Domain-Specific Language · FPGAs High throughput, great energy efficiency OpenCL Portable and open programming model for heterogeneous ... High

Generating the Streaming Pipeline

Trace host code and translate it to internal representation:

• model as combination ofprocesses and spaces

• create unique channelobjects for each space

• identify memory reuseand utilize processesaccordingly

• build dependency graph• traverse in depth-first

search starting fromoutput spaces

1× channel

2× channels

Kernel

IterSpace

Image

Accessor

Kernel

processprocess

space

process

Accessor Accessor

Kernel Kernel

process1to2

space space

process process

M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 20

Page 73: FPGA-Based Accelerator Design from a Domain-Specific Language · FPGAs High throughput, great energy efficiency OpenCL Portable and open programming model for heterogeneous ... High

Generating the Streaming Pipeline

Trace host code and translate it to internal representation:

• model as combination ofprocesses and spaces

• create unique channelobjects for each space

• identify memory reuseand utilize processesaccordingly

• build dependency graph• traverse in depth-first

search starting fromoutput spaces

1× channel2× channels

Kernel

IterSpace

Image

Accessor

Kernel

processprocess

space

process

Accessor Accessor

Kernel Kernel

process1to2

space space

process process

M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 20

Page 74: FPGA-Based Accelerator Design from a Domain-Specific Language · FPGAs High throughput, great energy efficiency OpenCL Portable and open programming model for heterogeneous ... High

Generating the Streaming Pipeline

Trace host code and translate it to internal representation:

• model as combination ofprocesses and spaces

• create unique channelobjects for each space

• identify memory reuseand utilize processesaccordingly

• build dependency graph• traverse in depth-first

search starting fromoutput spaces

1× channel2× channels

Kernel

IterSpace

Image

Accessor

Kernel

processprocess

space

process

Accessor Accessor

Kernel Kernel

process1to2

space space

process process

M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 20

Page 75: FPGA-Based Accelerator Design from a Domain-Specific Language · FPGAs High throughput, great energy efficiency OpenCL Portable and open programming model for heterogeneous ... High

Generating the Streaming Pipeline

Trace host code and translate it to internal representation:

• model as combination ofprocesses and spaces

• create unique channelobjects for each space

• identify memory reuseand utilize processesaccordingly

• build dependency graph• traverse in depth-first

search starting fromoutput spaces

1× channel2× channels

Kernel

IterSpace

Image

Accessor

Kernel

processprocess

space

process

Accessor Accessor

Kernel Kernel

process1to2

space space

process process

M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 20

Page 76: FPGA-Based Accelerator Design from a Domain-Specific Language · FPGAs High throughput, great energy efficiency OpenCL Portable and open programming model for heterogeneous ... High

Generating the Streaming Pipeline

Trace host code and translate it to internal representation:

• model as combination ofprocesses and spaces

• create unique channelobjects for each space

• identify memory reuseand utilize processesaccordingly

• build dependency graph• traverse in depth-first

search starting fromoutput spaces

1× channel

2× channels

Kernel

IterSpace

Image

Accessor

Kernel

processprocess

space

process

Accessor Accessor

Kernel Kernel

process1to2

space space

process process

M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 20

Page 77: FPGA-Based Accelerator Design from a Domain-Specific Language · FPGAs High throughput, great energy efficiency OpenCL Portable and open programming model for heterogeneous ... High

Generating the Streaming Pipeline

Trace host code and translate it to internal representation:

• model as combination ofprocesses and spaces

• create unique channelobjects for each space

• identify memory reuseand utilize processesaccordingly

• build dependency graph• traverse in depth-first

search starting fromoutput spaces

1× channel

2× channels

Kernel

IterSpace

Image

Accessor

Kernel

processprocess

space

process

Accessor Accessor

Kernel Kernel

process1to2

space space

process process

M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 20

Page 78: FPGA-Based Accelerator Design from a Domain-Specific Language · FPGAs High throughput, great energy efficiency OpenCL Portable and open programming model for heterogeneous ... High

Altera OpenCL System Level Interfaces

Page 79: FPGA-Based Accelerator Design from a Domain-Specific Language · FPGAs High throughput, great energy efficiency OpenCL Portable and open programming model for heterogeneous ... High

Altera OpenCL System Level Interfaces

Channels Data sharing between kernelsFigure 1-5: Difference in Global Memory Access Pattern as a Result of Channels or PipesImplementation

Global Memory

Kernel 1 Kernel 4Kernel 2 Kernel 3Write

ReadRead

ReadRead

WriteWriteWrite

Global Memory

Kernel 1 Kernel 4Kernel 2 Kernel 3

Read

Write

Channel/Pipe Channel/Pipe

Global Memory Access Pattern Before AOCL Channels or Pipes Implementation

Global Memory Access Pattern After AOCL Channels or Pipes Implementation

Channel/Pipe

For more information on the usage of channels, refer to the Implementing AOCL Channels Extensionsection of the Altera SDK for OpenCL Programming Guide.

For more information on the usage of pipes, refer to the Implementing OpenCL Pipes section of the AlteraSDK for OpenCL Programming Guide.

Related Information

• Implementing AOCL Channels Extension• Implementing OpenCL Pipes

Channels and Pipes CharacteristicsTo implement channels or pipes in your OpenCL kernel program, keep in mind their respectivecharacteristics that are specific to the Altera SDK for OpenCL (AOCL).

Default Behavior

The default behavior of channels is blocking. The default behavior of pipes is nonblocking.

Concurrent Execution of Multiple OpenCL Kernels

You can execute multiple OpenCL kernels concurrently. To enable concurrent execution, modify the hostcode to instantiate multiple command queues. Each concurrently executing kernel is associated with aseparate command queue.Important: Pipe-specific considerations:

The OpenCL pipe modifications outlined in Ensuring Compatibility with Other OpenCLSDKs in the Altera SDK for OpenCL Programming Guide allow you to run your kernel on theAOCL. However, they do not maximize the kernel throughput. The OpenCL Specificationversion 2.0 requires that pipe writes occur before pipe reads so that the kernel is not reading

1-8 Channels and Pipes CharacteristicsOCL003-15.0.0

2015.05.04

Altera Corporation Altera SDK for OpenCL Best Practices Guide

Send Feedback

Altera, SDK. for OpenCL: Best Practices Guide, Altera, 2016.

M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 21

Page 80: FPGA-Based Accelerator Design from a Domain-Specific Language · FPGAs High throughput, great energy efficiency OpenCL Portable and open programming model for heterogeneous ... High

Altera OpenCL System Level Interfaces

Channels Data sharing between kernels

Table: Impact of channels on line-buffered single-work item implementations.

Kernel Type II ALUTs Registers LU (%) BRAM DSP Freq [MHz]

Copy Kernel single 1 6821 7812 14.04 43 0 152.89Mean Filter single 1 8248 9415 16.78 51 0 151.52

2 Mean Filter separate 1 12104 15132 25.36 100 0 145.78

2 Mean Filter channel 1 10439 12466 21.47 60 0 154.142 Mean Filter fused 1 9480 11715 19.47 64 0 122.473 Mean Filter channel 1 12829 15451 26.56 67 0 145.573 Mean Filter fused 1 10899 13651 22.24 79 0 125.00

Luma+Mean single 1 8244 9694 16.84 66 2 132.56Luma+Mean channel 1 11281 13653 23.75 64 3 148.75

Luma+Mean fused 1 9635 11846 20.09 72 2 123.53

Harris corner channel 1 43127 54334 89.87 186 9 142.52Harris corner fused 1 33263 43319 70.87 177 9 107.35

- Channels use considerable amount of logic!

M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 21

Page 81: FPGA-Based Accelerator Design from a Domain-Specific Language · FPGAs High throughput, great energy efficiency OpenCL Portable and open programming model for heterogeneous ... High

Altera OpenCL System Level Interfaces

Channels Data sharing between kernels

Table: Impact of channels on line-buffered single-work item implementations.

Kernel Type II ALUTs Registers LU (%) BRAM DSP Freq [MHz]

Copy Kernel single 1 6821 7812 14.04 43 0 152.89Mean Filter single 1 8248 9415 16.78 51 0 151.522 Mean Filter separate 1 12104 15132 25.36 100 0 145.78

2 Mean Filter channel 1 10439 12466 21.47 60 0 154.14

2 Mean Filter fused 1 9480 11715 19.47 64 0 122.473 Mean Filter channel 1 12829 15451 26.56 67 0 145.573 Mean Filter fused 1 10899 13651 22.24 79 0 125.00

Luma+Mean single 1 8244 9694 16.84 66 2 132.56Luma+Mean channel 1 11281 13653 23.75 64 3 148.75

Luma+Mean fused 1 9635 11846 20.09 72 2 123.53

Harris corner channel 1 43127 54334 89.87 186 9 142.52Harris corner fused 1 33263 43319 70.87 177 9 107.35

- Channels use considerable amount of logic!

+ Channels are cheaper than kernel IOs.

M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 21

Page 82: FPGA-Based Accelerator Design from a Domain-Specific Language · FPGAs High throughput, great energy efficiency OpenCL Portable and open programming model for heterogeneous ... High

Altera OpenCL System Level Interfaces

Channels Data sharing between kernels

Table: Impact of channels on line-buffered single-work item implementations.

Kernel Type II ALUTs Registers LU (%) BRAM DSP Freq [MHz]

Copy Kernel single 1 6821 7812 14.04 43 0 152.89Mean Filter single 1 8248 9415 16.78 51 0 151.522 Mean Filter separate 1 12104 15132 25.36 100 0 145.78

2 Mean Filter channel 1 10439 12466 21.47 60 0 154.142 Mean Filter fused 1 9480 11715 19.47 64 0 122.473 Mean Filter channel 1 12829 15451 26.56 67 0 145.573 Mean Filter fused 1 10899 13651 22.24 79 0 125.00

Luma+Mean single 1 8244 9694 16.84 66 2 132.56Luma+Mean channel 1 11281 13653 23.75 64 3 148.75Luma+Mean fused 1 9635 11846 20.09 72 2 123.53

Harris corner channel 1 43127 54334 89.87 186 9 142.52Harris corner fused 1 33263 43319 70.87 177 9 107.35

- Channels use considerable amount of logic!

+ Channels are cheaper than kernel IOs.

- Smaller the size of kernel body, faster the clock frequency.

M. Akif Özkan | Hardware/Software Co-Design | FPGA-Based Accelerator Design from a Domain-Specific Language FPL’16 21