77
2016-10-19 Roberto Innocente [email protected] 1 FPGA computing @ SISSA Roberto Innocente [email protected]

FPGA computing @ SISSApeople.sissa.it/~inno/pubs/FPGA-computing.pdf · 2016-12-06 · FPGA Computing project ... FPGA (acronym of Field Programmable Gate Array ) is a misnomer

  • Upload
    others

  • View
    46

  • Download
    0

Embed Size (px)

Citation preview

2016-10-19 Roberto Innocente [email protected] 1

FPGA computing@ SISSA

Roberto [email protected]

2016-10-19 Roberto Innocente [email protected] 2

“Begin at the beginning ..” Lewis Carrol, Alice in Wonderland

2016-10-19 Roberto Innocente [email protected] 3

Table of Contents

1. Project history

2. What is an FPGA ?

3. INTEL/Altera Arria 10

4. 7 Dwarfs

5. Arithmetic Intensity(AI)

6. Roofline Model

7. CUDA/OpenCL

8. Actual Performance9. OpenCL for FPGA10.Getting the most11. Schematics 12. HDL for FPGA13. Spatial Computing(SC)14. What next ?15.. Competitors16. Can I use it ?

2016-10-19 Roberto Innocente [email protected] 4

FPGA Computing project

● Project proposed at the beginning of 2014 :– http://people.sissa.it/~inno/pubs/reconfig-computing

-16-9-tris.pdf

● Nallatech board with Arria 10 FPGA ordered end of April 2016-04-22

● Nallatech Board arrives 2016-06-24 ● Troubles with software licenses solved mid of

august 2016-08-14

2016-10-19 Roberto Innocente [email protected] 5

II. What is an FPGA ?

2016-10-19 Roberto Innocente [email protected] 6

What is an FPGA ?

● FPGA (acronym of Field Programmable Gate Array ) is a misnomer (gates in digital electronics are very simple circuits like: and, or, not, xor,..)

● It is in fact an array of Configurable Logic Blocks (CLB : 6/7/8 inputs, output can be any boolean function over them or 2/3 subsets of them)

● A “blank slate” in which you have to program both the functions that the Logic Blocks perform and the interconnections between them

● Today some of the LB, to be more efficient, are specialized (Memory Blocks, DSP1 blocks, I/O blocks,...)

1) DSP = Digital Signal Processor (Multiplier/Adder)

2016-10-19 Roberto Innocente [email protected] 7

Array of Configurable Logic Blocks

Picture from National Instuments

2016-10-19 Roberto Innocente [email protected] 8

Scalar Product on an FPGA

x[0]

*

x[1]

*

x[2]

*

x[3]

*

y[0] y[1] y[2] y[3]

+ +

+

x . y = Sum x[i]*y[i]

DFG = Data Flow Graph

While with other architectures you need to adapt your program to the architecture, with FPGA you adapt the architecture to your program.

Each cycle a new resultafter 7 flops

2016-10-19 Roberto Innocente [email protected] 9

III. The FPGA of our tests

2016-10-19 Roberto Innocente [email protected] 10

INTEL/Altera Arria 10

● This was the first FPGA on the market to offer native floating point multiply/add in its DSPs.

● That's the reason why we bought it.● Of course also on the other large FPGAs you can

implement floating point ops if you want, using the IP cores offered by vendors.

NB. IP core : a function implemented in schematics or an HDL not free, but proprietary (IP = Intellectual Property)

2016-10-19 Roberto Innocente [email protected] 11

INTEL (Altera) Arria 10

● INTEL Arria 10 GX1150 :– Logic Elements 1,150 K

– ALMs 427,200

– Registers 1,708,800

– M20K mem block 2,713

– DSP 1,518 (integer and float SP)

● Back of the envelope calculation : each DSP can output a SinglePrecision Fused Multiply Add per cycle – 2 × 1518 = 3036 flops × 0.5Ghz = 1500Gflop / s

INTEL bought Altera in 2015 and now they start to re-brand everything.

To avoid short-term obsolescence I will call it INTEL FPGA

2016-10-19 Roberto Innocente [email protected] 12

From INTEL/Altera docs

2016-10-19 Roberto Innocente [email protected] 13

IV. How to measure performanceof new architectures ?

2016-10-19 Roberto Innocente [email protected] 14

The “Seven dwarfs”

At the dawn of the many core and heterogenous new computer architectures, Phil Colella of LBL, wrote the presentation Defining Software Requirements for Scientific Computing, in which he claimed that all new architectures should measure themselves with seven computational kernels common across every branch of scientific computing.

These computational kernels were later cosy-named Seven Dwarfs, because like in the SnowWhite fairy tale they should be mining for gold in new Computer Architectures.

“A dwarf is an algorithmic method that captures a pattern of computation and communication. “

http://view.eecs.berkeley.edu/wiki/Dwarf_Mine

The dwarfs grew with time to 13.

High-end simulation in the physical sciences consists of seven algorithms:

• Structured Grids (stencils, including locally structured grids, e.g. AMR)• Unstructured Grids• Fast Fourier Transform• Dense Linear Algebra• Sparse Linear Algebra • Particles• Monte Carlo

Phil Colella,2004 (LBL)

2016-10-19 Roberto Innocente [email protected] 15

V. Arithmetic Intensity AI

2016-10-19 Roberto Innocente [email protected] 16

Arithmetic/Computational Intensity (AI)

AI=FLOPS

bytes transferred from/to offchip memory

AI SingleP

AI DoubleP

Vector addition z[i] = x[i] + y[i] 1/12 0.083 1/24

Scalar product Σ a[i] * b[i] ¼ 0.125 1/8

Vector magnitude Σ a[i] * a[i] ½ 0.500 ¼

SAXPY 1/6 0.375 1/12

Stencil 4 neighbors C[i,j] = a*A[i,j]+b*(A[i-1,j]+A[i+1,j]...)

5/24 0.208 5/48

Matrix Multiply C[i,j] = Σk A[k,i] * B[k,j] 1/4 0.125 1/8

FFT1d 0.9* log(N) 7.48N=4096

α∗x+ y

2016-10-19 Roberto Innocente [email protected] 17

AI Arithmetic Intensity

From https://crd.lbl.gov/departments/computer-science/PAR/research/roofline/

2016-10-19 Roberto Innocente [email protected] 18

VI. The Roofline Model (RM)

2016-10-19 Roberto Innocente [email protected] 19

The Roofline Model

Sam Williams; A.Waterman;D. Patterson (2009-04-01). "Roofline: An Insightful Visual Performance Model for Multicore Architectures" http://doi.acm.org/10.1145/1498765.1498785

● It is an intuitive visual performance model to provide estimates of performance

● Based on two ceilings :– Peak Flop performance of the architecture

– Maximum throughput of offchip memory

2016-10-19 Roberto Innocente [email protected] 20

Arria10 Roofline Model

Break Even Point@ AI = 250

0.01

0.1

1

10

100

1000

0.01 0.1 1 10 100 1000

Att

ain

ab

le G

fop

/s

Arithmetic intensity (AI) fops/byte transferred

Roofine model

Limits are I/O bandwidth 6GB/s, Peak fops 1.5 Tfops

1/1

2 v

ect

or

ad

d

1/4

sca

lar

pro

du

ct

3/8

SA

XPY

1/2

Vect

or

mag

nit

ud

e

5/2

4 S

ten

cil 4

neig

h

Theoretical Peak (x<250)? 6*x:1500

2016-10-19 Roberto Innocente [email protected] 21

VII. CUDA/OpenCL

2016-10-19 Roberto Innocente [email protected] 22

Data parallelism / Task parallelism

From www.fixtars.com

Data Parallel Task Parallel

TASK

0..

¼ N

¼ N..

½ N

½ N..

¾ N

¾ N..N

TASK

TASK

TASK

SUM SUM

2016-10-19 Roberto Innocente [email protected] 23

The rise of the CUDA/OpenCL model

● In the mid of the past decade it was clear that Moore law could be respected only through parallelism. ManyCore and Heterogenous computers appeared: GPUs, FPGAs, CPUs, DSPs

● GPUs with hundredths and then thousands of simple cores (forthcoming NVIDIA pascal ~ 3.800 [available from 2017] )

● Data parallelism can be supported with a simple model (differently from task parallelism) : a compute pattern (kernel) instantiated on every core with a different set of indices.– a[i]+b[i] (Vector Addition kernel)

– Σk a[i,k] * b[k,j] (Matrix multiplication kernel)

– 1 /2 /3 dimensional NDRange / grid

● Each instantiation (work-item/thread) is provided with different parameters through a function call (e.g. get_global_id() , in fact the core computes displacements by itself knowing its wg and wi numbers )

2016-10-19 Roberto Innocente [email protected] 24

NVIDIA pascal / GP100● GP100 (device) : ● SM (compute unit) :

2 x vector processors 32 SIMD(because only 1 PC per warp)

2016-10-19 Roberto Innocente [email protected] 25

OpenCL for FPGAs

● There is a compiler front end (UIUC LLVM) for the HDL Place&Route (PAR) package (in the INTEL/Altera case Quartus Pro)

● For the FPGAs the compilers are all offline compilers. Why ?– It takes many hours or days of CPU to synthesize a complete project

– Forget about Apple/NVIDIA examples in which OpenCL code is a string inside the host C++ program.

– INTEL/Altera say you need 32 GB of main memory, but in fact I have seen the compilation processes to use 40/50 GB many times (so 64 GB is a better size).

● aoc : INTEL/Altera Offline Compiler :– aoc krnl.cl -o krnl.aocx

2016-10-19 Roberto Innocente [email protected] 26

Host source code.c or .cpp

Host compiler

Host binary

kernel source code.cl

AOC

FPGA binary.aocx

Host code path Kernel code path

ExecuteHost appOn host

(INTEL/Altera Offline Compiler)

2016-10-19 Roberto Innocente [email protected] 27

FPGA/OpenCL

● OpenCL was born for different computer architectures and doesnt capture all possibilities FPGAs can offer.

● Anyway OpenCL for FPGA seems a mature product that offers a big step up in easy to obtain FPGAs performance.

2016-10-19 Roberto Innocente [email protected] 28

VIII. Actual performance

2016-10-19 Roberto Innocente [email protected] 29

Results Reported

● All the results reported here were obtained using INTEL/Altera OpenCL compiler 16.0.0 Build 211 and same version Quartus Pro

● In a future report I will discuss Verilog results.

2016-10-19 Roberto Innocente [email protected] 30

Vector Addition

● z[i] = x[i] + y[i]● Computational

intensity very low :–

● Limit is then from I/O:– 6 GB* 1/12 = 0.5

Gflops/s

 ./vector_add 

Initializing OpenCL

Platform: Altera SDK for OpenCL

Performance on CPU 1 core of intel i7 : 

   Processing time on CPU   = 1.1313ms

   Mflops/s 883.948201

Launching for device 0 (1000000 elements)

Performance on FPGA :

   Processing time on FPGA  = 6.5348ms

   Mflop/s on FPGA= 153.027972

   Time: 6.535 ms

   Kernel time (device 0): 3.668 ms

AI=112

2016-10-19 Roberto Innocente [email protected] 31

Stencil code

From PDE Substitute derivatives with discrete approximation using a symbolic algebra package

Difference Equation

Stencil codeCode that updates a point using the neighbor point values

3D stencil of order 8

2016-10-19 Roberto Innocente [email protected] 32

Stencil code/2

● 5043 = 0.128 G points in lattice● 5 time steps :

– 0.128 * 5 = 0.640 G points processed

● 321 ms : – 0.640/.321 = 1.993 Gpoints/s processed

● 24 neighbors + 1 = 25 * 2 ops =– 50 ops

● 1.993 Gpoints/s * 50 flops = – 99 Gflops/s on FPGA

● On a single core of intel i7 cpu :– 0.85 Gflop/s

● Arithmetic Intensity :–

 

$ ./stencil 

Volume size: 504 x 504 x 504

order­8 stencil computation for 5 time steps

Performance on FPGA :

      Processing time : 321 ms

      Throughput = 1.9897 Gpoints / sec

      Gflops per second  99.486999

Performance on CPU intel i7 1 core : 

      Processing time on cpu = 37524.9531ms

      Throughput on cpu = 0.0171 Gpoints / sec

      Gflops per second  on cpu 0.852926

      Verifying data ­­> PASSED

CI=2×25×N 3

4×25×N 3=12

… but ..

2016-10-19 Roberto Innocente [email protected] 33

Matrix Multiplication

● Matrix sizes:– A: 2048 x 1024

– B: 1024 x 1024

– C: 2048 x 1024

● FPGA 128.77 Gflops● CPU 1.48 Gflops (on

1 core of Intel i7)●

Generating input matrices

Launching for device 0 (global size: 1024, 2048)

Performance of FPGA :

   Time: 33.353 ms

   Kernel time (device 0): 33.294 ms

   Throughput: 128.77 GFLOPS

Computing reference output

Performance of CPU Intel i7 single core :

   Time: 2907.730 ms

   Throughput: 1.48 GFLOPS

AI=N 2×(2×N−1)

4×3×N 2 =16

… but ..

2016-10-19 Roberto Innocente [email protected] 34

A

B

CTile

Find 2 slices of A rows and B cols that you can keep in fast memory, then you can compute the corresponding tile of C without accessing any other data (data re-use due to caching). This can increase a lot the Arithmetic Intensity.

More efficient Matrix Multiplication

If you can store stripes large k, then you read B only once and A N/k times.

AI=

2 N k2 N 2

k 2

N 2+

Nk

×N 2=

2 N 3

N 2(1+

Nk

)

=2

1k+

1N

≈2k

2016-10-19 Roberto Innocente [email protected] 35

Matrix Multiplication/2

+­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­+; Estimated Resource Usage Summary                                   ;+­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­+­­­­­­­­­­­­­­­­­­­­­­­­­­­+; Resource                               + Usage                     ;+­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­+­­­­­­­­­­­­­­­­­­­­­­­­­­­+; Logic utilization                      ;   42%                     ;; ALUTs                                  ;   17%                     ;; Dedicated logic registers              ;   25%                     ;; Memory blocks                          ;   40%                     ;; DSP blocks                             ;   31%                     ;+­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­+­­­­­­­­­­­­­­­­­­­­­­­­­­­;

2016-10-19 Roberto Innocente [email protected] 36

Matrix Multiplication/3+–­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­+­­­­­­­­­­­­­­­­­+; Resource                                    ; Usage           ;+­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­+­­­­­­­­­­­­­­­­­+; Estimate of Logic utilization (ALMs needed) ; 64255           ;;                                             ;                 ;; Combinational ALUT usage for logic          ; 49717           ;;     ­­ 7 input functions                    ; 32              ;;     ­­ 6 input functions                    ; 12400           ;;     ­­ 5 input functions                    ; 1882            ;;     ­­ 4 input functions                    ; 5526            ;;     ­­ <=3 input functions                  ; 29877           ;;                                             ;                 ;; Dedicated logic registers                   ; 122269          ;;                                             ;                 ;; I/O pins                                    ; 0               ;; Total MLAB memory bits                      ; 0               ;; Total block memory bits                     ; 5220584         ;;                                             ;                 ;; Total DSP Blocks                            ; 392             ;;     ­­ Total Fixed Point DSP Blocks         ; 8               ;;     ­­ Total Floating Point DSP Blocks      ; 384             ;;                                             ;                 ;; Maximum fan­out node                        ; clock_reset_clk ;; Maximum fan­out                             ; 147565          ;; Total fan­out                               ; 914768          ;; Average fan­out                             ; 4.56            ;+­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­+­­­­­­­­­­­­­­­­­+

2016-10-19 Roberto Innocente [email protected] 37

FFT 1d

● AI ~ 7.48 ● FPGA ~ 120 Gflop/s

Fixed 4k points transform

Launching FFT transform for 2000 iterations

FFT kernel initialization is complete.

Processing time = 4.0878ms

Throughput = 2.0040 Gpoints / sec (120.2420 Gflops)

Signal to noise ratio on output sample: 137.677661 ­­> PASSED

Launching inverse FFT transform for 2000 iterations

Inverse FFT kernel initialization is complete.

Processing time = 4.0669ms

Throughput = 2.0143 Gpoints / sec (120.8579 Gflops)

Signal to noise ratio on output sample: 137.041007 ­­> PASSED

AI=5∗N∗log (N )/ log (2)

4∗2∗N58∗

log (N )

log (2)… but ..

2016-10-19 Roberto Innocente [email protected] 38

FFT 2d

● ~ 66 Gflop/s●

Launching FFT transform (alternative data layout)

Kernel initialization is complete.

Processing time = 1.5787ms

Throughput = 0.6642 Gpoints / sec (66.4201 Gflops)

Signal to noise ratio on output sample: 137.435876 ­­> PASSED

Launching inverse FFT transform (alternative data layout)

Kernel initialization is complete.

Processing time = 1.5781ms

Throughput = 0.6644 Gpoints / sec (66.4440 Gflops)

Signal to noise ratio on output sample: 136.689050 ­­> PASSED

AI=5∗N 2∗log(N 2)

4∗2∗N 2∗log (2)

5∗2∗log(N )

4∗2∗log (2)=

54∗

log(N )

log (2)

2016-10-19 Roberto Innocente [email protected] 39

Computing π with Montecarlo

Computes π with a Mersenne twister rng.

Points = 222

GlobalWS=WG 32 , LocalWS=WI 32

I. 32x32 WI = 1024 generate 4096 rn in [0,1]x[0,1] = 222= 4194304

II.For each batch of 4096 rn computes ins and outs respect to the circle

III. Computes average

It takes ~ 854 ns for each rn

Using AOCX: mt.aocx

Reprogramming device with handle 1

Count all 26354932.000000 / ( rn 4096 * rng 32 *32) * 4

Computed pi  = 3.141753

 Mersenne twister : 1954.146849[ms]

 Computing     pi : 1632.340594[ms]

 Copy results     :    0.077611[ms]

 Total time       : 3586.565054[ms]

2016-10-19 Roberto Innocente [email protected] 40

Sobel filter

● 1920 x 1080 pixels image, 3 x 8 planes color ~ 6 MB

● Filter can be applied 140 fps ●

luma=(([R G B ] [66

12925 ]+128)≫8)+16 Rec BT 709

Sobel Operators S x=[−1−2−1

000

121 ] S y=[

−10

+1

−20

+2

−10

+1 ]∂ I∂ x

=I∗Sx ,∂ I∂ y

=I∗S y

∇ I=[ ∂ I∂ x

∂ I∂ y ] , ‖∇ I‖=√( ∂ I

∂ x )2

+( ∂ I∂ y )

2

‖∇ I (i , j)‖ < θ → pixel(i , j)=(0,0,0)

Convolution

2016-10-19 Roberto Innocente [email protected] 41

Other implementations

● Smith-Waterman– Algorithm for

computing the best match (with gaps and mismatches) between 2 DNA sequences

Status : in progress

● Spiking Neurons– McCulloch-Pitts (and later

Rosenblatt perceptron) are too simple models of neuron communication. In fact neurons for sure use spikes frequency to signal strength of activation or maybe even use spikes as a kind of binary code between them

Status: thought about it

2016-10-19 Roberto Innocente [email protected] 42

IX. More on OpenCL for FPGA

2016-10-19 Roberto Innocente [email protected] 43

OpenCL

https://www.khronos.org/files/opencl-1-1-quick-reference-card.pdf

Originally authored by Apple, bored by the need to support all the new coming computing devices (NVIDIA, AMD, Intel,.. ). (2007/2008)

It goes mostly along the lines of the predecessor NVIDIA CUDA but using a different terminology.

The rights were passed to a consortium that develops standards : Khronos.This consortium develops also the OpenGL standard (2008/2009).

2016-10-19 Roberto Innocente [email protected] 44

OpenCL platform model1 host + 1 or more compute devices

Host

Compute device

Computeunit

PEProcessing

Element

2016-10-19 Roberto Innocente [email protected] 45

OpenCL platform model and FPGAs

FPGA :● A Compute Device is an FPGA

card (there can be many in a PC) ● A Compute Unit is a pipeline

instantiated by the FPGA OpenCL compiler (you can implement multiple pipelines on the FPGA : you will see in a next slide).

● A Processing Element (PE) is e.g. a DSP adder or multiplier in a pipeline.

NVIDIA CUDA :

● A Compute Device is an NVIDIA CUDA card

● A Compute Unit is a Streaming Multiprocessor (SM)

● A Processing Element (PE) is a CUDA core (on NVIDIA all cores in a warp execute the same instruction)

2016-10-19 Roberto Innocente [email protected] 46

OpenCL / CUDAData Parallel Model

OpenCL :● NDRange● WorkGroup● WorkItem

CUDA :● Grid● ThreadBlock● Thread

The problem is represented as a computation carried over a 1,2 or 3 dimensional array.

2016-10-19 Roberto Innocente [email protected] 47

OpenCLNDRange, work-group, work-item

From Intel https://software.intel.com/sites/landingpage/opencl/optimization-guide/Basic_Concepts.htm

CUDAgrid

CUDAthreadblock

CUDAthread

2016-10-19 Roberto Innocente [email protected] 48

OpenCL attributes for FPGA

#define NUM_SIMD_WORK_ITEMS  4  

#define REQD_WORK_GROUP_SIZE (64,1,1) 

#define NUM_COMPUTE_UNITS  2 

#define MAX_WORK_GROUP_SIZE 512  

__kernel 

__attribute__((max_work_group_size( MAX_WORK_GROUP_SIZE )))

__attribute__((reqd_work_group_size REQD_WORK_GROUP_SIZE ))

__attribute__((num_compute_units( NUM_COMPUTE_UNITS )))

__attribute__((num_simd_work_items( NUM_SIMD_WORK_ITEMS )))

void function(..) { ...; }

             

But ..The compiler is mostly Resource Driven and often it does'nt obey to your will, despite the docs promises.

2016-10-19 Roberto Innocente [email protected] 49

Vector Addition/Matrix Multiplication OpenCL kernels

// vector addition 

C:

  for(i=0;i<N;i++){

         C[i] = A[i]+B[i];

  }        

OpenCL:

__kernel void vecadd(__global const float* A,

                __global const float* B,

                __global float* C)

{

      i = get_global_id(0);

      C[i] = A[i] + B[i];

}

// matrix multiplication

C:

  for(i=0;i<N;i++){

    for(j=0;j<N;j++){

      Temp = 0.0f;

      for(k=0;k<N;k++){

        Temp += A[i][k] * B[k][j]

      }

      C[i][j] = Temp;

    }   

  }

OpenCL:

__kernel void matmul(__global const float* A,

                     __global const float* B,

                     __global float* C, )

{

      __local float sum;

      i = get_global_id(0);

      j = get_global_id(1);

      sum = 0.0f;

      for(k=0;k<N;k++) {

          sum += A[i][k] * B[k][j];

      }

      C[i][j] = sum;

}

2016-10-19 Roberto Innocente [email protected] 50

X. Getting the mostwe need to look at the architecture !

2016-10-19 Roberto Innocente [email protected] 51

Arria 10 DSP in Floating Point mode

2016-10-19 Roberto Innocente [email protected] 52

Arria 10 killah kernel

Initializing OpenCL

Platform: Altera SDK for OpenCL

Using 1 device(s)

p385a_sch_ax115 : nalla_pcie (aclnalla_pcie0)

Using AOCX: loop.aocx

Reprogramming device with handle 1

Launching for device 0 (100000 elements)

Total runs 100000 , gflop 107374.182400

100,000 x 4 x (256*1024*1024)

Wall Time: 139909.012 ms

Gflop/s 767.457225

Kernel time (device 0): 139908.517 ms

Gflop/s 767.459945

2.0

* 2.0

+ 0.5

* ­1.0

+

x[i]

res

#define N (256*1024*1024)__kernelvoid loop(__global const float* x, __global float *restrict  y){local float res;

int i = get_global_id(0);res  = x[i];

#pragma unroll 700  for(i=0;i<N;i++){    res = res*2.0f + 2.0f;    res = res*0.5f – 1.0f;  }  y[i] = res;}

2016-10-19 Roberto Innocente [email protected] 53

Arria 10 killah kernel – Quartus report

+­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­+; Spectra­Q Synthesis Resource Usage Summary for Partition "|"  ;

+­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­+­­­­­­­­­­­­­­­­­+

; Resource                                    ; Usage           ;

+­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­+­­­­­­­­­­­­­­­­­+

; Estimate of Logic utilization (ALMs needed) ; 82174           ;

;                                             ;                 ;

; Combinational ALUT usage for logic          ; 102803          ;

;     ­­ 7 input functions                    ; 5               ;

;     ­­ 6 input functions                    ; 1842            ;

;     ­­ 5 input functions                    ; 11104           ;

;     ­­ 4 input functions                    ; 18594           ;

;     ­­ <=3 input functions                  ; 71258           ;

;                                             ;                 ;

; Dedicated logic registers                   ; 151334          ;

;                                             ;                 ;

;I/O pins                                     ; 0               ;

; Total MLAB memory bits                      ; 0               ;

; Total block memory bits                     ; 1348604         ;

;                                             ;                 ;

; Total DSP Blocks                            ; 1400            ;

;     ­­ Total Fixed Point DSP Blocks         ; 0               ;

;     ­­ Total Floating Point DSP Blocks      ; 1400            ;

;                                             ;                 ;

; Maximum fan­out node                        ; clock_reset_clk ;

; Maximum fan­out                             ; 155035          ;

; Total fan­out                               ; 692846          ;

; Average fan­out                             ; 2.65            ;

+­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­+­­­­­­­­­­­­­­­­­+

This is extremely good. It shows that the OpenCL compiler really created the same design an experienced hardware engineer could have created using Verilog. It used 2 DSP in mul and add mode for each line of the loop.And it reached a performance of 50/60 % of the peak.

2016-10-19 Roberto Innocente [email protected] 54

XI. Programming FPGAswith Schematics

Only very small project can behandled using schematics

2016-10-19 Roberto Innocente [email protected] 55

Quartus – FPGA using Schematics

1. Create new project with wizard (give a dir and a project name), select an empty project

2. Choose FPGA model:10AX115N3F40E2SG

3. Open New File, choose a Design->Schematic : a design whiteboard opens up

4. Choose all the components you need : i/o pins, dsp blocks (choose them in integer or fp mode form the IP catalog, a parameter editor will open up and you can program them to be adders/multipliers/fma )

5. Connect components with busses from the top menu

2016-10-19 Roberto Innocente [email protected] 56

Quartus – Schematic for scalar product

2016-10-19 Roberto Innocente [email protected] 57

Quartus report using Schematics

2016-10-19 Roberto Innocente [email protected] 58

XII. Programming FPGAs with an HDL :Verilog

Again the Scalar Product of 2 vecs of length 4Large Projects can't be managed using Schematics :

hundredths/thousands/tens of thousands of components,millions of interconnections , ...

2016-10-19 Roberto Innocente [email protected] 59

top.v

module top( x0,y0,x1,y1,x2,y2,x3,y3,z,clk,ena,aclr);

input [31:0]x0; input [31:0]y0;

input [31:0]x1; input [31:0]y1;

input [31:0]x2; input [31:0]y2;

input [31:0]x3; input [31:0]y3;      

output [31:0]z;

input clk; input ena; input [1:0]aclr;

wire [31:0]ir0; wire [31:0]ir1; wire [31:0]ir2; wire [31:0]ir3; 

wire [31:0]ir4; wire [31:0]ir5;

    dsp_fp_mul m1(.aclr(aclr),.ay(x0),.az(y0),.clk(clk),.ena(ena),.result(ir0));

    dsp_fp_mul m2(.aclr(aclr),.ay(x1),.az(y1),.clk(clk),.ena(ena),.result(ir1));

    dsp_fp_mul m3(.aclr(aclr),.ay(x2),.az(y2),.clk(clk),.ena(ena),.result(ir2));

    dsp_fp_mul m4(.aclr(aclr),.ay(x3),.az(y3),.clk(clk),.ena(ena),.result(ir3));

        

    dsp_fp_add a1(.aclr(aclr),.ax(ir0),.ay(ir1),.clk(clk),.ena(ena),.result(ir4));

    dsp_fp_add a2(.aclr(aclr),.ax(ir2),.ay(ir3),.clk(clk),.ena(ena),.result(ir5));

    dsp_fp_add a3(.aclr(aclr),.ax(ir4),.ay(ir5),.clk(clk),.ena(ena),.result(z));

endmodule

top.v

dsp_fp_add.va1

dsp_fp_mul.vm4

dsp_fp_mul.vm1

dsp_fp_mul.vm3

dsp_fp_mul.vm2

dsp_fp_add.va3

dsp_fp_add.va2

In Verilog what seems a function call in fact is an instantiation of a circuit inside another. The parameter syntax represents the correspondence (connection) of wires with wires.

2016-10-19 Roberto Innocente [email protected] 60

dsp_fp_xxx

// dsp_fp_mul.v

// Generated using ACDS version 16.0 211

`timescale 1 ps / 1 ps

module dsp_fp_mul (

input  wire [1:0]  aclr,   //   aclr.aclr

input  wire [31:0] ay,     //     ay.ay

input  wire [31:0] az,     //     az.az

input  wire        clk,    //    clk.clk

input  wire        ena,    //    ena.ena

output wire [31:0] result  // result.result

);

dsp_fp_mul_altera_fpdsp_block_160_ebvuera fpdsp_block_0 (

.clk    (clk),    //    clk.clk

.ena    (ena),    //    ena.ena

.aclr   (aclr),   //   aclr.aclr

.result (result), // result.result

.ay     (ay),     //     ay.ay

.az     (az)      //     az.az

);

endmodule

// dsp_fp_add.v

`timescale 1 ps / 1 ps

module dsp_fp_add (a,b,c,clk,ena,aclr);

input wire [31:0]a;input wire [31:0]b;

output wire [31:0]c;

input wire clk;input wire ena;input wire [1:0]aclr;

dsp_fp_add_altera_fpdsp_bloc_160_nmfrqti fdsp_block_0 (.clk (clk),.ena(ena),.aclr(aclr),.ax     (a),     //     ax.ax.ay     (b),     //     ay.ay.result (c)  // result.result);endmodule

dsp_fp_mul.v dsp_fp_add.v

These 2 modules are generated automatically when you instantiate from the IP cores a DSP in floating point mode and configure it likean adder or a multiplier

2016-10-19 Roberto Innocente [email protected] 61

Quartus report on Scalar Product using HDL

Exactly the same as for the project

Using Schematics

2016-10-19 Roberto Innocente [email protected] 62

System Verilog1 – Killah kernel sp_12.sv

module sp_12 #( parameter N=700) ( input logic [31:0]x, output logic [31:0]out, input logic clk,ena, input logic [1:0]aclr ); logic [31:0] mul_2,add_2, mul_05,sub_1; logic [31:0]ir[2*N+4]; assign mul_2 = shortreal'(2.0); assign add_2 = shortreal'(2.0); assign mul_05 = shortreal'(0.5); assign sub_1 = shortreal'(-1.0); assign ir[0] = x; genvar i; generate for(i=0;i<=N;i=i+1) begin: FMA2_LOOP

dsp_fp_fma inst ( .ax(add_2), .ay(ir[2*i]), .az(mul_2), .result(ir[2*i+1]), .clk(clk), .ena(ena), .aclr(aclr) ); dsp_fp_fma inst1( .ax(add_2), .ay(ir[2*i+1]), .az(mul_2), .result(ir[2*i+2]), .clk(clk), .ena(ena), .aclr(aclr) ); end endgenerate assign out = ir[2*N+2]; endmodule

Quartus report :1,402 DSP used1) SystemVerilog is a new edition of

Verilog (1800-2012) with many additions

2016-10-19 Roberto Innocente [email protected] 63

// dsp_fp_fma.v

// Generated using ACDS version 16.0 211

`timescale 1 ps / 1 ps

module dsp_fp_fma (

input  wire [1:0]  aclr,   //   aclr.aclr

input  wire [31:0] ax,     //     ax.ax

input  wire [31:0] ay,     //     ay.ay

input  wire [31:0] az,     //     az.az

input  wire        clk,    //    clk.clk

input  wire        ena,    //    ena.ena

output wire [31:0] result  // result.result

);

dsp_fp_fma_altera_fpdsp_block_160_fj4u2my fpdsp_block_0 (

.clk    (clk),    //    clk.clk

.ena    (ena),    //    ena.ena

.aclr   (aclr),   //   aclr.aclr

.result (result), // result.result

.ax     (ax),     //     ax.ax

.ay     (ay),     //     ay.ay

.az     (az)      //     az.az

);

endmodule

Verilog : dsp_fp_fma.v

This file is generated automatically when you instantiate a DSP as a multiplier/adder with the parameter editor. It differs from the others that resulted from single operation instantiation (like only mul or only add) : it uses all 3 input busses as you can see.

2016-10-19 Roberto Innocente [email protected] 64

XIII. Spatial Computing (OpenSPL)

2016-10-19 Roberto Innocente [email protected] 65

OpenSPLOpen Spatial Programming Language

● Buzzword in the hands of a consortium leaded by Maxeler and Juniper on the industrial side, Stanford Uni , Imperial College, Tokjo Uni .. on the academic side

● Everything kept as a trade secret for now● Java interface ..● IMHO this is a lost occasion :

– “Spatial Programming” is probably the wrong word in these times in which thousand of things around GPS, GEO, etc .. are already called in this way

– Plans and standards should be open and not kept as a secret except for consortium members.

– The industrial members are weak on this market

– Java in this scene is, IMHO, not the right tool

– An open source movement should be started instead

2016-10-19 Roberto Innocente [email protected] 66

My Proposal: json-graph-fpga

Use a simple and already existing format to describe the graph of components. Json for instance, or Json-graph. (We assume all components become connected to a global clock)

{“inputs”:[“x0”,”x1”,”x2”,”x3”,“y0”,”y1”,”y2”,”y3”],“x0”:[“m1”],“y0”:[“m1”],“x1”:[“m2”],”y1”:[“m2”],“x2”:[“m3”],”y1”:[“m3”],“x3”:[“m4”],”y1”:[“m4”],

“m1”:[“a1”],”m2”:[“a1”],“m3”:[“a2”],”m4”:[“a2”],

“a1”:[“a3”],”a2”:[“a3”],“a3”:[“outputs”]}

Inputs

Outputs

*m1

*m2

*m3

*m4

+a1

+a2

+a3

2016-10-19 Roberto Innocente [email protected] 67

XIV. What's next ?

2016-10-19 Roberto Innocente [email protected] 68

Top INTEL/Altera Product Stratix 10

● Arria 10 (10AX115)

– 20nm technology

– Log El 1,150,000

– ALM 472,500

– DSP 1,518

– M20Blk 2,713

– Reg 1,708,800

– PeakTflops 1.5

● Stratix 10(GX2800)

– Intel 14nm (TriGate) FinFET

– Log El 2,753,000

– ALM 933,120

– DSP 5,760

– M20Blk 11,721

– Reg 3,732,480

– PeakTflops 10

Stratix 10 = 6 x (fp perf of Arria 10)

2016-10-19 Roberto Innocente [email protected] 69

How to lift off-board b/w limitations ?

Directly to QPI or PCIe ● Connect directly to the Intel

QPI (Quick Path Interconnect) or the future Intel UPI (Ultra Path Interconnect) , processor/chipset point to point interconnect (60-80 GB/s). Already done with Xilinx chips

● Stratix 10 supports 4x PCIe Gen3x16 ~ 60 GB/s

Stand alone● Use FPGAs stand

alone. The Stratix 10 supports DDR4 memory or HMC (Hybrid Memory Cube). Connections with Interlaken channels support 14.7 Gb/s per lane.

2016-10-19 Roberto Innocente [email protected] 70

XV. Competitors

2016-10-19 Roberto Innocente [email protected] 71

Competitors

● NVIDIA P100 (next year)– 3,584 cores

– 1,328 Mhz 300 W, 1,126 Mhz 250 W

– CUDA 6.0

– Single Precision Gflops 8,000-10,000

– 3584*1328*2 = 9,519 TFlops

– TDP 250-300 Watt

– ~ 10,000-12,000 USD

● INTEL Xeon Phi 7290

– 72 cores

– Freq. 1.50 Ghz

– TDP 245 Watt

– ~ 4,110 USD

● INTEL Xeon E5-4699v4

– 22 cores

– Freq. 2.20 Ghz

– TDP 135 Watt

– ~ 7,000 USD ● INTEL Arria 10 GX

– 1518 DSP – Freq. 0.5 Ghz– Peak 1518*2*0.5 = 1.5

Tflops– TDP ~ 30 Watt– ~ 5,000 USD

INTEL/Altera FPGAs

NVIDIA GPGPU

INTEL Xeon Phi

INTEL Xeon

● INTEL Stratix 10

– 5,760 DSP

– Freq. 1.0 Ghz

– Peak 10 Tflops

– TDP ~ 30-40 Watt

– ~ ??? 20 K USD

2016-10-19 Roberto Innocente [email protected] 72

Competitors/2

Arria 10 Stratix 10 INTEL E5-2699v4 INTEL Phi 7290 NVIDIA P1000

5000

10000

15000

20000

25000

30000

35000

TDP / Peak GFlop/s / Price

TDP Watt x 100

Peak FP - GFlop

Price

2016-10-19 Roberto Innocente [email protected] 73

Competitors/3

Arria 10 Stratix 10 INTEL E5-2699v4 INTEL Phi 7290 NVIDIA P1000

50

100

150

200

250

300

GFlops / Watt

GFlops/Watt

2016-10-19 Roberto Innocente [email protected] 74

XVI. Can I use it ?

2016-10-19 Roberto Innocente [email protected] 75

Can I use it ?

– I'm interested in making comparisons tests with Tesla and other architectures

– I'm interested in trying kernels with sufficient Arithmetic Intensity to run efficiently

– I'm interested in interesting problems :)

– The limit is the fact that there is only 1 board on 1 PC and the compiler license is for 1 seat.

About this please write to me !

2016-10-19 Roberto Innocente [email protected] 76

"and go on till you come to the end: then stop.”Lewis Carrol

but I think also Jacques De La Palice (or de La Palisse) could have said something like that

2016-10-19 Roberto Innocente [email protected] 77

END