19
ISSS 2001, Montréal 1 ISSS’01 S.Derrien, S.Rajopadhye, S.Sur-Kolay* IRISA France *ISI calcutta Combined Instruction and Loop Level Parallelism for Regular Array Synthesis on FPGAs

ISSS 2001, Montréal1 ISSS’01 S.Derrien, S.Rajopadhye, S.Sur-Kolay* IRISA France *ISI calcutta Combined Instruction and Loop Level Parallelism for Regular

Embed Size (px)

Citation preview

Page 1: ISSS 2001, Montréal1 ISSS’01 S.Derrien, S.Rajopadhye, S.Sur-Kolay* IRISA France *ISI calcutta Combined Instruction and Loop Level Parallelism for Regular

ISSS 2001, Montréal 1

ISSS’01

S.Derrien, S.Rajopadhye, S.Sur-Kolay*

IRISA France *ISI calcutta

Combined Instruction and Loop Level Parallelism for Regular Array Synthesis on FPGAs

Page 2: ISSS 2001, Montréal1 ISSS’01 S.Derrien, S.Rajopadhye, S.Sur-Kolay* IRISA France *ISI calcutta Combined Instruction and Loop Level Parallelism for Regular

ISSS 2001, Montréal 2

Outline

Context and motivation Space time transformations Transformation flow Experimental validation Conclusion

Page 3: ISSS 2001, Montréal1 ISSS’01 S.Derrien, S.Rajopadhye, S.Sur-Kolay* IRISA France *ISI calcutta Combined Instruction and Loop Level Parallelism for Regular

ISSS 2001, Montréal 3

High performance IP-Cores High-level specifications

Matlab, C, C++ or specific language (Alpha) Targeting nested loops Core must be formally correct

Hard/Soft co-generation Hardware RTL module (VHDL) Simple driver API (C)

Regular Processor Arrays High data through-put, specialized datapath Well suited for VLSI/FPGA

Page 4: ISSS 2001, Montréal1 ISSS’01 S.Derrien, S.Rajopadhye, S.Sur-Kolay* IRISA France *ISI calcutta Combined Instruction and Loop Level Parallelism for Regular

ISSS 2001, Montréal 4

Targeting FPGAs Poor clock speed

Typical clock speed is 1/10 Asic speed Very design dependant Good at low precision arithmetic (8 bits) Really bad for complex operations (floats)

But high performance Optimized designs can compete with Asics Performance gain due to parallelism Pipeline comes for free (lots of DFFs)

Page 5: ISSS 2001, Montréal1 ISSS’01 S.Derrien, S.Rajopadhye, S.Sur-Kolay* IRISA France *ISI calcutta Combined Instruction and Loop Level Parallelism for Regular

ISSS 2001, Montréal 5

Processor Array Synthesis

For i:=1 to 3 For j:=1 to 3 For k:=1 to 3

C[i,j]:=C[i,j] +A[i,k]*B[k,j]; End for; End for;End for;

Iteration domain extracted from loop bounds

Data dependence vector between iterations

Iteration domain is projected on the processor grid

Matrix multiplication exampleIteration are scheduled on their associated PE

Page 6: ISSS 2001, Montréal1 ISSS’01 S.Derrien, S.Rajopadhye, S.Sur-Kolay* IRISA France *ISI calcutta Combined Instruction and Loop Level Parallelism for Regular

ISSS 2001, Montréal 6

PE Architecture

DatapathDatapathDatapath

Temporal registers act as local memory

Combinational datapath connected to registers

Unidirectional flow and pipelined connections

N classes of registers (N = loop dimension)

One critical path for each register class

Operating frequency set by worst critical path

Spatio-temporal registers must be disambiguated

Spatial registers serve asinterconnect between PEs

Page 7: ISSS 2001, Montréal1 ISSS’01 S.Derrien, S.Rajopadhye, S.Sur-Kolay* IRISA France *ISI calcutta Combined Instruction and Loop Level Parallelism for Regular

ISSS 2001, Montréal 7

Conclusion

Simplistic schedule inside a PE (no ILP) Complex loop bodies induces poor performance

Floating point Matrix mult operating at 12MHz 2D SOR on 16 bits operating at 40MHz

The PE architecture is not suited to FPGAs !!

Proposed solution : allowing pipelined data-paths, by altering the PE architecture through simple space-time transformations.

Page 8: ISSS 2001, Montréal1 ISSS’01 S.Derrien, S.Rajopadhye, S.Sur-Kolay* IRISA France *ISI calcutta Combined Instruction and Loop Level Parallelism for Regular

ISSS 2001, Montréal 8

Retiming

LUT

LUT

LUT

LUT

LUT

LUT

Tc= 1 logic level

Tc= 2 logic level

Move registers to minimize clock period

Handled by most FPGA RTL synthesis tools

Efficient iff sufficient number of registers

We just need to add registers in the PE !!

Page 9: ISSS 2001, Montréal1 ISSS’01 S.Derrien, S.Rajopadhye, S.Sur-Kolay* IRISA France *ISI calcutta Combined Instruction and Loop Level Parallelism for Regular

ISSS 2001, Montréal 9

Serialization (1/2)

Regroup PEs into clusters

Iterations in a cluster executed sequentially

Through-put is slowed down by cluster size

Local memory is duplicated

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

Original PE array before clustering

Array after clustering

Page 10: ISSS 2001, Montréal1 ISSS’01 S.Derrien, S.Rajopadhye, S.Sur-Kolay* IRISA France *ISI calcutta Combined Instruction and Loop Level Parallelism for Regular

ISSS 2001, Montréal 10

Serialization (2/2)

DatapathDatapath

Decomposed along each spatial dimension

Serialization impacts the PE according to simple transformation rules

Loop level Parallelism traded for Instruction Level Parallelism

Temporal registers duplicated by serialization factor i

Feed-back loop are created for all spatial paths in the ith axis

Page 11: ISSS 2001, Montréal1 ISSS’01 S.Derrien, S.Rajopadhye, S.Sur-Kolay* IRISA France *ISI calcutta Combined Instruction and Loop Level Parallelism for Regular

ISSS 2001, Montréal 11

Skewing

DatapathDatapath

Skewing by factor 2 along vertical PE axis

Affects latency, but not through-put.

Adds temporal registers along spatial axis

Skewing can be used before and after serialization

Cannot reduce original temporal critical path

Page 12: ISSS 2001, Montréal1 ISSS’01 S.Derrien, S.Rajopadhye, S.Sur-Kolay* IRISA France *ISI calcutta Combined Instruction and Loop Level Parallelism for Regular

ISSS 2001, Montréal 12

Problem formulation

Datapath

Find the optimal set of transformations parameters.

Minimize number of registers

Preserve loop-level parallelism

Tc= 86 ns, requiresdj= 6 stages to obtain Tc= 15ns

Tc= 70 ns, di=5 stages to obtain Tc= 15ns

Tc= 60 ns requires dt=4 stages to obtain Tc= 15ns

Page 13: ISSS 2001, Montréal1 ISSS’01 S.Derrien, S.Rajopadhye, S.Sur-Kolay* IRISA France *ISI calcutta Combined Instruction and Loop Level Parallelism for Regular

ISSS 2001, Montréal 13

1. Assumes i given (partitioning step)

4. Determine all the skewing parameters

2. Sort PE space axis in ascending order of Tc

2. For each PE axis i do

i. Pre-serialization skewing ipre

ii. Serialization i

4. For each PE axis i do

i. Post-serialization skewing ipost

Proposed heuristic

Page 14: ISSS 2001, Montréal1 ISSS’01 S.Derrien, S.Rajopadhye, S.Sur-Kolay* IRISA France *ISI calcutta Combined Instruction and Loop Level Parallelism for Regular

ISSS 2001, Montréal 14

Transformation example

1. Pre-skew along axis y by factor y

pre =1.

2. Serialisation along axis y axis by factor y =2.

3. Pre-skew along axis x by factor x

pre =2.

4. Serialisation along axis x by factor x =2.

6. Apply retiming

5. Post skew along axis y by factor y

post=1.

DatapathDatapathDatapathDatapathDatapath

84

Datapath4 4

84

Datapath4 4

84

Datapath

44

44Datapath

4

45

6 10

Page 15: ISSS 2001, Montréal1 ISSS’01 S.Derrien, S.Rajopadhye, S.Sur-Kolay* IRISA France *ISI calcutta Combined Instruction and Loop Level Parallelism for Regular

ISSS 2001, Montréal 15

Experimental validation

Chosen benchmark Matrix multiplication (8,16 bits and floats) Adaptive filter (DLMS) (8,16 bits and floats) String matching (DNA, Protein)

Performance metrics Ape : PE area usage

fpe : PE operating frequency

Raw performance =Npe.fpe Npe approximated by 1/Ape

Page 16: ISSS 2001, Montréal1 ISSS’01 S.Derrien, S.Rajopadhye, S.Sur-Kolay* IRISA France *ISI calcutta Combined Instruction and Loop Level Parallelism for Regular

ISSS 2001, Montréal 16

Area overhead

Area overhead decreases as combinational datapath area cost grows

Page 17: ISSS 2001, Montréal1 ISSS’01 S.Derrien, S.Rajopadhye, S.Sur-Kolay* IRISA France *ISI calcutta Combined Instruction and Loop Level Parallelism for Regular

ISSS 2001, Montréal 17

Frequency improvement

Speed improvment up to one order of magnitude (for floats)

Page 18: ISSS 2001, Montréal1 ISSS’01 S.Derrien, S.Rajopadhye, S.Sur-Kolay* IRISA France *ISI calcutta Combined Instruction and Loop Level Parallelism for Regular

ISSS 2001, Montréal 18

Raw performance

Speed improvment up to one order of magnitude (for floats)

Page 19: ISSS 2001, Montréal1 ISSS’01 S.Derrien, S.Rajopadhye, S.Sur-Kolay* IRISA France *ISI calcutta Combined Instruction and Loop Level Parallelism for Regular

ISSS 2001, Montréal 19

Conclusion Extract very fine grain ILP from the

datapath as a whole Simple space-time transformations but

yield impressive results. Preserve circuit correctness and control

logic regularity and simplicity Performance benefits are limited by the

lack of place & route aware retiming tools.