4
Offline Synthesis of Online Dependence Testing: Parametric Loop Pipelining for HLS Junyi Liu, Samuel Bayliss, George A. Constantinides Department of Electrical and Electronic Engineering, Imperial College London, SW7 2AZ, United Kingdom {junyi.liu13, s.bayliss08, g.constantinides}@imperial.ac.uk Abstract—Loop pipelining is probably the most important opti- mization method in high-level synthesis (HLS), allowing multiple loop iterations to execute in a pipeline. In this paper, we extend the capability of loop pipelining in HLS to handle loops with uncertain memory behaviours. We extend polyhedral synthesis techniques to the parametric case, offloading the uncertainty to parameter values determined at run time. Our technique then synthesizes lightweight runtime checks to detect the case where a low initiation interval (II) is achievable, resulting in a run-time switch between aggressive (fast) and conservative (slow) execution modes. This optimization is implemented into an automated source-to-source code transformation framework with Xilinx Vivado HLS as one RTL generation backend. Over a suite of benchmarks, experiments show that our optimization can implement transformed pipelines at almost same clock frequency as that generated directly with Vivado HLS, but with approximately 10× faster initiation interval in the fast case, while consuming approximately 60% more resource. I. I NTRODUCTION High-level synthesis (HLS) tools have recently reached commercial maturity, enabling high hardware design produc- tivity for field-programmable gate array (FPGA) technology. However, for many applications, there is still a considerable gap between the quality of results produced by HLS tools and those obtained by manual optimized RTL design. Computa- tional bottlenecks are typically located in some critical loops of high-level programs, and hence loop pipelining has emerged as one of the preeminent optimization techniques in HLS. Fig. 1: Motivational code. The optimization method of this paper addresses the issue of loops with uncertain data access patterns, which causes existing commercial HLS tools to take an overly conservative approach to pipeline scheduling. In the motivation loop shown in Fig. 1, there is one uncertain variable m whose value is not known at compile time, in the read access pattern of array A. The loop iterator i is bounded by two constant bounds LB and UB. Whether the loop can be pipelined actually depends on the value of the parameter m. If m==-1, the result of each iteration has to be generated before the start of the next iteration, which implies an inter-iteration dependency. If m>=0, there will be no recurrence in this loop. This uncertain data dependency prevents modern HLS tools from exploiting loop pipelining by default. This is the basic idea of our approach: synthesize a light- weight runtime check (in this case m>=0) that switches the pipeline schedule. These lightweight checks can be introduced, alongside appropriate loop-pipelining directives, through source-to-source transformation applied before invok- ing a commercial HLS tool. The rest of this paper explains how this idea can be generalized into polyhedral analysis and automated within a tool flow. II. RELATED WORK In the recent work of loop pipelining in [1] and [2], the authors rely on knowing, at compile time, all the dependencies that exist between operations to exploit pipeline scheduling. There are also active HLS research efforts as in [3], [4] inves- tigating loop pipelining for loops with irregular behaviours and structures. Polyhedral optimization has also been popularly applied for optimizing custom memory systems in recent HLS research. The previous works in [5]–[8] apply polyhedral analysis to study memory reuse or partition problems for improving loop latency and parallelism. III. MOTIVATION A. Loop Pipelining Loop pipelining is implemented by overlapping the ex- ecution of loop iterations. Where read-after-write loop- dependencies exist in the original code (a value is writ- ten in one iteration and read in a subsequent iteration), a pipelined schedule must be constrained to preseve these dependencies [1], [2]. The constant interval between the start of successive iterations is called the initiation interval (II ), and reflects the degree of parallelism, in the sense that for the same latency, a pipeline with smaller II has more iterations running in parallel at any given clock cycle. If we denote the latency of a single loop iteration and the loop trip count as L and N respectively, then the computation delay of the whole loop is equal to L +(N - 1) * II . When N is large enough, the loop delay is approximately equal to N * II . Therefore, the performance of a loop is mainly related to its II . Unlike resource constraints that may vary with the requirements of different hardware implementations, loop- dependency constraints implied in the loops are quite intrinsic. A complex dependency constraint could significantly constrain our ability to reduce the II of a loop pipeline. B. Loop Dependence Analysis To analyze the data dependencies of a loop, we need to formally model the memory access sequence. These patterns are described by loop bounds and array indexing functions, in

Offline Synthesis of Online Dependence Testing: Parametric

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Offline Synthesis of Online Dependence Testing: Parametric

Offline Synthesis of Online Dependence Testing:Parametric Loop Pipelining for HLS

Junyi Liu, Samuel Bayliss, George A. ConstantinidesDepartment of Electrical and Electronic Engineering, Imperial College London, SW7 2AZ, United Kingdom

{junyi.liu13, s.bayliss08, g.constantinides}@imperial.ac.uk

Abstract—Loop pipelining is probably the most important opti-mization method in high-level synthesis (HLS), allowing multipleloop iterations to execute in a pipeline. In this paper, we extendthe capability of loop pipelining in HLS to handle loops withuncertain memory behaviours. We extend polyhedral synthesistechniques to the parametric case, offloading the uncertaintyto parameter values determined at run time. Our techniquethen synthesizes lightweight runtime checks to detect the casewhere a low initiation interval (II) is achievable, resulting ina run-time switch between aggressive (fast) and conservative(slow) execution modes. This optimization is implemented intoan automated source-to-source code transformation frameworkwith Xilinx Vivado HLS as one RTL generation backend. Over asuite of benchmarks, experiments show that our optimizationcan implement transformed pipelines at almost same clockfrequency as that generated directly with Vivado HLS, but withapproximately 10× faster initiation interval in the fast case, whileconsuming approximately 60% more resource.

I. INTRODUCTION

High-level synthesis (HLS) tools have recently reachedcommercial maturity, enabling high hardware design produc-tivity for field-programmable gate array (FPGA) technology.However, for many applications, there is still a considerablegap between the quality of results produced by HLS tools andthose obtained by manual optimized RTL design. Computa-tional bottlenecks are typically located in some critical loopsof high-level programs, and hence loop pipelining has emergedas one of the preeminent optimization techniques in HLS.

Fig. 1: Motivational code.

The optimization method of this paper addresses the issueof loops with uncertain data access patterns, which causesexisting commercial HLS tools to take an overly conservativeapproach to pipeline scheduling. In the motivation loop shownin Fig. 1, there is one uncertain variable m whose value is notknown at compile time, in the read access pattern of array A.The loop iterator i is bounded by two constant bounds LB andUB. Whether the loop can be pipelined actually depends on thevalue of the parameter m. If m==-1, the result of each iterationhas to be generated before the start of the next iteration, whichimplies an inter-iteration dependency. If m>=0, there will beno recurrence in this loop. This uncertain data dependencyprevents modern HLS tools from exploiting loop pipeliningby default.

This is the basic idea of our approach: synthesize a light-weight runtime check (in this case m>=0) that switches

the pipeline schedule. These lightweight checks can beintroduced, alongside appropriate loop-pipelining directives,through source-to-source transformation applied before invok-ing a commercial HLS tool. The rest of this paper explainshow this idea can be generalized into polyhedral analysis andautomated within a tool flow.

II. RELATED WORK

In the recent work of loop pipelining in [1] and [2], theauthors rely on knowing, at compile time, all the dependenciesthat exist between operations to exploit pipeline scheduling.There are also active HLS research efforts as in [3], [4] inves-tigating loop pipelining for loops with irregular behaviours andstructures. Polyhedral optimization has also been popularlyapplied for optimizing custom memory systems in recent HLSresearch. The previous works in [5]–[8] apply polyhedralanalysis to study memory reuse or partition problems forimproving loop latency and parallelism.

III. MOTIVATION

A. Loop Pipelining

Loop pipelining is implemented by overlapping the ex-ecution of loop iterations. Where read-after-write loop-dependencies exist in the original code (a value is writ-ten in one iteration and read in a subsequent iteration),a pipelined schedule must be constrained to preseve thesedependencies [1], [2]. The constant interval between the startof successive iterations is called the initiation interval (II),and reflects the degree of parallelism, in the sense that for thesame latency, a pipeline with smaller II has more iterationsrunning in parallel at any given clock cycle. If we denotethe latency of a single loop iteration and the loop trip countas L and N respectively, then the computation delay of thewhole loop is equal to L + (N − 1) ∗ II . When N is largeenough, the loop delay is approximately equal to N ∗ II .Therefore, the performance of a loop is mainly related toits II . Unlike resource constraints that may vary with therequirements of different hardware implementations, loop-dependency constraints implied in the loops are quite intrinsic.A complex dependency constraint could significantly constrainour ability to reduce the II of a loop pipeline.

B. Loop Dependence Analysis

To analyze the data dependencies of a loop, we need toformally model the memory access sequence. These patternsare described by loop bounds and array indexing functions, in

Page 2: Offline Synthesis of Online Dependence Testing: Parametric

Fig. 2: Evaluate the safe region of δ(p).

which uncertain variables may participate. Two paired accessesare dependent if and only if the data point written in the currentiteration will be read in a future iteration. If p is a vector ofuncertain variables, the dependence iteration distance δ(p) isthe smallest number of iterations between the execution ofsuch two dependent memory accesses.

Since the dependence iteration distance is variable in ourtarget loops, we can evaluate a safe region of δ(p), where noread access ever executes before the completion of its depen-dent write access during pipeline execution. The motivationalexample loop in Fig. 1 is used for illustration. As shown inFig. 2, according to the given loop scheduling, the latencyL between read access A[i+m] and write access A[i] initeration i is the period when the start of dependent readaccess A[i+δ(p)+m] in iteration i+δ(p) will violate the inter-iteration loop dependency. If the target initiation interval isequal to II , the number of iterations scheduled to be executedin latency L is equal to dL/IIe − 1. To avoid the conflict ofinter-iteration dependency, the iteration distance δ(p) shouldsatisfy one of the conditions in (1), which denotes the saferegion of δ(p). Intuitively, dependences between a write and afuture read should either not exist (non-positive δ(p)) or shouldbe enough iterations away that they do not impact schedulingdecisions. {

δ(p) ≤ 0

δ(p) ≥ dL/IIe(1)

C. Proposed pipeline optimization

In current HLS tools, only the worst case of uncertain datadependency is considered for loop pipelining. To fully unleashthe potential of parallelism, we can implement conditional looppipelining in the hardware, which is optimized for runtimeperformance. The conceptual architecture of proposed pipelineis shown in Fig. 3. The pipeline is able to speed up whenthe safe region detector determines that loop dependency doesnot limit the loop parallelism. Since the conditions of saferegion can be calculated at compile time with our method, thehardware complexity of the detection logic can be minimized.

IV. POLYHEDRAL RUNTIME OPTIMIZATION

A. Parametric polyhedral analysis of safe region

In this work, we will use a parametric polyhedral modelto analyze the uncertain loop dependences. An example loopshown in Fig. 4 is analyzed as an illustration.

Fig. 3: Proposed pipeline architecture.

Fig. 4: Sample code of a 2D loop.

The uncertain variables in the memory access patterns canbe represented by a parameter vector p ∈ ZdP of the polyhe-dral model, where dP is the number of uncertain variables.In the example loop shown in Fig. 4, n and m are twouncertain variables, used in a loop bound and an array indexingexpression respectively. LB and UB are constants known atcompile time. The iteration domain of this sample loop canbe represented by an inequality constraint shown below.

−1 0 0 0

1 0 0 0

0 −1 0 0

0 1 0 −1

i

j

m

n

≤−LBUB

−LB0

The memory read access A[i-1][j+m] in Fig. 4 has an

array indexing expression with one uncertain variable m : [i−1, j +m]T , where[

i− 1

j +m

]=

[1 0

0 1

][i

j

]+

[0 0

1 0

][m

n

]+

[−10

].

By linking the array indexing expressions of one writeaccess and one read access to the same array, we can obtain theiteration dependency map between two dependent iterationsaccessing the same memory element, which signifies the pos-sible presence of a data dependence. The equality constraintof the iteration dependency map can be used to formulate thedependence vector distance function for (v′−v), where v andv′ represent the iteration vectors of memory write (source)and read (sink) accesses respectively. For the example loop inFig. 4, the dependence vector distance function of the memorywrite access A[i][j] is

v′ − v = −

[0 0

1 0

][m

n

]+

[1

0

]=

[1

−m

].

Knowing a time stamp function t(p), we calculate t(p)(v′−v)to transform the vector distance into the scalar form, whichis equivalent to the dependence iteration distance δ(p) intro-duced in Section III. Therefore, the conflict region C is a setof parameter values p, such that δ(p) does not satisfy eithercondition in (1).

The example loop in Fig. 4 has the time stamp functiont(p)T = [n− LB + 1, 1], so that t(p)T (v′ − v) = n− LB +1−m. This result indicates that the write access A[i][j] ofeach iteration has its dependent read access A[i-1][j+m]

Page 3: Offline Synthesis of Online Dependence Testing: Parametric

Fig. 5: Tool flow of code transformation framework.

appearing in its next (n−m−LB+1)th iteration. Therefore,the conflict region of p in this example loop is

C = {[m,n]T |1 ≤ n−m− LB + 1 ≤ dL/IIe − 1

∧ [m,n]T ∈ Z2}.

With this C of the example loop generated, two constraints ofp for safe region can be easily obtained as{

n−m ≤ LB − 1

n−m ≥ dL/IIe+ LB − 1.

The previous constraints include those vectors p /∈ C wheniteration dependency map exists. In addition, there are poten-tial constraints of p for safe region includes those vectors presulting in no iteration dependency, i.e. for which iterationdependency map does not exist, which is n+m ≤ LB− 1 inthe example loop. This kind of constraint corresponds to thecase that A[i-1][j+m] in any iteration does not read thedata written by A[i][j] in any previous iteration.

B. Source-to-source Transformation

The polyhedral analysis to generate the safe region intro-duced above is implemented as an algorithm using the IntegerSer Library (ISL) [9] . To make our new loop optimizationcompatible with a commercial HLS tool, we integrated ouranalysis algorithm into a source-to-source code transformationframework shown in Fig. 5. In this paper, we select XilinxVivado HLS as the HLS back-end tool to generate RTL codefrom original and transformed C code. The HLS tool is firstlyused to synthesize the original code to generate the schedulinginformation for the future loop analysis. The loop informationis captured by two open-source tools. The Clang front-endparser [10] generates an abstract syntax tree (AST) from theinput C code. The Polyhedral Extraction Tool (PET) [11]extracts the loops as the static control parts (SCoPs) with ISLfrom Clang AST. PoTHoLeS [12] is a polyhedral compilationtool developed by us based on ISL, which conducts user-specified loop analysis and transformation. Finally, the trans-formed C code is generated by PoTHoLeS. Fig. 6 illustratesthe code transformation result of our framework. The input 2Dloop has constant loop bounds and one uncertain variable min write access A[j][i-m]. Currently, the potential violationof uncertain memory access patterns to real memory boundsis not considered in the generation of safe region.

Fig. 6: Code transformation for a 2D loop nest.

V. EXPERIMENTAL RESULTS

In this work, our code transformation framework currentlyuses Xilinx Vivado HLS 2014.4 as the RTL generation back-end. The target FPGA device is a Virtex 7 XC7VX485T.Instead of the scheduled latency between two dependentmemory accesses, we use the entire iteration latency in cyclesfrom loop scheduling information, which has the potential toreduce the size of the safe region to a smaller, but still correct,region. In addition, there is no resource limitation applied inthe benchmarks. For calculating the target initiation interval,we just consider one physical limitation, that the block RAMin Virtex 7 is dual-port memory.

A. Benchmarks

We choose six loops as benchmarks for our experimentstudy in this paper. The data type of all memory arrays issingle precision floating point. All uncertain variables are intvalues, i.e. lie between INT_MIN and INT_MAX as definedin <limits.h>. The benchmark loops are derived from realapplications and other publicly available benchmarks, and areintroduced below.

Fig. 7: 1D loop of tri sp slv.

typ loop is the 2D loop shown in Fig. 4 in Section IV.row col is a 2D loop simplified from the example shown onpage 208 in the Xilinx Vivado HLS user guide [13]. pivotis a 2D loop extracted from the forward reduction step (line208) in the Gaussian elimination with pivoting code from MIT[14]. tri sp slv is a 1D loop as shown in Fig. 7, where lb, uband m are uncertain variables. It is obtained from a triangularsparse matrix solver. jacobi 2d is a modified 2D loop from2D Jacobi stencil computation of Polybench [15]. Its offsetsof +/−1 in the memory access patterns are replaced by +/−one uncertain variable. adi int is a 2D loop from Kernel 8 ofLivermorec benchmark suite [16].

B. Results Analysis

In all experiments, the target clock period is set to 3ns. Weexport generated RTL codes to Xilinx Vivado design suite tocollect clock and resource usage results after RTL synthesis,place and route. Furthermore, all generated pipelines are testedby C/RTL co-simulation with dedicated testbenches.

Page 4: Offline Synthesis of Online Dependence Testing: Parametric

TABLE I. Pipeline performance and resource usage results.

Orig Tran Orig Tran_S Tran_F Orig Tran_S Tran_F Orig Tran Orig Tran Orig Tran Orig Tran

typ_loop 3.574 3.272 20 20 21 12 12 1 6 10 785 1103 1015 1503 4 6

row_col 3.146 3.100 5 5 5 2 2 1 8 9 267 436 578 946 4 8

pivot 2.556 2.523 49 - 55 47 - 3 6 5 1421 1438 2328 2531 9 10

tri_sp_slv 3.137 2.888 25 25 28 18 18 2 1 3 520 865 725 1007 6 7

jacobi_2d 2.523 2.887 55 55 60 48 48 3 1 2 657 1376 923 1739 8 13

adi_int 3.889 3.634 68 63 66 52 52 3 5 6 1500 4512 1984 5150 13 21

Geomean 3.098 3.031 27.765 27.414 29.359 19.237 19.237 1.944 3.360 5.030 731 1244 1104 1786 7 10

Ratio 1 0.98 1 0.99 1.06 1 1.00 0.10 1 1.50 1 1.70 1 1.62 1 1.47

DSP48E1Benchmark

Clock (ns) Iteration Cycles Initiation Interval Before Loop Cycles LUT FF

Table I provides detailed results of pipeline performanceand resource usage. “Tran S” and “Tran F” correspond tothe slow and fast execution modes of transformed pipelines.Column title “Cycles Before Loop” means the number ofcycles for the operations executed before the start of thepipelines. It should be noted that the benchmark pivot hasthe uncertain variables analyzed to be always inside the saferegion. Our implementation only generates the fast pipelinemode for pivot.

Firstly, the HLS tool is able to achieve clock periods oftransformed pipelines very close to their original implemen-tations. The fast pipeline mode has the geometric mean ofsingle iteration latency increased by 6%, which is due to moreopportunities for HLS scheduling to achieve a small initiationinterval. Having proved the absence of data dependenciesin our analysis, the backend tool is instructed to ignore alldependencies using a #pragma. This allows the geometricmean initiation interval of the fast pipeline mode to be 10×faster than the original code. According to Fig. 3, the detectorlogic are observed to increase the geometric mean “CyclesBefore Loop” by 50%. However, this is still a small number ofcycles if the loop body has a fairly large number of iterations,which also indicates that the complexity of the detector logicis lightweight.

As shown in Table I, the overall resource increase is around60%. Vivado HLS is observed to be able to apply resourcesharing for floating point units and memory ports implied intwo different loop modes. Meanwhile, some other resourcessuch as memory address calculation are not shared by VivadoHLS, which contributes to a big part of resource overhead ofour transformed pipelines. Besides, the increase of DSPs isalso related to the increase of parallelism that requires moreoperations running at the same time, especially for those largeloops like jacobi 2d and adi int.

VI. CONCLUSION

In this paper, we proposed a new optimization method forone class of loops with uncertainty. This method combinescompiler-based analysis and runtime optimization. With theexperiments over a suite of benchmarks, we show that the fastruntime mode of the optimized pipelines reduces the initiationinterval by 90%, i.e. a 10× speedup. This comes at the cost of60% increase in resource usage. In future work, we intend tolift the restriction of scheduling that is linear in the uncertain

parameters, allowing for more complex loop iteration spacesto be analyzed.

ACKNOWLEDGEMENT

The authors acknowledge the support of Imagination Tech-nologies, the Royal Academy of Engineering, and EPSRC(EP/K034448/1, EP/I020357/1, EP/I012036/1).

REFERENCES

[1] Z. Zhang and B. Liu, “SDC-based modulo scheduling for pipelinesynthesis,” in Proceedings of the International Conference on Computer-Aided Design, ser. ICCAD ’13. Piscataway, NJ, USA: IEEE Press,2013, pp. 211–218.

[2] A. Canis, S. D. Brown, and J. H. Anderson, “Modulo SDC schedul-ing with recurrence minimization in high-level synthesis,” in FieldProgrammable Logic and Applications (FPL), 2014 24th InternationalConference on, Sept 2014, pp. 1–8.

[3] S. Dai, M. Tan, K. Hao, and Z. Zhang, “Flushing-enabled loop pipeliningfor high-level synthesis,” in Proceedings of the 51st Annual DesignAutomation Conference, ser. DAC ’14. New York, NY, USA: ACM,2014, pp. 76:1–76:6.

[4] M. Alle, A. Morvan, and S. Derrien, “Runtime dependency analysisfor loop pipelining in high-level synthesis,” in Proceedings of the 50thAnnual Design Automation Conference, ser. DAC ’13. New York, NY,USA: ACM, 2013, pp. 51:1–51:10.

[5] Q. Liu, G. Constantinides, K. Masselos, and P. Y. K. Cheung, “Automaticon-chip memory minimization for data reuse,” in Field-ProgrammableCustom Computing Machines, 2007. FCCM 2007. 15th Annual IEEESymposium on, April 2007, pp. 251–260.

[6] S. Bayliss and G. A. Constantinides, “Optimizing SDRAM bandwidthfor custom FPGA loop accelerators,” in Proceedings of the ACM/SIGDAInternational Symposium on Field Programmable Gate Arrays, ser.FPGA ’12. New York, NY, USA: ACM, 2012, pp. 195–204.

[7] L.-N. Pouchet, P. Zhang, P. Sadayappan, and J. Cong, “Polyhedral-baseddata reuse optimization for configurable computing,” in Proceedings ofthe ACM/SIGDA International Symposium on Field Programmable GateArrays, ser. FPGA ’13. New York, NY, USA: ACM, 2013, pp. 29–38.

[8] Y. Wang, P. Li, and J. Cong, “Theory and algorithm for generalizedmemory partitioning in high-level synthesis,” in Proceedings of the2014 ACM/SIGDA International Symposium on Field-programmableGate Arrays, ser. FPGA ’14. New York, NY, USA: ACM, 2014, pp.199–208.

[9] “Integer Set Library.” [Online]. Available: http://isl.gforge.inria.fr/[10] “Clang.” [Online]. Available: http://clang.llvm.org[11] “Polyhedral Extraction Tool.” [Online]. Available: http://freecode.com/

projects/libpet/[12] “PoTHoLeS: Polyhedral Compilation tool for High Level Synthesis.”

[Online]. Available: https://github.com/SamuelBayliss/Potholes[13] “Vivado Design Suite User Guide: High-Level Synthesis.” [Online].

Available: http://www.xilinx.com/support/documentation/sw manuals/xilinx2014 4/ug902-vivado-high-level-synthesis.pdf

[14] “Gaussian elimination with pivoting.” [Online]. Available: http://web.mit.edu/10.001/Web/Course Notes/Gauss Pivoting.c

[15] “Polybench.” [Online]. Available: http://web.cse.ohio-state.edu/∼pouchet/software/polybench/

[16] “Livermore loops coded in c.” [Online]. Available: http://www.netlib.org/benchmark/livermorec