On Data Forwarding in Deeply Pipelined Soft Processors 5/3_fajhmy_fpga2015.pdf · On Data Forwarding in Deeply Pipelined Soft Processors H. Y. Cheah, S. A. Fahmy, N. Kapre School

On Data Forwarding in Deeply Pipelined Soft Processors

H. Y. Cheah, S. A. Fahmy, N. Kapre School of Computer Engineering

Nanyang Technological University

FPGA 2015

Introduction

• Hard macros allow us to build high frequency soft processors (Octavo, iDEA, etc.)

• To achieve high frequency, deep pipelining is required

• Results in significant idle cycles to resolve dependencies

• Full data forwarding is too complex for a lean processor

• Exploit loopback path in DSP block to provide forwarding in iDEA soft processor

2

iDEA Soft Processor

• Fundamental novelty is exploiting DSP block dynamic control signals

3

A1

B1

A2

B2

M P

D

A

B

C

= patterndetect

C

AD

D

P

30

25

18

48

48

18

48

48

25

INMODE ALUMODEOPMODE

iDEA Soft Processor

• A DSP-block based soft processor • Maximises use of the DSP48E1 block • 450MHz+ achievable frequency • Architecture described in detail in FPT 2012 and

TRETS 2014

4

Fetch Decode Execute WriteBack

DSP

IMEM RF DMEM

loopback

iDEA Soft Processor

• We deeply pipeline iDEA by adding extra cycles to the different processor stages

• DSP48E1 primitive requires 3 cycles to achieve maximum speed

• Multiple ways of distributing extra cycles among stages

5

Fetch Decode Execute WriteBack

DSP

IMEM RF DMEM

loopback

Deep Pipelining and NOPs

• High number of empty cycles required to pad dependent instructions, increasing with depth

6

4 5 6 7 8 9 10 11 12 13 14 150

10 000

20 000

30 000

40 000

Pipeline Depth

Number

ofNOPs

crc fib firmed mmul qsort

Padding for Dependencies

• Instruction cannot enter decode stage until its input operand has been written back

• Insert empty instructions (NOPs) between dependent instructions

7




7

IF ID ID EX EX WB

IF ID ID EX EX WB4 NOPs




7

IF IF ID ID EX EX EX WB

IF IF ID ID EX EX EX WB5 NOPs

IF ID ID EX EX WB

IF ID ID EX EX WB4 NOPs


• A few routes to overcoming this issue: • Dynamic in hardware – cost • Static in software

• External Loopback – add a forwarding path around the execute stage

• Internal Loopback – proposed approach using DSP block feature

• Hardware support through modified opcodes

8

Assembly with Loopback• Original Assembly li $1, x li $2, a li $3, b li $4, c mul $5, $1, $2 nop nop mul $6, $5, $1 mul $5, $1, $3 nop nop add $7, $5, $6 nop nop add $8, $7, $4 nop nop sw $8, 0($y)

9

• With loopback instructions li $1, x li $2, a li $3, b li $4, c mul loop0, $1, $2 mul $6, loop0, $1 mul loop1, $1, $3 add loop2, loop1, $6 add $8, loop2, $4 nop nop sw $8, 0($y)

→ save 6 instructions

External Forwarding

• Add a path around the execute stage • Allows a dependent instruction to start its

execute stage once the previous one has completed

10

A

B

P

DSP Block

External Forwarding



10

A

B

P

DSP Block

External Forwarding



10

A

B

P

DSP Block


IF IF ID ID EX EX EX WB2 NOPs

Internal Forwarding

• DSP block primarily used in filtering • Multiply-accumulate is functional primitive • Loopback path allows ALU output to be used for

accumulation • Dynamic control signal

determines whetherthis path is used

11

A

B

P

DSP Block

Internal Forwarding




11

A

B

P

DSP Block

Internal Forwarding




11

A

B

P

DSP Block



Loopback Analysis

• Identify loopback opportunities in assembly • Loopback instruction — a subsequent dependent

arithmetic operation • For external loopback: sufficient NOPs to pad the

execute stage • For internal loopback: no NOPs required • NOPs are inserted for dependencies that cannot

be resolved by this forwarding path

12

Parametric Verilog

Constraints

XilinxISE

BenchmarkC code

LLVMMIPS

Area

Freq.

Cycles

LoopbackAnalysis

Functional Simulator

RTL SimulatorAssembler

FPGADriver

ML605Platform

In-System Execution

Loopback Analysis

13

Processor Design Space

• First explore the impact of pipeline depth on area and frequency

• Parameterised register stages in each processor stage (1–5 cycles for Fetch, Decode, Execute)

• Overall pipeline depths of 4–15 cycles • Multiple possible combinations for each overall

depth

14

Processor Design Space• Consider all combinations of stage depths for each

overall processor depth • Reach close to 500MHz from 10 stages onwards

154 5 6 7 8 9 10 11 12 13 14 15

200

250

300

350

400

450

500

Pipeline Depth

Frequen

cy(M

Hz)


• Clusters near where DSP block throughput limits performance

• Significantly more registers than LUTs

16150 200 250 300 350 400 450 500

200

400

600

800

1,000

Frequency (MHz)

AreaUnit

RegistersLUTs


• Generally minimal impact for internal forwarding and a little more for external forwarding

17

0

500

1,000

237 371

370

407 479 543

542 602

764

756

754

992

232 372

375

416 482 543

535 611

795 877

866 990

179

377

373

362 5

18 585

591 662

823

793

793

1,000

Registers

None Internal Loopback External Forwarding

4 5 6 7 8 9 10 11 12 13 14 15

0

500

280

281

283

319

336

349

362

369

384

385

383

430

288

301

298

335

348

360

374

378

422

425

408

431

331

314

310 362

367

380

393

406

420

418

418

457

Pipeline Depth

LUTs

IPC Improvement

• External Loopback • Padding NOPs not

totally eliminated • For depths of 4–6

(with 1-cycle EX), no NOPs needed

• As EX depth increases, savings are curtailed due to need for some NOPs

18

4 5 6 7 8 9 10 11 12 13 14 150

5

10

15

20

25

30

35

Pipeline Depth

%IP

CSavings

crc

fib fir

median mmult

qsort

IPC Improvement

• Internal Loopback • Sustained savings

over no forwarding • Benchmarks with long

dependency chains • A 5–30%

improvement

19

4 5 6 7 8 9 10 11 12 13 14 150

5

10

15

20

25

30

35

Pipeline Depth

%IP

CSavings

crc

fib fir

median mmult

qsort

Execution Time

• Frequency and geomean wall clock time for all benchmarks • Frequency of external loopback implementation lags initially • At higher depths, differences disappear as the extra cycles

are added to stages other than execute • A 25% improvement in runtime compared to no forwarding

20

14

16

18

20

22

24

26

28

30Tim

e(us)

4 5 6 7 8 9 10 11 12 13 14 15

200

250

300

350

400

450

500

Pipeline Depth

Frequen

cy(M

Hz)

Frequency (Int)

Frequency (Ext)

Internal Loopback

External Forwarding

Execution Time

• External loopback • Depth 4 — 6 increase in execution time due to lower

operating frequency • Gap is significant in shallower depths, but closes as depth

increases • Reduced frequency is fundamental barrier 21

14

16

18

20

22

24

26

28

30Tim

e(us)

4 5 6 7 8 9 10 11 12 13 14 15

200

250

300

350

400

450

500

Pipeline Depth

Frequen

cy(M

Hz)

Frequency (Int)

Frequency (Ext)

Internal Loopback

External Forwarding

Execution Time

• Internal loopback • Minimum runtime is at 10 cycle processor depth • Anomalous peak at depth 9, due to a significant EX cycle

increase • Results in 11–20% improvement at low depths

22

14

16

18

20

22

24

26

28

30Tim

e(us)

4 5 6 7 8 9 10 11 12 13 14 15

200

250

300

350

400

450

500

Pipeline Depth

Frequen

cy(M

Hz)

Frequency (Int)

Frequency (Ext)

Internal Loopback

External Forwarding

Limitations

• As a post-assembly step, limited by the instruction order chosen by compiler

• Only ALU operations (add, subtract) can benefit from internal forwarding

• Ideally, a compiler that could ensure such instructions are kept together would improve results

23

Potential

• Aside from small benchmarks, explored potential for CHStone benchmarks

24

Table 5: LLVM IR Profiling Results for CHStone

Benchmark

Static Dynamic

Instr. Occur. % Instr. Occur. %

adpcm 1367 184 13 71,105 8,300 11aes 2259 51 2 30,596 3,716 12blowfish 1184 314 26 711,718 180,396 25gsm 1205 82 6 27,141 1,660 6jpeg 2388 95 4 1,903,085 131,092 6mips 378 15 3 31,919 123 0.3mpeg 782 80 10 17,032 60 0.3sha 405 64 15 990,907 238,424 24

and just-in-time (JIT) compiler to obtain dynamic counts ofpossible loopback occurrences. The results in Table 5 showa mean occurrence of over 10% within these benchmarks.We cannot currently support CHStone completely due tomissing support for 32b instructions and other developmentissues.

Program Size Sensitivity: We also synthesized RTLfor iDEA with increased instruction memory sizes to holdlarger programs. We observed, that iDEA maintains its op-timal frequency for up to 8 BRAMs. Beyond this, frequencydegrades by 10–30% to support routing delays and place-ment e↵ects of these larger memories. However, we envi-sion tiling multiple smaller soft processors with fewer than8 BRAMs to retain frequency advantages for a larger multi-processor system, and to reflect more closely the resource ra-tio on modern FPGAs. Each processor in this system wouldhold only a small portion of the complete system binary.

6. CONCLUSIONS AND FUTURE WORKWe have shown an e�cient way of incorporating data

forwarding in DSP block based soft processors like iDEA.By taking advantage of the internal loopback path typicallyused for multiply accumulate operations, it is possible to al-low dependent ALU instructions to immediately follow eachother, eliminating the need for padding NOPs. The resultis an increase in e↵ective IPC, and 5– 30% (mean 25%) im-provement in wall clock time for a series of benchmarks whencompared to no forwarding and a 5% improvement whencompared to external forwarding. We have also undertakenan initial study to explore the potential for such forwardingin more complex benchmarks by analysing LLVM interme-diate representation, and found that such forwarding is sup-ported in a significant proportion of dependent instructions.We aim to finalise full support for the CHStone benchmarksuite as well as open-sourcing the updated version of iDEAand the toolchain we have described.

7. REFERENCES[1] Aeroflex Gaisler. GRLIB IP Library User’s Manual,

2012.[2] Altera Corpration. Nios II Processor Design, 2011.[3] G. M. Amdahl. Validity of the single processor

approach to achieving large scale computingcapabilities. In Proceedings of the Spring JointComputer Conference, pages 483–485, 1967.

[4] ARM Ltd. Cortex-M1 Processor, 2011.

[5] S. Beyer, C. Jacobi, D. Kroning, D. Leinenbach, andW. J. Paul. Putting it all together - formal verificationof the VAMP. International Journal on Software Toolsfor Technology Transfer, 8:411–430, 2006.

[6] H. Y. Cheah, F. Brosser, S. A. Fahmy, and D. L.Maskell. The iDEA DSP block based soft processor forFPGAs. ACM Transactions on ReconfigurableTechnology and Systems, 7(3):19, 2014.

[7] H. Y. Cheah, S. A. Fahmy, and N. Kapre. Analysisand optimization of a deeply pipelined FPGA softprocessor. In Proceedings of the InternationalConference on Field Programmable Technology (FPT),pages 235–238, 2014.

[8] H. Y. Cheah, S. A. Fahmy, and D. L. Maskell. iDEA:A DSP block based FPGA soft processor. InProceedings of the International Conference on FieldProgrammable Technology (FPT), pages 151–158, Dec.2012.

[9] P. G. Emma and E. S. Davidson. Characterization ofbranch and data dependencies in programs forevaluating pipeline performance. IEEE Transactionson Computers, 36:859–875, 1987.

[10] Y. Hara, H. Tomiyama, S. Honda, and H. Takada.Proposal and quantitative analysis of the CHStonebenchmark program suite for practical C-basedhigh-level synthesis. In Journal of InformationProcessing, 2009.

[11] A. Hartstein and T. R. Puzak. The optimum pipelinedepth for a microprocessor. ACM Sigarch ComputerArchitecture News, 30:7–13, 2002.

[12] R. Jayaseelan, H. Liu, and T. Mitra. Exploitingforwarding to improve data bandwidth ofinstruction-set extensions. In Proceedings of theDesign Automation Conference, pages 43–48, 2006.

[13] N. Kapre and A. DeHon. VLIW-SCORE: Beyond Cfor sequential control of SPICE FPGA acceleration. InProceedings of the International Conference on FieldProgrammable Technology (FPT), Dec. 2011.

[14] C. E. LaForest and J. G. Ste↵an. Octavo: anFPGA-centric processor family. In Proceedings of theACM/SIGDA International Symposium on FieldProgrammable Gate Arrays (FPGA), pages 219–228,Feb. 2012.

[15] Lattice Semiconductor Corp. LatticeMico32 ProcessorReference Manual, 2009.

[16] C. Lattner and V. Adve. LLVM: A compilationframework for lifelong program analysis andtransformation. In International Symposium on CodeGeneration and Optimization, pages 75–86, 2004.

[17] M. Sami, D. Sciuto, C. Silvano, V. Zaccaria, andR. Zafalom. Exploiting data forwarding to reduce thepower budget of VLIW embedded processors. InProceedings of Design, Automation and Test inEurope, 2001, pages 252–257, 2001.

[18] K. Vipin, S. Shreejith, D. Gunasekara, S. A. Fahmy,and N. Kapre. System-level FPGA device driver withhigh-level synthesis support. In Proceedings of theInternational Conference on Field ProgrammableTechnology (FPT), pages 128–135, Dec. 2013.

[19] Xilinx Inc. UG081: MicroBlaze Processor ReferenceGuide, 2011.

Conclusion

• Exploiting low-level DSP block feature enabled efficient data forwarding to overcome significant number of padding NOPs

• Maintain frequency of iDEA at close to 500MHz • Up to 25% improvement across range of

benchmarks • Demonstrated applicability to larger programs

25

On Data Forwarding in Deeply Pipelined Soft Processors

Thank You

Documents

On Data Forwarding in Deeply Pipelined Soft Processors 5/3_fajhmy_fpga2015.pdf · On Data Forwarding in Deeply Pipelined Soft Processors H. Y. Cheah, S. A. Fahmy, N. Kapre School