22
Conjoining Soft-Core FPGA Processors David Sheldon a , Rakesh Kumar b , Frank Vahid a* , Dean Tullsen b , Roman Lysecky c a Department of Computer Science and Engineering University of California, Riverside *Also with the Center for Embedded Computer Systems at UC Irvine b Department of Computer Science and Engineering University of California, San Diego c Department of Electrical and Computer Engineering University of Arizona This work was supported in part by the National Science Foundation, the Semiconductor Research Corporation, and by hardware and software donations from Xilinx

Conjoining Soft-Core FPGA Processors David Sheldon a, Rakesh Kumar b, Frank Vahid a*, Dean Tullsen b, Roman Lysecky c a Department of Computer Science

Embed Size (px)

Citation preview

Page 1: Conjoining Soft-Core FPGA Processors David Sheldon a, Rakesh Kumar b, Frank Vahid a*, Dean Tullsen b, Roman Lysecky c a Department of Computer Science

Conjoining Soft-Core FPGA Processors

David Sheldona, Rakesh Kumarb, Frank Vahida*, Dean Tullsenb ,

Roman Lyseckyc

aDepartment of Computer Science and EngineeringUniversity of California, Riverside

*Also with the Center for Embedded Computer Systems at UC IrvinebDepartment of Computer Science and Engineering

University of California, San DiegocDepartment of Electrical and Computer Engineering

University of Arizona

This work was supported in part by the National Science Foundation, the Semiconductor Research Corporation, and by hardware and software

donations from Xilinx

Page 2: Conjoining Soft-Core FPGA Processors David Sheldon a, Rakesh Kumar b, Frank Vahid a*, Dean Tullsen b, Roman Lysecky c a Department of Computer Science

David Sheldon, UC Riverside 2 of 22

FPGA Soft Core Processors

Soft-core Processor HDL description

Flexible implementation

FPGA or ASIC

Technology independent

HDLDescription

FPGA ASIC

Spartan 3Virtex 2 Virtex 4

Page 3: Conjoining Soft-Core FPGA Processors David Sheldon a, Rakesh Kumar b, Frank Vahid a*, Dean Tullsen b, Roman Lysecky c a Department of Computer Science

David Sheldon, UC Riverside 3 of 22

FPGA Soft Core Processors

Soft Core Processors can have configurable options Datapath units Cache Bus architecture

Current commercial FPGA Soft-Core Processors Xilinx Microblaze Altera Nios

FPGA

μP

Cache

FPU

MAC

Page 4: Conjoining Soft-Core FPGA Processors David Sheldon a, Rakesh Kumar b, Frank Vahid a*, Dean Tullsen b, Roman Lysecky c a Department of Computer Science

David Sheldon, UC Riverside 4 of 22

Conjoinment Overview

Base micro-processo

r FPU

Base micro-processo

rFPUFPU FPU

FPU

Application 1 Application 2

“Conjoining”

Add necessary units to both processors

Conjoin the FPU Unit Conjoined FPU unit

Page 5: Conjoining Soft-Core FPGA Processors David Sheldon a, Rakesh Kumar b, Frank Vahid a*, Dean Tullsen b, Roman Lysecky c a Department of Computer Science

David Sheldon, UC Riverside 5 of 22

Conjoinment Background Conjoinment proposed for multicore desktop processing (Kumar 2004) Reduces size with reasonable performance overhead

e.g., cache conjoinment overhead: 1%-13%

ICache Sharing DCache Sharing

Page 6: Conjoining Soft-Core FPGA Processors David Sheldon a, Rakesh Kumar b, Frank Vahid a*, Dean Tullsen b, Roman Lysecky c a Department of Computer Science

David Sheldon, UC Riverside 6 of 22

Outline Conjoinment for soft-core FPGA processors

Area savings

Performance overhead

Tuning heuristic for two configurable soft-cores with conjoin option

size

perf

?

Page 7: Conjoining Soft-Core FPGA Processors David Sheldon a, Rakesh Kumar b, Frank Vahid a*, Dean Tullsen b, Roman Lysecky c a Department of Computer Science

David Sheldon, UC Riverside 7 of 22

Area Savings

Significant potential area savings Limitations

Does not consider multiplexing costs Due to absence of FPGA synthesis tools supporting conjoinment

But good potential justifies further investigation

BaseMicroBlaze

Multiplier

Barrel ShifterDivider

FPU

Unit Size

Multiplier

Barrel Shifter

Divider

FPU

1331

228

122

2738

0

2000

4000

6000

8000

10000

bs div mul fpuUnit instantiated w ith base processor

Equivalent LUTs

unconjoined

conjoined

6% 4%23%

32%

Page 8: Conjoining Soft-Core FPGA Processors David Sheldon a, Rakesh Kumar b, Frank Vahid a*, Dean Tullsen b, Roman Lysecky c a Department of Computer Science

David Sheldon, UC Riverside 8 of 22

Outline Conjoinment for soft-core FPGA processors

Area savings

Performance overhead

Tuning heuristic for two configurable soft-cores with conjoin option

size

perf

?

Page 9: Conjoining Soft-Core FPGA Processors David Sheldon a, Rakesh Kumar b, Frank Vahid a*, Dean Tullsen b, Roman Lysecky c a Department of Computer Science

David Sheldon, UC Riverside 9 of 22

Performance Overhead No simulator exists for conjoined processors We developed our own

Trace-based conjoined processor simulator

Conj. simulat

or

Simulation uses pessimistic performance assumptions

Kumar's techniques can improve Simulator outputs contention information

Final cycles can be compared to unconjoined to determine performance overhead

brev

bitmnp

Xilinx simulator

app1 app2

trace1trace2

Access stallContention stall

Page 10: Conjoining Soft-Core FPGA Processors David Sheldon a, Rakesh Kumar b, Frank Vahid a*, Dean Tullsen b, Roman Lysecky c a Department of Computer Science

David Sheldon, UC Riverside 10 of 22

Performance Overhead

brev

bitmnp

00.5

11.5

22.5

33.5

44.5

(brev),canrdrbrev,(canrdr)(brev),bitmnpbrev,(bitmnp)(brev),brevbrev,(brev)

(bitmnp),canrdrbitmnp,(canrdr)(bitmnp),bitmnpbitmnp,(bitmnp)(canrdr),canrdrcanrdr,(canrdr)

Speedup

Conjoined

Unconjoined

17% 2.4%

Speedup: Application time on optimally configured processor / avg. app. time on base processor

Compared configuration with conjoinment versus without Performance overhead usually small, averaged just 4.2%

Overhead caused by access delays and contention of the hardware units

Page 11: Conjoining Soft-Core FPGA Processors David Sheldon a, Rakesh Kumar b, Frank Vahid a*, Dean Tullsen b, Roman Lysecky c a Department of Computer Science

David Sheldon, UC Riverside 11 of 22

Outline Conjoinment for soft-core FPGA processors

Area savings

Performance overhead

Tuning heuristic for two configurable soft-cores with conjoin option

size

perf

?

Page 12: Conjoining Soft-Core FPGA Processors David Sheldon a, Rakesh Kumar b, Frank Vahid a*, Dean Tullsen b, Roman Lysecky c a Department of Computer Science

David Sheldon, UC Riverside 12 of 22

NO FPU NO FPU

Tuning Heuristic

5 choices per unit e.g., FPU – no unit, 1 only, 2 only, 1 & 2, and conjoined

4 units 54 = 625 possible configurations Simulation: ~30 minutes per configuration Need search heuristic to tune

BaseMicroBlaze

1

BaseMicroBlaze

2FPU 2

FPU conjoined

Multiplier

Barrel ShifterDivider

MultiplierMultiplier

FPU 1

Page 13: Conjoining Soft-Core FPGA Processors David Sheldon a, Rakesh Kumar b, Frank Vahid a*, Dean Tullsen b, Roman Lysecky c a Department of Computer Science

David Sheldon, UC Riverside 13 of 22

Map to 0-1 Knapsack Problem

MicroBlaze

Multiplier

sizepe

rf

Divider

size

perf

size

perf

Barrel Shifter

perf

size

FPU

BS

Perf increment

Size increment

FPU MUL DIV

1.1 0.9 1.2 1.0

1.4 2.7 1.8 1.1

Perf/Size 0.96 0.34 0.63 0.93

Creating the model

Synthesis

MicroBlaze

FPU

Synthesis

App

Base

Page 14: Conjoining Soft-Core FPGA Processors David Sheldon a, Rakesh Kumar b, Frank Vahid a*, Dean Tullsen b, Roman Lysecky c a Department of Computer Science

David Sheldon, UC Riverside 14 of 22

Map to 0-1 Knapsack Problem

First consider tuning without conjoinment Problem of instantiating units to limited FPGA size can be mapped to the 0-1 knapsack problem

Add items, each with weight and benefit, to weight-constrained knapsack such that profit maximized

MUL 1 1 1 FPU 1

Base MicroBlaz

e

MUL 2 2 2 FPU 2

Available FPGA

Base MicroBlaz

e

Items:

Weights:Benefits:

Knapsack

Note: Mapping inexact – weights/benefits not strictly additive

1331 228 121 2738 1331 228 121 2738

0.08 0.62 0.00 0.00 0.22 0.76 0.00 0.00

MUL 1

FPU 1

MUL 2

Page 15: Conjoining Soft-Core FPGA Processors David Sheldon a, Rakesh Kumar b, Frank Vahid a*, Dean Tullsen b, Roman Lysecky c a Department of Computer Science

David Sheldon, UC Riverside 15 of 22

Disjunctively Constrained Knapsack

Problem: If conjoined unit included, can't also include standalone unit

Solution: Map to disjunctively-constrained 0-1 knapsack Yanada T., “Heuristic and Exact Algorithms for the Disjunctively

Constrained Knapsack Problem”, 2002 Prohibits specific item pairs from being in the knapsack ILP solution, running time is pseudo polynomial

Base MicroBlaz

e

Available FPGA

Base MicroBlaz

e

Knapsack

MUL 1 1 1 FPU 1 MUL 2 2 2 FPU 2Items:

MUL C C C FPU C

Page 16: Conjoining Soft-Core FPGA Processors David Sheldon a, Rakesh Kumar b, Frank Vahid a*, Dean Tullsen b, Roman Lysecky c a Department of Computer Science

David Sheldon, UC Riverside 16 of 22

Disjunctively Constrained Knapsack

Base MicroBlaz

e

Available FPGA

Base MicroBlaz

e

Knapsack

MUL 1 1 1 FPU 1 MUL 2 2 2 FPU 2Items:

MUL C C C FPU C

Weights:Benefits:

1331 228 121 2738 1331 228 121 2738

0.08 0.62 0 0 0.22 0.76 0 0

Weights:Benefits 1:

1331 228 121 2738

0.06 0.54 0 0

Benefits 2:0.21 0.71 0 0

MUL 1

MUL C Conjoined

benefits shows a small decrease in benefit from the unconjoined unit

Conjoined units provide benefits to both processors

Page 17: Conjoining Soft-Core FPGA Processors David Sheldon a, Rakesh Kumar b, Frank Vahid a*, Dean Tullsen b, Roman Lysecky c a Department of Computer Science

David Sheldon, UC Riverside 17 of 22

Disjunctively Constrained Knapsack

Running Time Modeling

5 Synthesis runs for each Processor At most 4 runs of the conjoined Simulator

Disjunctively Constrained 0-1 Knapsack NP-complete problem Solved with a heuristic Heuristic takes < 1 min

Page 18: Conjoining Soft-Core FPGA Processors David Sheldon a, Rakesh Kumar b, Frank Vahid a*, Dean Tullsen b, Roman Lysecky c a Department of Computer Science

David Sheldon, UC Riverside 18 of 22

Results Data gathered for the Xilinx Microblaze Soft-core Processor

10 EEMBC and Powerstone benchmarks aifir, BaseFP01, bitmnp, brev, canrdr, g3fax, g721_ps, idct, matmul, tblook, ttsprk

Obtained results for all possible pairwise conjoinment

We only show conjoinment data when both applications use unit

To avoid making conjoinment appear better than it is

Page 19: Conjoining Soft-Core FPGA Processors David Sheldon a, Rakesh Kumar b, Frank Vahid a*, Dean Tullsen b, Roman Lysecky c a Department of Computer Science

David Sheldon, UC Riverside 19 of 22

Results

0

1

2

3

4

5

6

7

8

0 1000 2000 3000 4000 5000 6000 7000 8000 9000

Size (Equiv LUTs)

Speedup

bitmnp, bitmnp knapsack

bitmnp, bitmnp optimal

canrdr, canrdr knapsack

canrdr, canrdr optimal

BaseFP01, BaseFP01 knapsack

BaseFP01, BaseFP01 optimal

BaseFP01, bitmnp knapsack

BaseFP01, bitmnp optimal

BaseFP01, canrdr knapsack

BaseFP01, canrdr optimal

tblook, tblook knapsack

tblook, tblook optimal

tblook, bitmnp knapsack

tblook, bitmnp optimal

tblook, canrdr knapsack

tblook, canrdr optimal

Knapsack approach finds near-optimal in most cases

Page 20: Conjoining Soft-Core FPGA Processors David Sheldon a, Rakesh Kumar b, Frank Vahid a*, Dean Tullsen b, Roman Lysecky c a Department of Computer Science

David Sheldon, UC Riverside 20 of 22

Results Knapsack heuristic finds near-optimal in most cases (versus exhaustive with conjoinment)

Runs in seconds One example had sub-optimal results (2.9 times slower)

Performance overhead due to conjoinment just a few percent on average

012345678

knapsack

exhaustive w/ conj.

exhaustive w/o conj.

Page 21: Conjoining Soft-Core FPGA Processors David Sheldon a, Rakesh Kumar b, Frank Vahid a*, Dean Tullsen b, Roman Lysecky c a Department of Computer Science

David Sheldon, UC Riverside 21 of 22

Results

0

2000

4000

6000

8000

10000

12000knapsack

exhaustive w/ conj.

exhaustive w/o conj.

On average the knapsack approach yields the same size as the exhaustive with conjoinment

Average size savings of 16%

Page 22: Conjoining Soft-Core FPGA Processors David Sheldon a, Rakesh Kumar b, Frank Vahid a*, Dean Tullsen b, Roman Lysecky c a Department of Computer Science

David Sheldon, UC Riverside 22 of 22

Conclusions Conjoining two soft-core FPGA processors reduces average size by 16% Performance overhead just a few percent in most cases

Disjunctively constrained 0-1 knapsack approach finds near-optimal in most cases

But could be improved for some examples

Future Consider multiplexing size and delay overheads

Apply Kumar's advanced conjoining techniques to reduce overheads