Conjoining Soft-Core FPGA Processors David Sheldon a, Rakesh Kumar b, Frank Vahid a*, Dean Tullsen b, Roman Lysecky c a Department of Computer Science

Conjoining Soft-Core FPGA Processors

David Sheldona, Rakesh Kumarb, Frank Vahida*, Dean Tullsenb ,

Roman Lyseckyc

aDepartment of Computer Science and EngineeringUniversity of California, Riverside

*Also with the Center for Embedded Computer Systems at UC IrvinebDepartment of Computer Science and Engineering

University of California, San DiegocDepartment of Electrical and Computer Engineering

University of Arizona

This work was supported in part by the National Science Foundation, the Semiconductor Research Corporation, and by hardware and software

donations from Xilinx

David Sheldon, UC Riverside 2 of 22

FPGA Soft Core Processors

Soft-core Processor HDL description

Flexible implementation

FPGA or ASIC

Technology independent

HDLDescription

FPGA ASIC

Spartan 3Virtex 2 Virtex 4


FPGA Soft Core Processors

Soft Core Processors can have configurable options Datapath units Cache Bus architecture

Current commercial FPGA Soft-Core Processors Xilinx Microblaze Altera Nios

FPGA

μP

Cache

FPU

MAC


Conjoinment Overview

Base micro-processo

r FPU

Base micro-processo

rFPUFPU FPU

FPU

Application 1 Application 2

“Conjoining”

Add necessary units to both processors

Conjoin the FPU Unit Conjoined FPU unit


Conjoinment Background Conjoinment proposed for multicore desktop processing (Kumar 2004) Reduces size with reasonable performance overhead

e.g., cache conjoinment overhead: 1%-13%

ICache Sharing DCache Sharing


Outline Conjoinment for soft-core FPGA processors

Area savings

Performance overhead

Tuning heuristic for two configurable soft-cores with conjoin option

size

perf

?


Area Savings

Significant potential area savings Limitations

Does not consider multiplexing costs Due to absence of FPGA synthesis tools supporting conjoinment

But good potential justifies further investigation

BaseMicroBlaze

Multiplier

Barrel ShifterDivider

FPU

Unit Size

Multiplier

Barrel Shifter

Divider

FPU

1331

228

122

2738

0

2000

4000

6000

8000

10000

bs div mul fpuUnit instantiated w ith base processor

Equivalent LUTs

unconjoined

conjoined

6% 4%23%

32%



Area savings



size

perf

?


Performance Overhead No simulator exists for conjoined processors We developed our own

Trace-based conjoined processor simulator

Conj. simulat

or

Simulation uses pessimistic performance assumptions

Kumar's techniques can improve Simulator outputs contention information

Final cycles can be compared to unconjoined to determine performance overhead

brev

bitmnp

Xilinx simulator

app1 app2

trace1trace2

Access stallContention stall


Performance Overhead

brev

bitmnp

00.5

11.5

22.5

33.5

44.5

(brev),canrdrbrev,(canrdr)(brev),bitmnpbrev,(bitmnp)(brev),brevbrev,(brev)

(bitmnp),canrdrbitmnp,(canrdr)(bitmnp),bitmnpbitmnp,(bitmnp)(canrdr),canrdrcanrdr,(canrdr)

Speedup

Conjoined

Unconjoined

17% 2.4%

Speedup: Application time on optimally configured processor / avg. app. time on base processor

Compared configuration with conjoinment versus without Performance overhead usually small, averaged just 4.2%

Overhead caused by access delays and contention of the hardware units



Area savings



size

perf

?


NO FPU NO FPU

Tuning Heuristic

5 choices per unit e.g., FPU – no unit, 1 only, 2 only, 1 & 2, and conjoined

4 units 54 = 625 possible configurations Simulation: ~30 minutes per configuration Need search heuristic to tune

BaseMicroBlaze

1

BaseMicroBlaze

2FPU 2

FPU conjoined

Multiplier

Barrel ShifterDivider

MultiplierMultiplier

FPU 1


Map to 0-1 Knapsack Problem

MicroBlaze

Multiplier

sizepe

rf

Divider

size

perf

size

perf

Barrel Shifter

perf

size

FPU

BS

Perf increment

Size increment

FPU MUL DIV

1.1 0.9 1.2 1.0

1.4 2.7 1.8 1.1

Perf/Size 0.96 0.34 0.63 0.93

Creating the model

Synthesis

MicroBlaze

FPU

Synthesis

App

Base


Map to 0-1 Knapsack Problem

First consider tuning without conjoinment Problem of instantiating units to limited FPGA size can be mapped to the 0-1 knapsack problem

Add items, each with weight and benefit, to weight-constrained knapsack such that profit maximized

MUL 1 1 1 FPU 1

Base MicroBlaz

e

MUL 2 2 2 FPU 2

Available FPGA

Base MicroBlaz

e

Items:

Weights:Benefits:

Knapsack

Note: Mapping inexact – weights/benefits not strictly additive

1331 228 121 2738 1331 228 121 2738

0.08 0.62 0.00 0.00 0.22 0.76 0.00 0.00

MUL 1

FPU 1

MUL 2


Disjunctively Constrained Knapsack

Problem: If conjoined unit included, can't also include standalone unit

Solution: Map to disjunctively-constrained 0-1 knapsack Yanada T., “Heuristic and Exact Algorithms for the Disjunctively

Constrained Knapsack Problem”, 2002 Prohibits specific item pairs from being in the knapsack ILP solution, running time is pseudo polynomial

Base MicroBlaz

e

Available FPGA

Base MicroBlaz

e

Knapsack

MUL 1 1 1 FPU 1 MUL 2 2 2 FPU 2Items:

MUL C C C FPU C



Base MicroBlaz

e

Available FPGA

Base MicroBlaz

e

Knapsack

MUL 1 1 1 FPU 1 MUL 2 2 2 FPU 2Items:

MUL C C C FPU C

Weights:Benefits:

1331 228 121 2738 1331 228 121 2738

0.08 0.62 0 0 0.22 0.76 0 0

Weights:Benefits 1:

1331 228 121 2738

0.06 0.54 0 0

Benefits 2:0.21 0.71 0 0

MUL 1

MUL C Conjoined

benefits shows a small decrease in benefit from the unconjoined unit

Conjoined units provide benefits to both processors



Running Time Modeling

5 Synthesis runs for each Processor At most 4 runs of the conjoined Simulator

Disjunctively Constrained 0-1 Knapsack NP-complete problem Solved with a heuristic Heuristic takes < 1 min


Results Data gathered for the Xilinx Microblaze Soft-core Processor

10 EEMBC and Powerstone benchmarks aifir, BaseFP01, bitmnp, brev, canrdr, g3fax, g721_ps, idct, matmul, tblook, ttsprk

Obtained results for all possible pairwise conjoinment

We only show conjoinment data when both applications use unit

To avoid making conjoinment appear better than it is


Results

0

1

2

3

4

5

6

7

8

0 1000 2000 3000 4000 5000 6000 7000 8000 9000

Size (Equiv LUTs)

Speedup

bitmnp, bitmnp knapsack

bitmnp, bitmnp optimal

canrdr, canrdr knapsack

canrdr, canrdr optimal

BaseFP01, BaseFP01 knapsack

BaseFP01, BaseFP01 optimal

BaseFP01, bitmnp knapsack

BaseFP01, bitmnp optimal

BaseFP01, canrdr knapsack

BaseFP01, canrdr optimal

tblook, tblook knapsack

tblook, tblook optimal

tblook, bitmnp knapsack

tblook, bitmnp optimal

tblook, canrdr knapsack

tblook, canrdr optimal

Knapsack approach finds near-optimal in most cases


Results Knapsack heuristic finds near-optimal in most cases (versus exhaustive with conjoinment)

Runs in seconds One example had sub-optimal results (2.9 times slower)

Performance overhead due to conjoinment just a few percent on average

012345678

knapsack

exhaustive w/ conj.

exhaustive w/o conj.


Results

0

2000

4000

6000

8000

10000

12000knapsack

exhaustive w/ conj.

exhaustive w/o conj.

On average the knapsack approach yields the same size as the exhaustive with conjoinment

Average size savings of 16%


Conclusions Conjoining two soft-core FPGA processors reduces average size by 16% Performance overhead just a few percent in most cases

Disjunctively constrained 0-1 knapsack approach finds near-optimal in most cases

But could be improved for some examples

Future Consider multiplexing size and delay overheads

Apply Kumar's advanced conjoining techniques to reduce overheads

Documents

Conjoining Soft-Core FPGA Processors David Sheldon a, Rakesh Kumar b, Frank Vahid a*, Dean Tullsen b, Roman Lysecky c a Department of Computer Science